Whenever a new AI model is released, it is usually touted as having improved its performance compared to a series of benchmarks. OpenAI’s GPT-4o, for example, launched in May with a compilation of results that showed its performance outperforming every other AI company’s latest models in several tests.
The problem is that these criteria are poorly designed, the results are difficult to reproduce, and the measures they use are often arbitrary, new research shows. This is important because AI models’ scores against these criteria determine the level of review they receive.
AI companies frequently cite benchmarks as testimony to the success of a new model, and these benchmarks are already part of some governments’ plans to regulate AI. But for now, they may not be good enough to use this method – and researchers have some ideas on how to improve them.
—Scott J Mulligan
We need to start tackling the ethics of AI agents
Generative AI models have become remarkably good at conversing with us and creating images, videos, and music for us, but they’re not very good at it. TO DO things for us.
AI agents promise to change that. Last week, researchers published a new paper explaining how they trained simulation agents to reproduce the personalities of 1,000 people with breathtaking precision.
AI models that imitate you could act on your behalf in the near future. If such tools become cheap and easy to build, it will raise many new ethical concerns, but two in particular stand out. Read the full story.
—James O’Donnell