Despite growing demands for safety and accountability in AI, current tests and benchmarks may be insufficient, according to a new report.
Generative AI models—models that can analyze and produce text, images, music, video, and so on—are increasingly under scrutiny for their tendency to make mistakes and behave unpredictably. Now, organizations ranging from public sector agencies to big tech companies are proposing new benchmarks to test the safety of these models.
Late last year, startup Scale AI created a lab dedicated to assessing how closely models align with safety guidelines. This month, NIST and the UK AI Safety Institute released tools designed to assess model risk.
But these tests and model exploration methods may prove inadequate.
The Ada Lovelace Institute (ALI), a UK-based non-profit AI research organization, conducted a study that interviewed experts from academic, civil society, and vendor model-producing labs, as well as reviewed recent research on AI safety ratings. The co-authors found that while current ratings can be useful, they are not comprehensive, can be easily manipulated, and do not necessarily provide an indication of how models will perform in real-world scenarios.
“Whether it’s a smartphone, a prescription drug, or a car, we expect the products we use to be safe and reliable; in these industries, products are rigorously tested to ensure they are safe before they are released,” Elliot Jones, a senior researcher at ALI and a co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety assessment, assess how assessments are currently being used, and explore their use as a tool for policymakers and regulators.”
Benchmarking and red teaming
The study’s co-authors first reviewed the academic literature to establish an overview of the harms and risks that models pose today, and the state of evaluation of existing AI models. They then interviewed 16 experts, including four employees of unnamed tech companies that develop generative AI systems.
The study found strong disagreement within the AI industry regarding the best set of methods and taxonomies for model evaluation.
Some evaluations only tested how the models matched lab benchmarks, not how the models might impact real-world users. Others drew on tests developed for research purposes, not evaluating production models, but vendors insisted on using them in production.
We’ve talked before about the problems with AI benchmarks, and this study highlights all of those problems and more.
Experts cited in the study noted that it is difficult to extrapolate a model’s performance from benchmark results, and it is unclear whether benchmarks can even demonstrate that a model has a specific capability. For example, while a model may perform well on a state exam, that does not mean it will be able to solve multiple open legal challenges.
Experts also highlighted the problem of data contamination, where benchmark results can overestimate a model’s performance if the model was trained on the same data it is being tested on. Benchmarks, in many cases, are chosen by organizations not because they are the best evaluation tools, but for convenience and ease of use, experts said.
“Benchmarks are at risk of being manipulated by developers who may train models on the same dataset that will be used to evaluate the model, which is equivalent to seeing the exam paper before the exam or strategically choosing which assessments to use,” Mahi Hardalupas, a researcher at ALI and a co-author of the study, told TechCrunch. “It also matters which version of a model is evaluated. Small changes can cause unpredictable changes in behavior and can override built-in security features.”
The ALI study also found problems with “red-teaming,” the practice of tasking individuals or groups with “hacking” a model to identify vulnerabilities and flaws. Several companies use red-teaming to evaluate models, including AI startups OpenAI and Anthropic, but there are few agreed-upon standards for red-teaming, making it difficult to assess the effectiveness of any given effort.
Experts told the study’s co-authors that it can be difficult to find people with the skills and experience needed to red team, and the manual nature of red teaming makes it expensive and labor-intensive, presenting barriers for smaller organizations that lack the resources needed.
Possible solutions
Pressure to release models more quickly and a reluctance to conduct tests that might raise issues before release are the main reasons why AI ratings have not improved.
“One person we spoke to who works for a company that develops basic models felt there was more pressure within companies to release models quickly, making it harder to push back and take the evaluations seriously,” Jones said. “Major AI labs are releasing models at a rate that is beyond their or the company’s ability to ensure they are safe and reliable.”
One respondent in the ALI study called security model evaluation an “intractable” problem. So what hope does the industry, and those who regulate it, have for solutions?
Mahi Hardalupas, a researcher at ALI, believes there is a way forward, but that it will require greater commitment from public bodies.
“Regulators and policy makers need to clearly articulate what they want from valuations,” he said. “At the same time, the valuation community needs to be transparent about the current limitations and potential of valuations.”
Hardalupas suggests that governments require greater public participation in the development of assessments and implement measures to support a third-party testing “ecosystem,” including programs to ensure regular access to all required models and datasets.
Jones believes it may be necessary to develop “context-specific” evaluations that go beyond simply testing how a model responds to a prompt, and instead examine the types of users a model might impact (such as people of a particular background, gender, or ethnicity) and the ways in which attacks on models might undermine security measures.
“This will require investments in the basic science of assessments to develop more robust and repeatable assessments based on understanding how an AI model works,” he added.
But you can never be sure that a model is safe.
“As others have noted, ‘safety’ is not a property of models,” Hardalupas said. “Determining whether a model is ‘safe’ requires understanding the contexts in which it is used, who it is sold or made accessible to, and whether the safety measures in place are adequate and robust to mitigate those risks. Baseline model assessments may be exploratory to identify potential risks, but they cannot guarantee that a model is safe, let alone ‘perfectly safe.’ Many of our interviewees agreed that assessments cannot prove that a model is safe and can only indicate that a model is not safe.”