Anthropogenic researchers wear down AI ethics with repeated questions

How do you get an AI to answer a question it’s not supposed to answer? There are many “jailbreaking” techniques, and Anthropic researchers have just found a new one, in which a large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less harmful questions. First of all.

They call the approach “jailbreaking repeatedly” and I have both write an article about this and also informed their peers in the AI community so that this can be mitigated.

The vulnerability is new, resulting from the increased “pop-up” of the latest generation of LLMs. This is the amount of data they can hold in what might be called short-term memory, once just a few sentences, but now thousands of words and even entire books.

Anthropic researchers found that these models with large pop-ups tend to perform better on many tasks if there are many examples of that task in the prompt. So if there are a lot of trivia questions in the prompt (or in the bootstrap document, like a big list of trivia questions that the model has in context), the answers improve over time. So the fact that it might have been wrong if it was the first question, it might be correct if it was the hundredth question.

But in an unexpected extension of this “learning in context,” as it is called, models also become “better” in their ability to answer inappropriate questions. So if you ask him to make a bomb right away, he will refuse. But if you ask him to answer 99 other less harmful questions and then ask him to build a bomb… he’s much more likely to comply.

Image credits: Anthropic

Why does it work? No one really understands what’s going on in the tangle of weights that is an LLM, but there is clearly a mechanism that allows it to focus on what the user wants, as evidenced by the content of the pop-up. If the user wants anecdotes, it seems to gradually activate a latent questioning power as you ask dozens of questions. And for some reason the same thing happens when users ask dozens of inappropriate answers.

The team has already informed its peers and even competitors about this attack, which it hopes will “foster a culture in which exploits like this are openly shared between LLM providers and researchers.”

For their own mitigation, they found that while limiting the popup was helpful, it also had a negative effect on model performance. I can’t have that – so they work on classifying and contextualizing the queries before accessing the model. Of course, this just means you have a different model to fool… but at this point, a shift in focus when it comes to AI safety is to be expected.

Latest News

Ethical concerns around AI are real and require scrutiny: experts

University of Bradford offers £10,000 scholarships for AI and data analytics

AI is bad news for the Global South

Top Healthcare Cybersecurity and Privacy Predictions for 2025

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

Bangkok Post – New AI-related cybersecurity threats expected to proliferate in 2025

Top Healthcare Cybersecurity and Privacy Predictions for 2025

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

Bangkok Post – New AI-related cybersecurity threats expected to proliferate in 2025

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

The Most Popular AI Tools of 2024

Updates to Veo, Imagen and VideoFX, and introduction of Whisk to Google Labs

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

The Most Popular AI Tools of 2024

Updates to Veo, Imagen and VideoFX, and introduction of Whisk to Google Labs

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

Ethical concerns around AI are real and require scrutiny: experts

Welsh public sector leads the way in responsible use of AI

Enabling ethical adoption of AI – Developing telecommunications

Latest News

Subscribe to Updates

Anthropogenic researchers wear down AI ethics with repeated questions

Related Posts

Subscribe to Updates