How do you get an AI to answer a question it’s not supposed to answer? There are many “jailbreaking” techniques, and Anthropic researchers have just found a new one, in which a large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less harmful questions. First of all.
They call the approach “jailbreaking repeatedly” and I have both write an article about this and also informed their peers in the AI community so that this can be mitigated.
The vulnerability is new, resulting from the increased “pop-up” of the latest generation of LLMs. This is the amount of data they can hold in what might be called short-term memory, once just a few sentences, but now thousands of words and even entire books.
Anthropic researchers found that these models with large pop-ups tend to perform better on many tasks if there are many examples of that task in the prompt. So if there are a lot of trivia questions in the prompt (or in the bootstrap document, like a big list of trivia questions that the model has in context), the answers improve over time. So the fact that it might have been wrong if it was the first question, it might be correct if it was the hundredth question.
But in an unexpected extension of this “learning in context,” as it is called, models also become “better” in their ability to answer inappropriate questions. So if you ask him to make a bomb right away, he will refuse. But if you ask him to answer 99 other less harmful questions and then ask him to build a bomb… he’s much more likely to comply.
Why does it work? No one really understands what’s going on in the tangle of weights that is an LLM, but there is clearly a mechanism that allows it to focus on what the user wants, as evidenced by the content of the pop-up. If the user wants anecdotes, it seems to gradually activate a latent questioning power as you ask dozens of questions. And for some reason the same thing happens when users ask dozens of inappropriate answers.
The team has already informed its peers and even competitors about this attack, which it hopes will “foster a culture in which exploits like this are openly shared between LLM providers and researchers.”
For their own mitigation, they found that while limiting the popup was helpful, it also had a negative effect on model performance. I can’t have that – so they work on classifying and contextualizing the queries before accessing the model. Of course, this just means you have a different model to fool… but at this point, a shift in focus when it comes to AI safety is to be expected.