The National Institute of Standards and Technology (NIST) is watching the AI lifecycle closely, and for good reason. As AI proliferates, so does the discovery and exploitation of AI cybersecurity vulnerabilities. Rapid injection is one such vulnerability that specifically attacks Generative AI.
In Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigation, NIST defines various adversarial machine learning (AML) tactics and cyberattacks, such as rapid injection, and advises users on how to mitigate and manage them. AML tactics extract information about how machine learning (ML) systems behave to discover how they can be manipulated. This information is used to attack AI and his large language models (LLM) to bypass security, bypass protective measures, and open avenues for exploitation.
What is a rapid injection?
NIST defines two types of rapid injection attacks: direct and indirect. With direct prompt injection, a user enters a text prompt that causes the LLM to perform unintended or unauthorized actions. An indirect rapid injection occurs when an attacker poisons or degrades the data an LLM draws on.
One of the most well-known direct rapid injection methods is DAN, Do Anything Now, a rapid injection used against ChatGPT. DAN uses roleplaying to bypass moderation filters. In its first iteration, ChatGPT was prompted with prompts that it was now DAN. DAN could do whatever he wanted and would have to pretend, for example, to help some nefarious person create and detonate explosives. This tactic avoided filters that prevented him from providing criminal or harmful information by following a role-playing scenario. OpenAI, the developers of ChatGPT, follow this tactic and update the model to prevent its use, but users continue to bypass the filters to the point that the method has evolved into (at least) DAN 12.0.
Indirect prompt injection, as NIST notes, depends on an attacker’s ability to provide sources that a generative AI model would ingest, such as a PDF, document, web page, or even a file PDF. audio files used to generate fake voices. Rapid indirect injection is widely considered to be the biggest security flaw in generative AI, with no easy ways to find and repair these attacks. Examples of this type of prompt are many and varied. They go from the absurd (make a chatbot respond using “pirate speech“) to damage (using socially designed cat to convince a user to reveal their credit card and other personal data) to a wide range (hijacking AI assistants to send fraudulent emails to your entire contact list).
Explore AI cybersecurity solutions
How to Stop Rapid Injection Attacks
These attacks tend to be well hidden, making them both effective and difficult to stop. How to protect yourself against immediate direct injection? As NIST notes, you can’t stop them completely, but defensive strategies add some measure of protection. For model builders, NIST suggests ensuring that training data sets are carefully organized. They also suggest training the model on the types of inputs that signal a rapid injection attempt and training the model on how to identify conflicting prompts.
For rapid indirect injection, NIST suggests human involvement to refine models, known as reinforcement learning from human feedback (RLHF). RLHF helps models better align with human values that prevent unwanted behaviors. Another suggestion is to filter instructions from fetched inputs, which can prevent the execution of unwanted instructions from outside sources. NIST further suggests using LLM moderators to help detect attacks that do not rely on retrieved sources for execution. Finally, NIST offers solutions based on interpretability. This means that the prediction trajectory of the model that recognizes abnormal inputs can be used to detect and then stop abnormal inputs.
Generative AI and those who wish to exploit its vulnerabilities will continue to change the cybersecurity landscape. But this same power of transformation can also provide solutions. Learn more about how IBM Security offers AI cybersecurity solutions that strengthen security defenses.