Researchers use AI chatbots against themselves to “jailbreak” each other

Computer scientists at Nanyang Technological University Singapore (NTU Singapore) have successfully compromised several artificial intelligence (AI) chatbots, including ChatGPT, Google Bard and Microsoft Bing Chat, to produce content that violates their developers’ guidelines – a result known as a “jailbreak”. “.

“Jailbreaking” is a term in IT security Or hackers finding and exploiting flaws in a system’s software to make it do something that its developers deliberately told it not to do.

Additionally, by training a large language model (LLM) on a database of prompts that had previously been shown to successfully hack these chatbots, the researchers created an LLM. chatbot capable of automatically generating other prompts to jailbreak other chatbots.

LLMs form the brains of AI chatbots, allowing them to process human input and generate text that is almost indistinguishable from that a human can create. This includes performing tasks such as planning a travel itinerary, telling a bedtime story, and developing computer code.

The work of NTU researchers now adds “jailbreak” to the list. Their findings can be critical in helping businesses realize the weaknesses and limitations of their LLM chatbots so they can take steps to harden them against hackers.

After running a series of proof-of-concept tests on LLMs to prove that their technique does indeed present a clear and present threat to them, the researchers immediately reported the issues to affected service providers, after launching successful jailbreak attacks .

Professor Liu Yang from NTU’s School of Computer Science and Engineering, who led the study, said: “Large language models (LLMs) have proliferated rapidly due to their exceptional ability to understand, generate and complete human-like text, LLM chatbots being very popular applications for everyday use.

“Developers of such AI services have implemented safeguards to prevent the AI from generating violent, unethical or criminal content. But AI can be outsmarted, and we have now used AI against its own type to “jailbreak” LLMs to produce such content. “

Ph.D. NTU. Student Mr. Liu Yi, co-author of the paper, said: “The paper presents a new approach to automatically generate jailbreak prompts against hardened LLM chatbots. Training an LLM with jailbreak prompts helps automate the generation of these prompts, achieving a much higher success rate than existing methods. In effect, we attack chatbots by using them against themselves.

The researchers’ paper describes a dual method for “jailbreaking” LLMs, which they called “Masterkey.”

First, they reverse-engineered how LLMs detect and defend against malicious queries. Using this information, they taught an LLM to automatically learn and produce prompts that bypass the defenses of other LLMs. This process can be automated, creating a jailbreak LLM that can adapt and create new jailbreak prompts even after developers patch their LLM.

The researchers’ paper, which appears on the preprint server arXivhas been accepted for presentation at the Network and Distributed System Security Symposium, a leading security forum, in San Diego, USA, in February 2024.

Testing the limits of LLM ethics

AI chatbots receive prompts, or a series of instructions, from human users. All LLM developers establish guidelines to prevent chatbots from generating unethical, questionable, or illegal content. For example, asking an AI chatbot how to create malware to hack bank accounts often results in an outright refusal of an answer on the grounds of criminal activity.

Professor Liu said: “Despite their advantages, AI chatbots remain vulnerable to jailbreak attacks. They can be compromised by malicious actors who abuse their vulnerabilities to force chatbots to generate results that violate established rules.

NTU researchers studied ways to circumvent a chatbot by creating prompts that fly under the radar of its ethical guidelines so that the chatbot is tricked into responding. For example, AI developers rely on keyword censors that detect certain words that might signal potentially questionable activity and refuse to respond if such words are detected.

One strategy researchers used to circumvent keyword censors was to create a character providing prompts that simply contained spaces after each character. This bypasses LLM censors, who could operate from a list of banned words.

The researchers also asked the chatbot to respond in the guise of a persona that was “unqualified and devoid of moral constraints,” thereby increasing the chances of producing unethical content.

Researchers were able to infer the inner workings and defenses of LLMs by manually entering these prompts and observing how long each prompt took to pass or fail. They were then able to reverse engineer the LLMs’ hidden defense mechanisms, further identify their ineffectiveness, and create a dataset of prompts that successfully jailbreak the chatbot.

Intensification of the arms race between hackers and LLM developers

When vulnerabilities are discovered and revealed by hackers, AI chatbot developers respond by “fixing” the problem, in a never-ending cat-and-mouse cycle between hacker and developer.

With Masterkey, the NTU computer scientists has upped the ante in this arms race, as an AI jailbreaking chatbot can produce a large number of prompts and continually learn what works and what doesn’t, allowing hackers to beat LLM developers to their own game with their own tools.

The researchers first created a training dataset that included prompts they found effective during the previous phase of reverse engineering the jailbreak, as well as unsuccessful prompts, so Masterkey knew what not to do. The researchers fed this dataset into an LLM as a starting point, then performed ongoing pre-training and task tuning.

This exposes the model to a wide range of information and refines its capabilities by training it on tasks directly related to jailbreaking. The result is an LLM that can better predict how to manipulate text for jailbreaking, leading to more effective and universal prompts.

Researchers found that Masterkey-generated prompts were three times more effective than LLM-generated prompts in jailbreak LLMs. Masterkey was also able to learn from previous failed prompts and can be automated to constantly produce new, more effective prompts.

The researchers say their LLM can be used by developers themselves to strengthen their security.

Ph.D. NTU. Student Mr. Deng Gelei, co-author of the paper, said: “As LLMs continue to evolve and expand their capabilities, manual testing becomes both laborious and potentially inadequate to cover all possible vulnerabilities. An automated approach to generating jailbreak prompts can ensure comprehensive coverage, evaluating a wide range of possible misuse scenarios.

More information:
Gelei Deng et al, MasterKey: Automated jailbreak on several large language model chatbots, arXiv (2023). DOI: 10.48550/arxiv.2307.08715

Journal information:
arXiv

Provided by
Nanyang Technological University

Quote: Researchers use AI chatbots against themselves to “jailbreak” themselves (December 28, 2023) retrieved January 2, 2024 from https://techxplore.com/news/2023-12-ai-chatbots-jailbreak.html

This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.

Latest News

AI is great, but agencies need to remember that in 2025 they will be in marketing

Will NetApp (NTAP) AI Innovations Drive Revenue Growth in 2025?

Welsh public sector leads the way in responsible use of AI

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

Bangkok Post – New AI-related cybersecurity threats expected to proliferate in 2025

The essential role of cybersecurity in the sustainability of businesses, AND CISO

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

Bangkok Post – New AI-related cybersecurity threats expected to proliferate in 2025

The essential role of cybersecurity in the sustainability of businesses, AND CISO

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

The Most Popular AI Tools of 2024

Updates to Veo, Imagen and VideoFX, and introduction of Whisk to Google Labs

Congress releases AI policy plan

Accounting TodayTech News: Wolters Kluwer adds conversational AIPlus, HubSync adds a host of new enhancements; Crowe announces cyber risk analysis solution; and other technical accounting news…for 20 minutes

The Most Popular AI Tools of 2024

Updates to Veo, Imagen and VideoFX, and introduction of Whisk to Google Labs

Congress releases AI policy plan

Accounting TodayTech News: Wolters Kluwer adds conversational AIPlus, HubSync adds a host of new enhancements; Crowe announces cyber risk analysis solution; and other technical accounting news…for 20 minutes

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

The Most Popular AI Tools of 2024

Updates to Veo, Imagen and VideoFX, and introduction of Whisk to Google Labs

Congress releases AI policy plan

Latest News

Subscribe to Updates

Researchers use AI chatbots against themselves to “jailbreak” each other

Testing the limits of LLM ethics

Intensification of the arms race between hackers and LLM developers

Related Posts

Subscribe to Updates