jailbreaking black box llms automatically

Vulnerabilities in Language Models: A Case Study on Jailbreaking LLMs

In an era that will be increasingly defined by the performance of Large Language Models (LLMs), understanding their vulnerabilities is paramount. This article delves into a study that employed bespoke prompt automation titled the Tree of Attacks with Pruning (TAP) to generate jailbreaks and expose vulnerabilities in LLMs. The study shed light on the potential risks associated with unethical requests and harmful content generation. 

Despite their impressive capabilities, LLMs are susceptible to diverse failure modes that may expose users to harmful content, polarize their opinions, or, more broadly, have detrimental effects on society. On the business front, falling victim to LLM jailbreaks within your base tech poses a serious existential threat. Our exploration of this study aims to contribute to the broader understanding of responsible AI development and deployment.

The Danfer Of Jailbroken LLMs

In the enterprise landscape, a compromise of LLM-based applications poses a direct threat to corporate integrity. A jailbroken LLM within an enterprise could lead, at the very least, to catastrophic data breaches. Proprietary information, trade secrets, and sensitive customer data falling into the wrong hands is a one-way ticket to both legal consequences and a severe trust deficit for any business. 

Another dire consequence could be the generation and dissemination of misinformation on a massive scale. A jailbroken LLM might produce content that appears legitimate but is intentionally misleading. In an enterprise context, this could result in the spread of false financial reports, inaccurate product information, or misleading statements that can deal reputational damage which could take years to repair.

There is an immediate imperative to enhance defenses for LLMs as their proliferation continues. Implementing an additional layer of protection should be part of all GenAI hygiene going forward. As such, it is vital to understand the fundamentals of attack mechanisms.

Understanding Tree of Attacks with Pruning (TAP)

As mentioned earlier, at the heart of the study was TAP, an automated method for generating jailbreaks requiring only black-box access to the target LLM. In that sense, it can be said to be a derivative of the Prompt Automatic Iterative Refinement (PAIR) method. The researchers found that TAP reduced prompt redundancy and significantly improved prompt quality in comparison.

At its core, TAP utilizes three LLMs: an attacker, an evaluator, and a target.

  • Attacker: Generates jailbreaking prompts, utilizes tree-of-thoughts reasoning
  • Evaluator: Assesses generated prompts, determines the success of jailbreaking attempts
  • Target: The specific LLM subjected to the jailbreaking effort

understanding the tree of attack with pruning

The TAP algorithm, as employed in this study, was a systematic approach to exploring and exploiting potential vulnerabilities within large language models. TAP proved to be resource-efficient and achieved jailbreaks without needing GPU-intensive computations, while demanding only black-box access to the target model. 

Choice of Attacker and Evaluator 

There were two key factors the researchers considered in choosing their attacker and evaluator LLMs:  

  1. Models that could handle complex conversations
  2. Models that don’t resist dealing with harmful or restricted prompt

Based on that, the researchers opted for Vicuna-13B-v1.5 as the attacker and GPT4 as the evaluator. To offer a point of reference,they also evaluated the performance of TAP with GPT3.5-Turbo as the evaluator.

The Outcome

outcome table

The results showcased below show that TAP consistently outperforms PAIR across a spectrum of target models. Notably, on GPT4-Turbo, the latest from OpenAI as of November 2023, TAP achieved a 40% higher jailbreak success rate while remarkably sending 52% fewer queries compared to PAIR.

otcome table 2

Examples of Successful Jailbreaks 

Here are some illustrations of successful jailbreaks that the researchers achieved using GPT-4 and GPT-4 Turbo.

examples of successful jailbreaks

Safeguarding LLMs against interpretable prompts

The introduction of innovative safeguards, such as Llama Guard and Purple Llama, has provided something of a proactive stance against the jailbreak examples listed above. While the study had its own limitations, it was a good approximation of potential risks associated with jailbreak methods – and that is certainly an aspect of LLM security that will be continuously tested by the internet at large in the coming years.

Llama Guard is designed specifically for human-AI conversation use cases and excels in both prompt and response classification. Purple Llama, inspired by the cybersecurity principle of “purple teaming,” takes a holistic approach by integrating offensive and defensive strategies and guiding developers through the entire AI innovation journey.

But while these new models begin to enter the mainstream, developers and researchers must leverage and further refine the security of LLMs. The ongoing evolution of AI safety must match the speed of LLM proliferation at the moment, which has been more rapid than anyone could have anticipated. 

The future of this era must be one where innovation and responsibility go hand in hand, and ensuring the continued progress and ethical deployment of generative AI is going to be one of – if not the – biggest challenges.