SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks

An algorithm designed to defend Large Language Models (LLMs) against jailbreaking attacks that significantly reduces attack success rates.
Problem:
Despite efforts to align LLM outputs with ethical standards, the models exhibit persistent weaknesses against adversarial attacks that bypass alignment mechanisms and safety guardrails, commonly known as jailbreaking attacks, and allow harmful or objectionable content to be generated. This issue undermines the credibility of LLMs and poses risks in various domains such as public policy and healthcare, necessitating effective solutions to safeguard against adversarial prompting techniques.
Solution:
SmoothLLM addresses the vulnerability of LLMs to jailbreaking attacks through a systematic approach of altering a prompt and assessing the responses to more effectively determine whether the prompt is adversarial. This method has demonstrated robust attack mitigation, significantly reducing the risk of generating objectionable content.
Technology:
SmoothLLM adds an extra layer of protection by smoothing the input. This involves duplicating the input prompt multiple times, applying a perturbation to each duplicate prompt, assessing each response, and then determining the probability that the original prompt is adversarial. This process helps identify malicious prompts designed to bypass safeguards through such methods as adding suffixes that trick a chatbot into producing harmful content. Compared to an undefended LLM, the use of SmoothLLM reduced the attack success rate (ASR) against the Greedy Coordinate Gradient (GCG) attack to below one percent for Vicuna, Llama2, GPT-3.5, GPT4, Claude-1, Claude-2, and PaLM-2 LLMs. Furthermore, the ASR was reduced by over 100% in the Vicuna and Llama2 LLMs.
Advantages:

Effective defense against the GCG, Prompt Automatic Iterative Refinement (PAIR), and RandomSearch jailbreaking attacks as demonstrated in multiple LLMs.
High Efficacy: SmoothLLM typically queries the LLM less than 20 times, meaning that SmoothLLM is often several orders of magnitude more efficient than a jailbreaking attack query such as GCG.
Compatibility: SmoothLLM applies to multiple LLMs. It does not involve retraining the underlying LLM.
Non-conservatism: SmoothLLM is designed to retain the ability of the LLM to generate realistic text.
Versatility: SmoothLLM performs successfully across a variety of prompt lengths.

Stage of Development: