An algorithm designed to defend Large Language Models (LLMs) against jailbreaking attacks that significantly reduces attack success rates. Problem: Despite efforts to align LLM outputs with ethical standards, the models exhibit persistent weaknesses against adversarial attacks that bypass alignment mechanisms and safety guardrails, commonly known as jailbreaking attacks, and allow harmful or objectionable content to be generated. This issue undermines the credibility of LLMs and poses risks in various domains such as public policy and healthcare, necessitating effective solutions to safeguard against adversarial prompting techniques. Solution: SmoothLLM addresses the vulnerability of LLMs to jailbreaking attacks through a systematic approach of altering a prompt and assessing the responses to more effectively determine whether the prompt is adversarial. This method has demonstrated robust attack mitigation, significantly reducing the risk of generating objectionable content. Technology: SmoothLLM adds an extra layer of protection by smoothing the input. This involves duplicating the input prompt multiple times, applying a perturbation to each duplicate prompt, assessing each response, and then determining the probability that the original prompt is adversarial. This process helps identify malicious prompts designed to bypass safeguards through such methods as adding suffixes that trick a chatbot into producing harmful content. Compared to an undefended LLM, the use of SmoothLLM reduced the attack success rate (ASR) against the Greedy Coordinate Gradient (GCG) attack to below one percent for Vicuna, Llama2, GPT-3.5, GPT4, Claude-1, Claude-2, and PaLM-2 LLMs. Furthermore, the ASR was reduced by over 100% in the Vicuna and Llama2 LLMs. Advantages:
Stage of Development:
Intellectual Property:
Reference Media:
Desired Partnerships:
Docket #24-10574