Adversarial Text Purification: Large Language Model Approach for Defense

Background

Adversarial purification is a defense mechanism for safe-guarding classifiers against adversarial attacks without knowing the type of attacks or training of the classifier. These techniques analyze and eliminate adversarial perturbations from the attacked inputs, and help to restore purified samples that retain similarity to the attacked ones and are correctly classified by the classifier. However, because of the challenges associated with characterizing noise perturbations for discrete inputs, adversarial text purification methods have not been widely used.

Invention Description

Researchers at Arizona State University have developed a new approach to defending text classifiers against adversarial attacks by utilizing the generative capabilities of Large Language Models (LLMs). This approach bypasses the need to directly characterize discrete noise perturbations in text, instead employing prompt engineering to guide LLMs in generating purified versions of attacked text that are semantically similar to the original inputs and correctly classified. This method offers a robust solution to a previously challenging aspect of text classification security.

Potential Applications:

  • Enhanced security for NLP-based applications (e.g., spam detection, sentiment analysis, content moderation)
  • Robust AI-driven text analysis tools for cybersecurity
  • Improved accuracy and reliability of automated text classification services (e.g., finance, healthcare, legal)

Benefits and Advantages:

  • Improved accuracy – classifier accuracy under attack improved by over 65% on average, outperforming existing methods
  • Simplifies purification process – eliminates need for explicit characterization of adversarial perturbations
  • Effective – utilizes generative power of LLMs to restore attacked text to purified state effectively
  • Robust – does not require prior knowledge of the specific type of adversarial attack or classifier training

Related Publication: Adversarial Text Purification: A Large Language Model Approach for Defense

Patent Information: