Background
Adversarial purification is a defense mechanism for safe-guarding classifiers against adversarial attacks without knowing the type of attacks or training of the classifier. These techniques analyze and eliminate adversarial perturbations from the attacked inputs, and help to restore purified samples that retain similarity to the attacked ones and are correctly classified by the classifier. However, because of the challenges associated with characterizing noise perturbations for discrete inputs, adversarial text purification methods have not been widely used.
Invention Description
Researchers at Arizona State University have developed a new approach to defending text classifiers against adversarial attacks by utilizing the generative capabilities of Large Language Models (LLMs). This approach bypasses the need to directly characterize discrete noise perturbations in text, instead employing prompt engineering to guide LLMs in generating purified versions of attacked text that are semantically similar to the original inputs and correctly classified. This method offers a robust solution to a previously challenging aspect of text classification security.
Potential Applications:
Benefits and Advantages:
Related Publication: Adversarial Text Purification: A Large Language Model Approach for Defense