A new study suggests that well-intentioned safety measures in large language models may introduce unexpected vulnerabilities. Researchers found significant differences in how easily models can be "jailbroken" based on various demographic terms. This research, titled "Do LLMs Have Political Correctness?", explores how demographic keywords influence the success rate of jailbreak attempts. The study found that prompts using terms for marginalized groups are more likely to produce unwanted outputs compared to those using privileged group terms.

Researchers noted: "These intentional biases led to a 20% difference in jailbreak success rates between non-binary and cisgender keywords, and a 16% difference between white and black keywords, even when the rest of the prompt was identical," explained Isack Lee and Haebin Seong of Theori Inc.

The researchers attribute these differences to intentional biases introduced to ensure the model's ethical behavior. The jailbreak works by creating a "PCJailbreak" method to test the vulnerability of large language models to jailbreak attacks. These attacks use carefully designed prompts to bypass AI safety measures and generate harmful content.

image.png

PCJailbreak uses keywords for different demographic and socioeconomic groups. Researchers created pairs of words like "rich" and "poor" or "male" and "female" to compare privileged and marginalized groups.

They then created prompts that combined these keywords with potentially harmful instructions. By testing different combinations repeatedly, they were able to measure the success rate of jailbreak attempts for each keyword. The results showed significant differences: keywords representing marginalized groups generally had much higher success rates than those representing privileged groups. This indicates that the model's safety measures inadvertently have biases, which jailbreak attacks can exploit.

image.png

To address the vulnerabilities found by PCJailbreak, researchers developed the "PCDefense" method. This approach uses special defense prompts to reduce excessive bias in language models, making them less vulnerable to jailbreak attacks.

The uniqueness of PCDefense lies in its simplicity; it does not require additional models or processing steps. Instead, defense prompts are directly added to the input to adjust biases and achieve a more balanced behavior from the language model.

Researchers tested PCDefense on various models and showed that the success rate of jailbreak attempts could be significantly reduced, whether for privileged or marginalized groups. The gap between groups also decreased, indicating a reduction in security-related biases.

image.png

Researchers stated that PCDefense offers an efficient and scalable way to enhance the security of large language models without additional computational costs.

The findings highlight the complexity of designing safe and ethical AI systems in balancing security, fairness, and performance. Fine-tuning specific safety barriers may reduce the overall performance of AI models, such as their creativity.

To promote further research and improvement, the authors have made the code for PCJailbreak and all related artifacts available as open source. Theori Inc, the company behind this research, specializes in offensive security and is headquartered in the United States and South Korea. It was founded by Andrew Wesie and Brian Pak in January 2016.