Artificial intelligence company Anthropic recently announced the development of a new safety method called the "Constitution Classifier," aimed at protecting language models from malicious manipulation. This technology specifically targets "universal jailbreaks"—a method that attempts to systematically bypass all safety measures to prevent AI models from generating harmful content.

To validate the effectiveness of this technology, Anthropic conducted a large-scale test. The company recruited 183 participants who attempted to breach its defense system over two months. Participants were asked to input specific questions in an attempt to get the AI model Claude 3.5 to answer ten prohibited questions. Despite offering a reward of up to $15,000 and approximately 3,000 hours of testing time, no participant was able to completely bypass Anthropic's safety measures.

Claude2, Anthropic, Artificial Intelligence, Chatbot Claude

Progress from Challenges

The early version of Anthropic's "Constitution Classifier" had two main issues: it misclassified too many harmless requests as dangerous, and it required substantial computational resources. After improvements, the new version of the classifier significantly reduced the false positive rate and optimized computational efficiency. However, automated testing revealed that although the improved system successfully blocked over 95% of jailbreak attempts, it still required an additional 23.7% of computational power to operate. In contrast, the unprotected Claude model allowed 86% of jailbreak attempts to pass through.

Training Based on Synthetic Data

The core of the "Constitution Classifier" lies in using predefined rules (referred to as the "Constitution") to distinguish between allowed and prohibited content. The system trains the classifier to recognize suspicious inputs by generating synthetic training examples in various languages and styles. This approach not only improves the system's accuracy but also enhances its ability to respond to diverse attacks.

Despite significant progress, Anthropic's researchers acknowledge that the system is not foolproof. It may not be able to handle all types of universal jailbreak attacks, and new attack methods may emerge in the future. Therefore, Anthropic recommends using the "Constitution Classifier" in conjunction with other safety measures to provide more comprehensive protection.

Public Testing and Future Prospects

To further test the system's robustness, Anthropic plans to release a public demonstration version from February 3 to 10, 2025, inviting security experts to attempt to crack it. The test results will be announced in subsequent updates. This initiative not only showcases Anthropic's commitment to technological transparency but also provides valuable data for research in the field of AI safety.

Anthropic's "Constitution Classifier" marks a significant advancement in the safety protection of AI models. With the rapid development of AI technology, effectively preventing the misuse of models has become a focal point of industry concern. Anthropic's innovation offers a new solution to this challenge while also guiding future research in AI safety.