AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation

Anthropic Launches 'Constitution Classifier': Successfully Prevents 95% of Model Jailbreak Attempts

AIbase基地

Published inAI News · 5 min read · Feb 5, 2025

154

Artificial intelligence company Anthropic recently announced the development of a new safety method called the "Constitution Classifier," aimed at protecting language models from malicious manipulation. This technology specifically targets "universal jailbreaks"—a method that attempts to systematically bypass all safety measures to prevent AI models from generating harmful content.

To validate the effectiveness of this technology, Anthropic conducted a large-scale test. The company recruited 183 participants who attempted to breach its defense system over two months. Participants were asked to input specific questions in an attempt to get the AI model Claude 3.5 to answer ten prohibited questions. Despite offering a reward of up to $15,000 and approximately 3,000 hours of testing time, no participant was able to completely bypass Anthropic's safety measures.

Claude2, Anthropic, Artificial Intelligence, Chatbot Claude

Progress from Challenges

The early version of Anthropic's "Constitution Classifier" had two main issues: it misclassified too many harmless requests as dangerous, and it required substantial computational resources. After improvements, the new version of the classifier significantly reduced the false positive rate and optimized computational efficiency. However, automated testing revealed that although the improved system successfully blocked over 95% of jailbreak attempts, it still required an additional 23.7% of computational power to operate. In contrast, the unprotected Claude model allowed 86% of jailbreak attempts to pass through.

Training Based on Synthetic Data

The core of the "Constitution Classifier" lies in using predefined rules (referred to as the "Constitution") to distinguish between allowed and prohibited content. The system trains the classifier to recognize suspicious inputs by generating synthetic training examples in various languages and styles. This approach not only improves the system's accuracy but also enhances its ability to respond to diverse attacks.

Despite significant progress, Anthropic's researchers acknowledge that the system is not foolproof. It may not be able to handle all types of universal jailbreak attacks, and new attack methods may emerge in the future. Therefore, Anthropic recommends using the "Constitution Classifier" in conjunction with other safety measures to provide more comprehensive protection.

Public Testing and Future Prospects

To further test the system's robustness, Anthropic plans to release a public demonstration version from February 3 to 10, 2025, inviting security experts to attempt to crack it. The test results will be announced in subsequent updates. This initiative not only showcases Anthropic's commitment to technological transparency but also provides valuable data for research in the field of AI safety.

Anthropic's "Constitution Classifier" marks a significant advancement in the safety protection of AI models. With the rapid development of AI technology, effectively preventing the misuse of models has become a focal point of industry concern. Anthropic's innovation offers a new solution to this challenge while also guiding future research in AI safety.

QualityClassifier GeneralPurposeTuner Claude3.5 Anthropic

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Following OpenAI, Google Gemini Joins MCP Initiative to Accelerate AI Agent Interoperability

Just weeks after OpenAI announced its adoption of a rival Anthropic standard for connecting AI models to the systems where their data resides, Google has followed suit. Google DeepMind CEO Demis Hassabis announced on X on Wednesday that Google will add support for the Anthropic Model Context Protocol (MCP) to its Gemini models and software development kits (SDKs).

Apr 10, 2025

150

Anthropic Launches Premium Claude Max Subscription, Priced Up to $200 Monthly

Apr 10, 2025

210

Anthropic's Official Release: A Comprehensive Report on College Student Use of Claude AI

The tide of technology surges forward. Once considered exclusive to research, Artificial Intelligence (AI) systems have quietly permeated the daily academic lives of college students. Anthropic recently released a substantial, large-scale research report analyzing millions of anonymized student conversations on the Claude.ai platform. For the first time, this reveals how college students utilize this emerging tool in real-world scenarios. The report not only outlines the current landscape of student AI usage but also prompts reflection on the future direction of education. Focus includes STEM fields.

Apr 9, 2025

370

Anthropic Launches AI in Education Initiative to Support Higher Education and Critical Thinking

Apr 3, 2025

370

Claude Team Releases Comprehensive Prompt Engineering Guide to Fuel No-Code Development Boom

Apr 3, 2025

710

Anthropic Launches Claude for Education: An AI Tutor to Foster Critical Thinking in Students

Anthropic today announced Claude for Education, an AI assistant designed for the education sector to enhance learning by fostering critical thinking skills, rather than simply providing answers. The product is already partnering with Northeastern University, the London School of Economics, and Champlain College to extensively test how AI can effectively augment, not shorten, the learning experience. A core innovation in Claude for Education is its learning mode, a feature fundamentally altering how students interact with AI.

Apr 3, 2025

640

Anthropic Unveils Claude's Inner Workings: Nine Fascinating Discoveries Under the AI Microscope

Apr 2, 2025

300

Anthropic Enhances AI Model Safety Measures to Ensure Responsible Scaling

AI company Anthropic recently updated its "responsible scaling" policy, outlining which models require additional safety protections. This move aims to mitigate potential risks before new technologies are released. According to Anthropic's blog, if stress testing reveals an AI model could assist a "resource-constrained state actor" in developing chemical and biological weapons, Anthropic will delay the technology's public launch.

Apr 1, 2025

250

Anthropic and Databricks Partner in $100 Million Deal to Develop AI Agent Tools

Anthropic and Databricks announced a five-year, $100 million partnership to develop AI agent tools for enterprise task automation. Databricks CEO Ali Ghodsi stated that Anthropic's Claude model will be directly available on the Databricks platform, enabling clients to leverage their own data to develop custom AI solutions.

Mar 28, 2025

460

OpenAI Announces Support for Anthropic's MCP Standard; Agent SDK Adds MCP Support

OpenAI CEO Sam Altman announced on X that the company will support the Model Context Protocol (MCP), a standard developed by competitor Anthropic. This standard aims to improve the accuracy and relevance of AI assistants' responses to specific queries. Altman stated that OpenAI will integrate MCP into several of its products, including ChatGPT.

Mar 27, 2025

560