Translated data: Zhipu AI has released AlignBench, a specialized evaluation benchmark designed for Chinese large language models (LLMs), which is the first of its kind to assess the alignment of these models with human intent in multiple dimensions. AlignBench's dataset is sourced from real-world scenarios and undergoes steps such as initial construction, sensitivity screening, reference answer generation, and difficulty filtering to ensure authenticity and challenge. The dataset is categorized into 8 major types, including knowledge quizzes, writing generation, role-playing, and more. To achieve automation and reproducibility, AlignBench employs scoring models (like GPT-4 and CritiqueLLM) to rate each model's responses, representing their quality. These scoring models use a multi-dimensional, rule-calibrated scoring method, enhancing the consistency between model scores and human scores, and providing detailed evaluation analysis and scores. Developers can utilize AlignBench for evaluations and employ high-capability scoring models (such as GPT-4 or CritiqueLLM) for ratings. Through the AlignBench website, submission results can be evaluated using CritiqueLLM as the scoring model, with evaluation results typically available within about 5 minutes.