AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

Salesforce AI Launches New Large Language Model Evaluation Family SFR-Judge Based on Llama3

AIbase基地

Published inAI News · 5 min read · Sep 29, 2024

116

In the field of natural language processing, the development of large language models (LLMs) is advancing rapidly, achieving significant progress in multiple domains. However, as the complexity of these models increases, accurately evaluating their outputs becomes crucial. Traditionally, human evaluation has been relied upon, but this method is time-consuming, challenging to scale, and cannot keep pace with the rapid development of models.

To address this issue, the Salesforce AI Research team has introduced SFR-Judge, an evaluation family composed of three large language models, each with 8 billion, 120 billion, and 700 billion parameters, built on Meta Llama3 and Mistral NeMO. SFR-Judge is capable of performing various evaluation tasks, including pairwise comparison, single scoring, and binary classification assessments, aiming to help research teams evaluate new models' performance quickly and efficiently.

Traditional LLM evaluation models often suffer from bias issues such as positional and length biases, which affect their judgments. To overcome these issues, SFR-Judge employs a Direct Preference Optimization (DPO) training method, enabling the model to learn from positive and negative examples, enhancing its understanding of evaluation tasks, reducing biases, and ensuring consistency in judgments.

In testing, SFR-Judge performed excellently on 13 benchmarks, surpassing many existing evaluation models, including some private ones. Notably, on the RewardBench leaderboard, SFR-Judge achieved an accuracy rate of 92.7%, marking the first and second time generative evaluation models have surpassed the 90% threshold, demonstrating its outstanding performance in model evaluation.

SFR-Judge's training method encompasses three different data formats. The first is "Chain of Thought Critique," which helps the model generate structured analyses of evaluation responses. The second is "Standard Judgment," simplifying the evaluation process by directly providing feedback on whether responses meet the standard. Lastly, "Response Derivation" helps the model understand the characteristics of high-quality responses, strengthening its judgment capabilities. The combination of these three data formats significantly enhances SFR-Judge's evaluation capabilities.

After extensive experiments, the SFR-Judge models outperformed others in reducing biases. On the EvalBiasBench benchmark, they demonstrated high pairwise order consistency, indicating that even with changes in response order, the model's judgments remain stable. This makes SFR-Judge a reliable automated evaluation solution, reducing reliance on manual labeling and providing a more scalable option for model evaluation.

Paper link: https://arxiv.org/abs/2409.14664

Key Points:

📊 High Accuracy: SFR-Judge achieved top results in 10 out of 13 benchmarks, especially reaching a high accuracy rate of 92.7% on RewardBench.

🛡️ Bias Mitigation: The model shows lower biases compared to other evaluation models, particularly in length and positional biases.

🔧 Versatile Applications: SFR-Judge supports pairwise comparison, single scoring, and binary classification evaluations, adapting to various evaluation scenarios.

NaturalLanguageProcessing LargeLanguageModel SalesforceAI SFR-Judge

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Federal Judge Rules First Time AI Training Using Copyrighted Books as Fair Use, Anthropic Wins but Still Faces Piracy Accusations

Jun 25, 2025

190

Federal judge considers limiting Google's monopoly, including break-up and AI regulation

In the federal court in Washington, D.C., a major antitrust case against Google (Alphabet Inc.) is underway, with Judge Amit Mehta facing a difficult decision. As the trial progresses, the judge needs to decide whether to break up this tech giant to limit its monopolistic behavior in the search engine market. At the same time, he must also consider Google's significant advantages in the field of artificial intelligence. Last Friday, Google's legal team and representatives from the U.S. Department of Justice made their final arguments in court.

Jun 6, 2025

280

Meta Launches J1 Series Model: The Strongest AI Judge Is Online

May 22, 2025

670

MLX-LM Seamlessly Integrated with Hugging Face to Boost Efficient Large Language Model Performance on Apple Silicon Devices

May 20, 2025

670

Meta Asks Judge to Rule Early in Antitrust Case

May 19, 2025

300

California judge harshly criticizes law firms for using fake AI research

May 14, 2025

350

ByteDance Unveils QuaDMix: A Unified Framework for Large Language Model Pre-training Data Quality and Diversity

Apr 28, 2025

830

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

On April 27, Zhipu AI (Z.ai) and Shengshu Technology (shengshu.com), two leading artificial intelligence companies under Tsinghua University, announced a major strategic partnership. This collaboration aims to leverage both companies' technological expertise in large language models and multi-modal generative models to jointly advance the technological innovation and industrial application of domestic large models.

Apr 27, 2025

510

Doubao 1.5 Deep Thinking Model Launches on Edge Large Model Gateway with Free Million Tokens

Bytedance's Volcano Engine announced the full launch of its newly released Doubao 1.5 Deep Thinking model on the edge large model gateway, offering users up to 5 million free tokens. This move has garnered significant attention in the AI field.

Apr 25, 2025

1.7k

GPT-4.1 Model Faces Scrutiny: Alignment and Stability Concerns Raised

Apr 24, 2025

850