Berkeley Function-Calling Leaderboard

Leaderboard for evaluating the function calling ability of large language models

CommonProductProgrammingAI EvaluationProgramming

The Berkeley Function-Calling Leaderboard (BCL) is an online platform specifically designed to evaluate the accuracy of large language models (LLMs) in calling functions (or tools). The leaderboard is based on real-world data and is regularly updated, providing a benchmark for measuring and comparing the performance of different models on specific programming tasks. It is a valuable resource for developers, researchers, and anyone interested in the programming capabilities of AI.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Berkeley Function-Calling Leaderboard

Berkeley Function-Calling Leaderboard Visit Over Time

Berkeley Function-Calling Leaderboard Visit Trend

Berkeley Function-Calling Leaderboard Visit Geography

Berkeley Function-Calling Leaderboard Traffic Sources

Berkeley Function-Calling Leaderboard Alternatives

thisorthis.ai — AI Model Comparison Platform

Scale Leaderboard — AI Model Performance Evaluation Platform

Car Comparison — AI-powered car comparison

AIGCRank AI Language Model API Price Comparison — Aggregates and compares the pricing information of major AI model providers globally

CodeArena — AI Model Programming Competition Platform

OpenCompass 2.0 Large Language Model Leaderboard — A real-time large language model leaderboard that provides comprehensive performance assessments.

Deepmark AI — Generative AI Model Evaluation Tool

Gentrace — Evaluation and Monitoring of Generative AI

Openlayer — AI Model Testing and Evaluation Tool

SuperCLUE — Leading AI evaluation benchmark for measuring and comparing AI model performance.

Berkeley Function-Calling Leaderboard — Leaderboard for evaluating the function calling ability of large language models

FlagEval — Model Evaluation Platform

Ropes AI — New AI-powered Coding Evaluation

Patronus GLIDER — A general evaluation model for assessing text, dialogue, and RAG settings.

AVbeam — Audio comparison tool

deepeval — A evaluation and unit testing framework for Large Language Models (LLM)

promptbench — Unified Language Model Evaluation Framework

1X World Model — An advanced world model providing virtual simulation and evaluation for robotics.

SFR-Judge — An intelligent evaluation tool that accelerates model assessment and fine-tuning.

Countless.dev — AI model comparison tool, free and open-source

Algomax — Simplifies LLM and RAG model output evaluation, providing insights into qualitative metrics.

NameBeta — A fast domain search and comparison tool

Codestral 25.01 — An advanced programming assistance model launched by Mistral AI.

Promptclub — AI Model Online Programming and Interactive Learning Platform

Artificial Analysis — Independent analysis platform for AI language models and API providers, helping you choose the right models and APIs.

ImagenHub — ImagenHub: Inference and Evaluation of Standardized Conditional Image Generation Models

Llama-3-Patronus-Lynx-8B-Instruct-v1.1 — Open-source hallucination evaluation model

Code Llama — An advanced large language model for programming.

Imaginary Programming — Programming Imagination - Fast as Thought

Lin's Grand Model Ranking — Ranking of large model products more suited to the Chinese physique.