On December 5th, the Byte Bean big model team launched the latest code model evaluation benchmark - FullStack Bench, covering over 11 real-world scenarios, supporting 16 programming languages, and containing 3,374 questions. Compared to previous evaluation standards, this benchmark can more accurately assess the coding development capabilities of large models across a wider range of programming fields, promoting optimization of models in real-world programming tasks.
Current mainstream code evaluation benchmarks, such as HumanEval and MBPP, typically focus on basic and advanced programming problems, while DS-1000 concentrates on data analysis and machine learning tasks, supporting only Python. xCodeEval emphasizes advanced programming and mathematics, but has significant limitations in application scenarios and language coverage. In contrast, FullStack Bench significantly enhances data coverage, encompassing over 11 application domains and addressing more complex and diverse programming scenarios.
The dataset for FullStack Bench is sourced from Stack Overflow, the world's largest programming Q&A platform. The research team selected the top 88.1% of application domains from 500,000 questions, ensuring the dataset's breadth and robustness. Each question includes detailed descriptions, reference solutions, and unit test cases to ensure evaluation accuracy. The team also conducted cross-evaluations of data quality through AI and manual reviews, further enhancing the reliability of the data.
To facilitate developers in using this dataset, the Byte Bean team has also open-sourced a code sandbox tool - SandboxFusion, which supports efficient execution of multi-language programming tasks. SandboxFusion is compatible with over 10 widely used code evaluation datasets and supports 23 programming languages, enabling developers to easily conduct large model testing in different environments.
Additionally, the Byte Bean big model team showcased their self-developed code model - Doubao-Coder for the first time and evaluated the programming capabilities of over 20 global code models. Byte's continuous progress in the AI programming field, particularly through its self-developed code base model MarsCode, contributes millions of lines of code to users every month, demonstrating its leading position in this field.
Dataset open-source address: https://huggingface.co/datasets/ByteDance/FullStackBench
Sandbox open-source address: https://github.com/bytedance/SandboxFusion
Paper address: https://arxiv.org/pdf/2412.00535v2