Byte's New Code Model Evaluation Benchmark 'FullStack Bench'

AIbase基地

Published inAI News · 4 min read · Dec 5, 2024

259

On December 5th, the Byte Bean big model team launched the latest code model evaluation benchmark - FullStack Bench, covering over 11 real-world scenarios, supporting 16 programming languages, and containing 3,374 questions. Compared to previous evaluation standards, this benchmark can more accurately assess the coding development capabilities of large models across a wider range of programming fields, promoting optimization of models in real-world programming tasks.

Current mainstream code evaluation benchmarks, such as HumanEval and MBPP, typically focus on basic and advanced programming problems, while DS-1000 concentrates on data analysis and machine learning tasks, supporting only Python. xCodeEval emphasizes advanced programming and mathematics, but has significant limitations in application scenarios and language coverage. In contrast, FullStack Bench significantly enhances data coverage, encompassing over 11 application domains and addressing more complex and diverse programming scenarios.

The dataset for FullStack Bench is sourced from Stack Overflow, the world's largest programming Q&A platform. The research team selected the top 88.1% of application domains from 500,000 questions, ensuring the dataset's breadth and robustness. Each question includes detailed descriptions, reference solutions, and unit test cases to ensure evaluation accuracy. The team also conducted cross-evaluations of data quality through AI and manual reviews, further enhancing the reliability of the data.

To facilitate developers in using this dataset, the Byte Bean team has also open-sourced a code sandbox tool - SandboxFusion, which supports efficient execution of multi-language programming tasks. SandboxFusion is compatible with over 10 widely used code evaluation datasets and supports 23 programming languages, enabling developers to easily conduct large model testing in different environments.

Additionally, the Byte Bean big model team showcased their self-developed code model - Doubao-Coder for the first time and evaluated the programming capabilities of over 20 global code models. Byte's continuous progress in the AI programming field, particularly through its self-developed code base model MarsCode, contributes millions of lines of code to users every month, demonstrating its leading position in this field.

Dataset open-source address: https://huggingface.co/datasets/ByteDance/FullStackBench

Sandbox open-source address: https://github.com/bytedance/SandboxFusion

Paper address: https://arxiv.org/pdf/2412.00535v2

AliTongyi Opensources Audio Generation Model ThinkSound Supporting Chain-of-Thought Reasoning

Recently, the Ali Speech AI team announced the open source of ThinkSound, the world's first audio generation model supporting chain-of-thought reasoning. By introducing the chain-of-thought technology, this model breaks through the limitations of traditional video-to-audio technology in capturing dynamic visuals, achieving high-fidelity and strong synchronized spatial audio generation. This breakthrough marks a leap forward in AI audio technology, moving from 'image配音' to structured understanding of visual content.

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

1.Tencent's Hunyuan3D-PolyGen boosts 3D modeling efficiency by 70% with BPT tech. 2.Alibaba's HumanOmniV2 achieves 69.33% accuracy in multilingual input. 3.DingTalk AI processes 1k tasks/hour with 'spreadsheet-as-document'. 4.Baidu PaddleOCR3.1 improves 37-language recognition by 30%. 5.Microsoft Deep Research opens API. 6.HKPolyU & OPPO's DLoRAL speeds video enhancement 10x. 7.Google opens MCP Toolbox for SQL. 8.Microsoft Win11 to add AI dynamic....

Google Open Sources MCP Toolbox for Databases: Unlock the Infinite Possibilities of AI and Databases with 10 Lines of Code

Google releases the open-source tool MCP Toolbox for Databases, simplifying the integration of AI agents with SQL databases. The tool connects to a database with just 10 lines of code and supports secure mechanisms such as connection pool management, authentication, and schema introspection. It is compatible with various Google Cloud databases. As an open-source project, it lowers the development barrier, but currently mainly supports Google ecosystem databases. Future expansion of compatibility may be needed. This tool has the potential to become a standard component for AI development, driving intelligent data processing.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Byte's New Code Model Evaluation Benchmark 'FullStack Bench'

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Alibaba Tongyi Opens Source Audio Generation Model ThinkSound; Google Veo3 Generates Images into Videos; Feishu Announces Several New AI Products

Hugging Face Launches SmolLM3: A 3B-Parameter Small Model Competes with 4B Giants, 128K Context Leads a New Trend in Efficient AI!

Zhiyuan Robot Announces Patent Related to Robot Motion Control Model

Moonvalley Releases Marey Realism v1.5: Native 1080P AI Video Model, Zero Copyright Risk Leading the Industry Trend!

AliTongyi Opensources Audio Generation Model ThinkSound Supporting Chain-of-Thought Reasoning

Hugging Face releases the next generation of small parameter model SmolLM3: 128K context, dual-mode reasoning

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

Baidu's Stock Rises, Intelligent Cloud Wins Double Champion in Large Model Market in the First Half of the Year

Microsoft Win11 is about to launch the AI Dynamic Wallpaper feature, preview code has appeared

Google Open Sources MCP Toolbox for Databases: Unlock the Infinite Possibilities of AI and Databases with 10 Lines of Code