Information

Latest AI News

Explore AI Frontiers, Master Industry Trends

AI Daily Brief

Your Daily AI Brief - Never Miss What's Next

Information

AI Product Finder

Smart Product Discovery - Comprehensive Market Intelligence

AI Product Rankings

AI Product Power Rankings - Performance, Buzz & Trends

AI Product Submit

Submit Your AI Product - Amplify Reach & Drive Growth

Tools

AI Tools Directory

Discover The Best AI Websites & Tools

Information

AI Models Finder

Comprehensive AI Models Collection for All Your Development & Research Needs

LLM Leaderboard

AI LLM Power Rankings - Performance, Buzz & Trends

Model Providers

Discover Trusted AI Model Partners - Guaranteed Reliable Support

Submit Your Model

Submit Your Model Info & Services - Precision Marketing & User Targeting

Tools

Compare LLMs

Multi-Dimensional Large Model Comparison - Find Your Perfect Match

LLM Cost Calculator

Calculate AI Model Costs Accurately - Optimize Your Budget

LLM Arena

Multi-Model Real-Time Evaluation & Quick Output Comparison

Information

MCP Servers

Discover Popular AI-MCP Services - Find Your Perfect Match Instantly

MCP Client

Easy MCP Client Integration - Access Powerful AI Capabilities

MCP Case Tutorials

Master MCP Usage - From Beginner to Expert

MCP Ranking

Top MCP Service Performance Rankings - Find Your Best Choice

MCP Service Submission

Publish & Promote Your MCP Services

Tools

MCP Playground

Test MCP Services Freely - Quick Online Experience

MCP Inspector

Quick MCP Service Testing - Fast Deployment

GEO Services

Achieve Dominant Visibility in AI Search for Your Business or Brand with GEO Services

AI Search Visibility Checker

Detect brand's visibility on AI platforms

Tools

AI Model Compatibility Checker

Free PC Hardware Test for DeepSeek & Llama

Information

AI Dataset Collection

Large-scale datasets and benchmarks for training, evaluating, and testing models to measure

Tools

Intelligent Document Recognition

Comprehensive Text Extraction and Document Processing Solutions for Users

AI Tutorial

Can Llama 8B Outsmart GPT-4o Using Search Engines? A New Discovery That Changes the Game in AI!

AIbase基地

Published inAI News · 5 min read · Aug 15, 2024

178

Recently, a new study has brought excitement, demonstrating that Large Language Models (LLMs) can significantly enhance their performance through search functionalities. Notably, the Llama3.1 model with only 800 million parameters, after 100 searches, performed on par with GPT-4o in Python code generation tasks.

This idea seems reminiscent of Rich Sutton's pioneering work in reinforcement learning, particularly his 2019 classic blog post, "The Bitter Lesson." He emphasized the power of general methods as computational capabilities improve, highlighting "search" and "learning" as excellent choices that can continue to scale.

While Sutton underscored the importance of learning, that larger models typically acquire more knowledge, the potential of search in the reasoning process is often overlooked. Recently, researchers from Stanford, Oxford, and DeepMind found that increasing the number of sampling repetitions in the reasoning phase significantly improves model performance in areas like mathematics, reasoning, and code generation.

Inspired by these studies, two engineers decided to conduct experiments. They discovered that using 100 small Llama models for search could surpass or match GPT-4o in Python programming tasks. They metaphorically described it as: "What used to require a large horse to achieve can now be accomplished with 100 small ducks."

To achieve higher performance, they utilized the vLLM library for batch inference and ran it on 10 A100-40GB GPUs, achieving an astonishing output speed of 40k tokens per second. The authors chose the HumanEval benchmark test, which evaluates generated code by running tests, offering a more objective and accurate assessment.

According to the report, in zero-shot inference, GPT-4o scored 90.2% on the pass@1 metric. Through the aforementioned method, Llama3.18B's pass@k score significantly improved. With 100 repetitions, Llama scored 90.5%; when the number of repetitions increased to 1000, the score further improved to 95.1%, clearly outperforming GPT-4o.

It is worth noting that although this experiment is not a strict replication of the original study, it emphasizes the potential for smaller models to surpass larger models within foreseeable limits when using search methods to enhance the reasoning phase.

The strength of search lies in its ability to scale "transparently" with increased computational power, shifting resources from memory to computation, thereby achieving balanced resource allocation. Recently, DeepMind made significant progress in mathematics, demonstrating the power of search.

However, the success of search first requires high-quality evaluation of results. DeepMind's models achieved effective supervision by converting natural language descriptions of mathematical problems into formal expressions. In other areas, such as open-ended NLP tasks like "summarizing emails," the difficulty of conducting effective searches is much greater.

This study indicates that the performance improvement of generative models in specific fields is closely related to their evaluation and search capabilities, and future research can explore how to enhance these abilities through repeatable digital environments.

Paper link: https://arxiv.org/pdf/2407.21787

Large Language Models Llama3.1 GPT-4o Reinforcement Learning

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Zhiyuan Launches Emu3.5 Large Model: Reconstructing Multimodal Intelligence with Next-State Prediction, Embodied Operational Capabilities Amaze the Industry

Emu3.5 introduces autoregressive next-state prediction, enabling AI to plan and execute cross-modal tasks in complex environments, advancing from perception to intelligent operation.....

Oct 30, 2025

IBM Launches Granite4.0Nano Series: Small Open-Source Models Designed for Edge AI

IBM introduces the Granite4.0Nano series of small AI models, designed for local and edge inference, featuring 8 models available in 350M and 1B parameter sizes. The models use a hybrid SSM and transformer architecture, support base and instruction modes, are released under the Apache 2.0 open-source license, and are compatible with popular runtimes such as vLLM, enhancing enterprise control.

Oct 30, 2025

Wikipedia Stands Up to Musk! GrokiPedia Launches First Day Under Attack by the Human Knowledge Declaration: We Don't Trust AI, Only Humans

Wikipedia responds to Musk's AI encyclopedia challenge, emphasizing its 25-year non-profit model built by global volunteers, advocating that knowledge is created by humans, not machines, subtly criticizing the commercial tendencies of tech giants.

Oct 30, 2025

Microsoft Launches Agent Lightning: A New AI Framework to Help Train Large Language Models with Reinforcement Learning

Microsoft launches the open-source framework Agent Lightning, which uses reinforcement learning to optimize multi-agent systems. The framework does not require changes to existing architectures and can convert real agent behaviors into reinforcement learning transitions, improving the performance of strategies in large-scale language models. It models agents as partially observable Markov decision processes, using the current input as an observation, model calls as actions, and introducing a reward mechanism.

Oct 30, 2025

Cursor 2.0 Makes a Stunning Debut! Self-Developed Model Composer is 4 Times Faster, 8 AI Agents Work in Parallel for Coding, Developer Efficiency Sees a Nuclear-Level Upgrade

Cursor 2.0 introduces the Composer coding model and multi-agent interface, evolving from a code completion tool to a collaborative development platform. It addresses delays, confusion, and bottlenecks in complex projects, hailed as the ultimate form of agent-based programming.....

Oct 30, 2025

110

ElevenLabs CEO Predicts: AI Voice Models Will Become Commoditized, Company Bets on the Dual-Drive of Models and Applications

ElevenLabs CEO predicts AI voice models will commoditize in 2-3 years. While performance remains key short-term, long-term differentiation will shift to products, data, and ecosystems as language/voice variations narrow.....

Oct 30, 2025

120

OpenAI launches two new open-source secure reasoning models

OpenAI releases open-source safety models gpt-oss-safeguard-120b and 20b, advancing AI security and reliability.....

Oct 30, 2025

OpenAI Launches New Security Model gpt-oss-safeguard to Help the AI Field Flexibly Address Risks

OpenAI releases open-source AI safety models gpt-oss-safeguard-120b/20b under Apache 2.0 license. They enable flexible, customizable safety policy reasoning for AI deployment.....

Oct 30, 2025

Chat with GPT to edit photos? Adobe partners with OpenAI, Photoshop officially integrates ChatGPT, further lowering the barrier to creativity!

Adobe integrates with OpenAI's ChatGPT, enabling users to generate, edit, and export images via conversational commands in Photoshop and Adobe Express, eliminating the need for software expertise.....

Oct 29, 2025

Hypermind Launches China's First Interactive AI Podcast, Users Can Ask Questions at Any Time

Tencent Hunyuan launches China's first interactive AI podcast, enabling real-time user questions via voice or text, enhancing engagement and information efficiency beyond traditional one-way listening.....

Oct 29, 2025

300

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Can Llama 8B Outsmart GPT-4o Using Search Engines? A New Discovery That Changes the Game in AI!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Zhiyuan Launches Emu3.5 Large Model: Reconstructing Multimodal Intelligence with Next-State Prediction, Embodied Operational Capabilities Amaze the Industry

IBM Launches Granite4.0Nano Series: Small Open-Source Models Designed for Edge AI

Wikipedia Stands Up to Musk! GrokiPedia Launches First Day Under Attack by the Human Knowledge Declaration: We Don't Trust AI, Only Humans

Microsoft Launches Agent Lightning: A New AI Framework to Help Train Large Language Models with Reinforcement Learning

Cursor 2.0 Makes a Stunning Debut! Self-Developed Model Composer is 4 Times Faster, 8 AI Agents Work in Parallel for Coding, Developer Efficiency Sees a Nuclear-Level Upgrade

ElevenLabs CEO Predicts: AI Voice Models Will Become Commoditized, Company Bets on the Dual-Drive of Models and Applications

OpenAI launches two new open-source secure reasoning models

OpenAI Launches New Security Model gpt-oss-safeguard to Help the AI Field Flexibly Address Risks

Chat with GPT to edit photos? Adobe partners with OpenAI, Photoshop officially integrates ChatGPT, further lowering the barrier to creativity!

Hypermind Launches China's First Interactive AI Podcast, Users Can Ask Questions at Any Time

GEO Services