Recently, a new study has brought excitement, demonstrating that Large Language Models (LLMs) can significantly enhance their performance through search functionalities. Notably, the Llama3.1 model with only 800 million parameters, after 100 searches, performed on par with GPT-4o in Python code generation tasks.
This idea seems reminiscent of Rich Sutton's pioneering work in reinforcement learning, particularly his 2019 classic blog post, "The Bitter Lesson." He emphasized the power of general methods as computational capabilities improve, highlighting "search" and "learning" as excellent choices that can continue to scale.
While Sutton underscored the importance of learning, that larger models typically acquire more knowledge, the potential of search in the reasoning process is often overlooked. Recently, researchers from Stanford, Oxford, and DeepMind found that increasing the number of sampling repetitions in the reasoning phase significantly improves model performance in areas like mathematics, reasoning, and code generation.
Inspired by these studies, two engineers decided to conduct experiments. They discovered that using 100 small Llama models for search could surpass or match GPT-4o in Python programming tasks. They metaphorically described it as: "What used to require a large horse to achieve can now be accomplished with 100 small ducks."
To achieve higher performance, they utilized the vLLM library for batch inference and ran it on 10 A100-40GB GPUs, achieving an astonishing output speed of 40k tokens per second. The authors chose the HumanEval benchmark test, which evaluates generated code by running tests, offering a more objective and accurate assessment.
According to the report, in zero-shot inference, GPT-4o scored 90.2% on the pass@1 metric. Through the aforementioned method, Llama3.18B's pass@k score significantly improved. With 100 repetitions, Llama scored 90.5%; when the number of repetitions increased to 1000, the score further improved to 95.1%, clearly outperforming GPT-4o.
It is worth noting that although this experiment is not a strict replication of the original study, it emphasizes the potential for smaller models to surpass larger models within foreseeable limits when using search methods to enhance the reasoning phase.
The strength of search lies in its ability to scale "transparently" with increased computational power, shifting resources from memory to computation, thereby achieving balanced resource allocation. Recently, DeepMind made significant progress in mathematics, demonstrating the power of search.
However, the success of search first requires high-quality evaluation of results. DeepMind's models achieved effective supervision by converting natural language descriptions of mathematical problems into formal expressions. In other areas, such as open-ended NLP tasks like "summarizing emails," the difficulty of conducting effective searches is much greater.
This study indicates that the performance improvement of generative models in specific fields is closely related to their evaluation and search capabilities, and future research can explore how to enhance these abilities through repeatable digital environments.
Paper link: https://arxiv.org/pdf/2407.21787