Recently, scientists from Meta, the University of California, Berkeley, and New York University have collaborated to develop a new technology called "Thought Preference Optimization" (TPO). The aim of this technology is to enhance the performance of Large Language Models (LLMs) across various tasks, allowing AI to carefully consider its responses before answering.
Researchers suggest that "thinking" should have broad applicability. For instance, in creative writing tasks, AI can utilize its internal thought processes to plan overall structures and character development. This approach is significantly different from previous "Chain-of-Thought" (CoT) prompting techniques, which were mainly applied to mathematical and logical tasks, whereas TPO has a wider application scope. Researchers mention OpenAI's new o1 model, believing that the thinking process can also be beneficial for a broader range of tasks.
So, how does TPO work? Firstly, the model generates a series of thought steps before answering a question. It then creates multiple outputs, followed by an evaluation model that assesses only the final answers, not the thought steps themselves. Finally, through preference optimization of these evaluation results, the model undergoes training. Researchers hope that improving the quality of responses can be achieved by refining the thinking process, thereby enabling the model to acquire more effective reasoning capabilities through implicit learning.
In tests, the Llama38B model using TPO outperformed versions without explicit reasoning in general instruction-following benchmark tests. In the AlpacaEval and Arena-Hard benchmarks, TPO achieved win rates of 52.5% and 37.3%, respectively. More excitingly, TPO has also made progress in areas that typically do not require explicit thinking, such as common sense, marketing, and health.
However, the research team points out that the current setup is not suitable for mathematical problems, as TPO's performance is actually below that of the base model in these tasks. This indicates that for highly specialized tasks, different approaches may be needed. Future research may focus on controlling the length of the thinking process and the impact of thinking on larger models.
Key Points:
🌟 The research team introduces "Thought Preference Optimization" (TPO), aimed at enhancing AI's thinking capabilities in task execution.
🧠 TPO enables the model to generate thought steps before answering, utilizing an evaluation model to optimize response quality.
📈 Tests show that TPO performs well in areas like common sense and marketing, but underperforms in mathematical tasks.