Self-Rewarding Language Models
Language Model Self-Reward Training
CommonProductProductivityLanguage ModelSelf-Reward
This product is a self-rewarding language model trained using LLM as a judge and rewards signals generated by the model itself. Through iterative DPO training, the model not only improves its ability to follow instructions but also generates high-quality self-rewards. After three iterations of fine-tuning, this product has surpassed many existing systems, including Claude 2, Gemini Pro, and GPT-4 0613, on the AlpacaEval 2.0 leaderboard. While this work is preliminary research, it opens the door to the possibility of continuous improvement in the model in two key areas.
Self-Rewarding Language Models Visit Over Time
Monthly Visits
29742941
Bounce Rate
44.20%
Page per Visit
5.9
Visit Duration
00:04:44