Generative language models face numerous challenges during the transition from training to practical application. One major issue is how to achieve optimal performance from the model during the inference phase.
Current strategies, such as reinforcement learning from human feedback (RLHF), mainly focus on improving the model's success rate, often neglecting decoding strategies during inference, such as Best-of-N sampling and controlled decoding. This gap between training objectives and actual usage can lead to inefficiencies, affecting the quality and reliability of outputs.
To address these issues, Google DeepMind and the Google Research team developed InfAlign, a machine learning framework designed to integrate with inference strategies. InfAlign incorporates methods during inference into the alignment process, aiming to bridge the gap between training and application. It adjusts reward functions based on specific inference strategies through a calibrated reinforcement learning approach. InfAlign is particularly effective with techniques like Best-of-N sampling (generating multiple responses and selecting the best) and Worst-of-N (commonly used for safety evaluations), ensuring that aligned models perform well in both controlled environments and real-world scenarios.
At the core of InfAlign is the Calibrated and Transformed Reinforcement Learning (CTRL) algorithm, which follows three steps: calibrating reward scores, transforming these scores according to the inference strategy, and solving a KL regularization optimization problem. By customizing reward transformations to specific scenarios, InfAlign aligns training objectives with inference needs. This approach not only enhances the success rate during inference but also maintains computational efficiency. Furthermore, InfAlign improves the robustness of the model, enabling it to effectively handle various decoding strategies and produce consistently high-quality outputs.
In experiments conducted using Anthropic's usefulness and harmlessness datasets, the effectiveness of InfAlign was validated. Compared to existing methods, InfAlign improved the inference success rate in Best-of-N sampling by 8%-12% and in Worst-of-N safety evaluations by 4%-9%. These improvements stem from its calibrated reward transformations, effectively addressing the miscalibration issues of reward models and ensuring consistent performance across different inference scenarios.
InfAlign represents a significant advancement in aligning generative language models. By integrating inference-aware strategies, InfAlign addresses the critical differences between training and deployment. Its solid theoretical foundation and empirical results highlight its potential for comprehensively improving AI system alignment.
Link: https://arxiv.org/abs/2412.19792
Key Points:
🌟 InfAlign is a new framework developed by Google DeepMind aimed at enhancing the performance of language models during the inference phase.
📈 This framework aligns training objectives with inference needs by adjusting reward functions for inference strategies through calibrated reinforcement learning methods.
✅ Experimental results indicate that InfAlign significantly improves the inference success rate of models across multiple tasks, demonstrating good adaptability and reliability.