In the world of AI, "great efforts yield miracles" seems to have become a golden rule. The larger the model, the more data it has, and the stronger the computing power, the closer it appears to get to the holy grail of intelligence. However, behind this rapid advancement lies tremendous costs and energy consumption pressures.
To make AI training more efficient, scientists have been searching for more powerful optimizers, like a coach guiding the model's parameters to continuously optimize and ultimately reach the best state. AdamW, as the default optimizer for Transformer pre-training, has been the industry benchmark for many years. However, in the face of increasingly large model scales, AdamW is starting to show its limitations.
Is there no way to both speed up training and reduce energy consumption? Don't worry, a fully Chinese team has come with their "secret weapon" C-AdamW!
C-AdamW, short for Cautious AdamW, sounds quite "zen," doesn't it? Indeed, the core idea of C-AdamW is "think twice before acting."
Imagine the model's parameters as a group of energetic children who always want to run around. AdamW acts like a diligent teacher, striving to guide them in the right direction. But sometimes, the children get too excited and run off in the wrong direction, wasting time and energy.
At this point, C-AdamW is like a wise elder, wearing "fire-eyed" glasses, able to accurately identify whether the update direction is correct. If the direction is wrong, C-AdamW decisively calls a halt, preventing the model from straying further down the wrong path.
This "cautious" strategy ensures that each update effectively reduces the loss function, thus accelerating the model's convergence speed. Experimental results show that C-AdamW improves training speed by 1.47 times in Llama and MAE pre-training!
More importantly, C-AdamW incurs almost no additional computational overhead; it only requires a simple one-line modification to the existing code. This means developers can easily apply C-AdamW to various model training, enjoying "speed and passion"!
The "zen" aspect of C-AdamW also lies in its retention of Adam's Hamiltonian function while ensuring convergence guarantees under Lyapunov analysis. This means C-AdamW is not only faster but also more stable, avoiding issues like training crashes.
Of course, "zen" does not mean "unambitious." The research team states they will continue to explore richer φ functions and apply masks in feature space rather than parameter space to further enhance the performance of C-AdamW.
It is foreseeable that C-AdamW will become a new favorite in the field of deep learning, bringing revolutionary changes to large model training!
Paper link: https://arxiv.org/abs/2411.16085
GitHub: