Training large AI models (such as Transformers and language models) has become an indispensable key aspect of the AI field, but it also faces high computational costs, memory consumption, and energy demands. For example, OpenAI's GPT-3 has 175 billion parameters and requires weeks of GPU training. This enormous resource requirement limits the application of this technology in organizations with ample computational resources, while also intensifying concerns about energy efficiency and environmental impact. Addressing these challenges is crucial for ensuring broader accessibility and sustainability in AI development.
Traditional training methods are inefficient and require innovative solutions.
The primary reason for the inefficiency in training large models is their reliance on dense matrices, which require substantial memory and computational power. Modern GPUs have limited support for optimized low-precision or low-rank operations, further exacerbating these demands. Although some methods have been proposed, such as matrix decomposition and heuristic rank reduction, they still face limitations in practical applications. For instance, GaLore can train in a single-batch setup but has impractical runtime overhead. Similarly, using low-rank adapters in LTE poses convergence issues for large tasks. Currently, there is a lack of a method that can simultaneously reduce memory usage, computational costs, and training time without compromising performance, making the need for innovative solutions urgent.
CoMERA Framework: Achieving Efficient Training Through Adaptive Tensor Optimization
Researchers from the University at Albany (State University of New York), University of California, Santa Barbara, Amazon Alexa AI, and Meta have jointly introduced a new framework called CoMERA (Computing-and Memory-Efficient training method via Rank-Adaptive tensor optimization). This framework combines memory efficiency and computational speed through adaptive rank tensor compression technology. Unlike traditional methods that focus solely on compression, CoMERA employs a multi-objective optimization approach to balance compression ratio and model accuracy. It optimizes GPU utilization using tensorized embeddings and advanced tensor network contraction to reduce runtime overhead while maintaining robust performance. The framework also introduces CUDA graphs to minimize kernel launch latency during GPU operations, a significant bottleneck in traditional tensor compression methods.
The foundation of CoMERA is adaptive tensor representation, which allows model layers to dynamically adjust their rank based on resource constraints. By modifying tensor ranks, the framework achieves compression without compromising the integrity of neural network operations. This dynamic optimization is realized through a two-phase training process:
Early Phase: Focused on stable convergence.
Later Phase: Fine-tuning ranks to meet specific compression targets.
In a six-encoder Transformer model, CoMERA achieved a compression ratio of up to 43 times in its early phase, and in its later optimization phase, the compression ratio reached as high as 361 times. Additionally, compared to GaLore, it reduced memory consumption by 9 times and improved training speed by 2-3 times per epoch.
Multiple test results demonstrate CoMERA's outstanding performance.
When applied to train Transformer models on the MNLI dataset, CoMERA reduced the model size from 256MB to as low as 3.2MB while maintaining accuracy. In large-scale recommendation systems like DLRM, CoMERA compressed the model by 99 times and reduced peak memory usage by 7 times. The framework also excelled in pre-training CodeBERT (a large language model for specific domains), achieving an overall compression ratio of 4.23 times and doubling the speed in certain training phases. These results highlight its capability to handle various tasks and architectures, expanding its applicability across different fields.
Key Advantages of the CoMERA Framework Summarized
The main conclusions of this research are as follows:
CoMERA achieved a compression ratio of up to 361 times for specific layers and 99 times for the entire model, significantly reducing storage and memory requirements.
The framework reduced the training time per epoch for Transformers and recommendation systems by 2-3 times, saving computational resources and time.
By using tensorized representations and CUDA graphs, CoMERA reduced peak memory consumption by 7 times, making it feasible to train on smaller GPUs.
CoMERA's approach supports various architectures, including Transformers and large language models, while maintaining or improving accuracy.
By lowering the energy and resources required for training, CoMERA contributes to more sustainable AI practices and enables a broader audience to access cutting-edge models.