In the rapidly evolving field of generative AI, the Nous Research team is conducting a unique experiment: they are pre-training a large language model (LLM) with 1.5 billion parameters using machines distributed globally, avoiding the centralized development typically required in expensive and power-hungry data centers or superclusters.
Nous Research is also live-streaming this pre-training process on its dedicated website distro.nousresearch.com, showcasing the model's performance on various evaluation benchmarks in real-time and providing a map of the hardware locations participating in the training, covering multiple sites across the United States and Europe. As of the publication of this article, approximately 57 hours (or 2.3 days) remain for pre-training, with over 75% of the training progress already completed.
Pre-training is the first and most fundamental step in training an LLM, involving training on a vast amount of text data to learn the statistical properties and structure of the language. During this phase, the model captures patterns, grammar, and contextual relationships between vocabulary by processing a wide-ranging text dataset. This process equips the model with a broad understanding of language, enabling it to generate coherent text and perform various language-related tasks. After pre-training, the model still needs to be fine-tuned for specific tasks or domains.
If this plan is successful, Nous Research will demonstrate that it is possible to train cutting-edge LLMs without expensive superclusters or low-latency transmission, marking a new era in distributed AI training. This open-source training approach could reshape the power dynamics of generative AI, giving small teams and non-corporate actors more competitiveness in this field.
The new technology used by Nous is called Nous DisTrO (Distributed Training Over-the-Internet), designed to reduce the communication bandwidth requirements between GPUs during the pre-training process. According to the latest release from Nous Research, DisTrO can reduce communication needs by up to 10,000 times, maintaining competitive convergence rates and loss curves even under slower, more affordable internet connections.
Moreover, the core breakthrough of DisTrO lies in effectively compressing the amount of data exchanged between GPUs without compromising the model's performance. This technology builds upon earlier decoupled momentum optimization algorithms (DeMo), which also aimed to significantly reduce communication demands between GPUs while maintaining training performance.
On the hardware front, Nous Research's pre-training process is supported by several well-known partners, including Oracle, Lambda Labs, Northern Data Group, Crusoe Cloud, and Andromeda Cluster, who collectively provide the necessary heterogeneous hardware to thoroughly test DisTrO's capabilities in real distributed environments.
Blog entry: https://nousresearch.com/
Highlights:
🌐 Nous Research is conducting globally distributed AI training aimed at pre-training a 1.5 billion parameter large language model.
💻 Utilizing Nous DisTrO technology, this process significantly reduces the communication bandwidth requirements between GPUs, making low-cost training feasible.
🤝 This project has the support of multiple hardware suppliers, advancing the progress of distributed AI research.