Earlier this year, Google released the sixth generation and the most powerful TPU to date—Trillium. Today, Trillium is officially available for Google Cloud customers.
Google has trained its latest Gemini 2.0 model, the most powerful AI model to date, using Trillium TPU. Now, businesses and startups can leverage the same powerful, efficient, and sustainable infrastructure.
The Core of the AI Supercomputer: Trillium TPU
Trillium TPU is a key component of the Google Cloud AI Hypercomputer. The AI Hypercomputer is a groundbreaking supercomputer architecture that integrates performance-optimized hardware, open software, leading ML frameworks, and flexible consumption models. With the official launch of Trillium TPU, Google has also made significant enhancements to the open software layer of the AI Hypercomputer, including optimizations to the XLA compiler and popular frameworks such as JAX, PyTorch, and TensorFlow to achieve leading cost-effectiveness in AI training, tuning, and serving.
Additionally, features like host offloading using large-scale host DRAM (supplementing high bandwidth memory or HBM) provide higher levels of efficiency. The AI Hypercomputer enables you to extract maximum value from over 100,000 Trillium chip deployments within each Jupiter network architecture, which boasts a bi-directional bandwidth of 13 Petabits/second, allowing single distributed training jobs to scale to hundreds of thousands of accelerators.
Clients such as AI21 Labs are already using Trillium to deliver meaningful AI solutions to their customers faster:
Barak Lenz, CTO of AI21 Labs, stated: “At AI21, we continuously strive to improve the performance and efficiency of our Mamba and Jamba language models. As long-term users of TPU v4, we are impressed with the capabilities of Google Cloud's Trillium. The advancements in scale, speed, and cost efficiency are remarkable. We believe Trillium will play a crucial role in accelerating the development of our next-generation complex language models, enabling us to deliver more powerful and accessible AI solutions to our customers.”
Significant Performance Boosts with Trillium, Setting Multiple Records
Trillium has made significant improvements compared to the previous generation in the following areas:
Training performance increased by over 4 times
Inference throughput improved by 3 times
Energy efficiency increased by 67%
Peak computing performance per chip increased by 4.7 times
High Bandwidth Memory (HBM) capacity doubled
Inter-chip connectivity (ICI) bandwidth doubled
Over 100,000 Trillium chips included in a single Jupiter network architecture
Training performance improved by 2.5 times per dollar, and inference performance improved by 1.4 times per dollar
These enhancements allow Trillium to excel across various AI workloads, including:
Scaling AI training workloads
Training LLMs, including dense models and mixture of experts (MoE) models
Inference performance and ensemble scheduling
Embedding dense models
Providing cost-effective training and inference
How Does Trillium Excel Across Different Workloads?
Scaling AI Training Workloads
Training large models like Gemini 2.0 requires vast amounts of data and computation. Trillium's near-linear scaling capability significantly accelerates the training speed of these models by efficiently distributing workloads across multiple Trillium hosts, which are connected through high-speed inter-chip connections in 256-chip pods and our state-of-the-art Jupiter data center network. This is achieved through TPU multi-chip technology and further optimized by Titanium, a dynamic data center-level offloading system ranging from host adapters to network architecture.
Trillium achieved 99% scaling efficiency in deployments with 12 pods consisting of 3072 chips and demonstrated 94% scaling efficiency with 24 pods containing 6144 chips while pre-training gpt3-175b, even when operating across data center networks.
Training LLMs, Including Dense Models and Mixture of Experts (MoE) Models
LLMs like Gemini are powerful and complex, with billions of parameters. Training such dense LLMs requires tremendous computing power and software optimizations through collaborative design. Trillium is 4 times faster than the previous generation Cloud TPU v5e when training dense LLMs such as Llama-2-70b and gpt3-175b.
In addition to dense LLMs, training LLMs using mixture of experts (MoE) architecture is an increasingly popular approach that combines multiple "expert" neural networks, each responsible for different aspects of AI tasks. Managing and coordinating these experts during training adds complexity compared to training a single monolithic model. Trillium is 3.8 times faster than the previous generation Cloud TPU v5e when training MoE models.
Moreover, Trillium TPU provides 3 times the host dynamic random access memory (DRAM) compared to Cloud TPU v5e. This offloads some computations to the host, helping maximize large-scale performance and good throughput. Trillium's host offloading capabilities provided over a 50% performance boost in model FLOP utilization (MFU) while training the Llama-3.1-405B model.
Inference Performance and Ensemble Scheduling
During inference, the importance of multi-step reasoning is increasing, requiring accelerators to effectively handle the growing computational demands. Trillium offers significant advancements for inference workloads, enabling faster and more efficient deployment of AI models. In fact, Trillium provides our best TPU inference performance for image diffusion and dense LLMs. Our tests show that the relative inference throughput (images per second) of Stable Diffusion XL is over 3 times higher compared to Cloud TPU v5e, while the relative inference throughput (tokens per second) of Llama2-70B is nearly 2 times higher.
Trillium is our highest-performing TPU for offline and server inference use cases. The following chart shows that the offline inference relative throughput (images per second) of Stable Diffusion XL is 3.1 times higher compared to Cloud TPU v5e, and the server inference relative throughput is 2.9 times higher.
In addition to better performance, Trillium also introduces new ensemble scheduling capabilities. This feature allows Google's scheduling system to make intelligent job scheduling decisions to improve the overall availability and efficiency of inference workloads when multiple replicas exist in an ensemble. It provides a method to manage multiple TPU shards for running single-host or multi-host inference workloads, including through Google Kubernetes Engine (GKE). Grouping these shards into an ensemble allows for easy adjustment of the number of replicas to match demand.
Embedding Dense Models
With the addition of the third-generation SparseCore, Trillium doubles the performance of embedding dense models and improves the performance of DLRM DCNv2 by 5 times.
SparseCore is a data flow processor that provides a more adaptable architectural foundation for embedding-intensive workloads. The third-generation SparseCore of Trillium excels at accelerating dynamic and data-dependent operations such as scatter-gather, sparse segment summation, and partitioning.
Providing Cost-Effective Training and Inference
In addition to the absolute performance and scale required to train some of the largest AI workloads in the world, Trillium is also designed to optimize performance per dollar. So far, Trillium's performance per dollar is 2.1 times higher than Cloud TPU v5e and 2.5 times higher than Cloud TPU v5p when training dense LLMs like Llama2-70b and Llama3.1-405b.
Trillium excels at processing large models in a cost-effective manner. It is designed to enable researchers and developers to deliver powerful and efficient image models at a significantly lower cost than before. The cost of generating one thousand images on Trillium, for offline inference, is 27% lower than Cloud TPU v5e, and for server inference on SDXL, it is 22% lower than Cloud TPU v5e.
Elevating AI Innovation to New Heights
Trillium represents a significant leap in Google Cloud's AI infrastructure, offering incredible performance, scalability, and efficiency for a variety of AI workloads. With its ability to scale to hundreds of thousands of chips using world-class collaborative design software, Trillium enables faster breakthroughs and the delivery of exceptional AI solutions. Furthermore, Trillium's outstanding cost-effectiveness makes it an economical choice for organizations looking to maximize the value of their AI investments. As the AI landscape continues to evolve, Trillium demonstrates Google Cloud's commitment to providing cutting-edge infrastructure to help businesses unlock the full potential of AI.
Official introduction: https://cloud.google.com/blog/products/compute/trillium-tpu-is-ga