In the field of AI training, Nvidia's graphics cards are virtually unchallenged, but in AI inference, competitors seem to be catching up, especially in terms of energy efficiency. Although Nvidia's latest Blackwell chips are powerful, whether they can maintain their lead remains to be seen.

Today, ML Commons announced the latest AI inference competition results — MLPerf Inference v4.1. This round of competition saw the first participation of AMD's Instinct accelerators, Google's Trillium accelerators, chips from Canadian startup UntetherAI, and Nvidia's Blackwell chips. Two other companies, Cerebras and FuriosaAI, have launched new inference chips but did not submit them for MLPerf testing.

image.png

The structure of MLPerf is akin to an Olympic competition, with multiple events and sub-events. The "Data Center Closed" category has the most participants. Unlike the open category, the closed category requires participants to perform inference directly on a given model without significantly modifying the software. The Data Center category primarily tests the ability to process batch requests, while the Edge category focuses on reducing latency.

Each category includes nine different benchmark tests, covering various AI tasks, including popular ones like image generation (think Midjourney) and large language model question-answering (such as ChatGPT), as well as important but lesser-known tasks like image classification, object detection, and recommendation engines.

This round of competition introduced a new benchmark — the "Mixture of Experts" model. This is an increasingly popular method for deploying language models, dividing a single model into several independent smaller models, each fine-tuned for specific tasks such as everyday conversation, solving math problems, or programming assistance. By assigning each query to the appropriate small model, resource utilization is reduced, thereby lowering costs and increasing throughput, according to AMD's senior technologist Miroslav Hodak.

image.png

In the popular "Data Center Closed" benchmark, the winners were submissions based on Nvidia H200GPU and GH200 superchips, which combine GPU and CPU in a single package. However, a closer analysis reveals some interesting details. Some participants used multiple accelerators, while others used just one. If we normalize queries per second by the number of accelerators and keep only the best submissions for each accelerator type, the results become more ambiguous. It should be noted that this method ignores the role of CPUs and interconnects.

On a per-accelerator basis, Nvidia's Blackwell excelled in the large language model question-answering task, with speeds 2.5 times faster than its previous chip iteration, which was the only benchmark it submitted. Untether AI's speedAI240 preview chip performed almost on par with H200 in its sole submitted image recognition task. Google's Trillium performed slightly below H100 and H200 in image generation tasks, while AMD's Instinct performed equivalent to H100 in large language model question-answering tasks.

The success of Blackwell partly stems from its ability to run large language models using 4-bit floating-point precision. Nvidia and its competitors have been working to reduce the number of bits used for data representation in transformation models (like ChatGPT) to speed up calculations. Nvidia introduced 8-bit math in the H100, and this submission marks the first time 4-bit math has been demonstrated in MLPerf benchmarks.

The biggest challenge in using such low-precision numbers is maintaining accuracy, according to Nvidia's product marketing director Dave Salvator. To maintain high accuracy in the MLPerf submission, the Nvidia team made significant innovations in software.

Additionally, Blackwell's memory bandwidth nearly doubled, reaching 8 terabytes per second, compared to 4.8 terabytes per second for H200.

Nvidia's Blackwell submission used a single chip, but Salvator said it is designed for networking and scalability and will perform best when used in conjunction with Nvidia's NVLink interconnect. The Blackwell GPU supports up to 18 NVLink 100GB per second connections, with a total bandwidth of 1.8 terabytes per second, nearly double that of H100's interconnect bandwidth.

image.png

Salvator believes that as the scale of large language models continues to grow, even inference will require multi-GPU platforms to meet demand, and Blackwell is designed for this scenario. "Blackwell is a platform," Salvator said.

Nvidia submitted its Blackwell chip system to the preview subcategory, meaning it is not yet on the market but is expected to be available before the next MLPerf release, about six months from now.

In each benchmark, MLPerf also includes an energy measurement section, systematically testing the actual power consumption of each system while performing tasks. This round's main competition (Data Center Closed Energy category) only had submissions from Nvidia and Untether AI. Although Nvidia participated in all benchmarks, Untether only submitted results in the image recognition task.

image.png

Untether AI excelled in this area, achieving remarkable energy efficiency. Their chip employs a method called "memory computing." Untether AI's chip consists of a set of memory units with small processors adjacent to them. Each processor works in parallel, processing data simultaneously with the nearby memory units, significantly reducing the time and energy spent on transferring model data between memory and computing cores.

"We found that 90% of the energy consumption for AI workloads is in moving data from DRAM to the cache processing unit," said Robert Beachler, Vice President of Product at Untether AI. "So, Untether's approach is to move the computation to the data, rather than moving the data to the computing unit."

This method performed particularly well in another MLPerf subcategory: Edge Closed. This category focuses on more practical use cases, such as machine inspection in factories, guiding visual robots, and autonomous vehicles — applications that have strict requirements for energy efficiency and quick processing, Beachler explained.

In the image recognition task, Untether AI's speedAI240 preview chip had a latency performance 2.8 times faster than Nvidia's L40S, and throughput (samples per second) was also 1.6 times higher. The startup also submitted power consumption results in this category, but Nvidia's competitors did not, making direct comparisons difficult. However, Untether AI's speedAI240 preview chip has a nominal power consumption of 150 watts, compared to 350 watts for Nvidia's L40S, showing a 2.3 times advantage in power consumption, with better latency performance.

Although Cerebras and Furiosa did not participate in MLPerf, they also announced new chips. Cerebras revealed its inference service at Stanford University's IEEE Hot Chips conference. Based in Sunnydale, California, Cerebras manufactures giant chips, as large as the silicon wafer allows, avoiding inter-chip connections and greatly increasing the device's memory bandwidth, primarily used for training giant neural networks. They have now upgraded their latest computer, CS3, to support inference.

Although Cerebras did not submit to MLPerf, the company claims that its platform outperforms H100 by seven times and competitor Groq chips by two times in terms of the number of LLM tokens generated per second. "Today, we are in the dial-up era of generative AI," said Andrew Feldman, CEO and co-founder of Cerebras. "This is all due to a memory bandwidth bottleneck. Whether it's Nvidia's H100, AMD's MI300, or TPU, they all use the same external memory, leading to the same limitations. We broke this barrier because we use a wafer-scale design."

At the Hot Chips conference, Seoul-based Furiosa also showcased its second-generation chip, RNGD (pronounced "Rebel"). Furiosa's new chip is characterized by its Tensor Cores Processor (TCP) architecture. In AI workloads, the basic mathematical function is matrix multiplication, typically implemented as a primitive in hardware. However, the size and shape of the matrices, or more broadly, the tensors, can vary greatly. RNGD implements this more general tensor multiplication as a primitive. "In inference, batch sizes vary greatly, so it is crucial to fully utilize the inherent parallelism and data reuse of a given tensor shape," said Furiosa founder and CEO June Paik at Hot Chips.

Although Furiosa did not participate in MLPerf, they compared their RNGD chip internally with the MLPerf LLM summarization benchmark and found it performed similarly to Nvidia's L40S chip, but with a power consumption of only 185 watts, compared to 320 watts for L40S. Paik said that with further software optimizations, performance will improve.

IBM also announced its new Spyre chip, designed specifically for enterprise generative AI workloads, expected to be available in the first quarter of 2025.

Clearly, the AI inference chip market will be bustling in the foreseeable future.

Reference: https://spectrum.ieee.org/new-inference-chips