In an era where smart devices are ubiquitous, we yearn to enhance the intelligent processing capabilities of smartphones, tablets, and even smart home devices. However, the hardware resources of these edge devices are limited, particularly in terms of memory and computational power, which restricts the deployment and operation of large language models (LLMs) on them. Imagine if these devices could harness powerful models capable of understanding natural language, answering questions, and even engaging in creative tasks—how might that transform our world?

image.png

This is the backdrop against which T-MAC technology was born. T-MAC, short for "Table-Lookup-based MAC," is a lookup table-based method that enables efficient operation of low-bit large language models on CPUs, thereby facilitating intelligent upgrades on edge devices.

Large language models typically contain billions or even hundreds of billions of parameters, which require substantial memory for storage. To deploy these models on edge devices, we need to quantize the model weights—using fewer bits to represent the weights—to reduce memory footprint. However, quantized models require mixed-precision matrix multiplication (mpGEMM) during operation, which is not common in existing hardware and software systems and lacks efficient support.

image.png

The core idea of T-MAC is to transform traditional data-type-based multiplication operations into bit-based lookup table (LUT) lookups. This approach not only eliminates multiplication operations but also reduces addition operations, significantly enhancing computational efficiency.

Specifically, T-MAC achieves this through the following steps:

Decompose the weight matrix into multiple one-bit matrices.

Precompute the product of the activation vector with all possible one-bit patterns and store the results in a lookup table.

During inference, quickly obtain the final matrix multiplication result through lookup table indexing and accumulation operations.

Tests on various edge devices have shown that T-MAC offers significant performance advantages. Compared to the existing llama.cpp implementation, T-MAC quadruples throughput and reduces energy consumption by 70%. This enables even low-end devices, like the Raspberry Pi5, to generate tokens at speeds exceeding an average adult's reading pace.

T-MAC not only holds theoretical advantages but also has practical application potential. Whether it's real-time speech recognition and natural language processing on smartphones or providing smarter interaction experiences on smart home devices, T-MAC plays a crucial role.

T-MAC technology offers an efficient and energy-saving solution for deploying low-bit large language models on edge devices. It not only enhances the intelligence level of devices but also enriches and simplifies the smart experiences for users. With ongoing technical development and optimization, we have reason to believe that T-MAC will play an increasingly significant role in the field of edge intelligence.

Open Source Address: https://github.com/microsoft/T-MAC

Paper Address: https://www.arxiv.org/pdf/2407.00088