Google's recent release of the new Gemma3 series has excited many AI enthusiasts. Just a month after its launch, Google released a Quantization-Aware Training (QAT) optimized version of Gemma3, significantly reducing memory requirements while maintaining model quality.
Specifically, the QAT-optimized Gemma3 27B model's VRAM demand has dropped dramatically from 54GB to 14.1GB, meaning users can now run this large model locally on consumer-grade GPUs like the NVIDIA RTX 3090. Simple tests show that a machine equipped with an RTX 3070 can also run the 12B version of Gemma3, although the token output speed is slightly slower; overall performance remains acceptable.
The magic of QAT lies in its integration of quantization operations directly into the training process, unlike traditional methods that quantize after training. This effectively simulates low-precision computation, minimizing performance loss during subsequent quantization to smaller versions. Google conducted approximately 5000 steps of QAT training, successfully reducing perplexity by 54%, enabling high performance even on smaller devices.
Now, different versions of Gemma3 can run on various GPUs. For example, Gemma3 27B can run locally with a single NVIDIA RTX 3090 (24GB VRAM), while Gemma3 12B performs efficiently on lighter devices like the NVIDIA RTX 4060. This reduced model size allows more users to experience powerful AI capabilities, even on resource-constrained systems like smartphones.
Google has also partnered with several developer tools to provide a seamless user experience. Tools like Ollama, LM Studio, and MLX already support the Gemma3 QAT model. Notably, many users have expressed great excitement and hope that Google will further explore more efficient quantization techniques.