FP6-LLM
Efficiently serving large language models
CommonProductProgrammingLarge language modelsGPU inference
FP6-LLM is a new supporting solution for large language models. Through six-bit quantization (FP6), it effectively reduces the model size while maintaining model quality across various applications. We present TC-FPx, the first complete GPU kernel design that uniformly supports various quantization bit widths for floating-point weights. By integrating the TC-FPx kernel into existing inference systems, we provide a new end-to-end support for quantized LLM inference (called FP6-LLM), achieving a better balance between inference cost and model quality. Experiments demonstrate that FP6-LLM enables inference of LLaMA-70b using a single GPU, achieving normalized inference throughput 1.69x to 2.65x higher than the FP16 baseline.
FP6-LLM Visit Over Time
Monthly Visits
19075321
Bounce Rate
45.07%
Page per Visit
5.5
Visit Duration
00:05:32