FP6-LLM

Efficiently serving large language models

CommonProductProgrammingLarge language modelsGPU inference
FP6-LLM is a new supporting solution for large language models. Through six-bit quantization (FP6), it effectively reduces the model size while maintaining model quality across various applications. We present TC-FPx, the first complete GPU kernel design that uniformly supports various quantization bit widths for floating-point weights. By integrating the TC-FPx kernel into existing inference systems, we provide a new end-to-end support for quantized LLM inference (called FP6-LLM), achieving a better balance between inference cost and model quality. Experiments demonstrate that FP6-LLM enables inference of LLaMA-70b using a single GPU, achieving normalized inference throughput 1.69x to 2.65x higher than the FP16 baseline.
Visit

FP6-LLM Visit Over Time

Monthly Visits

17104189

Bounce Rate

44.67%

Page per Visit

5.5

Visit Duration

00:05:49

FP6-LLM Visit Trend

FP6-LLM Visit Geography

FP6-LLM Traffic Sources

FP6-LLM Alternatives