Artificial Intelligence (AI) driven text-to-image (T2I) generation models, such as DALLE3 and Adobe Firefly3, demonstrate exceptional generative capabilities with limitless potential in real-world applications. However, these models typically have billions of parameters, requiring significant memory, which poses a great challenge for deployment on resource-constrained platforms like mobile devices.
To address these challenges, researchers from ByteDance and POSTECH explored techniques for extremely low-bit quantization of T2I models. Among numerous advanced models, FLUX.1-dev became the target of research due to its public availability and outstanding performance. The researchers employed a method called 1.58-bit quantization to compress the visual transformer weights in the FLUX model, reducing them to just three values: {-1, 0, +1}. This quantization method does not require access to image data and relies solely on the self-supervision of the FLUX.1-dev model. Unlike the BitNet b1.58 method, this approach does not involve training a large language model from scratch but serves as a post-training quantization solution for T2I models.
With this method, the model's storage space was reduced by 7.7 times, as the 1.58-bit weights were stored using 2-bit signed integers, achieving a compression from 16-bit precision. To further enhance inference efficiency, the researchers developed a custom kernel optimized for low-bit computation. This kernel reduced inference memory usage by over 5.1 times and improved inference latency. Evaluations in the GenEval and T2I Compbench benchmarks indicated that 1.58-bit FLUX significantly increased computational efficiency while maintaining generation quality comparable to the full-precision FLUX model.
Specifically, the researchers quantized 99.5% of the visual transformer parameters (a total of 11.9 billion) in the FLUX model to 1.58 bits, significantly lowering storage requirements. Experimental results showed that 1.58-bit FLUX performed comparably to the original FLUX model on the T2I CompBench and GenEval datasets. In terms of inference speed, 1.58-bit FLUX exhibited more significant improvements on lower-performance GPUs (such as L20 and A10).
In summary, the emergence of 1.58-bit FLUX marks a significant step towards enabling high-quality T2I models to be practically deployed on devices with limited memory and latency. Although 1.58-bit FLUX still has some limitations in speed improvements and high-resolution image detail rendering, its enormous potential in enhancing model efficiency and reducing resource consumption is expected to provide new insights for future research.
Key improvements summary:
Model Compression: Model storage space reduced by 7.7 times.
Memory Optimization: Inference memory usage reduced by over 5.1 times.
Performance Retention: 1.58-bit FLUX maintained performance comparable to the full-precision FLUX model in the GenEval and T2I Compbench benchmarks.
No Image Data Required: The quantization process does not require access to any image data, relying solely on the model's self-supervision.
Custom Kernel: A custom kernel optimized for low-bit computation was adopted, enhancing inference efficiency.
Project Page: https://chenglin-yang.github.io/1.58bit.flux.github.io/
Paper Link: https://arxiv.org/pdf/2412.18653
Model Link: https://huggingface.co/papers/2412.18653