SliceGPT
SliceGPT: Compressing Large Language Models by Deleting Rows and Columns
CommonProductProgrammingSparsificationModel Compression
SliceGPT is a new post-training sparsity approach that reduces the network's embedding dimension by replacing each weight matrix with a smaller (dense) matrix. Through extensive experiments, we demonstrate that SliceGPT can remove up to 25% of the model parameters (including embeddings) from LLAMA2-70B, OPT 66B, and Phi-2 models while maintaining 99%, 99%, and 90% of the zero-shot task performance, respectively. Our sliced models run on fewer GPUs and execute faster without any additional code optimizations: on a 24GB consumer-grade GPU, we reduce the total inference computation of LLAMA2-70B to 64% of the dense model; on a 40GB A100 GPU, we reduce it to 66%. We provide a new insight into the computational invariance in transformer networks, which makes SliceGPT possible. We hope it can inspire and promote new avenues for reducing memory and computational requirements of pre-trained models in the future.
SliceGPT Visit Over Time
Monthly Visits
20899836
Bounce Rate
46.04%
Page per Visit
5.2
Visit Duration
00:04:57