SliceGPT

SliceGPT: Compressing Large Language Models by Deleting Rows and Columns

CommonProductProgrammingSparsificationModel Compression
SliceGPT is a new post-training sparsity approach that reduces the network's embedding dimension by replacing each weight matrix with a smaller (dense) matrix. Through extensive experiments, we demonstrate that SliceGPT can remove up to 25% of the model parameters (including embeddings) from LLAMA2-70B, OPT 66B, and Phi-2 models while maintaining 99%, 99%, and 90% of the zero-shot task performance, respectively. Our sliced models run on fewer GPUs and execute faster without any additional code optimizations: on a 24GB consumer-grade GPU, we reduce the total inference computation of LLAMA2-70B to 64% of the dense model; on a 40GB A100 GPU, we reduce it to 66%. We provide a new insight into the computational invariance in transformer networks, which makes SliceGPT possible. We hope it can inspire and promote new avenues for reducing memory and computational requirements of pre-trained models in the future.
Visit

SliceGPT Visit Over Time

Monthly Visits

20899836

Bounce Rate

46.04%

Page per Visit

5.2

Visit Duration

00:04:57

SliceGPT Visit Trend

SliceGPT Visit Geography

SliceGPT Traffic Sources