PowerInfer is an engine for performing high-speed inference of large language models on consumer-grade GPUs within personal computers. It leverages the high locality of LLM inference by pre-loading hot-activated neurons to the GPU, significantly reducing GPU memory requirements and CPU-GPU data transfer. PowerInfer also integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and sparse computation. It can achieve an average generation speed of 13.20 tokens per second on a single NVIDIA RTX 4090 GPU, only 18% slower than the top-tier server-grade A100 GPU while maintaining model accuracy.