MInference 1.0
Accelerates long-context pre-fill processing for large language models
CommonProductProgrammingNatural Language ProcessingMachine Learning
MInference 1.0 is a sparse computation method aimed at accelerating the pre-fill stage of long sequence processing. It implements a dynamic sparse attention method for long-context large language models (LLMs) by identifying three unique patterns in the long context attention matrix, accelerating the pre-fill stage for 1M token prompts while maintaining the capabilities of LLMs, especially retrieval capabilities.