Flash-Decoding
Flash-Decoding for long-context inference
InternationalSelectionProgrammingInferenceAttention mechanism
Flash-Decoding is a technique for long-context inference that can significantly accelerate the attention mechanism during inference, leading to an 8x improvement in generation speed. This technique achieves faster inference speed by parallelly loading keys and values and then rescaling and combining the results to maintain the correct attention output. Flash-Decoding is suitable for large language models and can handle long contexts such as long documents, long conversations, or entire codebases. Flash-Decoding is available in the FlashAttention package and xFormers, which can automatically select between Flash-Decoding and FlashAttention methods. It can also utilize the efficient Triton kernel.
Flash-Decoding Visit Over Time
Monthly Visits
1039843
Bounce Rate
45.57%
Page per Visit
2.8
Visit Duration
00:02:27