Flash-Decoding

Flash-Decoding for long-context inference

InternationalSelectionProgrammingInferenceAttention mechanism
Flash-Decoding is a technique for long-context inference that can significantly accelerate the attention mechanism during inference, leading to an 8x improvement in generation speed. This technique achieves faster inference speed by parallelly loading keys and values and then rescaling and combining the results to maintain the correct attention output. Flash-Decoding is suitable for large language models and can handle long contexts such as long documents, long conversations, or entire codebases. Flash-Decoding is available in the FlashAttention package and xFormers, which can automatically select between Flash-Decoding and FlashAttention methods. It can also utilize the efficient Triton kernel.
Visit

Flash-Decoding Visit Over Time

Monthly Visits

956914

Bounce Rate

50.45%

Page per Visit

2.6

Visit Duration

00:02:20

Flash-Decoding Visit Trend

Flash-Decoding Visit Geography

Flash-Decoding Traffic Sources

Flash-Decoding Alternatives