Translated data: A team of Chinese researchers from institutions such as Princeton and UIUC has proposed a simple framework called Medusa to accelerate the inference speed of large language models, which was open-sourced on September 12th. Test results show that Medusa can double the generation efficiency of LLMs. The researchers achieved this by adding additional decoding heads to the original model, implementing a multi-decoding head technique. During training, the original model was fine-tuned, and during generation, multiple predictions were merged through a tree-based attention mechanism. This framework has brought about a two-fold acceleration to the Vicuna series of models and is actively expanding its application scenarios.