Microsoft Q-Sparse Model: 8B Parameters Outperform 7B Models with Effortless Training and Fine-Tuning!

AIbase基地

Published inAI News · 5 min read · Jul 18, 2024

178

In the realm of artificial intelligence, Large Language Models (LLMs) are renowned for their exceptional natural language processing capabilities. However, deploying these models in practical applications faces significant challenges due to their high computational costs and memory footprint during the inference phase. To address this issue, researchers have been exploring ways to enhance the efficiency of LLMs. Recently, a method called Q-Sparse has garnered widespread attention.

Q-Sparse is a simple yet effective method that achieves fully sparse activation in LLMs by applying top-K sparsification in activations and a straight-through estimator during training. This means that efficiency can be significantly improved during inference. Key research findings include:

Q-Sparse offers higher inference efficiency while maintaining results comparable to baseline LLMs.

A novel inference-optimal scaling law for sparse activation LLMs has been proposed.

Q-Sparse is effective across various settings, including training from scratch, continued training of existing LLMs, and fine-tuning.

Q-Sparse is applicable to both full-precision and 1-bit LLMs (e.g., BitNet b1.58).

Advantages of Sparse Activation

Sparsity enhances the efficiency of LLMs in two ways: firstly, sparsity can reduce the computational load of matrix multiplication as zero elements are not computed; secondly, sparsity can decrease the amount of input/output (I/O) transfer, which is a major bottleneck during the inference phase of LLMs.

Q-Sparse achieves full sparsity in activations by applying a top-K sparsification function in each linear projection. For backpropagation, a straight-through estimator is used to compute the gradients of activations. Additionally, a squared ReLU function is introduced to further increase the sparsity of activations.

Experimental Validation

Researchers conducted a series of scaling experiments to study the scaling laws of sparse activation LLMs and made some intriguing discoveries:

The performance of sparse activation models improves with the increase in model size and sparsity ratio.

Given a fixed sparsity ratio S, the performance of sparse activation models scales according to a power law with the model size N.

Given a fixed parameter N, the performance of sparse activation models scales according to an exponential law with the sparsity ratio S.

Q-Sparse can be used not only for training from scratch but also for the continued training and fine-tuning of existing LLMs. In the continued training and fine-tuning settings, researchers use the same architecture and training process as in training from scratch, with the only difference being the initialization of the model with pre-trained weights and enabling the sparse function for continued training.

Researchers are exploring the combination of Q-Sparse with 1-bit LLMs (such as BitNet b1.58) and Mixture of Experts (MoE) to further enhance the efficiency of LLMs. Additionally, they are working to make Q-Sparse compatible with batch mode, which will provide more flexibility for the training and inference of LLMs.

Qwen Chat Desktop Client Released, Supporting One-Click Activation and Invocation of MCP

Recently, Qwen Chat received a major update and made a new appearance, offering users a more intuitive interaction experience and a wider range of functional services, aiming to become the most reliable AI partner for everyone. The updated Qwen Chat has achieved significant improvements in interaction design, allowing users to start a conversation directly on the home page without complicated operations, making chatting more convenient. Its functions have also been significantly expanded, supporting daily questions, meeting users' various information query needs, and assisting in content creation, whether it's writing articles or generating text.

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Mistral AI launched the Devstral2507 series with two AI models: the open-source Devstral Small1.1 (24 billion parameters, SWE-Bench score of 53.6%) and the enterprise version Devstral Medium2507 (score of 61.6%). Small1.1 supports a 128k context window and local deployment, while Medium2507 outperforms some commercial models. Both are optimized for code reasoning and program synthesis, and support integration with agent frameworks.

AI Daily: xAI Shockingly Launches Grok4; Microsoft Opensources New Phi-4-mini Version; Shanghai has Cumulatively 82 Large Models Passed Filing

1. xAI launches Grok4 with enhanced math/coding capabilities; 2. Microsoft open-sources efficient Phi-4-mini for edge devices; 3. Shanghai approves 82 specialized AI models; 4. Hugging Face releases Reachy Mini robot; 5. Perplexity debuts Comet AI browser; 6. OpenAI plans first open-weight model; 7. Google releases GPU-friendly MedGemma; 8. OpenAI acquires AI hardware firm for $6.5B.....

Shanghai has completed the filing of 82 large models

At the 2025 World Artificial Intelligence Conference, it was revealed that Shanghai has filed 82 large models and is actively promoting AI demonstration applications in key industries such as manufacturing and finance. Xuhui Moshu Space and Pudong Moli Community have become industrial carriers, gathering 500 and 200 AI companies respectively. Shanghai has established a full-cycle financing support system from the early stages to the mature stage through national and municipal artificial intelligence funds, with a focus on key areas such as computing power and language data.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Microsoft Q-Sparse Model: 8B Parameters Outperform 7B Models with Effortless Training and Fine-Tuning!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Qwen Chat Desktop Client Released, Supporting One-Click Activation and Invocation of MCP

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

City Commercial Banks Are Launching a Trend of Large Model Bidding, with Million-Level Investments Becoming a New Industry Opportunity!

Personification of Large AI Models: Grok 4 and Empathy with Musk?

AI Daily: xAI Shockingly Launches Grok4; Microsoft Opensources New Phi-4-mini Version; Shanghai has Cumulatively 82 Large Models Passed Filing

Shanghai has completed the filing of 82 large models

OpenAI Plans to Release Open-Weight Models, Breaking the Closed-Source Convention

Hong Kong's First AI Q&A System Launches, Taking You to Explore the Intelligent Era

Vidu Q1 Shock Upgrade: Reference to Video Supports Up to Seven Images, AI Video Generation Sets New Records

NVIDIA Collaborates with Hong Kong University and Others to Launch Fast KV Cache, Aiding in Accelerating Diffusion Models