With the global attention garnered by DeepSeek-R1, the inference model launched by Chinese AI company DeepSeek, its stability across various third-party platforms has become a hot topic in the tech world. Recent discussions and benchmark data on X reveal significant performance variations on different hosting platforms, with completeness, accuracy, and inference time varying depending on the platform chosen. This phenomenon highlights the complexities of model deployment and provides crucial guidance for users in selecting suitable hosting services.
Testing Background and Methodology
A cross-platform stability test of DeepSeek-R1, based on feedback from X users and professional testing organizations, has recently attracted considerable attention. Led by the Artificial Intelligence Department of the China Software Evaluation Center, the test involved over ten domestic and international third-party platforms, including Nano AI Search, Alibaba Bailian, and Silicon Flow. A standardized set of 20 basic mathematical reasoning problems (developed by the SuperCLUE team) served as the benchmark. Evaluation focused on three dimensions: response rate, accuracy, and inference time, also analyzing differences between free and paid services.
Image Source Note: Image generated by AI, licensed by Midjourney
Test Results: Significant Stability Differences
Results show that DeepSeek-R1's stability is highly dependent on the hosting platform. Nano AI Search, known for offering the full-fledged DeepSeek-R1 for free, performed exceptionally well. X user @op7418 posted on February 27th: "Nano AI Search was the first to integrate the full-fledged DeepSeek-R1, and it performed excellently in the evaluation." This platform received positive feedback for its high response rate and stable output, considered an embodiment of Zhou Hongyi's "AI popularization" philosophy.
However, other platforms showed less satisfactory results. X user @simonkuang938 noted on February 24th that Alibaba Bailian's DeepSeek-R1 frequently experienced truncated output due to excessive memory consumption when processing complex logical tasks (such as generating charts or flowcharts), causing client-side freezes despite maintaining connection. He jokingly described the experience as "cheap," reflecting some users' dissatisfaction with stability.
In contrast, Silicon Flow, by limiting free usage and offering a stable paid version, received positive feedback from @simonkuang938. On February 22nd, he stated: "There are so few conscientious platforms like Silicon Flow; R1 is the full-fledged version and hasn't been modified." This suggests that paid services may offer superior stability.
User Experience and Technical Details
User feedback on X indicates that DeepSeek-R1's performance varies across different scenarios. @changli71829684 mentioned on February 25th that R1 is prone to entering infinite loops when single-turn output exceeds 3000 words. While its information density is high, suitable for knowledge mining, its accuracy and output quality are somewhat lacking. He suggests the model is better suited for "brainstorming" than precise tasks. Furthermore, @oran_ge discovered on January 29th that the unsupervised fine-tuned (SFT) version of DeepSeek R1Zero exhibited erratic behavior on simple questions, such as responding to "Hello" with a mathematical formula, highlighting the model's instability in specific scenarios.
It's worth noting that some users have attempted to optimize R1's user experience. @oran_ge shared a solution on February 12th involving API network connection, claiming it to be "the most stable and fastest R1 user experience," effectively resolving freezing and network issues. This exploration shows that technical configurations beyond the platform can also affect stability.
Industry Significance and User Suggestions
This cross-platform testing not only exposed the challenges in DeepSeek-R1 deployment but also sparked discussions about the commercialization and stability of open-source models. X users generally believe that while DeepSeek-R1 excels in mathematical and programming benchmarks (e.g., MATH-500 score of 97.3%), its stability in real-world applications still needs improvement. The traffic pressure and high load on free services may lead to performance degradation, while paid platforms offer more reliable experiences through resource allocation.
Industry experts suggest users choose hosting platforms based on their needs. For developers seeking high response rates and complete output, stable services like Nano AI Search or Silicon Flow are good options; while for users handling complex reasoning tasks, paid platforms may be more suitable. Simultaneously, DeepSeek is urged to provide more hardware support or paid tiers to alleviate congestion on free services, as hoped for by @GrayPsyche in a February 8th post.
The DeepSeek-R1 third-party platform stability evaluation reveals a key fact: while the model's potential is great, its actual performance varies significantly depending on the hosting environment. From Nano AI Search's efficient free service, to Alibaba Bailian's truncation issues, and Silicon Flow's stable paid experience, users must weigh cost against performance. As AI technology becomes more prevalent, the future development of DeepSeek-R1 and its global market competitiveness will likely depend on its ability to address these stability challenges. The heated discussions on X continue, and this topic will undoubtedly remain a focus for the industry.