Leading Chinese AI company, DeepSeek, dropped a technological "nuclear bomb" at the close of Open Source Week, officially releasing 3FS (Fire-Flyer File System), a high-performance parallel file system designed for modern computing environments, and its accompanying data processing framework, Smallpond. This powerful combination directly addresses the data processing pain points of AI training and inference, achieving an industry-leading cluster throughput of 6.6 TiB/s, marking a new era for distributed storage technology.

QQ20250228-092812.png

Revolutionary Performance: Architectural Innovation Defines New Standards

3FS, with its decentralized architecture and strong consistency semantics, achieves an aggregate read throughput of 6.6 TiB/s in an 180-node cluster, with a single-node KVCache lookup peak exceeding 40 GiB/s. Its GraySort benchmark performance reaches 3.66 TiB/min (25 nodes), showing an exponential improvement over traditional solutions. The system deeply optimizes SSD and RDMA network characteristics, pushing hardware bandwidth utilization to its limits and providing a stable data supply for thousands of GPU AI training clusters.

Restructuring the Landscape: Empowering the Entire AI Workflow

As a core infrastructure component of DeepSeek's V3/R1 version, 3FS is fully integrated into key stages such as data preprocessing, checkpoint storage, vector search, and inference caching. Its shared storage layer design significantly simplifies the complexity of distributed development, while strong consistency ensures the safety of large-scale concurrent operations. The accompanying open-source Smallpond framework builds lightweight PB-level data processing capabilities, leveraging DuckDB to achieve "serverless" data engineering, forming a complete ecosystem closed loop from storage to computation.

Open-Source Strategy: Accelerating the Democratization of AI Infrastructure

The dual open-sourcing of 3FS and Smallpond continues DeepSeek's "five-day consecutive release" of technologies. By making systems verified by its own AI business publicly available, DeepSeek is pushing the industry to overcome the storage bottlenecks of data-intensive applications. Analysts believe this solution could significantly outperform traditional distributed systems like Ceph and Lustre, especially in opening up new paradigms for large model training.