AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

DeepSeek AI Launches Smallpond: A Lightweight Data Processing Framework Based on DuckDB and 3FS

AIbase基地

Published inAI News · 5 min read · Mar 6, 2025

Modern data workflows face increasing challenges due to expanding datasets and the growing complexity of distributed processing. Many organizations find that traditional data processing systems fall short in terms of processing time, memory limitations, and distributed task management. This often leads data scientists and engineers to spend significant time on system maintenance rather than extracting valuable insights from data. Clearly, the market urgently needs a tool that simplifies workflows without sacrificing performance.

Recently, DeepSeek AI released Smallpond, a lightweight data processing framework built on DuckDB and 3FS. Smallpond aims to extend DuckDB's efficient in-process SQL analytics to distributed environments. By combining with 3FS—a high-performance distributed file system optimized for modern SSDs and RDMA networks—Smallpond provides a practical solution for handling large datasets, avoiding the complexity and high infrastructure costs of long-running services.

Smallpond boasts a simple and modular design, compatible with Python 3.8 to 3.12. Users can quickly install it via pip and start processing data immediately. A key feature is its support for manual data partitioning, allowing users to partition based on file count, row count, or the hash value of a specific column. This flexibility enables customized processing based on individual data and infrastructure.

Technically, Smallpond leverages DuckDB's native SQL query performance and integrates with Ray to enable parallel processing across distributed computing nodes. This combination simplifies scaling operations and ensures efficient workload handling across multiple nodes. Furthermore, by avoiding persistent services, Smallpond reduces the operational overhead typically associated with distributed systems.

In performance tests, Smallpond excelled in the GraySort benchmark, sorting 110.5 TiB of data in just over 30 minutes, achieving an average throughput of 3.66 TiB per minute. These performance metrics demonstrate Smallpond's ability to meet the needs of organizations handling data ranging from terabytes to petabytes. As an open-source project, Smallpond welcomes contributions from users and developers to further optimize and adapt it to diverse use cases.

Smallpond represents a significant step forward in distributed data processing. By extending DuckDB's efficiency to distributed environments and combining it with 3FS's high throughput, it provides a practical tool for data scientists and engineers. Whether handling smaller datasets or scaling to petabyte-level operations, Smallpond is an efficient and accessible framework.

Project: https://github.com/deepseek-ai/smallpond?tab=readme-ov-file

Key Highlights:
🌟 Smallpond is a lightweight data processing framework from DeepSeek AI, built on DuckDB and 3FS.
⚙️ Supports Python 3.8 to 3.12, allowing users to quickly install and flexibly customize data processing.
🚀 Demonstrated exceptional performance in the GraySort benchmark, showcasing its ability to handle terabyte-scale data.

Smallpond DuckDB 3FS Lightweight Data Processing Framework

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Google Data Center Power Consumption Has Increased Sevenfold in Ten Years, Huge Investments Bet on a Carbon-Neutral Future

Google's latest sustainability report reveals a startling fact: within just four years, the company's data center power consumption more than doubled, rising from 14.4 million megawatt-hours in 2020 to 30.8 million megawatt-hours in 2024. If the timeline is extended to ten years, compared to an estimated 4 million megawatt-hours in 2014, Google's data center power consumption has increased sevenfold. Growing electricity demand: data centers are major energy consumers, efficiency improvements face bottlenecks. Data shows that Google's power issues are almost entirely concentrated in data centers.

Jul 2, 2025

100

Design Giant Figma's IPO is Imminent: Financial Data Revealed, Valuation May Reach $1.5 Billion!

Jul 2, 2025

190

The Revolution of Large Models! How Gemini 2.5 Pro is Transforming the Way We Process Information

Jul 1, 2025

240

"AI Daily Report - June 27th"; Tencent open-sources lightweight Huyuan-A13B model; Keling AI launches video audio effects feature

Welcome to AIbase's [AI Daily Report]! Spend three minutes every day to learn about the latest AI news, helping you understand AI industry trends and innovative AI product applications. For more AI updates, visit: https://www.aibase.com/zh1. Tencent open-sources the lightweight Huyuan-A13B model, which can be deployed with just one mid-range GPU card. Tencent has released a new member of the Huyuan large model family, the Huyuan-A13B model, which uses a mixture of experts (MoE) architecture, with a total parameter scale of 80 billion and an activated parameter count of 13 billion, large

Jun 27, 2025

480

Tencent Open-Sources Lightweight Hypermix-A13B Model, Deployable with One Mid-Range GPU Card

Tencent officially launched and open-sourced a new member of the Hypermix large model family - the Hypermix-A13B model. The model adopts an expert mixture (MoE) architecture, with a total parameter scale of 80 billion and an activated parameter count of 13 billion. It maintains the performance of top-tier open-source models while significantly reducing inference latency and computational costs, providing a more cost-effective AI solution for individual developers and small and medium-sized enterprises.

Jun 27, 2025

750

OpenAI Releases New Model for Deep Research API: o3/o4-mini-deep research

Jun 27, 2025

1.1k

ElevenLabs Launches Voice Design v3 - Generate Any Sound You Want with Just One Sentence

Jun 27, 2025

230

Breaking News! Google Opensources Gemma3n Multimodal Model, AI Performance Can Run on Phones as if it Were in the Cloud

Jun 27, 2025

320

OpenAI Major Upgrade: Deep Research Model API Opened, Web Search Functionality Significantly Reduced in Price

OpenAI announced the official opening of API access to its deep research models, providing developers with a powerful set of tools including automatic web search, data analysis, MCP (Model Communication Protocol), and code execution. The models opened include the deep research versions of o3 and o4-mini, which have already been used in ChatGPT. Now, developers can directly call these models via API. These models are particularly suitable for complex tasks that require obtaining the latest information and performing advanced reasoning. In terms of function expansion, o3,

Jun 27, 2025

430

Google Launches Gemini CLI: Lightweight Open-Source AI Agent Free Empowering Endpoints

Jun 26, 2025