Today, the Doubao large model team officially released a technical report on text-to-image generation, publicly disclosing the technical details of the Seedream 2.0 image generation model for the first time. This comprehensive report covers data construction, pre-training framework, and the post-training RLHF process, making a significant impact on the text-to-image field.
Since its launch in early December 2024 on the Doubao app and Jimeng, Seedream 2.0 has served over 100 million C-end users and gained favor among professional designers. Compared to mainstream models like Ideogram 2.0 and Midjourney V6.1, it addresses issues such as poor text rendering and insufficient understanding of Chinese culture, achieving significant improvements in bilingual (Chinese and English) understanding, aesthetics, and instruction following.
Bench-240 benchmark tests show that its English prompt generation boasts superior structural rationality and text comprehension accuracy. For Chinese generation, it achieves a 78% usable text rendering rate and a 63% perfect response rate, significantly exceeding other models in the industry.
The technical implementation involves several innovations. In data preprocessing, a "knowledge fusion"-centric framework was built. A four-dimensional data architecture balances data quality and knowledge diversity, while an intelligent annotation engine achieves three-level cognitive evolution, enhancing the model's understanding and recognition capabilities. Engineering reconstruction significantly improves data processing efficiency.
During the pre-training phase, the team focused on bilingual understanding and text rendering. A native bilingual alignment scheme, through fine-tuning LLM and constructing a dedicated dataset, breaks down the language-visual dimension barrier. A dual-modal encoding fusion system allows the model to consider both text semantics and font glyphs. A triple-upgrade DiT architecture, incorporating QK-Norm and Scaling ROPE technology, improves training stability and enables multi-resolution image generation.
Note: Seedream 2.0 performance across different dimensions for English prompts. Data in this figure is normalized using the best indicator as a reference.
In the post-training RLHF process, the team developed and optimized a system focusing on a multi-dimensional preference data system, three different reward models, and iterative learning to drive model evolution. This effectively improves model performance, with the scores of different reward models steadily increasing during iteration.
Note: Seedream 2.0 performance across different dimensions for Chinese prompts. Data in this figure is normalized using the best indicator as a reference.
The release of this technical report demonstrates the Doubao large model team's commitment to advancing image generation technology. In the future, the team will continue to explore innovative technologies, push the boundaries of model performance, delve into reinforcement learning optimization mechanisms, and continuously share technical experience to contribute to the vibrant development of the industry.
Technical Showcase: https://team.doubao.com/tech/seedream
Technical Report: https://arxiv.org/pdf/2503.07703