Translated Data: ByteDance and the University of Science and Technology of China have made significant breakthroughs in their multimodal document large model, DocPedia. With a resolution of 2560×2560, it has successfully addressed the shortcomings of existing models in parsing high-resolution document images. The model has shown significant improvements in text recognition and semantic question answering, enhancing efficiency without compromising information through a perception-understanding joint training strategy and frequency domain processing. DocPedia has performed exceptionally well on multiple test benchmarks, making a positive contribution to the development of the multimodal document understanding field.