The Beijing Academy of Artificial Intelligence (BAAI) has released Emu2, a new generation of multi-modal foundational models. Through extensive autoregressive generative multi-modal pre-training, Emu2 has significantly advanced the breakthroughs in multi-modal contextual learning capabilities. Emu2 excels in few-shot multi-modal understanding tasks, surpassing mainstream multi-modal pre-training models like Flamingo-80B and IDEFICS-80B. It has achieved optimal performance in multiple few-shot understanding, visual question answering, and image generation tasks. Emu2-Chat can accurately comprehend text-image instructions, enabling better information perception, intent understanding, and decision planning. Emu2-Gen can accept sequences of interleaved images, text, and locations as input, achieving flexible, controllable, and high-quality image and video generation. Emu2 adopts a simpler modeling framework and scales the model up to 37 billion parameters. For more details, please refer to the project link released by BAAI.