Show-o is a unified transformer model designed for multimodal understanding and generation. It can handle image captioning, visual question answering, text-to-image generation, text-guided inpainting and expansion, as well as mixed-modal generation. This model was collaboratively developed by the Show Lab at the National University of Singapore and ByteDance, utilizing the latest deep learning techniques to understand and generate data across various modalities, representing a significant breakthrough in the field of artificial intelligence.