CM3leon is an advanced model that combines text-to-image and image-to-text generation. It adopts an adaptation-based text model training recipe, including a large-scale retrieval-enhanced pre-training stage and a multi-task supervised fine-tuning stage. CM3leon has similar diversity and effectiveness to autoregressive models, while being cost-effective in training and high-efficient in inference. It is a causal masked mixed-modality (CM3) model that can generate text and image sequences based on any image and text content. Compared to previous models that only perform either text-to-image or image-to-text generation, CM3leon has greater functional extensibility in multi-modal generation.