ACE is a diffusion transformer-based all-in-one creator and editor that facilitates joint training of multiple visual generation tasks using a unified input format known as Long-context Condition Unit (LCU). ACE addresses the challenge of insufficient training data through efficient data collection methods and generates accurate textual instructions using multimodal large language models. It demonstrates significant performance advantages in the realm of visual generation, enabling the creation of chat systems that seamlessly respond to any image creation request, thus circumventing the cumbersome workflows typically employed by visual agents.