MIDI is an innovative image-to-3D scene generation technology that utilizes a multi-instance diffusion model to directly generate multiple 3D instances with accurate spatial relationships from a single image. The core of this technology lies in its multi-instance attention mechanism, which effectively captures inter-object interactions and spatial consistency without complex multi-step processing. MIDI excels in image-to-scene generation, suitable for synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models. Its main advantages include efficiency, high fidelity, and strong generalization ability.