MDT explicitly enhances the ability of diffusion probability models (DPMs) to learn relationships between object parts in images by introducing a masked latent model scheme. MDT operates in the latent space during training, masking certain tokens, and then designs an asymmetrical diffusion transformer to predict masked tokens from unmasked tokens while maintaining the diffusion generation process. MDTv2 further improves the performance of MDT through more efficient macro network structures and training strategies.