Translated data: The 4M framework, jointly launched by the Swiss Federal Institute of Technology in Lausanne (EPFL) and Apple, addresses the challenges of training cross-modal visual foundation models. This framework employs Transformer technology, utilizing modality-specific tokenizers to process multiple input modalities, enhancing scalability and efficiency. By training through input and target masking, 4M excels in various visual tasks, demonstrating significant potential.