The latest screenshots of GPT4.5 have been leaked, revealing its ability to process 3D and video, which has attracted attention. Research on multimodal large models has become a trend, with teams from Penn, Salesforce, and Stanford proposing the X-InstructBLIP solution. The X-InstructBLIP framework enables low-cost cross-modal inference, showcasing the emergent capabilities of multimodal language models. The research team constructed the DisCRn challenge dataset to validate X-InstructBLIP's performance in discriminative cross-modal tasks.