Translated data: Apple Research introduces MAD-Bench, a benchmark aimed at addressing the vulnerability of multimodal large language models (MLLMs) in handling misleading information. This study includes 850 pairs of image prompts, evaluating the MLLMs' ability to process the consistency between text and images. The findings reveal that GPT-4V performs well in scene understanding and visual confusion, providing significant insights for designing AI models. With the MAD-Bench benchmark, the robustness of AI models will be enhanced, making future research more reliable.