Recently, the University of Maryland and the University of North Carolina at Chapel Hill collaborated to release Mementos, an image sequence benchmark specifically designed for multimodal large language models. This benchmark aims to comprehensively test these models' reasoning abilities on real-world, robotic, and anime image sequences. However, the test results are astonishing: models like GPT-4V and Gemini have accuracy rates below 20% on comic datasets. This reveals significant shortcomings in these models' ability to handle hallucinations, object recognition, and action understanding in image sequences.