Recently, researchers from ByteDance Research Institute and Tsinghua University jointly released a new study, indicating that current AI video generation models, such as OpenAI's Sora, while capable of creating stunning visual effects, have significant flaws in understanding basic physical laws. This study has sparked widespread discussions about the capabilities of AI in simulating reality.
The research team tested AI video generation models under three different scenarios: predictions under known patterns, predictions under unknown patterns, and new combinations of familiar elements. Their goal was to determine whether these models have truly learned physical laws or are merely relying on surface features from their training.
Through testing, the researchers found that these AI models did not learn universally applicable rules. Instead, they primarily relied on surface features such as color, size, speed, and shape when generating videos, following a strict hierarchy: color takes precedence, followed by size, speed, and shape.
In familiar scenarios, these models performed almost perfectly, but they became helpless when faced with unknown situations. One test in the study demonstrated the limitations of the AI models in handling object motion. For example, when the model was trained with a fast-moving sphere oscillating back and forth, it displayed a sudden change in direction when presented with a slow-moving sphere during testing, which was clearly evident in the related video.
The researchers pointed out that simply increasing the model size or adding more training data does not solve the problem. Although larger models perform better under familiar patterns and combinations, they still fail to understand basic physical laws or handle scenarios outside their training range. Co-author Kang Bingyi mentioned, “If the data coverage is good enough in specific scenarios, it might form an overfitted world model.” However, such a model does not meet the true definition of a world model, as a genuine world model should be able to generalize beyond the training data.
Co-author Bingyi Kang demonstrated this limitation on X, explaining that when they trained the model with a fast-moving ball moving left to right and back, and then tested it with a slow-moving ball, the model showed the ball suddenly changing direction after just a few frames (you can see this at 1 minute and 55 seconds in the video).
The findings of this study pose a challenge to OpenAI's Sora project. OpenAI has claimed that Sora has the potential to evolve into a true world model through continuous expansion and even asserted that it has a basic understanding of physical interactions and three-dimensional geometry. However, researchers pointed out that merely scaling up is insufficient for video generation models to discover fundamental physical laws.
Yann LeCun, head of AI at Meta, also expressed skepticism, stating that generating pixels to predict the world is "a waste of time and doomed to fail." Nevertheless, many people still look forward to OpenAI's anticipated release of Sora in mid-February 2024, showcasing its video generation potential.
Key Points:
🌟 The study found that AI video generation models have significant flaws in understanding physical laws, relying on surface features of training data.
⚡ Increasing model size does not solve the problem; these models perform poorly in unknown scenarios.
🎥 OpenAI's Sora project faces challenges, as simply scaling up cannot achieve a true world model.