Behind the rapid development of artificial intelligence, a pressing issue is emerging—the difficulty of acquiring data is increasing. Recent studies by institutions such as MIT have found that web data, once easily accessible, is now becoming increasingly difficult to access, posing a significant challenge to AI training and research.

Researchers have discovered that websites crawled by multiple open-source datasets such as C4, RefineWeb, and Dolma are rapidly tightening their licensing agreements. This not only affects the training of commercial AI models but also hinders research by academic and non-profit organizations.

image.png

This study was conducted jointly by four team leaders from MIT Media Lab, Wellesley College, AI startup Raive, and other institutions. They point out that data restrictions are proliferating, and issues of asymmetry and inconsistency in licensing are becoming increasingly prominent.

The research team used the Robots Exclusion Protocol (REP) and the Terms of Service (ToS) of websites as research methods. They found that even crawlers from large AI companies like OpenAI face increasingly stringent restrictions.

image.png

Through the SARIMA model, it is predicted that future restrictions on data access, whether through robots.txt or ToS, will continue to increase. This indicates that obtaining open web data will become more difficult.

The study also found that the data crawled from the web does not align with the training purposes of AI models, which could impact model alignment, data collection practices, and copyright issues.

The research team calls for more flexible protocols to reflect the wishes of website owners, separating permitted and prohibited use cases, and synchronizing with the Terms of Service. They also hope that AI developers will be able to use data from the open web for training and that future laws will support this.

Paper link: https://www.dataprovenance.org/Consent_in_Crisis.pdf