Harvard University announced on Thursday that it will publicly release a high-quality dataset containing nearly one million public domain books, which anyone can use to train large language models and other AI tools. This dataset was created by Harvard's newly established Institutional Data Initiative and is funded by Microsoft and OpenAI. The books included are works scanned by the Google Books project that are no longer under copyright.
The dataset is approximately five times the size of the infamous Books3 dataset used to train AI models like Meta's Llama. It covers a wide variety of types, eras, and languages, including classics from Shakespeare, Charles Dickens, and Dante, as well as lesser-known Czech mathematics textbooks and Welsh pocket dictionaries. Greg Lepert, the executive director of the Institutional Data Initiative, stated that the project's goal is to "create a level playing field" by allowing the public, including small players in the AI industry and individual researchers, to access a highly refined and curated content repository that has typically only been available to established tech giants. He said, "It has gone through rigorous scrutiny."
Image source note: Image generated by AI, licensed through Midjourney
Lepert believes that the new public domain database can be used in conjunction with other licensed materials to build AI models. He stated, "I think it’s a bit like how Linux became the foundational operating system in many areas of the world." He pointed out that companies still need to use additional training data to differentiate their models from those of competitors.
Burt Davis, Microsoft's Vice President of Intellectual Property and Deputy General Counsel, emphasized that the company's support for this project aligns with its broader belief in creating "accessible data pools" for AI startups that are "public interest-oriented." In other words, Microsoft does not necessarily plan to replace all the AI training data used in its own models with public domain alternatives, such as the books in Harvard's new database. "We use publicly available data to train our models," Davis said.
With dozens of lawsuits regarding the use of copyrighted data to train AI currently in court, the future of how AI tools are built remains uncertain. If AI companies win, they will be able to continue scraping the internet without having to enter into licensing agreements with copyright holders. However, if they lose, AI companies may be forced to completely reform how they create their models. Projects like the Harvard database are moving forward at an unprecedented pace, assuming that—regardless of what happens—there will be a demand for public domain datasets.
In addition to the vast number of books, the Institutional Data Initiative has also partnered with the Boston Public Library to scan millions of public domain articles from various newspapers. The organization has expressed its willingness to establish similar collaborations in the future. The exact release method for the book dataset has yet to be determined. The Institutional Data Initiative has asked Google to jointly participate in public distribution, but the search giant has not publicly agreed to host the dataset, although Harvard University remains optimistic about this. (Google did not respond to WIRED's request for comment.)
Regardless of how the IDI dataset is released, it will join a series of similar projects, startups, and initiatives aimed at providing companies with a large amount of high-quality AI training materials without the risk of encountering copyright issues. Companies like Calliope Networks and ProRata have emerged, releasing licenses and designing compensation schemes intended to ensure that creators and rights holders are compensated for providing AI training data.
Moreover, there are other new public domain projects. Last spring, the French AI startup Pleis launched its own public domain dataset called Common Corpus, which, according to project coordinator Pierre-Carl Langlais, contains a collection of about three to four million books and journal articles. Supported by the French Ministry of Culture, Common Corpus has been downloaded over 60,000 times just this month on the open-source AI platform Hugging Face. Last week, Pleis announced it would release its first large language models trained using this dataset, which Langlais told WIRED are "the first models ever trained entirely on open data and compliant with the [EU] AI Act."
Currently, work is also underway to create similar image datasets. The AI startup Spawning released a dataset called Source.Plus this summer, which contains public domain images from Wikimedia Commons as well as various museums and archives. For a long time, some major cultural institutions, such as the Metropolitan Museum of Art, have also opened their archives to the public as independent projects.
Ed Newton-Rex, a former executive at Stability AI, now runs a nonprofit that certifies AI tools for ethical compliance. He stated that the rise of these datasets indicates that high-performance and high-quality AI models can be built without stealing copyrighted materials. OpenAI previously told UK lawmakers that it was "impossible" to create products like ChatGPT without using copyrighted works. "Large public domain datasets like this further undermine the 'necessity defense' that some AI companies use to justify scraping copyrighted works to train their models," Newton-Rex said.
However, he remains cautious about whether IDI and similar projects will truly change the training landscape. "These datasets will only have a positive impact if they are used in conjunction with other licensed data to replace scraped copyrighted works. If they are simply added to mixed datasets that also include the life’s work of creators around the world without permission, they will primarily benefit AI companies," he said.