Harvard University recently announced plans to release a dataset consisting of nearly one million public domain books, which anyone can use to train large language models and other AI tools.

This project is led by Harvard's newly established Institutional Data Initiative and is funded by Microsoft and OpenAI. The dataset includes scanned books from the Google Books project, featuring classic works by authors such as Shakespeare, Dickens, and Dante, as well as some obscure Czech mathematics textbooks and Welsh dictionaries.

AI Teaching Assistant Robot

Image Source Note: Image generated by AI, licensed by service provider Midjourney

This dataset is five times larger than the so-called "Books3 dataset" and aims to provide a fair competitive environment in the AI field, allowing the public, especially small AI companies and individual researchers, access to high-quality data that is typically only available to large tech companies. Greg Leppert stated that the project underwent strict selection and careful curation.

Burton Davis, Vice President at Microsoft, emphasized that the purpose of Microsoft's support for this project is to create an "accessible data pool" for startups and ensure that this data is managed "in the public interest." Tom Rubin, OpenAI's head of intellectual property, also expressed the company's pleasure in supporting this initiative.

As lawsuits regarding the use of copyrighted data in AI continue to rise, projects like this public domain dataset from Harvard are becoming an important source of training data for AI. While it is still unclear how exactly this dataset will be released, it is expected to provide companies with a wealth of high-quality data while avoiding copyright issues.

Harvard's Institutional Data Initiative is not limited to books; it has also collaborated with the Boston Public Library to scan millions of public domain newspaper articles and plans to engage in similar partnerships in the future. Additionally, Harvard is working with Google to discuss how to facilitate public distribution of the dataset.

This project will join several similar initiatives that also promise to provide high-quality AI training materials while avoiding copyright risks. In the future, as more public domain datasets emerge, AI companies will have more options to train their models while reducing legal risks related to copyright.