Photo source: Wikimedia Commons
Harvard University has announced the release of a groundbreaking dataset comprising nearly one million public-domain books, available for anyone to use in training large language models and other AI tools.
Spearheaded by Harvard’s Institutional Data Initiative (IDI) and funded by Microsoft and OpenAI, this initiative includes texts digitised through Google’s book-scanning project that are no longer under copyright protection.
The dataset is approximately five times larger than the infamous Books3 dataset, which was used to train models like Meta’s Llama. It features a diverse range of genres, time periods, and languages, showcasing works from iconic authors such as Shakespeare and Dickens alongside lesser-known texts.
Greg Leppert, executive director of the IDI, believes this project aims to “level the playing field” in the AI industry by providing smaller players and individual researchers access to high-quality content typically available only to major tech companies.
He likens the public domain database to Linux as a foundational resource for AI development, noting that companies will still need additional training data to distinguish their models.
Microsoft’s Burton Davis emphasised that supporting this project aligns with the company’s commitment to creating “pools of accessible data” for AI startups. This release comes at a time when legal battles over the use of copyrighted data for AI training are ongoing, but it assumes a continued demand for public domain datasets regardless of the outcomes.
In addition to this extensive book collection, the IDI is collaborating with the Boston Public Library to digitise millions of articles now in the public domain.
As similar projects emerge globally, including Pleias’s Common Corpus in France and Spawning’s Source.Plus for images, experts argue that these initiatives could reshape AI training practices by providing valuable resources without infringing on copyrights. However, some caution that these datasets must be used effectively to replace copyrighted materials rather than merely supplement them.