Researchers from the University of Washington, University of Copenhagen, and Stanford have provided new evidence suggesting that OpenAI’s models, such as GPT-4, likely memorised sections of copyrighted works during training.
This finding lends support to legal actions being taken against OpenAI, with various rights-holders accusing the company of unauthorised use of their creations. OpenAI maintains that its actions are justified under the “fair use” doctrine, while simultaneously lobbying for legislative changes to align with its practices.
How Researchers Caught the Models in the Act
The study, which investigates how large language models retain data, developed a method to probe the internal workings of models like GPT-3.5 and GPT-4. These models function as prediction engines, generating content by learning patterns in large volumes of text. While most outputs are original combinations of learned patterns, some portions of data are reproduced verbatim — a known vulnerability in large AI systems.
The researchers used a technique based on “high-surprisal” words — terms that are statistically rare in a given context to identify memorisation. An example highlighted in the study is the sentence: “Jack and I sat perfectly still with the radar humming.” Here, “radar” is a high-surprisal word because it is less expected than alternatives like “engine” or “radio.”
The researchers tested whether the AI could accurately guess them. If the models predicted the masked words correctly, it strongly suggested memorisation rather than generalisation by masking these unusual words in snippets from books and news articles.
Evidence of Memorised Content from Books and News
According to the study, GPT-4 displayed clear signs of memorising passages from fiction, including material that likely originated from a dataset known as BookMIA, which contains copyrighted e-books. The model also exhibited signs of memorising portions of New York Times articles, though at a lower frequency. GPT-3.5 was also tested, but the study’s findings focused more heavily on GPT-4.
A Call from the Research Community
“In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically,” said Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study.
“Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.”
Legal pressure surrounding AI’s use of unlicensed content is intensifying, prompting new research that challenges the current approach to model training. The study highlights the need for more visibility into developers’ training practices and signals a deeper industry-wide dilemma: how to strike a balance between technological innovation and intellectual property protection.
OpenAI’s Defence and Industry-Wide Norms
OpenAI has taken steps to address concerns over content use, securing licensing agreements and offering opt-out options for rights-holders. Nevertheless, the company argues that looser restrictions are essential for pushing the boundaries of AI development, continuing to advocate for a broader interpretation of fair use in AI-related contexts.
Conclusion
The resolution of the conflict between proprietary content rights and the data-hungry nature of machine learning is no longer a distant concern—it is central to the future of AI development. The legal system may be slow to react with lawsuits underway and policymakers considering new regulations, but the surge in AI auditing tools is forcing transparency into the spotlight, signalling an irreversible shift.