For years, people building powerful artificial intelligence systems have used massive amounts of text, images, and videos scraped from the internet to train their models.
Today, this data is running out.
Over the past year, many of the largest web sources used to train AI models have restricted the use of their data, according to a study released this week by the Data Provenance Initiative, a research group led by MIT.
The study, which examined 14,000 web domains included in three commonly used AI training datasets, found an “emerging consent crisis” as publishers and online platforms took steps to prevent their data from being collected.
The researchers estimate that across the three datasets, called C4, RefinedWeb, and Dolma, 5% of all data and 25% of data from the highest-quality sources were restricted. These restrictions are implemented through the Robots Exclusion Protocol, a decades-old method that allows website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that up to 45% of the data in one set, C4, had been restricted by websites’ terms of service. “We’re seeing a rapid decline in consent to use data on the web, which will impact not only AI companies, but researchers, academics and non-commercial entities,” Shayne Longpre, the study’s lead author, said in an interview.
Data is the main ingredient in today’s generative AI systems, fueled by billions of examples of text, images, and videos. Much of this data is scraped from public websites by researchers and compiled into large datasets, which can be freely downloaded and used, or supplemented with data from other sources. Learning from this data is what allows generative AI tools like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude to write, code, and generate images and videos. The more high-quality data these models are fed, the better their results tend to be.
For years, AI developers have been able to collect data relatively easily. But the rise of generative AI in recent years has led to tensions with the owners of that data, many of whom fear being used as AI training tools, or at least want to be paid for it. Faced with mounting criticism, some publishers have put up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked automated web crawlers used by companies like OpenAI, Anthropic, and Google.
Sites like Reddit and StackOverflow have started charging AI companies for access to data, and a few publishers have filed lawsuits, including the New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
In recent years, companies like OpenAI, Google and Meta have made significant efforts to collect more data to improve their systems. More recently, some AI companies have struck deals with publishers, including the Associated Press and News Corp., which owns the Wall Street Journal, giving them ongoing access to their content.
©2024 New York Times Press Service
First published: July 19, 2024 | 11:18 p.m. IST