Data that powers artificial intelligence is disappearing at a rapid pace

AI, artificial intelligence (Photo: Reuetrs)

4 minutes reading time Last update : July 19, 2024 | 11:18 p.m. IST

By Kevin Roose

For years, people building powerful artificial intelligence systems have used massive amounts of text, images, and videos scraped from the internet to train their models.

Today, this data is running out.

Over the past year, many of the largest web sources used to train AI models have restricted the use of their data, according to a study released this week by the Data Provenance Initiative, a research group led by MIT.

The study, which examined 14,000 web domains included in three commonly used AI training datasets, found an “emerging consent crisis” as publishers and online platforms took steps to prevent their data from being collected.

The researchers estimate that across the three datasets, called C4, RefinedWeb, and Dolma, 5% of all data and 25% of data from the highest-quality sources were restricted. These restrictions are implemented through the Robots Exclusion Protocol, a decades-old method that allows website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that up to 45% of the data in one set, C4, had been restricted by websites’ terms of service. “We’re seeing a rapid decline in consent to use data on the web, which will impact not only AI companies, but researchers, academics and non-commercial entities,” Shayne Longpre, the study’s lead author, said in an interview.

Data is the main ingredient in today’s generative AI systems, fueled by billions of examples of text, images, and videos. Much of this data is scraped from public websites by researchers and compiled into large datasets, which can be freely downloaded and used, or supplemented with data from other sources. Learning from this data is what allows generative AI tools like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude to write, code, and generate images and videos. The more high-quality data these models are fed, the better their results tend to be.

For years, AI developers have been able to collect data relatively easily. But the rise of generative AI in recent years has led to tensions with the owners of that data, many of whom fear being used as AI training tools, or at least want to be paid for it. Faced with mounting criticism, some publishers have put up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked automated web crawlers used by companies like OpenAI, Anthropic, and Google.

Sites like Reddit and StackOverflow have started charging AI companies for access to data, and a few publishers have filed lawsuits, including the New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.

In recent years, companies like OpenAI, Google and Meta have made significant efforts to collect more data to improve their systems. More recently, some AI companies have struck deals with publishers, including the Associated Press and News Corp., which owns the Wall Street Journal, giving them ongoing access to their content.

DATA CRISIS

– Refusal of consent to use of data could impact researchers, academics and non-commercial entities

– 5% of all data, 25% of data from highest quality sources restricted to datasets used to train AI

– The rise of generative AI has led to tensions with data owners

– Publishers have implemented paywalls, changed terms of service to limit the use of their data

– Web crawlers used by companies like OpenAI, Anthropic and Google are blocked by some companies

– Small AI companies and academic researchers who rely on public datasets are struggling

©2024 New York Times Press Service

First published: July 19, 2024 | 11:18 p.m. IST

Latest News

AI predicts Aljamain Sterling’s next opponent! Could this revolutionize combat planning?

Businesses rush to use AI security against AI-based threats

2025 will be the year AI agents transform crypto

Businesses rush to use AI security against AI-based threats

AI Safety Fund calls for tenders for cybersecurity research | Calls for tenders

AI-powered cybersecurity challenges loom on the horizon – MeriTalk

The Rise of AI-Driven Cybersecurity: A New Era of Defense and Offense

Businesses rush to use AI security against AI-based threats

AI Safety Fund calls for tenders for cybersecurity research | Calls for tenders

AI-powered cybersecurity challenges loom on the horizon – MeriTalk

The Rise of AI-Driven Cybersecurity: A New Era of Defense and Offense

This new digital marketing tool can generate an entire marketing campaign with just a few prompts

Coca-Cola and Omnicom lead AI marketing strategies

AI Crypto Market Cap Rises Over 25% Amid Major Developments in the Sector

Smarter AI agents are boosting the digital entertainment industry

This new digital marketing tool can generate an entire marketing campaign with just a few prompts

Coca-Cola and Omnicom lead AI marketing strategies

AI Crypto Market Cap Rises Over 25% Amid Major Developments in the Sector

Smarter AI agents are boosting the digital entertainment industry

Generative AI and data centers will define Indian tech industries in 2025

Tech News Today Live Updates December 25, 2024: Generative AI and data centers will define Indian technology industries in 2025

Sriram Krishnan, Donald Trump’s AI chief, fights to remove country caps on green cards; here’s why it’s good news for Indians

7 Google AI announcements from October

Generative AI and data centers will define Indian tech industries in 2025

Tech News Today Live Updates December 25, 2024: Generative AI and data centers will define Indian technology industries in 2025

Sriram Krishnan, Donald Trump’s AI chief, fights to remove country caps on green cards; here’s why it’s good news for Indians

7 Google AI announcements from October

Machine learning advances in pharmacology: transforming drug discovery and healthcare

Machine learning at the Flatiron Institute

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

Machine learning advances in pharmacology: transforming drug discovery and healthcare

Machine learning at the Flatiron Institute

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

Generative AI and data centers will define Indian tech industries in 2025

Tech News Today Live Updates December 25, 2024: Generative AI and data centers will define Indian technology industries in 2025

Sriram Krishnan, Donald Trump’s AI chief, fights to remove country caps on green cards; here’s why it’s good news for Indians

Latest News

Subscribe to Updates

Data that powers artificial intelligence is disappearing at a rapid pace | Technology News

Related Posts

Subscribe to Updates