Is training AI on copyrighted works ethical – or legal?

Perhaps the only industry hotter than artificial intelligence right now? AI litigation.

Just a sample: writer Michael Chabon is pursue Meta. Getty Images is pursue Stability AI. And both the New York Times and the Authors Guild have deposit separate lawsuits against OpenAI and Microsoft.

At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data.

For text-driven generative AI, there’s a good chance that some of this training data comes from a massive archive: Joint exploration.

“Common Crawl is the copy of the Internet. This is an Internet archive dating back 17 years. We make this information freely available to researchers, academics and businesses,” said Rich Skrenta, who runs the nonprofit Common Crawl Foundation.

Since 2007, Common Crawl has saved 250 billion web pages, all in downloadable data files. Until recently, some of its biggest users were academics, exploring topics such as online hate speech and government censorship.

But now there is another experienced user.

“Researchers told me that LLMs wouldn’t exist without Common Crawl,” Skrenta said.

LLMs represent large language models, essentially the algorithms behind AI products like ChatGPT.

LLMs must ingest huge chunks of text to learn the rhythm and structure of language, so they can write a compelling essay or compelling human-sounding wedding vows.

OpenAI, Google and Meta all used versions of Common Crawl in their early AI research.

Unless your 2009 “Glee” fanfiction blog is paid for or has code telling Common Crawl to look away, chances are it’s in Common Crawl, although there is no simple way to check it.

After the release of ChatGPT, Skrenta claims that the number of websites that blocked Common Crawl from archiving their content doubled. And there has been a sharp increase in requests to delete existing records.

Skrenta says that by posting something on the internet without explicitly telling bots to avoid it, you consent to its use by AI.

“You put your information on the Internet intentionally so people could come and see it. And robots are people too,” Skrenta said.

Common Crawl isn’t the only text used to train AI. Researcher Luca Soldaini of the nonprofit Allen Institute for AI says we know a lot more about the training used by data technology companies.

But that was before OpenAI got a A valuation of $100 billion.

“It’s not in their interest to tell us what’s in there, both from a competitive and legal perspective,” Soldaini said.

Most major AI companies allow web publishers to opt out of future AI training data. But Soldaini says that if companies were forced to retrain their current AI models without any user-desired hardware, it would take an enormous amount of time and money.

And without all that copyrighted work to learn from, AI might just stink.

Tech companies say taking copyrighted material to train AI is legally “fair use”: AI systems should be able to read and learn from the Internet, just like humans do it.

But beyond the legal debate, there is also an ethical debate.

“Every creator among us grew up fully knowing and fully accepting that when we create, when we put that out into the world, people will learn from it,” said Ed Newton-Rex, founder of the nonprofit startup . Fairly trained. “We didn’t expect big companies to take this on and train on it and create these scalable systems. None of this is part of the social contract.

Fairly Trained certifies AI systems that only use training data licensed or approved by their human creators.

Newton-Rex hopes the certification will allow consumers to decide which AI systems reflect their values, much like a fair trade sticker for robots.

“I don’t think people realize that when they use something like ChatGPT, they are using a model trained in that way, trained on the results of many people without their consent, often without their knowledge and without compensation,” Newton-Rex said.

There’s a lot going on in the world. Still, Marketplace has you covered.

You rely on Marketplace to analyze world events and tell you how they affect you in an accessible, fact-based way. We are counting on your financial support to continue to make this possible.

Your donation today fuels the independent journalism you rely on. For just $5/month, you can help maintain Marketplace so we can continue reporting on the things that matter to you.

Latest News

Responsible AI drives revenue in 2040 at APAC telecom companies

AI is revolutionizing Liga MX! Predictive Analytics for Victory

Machine learning at the Flatiron Institute

New urgent Gmail security warning for billions as attacks continue

AI and Cybersecurity Industry in Middle East and Africa to See Tremendous Success

AI, 5G and Quantum: risks linked to innovation and cybersecurity

From quantum threats to AI defenses: how cybersecurity will evolve in 2025

New urgent Gmail security warning for billions as attacks continue

AI and Cybersecurity Industry in Middle East and Africa to See Tremendous Success

AI, 5G and Quantum: risks linked to innovation and cybersecurity

From quantum threats to AI defenses: how cybersecurity will evolve in 2025

ChatGPT and AI tools gain ground in the search market

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

ChatGPT and AI tools gain ground in the search market

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

7 Google AI announcements from October

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

7 Google AI announcements from October

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

Machine learning at the Flatiron Institute

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

Machine learning at the Flatiron Institute

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

Responsible AI drives revenue in 2040 at APAC telecom companies

TRT WorldTürkiye calls for global unity on ethical AI. Ahrettin Altun calls for global collaboration on ethical AI, highlighting Turkey’s commitment through its national AI strategy.

Human-centered, Ethical and Responsible AI Conference – Seoul

Latest News

Subscribe to Updates

Is training AI on copyrighted works ethical – or legal?

Related Posts

Subscribe to Updates