Perhaps the only industry hotter than artificial intelligence right now? AI litigation.
Just a sample: writer Michael Chabon is pursue Meta. Getty Images is pursue Stability AI. And both the New York Times and the Authors Guild have deposit separate lawsuits against OpenAI and Microsoft.
At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data.
For text-driven generative AI, there’s a good chance that some of this training data comes from a massive archive: Joint exploration.
“Common Crawl is the copy of the Internet. This is an Internet archive dating back 17 years. We make this information freely available to researchers, academics and businesses,” said Rich Skrenta, who runs the nonprofit Common Crawl Foundation.
Since 2007, Common Crawl has saved 250 billion web pages, all in downloadable data files. Until recently, some of its biggest users were academics, exploring topics such as online hate speech and government censorship.
But now there is another experienced user.
“Researchers told me that LLMs wouldn’t exist without Common Crawl,” Skrenta said.
LLMs represent large language models, essentially the algorithms behind AI products like ChatGPT.
LLMs must ingest huge chunks of text to learn the rhythm and structure of language, so they can write a compelling essay or compelling human-sounding wedding vows.
OpenAI, Google and Meta all used versions of Common Crawl in their early AI research.
Unless your 2009 “Glee” fanfiction blog is paid for or has code telling Common Crawl to look away, chances are it’s in Common Crawl, although there is no simple way to check it.
After the release of ChatGPT, Skrenta claims that the number of websites that blocked Common Crawl from archiving their content doubled. And there has been a sharp increase in requests to delete existing records.
Skrenta says that by posting something on the internet without explicitly telling bots to avoid it, you consent to its use by AI.
“You put your information on the Internet intentionally so people could come and see it. And robots are people too,” Skrenta said.
Common Crawl isn’t the only text used to train AI. Researcher Luca Soldaini of the nonprofit Allen Institute for AI says we know a lot more about the training used by data technology companies.
But that was before OpenAI got a A valuation of $100 billion.
“It’s not in their interest to tell us what’s in there, both from a competitive and legal perspective,” Soldaini said.
Most major AI companies allow web publishers to opt out of future AI training data. But Soldaini says that if companies were forced to retrain their current AI models without any user-desired hardware, it would take an enormous amount of time and money.
And without all that copyrighted work to learn from, AI might just stink.
Tech companies say taking copyrighted material to train AI is legally “fair use”: AI systems should be able to read and learn from the Internet, just like humans do it.
But beyond the legal debate, there is also an ethical debate.
“Every creator among us grew up fully knowing and fully accepting that when we create, when we put that out into the world, people will learn from it,” said Ed Newton-Rex, founder of the nonprofit startup . Fairly trained. “We didn’t expect big companies to take this on and train on it and create these scalable systems. None of this is part of the social contract.
Fairly Trained certifies AI systems that only use training data licensed or approved by their human creators.
Newton-Rex hopes the certification will allow consumers to decide which AI systems reflect their values, much like a fair trade sticker for robots.
“I don’t think people realize that when they use something like ChatGPT, they are using a model trained in that way, trained on the results of many people without their consent, often without their knowledge and without compensation,” Newton-Rex said.
There’s a lot going on in the world. Still, Marketplace has you covered.
You rely on Marketplace to analyze world events and tell you how they affect you in an accessible, fact-based way. We are counting on your financial support to continue to make this possible.
Your donation today fuels the independent journalism you rely on. For just $5/month, you can help maintain Marketplace so we can continue reporting on the things that matter to you.