Recent New York Times investigative reporting shed new light on the ethics of developing artificial intelligence systems at OpenAI, Microsoft, Google and Meta. He revealed that in creating the latest generative AI, companies changed their own privacy policies and considered flouting copyright law in order to ingest the billions of words available on the Internet.
More importantly, reporting reiterated claims by current industry leaders, like Sam Altman, the famous CEO of OpenAI, that the main problem facing the development of more advanced AI is that these systems will soon run out of available data to devour . So the biggest AI companies in the world are turning more and more to “synthetic data,” or information generated by AI itself, rather than by humans, to continue training their systems.
As a technology policy expert, I believe the use of synthetic data presents one of the biggest ethical issues for the future of AI. Using it to train new AI only compounds the bias problems of the past. And, coupled with generative AI’s tendency to create false information, the use of synthetic data risks leading AI into its own dangerous feedback loop.
AI is only as good as the data it is trained on. Or, as the IT saying goes: trash in, trash out. Years before the release of ChatGPT, revolutionary Internet studies scholar and endowed professor Safiya Noble. argued that early search algorithms displayed biases based on “data discrimination” that produced racist and sexist results. And Rashida Richardson, AI policy pioneer and civil rights advocate. wrote that, based on historical practices of segregation and discrimination, the training databases available for early predictive AI systems were often full of “dirty data” or “inaccurate, biased, or systemically biased” information.
Timnit Gebru, computer scientist and AI researcher warned that racist and misogynistic views prevalent on the Internet were “overrepresented in the training data” of early AI language models. She predicted that by “encoding bias,” generative AI would be put in place “to further amplify bias and prejudice.” While Gebru was ousted from Google for its research, this proved prescient as more powerful generative AI was quickly released to the world in 2022.
Last month, a year after the release of Copilot Designer, Microsoft’s AI image generator, an engineer named Shane Jones confirmed that the foresight of Noble, Richardson and Gebru had become a reality and exhorted the company to withdraw the product from public use. Jones said that while testing the AI system, he found that without much prompting, it was easily capable of generating volumes of racist, misogynistic and violent content. He also has said the ease of creating these images gave him “a glimpse of what the training dataset probably was.”
Not only do AI systems consume and reproduce bias, but AI trained on biased data has a tendency has “hallucinate”, or generate incomplete or completely inaccurate information. Recently, AI chatbots have created non-existent court cases to cite as legal precedent and developed fake academic citations including authors, dates and journal names for research purposes. AI chatbots have also encouraged business owners to break the law and Free fictitious discounts to airline customers. These hallucinations are so widespread that Washington Post observed that they “look more like a feature than a bug”.
It is clear that current generative AI systems have shown that, based on their original training data, their result is to reproduce biases and create false information. The path to training new systems with synthetic data would mean constantly feeding biased and inaccurate results back into the system as new training data. Without intervention, this cycle ensures that the system will only double down on its own biases and inaccuracies. One only needs to look at the echo chamber of hate speech and misinformation created by less intelligent social media technology to understand where such an infinite loop can lead.
I believe that, now more than ever, it is time for people to organize and demand that AI companies pause their progress toward deploying more powerful systems and work to address the technology’s current failures. Although it may seem like a far-fetched idea, in February, Google decided to suspend its AI chatbot after being wrapped in a public scandal. And last month, following a report on a increase in scams use cloned voices of loved ones to demand ransom, OpenAI announcement it would not release its new AI voice generator, citing its “potential for misuse of synthetic voice.”
But I believe that society cannot rely solely on the promises of American technology companies that have a history to place profits and power above people. It is why I argue that Congress must create an agency to regulate the industry. In the field of AI, this agency should address potential harms by banning the use of synthetic data and requiring companies to audit and clean the original training data used by their systems.
AI has become an integral part of our lives. If we take the time to right the wrongs of the past and create new ethical guidelines and guardrails, this does not have to become a problem. existential threat to our future.
Anika Collier Navaroli is a senior fellow at Columbia University’s Tow Center for Digital Journalism and a Public Voices Fellow on technology in the public interest with the OpEd Project. She previously held high-level political positions at Twitter and Twitch.