Once touted as a replacement for Google Search, Perplexity AI has found itself in hot water for allegedly plagiarizing news articles without providing proper sourcing. In early June, the AI-powered generative search engine was threatened with legal action by Forbes for allegedly plagiarizing his work. Subsequently, an investigation by Cable It has been alleged that Perplexity AI could also freely copy online content from other major news sites.
Since then, several AI companies have come under scrutiny for allegedly circumventing paywalls and technical standards put in place by publishers to prevent their online content from being used to train AI models and generate summaries.
While Aravind Srinivas, CEO of Perplexity AI said a third-party service was to blame, the controversy surrounding the AI startup is the latest flashpoint between news publishers alleging their content is being copied without permission and AI companies arguing they should be allowed to do so.
How did it all start?
A graduate of IIT Madras, Aravind Srinivas has worked in leading technology companies such as GoogleDeepmind and OpenAI before launching Perplexity which sought to disrupt the way search results are presented to users, that is, by responding to user queries with personalized answers generated using AI.
Perplexity AI does this by “crawling the web, extracting relevant sources, using only the content from those sources to answer the question, and always telling the user where the answer comes from through citations or references,” Srinivas said. The Indian Express in an interview.
So Perplexity was seen as a small player taking on tech giants like Google and Microsoft in the search engine market. However, things took a different turn when it was launched a feature called “Pages” which allowed users to enter a prompt and receive a researched, AI-generated report that cited its sources and could be published as a web page to share with anyone.
A few days after its launch, the Perplexity team released an exclusive AI-generated “page” Forbes Perplexity’s article about former Google CEO Eric Schmidt’s involvement in a secret military drone project. The US-based publication claimed that the language used in its paid article and in Perplexity’s AI-generated summary was similar. It pointed out that the article’s graphics had also been copied and further alleged that Forbes had not been cited prominently enough.
Why is Perplexity receiving criticism from publishers?
In addition to allegedly plagiarizing articles and bypassing paywalls, Perplexity has also been accused of not adhering to accepted web standards such as robots.txt files.
According to cybersecurity company Cloudflare, “a robots.txt file contains instructions for robots that tell them which web pages they can and cannot access.”
The robots.txt file primarily applies to web crawlers used by Google to crawl the Internet and index content for displaying search results. The page administrator can leave specific controls so that web crawlers do not process data from restricted web pages or directories.
However, the robots.txt file is not legally binding, meaning it is not an effective defense against AI bots, as they can simply choose to ignore the instructions in the file. That is exactly what Perplexity did, according to CableConfirming the findings of a developer named Robb Knight, the technology news portal found that Perplexity AI was able to access its content and provide a summary of it despite the AI bot being prohibited from fetching its website.
But Perplexity isn’t alone in using questionable data scraping methods. Quora’s AI chatbot Poe goes beyond a summary and provides users with an HTML file of paid articles to download, according to a report by Cable. Additionally, content licensing startup Tollbit said that more AI agents are “choosing to bypass the robots.txt protocol to scrape content from sites.”
How can publishers block AI bots?
The emerging trend of AI bots that allegedly defy web standards and bypass paid sites raises an important question: What other steps can publishers take to prevent unauthorized scraping and use of their online content by AI bots?
Reddit said that in addition to updating its robot.txt file, it is also implementing a technique known as rate limiting, which essentially limits the number of times users can perform certain actions (like logging into a web portal) in a specified amount of time. While this solution can be used to filter AI traffic from legitimate traffic to websites, it is not foolproof.
There has also been an increase in the development of data poisoning tools like Nightshade and Kudurruthat claim to help artists prevent AI robots from ingesting their artwork by damaging their datasets in retaliation.