Why News Publishers Are Struggling to Defend Against AI Bots Scraping Online Content

Once touted as a replacement for Google Search, Perplexity AI has found itself in hot water for allegedly plagiarizing news articles without providing proper sourcing. In early June, the AI-powered generative search engine was threatened with legal action by Forbes for allegedly plagiarizing his work. Subsequently, an investigation by Cable It has been alleged that Perplexity AI could also freely copy online content from other major news sites.

Since then, several AI companies have come under scrutiny for allegedly circumventing paywalls and technical standards put in place by publishers to prevent their online content from being used to train AI models and generate summaries.

While Aravind Srinivas, CEO of Perplexity AI said a third-party service was to blame, the controversy surrounding the AI startup is the latest flashpoint between news publishers alleging their content is being copied without permission and AI companies arguing they should be allowed to do so.

How did it all start?

A graduate of IIT Madras, Aravind Srinivas has worked in leading technology companies such as GoogleDeepmind and OpenAI before launching Perplexity which sought to disrupt the way search results are presented to users, that is, by responding to user queries with personalized answers generated using AI.

Perplexity AI does this by “crawling the web, extracting relevant sources, using only the content from those sources to answer the question, and always telling the user where the answer comes from through citations or references,” Srinivas said. The Indian Express in an interview.

So Perplexity was seen as a small player taking on tech giants like Google and Microsoft in the search engine market. However, things took a different turn when it was launched a feature called “Pages” which allowed users to enter a prompt and receive a researched, AI-generated report that cited its sources and could be published as a web page to share with anyone.

A few days after its launch, the Perplexity team released an exclusive AI-generated “page” Forbes Perplexity’s article about former Google CEO Eric Schmidt’s involvement in a secret military drone project. The US-based publication claimed that the language used in its paid article and in Perplexity’s AI-generated summary was similar. It pointed out that the article’s graphics had also been copied and further alleged that Forbes had not been cited prominently enough.

Why is Perplexity receiving criticism from publishers?

In addition to allegedly plagiarizing articles and bypassing paywalls, Perplexity has also been accused of not adhering to accepted web standards such as robots.txt files.

According to cybersecurity company Cloudflare, “a robots.txt file contains instructions for robots that tell them which web pages they can and cannot access.”

The robots.txt file primarily applies to web crawlers used by Google to crawl the Internet and index content for displaying search results. The page administrator can leave specific controls so that web crawlers do not process data from restricted web pages or directories.

However, the robots.txt file is not legally binding, meaning it is not an effective defense against AI bots, as they can simply choose to ignore the instructions in the file. That is exactly what Perplexity did, according to CableConfirming the findings of a developer named Robb Knight, the technology news portal found that Perplexity AI was able to access its content and provide a summary of it despite the AI bot being prohibited from fetching its website.

But Perplexity isn’t alone in using questionable data scraping methods. Quora’s AI chatbot Poe goes beyond a summary and provides users with an HTML file of paid articles to download, according to a report by Cable. Additionally, content licensing startup Tollbit said that more AI agents are “choosing to bypass the robots.txt protocol to scrape content from sites.”

How can publishers block AI bots?

The emerging trend of AI bots that allegedly defy web standards and bypass paid sites raises an important question: What other steps can publishers take to prevent unauthorized scraping and use of their online content by AI bots?

Reddit said that in addition to updating its robot.txt file, it is also implementing a technique known as rate limiting, which essentially limits the number of times users can perform certain actions (like logging into a web portal) in a specified amount of time. While this solution can be used to filter AI traffic from legitimate traffic to websites, it is not foolproof.

There has also been an increase in the development of data poisoning tools like Nightshade and Kudurruthat claim to help artists prevent AI robots from ingesting their artwork by damaging their datasets in retaliation.

Latest News

Human-centered, Ethical and Responsible AI Conference – Seoul

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

From quantum threats to AI defenses: how cybersecurity will evolve in 2025

From quantum threats to AI defenses: how cybersecurity will evolve in 2025

Top Healthcare Cybersecurity and Privacy Predictions for 2025

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

From quantum threats to AI defenses: how cybersecurity will evolve in 2025

Top Healthcare Cybersecurity and Privacy Predictions for 2025

The quantum leap: D-Wave’s revolutionary financing. Is the future of AI and cybersecurity here?

AI detection and personality generators: preserving authenticity online

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

AI is great, but agencies need to remember that in 2025 they will be in marketing

Marketing and AI integrations: marketing experiences

Why AI Could Be the Best Thing to Happen to Marketing

The Meta Marketing Summit is back – register now to drive growth in 2025

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

The Most Popular AI Tools of 2024

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

The Most Popular AI Tools of 2024

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

Exploring the Power of AI and ML in Smart Grids: Advances, Applications and Challenges

Unsupervised ML 17 — Future Trends in Unsupervised Machine Learning: What’s Next? | by Ayşe Kübra Kuyucu | December 2024

FrontiersMachine learning applications in search of life beyond EarthMachine learning (ML) and artificial intelligence (AI) have moved beyond niche applications to become transformative and essential tools for analyzing data….2 days

ML breakthroughs win 2024 Nobel Prize in Physics

Instagram concerned about challenge of distinguishing real images from AI-generated images, Apple to launch foldable iPhone by 2026 and beyond: Consumer Tech News (Dec. 16-20) – Apple (NASDAQ: AAPL), Amazon.com (NASDAQ:AMZN)

AI is bad news for the Global South

China’s Shenzhen technology center issues ‘vouchers’ to support AI research and development

Latest News

Subscribe to Updates

Why News Publishers Are Struggling to Defend Against AI Bots Scraping Online Content | Technology News

How did it all start?

Why is Perplexity receiving criticism from publishers?

How can publishers block AI bots?

Related Posts

Subscribe to Updates