Cloudflare, a leading internet infrastructure provider, has accused AI startup Perplexity of scraping content from websites despite explicit instructions not to do so. The company claims that Perplexity is not only ignoring robots.txt files but also actively hiding its scraping activities. This behavior has raised concerns about the ethics and legality of data collection practices in the AI industry.
AI products rely on vast amounts of data, often obtained through web scraping. While this has been a common practice among AI startups, it has also led to tensions with website owners who are increasingly using robots.txt files to control access to their content. Cloudflare's research indicates that Perplexity is actively circumventing these restrictions by changing its bots' user agents and autonomous system networks (ASNs).
Cloudflare's findings were based on machine learning and network signal analysis, revealing that Perplexity's scraping activities spanned tens of thousands of domains and millions of requests per day. The company also observed that Perplexity used a generic browser to impersonate Google Chrome when its declared crawler was blocked.
In response to these allegations, Perplexity spokesperson Jesse Dwyer dismissed Cloudflare's claims as a "sales pitch" and denied that the bot in question belonged to Perplexity. However, Cloudflare's tests confirmed that Perplexity was indeed bypassing the robots.txt restrictions.
This is not the first time Perplexity has faced accusations of unauthorized scraping. Last year, news outlets like Wired alleged that the company was plagiarizing their content. Perplexity's CEO, Aravind Srinivas, was unable to provide a clear definition of plagiarism when questioned about the issue.
Cloudflare has taken a strong stance against AI crawlers, launching a marketplace last month that allows website owners to charge AI scrapers for access to their sites. The company's CEO, Matthew Prince, has warned that AI is disrupting the business model of the internet, particularly for publishers. Cloudflare has also introduced a free tool to help prevent bots from scraping websites for AI training purposes.
The ongoing dispute between Cloudflare and Perplexity highlights the broader challenges surrounding data collection and AI development. As AI technologies continue to advance, it is crucial for companies to strike a balance between harnessing the power of data and respecting the rights of content creators.