Perplexity Cloudflare Crawlers

Tech giants Perplexity and Cloudflare are at odds ,due to the former allegedly using an agent that disguises itself as a browser, instead of an AI agent in order to bypass crawler restrictions (basically robots.txt). Cloudflare’s confrontation of the situation ,has stirred up a debate on what counts as “useful user agent” and “ a no is a no (website consent)“.

References:

Time Line:

August 8: Cloudflare drops a blog on their site ,about Perplexity unethically accessing content from websites that specifically blocks Perplexity from accessing it ( By explicitly mentioning, that they blocked Perplexity-User and Perplexity-Agent in robots.txt file and experimented two newly bought sites , that weren’t readily accessible by google). When queried by Perplexity , it bypassed using bots and stealthily disguising itself as a Chrome browsing Agent.
August 8 (same day): Perplexity responds that the user requests, that Cloudflare addressed were made by Perplexity’s own browser (called BrowserBase) that is used for specific tasks only 🙃 .They called the robots method, very old and how its necessary for Cloudflare to be able to differentiate between a “Useful Agent “ and a threat.

After this there seemed to be two more relevant significant events:

August 12 : (Alleged)Perplexity offers to buy Chrome for 34.5 billion dollars. Link
August 18 : Cloudflare announces a Browser developer program.Link

Perplexity seems to defend their practice, as valid in the sense that their AI does not train over such data and barely acts as a browser to fetch data and display it to the user. Here starts the moral dilemma : If a user had actually used the information, from the website instead of using AI would be actually considered as a breach of trust? maybe yes or maybe not.

Alleged method that Perplexity (Source:Cloudflare)

Alleged method that Perplexity uses to bypass robots.txt by Cloudflare.

It seems to be using a different “ Declared” and “Stealth” Agent to access the site . In addition to this , it seems to get over Cloudflare’s WAF by rotating with different ISPs making it harder to keep track of and block.

Indeed using robots.txt method to fend off bots make sense, but wouldn’t explicitly asking AI to not read the data actually matter?

Asking not to be seen, is a basic right to privacy ,and although Perplexity used it to make someone’s life easier it still not an ethically acceptable .

This roots back, to the issue of AI agents training over forbidden data or images of artists. However breach of consent, seems to be the more prevailing issue ( Cloudflare praised OpenAI for following these practices better too).

In addition to this ,Perplexity wants to buy Chrome . Now this is obviously rises suspicion because Chrome is a very widely used web agent ….. hmmm seems to make a lot of crawling requests on websites . Wonder how it would be useful?

Note that Perplexity already has a browser named Comet which is in early stages of launch , so the rumored decision, leads to a lot of speculations towards both retribution as well as feasibility .

Cloudflare launching a program, to support browser developer might be an attempt to address this exact issue and prevent unwanted activity by AI crawlers by training Devs.

What does it mean to us?

WE ARE DONE FOR!!!.

Tech giants using your data, with consent is a must and never be given leeway . The digital world ,already seems to never forget anything once its on the internet . Although Perplexity says that it didn’t stockpile data and just retrieved it , data from the retrieval was used as context and is still in someone’s history as data.

Not even paywalled content, will be blocked and therefore there will be monetary losses to editors/writers.

Selenium scripts that scrape off data from websites ,using loopholes already exist so this addition only makes it more concerning.

So it is noteworthy, that old and important rules from 1990’s are still relevant to modern issues .If there are such loopholes, true privacy shall forever remain a myth and only slags attaining true data monitoring and documentation.

Perplexity’s response, seems to be more PR oriented than actually denying the claim that Cloudflare makes.

What can you do?

Setup tools like Anubis to prevent AI crawlers from accessing your site .
Add reCaptcha to every page and rate limit your page (in case you are facing aggressive crawling)
Raise your voices against breach of trust!!

In conclusion , avoiding scrapers has always been difficult but exponentially increased since AI agents gained direct access to the internet . Perplexity needs to post better defense to protect their reputation since a bare “ wrong understanding of our product” will not cut it .

Credits:

Cover Image: Geekflare

The Perplexity vs Cloudflare beef

Table of contents

What does it mean to us?

What can you do?

In conclusion , avoiding scrapers has always been difficult but exponentially increased since AI agents gained direct access to the internet . Perplexity needs to post better defense to protect their reputation since a bare “ wrong understanding of our product” will not cut it .

Subscribe to my newsletter

Shreyas D R

Shreyas D R

The Perplexity vs Cloudflare beef

Table of contents

Now a little dive into both the article tells us that Perplexity indeed is prying to websites without consent to the user!!.

What does it mean to us?

What can you do?

In conclusion , avoiding scrapers has always been difficult but exponentially increased since AI agents gained direct access to the internet . Perplexity needs to post better defense to protect their reputation since a bare “ wrong understanding of our product” will not cut it .

Subscribe to my newsletter

Shreyas D R

Shreyas D R