SPONSORED
July 2, 2024

Amazon Probes Perplexity AI for Alleged Website Scraping Violations

amazon probes perplexity ai min

Amazon Web Services (AWS) has reportedly launched an investigation into Perplexity AI, an answer engine powered by artificial intelligence, over allegations that the company has been violating its terms of service by scraping content from websites without permission. 

The AWS cloud division is said to be assessing information it received from the news outlet WIRED after it published an investigation that said Perplexity AI is using a crawler hosted on its servers that disregards the Robots Exclusion Protocol, a web standard where developers place a robots.txt file on a domain to specify whether bots are allowed to access certain pages. 

In a report published in June, WIRED said it discovered a virtual machine bypassing its website’s robots.txt instructions. This machine was hosted on an Amazon Web Services server with the IP address 44.221.181.252, which is “certainly operated by Perplexity.” To verify if Perplexity was indeed scraping its content, WIRED entered headlines or short descriptions of its articles into the company’s chatbot. The tool responded with results that closely paraphrased the articles, providing “minimal attribution.”

It also allegedly visited other Condé Nast properties hundreds of times within the past three months to scrape their content. Similar access patterns were also observed on other major publications, such as Forbes, The New York Times, and The Guardian. 

Perplexity AI, through its spokesperson, Sara Platnick, contradicts the claims, emphasising that their PerplexityBot respects robots.txt files. “Our PerplexityBot, which runs on AWS, respects robots.txt, and we confirmed that Perplexity-controlled services are not crawling in any way that violates AWS Terms of Service,” she stated. However, Platnick admitted to WIRED that PerplexityBot will ignore robots.text if a user includes a specific URL in their chatbot query. 

CEO of Perplexity AI, Aravind Srinivas, also denied the allegations and acknowledged that Perplexity uses third-party web crawlers in addition to their own. He also confirmed that the crawler identified by WIRED was one of these third-party tools.

AWS has stated that it prohibits abusive and illegal activities and expects its customers to comply with their terms of service. The company regularly investigates reports of abuse and has engaged with Perplexity AI regarding the allegations as part of its standard procedure for handling reports of potential violations.

But Perplexity AI is not the only AI company being accused of bypassing robots.txt files to gather content for training large language models. Reuters reported seeing a letter addressed to publishers from content licensing startup TollBit, warning them that “AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites.”While TollBit’s letter didn’t name any company, Business Insider says OpenAI and Anthropic, the creators of ChatGPT and Claude chatbots, respectively, are also bypassing robots.txt signals, despite previously claiming they would respect “do not crawl” instructions.