Reddit to Update Web Standards to Block Automated Data

Social media platform Reddit (RDDT.N) announced on Tuesday that it will update a web standard used on the site to block automated data scraping, following reports that AI startups were circumventing the rule to gather content for their systems. This decision comes amid growing accusations against AI firms for plagiarizing content from publishers to generate AI-produced summaries without proper credit or permission.

Recently, the robots.txt file has become a crucial tool for publishers to prevent tech companies from using their content without compensation to train AI algorithms and create search query responses. Last week, a letter from the content licensing startup TollBit indicated that several AI firms were bypassing this web standard to scrape publisher sites. This aligns with a Wired investigation, which found that the AI search startup Perplexity likely evaded efforts to block its web crawler using robots.txt.

Earlier in June, business media publisher Forbes accused Perplexity of plagiarizing its investigative stories for use in generative AI systems without giving proper credit. Reddit stated on Tuesday that researchers and organizations, such as the Internet Archive, will still have access to its content for non-commercial use.