×
Reddit catches Perplexity stealing content with copyright trap
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Reddit has caught AI search startup Perplexity red-handed using its content through a clever copyright trap, revealing how the company circumvented Reddit’s attempts to block unauthorized data scraping. The social media platform used a “mountweazel”—a deliberate fake entry designed to catch copiers—by creating a test post accessible only to Google’s search engine, which appeared in Perplexity’s results within hours.

The big picture: Reddit’s lawsuit against Perplexity and three other data scraping companies exposes the murky world of AI training data acquisition, where companies are finding increasingly sophisticated ways to harvest copyrighted content without permission.

How the trap worked: Reddit created a test post that could “only be crawled by Google’s search engine and was not otherwise accessible anywhere on the internet.”

  • Within hours, Perplexity’s AI-powered search engine displayed content from the trap post.
  • The technique builds on historical copyright traps like “esquivalience,” a fake word planted by the New Oxford American Dictionary in 2001 to catch competitors stealing definitions.
  • These traps are called “mountweazels” and represent an evolution in protecting intellectual property from unauthorized use.

What Reddit alleges: The company argues that “Perplexity’s business model is effectively to take Reddit’s content from Google search results,” then feed it into an AI model and “call it a new product.”

  • Reddit claims Perplexity bought scraped data sets from third-party firms to circumvent a cease and desist order.
  • Citations to Reddit data in Perplexity’s AI search results had jumped “fortyfold.”
  • The lawsuit targets three additional data scraping firms: SerpApi (Texas), Oxylabs (Lithuania), and AWMProxy (Russia).

The broader context: The AI industry’s voracious appetite for training data has fundamentally disrupted the traditional web scraping ecosystem.

  • Years ago, SEO firms scraped Google search data to provide optimization services, creating a mutually beneficial relationship that drove traffic back to original websites.
  • Now these firms are selling their scraped data directly to AI companies, whose chatbots don’t meaningfully direct traffic back to source websites.
  • Training large language models like ChatGPT required access to vast amounts of data, much of it copyrighted.

Reddit’s business strategy: The platform is simultaneously fighting unauthorized scraping while monetizing its own data through licensing deals.

  • Reddit expects to make over $200 million over the next few years through data licensing ventures.
  • The company is experimenting with its own built-in AI capabilities.
  • Reddit has been locking out scrapers and selling its user data at premium prices to capitalize on AI data demand.

Why this matters: This case highlights the growing tension between AI companies’ need for training data and content creators’ rights to control and monetize their intellectual property, setting up potential precedents for how the industry handles data acquisition going forward.

Perplexity Just Got Caught Breaking the Rules Red-Handed

Recent News

AI-native B2B companies burn 70% less capital than traditional SaaS

Five sales reps now do the work of 100 at companies scaling to $100 million ARR.

Boston startup launches AI app to find healthier restaurant meals

The app scans restaurant menus in real time to eliminate dining guesswork.