Data Scraping

Data scraping is the industrial-scale copying of information from the internet. It is the engine that powers the generative AI revolution, and it is the source of the industry’s most significant legal liabilities. While AI companies like to portray this as a neutral form of data collection, it is better understood as a legally aggressive and technically indiscriminate act of appropriation.

Analogy: Unregulated Strip Mining of the Internet

Imagine the internet is a vast, open territory. Some of it is national parkland, some is private property with ‘No Trespassing’ signs, and some is unclaimed.

A massive corporation arrives and begins strip-mining the entire territory.

  • The Operation: They use automated machinery (scraping bots) to dig up everything of value—gold (valuable text and images), but also toxic waste (hate speech, misinformation) and private property (copyrighted books, personal photos).
  • Ignoring the Rules: They ignore the ‘No Trespassing’ signs (Terms of Service that forbid scraping). They disregard the specific zoning rules posted by landowners (the robots.txt file on a website, which tells bots which areas are off-limits).
  • The Justification: When confronted, the corporation claims, “Everything we took was just lying on the ground! Anyone could have picked it up. We’re not selling the dirt; we’re using it to learn how to build better machines.”

This is the defense AI companies use for data scraping. But from a legal perspective, it’s a series of discrete, questionable acts.

Scraping isn’t one single legal issue; it’s a minefield. The litigation against AI companies is being fought on several fronts simultaneously.

  1. Copyright Infringement: This is the main front. When a scraper copies a copyrighted photo from Getty Images or a news article from The New York Times and saves it to a database, plaintiffs argue this is a direct act of infringement. The AI companies’ primary defense is Fair Use, claiming their use is “transformative.” The outcome of these cases will define the future of copyright law.

  2. Breach of Contract (Terms of Service): Most websites have a terms of service agreement that users implicitly accept by visiting the site. These terms almost always forbid automated scraping. When an AI company’s bots access the site, they are arguably entering into a contract that they immediately breach. This is a powerful claim because, unlike fair use, it doesn’t require a complex balancing test.

  3. Computer Fraud and Abuse Act (CFAA): This is a federal anti-hacking statute. The central question is whether violating a website’s terms of service constitutes “unauthorized access” under the CFAA. The Supreme Court’s ruling in Van Buren v. United States narrowed the scope of the CFAA, but its application to web scraping is still a heavily contested legal battleground. Ignoring a technical barrier (like an IP block) could be seen differently than ignoring a written prohibition.

Data scraping is not a passive act of observation. It is an active, automated process of copying. The defense that the data is “publicly available” is a misdirection; a book is publicly available in a library, but that doesn’t give you the right to photocopy the entire library. For litigators, the key is to break down the act of scraping into its component parts and attack the legal justification for each one.