AI Litigation Support | S-Square Research

Training Data Sources

Common Crawl

Status: Confirmed

Citation: Brown, T. et al. (2020). 'Language Models are Few-Shot Learners.' (The GPT-3 Paper)

For GPT-3, a filtered version of the Common Crawl dataset made up the largest portion of its training data, representing **60% of the total mix**. For later models like GPT-4, OpenAI has not disclosed the data, but lawsuits allege its continued use.
WebText2

Status: Confirmed

Citation: The GPT-3 Paper (2020); The New York Times v. OpenAI lawsuit.

This dataset, an expanded version of OpenAI's original WebText (scraped from Reddit links), accounted for **22% of GPT-3's training data**. The NYT lawsuit alleges that for later models, OpenAI's scraping expanded to millions of paywalled articles.
Books1 & Books2

Status: Confirmed

Citation: The GPT-3 Paper (2020); Authors Guild v. OpenAI lawsuit.

The GPT-3 paper lists two internet-based book corpora, Books1 and Books2, which together formed **16% of the training data** (8% each). The Authors Guild lawsuit alleges these corpora are composed of copyrighted books from 'shadow libraries' like Libgen and Z-Library.
Wikipedia

Status: Confirmed

Citation: The GPT-3 Paper (2020).

The English-language version of Wikipedia accounted for **3% of the training data** for GPT-3, valued as a high-quality source of factual information.
Licensed Data (News Publishers, etc.)

Status: Confirmed

Citation: Public partnership announcements (e.g., with the Associated Press, News Corp, Ziff Davis).

In response to legal pressure, OpenAI has begun actively licensing content from major publishers. These deals are for *future* use and do not retroactively cover the data used to train its foundational models. Some of these partners (like Ziff Davis) are still suing OpenAI for past infringement.
GitHub Code

Status: Confirmed

Citation: GitHub Copilot documentation; Doe v. Github, Microsoft, OpenAI lawsuit.

OpenAI's Codex model, which powers GitHub Copilot, was explicitly trained on publicly available code from GitHub. The ongoing lawsuit alleges this violates the terms of various open-source licenses by not preserving attribution and other requirements.
User Data (ChatGPT Conversations)

Status: Denied

Citation: OpenAI's privacy policy and public statements.

OpenAI states it does not use conversations from its commercial API or ChatGPT Enterprise to train its public models. It *does* use conversations from its free consumer services (unless users opt out) for fine-tuning and alignment, not foundational pre-training.

Overview: The Epicenter of the Copyright Wars

OpenAI’s GPT (Generative Pre-trained Transformer) models are the most influential and widely-used in the world. They are also at the absolute center of the legal and ethical debate over generative AI. As the developer of ChatGPT, OpenAI is the primary target of numerous landmark copyright lawsuits that will shape the future of the industry.

Key Models

The most relevant models in the GPT family are proprietary and accessed via API:

GPT-3.5: The model that powered the initial release of ChatGPT, setting the stage for the generative AI boom.
GPT-4 & GPT-4o: The current flagship models, known for their powerful reasoning and multimodal capabilities.
GPT-5: The next-generation model, expected to be even more powerful.

The Core Legal Issue: “Fair Use” on a Massive Scale

OpenAI’s legal strategy hinges almost entirely on the “fair use” doctrine in U.S. copyright law.

The Argument: OpenAI claims that training its models on vast amounts of copyrighted material scraped from the internet is “transformative.” They argue the models are not simply storing and regurgitating copies, but are learning patterns to create something new. They contend this is a new, legally permissible use, similar to how a human learns by reading many books.
The Counterargument: Plaintiffs, such as The New York Times, argue that this is industrial-scale copyright infringement. They have shown evidence of GPT models reproducing their articles verbatim and argue that the models directly compete with and devalue their original work by, for example, providing summaries of paywalled articles.

Key Litigation

As the market leader, OpenAI is the primary defendant in the AI copyright wars. Most copyright cases filed against it across the country have been consolidated into a Multi-District Litigation (MDL) proceeding in the Southern District of New York, presided over by Judge Sidney H. Stein.

Publisher Lawsuits (MDL)

A group of high-profile media organizations are suing OpenAI and its partner Microsoft, alleging that their content was used for training without permission and that ChatGPT’s outputs compete directly with their business.

The New York Times Co. v. Microsoft & OpenAI: The most significant publisher lawsuit, alleging that millions of articles were used to train GPT models, which now generate outputs that reproduce Times content verbatim and undermine its subscription business. Claims include copyright infringement, trademark dilution, and unfair competition.
Other News & Media Publishers: Several other media outlets have filed similar suits, including the Daily News, The Intercept, Raw Story Media, and Ziff Davis (owner of PCMag, Mashable). These cases primarily focus on copyright infringement and DMCA violations for removing attribution and copyright management information (CMI).

Author Class-Action Lawsuits (MDL)

Multiple class-action lawsuits have been filed on behalf of authors, which are also consolidated in the MDL.

Authors Guild v. OpenAI (Consolidated): A massive class action led by the Authors Guild and prominent authors like John Grisham, George R.R. Martin, and Jodi Picoult. It alleges that OpenAI copied their books from “shadow libraries” to train its models.
In re OpenAI ChatGPT Litigation (Tremblay, Silverman): Originally filed in California, this was one of the first major author lawsuits, led by authors like Paul Tremblay and Sarah Silverman. It makes similar claims regarding training on pirated book datasets. Judge Araceli Martinez-Olguin dismissed several initial claims before the case was transferred to the MDL.
Bird v. Microsoft: A separate lawsuit filed by authors against Microsoft, targeting its own Megatron large language model.

Code Copyright Lawsuit

Doe 1 v. Github, Microsoft, & OpenAI: A landmark class-action lawsuit filed by programmers over the training of GitHub Copilot (powered by an OpenAI Codex model). The suit alleges that training on public code repositories and ignoring open-source licenses constitutes copyright infringement, breach of contract (license violations), and DMCA violations. The case is currently on appeal before the Ninth Circuit.

International & Other Litigation

International Lawsuits: OpenAI faces copyright challenges globally, with lawsuits filed by news organizations in Canada (Toronto Star), Germany (GEMA), and India (Asian News International).
Defamation: In Walters v. OpenAI, a radio host sued for defamation after ChatGPT allegedly generated false and damaging information about him in response to a user query.
Privacy Lawsuits: Several proposed class actions allege that OpenAI scraped personal data from the internet without consent, violating privacy laws.
Elon Musk v. OpenAI: A lawsuit filed by Elon Musk alleging that OpenAI betrayed its founding mission as a non-profit by pursuing a commercial, closed-source model with its Microsoft partnership. The suit includes claims for breach of contract and unfair business practices.

OpenAI’s “Copyright Shield”

To calm the fears of its enterprise customers, OpenAI has followed the lead of other major providers by offering a legal indemnity.

What it is: “Copyright Shield” is a contractual promise to defend enterprise customers and pay the costs if they are sued for copyright infringement over content generated by OpenAI’s services.
Who is Protected: This indemnity covers users of ChatGPT Enterprise and the enterprise API platform. It does not cover users of the free version of ChatGPT.
A Business Decision: Like similar programs from Google and Amazon, this is a business decision to absorb customer risk. It signals OpenAI’s confidence in its legal position and its willingness to use its significant financial resources to defend its technology in court. It makes adopting the technology less risky for large corporations.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About