AI Litigation Support | S-Square Research

Training Data Sources

Public Web Data

Status: Reported

Citation: Anthropic's public statements and research papers.

Like all major LLMs, Claude's foundational models are trained on a massive, undisclosed corpus of data scraped from the public internet. This is the primary source of copyright risk.
Licensed Data (from Partners)

Status: Reported

Citation: Partnership agreements with Amazon, Google, and others.

Anthropic has received billions in investment from major cloud providers. It is widely reported that these partnerships include access to proprietary, curated datasets. However, the contents of these datasets are not public.
Constitutional AI & Reinforcement Learning Data

Status: Confirmed

Citation: Anthropic's 'Constitutional AI' research paper.

This is a key part of Anthropic's safety methodology. The model's outputs are refined using reinforcement learning based on a 'constitution' of principles, as well as human feedback data. This data is used for alignment, not foundational pre-training.
Customer Data

Status: Denied

Citation: Anthropic's data privacy policies and enterprise contracts.

Anthropic explicitly states that it does not train its public, general-purpose models on customer data submitted via its API or through partners like Amazon Bedrock. Customers can use their own data for fine-tuning, but that data remains private to them.
Copyrighted Books & Lyrics

Status: Alleged

Citation: Bartz v. Anthropic and Concord Music v. Anthropic lawsuits.

Lawsuits from authors and music publishers allege that their copyrighted works were used in Claude's training data without permission. The authors' suit, which has reached a proposed settlement, is a significant case in the AI litigation landscape.

Overview: The “Constitutional AI” & Safety-Focused Competitor

Claude is a family of large language models developed by Anthropic, an AI safety and research company. From a legal perspective, Anthropic’s key differentiator is its public focus on AI safety and its development of “Constitutional AI”—a framework for training models to be helpful, harmless, and to avoid generating problematic content, including copyrighted material. Despite this, Anthropic is still a primary defendant in the ongoing AI copyright wars.

Key Models

The Claude family is known for its large context windows and strong reasoning abilities.

Claude 2: The model that established Anthropic as a major competitor to OpenAI.
Claude 3: A family of models released in different sizes (Haiku, Sonnet, Opus) to compete across the full spectrum of performance and cost, from small, fast models to a high-performance flagship (Opus).

Key Litigation

Despite its focus on safety, Anthropic is a key defendant in copyright litigation from authors and music publishers, testing its “Constitutional AI” framework as a legal defense.

Bartz v. Anthropic

Case Number: 3:24-cv-05417 (N.D. Cal.)
Status: This landmark case saw the first-ever class certification for a group of book authors in an AI copyright lawsuit (July 17, 2025). The case, presided over by Judge William Alsup, involves authors alleging their books were used to train Claude without permission. A proposed settlement of $1.5 billion is pending court approval.
Key Ruling: Judge Alsup partially granted Anthropic’s motion for summary judgment on fair use (June 23, 2025), but allowed the core infringement claim to proceed, leading to the class certification.
Core Claim: Direct copyright infringement for the use of books in training data.

Concord Music Group, Inc. v. Anthropic

Case Number: 5:24-cv-03811 (N.D. Cal.)
Allegation: A group of music publishers, led by Universal Music Group, sued Anthropic, claiming the Claude model unlawfully copies and disseminates copyrighted song lyrics. The complaint alleges that the model can generate verbatim lyrics from prompts.
Core Claims: Direct, contributory, and vicarious copyright infringement, as well as DMCA violations for removing copyright management information (CMI).

Reddit, Inc. v. Anthropic

Case Number: Filed in California Superior Court (June 4, 2025).
Allegation: The social media platform Reddit sued Anthropic for allegedly scraping user-generated content from its platform in violation of its terms of service.
Core Claims: Breach of contract, unjust enrichment, and trespass. This is not a copyright case, but a contract dispute over data scraping.

The “Constitutional AI” Defense

What it is: Anthropic’s primary defense and marketing point is its use of “Constitutional AI.” This is a training methodology where the AI is given a set of principles or rules (a “constitution”) to follow. These rules instruct the model to be helpful and harmless, and explicitly to “not produce copyrighted or proprietary text.”
A Proactive Defense: This is a more proactive legal defense than a simple “fair use” argument. Anthropic can argue that it has taken concrete, technical steps to prevent copyright infringement, rather than just claiming a right to the data. It is an attempt to demonstrate good faith and responsible design.
The Weakness: While a good faith argument, it does not change the fact that the underlying foundational model was likely still trained on a massive, undisclosed corpus of web data that contained the very copyrighted material the constitution is trying to prevent it from outputting. A plaintiff can still argue the initial training was an infringing act, regardless of the safeguards placed on the output.

Platform Indemnity

Risk Absorption: Because Claude is primarily available through major cloud platforms like Amazon Bedrock and Google Vertex AI, users of the model through these services are typically covered by the IP indemnity of the platform provider (Amazon or Google), not Anthropic itself.
Risk for Direct API Users: Users who access Claude directly via Anthropic’s own API would need to check their specific terms of service, as a broad, public IP indemnity is not Anthropic’s primary offering.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About