AI Litigation Support | S-Square Research

Training Data Sources

Filtered "textbook-quality" web data

Status: Reported

Citation: Microsoft research papers on the Phi models.
Synthetic data

Status: Confirmed

Citation: Microsoft research papers.

Overview: The “Small Model” & “Clean Data” Strategy

The Phi models are a family of “small language models” (SLMs) from Microsoft. From a legal perspective, they are highly significant as they represent a deliberate attempt to create powerful AI models while minimizing copyright risk. Instead of training on the entire internet, Microsoft claims to have used a heavily curated, “textbook-quality” dataset combined with synthetic, AI-generated data.

Key Models

The Phi models are designed to be small, efficient, and cost-effective, making them suitable for on-device applications.

Phi-3-mini (3.8B)
Phi-3-small (7B)
Phi-3-medium (14B)

The “Clean Data” Legal Strategy

Microsoft’s entire strategy with Phi appears designed to build a stronger “fair use” defense in potential copyright litigation.

”Textbook-Quality” Web Data

The Curation Argument: Microsoft researchers state they trained Phi on a much smaller, more heavily filtered dataset than is typical. By focusing on data with “textbook-level” quality, they can argue they actively avoided the vast amounts of low-quality, infringing content found on social media, forums, and pirate websites that may be present in other models’ training data.
A Stronger “Fair Use” Case: This curation allows Microsoft to argue that their use was more targeted and less likely to harm the market for the original works, strengthening two of the four factors in the fair use test. It is a direct response to the legal arguments being made against OpenAI and Google.

Synthetic Data: A New Legal Frontier

The Core Idea: A significant portion of Phi’s training data was not created by humans, but was synthetically generated by other, larger AI models (presumably from the GPT family, given Microsoft’s partnership with OpenAI).
Novel Copyright Questions: This raises untested legal questions. Does using synthetic data for training violate the terms of service of the model that generated it? Who owns the copyright to AI-generated text used as training data? Can a model trained on the output of another model be considered a “derivative work”?
Data Laundering?: Critics might argue that using synthetic data is a form of “data laundering”—if the original model (e.g., GPT-4) was trained on infringing data, then is the synthetic data it produces also “fruit of the poisonous tree”? This is a novel legal theory that has yet to be tested in court.

Licensing & Strategic Implications

Permissive Open-Source: Unlike many other enterprise-backed open models, the Phi-3 models are released under the very permissive MIT License. This allows for free commercial use, modification, and distribution with virtually no restrictions.
A Strategic Hedge?: Microsoft is OpenAI’s biggest partner and investor. Their release of a powerful, permissively licensed, and “cleaner” open-source model could be seen as a strategic hedge. As a defendant in lawsuits over both code (Doe 1 v. Github, Microsoft, OpenAI) and news content (Daily News, Bird), Microsoft has a strong incentive to develop an alternative AI asset that is less legally encumbered.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions