Known Lawsuits 2

Training Data Sources

  • Filtered "textbook-quality" web data

    Status: Reported

    Citation: Microsoft research papers on the Phi models.

  • Synthetic data

    Status: Confirmed

    Citation: Microsoft research papers.

Overview: The “Small Model” & “Clean Data” Strategy

The Phi models are a family of “small language models” (SLMs) from Microsoft. From a legal perspective, they are highly significant as they represent a deliberate attempt to create powerful AI models while minimizing copyright risk. Instead of training on the entire internet, Microsoft claims to have used a heavily curated, “textbook-quality” dataset combined with synthetic, AI-generated data.

Key Models

The Phi models are designed to be small, efficient, and cost-effective, making them suitable for on-device applications.

  • Phi-3-mini (3.8B)
  • Phi-3-small (7B)
  • Phi-3-medium (14B)

Microsoft’s entire strategy with Phi appears designed to build a stronger “fair use” defense in potential copyright litigation.

”Textbook-Quality” Web Data

  • The Curation Argument: Microsoft researchers state they trained Phi on a much smaller, more heavily filtered dataset than is typical. By focusing on data with “textbook-level” quality, they can argue they actively avoided the vast amounts of low-quality, infringing content found on social media, forums, and pirate websites that may be present in other models’ training data.
  • A Stronger “Fair Use” Case: This curation allows Microsoft to argue that their use was more targeted and less likely to harm the market for the original works, strengthening two of the four factors in the fair use test. It is a direct response to the legal arguments being made against OpenAI and Google.
  • The Core Idea: A significant portion of Phi’s training data was not created by humans, but was synthetically generated by other, larger AI models (presumably from the GPT family, given Microsoft’s partnership with OpenAI).
  • Novel Copyright Questions: This raises untested legal questions. Does using synthetic data for training violate the terms of service of the model that generated it? Who owns the copyright to AI-generated text used as training data? Can a model trained on the output of another model be considered a “derivative work”?
  • Data Laundering?: Critics might argue that using synthetic data is a form of “data laundering”—if the original model (e.g., GPT-4) was trained on infringing data, then is the synthetic data it produces also “fruit of the poisonous tree”? This is a novel legal theory that has yet to be tested in court.

Licensing & Strategic Implications

  • Permissive Open-Source: Unlike many other enterprise-backed open models, the Phi-3 models are released under the very permissive MIT License. This allows for free commercial use, modification, and distribution with virtually no restrictions.
  • A Strategic Hedge?: Microsoft is OpenAI’s biggest partner and investor. Their release of a powerful, permissively licensed, and “cleaner” open-source model could be seen as a strategic hedge. As a defendant in lawsuits over both code (Doe 1 v. Github, Microsoft, OpenAI) and news content (Daily News, Bird), Microsoft has a strong incentive to develop an alternative AI asset that is less legally encumbered.