Training Data Sources
-
LAION-5B (and its subsets, e.g., LAION-2B-en)
Status: Confirmed
Citation: The official Stable Diffusion research paper and LAION-AI's own documentation.
The LAION-5B dataset is the exclusive training source for the foundational Stable Diffusion models. It is an open, publicly available dataset containing 5.85 billion image-text pairs scraped from the public web by the German non-profit LAION (Large-scale Artificial Intelligence Open Network). The data was collected from the Common Crawl dataset.
-
Copyrighted Images from Stock Photo Sites & Art Communities
Status: Confirmed
Citation: Getty Images v. Stability AI lawsuit; Andersen v. Stability AI lawsuit; public research into the LAION dataset.
Analysis of the LAION dataset has confirmed it contains a massive volume of copyrighted works scraped without permission. The Getty lawsuit provides direct evidence that millions of its watermarked images are in the dataset. Other top sources include Pinterest, Flickr, DeviantArt, and ArtStation, making it the central point of legal contention.
-
Private & Sensitive Images
Status: Confirmed
Citation: Research papers and investigative reports analyzing the LAION dataset.
Because LAION is an unfiltered scrape of the internet, it has been shown to contain sensitive images, including private medical photos, photos of individuals posted without consent, and depictions of violence. This raises significant ethical and privacy concerns in addition to copyright issues.
Overview: The “Open” Image Generator & Its Legal Storm
Stable Diffusion is the most prominent open-source image generation model. Developed by Stability AI (with initial collaboration from academic groups), it is at the absolute center of the legal storm over AI and art. Its open nature and high quality have led to explosive adoption, but its training data and capabilities have also made it a primary target for landmark copyright lawsuits.
The Core Legal Issue: The LAION Dataset
The entire legal controversy surrounding Stable Diffusion stems from its training data.
- What is LAION-5B?: The model was trained on the LAION-5B dataset, a massive, publicly available collection of 5.8 billion image-text pairs scraped from the internet. The dataset was created by a German non-profit, LAION (Large-scale Artificial Intelligence Open Network), and is the image-generation equivalent of the controversial “Books3” dataset for language models.
- The “Original Sin”: LAION-5B is an unfiltered scrape of the internet. It contains billions of copyrighted images, including personal photos, medical images, and vast amounts of professional artwork and photography, all collected without the consent of the copyright holders.
- The Legal Claim: Plaintiffs argue that creating the LAION dataset and then using it to train a commercial product (Stable Diffusion) is copyright infringement on an industrial scale.
Key Litigation
Stable Diffusion’s training and capabilities have triggered several critical lawsuits.
Getty Images v. Stability AI
- Jurisdictions: United States (D. Del.) and United Kingdom (High Court of Justice).
- U.S. Case: No. 1:23-cv-00135 (D. Del.), filed August 14, 2025.
- Allegation: Getty Images alleges that Stability AI unlawfully copied millions of its proprietary and watermarked photographs to train the Stable Diffusion model.
- Evidence: A key piece of evidence is the model’s tendency to generate distorted versions of the Getty Images watermark, suggesting direct copying and memorization rather than transformative use.
- Core Claims: The lawsuit includes claims for direct copyright infringement, trademark infringement, and removal of copyright management information (CMI).
Andersen v. Stability AI
- Case Number: 3:23-cv-00201 (N.D. Cal.)
- Filing Date: January 13, 2023
- Allegation: A proposed class-action lawsuit led by artists, including Sarah Andersen, Kelly McKernan, and Karla Ortiz. It alleges that Stability AI, Midjourney, and Runway trained their models on billions of copyrighted images from the LAION dataset without consent.
- Core Claims: The suit includes claims for direct copyright infringement, inducement of infringement, and false endorsement under the Lanham Act. A central, and legally novel, argument is that the models’ ability to mimic an artist’s unique “style” constitutes a form of infringement. While copyright has not traditionally protected “style,” the plaintiffs argue the technology creates a new type of violation.
International Litigation
- United Kingdom: Getty Images also filed a parallel lawsuit in London (No. IL-2023-000007), making the legal battle an international one.
- Canada: A lawsuit,
Gagne v. Stability AI, has been filed in Federal Court.
The “RAIL” License: A Responsible AI License
Stable Diffusion is not released under a standard license. It uses a custom CreativeML Open RAIL-M (Responsible AI License).
- Focus on “Use,” Not Commerce: Unlike the licenses from Meta or Alibaba, the RAIL license’s restrictions are not about commerce. It allows full commercial use.
- Harm-Based Restrictions: Instead, the license forbids using the model for certain “harmful” purposes. This includes generating illegal content, defaming individuals, or creating misinformation. The goal is to contractually obligate users to behave responsibly.
- A Shift in Liability: By using a license that explicitly forbids harmful uses, Stability AI can argue in court that if a user does create something harmful, that user has violated their license agreement. It is an attempt to shift legal and ethical responsibility to the end-user.