Training Data Sources
-
Web data
Status: Reported
Citation: Mistral AI disclosures and technical blog
-
Common Crawl
Status: Confirmed
-
GitHub
Status: Confirmed
Overview
Mistral AI is a European AI company known for its high-performance open-source and commercial models. Their key innovation is the Mixture-of-Experts (MoE) architecture, which allows for very large yet efficient models. From a legal perspective, their open-source releases and limited public data disclosures are notable points of interest.
Key Models & Timeline
- Mistral 7B (Sept 2023): An early, efficient Apache 2.0 licensed model. Its permissive license makes it a common base for other models.
- Mixtral 8x7B (Dec 2023): A powerful MoE model released with a more restrictive license. This model was the subject of a key copyright infringement study.
- Mistral Large (Feb 2024): The company’s flagship commercial, closed-source model.
- Mixtral 8x22B (Apr 2024): A larger, more powerful MoE model, also with a restrictive license.
Training Data & Copyright Risk
Mistral AI’s training data composition presents a significant area of legal inquiry.
- Stated Sources: The company reports using a mix of public web data (Common Crawl) and code repositories (GitHub).
- Lack of Transparency: Full, detailed breakdowns of training datasets are not provided, which creates ambiguity about the specific sources and volume of copyrighted material ingested by the models.
Copyright Infringement & Disputes
This section focuses on specific findings and events relevant to copyright litigation.
Patronus AI Study (2024)
A study by Patronus AI benchmarked the Mixtral-8x7B-Instruct-v0.1 model’s rate of regurgitating copyrighted content.
- Finding: The model produced verbatim copyrighted content in 22% of tested prompts.
- Comparison: This rate is higher than Meta’s Llama 2 (10%) and Anthropic’s Claude 2.1 (8%).
- Context: It was still significantly lower than OpenAI’s GPT-4 (44%), which was the highest in the study.
Stated Copyright Policy
Mistral AI has a public-facing copyright policy with several key assertions:
- Opt-Out Compliance: Claims to respect web crawling standards like
robots.txt. - No Circumvention: Asserts that it does not bypass technical measures designed to protect copyrighted works.
- Takedown Process: Provides a formal mechanism for rights holders to submit infringement complaints.
Model Safeguards & Liability
- Open-Source Risks: The initial Mistral 7B model was released without typical content moderation safeguards. This led to criticism that the model could be easily prompted to generate harmful or illegal content.
- Design Philosophy: Mistral has historically favored releasing “raw” or less restricted models, prioritizing performance over built-in safety mechanisms. This philosophy could be a factor in arguments concerning foreseeable misuse and the developer’s responsibility for a model’s outputs.