AI Litigation Support | S-Square Research

Training Data Sources

Proprietary Curated Dataset

Status: Confirmed

Citation: NVIDIA's Nemotron-4 340B technical report and blog posts.

NVIDIA states that Nemotron was trained on a massive 9-trillion-token dataset that was 'carefully curated.' The data is reported to be 98% synthetic, generated by other AI models to create high-quality training examples, with the remaining 2% from a proprietary, filtered corpus of real-world data.
Real-World Data (Public & Licensed)

Status: Reported

Citation: NVIDIA's technical report.

The small percentage of non-synthetic data is described as a high-quality mix of English text, multilingual text, and source code. NVIDIA has not disclosed the specific sources, but claims it was heavily filtered for quality and safety.
Copyrighted Books

Status: Alleged

Citation: Nazemian v. NVIDIA and Dubus v. NVIDIA lawsuits.

Despite NVIDIA's claims of a curated dataset, it is a defendant in class-action lawsuits from authors who allege their copyrighted books were part of the training data for the NeMo framework and associated models. This remains the primary legal challenge to NVIDIA's model development.

Overview: The “Picks and Shovels” Provider

NVIDIA is the single most important hardware company in the AI ecosystem, providing the GPUs that power the training and inference of nearly all major models. From a legal perspective, NVIDIA’s own AI models are best understood as a reference implementation and a sales tool to drive hardware adoption and lock customers into its software ecosystem (CUDA, AI Enterprise). The company’s primary goal is not to sell model access, but to sell the “picks and shovels” for others to build their own models.

Key Models & Platforms

Nemotron-4 340B: A powerful open-source model released by NVIDIA. Its primary purpose is to provide a high-quality foundation for enterprises to fine-tune with their own proprietary data.
NVIDIA NeMo: An end-to-end framework for building, customizing, and deploying generative AI models. It includes tools for data curation, training, and a feature called “Guardrails.”
NVIDIA AI Enterprise: The commercial software platform that provides support, security, and management for companies using NVIDIA’s AI tools in production.

Legal Strategy: The “Enabling Tools” Defense

NVIDIA has so far avoided the major copyright lawsuits faced by OpenAI and Google. Its legal posture is built around positioning itself as a neutral tools provider, not a data provider.

Focus on Customization: NVIDIA’s core message to enterprises is: “Don’t use a generic public model; use our tools to build a custom model on your own data.” This shifts the legal responsibility for the training data away from NVIDIA and onto its customers.
“Guardrails” for Risk Management: The NeMo “Guardrails” feature is a key part of this strategy. It is a toolkit that allows a company to control the outputs of a model, for example, by preventing it from talking about certain topics or by steering it to cite specific sources. This is marketed as a way for enterprises to manage their own legal and brand risk. It does not, however, solve the underlying copyright issue of the foundational model’s training data.

The NVIDIA Open Model License

Like other enterprise-focused companies, NVIDIA releases its “open” models under a custom, restrictive license.

Use-Based Restrictions: The license includes restrictions that, for example, prohibit using the model to train or improve a competing AI model.
No Warranty or Indemnity: The license provides the model “as-is” and offers no warranty or indemnity. The user assumes all legal risk.
The Enterprise Upsell: This lack of protection for the “open” model is a key part of the business strategy. It creates an incentive for risk-averse companies to pay for NVIDIA AI Enterprise, where they can get commercial support and potentially negotiate for stronger contractual protections, including a possible (but not publicly advertised) IP indemnity.

Key Litigation

While NVIDIA has aimed to position itself as a neutral tools provider, it has not entirely avoided copyright litigation. The company is a defendant in lawsuits alleging that its own foundational models were trained on infringing data.

Author Class-Action Lawsuits (Consolidated)

Case Numbers: Includes Nazemian v. NVIDIA (3:23-cv-01454) and Dubus v. NVIDIA (4:24-cv-02655), consolidated before Judge Jon S. Tigar (N.D. Cal.).
Allegation: A consolidated class-action lawsuit by authors (including Abdi Nazemian, Brian Keene, and Andre Dubus III) alleging that NVIDIA used their books without permission to train its NeMo large language models. The authors claim their works were part of a dataset of nearly 200,000 books used for training.
Core Claim: Direct copyright infringement for the use of books in training data.

YouTube Creator Lawsuit

Millette v. NVIDIA: A class action by YouTube video creators alleged that NVIDIA scraped their videos to train its models. This case was voluntarily dismissed.

International Litigation

Canada: NVIDIA is also named as a defendant in MacKinnon v. NVIDIA, a lawsuit filed in the Supreme Court of British Columbia.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About