Training Data Sources
-
Undisclosed web and proprietary data
Status: Alleged
Citation: Inferred from model performance, as no information is public.
Overview: The “Black Box” Leaderboard Model
“Nova” is the name given to a family of high-performance large language models that have appeared on various AI leaderboards without a clearly identified developer. From a legal perspective, the anonymous nature of these models makes them the ultimate “black box” and exceptionally dangerous to use in any production environment.
The Anonymous Vendor Problem
- No Accountability: With no known developer, there is no legal entity to sue, negotiate with, or hold accountable for the model’s output. If the model generates infringing, defamatory, or otherwise illegal content, the liability falls entirely on the user.
- Unknown Jurisdiction: The developers could be operating from any jurisdiction in the world, including those with weak or non-existent copyright enforcement, or those under international sanctions. This creates unpredictable legal and geopolitical risks.
Training Data: A Complete Unknown
The training data for the Nova models is entirely unknown, which presents the worst-case scenario for copyright risk.
- Presumption of Infringement: Given the model’s high performance, it must have been trained on a massive dataset. With no evidence to the contrary, a lawyer must assume this data was scraped from the internet without regard for copyright, and likely includes infringing sources like Books3, pirated content, and personal data.
- No “Fair Use” Defense: A user of this model would have an extremely difficult time arguing for “fair use” in court. They would be unable to provide any evidence about the nature of the training data or the purpose of its use, which are key parts of the fair use analysis.
The Ultimate Legal Risk
Using a model from an anonymous source in a commercial product would be legally reckless.
- Zero Legal Recourse: Unlike using a model from Google, OpenAI, or Amazon, there is no possibility of a copyright indemnity. The user bears 100% of the legal risk.
- Reputational Suicide: For any established company, being found to have used an anonymous, likely infringing AI model in a product could lead to catastrophic reputational damage, on top of any legal penalties.
- “Fruit of the Poisonous Tree”: Any product or service built on top of the Nova model could be considered “fruit of the poisonous tree.” A court could potentially order the entire product to be taken down or even destroyed if it is fundamentally based on a massively infringing underlying work.