Synthetic Data
Synthetic data is information that has been artificially created by a generative AI model. AI companies are increasingly turning to synthetic data for two reasons: first, because high-quality human data is becoming scarce, and second, as a potential legal shield against copyright infringement lawsuits. The argument is that if a model is trained on “fake” data, it cannot be infringing on real, copyrighted data. This argument is technically flawed and legally dubious.
Analogy: Lab-Grown Meat
Think of training data as the food you eat.
- Real Data: This is like eating a varied diet of real, farm-grown food—vegetables, fruits, and meat from actual animals. It’s complex, diverse, and contains all the nutrients and variety of the real world. This is the human-created internet.
- Synthetic Data: This is like eating a diet of lab-grown meat. A scientist takes a sample of a real steak, analyzes its cellular structure, and then grows a new piece of meat in a petri dish. It looks and tastes like steak, but it was created in a sterile lab. This is data generated by an AI.
Now, consider the problems with a diet of only lab-grown meat:
- It’s Still Derived from the Original: You couldn’t create the lab-grown steak without first having a real steak to analyze. Similarly, you cannot create “clean” synthetic data without first having a “parent” AI model that was trained on the real, messy, copyrighted internet. The synthetic data is a derivative of the original data, one step removed.
- It Lacks True Variety: The lab-grown steak will be a simplified version of the real thing. It won’t have the complex texture or the subtle flavors that come from a real animal’s life. Synthetic data is the same; it’s an averaged, less diverse version of reality.
- The Long-Term Effects are Unknown: What happens if you eat only lab-grown meat for years? It might lack certain trace nutrients, leading to unforeseen health problems. This is model collapse. If you train a new generation of AIs only on the synthetic data produced by the previous generation, they will get progressively dumber, more biased, and more detached from reality.
The Legal and Technical Flaws
Using synthetic data is not the “get out of jail free” card that some in the AI industry hope it will be.
-
Derivative Work Argument: A strong legal argument can be made that synthetic data is a derivative work of the data the parent model was trained on. If GPT-4 was trained on The New York Times, and you then use GPT-4 to generate a million “synthetic” news articles to train your new model, your model is still being trained on the intellectual property of The New York Times, just laundered through an intermediary.
-
The Threat of Model Collapse: From a product liability perspective, a company’s heavy reliance on synthetic data could be framed as negligence. They are knowingly training their model on an impoverished, distorted version of reality, which could lead to an unreliable and unsafe product. As model collapse becomes more widely understood, a failure to maintain a “diet” of fresh, human-generated data could be seen as a breach of the standard of care.
-
Obfuscating Provenance: The use of synthetic data makes it much harder to trace the provenance of a model’s behavior. The biases and infringing content of the parent model are “baked into” the synthetic data in subtle and complex ways, making a full audit of the training data’s lineage nearly impossible. This obfuscation may be intentional.
Synthetic data is a powerful tool, but it is not a magical solution to AI’s legal problems. It is a complex and risky trade-off that creates a new and subtle set of legal vulnerabilities for the companies that rely on it.