Fine-Tuned LLMs Provide Concrete Empirical Evidence of Market Substitution and Copyright Harm

Original Paper: Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Authors: Tuhin Chakrabarty; Jane C. Ginsburg; Paramveer Dhillon
Stony Brook University; Columbia Law School; University of Michigan; MIT Initiative on the Digital Economy
Corresponding: [email protected]; [email protected]; [email protected]

TLDR:

Standard LLM prompting failed to achieve stylistic fidelity, but fine-tuning models on complete copyrighted works led experts to strongly prefer AI output over human writers.
This preference reversal is attributed to fine-tuning eliminating detectable “AI stylistic quirks,” making the resulting text nearly impervious to current AI detectors.
The low cost and high quality of these fine-tuned substitutes provide direct, actionable evidence supporting the “effect upon the potential market” factor in fair use analysis.

The escalating litigation concerning the use of copyrighted material in training Large Language Models (LLMs) fundamentally rests on a pivotal, yet often unquantified, question: do the resulting AI outputs truly constitute market-substituting derivatives? A recent, highly relevant study, “Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers,” co-authored by Tuhin Chakrabarty, the eminent legal scholar Jane C. Ginsburg, and Paramveer Dhillon, provides crucial empirical data necessary to bridge this legal-technical gap.

This research delivers a pragmatic account of the market threat posed by sophisticated generative AI. For years, opponents of LLM training have argued that the technology inherently threatens the potential market for original works, fulfilling the fourth factor of the Fair Use doctrine. However, proving this harm required demonstrating that AI could produce content of sufficient quality and fidelity to compete directly with, or even surpass, expert human writers. This study addresses that knot head-on by comparing excerpts generated by frontier models (ChatGPT, Claude, Gemini) against those written by MFA-trained expert human writers, all tasked with emulating the diverse styles of 50 award-winning authors. The resulting data shifts the conversation from theoretical speculation about market harm to concrete, statistically significant proof of substitution potential.

The implications extend far beyond academia. For industry, this research defines the technical threshold at which general training transforms into targeted derivative creation. For litigators, it provides the missing empirical evidence required to successfully argue market harm in copyright infringement cases.

Key Findings

The Fine-Tuning Reversal: In initial blind evaluations, expert readers (MFA candidates) overwhelmingly disfavored standard AI outputs (generated via in-context prompting) for both stylistic fidelity and overall quality. However, when models were fine-tuned specifically on the complete works of individual authors, this result dramatically reversed. Experts favored the fine-tuned AI output for stylistic fidelity (Odds Ratio of 8.16) and quality (OR=1.87) over the human expert submissions. This finding demonstrates that the method of deployment—not just the initial training—is critical to assessing derivative quality.
Elimination of AI Artifacts: Mediation analysis revealed that the initial rejection of standard AI outputs was due to detectable “stylistic quirks,” such as higher cliché density. Fine-tuning effectively eliminated these detectable artifacts, altering the relationship between detectability and preference. This technical refinement is the mechanism enabling the market substitution threat.
Failure of Detection: The fine-tuned outputs, preferred by human experts, were rarely flagged as AI-generated (a mere 3% detection rate) by state-of-the-art AI detectors. This confirms that current technological countermeasures are inadequate against high-fidelity, fine-tuned generative outputs, creating a “black box” of commercially viable, infringing content.
The Cost of Substitution: The median cost for fine-tuning and inference to produce these high-quality, preferred outputs was calculated at approximately $81 per author. This dramatically low cost (representing a 99.7% reduction compared to human labor) underscores the immediate economic viability of using fine-tuning to generate market-competing content at scale.

Legal and Practical Impact

These findings fundamentally reshape how compliance and litigation strategies must address LLM utilization, particularly regarding the fourth factor of Fair Use.

Litigation Strategy: Litigators representing authors now possess direct, actionable evidence that specific practices (fine-tuning on complete works) result in outputs that are not merely “transformative,” but are demonstrably preferred by readers over expert human alternatives. This preference is the empirical basis for arguing that the AI outputs directly substitute for, and thus injure, the market for the original author’s creative style and subsequent works. The low production cost further strengthens the argument that the use is commercially exploitative and non-transformative in the economic sense.

Compliance and Licensing: For technology companies, the research draws a sharp distinction between general pre-training and targeted fine-tuning. Companies that engage in fine-tuning LLMs on proprietary or copyrighted datasets to create style-specific derivative works must recognize that they are operating at the highest risk threshold for infringement. Compliance strategies must shift away from relying solely on broad “transformative use” defenses for foundational models and move toward specific licensing mechanisms for the data used in fine-tuning, especially when the goal is stylistic emulation. This process is functionally equivalent to commissioning a derivative work, and must be treated as such.

Industry Norms: The failure of AI detectors against fine-tuned content confirms that technical detection cannot be relied upon as a primary defense or compliance mechanism. Industry norms must pivot toward provenance and transparent data lineage tracking, ensuring that the specific datasets used for fine-tuning are auditable and appropriately licensed, particularly given the demonstrated ability of fine-tuning to create highly competitive, undetectable substitutes.

Risks and Caveats

While the data is robust, expert examiners and skeptical litigators must note several limitations. The study focused on relatively short excerpts (up to 450 words); the technical challenge of generating a cohesive, book-length narrative of sustained quality remains a higher hurdle not fully addressed here. Furthermore, the $81 cost metric, while compelling, is limited to the cost of fine-tuning and inference, excluding the substantial capital expenditure required for data acquisition, curation, and the underlying foundational model infrastructure. Finally, the “expert” reader pool, while rigorous (MFA candidates), is a proxy for, not the entirety of, the commercial literary market (e.g., publishers, agents, critics). These factors provide potential avenues for defense counsel seeking to minimize the scope of the market harm finding.

Fine-tuning an LLM on specific copyrighted works is not merely training; it is the creation of a high-fidelity, market-preferred derivative product that directly substantiates claims of economic harm.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About

Fine-Tuned LLMs Provide Concrete Empirical Evidence of Market Substitution and Copyright Harm

Key Takeaways

Key Findings

Legal and Practical Impact

Risks and Caveats