top of page
Search

When AI Training Runs Into Copyright Law: What Adobe’s Proposed Class Action Says About Responsibility in the AI Era

  • Yoshi Soornack
  • Dec 20, 2025
  • 5 min read

Updated: Dec 21, 2025

A proposed class-action lawsuit against Adobe claims that unauthorised books were used to train its AI models. The legal and operational consequences extend far beyond a single company.




The Lawsuit That Could Redefine AI Training Norms

In December 2025, Adobe Inc. was hit with a proposed class-action lawsuit in the United States alleging that the company used copyrighted books without permission to train one of its artificial intelligence models. The case was filed in the U.S. District Court for the Northern District of California by author Elizabeth Lyon, who asserts that her instructional books and those of other writers were used without authorisation as part of the training data for Adobe’s SlimLM language model. 


The complaint is significant because it targets not only Adobe’s use of content but also the legitimacy of widely used AI training datasets, specifically those derived from sources like Books3 and RedPajama, which have been at the centre of several high-profile copyright disputes this year. 


While Adobe’s response at the time of filing was limited, the broader narrative reinforces a legal landscape in which authors question whether developers can incorporate works into AI training without consent, credit, or compensation. 


What the Complaint Alleges, in Practical Terms

According to the filing, Adobe’s SlimLM was pre-trained on a dataset called SlimPajama-627B, which the company describes as an open-source, deduplicated multi-corpora dataset intended to standardise training sources.


However, the plaintiff’s complaint contends that SlimPajama is itself a derivative of RedPajama, which incorporates the Books3 corpus, an extensive collection including tens of thousands of books under standard copyright. The lawsuit asserts that this implied lineage brings copyrighted works into Adobe’s training pipeline without licensing or consent. 


The proposed class action seeks monetary damages, injunctive relief to halt continued use of allegedly infringing data and a declaration that Adobe violated copyright law on behalf of all authors similarly situated. Whether the court grants class certification will shape the scale of the litigation’s impact. 


Why This Matters Beyond Adobe

This is not the first time authors have challenged the use of copyrighted works in AI training. Earlier in 2025, Anthropic, a major AI company, agreed to a settlement of US$1.5 billion with a class of authors who alleged that pirated books were used to train its Claude chatbot models. That settlement, the largest in AI-related copyright litigations, sent a clear signal that creators were willing to escalate legal challenges. 


Similar actions have been mounted against other large technology companies, including Apple and Salesforce, with lawsuits rooted in the same underlying datasets. In some cases, plaintiffs have specifically cited Books3 as containing copyrighted material in training corpora that were not appropriately licensed. 


These cases collectively highlight a central tension: training large scale models requires vast amounts of data, often assembled from multiple sources, but the provenance and licensing of that data is under intensifying legal scrutiny. A ruling against Adobe, or similar cases that gain traction, would force businesses to reconsider how they document, curate and license their training data.


The Emerging Legal and Operational Challenge

Under U.S. copyright law, to prove infringement in this context, plaintiffs typically need to demonstrate both access and copying of the copyrighted works. This is evolving in the context of AI: courts and legal scholars are debating whether ingesting copyrighted material into large model training fits within fair use or constitutes an actionable violation. 


A parallel dimension of the debate concerns derivative datasets. Adobe’s reliance on SlimPajama, which it claims is open source, masks proprietary data components. This raises questions not only about direct copying but also about the legal status of derivative datasets and companies' obligations to verify and document dataset origins before training models.


This kind of legal challenge makes the questions real for organisations processing large datasets for machine learning:

  • What constitutes legally permissible data ingestion?

  • How should organisations document data provenance?

  • Where do responsibilities lie when sourcing from aggregated or derivative datasets?


The answers remain unsettled, but litigation like this forces organisations to move from assumptions to accountable practices.


Why This Matters for Project Delivery Professionals

For leaders and teams responsible for delivering AI-infused products or solutions, this lawsuit carries practical implications:


1. Documentation of Data Practices Must Be Operationalised. Legal teams are no longer solely responsible for data strategy. Product, delivery and engineering functions must jointly ensure that data sourcing and usage comply with copyright and licensing obligations.


2. Contracts and Licensing Require Clarity. Whether internal development or vendor-supplied AI services are used, organisations must treat training-data provenance as a contractual requirement and negotiable risk. Projects reliant on third-party models or datasets should explicitly address rights and use permissions.


3. Ethical and Compliance Risks Are Delivery Risks. Copyright litigation like this highlights that data ethics is not optional. A misstep in training data practices can delay roadmaps, expose organisations to material damages and erode stakeholder trust. These are material delivery risks with business consequences.


4. Trust and Transparency Become Competitive Factors. Organisations that can show transparent chains of consent and lawful use of data may find that clients and partners increasingly prefer them over those that rely on opaque training practices.


A Broader Context on AI and Copyright Law

The Adobe lawsuit fits into a broader trend in which legal standards are catching up with rapid technological adoption. Industry observers note that past decisions, such as litigation involving other large language model developers, may influence how courts treat future cases.


In some courts, judges have expressed scepticism about broad fair use claims for AI training when the output could directly compete with the commercial market for the underlying work.


Moreover, AI models that reproduce copyrighted text verbatim, or in ways that compete with original works, may undermine longstanding assumptions about what fair use protects. This is more than a technical legal argument; it intersects with market impacts and creator rights.


“Adobe faces a class-action lawsuit alleging it used unauthorised copyrighted works to train its AI models.” — Reuters reporting on Lyon v. Adobe Inc. 

“Authors are increasingly asserting legal claims over the use of copyrighted works in generative AI systems.” — MediaNama on the expanding pattern of litigation against AI training data practices.

If your organisation builds, integrates or deploys AI systems, now is the moment to review how training data is sourced and documented. The legal landscape is shifting quickly, and assumptions about training data permissibility will no longer protect you from risk.


We also think that as AI becomes infrastructure, consent and provenance will become central to defensibility and commercial trust.


Ensure your next project iteration includes a data governance review that aligns with both legal standards and ethical use practices. This is not a theoretical exercise; it is foundational to sustainable delivery in an AI-enabled world.


For ongoing analysis of how legal, moral and operational shifts are reshaping AI-enabled delivery, subscribe to Project Flux.









 
 
 

Comments


bottom of page