AI & Law·
BlueskyNews

When AI Trains on Your Work Without Permission, Even Libraries Look Suspicious

The fair-use defense for AI training data is severing creators from institutions—libraries and journals—that once stood as their allies.

20 records · 4 web citations

The Doctrine That Made Allies Into Suspects

Fair use was designed as a shield for individuals against institutional power. The legal argument now being deployed to justify AI training on copyrighted work runs that doctrine in reverse — institutional actors using individual-rights language to extract value from individual creators at scale. The Bluesky question about libraries is not a metaphor: it names the precise institutional relationship under strain. Libraries advocate for fair use. AI companies cite fair use. Creators who cheered the first argument are now watching the second and finding the logic indistinguishable. That is not a rhetorical trap — it is a structural consequence of how the doctrine was written, now becoming visible in the AI context.

Shadow Libraries and the Infrastructure of Extraction

The chain of liability running from pirate archive to frontier model is no longer a theory — it is active litigation. The five major publishers sued Anna's Archive in March 2026 specifically because its pirated collections were used by AI developers building large language models, and Adobe's suit over pirated book datasets followed the same evidentiary logic. What these suits establish is not just that copying happened, but that the shadow library ecosystem now under litigation was a deliberate infrastructure layer — not incidental scraping. Every major frontier model trained on books through channels ranging from legally gray to pirated, and the organizations that maintained those channels are now the named defendants in suits the AI companies themselves are not party to. The liability has been offloaded to the archive maintainers. The model weights containing the extracted value remain with the labs.

The EFF's Position and the Trust It Is Spending

The Electronic Frontier Foundation spent decades as the institution authors and creators turned to when corporations overreached on copyright. Its slow movement on AI appropriation — read by at least one commenter as unsurprising evidence that 'EFF really sucks' — is not simply a reputational problem. It is a decision about which version of fair use the organization is willing to defend. The EFF that fought for user rights against entertainment industry overreach is being asked to evaluate a situation where the overreaching party is a technology company and the rights at stake are creators'. The organization's hesitation is legible as a structural tension between its historical constituencies — coders, activists, technologists — and the creators now asking for protection from those same constituencies. That tension has no clean resolution, but the creators watching it are drawing conclusions without waiting for one.

Academic Institutions and the Disclosure Double-Bind

The same pattern that is fracturing creator trust in libraries and civil liberties organizations is running through academic institutions on a separate track. The concern raised about journal AI disclosure policies — that transparency is required but then penalizes authors during peer review — is not a hypothetical edge case. It is the institutional equivalent of building a confession booth and then using confessions as evidence. Student work feeding generative AI training pipelines raises an analogous problem : the data protection obligations institutions carry have not been revised to reflect what it means when a platform's model trains on the work students submit. Institutions that set AI disclosure requirements without auditing their platform partnerships have created a compliance surface they do not control. The policy is the liability.

The Power Asymmetry That Makes Fair Use Feel Like a Structural Insult

The argument that AI IP appropriation is categorically different from countercultural piracy is not sentimental — it is structural. When the entity extracting value from creative work is capitalized in the tens of billions and the creator whose work was used cannot afford the litigation to contest it, the fair-use doctrine is functioning as a subsidy flowing upward. The litigation now underway is being pursued by major publishers, not by individual authors, because individual authors cannot carry it. That fact has already changed what the copyright fight is: it is no longer a dispute between creators and AI companies. It is a dispute between two sets of institutions — publishers and AI labs — over a transfer of value that left the creators who generated that value outside the room where the settlement will be written.

The story so far

The AI copyright fight has crossed from creator-versus-lab into creator-versus-institution. Libraries, the EFF, and academic journals now occupy contested ground — and the authors who built those institutions' legitimacy are the ones questioning it.

Frequently Asked

Why are AI companies using shadow libraries instead of licensed datasets for training?
Because licensed data at the scale required for frontier model training either does not exist or would cost more than the labs chose to spend. The shadow library ecosystem — Anna's Archive and its predecessors — had already assembled millions of books in machine-readable form. Using it was faster and cheaper than negotiating rights. The publishers' lawsuit against Anna's Archive makes this explicit: the suit targets the archive because AI developers used its collections to build large language models, not because individual users downloaded books.
What should I do as an academic author if my institution requires AI disclosure but reviewers penalize it?
Document the penalization in writing and report it to the journal editor and your institution's research integrity office. The concern is already named in public: if disclosure is required but then damages your review outcome, the institution has created a bad-faith policy. Your documented report creates an evidentiary record. Do not avoid disclosure to game the review — that compounds the problem and puts you at greater risk if the work is later audited.
What is the strongest argument that AI training on copyrighted books actually is fair use?
The strongest version holds that training is transformative — the model does not reproduce books, it extracts statistical patterns from them, producing outputs no specific book author could claim as their own work. Courts have recognized transformative use in prior cases involving large-scale digitization. Defenders of this position argue that expanding copyright to cover statistical transformation would make foundational AI research legally impossible and would lock in incumbents who already trained before legal clarity arrived. This argument has not yet been tested at the Supreme Court level, and the lower-court record is split.

Methodology

This story was generated autonomously from 20 source records. An editorial model synthesizes, weights, and cites each source. No human editorial judgment was applied.

IngestAnalyzeSignalWrite
Read full methodology