On December 9, 2024, I wrote about Meta’s new terms of service, effective January 1, 2025. This month, I’m even more disgusted by what I learned. An email from one of my publishers told me Meta stole 7.5 million books and 81 million research papers to train their new AI model, Llama 3.
For those who haven’t heard the news yet, Alex Reisner first broke the story in The Atlantic…
“When employees at Meta started developing their flagship AI model, Llama 3, they faced a simple ethical question. The program would need to be trained on a huge amount of high-quality writing to be competitive with products such as ChatGPT, and acquiring all of that text legally could take time. Should they just pirate it instead?”
Meta employees spoke with multiple companies about licensing books and research papers, but they nixed that idea, stating, “[This] seems unreasonably expensive.” A Llama-team senior manager also said it’d be an “incredibly slow” process. “They take like 4+ weeks to deliver data.”
Offended yet? Not only has Meta and others stolen copyrighted work but they’ve reduced authors’ blood, sweat, and tears to nothing more than “data.”
“The problem is that people don’t realize that if we license one book, we won’t be able to lean into fair use strategy,” said the director of engineering at Meta in an internal memo.
If caught, the senior manager claimed the legal defense of “fair use” might work for using pirated books and research papers to train AI…
“[It is] really important for [Meta] to get books ASAP. Books are actually more important than web data.”
How did they solve this problem? Meta employees turned to LibGen (Library Genesis), a digital warehouse of stolen intellectual property, neatly stacked with pirated books, academic papers, and various works authors and publishers never approved.
As of March 2025, the LibGen library contained more than 7.5 million books and 81 research papers. And Meta stole it all, with permission from “MZ”—a reference to CEO Mark Zuckerberg—to download and use the data set.
Internal correspondence were made public this month as part of a copyright-infringement lawsuit brought by Sarah Silverman and other celebs whose books LibGen pirated. If that’s not bad enough, the public also discovered OpenAI used LibGen for similar purposes. Microsoft owns a 49% equity stake in the for-profit subsidiary OpenAI LP. It is not yet known whose idea it was to download the LibGen library to train its AI model.
Does it matter? They still used copyrighted material without obtaining licensing fees or giving authors the option to opt-out.
“Ask for forgiveness, not for permission,” said another Meta employee.
Even when a senior management employee at Meta raised concerns about lawsuits, they were convinced to download the libraries from LibGen and Anna’s Archive, another massive pirate site.
“To show the kind of work that has been used by Meta and OpenAI, I accessed a snapshot of LibGen’s metadata—revealing the contents of the library without downloading or distributing the books or research papers themselves—and used it to create an interactive database that you can search here:
https://reisner-books-index.vercel.app”
~ Alex Reisner, The Atlantic
Meta and OpenAI have both claimed the defense of “fair use” to train their generative-AI models on copyrighted work without a license, because LLMs (Large Language Models) “transform” the original material into new work. Work that could directly compete with the authors they stole from—by duplicating their writing voice and style!
This legal strategy could set a dangerous precedent: It’s okay to steal from authors. Who cares if they worked for months, even years, to write the pirated books and/or research papers?
The use of LibGen and Anna’s Archive also raises another issue.
Alex Reisner stated the following in one of The Atlantic articles:
“Bulk downloading is often done with BitTorrent, the file-sharing protocol popular with pirates for its anonymity, and downloading with BitTorrent typically involves uploading to other users simultaneously. Internal communications show employees saying that Meta did indeed torrent LibGen, which means that Meta could have not only accessed pirated material but also distributed it to others—well established as illegal under copyright law, regardless of what the courts determine about the use of copyrighted material to train generative AI.”
Not only has Meta and OpenAI stolen copyrighted material from authors, but they’ve distributed it to others.
By now, you must be wondering if your books are included in the LibGen library. I found six of mine, including my true crime/narrative nonfiction book, Pretty Evil New England, which took me a solid year to research—driving around six states to dig through archives—and then submit the finished manuscript to the publisher by the deadline, never mind the weeks of edits afterward. Each one of my stolen thrillers—HACKED, Blessed Mayhem, Silent Mayhem, Unnatural Mayhem, and HALOED—also took months of hard work.
By stealing six books, they robbed me of years—years(!) of pouring my soul onto the page to deliver the best experience I could—and I’ll continue to put in the time for my readers. I suspect you’ll do the same. But authors still need to eat and pay bills. It’s difficult to write if you’re homeless.
What message is Big Tech sending to the public?
If Meta and OpenAI prevail in the lawsuits, authors everywhere are at risk.
Quick side note about pirate sites: Sure, you can read books for free. Just know, most sites include trojan horses in the pirated books that will steal banking and other personal info from your network. Every pirated book steals money from authors. If you want us to keep writing but can’t afford to buy books, get a library card. Or contact the author. Most will gift you a review copy.
Care to read Meta’s internal correspondence?
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.449.4.pdf
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.417.6.pdf
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.391.24.pdf
And here’s a court document regarding OpenAI:
https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.254.0.pdf
Disgraceful, right?
The Authors Guild is also reporting on the theft and closely monitoring the court cases.
If your work is included in the LibGen library, your name will automatically be included in the class action (there are many filed), unless you opt-out. However, if you prefer to contact the attorney handling the case against Meta, contact Saveri Law Firm HERE.
Did you find any of your work in the pirated libraries?
Excellent post, Sue, one I hope spurs action. Like the toxicity of nuclear waste, it seems corporate people give little thought to the lasting destruction caused by stealing and reselling work. Even a casual glance at the problem should be enough for the courts to rule in favor of writers and artists. Google’s AI summarization steals the dreams of millions because people no longer visit the actual sites, but settle for the quick-hit of data dopamine. I hope the courts sort this out soon!
I hope so, Grant, but corporate greed always seems to win. Those internal documents show how ruthless they really are. It’s discouraging for writers who show up to the keyboard day after day, year after year.
Your post gave me an idea for a post titled “AI Steals, Kills, and Destroys.” I’ll publish it on Wednesday, linking to your post for the “stealing” part. Thanks again for the clarity and disclosure of the problem. As writers, I believe we can make a difference by contacting our state’s federal representatives and senators.
Keep ‘em coming, Sue!
You’re the best, Grant. Send a pingback so I can share your post.
Sue, thanks for spotlighting this staggering arrogance. The exhibits you linked of internal emails are disgusting.
Myy latest book is on LibGen.
Authors Guild provides a template of a letter telling AI companies to stop using your work:
https://actionnetwork.org/letters/authors-guild-author-letters-to-ai-companies/
The sad reality is billionaires have bottomless pockets, far deeper than all writers and the Authors Guild combined. When courts find against them in 5-10 years, they’ll shrug and pay a fine that might be .00001% of the damage they did.
We writers might each receive a nickel or so in damages. Don’t spend it all in one place.
So true, Debbie. The only ones who get rich of class action suits are lawyers. The Plaintiffs receive pennies. The worst part is, LibGen and others will never remove our books. Or one site will take them down, then sell them to another. 🤬
All of mine are there. I haven’t kept up, but what I read earlier is that it’s not copyright infringement if they’re not republishing your work. That doesn’t mean I have to like it. And I still don’t think AI is going to be able to create anything remotely human.
It’s theft to steal pirated material. It’s also copyright infringement if they distributed the work, even accidentally, through BitTorrent. Corporate greed doesn’t care who they hurt.
LibGen has 4 of my books and 5 of my scientific research articles(!) So can I ask these AI programs to let me actually READ one of the pirated books they used directly from their platforms? Use the AI platforms like a giant eReader?
That I don’t know, Carol. I wouldn’t download one yourself or you could be hacked. It’s all a disgusting mess.
All but but one of my books are there when I checked recently. From what I understand, works created by AI can not be copyrighted, but of course these companies won’t care about that–OpenAI sells access to their generative “A.I.” I’m not sure what Meta’s plans are for Llama–perhaps something similar.
I’m pulling big-time for the class-action lawsuit. I did not realize they used bit-torrent, which adds another legal trouble for them.
There’s no there “there” with A.I., no awareness, no feeling, nothing, just programs working probabilities from very large datasets.
Thanks so much, Sue, for putting together this excellent, informative, and insightful article on this outrage. I hope you have a wonderful week, my friend!
True, Dale. Work produced by AI cannot be copyrighted. Not sure that matters to book sellers like Amazon, who allow AI work to be listed for sale as long as it’s labeled “written by AI.” Which may cause us more headaches. What if Amazon’s bots slap an AI label on our work because it was trained on our pirated books?
Thanks, Dale. Wishing you a fabulous week!
This is frightening, Sue. Thank you for posting this, with all of the helpful links.
I didn’t find any of my books on the list . . . yet.
With everything else currently falling apart in so many ways, this.
What a world. 🙁
Agreed, Deb. What a world. 🙃
I found two of my books on LibGen. My husband’s one novel is also there.
The “win at all costs, even if it means breaking the law” attitude at Meta is disgusting. I guess human beings’ lowest character traits can be found even in the brightest minds.
Besides, they aren’t even clever. “Ask for forgiveness, not for permission” is so trite.
Amen to that, Kay. Corporate greed at its ugliest . Shame on them.
New tech has always meant the same old thing for creators. Years ago, AOL started posting authors’ full works. Harlan Ellison, in true Harlan Ellison style, got mad and spent a small fortune in lawyer fees to stop them. He succeeded to the point they didn’t dare to that to him or any other author. That grump old man is one of my heroes.
Now, he’s one of my heroes, too! Thanks for sharing, Marilynn.
Thanks for the update, Sue.
Wish I had better news, Elaine. Hope you have a fabulous week!
All of my books are there…one silver lining–I write Christian fiction so the message is getting out there, even this way.
I understand your position, Patricia. Still, I’d rather readers receive a book’s message legally. Pirated books don’t garner reviews.
I totally agree. And hopefully something can be done about it, but I fear it’ll be like the pirated places–wack a mole.
I did find one of my works there, but I’m still chuckling and scratching my head.
“Hydrocarbon Transport in a Plasma Boundary Layer
Fusion Technology
William D. Langer, Alicia Butcher Ehrhardt”
from a random paper before 1989 when I did fusion research at the Princeton Plasma Physics Lab.
I shall have to go read it again – must have a copy somewhere – so I can monitor when the A’I’ serves it up in some other context. And maybe it will make me a few pennies?
Don’t count on making any pennies, Alicia. It’s listed for free on LibGen and other pirate sites.