On December 9, 2024, I wrote about Meta’s new terms of service, effective January 1, 2025. This month, I’m even more disgusted by what I learned. An email from one of my publishers told me Meta stole 7.5 million books and 81 million research papers to train their new AI model, Llama 3.
For those who haven’t heard the news yet, Alex Reisner first broke the story in The Atlantic…
“When employees at Meta started developing their flagship AI model, Llama 3, they faced a simple ethical question. The program would need to be trained on a huge amount of high-quality writing to be competitive with products such as ChatGPT, and acquiring all of that text legally could take time. Should they just pirate it instead?”
Meta employees spoke with multiple companies about licensing books and research papers, but they nixed that idea, stating, “[This] seems unreasonably expensive.” A Llama-team senior manager also said it’d be an “incredibly slow” process. “They take like 4+ weeks to deliver data.”
Offended yet? Not only has Meta and others stolen copyrighted work but they’ve reduced authors’ blood, sweat, and tears to nothing more than “data.”
“The problem is that people don’t realize that if we license one book, we won’t be able to lean into fair use strategy,” said the director of engineering at Meta in an internal memo.
If caught, the senior manager claimed the legal defense of “fair use” might work for using pirated books and research papers to train AI…
“[It is] really important for [Meta] to get books ASAP. Books are actually more important than web data.”
How did they solve this problem? Meta employees turned to LibGen (Library Genesis), a digital warehouse of stolen intellectual property, neatly stacked with pirated books, academic papers, and various works authors and publishers never approved.
As of March 2025, the LibGen library contained more than 7.5 million books and 81 research papers. And Meta stole it all, with permission from “MZ”—a reference to CEO Mark Zuckerberg—to download and use the data set.
Internal correspondence were made public this month as part of a copyright-infringement lawsuit brought by Sarah Silverman and other celebs whose books LibGen pirated. If that’s not bad enough, the public also discovered OpenAI used LibGen for similar purposes. Microsoft owns a 49% equity stake in the for-profit subsidiary OpenAI LP. It is not yet known whose idea it was to download the LibGen library to train its AI model.
Does it matter? They still used copyrighted material without obtaining licensing fees or giving authors the option to opt-out.
“Ask for forgiveness, not for permission,” said another Meta employee.
Even when a senior management employee at Meta raised concerns about lawsuits, they were convinced to download the libraries from LibGen and Anna’s Archive, another massive pirate site.
“To show the kind of work that has been used by Meta and OpenAI, I accessed a snapshot of LibGen’s metadata—revealing the contents of the library without downloading or distributing the books or research papers themselves—and used it to create an interactive database that you can search here:
https://reisner-books-index.vercel.app”
~ Alex Reisner, The Atlantic
Meta and OpenAI have both claimed the defense of “fair use” to train their generative-AI models on copyrighted work without a license, because LLMs (Large Language Models) “transform” the original material into new work. Work that could directly compete with the authors they stole from—by duplicating their writing voice and style!
This legal strategy could set a dangerous precedent: It’s okay to steal from authors. Who cares if they worked for months, even years, to write the pirated books and/or research papers?
The use of LibGen and Anna’s Archive also raises another issue.
Alex Reisner stated the following in one of The Atlantic articles:
“Bulk downloading is often done with BitTorrent, the file-sharing protocol popular with pirates for its anonymity, and downloading with BitTorrent typically involves uploading to other users simultaneously. Internal communications show employees saying that Meta did indeed torrent LibGen, which means that Meta could have not only accessed pirated material but also distributed it to others—well established as illegal under copyright law, regardless of what the courts determine about the use of copyrighted material to train generative AI.”
Not only has Meta and OpenAI stolen copyrighted material from authors, but they’ve distributed it to others.
By now, you must be wondering if your books are included in the LibGen library. I found six of mine, including my true crime/narrative nonfiction book, Pretty Evil New England, which took me a solid year to research—driving around six states to dig through archives—and then submit the finished manuscript to the publisher by the deadline, never mind the weeks of edits afterward. Each one of my stolen thrillers—HACKED, Blessed Mayhem, Silent Mayhem, Unnatural Mayhem, and HALOED—also took months of hard work.
By stealing six books, they robbed me of years—years(!) of pouring my soul onto the page to deliver the best experience I could—and I’ll continue to put in the time for my readers. I suspect you’ll do the same. But authors still need to eat and pay bills. It’s difficult to write if you’re homeless.
What message is Big Tech sending to the public?
If Meta and OpenAI prevail in the lawsuits, authors everywhere are at risk.
Quick side note about pirate sites: Sure, you can read books for free. Just know, most sites include trojan horses in the pirated books that will steal banking and other personal info from your network. Every pirated book steals money from authors. If you want us to keep writing but can’t afford to buy books, get a library card. Or contact the author. Most will gift you a review copy.
Care to read Meta’s internal correspondence?
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.449.4.pdf
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.417.6.pdf
https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.391.24.pdf
And here’s a court document regarding OpenAI:
https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.254.0.pdf
Disgraceful, right?
The Authors Guild is also reporting on the theft and closely monitoring the court cases.
If your work is included in the LibGen library, your name will automatically be included in the class action (there are many filed), unless you opt-out. However, if you prefer to contact the attorney handling the case against Meta, contact Saveri Law Firm HERE.
Did you find any of your work in the pirated libraries?