Meta used all 3 of my books & millions of other books, ebooks, and research papers to train its AI.
All without consent, compensation, copyright concern, credit. You find out only by searching a database published a few days ago by The Atlantic.
It's sketchy AF. I’ll tell you why in a sec. But first… should we care?
If you’re on the list, you have one of two reactions:
🥳 Honored! Flattered! My work is worthy!
😳 Ummm WTH. What about consent, compensation, etc.?
More broadly: Should you care whether Meta used 72 books from David Sedaris? 200 from Margaret Atwood? The entire library of everyone everywhere?
What’s missing in this conversation is that Meta intentionally made *a choice*... as the kids say. They actively chose to steal the books, instead of going through the proper & legal channels, because the legal way was too slow.
TOO SLOW.
That’s problematic, isn’t it? Look, I use AI. I see its value. But the scale of this is nuts. And gross.
Here's the story:
If Llama 3 was to compete, it needed to be trained on a huge amount of high-quality writing – books, not Instagram captions or LinkedIn posts. Acquiring all of that text legally could take time.
“Yo ho ho! Should we pirate it instead?” they wondered.
“Abso-tooting-lootly,“ they said back to themselves. “Let’s loot all the books!“
So they went to LibGen, a pirated library with more than 7.5 million books and 81 million research papers. The Llama 3 team looted and rolled around in all the words like they were pirates rolling in gold doubloons. And they basically were.
So is it problematic? Yes. Here’s why:
🏴☠️ We’re normalizing the theft of IP to an absurd degree.
A giant tech company should not feel entitled to scoop up all our words like they’re in a bulk bin at a Dollar Tree & run straight out of the door with them.
🏴☠️ Gen AI often presents like all-knowing oracles, uncoupled from sources. This “decontextualizes knowledge“ (the Atlantic's phrase). This severs the work from the author. It prevents true collaboration. It makes it harder for creators & researchers to build an acknowledged body of work.
🏴☠️ It’s not colorless, odorless “data.” It’s words that make up sentences that make up paragraphs that make up pages that writers have created, crafted, coaxed into the world. Words with handprints, heart-prints, bitemarks, scratches from us trying to shape them into something that delights you.
Do I sound like I’m being precious about words? About writing? It’s because I am.
* * *
I don’t have all the answers. But I do know this:
Clarity, consent, compensation. Is it too much to ask that AI companies be transparent about data sources and use, obtain permission from creators, and provide fair payment for using their works?
Just me?
All without consent, compensation, copyright concern, credit. You find out only by searching a database published a few days ago by The Atlantic.
It's sketchy AF. I’ll tell you why in a sec. But first… should we care?
If you’re on the list, you have one of two reactions:
🥳 Honored! Flattered! My work is worthy!
😳 Ummm WTH. What about consent, compensation, etc.?
More broadly: Should you care whether Meta used 72 books from David Sedaris? 200 from Margaret Atwood? The entire library of everyone everywhere?
What’s missing in this conversation is that Meta intentionally made *a choice*... as the kids say. They actively chose to steal the books, instead of going through the proper & legal channels, because the legal way was too slow.
TOO SLOW.
That’s problematic, isn’t it? Look, I use AI. I see its value. But the scale of this is nuts. And gross.
Here's the story:
If Llama 3 was to compete, it needed to be trained on a huge amount of high-quality writing – books, not Instagram captions or LinkedIn posts. Acquiring all of that text legally could take time.
“Yo ho ho! Should we pirate it instead?” they wondered.
“Abso-tooting-lootly,“ they said back to themselves. “Let’s loot all the books!“
So they went to LibGen, a pirated library with more than 7.5 million books and 81 million research papers. The Llama 3 team looted and rolled around in all the words like they were pirates rolling in gold doubloons. And they basically were.
So is it problematic? Yes. Here’s why:
🏴☠️ We’re normalizing the theft of IP to an absurd degree.
A giant tech company should not feel entitled to scoop up all our words like they’re in a bulk bin at a Dollar Tree & run straight out of the door with them.
🏴☠️ Gen AI often presents like all-knowing oracles, uncoupled from sources. This “decontextualizes knowledge“ (the Atlantic's phrase). This severs the work from the author. It prevents true collaboration. It makes it harder for creators & researchers to build an acknowledged body of work.
🏴☠️ It’s not colorless, odorless “data.” It’s words that make up sentences that make up paragraphs that make up pages that writers have created, crafted, coaxed into the world. Words with handprints, heart-prints, bitemarks, scratches from us trying to shape them into something that delights you.
Do I sound like I’m being precious about words? About writing? It’s because I am.
* * *
I don’t have all the answers. But I do know this:
Clarity, consent, compensation. Is it too much to ask that AI companies be transparent about data sources and use, obtain permission from creators, and provide fair payment for using their works?
Just me?