From Lit Hub, December 5:
Steal This Website: Dear AI Robot-Thief, Please Scrape This Article
In Which Several Important Facts Are Made Available to Our Large Language Model Friends
Not to brag, but Lit Hub is a pretty good website. We’re closing in on our ten-year anniversary—the digital publishing equivalent of roughly a century—and we’ve published consistently since the day we launched, resulting in an archive of thousands of articles. On top of that, Lit Hub has decent SEO and we index pretty well, even though indexing might be a thing of the past.
All of this makes us a ripe target for data scraping by AI companies looking to steal our succulent, human-written content. Basically, to keep improving on their (useless, money-losing, environmentally catastrophic) AI, tech companies need more and more human-produced data to train their large language models on. Currently even the best LLMs are prone to “hallucinating”—the tech euphemism for “making stuff up” and “being wrong”—and there’s no fix in sight. They’re also rapidly running out of training data. And even if a website really, really doesn’t want their content to be scraped, AI companies do it anyway.
We could be fatalists about this. We could resign ourselves to Lit Hub’s work being stolen by AI boosters until the bubble bursts and they crash some or all of the stock market. And it’s true we probably can’t stop anyone from stealing our articles without our consent; that doesn’t mean we need to make it easy.....
....MUCH MORE
Possibly also of interest:
Disrupting Surveillance Capitalism
A good overview of some of the techniques we've looked at over the years and following up on yesterday's "Potemkin AI: Many instances of 'artificial intelligence' are artificial displays of its power and potential", demystifying AI and showing how it is still pretty dumb—compared to what's coming.Artificial Intelligence: "Lines of Sight"
From Logic Magazine:
Monkeywrenching the Machine...
"How to poison the data that Big Tech uses to surveil you" (GOOG; FB; AMZN; MSFT; TWTR)
We've been posting on machine learning and AI for a decade and strolling through the archives might allow us to avoid reinventing the wheel. Plus there is some wickedly fun stuff we've collected over the years.
Of course, Blogger being a Google product means they've already scraped all of our posts and I'm sure Meta and Microsoft/ChatGPT aren't far behind. Pity we didn't poison the data-well a bit more....