
2025-06-26 10:27:41
@… stimmt! Wenn man sie nicht setzt, greift der Default: die Lizenz wird ignoriert. Von CC hätte ich dagegen einen Einsatz für ein deutliches Signal gegen Scraping erwartet. Signals führt aber bisher nur pro Scraping ein
@… stimmt! Wenn man sie nicht setzt, greift der Default: die Lizenz wird ignoriert. Von CC hätte ich dagegen einen Einsatz für ein deutliches Signal gegen Scraping erwartet. Signals führt aber bisher nur pro Scraping ein
How online fandom communities are advocating against AI, including protesting companies scraping fanfic content for AI training and opposing AI-generated fanfic (Decca Muldowney/The Verge)
https://www.theverge.com/ai-artificial-intelligence/688640/fanficti…
A review of the legal challenges associated with generative AI training disputes emphasises the need for clarity from the UK government, legislature and courts.
https://www.computing.co.uk/feature/2025/scraping-surface-generative-ai-training-disputes…
"AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline"
#AI is ruining our digital world
(Original title: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums)
How online fandom communities are advocating against AI, including protesting companies scraping fanfic content for AI training and opposing AI-generated fanfic (Decca Muldowney/The Verge)
https://www.theverge.com/ai-artificial-intelligence/688640/fanficti…
AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don’t want to be scraped, according to internet infrastructure provider Cloudflare.
On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities.
The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s prefe…
AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums
https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/?ref=daily-stories-newsletter
AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/
Painting and ceiling scraping is finally done!
Now I can set my office back up and get back into the swing of things.
But first, I must clean every single thing I own because it's all covered in a think layer of white dust 😐
We are excited to announce the release of Vulnerability-Lookup 2.15.0!
This version brings new features, performance improvements, and several bug fixes.
Thanks to @… for the hard work.
#vulnerability
»Pay up or stop scraping – Cloudflare program charges bots for each crawl:
Cloudflare now beta testing pay-per-crawl feature to stop endless AI scraping.
Cloudflare is now experimenting with tools that will allow content creators to charge a fee to AI crawlers to scrape their websites.«
This is certainly a good idea, but on the other hand, the competition is trying to eliminate each other. I'm curious… 🍿😎
They (or an intentional DDoS) have been pounding the #SpamAssassin RuleQA site into catatonia. They construct URLs which are legitimate and which each cause the site to go digging for the specific performance of a rule on an arbitrary date in the past. Hundreds of rules tested daily for ~20 years.
AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries,
and are in some cases knocking their collections offline,
according to a new survey published today.
While the impact of AI bots on open collections has been reported anecdotally,
this survey is the first attempt at measuring the problem,
which in the worst cases can make valuable, public resources unavailable to humans
because the…
I think I want to try a ZX Spectrum Next setup on MiSTer to see if I should get the Kickstarter hardware. Does anyone know of any packs that combine the free/demo Next software so I don’t have to spend hours scraping it from various places on the net? I have found packs of the classic Spectrum software.
A researcher says 245 extensions on nearly 1M devices are overriding security protections to turn browsers into engines that scrape websites for a paid service (Dan Goodin/Ars Technica)
https://arstechnica.com/security/2025/
Die neuen Mastodon.social-AGB-Änderungen erwärmen mein Herz. Unter Anderem stellen sie sicher, dass AI-Scraping unerwünscht ist, und dass alle Rechte an den Beiträgen den jeweilgen Benutzenden gehören (und sie nur Rechte besitzen, um die Beiträge entsprechend bereitzustellen).
Etwas, was bei anderen Services genau andersherum passiert. Sehr schön, @… und Te…
In a letter to CEO Aravind Srinivas, the BBC says it has evidence Perplexity's default model used its content and seeks "a proposal for financial compensation" (Financial Times)
https://www.ft.com/content/b743d401-dc5d-44b8-9987-825a4ffcf4ca
I think a misunderstanding is that people want to fight "scraping" or "automated systems". But my feeling is that the issue is with the _purpose_ of the scraping: It's not "that person is scraping my site" it's "that person wants to use my work to train their slop machine". The issue is the SLOP machine with all the negative externalities they have.
And that is a path worth exploring (that I have similarly argued for code): We want to cont…
Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-…
Anyone still think having Facebook or Meta around (or whatever that asshole Zuckerberg wants to call his bullshit company this week) is good for the fediverse? And they're making Social Network 2 to further glorify him. Fuck all this shit. https://www.dropsitenews.com/p/meta-facebo
Reddit sues AI company Anthropic for allegedly ‘scraping’ user comments to train chatbot Claude
https://apnews.com/article/reddit-sues-ai-company-anthropic-claude-chatbot-f5ea042beb253a3f05a091e70531692d
More time should be devoted about the (near) future businessmodels of AI and how it collects data/content. Just trying to prevent AI models from scraping data will be futile.
https://blog.cloudflare.com/content-independence-day-no-ai-crawl-without-compen…
Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine,
-- and it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.
The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles;
instead, it will only be able to index the Reddit.com homepage,
which effectively means Internet Archive will only be able to archive insights into which news headlines …
There was somebody fussing in my replies to my last link to my blog post about Medium (I don’t see them now; they probably blocked me, but their specific words don’t really matter), and the gist of their message was that they didn’t like that site. On the modern internet, if you have an issue with content written by humans, with no surveillance ads, that doesn’t allow AI scraping or AI slop content, with a business model that makes money… I don’t know how to help you. Honestly.
Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-…
According to Dropsitenews Meta is training AI on multiple Lemmy instances. I also saw some mastodon instance in the PDF.
Full article https://www.dropsitenews.com/p/meta-facebook-tech-copyright-privacy-whistleblower
Full list:
Anyone else getting these ridiculous repo scraping spikes? A clean checkout of the https://thi.ng/umbrella monorepo is ~370MB. Over the past 14 days there were 222k clones (only 117 unique) of this repo which have caused downloads of a whopping ~78TB. WTF! 🤯
Cloudflare launches Pay per Crawl, a marketplace letting sites charge AI crawlers per crawl; new sites set up with Cloudflare will block AI crawlers by default (Maxwell Zeff/TechCrunch)
https://techcrunch.com/2025/07/01/clou
Everybody complaining about getting hammered with #AI traffic seems to think that these are crawlers scraping for training data.
How likely is it that this is a complete misconception and this is all inference time?
Most public companies give their cralwers and RAG agents different user agent strings. But what about security services trawling through their data?
#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
🔍 New /search #API endpoint combines web search results with full page content in one call
🤖 Built specifically for #AI
Cloudflare launches Pay per Crawl, a marketplace letting sites charge AI crawlers per crawl; new sites set up with Cloudflare will block AI crawlers by default (Maxwell Zeff/TechCrunch)
https://techcrunch.com/2025/07/01/clou
Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Emily Wenger
https://arxiv.org/abs/2505.21733
A look at various responses to AI firms' scraping, like open source Anubis that uses cryptographic JavaScript challenges, Cloudflare's "link mazes", and more (Emanuel Maiberg/404 Media)
https://www.404media.co/the-open-source-so
For anyone who wants to self-host their catalog of book video game or movie collections, Koillection is a good open-source option.
It can also be installed using Docker, which can speed up the setup process.
I've only been using this tool for a couple of days, and it looks promising. The only thing that doesn't seem very intuitive at the moment is the scraping system, although its developer has commented on GitHub that they are working on it.
This week is building up to a crescendo of critical cybersecurity developments, so don't miss today's Metacurity for the top infosec news stories you should know, including
--CISA nominee Plankey pulled from Senate confirmation hearing,
--The Com has been hacking Salesforce tools,
--Chinese hackers broke into US telecoms in 2023,
--Law enforcement busts up BidenCash,
--China issues warrants for 20 alleged Taiwanese hackers,
--Feds are probing CrowdS…
Oh the joy of having to block AI bots from scraping some public repos we host from the bot with the signature
"KHTML, like Gecko; compatible; GPTBot/1.2; https://openai.com/gptbot"
I hope they read the license for all the code before using it ;)
A look at various responses to AI firms' scraping, like open source Anubis that uses cryptographic JavaScript challenges, Cloudflare's "link mazes", and more (Emanuel Maiberg/404 Media)
https://www.404media.co/the-open-source-so
LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
Vinil Polepalli
https://arxiv.org/abs/2507.20800 https://arxiv…
#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
#Firecrawl new /search #API combines web search results with full page content in single call, designed for
Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot.
🍿One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do.
This check is invisible to the user, and most browsers since 2022 are able to complete this test.
In theory, bot scrapers could pretend to be users with browsers as well, but the ad…
The Open-Source Software Saving the Internet From AI Bot Scrapers https://www.404media.co/the-open-source-software-saving-the-internet-from-ai-bot-scrapers/
Getty's copyright lawsuit against Stability AI begins at London's High Court, accusing it of unlawfully scraping millions of images; Stability denies the claims (Sam Tobin/Reuters)
https://www.reuters.com/sustainability/boa
Getty's copyright lawsuit against Stability AI begins at London's High Court, accusing it of unlawfully scraping millions of images; Stability denies the claims (Sam Tobin/Reuters)
https://www.reuters.com/sustainability/boa
Q&A with British Library CEO Rebecca Lawrence on dealing with the aftermath of a major October 2023 cyberattack, AI scraping, AI for text analysis, and more (Mishal Husain/Bloomberg)
https://www.bloomberg.com/features/2025-rebecca-lawrence-weekend-i…
As Reddit turns 20, a look at its AI efforts, including the Reddit Answers chatbot, while it battles unauthorized scraping of user data for AI training (Jonathan Vanian/CNBC)
https://www.cnbc.com/2025/06/28/reddit-20-fighting-ai-defending-data.html…