Tootfinder

Opt-in global Mastodon full text search. Join the index!

@dnddeutsch@pnpde.social
2025-06-26 10:27:41

@… stimmt! Wenn man sie nicht setzt, greift der Default: die Lizenz wird ignoriert. Von CC hätte ich dagegen einen Einsatz für ein deutliches Signal gegen Scraping erwartet. Signals führt aber bisher nur pro Scraping ein

@heiseonline@social.heise.de
2025-06-22 11:23:00

Content Scraping: BBC droht Perplexity mit rechtlichen Schritten
Die KI-Suchmaschine Perplexity soll mutmaßlich Inhalte des öffentlich-rechtlichen Rundfunks Großbritanniens nutzen. Perplexity wittert dagegen Monopolismus.

@Techmeme@techhub.social
2025-06-26 06:35:55

How online fandom communities are advocating against AI, including protesting companies scraping fanfic content for AI training and opposing AI-generated fanfic (Decca Muldowney/The Verge)
theverge.com/ai-artificial-int

@johnleonard@mastodon.social
2025-07-23 11:53:11

A review of the legal challenges associated with generative AI training disputes emphasises the need for clarity from the UK government, legislature and courts.
computing.co.uk/feature/2025/s

@theodric@social.linux.pizza
2025-05-26 10:56:46

Why is BMS software scraping my clipboard? Truly one of the great mysteries of the ages.

"极空BMS pasted from your clipboard"
@tante@tldr.nettime.org
2025-06-17 12:40:05

"AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline"
#AI is ruining our digital world
(Original title: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums)

@Mediagazer@mstdn.social
2025-06-24 08:40:43

How online fandom communities are advocating against AI, including protesting companies scraping fanfic content for AI training and opposing AI-generated fanfic (Decca Muldowney/The Verge)
theverge.com/ai-artificial-int

AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don’t want to be scraped, according to internet infrastructure provider Cloudflare.
On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities.
The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s prefe…

@servelan@newsie.social
2025-06-17 14:49:49

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums
404media.co/ai-scraping-bots-a

@newsie@darktundra.xyz
2025-06-17 10:01:26

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums 404media.co/ai-scraping-bots-a

@tgpo@social.linux.pizza
2025-07-23 13:13:21

Painting and ceiling scraping is finally done!
Now I can set my office back up and get back into the swing of things.
But first, I must clean every single thing I own because it's all covered in a think layer of white dust 😐

@adulau@infosec.exchange
2025-08-22 12:53:47

We are excited to announce the release of Vulnerability-Lookup 2.15.0!
This version brings new features, performance improvements, and several bug fixes.
Thanks to @… for the hard work.
#vulnerability

@kubikpixel@chaos.social
2025-07-01 10:40:52

»Pay up or stop scraping – Cloudflare program charges bots for each crawl:
Cloudflare now beta testing pay-per-crawl feature to stop endless AI scraping.
Cloudflare is now experimenting with tools that will allow content creators to charge a fee to AI crawlers to scrape their websites.«
This is certainly a good idea, but on the other hand, the competition is trying to eliminate each other. I'm curious… 🍿😎

@grumpybozo@toad.social
2025-06-18 01:35:26

They (or an intentional DDoS) have been pounding the #SpamAssassin RuleQA site into catatonia. They construct URLs which are legitimate and which each cause the site to go digging for the specific performance of a rule on an arbitrary date in the past. Hundreds of rules tested daily for ~20 years.

@heiseonline@social.heise.de
2025-07-02 03:37:00

Cloudflare lässt KI-Crawler auflaufen, wenn nicht für Scraping bezahlt wird
Webseiten können nun von Cloudflare standardmäßig vor Crawler-Zugriffen geschützt werden. KI-Firmen können Betreiber aber auch für Content-Scraping bezahlen.

AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries,
and are in some cases knocking their collections offline,
according to a new survey published today.
While the impact of AI bots on open collections has been reported anecdotally,
this survey is the first attempt at measuring the problem,
which in the worst cases can make valuable, public resources unavailable to humans
because the…

@trezzer@social.linux.pizza
2025-07-20 10:59:44

I think I want to try a ZX Spectrum Next setup on MiSTer to see if I should get the Kickstarter hardware. Does anyone know of any packs that combine the free/demo Next software so I don’t have to spend hours scraping it from various places on the net? I have found packs of the classic Spectrum software.

@Techmeme@techhub.social
2025-07-10 09:01:14

A researcher says 245 extensions on nearly 1M devices are overriding security protections to turn browsers into engines that scrape websites for a paid service (Dan Goodin/Ars Technica)
arstechnica.com/security/2025/

@dploeger@mastodon.social
2025-06-17 11:43:27

Die neuen Mastodon.social-AGB-Änderungen erwärmen mein Herz. Unter Anderem stellen sie sicher, dass AI-Scraping unerwünscht ist, und dass alle Rechte an den Beiträgen den jeweilgen Benutzenden gehören (und sie nur Rechte besitzen, um die Beiträge entsprechend bereitzustellen).
Etwas, was bei anderen Services genau andersherum passiert. Sehr schön, @… und Te…

@Techmeme@techhub.social
2025-06-20 04:41:35

In a letter to CEO Aravind Srinivas, the BBC says it has evidence Perplexity's default model used its content and seeks "a proposal for financial compensation" (Financial Times)
ft.com/content/b743d401-dc5d-4

@tante@tldr.nettime.org
2025-06-27 09:50:54

I think a misunderstanding is that people want to fight "scraping" or "automated systems". But my feeling is that the issue is with the _purpose_ of the scraping: It's not "that person is scraping my site" it's "that person wants to use my work to train their slop machine". The issue is the SLOP machine with all the negative externalities they have.
And that is a path worth exploring (that I have similarly argued for code): We want to cont…

@Mediagazer@mstdn.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
theverge.com/news/757538/reddi

@jake4480@c.im
2025-08-10 05:16:13

Anyone still think having Facebook or Meta around (or whatever that asshole Zuckerberg wants to call his bullshit company this week) is good for the fediverse? And they're making Social Network 2 to further glorify him. Fuck all this shit. dropsitenews.com/p/meta-facebo

@timbray@cosocial.ca
2025-07-01 18:42:29

wired.com/story/cloudflare-blo

@metacurity@infosec.exchange
2025-06-05 08:43:52

Reddit sues AI company Anthropic for allegedly ‘scraping’ user comments to train chatbot Claude
apnews.com/article/reddit-sues

@ErikJonker@mastodon.social
2025-07-05 07:35:51

More time should be devoted about the (near) future businessmodels of AI and how it collects data/content. Just trying to prevent AI models from scraping data will be futile.
blog.cloudflare.com/content-in

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine,
-- and it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.
The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles;
instead, it will only be able to index the Reddit.com homepage,
which effectively means Internet Archive will only be able to archive insights into which news headlines …

@anildash@me.dm
2025-05-28 13:29:52

There was somebody fussing in my replies to my last link to my blog post about Medium (I don’t see them now; they probably blocked me, but their specific words don’t really matter), and the gist of their message was that they didn’t like that site. On the modern internet, if you have an issue with content written by humans, with no surveillance ads, that doesn’t allow AI scraping or AI slop content, with a business model that makes money… I don’t know how to help you. Honestly.

@Techmeme@techhub.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
theverge.com/news/757538/reddi

@Stomata@social.linux.pizza
2025-08-09 05:33:29

According to Dropsitenews Meta is training AI on multiple Lemmy instances. I also saw some mastodon instance in the PDF.
Full article dropsitenews.com/p/meta-facebo
Full list:

@toxi@mastodon.thi.ng
2025-07-27 09:27:06

Anyone else getting these ridiculous repo scraping spikes? A clean checkout of the thi.ng/umbrella monorepo is ~370MB. Over the past 14 days there were 222k clones (only 117 unique) of this repo which have caused downloads of a whopping ~78TB. WTF! 🤯

Screenshot of a Github activity line plot showing the number of daily clones per day over the past 14 days. In the past week the number of daily clones went up to 60k+ for 2 days, with the total number of clones for the entire timespan 222,356 with only 117 unique cloners.
@Mediagazer@mstdn.social
2025-07-01 10:20:30

Cloudflare launches Pay per Crawl, a marketplace letting sites charge AI crawlers per crawl; new sites set up with Cloudflare will block AI crawlers by default (Maxwell Zeff/TechCrunch)
techcrunch.com/2025/07/01/clou

@pbloem@sigmoid.social
2025-06-03 12:42:10

Everybody complaining about getting hammered with #AI traffic seems to think that these are crawlers scraping for training data.
How likely is it that this is a complete misconception and this is all inference time?
Most public companies give their cralwers and RAG agents different user agent strings. But what about security services trawling through their data?

@michabbb@social.vivaldi.net
2025-06-03 22:15:15

#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
🔍 New /search #API endpoint combines web search results with full page content in one call
🤖 Built specifically for #AI

@Techmeme@techhub.social
2025-07-01 10:20:45

Cloudflare launches Pay per Crawl, a marketplace letting sites charge AI crawlers per crawl; new sites set up with Cloudflare will block AI crawlers by default (Maxwell Zeff/TechCrunch)
techcrunch.com/2025/07/01/clou

@arXiv_csNI_bot@mastoxiv.page
2025-05-29 07:21:03

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Emily Wenger
arxiv.org/abs/2505.21733

@Mediagazer@mstdn.social
2025-07-08 06:25:37

A look at various responses to AI firms' scraping, like open source Anubis that uses cryptographic JavaScript challenges, Cloudflare's "link mazes", and more (Emanuel Maiberg/404 Media)
404media.co/the-open-source-so

@alejandrobdn@social.linux.pizza
2025-07-27 09:31:25

For anyone who wants to self-host their catalog of book video game or movie collections, Koillection is a good open-source option.
It can also be installed using Docker, which can speed up the setup process.
I've only been using this tool for a couple of days, and it looks promising. The only thing that doesn't seem very intuitive at the moment is the scraping system, although its developer has commented on GitHub that they are working on it.

@metacurity@infosec.exchange
2025-06-05 11:43:11

This week is building up to a crescendo of critical cybersecurity developments, so don't miss today's Metacurity for the top infosec news stories you should know, including
--CISA nominee Plankey pulled from Senate confirmation hearing,
--The Com has been hacking Salesforce tools,
--Chinese hackers broke into US telecoms in 2023,
--Law enforcement busts up BidenCash,
--China issues warrants for 20 alleged Taiwanese hackers,
--Feds are probing CrowdS…

@alsutton@snapp.social
2025-07-02 10:06:01

Oh the joy of having to block AI bots from scraping some public repos we host from the bot with the signature
"KHTML, like Gecko; compatible; GPTBot/1.2; openai.com/gptbot"
I hope they read the license for all the code before using it ;)

@Techmeme@techhub.social
2025-07-07 21:20:38

A look at various responses to AI firms' scraping, like open source Anubis that uses cryptographic JavaScript challenges, Cloudflare's "link mazes", and more (Emanuel Maiberg/404 Media)
404media.co/the-open-source-so

@arXiv_csRO_bot@mastoxiv.page
2025-07-29 11:28:01

LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
Vinil Polepalli
arxiv.org/abs/2507.20800 arxiv…

@michabbb@social.vivaldi.net
2025-06-03 22:15:15

#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
#Firecrawl new /search #API combines web search results with full page content in single call, designed for

Anubis verifies that any visitor to a site is a human using a browser as opposed to a bot.
🍿One of the ways it does this is by making the browser do a type of cryptographic math with JavaScript or other subtle checks that browsers do by default but bots have to be explicitly programmed to do.
This check is invisible to the user, and most browsers since 2022 are able to complete this test.
In theory, bot scrapers could pretend to be users with browsers as well, but the ad…

@Mediagazer@mstdn.social
2025-06-12 15:15:59

TollBit: from Q4 2024 to Q1 2025, traffic from AI retrieval bots to 266 websites, half run by national and local news organizations, grew 49%, as AI usage jumps (Nitasha Tiku/Washington Post)

@newsie@darktundra.xyz
2025-07-07 13:26:29

The Open-Source Software Saving the Internet From AI Bot Scrapers 404media.co/the-open-source-so

@Techmeme@techhub.social
2025-06-12 10:05:45

TollBit: from Q4 2024 to Q1 2025, traffic from AI retrieval bots to 266 websites, including national and local news organizations, grew 49%, as AI usage jumps (Nitasha Tiku/Washington Post)

@Mediagazer@mstdn.social
2025-06-09 09:40:37

Getty's copyright lawsuit against Stability AI begins at London's High Court, accusing it of unlawfully scraping millions of images; Stability denies the claims (Sam Tobin/Reuters)
reuters.com/sustainability/boa

@Techmeme@techhub.social
2025-06-09 09:30:39

Getty's copyright lawsuit against Stability AI begins at London's High Court, accusing it of unlawfully scraping millions of images; Stability denies the claims (Sam Tobin/Reuters)
reuters.com/sustainability/boa

@Mediagazer@mstdn.social
2025-07-07 08:05:31

Q&A with British Library CEO Rebecca Lawrence on dealing with the aftermath of a major October 2023 cyberattack, AI scraping, AI for text analysis, and more (Mishal Husain/Bloomberg)

@Techmeme@techhub.social
2025-07-06 19:25:31

Q&A with British Library CEO Rebecca Lawrence on dealing with the aftermath of a major October 2023 cyberattack, AI scraping, AI for text analysis, and more (Mishal Husain/Bloomberg)
bloomberg.com/features/2025-re

@Techmeme@techhub.social
2025-06-28 20:31:10

As Reddit turns 20, a look at its AI efforts, including the Reddit Answers chatbot, while it battles unauthorized scraping of user data for AI training (Jonathan Vanian/CNBC)
cnbc.com/2025/06/28/reddit-20-