Tootfinder

Opt-in global Mastodon full text search. Join the index!

@pbloem@sigmoid.social
2025-06-03 12:42:10

Everybody complaining about getting hammered with #AI traffic seems to think that these are crawlers scraping for training data.
How likely is it that this is a complete misconception and this is all inference time?
Most public companies give their cralwers and RAG agents different user agent strings. But what about security services trawling through their data?

@michabbb@social.vivaldi.net
2025-06-03 22:15:15

#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
🔍 New /search #API endpoint combines web search results with full page content in one call
🤖 Built specifically for #AI

@michabbb@social.vivaldi.net
2025-06-03 22:15:15

#Firecrawl launches /search endpoint for web scraping and data extraction 🔥
#Firecrawl new /search #API combines web search results with full page content in single call, designed for

@anildash@me.dm
2025-05-28 13:29:52

There was somebody fussing in my replies to my last link to my blog post about Medium (I don’t see them now; they probably blocked me, but their specific words don’t really matter), and the gist of their message was that they didn’t like that site. On the modern internet, if you have an issue with content written by humans, with no surveillance ads, that doesn’t allow AI scraping or AI slop content, with a business model that makes money… I don’t know how to help you. Honestly.

@arXiv_csNI_bot@mastoxiv.page
2025-05-29 07:21:03

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Emily Wenger
arxiv.org/abs/2505.21733

@jeang3nie@social.linux.pizza
2025-05-19 20:37:00

This morning I null routed another dozen IP addresses for scraping my personal git server using repeated http requests. As per usual, a quick inspection reveals that at least some of them are scraping for LLM data. As always, I have not consented to this use of my non-maintained code, experiments, college coursework, and miscellaneous crap that I for whatever reason decided to self host rather than pushing it to Codeberg.
I mean, if you really want to feed your LLM on a diet that inclu…

@theodric@social.linux.pizza
2025-05-26 10:56:46

Why is BMS software scraping my clipboard? Truly one of the great mysteries of the ages.

"极空BMS pasted from your clipboard"
@domm@social.linux.pizza
2025-04-15 19:51:14

Today I fought against AI bots that where scraping / DDoSing one of our customers library OPAC. The quick fix was to block all non-EU IP address. We don't have a good fix yet, but can draw on a lot of ideas that where discussed at #Koha Hackfest. #JohnConner

@deabigt@universeodon.com
2025-05-05 18:53:51

Interesting. Pointed some test code at google to see if new driver works and got a captcha. Guess they do not want anyone scraping them like they do.