Tootfinder

Opt-in global Mastodon full text search. Join the index!

@heiseonline@social.heise.de
2025-10-14 05:21:00

UK: Gesichtserkennung von Clearview unterliegt europäischen Datenschutzregeln
Ein britisches Berufungsgericht hat eine vom UK-Datenschutzbeauftragten verhängte Millionenstrafe gegen die US-Firma Clearview AI wegen Mega-Scraping bestätigt.

AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don’t want to be scraped, according to internet infrastructure provider Cloudflare.
On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities.
The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s prefe…

@Techmeme@techhub.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
theverge.com/news/757538/reddi

@Mediagazer@mstdn.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
theverge.com/news/757538/reddi

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine,
-- and it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.
The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles;
instead, it will only be able to index the Reddit.com homepage,
which effectively means Internet Archive will only be able to archive insights into which news headlines …

@Techmeme@techhub.social
2025-10-03 15:01:24

LinkedIn sues a company called ProAPIs for allegedly operating millions of fake accounts to scrape LinkedIn member data and selling it for ~$15,000 per month (Suzanne Smalley/The Record)
therecord.media/linkedin-sues-

@philip@mastodon.mallegolhansen.com
2025-09-11 03:41:20

@… *If* scrapers would actually follow the spec, I’m somewhat for it. It *does* allow for you to insert a custom license that says “No scraping, under any circumstances”.
But would any scraper actually follow it? Of course not.

@newsie@darktundra.xyz
2025-10-03 14:08:30

LinkedIn sues software company allegedly scraping data from millions of profiles therecord.media/linkedin-sues-

@fgraver@hcommons.social
2025-09-10 19:22:24

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. arstechnica.com/tech-policy/20

@johnleonard@mastodon.social
2025-07-23 11:53:11

A review of the legal challenges associated with generative AI training disputes emphasises the need for clarity from the UK government, legislature and courts.
computing.co.uk/feature/2025/s

@Stomata@social.linux.pizza
2025-08-09 05:33:29

According to Dropsitenews Meta is training AI on multiple Lemmy instances. I also saw some mastodon instance in the PDF.
Full article dropsitenews.com/p/meta-facebo
Full list:

@gedankenstuecke@scholar.social
2025-09-06 16:42:44

«Google quietly vanishes its net zero carbon pledge»
Of course, gotta spend all the energy scraping the web to death and using that for training some "AI" that advises people to kill themselves. The Butlerian Jihad truly can't come fast enough…
pivot-to-ai.com/2025/09/05/goo

@Mediagazer@mstdn.social
2025-09-10 13:11:01

Reddit, Yahoo, Medium, Quora, People, O'Reilly, wikiHow, Ziff Davis, and others adopt the Really Simple Licensing (RSL) Standard to set terms for AI scraping (Emma Roth/The Verge)
theverge.com/news/775072/rsl-s

@Techmeme@techhub.social
2025-09-10 13:10:55

Reddit, Yahoo, Medium, Quora, People, O'Reilly, wikiHow, Ziff Davis, and others adopt the Really Simple Licensing (RSL) Standard to set terms for AI scraping (Emma Roth/The Verge)
theverge.com/news/775072/rsl-s

@toxi@mastodon.thi.ng
2025-07-27 09:27:06

Anyone else getting these ridiculous repo scraping spikes? A clean checkout of the thi.ng/umbrella monorepo is ~370MB. Over the past 14 days there were 222k clones (only 117 unique) of this repo which have caused downloads of a whopping ~78TB. WTF! 🤯

Screenshot of a Github activity line plot showing the number of daily clones per day over the past 14 days. In the past week the number of daily clones went up to 60k+ for 2 days, with the total number of clones for the entire timespan 222,356 with only 117 unique cloners.
@metacurity@infosec.exchange
2025-09-27 12:27:17

Metacurity is pleased to offer our free and premium subscribers a weekly digest of the best long-form (and longish) infosec-related pieces we couldn't properly fit into our daily news crush.
This week's selection covers
--How an NYT reporter almost fell for a scam,
--Hackers increasingly take aim at small-town water systems,
--Citizens must shift their threat models under Trump's regime,
--Even the most innocent AI model can spew out dark material,

@gedankenstuecke@scholar.social
2025-08-30 01:22:43

Spent the evening improving my #RSS setup by getting more into the possibilities of freshrss.org/:
Thanks to its support for web scraping, I've now managed to get the full text of articles, instead of just the snippets, of @… en español into my reader.
And even managed to get BBC Mundo scraped into it through the scraping, despite @… unfortunately not providing any official feeds for the Spanish-language news at all.

@arXiv_csAI_bot@mastoxiv.page
2025-09-03 13:46:13

Throttling Web Agents Using Reasoning Gates
Abhinav Kumar, Jaechul Roh, Ali Naseh, Amir Houmansadr, Eugene Bagdasarian
arxiv.org/abs/2509.01619

@alejandrobdn@social.linux.pizza
2025-07-27 09:31:25

For anyone who wants to self-host their catalog of book video game or movie collections, Koillection is a good open-source option.
It can also be installed using Docker, which can speed up the setup process.
I've only been using this tool for a couple of days, and it looks promising. The only thing that doesn't seem very intuitive at the moment is the scraping system, although its developer has commented on GitHub that they are working on it.

@adulau@infosec.exchange
2025-08-22 12:53:47

We are excited to announce the release of Vulnerability-Lookup 2.15.0!
This version brings new features, performance improvements, and several bug fixes.
Thanks to @… for the hard work.
#vulnerability

@tgpo@social.linux.pizza
2025-07-23 13:13:21

Painting and ceiling scraping is finally done!
Now I can set my office back up and get back into the swing of things.
But first, I must clean every single thing I own because it's all covered in a think layer of white dust 😐

@arXiv_csCY_bot@mastoxiv.page
2025-10-01 09:39:27

Bubble, Bubble, AI's Rumble: Why Global Financial Regulatory Incident Reporting is Our Shield Against Systemic Stumbles
Anchal Gupta, Gleb Pappyshev, James T Kwok
arxiv.org/abs/2509.26150

@arXiv_csRO_bot@mastoxiv.page
2025-07-29 11:28:01

LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
Vinil Polepalli
arxiv.org/abs/2507.20800 arxiv…

@trezzer@social.linux.pizza
2025-07-20 10:59:44

I think I want to try a ZX Spectrum Next setup on MiSTer to see if I should get the Kickstarter hardware. Does anyone know of any packs that combine the free/demo Next software so I don’t have to spend hours scraping it from various places on the net? I have found packs of the classic Spectrum software.

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:11:51

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
Qiuyang Lu, Fangjian Shen, Zhengkai Tang, Qiang Liu, Hexuan Cheng, Hui Liu, Wushao Wen
arxiv.org/abs/2509.15789

@arXiv_csHC_bot@mastoxiv.page
2025-09-23 08:36:00

Tides of Memory: Digital Echoes of Netizen Remembran
Lingyu Peng, Chang Ge, Liying Long, Xin Li, Xiao Hu, Pengda Lu, Qingchuan Li, Jiangyue Wu
arxiv.org/abs/2509.16579

@gedankenstuecke@scholar.social
2025-09-01 15:52:34

I've blogged about how I'm using #FreshRSS to get full-text #RSS feeds – and about crowdsourcing configs that will allow folks to subscribe to more things thanks to the web scraping feature!
tzovar.as/fulltext-freshrss/
(Responses to this toot will also become blog comments)

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 08:19:10

On the de-duplication of the Lakh MIDI dataset
Eunjin Choi, Hyerin Kim, Jiwoo Ryu, Juhan Nam, Dasaem Jeong
arxiv.org/abs/2509.16662 arxiv.o…

@arXiv_csAI_bot@mastoxiv.page
2025-09-29 10:18:37

RISK: A Framework for GUI Agents in E-commerce Risk Management
Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, Shuai Chen
arxiv.org/abs/2509.21982

@Techmeme@techhub.social
2025-08-29 01:05:46

While facial recognition tech remains unregulated at the US federal level, 23 states have passed or expanded laws to restrict mass scraping of biometric data (Bobby Allyn/NPR)
npr.org/2025/08/28/nx-s1-55197

@arXiv_csSI_bot@mastoxiv.page
2025-09-22 09:03:11

PoliTok-DE: A Multimodal Dataset of Political TikToks and Deletions From Germany
Tomas Ruiz, Andreas Nanz, Ursula Kristin Schmid, Carsten Schwemmer, Yannis Theocharis, Diana Rieger
arxiv.org/abs/2509.15860

@newsie@darktundra.xyz
2025-10-01 13:02:09

Podcast: Landlords Demand Your Workplace Logins to Scrape Paystubs 404media.co/podcast-landlords-