Tootfinder

@heiseonline@social.heise.de
2025-10-14 05:21:00

UK: Gesichtserkennung von Clearview unterliegt europäischen Datenschutzregeln
Ein britisches Berufungsgericht hat eine vom UK-Datenschutzbeauftragten verhängte Millionenstrafe gegen die US-Firma Clearview AI wegen Mega-Scraping bestätigt.

UK: Gesichtserkennung von Clearview unterliegt europäischen Datenschutzregeln
Ein britisches Berufungsgericht hat eine vom UK-Datenschutzbeauftragten verhängte Millionenstrafe gegen die US-Firma Clearview AI wegen Mega-Scraping bestätigt.

@cdarwin@c.im
2025-08-04 18:13:45

AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don’t want to be scraped, according to internet infrastructure provider Cloudflare.
On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities.
The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s prefe…

Perplexity accused of scraping websites that explicitly blocked AI scraping | TechCrunch
Internet giant Cloudflare says it detected Perplexity crawling and scraping websites, even after customers had added technical blocks telling Perplexity not to scrape their pages.

@Techmeme@techhub.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-…

Reddit will block the Internet Archive
Reddit caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to limit the Internet Archive from indexing some data.

@Mediagazer@mstdn.social
2025-08-11 17:15:42

Reddit says it will block the Internet Archive from indexing most of its pages after it caught AI companies scraping its data from the Wayback Machine (Jay Peters/The Verge)
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-…

Reddit will block the Internet Archive
Reddit caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to limit the Internet Archive from indexing some data.

@cdarwin@c.im
2025-08-11 18:41:40

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine,
-- and it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.
The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles;
instead, it will only be able to index the Reddit.com homepage,
which effectively means Internet Archive will only be able to archive insights into which news headlines …

Reddit will block the Internet Archive
Reddit caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to limit the Internet Archive from indexing some data.

@Techmeme@techhub.social
2025-10-03 15:01:24

LinkedIn sues a company called ProAPIs for allegedly operating millions of fake accounts to scrape LinkedIn member data and selling it for ~$15,000 per month (Suzanne Smalley/The Record)
https://therecord.media/linkedin-sues-data-scraping-company

LinkedIn sues software company allegedly scraping data from millions of profiles
ProAPIs, a software company, and its CEO Rahmat Alam allegedly run an operation which LinkedIn says charges customers up to $15,000 per month for scraped user data taken from the social media platform.

@philip@mastodon.mallegolhansen.com
2025-09-11 03:41:20

@… *If* scrapers would actually follow the spec, I’m somewhat for it. It *does* allow for you to insert a custom license that says “No scraping, under any circumstances”.
But would any scraper actually follow it? Of course not.

@newsie@darktundra.xyz
2025-10-03 14:08:30

LinkedIn sues software company allegedly scraping data from millions of profiles https://therecord.media/linkedin-sues-data-scraping-company

LinkedIn sues software company allegedly scraping data from millions of profiles
ProAPIs, a software company, and its CEO Rahmat Alam allegedly run an operation which LinkedIn says charges customers up to $15,000 per month for scraped user data taken from the social media platform.

@fgraver@hcommons.social
2025-09-10 19:22:24

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. https://arstechnica.com/tech-policy/2025/09/pay-per-output-ai-firms-blindsided-by-beefed-up-robots-txt-instructions/

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.
“Really Simple Licensing” makes it easier for creators to get paid for AI scraping.

@johnleonard@mastodon.social
2025-07-23 11:53:11

A review of the legal challenges associated with generative AI training disputes emphasises the need for clarity from the UK government, legislature and courts.
https://www.computing.co.uk/feature/2025/scraping-surface-generative-ai-training-disputes…

Scraping the surface of generative AI training disputes and their legal challenges
A review of the legal challenges associated with generative AI training disputes emphasises the need for clarity from the UK government, legislature and courts.

@Stomata@social.linux.pizza
2025-08-09 05:33:29

According to Dropsitenews Meta is training AI on multiple Lemmy instances. I also saw some mastodon instance in the PDF.
Full article https://www.dropsitenews.com/p/meta-facebook-tech-copyright-privacy-whistleblower
Full list:

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI
The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

@gedankenstuecke@scholar.social
2025-09-06 16:42:44

«Google quietly vanishes its net zero carbon pledge»
Of course, gotta spend all the energy scraping the web to death and using that for training some "AI" that advises people to kill themselves. The Butlerian Jihad truly can't come fast enough…
https://pivot-to-ai.com/2025/09/05/google-quietly-vanishes-its-net-zero-carbon-commitment/

@Mediagazer@mstdn.social
2025-09-10 13:11:01

Reddit, Yahoo, Medium, Quora, People, O'Reilly, wikiHow, Ziff Davis, and others adopt the Really Simple Licensing (RSL) Standard to set terms for AI scraping (Emma Roth/The Verge)
https://www.theverge.com/news/775072/rsl-standard-licensing-ai…

The web has a new system for making AI companies pay up
A new licensing standard, called the RSL Standard, aims to allow AI companies to license content from across the web. Reddit, Yahoo, Quora, and others are already on board.

@Techmeme@techhub.social
2025-09-10 13:10:55

Reddit, Yahoo, Medium, Quora, People, O'Reilly, wikiHow, Ziff Davis, and others adopt the Really Simple Licensing (RSL) Standard to set terms for AI scraping (Emma Roth/The Verge)
https://www.theverge.com/news/775072/rsl-standard-licensing-ai…

The web has a new system for making AI companies pay up
A new licensing standard, called the RSL Standard, aims to allow AI companies to license content from across the web. Reddit, Yahoo, Quora, and others are already on board.

@toxi@mastodon.thi.ng
2025-07-27 09:27:06

Anyone else getting these ridiculous repo scraping spikes? A clean checkout of the https://thi.ng/umbrella monorepo is ~370MB. Over the past 14 days there were 222k clones (only 117 unique) of this repo which have caused downloads of a whopping ~78TB. WTF! 🤯

Screenshot of a Github activity line plot showing the number of daily clones per day over the past 14 days. In the past week the number of daily clones went up to 60k+ for 2 days, with the total number of clones for the entire timespan 222,356 with only 117 unique cloners.

thi.ng/umbrella

@metacurity@infosec.exchange
2025-09-27 12:27:17

Metacurity is pleased to offer our free and premium subscribers a weekly digest of the best long-form (and longish) infosec-related pieces we couldn't properly fit into our daily news crush.
This week's selection covers
--How an NYT reporter almost fell for a scam,
--Hackers increasingly take aim at small-town water systems,
--Citizens must shift their threat models under Trump's regime,
--Even the most innocent AI model can spew out dark material,

@gedankenstuecke@scholar.social
2025-08-30 01:22:43

Spent the evening improving my #RSS setup by getting more into the possibilities of https://www.freshrss.org/:
Thanks to its support for web scraping, I've now managed to get the full text of articles, instead of just the snippets, of @… en español into my reader.
And even managed to get BBC Mundo scraped into it through the scraping, despite @… unfortunately not providing any official feeds for the Spanish-language news at all.

@arXiv_csAI_bot@mastoxiv.page
2025-09-03 13:46:13

Throttling Web Agents Using Reasoning Gates
Abhinav Kumar, Jaechul Roh, Ali Naseh, Amir Houmansadr, Eugene Bagdasarian
https://arxiv.org/abs/2509.01619 https://

Throttling Web Agents Using Reasoning Gates
AI web agents use Internet resources at far greater speed, scale, and complexity -- changing how users and services interact. Deployed maliciously or erroneously, these agents could overload content providers. At the same time, web agents can bypass CAPTCHAs and other defenses by mimicking user behavior or flood authentication systems with fake accounts. Yet providers must protect their services and content from denial-of-service attacks and scraping by web agents. In this paper, we design a fr…

@alejandrobdn@social.linux.pizza
2025-07-27 09:31:25

For anyone who wants to self-host their catalog of book video game or movie collections, Koillection is a good open-source option.
It can also be installed using Docker, which can speed up the setup process.
I've only been using this tool for a couple of days, and it looks promising. The only thing that doesn't seem very intuitive at the moment is the scraping system, although its developer has commented on GitHub that they are working on it.

Koillection
A self-hosted website to manage all your collections.

@adulau@infosec.exchange
2025-08-22 12:53:47

We are excited to announce the release of Vulnerability-Lookup 2.15.0!
This version brings new features, performance improvements, and several bug fixes.
Thanks to @… for the hard work.
#vulnerability

Vulnerability-Lookup 2.15.0 released
We are excited to announce the release of Vulnerability-Lookup 2.15.0! This version brings new features, performance improvements, and several bug fixes. What’s New Detecting vulnerabilities known only through sightings The dashboard now highlights vulnerabilities discovered via our sighting tools, including scraping social networks, MISP, Nuclei templates, Shadowserver, Gist, and more. This gives you better visibility of unpublished advisories.

@tgpo@social.linux.pizza
2025-07-23 13:13:21

Painting and ceiling scraping is finally done!
Now I can set my office back up and get back into the swing of things.
But first, I must clean every single thing I own because it's all covered in a think layer of white dust 😐

@arXiv_csCY_bot@mastoxiv.page
2025-10-01 09:39:27

Bubble, Bubble, AI's Rumble: Why Global Financial Regulatory Incident Reporting is Our Shield Against Systemic Stumbles
Anchal Gupta, Gleb Pappyshev, James T Kwok
https://arxiv.org/abs/2509.26150

Bubble, Bubble, AI's Rumble: Why Global Financial Regulatory Incident Reporting is Our Shield Against Systemic Stumbles
"Double, double toil and trouble; Fire burn and cauldron bubble." As Shakespeare's witches foretold chaos through cryptic prophecies, modern capital markets grapple with systemic risks concealed by opaque AI systems. According to IMF, the August 5, 2024, plunge in Japanese and U.S. equities can be linked to algorithmic trading yet ab-sent from existing AI incidents database exemplifies this transparency crisis. Current AI incident databases, reliant on crowdsourcing or news scraping, systematic…

@arXiv_csRO_bot@mastoxiv.page
2025-07-29 11:28:01

LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
Vinil Polepalli
https://arxiv.org/abs/2507.20800 https://arxiv…

LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
The invasive spotted lanternfly (SLF) poses a significant threat to agriculture and ecosystems, causing widespread damage. Current control methods, such as egg scraping, pesticides, and quarantines, prove labor-intensive, environmentally hazardous, and inadequate for long-term SLF suppression. This research introduces LanternNet, a novel autonomous robotic Hub-and-Spoke system designed for scalable detection and suppression of SLF populations. A central, tree-mimicking hub utilizes a YOLOv8 com…

@trezzer@social.linux.pizza
2025-07-20 10:59:44

I think I want to try a ZX Spectrum Next setup on MiSTer to see if I should get the Kickstarter hardware. Does anyone know of any packs that combine the free/demo Next software so I don’t have to spend hours scraping it from various places on the net? I have found packs of the classic Spectrum software.

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:11:51

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
Qiuyang Lu, Fangjian Shen, Zhengkai Tang, Qiang Liu, Hexuan Cheng, Hui Liu, Wushao Wen
https://arxiv.org/abs/2509.15789

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
The quality and accessibility of multilingual datasets are crucial for advancing machine translation. However, previous corpora built from United Nations documents have suffered from issues such as opaque process, difficulty of reproduction, and limited scale. To address these challenges, we introduce a complete end-to-end solution, from data acquisition via web scraping to text alignment. The entire process is fully reproducible, with a minimalist single-machine example and optional distribute…

@arXiv_csHC_bot@mastoxiv.page
2025-09-23 08:36:00

Tides of Memory: Digital Echoes of Netizen Remembran
Lingyu Peng, Chang Ge, Liying Long, Xin Li, Xiao Hu, Pengda Lu, Qingchuan Li, Jiangyue Wu
https://arxiv.org/abs/2509.16579 h…

Tides of Memory: Digital Echoes of Netizen Remembran
This artwork presents an interdisciplinary interaction installation that visualizes collective online mourning behavior in China. By focusing on commemorative content posted on Sina Weibo following the deaths of seven prominent Chinese authors, the artwork employs data scraping, natural language processing, and 3D modeling to transform fragmented textual expressions into immersive digital monuments. Through the analysis of word frequencies, topic models, and user engagement metrics, the system …

@gedankenstuecke@scholar.social
2025-09-01 15:52:34

I've blogged about how I'm using #FreshRSS to get full-text #RSS feeds – and about crowdsourcing configs that will allow folks to subscribe to more things thanks to the web scraping feature!
https://tzovar.as/fulltext-freshrss/
(Responses to this toot will also become blog comments)

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 08:19:10

On the de-duplication of the Lakh MIDI dataset
Eunjin Choi, Hyerin Kim, Jiwoo Ryu, Juhan Nam, Dasaem Jeong
https://arxiv.org/abs/2509.16662 https://arxiv.o…

On the de-duplication of the Lakh MIDI dataset
A large-scale dataset is essential for training a well-generalized deep-learning model. Most such datasets are collected via scraping from various internet sources, inevitably introducing duplicated data. In the symbolic music domain, these duplicates often come from multiple user arrangements and metadata changes after simple editing. However, despite critical issues such as unreliable training evaluation from data leakage during random splitting, dataset duplication has not been extensively a…

@arXiv_csAI_bot@mastoxiv.page
2025-09-29 10:18:37

RISK: A Framework for GUI Agents in E-commerce Risk Management
Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, Shuai Chen
https://arxiv.org/abs/2509.21982

RISK: A Framework for GUI Agents in E-commerce Risk Management
E-commerce risk management requires aggregating diverse, deeply embedded web data through multi-step, stateful interactions, which traditional scraping methods and most existing Graphical User Interface (GUI) agents cannot handle. These agents are typically limited to single-step tasks and lack the ability to manage dynamic, interactive content critical for effective risk assessment. To address this challenge, we introduce RISK, a novel framework designed to build and deploy GUI agents for this…

@Techmeme@techhub.social
2025-08-29 01:05:46

While facial recognition tech remains unregulated at the US federal level, 23 states have passed or expanded laws to restrict mass scraping of biometric data (Bobby Allyn/NPR)
https://www.npr.org/2025/08/28/nx-s1-5519756/biometrics-facial-recognition-l…

@arXiv_csSI_bot@mastoxiv.page
2025-09-22 09:03:11

PoliTok-DE: A Multimodal Dataset of Political TikToks and Deletions From Germany
Tomas Ruiz, Andreas Nanz, Ursula Kristin Schmid, Carsten Schwemmer, Yannis Theocharis, Diana Rieger
https://arxiv.org/abs/2509.15860

PoliTok-DE: A Multimodal Dataset of Political TikToks and Deletions From Germany
We present PoliTok-DE, a large-scale multimodal dataset (video, audio, images, text) of TikTok posts related to the 2024 Saxony state election in Germany. The corpus contains over 195,000 posts published between 01.07.2024 and 30.11.2024, of which over 18,000 (17.3%) were subsequently deleted from the platform. Posts were identified via the TikTok research API and complemented with web scraping to retrieve full multimodal media and metadata. PoliTok-DE supports computational social science acro…

@newsie@darktundra.xyz
2025-10-01 13:02:09

Podcast: Landlords Demand Your Workplace Logins to Scrape Paystubs https://www.404media.co/podcast-landlords-demand-your-workplace-logins-to-scrape-paystubs/

Podcast: Landlords Demand Your Workplace Logins to Scrape Paystubs
How companies working for landlords are scraping data inside corporate environments; lawyers explain why they used AI (after getting caught); and all the Ruby drama.

Tootfinder

Opt-in global Mastodon full text search. Join the index!