2025-10-17 14:39:39
Quelle surprise: seriöse Webseiten blocken den Zugang für KI-Trainingszugriff eher als Seiten, deren Zweck der Desinformation dient.
https://arxiv.org/abs/2510.10315
Quelle surprise: seriöse Webseiten blocken den Zugang für KI-Trainingszugriff eher als Seiten, deren Zweck der Desinformation dient.
https://arxiv.org/abs/2510.10315
Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. https://arstechnica.com/tech-policy/2025/09/pay-per-output-ai-firms-blindsided-by-beefed-up-robots-txt-instructions/
That's curious: A newish paper suggesting an 'ai.txt' (similar to robots.txt) to manage server interactions with ai bots: Li et al. (2025). ai.txt: A Domain-Specific Language for Guiding AI Interactions with the Internet (No. arXiv:2505.07834). arXiv. https://doi.org/10.48550/arXiv.2505.0783…
Is Misinformation More Open? A Study of robots.txt Gatekeeping on the Web
Nicolas Steinacker-Olsztyn, Devashish Gosain, Ha Dao
https://arxiv.org/abs/2510.10315 https://
Calendar.txt by Tero Karvinen is a plain text file calendar that's versionable, supports all operating systems, is future-proof, easily syncs with Android, etc: #TextFiles
El intercambio de correos en el que se explica cómo nació el formato de codificación UTF-8 de la mano de sus creadores #utf8
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Alejandro Hern\'andez-Cano, Alexander H\"agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank \v{D}urech, Ido Hakimi, Juan Garc\'ia Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabol\v{c}ec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin An…
SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study
Matheus J. T. Vargas
https://arxiv.org/abs/2510.10010
from my link log —
Nontransitive comparison functions lead to out-of-bounds read and write in glibc's qsort().
https://www.qualys.com/2024/01/30/qsort.txt
saved 2025-09-06
I'm not sure if this is going to make a difference ( #LLMs weren't able to read #licenses or terms & conditions before when these were not formalized in a "machine-readable" way (plus, besides licenses we already had the robots.txt declarative files; even if those were not as expressive as this new proposal).
So, is this extra work for web developers and maintainers? Are we going to operate under the new assumption that if we didn't do the work of implementing this then we are granting permission to scrapper bots to steal all our online creations?
Or can this be a net gain for creators in some specific way?
Imagine ChatGPT but instead of predicting text it just linked you to the to 3 documents most-influential on the probabilities that would have been used to predict that text.
Could even generate some info about which parts of each would have been combined how.
There would still be issues with how training data is sourced and filtered, but these could be solved by crawling normally respecting robots.txt and by paying filterers a fair wage with a more relaxed work schedule and mental health support.
The energy issues are mainly about wild future investment and wasteful query spam, not optimized present-day per-query usage.
Is this "just search?"
Yes, but it would have some advantages for a lot of use cases, mainly in synthesizing results across multiple documents and in leveraging a language model more fully to find relevant stuff.
When we talk about the harms of current corporate LLMs, the opportunity cost of NOT building things like this is part of that.
The equivalent for art would have been so amazing too! "Here are some artists that can do what you want, with examples pulled from their portfolios."
It would be a really cool coding assistant that I'd actually encourage my students to use (with some guidelines).
#AI #GenAI #LLMs
@… two precautions that may help, in lossy situations:
pkg prime-origins | sort -u > /var/tmp/pkg-prime-origins.txt
/usr/local/etc/periodic/daily/411.pkg-backup
If – following an issue – you predict the need to revert to the backup, you can:
service cron stop
robots.txt (but really /dev/zero)
There is an ActivityPub proposal that involves the #DNS.
I have only just discovered it and have not considered it deeply so I am reluctant to make any grand statements. It is not obvious to me why this is useful or better than alternative approaches. It appears to involve the use of TXT RRs, any new de facto use of which makes me skeptical.
»Immer weniger echte Nutzer — Studie zeigt massiven Anstieg von Bot-Traffic im Web:
Menschliche Website-Besucher:innen werden zur Mangelware: Eine neue Analyse zeigt, wie Google, ChatGPT und Co. mit ihren Bots die Spielregeln im Netz verändern - und warum Publisher Alarm schlagen.«
Bots gibt es schon seit den Internet-Anfängen und auch die missachten bewusst die robots.txt Anweisungen.
🫥
Any reading recommendations for a small collective looking to move from google dependency to self-hosting?
We're collecting some resources here:
https://doc.patternclub.org/s/QCwRlvO1A#
With a FreeBSD pkg repository configuration file set to use quarterly for the one and only repo:
― why is latest (not quarterly) used for bootstrapping?
https://gist.github.com/grahamperrin/1a36d21e9c6d3bc363cee7ecfe779595#file-2…
#WritersCoffeeClub
19. How do you keep track of dates and events in a WIP?
20. What rôle does death (or undeath!) play in your work?
21. What is your take on the adverb debate?
---
19. Any WIP that needs dates and events is probably already big enough to void memorization.
I went all the way from stickies to txt files to wikis to, eventually, Campfire. I love m…
Later, after I keyed n (to not continue), installation did continue.
This is partly understandable, because the y/n prompt was in response to a command that used
-y
Not really a bug, just slightly surprising.
https://gist.github.com/grahamperrin/1…
*AI* crawlers ignoring *robots*.txt file. The irony. 🤦♂️
#AI #technology #conventions #web