Tootfinder

@tiotasram@kolektiva.social
2025-05-13 22:19:54

Writing code that ignores robots.txt is a professional ethics violation.
This is a toot about #AI

@lysander07@sigmoid.social
2025-05-13 16:25:32

Last week, our students learned how to conduct a proper evaluation for an NLP experiment. To this end, we introduced a small textcorpus with sentences about Joseph Fourier, who counts as one of the discoverers of the greenhouse effect, responsible for global warming.

Slide of the Information Service ENgineering lecture 03, Natural Language Processing 02, section 2.6: Evaluation, Precision, and Recall
Headline: Experiment
Let's consider the following text corpus (FOURIERCORPUS):
1
In 1807, Fourier's work on heat transfer laid the foundation for understanding the greenhouse effect.
2
Joseph Fourier's energy balance analysis showed atmosphere's heat-trapping role.
3
Fourrier's calculations, though rudimentary, suggested that the atmosphere acts as an insulato…

@timbray@cosocial.ca
2025-06-10 17:23:53

Is there anything I can put in robots.txt that will stop Scrapy?
Failing that, let’s take the ship up and nuke the site from orbit. It’s the only way to be sure.

@groupnebula563@mastodon.social
2025-07-05 01:32:59

#AI #honeypots huh

The /llms.txt file – llms-txt
A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.

@GroupNebula563@mastodon.social
2025-07-05 01:32:59

#AI #honeypots huh

The /llms.txt file – llms-txt
A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.

@xtaran@chaos.social
2025-06-08 23:55:00

Fsck GMail!
@ IN TXT "v=spf1 all"

@andycarolan@social.lol
2025-06-09 08:36:36

I just discovered TXT... feels all kinds of uplifting for a Monday morning :)
#TomorrowXTogether #KPop

@kubikpixel@chaos.social
2025-05-26 06:00:07

From HOSTS.TXT to Modern Internet Infrastructure
🌐 #hoststxt

From HOSTS.TXT to Modern Internet Infrastructure | AXON Shield
The development of DNS demonstrates an impressive journey from its initial basic form into a modern distributed system which provides high resilience. The internet initially used a basic centralized text file named HOSTS.TXT for its operations. The rapid internet expansion made the initial text file system unworkable so developers created a new solution which could scale dynamically. The system evolves because organizations need better scalability and absolute reliability along with robust secu…

@cosmos4u@scicomm.xyz
2025-07-03 00:55:22

There is now also a CBET about the new interstellar #comet 3I/ATLAS: http://www.cbat.eps.harvard.edu/iau/cbet/005500/CBET005578.txt - it comes with an even more precise orbit based on astrometry back to 5 June and predicts 13th magnitude with 60° elongation after perihelion in November. The current magnitude is about 17.7.

@wfryer@mastodon.cloud
2025-07-02 03:03:31

Control How Your Content Is Used for AI Training With Cloudflare (Cloudflare Blog, 1 July 2024)
#MediaLit

Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content
Cloudflare is making it easier for publishers and content creators of all sizes to prevent their content from being scraped for AI training by managing robots.txt on their behalf, and allowing targeted blocking of AI crawling on sites that serve ads.

@n8foo@macaw.social
2025-05-08 04:10:30

From the digital archives: #AWS #EC2 IP ranges from 14 years ago.
https://

@chriscz@social.linux.pizza
2025-07-03 01:19:26

🥱
The day SHALL start.
Regards not given,
RFC2119
https://ietf.org/rfc/rfc2119.txt

@mgorny@social.treehouse.systems
2025-07-05 18:35:18

To whomever praises #Claude #LLM:
ClaudeBot has made 20k requests to bugs.gentoo.org today. 15k of them were repeatedly fetching robots.txt. That surely is a sign of great code quality.
#AI

@mgorny@pol.social
2025-07-05 18:36:35

Jak ktoś chwali sobie #Claude #LLM, to wspomnę:
ClaudeBot dziś wykonał 20 tysięcy żądań do bugs.gentoo.org. Spośród nich, 15 tysięcy w kółko ciągnęło plik robots.txt. Zaprawdę wysokiej jakości kod.
#AI

@arXiv_csNI_bot@mastoxiv.page
2025-05-29 07:21:03

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Emily Wenger
https://arxiv.org/abs/2505.21733

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots.txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand th…

@groupnebula563@mastodon.social
2025-07-08 01:41:19

new cool idea: whenever anything requests /llms.txt or *.md serve them 42.zip
#ai #noai

@GroupNebula563@mastodon.social
2025-07-08 01:41:19

new cool idea: whenever anything requests /llms.txt or *.md serve them 42.zip
#ai #noai

@EgorKotov@datasci.social
2025-06-18 16:12:16

📝🗃️ 𝗿𝗱𝗼𝗰𝗱𝘂𝗺𝗽: Dump ‘R’ Package Source, Documentation, and Vignettes into One File for use in LLMs #rstats #LLM is on CRAN https://www.ekotov.pro/rdocdum…

rdocdump
Get fresh package docs to pass to LLM
library(rdocdump)
rdd_to_txt(
pkg = "aws.s3"
output_file = "aws.s3.txt",
force_fetch = TRUE)
github.com/e-kotov/rdocdump

rdocdump: Dump R Package Documentation and Vignettes into One File
Dump R Package Documentation and Vignettes into One File

@noellabo@fedibird.com
2025-06-20 04:58:16

『DOSの人が困るので、ファイル名は8文字のアルファベット大文字と _ と数字の組みあわせ（8.3形式）でお願いします』 -- README~1.TXT

@fluchtkapsel@nerdculture.de
2025-05-30 12:34:57

Content warning: tech, admin, dns

Today, I got notified about spamhaus not responding anymore to requests from our mailserver due to using an "open resolver".
Huh?
I found the command `dig short test.openresolver.com TXT @<ip_of_dns_server_to_test>` to test if my DNS server is deemed an open resolver. And yes, the mailserver uses a DNS server that got recognized as an open resolver.
Out of curiosity, I tried the same in my local network where I have a dnsmasq serving DHCP and DNS for my cli…

@ripienaar@devco.social
2025-06-15 12:14:41

Been designing distributed counters for NATS. Pretty happy with this.
50k/second unoptimised and on a single counter - but we will support aggregation of regional to global etc.
Hard dist sys problems made trivial to use and operate 💪💪
https://gist.github.com/ripienaar/d95d

Counters.txt
GitHub Gist: instantly share code, notes, and snippets.

Tootfinder

Opt-in global Mastodon full text search. Join the index!