Tootfinder

@heiseonline@social.heise.de
2024-05-08 11:05:00

OpenAI will robots.txt ersetzen – Media Manager für Creator, Urheber, Verlage
Mit einem neuen Media Manager will OpenAI regeln, wie Daten im Netz verwendet werden dürfen. Das soll robots.txt ersetzen.

OpenAI will robots.txt ersetzen – Media Manager für Creator, Urheber, Verlage
Mit einem neuen Media Manager will OpenAI regeln, wie Daten im Netz verwendet werden dürfen. Das soll robots.txt ersetzen.

@wemic@social.linux.pizza
2024-05-14 10:10:54

Spent some time creating an account on darkvisitors.com, adding deps to my site, writing logic to dynamically create a robots.txt file with data from dark visitors' API only for the API to return a 500 Internal Server Error...
Serves me right for not testing the API in the terminal first..
For now I'm sticking with the static robots.txt I got from @…

Go ahead and block AI web crawlers
AI companies are crawling the open web to, ostensibly, improve the quality of their models and products. This process is extractive and accrues the benefit to said companies, not the owners of sites both small and large.

@vform@openbiblio.social
2024-03-13 17:41:03

Jemand gelesen, der via robots.txt nur noch Google und Bing Crawler auf seinem Webserver erlaubt. Dann gerade eine Seite gehabt, die ich nicht bei archive.org sichern konnte.
Schönes neues Netz (naja, nicht gar so neu das Thema, aber...)

@fluchtkapsel@nerdculture.de
2024-05-13 11:09:40

Da will uns eine Agentur SEO verkaufen und bemängelt jetzt, dass der Inhalt unserer Website in unserem Gitea auf einer Subdomäne vorliegt. Google würde das negativ ankreiden, weil Doppelung von Inhalt unter verschiedenen URLs. Unser Gitea beherbergt aber auch noch einige andere Projekte, die wir mit einer noindex-robots.txt nicht unsichtbar machen wollen.
Wir sind ja der Meinung, die erzählen Blödsinn. Unsere PR-Abteilung wünscht sich aber "optimierte" Suchergebnisse. Vorschl…

@vform@openbiblio.social
2024-03-13 17:41:03

Jemand gelesen, der via robots.txt nur noch Google und Bing Crawler auf seinem Webserver erlaubt. Dann gerade eine Seite gehabt, die ich nicht bei archive.org sichern konnte.
Schönes neues Netz (naja, nicht gar so neu das Thema, aber...)

@idbrii@mastodon.gamedev.place
2024-03-03 19:41:37

I finally setup a robots.txt to prevent my site from being scraped by AI bots:
http://idbrii.com/robots.txt
(Hopefully I got everything right.)
Not totally against AI, but I post my stuff free and without ads to spread knowledge. Scraping the web to create closed models behind paywall…

@josemurilo@mato.social
2024-03-31 10:31:46

"the rise of AI products like #ChatGPT, and the #LLMs underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. Being too …

With the rise of AI, web crawlers are suddenly controversial
For decades, a humble text file governed the behavior of web scrapers. But as the AI industry grows, the social contract of robots.txt is falling apart.

@life_is@no-pony.farm
2024-05-08 11:34:21

@…
Ok, die Idee ist, dass nicht nur personen, die schreibrechte für die datei "robots.txt" haben, bestimmen können, was gecrawlt werden soll, sondern alle, die eine seite erstellen, für diese erstellte seite. Außerdem soll es nicht nötig sein, die namen der crawler zu kennen. So weit so gut. Mehr ist aber nicht bekannt.

@vform@openbiblio.social
2024-03-13 17:33:20

Hm, die (insbesondere, aber nicht nur AI-)Bots scheinen ja ein ziemliches verbreitetes Problem zu sein. Nicht nur wegen ethischen Fragen (Contentklau), sondern auch ganz banal von Traffic und Serverlast her.
Im Worst Case kommen private (Gatekeeper-)Unternehmen mit "Wumms" (z.B. Cloudflare) immer mehr zum Zug, weil robots.txt und IP-Banning schon bei IPv4 kaum sinnvoll handelbar sind. So als ganz laienhafter Gedanke gerade.

@philip@mastodon.mallegolhansen.com
2024-03-05 16:19:49

@… Not saying this alone is good enough, but a starting point:
If you’re writing a scraper, make sure you actually respect the damn robots.txt, it’s there for a reason.
If someone took the time and effort to explicitly indicate what you’re allowed to scrape, listen.

@vform@openbiblio.social
2024-03-13 17:33:20

Hm, die (insbesondere, aber nicht nur AI-)Bots scheinen ja ein ziemliches verbreitetes Problem zu sein. Nicht nur wegen ethischen Fragen (Contentklau), sondern auch ganz banal von Traffic und Serverlast her.
Im Worst Case kommen private (Gatekeeper-)Unternehmen mit "Wumms" (z.B. Cloudflare) immer mehr zum Zug, weil robots.txt und IP-Banning schon bei IPv4 kaum sinnvoll handelbar sind. So als ganz laienhafter Gedanke gerade.

@pgcd@mastodon.online
2024-03-01 05:15:19

While I would very much like to opensource some of the code I've written in the last few years, I'm pretty sure github (and others) will simply use it to train their LLMs with the stated objective of putting me and others out of a job.
What to do?
(No, I don't trust tags/robots.txt/whatever - and neither should you)

@drbruced@aus.social
2024-02-29 02:15:40

Here is my hot take on the #WordPress AI debacle:
- Automattic/WordPress.com could have done a much better job on communicating their strategy
- 404 Media and others could have written much better headlines that didn't make it sound like every WordPress installation would be impacted
- If you run a public web site and you haven't configured a robots.txt file to prevent AI sc…

Systems Approach
Explaining the Internet: its technology, architecture and evolution

@vform@openbiblio.social
2024-02-25 12:02:06

robots.txt – 30 Jahre Hausregeln für Websites
http://heise.de/-9636693
> Der Robots Exclusion Standard regelt, wer automatisiert Website-Inhalte abgrasen darf – und ist in Zeiten von ChatGPT so aktuell wie lange nicht mehr.

robots.txt – 30 Jahre Hausregeln für Websites
Der Robots Exclusion Standard regelt, wer automatisiert Website-Inhalte abgrasen darf – und ist in Zeiten von ChatGPT so aktuell wie lange nicht mehr.

@metacurity@infosec.exchange
2024-02-17 14:19:00

Each week, Metacurity offers our readers a digest of the best infosec-related long reads.
This week's selection covers,
--Marginalized communities hit harder by cyberattacks,
--Deepfakes complicate elections,
--AI threatens robots.txt files,
--Data centers scar Ireland
https://www.metacurity.com/p/best-infosecrelated-long-reads-week-21024

@DamonHD@mastodon.social
2024-04-21 07:29:54

@… ClaudeBot and its AI pals have been very vigorously scraping my pages recently; so much that I had to send the ClaudeBot mob a cease-and-desist email since it was pulling my 'go away' robots.txt every few seconds to see if I wanted to be friends again yet...

@pgcd@mastodon.online
2024-03-01 05:15:19

While I would very much like to opensource some of the code I've written in the last few years, I'm pretty sure github (and others) will simply use it to train their LLMs with the stated objective of putting me and others out of a job.
What to do?
(No, I don't trust tags/robots.txt/whatever - and neither should you)

Tootfinder

Opt-in global Mastodon full text search. Join the index!