Can’t we talk to the humans and work together? No, because they are dead
March 24, 2017 11:19 AM   Subscribe

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots, and was first developed by people on the www-talk mailing list in 1994. RobotsTXT.org has information and history, and the similar Robots META tag. As with code in general, you can add silly things in the comments, and Google spoofed the format with their own killer-robots.txt. More recently, robots.txt inspired an alternative file: humans.txt, "a TXT file that contains information about the different people who have contributed to building the website."

Google got in on the fun, too. Since then, they've simplified it.

If making your own robots.txt is daunting, there are tools to help you generate one, and specifically guide each of 15 different bots, and there are a few more bots listed here, but they are both paltry compared to 302 bots detailed on RobotsTXT.
posted by filthy light thief (18 comments total) 24 users marked this as a favorite
 
One particularly long robot.txt tells a story of the struggles of being a SEO. At least he gave credit to Joan Stark (jgs), the ASCII artist.

The title is a reference to a reference: it's a line by Flight of the Conchords, which was also referenced in YouTube's robots.txt file, which SearchEngineLand might have missed because they think the reference means we already lost the war to robots.
posted by filthy light thief at 11:23 AM on March 24, 2017


Also, this is all inspired by Lanark's comment in MetaTalk, which is where I learned about humans.txt.
posted by filthy light thief at 11:24 AM on March 24, 2017 [3 favorites]


Also the bane of Archive.org, given that it applies retroactively. Anybody who buys a domain can slap a robots.txt exclusion on it and remove that site's entire history from the Wayback Machine.
posted by Rhaomi at 11:28 AM on March 24, 2017 [3 favorites]


Also the bane of Archive.org, given that it applies retroactively. Anybody who buys a domain can slap a robots.txt exclusion on it and remove that site's entire history from the Wayback Machine.

Just as a point of fact: this is archive.org being polite; robots.txt has no legal standing.
posted by Going To Maine at 11:37 AM on March 24, 2017 [5 favorites]


I'm waiting for the cat.txt
posted by Brandon Blatcher at 11:39 AM on March 24, 2017 [4 favorites]


I imagine you could also undo that by poaching that domain and undoing the robots.txt change.

I actually use Archive.org to view tweets, which also archives them, but today it seems Twitter has updated their robots.txt file, which is ... strange.
posted by filthy light thief at 11:41 AM on March 24, 2017


Rhaomi: "Also the bane of Archive.org, given that it applies retroactively. Anybody who buys a domain can slap a robots.txt exclusion on it and remove that site's entire history from the Wayback Machine."

Going To Maine: "Also the bane of Archive.org, given that it applies retroactively. Anybody who buys a domain can slap a robots.txt exclusion on it and remove that site's entire history from the Wayback Machine.

Just as a point of fact: this is archive.org being polite; robots.txt has no legal standing.
"

I use it to look up sites now missing in the current day. I see the listing in the Wayback Machine, click the lastest check and BAM! Nothing. Sad Samizdata face.
posted by Samizdata at 11:53 AM on March 24, 2017 [1 favorite]


I'm waiting for cyborg.txt or mecha.txt (but not transformer.txt)
posted by oneswellfoop at 12:25 PM on March 24, 2017


I just checked and the humans are dead.
posted by tommasz at 12:37 PM on March 24, 2017


Google got in on the fun, too. Since then, they've simplified it.

(I suspect you'd find that first version in more places than a single hacker news comment if it was something they had actually served at some point...)
posted by effbot at 1:09 PM on March 24, 2017


Seconding Rhaomi's concern, I've noticed this happening in the past.

I'm also concerned because several classic gaming sites I visit from time to time have overly broad robots.txt exclusions. In each instance I've emailed the site maintainer asking them, since it means the Wayback Machine won't touch it and the (sometimes irreplaceable) information might get lost, if they'd consider relaxing their policy. Nothing has come of it yet though.
posted by JHarris at 1:15 PM on March 24, 2017 [2 favorites]


I've seen MANY sites totally blocking off Googlebot simply because of a single word in their robots.txt:

Block:
User-agent: *
Disallow: /
Allow:
User-agent: *
Allow: /
Easy mistake to make.
posted by Foci for Analysis at 1:44 PM on March 24, 2017 [1 favorite]


I ... wrote a robot in 1993, to teach myself Perl.

I was working for SCO at the time, in London, and we had a 64K leased line. (Feel the raw power of my internets!!!)

I was testing my bot with a depth-first traversal of the web (stupid, I know, but hey, this was in the era when the "What's New on the Web" morning email from CERN was short enough you could visit all yesterday's new interesting websites before 10am) and hardwired a target URL as a starting point, because I wasn't thinking through the consequences.

Got an irate email a few days later from the syadmin there; his company only had a 14.4K leased line and I was saturating it! So he horked up a simple protocol. "Look for a file in / called robots.txt. Grab it. It'll contain pathnames. Do not grab anything prefixed with those paths. Or I'll kvetch at your boss."

Then he went away and turned it into an actual specification and turned it into the robot exclusion protocol proper, circa early 1994.

Here's where I buried the guilty evidence (you're looking for websnarf). NB: I last updated the stuff in this corner of my website back in the very early 2000s; the contents of this directory date to 1996 or so.
posted by cstross at 1:48 PM on March 24, 2017 [57 favorites]


Robots.txt is not a standard in the sense that's there's any organization backing it or any consequences to ignoring or exploiting it. At best it's a consensus agreement for how to describe the way you wish visitors to behave in your house.

It's also been known as a security risk since a hostile robot can use the disallow section as a guide for where private information is more likely to be found.
posted by ardgedee at 1:55 PM on March 24, 2017 [2 favorites]


Dont worry about the robots.txt. Its the cake.txt that is trying to kill you with its lies.
posted by Nanukthedog at 6:31 PM on March 24, 2017 [1 favorite]


Nanukthedog: "Dont worry about the robots.txt. Its the cake.txt that is trying to kill you with its lies."

I gopher'd a text file that shows your parents didn't love you.
posted by Samizdata at 6:59 PM on March 24, 2017 [1 favorite]


I ... wrote a robot in 1993

cstross, I have a story involving unsupervised army cadets, unloaded 7.62mm assault rifles and unsuspecting suburban Canada, whose similarity to your tale - in terms of consequences - is... [shuffles feet awkwardly] similar in scope.
posted by CynicalKnight at 7:42 PM on March 24, 2017 [4 favorites]


Fortunately there is another team that interprets robots.txt as damage and works around it, for archival purposes.

BTW: Just uploaded my unicorns.txt file.
posted by runcifex at 1:19 AM on March 25, 2017 [4 favorites]


« Older You can't dismantle capitalism if you have a...   |   Advocacy begins by sharing stories Newer »


This thread has been archived and is closed to new comments