Iraqfilter
October 27, 2003 2:30 PM   Subscribe

Iraqfilter. "Sometime between April 2003 and October 2003, someone at the White House added virtually all of the directories with 'Iraq' in them to its robots.txt file, meaning that search engines would no longer list those pages in results or archive them." The robots.txt file is here. And here's the Slashdot discussion. I guess it's hard to restore integrity to the Presidency when people can compare your statements over time.
posted by condour75 (29 comments total)
 
Yikes. That looks bad regardless of the reasoning.
posted by bz at 2:41 PM on October 27, 2003


one would imagine

/cakewalk

got deleted sometime back....
posted by specialk420 at 2:43 PM on October 27, 2003


One wonders if anyone out there is saving a copy of this robots.txt file for future comparison.. hmmm...

Seriously though, this disturbs the shit out of me.. and what disturbs me more is the fact that the first coworker I showed this to said "well, I dunno.. it seems some of that stuff is understandable.."

Attempting to deny the public the right to cache this stuff both 1) for history, and 2) for before/after comparison is at the very least "a bit shady", and more likely "quite shady"...

I cannot think of a single valid reason to block search engines and other "bots" from caching historical copies of public documents on the whitehouse's website regardless of subject matter.

Throw in that everything pertains to "iraq", and that's just plain scary... I'm no conspiracy theorist, but it doesn't take a radical nutball to notice they've already gone back and edited the content of bush speeches, etc...
posted by twiggy at 2:46 PM on October 27, 2003


Interesting. But wait, is Google keeping people from a history of preference?

http://www.google.com/robots.txt
posted by the fire you left me at 2:54 PM on October 27, 2003


I think somebody screwed up the perl script for generating the robots.txt. Look at the directories; most don't exist at all.

Disallow: /firstlady/photoessay/bookfestival/iraq
Disallow: /firstlady/photoessay/welcometowh/iraq
Disallow: /firstlady/recipes/iraq
Disallow: /history/africanamerican/iraq
Disallow: /history/photoessays/easter/one/iraq
Disallow: /history/valentines/iraqDisallow:
etc.
posted by monju_bosatsu at 2:55 PM on October 27, 2003


speaking of iraq: the good. the bad and ugly.
posted by specialk420 at 2:58 PM on October 27, 2003


Then again, there is a history of revisionism within this White House.
posted by the fire you left me at 2:59 PM on October 27, 2003


looks like they disallowed almost the whole site to my uneducated eye...
posted by internal at 3:03 PM on October 27, 2003


Let's all repeat the Slashdot discussion! First you say "This is scary! I hope someone is archiving this!" Then I say "Look, man, /firstlady/photos/2003/01/iraq? /president/holiday/decorations/iraq? They never existed. Some server monkey screwed up a perl -e." Then you ignore that because it's more fun to make Ministry of Truth jokes.
posted by rusty at 3:12 PM on October 27, 2003


It's just "/text" and "/iraq" added to every directory. *Looks* like a goof to me.

They have valid reasons to want to block indexing of "/text" and "/iraq" (as root-level directories).

/text simply mirrors the content of the rest of the site, so there's no need to index it. (Avoiding indexing redundant content is a good thing.)

/iraq simply redirects to /infocus/iraq, which is also linked from elsewhere. If they hadn't goofed, /infocus/iraq would be indexed normally, with no need for any crawlers to go in /iraq.

There is the question of how "/afac/index.htm" got into the blunder. It doesn't seem to exist (.html does), but maybe there's some big conspiracy to keep people from finding out about the money we're giving to Afghan children!

Now, you can argue all you want over whether or not the "goof" is coverup to block indexing of /infocus/iraq (I dare you to find any other directories ending in "/iraq") or not, but it looks like a dumb blunder. So just laugh at the dumb-dumbs or something.
posted by whatnotever at 3:13 PM on October 27, 2003


Disallow: /firstlady/recipes/iraq

"We know in what areas they eat Lis-san El Qua-Thia. They're in the area around Tikrit and Baghdad and east, west, south and north somewhat."
posted by eddydamascene at 3:13 PM on October 27, 2003


whanotever: How can you say that! After the well-known and respected journalist Dan Gillmor has urged us all to download those restricted nonexistent directories every day?

"In the blogosphere, my readers can fact-check my ass, cause God knows I'm not going to do it myself."
posted by rusty at 3:20 PM on October 27, 2003



Now, you can argue all you want over whether or not the "goof" is coverup to block indexing of /infocus/iraq (I dare you to find any other directories ending in "/iraq") or not, but it looks like a dumb blunder. So just laugh at the dumb-dumbs or something.


Sure, but for someone to commit a blunder in robots.txt file with regards to the word "iraq", they must have been trying to do something involving the word "iraq" and the robots.txt file to begin with. So the question is, what were they really trying to do when they actually messed up?
posted by tuxster at 3:48 PM on October 27, 2003


speaking of iraq: the good. the bad and ugly.
posted by specialk420 at 4:58 PM CST on October 27


specialk420-

What is that? The same person with two blogs, or one original and a faker? I read the first one on a regular basis, but I'd never seen the second until today.
posted by kayjay at 4:13 PM on October 27, 2003


tuxster:

The rest of the post you quoted mentioned reasons why you wouldn't want /text/ and /iraq/ to be spidered.

This is what happens when you let Dubya exercise his MaD 1eet sKillZ at perl, I guess?
posted by ROU_Xenophobe at 4:15 PM on October 27, 2003


i think the links to the israeli backed: memri.org and this blog which was linked on the suspicious copy until just a few hours ago might be a few clues as to who is behind the baghdad burning rip-off.
posted by specialk420 at 4:25 PM on October 27, 2003


I'm always happy to blame GW for, well, just about anything. This may be evidence of the generally secretive approach at GW's WH, but when directories like
http://www.whitehouse.gov/president/holiday/whtree/text/
are banned (that's legit, btw), you know they're just banning everything. Actually, that'd be an interesting project--write a spider that doesn't respect robots.txt, and find out which, if any, directories have been left open.
posted by adamrice at 4:36 PM on October 27, 2003


Damn, specialk420, those links are fascinating!
posted by adamrice at 5:06 PM on October 27, 2003


total derail. the rest of the discussion is over at atrios. the bush adminstrations efforts to control information in this case - and dilute information in the baghdad burning case (we suspect) is ... well... i think people can make up their own minds.
posted by specialk420 at 5:11 PM on October 27, 2003


(further derail) this cat seems to be on the hunt.
posted by specialk420 at 5:16 PM on October 27, 2003


webreaper will ignore spiders, I believe. If not, the htweb might. One of them does. And Wget might be configurable in that regard.
posted by mecran01 at 6:07 PM on October 27, 2003


Something interesting about this strategy, if it deserves that label. Go to Google, and search whitehouse.gov + Iraq, and you get two White House entries plus all the sites they would probably prefer you not see: Stumbleupon,, Whitehouse.org, the Evil Eradication Office, and a lot of weblogs (e.g., such as this one.

So why is this such a good idea for the White House? Don't they know that someone will notice, and publicize it?
posted by palancik at 6:31 PM on October 27, 2003


adamrice, I wasn't aware that "/text" showed up everywhere. But that just makes even more sense, for the most part. Throw "/text" on the end of any directory and you get the plain-text version (for accessibility, generally). That's a fine way to do things. But there's no need for anyone to index it, because it has the exact same content as the non-/text page.

Take any "/text" url, remove "/text", and you should get the same page, but with pretty pictures. People can feel free to check all of them every day for themselves, but it seems a bit wasteful.

So blocking [everything]/text makes perfect sense. How [everything]/iraq got in there...? I'll still chalk it up to stupidity.

Really, does anyone remember "insert something meaningful here"?
posted by whatnotever at 6:37 PM on October 27, 2003


palancik: Your reported Google results aren't due to the robots.txt file - you're only getting two whitehouse.gov results because Google hides all the others. Searching whitehouse.gov for "iraq" gives a ton of hits (at least for now).
posted by UKnowForKids at 7:01 PM on October 27, 2003


Let's all repeat the Slashdot discussion!

Don't you have a site to maintain somewhere?
posted by inpHilltr8r at 7:26 PM on October 27, 2003


Yeah, this same discussion has been had at Atrios and other places all over the Net already. BUT it was worth posting just in order to call the link "Iraqfilter." How many chances are we gonna have to make that pun? Kudos, condour75.
posted by soyjoy at 9:09 PM on October 27, 2003


inpHilltr8r: I read MetaFilter to forget. :-)
posted by rusty at 11:38 PM on October 27, 2003


Wget's been mentioned above, but it might be a worthwhile endeavour to configure something like it to download exactly those pages. Might be interesting to read around, say, the next election.
posted by bwerdmuller at 2:35 AM on October 28, 2003


HTTrack can be configured to ignore robots.txt files and it's got a nice clicky interface for those on windows who can't handle a command line. You could even tell it to only get things mentioned in robots.txt (or any text file of links for that matter)
posted by Mitheral at 7:48 AM on October 28, 2003


« Older LAMP   |   Andrzej Jackowski Newer »


This thread has been archived and is closed to new comments