googlebot IS evil!
March 28, 2006 8:24 PM Subscribe

Googlebot deletes inept company's site content.
Lessons to be learned, kids, lessons to be learned.
posted by Hackworth (36 comments total)

I for one welc . . . . oh never mind.
posted by fourcheesemac at 8:40 PM on March 28, 2006

This isn't the Googlebots fault. It's the developers of the website / CMS.
posted by punkrockrat at 8:41 PM on March 28, 2006

That's the whole point, punkrockrat -- and the reason the story appeared on The Daily WTF. It's a fun site for the occasional forehead-slapping 'How Did Anyone Think This Was A Good Idea' code snippet.

This one, though, takes the cake...
posted by verb at 8:46 PM on March 28, 2006

Someday Matt will forget to remove some debugging code and the Googlebot will make its way through Metafilter, marking all of AskMe as a best answer. Or flagging all of MeFi as noise. Or both.

At that point we will be enlightened.
posted by IshmaelGraves at 8:46 PM on March 28, 2006

It's not Googlebot's fault, but the fact remains.. Googlebot is evil.
posted by _aa_ at 8:49 PM on March 28, 2006

Googlebot is evil.

“I put the robot exclusion protocol on my door. Didn't you see it?”
posted by IshmaelGraves at 8:51 PM on March 28, 2006

Dammit.
posted by IshmaelGraves at 8:52 PM on March 28, 2006

That's learning the hard way.
posted by raedyn at 9:03 PM on March 28, 2006

Robot, I said nofollow! Get away! No! NOFOLLOOOAAAAGGGHHHHH

(sounds of tearing and strained servomotors)
posted by jenovus at 9:09 PM on March 28, 2006

I especially like how this link text is only vaguely different than Digg's "Googlebot destroys incompetent company's website." ;)
posted by theonetruebix at 9:12 PM on March 28, 2006

Googlebot will one day index your soul.
posted by bubblesonx at 9:15 PM on March 28, 2006

P.S. I'll find my website

Who took my website
who found my website
posted by arialblack at 10:01 PM on March 28, 2006

Was Heinlein alive, he'd update that short story of his and write "The Indexer must Index". On the other hand, Vernor Vinge is still here to write the "The Singular Day Gogglebot came alive".
posted by nkyad at 10:06 PM on March 28, 2006

Repeat it with me folks:

Never make links an action
Never make links an action
Never make links an action
posted by maschnitz at 10:35 PM on March 28, 2006

Never make links an action
posted by cillit bang at 11:24 PM on March 28, 2006

Thanks for posting this -- I have a pet project (a web-UI for a PVR) that uses links as actions in various places... The dailyWTF thread has just exposed a nasty failure mode whereby if a web accellerator is used, these links will all get 'clicked'. Oops! (it's safe from bots due to the authentication required!)

Out of curiosity, where is the 'never make links an action' specified?
posted by nielm at 12:20 AM on March 29, 2006

In the HTTP spec, I think.

Most Web accelerators won't prefetch URLs that include query strings. Just sticking a ? at the end of your links should avoid the prefetch behavior without having to change much else about the site.
posted by kindall at 12:36 AM on March 29, 2006

Out of curiosity, where is the 'never make links an action' specified?

The HTTP spec:

In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". [...] The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

All actions should use POST.
posted by cillit bang at 1:14 AM on March 29, 2006

Thanks kindall -- yes all the 'actions' are actually queries, and looking at the google and Mozilla WA docs, both do not prefetch URLs with query strings.. so I feel a little better.

cillit bang: thanks for the (repaired) link: The actions performed are not 'side effects' as such, they are 'actions' that the user did request, eg, to schedule the recording of a TV show:
<a href="command?command=record&airingID=12345">Record</a>

It looks like it's time to change my code... Hmm. Time to learn how to style submit buttons :)
posted by nielm at 1:45 AM on March 29, 2006

It's also worth noting the website presumed authentication if it could not find a cookie saying that the user is not logged in. Genius.
posted by MetaMonkey at 1:49 AM on March 29, 2006

nielm: Read the first sentence I quoted again:

the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval

The "action" referred to in the second sentence is retrieval. Anything else is a side-effect.
posted by cillit bang at 1:52 AM on March 29, 2006

cillit bang - and according to RFC 2119 (as referenced in section 1.2 of RFC 2616 "Hypertext Transfer Protocol -- HTTP/1.1")

4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

plus it is specifically described as a "convention"... so it doesn't look like that much like a requirement to me...

of course it's a good idea for GET to not have server-side side effects, and the writers of this CMS are idiots, but I still don't think the folks claiming this isn't HTTP compliant have a legitimate point...
posted by russm at 3:40 AM on March 29, 2006

I don't develop web applications anymore. Quit that hazy bullshit for a PhD in Ecology, and the great outdoors.

But this has left me thinking if, at any point in the past, I had actually done this in any application. And I'm pretty sure I did.

The thing is, any case where I made a GET request link delete or edit something, that code was locked away behind an administrator password. Googlebot wouldn't have been able to access it anyway.

Which leaves me wondering - when would you ever leave functions on a website enabling anonymous, non-authenticated users (like, say, a search engine bot) to mess with your back-end database? Except, a wiki, perhaps. That seems even stupider at a more fundamental level than using GET request like this. Code that messes with your back end shouldn't even be visible to bots, right? Or are my long out of practice web developer ganglia missing something vital?
posted by Jimbob at 3:41 AM on March 29, 2006

Jimbob : "Or are my long out of practice web developer ganglia missing something vital?"

You're not missing a thing: notice the word "inept" in the FPP here and the site it points to - DailyWTF is a repository for funny/amazing programming mistakes (and also plain stupid code).
posted by nkyad at 5:02 AM on March 29, 2006

so it doesn't look like that much like a requirement to me...

I'm more than familiar with RFC language - I'm one of the authors of this one.

Fisrtly, statements containing RFC2119 language are authoritative, so the SHOULD NOT overrides the talk of it being a "convention". Secondly, SHOULD NOTs are only allowed to be broken when the implementor can live with the problems doing so might cause. The site in the original link is only HTTP compliant if its creators wanted the Googlebot to delete all their stuff.

Jimbob: any case where I made a GET request link delete or edit something, that code was locked away behind an administrator password.

That'll protect you from the Googlebot, but software that knows your admin password is also allowed to make GET requests willy nilly. Web accelerators are the prime example of this.
posted by cillit bang at 5:14 AM on March 29, 2006

Yay homebrew!
posted by furtive at 5:16 AM on March 29, 2006

Never make links an action.
Never make links an action.
Never make links an action.
tar -t before tar -x.

Wait. That last one is another good rule. Sorry.
posted by eriko at 5:46 AM on March 29, 2006

Secondly, SHOULD NOTs are only allowed to be broken when the implementor can live with the problems doing so might cause.

well it sounds like they've learnt to live with it in this particular case...
posted by russm at 6:31 AM on March 29, 2006

It also doesn't pay attention to Javascript, which would normally prompt and redirect users who are not logged on.

Who the hell does their authentication through JavaScript?
posted by Afroblanco at 6:58 AM on March 29, 2006

tar -t before tar -x

Dear God yes. I bet I spend a full 1% of my work time cleaning up after tarballs that don't put their entire contents into a subdir.
posted by Plutor at 7:10 AM on March 29, 2006

What does the "Googlebot" look like?

I'm pretty sure as I type this, one of you is securing a domain name and creating a site where we can submit drawings of the Googlebot.

Can't wait to see it!
posted by usedwigs at 8:14 AM on March 29, 2006

nielm writes "Thanks kindall -- yes all the 'actions' are actually queries, and looking at the google and Mozilla WA docs, both do not prefetch URLs with query strings.. so I feel a little better."

Keep in mind that the googlebot et.al aren't the only things out there. Stuff like wget and httrack may not follow that convention. Plus who knows what some bright bunny might whip up on their own.
posted by Mitheral at 8:16 AM on March 29, 2006

re: GET vs POST. Learn something new every day, I guess. Though I concur that it is idiotic to rely on JS to do authentication and to have the default/failure case security condition evaluate as 'authenticated'. I too will now spend the day converting a bit of one of my projects to use buttons instead of hyperlinks.
posted by Fezboy! at 9:28 AM on March 29, 2006

Stuff like wget and httrack may not follow that convention.

I use wwwoffle to download web pages to my Zaurus PDA for offline browsing. It has a nifty "-r" option to follow links recursively. It'd cause the same problem.
posted by russilwvong at 11:00 AM on March 29, 2006

IshmaelGraves writes "Someday Matt will forget to remove some debugging code and the Googlebot will make its way through Metafilter, marking all of AskMe as a best answer."

It happened.
posted by OmieWise at 12:34 PM on March 29, 2006

In addition to "never make links an action", may I also suggest "never rely on client-side mechanisms for authentication". Also, "authentication is active, not passive".

Seriously, who trains these people? Using flag==false to deny, and otherwise allowing -- I need to slap someone. Authentication is ACTIVE, not passive!
posted by davejay at 4:36 PM on March 29, 2006

« Older Sleep . . . .SLEEP! | Ukraine is divided on the issue of Russian Newer »

This thread has been archived and is closed to new comments

MetaFilter

googlebot IS evil!
March 28, 2006 8:24 PM Subscribe

Tags

Share

googlebot IS evil! March 28, 2006 8:24 PM Subscribe

Tags

Share

googlebot IS evil!
March 28, 2006 8:24 PM Subscribe