From the googlebot FAQ:
June 5, 2001 3:50 AM   Subscribe

From the googlebot FAQ: "For most sites, Googlebot should not access your site more than once every few seconds on average"

I thought it was a mistake at first, but they go on to say that you should contact them if "we are placing too high a load on your site" Do they really hit some sites that hard? If so, is it really necessary?
posted by Nothing (17 comments total)
 
Yes - absolutely - I onced worked at a database driven site - I was assembling the search-engine strategy ironically enough - running Vignette and Oracle, and the site kept collapsing completely after launch. It was determined that this was in fact caused by a Google bot, and the site eventually had to stop all search engine robots spidering it. Disaster for the huge company concerned. VERY bad.
posted by barbelith at 4:06 AM on June 5, 2001


What upsets me more than anything, is the handful of companies which have set the user-agent of their spider to match a popular browser. ex. Mozilla/4.0 (compatible; MSIE 5.0; Win32)
posted by chrish at 5:13 AM on June 5, 2001


Chrish: please explain.

Its like the Borg. Cool.
posted by ParisParamus at 5:22 AM on June 5, 2001


What about filtering googlebot's IP block at the router?
posted by greensweater at 5:42 AM on June 5, 2001


I had a similar problem with Altavista. Decided to make a script and database intensive site of mine all funky, you know, the-url-is-the-command-line and all that nonsense. So: switched from ?key=value to a /key/value scheme. Forgot to put a robots.txt file in.

Upshot: Fat Perl/database script got hit upwards of 7000 times in 11 hours by the Altavista spider, killed the entire web server, and I got booted off my hosts for being antisocial a couple of weeks later. Horrible horrible episode.

Had its upside in the end though.

One alarming development is that bots are starting to spider obviously database backed sites so people are going to have to learn more about how to exclude them.
posted by mattw at 6:10 AM on June 5, 2001


Background reading available at robotstxt.org.
posted by jessamyn at 6:20 AM on June 5, 2001


In answer to the original poster's question: They only hit sites every few seconds that have thousands or millions of pages, they don't mean that they're going to hit the same page every few seconds.

You can exclude them, or you can just plan ahead and have enough server power to handle it. Google, and most spiders, only checks a given page every few days tops, so all you need is enough server power to handle it if some actual human decided to read every page on your site for a few days.

I have a server that Google hits a lot (it indexed 137,000 pages in the last 4 days) and it's not a problem at all, Google have always had very nicely behaved spiders. Anyone who has to block them needs to look at their site setup again, cause they're probably losing a ton of search engine hits needlessly.
posted by beefula at 7:25 AM on June 5, 2001


It's not just Google's spiders that cause the problem. Multiple spiders running through my former employer's frames-based website brought down the web server every night until I added some stuff to our robots.txt file and we went to a non-frames based design. Each page had like 5 frames(!) in it, thanks in large part to a pretty web designer they had there. All those frames can make for some pretty big http requests.
posted by PWA_BadBoy at 7:43 AM on June 5, 2001


Keynote Systems pounded the presidential campaign sites from January til about March. Then they called the campaigns and offered their services: "We can improve your performance!"

They ended up posting the results of their "survey" in the Spring, and then did the process over again in the Fall. I don't think any of the campaigns actually utilized their services; we were all pretty pissed about become their unwilling guinea pigs.
posted by jennak at 8:47 AM on June 5, 2001


There are far far worse potentially system crashing spider-and-bot-type things out there.
posted by Dreama at 11:23 AM on June 5, 2001


For one real-world example, read up on why Userland-hosted Manila Sites aren't on Google (more detail). Short answer: Google's bot was forcing Frontier/Manila to generate pages every few seconds, and they couldn't handle the load very well. Critics have noted that one problem is that Frontier generates all sorts of other kinds of files than HTML (e.g. WAP) that Google doesn't need to index, and they weren't doing a good job of excluding the bot from that extraneous surfing. Winer's said that he and Google have had technical discussions and will let the bot return, but for the time being those of us with sites at e.g. EditThisPage.com are getting no, nada, zero, zilch referrals from Google. For all intents and purposes, if you search for us on Google, we do not exist.
posted by dhartung at 12:42 PM on June 5, 2001


in userland's case it wasn't google, but inktomi. however, they had blocked (more or less accidently) all bots and crawlers.
posted by arf at 1:37 PM on June 5, 2001


erm google does need to index wap sites. check out wap.google.com very cool (on your Psion Revo and IR mobile of course, not a wap phone, waste of money!). It does also translate html sites to wap with often interesting results (lets say i'll be rearranging my divs :-] )
posted by nedrichards at 2:14 PM on June 5, 2001


One problem with Google and Userland sites is that the ETP sites are hosted off a single server. So if google is only hitting once every few seconds but is hitting a number of sites at once, down comes the server.

Another problem with ETP sites is the sheer number of duplicating links. I once tried indexing my site with Teleport Pro and had to stop if after a few hours and nearly 50MB of duplicated pages (all different views of the same material).

A carefully designed web app takes spidering into account, only allowing robots to hit static pages (and giving them enough static content that hopefully reflects the site contents to keep the spider happy) and steers it away from the dynamic database stuff. So go so far as to dynamically create custom pages for the robot but then this often leads to spamming the search engine. As the Google FAQ points out, they even offer customization on a per link basis, with a little care you can isolate parts of your site from legitimate robots that follow the standard.
posted by mutagen at 10:01 AM on June 6, 2001


Dan, it had nothing to do with the other formats we generate. The way Manila works is that all the WAPs and other XMLizations are dynamic pages generated on-demand. As arf says it wasn't googlebot that caused us all the grief, it was a combination of all the crawlers and some of them very buggy. One was requesting the same page in an infinite loop. Bugs in their software give headaches to dynamic servers.
posted by davewiner at 5:42 PM on June 6, 2001


Gee, crawlers have been around oh, for what, 7 or 8 years, and only now are people deciding they are a "problem?" I think the real problem is that site designers or architects simply forget that such crawlers can make up a significant minority of their site's traffic (5-15%), and in return, the site gets indexed in the search engine (which is the place most people turn to first to find information online).

If your site or architecture can't handle a technology and behavior that's been around so long, perhaps it's time to look at the real problem and find a robust and stable technology that can. There are hundreds of thousands of dynamically-generated, database-backed sites online today, and you don't hear them complaining that being indexed by search engines is a "problem." Perhaps that's because they're using software that understood this is an issue and built their program with it in mind.
posted by yarf at 3:10 AM on June 7, 2001


There are hundreds of thousands of dynamically-generated, database-backed sites online today, and you don't hear them complaining that being indexed by search engines is a "problem."

You don't?
posted by rodii at 7:59 AM on June 7, 2001


« Older It's simple: Don't let the blacks vote, your guy...   |   Jon Carroll defends Kaycee: Newer »


This thread has been archived and is closed to new comments