<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Comments on 8055</title>
	<link>http://www.metafilter.com/8055//</link>
	<description>Comments on MetaFilter post Comments on 8055</description>
	<pubDate>Tue, 05 Jun 2001 04:06:30 -0800</pubDate>
	<lastBuildDate>Tue, 05 Jun 2001 04:06:30 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Post number 8055</title>
		<link>http://www.metafilter.com/8055/</link>	
		<description>&lt;a href="http://www.google.com/bot.html"&gt;From the googlebot FAQ:&lt;/a&gt; &quot;For most sites, Googlebot should not access your site more than once every few seconds on average&quot; &lt;br&gt;&lt;br&gt;I thought it was a mistake at first, but they go on to say that you should contact them if &quot;we are placing too high a load on your site&quot;

Do they really hit some sites that hard? If so, is it really necessary? </description>
		<guid isPermaLink="false">post:www.metafilter.com,2001:site.8055</guid>
		<pubDate>Tue, 05 Jun 2001 03:50:15 -0800</pubDate>
		<dc:creator>Nothing</dc:creator>		<category>google</category>		<category>bots</category>		<category>search</category>		<category>searchengines</category>
	</item>	<item>
		<title>By: barbelith</title>
		<link>http://www.metafilter.com/8055/#89227</link>	
		<description>Yes - absolutely - I onced worked at a database driven site - I was assembling the search-engine strategy ironically enough - running Vignette and Oracle, and the site kept collapsing completely after launch. It was determined that this was in fact caused by a Google bot, and the site eventually had to stop all search engine robots spidering it. Disaster for the huge company concerned. VERY bad.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89227</guid>
		<pubDate>Tue, 05 Jun 2001 04:06:30 -0800</pubDate>
		<dc:creator>barbelith</dc:creator>
	</item>	<item>
		<title>By: chrish</title>
		<link>http://www.metafilter.com/8055/#89235</link>	
		<description>What upsets me more than anything, is the handful of companies which have set the user-agent of their spider to match a popular browser. ex. Mozilla/4.0 (compatible; MSIE 5.0; Win32)</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89235</guid>
		<pubDate>Tue, 05 Jun 2001 05:13:10 -0800</pubDate>
		<dc:creator>chrish</dc:creator>
	</item>	<item>
		<title>By: ParisParamus</title>
		<link>http://www.metafilter.com/8055/#89236</link>	
		<description>Chrish:  please explain.

Its like the Borg.  Cool.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89236</guid>
		<pubDate>Tue, 05 Jun 2001 05:22:09 -0800</pubDate>
		<dc:creator>ParisParamus</dc:creator>
	</item>	<item>
		<title>By: greensweater</title>
		<link>http://www.metafilter.com/8055/#89239</link>	
		<description>What about filtering googlebot&apos;s IP block at the router?</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89239</guid>
		<pubDate>Tue, 05 Jun 2001 05:42:24 -0800</pubDate>
		<dc:creator>greensweater</dc:creator>
	</item>	<item>
		<title>By: mattw</title>
		<link>http://www.metafilter.com/8055/#89243</link>	
		<description>I had a similar problem with Altavista. Decided to make a script and database intensive site of mine all funky, you know, the-url-is-the-command-line and all that nonsense. So: switched from ?key=value to a /key/value scheme. Forgot to put a robots.txt file in.

Upshot: Fat Perl/database script got hit upwards of 7000 times in 11 hours by the Altavista spider, killed the entire web server, and I got booted off my hosts for being antisocial a couple of weeks later. Horrible horrible episode.

Had its upside in the end though.

One alarming development is that bots are starting to spider obviously database backed sites so people are going to have to learn more about how to exclude them.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89243</guid>
		<pubDate>Tue, 05 Jun 2001 06:10:10 -0800</pubDate>
		<dc:creator>mattw</dc:creator>
	</item>	<item>
		<title>By: jessamyn</title>
		<link>http://www.metafilter.com/8055/#89246</link>	
		<description>Background reading available at &lt;a href=http://www.robotstxt.org&gt;robotstxt.org&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89246</guid>
		<pubDate>Tue, 05 Jun 2001 06:20:57 -0800</pubDate>
		<dc:creator>jessamyn</dc:creator>
	</item>	<item>
		<title>By: beefula</title>
		<link>http://www.metafilter.com/8055/#89269</link>	
		<description>In answer to the original poster&apos;s question: They only hit sites every few seconds that have thousands or millions of pages, they don&apos;t mean that they&apos;re going to hit the same page every few seconds.

You can exclude them, or you can just plan ahead and have enough server power to handle it. Google, and most spiders, only checks a given page every few days tops, so all you need is enough server power to handle it if some actual human decided to read every page on your site for a few days.

I have a server that Google hits a lot (it indexed 137,000 pages in the last 4 days) and it&apos;s not a problem at all, Google have always had very nicely behaved spiders. Anyone who has to block them needs to look at their site setup again, cause they&apos;re probably losing a ton of search engine hits needlessly.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89269</guid>
		<pubDate>Tue, 05 Jun 2001 07:25:02 -0800</pubDate>
		<dc:creator>beefula</dc:creator>
	</item>	<item>
		<title>By: PWA_BadBoy</title>
		<link>http://www.metafilter.com/8055/#89274</link>	
		<description>It&apos;s not just Google&apos;s spiders that cause the problem. Multiple spiders running through my former employer&apos;s frames-based website brought down the web server every night until I added some stuff to our robots.txt file and we went to a non-frames based design. Each page had like 5 frames(!) in it, thanks in large part to a pretty &lt;a href=&quot;http://www.marcuccistudios.com&quot;&gt;web designer&lt;/a&gt; they had there. All those frames can make for some pretty big http requests.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89274</guid>
		<pubDate>Tue, 05 Jun 2001 07:43:28 -0800</pubDate>
		<dc:creator>PWA_BadBoy</dc:creator>
	</item>	<item>
		<title>By: jennak</title>
		<link>http://www.metafilter.com/8055/#89323</link>	
		<description>&lt;a href=&quot;http://www.keynote.com&quot;&gt;Keynote Systems&lt;/a&gt; &lt;i&gt;pounded&lt;/i&gt; the presidential campaign sites from January til about March.  Then they called the campaigns and offered their services:  &quot;We can improve your performance!&quot;

They ended up posting the results of their &quot;survey&quot; in the Spring, and then did the process over again in the Fall.  I don&apos;t think any of the campaigns actually utilized their services; we were all pretty pissed about become their unwilling guinea pigs.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89323</guid>
		<pubDate>Tue, 05 Jun 2001 08:47:36 -0800</pubDate>
		<dc:creator>jennak</dc:creator>
	</item>	<item>
		<title>By: Dreama</title>
		<link>http://www.metafilter.com/8055/#89378</link>	
		<description>There are far far worse &lt;a href=&quot;http://www.c-4-u.com/&quot;&gt;potentially system crashing spider-and-bot-type things&lt;/a&gt; out there.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89378</guid>
		<pubDate>Tue, 05 Jun 2001 11:23:33 -0800</pubDate>
		<dc:creator>Dreama</dc:creator>
	</item>	<item>
		<title>By: dhartung</title>
		<link>http://www.metafilter.com/8055/#89429</link>	
		<description>For one real-world example, read up on &lt;a href=&quot;http://wmf.editthispage.com/discuss/msgReader$4082?mode=day&quot;&gt;why Userland-hosted Manila Sites aren&apos;t on Google&lt;/a&gt; (more &lt;a href=&quot;http://www.truerwords.net/index/2001/05/09#TW658&quot;&gt;detail&lt;/a&gt;). Short answer: Google&apos;s bot was forcing Frontier/Manila to generate pages every few seconds, and they couldn&apos;t handle the load very well. Critics have noted that one problem is that Frontier generates all sorts of other kinds of files than HTML (e.g. WAP) that Google doesn&apos;t need to index, and they weren&apos;t doing a good job of excluding the bot from that extraneous surfing. Winer&apos;s said that he and Google have had technical discussions and will let the bot return, but for the time being those of us with sites at e.g. EditThisPage.com are getting no, nada, zero, zilch referrals from Google. For all intents and purposes, if you search for us on Google, we do not exist.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89429</guid>
		<pubDate>Tue, 05 Jun 2001 12:42:07 -0800</pubDate>
		<dc:creator>dhartung</dc:creator>
	</item>	<item>
		<title>By: arf</title>
		<link>http://www.metafilter.com/8055/#89480</link>	
		<description>in userland&apos;s case it wasn&apos;t google, but inktomi. however, they had blocked (more or less accidently) all bots and crawlers.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89480</guid>
		<pubDate>Tue, 05 Jun 2001 13:37:11 -0800</pubDate>
		<dc:creator>arf</dc:creator>
	</item>	<item>
		<title>By: nedrichards</title>
		<link>http://www.metafilter.com/8055/#89509</link>	
		<description>erm google does need to index wap sites. check out wap.google.com very cool (on your Psion Revo and IR mobile of course, not a wap phone, waste of money!). It does also translate html sites to wap with often interesting results (lets say i&apos;ll be rearranging my divs :-] )</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89509</guid>
		<pubDate>Tue, 05 Jun 2001 14:14:11 -0800</pubDate>
		<dc:creator>nedrichards</dc:creator>
	</item>	<item>
		<title>By: mutagen</title>
		<link>http://www.metafilter.com/8055/#89847</link>	
		<description>One problem with Google and Userland sites is that the ETP sites are hosted off a single server. So if google is only hitting once every few seconds but is hitting a number of sites at once, down comes the server.

Another problem with ETP sites is the sheer number of duplicating links. I once tried indexing my site with Teleport Pro and had to stop if after a few hours and nearly 50MB of duplicated pages (all different views of the same material).

A carefully designed web app takes spidering into account, only allowing robots to hit static pages (and giving them enough static content that hopefully reflects the site contents to keep the spider happy) and steers it away from the dynamic database stuff. So go so far as to dynamically create custom pages for the robot but then this often leads to spamming the search engine. As the Google FAQ points out, they even offer customization on a per link basis, with a little care you can isolate parts of your site from legitimate robots that follow the standard.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-89847</guid>
		<pubDate>Wed, 06 Jun 2001 10:01:51 -0800</pubDate>
		<dc:creator>mutagen</dc:creator>
	</item>	<item>
		<title>By: davewiner</title>
		<link>http://www.metafilter.com/8055/#90196</link>	
		<description>Dan, it had nothing to do with the other formats we generate. The way Manila works is that all the WAPs and other XMLizations are dynamic pages generated on-demand. As arf says it wasn&apos;t googlebot that caused us all the grief, it was a combination of all the crawlers and some of them very buggy. One was requesting the same page in an infinite loop. Bugs in their software give headaches to dynamic servers.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-90196</guid>
		<pubDate>Wed, 06 Jun 2001 17:42:22 -0800</pubDate>
		<dc:creator>davewiner</dc:creator>
	</item>	<item>
		<title>By: yarf</title>
		<link>http://www.metafilter.com/8055/#90299</link>	
		<description>Gee, crawlers have been around oh, for what, 7 or 8 years, and only now are people deciding they are a &quot;problem?&quot; I think the real problem is that site designers or architects simply forget that such crawlers can make up a significant minority of their site&apos;s traffic (5-15%), and in return, the site gets indexed in the search engine (which is the place most people turn to first to find information online).

If your site or architecture can&apos;t handle a technology and behavior that&apos;s been around so long, perhaps it&apos;s time to look at the real problem and find a robust and stable technology that can. There are hundreds of thousands of dynamically-generated, database-backed sites online today, and you don&apos;t hear them complaining that being indexed by search engines is a &quot;problem.&quot; Perhaps that&apos;s because they&apos;re using software that understood this is an issue and built their program with it in mind.</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-90299</guid>
		<pubDate>Thu, 07 Jun 2001 03:10:19 -0800</pubDate>
		<dc:creator>yarf</dc:creator>
	</item>	<item>
		<title>By: rodii</title>
		<link>http://www.metafilter.com/8055/#90382</link>	
		<description>&lt;i&gt;There are hundreds of thousands of dynamically-generated, database-backed sites online today, and you don&apos;t hear them complaining that being indexed by search engines is a &quot;problem.&quot;&lt;/i&gt;

&lt;a href=&quot;http://www.metafilter.com/comments.mefi/8055#89243&quot;&gt;You don&apos;t?&lt;/a&gt;</description>
		<guid isPermaLink="false">comment:www.metafilter.com,2001:site.8055-90382</guid>
		<pubDate>Thu, 07 Jun 2001 07:59:15 -0800</pubDate>
		<dc:creator>rodii</dc:creator>
	</item>
	</channel>
</rss>
