Join 3,440 readers in helping fund MetaFilter (Hide)


The ODP bans its successful users.
January 17, 2003 6:08 PM   Subscribe

The Open Directory Project bans TNL.net Tristan Louis's web site can no longer be used to access the Open Directory. Why? apparently they can't handle the traffic, so they banned links coming from his pages in the early afternoon.
posted by clevershark (25 comments total)

 
Tristan Louis sounds like a whiny prat and this write up is really misleading. The Open Directory project provides a service, it costs them money to do this. Tristan builds something that makes use of this service but the manner which he uses to do it consumes too much bandwidth. The people who work on the directory can't do their work and other visitors are hampered. The problem is pointed out to him as well as a solution but he's too inept to make use of it, so he leaves his spiders running (not even dialing them back a bit) and then whines when they do the right thing and disable access from his ip address.
posted by substrate at 6:23 PM on January 17, 2003


I agree. I think it's pretty lame that he finds loading RDF into a database so difficult. He complains that blocking him was a violation of "putting users first" or something like that, but if his crawler was bogging down the infrastructure, they were only doing what they had to for the sake of the core service. Do a little optimization work fer fuck's sake, Tristan. It's the better half of development.

"Just did some quick research and there doesn't seem to be anything good out there ..."

Try PERL
posted by scarabic at 6:28 PM on January 17, 2003


It took me a few passes before I understood what he was getting at. My opinion was exactly the same as substrate's after my second pass, and I was in the process of saying that (a bit more diplomatically :) when I had another read through.

The problem he's trying to solve is to be more up-to-date than the latest weekly dump. They don't provide a good way to do that, so he had to resort to writing a crawler. He's suggesting some good ways that they could solve that problem.

The whole business about asking them for code to update his database or get an SQL dump is a bit wierd. How does he expect them to know the structure of his database? If he can't parse some XML I'm not surprised that his spider is hogging their resources, and dmoz is perfectly reasonable in requesting that he stop. Whining about that is just silly.
posted by Emanuel at 6:35 PM on January 17, 2003


Try PERL

That was my first though. If there's nothing out there, just make something. DMOZ provides an API and all the data is in XML. There's no reason he shouldn't be able to spend a few weeks building an application or script. Instead he spends his time whining.
posted by SweetJesus at 6:42 PM on January 17, 2003


Let's see if I have this right.

Some guy crawls a site, a site that provides data for free.

His, likely poorly written, spider sucks up so much of the available resources that the site no longer functions properly, to the detriment of both the users and the administrators.

They offer a solution, but that's not good enough. He would rather they restructure their entire damn architecture in order to provide him with a database dump that is more convenient for him. Adding insult to injury, he then ignores their request and leaves the spiders running.

I'd block his ass too.
posted by cedar at 6:52 PM on January 17, 2003


Good for DMOZ. This guy did the rough equivalent of breaking into a soup kitchen and demanding food at gunpoint because he couldn't figure out the logistics of standing in line for free food like everyone else.

DMOZ is one of the noblest efforts on the web. It's not perfect, and people have plenty of complaints, but they're doing it for FREE for crying out loud. The sheer quantity of manhours that the project has contributed to the online community are staggering, and this guy whines because they don't give it to him on a silver platter. Geez.
posted by Erasmus at 7:41 PM on January 17, 2003


"I like to think of myself as an internet pioneer..."

"TNL.net explores the edge of technology..."

Wow, this poor guy is this close to going from prat to full-on net kook.
posted by Erasmus at 7:50 PM on January 17, 2003


>Tristan Louis's web site can no longer be used to access the Open Directory.

You're making it sound like users are clicking links from his site to hit DMOZ when whats really going on is that his spiders are simply being abusive. I don't see what this has to do with "Tristan Louis's web site." Just because the data is accessible from there doesn't mean the event is happening there.
Its not like there's a slashdot effect, this is more like an unintended DOS and DMOZ probably did the correct thing.

"I am in charge of the technical management for the application development group in the Internet unit of HSBC, one of the largest banks in the world."

Oh boy.
posted by skallas at 8:21 PM on January 17, 2003


To sum up: What an ass.
posted by RylandDotNet at 9:03 PM on January 17, 2003


I have been downloading the RDF files for the past couple of years and wrote a very simple parser in perl to deal with all of the data. It in no way takes days to parse, as he states. Now I was only making a specialty directory that ended up using about 1% of the directory, but I still processed the whole thing. This is hardly the "edge of technology." It is simply 50 lines of perl (scarabic was dead on). This "internet pioneer" wouldn't even do that.

As for blocking him from crawling, I don't feel this violates the license. They still provide the full dump for free.

DMOZ provides an excellent service. I just wish they would update that RDF dump. The one from September is getting a little old.
posted by sciatica at 9:39 PM on January 17, 2003


As the "whiny prat," "ass," and so on, I'd like to respond: First of all, my objections are not to the fact that they blocked me. It is to a couple of other things: Some people have said that I should optimize my code. I did and with less than 3000 requests a day (on average that section of my site gets no more than 2-3000 visits), I thought I'd pass mustard (after all, my 5 biggest referer all come in at over 10,000 requests per day, and they all get served without decrease in speed from a dual 400Mhz Pentium III box (however, I am making some changes including, notably a move away from windows and towards linux to increase the number of requests served at peak times). Seems to me that if I can served those numbers (and I'm assuming most anyone with a semi-respectable web site, including pretty much anyone who reads mefi, gets at least that much traffic) they must be doing something wrong in terms of optimization. Now I could optimize my code to do a dump of their latest feed everyday (200Mb) so I get whatever they consider freshest, but I think that if you take their average page (about 20kb) and multiply it by the number of requests (let's go on the high end and say that I grab 5000 pages), you still end up with a set of downloads of 10000kb of 10Mb vs. downloading a 200Mb feed daily. Seems to me this is a more optimized approach.

Others question my pioneer status. Well, I'll just have to ask those people how long they've been online, how long they've been working online, and how much they've contributed. The answer to these on my end are, online since 1988, working online since 1993, contributed internet.com (and related sites) as well as earthweb.com (and related sites, now all owned by internet.com) to the community. I'm sure between those two, if you're in web development, you've been on a site I created. Beyond that, I'm open to discussions like this. I'm sure this post is going to generate more flaming but hey, I'd rather chat with people about it than keep quiet.

What I really want out of ODP, to be honest, is more openness. Every question I have asked has received the answer "go to http://www.dmoz.org/rdf" and download our rdf dump. I want fresher data (not only for my site but also for everyone else) and I think that it would be ridiculous to grab the whole data set in order to get it. Now the question is how do we solve this. Anyone out there has an idea? Imagine a list of current events (yes, there is such a section in dmoz.org) that would not include mentions of north korea, the iraqui inspection, or the bush incentive plan, for example. How current could that be? To me, not very. I want to solve the problem. I want to help ODP. The question is how can I do that? In the meantime, the error message pops up and I believe it is my responsability to update the people trying to reach the page on progress being made. If you want to follow the conversation keep hitting the page as I'll keep updating it until it's all resolved.

Now, back to the RDF error_log that my parse generated :)
posted by TNLNYC at 12:52 PM on January 18, 2003


What a wanker! He really needs to develop his critical reading skills just a bit since he doesn't seem to understand the little snippet of the OPD Social Contract that he posted:

We will be guided by the needs of our data users and the ODP editorial community. We will place their interests first in our priorities.

Check it Tristan, it says users. It is in the best interests of the users that one self-important egomaniac isn't hogging all the resources due to his inability or lack of interest in solving the problem at hand in a more efficient manner for all.
posted by RevGreg at 1:08 PM on January 18, 2003


Oh, by the way, next time you post here make sure you terminate all of your open tags. Whiny prat!
posted by RevGreg at 1:09 PM on January 18, 2003


revgreg: Does providing 4 months old data work in the best interest of users when Netscape.com has day old data? Just a question. Let me restate: I don't have a problem with them blocking me, I have a problem with them not providing recent data.
posted by TNLNYC at 1:22 PM on January 18, 2003


If there's nothing out there, just make something. DMOZ provides an API and all the data is in XML. There's no reason he shouldn't be able to spend a few weeks building an application or script. Instead he spends his time whining.

I am doing the other part but in the meantime need to update people on progress. A quick message keeping users of my site in the loop does take what, 5 mnts. Writing the application to do the RDF parse is going to take me all weekend (maybe slightly longer as I am rewriting it in PHP in the new site (I've decided to abandon win2k last year). What do I do in the meantime, let the current application crash? My view is a message that updates people is better. Maybe I'm wrong on this.
posted by TNLNYC at 1:26 PM on January 18, 2003


His, likely poorly written, spider sucks up so much of the available resources that the site no longer functions properly, to the detriment of both the users and the administrators.

If anyone wants to see the code, fee free to email me and I'll send it to you. It sucks up ressource (picks up the page at render time) but less so than doing frequent dumps of their complete DB (since there is no incremental one)
posted by TNLNYC at 1:30 PM on January 18, 2003


You're making it sound like users are clicking links from his site to hit DMOZ when whats really going on is that his spiders are simply being abusive. I don't see what this has to do with "Tristan Louis's web site." Just because the data is accessible from there doesn't mean the event is happening there.

Good catch on the misrepresentation. I typed the error message in a hurry and didn't think of it in that way. Anyone's got better wording? If so, either email it to me or post it in this forum and I'll change the wording on the site. :)
posted by TNLNYC at 1:32 PM on January 18, 2003


Almost forgot: For those interested in calling me a Net Kook, you might want to google "Usenet Bully" and "Tristan Louis"... I think you'll find me on the list there somewhere (it's a doc from circa 94-95)

Other areas of people bitching about me online (as far as I remember): The narrow 1-point win for the east coast in the "Cool Site in a Day" contest back in 1999 (there should be a webreview article somewhere out there about it and some mentions on a few blogs), the Boo.com debacle (my comments on it angered a few people)

:)

Now, I'm going to go quiet again on here as I want to focus on writing a better way to parse through the RDF (if anyone has suggestions, send them via email :) )

TNL
posted by TNLNYC at 2:07 PM on January 18, 2003


revgreg: Does providing 4 months old data work in the best interest of users when Netscape.com has day old data?

When attempting to provide the data in a more timely fashion threatens it being collected at all, yes. If you're willing to part with the cash to build the infrastructure and the time to write the code, feel free to show them up. When you're getting something for nothing it strikes me as ludicrous to then bitch about it...
posted by RevGreg at 3:35 PM on January 18, 2003



what do I do in the meantime, let the current application crash?

Yes. They don't owe you anything. On the contrary, you owe them some consideration, which you aren't showing by flooding their resources with your poorly written spider bot. They are doing nothing wrong, and providing a free service, a service you are abusing. You're the one in the wrong. Take your broken application offline until you fix it.
posted by RylandDotNet at 4:36 PM on January 18, 2003


If I passed mustard, I'd totally go see a proctologist.
posted by waldo at 9:11 PM on January 18, 2003


Well, since the general feeling seems to be that, because the data from ODP is free, I have no right to complain about its freshness and the preferential treatment ODP gives to Netscape.com, I've decided to completely get rid of my dependency on the ODP for data. My view (and it seems that it is not one shared by people on this thread) is that when the ODP provides data to the community, it should provide the same data to everyone, whether it is Netscape (their parent owner) or anyone else.

I appreciate all the comments back on this as it has helped me understand what other people feel. I disagree with the general view that it is OK for ODP to provide better data to Netscape than to other ODP licensee but, as some of you pointed out, it is free data so I can't really complain.

However, what I can do is try to build a better mousetrap. As a result, I'm kicking off a new project that will essentially mirror what Netscape/AOL-Time-Warner is doing with the ODP. The concept is simple:

Using RSS 2.0 and trackback, one can essentially rebuild the data that ODP provide.

Why RSS 2.0?

RSS 2.0 provides flexibility in the use of namespaces and provides all the data for a particular section. When you look at an ODP section, all it has is link, title, description, and category for each of the link. That data can all be encapsulated in an RSS channel. However, the one thing RSS does NOT provide is a way to organize channels in a hierachical order. The other important item in RSS 2.0 is the source field, which allows me to refer to other sources using trackback.

Why trackback?

Using Sam Ruby's concept of easy grouping, one can workout a decentralized approach to a database. The idea here is that if there's a published category structure (I still need to figure out how to decentralize that part), I can create a category channel (for example a community weblog channel). Using trackback, one can enhance that channel category either by defining their channel as a parent or child of my channel. If my channel receives a referer from a parent channel, it now has to remember this info and display it when moving up one category. If it is a child, I need to include the trackback when moving down in categories. As a result, there is a 1to1 relationship between a channel, its parents, and its children. I don't need to dig deeper than one channel in this case, because all I want to know about a channel (beyond its content) is who its parents are, and who its children are.

Is anything wrong with the concept so far? Thoughts? Comments?
posted by TNLNYC at 12:41 PM on January 19, 2003


Now, I'm going to go quiet again...

As usual for these types, this has turned out not to be the case...
posted by Erasmus at 8:51 PM on January 19, 2003


I am doing the other part but in the meantime need to update people on progress. A quick message keeping users of my site in the loop does take what, 5 mnts. Writing the application to do the RDF parse is going to take me all weekend (maybe slightly longer as I am rewriting it in PHP in the new site (I've decided to abandon win2k last year). What do I do in the meantime, let the current application crash? My view is a message that updates people is better. Maybe I'm wrong on this.

First of all, what you're basicly doing is mirroring DMOZ. Ethicly, you shouldn't do this. It pisses people off, and that's why you're being called a whiny prat. And it seems you haven't even looked around for tools that will do what you want to do. There's a whole big fucking list, right here, of pre-made tools that could help you out.

But if you're really fucking primed to re-build dmoz on your own machine, you can, and this is not the end of the world. Learn some perl, which is designed expressly to rip though text files and generate useable data (Thats why perl stands for Practical Extraction and Report Language). Put the perl script on a cron tab and have it run nightly.

You probably wouldn't even have to do too much work. If you check out CPAN there are tons of pre-built API's for XML and RDF processing. Its not rocket science.

I don't know much off the top of my head about how DMOZ structures their XML data, but I've done a bunch of work with large XML data sources*, and there is no way it should take 40+ hours to parse all the data if you do it right. The biggest problem I see is figuring out what data is new, and what data is old, and this can be over come with some work (only parsing data with date's after X, prehaps.)

Just find a friend (or hire someone, if it's that important) who has a software engineering background to take a look around for a few hours, and give you a starting point.

And please stop complaining. DMOZ is not in business to provide you, and only you, with data. Stop acting that way.

*I used to work for a large semi-conductor testing equipment manufacturer, and while I was working there I was charged with writing an XML parser in perl to plow though the automated testing equipment log files that were generated every night. We had about 40 or 50 machines, each one pumping out a 6 to 10 meg log file every night. So, on average, this script was processing around 360 megs (~8 * ~45 = ~360) nightly. It took much less than an hour to chomp though and generate useable data on an old 400mhz Solaris box.
posted by SweetJesus at 4:41 PM on January 20, 2003


First of all, what you're basicly doing is mirroring DMOZ.

Actually, if it were just a question of mirroring it, I wouldn't find the project interested. What I was doing (and I wish more people could see it work now) was parse through DMOZ and on top of creating the HTML page, enrich it with RDF metadata and RSS parse of relevant feeds for every page. It didn't work too well on the internationalization page (the World Category) but worked pretty well on the category pages. The problem is that I needed the ODP data to figure out what were the best RSS feeds to place in context, as well as to generate the RDF metadata.

And it seems you haven't even looked around for tools that will do what you want to do. There's a whole big fucking list, right here, of pre-made tools that could help you out.

That list include tools that essentially did the same kind of crawl as I did. As a result of my new "banned" status, none of those tools would work to get the data.

primed to re-build dmoz on your own machine

Actually, if it were just a question of doing a dmoz rebuild, I wouldn't even undertake it. What's the use? There's already a site out there that does that. It's called Dmoz :) What interested me in the project was doing the invisible stuff (RDF in particular). The dmoz site has OK data but no one has bothered adding all the metadata. For example, why is this that there is no meta author tag on every page that provides a metadata link to the profile info for category editors? Why is the description set aside on a separate page when it could be in a meta description tag. Using the Dublin Core metadata set makes sense but dmoz is not interested in including it. I sent them email way back before I started this project but never got an answer so I ended up creating the meta data for them. As a result, when a request came into my site, I would grab their resource page, category description and profiles for the editors before rendering my "enhanced" page. I think that's what created conflict.

there is no way it should take 40+ hours to parse all the data if you do it right.

Probably true. I just grabbed the W3C RDF parser and put a quick wrapper around it overnight so that code is hardly pristine.

stop complaining. DMOZ is not in business to provide you, and only you, with data.

Granted, dmoz is not about just me. My complaints are actually not about "my site" per say but about the fact that they are not willing to give a feed that is more recent than September. What I would like is for them to provide a data dump that is at least as good as the data dump they use on Netscape.com (that one is much more up to date and considering that dmoz is owned by Netscape, the dmoz people are probably doing it straight from dmoz. ) The other thing is that, for all of TNL.net's history (dating back to 1994), I've believed in being as transparent as possible in terms of why certain errors popped up. I usually post similar messages when something else breaks. I never thought that it would generate such heat. Never did in the past :)

What I am considering though, is the fact that there is a critical failure in the way Dmoz works now. With the two way web and the emergence of trackback, rss, and other related metadata and two-way technologies, I suspect there is a better way to build (dmoz 2.0 ?) so that's what I'm starting to look into .
posted by TNLNYC at 5:59 AM on January 21, 2003


« Older The Artists Rights Coalition has decided that it i...  |  Guilty until proven innocent?... Newer »


This thread has been archived and is closed to new comments