Scraping away
February 18, 2005 10:26 AM   Subscribe

Feedpalooza. This gentleman offers to scrape any website (at his discretion) to provide you the custom feed you want. For instance, I wanted a simple black box on my site with the real-time number of coalition casualties in Iraq. I pointed him to this site. He scraped the one number and provided this feed. Brilliant.
posted by stupidsexyFlanders (15 comments total)
 
Some stats: On average, we serve RSS requests () per day. The last two months looked like this: 0 polls in , 0 polls in , and still counting.

Cool idea, even though he's probably going to end up upsetting a couple people.
posted by BradNelson at 10:39 AM on February 18, 2005


parsing html is not screenscraping, regardless of the misconception mr. haughey labors under.
posted by quonsar at 10:51 AM on February 18, 2005


Er, what is screenscraping?
posted by Gyan at 10:54 AM on February 18, 2005


screenscraping is what search engines do - they take all your content and format it in such a way that it's easy to skim. but screen scrapers tend to do less summarizing, and more just copying. and they don't generally respect robots.txt rules.
posted by scottreynen at 10:56 AM on February 18, 2005


Interesting site and some interesting feeds to wander through in there.

How much is €2! in real money?
posted by fenriq at 11:00 AM on February 18, 2005


Serverscraping!
posted by DrJohnEvans at 11:03 AM on February 18, 2005


About £1.50, or $38,342... $38,490... $38,900...
posted by benzo8 at 11:04 AM on February 18, 2005


OT, but screenscraping used to refer to pulling characters from video memory, and re-mapping them for another purpose. It was common in early Windows terminal emulators. You'd run the terminal emulation in the background, and the output was 'scraped' and displayed in a 'Windows' window. This allowed copy & paste, (the innovation!), and screen-mapped terminal keys, among other things.
posted by punilux at 11:13 AM on February 18, 2005


I don't mean to pick a fight, scottreynen, but I have to disagree. Even Wiki considers Screen Scraping "parsing the HTML in generated web pages with programs designed to mine out particular patterns of content" -- which is exactly what this post is about.

Also, saying "they don't generally respect robots.txt rules" is a bit FUD-ish. It's up to the author of the particular tool/program/script to check for the existence of and honor the robots.txt file, which is something that many search engines even fail to do. (Not the big ones, mind you, but it's certainly not uncommon.)

Getting back to the point of the original post, it's a great idea, but I think the guy's crazy. This kind of thing tends to break when website owners make even small changes to their pages. I wouldn't have the kind of patience to maintain such a large number of feeds. Yikes.
posted by ibidem at 11:17 AM on February 18, 2005


Can't help thinking he's got his business model topsy-turvey though. He charges for a new feed, but then opens it up to everyone else free, so, unless it's something you desperately want today, most people will wait and keep an eye out for when someone else is desperate enough to pay the €2.

He should set each fee up for free, but charge the second and subsequent subscribers €1, or less... (Of course, then he's charging for someone else's content, and that's really sticky legal territory... But still.)
posted by benzo8 at 11:18 AM on February 18, 2005


wiki is wrong. punilux is correct.
posted by quonsar at 11:25 AM on February 18, 2005


ibidem, i didn't mean to imply this isn't screen scraping, but rather that this isn't fundamentally different from what a search engine does. both are scraping others' content and reformatting it. i thought that would be a common point of understanding to explain scraping to those who don't understand it.

nor did i mean to suggest that scraping is bad. i do quite a lot of it myself. and i don't generally respect robots.txt files. this causes no fear, uncertainty or doubt for me, and i'm not sure why it would for others. robots.txt is not a law.

benzo8, there isn't wide enough support for authentication in RSS readers to allow for paid feeds.
posted by scottreynen at 11:32 AM on February 18, 2005


quonsar, the best argument you got is semantics? I never thought "screenscraping" was a negative title, like "spam" and I don't mean it that way. All the folks I know that grab pages via http to pull out text call what they do screen scraping. Heck, my friend paul calls his own versions of this a scrapi.

This guy's RSS maker scrapes pages for content you want and I know a few friends that have paid him to do it. I'm not a fan of scrapers hitting metafilter because it's like a random user loading a page, which takes memory and processing, and scapers often pull 5-10 pages at once so they can result in mini-DOS moments.

I'd be more than happy to provide a xml api instead, to minimize the load on the server, but that's more for metatalk than here.
posted by mathowie at 12:07 PM on February 18, 2005


quonsar, the best argument you got is semantics?

yeah, you should have heard me rant about hacker vs. cracker a few years ago!
posted by quonsar at 3:18 PM on February 18, 2005


benszo8 - you're wrong, I've got feeds off him in the past and he just gives you the feed URL, it's up to you whether or not you want to make that URL public.
posted by daveirl at 7:44 AM on February 19, 2005


« Older ""But I'm 41 now. That era, it's gone".   |   First concrete global warming proof emerges from... Newer »


This thread has been archived and is closed to new comments