"Thus oft a struggle to escape - But lands us in a still worse scrape"
June 2, 2008 2:18 AM   Subscribe

page2rss is a simple, effective RSS scraper. For instance, here's an RSS feed for Astronomy Picture of the Day. A powerful feature: "You can add a button to your browser's bookmarks toolbar that will create Page2RSS feed for the page you are currently viewing."
posted by nthdegx (12 comments total) 7 users marked this as a favorite
<title>fuck you</title>
<a href="rickroll.rm">Here</a> is something really helpful and interesting.

The word scraper is evocative of the word skyscraper to me. I can't think of another scraper which is the result of an engineering process. I've never heard of any bulldozer or steamroller or anything like that which was called a scraper. I can't think of what else it would be.

The second thing I can think of is a sense of trust which allows the creation of things of various degrees of fragility. For example, where I live, a storefront is large panes of glass. You can see what's inside easily, it's pretty, and the employees can see outside, which is probably nice for them. In other neighborhoods, the storefront is a metal wall which can withstand direct assault with a sledgehammer, although probably not a Buick. Then there are those with the metal screens which roll up so that, as long as someone's around to watch the store, they aren't worried about someone throwing a chair through the window and coming in to loot the place.

So it is to compare a city to an army base. And you could say that the city can be fragile without much fear because it maintains a good PR. So it's open to everyone, it's an example of human accomplishment and partnership, it shelters and supports so much innovation or industry. But if it ever instead became a center for the manufacture of carbon monoxide and cluster bombs, then it had better put up some serious defenses, because the people of the world will no longer kindly tolerate its existence.

But there was a period when gunpowder manufacture was the height of human achievement. Anyway all this is moot because we don't target the skyscraper model, after all, look at the latest high rises going up everywhere around the world. We just want to destroy some of them. And so we do. So socialism is known by some as fascist, and the social web is simply, innocently collecting information, living by selling advertising, certainly not those intelligent billboards that know who they're talking to, that's obscene (if it's us that are psychologically abused by it, if it's them it's alright so long as we aren't told to care about it).

Am I paranoid about a site that steals lifts kindly shares page content? <title>Oh, I don't know.</title>
posted by nervousfritz at 2:48 AM on June 2, 2008

Does anyone actually make tinfoil these days? Was foil ever actually tin? For purposes of headwear, does the metal in the foil make a difference?
posted by maxwelton at 3:31 AM on June 2, 2008

Sweet, thanks! APoD in particular annoys me, so that's a great example. Although I'm not sure it's working yet and of course there's the issue of when page2rss goes dead. But we'll see what happens.
posted by DU at 4:11 AM on June 2, 2008

But these have existed for some time -- FeedFire, FeedYes and Feed 43 are a few I can think of. Is this one somehow different?
posted by loiseau at 5:08 AM on June 2, 2008

Thanks I've been looking for something like this and didn't know this type of thing existed.
posted by stbalbach at 6:35 AM on June 2, 2008

Uh, nervousfritz, I don't know how to tell you this... but RSS is a feature of the page. It's built in. It's chosen. The maintainer designed it that way.

posted by sonic meat machine at 6:46 AM on June 2, 2008

Filtering should be on top of their todo list. Right now you'll be informed whenever an ad rotates or whenever there's a latest news update in a sidebar.
Even a simple selectable filter which would prevent updates if the only change in a page is a single image (ad) could make this service a lot less frustrating.

I do like the free-ness of p2r and have used it to get updates of specific pubmed searches in my RSS reader instead of my mailbox.
posted by Akeem at 7:17 AM on June 2, 2008

You can create rss feeds on pubmed, you know.
posted by cashman at 7:53 AM on June 2, 2008

I've found this to be a simple yet effective and easily customisable scraper for anyone with a passing knowledge of PHP.

Note that you can do a lot of cool stuff with Pipes, too, which is probably the most effective way for non-programmers.

I'm a little on the fence about the ethics of scraping. It's not just determining the limits of fair use: sometimes, for example, you can't help but wonder whether a site lacks a (functional) RSS feed for whatever interesting content it carries because the author/owner doesn't want the content to be freely syndicated, or because it simply hasn't occurred to them that there would even be an interest in one.

Sometimes the first argument can be said to hold water, take Drudge for instance: there are various third-party attempts at creating a feed for it, but none of them are timely, I find. I think Drudge might want it that way, effectively forcing people to visit the site itself instead of reading the content elsewhere.

Now, I do a lot of scraping (as well as non-scraping aggregation, filtering etc.) for Electicker, and I honestly find it very difficult to find the line of what's acceptable syndication/aggregation and what is content theft - in the mean time, it's very tempting to adopt a default attitude along the lines of "It's just a bunch of links, I'm driving traffic to their site, and if they ask me to take anything down I'll do it".

Then again, it's not just a bunch of links, because I do reprint actual content that isn't freely available as a feed (electoral-vote.com's projection, RCP polling averages, etc.). Then again, tons of bloggers post the latest polls, projections etc. - I've merely automated the process. (Plus, everything links back to the source.) So I'm kind of on the fence, and I can't help but wonder whether my attitude isn't a little bit lazy toward the ethics.

On a tangent, you'd be amazed at how much broken crud you come across when you start dealing with a 100+ feeds, even from more reputable sources, main stream media, etc. Obviously, there are a number of differing standards, but my parser should be able to handle most variations, even exceptions, etc.

As an example, pet peeve of mine is the amount of feeds that still carry their timestamps as "EST", even though it's summertime (DST), so it should be EDT. This seems minute, and in many ways it is, but as it makes a feed's content show up an hour ahead of its actual posting time, not only does this make for aberrant timestamps ("posted 53 minutes from now"), but it also gives a feed an unfair 'advantage' by way of its items showing up at the top of a list for longer.

This is of course trivial to compensate for in PHP, but it's still something I still have to check for, and that takes time.

I'm sure this sounds kind of obsessed, but that's probably because I am.
posted by goodnewsfortheinsane at 11:09 AM on June 2, 2008 [1 favorite]

That is an awesome comment, gnfti, and thanks for the heads up about Electicker.

Mediareport runs a local blog that I'd like to follow but it's simple html, no rss, and I hadn't considered the ethics of using a tool like this. But shh...I'll probably use it until he finds this thread and objects. ;)
posted by artifarce at 7:12 PM on June 2, 2008

For this sort of thing, dapper with a little Yahoo Pipes hackery can go much further. Not further than a custom PHP or python script though, I was thinking of abusing google app engine for my own custom screen scrape -> RSS stuff.
posted by bertrandom at 11:00 AM on June 3, 2008

Why would you need to scrape a feed for APOD when they make one available themselves?
posted by WCityMike at 10:52 AM on June 5, 2008

« Older You look familiar.   |   Another Report Which The President Won't Read Newer »

This thread has been archived and is closed to new comments