Join 3,414 readers in helping fund MetaFilter (Hide)


Dapper: an API for any website
September 5, 2006 8:59 PM   Subscribe

Dapper: The Data Mapper
A recently launched service that allows users to extract data from any website into XML, and transform or build applications and mashups with that data. Described by it's creators as a way to, "easily build an API for any website... through a visual and intuitive process". Plagiarism Today, meanwhile, has cause for concern, "Dapper is a scraper. Nothing more... now the technologically impaired can scrape content from any site... the potential danger [is] very, very real".
posted by MetaMonkey (31 comments total) 6 users marked this as a favorite

 
Luckily, the "technologically impaired" won't be able to properly exploit that content.
posted by smackfu at 9:14 PM on September 5, 2006


This could be incredibly useful. Obviously it's a threat to existing methods of "monetarization", but that debate, outside of the minds of the currently-monetarized, is well-settled. New technology that does useful things that weren't possible before is more important than current vested interests in keeping those things impossible. (Well, in the context of information technology. We're not talking about swapping heads with flies here.)
posted by aeschenkarnos at 9:17 PM on September 5, 2006


monetarization? monetarized? what kind of gobbeldygook are you people spouting in this thread???
posted by jonson at 9:25 PM on September 5, 2006


Hmm. Good idea, but not to good at finding patterns. E.g. on MetaFilter user contacts pages. Admittedly, Matt adds no CSS semantic markup in the form of named elements to it, and only some contacts are followed by spans. so it wouldn't be easy to automatically extract a pattern, but I'd hoped dapp would see the commonality in the urls themselves; all are of the form "/user/[0-9]+".
posted by orthogonality at 9:28 PM on September 5, 2006


the potential danger [is] very, very real".

Ha ha he said danger. Maybe the surgeon general should add a warning label.
posted by IronLizard at 9:34 PM on September 5, 2006


With that being said, it is a service everyone needs to make note of. The one thing that is for certain is that it will be in the news again. The only question is what light will it be under.

That has to be one of the weakest final paragraphs of any artiucle I've ever read.
posted by signal at 9:40 PM on September 5, 2006


The idea of this service is rather exciting to me, even if the reality has limitations. This seems like the first steps toward a real evolution in the web - the ability for people of average ability to connect, reuse and recombine data will, I suspect, have some very interesting, very unpredictable, very powerful consequences, by making the web fundamentally more networky.
posted by MetaMonkey at 9:45 PM on September 5, 2006


"Monetarization" is a fancy-shmancy word for "getting money from having a website". Ads, mostly. Dapper allows readers to avoid ads. That this occurs at the expense of the website provider is additional concern for the author of "Plagiarism Today".

Incidentally, I find the title "Plagiarism Today" a bit troubling. To me, plagiarism is a particular kind of intellectual fraud committed by a student or a researcher or author whereby the work of others is presented as if it were one's own. The way to avoid plagiarism is to cite sources, and it seems to me that content-aggregators, scrapers etc can cite their sources with trivial ease in the form of HTML links, and will invariably do so unless engaged in some sort of scam. A further connotation of the word "plagiarism" is that, given the act of citing, it is not only permissible to quote the work of others, but desirable to the point that if you don't quote others whose work is pertinent to whatever it is you are trying to say, you will lose credibility.
posted by aeschenkarnos at 9:46 PM on September 5, 2006


Too much fancy, not enough schmancy.
posted by geodave at 10:00 PM on September 5, 2006


MetaMonkey writes "The idea of this service is rather exciting to me, even if the reality has limitations."

Well, of course it's exciting to you!

When I was testing dapp, I tested it on metafilter user pages. And I went first to your user page, just because the link to your user page was the first one on the front page. It was the first one because you'd just posted this FPP.
posted by orthogonality at 10:04 PM on September 5, 2006


When I was testing dapp, I tested it on metafilter user pages. And I went first to your user page, just because the link to your user page was the first one on the front page. It was the first one because you'd just posted this FPP.

I'm not quite sure what you're getting at there.
posted by MetaMonkey at 10:09 PM on September 5, 2006


In case of misunderstanding, I appreciate that Dapp currently lacks strong pattern-matching abilities, but I would expect this is something they will be at pains to improve as soon as possible.

Dwelling on the subject a little, it seems far more likely to me that the semantic web (should such a recognizable beast emerge) will be built on tenuous, make-shift constructions of recombined aggregations of aggregations, than some kind of pre-concieved ur-system.
posted by MetaMonkey at 10:16 PM on September 5, 2006


The demo made me die inside, a little. And the screencast help, too. I have a hard time using something that irritates me to listen to. Buy a real microphone, you bastards, and stop wheezing and coughing into the microphone.
posted by boo_radley at 10:35 PM on September 5, 2006


This is really cool. (See, I'm not a hater all the time.)
posted by keswick at 10:39 PM on September 5, 2006


  1. I saw this ages ago
  2. I tried it and it didn't work
  3. or maybe I just didn't understand what it's supposed to do
  4. what's a "screencast"?
  5. why does it need three sample pages of Digg, all formatted exactly the same, to figure out the pattern?
  6. it's just a screen-scraper when all's said and done
  7. Am I crazy, or did the demo spend ten minutes getting me a magical RSS feed for a site which already has an RSS feed?
It's not sites like Digg you need this kind of service for. If it had some kick-ass AI which could grab a good RSS feed from a site full of bad, old-fashioned, non-semantic HTML, I'd be applauding.
posted by AmbroseChapel at 10:56 PM on September 5, 2006


Oh, when they decided to call their thing "Dapper", perhaps they should have checked that dapper.com wasn't already taken? Warning: hideous, annoying, auto-play-audio Flash site.
posted by AmbroseChapel at 10:59 PM on September 5, 2006


Yay!
I break for screen scrapers! They never get any respect, but they rawk, they are ballsy and sooo useful. I think developing automated heuristics, or learning methods to harvest hetergeneous data (html) is a tall order.....
I cant wait to try this!

Hellz Yeah!
posted by celerystick at 11:20 PM on September 5, 2006


Third post, but ... I just tried it on Slate.com and it hung, twice. I followed their steps and got a screen with "please wait" on it. It's been an hour now. I think I'm going to stop waiting.
posted by AmbroseChapel at 11:39 PM on September 5, 2006


It seems site admins could just block referrers with "http://www.dappit.com/CreateDapp?mode=virtualBrowser&applyToUrl=etc..." in them to block scraping?
posted by Blazecock Pileon at 11:46 PM on September 5, 2006


Blazecock: Faking the referrer is easy.
posted by spazzm at 12:26 AM on September 6, 2006


Also check out the similarly interesting DabbleDB, which has been around for a while now. Basically an automated scraper/rss parser that stores data in a db like way, allowing all kinds of fancy manipulation and relation. Take a look at this impressive screencast.
posted by MetaMonkey at 12:44 AM on September 6, 2006


Hmmm, interesting...

On the one hand, it's a fantastic development which might be the start of organisations realising that "if you put on the web, it's fair game". Australian PVR & MCE users have had a running battle trying to get an electronic program guide - a single company (HWW) controls all TV guide data, from all networks to all delivery mechanisms (electronic, paper, etc) - so it's been an arms race with screen-scrapers vs TV guide websites at 30 paces. Even Microsoft couldn't get them to budge and licence a guide for MCE, despite HWW being owned by their local MSN joint-venture partner.

On the other hand, as mentioned, they'll just block that referrer or IP range...
posted by Pinback at 12:55 AM on September 6, 2006


I think the really interesting part of this technology is summed up in this quote from the TechCrunch link, "what Geocities did for static web pages, they want to do for dynamic content reuse".

For me it isn't so much the covert scraping of big data horders which is exciting, it is the mechanism that can be leveraged by anyone to reprocess and recombine almost a world wide web worth of data.

Moreover, its the beginning of a trend toward site owners thinking of their data as reuseable, and the implications of that - some will inevitably seek to wall off their content, but others will engage in a debate about the merits of allowing their content to be used beyond their walls... any very many people will open their data up to this sort of thing. Before long we'll all be using emergant semantics in our xhtml like it was going out of style.
posted by MetaMonkey at 1:23 AM on September 6, 2006


I like mashups of fingerling potatoes flavored with a little garlic and butter the best.
posted by killdevil at 3:27 AM on September 6, 2006


There's a magazine called Plagiarism Today?
posted by moonbiter at 4:10 AM on September 6, 2006


Oh, never mind. It's just a blog. (And here I was thinking there was a readership for issues of academic plagiarism and whatnot. Should have followed the link first.)
posted by moonbiter at 4:12 AM on September 6, 2006


I've wanted something like this for NFL stats for years now. I guess I can give it a try in a week.
posted by effwerd at 5:55 AM on September 6, 2006


The red boxes it uses to highlight content look exactly like what Aardvark uses.
posted by poweredbybeard at 8:27 AM on September 6, 2006


Autrijus developed a Perl module a few years ago called Template::Generate. When provided with a series of "data to text" mappings, it creates a generic template string that can be used by Template::Extract to easily scrape data from sites. It never quite took off, though; no powerful frontend (and no Web 2.0, at the time).
posted by crysflame at 8:54 AM on September 6, 2006


Yeah Autrijus is some kind of mad genius, isn't he? She? I know someone who tried to use Template::Generate and spent a lot of time scratching his head for very little result, but the essential idea is brilliant: turn a templating system inside out, you've got a scraper.
posted by AmbroseChapel at 6:22 PM on September 6, 2006


Cute, but real sites aren't naked HTML. Is there any software out there that can scrape hard-to-read webpages that are full of password fields and javascript? I'd love to have a heads-up view of all my bank accounts and credit card bills, mortgage, etc. but the scrapers I have tried typically can't find the content. [Obligatory nod to the trust issues implicit with using a web agent to aggregate this kind of information!]
posted by krebby at 6:38 AM on September 7, 2006


« Older Beirut: Before and After...  |  Live Dead - First Set... Newer »


This thread has been archived and is closed to new comments