Join 3,516 readers in helping fund MetaFilter (Hide)


Mining the Deep Web
March 2, 2004 4:56 PM   Subscribe

Mining the Deep Web. Google indexes 4 billion pages, but there are hundreds of billions of documents out there in the Deep Web that are effectively unreachable by search engines because they are locked in databases or are unsearchable media. It looks like Yahoo is going to start giving us a peek by providing unified access to a wide variety of sites that are ordinarily only searchable by their own custom search engines.
posted by badstone (12 comments total) 1 user marked this as a favorite

 
Great, Yahoo will be able to tell us all about proprietary information and "for subscibers only" pages we will have to pay to get.
posted by ilsa at 5:24 PM on March 2, 2004


and what about privacy? is this stuff supposed to be available?
posted by amberglow at 5:51 PM on March 2, 2004


And do we really need that much more porn?
posted by yhbc at 5:57 PM on March 2, 2004


Competition in the search arena is good news. I'm as much a fan of Google as the next guy, but it's important that the rest of the search sites don't lay down and die because one company came out of nowhere and kicked their asses.
posted by gwint at 6:15 PM on March 2, 2004 [1 favorite]


And do we really need that much more porn?

Yes, duh!

I'm as much a fan of Google as the next guy, but it's important that the rest of the search sites don't lay down and die because one company came out of nowhere and kicked their asses.

Actually, the google algorithm is amazingly simple and it's amazing that no-one else has tried to implement it. However, I think the big problem with indexing the invisible web probably revolves on what the search engines retrieve from these partnerships. Does yahoo or google retrieve all the information or just some metadata (and I'm guessing probably just metadata). And then you get into the whole problem with metadata, etc.

This is a good step but it's going to take awhile to materialize into something great.
posted by Stynxno at 6:22 PM on March 2, 2004


excellent post badstone!
posted by Steve_at_Linnwood at 7:22 PM on March 2, 2004


I don't think that "locked in" is the appropriate term here. Much of the database-driven content is freely available, it's just that it isn't exploited by search engines because the algorithms and processes are problematic. For one, there are multiple (and growing) data access techniques that require significant programming logic. For another, the back-end data can be so dynamic as to make subsequent search and retrieval non-relevant to a user's request.

I would guess that Yahoo and others would target reasonably static databases with large amounts of data and/or significant traffic. They will probably code specifically for some data, perhaps by mutual agreement with the content provider, while developing generic code for those using common access routines (for example, a secondary detailed query into a database based upon a previous page's results).

This is a good thing.
posted by F Mackenzie at 7:27 PM on March 2, 2004


Google needs competitition. It's "Free Market thing" - Go!
posted by troutfishing at 8:39 PM on March 2, 2004


Even if all this does is encourage Google to do it better (which is what I suspect will happen), it's still a Very Good Thing.
posted by Opus Dark at 8:44 PM on March 2, 2004


The company also confirmed that its new search tool, unveiled two weeks ago, will allow companies to pay to have their Web sites included in regular search results.
...which will be the undoing of Yahoo! yet again. Still, competition is good.
posted by dg at 11:25 PM on March 2, 2004


Google already indexes dynamic content it finds on pages.
posted by SpaceCadet at 2:01 AM on March 3, 2004


Oops - unfortunately I posted and ran, so I'm only just getting back to this now.

and what about privacy? is this stuff supposed to be available?

Yes. There are tons of resources on the web that the owners/hosters really want to make available, but can only provide access through there own custom database searches. Yahoo is not hacking in to private databases and posting the content by any means. They are making mutually beneficial agreements with groups that have data that they want seen my the public. Many research projects have a mandate in their funding to make their findings and/or collected data publicly available, and often to do public outreach. unfortunately, the amount of funding allotted to these efforts is usually too small to do anything particularly effective. efforts like Deep Web mining, and the semantic web will help bring some real gems to light.

Google already indexes dynamic content it finds on pages.
Assuming that content actually manifests as a textual web page, yes. there is lots of content out there that never does though.

Oh, and as much as it might sound like I'm cheerleading, I'm not a Yahoo person, just a data mining geek.
posted by badstone at 10:18 AM on March 3, 2004


« Older Ashoura Day...  |  Buyer's Remorse?... Newer »


This thread has been archived and is closed to new comments