The Open Library
July 16, 2007 1:23 PM Subscribe
Cool. I like to keep my posts terse.
posted by chunking express at 1:41 PM on July 16, 2007
posted by chunking express at 1:41 PM on July 16, 2007
Somewhere Borges is turning in his grave...
posted by PostIronyIsNotaMyth at 1:47 PM on July 16, 2007 [1 favorite]
posted by PostIronyIsNotaMyth at 1:47 PM on July 16, 2007 [1 favorite]
Borges' grave, one presumes.
posted by cortex at 2:04 PM on July 16, 2007 [4 favorites]
posted by cortex at 2:04 PM on July 16, 2007 [4 favorites]
out-of-print only? it's Project: Gutenberg on steroids?
What happened to all the data havens I was promised? I'm looking for someone to pirate the google.books horde and the amazon search-inside feature, and then make it all freely available from an oil rig in the Pacific or a satellite in geo-synchronous orbit over Washington, DC. This is not the jetpack I was promised!
posted by anotherpanacea at 2:12 PM on July 16, 2007
What happened to all the data havens I was promised? I'm looking for someone to pirate the google.books horde and the amazon search-inside feature, and then make it all freely available from an oil rig in the Pacific or a satellite in geo-synchronous orbit over Washington, DC. This is not the jetpack I was promised!
posted by anotherpanacea at 2:12 PM on July 16, 2007
I was going to ask, how is this different from Project Gutenberg? Especially for mind-bogglingly-large projects like this, I can't see duplication of effort being a good thing.
posted by cyrusdogstar at 2:24 PM on July 16, 2007 [1 favorite]
posted by cyrusdogstar at 2:24 PM on July 16, 2007 [1 favorite]
When it was announced that the Library contained all books, the first reaction was unbounded joy. All men felt themselves the possessors of an intact and and secret treasure. There was no personal problem, no world problem, whose eloquent solution did not exist - somewhere in some hexadecimal. The universe was justified; the universe suddenly became congruent with the unlimited width & breadth of humankind's hope.
posted by UbuRoivas at 2:28 PM on July 16, 2007 [2 favorites]
posted by UbuRoivas at 2:28 PM on July 16, 2007 [2 favorites]
out-of-print only?
Huh? I've found every in-print book I've searched for.
posted by scottreynen at 2:39 PM on July 16, 2007
Huh? I've found every in-print book I've searched for.
posted by scottreynen at 2:39 PM on July 16, 2007
As I read it: mainly, they're trying to set up an open book catalog, gathered from as many book catalogs as possible.
T
If a given book's out of print, they might offer its text. This overlaps with Project Gutenberg so strongly that I'm sure there's cooperation involved.
If they want to provide links to different places to look for a given book, then they might look at the LibraryLookup Project or Wikipedia book sources, because there is probably some overlap there too.
I'm honestly surprised Jessamyn hasn't said anything yet.
posted by Pronoiac at 2:49 PM on July 16, 2007
T
If a given book's out of print, they might offer its text. This overlaps with Project Gutenberg so strongly that I'm sure there's cooperation involved.
If they want to provide links to different places to look for a given book, then they might look at the LibraryLookup Project or Wikipedia book sources, because there is probably some overlap there too.
I'm honestly surprised Jessamyn hasn't said anything yet.
posted by Pronoiac at 2:49 PM on July 16, 2007
What does she know that she isn't saying?
<dramatic music/>
posted by blue_beetle at 2:50 PM on July 16, 2007
<dramatic music/>
posted by blue_beetle at 2:50 PM on July 16, 2007
Jessamyn blogged it, but hasn't mentioned it here.
posted by RobotHeart at 2:51 PM on July 16, 2007
posted by RobotHeart at 2:51 PM on July 16, 2007
It looks like the only full-text books are items in the public domain (Project Gutenberg and other archives). Anything else is merely just metadata on the books. Nothing very exciting yet. Sadly it has a long way to go before being anything close to its claim. Right now it is just a wiki with a bunch of book titles containing some full text but mostly just CIP information, unless I am missing something.
Also it appears to just scrape some sort of giant publishing index without much human intervention. Do a search on "Da Vinci Code" and you get things like Fodor's Da Vinci Code Companion Counter Display and The Da Vinci Code 36-copy Mass Market EAN Floor Display.
posted by Razzle Bathbone at 2:52 PM on July 16, 2007 [1 favorite]
Also it appears to just scrape some sort of giant publishing index without much human intervention. Do a search on "Da Vinci Code" and you get things like Fodor's Da Vinci Code Companion Counter Display and The Da Vinci Code 36-copy Mass Market EAN Floor Display.
posted by Razzle Bathbone at 2:52 PM on July 16, 2007 [1 favorite]
It looks like they're sorta aiming for WorldCat, only more user-editable. More like openlibrarycatalog.org, amirite?
They do search a bit of text for recent books - I searched for one book, & it came up in a blurb for another.
RobotHeart: I see.
posted by Pronoiac at 3:04 PM on July 16, 2007
They do search a bit of text for recent books - I searched for one book, & it came up in a blurb for another.
RobotHeart: I see.
posted by Pronoiac at 3:04 PM on July 16, 2007
"This unbridled hopefulness was succeeded, naturally enough, by a similarly disproportionate depression. The certainty that some bookshelf in some hexagon contained precious books, yet that those precious books were forever out of reach, was almost unbearable."
posted by blucevalo at 3:05 PM on July 16, 2007
posted by blucevalo at 3:05 PM on July 16, 2007
There is full text for A Connecticut Yankee in King Arthur's Court. Nothing else I've searched for yet.
I was going to ask, how is this different from Project Gutenberg?
For one thing it seems that this project uses scanned book pages whereas Gutenberg uses plain text files. Also I don't think Gutenberg has a listing for How To Make Money Like a Porn Star and KISS and Make-Up (by Gene Simmons).
(Try "making comics" as your search. The box suggests random terms. I don't know what KISS has to do with comics, either.)
posted by Tehanu at 3:20 PM on July 16, 2007
I was going to ask, how is this different from Project Gutenberg?
For one thing it seems that this project uses scanned book pages whereas Gutenberg uses plain text files. Also I don't think Gutenberg has a listing for How To Make Money Like a Porn Star and KISS and Make-Up (by Gene Simmons).
(Try "making comics" as your search. The box suggests random terms. I don't know what KISS has to do with comics, either.)
posted by Tehanu at 3:20 PM on July 16, 2007
If I understand things corectly, OpenLibrary:Bowkers :: Wikipedia:Britannica.
posted by chrominance at 3:25 PM on July 16, 2007 [1 favorite]
posted by chrominance at 3:25 PM on July 16, 2007 [1 favorite]
Why all the confusion? It's simply a catalog of all books that have ever been published. Very simple idea. Estimates are around 150 million books have ever been published, but no one really knows for sure.
I wish them luck it's an ambitious project that could take generations to complete. I'm not sure it will be reliable but maybe useful, in a Wikipedia sort of way.
posted by stbalbach at 3:56 PM on July 16, 2007
I wish them luck it's an ambitious project that could take generations to complete. I'm not sure it will be reliable but maybe useful, in a Wikipedia sort of way.
posted by stbalbach at 3:56 PM on July 16, 2007
Well, I can come up with a few other intro lines, depending on how the project develops:
Imagine a library that became instantly overshadowed and made irrelevant by better funded and more highly developed projects already in existence. We're building that library.
or
Imagine a library that was so blatantly guilty of copyright infringement that it was sued out of existence. We're building that library.
or
Imagine a library that sounded cool one weekend when we were all baked on this sweet weed from Oaxaca, but became a drag to work on once we had to get real jobs so we all gave up after about a year and a half. We're building that library, dude.
posted by Muddler at 4:53 PM on July 16, 2007 [3 favorites]
Imagine a library that became instantly overshadowed and made irrelevant by better funded and more highly developed projects already in existence. We're building that library.
or
Imagine a library that was so blatantly guilty of copyright infringement that it was sued out of existence. We're building that library.
or
Imagine a library that sounded cool one weekend when we were all baked on this sweet weed from Oaxaca, but became a drag to work on once we had to get real jobs so we all gave up after about a year and a half. We're building that library, dude.
posted by Muddler at 4:53 PM on July 16, 2007 [3 favorites]
Only 150 million books have ever been published? Is that really true? If so, remarkable. Source?
posted by MarshallPoe at 5:05 PM on July 16, 2007
posted by MarshallPoe at 5:05 PM on July 16, 2007
Every time I see Project Gutenberg, I think Project Steve Gutenberg— an open repository of every Police Academy movie.
posted by klangklangston at 5:06 PM on July 16, 2007 [3 favorites]
posted by klangklangston at 5:06 PM on July 16, 2007 [3 favorites]
It's a bit of a mystery to me why this hadn't been done long ago. Its basically IMDB for books (unless I missed something significant). It always seemed anomalous that whenever you want to link to a book, you end up linking amazon, which is a seller of books, not a provider of information about them (though of course there is some overlap).
posted by MetaMonkey at 5:12 PM on July 16, 2007
posted by MetaMonkey at 5:12 PM on July 16, 2007
stbalbach: They're calling "a collection of card catalogs about books" a library, instead of an index or catalog. That's not clear - that's confusing, if not deliberately misleading.
posted by Pronoiac at 5:36 PM on July 16, 2007
posted by Pronoiac at 5:36 PM on July 16, 2007
Muddler, you should totally check out the site this post is about. From your comment, I assume you haven't yet - if you did, um, what the hell are you talking about?
posted by freebird at 5:55 PM on July 16, 2007
posted by freebird at 5:55 PM on July 16, 2007
Oh, me likey.
As for only 150 million books being published I'd be interested in that figure as well.
posted by banannafish at 6:00 PM on July 16, 2007
As for only 150 million books being published I'd be interested in that figure as well.
posted by banannafish at 6:00 PM on July 16, 2007
The other interesting thing here is the "semi-structured wiki" it seems they've made to run this thing on. I think it can be very useful for other things.
I've long wanted to create a replacement for IMDB, using a structured wiki as a base. There's lots of stuff you want to put into fields, but you also want to make it editable and versioned. What I came up with last I sat down and wrote some specs was very close to what they've written, even up to the choice in platform (PostgreSQL and an O-R mapper in Python).
IMDB needs to be replaced. If nothing else, then because it's very much not open. It never has been, and especially isn't now, and especially not compared to stuff like Wikipedia. There's a bunch of info about films in Wikipedia that could be semi-automatically parsed into something that would serve as a starting point for an IMDB replacement, by the way, which would make the prospect a lot less daunting. Also, IMDB has only a few hundred thousand movies, and has only rudimentary info about the majority of them. Compared to the size Wikipedia has achieved in just a few years, it seems pretty doable.
I don't have the time I'd like to work on something like this, but now that people seem to be creating base technology that can be used for a project like that, it might be more doable. I'll be keeping an eye on the source as it's developed, at least.
(After that, please replace All Music Guide.)
posted by Joakim Ziegler at 6:20 PM on July 16, 2007
I've long wanted to create a replacement for IMDB, using a structured wiki as a base. There's lots of stuff you want to put into fields, but you also want to make it editable and versioned. What I came up with last I sat down and wrote some specs was very close to what they've written, even up to the choice in platform (PostgreSQL and an O-R mapper in Python).
IMDB needs to be replaced. If nothing else, then because it's very much not open. It never has been, and especially isn't now, and especially not compared to stuff like Wikipedia. There's a bunch of info about films in Wikipedia that could be semi-automatically parsed into something that would serve as a starting point for an IMDB replacement, by the way, which would make the prospect a lot less daunting. Also, IMDB has only a few hundred thousand movies, and has only rudimentary info about the majority of them. Compared to the size Wikipedia has achieved in just a few years, it seems pretty doable.
I don't have the time I'd like to work on something like this, but now that people seem to be creating base technology that can be used for a project like that, it might be more doable. I'll be keeping an eye on the source as it's developed, at least.
(After that, please replace All Music Guide.)
posted by Joakim Ziegler at 6:20 PM on July 16, 2007
As for only 150 million books being published I'd be interested in that figure as well.
It is an estimate, no one knows, it is unknown how many books have ever been published. There are a number of estimates but 150m is one that stuck with me based on some reasonable sounding deductions based on library holdings which I no longer remember the details. In any case it is very likely in that order of magnitude, maybe double that, maybe half that.
As for number of books people actually read.. take a look at LibraryThing's "Zeitgeist" link. In a non-random poll of 250+ thousand readers, there are about 2 million unique works. Again, order of magnitudes are probably in the realm of 2 to 20 million unique books that people alive today actively are reading, probably more towards the lower end since LibraryThing users tend to be more serious readers and thus have the greatest range of unique works.
What people read versus what actually exists the numbers are very different, the vast majority of books are no longer read at all. One only has to scroll though Internet Archive's "recent additions" and ask "who reads this stuff?" - and they only have 250,000 works (thus far).
posted by stbalbach at 6:42 PM on July 16, 2007 [1 favorite]
It is an estimate, no one knows, it is unknown how many books have ever been published. There are a number of estimates but 150m is one that stuck with me based on some reasonable sounding deductions based on library holdings which I no longer remember the details. In any case it is very likely in that order of magnitude, maybe double that, maybe half that.
As for number of books people actually read.. take a look at LibraryThing's "Zeitgeist" link. In a non-random poll of 250+ thousand readers, there are about 2 million unique works. Again, order of magnitudes are probably in the realm of 2 to 20 million unique books that people alive today actively are reading, probably more towards the lower end since LibraryThing users tend to be more serious readers and thus have the greatest range of unique works.
What people read versus what actually exists the numbers are very different, the vast majority of books are no longer read at all. One only has to scroll though Internet Archive's "recent additions" and ask "who reads this stuff?" - and they only have 250,000 works (thus far).
posted by stbalbach at 6:42 PM on July 16, 2007 [1 favorite]
I've long wanted to create a replacement for IMDB, using a structured wiki as a base
Semantic MediaWiki is a good design for that.
posted by stbalbach at 6:45 PM on July 16, 2007
Semantic MediaWiki is a good design for that.
posted by stbalbach at 6:45 PM on July 16, 2007
They're calling "a collection of card catalogs about books" a library, instead of an index or catalog. That's not clear - that's confusing, if not deliberately misleading.
Yeah, my local library website doesn't actually contain the full text of many books either. Sometimes I even go there and they have the nerve to refer me to another branch to get a book promoted in their catalog. I think I probably have a good case for a false advertising lawsuit, as long as the jury doesn't realize that it's completely ridiculous to expect a library website to provide full text of books in balatant violation of copyright law.
posted by scottreynen at 6:45 PM on July 16, 2007
Yeah, my local library website doesn't actually contain the full text of many books either. Sometimes I even go there and they have the nerve to refer me to another branch to get a book promoted in their catalog. I think I probably have a good case for a false advertising lawsuit, as long as the jury doesn't realize that it's completely ridiculous to expect a library website to provide full text of books in balatant violation of copyright law.
posted by scottreynen at 6:45 PM on July 16, 2007
It always seemed anomalous that whenever you want to link to a book, you end up linking amazon, which is a seller of books, not a provider of information about them
That's because the information is either hard to get at via the internet (i.e. government sources like the Library of Congress for books published in the States) or else locked up behind a paywall (i.e. Bowker's Books in Print). Furthermore, booksellers have been latching onto metadata and the various neat things you can do with metadata—you're more likely to buy a book if you know something about it, including basic bibliographical data. A side effect is that online retailers who excel at collecting metadata become the best source of information available to the public.
posted by chrominance at 6:49 PM on July 16, 2007 [1 favorite]
That's because the information is either hard to get at via the internet (i.e. government sources like the Library of Congress for books published in the States) or else locked up behind a paywall (i.e. Bowker's Books in Print). Furthermore, booksellers have been latching onto metadata and the various neat things you can do with metadata—you're more likely to buy a book if you know something about it, including basic bibliographical data. A side effect is that online retailers who excel at collecting metadata become the best source of information available to the public.
posted by chrominance at 6:49 PM on July 16, 2007 [1 favorite]
The semantic/semi-structured wiki/DB thing is a great idea - pretty much the natural evolution of the wiki, and probably an idea a lot of people have mulled over. But making such a system work well is going to be a real challenge, probably requiring a fair amount of trial and error. It probably takes someone like Brewster Kahle to kick off this sort of idea.
However it works itself out though, it seems inevitable that some kind of open-wiki-DB will obselete the closed online databases like IMDB, AMG, etc. I certainly hope so.
posted by MetaMonkey at 7:15 PM on July 16, 2007
However it works itself out though, it seems inevitable that some kind of open-wiki-DB will obselete the closed online databases like IMDB, AMG, etc. I certainly hope so.
posted by MetaMonkey at 7:15 PM on July 16, 2007
stbalbach: As I understand Semantic MediaWiki, it isn't, really. Semantic MediaWiki is more about marking up parts of wikitext with semantic information, so it can be more easily processed by machines. If they did that to everything in Wikipedia, it would certainly be easier to make something like a movie database out of the data, but you'd still need a pretty different architecture to make the kinds of searches you do in IMDB (not to mention the kinds of searches you should be able to do in IMDB) possible.
posted by Joakim Ziegler at 7:16 PM on July 16, 2007
posted by Joakim Ziegler at 7:16 PM on July 16, 2007
Freebase is also doing something similar, but much more meta.
posted by MetaMonkey at 7:19 PM on July 16, 2007
posted by MetaMonkey at 7:19 PM on July 16, 2007
It couldn't find three of the four obscure books I threw at it.
posted by dmd at 7:29 PM on July 16, 2007
posted by dmd at 7:29 PM on July 16, 2007
For one thing it seems that this project uses scanned book pages whereas Gutenberg uses plain text files.
What on earth could possibly be the justification for doing that? Images? Non-content-searchable, non-indexable images of text? If true, that's just dumb.
posted by stavrosthewonderchicken at 7:30 PM on July 16, 2007
What on earth could possibly be the justification for doing that? Images? Non-content-searchable, non-indexable images of text? If true, that's just dumb.
posted by stavrosthewonderchicken at 7:30 PM on July 16, 2007
Metamonkey: It's a bit of a mystery to me why this hadn't been done long ago.
Well, there's WorldCat (already mentioned upthread). Which really seems pretty similar to this, although it doesn't have the full-text of out of copyright works. But as a catalogue, it seems very similar - allowing users to comment on works, find them in libraries, etc.
posted by Infinite Jest at 7:43 PM on July 16, 2007
Well, there's WorldCat (already mentioned upthread). Which really seems pretty similar to this, although it doesn't have the full-text of out of copyright works. But as a catalogue, it seems very similar - allowing users to comment on works, find them in libraries, etc.
posted by Infinite Jest at 7:43 PM on July 16, 2007
I kinda meant that the mystery was it hadn't been done properly, where the definition of properly is 'replaces Amazon for linking to books'. No-one is ever going to link to worldcat as a general bookDB because it is generally less useful/user-friendly than amazon, even though half the page on amazon is cross-selling. Compare the book Dune on worldcat, amazon and wikipedia.
posted by MetaMonkey at 8:03 PM on July 16, 2007
posted by MetaMonkey at 8:03 PM on July 16, 2007
What on earth could possibly be the justification for doing that? Images? Non-content-searchable, non-indexable images of text? If true, that's just dumb.
This is a common query by people who primarily read fiction. One reason is to create an archival copy which can be used to re-create an almost identical reproduction. This is useful for archivists as well as projects such as the Internet Archive bookmobile.
Another reason is that for some material, text isn't sufficient to fully capture the content of the work. Mathematical equations, scientific illustrations, diagrams, photos, etc. need to be preserved as well.
Lastly, the images are OCRed. Usually when there is a book archived as images, there is also corresponding text, html, and xml that gives coordinate positions of the words (useful for creating searchable pdfs).
posted by rajbot at 8:57 PM on July 16, 2007
This is a common query by people who primarily read fiction. One reason is to create an archival copy which can be used to re-create an almost identical reproduction. This is useful for archivists as well as projects such as the Internet Archive bookmobile.
Another reason is that for some material, text isn't sufficient to fully capture the content of the work. Mathematical equations, scientific illustrations, diagrams, photos, etc. need to be preserved as well.
Lastly, the images are OCRed. Usually when there is a book archived as images, there is also corresponding text, html, and xml that gives coordinate positions of the words (useful for creating searchable pdfs).
posted by rajbot at 8:57 PM on July 16, 2007
Imagine a library that collected all the world's information...
I remember seeing an episode of Planet of the Apes (the TV series) where such a computer was discovered.
It was printing out said information on a dot matrix printer IIRC.
posted by uncanny hengeman at 9:29 PM on July 16, 2007
I remember seeing an episode of Planet of the Apes (the TV series) where such a computer was discovered.
It was printing out said information on a dot matrix printer IIRC.
posted by uncanny hengeman at 9:29 PM on July 16, 2007
Another reason is that for some material, text isn't sufficient to fully capture the content of the work. Mathematical equations, scientific illustrations, diagrams, photos, etc. need to be preserved as well.
OK, this I can see, sort of. Although there are much better technologies than flat images of full pages for that kind of mixed media, it seems to me.
One reason is to create an archival copy which can be used to re-create an almost identical reproduction.
This, not so much. Smacks a bit of fetishism to me -- it seems to me that it's not the object that we care about, or ought to, in terms of a book, but the information it contains, whether textual or otherwise.
posted by stavrosthewonderchicken at 9:38 PM on July 16, 2007
OK, this I can see, sort of. Although there are much better technologies than flat images of full pages for that kind of mixed media, it seems to me.
One reason is to create an archival copy which can be used to re-create an almost identical reproduction.
This, not so much. Smacks a bit of fetishism to me -- it seems to me that it's not the object that we care about, or ought to, in terms of a book, but the information it contains, whether textual or otherwise.
posted by stavrosthewonderchicken at 9:38 PM on July 16, 2007
It's not a library book if it doesn't have the dates stamped inside the front cover.
posted by louche mustachio at 1:37 AM on July 17, 2007
posted by louche mustachio at 1:37 AM on July 17, 2007
PDF (though propriety) probably would make a better scanned image than a straight up TIFF.
posted by chunking express at 4:42 AM on July 17, 2007
posted by chunking express at 4:42 AM on July 17, 2007
I'm honestly surprised Jessamyn hasn't said anything yet.
I'm still on the way back from the 8th Anniversary MeFi party. The plan, as I understand it (I went to an early planning meeting for this, but I'm not part of the project) is to have a way to get library catalogin records to people who need/want them without them having to go through huge borg-like corporations who tak what is essentially a collaborative process (cataloging of books that go into libraries) and lock it up and make the results proprietary. If Brewster had his way, we'd have the text of all the books there as well.
I see it this way "Given: digitzation of texts is happening and continued progressin this direction is inevitable. Therefore: let's have a system set up which can be the catalog for all that great text and not wait for Bill Gates (or OCLC) to build a crappy version and sell it back to us" WorldCat has some serious problems since anyone can use the catalog (and there are some lame wiki apsects) but only people who pay can contribute records. Good for quality control, bad for openness.
The plan is to have records and texts and a page per "book" which is still a confusing idea that isn't really all the way done and thought out yet -- not that they're not thinking about it, just that choices have to be made that haven't been made at this stage yet.
Exciting times to be a librarian. I like this.
posted by jessamyn at 5:52 AM on July 17, 2007 [1 favorite]
I'm still on the way back from the 8th Anniversary MeFi party. The plan, as I understand it (I went to an early planning meeting for this, but I'm not part of the project) is to have a way to get library catalogin records to people who need/want them without them having to go through huge borg-like corporations who tak what is essentially a collaborative process (cataloging of books that go into libraries) and lock it up and make the results proprietary. If Brewster had his way, we'd have the text of all the books there as well.
I see it this way "Given: digitzation of texts is happening and continued progressin this direction is inevitable. Therefore: let's have a system set up which can be the catalog for all that great text and not wait for Bill Gates (or OCLC) to build a crappy version and sell it back to us" WorldCat has some serious problems since anyone can use the catalog (and there are some lame wiki apsects) but only people who pay can contribute records. Good for quality control, bad for openness.
The plan is to have records and texts and a page per "book" which is still a confusing idea that isn't really all the way done and thought out yet -- not that they're not thinking about it, just that choices have to be made that haven't been made at this stage yet.
Exciting times to be a librarian. I like this.
posted by jessamyn at 5:52 AM on July 17, 2007 [1 favorite]
Smacks a bit of fetishism to me -- it seems to me that it's not the object that we care about, or ought to, in terms of a book, but the information it contains, whether textual or otherwise.
It's much cheaper to start that way, and OCR is pretty accurate, which enables search (ala Google Books). You can acquire a page image for five to eight cents, say, factoring in storage, labor, and scanning equipment costs over several years, but getting that page typed in and verified, even in areas where where labor is relatively less expensive than in the United States or Europe (the Philippines, Romania, parts of India all have text-entry industries) is much, much more expensive. It's also time-consuming, whereas a few fast book scanners can potentially churn through, OCR, and post hundreds of books a day assuming the bibliographic data is already available. If the book scans have an open license, you could perhaps engineer a system to automatically paste the OCR output into Wiki pages for volunteers to clean up, using the scans as a reference to fix OCR errors.
I doubt it has much to do with fetishism. You have to start somewhere, and if you start with the scan you make it possible to later create full text when time and budget allow--given the volume of books they're discussing, and the amount of labor involved and the resultant costs, this may mean decades. But in the meantime you have a book available online which is, with some good engineering, full-text searchable. In addition, the experience of the page--typographic choices, pagination, flyleafs, marbled endpapers, mottled lithographs, and so on, is maintained for those who find that sort of thing important.
posted by ftrain at 5:59 AM on July 17, 2007 [1 favorite]
It's much cheaper to start that way, and OCR is pretty accurate, which enables search (ala Google Books). You can acquire a page image for five to eight cents, say, factoring in storage, labor, and scanning equipment costs over several years, but getting that page typed in and verified, even in areas where where labor is relatively less expensive than in the United States or Europe (the Philippines, Romania, parts of India all have text-entry industries) is much, much more expensive. It's also time-consuming, whereas a few fast book scanners can potentially churn through, OCR, and post hundreds of books a day assuming the bibliographic data is already available. If the book scans have an open license, you could perhaps engineer a system to automatically paste the OCR output into Wiki pages for volunteers to clean up, using the scans as a reference to fix OCR errors.
I doubt it has much to do with fetishism. You have to start somewhere, and if you start with the scan you make it possible to later create full text when time and budget allow--given the volume of books they're discussing, and the amount of labor involved and the resultant costs, this may mean decades. But in the meantime you have a book available online which is, with some good engineering, full-text searchable. In addition, the experience of the page--typographic choices, pagination, flyleafs, marbled endpapers, mottled lithographs, and so on, is maintained for those who find that sort of thing important.
posted by ftrain at 5:59 AM on July 17, 2007 [1 favorite]
Alrighty, then. That makes sense to me; thanks for the educatin'.
posted by stavrosthewonderchicken at 6:32 AM on July 17, 2007
posted by stavrosthewonderchicken at 6:32 AM on July 17, 2007
PDF (though propriety) probably would make a better scanned image than a straight up TIFF.
I don't follow. It'd be a PDF composed of TIFFs, wouldn't it? Do you just mean the default viewer experience, assuming they have Acro Reader or whatever but not a nice TIFF-specific viewer?
posted by cortex at 6:43 AM on July 17, 2007
I don't follow. It'd be a PDF composed of TIFFs, wouldn't it? Do you just mean the default viewer experience, assuming they have Acro Reader or whatever but not a nice TIFF-specific viewer?
posted by cortex at 6:43 AM on July 17, 2007
I'm pretty sure I have PDF's of books where the text is selectable, but there are images as well. So the book looks like it would in the real world, but can be searched, etc. Maybe its all just in my head. (A PDF which is just a collection of TIFFs wouldn't be so useful, as you point out.)
posted by chunking express at 6:56 AM on July 17, 2007
posted by chunking express at 6:56 AM on July 17, 2007
(And there's a distinction here that I didn't really address: if you can get a nicely made PDF from the publisher, then yes, great, much better than raw TIFFs; and if barring that you have the resources to create a nice PDF with good formatting as a followup step to home-rolled OCR of an existing physical text, also great; but if you're working on scale and trying to convert a huge amount of material, you're just going to scan and then OCR the TIFFs and be left with that. The difference is either compliance form publishers or considerable added cost for the project.)
posted by cortex at 7:20 AM on July 17, 2007
posted by cortex at 7:20 AM on July 17, 2007
Harpers scanned and published their entire catalog of magazines. The PDFs are pretty much exact copies of the pages from the magazine, but with selectable text. I'm not sure how well what they did would scale of course, but it is a nicer alternative to raw TIFF files, I would say.
posted by chunking express at 7:55 AM on July 17, 2007
posted by chunking express at 7:55 AM on July 17, 2007
Absolutely. And more expensive to produce. ftrain's notion of using volunteer labor to handle the post-scan production elements is a good one, but even that would be a difficult-to-impossible project for any real volume of, er, volumes. I'm just skeptical of this group having the budgetary freedom to do anything more than simple TIFF + OCR for giant piles of books. The cost-per-image of even that is going to be in significant fractions of a dollar, and that multiplies up pretty quick.
posted by cortex at 8:23 AM on July 17, 2007
posted by cortex at 8:23 AM on July 17, 2007
Vernor Vinge's latest (awesome) novel Rainbow's End imagines a scenario in which destructive scanning of books is much cheaper than scanning and saving the original book. A nasty dilemma.
(Gonna be a while before anything is as useful as Amazon to link to for book information.)
posted by straight at 9:28 AM on July 17, 2007
(Gonna be a while before anything is as useful as Amazon to link to for book information.)
posted by straight at 9:28 AM on July 17, 2007
Gonna be a while before anything is as useful as Amazon to link to for book information
I agree to a certain extent, since amazon has such a strong collection of data. But then again, who would have predicted wikipedia's phenomenal growth? Open Library have already got data from Librarything, and are attempting to get data from pretty much everywhere else. If this gets a little net momentum behind it, it could well get big quickly. Actually being able to do decent search/DB-like queries would be a huge draw, and I'd wager a lot of book-lovers would be willing to help out with the wiki, if the project manages to make some waves.
posted by MetaMonkey at 9:41 AM on July 17, 2007
I agree to a certain extent, since amazon has such a strong collection of data. But then again, who would have predicted wikipedia's phenomenal growth? Open Library have already got data from Librarything, and are attempting to get data from pretty much everywhere else. If this gets a little net momentum behind it, it could well get big quickly. Actually being able to do decent search/DB-like queries would be a huge draw, and I'd wager a lot of book-lovers would be willing to help out with the wiki, if the project manages to make some waves.
posted by MetaMonkey at 9:41 AM on July 17, 2007
scottreynen: It's like a library website, only without the library? But it says 'library' right in the name! I didn't go to openbookdb.org!
Everyone knows about Project Gutenberg's Distributed Proofreading, right? Volunteers check a page or two for OCR errors.
posted by Pronoiac at 10:31 AM on July 17, 2007
Everyone knows about Project Gutenberg's Distributed Proofreading, right? Volunteers check a page or two for OCR errors.
posted by Pronoiac at 10:31 AM on July 17, 2007
The Open Library project could be more useful with Dewey Decimal numbers or ISBN and links to other related books in each entry would facilitate moving between records or just searching the database. Other information for an entry could be multiple publications in various languages or locations.
posted by JJ86 at 1:32 PM on July 17, 2007
posted by JJ86 at 1:32 PM on July 17, 2007
chungking express, the Harper's scanning project you mention was carried out by the same ftrain who educated stavrosthewonderchicken, above.
posted by cgc373 at 2:12 PM on July 17, 2007
posted by cgc373 at 2:12 PM on July 17, 2007
Metafilter: only people who pay can contribute records. Good for quality control, bad for openness.
posted by darkripper at 5:18 PM on July 17, 2007
posted by darkripper at 5:18 PM on July 17, 2007
cgc373, I knew ftrain did the work, I just didn't notice he had commented.
ftrain, I love Harpers. The new site is awesome. Fucking Awesome in fact.
posted by chunking express at 5:48 PM on July 17, 2007
ftrain, I love Harpers. The new site is awesome. Fucking Awesome in fact.
posted by chunking express at 5:48 PM on July 17, 2007
cortex, I really don't think these scans will be TIFFs. I use a scanner with document feeder at work and the software the scanner comes with automatically outputs nice searchable PDFs.
posted by any portmanteau in a storm at 6:29 PM on July 17, 2007
posted by any portmanteau in a storm at 6:29 PM on July 17, 2007
freebird, um, yes I did look at the site, and um, what the hell are you talking about? I suspect you don't know much about the several other projects that are under way, better funded, and further along; copyright law; or the attention span/capabilities of your average 20 something techie that gets an idea one weekend and promptly fails to see it through to completion. For reference, see the dot-com bubble burst and massive failure of start-up companies right around the year 2000.
Just to play devil's advocate ever further, here are quotes from the website:
What if there was a library which held every book? Not every book on sale, or every important book, or even every book in English, but simply every book—a key part of our planet's cultural legacy. . . . Second, it must be grandly comprehensive.
The copyright implications are massive, and those that assume that this is all fair use and so forth don't know copyright law.
But most importantly, such a library must be fully open. Not simply "free to the people," as the grand banner across the Carnegie Library of Pittsburgh proclaims, but a product of the people: letting them create and curate its catalog, contribute to its content, participate in its governance, and have full, free access to its data. In an era where library data and Internet databases are being run by money-seeking companies behind closed doors, it's more important than ever to be open.
Yup, and those other projects are better funded, further along, and more capable.
Earlier this year, a small group of people gathered at Internet Archive's San Francisco office to discuss whether this was possible. Could we build something so grand? We concluded that we could.
Yup, that's a strong, stable group of people to see this through...look at the list of people involved. Good luck keeping them on this project for oh, say, the next several decades. Without funding, this is a hobby.
Anyway, best of luck. If this works, congrats. Otherwise I'll be old fashioned and go to the library, god forbid buy a book, or use Google's product.
posted by Muddler at 3:11 AM on July 18, 2007
Just to play devil's advocate ever further, here are quotes from the website:
What if there was a library which held every book? Not every book on sale, or every important book, or even every book in English, but simply every book—a key part of our planet's cultural legacy. . . . Second, it must be grandly comprehensive.
The copyright implications are massive, and those that assume that this is all fair use and so forth don't know copyright law.
But most importantly, such a library must be fully open. Not simply "free to the people," as the grand banner across the Carnegie Library of Pittsburgh proclaims, but a product of the people: letting them create and curate its catalog, contribute to its content, participate in its governance, and have full, free access to its data. In an era where library data and Internet databases are being run by money-seeking companies behind closed doors, it's more important than ever to be open.
Yup, and those other projects are better funded, further along, and more capable.
Earlier this year, a small group of people gathered at Internet Archive's San Francisco office to discuss whether this was possible. Could we build something so grand? We concluded that we could.
Yup, that's a strong, stable group of people to see this through...look at the list of people involved. Good luck keeping them on this project for oh, say, the next several decades. Without funding, this is a hobby.
Anyway, best of luck. If this works, congrats. Otherwise I'll be old fashioned and go to the library, god forbid buy a book, or use Google's product.
posted by Muddler at 3:11 AM on July 18, 2007
In an era where library data and Internet databases are being run by money-seeking companies behind closed doors, it's more important than ever to be open.
Yup, and those other projects are better funded, further along, and more capable.
I'd like you to name them.
I don't know a single project along these lines where the data is open to being changed/improved/edited and where the data is provided to others to repurpose etc. I don't even know of another project that has an open API so that people can even build things off of it. LibraryThing has a semi-open API but they keep the more salable parts of it [user tagging] unavailable so they can use them as a revenue stream nothing wrong with that, but it does impact the usefulness of their data.
It's possible that since I was around at the conception of the project, I understand more of what it's going to (hopefully) turn into than what it currently is, but I think its also possible that when this project gets out of beta that it will kick ass over every similar book-records-thing that exists.
At some point Brewster & Co at the Internet Archive realized that they weren't going to be able to "win" the scanning project, so they decided they'd rather be the go to place for people to FIND the results of others' scanning projects. This works well for that, or will, or could.
I seriously do not know another company that is creating anything like this that allows remixing and data editing, to name two basic functions that this project has. Also, in this case, the funding is there, I can't speak to the rest of the criteria you list as caveats, they may certainly be true. Copyright is a big hurdle but having people working on it who already have a team of copyright lawyers who fight for this stuff all the time (as the Internet Archive does) seems like a step in the right direction.
Put another way, library catalogs in their current state are a huge joke. They are insanely expensive for being a fancy relational database. They are hard for library staff and patrons to use. They have an idiotic upgrade schdules that mean that they're always breaking and they DON'T WORK WELL which is the worst part. How nice would it be if there was a centralized location where you could get book data, build your own thing on top of that API, pick and choose what elements you wanted and not have to recreate a database of book (and other library items) record data for every library in the damned world?! The way libraries do things now is currently broken and crazy and makes librarians look like idiots and assholes by continung to promote this model of finding information.
This works better than that and doesn't put us in the pockets of other corporate idiots who build lousy products. That's a win. The openness parts and the sharing parts mean that if we don't like the way it looks or functions we can change that. Try correcting a record on WorldCat. Try getting them to email you back if you're not a paying partner.
posted by jessamyn at 11:45 AM on July 18, 2007
Yup, and those other projects are better funded, further along, and more capable.
I'd like you to name them.
I don't know a single project along these lines where the data is open to being changed/improved/edited and where the data is provided to others to repurpose etc. I don't even know of another project that has an open API so that people can even build things off of it. LibraryThing has a semi-open API but they keep the more salable parts of it [user tagging] unavailable so they can use them as a revenue stream nothing wrong with that, but it does impact the usefulness of their data.
It's possible that since I was around at the conception of the project, I understand more of what it's going to (hopefully) turn into than what it currently is, but I think its also possible that when this project gets out of beta that it will kick ass over every similar book-records-thing that exists.
At some point Brewster & Co at the Internet Archive realized that they weren't going to be able to "win" the scanning project, so they decided they'd rather be the go to place for people to FIND the results of others' scanning projects. This works well for that, or will, or could.
I seriously do not know another company that is creating anything like this that allows remixing and data editing, to name two basic functions that this project has. Also, in this case, the funding is there, I can't speak to the rest of the criteria you list as caveats, they may certainly be true. Copyright is a big hurdle but having people working on it who already have a team of copyright lawyers who fight for this stuff all the time (as the Internet Archive does) seems like a step in the right direction.
Put another way, library catalogs in their current state are a huge joke. They are insanely expensive for being a fancy relational database. They are hard for library staff and patrons to use. They have an idiotic upgrade schdules that mean that they're always breaking and they DON'T WORK WELL which is the worst part. How nice would it be if there was a centralized location where you could get book data, build your own thing on top of that API, pick and choose what elements you wanted and not have to recreate a database of book (and other library items) record data for every library in the damned world?! The way libraries do things now is currently broken and crazy and makes librarians look like idiots and assholes by continung to promote this model of finding information.
This works better than that and doesn't put us in the pockets of other corporate idiots who build lousy products. That's a win. The openness parts and the sharing parts mean that if we don't like the way it looks or functions we can change that. Try correcting a record on WorldCat. Try getting them to email you back if you're not a paying partner.
posted by jessamyn at 11:45 AM on July 18, 2007
« Older Fisted! | Bojo fo' Lo' M'o' Newer »
This thread has been archived and is closed to new comments
Intro, tour, how you can help. By Aaron Swartz, who sold Reddit to Wired, and then quit, and Brewster Kahle, of the Internet Archive. Wikipedia meets Amazon?
posted by scottreynen at 1:34 PM on July 16, 2007