Join 3,561 readers in helping fund MetaFilter (Hide)


Google, the Library
December 14, 2004 1:15 AM   Subscribe

Google to team up with the University of Michigan and Harvard University to make their extensive libraries available online. According to the agreement, Google will make available all books in the public domain; the universities can put the material to whatever use they see fit. Others have made attempts before, but none with the sheer might of Google. [via /.]
posted by Civil_Disobedient (71 comments total)

 
From the first article:
Besides digitizing U-M's massive collection, Google plans to scan parts of other research libraries, including those at Harvard, Stanford, Oxford University in England and the New York Public Library. Those projects are much smaller in scope than Google's plans for U-M. At Harvard, for example, only 40,000 of the university's 15 million volumes will be digitized.
posted by Civil_Disobedient at 1:17 AM on December 14, 2004


Ahhh, academic heaven.
posted by Mach3avelli at 1:49 AM on December 14, 2004


[this is good]
posted by Dreamghost at 1:54 AM on December 14, 2004


Skynet, up an running in 2007. I, for one, welcome our new google overlords.
posted by Derek at 3:26 AM on December 14, 2004


Tomorrow.. THE WORLD!
posted by TwelveTwo at 3:30 AM on December 14, 2004


Derek will now be erased from history.
posted by athenian at 3:31 AM on December 14, 2004


Derek? Who are you talking about?
posted by TwelveTwo at 3:42 AM on December 14, 2004


This was on NYT and slashdot... Is it worth reposting on mefi?
posted by about_time at 4:16 AM on December 14, 2004


It's worth at least sixteen quatloos. I freely admit that I'm not sure if metafilter is too high-rent for that, though.

Here's the first place I saw it, and it definitely trips the action potential of whatever neurons signal "neat!" in my noggin.
posted by Drastic at 4:21 AM on December 14, 2004


This was on NYT and slashdot... Is it worth reposting on mefi?

Lots of people don't read NYT or slashdot and neither of these have a relationship with MeFi such that links can be regarded as exclusive. Perhaps you should focus more on finding some good links for yourself rather than trying to Metapolice with what appears to be your first contribution of any kind as a member. And take it to MeTa if you have any further problem with it.
posted by biffa at 5:07 AM on December 14, 2004


I don't read the NYT or slashdot, so I'm pretty happy this was posted.
posted by signal at 5:09 AM on December 14, 2004


"1,600: Years it would take U-M to digitize all 7 million volumes without Google's special technology."

Anyone have any more info on this special scanning technology? Granted, I'm sure it's all top-secret (the article says of it "Google won't discuss in detail"), and if someone really knew...they'd immediately be scanned and indexed by a Google-cleaner...just curious.
posted by tpl1212 at 5:19 AM on December 14, 2004


I'm a Harvard librarian and I weep for my minions whose jobs are marching towards obsolescence. Of course, with only 40,000 items being scanned (likely the next phase of the Harvard Open Collections Program that banjo_and_the_pork posted about previously), they still have a long way to go.

Most of our students (b-school) won't look at information unless it's online. The trick comes when we try and get journals online. We have the ability to scan them quickly, but the battle for copyright and permissions would be a long one.

Of course, indexing and storage could make this a nice big white elephant.

Thanks for posting this, my library's going through a big strategic planning session this week, so this kind of info is good to have (what, you think Harvard tells its librarians what's going on? Ha!!
posted by robocop is bleeding at 5:31 AM on December 14, 2004


"1,600: Years it would take U-M to digitize all 7 million volumes without Google's special technology."

Anyone have any more info on this special scanning technology?


No ... but oddly enough, the NYT article says "Google's technology is more labor-intensive than systems that are already commercially available."
posted by pmurray63 at 5:46 AM on December 14, 2004


I hate to spoil the party, but did everyone read that last paragraph?

What users will see when they search the U-M collection online depends upon whether the information is still covered by copyright. For older items, users will be able to search for and read every word on each page of a book or document. But for material under copyright, the university will put a short synopsis of the material online, with information that links to the publisher or libraries where the work can be obtained.

I'm not quite sure, but I believe this means that books written by people who have not been dead for at least 75 years will not be available, or only a "short synopsis" thereof, which is what you get at Amazon right now (in fact, I bet there'll be a direct link to your nearest online bookstore right next to that synopsis).
So that's maybe useful if you study literature or ancient history, but probably not so useful if you research anything in which there have been significant advances in the last 75 years, such as, oh lets's see, just about everything.
posted by sour cream at 5:59 AM on December 14, 2004


what sour cream said.

Also, how will they deal with things still under copyright here, but available overseas?
posted by amberglow at 6:06 AM on December 14, 2004


For those that are interested, here's the description of the pilot program that Harvard's taking part in, as sent out in a mass email today:

Project Description:
Harvard's Pilot Project with Google

Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.

The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.

By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greatest university library in the world. If this experiment is successful, we have the potential to provide the world's greatest system for dissemination as well."

In addition, there would be special benefits to the Harvard community. Plans call for the eventual development of a link allowing Google users at Harvard to connect directly to the online HOLLIS (Harvard Online Library Information System) catalog (http://holliscatalog.harvard.edu) for information on the location and availability at Harvard of works identified through a Google search. This would merge the search capacity of the Internet with the deep research collections at Harvard into one seamless resource-a development especially important for undergraduates who often see the library and the Internet as alternative and perhaps rival sources of information.

Eventually, Harvard users would benefit from far better access to the 5 million books located at the Harvard Depository (HD). If the University undertakes the long-term program, Harvard users would gain online access to the full text of out-of-copyright books stored at HD. For books still in copyright, Harvard users could gain the ability to search for small snippets of text and, possibly, to view tables of contents. In short, the Harvard student or faculty member would gain some of the advantages of browsing that remote storage of books at HD cannot currently provide.

According to Sidney Verba, Carl H. Pforzheimer University Professor and Director of the University Library, "The possibility of a large-scale digitization of Harvard's library books does not in any way diminish the University's commitment to the collection and preservation of books as physical objects. The digital copy will not be a substitute for the books themselves. We will continue actively to acquire materials in all formats and we will continue to conserve them. In fact, as part of the pilot we are developing criteria for identifying books that are too fragile for digitizing and for selecting them out of the project.

"It is clear," Verba continued, "that the new century presents unparalleled challenges and opportunities to Harvard's libraries. Our pilot program with Google can prove to be a vital and revealing first step in a lengthy and rewarding process that will benefit generations of scholars and others."

posted by robocop is bleeding at 6:30 AM on December 14, 2004


This was on NYT and slashdot... Is it worth reposting on mefi?
posted by about_time at 4:16 AM PST on December 14


Well, I just found the article on the BBC website - and thought "Ooh - this'd be good on MeFI!"... but then found that I'd been beaten to it! Boo.

I'm sure that my fpp wouldn't have been quite as comprehensive and link-y as Civil_Disobendient's though...
posted by Chunder at 6:33 AM on December 14, 2004


As far as limiting the copyright goes, though, I really don't see how they have a choice. US copyright law, however stupid, is the law in the land where the academic offices/servers reside. The primary advantage I can see here over amazon is that amazon, as far as I know, links to where you can buy said books, where the U-M has access to the academic interlibrary loan system.
posted by Karmakaze at 6:40 AM on December 14, 2004


Too bad they don't team up with Amazon who has already done something like this.

Amazon's efforts are impressive. Wonder if they could come up with a subscription model.
posted by dancingbaptist at 6:45 AM on December 14, 2004


Right, Karmakaze, but ILL departments currently will copy/scan chapters and articles for those who would normally be able to access the collection in order to deliver it to them. So the Google project material will likely be set up so only those who could normally check out the materials could view them online. HBS does something a bit similar with its case studies (except you need to pay for them if you're not an HBS affiliate). So unless the publishing houses want to pick a fight with Harvard, controlled access may likely be the way to go.

Most of HUL's new books get shipped off to the Harvard Depository (a big collection of Raiders of the Lost Arc-esq warehouses) rather than placed on the shelf in Widener. I've spent the past year sending old and dead journals out there too, so it'd be worthwhile if they were scanned for quicker access (not that anyone really wants The Journal of the Meat Packing Industry from 1926 very often).
posted by robocop is bleeding at 6:49 AM on December 14, 2004


Well, I just found the article on the BBC website - and thought "Ooh - this'd be good on MeFI!"... but then found that I'd been beaten to it!

You can congratulate yourself on your powers of observation. I'm quite serious; I'm absolutely certain this will be double-posted within the next couple of days, probably more than once. "Ooh, cool, Google news! I'll bet nobody on MetaFilter knows about it!"

As for the news, yeah, for now it's just going to be old stuff, but you've got to start somewhere, and for anyone interested in history it's going to be fantastic. If your main concern is, say, string theory, probably not so much.
posted by languagehat at 7:56 AM on December 14, 2004


This is excellent. Pity about the negative impact of copywrong, but at least everything will be in place when society finally wakes up and dumps it.

I'm a Harvard librarian and I weep for my minions whose jobs are marching towards obsolescence.

Sorry, but access to information > a million minions.
posted by rushmc at 8:04 AM on December 14, 2004


Amazon's implementation is NOT impressive. It's ruined by absurd searching (lack of full Boolean, for one), ridiculous results (sometimes you get an extract, sometimes you don't), and a limit to the number of pages one can view in the returned result. It's a selling tool, not a research tool.
posted by Mo Nickels at 8:11 AM on December 14, 2004


It's a selling tool, not a research tool.

Actually, it's both. Or could be (in Amazon's case it's certainly a selling tool). Access to the actual fulltext of a book is less important than having monographs and other non-electronic items represented in the result set of, say, a metasearch or even an aggregated database search.

With these items currently isolated to library catalogs (which almost certainly just contain bibliographic information), there is little motivation for the researcher to get up off his/her kiester to find if a book may be relevant to their topic, when there's so much other full text instantly available for them to possibly use.

I see this, actually, as a possibility to bring monographs and other non-electronic items back to being relevant and desirable to the average researcher.
posted by Human Stain at 8:36 AM on December 14, 2004


I appreciate this being an FPP - even though I read about it on boing boing last night. Worthy news which some may have missed. Thanks Civil_Disobedient for the link to CopyScape - a new resource of which I was not familiar. John Batelle's take on the program.
posted by ericb at 8:42 AM on December 14, 2004


So that's maybe useful if you study literature or ancient history, but probably not so useful if you research anything in which there have been significant advances in the last 75 years, such as, oh lets's see, just about everything.
Or if, like my girlfriend, you study Victorian literature and spend a good chunk of time trying to dig up the more obscure stuff. Sure, there are plenty of topics where 75-year-old information is hopelessly outdated. In other areas of study, though, 75 years is just starting to lost the new-car-smell.
posted by verb at 8:45 AM on December 14, 2004


More from John Batelle this morning.
posted by ericb at 8:53 AM on December 14, 2004


These are great times to be an academic.
posted by Quartermass at 9:18 AM on December 14, 2004


Here's a quick and dirty and really very homely chart of US copyright expiration information.

I just learned about this myself, so I'm still taking it in, but my first thought concerned the implications for the open access movement for scientific publishing. It's the obvious direction Google's heading. It's really dizzying. Academic science has been built on societies, which fund their activities on their publications and meetings, so we're talking about tremendous change not just to the information, but to its producers and distributors. There's a lot of excitement in watching the old edifices crack and give way, but the more free information gets, the more I wonder how we're going to pay for it, and how much of the really good stuff will be freely available.
posted by melissa may at 9:35 AM on December 14, 2004


thanks for that chart, melissa may. I am studying for my copyright exam (or should be) right now and that's a very handy format.

random query for robocop is bleeding: is there some reason why the Harvard library has tons of science fiction and fantasy novels in Hebrew only? Do donors give conditional bequests for the purchase of books or is this some cruel joke by the purchasing department?
posted by amber_dale at 9:54 AM on December 14, 2004


for more on academic open access
posted by leotrotsky at 11:10 AM on December 14, 2004


[sigh /] OK, I'll just point this out once, since the Googleites hate it when people rag on their creed, but: Google is not your friend. Google is rapidly becoming the most data-wealthy entity on earth. And this positions them to become the de facto mediators of access to academic data, too. Whether or not their "motives" (if a corporation can be said to have motives) are pristine is really quite irrelevant. If you love Google but: think Gary Webb was probably offed by the CIA; oppose WTO and World Bank policies; think that the power of large corporations ought to be curbed; or just generally bought into the ClueTrain, then I think you owe it to yourself to do some serious rethinking of your positions. Some of these blocks don't fit into the same shaped holes.
posted by lodurr at 11:15 AM on December 14, 2004


NPR had a story on this project this morning. From what I understood, all the volumes would be scanned in their entirety and the entire collection would be searchable; however, only a small section of the relevant text (relevant to your search) would be available from works still under copyright.
posted by gruchall at 11:29 AM on December 14, 2004


lodurr makes a good point. ObDisclaimer: "I am not a conspiracy nut" -- but -- this does need to be thought through. I am one of these academics of which they speak. The only thing information wants more than to be free is to be a non-monopolistic commercial commodity. Recent developments in on-line access have been a fantastic boon for research. And that particular brand of information that academics love - peer-reviewed journals and books - has always had commercial constraints, but piecemeal ones. A single publisher might charge for access to their information. What worries me is to look 20 years down the road, or 50, or 100. If the early internet saw the distribution of information as its most salient feature, will the later internet be defined by its re-centralization into the "google overlady"?
posted by Rumple at 11:37 AM on December 14, 2004


robocop is bleeding: (not that anyone really wants The Journal of the Meat Packing Industry from 1926 very often).

But man if you did having it online would be awesome. And there are a lot of interactions that may not be obvious. For example lets say I'm modeling a late twenties railroad in n scale. A meat packing facility is a great feature because you have at least three different types of trains coming and going with lots of volume of each plus you need a place to ice the refeers.

The Journal of the Meat Packing Industry would be a perfect place to get pictures and operational details.

And I curse the overbroad extensions of copyright foisted upon us by Disney and their lackeys. The public domain is being strip mined with no return to the public.
posted by Mitheral at 11:43 AM on December 14, 2004


lodurr, you raise a good and interesting point. Ironically, it's not all that different from Elsevier having a monopoly on a vast amount of scholarly material.

At present it's hard to tell how well Google will work with academia. I mean, right now it's a fairly one-way street with academia handing its content over (again) and not having a whole lot of say about how they can get it back.
posted by Human Stain at 12:41 PM on December 14, 2004


Anyone have any more info on this special scanning technology?

Apparently, the University of Michigan already did a lot of the dirty work: Regent Emeritus Eugine Powers and his company University Microfilms started microfilming a lot of their old catalog in the 60's, so I imagine that will knock a couple of years off the project. And there's probably a lot of redundancy between Harvard and Michgan's libraries.

Still, I'm particularly interested in what they're going to do for the rare/old books that haven't yet been microfilmed. You can't just slice 'em and automatically feed them into a scanner, you know?

Google is not your friend.

Maybe not, but my inner geek can't help but be impressed at their ingenuity. I mean, hell, people have been talking about projects like this for years, but nobody's done anything about it. Google, awash with IPO cash, is actually spending it on something constructive, instead of just buying up smaller companies.
posted by Civil_Disobedient at 12:53 PM on December 14, 2004


amber_dale: It was likely a bequest from some alumni/professor. Harvard has a policy of never turning down a sizable gift (because if we don't take your Hebrew scifi, you may not give us any money to help build a new library). We have all sorts of weird things in the stacks that we accepted with a smile.

Mithreal: The Journal of the Meat Packing Industry (aka Meat Cleaver Journal, as a minion calls it) would be of use to some, sure, but the cost involved with scanning, indexing, and storing a digital copy may outweigh its use. In addition to the not-all-that-paranoid stuff lodurr brought up, I would be concerned about a scenario where, after Harvard puts Meat Cleaver Journal online for all to see, most academic libraries dump their print copies. Then Harvard has a change of heart and takes it offline or gets its servers hacked. Suddenly, the information is gone.

There's a race going on in many of the large academic libraries to not be the library of record (quote a director of another Boston college library I know, "If Harvard has it, why should I have it take up space on my shelves?"), such that those libraries left holding the copy will be able to exert some pressure on those without. I dunno if that jives with my vision of what a library should be.
posted by robocop is bleeding at 12:53 PM on December 14, 2004


... academic libraries dump their print copies.

Most people just aren't at all aware of the immense pressures to reduce physical holdings. Every shelf-foot costs money, which is a commodity that even the wealthiest libraries have in short supply.
posted by lodurr at 1:02 PM on December 14, 2004


robocop: a little lost on how your minions would be losing jobs, unless your minions are responsible for reshelving physical books. otherwise this sounds like a great time to be a minion.

lodurr and rumple: great points. the participating libraries better make sure that they're making good, diverse and effective use of the scanned files they're getting in return, so google isn't the only way the majority of us can access these things digitally.

(on preview) robocop: heard a collection development director at a library once joke that Harvard is its off-site storage facility.. ditto your same concerns.
posted by bricsot at 1:03 PM on December 14, 2004


Libraries and the Assault on Paper. When I was living in New Haven the Sterling Library threw out huge quantities of old newspapers (nineteenth-century). Just dumped them out on the street. I have a love/hate relationship with libraries.
posted by languagehat at 1:44 PM on December 14, 2004


Re: use as a research tool. (I'm a fourth year in college and embroiled in a large reserach project.) Even if it can't give you the full text of recent sources, the mere fact that it allows you to search that text is incredibly helpful. For works whose titles do not make immediately obvious that they contain passages relevent to your research, this is a great alternative to pouring backwards through bibliographies. Yes, you still have to have access to a large research library. With interlibrary loan, et. al., that isn't so hard, especially if you know beforehand that the book you're ordering will contain relevent info. This is still a godsend.
posted by rustcellar at 2:21 PM on December 14, 2004


...research, jeez.
posted by rustcellar at 2:22 PM on December 14, 2004


Ah robocop is bleeding I see your concern. Who would have thought libraries would be actively looking for ways to throw books away? Not enough paranoid systems administrators take up librarian as a second career.
posted by Mitheral at 2:35 PM on December 14, 2004


libraries dump their print copies

Makes me recall the uproar that followed the British Library’s throwing out thousands of books and historical newspapers in 2000.

As well, memories of novelist Nicholson Baker (to whose 2001 book “Double Fold: Libraries and the Assault on Paper” languagehat’s link above refers) criticism of the San Francisco Public Library when they threw out their card catalog (replacing it with computers) and a significant quantity of books (about which he wrote in the New Yorker Magazine “"Discards” / April 4, 1994 and "Deadline: The Author's Desperate Bid to Save America's Past," July 24, 2000).

In 1999 Baker established a non-profit corporation, the American Newspaper Repository to rescue old newspapers from destruction by librarians.

[Way off topic, but an interesting footnote…it was Nicholson Baker’s book “Vox” (a novel about phone sex) which Monica Lewinsky gave to President Clinton as a gift in 1997.]
posted by ericb at 3:15 PM on December 14, 2004


I wonder if Google is using their technology to scan bibliographies and footnotes (the original hyperlinks) to build a google "page rank" for their books.

To me, the full-text search is less interesting then the idea of building up a super-database which helps you find the 'most authoratative' sources as well as the least, essentially building giant electronic pyramids of related texts, articles, web sites, etc.
posted by chaz at 3:24 PM on December 14, 2004


Legally, can Google stop, say, Project Gutenberg from simply taking the stuff Google provides and being a second source?
posted by lbergstr at 3:47 PM on December 14, 2004


Sterling Library threw out huge quantities of old newspapers (nineteenth-century). Just dumped them out on the street.

Ot a little, but... I worked in the Maps, Microtexts and Newspapers department at A Major Research Library for three years, as the department admin. I knew it was an educational experience at the time, but I'm continually astonished in retrospect about how educational.

Anyway, we had some amazing old newsprint, there. I remember one day going back to re-shelve some stuff that the student staff hadn't been able to get to before close the previous night (end of semester rush, doncha know...), and the Librarian stopped by as I was picking up a folio. "Let's look at this one," she said. We opened it up, and started leafing through. It was filled with single-items and short-runs of newspapers from the American west in the late 19th century. "Ah, here it is," she said, as she pulled out a particularly pristine piece of cotton-based newsprint -- some daily from a small town in Montana, from the 1880s. "Read this," she commanded. (You didn't disobey her.)

I did. It was remarkable. In those days, a "newspaper" would often be a single broadsheet, folded once to make four pages. That's all this was. But it was remarkable. Remember all those stories you've heard about when Mark Twain was a newspaperman? Well, he wasn't all that unusual; the west was full of guys with more education than the territory cared about, and more perception than they had a use for in daily life, and they poured it all into their jobs as editors and reporters for town rags like that one. I don't even remember the name, but it was filled with the editor's take on town politics, local businesses, weather, and a very scathing and witty remote-disarmament of a charater who was apparently making the rounds representing himself as a reporter for the paper...

Thing was, there was a positive magic to having that newsprint right in front of you, that's impossible to communicate -- seeing it, clean and just slightly off-white, amid the brown and flaking pages that surrounded it. For some brief period of time, that newspaper had run out of cheap wood-pulp newsprint, and had used the good stuff, instead, and now because of that we had the Quonsaresque rantings of some lonely newspaper editor in the slowly taming wilds of Montana, c. 1885. Without that accident of logistics, and without some alum's whimsical bequest (the Librarian had been there 35 years, and knew the history of damn near everything), nobody would ever know about that now.

I could have made this a story about plowing through decades of Ithaca Journals looking for mentions of Seward (which a postdoc from another university paid me to do for a while), but that Montana newspring, and the color and feel of that paper -- they're what stick in my mind.
posted by lodurr at 4:20 PM on December 14, 2004


chaz: Why in the world would anyone suppose that a Googlesque interpretation of "authoritativeness" for academic papers would be of positive value? It would serve only to amplify the concensus (which is what page-rank typcially does), and would in effect bury important informaiton.
posted by lodurr at 4:22 PM on December 14, 2004


I wonder if Google is using their technology to scan bibliographies and footnotes (the original hyperlinks) to build a google "page rank" for their books.

Oh man, that's a brilliant idea. God, the mind reels at what could be done with this system.
posted by Civil_Disobedient at 4:26 PM on December 14, 2004


Oh, and I recommend the following URL to Google:

alexandria.google.com
posted by Civil_Disobedient at 4:27 PM on December 14, 2004


The only thing information wants more than to be free is to be a non-monopolistic commercial commodity.

Well said, and definitely something to watch, although as ineffectual as the public seems to be in protecting (or even understanding) the public domain, all our data will probably be turned proprietary anyway.

I have a love/hate relationship with libraries.

I agree, and all the ones I have visited in the past 15 years seem to be turning themselves into something that doesn't appeal to me, more every day (which is very sad because, at it's core, I think the library is the single best idea that humans have yet produced, and I used to live there).
posted by rushmc at 4:43 PM on December 14, 2004


along lodurr's lines, and what others have said: what happens when Google is sold? or there's a hostile takeover etc? Or a next generation of their leadership decides to dump it to temporarily raise the stock price? Shouldn't the Library of Congress or a Public Library be doing something like this, and not a publicly-traded company?
posted by amberglow at 7:35 PM on December 14, 2004


lodurr said:

While interesting points, I'm not sure that's correct. Quoth the NYT:

Each library, meanwhile, will receive its own copy of the digital database created from that institution's holdings, which the library can make available through its own Web site if it chooses.

And I swear the NPR story I heard this morning said that the agreements with Google are non-binding, such that each University could do something else with their newly digitized archive. (NPR's website is misbehaving at the moment - why the fuck won't they just put their transcripts online?)

The way I read that is: Google is effectively paying ~$1billion to non-destructively digitally archive these collections. They keep a copy for their index, they give a copy to the school to do with as they please. The world at large has greater access to more information than ever before.

As for the Google-hating that seems to be so fashionable with the kids these day, I just don't buy it. Yes, they are a huge multinational corporation and therefore deserve our scorn. Or something. I'm with-holding scorn until they actually do something to deserve it, which hasn't happened yet. I mean, they're spending a BILLION dollars digitizing old books -- you think the Wikipedia people are gonna do that?

I'm all for a healthy dose of mistrust of authority, but I'm failing to see how this is bad. In fact, I'm having a real hard time seeing this as anything but capital-g Good.
posted by jimray at 10:03 PM on December 14, 2004


What's to stop an open source movement to do the same thing? People have tons of books at home and most people have scanners. Scan in a couple books, upload it to a central repository, it adds up. Is it illegal? I'm not sure, but believe not long as the full text of (c) works is not downloaded and only searchable, it's exactly what google and amazon are already doing.

chaz, your observation about Bibliographies and Footnotes is excellent. They are indeed the original hyperlinks, the more cites, the higher up the page rank, the more important the work. Brilliant, can't wait. This will be Googles competitive edge, no one will be able to touch it (unless they scan the same number of books again).
posted by stbalbach at 10:57 PM on December 14, 2004


the cost involved with scanning, indexing, and storing a digital copy [of The Journal of the Meat Packing Industry] may outweigh its use.

Maybe in isolation, but having every issue of every journal is incredibly valuable. You could algorithmically track words, phrases, modes of thought, through the entirety of recorded history.
posted by Tlogmer at 12:50 AM on December 15, 2004


This was on NYT and slashdot... Is it worth reposting on mefi?

This is one of the best FPPs in days. I do not read NYT or slashdot, so YES. It is worth reposting.

[Thank you! This is good.]
posted by erratic frog at 3:24 AM on December 15, 2004


your observation about Bibliographies and Footnotes is excellent. They are indeed the original hyperlinks, the more cites, the higher up the page rank, the more important the work.

Not really. I had a long chat with my brother about this a few months back, with regard to a project I was involved with at the time. He's a vector biologist, with soemthing like 20 years in the field by now, and he found the idea of rating value by number or interconnectedness of citations to be kind of pernicious.

That squared with my experience looking at vitae in Nuclear Engineering in the mid-80s. It was a common practice in that department in those days -- and I was given to understand, a common practice in all departments, at least in the US -- for senior people in the department (e.g., the chair) to get the chance to review all the siginficant papers that went out of the department. "Review" usually meant "make minor edits to", which meant he got on the author list, which meant it went on his c.v. True, the chair would typically be low down on the list of authors, so I suppose if you weighted for position, there might be value. But that's a more difficult problem than it appears at first.

Finally, I have the same issue with this that I have with Google's mechanistic "rating" of web pages: It bears no relation to actual quality or to actual value. Much as a page can rank highly because everyone thinks it's stupid enough to comment on, a paper might rank highly because lots of people think it's a great example of bad reasoning. Without that context, though, you might think the rating actually had something to do with value.

Googlesque ranking systems -- and systems that attempt to divine knowledge from the traceable connections and interactions between people and organizations -- are attractive because they appear to find something profound in un-profound data. And they often do. But it's a mistake to think that they produce understanding. They not only do not, but cannot. Until the system can understand, it's not really giving us anything but information about connections and interactions -- not about the sources themselves.
posted by lodurr at 5:42 AM on December 15, 2004


What's to stop an open source movement to do the same thing?

Two things: Money, and prestige. Google's got money; and they've got a reputation that speaks loudly (no doubt amplified by the money) to Harvard's board.
posted by lodurr at 5:43 AM on December 15, 2004


As for the Google-hating that seems to be so fashionable with the kids these day, I just don't buy it. Yes, they are a huge multinational corporation and therefore deserve our scorn. Or something.

"Google-hating" isn't fashionable at all. Witness the round criticism of the messenger whenever someone dares criticise Google.

The most ardent Googleites, in my experience, are a lot like Conservative Republicans: They see persecution in criticism. I'm just saying that you need to look at Google with the same level of cynicism that you would use when looking at, say, Microsoft, or IBM. And hardly anybody does.
posted by lodurr at 5:49 AM on December 15, 2004


Shouldn't the Library of Congress or a Public Library be doing something like this, and not a publicly-traded company?

Speaking as a public librarian who is still teaching people how to get from the "to" line to the "cc" line when they use email, no public library except the godawful disneyfied Cerritos Library is anywhere near funded enough to do this even if they did somehow understand why it would be useful. The world of libraries is filled with petty turf wars, often-clueless administrators, and a massive game of last-copy "not it!" Collaboration between librarians on non-immediate projects is often perceived as taking time away from the more pressing library issues such as cataloging and patron services. The Library of Congress currently has a backlog of many many [millions?] volumes, as do Harvard, Stanford, and many of the other participating OCEAN libraries, none of which will be addressed, I'm reckoning, with any of the funding from this project [robocop, please tell me if I am misled here]

As a result, libraries that already lack the funding to perform basic tasks related to their print collection [and who are seeing Medicare and health insurance costs creep up in the double-digit percentages per year without corresponding funding increases] are getting on the bandwagon to work on digital collections while vendors try to sell us lots of proprietary hardware and software we may not need and make us subscribe to and license content that we used to just own. I think our future world will be much more reliant on electronic resources and I too would like to be able to use federated [i.e. one interface to many databases] searching when I do research or look for books. However, Google's "don't be evil" mantra does nothing to allay my concerns about who owns the rights to this content, whose labor is being used to digitize it [we already know Google Print sends digitizing work overseas and destroys the books to be digitized in the process] and what the long-term security of this content is, among other things. The gap between the library-with-books present and the library-with-digitized-information futrure is going to be craggy and ugly.

Google is supremely useful and has a taught a lot of us about user-oriented interface design, however they have also coined the cloying phrase "relevant ads" and have as at least one of their missions, earning money -- and have a legal obligation to their shareholders to do this.

Check the mission of your local public library sometime and see where profitability comes into it. Many would argue that this is the problem with libraries lately, they suck up taxpayer money and deliver.... what? Social good? On the other hand, their methods are transparent, their staff are accountable and their resources belong to the public, now and forever. I'd like to be able to KNOW what Google and the libraries are going to be doing and how they are going to be doing it, not just have to trust Google to do the right thing.
posted by jessamyn at 6:43 AM on December 15, 2004


I mean, I guess the question still hasn't been answered, "If Google has to spend all this money on this project, where does it expect to recoup its losses?".
posted by Human Stain at 10:54 AM on December 15, 2004


Why do you call them "losses"? They're only losses if they have a product to sell and aren't making money on it.

Anyway, here are a few suggestions: Purchase referrals to Amazon; ad placement; personal data mining (which is legal as long as they don't make a direct connection between people and their data).

And then there's the more intriguing possibility: This is an investment. The amount of data they'll get from this (and I'm not talking about the books and journals) is really staggering. Once they figured out how to use that for targeted marketing, they could recoup a billion pretty easily, I think.

And that might not even matter: The real investment could be in pushing the ubiquity of their "brand". It makes Google not only the first stop for information, but also the second, third, fourth....
posted by lodurr at 11:19 AM on December 15, 2004


lodurr, I didn't follow your logic on why bibliographies are not important. I'm in to history, and it's often difficult to know which papers published in the past 100 years are the significant papers. But, knowing that a paper has lots of cites gives it significance. So, a search on "Petrarch" brings up 5000 journal papers, I want them ordered according to which have the most cites (across 10 million books). That's very powerful. Or, for example, more specifically, "Petrarch & Dark Ages" .. so now I know which books and papers discuss Petrarch and the Dark Ages according to popularity. It's like having an expert guide.
posted by stbalbach at 11:50 AM on December 15, 2004


..the expert in this case is the wisdom of the crowds.
posted by stbalbach at 11:55 AM on December 15, 2004


... why bibliographies are not important.

I never said they weren't; I just said that Googleizing bibliographies is a good way to lose important context.

BTW, the "wisdom of crowds" is just the latest fashion -- it's a convenient way to valorize a type of "wisdom" that it's lately become cost-effective to produce. It's the equivalent of looking under the street lamp first, when you think there's a probably x where x > 50% that the quarter is in the alley. The real "wisdom" of crowd-wisdom then becomes how far above 50% the probability has to be to get you to look first in the alley, and how close to 50% it has to be to get you to just ignore the alley altogether. The "wisdom" of Googleism (and, I fear, of Googleized bibioliographies) is that you can get a long ways above 50% before you decide you'll bother to look in the alley.
posted by lodurr at 12:40 PM on December 15, 2004


Put another way: If you rely on "crowd-wisdom" to find what you need to find, you'll nearly always find what everyone else already has. And fail to find what nearly everyone else has failed to find.
posted by lodurr at 12:58 PM on December 15, 2004


Yes, but you have to remember that it would be waaaay more sophisticated then simply building a popularity rank out of biographical citations.

You're thinking in a very small window which is just 'ranking based on citation' but try to imagine what can be done by linking together all footnotes and bibliographies.

Imagine how you would be able to find interesting data based on relationships between various texts, and the sub-relationship between various terms or concepts within those texts. You would be able to navigate seamlessly between footnotes and bibliographies, even linking various footnotes together based on their shared references and/or terms of interest.

Building a popularity index (which also can be valuable, as you can find out which are the most cited and least cited books in history, which has a meta-value in bulding relationship pyramids describing the history of scholarship) is just one tiny part of the value to be unlocked by bibliographic and footnote data.
posted by chaz at 4:55 PM on December 15, 2004


... but try to imagine what can be done by linking together all footnotes and bibliographies.

... which is not how Google would do it. They'd do it by treating citation as linking; any system built on that basis would still have all the weaknesses I described.

Also, the 'seamless navigation' you suggest would surely be a wonderful thing -- but again, it's not the Google way. Whenever anybody cooks up anything like that, they move to spoil the fun; witness the short-lived rash of XML-RPC based "Google browsers" from a couple of years ago. Very interesting beasts -- but Google, for whatever reason, did not suffer them to live.
posted by lodurr at 9:22 PM on December 15, 2004


Here's what I think of Harvard.
posted by thedevildancedlightly at 1:17 PM on December 28, 2004


« Older Neuro...  |  Lego Beretta 9mm... Newer »


This thread has been archived and is closed to new comments