Join 3,501 readers in helping fund MetaFilter (Hide)


Do I contradict myself? Very well, then I contradict myself, I am large, I contain multitudes.
September 1, 2009 1:08 PM   Subscribe

"Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert's novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google's little joke)." —Linguist Geoffrey Nunberg on Google's little metadata problem.
posted by Toekneesan (29 comments total) 11 users marked this as a favorite

 
That's not a link, this is a link.
/Crocodile Dundee
posted by pyrex at 1:16 PM on September 1, 2009 [2 favorites]


The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering. And a catalog

CAN I HAZ CYBORG CHEEZBURGER?
posted by jonp72 at 1:25 PM on September 1, 2009


> Geoffrey Nunberg, a linguist, is an adjunct full professor at the School of Information at the University of California at Berkeley. Images of some of the errors discussed in this article can be found here.

Instead of linking "Images of some of the errors", they link "here", and the link is a TinyURL, both of which were surprising to see at the end of a very well thought out essay about metadata and online archives.
posted by ardgedee at 1:48 PM on September 1, 2009 [5 favorites]


Check out Jon Orwant's (of Book Search) response on that page -- he goes into a lot of detail about how the metadata system works and why those errors exist, and what is being done to improve the process.

Direct link
posted by wildcrdj at 1:51 PM on September 1, 2009 [4 favorites]


I guess the freetext search working better than the categorisations is why it's called Google Books and not Yahoo Circa Late 90s Books.

/snark
posted by Artw at 1:54 PM on September 1, 2009 [3 favorites]


I don't get it, is he upset with this incredible boon for scholars because analyzing the texts still requires some human fact-checking?
posted by shii at 2:00 PM on September 1, 2009


Thing is, the metadata isn't just for finding books — it's for learning about what you've found. If freetext search nets me an awesome old out-of-print book and I want to cite it, or read more by the same author, or just know when it was written so I can put it in context, good metadata is gonna make my life much, much easier.
posted by nebulawindphone at 2:01 PM on September 1, 2009


Wow, Nunbergs responses to Orwant are hugely jerkish.
posted by Artw at 2:11 PM on September 1, 2009


> I don't get it, is he upset with this incredible boon for scholars because analyzing the texts still requires some human fact-checking?

First, the fact-checking has already been done to the extent it's possible; existing cataloging information and library classifications are known and can be assumed trustworthy. Google's more or less not relying on any of it right now which kneecaps the utility of Google Books for anything but free text searching.

Nunburg demonstrates the problem of synthesizing metadata from free text searches repeatedly in his essay, reaching for the extreme examples (Books on the Internet from before the word was coined!) to make his point, but more subtle errors in book metadata can become insidious if, for example, the authority of one edition or another of a given title is in question.

And more broadly, if the metadata for a book is wrong, it may disappear from view. If you search for "Moby Dick", and half the titles are miscataloged as "Moby Dice", they're off the chart. You have no reason to assume there are books are filed as "Moby Dice". And if you happen across one of them by accident, then you have to investigate other possible misfilings: "Mozy Dick", "Maybe Dill", ad infinitum. The existence of reliable book catalog data expedites research in innumerable fields, and not using it does a disservice to researchers.
posted by ardgedee at 2:27 PM on September 1, 2009


Shii, messed up metadata makes it difficult to find and aggregate the texts for humans to fact check. Google tries to claim that it takes care of all the work and we can rely on them. However, these examples show that (as many of us in the metadata world know) automatic tagging is not the answer. Human eyes should be at least spot checking. It doesn't take much effort to set up searches that will pull out obvious absurdities such as works by authors dated before the date of its author's birth. Then, they could use those things to fine tune the auto tagging so it doesn't continue to make that mistake.

I especially loved how Google is trying to blame everyone else for its messy metadata. The author does a good job illustrating that many errors cannot be from the libraries.

Toekneesan thank you so much for this post. As a metadata geek I kept chortling at some of the examples :)
posted by Librarygeek at 2:28 PM on September 1, 2009


Orwant's reply details the lengths to which the existing metadata has problems, and how it's affecting catalog searches (such as using "1899" and "1905" as placeholders for unknown dates). Taking the long view, Google might provide catalogers a service in the long run by rectifying these conflicts, and Nunburg may be magnifying the long-term consequences of short-term problems. On the other hand, it opens Google up to the challenge of finding and correcting these records, and whether they might decide there's too little benefit in making an effort for titles at the far end of the long tail of public domain resources.
posted by ardgedee at 2:37 PM on September 1, 2009


Google Books has a bigger problem: searches are inconsistent and don't show all the hits. It may report "20,000 hits" but actually only show a few hundred or thousand (if one were to actually scroll page by page through the search results). And those that it shows will often change so one can't get consistent deep search results. For example search on "Charles Dickens" (with quotes, Full View only) and it will report

Books 1 - 10 of 97,064 on "charles dickens". (0.13 seconds)

So you'd think great, 97,064 books mention Charles Dickens. But try to display those 97k results and it will stop after a few hundred or thousand or so. Actually in my recent test, it only showed 91! Incredibly frustrating.
posted by stbalbach at 2:54 PM on September 1, 2009


That's a general problem with Google. The hit counts just have no basis in reality. It used to be you could use web search hit counts to get a feel for how common an expression was, but those numbers have gotten much fuzzier and less reliable over time.
posted by nebulawindphone at 3:10 PM on September 1, 2009


do we know why, nebulawindphone?
posted by LobsterMitten at 5:06 PM on September 1, 2009


I especially loved how Google is trying to blame everyone else for its messy metadata

Orwant's reply pretty much explicitly does NOT put all the blame on everyone else. In particular the category stuff he admits was largely Google errors. ("We guess correctly about 90% of the time and Geoff's comments prompted the engineer responsible to suggest some improvements that we will roll out over the coming months."). It's a combination of source errors and process/classification errors, the latter being on Google's side.

The author got a lot of the "illustrating that errors cannot be from libraries" wrong as well, since he seemed to think OCR was responsible for a lot of things, and OCR isn't used.

Obviously the better this gets the better for everyone, Google and users alike. But a human review process doesn't scale when you're talking about well over a hundred million books.
posted by wildcrdj at 5:06 PM on September 1, 2009


But try to display those 97k results and it will stop after a few hundred or thousand or so.

Yeah, that's a bug. Usually the first number is an estimate, not based directly on rows in a database, but an algorithm to be able to "fudge" it within a configurable degree of probability. Accuracy being the trade-off for efficiency. But normally as you get closer to the "actual" number, you should see the total go down. The "oooo"gle does go down on the next page, but the initial total on the top never gets updated.

Which, you're right, is really annoying.
posted by Civil_Disobedient at 5:52 PM on September 1, 2009


> the freetext search working better than the categorisations is why it's called Google
> Books and not Yahoo Circa Late 90s Books.

There's still something to be said for categorization. Search the online catalog of large university library near me by subject and, right after "computer networks," it will offer you "computer newtworks."

Sadly the library has no holdings right now on the subject of computer newtworks. But I check it again every now and then, hopefully.
posted by jfuller at 6:46 PM on September 1, 2009 [1 favorite]


This is pretty much a problem done up GoogleStyle. Most of these records are purchasable, in one fashion or another. They'd be a pittance to Google. But buy them? No, not Google! We're smarter and 21st Century, baby!

Google does very well, as a company, when they come up against search problems that are new. When they attempt to tackle already solved problems, you see this kind of behavior which ignores already extant solutions. They really are fantastic at finding words, but extraction of meaning is tough. Hit Google Products sometime and look at quite how hard it is for them to find all of the places which might sell a given item. If you are looking for something rare, skip Products.

This is a project which would have greatly benefited from some guidance by serious librarians, instead of the usual "as long as we can get it into text, we can feed it into the hopper and data will just come out!" methodology Google applies to everything.
posted by adipocere at 7:42 PM on September 1, 2009


do we know why, nebulawindphone?

I'm sure someone does, but I don't. I don't have anything to do with search engine programming — I'm just a linguistics student on the lookout for cheap data.
posted by nebulawindphone at 7:45 PM on September 1, 2009


I keep mousing over the word ~linguist~ to click the hyperlink but nothing happens...
posted by Muirwylde at 9:27 PM on September 1, 2009


I was just looking at Introduction to Metadata at the library last week and made a note to go back and check it out some time. Now I definitely will do so.
posted by neuron at 10:19 PM on September 1, 2009


I can at least claim to have seen this coming. Almost exactly two years ago, in a comment over at thomas j wise's blog, I had this to say:

Google were in too much of a hurry, and plunged into the project without taking a long hard look at what was involved, or consulting scholars or librarians who could have warned them about some of the bibliographical complexities of nineteenth-century books. The attitude was, 'let's grab the data, and any problems can be sorted out afterwards'. And the result? A lack of reliable metadata, and chaotic search results with a huge number of false positives.

I wouldn't express myself quite so strongly now. In the last two years, I've used Google Books a lot, and I've come to appreciate its great virtues as well as its many flaws. (I'm copy-editing a book at the moment, and Google Books has saved me literally hours, no, days of my time checking references.) But I still think I was right in my basic judgement, that Google plunged in too quickly and without realising the importance of reliable metadata.

Jon Orwant's response to Nunberg (linked from wildcrdj's comment above) is very interesting. What strikes me is how promiscuous Google were in taking metadata from so many different providers. In the course of his response Orwant refers to 'a Brazilian metadata provider', 'a New Jersey metadata provider', 'a British union catalog', 'a French union catalog', 'a Korean commercial metadata provider' (for an English-language book), 'an Armenian union catalog' (also for an English-language book) and 'a library catalog aggregator'. The odd thing, to a librarian, is that the digitized material and the metadata were kept so completely separate: i.e. if Google digitized a book from Harvard, they didn't take the Harvard catalogue record that came with it, they took a catalogue record from somewhere else or just made one up on the spot. No wonder their metadata is in such a mess.

Orwant seems cheerfully untroubled by it all. I suppose he feels that Google has the resources to fix the problem, whatever it takes, and if it takes an army of metadata-checkers correcting the mistakes by hand, one at a time, well, that's what they'll do. (Or not.) But I think Nunberg's parting comment is pretty much spot-on:

The reason this is all so frustrating -- and not just for scholars -- is that Google Books represents such an extraordinary resource, and already one that numerous researchers are trying to exploit. But you have the sense that the decisions about metadata and related issues are being shaped by a bunch of engineers sitting over their free Odwallas in Mountain View, who haven't really tried to determine what scholars need to make this work for them .. And there's the suspicion, too, that the Google people don't deeply understand the cataloguing process as professionals understand it. Those are the crucial disconnects, and until they're bridged it's hard to see how Google Books can live up to its potential, for all the best intentions of the people there.
posted by verstegan at 11:24 PM on September 1, 2009 [2 favorites]


In other news: Google turns classic books into free gibberish eBooks.

(Via Gillian Spraggs, who has been doing a great job of analysing the flaws in the Google Book Settlement.)
posted by verstegan at 11:35 PM on September 1, 2009


Oh, please tell me that someone is compiling a "Best Of" list of these errors before they're corrected (assuming, of course, that they get corrected). After you get over the horror of it all, some of them are quite funny.
posted by cowpattybingo at 11:38 PM on September 1, 2009


I'd be surprised if that proportion of errors or anything like it held up in general for books in that range, and dating errors are far denser for older works than for the ones Google received from publishers. But even if the proportion is only 5 percent, that suggests hundreds of thousands of dating errors.

Does he realize he is just making up stats here?
posted by smackfu at 5:35 AM on September 2, 2009


existing cataloging information and library classifications are known and can be assumed trustworthy

Although Google's work is probably the first time these catalogs have ever been cross-checked. If a date or category on some random book in a catalog is wrong, what is the chance anyone would ever notice it, or that the person who notices it would be able to get it fixed?
posted by smackfu at 5:54 AM on September 2, 2009


That is what google gets for trying to reinvent the wheel, rather than using what libraries (generally considered to be experts in retrieval of books) have developed over the past few hundred years.
posted by QIbHom at 8:13 AM on September 2, 2009


Google Books' metadata team lead responds. Some quotes:

"we've learned the hard way that when you're dealing with a trillion metadata fields, one-in-a-million errors happen a million times over."

"An Australian union catalog holds that Jane Eyre is about governesses; a Korean commercial provider claims it's about Antiques & Collectibles. We suspect that the prevalence of Antiques & Collectibles for some classic editions derives from a cataloger's conflation of a particular item's worth ("that first edition is a real collectible!") with the subject classification for the edition. The architecture subject heading was our fault."

And even Mefi's own languagehat offers positive feedback on the response.
posted by GuyZero at 11:37 AM on September 2, 2009 [1 favorite]


Classification errors? Say it ain't so! Oh, that's why you use the "Find in a Library" link to go to Worldcat. Whew.
posted by unknowncommand at 11:26 PM on September 2, 2009


« Older With the economic downturn and a steady downward t...  |  The University of Michigan's c... Newer »


This thread has been archived and is closed to new comments