Google Books offers PDFs of public domain books
August 29, 2006 9:40 PM   Subscribe

Google is now offering PDFs of public domain books. Okay, this is a direct lift from Boing Boing but I figured it was too juicy for Metafilter to miss. On my first search I found An Historical Account of the Discovery and Education of a Savage Man, E. M. Itard's account (translated) of his experiences with Victor, the Wild Boy of Aveyron. What else is there, MeFiers?
posted by unSane (55 comments total) 9 users marked this as a favorite
 


Crying shame they settled on pdf format. I don't know what the hell they're thinking -- for a company that's credited with so many smarts, they certainly seem to make boneheaded decisions sometimes.
posted by stavrosthewonderchicken at 10:24 PM on August 29, 2006


Doesn't Project Gutenberg pretty much have this locked up? I'm not saying they can't do it better, because maybe they can, but wouldn't it have made more sense to partner with Project Gutenberg in some way to advance the cause?
posted by willnot at 10:33 PM on August 29, 2006


Crying shame they settled on pdf format.

I assume that they've scanned the books, and doing the OCR (and error checking) necessary to get it into a machine readable format was simply to labour-intensive and expensive for them.

But now that the books are out there, is there anything stopping someone with decent OCR software (and copious spare time) from converting them and submitting them to Project Gutenberg?
posted by spazzm at 10:34 PM on August 29, 2006


Crying shame they settled on pdf format.

Why? The PDF format is more or less a cross-platform open standard, PDFs makes it easy to incorporate graphics in a reliable way from OCRed product, and PDFed public domain books can be easily distributed from machine to machine.

Worse would be that they were to settle on unsearchable bitmaps, like Amazon. HTML would be nice, granted, but it's not like the PDFs are locked up in any way.
posted by Blazecock Pileon at 10:41 PM on August 29, 2006


I hate reading Project Gutenberg's .txt files - in that format, big blocks of text are incredibly ugly and cumbersome.
posted by stammer at 10:49 PM on August 29, 2006


Thomas de Quincey: Confessions of an English Opium-Eater. It's awesome.
posted by ethocin at 10:53 PM on August 29, 2006


and what if you just want to browse through the titles to see if anything interests you?

isn't that what people do at bookstores and libraries a lot of the time?

they really should have thought of that
posted by pyramid termite at 10:53 PM on August 29, 2006


oh, i only got through 100 pages or so of dequincey ... the man wrote like he was on drugs or something
posted by pyramid termite at 10:54 PM on August 29, 2006


Hm, that was only a limited preview. Sorry.
posted by ethocin at 10:55 PM on August 29, 2006


Worse would be that they were to settle on unsearchable bitmaps
I downloaded the 'Education of a Savage Man'. It's an unsearchable bitmap.
posted by tellurian at 10:59 PM on August 29, 2006


"If any thing can be settled, it is that the man is the head of the woman,--that she is for him, not he for her; and that religion, government, family, property, are essential elements of all civilization. Without them man must sink below the savage, for in the lowest savage state we find, at least, some reminiscences of them.... We do the Socialists too much honor when we consent to hear and refute their dreams. We have not at this late day to resettle the basis of society, to seek for unknown truth in religion or politics, in relation to public or domestic, private or social life; we have no new discoveries to make, no important changes to introduce; and all that we need attempt is to ascertain the truth which has been known from the beginning, and to conform ourselves to it."

Better than YouTube! Thanks!
posted by stammer at 11:06 PM on August 29, 2006


I'd also recommend a search for "phrenology".
posted by stammer at 11:22 PM on August 29, 2006


I downloaded the 'Education of a Savage Man'. It's an unsearchable bitmap.

You're right. That's unfortunate. Thankfully, for some, there is a solution.
posted by Blazecock Pileon at 11:24 PM on August 29, 2006


OCR? The Wikipedia article says The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem.

I wish that were so. The Adobe Acrobat OCR is pretty poor. It can't even keep the font size consistent, it doesn't seem to have an option to ask me what font and font size it should be reading the text as, it doesn't seem to check for misreads of letters as numbers or punctuation inside relatively common words ("TH15" is an example), and it's hopelessly lost when presented with any kind of formatting, such as columns of numbers.

Can anyone recommend an OCR program that might deserve the description of "solved problem"?
posted by aeschenkarnos at 11:40 PM on August 29, 2006


OmniPage?
posted by Blazecock Pileon at 12:03 AM on August 30, 2006


Why?

Plain text, I've noticed, is the most reliably cross-platform format going. It doesn't require a massive, resource hungry, utterly crap piece of software (note: opinion ahoy) to read it (though I know there are alternatives that are better), nor does it take anything more complicated than Notepad to create it.

The web is plain text, with embedded markup, at its heart.

The same kind of scorn and fury people sometimes express for Flash (which I don't mind, if used well), I reserve for pdf files used for text. *shrugs*
posted by stavrosthewonderchicken at 12:12 AM on August 30, 2006


I was hoping for a Flash viewer with embedded PDFs and a soundtrack, but I guess this will have to do.

How do we limit our searches to downloadable (public domain only?) books? And can downloaded books be printed?
posted by pracowity at 2:04 AM on August 30, 2006


It's hip to hate PDF.
posted by crunchland at 2:38 AM on August 30, 2006


Is it?

Awesome. I'm so in touch with today's youth culture.
posted by stavrosthewonderchicken at 3:24 AM on August 30, 2006


It's hip to be cynically aloof.

PDFs suck in a bad way.
posted by knave at 4:34 AM on August 30, 2006


Converting to (worthwhile) plain text takes a lot of work, manual labor. That is what Distributed Proofreaders does (the source for most of Gutenbergs texts). I wouldn't expect Google to do that. Just getting to books online in digital searchable format is a huge gain and we should all be merry.
posted by stbalbach at 5:00 AM on August 30, 2006


Pracowity: go to Advanced Book Search, and select "Full View Only".
posted by stammer at 5:19 AM on August 30, 2006


Boo hoo, Google won't provide plain text files of the thousands upon thousands of books they've scanned.

I don't think Google has enough money to hire thousands of monkeys to proofread the plain text. Yet.
posted by zsazsa at 5:22 AM on August 30, 2006


I do agree that this is an awesome resource, I wasn't trying to piss on it. Just trying to back up stavros, because he's right.
posted by knave at 5:24 AM on August 30, 2006


This is neat (increased availability is always good, and I was always a bit iffy on Google's decision to scan non-public domain books without permission), but can't help ask: does anyone here enjoy reading books online? Like, full books? I've tried and tried to read books online and keep coming to the same conclusion: Ugh.

I really hope this changes in my lifetime, but right now reading books online sucks.
posted by mediareport at 5:46 AM on August 30, 2006


Wonderful, but scattershot. They have Volume V of the Lexow Committee report, for example, but not volumes I-IV, and Volume I of the Colonial Laws of New York, but none of the rest of the set. I wonder what their inclusion critera were.
posted by Phlogiston at 5:48 AM on August 30, 2006


Pracowity: go to Advanced Book Search, and select "Full View Only".

That still pulls up books that cannot be downloaded. Maybe the only way to search only for downloadable books is to restrict the search to books published before 19xx?
posted by pracowity at 5:56 AM on August 30, 2006


does anyone here enjoy reading books online? Like, full books?

It's a good question. Someone here on MeFi who lives in South Korea (forget his id) indicated he had read 100's of Gutenberg books on a laptop, and others have said similar things with PDA's. I own a Sony Librie which uses e-ink but gave up on it because the text is too cramped (too many page flips), it was claustrophobic. Books, in particular big plush old books made for reading back in the day when people read, when the paper was thick, the fonts large, the bindings like tanks - are what I seek out - and usually can be had as cheaper or cheaper than brand new paperbacks - and old books have a second life as my home interior decoration.

One of the more interesting digital reading devices is the iLiad because the screen is large enough to hold a full page worths of text, but costs a fortune, and you can't decorate your home with Gutenberg text files.
posted by stbalbach at 6:02 AM on August 30, 2006


Nice. I too wish that these were in plain text, which is not the same thing as saying this isn't nice. ('Cause it is.)
posted by OmieWise at 6:06 AM on August 30, 2006


Nice, but seems like more of a come-on for the publishers' benefit than the readers'.

I'm a fan of archive.org. They have a ton of books, all full-access, including a huge catalogue of pre-1650 books.
posted by the sobsister at 6:24 AM on August 30, 2006


a come-on for the publishers' benefit

What the hell? How does James B. Lyon, who printed Volume I of the Colonial Laws of New York in 1894, benefit from having his publication on Google Books? I'm all for skepticism and everything, but that's just nuts.
posted by languagehat at 6:37 AM on August 30, 2006


pracowity: that's because you're in Europe. 1923 is only the copyright expiration date in the US. Try looking for even older books.
posted by zsazsa at 6:38 AM on August 30, 2006


What else is there, MeFiers?

To echo willnot, there is Project Gutenberg, which has the txt files of public domain books, which are far more useful than PDFs. They will also mail you a free DVD with their entire archive, if you don't feel like downloading it off bittorrent.

There should be a remix project, where people take a txt version of a book, and design a real book, with fonts, layouts, cover, etc. Might make an interesting competition.
posted by Pastabagel at 6:59 AM on August 30, 2006


I know all about Gutenberg, but Google Books has masses of stuff that Gutenberg doesn't. And I really like the PDFs. I simply print them out... you know... books... paper...
posted by unSane at 7:10 AM on August 30, 2006


There should be a remix project, where people take a txt version of a book, and design a real book

This is done all the time by various people and organizations. The problem is finding them. One site is World eBook Library. They charge a very small fee but its worth it if your looking for PDF versions of Gutenberg texts (some are professional quality ready for the printer).
posted by stbalbach at 7:15 AM on August 30, 2006


OKay, so now there's Gutenberg, Bartleby, Google and various similar sites. There's more content than anyone can get through in a thousands lifetimes. So how do you find really great, obscure stuff? Is there a site (blog?) devoted to mining these sites for gems?
posted by grumblebee at 7:20 AM on August 30, 2006


There are a lot of books I am interested in which are not covered by copyright which are scanned in, but not full-text viewable. Is there any way to request that they be opened up?
posted by sonofsamiam at 7:26 AM on August 30, 2006


grumblebee-

Start here. It has "great books" lists from a number of sources. If you see books that appear on all the lists, those would be good ones to start with.
posted by Pastabagel at 7:29 AM on August 30, 2006 [1 favorite]


stbalbach, thx for the link.

Charging for the PDFs is a little sketchy. You can get programs for free that will convert them to pdf.

I was thinking some more along the lines of those sites that give you one HTML page and everyone only works on the CSS to make completely different designs of the same page.

Same idea, but at the end you generate a PDF (or why not a CSS overlay contest?).
posted by Pastabagel at 7:33 AM on August 30, 2006




George Kennan's funny and illuminating "Tent Life in Siberia" is available in its entirety.

Downloaded it today and still as entertaining as on the first read.

Thank you Metafilter, Thank you Thighmaster.

--Peter C.
posted by petecart at 8:52 AM on August 30, 2006


starvosthewonderchicken: Plain text, I've noticed, is the most reliably cross-platform format going.

Which is all well and good, until you get into a text where typography and illustration matter. Not that I entirely disagree.
posted by KirkJobSluder at 9:38 AM on August 30, 2006


languagehat,

my point was that the "limited previews" aren't much more than amazon offers and, consequently, this venue offers publishers a place to shill. It's like going to google videos and being offered trailers for upcoming releases: yeah, they're free and, yeah, they're a taste but, really, they're not particularly useful or satisfying. Except as a come-on.

Which is not to say that the publishers and authors shouldn't defend their copyright. Just that putting ten or twenty pages of the text online is only useful as advertisement and, consequently, these promotional items shouldn't be mixed in with texts that are copyright-free and wholly-accessible.

I know that one can opt to see only the full-text versions. I was simply arguing against mixing apples and oranges. The oranges being free and edible, the apples, well, you can sniff and maybe nibble, but that's it.
posted by the sobsister at 10:40 AM on August 30, 2006


1) We're not talking about "limited previews" but about entire public domain books offered on line. It's like going into a thread about Yeats and complaining about Billy Collins. You talk about "apples and oranges," but you brought the apples yourself.

2) I completely disagree that limited previews are "only useful as advertisement." I have many, many times been grateful for even the tiniest access to hard-to-find books; if I need one obscure fact about a WWI general that happens to be available on one page of a rare book that Google Print lets me see, god bless Google Print. If you only want to read entire books, yeah, you'll do better elsewhere, but your personal needs and preferences are not the world's.
posted by languagehat at 10:49 AM on August 30, 2006


Even the Amazon "search inside the book" option is incredibly useful sometimes, for finding a footnote with some nugget of information, for uncovering the one reference to your topic within a 1000 page tome, and for busting half-smart plagiarists ("I'll copy from a source not on the internet--the professor will never catch me!")

I am currently helping to put together a History MA program to be offered entirely online, and stuff like this is solid gold. I stumbled on the full texts within Google Books a month or so back, I am still having difficulty navigating it. But it is clear that there is a huge volume of primary source material. Put this together with some of the other free archives (like the Making of America site, or the Founder's Constitution) and a whole new world of historical scholarship opens up. Exciting times!
posted by LarryC at 11:28 AM on August 30, 2006


Charging for the PDFs is a little sketchy. You can get programs for free that will convert them to pdf.

Oh sure. But some of these PDF's look like books. You have to see it. I mean ready for the printer, ready for a professional printing house to print a book with. They look great. Some of them. Many of them look like what you say, just straight machine conversions of a text file, which are nothing special.
posted by stbalbach at 6:23 PM on August 30, 2006


The only thing I don't like is the "digitalized by Google" imprint on every page. I'd imagine one would get very annoyed seeing it page after page if they were reading the book straight through.
posted by lpctstr; at 8:38 PM on August 30, 2006


Someone here on MeFi who lives in South Korea (forget his id) indicated he had read 100's of Gutenberg books on a laptop

That was me, of course, but my reading is not by any means limited to Gutenberg or other public domain texts. I've got upwards to 10,000 books in my digital library, a collection that I have indeed been adding to and reading on my old Thinkpad for about 7 years, I guess. The vast majority are non-Gutenberg, and only a tiny percentage are pdf, thank christ. Mostly txt, rtf, html and some lit (though I have software to convert lit back to html (because in many ways it sucks even worse than pdf) if I'm so inlined). My reading is not online so much as onscreen.

If I had not lived the peripatetic life I have, and if I owned a house where I'd lived for some years and planned to stay, I would have all of those books as real, bound books, in my own library. You make sacrifices if you want to travel light. I love books, but I love what they contain more than their physical presence (much as I miss that, or used to, before I started reading everything onscreen).

Which is all well and good, until you get into a text where typography and illustration matter.

Yes, granted, absolutely. But few of the books I read, at least, are dependant on such things. I'm certainly design-conscious (although design idolatry tends to make me puke), but given the choice between plain text of a book or nothing, I'll go plain text every damn time, even if it's riddled with OCR errors.

Anyway, good on Google and all that. But pdfs still bite the wax tadpole.
posted by stavrosthewonderchicken at 11:01 PM on August 30, 2006


stavros, I'll more than admit: Adobe's PDF readers suck ass, and poorly made PDFs are really terrible (I'm looking at you, TeX, with your individual ugly bitmaps for every character).

But the PDF format itself is a thing of beauty!

A well (or even decently) made PDF really shines, and provides it's full raw text in an organized manner for easy extraction by screenreaders and applications like Tofu, as well as searching and copy and pasting.
posted by blasdelf at 7:09 AM on August 31, 2006


I've never understood why PDFs don't just embed the font and then keep the text in a machine-understandable format. It seems *really* strange to have to use OCR to copy text from a computer-written document.
posted by knave at 7:19 AM on August 31, 2006


I don't know what I'm talking about when it comes to computers, but I always thought the whole point of PDFs was to keep the data on them from being easily extractable.
posted by OmieWise at 7:41 AM on August 31, 2006


blasdelf, fortunately TeX has progressed past those bad days and real PostScript Type 1 fonts are a part of every major TeX distribution. That still doesn't keep the default Computer Modern font from being strange and spindly looking in a design sense, though.

knave, they do. Except the ones that consit entirely of scanned images like the Google PDFs. They could be improved upon by OCRing the text and embedding it invisibly behind the bitmap, for example.

OmieWise, the original point of PDFs is to have a portable format that will let you see and print laid out text and graphical data as originally designed.

If a PDF has any plaintext in it, there are tools to automatically extract it. Images can also be automatically extracted. I haven't examined the Google PDFs too closely, but if you can extract the images you can run OCR on them. The image encoding is JBIG2, which is efficient and open but a bit esoteric.
posted by zsazsa at 6:41 PM on August 31, 2006


This is truly weird. Like some kind of horrible Aussie wank book written by an illiterate.

I found it by searching curse words.
posted by Mid at 8:01 PM on August 31, 2006


Just like to add, far to late for anyone to notice, that PDF's are actually a very good document wrapper. You can embed fonts in them for true cross-platform looks the same-ness. You can put form fields in them, which is great for printing out official-looking forms without having to put pen to paper (and makes archiving a cinch).

Further, PDF's are easily parsable by the hundreds of libraries out there (PHP, Java, C, .NET, etc.). That's why you always see "VIEW AS HTML" next to PDF's that are returned from Google searches.

What does suck, and what everyone who complains about PDF's is really complaining about, is Adobe's free Acrobat Reader, which is a stinking pile of dog poo. You can increase the load speed and reduce the memory footprint by deleting some of the files in the plugins directory, or just use a free PDF viewer like FoxIt!
posted by Civil_Disobedient at 10:52 AM on September 2, 2006


This is truly weird. Like some kind of horrible Aussie wank book written by an illiterate.

He explains in the glossary at the beginning that "fuck" means "sexual intercourse." But then, apparently so does every other word in Aussie.
posted by languagehat at 11:40 AM on September 2, 2006


« Older Ruined Music.   |   Starship Dimensions Newer »


This thread has been archived and is closed to new comments