Internet Archive adds Millions of Public Domain Images to Flickr
August 29, 2014 9:31 PM Subscribe

Looking for another source of public domain images? Yahoo research fellow Kalev Leetaru has extracted over 14 million images from IA public domain book scans, and so far 2.6 million of them have been posted to the Internet Archive Book Images photostream, where they have become part of The Commons.

More info from the BBC and the Internet Archive blog post.

posted by fings (25 comments total) 68 users marked this as a favorite

A fine endeavour, but...

It was a gigantic mistake to use that scanner function that auto-crops right up to anything black. Most of these images are pretty much useless because of it.
posted by Sys Rq at 9:47 PM on August 29, 2014

The crop function isn't quite what you think it is. These are coming out of texts that were being OCR'd by the Internet Archive. According to the BBC article, the code originally identified areas of stuff that weren't actually text, so that OCR wouldn't be run on those parts and produce gibberish.

What they've done here is go back to that data and pull out the areas of the scans previously identified as non-text content and ignored. Those extracted portions are put on flickr along with the OCR'd text immediately before and after the image, with the intention that people will be able to search and find historical images related to specific subjects.

Each page on flickr also contains a link to a scan of the entire book that the image was extracted from, as well as a link to the specific page of that book. Once you've found an image that is pertinent to you, you can easily create your own crop from the full images provided there if the automatic one isn't to your liking.

This is all about enhancing discoverability of content that was already available on IA, and poor automatic crops aren't actually a problem here.
posted by vibratory manner of working at 10:27 PM on August 29, 2014 [14 favorites]

The cynic in me says that here's a large bunch of free content with the bonus of being rights free and it will help drive large amounts of internet traffic to a Yahoo property which has the added ability to track users.

And then the nerd in me says "Cool! Old pictures! Lets see!" and off i go, damns not being given.

Ahh, the internet - a harsh mistress (or master, whatever your preference).
posted by Zedcaster at 10:43 PM on August 29, 2014 [1 favorite]

If I'm going through an archive of historical illustrations, photographs and esoteric marginalia - I ain't starting my search off with "telephone", Mr. Leetaru. Thank you very much.
posted by HE Amb. T. S. L. DuVal at 11:06 PM on August 29, 2014 [2 favorites]

I don't understand why the text search function on archive.org doesn't go "deep" -- if you search for something on the main page, it will look at the descriptions, but not the full text. If you do go to a book, then you can search the full text.

Photos are nice, but not as nice as text.
posted by user92371 at 11:18 PM on August 29, 2014

That search function really is the best thing about this collection. It's absolutely amazing that he managed to do this automatically. (Although I share some of Zedcaster's concerns, too.)

Sex*
Drugs
Rock and Roll

*Surprisingly SFW until you scroll down a few pages, and then still pretty tame)
posted by HE Amb. T. S. L. DuVal at 11:18 PM on August 29, 2014 [1 favorite]

Why Flickr?
posted by sophist at 3:11 AM on August 30, 2014

jmenuj mi ho!
posted by StickyCarpet at 5:05 AM on August 30, 2014

Lots of useful stuff for an old D.C. photo group that I help run on Facebook - tagging away, and thanks!
posted by ryanshepard at 5:53 AM on August 30, 2014

I think my favorite thing are all the random decorative illustrationsand type that appear when searching for something random.
posted by vespabelle at 6:19 AM on August 30, 2014 [1 favorite]

Man I wish we had something better than Flickr. Yahoo has pretty much wrecked its usability with the series of NuFlickr redesigns. At least they recently put Creative Commons search back.
posted by Nelson at 6:54 AM on August 30, 2014

Each page on flickr also contains a link to a scan of the entire book that the image was extracted from, as well as a link to the specific page of that book. Once you've found an image that is pertinent to you, you can easily create your own crop from the full images provided there if the automatic one isn't to your liking.

That's great and all, but these images could have easily been, in addition to reference material, usable (or even just presentable) images themselves. Unfortunately, most of them have their edges chopped off. It couldn't have been too hard to just throw in an algorithm to loosen up the cropping.
posted by Sys Rq at 8:56 AM on August 30, 2014

It appears they partially addressed their copyfraud problem with the new category "No Known Copyright Restrictions." In the past, all the work from The Commons was being dumped into a CC License, which basically asserts that the image is copyrighted but usage restrictions are waived. So The Commons was committing copyfraud by asserting a valid copyright exists.

But their license statement "No Known Copyright Restrictions" is still technically not correct. These items should be labeled Public Domain. The Creative Commons project has been deliberately blurring the lines between CC and Public Domain, in an attempt to assert its administrative authority over the public domain. The Commons is just another land grab in the PD space.
posted by charlie don't surf at 9:14 AM on August 30, 2014 [2 favorites]

Oh, mixed emotions. Zillions of fascinating images, mediocre scans. These have better contrast than the recently uploaded British Library scan collection, none of which seemed to achieve anything darker than a medium grey. Neither set is going to win any awards for sharpness. But, zillions of fascinating images. I'll certainly spend days and days browsing and downloading, so thanks for the post!

> Man I wish we had something better than Flickr.

Amen to that. Flickr's old-skool UI was kind of inconvenient but the new improved version defines "suck".

Why couldn't they have put these on Wikimedia Commons? WC has a search function too, and when you find somethng nice then viewing or downloading full size is a snap. You don't even have to unblock scripting. Plus an image's filename already tells the image's title (if it has one) or else what it is, and the artist's name if that is known. I can't emphasize enough how nice and useful that is. God's curse on sites that use filenames like 14762209816_65bb901ab5_o.jpg.
posted by jfuller at 10:01 AM on August 30, 2014 [3 favorites]

I have a digital history project in my town (Spokane) where my students and I spend so very much time looking for historical images we can use. In 20 minutes I identified dozens of pictures of local landmarks and the like that we can use. This is huge.
posted by LarryC at 1:11 PM on August 30, 2014

Is there anything preventing people from putting them all on Wikimedia Commons as well?
posted by dmd at 5:14 PM on August 30, 2014

One at a time? I'm sure there isn't, if you have upload privileges. All 2,626,987 of them? Would pretty much have to be someone who already has them local to wherever they are and can write an upload engine specific to WC. Whether that would be allowed I dunno.
posted by jfuller at 5:29 PM on August 30, 2014

Is there anything preventing people from putting them all on Wikimedia Commons as well?

No. And that group will probably be running bots for the next n years to do so.

I have no love for the WikiCommons community and interface, but the Internet Archive should have partnered with it, a non-profit. Since WC was surely considered, I wonder why it was rejected...
posted by sylvanshine at 6:48 PM on August 30, 2014

This project is way better than nothing, but so many images lack context because of the automation. Who's this?--well, you can't be sure from the Flickr page itself. You have to visit the provided link to the book's page on archive.org.

If these images were dumped on Wikimedia Commons instead, volunteers would be "iterating" the metadata, even the file names. On Flickr, they're more or less stuck like this. (Aren't they? I don't use it.)
posted by sylvanshine at 7:04 PM on August 30, 2014

If you have a yahoo (flickr) account, you can add tags to these photos, as well as add comments, so they are not completely stuck. I agree that the "new" flickr interface sucks compared to the old -- lack of pagination, no idea of how many search results you get, forced "justified" view unless you add "?details=1" to the URL by hand, etc.

While he did not upload these to Wikimedia Commons, the BBC article quotes Mr Leetaru as saying "What I want to see is... Wikipedia have a national day of going through this to illustrate Wikipedia articles", so he does want them to get used.
posted by fings at 7:30 PM on August 30, 2014 [2 favorites]

If what I said above ("Zillions of fascinating images, mediocre scans.") implied that all these scans are mediocre, I did them an injustice because it's not true. There are some quite good scans here. The quality is all over the map, but "all over the map" includes some up at the very respectable end.

I should note that I don't expect scans of old photographs to be sharp and have good contrast. I've been looking specifically for images that started life as pen-and-ink drawings (of trees or including trees). Drawings like that ought to be the easiest kind of art to reproduce well in a book in the first place, and also easiest to scan well.

Here are two examples, good and not so good. Mysteriously, they are from exactly the same Internet Archive scan of exactly the same book (fuller takes deep breath), Die Kunstdenkmaler der Stadte und Kreise Gladbach und Krefeld im Auftrage des Provinzialverbandes der Rheinprovinz herausgegeben by Paul Clemen, published 1896 in Dusseldorf. Links are directly to the original size on Flickr.

good

way less good

I understand that Kalev Leetaru's image extracting project is completely dependent for image quality on the Internet Archive's book scans (with the exception of the overenthusiastic cropping Sys Rq called out.) I checked the two example images linked above against the relevant pages of the IA scanned book itself and they are indeed exactly the same: one good image scan with clean almost-black lines (on a page with clean almost-black text), one poor image scan with fuzzy brown lines (on a page with fuzzy brown text.)

How on earth to account for such differences? IA's 33-page scanning workflow overview document is linked on this page (a direct link to the MS Word file also works.) The description makes it look as if you should be able to expect consistency of result within one book. Now I'm very curious to see the original book itself. Is it possible that the variations are present in the original? Those nineteenth-century Germans, what a bunch of slackers!
posted by jfuller at 8:43 AM on August 31, 2014 [1 favorite]

N.B. the Internet Archive blog entry linked in the fpp makes it clear that Kalev Leetaru is "a Yahoo research fellow" which probably explains how Flickr got into the act.
posted by jfuller at 8:52 AM on August 31, 2014

But their license statement "No Known Copyright Restrictions" is still technically not correct. These items should be labeled Public Domain.

This probably reflects the original Internet Archive license.

text search function on archive.org doesn't go "deep"

You can do this two ways:

1. Go to Open Library, click "More Search Options" (top-right page), then button "Full Text Search"

2. Google: "Charles Dickens" site:archive.org

I prefer option #2 it's quick and easy and maybe more complete. It searches the OCR text files.
posted by stbalbach at 7:05 PM on August 31, 2014

I do not speak for the Internet Archive. But I do work there.

Regarding the "mystery" that jfuller masticates on regarding the variance of quality in Die Kunstdenkmaler... the mystery took me 20 seconds to track down. The book, scanned in an Internet Archive Scribe scanner, was scanned in a way that the volunteer/employee operating the machinery focused the right-side Canon 5D MK II properly and the left-side Canon 5D MK II not quite so properly. There's a hair difference in the focus pull, or the glass was dirty, or the camera was knocked, or a dozen other reasons for issues. The text is entirely readable, the images understandable, and so the QA person likely passed it because on observation, it all looked passable. Likely, if it was a big enough deal and someone complained, the book/item could be retrieved and re-scanned to a more focused and consistent degree, but most of these items are "good enough" in that regard. Remember, the Internet Archive is adding a new book to the collections every ninety seconds and they are basically giving it all away for free upon doing so. Remember that per-book or per-month fee you paid before viewing everything? Me either.

Sylvanshine used that magic phrase, "Surely, they [thing that only the writer is sure happened]". In that case, "Surely, they contacted/considered Wikimedia Commons" is likely a false statement on its face; this project was the work of one driven and smart individual, working on something nobody was likely to fundamentally understand as a description, and who ultimately arranged a lot of the aspects of cross-referencing and utilization into the Flickr API. Internet Archive would have provided technical assistance but it has a relatively small amount of human resource available for "projects" of this sort. I know they'd like to change that in the future, but it's busy enough keeping 20 petabytes of data available to the world for absolutely free with no ads, thanks.

As an aside, and stressing my pretty strenuous years as a Wikipedia/Wikimedia critic, I think it's a generally better world for a project to upload to a "static" archive, in this case Flickr, and then let the bickering, arbitrary standards of Wikipedia be applied to it for acquisition into the Wikimedia Commons. I think all parties benefit from this approach. Just my opinion, though.

As for the "crop issue", I'm going to punt and say that I'm sure the kind professor thought the cropping was sufficient, and some percentage of people are not going to agree with the choices made (and I'm sure every other choice) and the brilliant execution of this project ensures that you can go back and get the original, off-the-camera photograph and immediately rejigger the entire thing to your specific needs. The cropping doesn't seem to bother me, so I don't know what group I'm in, but if I needed the original photograph, it's linked right in there. So I'd probably do some clever shrug emoji if this place supported it.

Now, if you'll excuse me I'm going to go enjoy the telephone search in this collection, because it is calming and it pleases me.
posted by jscott at 5:04 AM on September 1, 2014 [4 favorites]

Sex
Drugs
Rock and Roll

[clicks on Sex]

Result on first page:
LIBRARY
MAR 9 2001
UNIVERSITY OF TORONTO

How did they know????
posted by Kabanos at 6:21 PM on September 3, 2014

« Older Hidden patterns even in the most mundane of... | I'm trying to impress people here. You don't win... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Internet Archive adds Millions of Public Domain Images to Flickr
August 29, 2014 9:31 PM Subscribe

Tags

Share

Internet Archive adds Millions of Public Domain Images to Flickr August 29, 2014 9:31 PM Subscribe

Tags

Share

Internet Archive adds Millions of Public Domain Images to Flickr
August 29, 2014 9:31 PM Subscribe