"This is the largest dataset of its kind ever produced."
September 21, 2020 8:37 AM Subscribe

Newspaper Navigator is a project being carried out by Ben Lee (his announcement on Twitter), Innovator in Residence at the Library of Congress. It extracts visual content from 16+ million pages of sixty years of public domain digitized American newspapers and helps people learn to search the visual content using machine learning techniques. Read the FAQ to learn more about how its creator tried to manage algorithmic bias. Fun search terms are offered if you're not feeling creative: national park, giraffe, blimp, hats, stunts. The dataset is publicly available, the code is available and here's a white paper about the process of building it.
posted by jessamyn (8 comments total) 53 users marked this as a favorite

It’s been really satisfying to see this launch. If you’re interested in similar projects, the entire NDNP collection is available on the web and as an S3 bucket.
posted by adamsc at 9:17 AM on September 21, 2020 [2 favorites]

The visual similarity search capability retrieves relevant photos by empowering you to train a machine learning algorithm by selecting photos that you are interested in.

Is this different from how Facetube, uh, "empower[s] you to train [their] machine learning algorithm by selecting [content] that you are interested in"?
posted by Not A Thing at 9:53 AM on September 21, 2020

Yes. The linked article talks about it.
posted by jessamyn at 10:23 AM on September 21, 2020 [3 favorites]

Uh -- the one I linked to and quoted? Or one of the other ones? A ctrl-F for "empower" on the whitepaper doesn't return any results, and the only mention in the FAQ is the sentence quoted.

It's a minor point of course, but such language does make me wonder about the alternate universe it might describe (and whether there's a way to get there).
posted by Not A Thing at 12:46 PM on September 21, 2020

I may not be sure what you're getting at? Here are a few ways they're different

- Their code and all the data are open, not closed. This is different from YouTube and Facebook.
- You can add or remove things from your collection and your preferences will be re-learned, you control the data that the AI is making conclusions based on. This is different from YouTube and Facebook.
- You don't have to log in, none of your data is linked to any real world identity. This is different from YouTube and Facebook.
- They're aware of and specifically talk about "distortive effects of the microfilming process" and potential erasure of POC as a result of this and how "OCR engines amplify the noise" and their closed source code means this is tough to mitigate. LoC specifically aims to use OCR that works with non-English languages and performs well on them. This is different from YouTube and Facebook.
- Links between the extracted visual content and the original Chronicling America pages are always retained and are surfaced. This is different from YouTube and Facebook where metadata, tags and even image descriptions are unavailable or simply buried.
- More to the point, they're not a capitalistic endeavor which profits from your continued engagement with the tool. This is different from YouTube and Facebook.

The paper goes on and I think the most useful part of it, for me, was

“attempts to use algorithmic methods to describe collections must embrace the reality that, like human descriptions of collections, machine descriptions come with varying measure of certainty” [Padilla 2019]. Machine-generated metadata such as OCR are also fundamentally probabilistic in nature; this fact is not immediately apparent to end users of cultural heritage collections because cuts on confidence score are typically chosen before surfacing the metadata. Effectively communicating confidence scores, probabilistic descriptions, and the decisions surrounding them to end users remains a challenge.

I have to admit, the AI part of this is the least-interesting part of it to me (I don't want to build a collection and have it find things for me that i might like, I am almost always drilling down to find very specific images) but I can see how we could use a similar thing in lribary catalogs and other digital archives to help people use the tools more effectively to do whatever theey want with them.
posted by jessamyn at 1:28 PM on September 21, 2020 [11 favorites]

This looks more useful than it sounded at first. I'm interested in some expeditions about 120-150 years back, tried 'expeditions' as a filter and found stuff dating back to the 1860s. It looks like OCR from old papers hasn't improved a whole lot (barely readable), but maybe there's enough to nail some helpful clues at least. *Promising*!

(Most net search engines are *terrible* for this, specially as they almost never let you filter by dates.)
posted by Twang at 4:26 PM on September 21, 2020

This is very cool. I've used the related Chronicling America project before, for things like family history research, and it's a wonderful resource. In trying to track down information about an elusive great-grandfather, about whom I know very little as he was poor and died young, I discovered that there was a minor news sensation a century ago when his eldest son, my great-uncle, was kidnapped by a vagabond who led the police on a brief chase on horseback. (The alleged kidnapper was eventually acquitted, which hints at an interesting story I'd love to know more about someday.) The story, and other information I turned up, gave me some really interesting perspectives on the everyday life of ordinary people in the first two decades of the 20th century, the sort of stuff that doesn't generally get talked about in history classes. Historical newspapers are packed with all kinds of weird gems.

I might be biased because my mother spent her career working there, but the Library of Congress is one of the most underappreciated, amazing resources provided by the US federal government.
posted by biogeo at 6:33 PM on September 21, 2020 [3 favorites]

The LOC newspaper collection is a thing of joy, you can find all sorts of stuff there but of course there is the notorious gap where only the out-of-copyright content is available. I'm not sure about this search engine, though, because I searched for some keywords that I knew were there, gradually broadened them to "boys school", and still only got six hits. Obviously visual search is really hard, and relies heavily on the quality of the OCR of the text, which is not great. Still, I would have expected a lot more than that. Surely there are hundreds of pictures of events at boys schools, no?
posted by wnissen at 11:35 AM on September 22, 2020

« Older It's Twilight Time | Wait for Tanqueray to step out on stage and take... Newer »

This thread has been archived and is closed to new comments

MetaFilter

"This is the largest dataset of its kind ever produced."
September 21, 2020 8:37 AM Subscribe

Tags

Share

"This is the largest dataset of its kind ever produced." September 21, 2020 8:37 AM Subscribe

Tags

Share

"This is the largest dataset of its kind ever produced."
September 21, 2020 8:37 AM Subscribe