Bringing ancient manuscripts in the Vatican's Archives to (digital) life
May 16, 2018 1:03 PM   Subscribe

Archivio Segreto Vaticanum, or the Vatican Secret Archives contain 85 km (~53 mi) of shelving, but is limited to physical access only... for now. But digitizing alone isn't enough - even with an index, how can you search this volume of material? Digitized text would be ideal, but automated digital transcription through Optical Character Recognition (OCR) only works with typed text because of a need for consistent shapes and clear spaces between characters. Add artificial intelligence and now you might have something. In Codice Ratio is the research project that is trying to recognize character segmentation, which is fed into a convolutional neural network to recognize characters and language models to compose word transcriptions.
posted by filthy light thief (5 comments total) 31 users marked this as a favorite
 
Fascinating! Thanks.
posted by SecretAgentSockpuppet at 5:55 PM on May 16, 2018


I wonder why this is not being crowd-sourced? I suspect that pretty much every classicist on earth would LOVE to a chance to transcribe and translate these documents.
posted by msalt at 11:31 AM on May 17, 2018


> I suspect that pretty much every classicist on earth would LOVE to a chance to transcribe and translate these documents.

This is the entire reason I got online, back in...'91, maybe? There was a rumor that the Vatican was going to start scanning manuscripts. They had some lovely (less rare) pieces up, but it was a matter of start the page loading, get coffee, wait longer, and then see the image load so very, very slowly. Then click the big arrow button, and wait another half-hour.
posted by korej at 5:29 AM on May 18, 2018 [2 favorites]


I wonder why this is not being crowd-sourced? I suspect that pretty much every classicist on earth would LOVE to a chance to transcribe and translate these documents.

I'm surprised that they don't at least want human proofing of this OCR+ output. I get the want to automate the process, given the sheer volume of work, but I imagine that there's still a fair amount of corrections to be made that are (currently) best done by a human.
posted by filthy light thief at 8:07 AM on May 20, 2018 [1 favorite]


I'd say it this way -- I can understand why humans, even crowd sourced, might not be great at deciphering handwriting from medieval times without endless unresolvable debates. For that matter, highly qualified experts have trouble reaching consensus on these transcriptions.

However, I think the crowd is very good at identifying problems and areas of dispute. So a 2-step process, where the machine takes the first pass and people flag mistakes and offer suggested solutions, might be the best mix.
posted by msalt at 9:13 PM on May 20, 2018


« Older MH370 is still missing, with no final answers   |   UIs that accidentally preserve memories Newer »


This thread has been archived and is closed to new comments