Universal instant transcription
December 15, 2010 10:51 AM   Subscribe

The Speakularity is coming. So says (MeFi's own) Matt Thompson of NPR, posting at NiemanLab as part of its series, Predictions for Journalism 2011. Constant social feedback plus machine learning could improve automatic speech transcription to the point where it’s finally ready for prime time. And when it does, the default expectation for recorded speech will be that it’s searchable and readable, nearly in the instant. I know this sounds totally retrograde, but I think it’s something like the future.
posted by beagle (21 comments total) 8 users marked this as a favorite
let’s call it the Speakularity

Oh please can we not call it that?

It's a fascinating idea; needs rebranding, stat.
posted by chavenet at 10:55 AM on December 15, 2010 [2 favorites]

Recorded speech transcription is already desperately needed yesterday. The cool futuristic thing would be text searchable *video*.
posted by DU at 11:08 AM on December 15, 2010

There will be all those diplomats that'll communicate with their home country only though speech for fear of their messages being disclosed later, well we'll need their conversations to be searchable by people working against terrorism of course.
posted by jeffburdges at 11:11 AM on December 15, 2010 [1 favorite]

Of course, this will kill another category of jobs— if we don't soon decide to do something like shorten the workweek or otherwise spread employment around more, we're going to end up with no middle class.

That said, it would save me tons of money and time... while putting a good friend out of a job which would not make me happy.
posted by Maias at 11:18 AM on December 15, 2010

Apart from the usual to/two/too issues, any near-instant transcription will be useful for relaying breaking news, or providing basic accessibility functions. While voice recognition's already made great strides, machine editing and transcription still struggle with idioms, dialects, cadence, slang and impediments.

The first real use of the form in journalism will compliment agencies, and greatly aid in providing tighter cues in prompting producers and staff, in a manner not unlike the telegraph's role in advancing news gathering.
posted by Smart Dalek at 11:30 AM on December 15, 2010

Man, joeclark is going to have an aneurysm when he sees this. Delivering searchable text with video is the future? I think I was hearing "closed captions where available" mentioned on The Price Is Right when I was twelve.

Yet I'm always somehow surprised when twenty year old tech becomes Zomg Teh Futurzorz when it's served up by a web service.
posted by mhoye at 11:33 AM on December 15, 2010

The cool futuristic thing would be text searchable *video*

How would you enter search terms?

Speech search is plausible because (in English, anyway), text is basically transcribed speech; it's really the same sort of information recorded in two different ways. There's a 1:1 correlation between a passage of text and a sound recording of the same text being spoken. Bridging that gap is difficult (for a machine) but semantically straightforward.

'Transcribing' video would be much more difficult. I know it happens (movies for the blind, for instance, have the action transcribed into text), but it's pretty "lossy". What would you end up with? A sort of running transcript of what was going on? A tag cloud, with timestamps? How you handled that would dramatically change how you'd be able to search for things.
posted by Kadin2048 at 11:34 AM on December 15, 2010

Can't we just route all recorded speech through something Google Voice? It's doing a bang up job of transcribing voicemail messages, practically instantaneously depending on the length of the message.

I'm half joking but still, isn't the Speaku...nope, can't do it...isn't ubiquitous transcription just a case of implementation at this point? The tech is seemingly here already.
posted by JaredSeth at 11:43 AM on December 15, 2010

I think one relatively feasible but cool/scary feature would be mixing facial recognition algorithms with something like Google Image Search or TinEye. Kind of like tagging people in Facebook photos but done automatically when new images are scraped. It would be weird to be able to search for someone and have it find literally every photo of them that exists online. Of course this could be expanded to more than just facial recognition as well.
posted by burnmp3s at 11:52 AM on December 15, 2010

Speech is a hugely complex process. It's kind of a miracle that we can interpret it, not to mention generate it with such ease. Automatic recognition has been in the works for years and years, and it's just plain HARD. We produce a wide range of sounds, and we have to group them all into categories. Different languages (indeed, different dialects) draw the lines in different places. A "p" is not just a "p;" you can pronounce it several ways. In English, we'll classify most of those sounds as "p," but that won't be the case in a language like, say, Korean, where those sounds could be divided up into three different letters. Furthermore, you have to account for things like vocal damage, pitch, tone, speaker error, elision, dialect, etc. This is a huge field right now, and a ludicrous amount of money is going into it. It's a very exciting field, but it's still very far from foolproof. Crowdsourcing it might help, but it's not going to solve the problems of finding out exactly why our brains are so good at it.
posted by honeydew at 11:55 AM on December 15, 2010 [1 favorite]

I will admit to skipping over the vast majority of TED talks because a transcription is not present. I read faster than I listen. I do not want to see the flashy introduction. If I have an interest in the credits, I will go look them up. Give me text.
posted by adipocere at 12:02 PM on December 15, 2010 [4 favorites]

I'm trying to figure out how the porn industry could be an early adopter, thus advancing the adoption of the technology.
posted by Greg_Ace at 1:05 PM on December 15, 2010

> Give me text.

Ditto, TED. Also, I've got quite a number of downloaded vids, mainly tutorials on this and that, that I wish I could transform to just text and maybe two or three screen caps. Actually watching/listening to the vids themselves is soooooooo sloooooooooooow. But as of now there's no alternative to extracting the text by taking handwritten notes, exactly as if you were a student at the University of Padua in 1300AD.

> Of course, this will kill another category of jobs

Not for medical transcriptionists, anyway. One of the things I support is voice recognition software for a bunch of doctors. I can tell you from having listened to many, many voice files from MDs that clear well-enunciated speech is to doctors' speech into transcription mics as clear, carefully handwritten text is to doctors' scribbles on prescription forms. If we're talking about voice rec that works well without being carefully tuned to a single individual's speaking voice and habits, it's hard to believe that's happening any day soon. Hell, they haven't even gotten OCR working well yet for less than ideal originals.
posted by jfuller at 1:25 PM on December 15, 2010 [1 favorite]

Delivering searchable text with video is the future? I think I was hearing "closed captions where available" mentioned on The Price Is Right when I was twelve.

Closed captions are still done by humans. Generating 'em automatically (without introducing a fuckton of errors) would be a big hairy deal.
posted by nebulawindphone at 1:51 PM on December 15, 2010

Some captioning must be automated, otherwise the people who handle the closed captioning for live sports broadcasts are complete idiots.
posted by yerfatma at 2:20 PM on December 15, 2010

Why lie toothing the principal is sound eye wood half two expect that the practical implementation wilful far behind what is imagined hear.

Much like OCR, I expect it would initially generate a semi-adequate transcription, that would then need to be edited to make it make sense. Maias's friend's job (stenographer?) can be expected to change, from generating text to editing computer-generated text. The computer may be able to flag paragraphs of dubious grammar for special attention (my little joke above, for example) but the whole thing would need at least a skim-read no matter how good the technology gets.

This might mean that less stenographers are employed, but I doubt it; instead, I would predict that the volume of work the stenographers can process will increase, and as stenography becomes easy and practical, people's desire to use it and ability to afford it will increase. We may not want to spend $500 on ten hours of stenography getting a wedding video transcribed, but $50? Sure. Ten people will do that, and if it takes an hour each because the computer did almost all the work, then the stenographer's made the same amount of money in the same time. There are so many existing video and audio recordings that to transcribe even the worthwhile ones is a mountain of work just sitting there, because without this technology, it wasn't economically viable.

Also, this is just literal transcription, full of ums and ahs and repetitions and misspeakings. For the technology to advance to the point where a computer can not only transcribe speech but actually "neaten" it to the point where it reads as polished text is a whole other magnitude of problem.
posted by aeschenkarnos at 2:30 PM on December 15, 2010

machine learning (as used in automated translation in google chrome) definitely helps journalism in that today i was able to access news sources in languages and alphabets completely unknown to me in the easiest possible way-- so i would say he has a point.
posted by 3mendo at 4:21 PM on December 15, 2010

Some captioning must be automated, otherwise the people who handle the closed captioning for live sports broadcasts are complete idiots.

Some captioning is computer-assisted. For real-time captioning of live, unscripted events, the captioners use a stenotype machine, which lets them type in a special sort of shorthand. A computer translates that shorthand into standard English spelling — and, yeah, sometimes the computer fucks up, especially if you give it some newly drafted Samoan linebacker's last name or something. (And sometimes the typist fucks up — doing upwards of 200 WPM in real time isn't easy no matter what equipment you've got, especially if you're trying to recognize some newly drafted Samoan etcetera.)

But believe it or not, if it was entirely automated, with no human typist, the accuracy would be much, much worse.

In fact, a sports broadcast is like a perfect storm of error sources for speech recognition. You've got multiple speakers, they're talking quickly, with lots of emotion, often too close to the microphone, using all kinds of nonstandard English, not necessarily enunciating so well and sometimes interrupting or talking over each other, and meanwhile there's thousands of people yelling and maybe a marching band playing in the background. If you want good accurate automatic speech recognition, you need a situation that's exactly the opposite of that in every way: you adapt your system to a single speaker's voice, and you have him speak slow, steady, carefully articulated dictionary-standard English in a quiet room. And even then, you hope he's willing to put up with a decent number of errors, because holy shit, speech recognition is hard as hell.
posted by nebulawindphone at 8:05 PM on December 15, 2010 [1 favorite]

I laugh because my husband is right now sitting at the next desk in our home office swearing a blue streak about a very cutting-edge project in this very arena, which doesn't actually work very well.

The people who work in this arena, and who aren't selling anything, are not so optimistic as Mr. Thompson. Maybe by 2020 there will be something ready for prime time. Maybe. Maybe by 2025. Maybe.

'Transcribing' video would be much more difficult. I know it happens (movies for the blind, for instance, have the action transcribed into text), but it's pretty "lossy". What would you end up with? A sort of running transcript of what was going on? A tag cloud, with timestamps?

As a transcriber, I have done both, and it was really hard. I have no idea how that would possibly work in machine transcription.
posted by Sidhedevil at 11:50 AM on December 16, 2010 [1 favorite]

I hate when computers wreck a nice beach.
posted by straight at 12:50 PM on December 16, 2010

One does not “transcribe” video for the blind. That is called audio description.

Speaker-dependent speech recognition (one respeaker repeats the dialogue of a show) is also used for live captioning, and it’s much worse than even halfway incompetent stenocaptioners.

Real-time speaker-independent speech recognition is understood to be fully operational at intelligence agencies and has been for decades, but at the consumer level, there is a very good chance nobody reading this will be alive to see it.
posted by joeclark at 12:19 PM on December 29, 2010

« Older What They've Learned   |   It's about the conceptual effects of masturbation... Newer »

This thread has been archived and is closed to new comments