Speech-to-text with Whisper
October 13, 2022 10:58 AM Subscribe

Whisper, from OpenAI, is an open source tool you can run on your own computer that "approaches human level robustness and accuracy on English speech recognition"; "Moreover, it enables transcription in multiple languages, as well as translation from those languages into English." Instructions on how to download, install, and run it. (I have successfully used Whisper and the results were very good. However, it is not fast enough to run during recording of an interview and give you live captions/transcripts; it runs after the fact, on already-recorded audio.)

Pages 19-20 of the research paper list the sources of data used in evaluating transcript quality, including 64 segments from The Late Show with Stephen Colbert, The Corpus of Regional African American Language, European Parliament event recordings, earnings calls addressing investors and the financial markets, and Mozilla's CommonVoice dataset.

posted by brainwane (60 comments total) 62 users marked this as a favorite

Pages 19-20 of the research paper list the sources of audio used in training the model

Those are the public datasets used for evaluation of the quality, not for training. They are rather vague about the actual training data, just providing rough statistics (680000 hours of audio together with transcripts of various quality, broken down into the various languages).
posted by ltl at 11:12 AM on October 13, 2022 [8 favorites]

WOW - I can't wait to try this! ( ... which means, realistically, I will get to attempt installation and testing sometime in, probably, 2057).

I love learning languages and have a TON of downloaded podcasts to listen to, and of course, even when I can understand 90% of a given podcast, there are always those few words I just can't get, even with repeated listening.

I am really looking forward to seeing whether Whisper can decode those for me, with the added benefit of letting me check my comprehension of the stuff I think I AM understanding.

(I also really want to read all the in-depth material they've posted about the tool and how it works, but that probably won't happen until, say, 2071.)

The Github Discussions tab has a Show and Tell section for people to talk about how they're using this.

Also, there's more about the organization at the OpenAI About page.

Thank you SO MUCH for posting this, brainwane! I am really excited about this.
posted by kristi at 11:16 AM on October 13, 2022 [3 favorites]

There's a "no installation required" demo available on Hugging Face ("small" model variant, multilingual), limited to 30 seconds of audio recorded by microphone.
posted by ltl at 11:20 AM on October 13, 2022 [2 favorites]

This is going to make so much sensitive research data so much safer.
posted by humbug at 11:21 AM on October 13, 2022 [4 favorites]

I installed it last week on a pre-M1/2 MacBook Pro and the results were pretty amazing. I had a file that was a mix of English and Swedish with some background noise and it was able to transcribe both languages quite accurately. The downside is that it was *very* slow, like 10-20 minutes of compute time for every minute of audio. YMMV-- I would image it is quite zippy on a Mac.
posted by gwint at 11:24 AM on October 13, 2022 [3 favorites]

ltl -- thank you for the correction! Sorry for misreading. I'm disappointed because I wanted more info about where they got the training data. They work a lot with Microsoft, so, maybe calls on Skype, Microsoft Teams, and XBox Live?
posted by brainwane at 11:56 AM on October 13, 2022

gwint: yeah, on my laptop it was pretty slow too; I think I need to fiddle with using different sized models to see whether there's some way I can speed things up with acceptable accuracy tradeoffs.

When I used ffmpeg to extract the audio from a video of my stand-up comedy and then ran Whisper on the resulting audio file, I was happily surprised to find that, by default?, it also emitted a .srt subtitles file and a .vtt file. The .srt file is suitable for manual editing, local viewing alongside downloaded video, and uploading to video platforms to provide greater accessibility for future audiences.
posted by brainwane at 12:03 PM on October 13, 2022 [5 favorites]

running it right now - it's still chugging along on a 30 minute podcast of mine - a hobbyist project with a lot of jargon - and it's doing shockingly well using the medium model

So far it's only missed on the words Chinook and Braggot.

Nifty
posted by drewbage1847 at 12:23 PM on October 13, 2022 [2 favorites]

I have been shocked at how quickly and accurately Google's voice transcription is on my Pixel 6. It's like what I had hoped Dragon NaturallySpeaking would be 20 years ago.

I'm glad that the tech is becoming available outside of big tech's silos. If it could be made to work on cores optimized for running ML models it could really change the entire landscape.
posted by wierdo at 12:28 PM on October 13, 2022 [2 favorites]

I just got out a meeting planning a series of interviews where we were discussing how much computer transcriptions have changed qualitative research and other projects that require transcriptions. We're going to be using Zoom to do all of our interviews in large part because of the transcriptions that can be easily created. In the past, transcribing interviews and focus groups has been expensive and very time-consuming so having access to tools like this is incredibly helpful!
posted by ElKevbo at 12:54 PM on October 13, 2022 [4 favorites]

drewbage1847: early in the standup comedy routine I transcribed, I introduce myself. The Whisper transcript got my name nearly right, choosing an alternate transliteration that would be right in some contexts. This is, in my experience, unprecedented.

The accuracy and the privacy preservation make Whisper, for me, promising as a game-changer for audio I spoke myself.

I'm firing up yt-dlp (the more-updated? alternative to youtube-dl) and grabbing some conference talks that I delivered years ago and hadn't yet paid someone to transcribe, and that'll make it easier for me to turn them into blog posts or reuse parts of them for my forthcoming book.

My household sometimes records conversations where we talk about movies and TV we've just watched, and we aren't going to publish them so we were not about to pay someone to transcribe those many hours, but getting Whisper to chug on it in the background is totally doable, so now we'll have searchable transcripts to enjoy and revisit.

Sometimes it's easier for me to start drafting a talk or a memo by speaking aloud. The workaround I currently use: ask a friend to listen and take notes as I talk, and email me the notes, which I turn into my first draft. Whisper's accuracy makes it possible for me to do this by myself, whether or not someone else is available.
posted by brainwane at 1:11 PM on October 13, 2022 [7 favorites]

They work a lot with Microsoft, so, maybe calls on Skype, Microsoft Teams, and XBox Live?

I worked on a very big project transcribing many languages for a machine learning dataset for Microsoft for a good while a few years back. It used effectively anything that could be grabbed from the public internet that was any kind of creative commons or public license. Lots of public radio, podcasts, conference/speech recordings, and YouTube videos (in the small language I was working in, anyway).
posted by Dysk at 1:32 PM on October 13, 2022

Wondering how well this is going to work on my New Zealand accent, or any accent with less than say 10m speakers. Will there have been enough of us in the training set? Past experience with voice recognition is that I have to do fake American sometimes to get it to work.
posted by i_am_joe's_spleen at 1:33 PM on October 13, 2022 [2 favorites]

Well, this is something I'm going to put into the third edition of Automate the Boring Stuff. Thanks.
posted by AlSweigart at 1:40 PM on October 13, 2022 [13 favorites]

Likewise i_am_joe's_spleen, looking fwd to trying it on weekend. Most of what I record is botanical names, jargon, verbal shortcuts and any system I've tried so far - otter, speechnotes... seem to be heavily US inflected.

and all overlain with wind, chopper noise, traffic, mooing and bleating, machines
posted by unearthed at 2:05 PM on October 13, 2022

(Whoa! Al Sweigart! I'm in the same thread as Al Sweigart!)
posted by kristi at 2:06 PM on October 13, 2022 [5 favorites]

Something about the Announcer's Test seemed to kind of break the brain of that Hugging Face demo. There are many variations, but the test as I know it goes like this:

"One hen, two ducks, three squawking geese, four limerick oysters, five corpulent porpoises, six pairs of Don Alverso’s tweezers, seven thousand Macedonians in full battle array, eight brass monkeys from the ancient, sacred crypts of Egypt, nine apathetic, sympathetic, diabetic old men on roller skates with a marked propensity towards procrastination and sloth, ten lyrical, spherical, diabolical denizens of the deep who haul salt around the corner of the quay of the quarry, all at the same time."

You're supposed to rattle off the numbers cumulatively, so it gets longer and trickier each time like The Twelve Days of Christmas, and I tried doing that but it was a disaster. The transcription read, "1. 1. 1. 2. 1. 2. 1. 2. 3. 1. 2. 3. 1. 2. 1. 2. 1. 2. 1. 2. 1. 2. 1. 2. 1. 2. 1. 2. 1." It's like it got stuck on the numbers and couldn't hear anything else.

Finally I tried just doing the last iteration, and I got this:

"One hand, two ducks, three squawking geese, four limericoasters, five corpulent corpuses, six pairs of donal-versus tweezers, seven thousand Macedonians of full battle array, eight brass monkeys from the ancient sacred crypts of Egypt, nine apathetic-sympathetic, diabetic old men on roller skates with a market propensity towards procrastination of sloth, ten lyrical-spherical, diabolical denizens of the deep who hall stall around the corner of the quote of the query, all at the same time."
posted by Ursula Hitler at 2:34 PM on October 13, 2022 [6 favorites]

Like kristi, this is on my to-do list and has been since I first saw mention of it (likely here on the Blue), but at this point, I'd be happy to let some other, more technically minded MeFites play around with it for a few weeks and report back here about bugs/tweaking/etc. so I have a better idea of what to expect and who might be willing to offer me some technical assistance if I get stuck.

Way back in the archives, there's a question on the Green I posted about trying to get the IBM Watson speech-to-text program up and running. I never managed to get that to work. I even posted a Job linked to the same question (and never got an replies). I'd really like for this to work (and work for me), but I'm less than 100 per cent confident in my ability to implement it and its ability to work.
posted by sardonyx at 2:39 PM on October 13, 2022

So reading through the transcript of 29 minutes. Just looking at the big misses. (there are some things that need editing) But here are the bits it missed that jumped out.

transcription - actual word
braggrat - braggot - 8 times
schnuck - chinook
umpatstical - oomphtastical (I would have been floored if it caught this silly bit of me)
fermenters - fermentable
Chris Marisotter - Crisp Maris Otter
Weierman - Weyermann
Harvard - Homebrew
Isla - Islay
Humbert - homebrew
to be dumb - tippy dump
stouter - stout
reporter - porter

That's across 5500 words. That's beyond any reasonable expectation in my mind.(under 1%)
posted by drewbage1847 at 2:53 PM on October 13, 2022 [1 favorite]

The first test I successfully ran on Whisper was a song by cortex that is only one minute long. I was pleased to note that it rendered all the profanity accurately. As opposed to the automatic captions in Google Meet, which censor swearing and, as I recall, the word "porn".

In transcribing my standup, Whisper thought "exhaustedly" was "exhaustively" which is, I admit, a much more common word.
posted by brainwane at 3:04 PM on October 13, 2022 [3 favorites]

What's the hardware requirements? The "Required VRAM" thing makes me think you need one of them unobtainable graphics cards thingies
posted by scruss at 3:06 PM on October 13, 2022 [1 favorite]

I ran mine on my mac laptop.
posted by drewbage1847 at 3:07 PM on October 13, 2022

By strange coincidence, The Limericoasters is the name of my new celtic indie-folk band
posted by acb at 3:13 PM on October 13, 2022 [3 favorites]

There is a whole subindustry of people who transcribe audio on Mechanical Turk and such sites that I can see this destroying. (Count on me for the depressing take!)
posted by JHarris at 3:14 PM on October 13, 2022 [3 favorites]

I heard from a friend at the Freedom of the Press Foundation about work on Stage Whisper, a web interface to Whisper specifically for use by journalists and newsrooms:

...not all journalists (or others who could benefit from this type of transcription tool) are comfortable with the command line and installing the dependencies required to run Whisper.

Our goal is to package Whisper in an easier to use way so that less technical users can take advantage of this neural net.....

The project is currently in the early stages of development.

posted by brainwane at 3:14 PM on October 13, 2022

Wondering how well this is going to work on my New Zealand accent, or any accent with less than say 10m speakers.

That would be most British accents? So maybe a number South of there if this as good as is being reported here.
posted by biffa at 3:56 PM on October 13, 2022

So it's a multi-gigabyte download and it brought my elderly but still adequate 8-core 4 GHz i7-4790K to a near halt, but it did work almost perfectly. Using the default model, verbatim output:

Mammy, why does the iron munger sell egg meat?
Well, somehow or other he could make it the best if we just know the people that buy it around here where to buy it from because he's the only one round here that has it. Haha, you couldn't go into the dairy and ask for a pound of egg meat. They'd look at you as scans, wouldn't they?

— Ivor Cutler, Egg Meat (excerpt)

Not bad: "iron munger" for ironmonger, "as scans" for askance. Ran at ~¼ real time, all fans howling. No, I have no graphics card.

I was worried that their idea of English wouldn't include mine.
posted by scruss at 3:57 PM on October 13, 2022 [1 favorite]

However, it is not fast enough to run during recording of an interview and give you live captions/transcripts; it runs after the fact, on already-recorded audio.

This is both not true, as a pure matter of speed, but also true for unrelated reasons. Technical details follow:

When running on the CPU it is not very fast. (It's clearly unoptimized for that mode and there's active work on improving that going on now.) As a single datapoint, on my pretty beefy desktop (i9-10940X), it runs at approximately real-time with the medium model on the CPU. I'd expect it to be much slower on a typical notebook.

That said, it's really meant to run on CUDA cores (read Nvidia GPUs) and on that same desktop with a couple year-old GPU (RTX 2080 Super) it runs at approximately 3x real-time with the medium model without any significant CPU usage at all.

The small model, which is the default and very good on English audio, is about twice as fast as the medium one.

That said, while it can turn out text much faster than real-time with the right hardware, it's not designed to be used that way. It works on 30 second blocks of audio, so "live captions" would be 30-60 seconds late. I believe there's work being done to use the model in a lower-latency fashion as well, but I've not looked into that.

It is a very exciting new tool.
posted by bcd at 6:47 PM on October 13, 2022 [2 favorites]

In the future, self-published ebooks will be written by authors while walking to work, their podcast-like ramblings transcribed by Whisper rather than transcription farms and cleaned up by Word-E, given clickbait titles by Bait-E, and then sold as "micro-novels" by Amazon, with the option of having the book reread to you by an add-on celebrity voice plugin, also generated by AI. Imagine Ryan Reynolds smirking his way through niche fan fiction like "The Mystery of the Fluffy Pony Murders".

At some point the author will be completely removed and the books will be generated using complex prompts, much like Midjourney "art." Writers and Poets will sell prompt bundles on Ebay, with an underground market for everything not allowed on Ebay. It's won't be too different from ghostwriting tweets for $200k/year. The wealthiest, most sought-after artists will be that sliver of humanity, those artists who have hidden their work from traitorous academics, thus remaining inimitable, even by the algorithm. Individual poems, written with dna-infused ink in small linen notebooks by famous poets, will be auctioned for millions.
posted by mecran01 at 9:14 PM on October 13, 2022 [9 favorites]

That is both brilliant and horrifying, mecran01.
posted by bcd at 9:46 PM on October 13, 2022 [1 favorite]

This is all quite poignant when I recall that it took me months of my PhD to transcribe sixty interviews, taking about a day to transcribe each 45-60-minute tape recording, using a pedal control to stop and start the tape while I typed. Maybe it would still take a day per interview for Whisper to do the work, but at least I'd have been able to do other things...

But never mind. Look back far enough and you'll see countless person-hours expended on tasks that we've now automated. I'm happy nowadays to recommend these tools to postgrads to save them some time; but there is something lost along the way. Doing the transcribing myself meant I knew the interviews inside out, which made analyzing them easier.
posted by rory at 11:07 PM on October 13, 2022 [4 favorites]

Wondering how well this is going to work on my New Zealand accent, or any accent with less than say 10m speakers.

I just tested the opening of the first episode of season 4 of the Great Australian Bake Off* using the small English model and default settings, and the results are pretty good. In an hour of runtime (no GPU) it processed 12 minutes of show, 1707 words. There were only 18 errors I could find being pretty picky; that's about 99.9% correct:
transcript had 'feeling' -> should have been 'filling'
Verduce -> verjuice
Baker -> Bake Off (both hosts were saying it in semi-unison)
I -> They (someone was laughing at the time)
strict -> streaked
Let's find -> less time (some overlap between speakers)
you -> you've (mixer running in background)
make -> made
with a -> with the
grain -> green
Raice's -> Raeesa's (2x)
Just put them -> Just put'n them
I love Max -> Hello Max (laughing while talking)
Plaster and -> Plasterer
la moire -> l'amour (spoken by one of the hosts, she doesn't do a French accent)
it's so hot -> it's so (trailed off, interrupted)

The spelling is US (eg colors), but it would be pretty impressive to infer the spelling from the accent. I suspect this is a challenging task in some ways; it is professionally recorded, edited and processed sound, but there's background music and ambient kitchen noise, the speakers have multiple accents including bakers who grew up in (and have obvious accents from) France, South Africa and Singapore. It even did a good job when one of the hosts deliberately mispronounced 'table' to rhyme with 'marble'. A few of the mistakes I might have made if I wasn't listening intently and reading a transcript at the same time.

This is really impressive!

* I know, AUS != NZ, I don't have any NZ programs available to me right this moment.
posted by Superilla at 11:47 PM on October 13, 2022 [4 favorites]

I'm happy nowadays to recommend these tools to postgrads to save them some time; but there is something lost along the way. Doing the transcribing myself meant I knew the interviews inside out, which made analyzing them easier.

Given the amount of interview transcription I've done for students at the local university, this will be saving a lot of them money rather than time.

I was kind of looking at getting back to doing transcription work after a stint in something entirely different. Guess it's probably not sensible to get into the modern equivalent of the buggy whip industry.
posted by Dysk at 11:59 PM on October 13, 2022

Sorry to hear of the direct impact this will have on you, Dysk. It does seem that, between this and DALL-E et al., we're experiencing another seismic shift in the automation of entire fields of work.

Most of the students I've worked with haven't had the ££ to farm out their transcription, just as I didn't back in the day, but I know that some will have been using services such as yours. More worrying are the ones using essay-writing services to farm out the whole thing... before long there will be AIs doing that, too. Academic assessment is going to be the equivalent of those honesty boxes in the countryside where you drop your money in the slot to take a bag of apples—if it isn't already.
posted by rory at 2:25 AM on October 14, 2022

Eh, it is what it is. To a certain extent, this levels the playing field by giving everyone access to transcription, not just the rich kids, do it's a good thing! It's not like it's fascinating, quality work either - it's a rote job, very repetitive. Nobody mourns the buggy whip industry either, and quite rightly.
posted by Dysk at 4:31 AM on October 14, 2022 [1 favorite]

More and more, I feel like all the jobs are going away. Drivers will be replaced by automated cars. Store clerks will be replaced by automated registers. Artists and writers will be replaced by AI. Even a lot of healthcare jobs will be replaced by apps or tests you do at home. You might think that as automation takes over everything, it'd be a new golden age where people are finally free to do whatever they want with their lives without having to worry about making a living. But, no. It's going to be a situation where money matters as much as ever, but there are fewer and fewer ways for anybody to actually make money. I don't know what corporations will do when there's nobody left with the cash to buy their stuff. They sure as hell aren't gonna give it away for free.

If I had a kid who was looking for a viable long-term career now, I don't know what the heck I'd tell them. Try to be one of the people who creates the AIs, I guess. That'll be an in-demand job, until the AIs start creating each other.
posted by Ursula Hitler at 4:34 AM on October 14, 2022 [2 favorites]

There are some things where I think humans will continue to be preferred, if not superior. Cooking, particularly at the higher end, seems a fairly safe bet, and there will always be a demand for some degree of customer service and tech support. Warehousing seems to resist being completely de-staffed (beyond a certain point). Etc. But yes, the future is shitty retail and service jobs, for the benefit of the ultra rich.
posted by Dysk at 4:38 AM on October 14, 2022

A Proof-of-Concept of an AI Assistant Designer using UnrealEngine's Metahuman, stablediffusion, OpenAI's Whisper and GPT3
posted by rongorongo at 6:26 AM on October 14, 2022

This particular thing gets right at a few highly relevant areas of my life, my lived experience, and who I am. Two of my parents — and consequently, a fair bit of my own childhood — rubbed right up against this technology and the effects of it over the course of decades.

My mother was a career transcriptionist, specializing in both medical transcription and transcription of translated scientific journals. The hum and patter of her blue Selectric III put me to sleep as a child. My stepfather spent his entire working career as a translator of Soviet scientific journals for a small publisher.

Both awe-inspiringly smart people, of course they thought of and experimented with ways to automate their work both at a practical day to day level as well as trying to conceive the technologies that would replace them. Microcomputing was starting to change everything, and figured large as an obvious building block.

Machine translation, machine-assisted translation, voice recognition, linguistic mapping, and generally beanplating over the flow of words was a recurring theme in the house. Prototyping happened from time to time. Of course, the ML and AI spaces were still in their long infancy. Nothing a modern consumer would recognize as “AI” today was happening.

Everyday life consisted of a lot of puzzling out visual symbols in one language, creating audio recordings in another language, and turning them into visual symbols again. Occasionally both of these parents worked on the same project, saving the UPSing of audiotape back and forth to New York.

My stepfather continued his translation work into his 70s. My mother, long separated from him, went back to medical transcription and over a surprisingly short period of time watched, felt the economic effect, of the rise of first outsourcing made possible by telecommunications developments and then un-sourcing made possible by the use of automated speech recognition in her field.

This here hunk of code and blob of model encapsulates a tremendous number of things that are at the heart of who I am and how I got here. In a little under 3 minutes, I went from looking at a GitHub repo page to running a FLAC through something that produced a passable transcription in 6x real time.

I, scion of the buggy whip industry, watched my family’s empire fall when I typed a single command.
posted by majick at 6:44 AM on October 14, 2022 [15 favorites]

majick, my ex-husband is a medical transcriptionist. I'd be lying if I said I weren't a tiny bit smug -- I told that feckless nitwit to do a medical coding program instead of transcription.
posted by humbug at 7:34 AM on October 14, 2022 [1 favorite]

do a medical coding program instead of transcription.

Well, yes. There’s still money to be made feeding the EMR, especially in the Billing-Industrial Complex.

In fairness, though, we’re still years off from good medical transcription being automated. Like the product of a thousand years of scribes before my mother, it’s rote work, but highly skilled rote work. Unfortunately, there’s no market demand for good medical transcription any more. The market has a bottomless appetite for decidedly mediocre medical transcription, cheaply, and stat. You’ve been able to buy that off the shelf for years.
posted by majick at 8:10 AM on October 14, 2022 [2 favorites]

At least in the short term, I anticipate that there's a business model for value-added transcription and subtitling/captioning work by humans. Workers could use Whisper, Stage Whisper, and similar technologies to speed up making a first draft, and offer paid services in proofreading, validation (as in, "I, a human, listened to this and verify that this mostly-computer-generated transcript is correct"), speaker labelling, redaction as appropriate to the specific recording's context, and maybe even fact-checking corrections.

Even once this level of speech-to-text is built into widely deployed audio recording and playback tools/applications, I'd predict there will still be a market for "this was checked and validated by a human" services. At least for a while.
posted by brainwane at 8:14 AM on October 14, 2022 [3 favorites]

Even once this level of speech-to-text is built into widely deployed audio recording and playback tools/applications, I'd predict there will still be a market for "this was checked and validated by a human" services.

I agree, I just think it'll be a fraction of what there is now. You can still buy buggy whips!
posted by Dysk at 8:16 AM on October 14, 2022 [2 favorites]

In fairness, though, we’re still years off from good medical transcription being automated.

What I'm seeing - and I'm deeply involved in this professionally too - is that the basic transcription is going to be automated pretty soon now, but it will require a human editor to polish it properly for a long while yet.

Of course, it is those editing/proofreading passes that make the difference between mediocre and good transcription in the first place, so we are mostly saying the same thing.

In our case, I'm hoping this will let our humans focus more on editing and less on transcribing, but is certainly anxiety producing.
posted by bcd at 8:20 AM on October 14, 2022 [2 favorites]

"I, a human, listened to this and verify that this mostly-computer-generated transcript is correct"

Absolutely this - in particular, where the resulting transcript is going to have legal or financial weight, people are going to want a human/company they can sue for errors or omissions. There's a reason E&O insurance is a thing.
posted by bcd at 8:26 AM on October 14, 2022 [1 favorite]

I tell people occasionally that I was raised in the wild by a pack of proofreaders.

The early workplace implementation of machine assisted and outsource-assisted basic transcription a decade ago, the transformation of her role to one of listen-and-proof and downward pressure on staffing and available audio hours for high skill scribes, ultimately, drove my mom’s decision to retire from the field rather than keep swimming upstream.

Automation is transformational and while humans will remain better at a bunch of edge case tasks for a good while yet, the race to the bottom has long encouraged just… discarding (or minimizing) the extra quality you get by doing human-assisted work. My own field’s heading this way almost as quickly.
posted by majick at 8:32 AM on October 14, 2022 [2 favorites]

My spouse started as a medical transcriptionist in the 80s and has been doing/managing transcription of various sorts ever since. I've been building the software used to run her transcription business for the last decade. As a result, this whole topic is very central for both of us.

There are three sorts of editing that I feel is still a fair distance away from being automated:

Judgement calls about what is contributory versus what should be omitted. (The more verbatim the requirements, the less this is relevant.)
Where research is required to identify proper names and historical facts being referred to in the audio. (This isn't generally an issue in the medical transcription we do.)
Matching idiosyncratic house styles. (I'm unclear how many companies will eventually decide, "whatever our favorite ASR system produces is our house style.")

Interesting times - both good and bad.
posted by bcd at 9:04 AM on October 14, 2022 [1 favorite]

I've been using DeepL to do real time simultaneous speech-to-text transcription AND translation from Japanese to English on my iPhone. It depends heavily on conditions, but in the right environment (meaning not a lot of noise and crosstalk), it performs impressively well. Far from perfect, but way better than I would have imagined.

My partner's mother speaks no English, and she speaks Japanese in the Kansai dialect, not standard Japanese. We've used DeepL to communicate in real time with a fair amount of success. It can be awkward, and you get weird results sometimes, but much of the time, DeepL absolutely nails it, even with the dialect and heavy use of idioms/slang/etc.

When that happens, it is truly mindblowing, especially when you know how hard it is to translate from Japanese to English under the most ideal conditions. To do it solely from audio, in real time, based on real world speech in a nonstandard dialect?? That's like watching a basketball player stand at midcourt and sink 10 baskets in a row, nothing but net.
posted by mikeand1 at 12:13 PM on October 14, 2022 [3 favorites]

I run into that kind of Python thing all the time on various machines — pip is a horror show for anyone used to CPAN, and it ain’t like locallib Perl is any great shakes — but the common thread tends towards “vendors ship a garbage Python”, or even worse there are “vendors ship python-is-python2” situations, combined with “pip is hideously out of date.”

First guess:
use pip3 instead of pip

Second guess, if pip3 is a thing but it still doesn’t work:
pip3 install --upgrade pip

Third guess:
Push system Python into a ditch, light it on fire, violently park an asteroid on it, and install Python in the crater.
posted by majick at 3:56 PM on October 14, 2022 [2 favorites]

I used mostly majick's third option.

In particular, I installed Python using Anaconda in order to easily create a PyTorch environment configured to work with my GPU, i.e. conda create --name whisper pytorch

And then use Whisper's suggested pip install git+https://github.com/openai/whisper.git within that enviroment.

This requires you conda activate whisper to enter that environment in your shell before using Whisper.
posted by bcd at 4:24 PM on October 14, 2022

Third guess:
Push system Python into a ditch, light it on fire, violently park an asteroid on it, and install Python in the crater.

As a long-time Perl programmer, getting "system Perl" out of the mix was the #1 reason I fell in love with containers. So, I guess the obvious next-question is, "Does someone have a container for this yet?"
posted by mikelieman at 4:45 PM on October 14, 2022 [4 favorites]

This is, essentially, a type of workload that will not run reasonably on a mostly scalar, low-parallelism general purpose CPU. You might do okaaaaaayish? with a huge Threadripper or one of those old Xeons from when they still had a decent number of available threads, but a consumer CPU without a GPU of any kind will ever have the parallelism to run this workload as it exists today until it’s refactored for the purpose.

I do not know the state of the art for that kind of refactoring work but it’s fair to assume it’s non-trivial.

TL;DR: U SOL.
posted by majick at 7:44 PM on October 14, 2022 [3 favorites]

Well the good news is that we're on the far side of the crypto crunch so decent graphics cards are available for somewhat reasonable prices. But that doesn't fix things for laptops.
posted by wotsac at 6:39 PM on October 15, 2022 [1 favorite]

It's like what I had hoped Dragon NaturallySpeaking would be 20 years ago.

I remember desperately trying to get this to work back then in the early 2000s, and we've definitely come a long way since then even just with phone-based text to speech.

The main reason why I was trying to get it to work was to help my grandma be able to communicate and write easier and otherwise get online after surviving a stroke, and, well, it was just never going to work out for us.

One of the major problems with the program that was a huge barrier to entry was that you had to train it to recognize someone's particular voice by reading some very lengthy passages, which meant I had to make my poor grandma read a small novel at a dumb ol' eMac which she rightfully described as silly and annoying.

I couldn't really get it working on my voice training, either. But that's not really a surprise to me because I talk too fast and have some weird enunciation issues. Even today I regularly break phone-based text to speech and if I use it I have to talk so slow I might as well just write it down with a pen and take a picture of it to send instead.
posted by loquacious at 3:45 PM on October 18, 2022 [1 favorite]

In case you want to try Whisper but you don't want to fiddle with installing it on your computer:

The machine learning company Replicate is hosting a web-based version of Whisper so you can upload a sound file and get a transcription.
posted by brainwane at 2:18 PM on October 19, 2022 [3 favorites]

Oh, the hosted version is good to know about, thanks! I finally got Whisper running locally via Docker today (was having problems with Python versions and didn't want to reinstall Python) but it's still slow on a laptop and wasn't generating SRT files. Plus now I can share a link to this with other people instead of telling them to send me their files for transcription!
posted by sibilatorix at 9:25 PM on October 19, 2022 [1 favorite]

I want to see what William Gibson would do with this.
posted by conscious matter at 11:23 AM on October 21, 2022

In fairness, though, we’re still years off from good medical transcription being automated.

I'm not sure I'd want to bet my career on that, honestly.

The difference between good and bad transcription, particularly in a field with it's own vocabulary, probably depends greatly on training the model. A general-purpose transcription model, trained on audiobooks and whatever else you can find online, is probably not going to do medical (or legal) transcription particularly well.

But I suspect that if you built up a good set of training data specific to the discipline, you could probably get something pretty decent.

That might be where the career path lies for a skilled human, though. Rather than doing actual transcription, I'd be looking to provide the data that companies are going to want to train their models. Which could be almost infinitely specific! The vocabulary used by a cardiologist is going to be different from that used by an ENT, which is going to be very different from that used by a neurologist or oncologist. Ideally, you'd want separate models for each of these disciplines... and maybe some sort of logic that chooses the most appropriate one based on trigger words or something.

And beyond that, my guess is you'd also probably want models trained on various accents within the specialties. E.g. my cardio guy is from Bangalore originally, and there are definitely some al-uminum vs. a-lum-in-ium type things going on with medical terms that I suspect could trip up a poorly-trained model.

Particularly for medical transcription, I'd imagine there are a lot of privacy issues involved in using actual real-world patient data for model training. So a good transcriptionist, who knows the idiosyncrasies of doctors within a particular specialty, could probably do pretty well by creating realistic simulated data and associated clean transcriptions.
posted by Kadin2048 at 12:39 PM on October 25, 2022

I've used whisper for the initial drafts of two transcriptions, so far. I didn't keep whisper's original outputs, but I used the base model in both cases. I used an online version of whisper found here. Both of my final transcriptions are moderately or heavily modified versions of the whisper output, but that's just because I'm a little OCD and tend to go overboard with this kind of thing. I think my final versions are quite good.

Last week, I used it to create a transcription of Tucker Carlson's astonishingly racist segment attacking MSNBC and Tiffany Cross (pdf), and today I used it to create a transcript of Fiona Apple's statement about her court watching of Prince George's County and the lawsuit against it (html).

I don't have an html version of the Carlson transcription, but here's the video clip, the audio only, and a docx version.

The html version I link above of Fiona Apple's transcript also contains links to the same variations as above.

BTW, I think the Tucker Carlson segment might deserve a FPP because it's so extreme and frankly diabolical. I'd like to do a thorough analysis of it, really. Fiona Apple's video (from TikTok?) also might deserve to be included in a FPP dedicated to the lawsuit.
posted by Ivan Fyodorovich at 6:41 PM on October 25, 2022 [1 favorite]

Whisper has successfully made it into my rotation of tools I reach for frequently without thinking "oh should I bother?"

I recently ran several videocalls to rehearse some standup comedy. For a few early ones, if I came up with a good riff spontaneously during the rehearsal, I paused to jot them down. But of course that broke the rhythm and the quality of the performance. Then, I started to record my own rehearsal performances. I ran Whisper afterwards and that let me skim through to find places I'd come up with a good joke on the fly, and then I could incorporate that into my notes for the next runthrough.
posted by brainwane at 7:00 PM on November 12, 2022 [3 favorites]

« Older Never say always | Did you know our filthy abattoir offers tours?... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Speech-to-text with Whisper
October 13, 2022 10:58 AM Subscribe

Tags

Share

Speech-to-text with Whisper October 13, 2022 10:58 AM Subscribe

Tags

Share

Speech-to-text with Whisper
October 13, 2022 10:58 AM Subscribe