Just don't play the samples backwards
September 8, 2016 2:38 PM   Subscribe

WaveNet: text to speech using a generative deep learning model. Existing text-to-speech systems use parametric generation or a concatenative approach where tiny samples of a recorded voice are strung together to create synthesized speech. Using a deep learning technique WaveNet generates synthetic speech a single sample at a time. Especially interesting: "If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples below, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds" The academic paper.
posted by GuyZero (26 comments total) 32 users marked this as a favorite
 
Oops, we tried to make Google Now sound less weird but accidentally invented AI glossolalia and also Arnold Schoenberg.
posted by theodolite at 2:56 PM on September 8, 2016 [9 favorites]


Awwww crap we pushed that PageRank update too early and now every computer on Earth believes in God
posted by theodolite at 2:59 PM on September 8, 2016


This is pretty cool. I've lately been messing with the Natural Language Toolkit for tagging parts of speech, to help make 'AIs' better able to recognize intent and subject when queried and hopefully add conversationality/variety to their output. I assume these TTS engines have got to do some sort of tagging as well, to know where to put emphasis and vary speed/tone when 'speaking'.
posted by destructive cactus at 3:00 PM on September 8, 2016


I assume these TTS engines have got to do some sort of tagging as well, to know where to put emphasis and vary speed/tone when 'speaking'.

With the concatenative model I think yes that's the case but with the deep learning model I think maybe not? But that's based on my lay reading of what's been done.
posted by GuyZero at 3:03 PM on September 8, 2016


Oh wait no:

In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples, but also on the text we want it to say.

So I think the model input is phonetic as opposed to literal text so the engine doesn't have to deal with english's insane pronunciation idiosyncrasies. Presumably with the Mandarin model it's phonetic including tone.
posted by GuyZero at 3:05 PM on September 8, 2016


Cool! The piano music kind of reminds me of Conlon Nancarrow.
posted by teponaztli at 3:06 PM on September 8, 2016


Goofs aside, this is really cool! The WaveNet clips are miles ahead of any speech synthesis I've ever heard. How close are we to making Siri sound exactly like Douglas Rain?
posted by theodolite at 3:10 PM on September 8, 2016 [1 favorite]


(Are the speaking-in-tongues clips all Australian?)
posted by theodolite at 3:12 PM on September 8, 2016


On first listen, I would swear up and down that the "babbling" clips are German or Dutch. And that's knowing a little German and a very little Dutch. Something about English prosody and consonants, but mixed up vowel diphthongs and syllables maybe? Or recognizing a few words (I feel like verstehen and zeit jumped out more than once), but not the whole sentence.
posted by supercres at 3:25 PM on September 8, 2016 [2 favorites]


Aww, it's a lil babby computer!
(This was my initial response to the idea of it babbling)
posted by dialMforMara at 3:27 PM on September 8, 2016 [1 favorite]


I wonder what the corpus was for the piano bits, because I like to imagine that the collected works for piano in the Western art music tradition all kind of converge on sounding like late-period Debussy when you smoosh them all together.
posted by invitapriore at 3:33 PM on September 8, 2016 [2 favorites]


If we can train this on a corpus of a specific person's speech, that seems to mean we can now synthesize an imitation of anyone's voice saying anything we want them to.
posted by wanderingmind at 4:04 PM on September 8, 2016


On first listen, I would swear up and down that the "babbling" clips are German or Dutch.
I heard the same thing!
posted by linux at 4:09 PM on September 8, 2016 [1 favorite]


daisy, daisy...
posted by the antecedent of that pronoun at 4:18 PM on September 8, 2016 [1 favorite]


I'd love to see more of these music samples, and trained on more specific corpora, like all Bach for example. (The ones they gave sounded very Scriabin-esque to me.)
posted by gold-in-green at 4:18 PM on September 8, 2016 [1 favorite]


The random speech is nightmare-inducing in the most delicious way.
posted by grumpybear69 at 4:21 PM on September 8, 2016 [1 favorite]


Goddamn but this stuff is cool. It's phenomenal what a huge leap forward we're making in applied machine learning right now.
posted by Nelson at 4:23 PM on September 8, 2016 [4 favorites]


It's amazing how the generated babble speech creates the mouth noises as well - breathing, a slight smacking of the mouth.
posted by GuyZero at 5:11 PM on September 8, 2016 [1 favorite]


Wow, the Wavenet is nearly nice enough to listen to reading an audiobook. Obviously I want this in a Trump voice on my phone.
posted by Damienmce at 6:08 PM on September 8, 2016


It's amazing how the generated babble speech creates the mouth noises as well - breathing, a slight smacking of the mouth.

I've never seen Her, but when it came out, I listened to a movie review podcast where they played a clip. In it, Scarlett Johansen's AI character discusses whether or not her feelings are real, or just a simulacrum. She thinks has emotions, but she can't prove that that isn't just programmed to believe that. She can't even prove that there's a her at all to have feelings, rather than just a cleverly designed but deterministic program producing a string of words that happen to be a philosophical argument. Joaquin Phoenix's character reassures her that of course she's real.

Divorced from the context of the rest of the movie and any visuals, the sequence gave me the most profoundly disquieting sensation. Because you could hear breathing and mouth sounds from Scarlett Johansen as she spoke. It probably wasn't intended by the film makers, but the idea of something without a mouth or lungs which had nevertheless been clearly designed to produce the illusion of them for the benefit of its users, wondering whether it had been given the illusion of emotions and intelligence for the benefit of its users, seemed to me to have such a obvious and poignant answer. I never wanted to watch the movie, for fear of spoiling that.

So I have mixed feelings about artificial speech that includes breathing.
posted by figurant at 6:14 PM on September 8, 2016 [4 favorites]


This is pretty awesome, especially because they got it to work on raw samples. I've experimented with generating random speech with LSTMs via speech codecs, and it's pretty CPU-intensive because each speech frame covers 10 ms or so, and you might have 10-20 variables in each frame, so you might want to train across a couple seconds of data.

I haven't read the paper yet, but it sounds like they're feeding in raw samples for the entire history, so 16000 Hz * N seconds = ??? That's a heck of a lot of neurons to generate one sample. But they're DeepMind, so they might be doing something fancy (or just have access to a heckuva lot of computing power)
posted by RobotVoodooPower at 6:58 PM on September 8, 2016 [1 favorite]


Well, I for one am terrified.
posted by odinsdream at 7:02 PM on September 8, 2016


wow wow wow wow

Those are good, excited, happy "wows" -- this fascinates and makes me happy enough to leave me a little breathless. Next stop - check if Mark Liberman at LanguageLog has blogged on this yet.

A linguist and especially a phoneticist would surely be able to instantly identify the distinguishing traits of each of the babbling clips. At least one sounds as if it were trained in German, a couple with Scottish/Wesh/Irish traits (I know each is quite distinct, but to my American ears sometimes the ones closer to RP are hard to differentiate), and one sounds pretty English anglophone. But maybe they are all trained on the same American anglophones as the actual text samples!

There is so much to explore here, and I haven't even started on the music stuff or the paralingustic sounds.

Ahhhh! I'm overestimated, moderator please hope me!
posted by Ivan Fyodorovich at 7:04 PM on September 8, 2016 [2 favorites]


On a sidenote, it's super-cool to me that everyone remarking on the piano bits so far has invoked a different composer -- I brought up Debussy because I hear extended tertian harmonies, modality (buttressed by chromaticism, thus the "late-period" qualifier), and maybe a bit of planing in the audio samples linked in the article -- precisely because the parameters of musical style seem particularly nebulous in a way that maybe accents aren't?

I think about this with regards to machine learning a lot, and I have some very bittersweet feelings about it as a musician: I'd love to have the sort of acute theoretical mind that could put together just what makes a Ravel or a Messiaen tick just by listening, and so in spite of how fascinating it is, there's something excruciatingly sad to me about the fact that an NN can do exactly that but that I'll never be able to inspect its representation of that knowledge in a way that would be meaningful to me. Maybe something similar to the deep dream approach could be used here -- for each layer, start with white noise, and with some prior beliefs about what music should sound like attached, climb hills until you find the sound that that layer believes is the most representative of the features it looks for -- but of course it's hard to guarantee that the higher-level features deduced by the network will be intelligible to the human ear.
posted by invitapriore at 9:27 PM on September 8, 2016


If we can train this on a corpus of a specific person's speech, that seems to mean we can now synthesize an imitation of anyone's voice saying anything we want them to.

I read this in Gilbert Gottfried's voice. And soon enough, so will everyone else.
posted by um at 11:03 PM on September 8, 2016


I'm, uh, casually curious if there are enough master recordings out there of the late Majel Barrett as the Star Trek computer to feed into this system and produce usable results.
posted by figurant at 8:47 PM on September 13, 2016


« Older “My Moby-Dick has been Zhongwang”   |   A pint of fear and home by teatime Newer »


This thread has been archived and is closed to new comments