They're not human... yet
September 2, 2015 8:23 PM   Subscribe

Mention Vocaloid, and most people think of this. But this is also a Vocaloid. As is this, and this. (warning: Youtube-heavy)

With her collaborations last year with the likes of Lady Gaga and Pharell Williams, as well as an appearance on the Late Show (previously), the Japanese Vocaloid superstar Hatsune Miku (Japanese product page) has been the focus of rising interest in mainstream Western media. Still, the idea of a truly synthetic pop star is somewhat problematic to most folks, and one of the biggest hurdles is Miku's robotic, high-pitched singing voice, as well as her inability to sing in comprehensible English. But what people don't necessarily realise is that singing voice synthesizing technology has been around for decades - the Vocaloid engine itself is over ten years old, and since then the quest for an ever more realistic-sounding Vocaloid has been going strong.

The very first Vocaloids (previously), LEON and LOLA, were English voice banks released back in 2004 by the Zero-G studio. They were followed quickly by MIRIAM, voiced by Miriam Stockley (of Adiemus fame). Later that same year, Crypton Future Media (the company that made Hatsune Miku) released MEIKO (now in her third iteration), the first Japanese Vocaloid, whose voice was provided by the singer Meiko Haigo.

The Zero-G Vocaloids met with a lukewarm reception, and their lack of sales in the US was blamed on their "British" accents. (Leon and Lola's voice providers are unknown, but Miriam Stockley is a South African-born British singer.) MEIKO, on the other hand, sold well in Japan (success being defined, apparently, by whether the product could sell 1000 units within the first year, and MEIKO sold more than 3000 units). She remained Crypton's bestselling Vocaloid until Hatsune Miku's release in 2007.

What made MEIKO so much more successful than Leon or Lola? The most obvious answer is language: English is a much more difficult language to vocalize than Japanese.

The English language itself is made up of about 20 vowel sounds and 24 consonant sounds. Also, English doesn't have a systematic orthography, so there is not a one-to-one or near one-to-one match between letters and sounds as it happens with other languages like the Spanish or Japanese...

VOCALOID and VOCALOID2, uses American spelling for the lyrics... However the phonetic notation doesn't follow this, and instead uses the Received Pronunciation written in X-SAMPA, with some minor modifications when it's required, like its the case of the allophones...

In some instances, Producers may be found to have adjusted VSQ and VSQX (VOCALOID format) files so heavily to make them work for 1 particular English VOCALOID that they become "VOCALOID specific" and are unable to work particularly well without further adjustments on other English VOCALOIDs. Cases like this are often rare in languages such as Japanese, though not foreign to them and many VSQ and VSQX files will work without too much adjusting.

Compelling though this argument may be, it does not fully explain the success of the Cryptonloids (short for Crypton Vocaloids). KAITO (voiced by Naoto Fuga) was developed as a complementary male voicebank for MEIKO, but didn't fare as well; his release in 2006 was infamous for being the one and only time a Crypton Vocaloid failed upon launch, having sold only 500 units in his first year. His failure has resulted in Crypton turning their focus almost exclusively on female Vocaloids. It's not really known why he failed initially; in 2007, he enjoyed a sudden revival of popularity, and he has since been hailed by Vocaloid song producers for the versatility of his voice and his compatibility with other Vocaloids. It's probably safe to say though that he failed because he was male, and people wanted female Vocaloids.

The perception of Vocaloids as robotic background singers incapable of carrying a song through by themselves changed with the release of the Vocaloid 2 engine in 2007. The sound of breathing, and the ability to tune the Vocaloid to create a more husky-sounding voice, were added to allow more realism. In addition, Megurine Luka, the first natively bilingual Vocaloid, was released. She was, in fact, meant to be the first Vocaloid for the Vocaloid 2 engine, but the schedules of her voice provider (voice actress Yuu Asakawa, who is fluent in both English and Japanese) clashed with those of Crypton, and she became the third voicebank released for Vocaloid 2. Hatsune Miku became the first, and the rest was history.

Interestingly, Crypton chose voice actors instead of professionally trained singers to provide the voices for Vocaloid 2, because they feared that the Vocaloids would overtake their human counterparts in popularity and thus ruin a real singer's career. Indeed, Yuu Asakawa, who voiced Luka, expressed vehement objection at the suggestion of a voice synthesiser realistic enough to take over the job of a radio personality. Naoto Fuga also disliked the first Vocaloids that he heard, although he has since embraced his fame as KAITO.

Those fears may not be completely unfounded. Among the plugins for Vocaloid 2 was a demo for VocaListener, which analysed a human's singing and automatically created estimated parameters based on it using the song lyrics. In other words, it allowed the Vocaloid to follow the inflections of the human singer almost perfectly. Although it has been released for the Vocaloid 3 engine, it's still not widely used, partly because it's still restricted to Japanese only and partly because Vocaloid fans thought that the Vocaloids' individual quirks and "personality" were lost when VocaListener was used.

It could be argued that when it comes to natural-sounding Vocaloids, the importance of language, gender and engine capabilities pale in comparison with the talent and ability of the person using the software, known in the Vocaloid community as Producers. With some skilled tuning, Producers can manipulate Vocaloids into emulating some degree of dynamics and even emotions. They can even make Japanese voicebanks sing heavily-accented English, and vice versa. But they have competition: following Luka's popularity as a bilingual Vocaloid, English voice banks have also been released for all of the other Cryptonloids. The level of realism is varied, depending on the voice provider's level of English ability.

But other studios have not been sitting on their hands while Crypton steams ahead in singing synthesizer technology. Internet Co., Ltd's GUMI (voiced by Megumi Nakajima)is widely thought to have the most realistic English voicebank of all the bilingual Japanese Vocaloids. IA (voiced by singer Lia) is at times almost indistinguishable from a real human's.

Even English Vocaloids are slowly making progress: Avanna has recently been released for the Vocaloid 3 engine, and is considered the most realistic English voicebank available. And with the Vocaloid 4 (warning: autoplay) engine recently released, these virtual singers can only get more realistic.

But here are some of the most realistic-sounding songs from the Vocaloids mentioned in this post, and you can judge for yourself how realistic Vocaloids have become over the years.

Take On Me by v3xman2

Fly Me To The Moon

Poem Weaved in Ruins by hinayukki/仕事してP
Amazing Grace (in English using Japanese V1 voicebank) by Nanameue-P
Love Is War

Byakkoya no Musume (The Girl In Byakkoya) - cover of theme song from Paprika by Susumu Hirasawa, which itself features LEON
Alice (mu-cho remix)
Jougen no Tsuki (Crescent Moon) by Kurousa-P; compare to Kazuki Kato's version

Hatsune Miku V2:
Freely Tomorrow by Mitchie M
Crystal Quartz; compare to Saki Fujita's version

Megurine Luka V2:
Megane (Glasses)
Poker Face (English voicebank)

Fragments of Star (English voicebank)
Interstellar Flight; compare to Megumi Nakajima's version

IA V3:
IA ROCKSで「daze」 by Ciel(神無月P)
Tori no Uta; compare to Lia's version

Avanna V3:
Titanium mix by GrandMasterFlames159
posted by satoshi (24 comments total) 44 users marked this as a favorite
I actually first found out about Vocaloids from the 2006 post on them on Metafilter, and became fascinated by the technology, so it seemed appropriate to make a more in-depth one covering their progress since then.
posted by satoshi at 8:25 PM on September 2, 2015 [6 favorites]

Where's Gloriana O'Toole when you need her?

Or even Norman Spinrad....
posted by jefflowrey at 8:38 PM on September 2, 2015 [1 favorite]

Wow, fantastic write-up, I look forward to digging in!
posted by filthy light thief at 8:46 PM on September 2, 2015

FirstPost? Wow!
posted by benito.strauss at 9:04 PM on September 2, 2015 [1 favorite]

Excellent post!

Here are some demos of Megurine Luka V4X. Compare its English quality to the original V2.

Miku's appearance on Letterman last year was to promote the concert at the Hatsune Miku Expo 2014 in New York and Los Angeles. I was lucky enough to get a ticket before they sold out. It was a lot of fun, the imperfect pronunciation doesn't matter when it's being played/sung/synthesized loud enough to fill the hall.

The Expo wasn't the first Vocaloid event in Los Angeles, and it probably won't be the last. I'm not sure about East Coast events, though; I heard that the New York concert wasn't very profitable (the place was seemingly packed, and the VIP tickets sold out, but it was a small venue).
posted by Rangi at 9:12 PM on September 2, 2015

Some of these Avanna tracks are downright amazing. They don't sound perfect yet but they're close enough to only sound kinda off.
posted by kafziel at 9:13 PM on September 2, 2015

sci-fi as fuck
posted by brennen at 9:43 PM on September 2, 2015 [3 favorites]

I dunno, I listened to the first three links above the cut. Bailed out of the first two because my affection for j-pop is pretty nonexistant, as is my ear for What Japanese Should Sound Like, then the third one claimed to be in English, but as soon as I turned off the subtitles, I ceased being able to tell what she was singing over the faux-Enya track. The vocal samples are better than they were in the 8-bit days but I'm not really convinced it's that much more comprehensible than S.A.M. back on my old c64.

Close your eyes while listening to Leon doing 'Take On Me'. If you read the lyrics as they're being sung then your brain interprets them, but all I hear with my eyes closed is assorted held consonants with weird little low-res slides between them. If I knew the entire song by heart I'm sure I'd be filling in the edges and thinking it was better, but all I really remember is the refrain.

Or this one, Gumi singing something called 'Fragments of Star'. Close your eyes to shut out the captions; what is she saying? Can you make out a single line? I sure can't.

Maybe I just randomly picked the worst English examples in this post? I dunno. Five Vocaloid snippets is my limit. I'm really not feeling like a major jump in vocal mimicry is being made here.
posted by egypturnash at 11:13 PM on September 2, 2015

I remember being fascinated by vocaloids and excited to hear all the cool music people could make with them. I was thinking, since the voice is hard to reproduce digitally, maybe a lot of musicians using digital instruments would have taken to using vocaloids as a cheap substitute. But instead I found it really hard to actually find anything that appealed to me since the vast majority of vocaloid songs are just covers of other songs but with a vocaloid. Also there's no telling until you listen whether the artist bothered to realistically tune them at all. So you have to wade through all of that to find anything good.

Also the ones with unique voices tend to be underused, which is a shame since there's such interesting music you can make with the other ones if you know what to do with them. For example:

Mew - Effanineffable

Mew - Tsutsun Gokko

V Flower - Kyouto da Kashiya Sensou

V Flower - Absolute Dance
posted by picklenickle at 12:00 AM on September 3, 2015

Excellent timing on this post, as I've this week been mostly falling into a Youtube rabbit hole of (fan made) vocoloid vids, like baka baka baka which has been stuck in my head since Sunday. Which led to dueling versions of "World is Mine" which led to Onii Yuukai, featuring Hatsune Miku and her "sisters" having feelings for their vocoloid "brother" which do not lead to a happy place.

And then there's Kasane territory featuring rival vocoloid Kasane Teto in a song that sounded very familiar because it's a take off of Kero 9 Destiny which itself is of course derived from Cirno's Perfect Math Class: in other words, there's somewhat of a crossover between Touhou and vocoloid fandom.

How normal people respond to all this nonsense?
posted by MartinWisse at 12:20 AM on September 3, 2015 [1 favorite]

So it turns out the first step to making robot pop stars was making it normal for pop stars to sound like singing robots.
posted by idiopath at 12:40 AM on September 3, 2015 [6 favorites]

They're not human… and also not yet available for any platforms other than Windows.

Come on folks, this isn't the 1990s.
posted by acb at 1:42 AM on September 3, 2015 [1 favorite]

oh dear lord, i didn't expect this much text when I opened this topic.
there goes my day...
thank you, satoshi.
posted by bigendian at 3:12 AM on September 3, 2015

> I found it really hard to actually find anything that appealed to me since the vast majority of vocaloid songs are just covers of other songs...

That's the nature of fan activity, though. They want to replicate or riff on existing works rather than create entirely new work.

There is a lot of original vocaloids music, though. The Exit Tunes Presents series has a good selection of amateur and/or pro productions.
posted by ardgedee at 4:03 AM on September 3, 2015

Why the proprietary interface, though? Why not stick it into an instance of Kontakt or EW Play or something? Carefully-crafted instruments, scripts, and sample sets let talented composers create pretty convincing orchestral mockups. Why not with Vocaloids, too?

Realivox does this, but the quality isn't quite up to the Vocaloid standard, possibly because it uses scripts to generate the phrasing sort of like Vocaloid does. You could set up keyswitching to control banks of articulations, note data to enter the articulation, and then listen on another channel, maybe, for pitch. I dunno, not my gig, but I think you could make some really convincing stuff that way.
posted by uncleozzy at 5:56 AM on September 3, 2015

I wonder if Damon Albarn might be looking into vocaloid technology as a step towards making 2D and the other Gorillaz into their own semi-autonomous AI music collective.
posted by Strange Interlude at 6:01 AM on September 3, 2015

I wish posts like this didn't show up while I was at work where I can't give them the attention I'd like to because I think Vocaloid is great fun.

For those who aren't big fans of the J-pop/anison genre Here's some Miku you might like.
posted by Gev at 6:10 AM on September 3, 2015

The Expo wasn't the first Vocaloid event in Los Angeles, and it probably won't be the last.

There has been another one since then in fact, as IA performed at Club Nokia this summer (during Anime Expo).

I was at both concerts (Miku Expo 2014 and IA) and had a good time, although Miku's was better -- mostly because it was a big crowd in a big theater, and so the energy and enthusaism was better. I had a pretty good seat for Miku Expo so the hologram worked better as well (unfortunately your viewing angle matters a lot as you might expect, for IA I was in the balcony and looking down is not as good).

Also Neko Atsume now has a Vocaloid powered CD (3 CDs, actually).

Miku's video games (Project Diva) are pretty fun (although difficult!) as well.
posted by thefoxgod at 6:35 AM on September 3, 2015 [1 favorite]

Awesome post! I really like Vocaloid music - I think because it tends to be catchy and quirky but accessible, and vocals that you can't quite understand are appealing in a Sigur Rósian sort of way. I only really knew about the Vocaloids from the Miku games though (they're pretty good!), it's interesting to learn some of the history and hear the new English ones.

Regarding making them sound human, I get the impression that many producers aren't particularly trying to, or even that particular impossible sounds are a desirable feature. When people do try though, it's fascinating trying to listen and work out what it is that makes the voice sound off - it seems like it takes a lot of artistry to work in the slurs, breaths and other imperfections needed to make then more natural.

The Paprika theme cover "by" Kaito in the OP is pretty impressive, and Mitchie M is a producer specifically trying to get Miku to sound more conversational and human. This song that picklenickle linked is amazing though. Maybe it's the chorus hiding things, but I was convinced it was actual people until half way through. (Watch out, the music video gets a bit strobey towards the end.)
posted by lucidium at 4:31 PM on September 3, 2015

> Regarding making them sound human, I get the impression that many producers aren't particularly trying to, or even that particular impossible sounds are a desirable feature.

I'll just link to some previous thoughts I've had about that.
posted by ardgedee at 5:15 PM on September 3, 2015

Great first post. Wow.
posted by joseph conrad is fully awesome at 7:45 PM on September 3, 2015

Current favourite traditional vocaloid track: World Calling featuring IA, which avoids the techno-dance-pop aesthetic you might associate with vocaloid music at first blush, and instead goes with a very twee indie-pop sound that reminds me a bit of Cornelius. (Turn up the volume, the original video is quiet for some reason.)

Current favourite non-traditional vocaloid track: Slow Snow featuring Sekka Yufu. It's really chill, heavy on gauzy synths and narcotic pacing, and the delicate vocals fit perfectly. It's basically the polar opposite of Hatsune Miku. The artist, mus.hiba, occasionally works with real singers as well.

Current favourite non-traditional vocaloid.
posted by chrominance at 9:13 PM on September 3, 2015

Turns out there will be another Vocaloid event in NYC! The expo last year had all of Crypton's Vocaloids, and now IA will be performing. Tickets are $90, but I was able to buy one on presale with their mobile app for $47.
posted by Rangi at 5:16 PM on September 9, 2015

« Older Interview interrupted by the largest animal to...   |   Stepping out of the clown shoes Newer »

This thread has been archived and is closed to new comments