Posted on Categories Discover Magazine
Siri and Alexa are good, but no one would mistake them for a human being. Google’s newest project, however, could change that.
Called Tacotron 2, the latest attempt to make computers talk like people builds on two of the company’s most recent text-to-speech projects, the original Tacotron and WaveNet.
Tacotron 2 pairs the text-mapping abilities of its predecessor with the speaking prowess of WaveNet for an end result that is, frankly, a bit unsettling. It works by taking text, and, based on training from snippets of actual human speech, mapping the syllables and words onto a spectrogram—a visual representation of audio waves. From there, the spectrogram is then turned into actual speech by a vocoder based on WaveNet. Tacotron 2 uses a spectrogram that can handle 80 different speech dimensions, which Google says is enough to recreate not only the accurate pronunciation of words but natural rhythms of human speech as well. The researchers report their work in a paper published to the preprint server arXiv.
Most computer voice programs use a library of syllables and words to construct sentences, something called concatenation synthesis. When humans speak, we vary our pronunciation widely depending on context, and this gives computer-speak its lifeless patina. What Google is attempting to do is get away from the repetition of words and sounds and construct sentences based on not only the words they’re made of, but what they mean as well. The program uses a network of interconnected nodes joined together to identify patterns in speech and ultimately predict what will come next in a sentence, helping to smooth out intonation.
The researchers back up their bluster with a bevy of examples posted online. Where WaveNet sounded accurate but a bit flat, Tacotron 2 sounds fleshed out and impressively varied. For a sample, just check out the same phrase repeated by both programs:
WaveNet:
Tacotron 2:
http://blogs.discovermagazine.com/d-brief/files/2018/01/bluelagoon_01.mp3
The program can also handle complex, multi-syllabic words with ease, and can be instructed to add stress to words or syllables to alter the interpretation of sentences. This means Tacotron 2 can phrase things as questions and correctly differentiate between homonyms, as well as more subtle things like highlighting the subject of a sentence by adding emphasis to a word.
http://blogs.discovermagazine.com/d-brief/files/2018/01/basilar_01.mp3
The final, and most compelling test is a side-by-side comparison of a human and computerized voice. Tacotron 2 scores a 4.53 on a popular test of speech quality, the researchers say, compared to 4.58 for professionally-recorded speech. See if you can tell the difference:
http://blogs.discovermagazine.com/d-brief/files/2018/01/washington_gt_01.mp3 http://blogs.discovermagazine.com/d-brief/files/2018/01/washington_gen_01.mp3
Although the program is impressive, it still has a few flaws. It can’t inject any emotion into its speech, and isn’t yet fast enough to produce audio in real time. And don’t ask it to order wine for you either:
http://blogs.discovermagazine.com/d-brief/files/2018/01/merlot_01.mp3