Deep Voice by Baidu Labs (Summary by: Ricardo Reimao)

Original Paper: https://arxiv.org/abs/1702.07825

Researchers from the Baidu Labs published in early 2017 a paper about their project called DeepVoice. This project aims to create a production-level end-to-end solution for Text-To-Speech (TTS). The key point of the DeepVoice is that it doesn’t require any specialist knowledge during the training/inference process, and that the solution is able to generate audio in real-time.

DeepVoice breaks down the TTS problem into five models:

Grapheme-to-phoneme model, responsible for converting written text (eg. English, Chinese, etc) into phonemes. This is helpful as in the majority of the languages, the same sequence of characters may produce different sounds based on the surrounding characters. As example, in the words “Frost” (Phoneme: frôst) and “Roast” (Phoneme: rōst), the sequence “ro” have totally different sounds, therefore, different phonemes are used in each word.
Segmentation model, utilized only during the training phase, is responsible for finding the start and ending of phonemes in an audio. Once the start-end is found, the audio is isolated so the neural network is able to learn the sound for the exact phoneme.
Phoneme duration model, responsible for determining the duration of a phoneme. This is helpful to generate natural-sounding audio, since the same phoneme can have different duration depending on the word, or even on the context of the phrase. As an example, the “mo” in the words “model” and “more” have different duration: in the first one, it’s a short sound, while in the second, it’s a longer sound.
Fundamental frequence model, responsible for predicting the intonation of a phoneme. This is also very important on generating natural-sounding audio, since in the majority of the languages there are phonemes in words that are not pronounced. As an example, in the word “island”, the “s” is mute and should not be pronounced.
Audio synthesis model, responsible for generating audio signal based on the phoneme, duration and fundamental-frequency. This model is heavily based on the WaveNet project, however it implements several improvements that allows the audio generation to be done in real-time.

The majority of the previous works on the machine learning field only implemented one or two of those models and relied in specialist-engineered solutions for the other parts of the TTS. However, the problem with relying on specialists is that it may take weeks to fine tune the parameters and find a natural-sounding voice. In the deep-voice project, all the five components are implemented using neural networks models, which requires minimal specialist effort. According to the publication, those machine-learning models have similar or better performance than the traditional non-machine learning methods.

Another key contribution of the paper is the fact that the authors share a good amount of details on how this architecture can be implemented to maximize the performance over CPU and/or GPU. The authors compare the execution using several parallelization methods and propose an implementation with impressive results. As comparison, the authors claim that the inference in DeepVoice is 400 times faster than WaveNet.