Original paper: https://arxiv.org/abs/1710.07654
After the success of DeepVoice 1 and DeepVoice 2, researchers from the same company published a paper regarding their success with DeepVoice 3. In this new paper, the researchers discarded the previous architecture used in DeepVoice1 and DeepVoice2 and introduced a completely novel neural network architecture for speech synthesis. With this new architecture, the system is able to train faster, allowing them to scale up to more than 800 hours of training data containing more than 2400 voices.
The new architecture is based on a encoder-decoder scheme, or in other words, a fully-convolutional sequence-to-sequence (character to spectrogram) model. The encoder, which is a full-convolutional encoder with convolution blocks, is responsible for receiving the written words and transforming them into an internal learned representation. The decoder, which is a fully-convolutional casual decoder with an attention mechanism, is responsible for decoding this learned representation into a low-dimensional audio representation (mel-scale spectrograms). Those audio representations are sent to converters so they can transformed into audio using traditional (Griffin-Lim, WORLD) or machine-learning (WaveNet) methods.
The results were impressive by the time of publication. The DeepVoice3 training time overperformed other TTS systems showing to be 10 times faster than the latest technologies (Tacotron). The authors claim that the system is able to process more than 10 million TTS requests a day with just one server with a single GPU. Moreover, the DeepVoice 3 converges with significantly less interactions than other TTS systems. Also, with the improvements in the attention-based convolution blocks, the authors were able to reduce the common errors presented by attention-based networks meaning that the output phrase had less mistakes (repetitions, mispronunciations, etc.) than previous systems. The naturalness levels of the audio generated by DeepVoice 3 is also leading the rank in comparison with previous research, showing that even with a fast algorithm it is possible to achieve human-like voice audio.