Original paper: https://arxiv.org/pdf/1705.08947.pdf
Few months after the publication of the Deep Voice paper, researchers from the same company published the Deep Voice 2, which is an expansion of the first proposed methodology. In the second paper, the authors propose few improvements to the original publication: Multi-speaker support, segmentation of modules and increase in training data. Although the overall architectures of both methodologies are very similar, the authors tune the system to achieve better performance and provide the above mentioned new features.
The key difference between Deep Voice 1 and Deep Voice 2 is the support to multi-speaker datasets. This can be achieved by using a low-dimensional embedding of each speaker in each module of the Deep Voice 2 architecture, so during training and inference the network is able to assimilate the features of each specific speaker. Another improvement from Deep Voice 1 on the new proposed methodology is the use of a much larger dataset. While in the first paper the authors used 20 hours of audio in their research, in Deep Voice 2 more than 250 hours of audio were used with more than 250 speakers. The main source for this dataset was a database of audio books recorded by professional speakers.
As mentioned, the architectures of the Deep Voice 1 and 2 are very similar, with exception that in the second publication the authors implement the “Phoneme Duration” and “Frequency Extraction” separately. This increases the performance of the network and allows multi-speaker support by training each module with an low-dimensional embedding of each speaker.
The results presented by the Deep Voice 2 are fairly impressive by the time of the publication. The audio quality surpass the previous TTS methodologies and the MOS and accuracy numbers shows that the proposed model is approaching the ground truth, however it still a statistical significant difference between a human speaker and a computer-generated voice.