DEEP NEURAL NETWORKS EMPLOYING MULTI-TASK LEARNING AND STACKED
BOTTLENECK FEATURES FOR SPEECH SYNTHESIS
Zhizheng Wu Cassia Valentini-Botinhao Oliver Watts Simon King
Centre for Speech Technology Research, University of Edinburgh, United Kingdom
ABSTRACT
Deep neural networks (DNNs) use a cascade of hidden representa-
tions to enable the learning of complex mappings from input to out-
put features. They are able to learn the complex mapping from text-
based linguistic features to speech acoustic features, and so perform
text-to-speech synthesis. Recent results suggest that DNNs can pro-
duce more natural synthetic speech than conventional HMM-based
statistical parametric systems. In this paper, we show that the hidden
representation used within a DNN can be improved through the use
of Multi-Task Learning, and that stacking multiple frames of hid-
den layer activations (stacked bottleneck features) also leads to im-
provements. Experimental results confirmed the effectivene
features/synthe/speech/DNNs/hidden/complex/stacked/bottleneck/learning/works/
features/synthe/speech/DNNs/hidden/complex/stacked/bottleneck/learning/works/
-->