Expressive visual text to speech and expression adaptation using deep neural networks


Type
Conference Object
Change log
Authors
Parker, J 
Maia, R 
Stylianou, Y 
Abstract

In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training.

Description
Keywords
Expressive Visual Text to Speech, Expression Adaptation, Deep Neural Network
Journal Title
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Conference Name
ICASSP 2017 The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing
Journal ISSN
1520-6149
Volume Title
Publisher
IEEE