Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks

Change log

Hidden Markov models (HMMs) have been the mainstream acoustic modelling approach for state-of-the-art automatic speech recognition (ASR) systems over the past few decades. Recently, due to the rapid development of deep learning technologies, deep neural networks (DNNs) have become an essential part of nearly all kinds of ASR approaches. Among HMM-based ASR approaches, DNNs are most commonly used to extract features (tandem system configuration) or to directly produce HMM output probabilities (hybrid system configuration).

Although DNN tandem and hybrid systems have been shown to have superior performance to traditional ASR systems without any DNN models, there are still issues with such systems. First, some of the DNN settings, such as the choice of the context-dependent (CD) output targets set and hidden activation functions, are usually determined independently from the DNN training process. Second, different ASR modules are separately optimised based on different criteria following a greedy build strategy. For instance, for tandem systems, the features are often extracted by a DNN trained to classify individual speech frames while acoustic models are built upon such features according to a sequence level criterion. These issues mean that the best performance is not theoretically guaranteed.

This thesis focuses on alleviating both issues using joint training methods. In DNN acoustic model joint training, the decision tree HMM state tying approach is extended to cluster DNN-HMM states. Based on this method, an alternative CD-DNN training procedure without relying on any additional system is proposed, which can produce DNN acoustic models comparable in word error rate (WER) with those trained by the conventional procedure. Meanwhile, the most common hidden activation functions, the sigmoid and rectified linear unit (ReLU), are parameterised to enable automatic learning of function forms. Experiments using conversational telephone speech (CTS) Mandarin data result in an average of 3.4% and 2.2% relative character error rate (CER) reduction with sigmoid and ReLU parameterisations. Such parameterised functions can also be applied to speaker adaptation tasks.

At the ASR system level, DNN acoustic model and corresponding speaker dependent (SD) input feature transforms are jointly learned through minimum phone error (MPE) training as an example of hybrid system joint training, which outperforms the conventional hybrid system speaker adaptive training (SAT) method. MPE based speaker independent (SI) tandem system joint training is also studied. Experiments on multi-genre broadcast (MGB) English data show that this method gives a reduction in tandem system WER of 11.8% (relative), and the resulting tandem systems are comparable to MPE hybrid systems in both WER and the number of parameters. In addition, all approaches in this thesis have been implemented using the hidden Markov model toolkit (HTK) and the related source code has been or will be made publicly available with either recent or future HTK releases, to increase the reproducibility of the work presented in this thesis.

Woodland, Phil
Deep Neural Network, Automatic Speech Recognition, Joint Training
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge
Cambridge International Scholarship, Cambridge Overseas Trust Research funding, EPSRC Natural Speech Technology Project Research funding, DARPA BOLT Program Research funding, iARPA Babel Program