Attention-Based Encoder-Decoder Models for Speech Processing
Speech processing is one of the key components of machine perception. It covers a wide range of topics and plays an important role in many real-world applications. Many speech processing problems are modelled using sequence-to-sequence models. More recently, the Attention-Based Encoder-Decoder (AED) model has become a general and effective neural network that transforms a source sequence into a target sequence. These two sequences may have different lengths and belong to different modalities. AED models offer a new perspective for various speech processing tasks. In this thesis, the fundamentals of AED models and Automatic Speech Recognition (ASR) are first covered. The rest of the thesis focuses on the application of AED models for three major speech processing tasks - speech recognition, confidence estimation and speaker diarisation.
Speech recognition technology is widely used in voice assistants and dictation systems. It converts speech signals into text. Traditionally, Hidden Markov Models (HMMs), as a generative sequence-to-sequence model, are widely used as the backbone of an acoustic model. Under the Source-Channel Model (SCM) framework, the ASR system finds the most likely text sequence that produces the corresponding acoustic sequence together with a language model and a lexicon. Alternatively, the speech recognition task can be addressed discriminatively using a single AED model. There are distinct characteristics associated with each modelling approach. As the first contribution of the thesis, the Integrated Source-Channel and Attention (ISCA) framework is proposed to leverage the advantages of both approaches with two passes. The first pass uses the traditional SCM-based ASR system to generate diverse hypotheses, either in the form of N-best lists or lattices. The second pass obtains the AED model score for each hypothesis. Experiments on Augmented Multi-Party Interaction (AMI) dataset showed that ISCA using two-pass decoding reduced the Word Error Rate (WER) by 13% relative for a joint SCM and AED system using one-pass decoding. Further experiments on both the AMI dataset and the larger Switchboard (SWB) dataset showed that, if the SCM and AED systems were trained separately to be more complementary, the combined system using ISCA outperformed the individual system by around 30%. Also, the refined lattice rescoring algorithm is significantly better than N-best rescoring as lattice is a more compact representation of hypothesis space, especially for longer utterances.
With various advancements in neural network training, AED models can reach similar or better performance than traditional systems for many ASR tasks. Compared to a conventional ASR system, one important but perhaps missing attribute of an AED-based system is good confidence scores which indicate the reliability of automatic transcriptions. Confidence scores are very helpful for various downstream tasks, including semi-supervised training, keyword spotting and dialogue systems. As the second contribution of this thesis, effective confidence estimators for AED-based ASR systems are proposed. The Confidence Estimation Module (CEM) is a lightweight simple add-on neural network that takes various features from the encoder, attention mechanism and decoder to estimate a confidence score for each output unit (token). Experiments on the LibriSpeech dataset showed that compared to using Softmax probabilities as confidence scores, the CEM improved token-level confidence estimation performance substantially and largely addressed the over-confidence issue. For various downstream tasks such as data selection, utterance-level confidence scores are more desirable. The Residual Energy-Based Model (R-EBM), an utterance-level confidence estimator, was demonstrated to outperform both Softmax probabilities and the CEM. The R-EBM directly operates at the utterance level and takes deletion errors into account implicitly. The R-EBM also provides a global normalisation term for the locally normalised auto-regressive AED models. On the LibriSpeech dataset, the R-EBM reduced the WER of an AED model by up to 8% relative. One potential issue for model-based confidence estimators such as the CEM and R-EBM is their performance on Out-of-Domain (OOD) data. To ensure that confidence estimators generalise well for OOD input, two simple approaches are suggested that can effectively inject OOD information during the training of the CEM and R-EBM.
Speaker diarisation, a task of identifying “who spoke when”, is a crucial step for information extraction and retrieval. The speaker diarisation pipeline often consists of multiple stages. The last stage is to perform clustering over segment-level or window-level speaker representations. Although clustering is normally an unsupervised task, this thesis proposes the use of AED models for supervised clustering. With specific data augmentation techniques, the proposed approach, Discriminative Neural Clustering (DNC), has shown to be an effective alternative to unsupervised clustering algorithms. Experiments on the very challenging AMI dataset showed that DNC improved the Speaker Error Rate (SpkER) by around 30% relative compared to a strong spectral clustering baseline. Furthermore, DNC opens more interesting research directions, e.g. speaker diarisation with multi-channel or multi-modality information and end-to-end neural network-based speaker diarisation.