Repository logo
 

Ensemble generation and compression for speech recognition


Type

Thesis

Change log

Authors

Wong, Jeremy Heng Meng 

Abstract

For many tasks in machine learning, performance gains can often be obtained by combining together an ensemble of multiple systems. In Automatic Speech Recognition (ASR), a range of approaches can be used to combine an ensemble when performing recognition. However, many of these have computational costs that scale linearly with the ensemble size. One method to address this is teacher-student learning, which compresses the ensemble into a single student. The student is trained to emulate the combined ensemble, and only the student needs to be used when performing recognition. This thesis investigates both methods for ensemble generation and methods for ensemble compression.

The first contribution of this thesis is to explore approaches of generating multiple systems for an ensemble. The combined ensemble performance depends on both the accuracy of the individual members of the ensemble, as well as the diversity between their behaviours. The structured nature of speech allows for many ways that systems can be made different from each other. The experiments suggest that significant combination gains can be obtained by combining systems with different acoustic models, sets of state clusters, and sets of sub-word units. When performing recognition, these ensembles can be combined at the hypothesis and frame levels. However, these combination methods can be computationally expensive, as data is processed by multiple systems.

This thesis also considers approaches to compress an ensemble, and reduce the computational cost when performing recognition. Teacher-student learning is one such method. In standard teacher-student learning, information about the per-frame state cluster posteriors is propagated from the teacher ensemble to the student, to train the student to emulate the ensemble. However, this has two limitations. First, it requires that the teachers and student all use the same set of state clusters. This limits the allowed forms of diversities that the ensemble can have. Second, ASR is a sequence modelling task, and the frame-level posteriors that are propagated may not effectively convey all information about the sequence-level behaviours of the teachers. This thesis addresses both of these limitations.

The second contribution of this thesis is to address the first limitation, and allow for different sets of state clusters between systems. The proposed method maps the state cluster posteriors from the teachers' sets of state clusters to that of the student. The map is derived by considering a distance measure between posteriors of unclustered logical context-dependent states, instead of the usual state cluster. The experiments suggest that this proposed method can allow a student to effectively learn from an ensemble that has a diversity of state cluster sets. However, the experiments also suggest that the student may need to have a large set of state clusters to effectively emulate this ensemble. This thesis proposes to use a student with a multi-task topology, with an output layer for each of the different sets of state clusters. This can capture the phonetic resolution of having multiple sets of state clusters, while having fewer parameters than a student with a single large output layer.

The third contribution of this thesis is to address the second limitation of standard teacher-student learning, that only frame-level information is propagated to emulate the ensemble behaviour for the sequence modelling ASR task. This thesis proposes to generalise teacher-student learning to the sequence level, and propagate sequence posterior information. The proposed methods can also allow for many forms of ensemble diversities. The experiments suggest that by using these sequence-level methods, a student can learn to emulate the ensemble better. Recently, the lattice-free method has been proposed to train a system directly toward a sequence discriminative criterion. Ensembles of these systems can exhibit highly diverse behaviours, because the systems are not biased toward any cross-entropy forced alignments. It is difficult to apply standard frame-level teacher-student learning with these lattice-free systems, as they are often not designed to produce state cluster posteriors. Sequence-level teacher-student learning operates directly on the sequence posteriors, and can therefore be used directly with these lattice-free systems.

The proposals in this thesis are assessed on four ASR tasks. These are the augmented multi-party interaction meeting transcription, IARPA Babel Tok Pisin conversational telephone speech, English broadcast news, and multi-genre broadcast tasks. These datasets provide a variety of quantities of training data, recording environments, and speaking styles.

Description

Date

2018-09-28

Advisors

Gales, Mark

Keywords

Teacher-student, ensemble, automatic speech recognition, random forest, sequence discriminative training

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge