Statistical Models for Noise-Robust Speech Recognition R O G I E R C H R I S T I A A N VA N DA L E N QU E E N S ’ C O L L E G E Department of Engineering University of Cambridge is dissertation is submitted for the degree of Doctor of Philosophy October 2011 Statistical Models for Noise-Robust Speech Recognition Rogier Christiaan van Dalen Abstract A standard way of improving the robustness of speech recognition systems to noise is model compensation. is replaces a speech recogniser’s distributions over clean speech by ones over noise-corrupted speech. For each clean speech component,model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution.is thesis looks into improving on this approximation in two ways: rstly, by estimating full-covariance Gaussian distribu- tions; secondly, by approximating corrupted-speech likelihoods without any paramet- erised distribution. e rst part of this work is about compensating for within-component feature correlations under noise. For this, the covariancematrices of the computed Gaussians should be full instead of diagonal.e estimation of oš-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, likevts compensation, use for dynamic coe›- cients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coe›cients, and dynamic coe›cients, which represent signal changes over time, and are normally computed from a window of static coe›cients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It rst compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coe›cients. It introduces a number ofmethods that address the correlation changes that occur in noisewithin this framework.e next problem is decoding speed with full covariances.is thesis re- analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost. e second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mis- match function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequen- tial importance resampling. ough it is too slow to use for recognition, it enables a more ne-grained assessment of compensation techniques, based on the kl diver- gence to the ideal compensation for one component. e kl divergence proves to predict the word error rate well.is technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make. Declaration is dissertation is the result of my own work and includes nothing which is the out- come of work done in collaboration except where specically indicated in the text. In particular, two parts should be mentioned. Sections 6.2 and 6.3 of this thesis extend work on predictive linear transformations originally submitted for an M.Phil. thesis (van Dalen 2007).e work on component-independent predictive transform- ations, section 6.4, was joint work with Federico Flego (and published as van Dalen et al. 2009). Most of the work in this thesis has been published, as technical reports (vanDalen andGales 2009a; 2010b), in conference proceedings (vanDalen andGales 2008; 2009b; van Dalen et al. 2009; van Dalen and Gales 2010a), and as a journal article (van Dalen and Gales 2011). e length of this thesis including appendices, bibliography, footnotes, tables and equations is 63 532 words. It contains 39 gures and 15 tables. i Acknowledgements Firstly, thanksmust go to the people who have enabledme to do a Ph.D. at Cambridge University. Toshiba Research Europe Ltd. funded my Ph.D. completely, and aŸer spe- cifying the subject area, environment-aware speech recognition, did not constrain me at all. I have much appreciated that. e other enabler is Dr Mark Gales, my supervisor. I could not have wished for a supervisor more attentive to the big picture and the details, more enthusiastic, or happier to let me work at my own schedule. I am grateful that he has always set aside an hour a week for me, and has always made time to address my questions. Above all, I appreciate his dedication to helping me get as much out of my Ph.D. as possible. Hank Liao can be called my predecessor, and I inherited much of his code and experimental set-ups. anks go to Federico Flego not only for providing innitely patient help with my experiments, but also for collaborating on exciting work (as spe- cied in the Declaration). Milica Gasˇic´ and Matt Shannon provided invaluable comments to a draŸ version of this thesis. anks also go to past and present occupants of my o›ce, and other members of the speech group, in particular, Chris Longworth, Zoi Roupakia, Anton Ragni, Milica Gasˇic´, Catherine Breslin, and three Matts—Matt Gibson, Matt Shan- non, andMatt Seigel— for scintillating discussions, whether it be speech recognition, machine learning, or subjects less directly related to our research. All have enriched my time here. Finally, to my parents and family: dank jullie wel voor alle aanmoediging en steun door de jaren heen. Zonder jullie hulp was ik hier nimmer terechtgekomen, noch had ik het zo weten af te maken! iii Contents Contents v 1 Introduction 7 I Background 11 2 Speech recognition 13 2.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Mel-frequency cepstral coe›cients . . . . . . . . . . . . . . . 14 2.1.2 Dynamic coe›cients . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 State sequences . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1.1 Language modelling . . . . . . . . . . . . . . . . . 21 2.2.1.2 Latent discrete sequence . . . . . . . . . . . . . . . 22 2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Empirical distributions . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Maximum-likelihood estimation . . . . . . . . . . . . . . . . 29 2.3.2.1 Expectation–maximisation . . . . . . . . . . . . . 31 2.3.3 Baum–Welch . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3 Adaptation 39 3.1 Unsupervised adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Adaptive training . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Linear adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Constrained transformation . . . . . . . . . . . . . . . . . . 45 3.2.2 Covariance adaptation . . . . . . . . . . . . . . . . . . . . . 47 3.3 Covariance modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Structured precision matrices . . . . . . . . . . . . . . . . . . 50 3.3.2 Maximum likelihood projection schemes . . . . . . . . . . . 53 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 v contents 4 Noise-robustness 57 4.1 Methods for noise-robustness . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Noise-corrupted speech . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.1 Log-spectral mismatch function . . . . . . . . . . . . . . . . 60 4.2.1.1 Properties of the phase factor . . . . . . . . . . . . 65 4.2.2 Cepstral mismatch function . . . . . . . . . . . . . . . . . . . 69 4.2.3 Mismatch function for dynamic coe›cients . . . . . . . . . . 70 4.3 e corrupted speech distribution . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Sampling from the corrupted speech distribution . . . . . . . 74 4.4 Model compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Data-driven parallel model combination . . . . . . . . . . . . 77 4.4.2 Vector Taylor series compensation . . . . . . . . . . . . . . . 82 4.4.3 Joint uncertainty decoding . . . . . . . . . . . . . . . . . . . 87 4.4.3.1 Estimating the joint distribution . . . . . . . . . . . 90 4.4.4 Single-pass retraining . . . . . . . . . . . . . . . . . . . . . . 92 4.4.4.1 Assessing the quality of model compensation . . . . 94 4.5 Observation-dependent methods . . . . . . . . . . . . . . . . . . . . 95 4.5.1 e Algonquin algorithm . . . . . . . . . . . . . . . . . . . . 95 4.5.2 Piecewise linear approximation . . . . . . . . . . . . . . . . . 101 4.6 Model-based feature enhancement . . . . . . . . . . . . . . . . . . . 103 4.6.1 Propagating uncertainty . . . . . . . . . . . . . . . . . . . . . 105 4.6.2 Algonquin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.7 Noise model estimation . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7.1 Linearisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.7.2 Numerical approximation . . . . . . . . . . . . . . . . . . . . 111 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 II Contributions 113 5 Compensating correlations 115 5.1 Correlations under noise . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Compensating dynamics with extended feature vectors . . . . . . . . 118 5.2.1 e extended Gaussian . . . . . . . . . . . . . . . . . . . . . 120 5.2.2 Validity of optimising an extended distribution . . . . . . . . 121 5.3 Compensating extended distributions . . . . . . . . . . . . . . . . . . 124 5.3.1 Extended dpmc . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.2 Extended idpmc . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.3 Extended vts . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.3.1 Relationship with vts . . . . . . . . . . . . . . . . 132 5.3.3.2 Computational cost . . . . . . . . . . . . . . . . . 134 5.3.4 Algonquin with extended feature vectors . . . . . . . . . . . . 135 5.4 Extended joint uncertainty decoding . . . . . . . . . . . . . . . . . . 138 5.5 Extended statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vi 5.5.1 Clean speech statistics . . . . . . . . . . . . . . . . . . . . . . 141 5.5.2 Noise model estimation . . . . . . . . . . . . . . . . . . . . . 143 5.5.2.1 Zeros in the noise variance estimate . . . . . . . . . 146 5.5.2.2 Estimating an extended noise model . . . . . . . . 147 5.5.3 Phase factor distribution . . . . . . . . . . . . . . . . . . . . 149 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6 Predictive transformations 151 6.1 Approximating the predicted distribution . . . . . . . . . . . . . . . . 152 6.1.1 Per component . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.1.2 Per sub-phone state . . . . . . . . . . . . . . . . . . . . . . . 156 6.2 Predictive linear transformations . . . . . . . . . . . . . . . . . . . . 160 6.2.1 Predictive cmllr . . . . . . . . . . . . . . . . . . . . . . . . 162 6.2.2 Predictive covariance mllr . . . . . . . . . . . . . . . . . . . 163 6.2.3 Predictive semi-tied covariance matrices . . . . . . . . . . . . 164 6.3 Correlation modelling for noise . . . . . . . . . . . . . . . . . . . . . 166 6.3.1 Predictive cmllr . . . . . . . . . . . . . . . . . . . . . . . . 169 6.3.2 Predictive covariance mllr . . . . . . . . . . . . . . . . . . . 170 6.3.3 Predictive semi-tied covariance matrices . . . . . . . . . . . . 171 6.3.4 Computational complexity . . . . . . . . . . . . . . . . . . . 172 6.3.5 Predictive hlda . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4 Front-end pcmllr . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4.1 Observation-trained transformation . . . . . . . . . . . . . . 177 6.4.2 Posterior-weighted transformations . . . . . . . . . . . . . . 179 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7 Asymptotically exact likelihoods 181 7.1 Likelihood evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.2 Importance sampling over the speech and noise . . . . . . . . . . . . 185 7.3 Importance sampling in a transformed space . . . . . . . . . . . . . . 189 7.3.1 Single-dimensional . . . . . . . . . . . . . . . . . . . . . . . 190 7.3.1.1 e shape of the integrand . . . . . . . . . . . . . . 194 7.3.1.2 Importance sampling from the integrand . . . . . . 199 7.3.1.3 Related method . . . . . . . . . . . . . . . . . . . . 202 7.3.2 Multi-dimensional . . . . . . . . . . . . . . . . . . . . . . . . 202 7.3.2.1 Per-dimension sampling . . . . . . . . . . . . . . . 205 7.3.2.2 Postponed factorisation . . . . . . . . . . . . . . . 207 7.3.2.3 Quasi-conditional factorisation . . . . . . . . . . . 208 7.3.2.4 Applying sequential importance resampling . . . . 210 7.4 Approximate cross-entropy . . . . . . . . . . . . . . . . . . . . . . . 212 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 8 Experiments 215 8.1 Correlation modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 216 vii contents 8.1.1 Resource Management . . . . . . . . . . . . . . . . . . . . . 217 8.1.1.1 Compensation quality . . . . . . . . . . . . . . . . 219 8.1.1.2 Extended noise reconstruction . . . . . . . . . . . 222 8.1.1.3 Per-component compensation . . . . . . . . . . . . 223 8.1.1.4 Per-base class compensation . . . . . . . . . . . . . 225 8.1.2 aurora 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.1.2.1 Extended vts . . . . . . . . . . . . . . . . . . . . . 228 8.1.2.2 Front-end pcmllr . . . . . . . . . . . . . . . . . . 229 8.1.3 Toshiba in-car database . . . . . . . . . . . . . . . . . . . . . 232 8.2 e ešect of approximations . . . . . . . . . . . . . . . . . . . . . . . 233 8.2.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.2.2 Compensation methods . . . . . . . . . . . . . . . . . . . . . 238 8.2.3 Diagonal-covariance compensation . . . . . . . . . . . . . . 242 8.2.4 Inžuence of the phase factor . . . . . . . . . . . . . . . . . . 244 9 Conclusion 247 9.1 Modelling correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 248 9.2 Asymptotically exact likelihoods . . . . . . . . . . . . . . . . . . . . 249 9.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Appendices 253 a Standard techniques 255 a.1 Known equalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 a.1.1 Transforming variables of probability distributions . . . . . . 255 a.1.2 Matrix identities . . . . . . . . . . . . . . . . . . . . . . . . . 255 a.1.3 Multi-variate Gaussian factorisation . . . . . . . . . . . . . . 256 a.2 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . 258 a.2.1 kl divergence between Gaussians . . . . . . . . . . . . . . . . 260 a.2.2 Between mixtures . . . . . . . . . . . . . . . . . . . . . . . . 261 a.3 Expectation–maximisation . . . . . . . . . . . . . . . . . . . . . . . 264 a.3.1 e expectation step: the hidden variables . . . . . . . . . . . 265 a.3.2 e maximisation step: the model parameters . . . . . . . . . 266 a.3.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 a.4 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 a.4.1 Plain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 269 a.4.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . 269 a.4.3 Sequential importance sampling . . . . . . . . . . . . . . . . 272 a.4.4 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 a.4.5 Sampling from the target distribution . . . . . . . . . . . . . 277 b Derivation of linear transformations 279 b.1 cmllr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 b.1.1 Adaptive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 viii b.1.2 Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 b.2 Covariance mllr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 b.2.1 Adaptive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 b.2.2 Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 b.3 Semi-tied covariance matrices . . . . . . . . . . . . . . . . . . . . . . 284 b.3.1 From data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 b.3.2 Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 c Mismatch function 289 c.1 Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 c.2 Mismatch function for other spectral powers . . . . . . . . . . . . . . 290 d Derivation of model-space Algonquin 293 e e likelihood for piecewise linear approximation 297 e.1 Single-dimensional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 e.2 Multi-dimensional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 f e likelihood for transformed-space sampling 303 f.1 Transforming the single-dimensional integral . . . . . . . . . . . . . 304 f.2 Transforming the multi-dimensional integral . . . . . . . . . . . . . . 306 f.3 Postponed factorisation of the integrand . . . . . . . . . . . . . . . . 310 f.4 Terms of the proposal distribution . . . . . . . . . . . . . . . . . . . 313 Bibliography 315 1 Notation Operators argmaxxφ(x) e value of x that maximises φ(x) ∗ Convolution |·| Absolute value Matrixes and vectors A Matrix ai =[A]i Row i ofA aij =[A]ij Element (i, j) ofA b Vector bi Element i of b I Identity matrix 0 Vector/matrix with all entries 0 1 Vector/matrix with all entries 1 |·| Determinant ‖·‖ Norm Tr(·) Trace A−1 Inverse AT Transpose A−T Inverse and transpose diag(·) Matrix diagonalisation exp(·) , log(·) , ◦ Element-wise exponentiation, logarithm, multiplication Distributions p Real distribution q Approximate distribution p˜ Empirical distribution E{a} Expected value of a Ep{a} Expected value of a under p Var{a} Variance of a KL(p‖q) Kullback-Leibler divergence to p from q H(p‖q) Cross-entropy of p and q 3 contents H(p) Entropy of p 1(·) Indicator function (Kronecker delta): evaluates to 1 if the argument is true and to 0 otherwise δa(x) Dirac delta: distribution with non-zero density only at x = a, where the density is innite, but∫ δa(x)dx = 1 a ∼ N (µa,Σa) a is Gaussian-distributed with mean µa and covariance Σa N (a; µa, Σa) Gaussian density evaluated at a u ∼ Unif [a, b] u is uniformly distributed between a and b Speech recognition signals X Training data Y Test data U Hidden variables y Observation feature vector x Clean speech n Additive noise h Convolutional noise α Phase factor β Power of spectrum ·s Static feature vector ·∆, ·∆2 Vectors with rst- and second- order dynamic features ·e Extended feature vector ·log Log-mel-spectral feature vector ·[k] Spectral-domain signal ·[t] Time domain signal f(·) Mismatch function Jx, Jn Jacobian of mismatch function with respect to the speech and noise Speech recogniser parameters m Speech recogniser component µ(m) Mean of componentm Σ(m) Covariance matrix of componentm Λ(m) Precision matrix of componentm W Word sequence  Empty symbol θ Sub-phone state m ∈ Ω(θ) Component in mixture for θ γ (m) t Occupancy for componentm at time t pi (θ) m Weight of componentm in the mixture for θ Mn Speech Model 4 A Transformation A Linear transformation b Bias H Linear transformation expressed on the model parameters g Bias expressed on the model parameters D Projection to static and dynamic coe›cients L Log-likelihood F Lower or upper bound Indices t Time w Window size r Front-end component r Base class m ∈ Ω(r) Component associated with base class r k Iteration d Feature vector length s Static feature vector length k Fourier frequency i Filter bin index Monte Carlo γ Integrand pi Normalised target distribution Z Normalising constant ρ Proposal distribution u Sample l Sample index w Sample weight 5 Chapter 1 Introduction Automatic speech recognition is employed in many places.e most popular operat- ing system for personal computers, MicrosoŸWindows, has shipped with a dictation system for years. Google’s Android operating system for phones now supports using speech to run commands and input text. Mobile phones in particular are likely to be used in noisy places.is tends to highlight that speech recognisers are more sensit- ive to noise than humans: performance deteriorates quickly under noise that hardly ašects human speech recognition (Lippmann 1997). ere are two approaches to make speech recognisers more robust to noise. One is to reconstruct the clean speech before it enters the speech recogniser. e other is to compensate the speech recogniser so it expects noise-corrupted rather than clean speech. e rst approach is called feature enhancement. Since it aims to reconstruct the clean speech feature vectors and then passes them to the unchanged speech recog- niser, it is usually fast. However, propagating just a point estimate of the clean speech discards the information on the uncertainty of the estimate, which is important. For example, loud noise can mask quiet speech completely.ough this makes the clean speech estimate meaningless, the speech recogniser treats it as the real clean speech, which causes recognition errors. e other approach, model compensation, is the focus of this thesis. It has ob- 7 chapter 1. introduction tained better performance than feature enhancement, especially under low signal- to-noise ratios. Model compensation nds a distribution over the noise-corrupted speech. It computes this from a trained speech recogniser, which constitutes a model of the training data, and an estimated model of the noise. Most model compensa- tion methods replace each clean speech Gaussian with a corrupted speech Gaussian with a diagonal covariance matrix.is is an approximation. If the speech and noise distributions were correct and the replacement distribution were the exact one, then according to the Bayes decision rule the speech recogniser would yield the optimal hypothesis. is work will therefore look into modelling the corrupted speech more precisely thanwith a diagonal-covariance Gaussian. It will look into two dišerent aspects. First, diagonalising covariance matrices neglects the changes in correlations between fea- ture dimensions under noisy conditions. However, in reality, feature correlations do change. For example, in the limit as the noise masks the speech, the correlations of the input data become equal to those of the noise. e rst part of this thesis will there- fore estimate full-covariance Gaussians.e second part derives from the observation that given a standard form for the relationship between the corrupted speech and the speech and noise, the corrupted speech is not Gaussian even if the speech and noise are. Rather than using a parameterised distribution, it will approximate the corrupted speech likelihood directly, with a samplingmethod.e following outlines the contri- butions of this thesis. Publications that have arisen from the Ph.D. work are indicated with “(published as van Dalen and Gales 2009b)”. e rst part will nd compensation that models correlation changes under noise. e obvious approach, which is to forgo the diagonalisation of estimated corrupted speech covariances, will encounter two problems. e rst problem is that common approximations to estimate parameters for dynamic coe›cients cause oš-diagonal ele- ments of the covariance matrices to be misestimated. ese dynamics indicate the change in the signal from time slice to time slice.ey are computed from a window of per-time slice, static, coe›cients. e state-of-the-art vts compensation scheme 8 uses the continuous-time approximation, which simplies computation by assuming the dynamic coe›cients to be time derivatives. Chapter 5 will propose using distribu- tions over vectors of all static features in a window, called extended feature vectors. It proposes extendeddpmc (published as vanDalen andGales 2008) and extendedvts (published as van Dalen and Gales 2009b;a; 2011).ese compensation schemes, de- rived from the dpmc and vts compensation schemes, compute the ešect of the noise on each time instance separately and then perform the mapping to statics and dy- namics.e more precise modelling enables speech recognisers to use full covariance matrices to model correlation changes. Having estimated full covariance matrices, decoding with full covariances is slow. Chapter 6 will therefore propose a general mechanism to approximate one speech recogniser parameterisation with another one.is ešectively trains the latter on pre- dicted statistics of the former. Chapter 6.2 will re-analyse predictive transformations (rst introduced in van Dalen 2007; Gales and van Dalen 2007), as minimising the Kullback-Leibler divergence between the two models.is is used to convert full-co- variance compensation into a tandem of a linear transformation and a diagonal bias on the covariance matrices. Since the covariance bias is diagonal, once compensa- tion has nished, decoding will be as fast as normal. When combined with a scheme that applies compensation to clusters of Gaussians at once, joint uncertainty decoding, compensation is also fast. e choice of the number of clusters provides a trade-oš between compensation accuracy and decoding speed.e combination of a compens- ation scheme with extended feature vectors, joint uncertainty decoding, and predic- tive linear transformations yields practical schemes for noise-robustness. e second part of this thesis is more theoretical.e corrupted speech distribu- tion is not Gaussian, even though most model compensation methods assume it is. ere has been no research on how good model compensation could be with speech and noise models that modern recognisers use. Chapter 7 will therefore investigate how well speech recognisers would cope with noise if they used the exact corrup- ted speech distribution. It turns out that this is impossible: there is no closed form 9 chapter 1. introduction for the corrupted speech distribution. However, a useful approach is to re-formulate the problem. In practice, no closed form for the distribution is necessary: a speech recogniser only needs likelihoods for the observation vectors in the data.ese like- lihoods can be approximated with a cascade of transformations and a Monte Carlo method, sequential importance sampling. As the size of the sample cloud increases, the approximation converges to the real likelihood. Section 7 will discuss this trans- formed-space sampling (published as van Dalen and Gales 2010a;b) in detail. In coming close to the real likelihood, transformed-space sampling becomes so slow that implementing a speech recogniser with it is infeasible. It is, however, pos- sible tomake amore ne-grained assessment of speech recogniser compensation, with a metric based on the kl divergence (section 7.4). In the limit, the new sampling method will ešectively give the point where the kl divergence is zero, which is oth- erwise not known. is calibration will make it possible to determine how far well- known compensation methods are from the ideal. is work will examine how well the kl divergence predicts speech recogniser word error rate. It will compare dišer- ent compensation schemes, and examine the ešect of common approximations.is includes assuming the corrupted speech distribution Gaussian and diagonalising its covariance matrix, and approximations to the mismatch function.is illustrates that the new method is an important new research tool. 10 Part I Background 11 Chapter 2 Speech recognition Speech recognition is the conversion of speech into text. “Text” heremeans a sequence of written words.is chapter will give an overview of speech recognition. A speech recogniser rst converts the incoming audio into a sequence of feature vectors that each represent a xed-duration time slice. Section 2.1 will discuss the standard type of features that this thesis will use: mfccs.e feature vectors serve as observations to a generative probabilistic model, the topic of section 2.2, that relates them to sequences of words. Section 2.3 will then explain how the acoustic part of the model, the focus of this thesis, is trained. e process of nding the best word sequence from the observation sequence, decoding, will be the topic of section 2.4. 2.1 Feature extraction e initial digital representation of an audio signal is a series of amplitude samples expressing the pressure waveform at, say, 8 or 16 kHz. Speech recognisers apply a number of transformations to translate the stream of samples into a sequence of low- dimensional feature vectors. e standard type of feature vector for a time instance contains mel-frequency cepstral coe›cients (mfccs) extracted from one time slice (section 2.1.1). Appended to these “static” coe›cients are dynamic coe›cients that represent the changes in the static features (section 2.1.2). 13 chapter 2. speech recognition Fi lte rw ei gh tw ik 0 30 60 90 120 Frequency k Figure 2.1Mel-spaced lter bins. Alternate lters are coloured dišerently. 2.1.1 Mel-frequency cepstral coefficients e objective of feature extraction is to represent the audio signal at discrete intervals by a small number of features that are useful for speech recognition. e proced- ure that produces mel-frequency cepstral coe›cients (Davis and Mermelstein 1980) rst nds coe›cients that indicate the energy in human-hearing-inspired frequency bands. It then converts these into (usually 24) features that have decreasing relation on the shape of the mouth and increasing relation with voice qualities, and retains only the rst (usually 13).e following goes into more detail. e spacing of the feature vectors, usually 10ms, is chosen so that for one time slice the spectrum can be assumed to be stationary. A Hamming window is applied to a short time slice (of usually 25ms). A Fourier transformation of the signal wave- form x[t] results in the spectrum X[k] representing the audio in that time slice. e phase information of the spectrum is then discarded by taking the magnitude, or a power, of the spectrum.e spectrum has a resolution higher than necessary for rep- resenting the shape of spectrum for speech recognition. Triangular lters (usually 24) on a mel-scale, which imitates the varying resolution of the human ear, are therefore applied. Figure 2.1 contains a depiction of lters on themel-scale. Lower-indexed bins 14 2.1. feature extraction are narrower, and span fewer frequencies, than higher-indexed bins. Describing lter bin i with lter weightswik for frequencies k, the mel-ltered spectrum with is X¯i = ∑ k wik|X[k]|β . (2.1) Usual values for the power β are 1 (indicating the magnitude spectrum) or 2 (the power spectrum). Filter bank coe›cients X¯i are called mel-spectral coe›cients. e next step is motivated by a source-lter model. e source models the vocal cords, and the lter models the mouth. Since in speech recognition, voice quality is mainly irrelevant but the shape of the mouth is of interest, attributes inžuenced by the source should be disregarded and attributes inžuenced by the lter retained. In the time domain, the lter convolves the source signal; in the spectral domain, this becomes multiplication. en, the logarithm of each the lter bank coe›cients is found: xlog = log   X¯1 ... X¯I   , (2.2) with log(·) denoting the element-wise logarithm.e resulting vector xlog represents the log-spectrum. In this domain, the source and the lter are additive. It is assumed that the lter determines the overall shape of the elements of xlog. To separate it from the source, frequency analysis can be performed again, this time on xlog. is uses the discrete cosine transform (dct), which is a matrixC[I], the elements of which are dened as cij = √ 2 I cos ( (2j− 1)(i− 1)pi 2I ) . (2.3) By taking the discrete cosine transform of xlog, an I-dimensional vector of mel- frequency cepstral coe›cients xs[I] is found: xs[I] = C[I]x log. (2.4) e following step is to discard the higher coe›cients in the feature vector. Oneway of viewing the ešect of this is by converting back to log-spectral features. (In section 4.2 15 chapter 2. speech recognition this will actually be done to express the ešect of noise.) Back in the log-spectral do- main, higher-frequency changes from coe›cient to coe›cient, which are assumed to relate to the source more than to the lter, have been removed. Discarding the last part of the mfcc feature vector therefore has the ešect of smoothing the spectrum. Truncating xs[I] to the rst s elements, xs[s] = C[s]x log, (2.5) whereC[s] is the rst s rows ofC, and s ≤ I. For notational convenience, (2.5) will be written as xs = Cxlog. (2.6) ese features form the “cepstrum”. To represent the sequence of operations leading to them in noun compound summary format, the features are called “mel-frequency cepstral coe›cients” (mfccs). mfccs are popular features in speech recognition.e rst coe›cient represents a scaled average of the coe›cients in the log-spectral feature vector in (2.2). is coe›cient is oŸen replaced by the normalised total energy in the lter bank. However, like most work on noise-robust speech recognition, this thesis will use feature vectors with only mfccs. 2.1.2 Dynamic coefficients It will be discussed in section 2.2 that the feature vectors will be modelled with a hid- den Markov model (hmm). Hidden Markov models assume that given the sequence of states that generate the feature vector, consecutive observations are independent. is implies that consecutive feature vectors generated by a single sub-phone are in- dependent and identically distributed.us, no information about changes over time is encoded, other than through switching sub-phone states. To alleviate this problem, consecutive feature vectors can be related by appending extra coe›cients that approx- imate time derivatives (Furui 1986). Usually, “delta” and “delta-delta” coe›cients are 16 2.1. feature extraction included, representing feature velocity and acceleration, respectively.ese dynamic coe›cients lead to large improvements in recognition accuracy and are standard in speech recognition systems. As an approximation to time derivatives, linear regression over a window of con- secutive frames is used. For exposition, assume a window±1 around the static feature vector xst.e dynamic coe›cients x∆t are then found by x∆t = [ − 12I 0 1 2I ] xst−1 xst xst+1  = D∆xet , (2.7a) where xet is an extended feature vector, which is a concatenation of the static coe›- cients in a window, and D∆ is the projection from xet to x∆t . en, a feature vector with statics and rst-order dynamics (“deltas”) is xt =  xst x∆t  =  0 I 0 − 12I 0 1 2I   xst−1 xst xst+1  =  Ds D∆  xet = Dxet , (2.7b) whereDs projects xet to xst, andD projects xet to the nal feature vector xt. With the window of one time instance leŸ and one right, this uses simple dišerences between static feature vectors. In general, rst-order coe›cientsx∆t are found from consecutive static coe›cients xst−w, . . . , x s t+w by x∆t = ∑w i=1 i ( xst+i − x s t−i ) 2 ∑w i=1 i 2 . (2.8) Second-order coe›cients x∆2t are found analogously from x∆t−v, . . . , x∆t+v, and similar for higher-order coe›cients.e extended feature vector xet then contains the statics in a window ±(w + v). e transformation from xet to feature vector xt with these higher-order terms remains linear, so that D in (2.7b) is straightforwardly general- ised. e feature vector with static and rst- and second-order dynamic coe›cients 17 chapter 2. speech recognition θt−1 θt θt+1 Figure 2.2 Graphical model of a Markov model. Circles represent random vari- ables. (“deltas” and “delta-deltas”) is then computed with xt =  xst x∆t x∆ 2 t  = Dxet , xet =  xst−w−v ... xst+w+v  . (2.9) For exposition, however, this work assumes (2.7b). 2.2 Hidden Markov models To describe the relation betweenword sequences and feature vector sequences, speech recognisers use a generative model. is is a joint distribution of word sequenceW and feature vector sequence X, p(W,X). In such a model, the words generate feature vectors, called observations. A problem is that a word has a variable duration, and therefore the number of observation vectors it generates is also variable. To solve this, a latent discrete sequence of xed-interval units is introduced. e observation at time t is assumed generated by the state at time t, written θt. Denoting the sequence of states withΘ, p(X,W) = ∑ Θ p(X|Θ)P(Θ,W) . (2.10) e distribution P(Θ|W) performs amapping from a sequence of words to a sequence of states. To make training (section 2.3) and recognition (section 2.4) feasible, the Markov property is assumed: the state at time t only depends on its predecessor, not on further history. Figure 2.2 has a simple graphical model of this, with θt a random variable that represents the active state at time t. Section 2.2.1.2 will give more detail on how the transition probabilities are determined. 18 2.2. hidden markov models θt−1 θt θt+1 xt−1 xt xt+1 Figure 2.3 Graphical model of a hidden Markov model. e shaded circles rep- resent observed variables. To include the acoustic information in this model, the feature vectors, represent- ing the acoustics at time t, are attached. Given the state sequence, the feature vectors are assumed independent. Because when training or decoding the feature vectors gen- erated by the states are observed, but not the states themselves, the resulting model is called a hidden Markov model (hmm). Figure 2.3 shows the graphical model. Two important properties of the hmm are the Markov property, and the conditional inde- pendence assumption: the observation at time t only depends on the state at time t. e arrow from state θt to observation xt in gure 2.3 represents the state output distribution.is distribution, q(θ)(x), is usually a mixture of Gaussians: q(θ)(x) = ∑ m pi (θ) m q (m)(x), ∑ m pi (θ) m = 1, (2.11) where pi(θ)m is the mixture weight for mixture θ and componentm, and q(m) the com- ponent’s Gaussian distribution. is can also be expressed as a graphical model: see gure 2.4 on the following page. e component that generates the observation is indicated by a random variable mt. Figures 2.3 and 2.4 represent dišerent ways of looking at the same speech recogniser, but showing the componentsmt explicitly as in gure 2.4 is useful for training, and making them implicit in the distribution of xt for decoding. Components are usually multi-variate Gaussians, which are parameterised with mean µ(m)x and covariance Σ(m)x : q(m)(x) = N (x; µ(m)x , Σ(m)x ) = ∣∣2piΣ(m)x ∣∣− 12 exp(− 12(x− µ(m)x )TΣ(m)x −1(x− µ(m)x )). (2.12) 19 chapter 2. speech recognition θt−1 θt θt+1 mt−1 mt mt+1 xt−1 xt xt+1 Figure 2.4 Graphical model of a hidden Markov model with mixture models as output distributions. Sometimes it is more useful to express the Gaussian in terms of the inverse covariance. is precision matrix will be writtenΛ(m)x = Σ(m)x −1 . e covariance is oŸen constrained to be diagonal, so that one Gaussian requires less data to estimate. is does not model within-component correlations that are there. To model some correlations while reducing the amount of training data re- quired, structure can be introduced in covariances.is will be the topic of section 3.3. 2.2.1 State sequences To use hmms in a speech recogniser, it must dene probabilities for state sequences and relate them to words.at is, it must dene P(Θ,W) in (2.10). It is convenient to write this distribution in terms of the prior distribution P(W) and the likelihood of the state sequence P(Θ)W: P(Θ,W) = P(W)P(Θ|W) . (2.13) e distribution over word sequences P(W) is called the language model.1 e dis- tribution P(Θ|W) performs a mapping from a sequence of variable-length words to a sequence of xed-length states. It is possible, if not very insightful, to describe this mapping with a graphical model (Murphy 2002; Wiggers et al. 2010). A more insight- ful method is to use weighted nite state transducers. 1Whereas in linguistics speech is considered the only real form of language and spelling a confusing artefact, in computational linguistics “language” refers to written text. 20 2.2. hidden markov models us, section 2.2.1.1 will describe the language model in terms of a probabilistic model, P(W) and section 2.2.1.2, will describe P(Θ|W) as a composition of weighted nite state transducers. 2.2.1.1 Language modelling In the generative model in (2.10), the probability of each possible word sequence must be dened. For tasks where the inputs are constrained, this is oŸen straightforward. For example, if a ten-digit phone number is expected, the probability of any such digit sequence can be set to the same value, and any other word sequence can receive a zero probability. For some tasks it is possible to set up a limited grammar. However, if free-form input is expected, no word sequences are impossible, though some may be improbable. Higher probabilities may be assigned to grammatical sentences (or fragments) compared to ungrammatical utterances, and semantically likely sequences compared to unlikely ones.is is oŸen done by training a statistical model on data. A languagemodel can be seen as a probability distribution over word sequences. If sentences are considered independently, the probability of a sentenceW = w1, . . . , wL is2 P(W) = P(w1)P(w2|w1)P(w3|w1, w2) · · ·P(wL|w1, . . . , wL−1) = L∏ i=1 P(wi|w1, . . . , wi−1) . (2.14) is factors the probability of a word sequence into probabilities of each word condi- tional on the word history, which can be trained from data.e usual strategy is to ap- ply maximum-likelihood estimation, which in this case sets probabilities of words for a given word history proportional to its count in the training data. However, though theremay be enough data to do this for zero-length histories, the last word depends on all previous words, which most likely occur only once in the training data and never in the test data.is is an example of over-training: the trained model would assign a 2Boundary ešects are ignored for simplicity. 21 chapter 2. speech recognition zero probability to most sentences.e word history is therefore usually constrained toN− 1 words: P(wi|w1, . . . , wi−1) ' P(wi|wi−N+1, . . . , wi−1) . (2.15) is type of model is called anN-gram model, and a typical value forN is 3. For N ≥ 2, usually not all word tuples have been seen in training data, so it is oŸen necessary to recursively back oš to data from a shorter word history, or to inter- polate between N-gram models for dišerent values of N. A state-of-the-art back-oš scheme is modied Kneser-Ney (Chen and Goodman 1998). An alternative method to guard against over-training is to apply Bayesian methods, where a prior over para- meters keeps them from assigning all probability mass to seen events. An interesting approach derives from a hierarchical Pitman–Yor process (Teh 2006). 2.2.1.2 Latent discrete sequence To map words onto xed-time units, words are converted to sequences of xed-time discrete states. e state space can be described in various ways, but an insightful one is as a network of states. A formalism that produces these networks (and can be fast when embedded in a speech recogniser) is that ofweighted nite state transducers. Mohri et al. (2008) gives a good overview of how this works.e following will briežy describe how to construct a state network for speech recognition. A nite-state automaton has discrete states that it switches between at each time step. Which transitions are possible is specied explicitly. Finite state transducers add to each transition an input symbol and an output symbol. ey are therefore able to convert one type of symbol sequence into another. It is allowable for a transition to either not input any symbol, or not to output any. An empty input or output is indic- ated with “”. is is useful if input and output sequences have dišerent lengths, as when converting from words to states. Weighted nite state transducers, nally, add weights to transitions. For the sake of clarity, the weights will here stand for probab- ilities that are multiplied at each transition.is section will use weighted nite state transducers to convert word sequences into state sequences. 22 2.2. hidden markov models Transducer Result Weight Input sequence three two Language model three two 0.01 Pronunciation lexicon θ r i: t u: 0.01 Sub-phones θ1 θ2 θ2 θ3 r1 r2 r3 r3 r3 i:1 i:1 i:2 i:3 t1 t2 t2 t2 t3 t3 u:1 u:1 u:2 u:2 u:2 u:3 u:3 5.8 · 10−20 Table 2.1 Conversions that the weighted nite state transducers in gure 2.5 on the following page, applied aŸer one another, may apply, starting from an input sequence. Figure 2.5 on the next page illustrates component weighted nite state transducers thatwhen composed can convert a sequence ofwords to a sequence of discrete equally- spaced states.e transducers in the chain translate between a number of alphabets. e output alphabet of the one must be the input alphabet of the other, et cetera.e arbitrarily numbered circles represent states.e bold circles are start states, and the double-bordered ones end states, which can, but do not necessarily, end the sequence. e arrows represent transitions. eir labels consist of an input symbol, a colon (:), an output symbol, a slash (/), and a transition weight (here, a probability). Figure 2.5a contains a simple language model that can take digit sequences. Its input and output symbols at each transition are equal. Table 2.1 shows the ešect of applying this transducer to a sample sequence (rst and second rows): it merely com- putes a weight. is weight stands for the probability of the word sequence. It has a weight of 0.1 for each word, and a transition from state 2 to 1, without consuming or producing any symbols, to allow repetition. It is straightforward to use the same ingredients to generate a representation of a xed grammar, or of a probabilistic lan- guage model trained on text data. For a small vocabulary, it is possible to train the acoustics of every word from audio separately. In the digit sequence example, this could be an option, if the data contains enough examples of all words. However, in general words are mapped to a sequence consisting of symbols from a smaller alphabet with a pronunciation diction- ary.ese sub-word units are supposed to represent the pronunciation of the words. 23 chapter 2. speech recognition 1 2 zero:zero/0.1 one:one/0.1 two:two/0.1 nine:nine/0.1 ... ǫ:ǫ/1 (a) Simple language model for a digit se- quence. 1 2 5 two:t/1 ǫ:u:/1 3 4 three:T/1 ǫ:r/1 ǫ:i:/1 · · · ... ǫ:ǫ/1 (b) Simple pronunciation lexicon: mapping words to phones. 1 2 3 4 8 t:t1/1 ǫ:t1/0.8 ǫ:t2/0.2 ǫ:t2/0.7 ǫ:t3/0.3 ǫ:t3/0.5 ǫ:ǫ/0.5 5 6 7 T:T1/1 ǫ:T1/0.4 ǫ:T2/0.6 ǫ:T2/0.5 ǫ:T3/0.5 ǫ:T3/0.9 ǫ:ǫ/0.1 · · · ... ǫ:ǫ/1 (c)Mapping phones to sub-phones. Figure 2.5 Component weighted nite state transducers for building a speech re- cognition network. 24 2.2. hidden markov models However, standard linguistic units are usually considered to be at the wrong level of detail. ere are phonemes, which are supposed to encode an underlying represent- ation. us, the second c in “electric” and in “electricity” could be encoded with the same symbol (see, e.g., Harris 1994), even though they are always pronounced dišer- ently. On the other hand, phonetic transcriptions encode dišerences due to accents and sheer chance.us, it changes the transcription whether the last sound in a real- isation of the word “hands” sounds like a z or a s, or changes halfway through (see, e.g., Collins andMees 1999). It also encodes allophonic dišerences, like in realisations of the l in the words “clear”, “voiceless’, and “ešulgent”. Speech recognisers’ acoustic models are powerful enough to deal with part of the pronunciation variability, and it is also possible to encode sound context (see below). A level of transcription in between phonemic and phonetic is therefore usually chosen for the sub-word units. ese units are referred to with the linguistically neutral term phone. e third row of table 2.1 shows what the mapping from a word sequence into a sequence of phones produces. Figure 2.5b shows a weighted nite state transducer that performs the mapping. Because most words consist of more than one phone, the transducer needs to generate output symbols without consuming any input; this is indicated on the transitions by “” for the input symbol. It does not make a dišer- ence in theory which transition on a deterministic path carries the non-empty input. For performance reasons, practical speech recognisers will apply operations on the transducer (for more details, see Mohri et al. 2008) to move the word symbol further back. Since there is no pronunciation variation that needs encoding, the transducer here is deterministic, with weights 1 throughout. It is, however, possible to include alternative phone sequences for one word with the appropriate probabilities. Movement of articulators is a continuous process. To represent the resulting change in acoustics during the realisation of a phone, with discrete units, phones are split into sub-phones.ere must be a balance between time resolution of the acoustics and the amount of training data available for each sub-phone.e canonical number of sub- phones per phone is therefore three. 25 chapter 2. speech recognition Figure 2.5c shows part of a weighted nite state transducer that converts phones (e.g. “θ”) into sub-phones (e.g. “θ2”). Since one phone generates more than one sub- phone, many transitions take no input symbol, which is indicated with “”. Some states have self-transitions, which produce the same sub-phone label every time they are chosen. is makes the transducer non-deterministic and allows the sub-phone sequence to have varying lengths. Durationmodelling is not very sophisticated: com- putational constraints practically dictate a geometric distribution for the sub-phone duration on one path through the network.e weight on a self-transition is the para- meter of this geometric distribution. e bottom rowof table 2.1 contains an example sub-phone sequence derived from the phone sequence. Each of the sub-phones produces one time slice, of which sec- tion 2.2 will discuss the properties. ere are a number of additional steps in producing a real-world speech recog- niser. One is to introduce context-sensitive phones.is divides phones up depending on previous and following phones, which is straightforward to implement as a nite state transducer. To combat the resulting explosion of the number of parameters, it is usual to map phone or sub-phone models that share properties and acoustics into equivalence classes. A decision tree is built that for each split picks from a list the phonetically-inspired question that best separates the data. For example, an l like the one in “ešulgent” may be separated from other allophones of l if the question “Is the next phone a consonant?” appears in the decision tree. Once the mapping into equi- valence classes has been found, a nite state transducer straightforwardly performs the conversion. Another trick, used for decoding, is to apply acoustic deweighting. e acoustic model’s probabilities have too great a dynamic range, so that they swamp out the lan- guage model’s probabilities. e usual work-around is to take the acoustic model’s transition weights to a power smaller than 1 before multiplying them with the lan- guage model’s. e discussion has so far considered separate transducers that take an input se- 26 2.2. hidden markov models quence and produce an output sequence. However, when decoding, many paths must be considered at the same time, through the whole cascade of transducers at once, and it is the inverse transducers that are necessary. A transducer is inverted by exchanging input and output symbols on all transitions. Following many paths in a cascade of transducers is less easy. Conceptually, the transducers are composed into one big transducer that con- sumes a word sequence and non-deterministically produces a sub-phone sequence. As long as the empty symbol  is not considered, composing two transducers straight- forwardly yields a new transducer. Its state space is the product space of the two trans- ducers’. For transducers with -transitions, performing composition is less straight- forward, but possible (Mohri et al. 2008). It is also oŸen benecial to determinise and minimise the transducers so that, for example, words in the pronunciation lexicon share states as much as possible.ese operations are generic, but it is also possible to use algorithms for specic network types (e.g. Dobrisˇek et al. 2010). It is possible to expand the whole network for a system with a large vocabulary and language model. However, this oŸen requires muchmemory. Alternatively, though it is non-trivial, the composition operation can be performed on the žy, by introducing lters (Oonishi et al. 2009; Allauzen et al. 2009), some of which are necessary for correct operation, and some increase decoding speed. Transducer composition is an associative opera- tion. is allows some of the composition operations to be performed oš-line, and the resulting network to be stored, and the rest to be done on the žy. It is possible to express training a speech recogniser and decoding with it as opera- tions on weighted nite state transducers.is requires setting up a linear transducer, with states that represent times, and transitions that convert all sub-phones into the feature vector found between these times, with the correct probabilities. However, this is not the most enlightening way of looking at this.e following section will use the active sub-phone at any given time as a random variable, probabilities of sequences of which are governed by the fully composed weighted nite state transducer. 27 chapter 2. speech recognition 2.3 Training Speech recognisers, like most statistical models in machine learning, are trained on data. e criterion used to optimise the parameters is usually the likelihood. is is sometimes followed by discriminative training. However, since in this work the noise model must be estimated on unlabelled data, maximum-likelihood estimation will be used.is is only consistent if the speech model is generatively trained, so this thesis will only use maximum-likelihood estimation for the speech model. e objective is usually to nd themodel parameters that maximise the likelihood of the labelled training data. Section 2.3.2 will discuss maximum-likelihood estima- tion and its instantiation for models with hidden parameters, expectation–maximi- sation. Section 2.3.3 will discuss how expectation–maximisation is applied to speech recognisers. Maximum-likelihood estimation, which normally uses training data, can be ex- tended in two ways that will be important for this work. First, it is possible to adapt the model parameters to unlabelled audio that is to be recognised rather than labelled data. is will be the topic of chapter 3. Second, chapter 6 will introduce predictive methods, which train parameters not from data, but on predicted distributions. e generalisation of methods that implement maximum-likelihood estimation to training distributions requires an unusual presentation. e training data will be written as a distribution of samples, an empirical distribution, which will be the topic of section 2.3.1. 2.3.1 Empirical distributions It is oŸen useful to approximate distributions by a number of samples. In this work, these approximations will be called empirical distributions. If p is the real distribution over u, then its empirical approximation p˜ is dened by L samples u(l): p ' p˜ = 1 L ∑ l δu(l) , (2.16) 28 2.3. training where δu(l) indicates a Dirac delta at u(l). Sometimes, the samples will be weighted. is work will use empirical distributions for two purposes. e rst is in the well-knownMonte Carlo family of algorithms. Asmodels inma- chine learning get more complicated, it quickly becomes infeasible to perform exact inference. Sometimes distributions can be approximated parametrically, with a choice of forms and types of approximation (e.g. Minka 2005). Monte Carlo algorithms, on the other hand, replace a parameterised distribution by an empirical distribution ac- quired by sampling from it. Appendix a.4 discusses Monte Carlo methods that ap- proximate integrals. However, even for message passing in general graphical models, distributions can be represented with samples (Dauwels et al. 2006). For example, Gibbs sampling (Geman and Geman 1984) can then be seen as loopy belief propaga- tion (Frey and MacKay 1997) with single-sample messages. emain use for empirical distributions in the next sections, however, is to repres- ent training data. Data points (in thiswork: audio recordings of speech utterances) can be interpreted as samples from a stochastic process. is stochastic process, speech production, is hard to approximate and arguably impossible to model exactly. e next section will interpret maximum-likelihood estimation using the empirical dis- tribution representing the training data, where every utterance is a data point. 2.3.2 Maximum-likelihood estimation Training speech recogniser parameters, whether on thousands of hours of data with transcriptions, or a few parameters for adaptation on a few seconds, usually applies maximum-likelihood estimation or an approximation. is sets the model paramet- ers to maximise the likelihood of the training data. e distribution that the model represents will be written qX , and one training data point X . For mathematical con- venience, the maximisation of the likelihood is usually rephrased as a maximisation of the log-likelihood, which will be written L(·). e log-likelihood of data point X 29 chapter 2. speech recognition according to qX is L(X , qX ) , logqX (X ). (2.17) e likelihood of a set of independent and identically distributed data points {X (l)} is the product of the likelihoods of the points, so the log-likelihood of that set is the sum of the individual log-likelihoods: log ∏ l qX (X (l)) =∑ l logqX (X (l)). (2.18) In chapter 6, methods trained on data with maximum-likelihood estimation will be generalised to train on distributions. It is therefore useful at this stage to write the training data as a distribution. is empirical distribution p˜(X ) has a Dirac delta at each point in the training set, as in (2.16).e log-likelihood of the training data can then be written L(p˜, qX ) , ∫ p˜(X ) logqX (X )dX . (2.19) Maximum-likelihood estimation then nds qˆX = argmax qX L(p˜, qX ). (2.20) Maximum-likelihood estimation invites over-tting of the training data, which endangers generalisation. To guard against this, Bayesian approaches are possible, which factor in a prior over the parameters. However, because of the temporal struc- ture of speech recognisers, using a distribution over parameters is not feasible and must be approximated (Watanabe et al. 2004). Instead, speech recognition therefore uses techniques that control the amount of data that parameters are trained on. Many learning problems in statistical pattern processing have unobserved, hid- den, variables. Finding the parameters of the distributions over both the hidden and the observed parameters thatmaximise the likelihood is oŸen intractable. An iterative algorithm that approximates the maximum-likelihood solution is expectation–maxi- misation (em) (Dempster et al. 1977). 30 2.3. training 2.3.2.1 Expectation–maximisation Expectation–maximisation increases the log-likelihood L of the observations by op- timising a lower boundF of the likelihood.is section will introduce the algorithm by writing it in terms of the empirical distribution. Appendix a.3 gives a derivation and proof of convergence.e two stages of the expectation–maximisation algorithm are the expectation stage and the maximisation stage. e expectation stage optim- ises the lower bound, making it equal to the log-likelihood. e maximisation stage optimises the model parameters. e statisticalmodelwhose parameters are trainedwill be denotedwithqUX (U ,X ), which is a distribution over the hidden variable U and observed variablesX . Margin- alising out over the hidden variables gives the distribution over the observed variables: qX (X ) = ∫ qUX (U ,X )dU (2.21a) e log-likelihood for one data point is then L(X , qUX ) , log ∫ qUX (U ,X )dU . (2.21b) e lower bound that expectation–maximisation maximises is dened for a single data point X as F(X , ρ, qUX ) , ∫ ρ(U |X ) log qUX (U ,X ) ρ(U |X ) dU . (2.21c) Compared toL, its lower boundF explicitly takes an extra parameter, ρ, which is the distribution over the hidden variables U for each data point X . Expectation–maximisation is an iterative algorithm. An initial parameter setting q (k−1) UX (U ,X )must be given.e expectation stage of expectation–maximisation op- timises the distribution over the hidden parameters ρ. Appendix a.3.1 shows that the optimal setting for ρ makes the lower bound equal to the log-likelihood. ρ is then equal to the posterior distribution of the hidden variables given the old parameter 31 chapter 2. speech recognition setting: ρ(k)(U |X ) := q(k−1)U |X (U |X ) = q (k−1) UX (U ,X ) q (k−1) X (X ) . (2.22) emaximisation step now sets themodel parametersqUX tomaximise the lower bound evaluated on the whole training data, written as an empirical distribution as in (2.19): q (k) UX := argmax qUX ∫ p˜(X ) ∫ ρ(k)(U |X ) logqUX (U ,X )dUdX . (2.23) It is oŸen possible to perform this maximisation analytically. If not, generalised em may be used, which merely requires a new value of q(k)UX that improves the lower bound. Appendix a.3 proves that in both cases the likelihood increases at least as much as the lower bound.e full em iteration therefore causes the likelihood to con- verge to a maximum. Another way of looking at the optimisation in the maximisation step is as min- imising the kl divergence to the inferred distribution over the complete data (the ob- served variables aswell as the hiddendata).is distributionp combines the empirical distribution and the approximation to the distribution of the hidden variables: p(U ,X ) = p˜(X )ρ(k)(U |X ) . (2.24) Minimising the kl divergence of the model qUX to p can be written as argmin qUX KL(p‖qUX ) = argmin qUX ∫ ∫ p˜(X )ρ(k)(U |X ) log p˜(X )ρ (k)(U |X ) qUX (U ,X ) dUdX = argmax qUX ∫ p˜(X ) ∫ ρ(k)(U |X ) logqUX (U ,X )dUdX , (2.25) which is exactly the expression in (2.23). Many generative statisticalmodels, including standard speech recognisers, consist of a distribution over the hidden variables, and one over the observed variables given the hidden ones. qUX then factorises as qUX (U ,X ) = qU (U)qX |U (X |U) . (2.26) 32 2.3. training e logarithm of this in the maximisation in (2.23) then becomes a sum, so that the two distributions can be optimised separately: logqUX (U ,X ) = logqU (U) + logqX |U (X |U) ; (2.27a) q (k+1) U := argmax qU ∫ p˜(X ) ∫ ρ(U |X ) logqU (U)dUdX ; (2.27b) q (k+1) X |U := argmax qX |U ∫ p˜(X ) ∫ ρ(U |X ) logqX |U (X |U)dUdX . (2.27c) All the distributions that this thesis will apply expectation–maximisation to will have this form. 2.3.3 Baum–Welch e instantiation of expectation–maximisation for hmms is also called Baum–Welch training because it was introduced (Baum et al. 1970) before its generalisation. e discussion of expectation–maximisation in section 2.3.2.1 denoted the set of hidden variables with U and the observed variables X . Applying this to speech recognition, the hidden variables are the sub-phone state and the component at every time in- stance: U = {θt,mt}. e observed variables X consist of the feature vectors for one utterance {xt}1...TX , and, for training, transcriptionsW of the audio.3 Assuming the transcriptions are given at the word level, the weighted nite state transducer for the language model is replaced by a simple word sequence for each utterance. is constrains the state space, so that it is feasible to keep the distribution of the hidden variables for one utterance in memory. e empirical distribution p˜(X ) has Dirac deltas at the utterances in the training data with their transcriptions. e expectation step of em nds a distribution ρ(U |X ) over the hidden variables for an utterance.emost convenient form for ρwill drop out of the derivation below. For speech recognisers, themodel factorises as a distribution over the hidden variables and a distribution over the observed variables given the hidden ones. Once the dis- tribution ρ has been found, the two factors are optimised separately as in (2.27). How 3For decoding, there will be no transcriptions. 33 chapter 2. speech recognition to train qU , the state transitions and the mixture weights, is well-known (e.g. Bilmes 1998) and will not be discussed here. How to train qX |U , though also well-known, will become important for estimating adaptation transformations in section 3 and later, so it will be discussed here in detail. e form thatqX |U takes for speech recognition is the product of the likelihood for componentGaussians for every time t. Letq(m) represent the distribution of compon- entm, of which the parameters are to be trained.e likelihood of one data point X , of length TX , given a setting for the hidden variables U is qX |U (X |U) = TX∏ t=1 ∑ m 1(mt = m)q (m)(xt), (2.28) where 1(·) is the indicator function, or Kronecker delta, which is 1when its argument is true and 0 otherwise. Here, ∑ m 1(mt = m)merely selects the correct component. erefore, the log-likelihood given U is logqX |U (X |U) = TX∑ t=1 ∑ m 1(mt = m) logq(m)(xt). (2.29) e inner integral in (2.27c), the expected log-likelihood under the distribution over the hidden variables ρ, then is∫ ρ(k)(U |X ) logq(k)X |U (X |U)dU = ∫ ρ(k)(U |X ) TX∑ t=1 ∑ m 1(mt = m) logq(m)(k)(xt)dU = ∑ m TX∑ t=1 [∫ ρ(k)(U |X ) 1(mt = m)dU ] logq(m)(k)(xt). (2.30) As section 2.3.2.1 has discussed, the expectation step of expectation–maximisation sets ρ to the posterior of the hidden variables using the old model parameters. e value of the integral, in square brackets in the last expression, can therefore be seen as the posterior marginal probability of componentm at time t. For training q(m), it is all that is necessary to know of the distribution of the hidden parameters. It will be written γ(m)t with γ (m) t , ∫ ρ(U |X ) 1(mt = m)dU , (2.31a) 34 2.3. training and the summed component occupancy over the whole training data γ(m) , ∫ p˜(X ) TX∑ t=1 γ (m) t dX . (2.31b) Finding the component–time posterior γ(m)t uses the forward–backward algorithm (Baum et al. 1970).is is an instantiation of the belief propagation algorithm (Pearl 1988), which nds the posterior distribution of random variables in a graphical model by message-passing between adjacent variables. In hmms, the forward probability is the distribution ofmt−1 given observations x1 . . . xt−1.e distribution ofmt given observations x1 . . . xt can be computed from the forward message and the observed xt. e backward probability is the distribution ofmt+1 given xt+2 . . . xT . Together with the observed xt+1 this yields the distribution ofmt given xt+1 . . . xT . Multiply- ing the forward and backward probability for time t yields the distribution ofmt given x1 . . . xT , which is the component–time posterior γ (m) t . Since forward and backward probabilities are computed recursively from opposite ends of the sequence, either the forward or the backward probabilities are required in the reverse order from the one in which they are computed. For a state space of size Θ, the natural implementa- tion of the forward-backward algorithm therefore usesO(T ·Θ) space andO(T ·Θ) time. To deal with long sequences, it is also possible to cache the probabilities only at intervals and reduce the space requirement to O(Θ log T) at the cost of requir- ingO(Θ · T · log T) time (Murphy 2002). However, in practice longer utterances also contain more words and thus more states. To deal with this, pruning is used: forward and backward probabilities below a threshold are set to zero. Having computedγ(m)t , themaximisation step instantiates (2.27c), rewritten using (2.30) and (2.31a): qX |U := argmax qX |U ∫ p˜(X ) ∑ m TX∑ t=1 γ (m) t logq (m)(xt)dX . (2.32) is maximisation is used for training the output distributions’ parameters. When training all speech recogniser parameters, the expectation andmaximisation steps are applied iteratively. Chapter 3 and section 4.7 will discuss adaptation within this same 35 chapter 2. speech recognition framework.ere is usually enough training data to train all parameters of Gaussians directly. e parameters of each Gaussian can be estimated separately.e instantiation of (2.27c) for training speech recognition parameters sets the parameters of each com- ponent to maximise its expected log-likelihood under the distribution ofmt: q(m) := argmax q(m) ∫ p˜(X ) TX∑ t=1 γ (m) t logq (m)(xt)dX . (2.33) Taking the derivative of the integral to be maximised, the distribution q(m) ∼ N (µ(m),Σ(m)) is maximised when µ(m) = 1 γ(m) ∫ p˜(X ) TX∑ t=1 γ (m) t xtdX ; (2.34a) Σ(m) = ( 1 γ(m) ∫ p˜(X ) TX∑ t=1 γ (m) t xtx T t dX ) − µ(m)µ(m) T . (2.34b) 2.4 Decoding e purpose of a speech recogniser is to convert audio into text. With the audio to be recognised represented by feature vector sequenceX and the word sequence denoted withW, Bayes’ rule relates nding the most likely word sequence Wˆ to the generative model in (2.10): Wˆ = argmax W P(W|X) = argmax W P(W,X) p(X) = argmax W P(W,X) = argmax W P(W)p(X|W) . (2.35a) Since 1/p(X) does not depend on W, when decoding it is a constant factor in the maximand and can be ignored. Section 2.2 has discussed the form of the generat- ive model P(W,X). To nd the most likely word sequence, the sub-phone state se- quenceΘ should be marginalised out, as in (2.10): Wˆ = argmax W P(W) ∑ Θ p(X|Θ)P(Θ|W) . (2.35b) 36 2.4. decoding However, this marginalisation turns out to be computationally infeasible. erefore, the sum in (2.35b) is replaced by a max operator. Rather than nding the best word sequence, speech recognisers therefore nd the word sequence corresponding to the best sub-phone state sequence. Wˆ ' argmax W P(W)max Θ p(X|Θ)P(Θ|W) . (2.35c) Note that if p(X|Θ) is oš by a factor, this does not inžuence the maximisation. is property is oŸen useful in speech recogniser adaptation. is sequence can be computed with the Viterbi algorithm (Viterbi 1982), which is a dynamic programming algorithm.e following describes it briežy.e property of the network it needs is the Markov property discussed in section 2.2: the variables at time t depend only on the variables at time t, and not on anything before that.is means that if the best possible path that ends in sub-phone θ at time t goes through θ ′ at time t−1, it contains the best path ending inθ at time t−1. Finding the best paths to all states at one time therefore only requires the best paths to all states at the previous time.e task of nding the best path to a nal state at the nal time therefore becomes a recursion backwards through time. An approximation that increases decoding speed is pruning.is removes unlikely states from the set of paths at every time step. It denes a pruning beam, the dišerence in log-likelihood between the most likely state and the least likely state to be allowed through. Pruning does introduce search errors, so setting the pruning beam gives a trade-oš between speed and accuracy. To assess speech recogniser performance, the word error rate (wer) is oŸen used. is metric gives the distance from the reference transcription. It is the lowest num- ber of deletions, insertions, and substitutions required to transform the reference tran- scription into the result of the speech recognition, as a fraction of the number of words in the transcription. 37 chapter 2. speech recognition 2.5 Summary is chapter has described the structure of a speech recogniser, and how to use it. Section 2.1 has discussed how the audio is converted into feature vectors that form the observations to a probabilistic model. e inžuence of the noise on feature vectors extracted from noisy data will be derived from this (in section 4.2.1). Section 2.2 has discussed the structure of the generative model. How this model is trained with ex- pectation–maximisation was the topic of section 2.3. Similar methods will be applied for adaptation and noise model estimation (chapter 3 and section 4.7). However, in the maximisation step the parameters will then be constrained so that they can be robustly estimated on limited amounts of data. Section 2.4 has discussed decoding, which will be used for the experiments (chapter 8). 38 Chapter 3 Adaptation Speech recognisers are oŸen employed in dišerent environments to the one they were trained on.ere may be, for example, dišerences in speaker, speaking style, accent, microphone, and, the topic of this thesis, background noise.is mismatch could be resolved by retraining the recogniser in the new environment. Re-training the model on data that is to be recognised is called adaptation. However, usually too little data is available to robustly train all parameters, and it is unlabelled. To deal with this, the model parameters are usually constrained. Section 3.1 introduces the concept of adaptation and general strategies. Section 3.2 discusses training linear transformations of speech recogniser parameters. Linear transformations for covariance modelling while training are mathematically similar and will therefore be the topic of section 3.3. 3.1 Unsupervised adaptation is thesis will denote an utterance from the training environment with X , with ob- servations xt. An utterance to be recognised, which is from a dišerent environment, will be written Y with observations yt.e adaptation methods that this chapter will introduce are general and can adapt a speech recogniser to many types of dišerence between environments. In chapter 4 about methods for noise-robustness, X will ex- plicitly be assumed noise-free, clean, data, and Y noise-corrupted. 39 chapter 3. adaptation If su›cient training data and the correct transcriptions were available, the mis- match between the environment the recogniser was trained in and the environment it is used in could be resolved by retraining the recogniser in the new environment. One approach would be to apply maximum a posteriori (map) training to the speech re- cognition parameters (Gauvain and Lee 1994).ere is no conjugate prior density for an hmm with mixtures of Gaussians, but if mixture weights and component paramet- ers are assumed independent, maximum a posteriori estimates for them can be found. e main problem with this is that each Gaussian’s parameters are re-estimated sep- arately, so that to have an ešect, su›cient data must be observed for each Gaussian. map adaptation of speech recognition parameters is therefore ill-suited to scenarios with limited adaptation data. An alternative is to constrain the parameters to a subspace, by only training a transformation of the speech recognition parameters that itself has fewer parameters than the speech recogniser. Ideally, decoding with adaptation would jointly optim- ise the word sequence and speech recogniser transformation that maximises, for ex- ample, the likelihood. If L is the function that is to be optimised with respect to word sequenceW and speech recogniser transformation A, and Y is the adaptation data, then the joint optimisation can be written as ( Wˆ, Aˆ) := argmax W,A L(Y,W,A). (3.1) It is possible to approximate this by estimating Aˆ for a number of hypotheses (Mat- sui and Furui 1998; Yu and Gales 2007). However, this is slow.e normal approach therefore uses coordinate ascent and interleaves optimising word sequence W and optimising speech recogniser transformation Aˆ. As an approximation to optimising the word sequence, decoding as discussed in section 2.4 is applied. Optimising the speech recogniser transformation normally uses expectation–maximisation or gener- alised expectation–maximisation. By controlling the number of parameters, the need for Bayesian schemes is avoided. One form of speech recogniser transformation is an a›ne transformation of parameters of output distributionsq(m), which section 3.2will 40 3.1. unsupervised adaptation θt−1 θt θt+1 mt−1 mt mt+1 yt−1 yt yt+1 A Figure 3.1 Directed graphical model of a speech hidden Markov model with A transforming the parameters of the component-conditional distribution. e yt are observations from a dišerent environment than the xt in gure 2.4 on page 20. discuss. Methods specically for noise-robustness, whichwill be the topic of chapter 4, can also be seen as adaptation if a noise model is estimated. Figure 3.1 shows a graphical model of the speech hmm with a transformation, where the observations from the training environment xt in gure 2.4 have been re- placed by those from the recognition environmentyt. yt depends not only onmt, but also onA.e component output distribution q(m)(xt) is replaced with q(m)(yt|A). q(m)(yt|A) can have various forms, some of which section 3.2 will discuss, but all can be seen as transforming the parameters of the component output distribution. Note that the transformation does not ašect mt, nor the state transitions. For decoding, the algorithm from section 2.4 still applies, except that the output distributions are replaced by their transformed versions. Figure 3.2 on the following page gives a žow diagram for unsupervised estima- tion of a transformation. First, the recogniser uses an initial transformation (oŸen the identity transformation) to nd a transcription hypothesisW that probably has many errors. To estimate the transformation, the distribution over the component sequence is found (the expectation step of expectation–maximisation). In the maximisation step, this can then be used to decrease the mismatch between the component distri- 41 chapter 3. adaptation Audio data Decode Expectation HypothesisW Final hypothesis Maximisation Posteriors γ(m)tTransformationA Figure 3.2 Unsupervised adaptation. bution and the actual observations. (is will be discussed in greater detail below.) In the short loop in gure 3.2 the expectation step immediately follows. Running round this loop implements expectation–maximisation or generalised expectation–maximi- sation. e new transformation A is therefore guaranteed not to decrease the likeli- hood in the joint maximisation in (3.1). It is also possible to replace the hypothesis with a new one by running the decoder with the latest estimate of the transformation (the long loop). Since decoding only nds the state sequence, not the word sequence, with the highest likelihood (see section 2.4), the new hypothesis is not guaranteed to yield a better likelihood. If it does, then it is a step towards the joint maximisation in (3.1). AŸer a small number of iterations, this process can stop and yield the nal hypothesis. e expectation step nds the distribution of the hidden variables. As when train- ing a recogniser, this distribution is represented by component–time posteriors γ(m)t . e maximisation step nds the best transformation, in a process similar to training output distributions, but using the hypothesis on adaptation utterances Y rather than training utterancesX .e expression is very similar to (2.32), but rather than directly estimating the output distribution’s parameters, the transformationA is estimated by 42 3.1. unsupervised adaptation maximising A(k) := argmax A ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t logq (m)(yt|A)dY. (3.2) e empirical distribution p˜(Y) here is assumed to represent utterances from a ho- mogeneous part of the training data. Unsupervised estimation oŸen works well even if the initial hypothesis contains many errors.e key to this is controlling the number of parameters that are trained on a given amount of data. Gaussian components are usually grouped in clusters that share one transformation.e grouping is normally hierarchical, in a regression class tree (Leggetter 1995; Gales 1996; Haeb-Umbach 2001). When performing adaptation, the tree is pruned so that the resulting leaf nodes have enough data for robust estima- tion of the transformation.e leaf nodes of an unpruned tree result in a component clustering into base classes 1 . . . R. For the methods that this thesis will introduce (in chapter 6), the amount of adaptation data will be irrelevant, so base classes will feature most prominently. Estimating the transformations is completely separate per class, so to keep notation from being cluttered, the notation in sections 3.2 and 3.3 will assume one class. 3.1.1 Adaptive training e discussion of adaptation has so far assumed that the speech recogniser model is trained on homogeneous and noise-free data, and that that the test data is dišerent from that. In reality, the training data oŸen has dišerent speakers, and sometimes is even explicitly multi-environment (for example, dišerent noise data may be arti- cially added to the audio), to try and capture the dišerent environments the recogniser might be employed in. It is possible to use the graphical model with transformations, in gure 3.1, when training as well. For every set of homogeneous utterances (for example, utterances from one speaker, or from one noise environment, or just for one utterance) a trans- formation is trained to maximise its likelihood.e speech recogniser parameters are 43 chapter 3. adaptation then re-estimated to maximise the average likelihood over all speakers. e speech recogniser and the transformation are optimised iteratively. is creates a canonical speech recognisermodel, which, unlike a normally trained one, does not represent the training data without the transformation. It is a conceptually pleasing property that it unies the model for training and testing. However, it creates a chicken-and-egg problem: the speech recogniser does not represent the data without a transformation and there is nowell-dened initial setting for the transformationwithout a hypothesis. ere is therefore no clear starting point for the interleaved estimation of transforma- tion and hypothesis.is can be solved by using a conventionally-trained recogniser initially, and only then using the adaptively-trained one with an appropriate trans- formation, or using a heuristically determined initial transformation. Some schemes that apply adaptive training introduce an extra stochastic variable that represents the identity of the speaker or cluster of speakers.is includes speaker adaptive training (Anastasakos et al. 1996) and cluster adaptive training (Gales 2000). Somemore recent adaptive training schemes (Liao and Gales 2007; Kalinli et al. 2010; Flego and Gales 2009; Kim and Gales 2010) have found improvements from using transformations on speech recognisers with a standard form of speech recogniser ad- aptation. 3.2 Linear adaptation Linear adaptation methods are instantiations of adaptation as discussed in section 3.1. A graphical model was given in gure 3.1 on page 41: the distribution of observa- tionsyt dependon the component generating it,mt, and the transformationA. Trans- formation A changes the parameters of Gaussian component m, which models the training data: p(m)(x) = N (x; µ(m)x , Σ(m)x ). (3.3a) 44 3.2. linear adaptation e parameters of transformed distributionq(m)(y|A) represents the found data with q(m)(y|A) = N (y; µ(m)y , Σ(m)y ). (3.3b) e general form of transformation that will be considered is an a›ne transformation{ H,g } to the mean vector, and a linear transformationH ′ to the covariance matrix: µ (m) y := Hµ (m) x − g; (3.3c) Σ (m) y := H ′Σ(m)x H ′T. (3.3d) e rst adaptation method that applied an explicitly ml-estimated a›ne transform was maximum-likelihood linear regression (mllr) (Leggetter and Woodland 1995), which only adapts the parameters of the mean. Since the main interest in adaptation transformations for this thesis is in modelling correlations for noise-robustness, the following sections will focus on two dišerent forms. e rst, cmllr (Gales 1998a) constrains the mean and covariance transformations to be the same.e second only transforms the covariance (Neumeyer et al. 1995; Gales andWoodland 1996). Per class, both have in the order of d2 parameters, which means that about 1000 frames are required to train them robustly. 3.2.1 Constrained transformation Constrained mllr (cmllr) constrains the linear transform applied to the mean and covariance to be equal.us, the likelihood for componentm becomes q(m)(y|A) = N (y; Hµ(m)x − g, HΣ(m)x HT). (3.4a) H can be diagonal (Digalakis et al. 1995) or full (Gales 1998a). e latter shape is of most interest for this work since it can model some feature correlations. One of its useful properties is that this can alternatively be written as a transformation of the observations: A = {A,b} whereA = H−1 and b = −H−1g (Gales 1998a): q(m)(y|A) = |A| · N (Ay+ b; µ(m)x , Σ(m)x ). (3.4b) 45 chapter 3. adaptation is means that each observation vector is transformed before being passed to the Gaussian components. In practice, components are usually clustered into classes based on their distance to each other, with a dišerent transformation for each class (see section 3.1). In ešect, cmllr then performs a piecewise linear transformation of the observations. In terms of the implementation, models can calculate the observation likelihood on the appropriately transformed feature vector. Transforming y to R par- allel feature vectorsA(r)y+b(r) can be computationally cheaper compared to trans- forming the parameters of each Gaussian. For a diagonal transformation matrix, this depends on the number of feature vectors to be transformed and the number of com- ponents: transforming one feature vector has the same complexity as transforming one component. If the transformation matrix is full, transforming one feature vector costs O(d2) time, whereas transforming one covariance matrix costs O(d3) time. Additionally, transforming the features means that the original diagonal covariance matrices can be used, so that no extra memory is required to store the models and the likelihood computation is not slowed down much. e interest here is in a method that works for diagonal covariance matrices, so that decoding is cheap.is allows for the row-wise optimisation algorithm in Gales (1998a). Estimating transformations for full covariancematrices, with a generalisation of the row-wise algorithm (Sim and Gales 2005) or gradient optimisation (Ghoshal et al. 2010), will not be discussed. Covariance matrix entries are denoted with σ(m)x,ii . e maximisation step implements (3.2) with the likelihood calculation in (3.4b). A derivation of the optimisation is in section b.1.1. What is interesting here is the statistics that this optimisation requires. (Section 6.2.1 will discuss how to train the same form of transformation from predicted statistics.e only dišerence will be the form of the statistics.)e required statistics from the adaptation data are γ, k(i), and G(i) with (from (b.5)) γ , ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t dY ; (3.5a) 46 3.2. linear adaptation k(i) , ∫ p˜(Y) ∑ m µ (m) x,i σ (m) x,ii TY∑ t=1 γ (m) t [ yTt 1 ] dY ; (3.5b) G(i) , ∫ p˜(Y) ∑ m 1 σ (m) x,ii TY∑ t=1 γ (m) t ytyTt yt yTt 1 dY. (3.5c) e optimisation algorithm then maximises the likelihood with respect toA per row. It iterates over each row a number of times. e likelihood is therefore guaranteed not to decrease, which makes the overall process an instantiation of generalised ex- pectation–maximisation. If the transformation matrix A is constrained to a block- diagonal shape, the likelihood expression factorises into likelihoods for these blocks of coe›cients. ey can therefore be optimised separately. In the extreme case, A is constrained to be diagonal, and the optimisation is separate for each dimension.e optimisation procedure then yields the globalmaximum immediately, so that iterating is not necessary, and the process is an instantiation of expectation–maximisation.e nal transformation is equivalent to the one described in Digalakis et al. (1995), which applies it in model space, though with an iterative method. For the full-transformation case, the computational complexity of this algorithm is dominated by the cost of calculating the cofactors and the inverse of G(i), which is necessary for each row. e latter costs O(d3) per matrix (with d the dimension of the feature vector). A naive implementation of the former costsO(d3) per matrix per iteration, but using the Sherman-Morrisonmatrix inversion lemma this can be re- duced toO(d2) (Gales and van Dalen 2007).us, for R transforms and L iterations, the cost of estimating the transforms isO(RLd3 + Rd4). 3.2.2 Covariance adaptation Covariance mllr (Neumeyer et al. 1995; Gales and Woodland 1996; Gales 1998a) up- dates only the covariances of the component Gaussian. It was originally proposed to be used in combination with mean mllr. e likelihood of transformed Gaussian 47 chapter 3. adaptation componentm becomes q(m)(y|A) = N (y; µ(m)x , HΣ(m)x HT). (3.6a) As in the constrained case, this is better expressed with the inverse transformation A = H−1, so that q(m)(y|A) = |A| · N (Ay; Aµ(m)x , Σ(m)x ). (3.6b) It may not be immediately obvious why this is a better formulation than (3.6a). Nor- mally, the covariancematrixΣ(m)x is diagonal. Storing the updated covarianceHΣ(m)x HT would therefore require extra storage, whereas storingAµ(m)x does not. An evenmore important reason in this work, which will use a variant of covariance mllr to speed up decoding, is that computing the likelihood is faster with the form in (3.6b) than with the form in (3.6a), again because of the diagonal covariance matrix. e derivation is in appendix b.2.1. Just like for cmllr, section 6.2.2 will compute the same form of transformation but from predicted statistics. e form of the stat- istics are therefore of most interest here.e statistics from the adaptation data are γ andG(i) with γ , ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t dY ; (3.7a) G(i) , ∫ p˜(Y) ∑ m 1 σ (m) x,ii TY∑ t=1 γ (m) t ( yt − µ (m) x )( yt − µ (m) x )T dY. (3.7b) γ here is the same as for cmllr (in (3.5a)); G(i) is similar to part of (3.5c) but uses yt − µ (m) x instead of yt. Similarly to cmllr, the optimisation is row-wise, and this again implements gen- eralised expectation–maximisation. Just as for cmllr, calculating the inverse ofG(i) and nding the cofactors are necessary for each rowupdate and form themain compu- tational cost.erefore, estimating transforms forR classes inL iterations isO(RLd3 + Rd4), with d the size of the feature vector. Covariance mllr also needs to transform all model means, which takesO(Md2). 48 3.3. covariance modelling 3.3 Covariance modelling e second type of transformation in this chapter aims to model the training data correlations better rather than to resolve a mismatch between training and test data. ese transformations are normally trained during speech recognition training. Com- putational cost and data sparsity are therefore not as problematic as in adaptation, and component parameters can also be updated.is section will denote the observations from training datawithx. In terms of themathematics, however, there is no signicant dišerence with adaptation transformations. Training more sophisticated correlation models again uses expectation–maximisation. In chapter 6.2, they will be applied to speed up decoding with correlation compensation for noise-robustness in much the same way as transformations for adaptation will be. Traditionally, speech recognisers make the assumption that the coe›cients of one feature vector are independent. In that case, Gaussian distributions can have diagonal covariancematrices, robust estimates for which need less data than full ones. Also, it is expensive to compute the likelihood of a full-covariance Gaussian. However, real data does show within-component correlations. ere are various techniques to improve modelling of correlations while for robustness restricting the number of extra para- meters to be trained. It is possible to derive a structured form of covariancemodelling from factor analysis (Gopinath et al. 1998; Saul and Rahim 2000): Σ(m) = AT[p]A[p] + Σ (m) diag. (3.8) e loading matrix A[p] can be specic to the component, but for p small that yields little extra modelling power, whereas as p becomes equal to the number of features d, as many parameters are introduced as with full covariances. Alternatively, a gener- alisation of this (Rosti and Gales 2004) ties A[p] over all components in a base class. is reduces the number of parameters to be trained compared to full covariances. For small p the Sherman–Morrisson–Woodbury formula can make the likelihood computation e›cient. However, to attain good modelling, the loading matrices are normally tied across many components and p is large, which makes decoding as slow 49 chapter 3. adaptation as with full covariance matrices. e following sections will discuss two dišerent types of correlation modelling that require fewer parameters to be estimated and allow faster decoding than full cov- ariancematrices.e rst formmodels precisionmatrices, inverse covariancematrices (section 3.3.1).e second form to be discussed (section 3.3.2) are projection schemes, which choose dimensions that discriminate best and reduce the dimensionality of the data. 3.3.1 Structured precision matrices It is possible to model the precisionmatrices, the inverse covariance matrices, directly as a weighted sum of basis matrices (Olsen and Gopinath 2004; Axelrod et al. 2002; Vanhoucke and Sankar 2004; Sim and Gales 2004): Σ(m) −1 = ∑ i pi (m) i Bi, (3.9) where Bi is a basis matrix for modelling the precision matrices. e advantage of modelling the precision matrices rather than covariance matrices is decoding speed. e resulting likelihood calculation is q(m)(x) ∝ exp ( − 12(x− µ (m)) T Σ(m) −1 (x− µ(m)) ) (3.10) = exp (∑ i pi (m) i ( − 12x TBix+ µ (m)TBix− 1 2µ (m)TBiµ (m) )) . (3.11) Since xTBix and Bix do not depend on the component, it is possible to cache them and share the result between components. A more restricted, but ešective, form of precision matrix modelling is semi-tied covariance matrices (Gales 1999). Each basis matrix is of rank 1, and components in one base class share as many basis matrices as there are feature dimensions. It can therefore be written in a dišerent form, where components have diagonal covariance matrices that share one rotation matrix per base class. e algorithm nds a trans- formation that results in a feature space in which a diagonal covariance matrix is a 50 3.3. covariance modelling more valid assumption than in the original feature space. e covariance matrix in the transformed space will be denoted with Σ˜(m)x,diag.e expression for the likelihood is q(m)(x) = N (x; µ(m)x , HΣ˜(m)x,diagHT). (3.12a) Just like for covariancemllr, in (3.6a), the ešective covariance is a component-specic diagonal covariance in a space specied by a transformation H. e transformation is tied over all components in a base class. (Again, the dependence on the base class is not written since the optimisation is separate for each base class.) Unlike for cov- ariance mllr, data sparsity is not a problem, because all training data is used, so that component-dependent covariance Σ˜(m)x,diag is estimated as well as the transformation. To improve decoding speed, it is again useful to describe the covariance transforma- tion by its inverse,A = H−1, so that the likelihood is expressed q(m)(x) = |A| · N (Ax; Aµ(m)x , Σ˜(m)x,diag). (3.12b) e estimation ofA and Σ˜(m)x,diag is iterative. First,A is estimated, with the same form of statistics and procedure as covariance mllr. en, Σ˜(m)x,diag is straightforwardly set to the maximum-likelihood estimate. As for cmllr and covariancemllr, the interest here is in the statistics that estimation requires. A derivation is in section b.3.1. Stat- istics that do not change when component covariances are updated are γ, again, and the sample covariance for componentm,W(m): γ , ∫ p˜(X ) ∑ m TX∑ t=1 γ (m) t dX ; (3.13a) W(m) , 1 γ(m) ∫ p˜(X ) TX∑ t=1 γ (m) t ( xt − µ (m) x )( xt − µ (m) x )T dX . (3.13b) e statistics G(i) are of the same form as for covariance mllr, in (3.7b). ey de- pend on σ˜(m)x,ii , diagonal element i of Σ˜ (m) x,diag, which for semi-tied covariance matrices is updated in every iteration. G(i) , ∑ m γ(m) σ˜ (m) x,ii W(m). (3.13c) 51 chapter 3. adaptation function Estimate-Semi-Tied-Covariance-Matrices({W(m), γ(m)}, γ) for all componentsm do Initialise Σ˜(m)x,diag ← diag(W(m)) InitialiseA← I repeat G(i) ←∑m 1σ˜(m)x,ii γ(m)W(m) A← Estimate-Covariance-MLLR(γ,G(i)) for all componentsm do Σ˜ (m) x,diag ← diag(AW(m)AT) until convergence return { Σ˜ (m) x,diag } ,A Algorithm 1e maximisation step of expectation–maximisation for estimating semi-tied covariance matrices. e estimation procedure is given in algorithm 1. Σ˜(m)x,diag is initialised to the di- agonalised original covariance, diag ( W(m) ) , and A to the identity matrix I. In the rst step, the transformation A is updated in the same way as is done for covariance mllr transforms, described in section 3.2.2. However, the current estimate for the covariance, which changes every iteration, is used. e statistics must therefore be re-computed for every iteration. In the second step, Σ˜(m)x,diag is set to the maximum- likelihood diagonal covariance in the feature space given by A. is process is re- peated until convergence. Both steps are guaranteed not to decrease the likelihood. is is therefore a generalised expectation–maximisation algorithm. Because decoding with semi-tied covariance uses diagonal-covariance Gaussians, it is almost as fast as decoding with plain diagonal-covariance Gaussians. Adjusting the number of base classes that transformations are computed for allows a trade-oš between the number of parameters and the accuracy of the covariance model. Just like for cmllr and covariance mllr, the computational complexity of this algorithm is dominated by the cost of calculating the cofactors and the inverse ofG(i). e former costs O(d2) (with d the dimension of the feature vector) per dimension per iteration.e latter costs O(d3) per dimension.us, for R transforms, K outer 52 3.3. covariance modelling loop iterations, and L inner loop (of estimating A) iterations, the cost of estimating the transforms isO(RKLd3 + RKd4). 3.3.2 Maximum likelihood projection schemes A projection of feature vectors onto a dišerent overall feature space, as opposed to a component-specic one, can also improve speech recogniser performance. is no- tion motivates the nal step of computingmfccs (in (2.6)), which aims to decorrelate the featureswith a discrete cosine transform, and then reduces the dimensionality.e same goes for deriving dynamic features from a window of static features (in (2.7b)). ese two projection schemes have an intuitive motivation. is section, however, discusses data-driven approaches to decorrelation and dimensionality reduction that can be applied in combination with or instead of the projections for dct and dynamic coe›cients. Linear discriminant analysis (lda) (Fukunaga 1972) is a standard linear projec- tion scheme that transforms the feature vectors to maximise between-class distance and minimise with-class correlation. For speech recognisers, the classes are usually Gaussian components. e transformation is supposed to make the assumption that the components have diagonal covariance matrices more reasonable. An alternative projection scheme that does not assume that the component covariances are equal, but does not optimise the feature space to be diagonal, is heteroscedastic discrimin- ant analysis (Saon et al. 2000). Heteroscedastic linear discriminant analysis (hlda) (Kumar 1997) is a method that nds the best projection as well as a transformation that improves the diagonal covariance approximation. It nds a projectionA[p] that projects the d-dimensional original feature space to p-dimensional subspace of useful features. e parameters for the (d − p)-dimensional nuisance subspace are tied over all components in the 53 chapter 3. adaptation base class.us A =  A[p] A[d−p]  , (3.14) so that the base class-specic transformed feature vector is xˆ = Ax =  A[p]x A[d−p]x  . (3.15) e new parameters for componentm become µˆ(m) =  µˆ(m)[p] µˆ[d−p]  = A[p]µ(m) A[d−p]µ  ; (3.16) Σˆ(m) =  Σˆ(m)[p] 0 0 Σˆ[d−p]  , (3.17) where µ is the global mean, and Σˆ (m) [p] = diag ( A[p]W (m)AT[p] ) ; (3.18) Σˆ[d−p] = diag ( A[d−p]ΣA T [d−p] ) . (3.19) whereW(m) is the actual covariance within components, and Σ is the global covari- ance. e transformation is found with maximum-likelihood estimation. Details of the process are in Kumar (1997). Because the nuisance dimensions have been tied over all components in the base class, the component likelihood computation can be split up into a global Gaussian and a component-specic one: q(m)(x) = ∣∣A∣∣N(x; µˆ(m), Σˆ(m)) = ∣∣A∣∣N(A[d−p]x; µˆ[d−p], Σˆ[d−p])N(A[p]x; µˆ(m)[p] , Σˆ(m)[p] ) , (3.20) which reduces the computational complexity because for one observation the rst Gaussian is constant. Since a constant factor does not inžuence decoding (see sec- tion 2.4), the determinant and the rst Gaussian do not normally have to be computed. 54 3.4. summary An useful property of hlda is that it nds a model for the complete feature space, not just for the useful dimensions. is allows for a generalisation, multiple hetero- scedastic linear discriminant analysis (mhlda) (Gales 2002), which nds a separate hlda-like transformation for each base class. (e determinant and the rst Gaussian in (3.20) then do ašect decoding, so they do have to be computed.) At the same time, it is also an generalisation of semi-tied covariance matrices that reduces the feature dimensionality. 3.4 Summary is chapter has discussed the general mechanism of speech recogniser adaptation. Section 3.1 has described how the usual approach, given unlabelled adaptation data, iterates between decoding and estimating a transformation of the recogniser. Methods for noise-robustness that the next chapter will discuss can be cast in the same frame- work when the noise is estimated (section 4.7). Section 3.2, however, has discussed ad- aptation without a model of the environment, but with linear transformations. ey can therefore adapt to many types of mismatch, and by placing the transformations in the right places in the likelihood equation, they are essentially as fast to decode with as without. A similar scheme, semi-tied covariance matrices, discussed in section 3.3, can be used while training to model covariances, and similarly hardly reduces decod- ing speed. Section 6 will train both types of linear transformations from predicted distributions (for example, over noise-corrupted speech) rather than adaptation data. 55 Chapter 4 Noise-robustness is chapter will discuss methods that make speech recognisers robust to noise. Sec- tion 4.1 will present a number of strategies for noise-robustness. Compared with the generic adaptation methods discussed in chapter 3, they need less adaptation data. is is possible since they make stronger assumptions about the mismatch between training and test environments.e assumptions are the model of the noise and how it inžuences the incoming feature vectors. ey will be the topic of section 4.2. Sec- tion 4.3 will then describe the resulting corrupted speech distribution. It has no closed form, so it needs to be approximated. Section 4.4 will discuss specic methods of model compensation, which replace the recogniser’s clean speech distributions with distributions over the corrupted speech. Rather than using precomputed paramet- erised distributions, it is possible to approximate the noise-corrupted speech likeli- hoods only as the observations come in. is will be the topic of section 4.5. Sec- tion 4.6 will discuss an alternative model-based scheme that reconstructs the clean speech before passing it to the recogniser. Section 4.7 will describe methods to es- timated the noise model parameters can be estimated in an adaptation framework as described in section 3.1. 57 chapter 4. noise-robustness 4.1 Methods for noise-robustness ere are two categories of approaches for making speech recognisers robust to noise. Feature enhancement aims to reconstruct the clean speech before it enters the speech recogniser, and is therefore relatively fast. Model compensation, on the other hand, aims to perform joint inference over the clean speech and the noise, and is slower but yields better accuracy. If input frommultiplemicrophones is available, then it is some- times possible to reconstruct the signal from a source at a specic location. However, this work will consider the general case where only the input from one microphone is available. Feature enhancement can work on any representation of the audio signal that is available in a speech recogniser. For example, the spectrum is sometimes used. Spec- tral subtraction (Boll 1979) requires the spectrum of the noise to be given, and assumes the noise is stationary. Alternatively, it is possible to nd a minimummean square er- ror estimate of the speech and the noise (Ephraim 1990). is requires probabilistic models of the speech and the noise. It is possible to formulate these in the spectral domain, and assume all spectral coe›cients independent. ough this assumption is true for Gaussian white noise, this type of noise is not normally found outside of research papers. For speech, the assumption is particularly unhelpful. erefore, approaches to feature enhancement that work on the log-mel-spectrum or the cepstrum have over the past two decades become successful.ough it makes formulating the interaction of speech and noise harder, it makes the models of speech and noise that assume independence between dimensions more accurate. To nd the minimummean square error estimate of the clean speech, its distribution is normally assumed independent and identically distributed, oŸen as a mixture of Gaussians. A joint distribution of the clean and the corrupted speech then needs to be derived.is normally applies the same methods that model compensation does. Section 4.6 will discuss this. However, thisworkwill focus onmodel compensation, which replaces a speech re- cogniser’s distributions over clean speech by ones over noise-corrupted speech. Con- 58 4.2. noise-corrupted speech ceptually, a speech recogniser is a classier that takes an observation sequence as input and classies it as belonging to one of a set of word sequences.e Bayes decision rule for classication says that the best choice for labelling observations is the one with the highest probability.e probability is factorised into the prior probability of the label, and the likelihood, the probability of the observations given the label. If the prior and the likelihood are the true ones, then the Bayes decision rule produces the optimal classication. Speech recognisers essentially implement this rule. Assuming that the speech and noise distributions are the true ones, decoding with the exact distributions for the corrupted speech would therefore yield the best recog- niser performance. e objective of feature enhancement, reconstructing the clean speech, may be useful where the clean speech is required, for example, as a prepro- cessing step before passing the signal to humans. However, as a preprocessing step for speech recognition, it gives no guarantees about optimality under any assumptions. is thesis will therefore aim to nd accurate corrupted speech distributions formodel compensation. 4.2 Noise-corrupted speech Noise can be described as the change to the clean speech signal before the speech recogniser receives it. It is possible to identify a number of dišerent types of noise. e most obvious is that background noise may be added to the signal. is will be called “additive noise” and denoted by n. It can be represented by its time-domain signal. It will be assumed independent of the speech. Second, due to the properties of the microphone and other elements of the channel, some frequencies may be ampli- ed and others reduced.is can be represented by convolution in the time domain. e convolutional noise will be writtenh andwill be assumed constant and independ- ent of the additive noise and the speech. e environment model with additive and convolutional noise has been standard for model-based noise-robustness since it was introduced (Acero 1990). ese two types of noise will be handled explicitly in this 59 chapter 4. noise-robustness work. Other forms of noise need to be handled separately, with approaches that may well be combinable with the approaches from this thesis. One type of noise is non- linear distortion, for example, gain distortion or clipping. Another is reverberation, resulting from the characteristics of the room that the microphone is in. Noise also inžuences how people speak. To help the listener decode the message, people alter their speaking style in noisy conditions.is is called the Lombard ešect (Junqua and Anglade 1990) and it has turned out to be hard to model. As section 2.1 has discussed, speech recognisers preprocess the time-domain sig- nal to produce feature vectors at xed intervals. It is the inžuence of the noise on these feature vectors that is of interest for making speech recognisers robust to noise. e function representing the observation vector that results from vectors for the speech, the additive noise, and the convolutional noise is called the mismatch function. e following sections will nd an expression for this inžuence in the log-spectral domain, the cepstral domain, and for dynamic features.e noise will bemodelled in the same domain as the speech. Just like in chapter 3, the observations from the training data will be written x and the ones from the recognition data y. In the context of noise- robustness, x is the noise-free, clean, speech, and y the noise-corrupted speech. e observation distribution, the distribution that results from combining dis- tributions of the speech and the noise through the mismatch function, is what most methods for noise-robustness in this thesis aim to model. Section 4.3 will nd the exact expression. 4.2.1 Log-spectral mismatch function e relationship between the corrupted speech, the clean speech and the noise is cent- ral to noise-robust speech recognition. e term mismatch function (Gales 1995) or interaction function (Kristjansson 2002) is oŸen used for the function that takes the speech and noise signals and returns the corrupted speech signal.is section will de- rive a mismatch function in terms of speech recogniser feature vectors. Many speech 60 4.2. noise-corrupted speech recognisers use feature vectors in the cepstral domain, which was described in sec- tion 2.1.1. Cepstral features are related to log-spectral features by the discrete cosine transform (dct), which is a linear transformation.e reason the cepstral domain is oŸen preferred is that the dct goes a long way to decorrelating the features within a feature vector. For the purpose of modelling the interaction between speech, noise, and observations, however, the log-spectral domain has an advantage: the interaction is per dimension. Chapter 7 will need the log-spectral domain; the speech recogniser experiments in chapter 8will use the cepstral domain.e followingwill therefore ini- tially derive the relation for log-spectral features. e conversion to cepstral feature vectors is then found by converting to and from log-spectral features, in section 4.2.2. e derivation will rewrite the mismatch function by going through the steps for feature extraction described in section 2.1 in parallel for dišerent feature vectors. It will assume the power spectrum (β = 2), as is oŸen done; appendix c.2 generalises it to other factors. It follows Deng et al. (2004); Leutnant and Haeb-Umbach (2009a;b). In the time domain, the relationship between the corrupted speech y[t], the clean speech x[t], the additive noise n[t], and the convolutional noise h[t] is simply (Acero 1990) y[t] = h[t] ∗ x[t] + n[t], (4.1) which in the frequency domain (aŸer applying a Fourier transformation) becomes a relation between complex numbers: Y[k] = H[k]X[k] +N[k]. (4.2) To nd the power spectrum, the absolute value of this complex value is squared: |Y[k]|2 = |H[k]X[k] +N[k]|2 = |H[k]|2|X[k]|2 +|N[k]|2 + 2|H[k]X[k]N[k]| cos θk, (4.3) where θk is the angle in the complex plane between H[k]X[k] and N[k]. is relates to the phase dišerence at frequency k between the clean speech and the noise. Since 61 chapter 4. noise-robustness there is no process in speech production that synchronises the phase to background noise, this angle is uniformly distributed, so that the expected value of the cosine of the angle is E{cos θk} = 0. (4.4) To extract coe›cients for speech recognition, the next step is to reduce the num- ber of coe›cients, by applying I lter bins to the power-spectral coe›cients. ere are usually 24 triangular bins. As in (2.1), let wik specify the contribution of the kth frequency to the ith bin. e mel-ltered power spectrum is then given by coe›- cients Y¯i: Y¯i = ∑ k wik|Y[k]|2 = ∑ k wik ( |H[k]|2|X[k]|2 +|N[k]|2 + 2|H[k]X[k]||N[k]| cos θk ) . (4.5) Because this is a weighted average and the expectation of the term with cos θk is 0, as shown in (4.4), the cross-term of the speech and the noise is oŸen dropped. ough retaining this term complicates the derivation, here (4.5) will be used as is. is value of the mel-ltered power spectrum for the corrupted speech can be dened in terms of values of the clean speech, the additive noise, and the convolutional noise in the same domain: X¯i = ∑ k wik|X[k]|2 ; N¯i = ∑ k wik|N[k]|2 ; H¯i = ∑ k wik|H[k]|2 . (4.6) e rewrite requires an approximation and the introduction of a random variable. First, the convolutional noise H[k] is assumed equal for all k in one bin, so that for any frequency k in bin i, H¯i '|H[k]|2 . (4.7) en, replacing the terms in the right-hand side of (4.5) yields Y¯i = ∑ k wik|Y[k]|2 = H¯iX¯i + N¯i + 2αi √ H¯iX¯iN¯i, (4.8a) 62 4.2. noise-corrupted speech where a new random variable αi indicating the phase factor is dened as αi , ∑ kwik|H[k]||X[k]||N[k]| cos θk√ X¯iN¯i . (4.8b) e next subsections will nd properties of αi that are independent of the spectra of the sources: rst, that αi is constrained to [−1,+1], and then that its distribution is approximately Gaussian. e mel-power-spectral coe›cients are usually converted to their logarithms so that ylogi = log ( Y¯i ) , xlogi = log ( X¯i ) , nlogi = log ( N¯i ) , and hlogi = log ( H¯i ) . e mismatch expression in the log-spectral domain trivially becomes exp ( y log i ) = exp ( x log i + h log i ) + exp ( n log i ) + 2αi exp ( 1 2 ( x log i + h log i + n log i )) . (4.9) is relationship is per coe›cient i of each feature vector. Section 4.2.2 will express the mismatch function in terms of mel-cepstral feature vectors, which are linearly- transformed log-spectral vectors. It is therefore useful to write the mismatch between log-spectral vectors. e relationship between the speech vector xlog, the additive noise nlog, the convolutional noise hlog, and the observation ylog in the log-spectral domain is exp ( ylog ) = exp ( xlog + hlog ) + exp ( nlog ) + 2α ◦ exp( 12(xlog + hlog + nlog)), (4.10a) where exp(·) and ◦ denote element-wise exponentiation and multiplication respect- ively. To express the observation as a function of the speech and noise, using log(·) for the element-wise logarithm, ylog = log ( exp ( xlog + hlog ) + exp ( nlog ) + 2α ◦ exp( 12(xlog + hlog + nlog))) , f(xlog,nlog,hlog,α). (4.10b) 63 chapter 4. noise-robustness e symmetry between the clean speech and the additive noise will be important in this work. However, other work oŸen uses a dišerent form. (4.10b) is oŸen rewritten to bring out the ešect of the noise on the clean speech, as in (4.10c), or on the channel- ltered speech, as in (4.10d): ylog = xlog + log ( exp ( hlog ) + exp ( nlog − xlog ) + 2α ◦ exp( 12(hlog + nlog − xlog))) (4.10c) = xlog + hlog + log ( 1+ exp ( nlog − xlog − hlog ) + 2α ◦ exp( 12(nlog − xlog − hlog))). (4.10d) In all cases, α encapsulates the phase dišerence between the two signals that are ad- ded (channel-ltered speech and noise) in one mel-bin. e phase information is discarded in the conversion to the log-spectral domain, so that with speech and noise models in that domain, the phase factorα is a randomvector, the distribution ofwhich will be discussed below. Before going into the properties of the phase factor term, it is worth identifying a split in the literature on the model for the phase factor. Traditionally (Gales 1995; Moreno 1996; Acero et al. 2000), the phase factor term has been ignored. Some papers (Deng et al. 2004; Leutnant and Haeb-Umbach 2009a) have gone into mathematical depth to get as close as possible to the real mismatch. For example, they show the elements of α to be between −1 and +1. Other papers have been more pragmatic. AŸer all, in practice, model compensa- tion for noise robustness is a form of adaptation to the data. e parameters that in theory make up the noise model are usually estimated from the data, with the aim to maximise the likelihood of the adapted model (section 4.7 will discuss this in more detail).e dišerence between traditional linear transformation methods and meth- ods for model compensation is therefore the space to which the adapted model are constrained. In both cases, it could be argued that the best choice for this space is the one that yields the lowest word error rate, which is not necessarily the mathematically correct one.e mismatch function is one element that determines this space. 64 4.2. noise-corrupted speech In this vein, it is possible to show that the optimal value for the phase factor is the mathematically inconsistent αi = 2.5 for the aurora 2 corpus (Li et al. 2009). What has really happened is that the mismatch function has been tuned.e possib- ility of tuning the mismatch function to assume the features use a specic power of the spectrum (e.g. the magnitude or power spectrum) had been noted before (Gales 1995). Setting the phase factor to 2.5 has a very similar ešect to assuming the power of the spectrum used to be 0.75 (Gales and Flego 2010). is parameter setting im- proves performance for aurora 2, but not for other corpora (Gales and Flego 2010). is illustrates that adjusting arbitrary parameters in the mismatch function, in ešect adjusting the space in which the optimal adaptedmodel is sought, can have an impact on word error rate in some cases. However, the following will analyse properties of the phase factor distribution mathematically. 4.2.1.1 Properties of the phase factor Two important observations about αi, which was dened in (4.8b), can be made. First, it is possible to determine the range of αi (Deng et al. 2004). Since a cosine is constrained to [−1, 1], from (4.8b) the following inequality holds: |αi| ≤ ∑ kwik|X[k]||N[k]|√ X¯iN¯i . (4.11) It is possible towrite the fraction in (4.11) as a normalised inner product of two vectors. e vectors are X˜i and N˜i, with entries X˜ik = √ wik|X[k]| ; (4.12a) N˜ik = √ wik|N[k]| . (4.12b) en, √ X¯i in (4.11) can be written as the norm of X˜i, so that∑ kwik|X[k]||N[k]|√ X¯iN¯i = ∑ k √ wik|X[k]|√wik|N[k]|∥∥X˜∥∥∥∥N˜∥∥ = X˜ T i N˜i∥∥X˜∥∥∥∥N˜∥∥ , (4.13) 65 chapter 4. noise-robustness which is a normalised inner product of two vectors with non-negative entries, which is always in [0, 1].e inequality in (4.11) then shows that αi is constrained to [−1, 1]: |αi| ≤ X˜ T i N˜i∥∥X˜∥∥∥∥N˜∥∥ ≤ 1. (4.14) Second, an approximation can decouple the distribution of αi from the values of |X[k]| and|N[k]| (Leutnant andHaeb-Umbach 2009a). Removingmagnitude-spectral terms from the equation is useful because they are not usually modelled individually in speech recognisers. e assumption is that for one frequency bin i, all |X[k]| have the same value, and similar for all |N[k]|. Since the bins overlap, this is known to be exactly true only if|X[k]| is equal for all k. However, especially for the narrowest bins i, the lower ones (see section 2.1.1), it may be a reasonable approximation. By dividing both sides of the fraction in (4.8b) by |X[k]| and |N[k]|, assuming that they are equal for all k, αi is approximated as αi = ∑ kwik|X[k]||N[k]| cos θk√∑ kwik|X[k]|2 ∑ kwik|N[k]|2 ' ∑ kwik cos θk√(∑ kwik )(∑ kwik ) = ∑kwik cos θk∑ kwik , (4.15) αi can thus be approximated as a weighted average of cosines over independently distributed uniform variables θk. e distribution that this model produces is close to the empirical distribution on various combinations of noises and signal-to-noise ratios on the aurora 2 corpus (Leutnant and Haeb-Umbach 2009a). e distribution of the phase factor e distribution ofαi is most easily viewed by sampling from it. Drawing a sample is straightforward if three assumptions are used. Since there is no process in speech production that synchronises the speech phasewith the noise phase at a specic frequency, the phase θk is uniformly distributed and in- dependent of noise and speech. Also, the phase is assumed independently distributed for dišerent frequencies k.irdly, the distribution of αi is approximated as in (4.15), removing the inžuence of particular value of the speech and noise signals. 66 4.2. noise-corrupted speech procedure Draw-αi-Sample(i) for frequency k for whichwik 6= 0 do sample θk ∼ Unif [−pi,+pi]; compute υk = cos θk. Compute the sample αi = ∑ kwikυk∑ kwik . Algorithm 2 Drawing a sample from p(αi). 0 1 2 p (υ k ) −1 0 1 υk = cos(θk) Figure 4.1e distribution of υk = cos θk for one frequency k (aŸer Leutnant and Haeb-Umbach 2009b). Algorithm 2 shows how to sample from αi. Samples for θk are drawn independ- ently for all frequencies in one bin.is yields samples for cosθk, which will be called υk. A sample for p(αi) can be drawn by taking the weighted average over these. It is also possible to nd a parametric distribution of υk = cos θk. It can be shown to be (Leutnant and Haeb-Umbach 2009a;b) p(υk) =  1 pi √ 1−υ2k , ∣∣υ2k∣∣ ≤ 1, 0, otherwise. . (4.16) is distribution is pictured in gure 4.1. e distribution ofαi is a weighted average of distributions over theυks in the bin. As the number of frequencies goes up, the central limit theorem means that the dis- 67 chapter 4. noise-robustness 0 1 2 p (α 0 ) −1 −0.5 0 0.5 1 α0 (a) Bin 0 is the narrowest, so that α0 has the least Gaussian-like distribu- tion. 0 1 2 p (α 2 3 ) −1 −0.5 0 0.5 1 α23 (b) Bin 23 is the widest, so that α23 has the most Gaussian-like distribu- tion. Figure 4.2e distribution of αi for dišerent mel-lter channels i (—), and their Gaussian approximations (- - ). tribution of αi becomes closer to a Gaussian (Deng et al. 2004). For lower-frequency bins, the number of frequencies that is summed over is smaller, so the distribution of αi is expected to be further away from a Gaussian. is ešect can be seen in g- ure 4.2, which shows the distributions for two values of i. ese distributions were found by sampling many times from αi using algorithm 2 on the preceding page.e dashed lines showGaussian approximations. Forα0, the Gaussian is least appropriate, but still a reasonable approximation. e covariance of the Gaussian can be set to the secondmoment of the real distri- bution. It can be shown that, again assuming that all spectral coe›cients in one lter bin are equal, that (Leutnant and Haeb-Umbach 2009a) σ2α,i , E { α2i } = ∑ kw 2 ik 2 (∑ kwik )2 . (4.17) is gives values very close to the actual variance ofαi on various subsets of aurora 2 (Leutnant and Haeb-Umbach 2009a). is work will therefore approximate the dis- 68 4.2. noise-corrupted speech tribution of αi as a truncated Gaussian with p(αi) ∝  N ( αi; 0, σ 2 α,i ) αi ∈ [−1,+1]; 0 otherwise. (4.18) Evaluating this density at any point requires the normalisation constant 1/ ∫+1 −1N (α; 0, σ2)dα, which could be approximatedwith an approximation to theGaussian’s cumulative dis- tribution function. However, it is straightforward to draw samples from this distribu- tion, by sampling from the Gaussian and rejecting any samples not in [−1,+1]. 4.2.2 Cepstral mismatch function e recognition experiments in this thesis will use cepstral (mfcc) features. To con- vert the log-spectral mismatch function in (4.10b) to the cepstral domain, the feature vectors must be converted to and from log-spectral feature vectors.is uses the dct matrixC (analogously to (2.6)): ys = Cylog. (4.19a) is converts a log-spectral feature vector into a cepstral-domain one. As discussed in section 2.1, the dctmatrix is normally truncated, so that the cepstral feature vectorys is shorter than the log-spectral one ylog. Converting from ys to ylog therefore incurs smoothing of the coe›cients: high-frequency changes from coe›cient to coe›cient disappear.e pseudo-inverse of the truncated dctmatrix is performed by the trun- cated transpose of the matrix, which will be written C−1.e reconstructions of the speech and the additive and convolutional noise are then xlog ' C−1xs; nlog ' C−1ns; hlog ' C−1hs. (4.19b) Substituting (4.10b) and then (4.19b) into (4.19a) yields the cepstral-domainmismatch function: ys ' Clog ( exp ( C−1(xs + hs) ) + exp ( C−1ns ) + 2α ◦ exp( 12C−1(xs + hs + ns))) , f(xs,ns,hs,αs). (4.20) 69 chapter 4. noise-robustness 4.2.3 Mismatch function for dynamic coefficients As section 2.1.2 has discussed, speech recogniser feature vectors normally contain static as well as dynamic features. Dynamics features represent the change over time of the static features. A mismatch function is required for dynamic features to com- pensate model parameters. e dynamics features y∆t are a linear combination of static features in a window. As in section 2.1.2, the window will be assumed ±1 for exposition.e dynamic coe›cients of the corrupted speech are y∆t = D ∆  yst−1 yst yst+1  = D∆  f(xst−1,n s t−1,h s t−1,αt−1, ) f(xst,n s t,h s t,αt, ) f(xst+1,n s t+1,h s t+1,αt+1, )  , (4.21) where D∆ projects an extended feature vector to dynamic features. e specic in- stance of this form where the dynamics are computed as the dišerence between the statics at time t + w and t − w (“simple dišerences”) was used in Gales (1995) for noise-robustness. However, usually an approximation is used: the continuous-time approximation (Gopinath et al. 1995). e approximation assumes that the dynamic coe›cients are actual derivatives with respect to time: y∆t ' ∂f(xst,n s t,h s t,αt, ) ∂t . (4.22) Section 4.4.2 will discuss how this approximation is used for speech recogniser com- pensation with the state of the art vector Taylor series approximation. 4.3 The corrupted speech distribution Section 4.1 has argued that if the models of the clean speech and the noise, and the mismatch function are the correct ones, decoding with the true corrupted-speech dis- tribution would yield the optimal recogniser performance. Later sections will discuss specic methods for model compensation, and model-based feature enhancement. is section discusses how the speech and noise models can be combined to nd a 70 4.3. the corrupted speech distribution p (x|θ) p (n) Convolution p (y) Figure 4.3 Model combination: a schematic view. θt−1 θt θt+1 xt−1 xt xt+1 yt−1 yt yt+1 nt−1 nt nt+1 θnt−1 θnt θnt+1 Figure 4.4 Model combination: a speech hmm with states θt and a noise hmm with states θnt generate feature vectors that combine to form observations yt. model for the noise-corrupted speech.is process is called model combination, and is pictured in gure 4.3. is section will assume the speech to be modelled by an hmm. It would be possible to model the noise with a hiddenMarkov model that is in- dependent of the speech.is allows for as much structure for the noise as the speech model does. Figure 4.4 depicts a graphical model that combines a speech hmm with a noise hmm. As in section 2.2, the speech states θt generate feature vectors xt. In this model, the noise states θnt similarly generate feature vectorsnt.ey combine to form observation vector yt. Using this model directly results in a two-dimensional model containing a state 71 chapter 4. noise-robustness θt−1 θt θt+1 xt−1 xt xt+1 yt−1 yt yt+1 nt−1 nt nt+1 Figure 4.5 Model combination with a simplied noise model compared to g- ure 4.4: the noise feature vectors are independent and identically distributed. for every pair of the clean speech and noise state (θt, θnt). For recognition, a “three- dimensional Viterbi decoder” (Varga and Moore 1990) can be used (the third dimen- sion is time). A problem is that the number of states in the resultingmodel explodes to the product of the number of states in the clean speech and noise models. Increasing the number of states is undesirable since it slows down decoding. Another problem is that unlike the speech model, the noise model is usually not known in advance and usually has to be estimated from test data. (Section 4.7 will discuss this in greater detail.) A noise model with fewer parameters can be estimated robustly on less data. It is therefore standard to model the noise as independent and identically distrib- uted. Figure 4.5 depicts that: the noise at each time instance is independent. e number of states in the hmm that results from combining this model for the speech and the noise is the same as of the original, clean speech, hmm. To keep the number of parameters low, the prior over the noise feature vectors nt is usually restricted to be one Gaussian for the additive noise, and the convolutional noise is assumed xed: n ∼ N (µn,Σn) ; h = µh. (4.23) is work will therefore use the noise modelMn = {µn,Σn,µh}. Section 4.7 will discuss the structure of its parameters and how to estimate them. Per time frame, the noise is assumed independent and identically distributed and 72 4.3. the corrupted speech distribution −10 0 10 20 30 y p (y ) Figure 4.6e corrupted speech distribution for speech x ∼ N (10.5, 36) noise n ∼ N (4, 1), and phase factor α = 0. the speech is independent and identically distributed given sub-phone θ. e phase factor is assumed independent of the noise and speech, and independent and identic- ally distributed per time frame, as in (4.15). None of those variables are directly ob- served. For decoding, it is therefore possible to marginalise them out and directly describe the distribution of the observation vector given the sub-phone. is can be performed in either the log-spectral domain (with mismatch function f as in (4.10b)) or the cepstral domain (with mismatch function f as in (4.20)). Given vectors for the speech, noise, and the phase factor, the observation vector is fully determined by the mismatch function. Denoting vectors in the appropriate domain with y, x,n,h,α, the distribution of the observation vector is p(θ)(y) = p(y|θ) = ∫ p(y|x)p(x|θ)dx (4.24a) = ∫ ∫ ∫ p(y|x,n,h)p(h)dhp(n)dnp(x|θ)dx (4.24b) = ∫ ∫ ∫ ∫ δf(x,n,h,α)(y)p(α)dαp(h)dhp(n)dnp(x|θ)dx (4.24c) = ∫ ∫ ∫ ∫ δf(x,n,h,α)(y)p(x,n,α,h|θ)dαdhdndx. (4.24d) where δf(x,n,h,α)(y) is the Dirac delta at f(x,n,h,α).is expression is exact. (It is 73 chapter 4. noise-robustness still valid if α is xed: then p(α) is a Dirac delta at the xed value.) However, since the mismatch function f is non-linear, for non-trivial distributions for x,n,h,α, like Gaussians, the expression does not have a closed form. Figure 4.6 shows a one-dimen- sional example of the corrupted speech distribution for Gaussian speech and noise. e topic of much of this thesis will discuss how to best approximate the distribution. 4.3.1 Sampling from the corrupted speech distribution Expressing the corrupted speech distribution parametrically is normally not possible. However, if distributions for x, n, h, and α are available, then it is straightforward to draw samples y(l) from the distribution. is applies Monte Carlo to the expression in (4.24d). e joint distribution p(x,n,h,α|θ) is replaced by an empirical version by sampling each variable from its prior: x(l) ∼ p(x|θ) ; n(l) ∼ p(n) ; h(l) ∼ p(h) ; α(l) ∼ p(α) . (4.25a) e empirical distribution over these then becomes p˜(x,n,h,α|θ) = 1 L ∑ l δx(l),n(l),h(l),α(l)((x,n,h,α)) . (4.25b) By substituting this in in (4.24d), the empirical distribution over y becomes p˜(y|θ) = 1 L ∑ l δf(x(l),n(l),h(l),α(l))(y) , (4.26a) where through the Dirac delta in (4.24d) the observation samples are dened by the mismatch function applied to the samples of x,n,h,α: y(l) = f ( x(l),n(l),h(l),α(l) ) . (4.26b) Sampling from the corrupted speech distribution will be used in this work to train parametric distributions (dpmc and idpmc) in section 4.4.1, and to examine how well approximated distributions match the actual distribution in section 7.4. 74 4.4. model compensation θt−1 θt θt+1 yt−1 yt yt+1 Figure 4.7 Model compensation: the speech and noise in the graphical model in gure 4.5 have been marginalised out. θt−1 θt θt+1 mt−1 mt mt+1 yt−1 yt yt+1 Figure 4.8 e conventional implementation of model compensation: each com- ponent is compensated separately. 4.4 Model compensation Integrating out the speech and noise, in (4.24), leads to the graphical model in g- ure 4.7. It is a simple hmm.is has the useful property that the structure is the same as that of a normal speech recogniser for clean speech (in gure 2.3 on page 19), with the state output distribution nowmodelling the corrupted speech. In an implementa- tion, this means that if the corrupted speech can be approximated with the same form of model as clean speech uses, the speech parameters in the original speech recog- niser can be replaced by estimated corrupted speech parameters.is is calledmodel compensation. For computational reasons the compensation is normally performed per Gauss- ian mixture component rather than per sub-phone state. Figure 4.8 show a graphical model for this, which has the same structure as the model in gure 2.4 on page 20. 75 chapter 4. noise-robustness Model compensation normally approximates (4.24) with a Gaussian: q(m)(y) = N (y; µ(m)y , Σ(m)y ). (4.27) is Gaussian then replaces the clean speech Gaussian in the original speech recog- niser (in (2.12)).is Gaussian,N (µ(m)x ,Σ(m)x ), also gives the clean speech statistics. As section 4.3 has shown, given standard speech and noise distributions, the cor- rupted speech distribution has no closed form. Model compensation methods there- foreminimise the kl divergence between the predicted distributionp(m) in (4.24) and the Gaussian q(m):1 q(m) := argmin q(m) KL(p(m)∥∥q(m)) = argmin q(m) ∫ p(m)(y) logq(m)(y)dy. (4.28) e observation vector y contains static coe›cients ys and dynamic coe›cients y∆. It is oŸen harder to estimate compensation for dynamics. (Indeed, most of chapter 5 will be dedicated to that subject.) Not all compensationmethods that the next sections will discuss also compensate dynamic parameters. ere is a range of model compensation schemes that produce Gaussians, includ- ing the log-normal approximation (Gales 1995), Jacobian compensation (Sagayama et al. 1997), the unscented transformation (Hu and Huo 2006; van Dalen and Gales 2009b), and a piecewise linear approximation to the mismatch function (Seltzer et al. 2010). Only the following schemes will be be discussed in this work. dpmc, which section 4.4.1 will discuss, approximates the predicted distribution p(m) by sampling, and trains the optimal Gaussian on that. A variant is iterative dpmc, which also uses samples, but trains mixtures of Gaussians rather than single Gaussians.e state-of- the-art scheme, which will be the topic of section 4.4.2, applies a vector Taylor series (vts) approximation to the mismatch function, so that the resulting predicted distri- bution becomes Gaussian. Section 4.4.3 will describe a scheme that speeds up com- pensation by nding compensation per base class rather than per component. Finally, 1In chapter 6, this type of schemewill be interpreted as an instantiation of predictivemethods, which approximate a predicted distribution. 76 4.4. model compensation section 4.4.4 will discuss single-pass retraining, which trains a speech recogniser on articially corrupted speech to nd the ideal compensation in some sense. 4.4.1 Data-driven parallel model combination Data-driven parallel model combination (dpmc) (Gales 1995) approximates the distri- butions with samples and applies the correct mismatch function. AGaussian assump- tion is made only when training the corrupted speech distribution on the samples. In the limit, it nds the optimal Gaussian distribution for the corrupted speech. e original algorithm did not use phase factor α; however, the generalisation to include this term is straightforward. dpmc represents the predicted distribution p(m) by an empirical version p˜(m). Section 4.3.1 has discussed how to nd this distri- bution by sampling. e empirical distribution has L delta spikes at positions y(l) (see (4.26a)): p˜(m)(y) = 1 L ∑ l δy(l)(y) . (4.29) e parametric distribution for the corrupted speech that dpmc nds is chosen to minimise the kl divergence with the empirical distribution p˜, approximating (4.28) with q(m) := argmin q KL(p˜(m)∥∥q) = argmax q ∫ p˜(m)(y) logq(y)dy = argmax q ∑ l logq(y(l)). (4.30) is is equivalent to nding the maximum-likelihood setting for q(m) from the sam- ples. Standard dpmc nds a Gaussian distribution for the corrupted speech: q(m)(y) = N ( y; µ (m) y , Σ (m) y ) . (4.31a) 77 chapter 4. noise-robustness e maximum-likelihood setting for its mean and covariance parameters are set to: µ (m) y := Ep˜(m){y} = 1 L L∑ l=1 y(l); (4.31b) Σ (m) y := Ep˜ { yyT } − µ (m) y µ (m) y T = ( 1 L L∑ l=1 y(l)y(l) T ) − µ (m) y µ (m) y T , (4.31c) where Ep˜{·} denotes the expectation under p˜. is allows the static parameters to be compensated. In previous work on dpmc a method for compensating the dynamic parameters was proposed (Gales 1995).is approach is only applicable when simple dišerences (linear regression using a window of one time instance leŸ and one right) are used. It uses the mismatch function for simple dišerences from section 4.2.3: by modelling the static coe›cients from the previous time instance to the feature vector, xst−1, the dynamic coe›cients for the noise-corrupted speech can be found using2 y ∆(k) t = f ( x ∆(l) t + x s(l) t−1,n ∆(k) t + n s(l) t−1,h s ) − f ( x s(l) t−1,n s(l) t−1,h s ) . (4.32) However, this form of approximation cannot be used for the linear-regression-based dynamic parameters, which is the form in modern speech recognisers. In the limit as the number of samples goes to innity, dpmc yields the optimal Gaussian parameters given a mismatch function and distributions for the speech, noise, and phase factor. However, as a large number of samples are necessary to ro- bustly train the noise-corrupted speech distributions, it is computationally expensive. Figure 4.9a on the next page shows an example of the corrupted speech distribu- tion and the dpmc approximation in one dimension. Even for the one-dimensional case, the corrupted speech can have a bimodal distribution that is impossible tomodel with one Gaussian, as the gure shows. Iterative dpmc (idpmc) also nds a para- metric distribution that is close to the empirical distribution, but the distribution is a mixture of Gaussians associated with a speech recogniser state rather than a single Gaussian. is allows it to model the multi-modal nature of the corrupted speech 2Normalisation of dynamic parameters is ignored for clarity of presentation. 78 4.4. model compensation −10 0 10 20 30 y p (y ) Real distribution dpmc (a) dpmc, with one component. −10 0 10 20 30 y p (y ) Real distribution idpmcmixture idpmc components (b) idpmc with four components. Figure 4.9dpmc and idpmc in one dimension.e corrupted speech distribution for speech x ∼ N (10.5, 36) and noise n ∼ N (4, 1). 79 chapter 4. noise-robustness distribution.e word “iterative” in the name of the scheme refers to the iterations of expectation–maximisation necessary to train amixture ofGaussians. To draw corrup- ted speech samples, the procedure in section 4.3.1 is applied again, but now the clean speech model is a state-conditional gmm. Analogously to (4.30), the approximation is3 q(θ) := argmin q(θ) KL(p˜(θ)∥∥q(θ)) = argmin q(θ) ∫ p˜(θ)(y) logq(θ)(y)dy = argmin q(θ) ∑ l logq(θ)(y(l)). (4.33) e corrupted speech gmm is then trained on the samples, without reference to the clean models, and is not restricted to have the same number of components as the clean speech gmm. Let the sub-phone-conditional distribution be dened q(θ)(y) , ∑ m∈Ω(θ) pimq (m)(y), (4.34) whereΩ(θ) is the set of components in the mixture of Gaussians for θ. Training mixtures of Gaussians from samples is similar to expectation–maximi- sation. It works iteratively, as follows. At iteration k, rst the hidden distribution, the posterior responsibilities of each component, is computed for each sample: ρ(k)(m, l) := pi (k−1) m q (m)(k−1) ( y(l) )∑ m ′∈Ω(θ) pi (k−1) m ′ q (m ′)(k−1) ( y(l) ) . (4.35a) 3In section 6.1.2 this will be analysed in terms of a predictive method, as an approximation of (6.9). 80 4.4. model compensation e new parameters for the mixture distribution are then trained with maximum likelihood using the distribution over the hidden variables. e component weights, means, and covariances for iteration k become pi (k) m := 1 L ∑ l ρ(k)(m, l); (4.35b) µ (m)(k) y := 1∑ l ρ (k)(m, l) ∑ l ρ(k)(m, l)y(l); (4.35c) Σ (m)(k) y := ( 1∑ l ρ (k)(m, l) ∑ l ρ(k)(m, l)y(l)y(l) T ) − µ (m)(k) y µ (m)(k) y T . (4.35d) Initialisation for training Gaussian mixture models is oŸen a problem. However, in this case, dpmc provides a sensible initial setting with the number of components equal to the original state-conditional mixture. To increase the number of compon- ents, “mixing up” can be used, which progressively increases the number of compon- ents. e component with the largest mixture weight is split into two components with dišerent ošsets on the means, and a few iterations of expectation–maximisation training are run. is is repeated until the mixture has the desired number of com- ponents. Figure 4.9 on page 79 shows how the approximation becomesmore accurate when the number of Gaussians increases. In the limit as the number of GaussiansM goes to innity, and the components are trained well, the mixture of Gaussians becomes equal to the real distribution. However, this requires each component to be trained well, for which it needs suf- cient samples. When increasing the number of componentsM, the number of total samples Lmust increase by at least the same factor. An iteration of expectation–maxi- misation takes O(ML) time, so in ešect this is O(M2). Additionally, the number of iterations of mixing up, which needs a number of iterations of expectation–maximi- sation at each step, increases linearly withM. In practice, then, the time complexity of idpmc is at leastO(M3).is becomes impractical very quickly, especially since com- pared to gure 4.9, which shows only a one-dimensional example, many components 81 chapter 4. noise-robustness may be required to model the distribution well in a high-dimensional space. 4.4.2 Vector Taylor series compensation Vector Taylor series (vts) compensation (Moreno 1996; Acero et al. 2000; Deng et al. 2004) is a standard method that is faster than dpmc. Rather than approximating the noise-corrupted speech distribution directly, it applies a per-component vector Taylor series approximation to the mismatch function f in (4.20).e most important result of this is that, given Gaussians for the clean speech, the noise, and the phase factor, the predicted noise-corrupted speech also becomesGaussian. vts compensation does not aim to minimise any criterion, like the kl divergence. e rst-order vector Taylor series approximation to themismatch function fwith expansion point (xs0,ns0,hs0,α0) is f (m) vts (x s,ns,hs,α) = f(xs0,n s 0,h s 0,α0) + J (m) x (x s − xs0) + J (m) n (n s − ns0) + J (m) h (h s − hs0) + J (m) α (α− α0), (4.36a) where the Jacobians for the clean speech, additive noise, and phase factor are J (m) x = ∂ys ∂xs ∣∣∣∣ xs0,n s 0,h s 0,α0 ; J (m) n = ∂ys ∂ns ∣∣∣∣ xs0,n s 0,h s 0,α0 ; J (m) h = ∂ys ∂hs ∣∣∣∣ xs0,n s 0,h s 0,α0 ; J (m) α = ∂ys ∂α ∣∣∣∣ xs0,n s 0,h s 0,α0 . (4.36b) Appendix c.1 gives expressions for these in terms of the expansion points. If the ex- pansion point of the speech is much larger than that of the noise, the speech will dom- inate, so that the Jacobian for the speech J(m)x (in (c.3) and (c.5)) will tend to I. e Jacobian for the noise J(m)n will then tend to 0. Conversely, under high noise condi- tions, J(m)x will tend to 0, and J(m)n will tend to I. An aspect that oŸen goes unmentioned but is of importance is compensation of dynamic parameters.is usually uses the continuous-time approximation (Gopinath et al. 1995), discussed in section 4.2.3, which approximates the dynamic parameters. However, to model the inžuence of the phase factor α on the dynamics, the dynamic 82 4.4. model compensation part of its covariance, Σ∆α is required. Rather than nding it, previous work has as- sumed the phase factor 0 (Acero et al. 2000), or xed to a dišerent value, like 1 (Liao 2007) or 2.5 (Li et al. 2007). Assuming α xed causes α∆ to be zero by denition. A phase factor distribution has previously only been used for feature enhancement (Deng et al. 2004), where no distribution over dynamics is required. e following will therefore assume that α = 0. As discussed in section 2.1.2, when extracting feature vectors from audio, deltas and delta-deltas are computed from a window of static feature vectors. However, they aim to indicate time derivatives of the static coe›cients.e continuous-time approx- imation (see section 4.2.3) assumes that they are in fact time derivatives: y∆t ' dys dt ∣∣∣∣ t ; x∆t ' dxs dt ∣∣∣∣ t ; n∆t ' dns dt ∣∣∣∣ t ; h∆t ' dhs dt ∣∣∣∣ t . (4.37) Combining this approximation and the vts approximation in (4.36a), the dynamic coe›cients become y∆t ' dys dxs dxs dt ∣∣∣∣ t + dys dns dns dt ∣∣∣∣ t + dys dhs dhs dt ∣∣∣∣ t ' J(m)x x∆t + J(m)n n∆t + J(m)h h∆t . (4.38) is uses the same Jacobians as the linearisation of the statics.e analogous expres- sion for the delta-deltas yields y∆ 2 t ' J(m)x x∆ 2 t + J (m) n n ∆2 t + J (m) h h ∆2 t . (4.39) For clarity of notation, the following will only write rst-order dynamics. Having linearised the inžuence of both the static and the dynamic coe›cients, the observation feature vector that results can be written as a mismatch function for static and dynamic coe›cients: f(m)vts (x,n,h). It applies (4.36a) for the static coe›- cients and (4.38) for the dynamic coe›cients. f(m)vts is a sum of linearly transformed independently Gaussian distributed variables.ese are also Gaussian. For example, for the clean speech statics: J (m) x (x s − xs0) ∼ N ( J (m) x (µ s(m) x − x s 0), J (m) x Σ s(m) x J (m) x T ) , (4.40) 83 chapter 4. noise-robustness −10 0 10 20 30 y p (y ) Real distribution vts Figure 4.10 vts compensation in one dimension.e corrupted speech distribu- tion for speech x ∼ N (10.5, 36) and noise n ∼ N (4, 1). and similar for the dynamics, and for the noise. e linearised mismatch function f(m)vts replaces f in the delta function in (4.24c). e approximation for y then is the sum of the mismatch function at the expansion point and the two Gaussians (this assumes the convolutional noise h is xed): q(m)(y) := ∫ ∫ δ f (m) vts (x,n,µh) (y)N (n; µn, Σn) dnN ( x; µ (m) x , Σ (m) x ) dx = N (y; µy, Σy). (4.41) e parameters µy,Σy consist of parameters for the static and dynamic coe›cients. Compensation for the static parameters applies (4.36a): µ s(m) y := f(x s 0,n s 0,µh, 0) + J (m) x (µ s(m) x − x s 0) + J (m) n (µ s n − n s 0); (4.42a) Σ s(m) y := J (m) x Σ s(m) x J (m) x T + J (m) n Σ s nJ (m) n T , (4.42b) and compensation for dynamics applies (4.38): µ ∆(m) y := J (m) x µ ∆(m) x + J (m) n µ ∆ n ; (4.42c) Σ ∆(m) y := J (m) x Σ ∆(m) x J (m) x T + J (m) n Σ ∆ n J (m) n T . (4.42d) Because in the mel-cepstral domain the Jacobians are non-diagonal, the covariance matrices Σs(m)y and Σ∆(m)y are full even when the covariances of the clean speech and 84 4.4. model compensation the noise are assumed diagonal.e expansion points are usually set to the means of the distributions for the clean speech and the additive noise, so that the terms (µs(m)x − xs0) and (µsn − ns0) in (4.42a) vanish.e mean of the statics in (4.42a) then becomes µ s(m) y := f ( µ s(m) x ,µ s n,µ s h, 0 ) . (4.43) Figure 4.10 on the preceding page illustrates a vts approximation to the corrupted speech distribution. e approximation is reasonable, but not the same as the max- imum likelihood Gaussian in gure 4.9a on page 79. e parameters for the feature vector with statics and dynamics in (4.41) then are the concatenation of the parameters of the parts in (4.42): µ (m) y :=  µs(m)y µ ∆(m) y  ; Σ(m)y := Σs(m)y 0 0 Σ ∆(m) y  . (4.44) SinceΣs(m)y andΣ∆(m)y are full, the overall corrupted speech covarianceΣ(m)y is block- diagonal. However, the block-diagonal structure is not normally applied for decoding, because of two problems. First, it is computationally expensive. Second, the continu- ous-time approximation for the dynamic parameters does not yield accurate block- diagonal compensation (section 8.1.1.1 will show this). erefore, the standard form of vts compensation is q(m) := N (y; µ(m)y , diag (Σ(m)y )), (4.45) where diag(·) denotes matrix diagonalisation. e most obviously useful ešect of linearising the mismatch function is that the corrupted speech turns out Gaussian.ere are also other advantages that arise from xing the expansion points, so that the relationship between speech, noise, and cor- rupted speech becomes linear per component. e means of the noise model can be estimatedwith a xed-point iteration (Moreno 1996) and the variance with a gradient- descent-based scheme (Liao 2007). Alternatively, since the rst-order approximation makes the noise, speech, and corrupted speech jointly Gaussian, an em approach (Kim et al. 1998; Frey et al. 2001b; Kristjansson et al. 2001) can be used. Section 4.7 will give 85 chapter 4. noise-robustness more details. Also, it is possible to use adaptive training with it (Liao and Gales 2007; Kalinli et al. 2009).ese aspects make vts compensation very useful in practice. Compared to using a distribution over α, assuming it constant has two ešects on the approximated distribution of the statics. One is that the term J(m)α ΣαJ (m) α T drops out from the covariance expression in (4.44). Since the entries of the phase factor covariance are small, this decreases the variances only slightly. Since Σα is constant across components and J(m)α changes only slightly between adjacent components, dis- crimination is hardly ašected. If α is equal to its expected value, 0, then the mean of the compensated Gaussian does not change compared to when α is assumed Gaussian. If α is set to a higher value, the mean is overestimated. Also, Jacobians J(m)x and J(m)n move closer to 12I (see appendix c.1 for the expressions). In practice,α is oŸen assumed xed but the noise model is estimated.is should subsume many of the ešects that using a phase factor distribution would have had. is includes a wider compensated Gaussian and overestimation of the mode. ere are other ways of approximating the corrupted speech distribution with a Gaussian. One possibility is to approximate the mismatch function with a second- order vector Taylor series approximation (Stouten et al. 2005; Xu and Chin 2009b), and estimate the Gaussian to match the second moment of the resulting distribution. Another approximation that has recently attracted interest is the unscented transform- ation (Julier and Uhlmann 2004). is approximation draws samples like dpmc, but it the samples are chosen deterministically, and if the mismatch function were linear, then it would yield the exact distribution, just like vts. It has been applied to feature enhancement (Shinohara and Akamine 2009), without compensation for dynamics, and model compensation (Li et al. 2010), with the continuous-time approximation. Yet another approach approximates the mismatch function with a piecewise linear approximation (Seltzer et al. 2010), the parameters of which are learned from data. Whatever the dišerences between how these methods deal with the mismatch func- tion, they all approximate the corrupted speech distribution with a diagonal-covari- 86 4.4. model compensation θt−1 θt θt+1 xt−1 xt xt+1 yt−1 yt yt+1 Figure 4.11 Joint uncertainty decoding: the noise is subsumed in p(y|x). ance Gaussian. When applying maximum-likelihood estimation to estimate the noise model (see section 4.7), the dišerences between these compensation methods come down to slight variations in the parameterisation. For example, Li et al. (2010) nds that compensation with the vts and the unscented transformation yields the same performance when parameters for both are correctly optimised. Rather than looking into all these variations, this thesis will look into full-covariance Gaussian (chapter 5) and non-Gaussian (chapter 7) distributions. 4.4.3 Joint uncertainty decoding e model compensation schemes discussed so far incur considerable computational cost, since they compensate components individually. Away of overcoming this prob- lem is to apply compensation at a dišerent level. It is possible to describe themismatch between the clean speech and the corrupted speech as a joint Gaussian of the speech and the observation. is is called joint uncertainty decoding (jud) (Liao 2007). Fig- ure 4.11 contains a graphical model for this.e inžuence of the noise is subsumed in the link between x and y.e joint distribution is dened x y  ∼ N µx µy  ,  Σx Σxy Σyx Σy  . (4.46) If stereo data with parallel clean speech and corrupted speech is available, then this distribution can be trained directly (Neumeyer and Weintraub 1994; Moreno 1996). 87 chapter 4. noise-robustness However, this means the scheme cannot adapt to new noise environment. More gen- eral schemes estimate a noise model and apply a form of model compensation. Most of the parameters of this joint distribution can be found with a model compensation scheme that takes a clean speech Gaussian and produces a corrupted speech Gauss- ian, like vts or dpmc. e cross-covariance Σyx, however, needs an extension. Sec- tion 4.4.3.1 will discuss how to estimate a joint distribution with vts and dpmc. e only approximation that this requires is the one that the original model compensa- tion method applies. However, no additional approximation is necessary to nd the corrupted-speech distribution for one speech recogniser component: it drops out as Gaussian.e following derivation will show that. A useful property of a joint Gaussian is that the conditional distribution of one variable given the other one is also Gaussian.is is a known result, which is derived in appendix a.1.3. e distribution of the observation given the clean speech p(y|x) therefore becomes p(y|x) = N ( y; µy + ΣyxΣ −1 x ( x− µx ) , Σy − ΣyxΣ −1 x Σxy ) = N ( y; µy +A −1 jud ( x− µx ) , Σy −A −1 judΣxy ) = ∣∣Ajud∣∣N(Ajudy; Ajudµy + x− µx, AjudΣyATjud − Σx) , (4.47a) with Ajud = ΣxΣ −1 yx . (4.47b) For the component-conditional distribution of the corrupted speech, joint uncertainty decoding uses the expression for model compensation given in (4.24a).e environ- ment model p(y|x) is replaced by the one in (4.47a).e component distribution for the corrupted speech that results from convolving this conditional with the compon- ent distribution of the clean speech is Gaussian. It can be written as a base-class-spe- cic transformation of the clean speech parameters: an a›ne transformation of the observation and a bias on the covariance: q(m)(y) , ∫ p(y|x)p(x|m)dx 88 4.4. model compensation = ∫ N ( y; µy +A −1 jud ( x− µx ) , Σy − Σx ) · N ( x; µ (m) x , Σ (m) x ) dx = ∫ ∣∣Ajud∣∣ · N(Ajudy; Ajudµy + x− µx, AjudΣyATjud − Σx) · N ( x; µ (m) x , Σ (m) x ) dx = ∣∣Ajud∣∣ · N(Ajudy+ µx −Ajudµy; µ (m) x , Σ (m) x +AjudΣyA T jud − Σx ) = ∣∣Ajud∣∣ · N(Ajudy+ bjud; µ(m)x , Σ(m)x + Σbias), (4.48a) with Ajud = ΣxΣ −1 yx ; (4.48b) bjud = µx −Ajudµy; (4.48c) Σbias = AjudΣyA T jud − Σx. (4.48d) If the joint distribution has full covariance, Ajud and Σbias are also full. However, if Σbias is full, the covariance matrices used for decoding become full as well, even if they were diagonal before being compensated.is means that decoding is slower. Conceivably,Σbias could simply be diagonalised, but this yields bad speech recogniser performance (Liao and Gales 2005). A solution is to nd a joint distribution with di- agonal covariances and cross-covariances, so thatAjud andΣbias drop out as diagonal. is is the approach taken in Liao and Gales (2006). Section 6.3 will discuss how to generate fullAjud and Σbias and convert them into a form that is faster in decoding. Normally, joint uncertainty decoding associates every componentm of the speech recogniser hmm with one base class r, each with a dišerent Gaussian joint distribu- tion. Note that a regression class tree is not necessary if the joint Gaussians are es- timated with a compensation method, because the number of parameters in the noise model does not varywhen the regression class tree is expanded.e speech recogniser components in one base class are close in acoustic space. Joint uncertainty decoding therefore can be viewed as partitioning the acoustic space and approximating the en- vironment properties for each partition. 89 chapter 4. noise-robustness Since the model compensation scheme is applied to all components in one base class at once, it is faster than applying model compensation separately per compon- ent. Varying the number of base classes gives a trade-oš between computational cost and accuracy. If every base class contains just one speech recogniser component, then the number of components is equal to the number of base classes, and the compens- ation is exactly equal to if the compensation scheme had been applied directly to the components. 4.4.3.1 Estimating the joint distribution Any model compensation scheme that takes a clean speech Gaussian and produces a corrupted speech Gaussian, like vts or dpmc, can be used to nd most of the para- meters of the joint distribution in (4.46). e parameters of the clean speech µx,Σx can be found from the clean training data.ey can be derived from the distribution of all components in a base class, which is the obvious choice for joint uncertainty de- coding.e parameters of the corrupted speech that a model compensation method nds give µy and Σy.at only leaves Σyx = ΣTxy to be estimated. With dpmc Finding the joint distribution with dpmc (Xu et al. 2006) extends the algorithm for model compensation with dpmc straightforwardly.e following only considers static parameters; section 5.4 will nd a joint distribution over statics and dynamics. Section 4.3.1 has discussed how to draw samples y(l) from the noise-corrupted speech distribution. To train the joint distribution, sample pairs of both the clean speech and the corrupted speech are retained.e empirical distribution has L delta spikes at positions (x(l),y(l)), analogous to (4.29): (x(l),y(l)) ∼ p(m)(x,y); p˜(x,y) = 1 L ∑ l δ(x(l),y(l)). (4.49) Just like in (4.31), the parameters are set tomaximise the likelihood of the resulting dis- tributions on the samples. However, for the joint distribution in (4.46), the mean and 90 4.4. model compensation covariance parameters are set at once to the mean and covariance of the tuple (x,y) under the empirical distribution:µx µy  := Ep˜   x y  = 1L L∑ l=1  x(l) y(l)  ; (4.50a)  Σx Σxy Σyx Σy  := Ep˜   x(l) y(l)  x(l) y(l) T − µx µy µx µy T = 1 L L∑ l=1  x(l) y(l)  x(l) y(l) T − µx µy µx µy T. (4.50b) With vts As in section 4.4.2, the vector Taylor series approximation by itself only nds compensation for the static parameters. For the dynamic parameters, an addi- tional approximation, the continuous-time approximation, is necessary. (Section 5.4 will estimate the joint distribution without that approximation.) How to estimate the corrupted speech parameters µy,Σy was discussed in sec- tion 4.4.2, in (4.44). Since the speech is independent of the noise, and the mismatch function in (4.36a) is linearised, the cross-covariance of the speech and the observa- tion does not contain any noise terms, so that (Moreno 1996; Xu et al. 2006): Σsyx ' E {( fvts(x s,ns,hs, 0) − µsy )( xs − µsx )T} = E { Jx ( xs − µsx )( xs − µsx )T} = JxΣ s x. (4.51a) e cross-covariance for the dynamics follows in the same manner from the approx- imation of the corrupted speech dynamic coe›cients in (4.37): Σ∆yx ' E {( y∆ − µ∆y )( x∆ − µ∆x )T} = E { Jx ( x∆ − µ∆x )( x∆ − µ∆x )T} = JxΣ ∆ x . (4.51b) Similarly to (4.44), the cross-covariance of the feature vector with statics and dy- namics then is the concatenation of the parameters of the parts: Σyx := Σsyx 0 0 Σ∆yx  . (4.51c) 91 chapter 4. noise-robustness Alternatively, it is possible to use dišerent approximations to nd the joint distri- bution. is includes the methods that were mentioned at the end of section 4.4.2: second-order vts (Xu and Chin 2009b), the unscented transformation (Xu and Chin 2009a), and the trained piecewise linear approximation should also be applicable to estimating the joint. However, the nal shape of the joint distribution is still the same, with the diagonal blocks. When the noise is estimated, this limits the amount of im- provement dišerent techniques can yield. Section 5.4 will discuss how to estimate a full-covariance joint distribution. 4.4.4 Single-pass retraining Single-pass retraining (Gales 1995) is a technique that takes a speech recogniser trained on clean speech and retrains it for corrupted speech. It requires stereo data, a parallel corpus of clean speech and exactly the same speech in the noisy acoustic environment. Stereo data is only available in laboratory conditions, for example, when noise is ar- ticially added to clean speech. Also, it is unlikely that in practical situations clean speech data is available but not preferred over noise-corrupted speech data as input to a speech recogniser. Single-pass retraining is therefore not a practical technique itself, but one that more practical compensation techniques can be compared to. Single-pass retraining trains a speech recogniser with expectation–maximisation as normally, but between the expectation step and the maximisation step of the last iteration it replaces the clean audio with articially corrupted audio that is exactly aligned. One utterance in the stereo data will be denoted (X ,Y), with X the clean data, and Y the corrupted data. e empirical distribution representing the whole training data will be denoted with joint distribution p˜(X ,Y). e expectation step of em yields a distributionρ(U |X ), based on the clean speech, as in normal training. However, the speech recogniser that is trained is not a dis- tribution over clean observations qUX , but a distribution over the corrupted speech observations qUY . Just as in normal speech recogniser training, this distribution fac- torises into a distribution over the hidden variables qU and a distribution over the 92 4.4. model compensation observations given the distribution of the hidden variables, here qY |U . e optimisation of the former has the same ešect as normal training, in (2.27b), since the observations do not directly enter into the equation. e latter, however, maximises the likelihood of Y rather than X . Adapted from (2.27), then, the optim- isation in the last iteration, K, is given by q (K) U := argmax qU ∫ p˜(X ,Y) ∫ ρ(U |X ) logqU (U)dUd(X ,Y); (4.52a) q (K) Y |U := argmax qY|U ∫ p˜(X ,Y) ∫ ρ(U |X ) logqY |U (Y |U)dUd(X ,Y). (4.52b) Again, p˜(X ) is represented by component–time occupanciesγ(m)t , found on the clean speech.e output distributions q(m) are then trained on the corresponding corrup- ted speech vectors (similar to (2.32)): q (k) Y |U := argmax qY|U ∫ p˜(X ,Y) ∑ m TX∑ t=1 γ (m) t logq (m)(yt)d(X ,Y). (4.53) It is interesting to relate this tomodel compensation.e clean training data p˜(X ) represents samples from the real distribution of the speech.e corrupted utteranceY corresponding to each clean speech utteranceX , found by articially adding the noise, is drawn from a distribution p(Y |X ) representing the mismatch. Single-pass retrain- ing therefore ešectively trains a speech recogniser on a non-parametric distribution of the noise-corrupted speech that results from combining non-parametric distribu- tions for the clean speech and the noise. is is in contrast with dpmc, discussed in section 4.4.1, which, even though it represents the corrupted speech with an empir- ical distribution as an intermediate step, assumes parametric distributions for both clean speech and noise, and a known mismatch function. If there is enough data to train the distributions robustly, single-pass retraining yields the optimal parameters for component distributions. A single-pass retrained speech recogniser therefore re- žects the corrupted data better than model compensation methods that estimate the same formof component distribution could, because they derive fromparametric rep- resentations of the speech and the noise. is work will therefore compare model 93 chapter 4. noise-robustness compensation methods against a single-pass retrained speech recogniser, which gives the ideal compensation for a form of output distribution. Single-pass retraining is normally applied only in the last iteration of speech recog- niser training. With additional training iterations, on just the noisy data, the component– time alignments would shiŸ, and the state model of the speech and that of the noise cease to be independent. Inmodel compensation, the speech and noisemodels are as- sumed independent (see gure 4.4). When training reference recognisers, this work will therefore not apply additional training iterations aŸer single-pass retraining. 4.4.4.1 Assessing the quality of model compensation Normally, word error rates are used to evaluate the performance of speech recogni- tion systems. However, this does not allow a detailed assessment of which aspects of the compensation process are working well and which poorly. An alternative ap- proach is to compare compensated systems’ distributions to their ideal counterparts. A well-known tool for estimating the distance between two distributions is the Kull- back-Leibler (kl) divergence, discussed in appendix a.2.e only work that has used the kl divergence to investigate the performance of model compensation methods is Gales (1995). However, it will turn out that the kl divergence can help to assess compensation quality with a much ner granularity than word error rates alone. Auseful comparisonmethod is the occupancy-weighted average of the component- for-component kl divergence of the compensated system to the single-pass retrained system (Gales 1995). If p(m)is a Gaussian of the single-pass retrained system, and q(m) is the corresponding Gaussian of the compensated system, then this metricD is D , 1∑ m γ (m) ∑ m γ(m)KL(p(m)∥∥q(m)). (4.54) γ(m) is the occupancy of componentm in the last training iteration, for both the com- pensated and the single-pass retrained system. Apart from being an obvious measure of compensation quality over a whole speech recognition system, it is also propor- 94 4.5. observation-dependent methods tional to the expression (in (6.7)) that this thesis will analyse predictive methods such as model compensation methods as aiming to optimise. Another useful attribute of this form of metric is pointed out in appendix a.2. De- pending on the structure of the covariance matrices, it is possible to assess the com- pensation per coe›cient or block of coe›cients. When diagonal covariance matrices are used, each dimension may be considered separately. is allows the accuracy of the compensation scheme to be assessed for each dimension. Similarly, block-diag- onal compensation can be examined per block of coe›cients. 4.5 Observation-dependent methods Section 4.4 has discussed model compensation methods, which approximate the cor- rupted speech with a parameterised distribution.is section will describe twometh- ods that use a dišerent approach: they start the computation only when the obser- vation has been seen. e Algonquin algorithm (section 4.5.1) extends the vts ap- proximation by iteratively updating the expansion point. It comes up with a dišerent Gaussian for each clean speech Gaussian for each observation. It is also possible to approximate the integral over the speech and noise using a piecewise linear approx- imation (section 4.5.2). 4.5.1 The Algonquin algorithm e “Algonquin” algorithm (Frey et al. 2001a; Kristjansson and Frey 2002) is an exten- sion to vts compensation, which updates the expansion point given the observation. is thesis will view the algorithm from a dišerent slant than the original presenta- tion. e original presentation extended model-based feature enhancement to per- form variational inference, which section 4.6.2 will discuss.is section, on the other hand, will discuss how Algonquin iteratively updates its approximation to the corrup- ted speech distribution for one speech Gaussian, in line with the rest of this chapter. At the same time, the algorithm updates an approximation to the the posterior of the 95 chapter 4. noise-robustness clean speech and the noise, which will come in useful in section 7.2. e most important conceptual addition of the Algonquin algorithm is that it takes into account the observation vector. In the following discussion, when an ac- tual observation is meant it will be indicated withyt. Whereas vts linearises the mis- match function at the expansion point given by themeans of the prior distributions of the speech and the noise, the Algonquin algorithm updates the expansion point iter- atively, nding the mode of the posterior of the speech and the noise. It can therefore be seen as an iterative approach to nding the Laplace approximation to the posterior. For the presentation of the Algonquin algorithm, the convolutional noise will be assumed zero and not written. (As long as it is assumed Gaussian, as in the original paper, the extension is trivial.) Feature vectors will be written x,n,y, but they can stand for any type of feature vectors that there exists a linearisable mismatch func- tion for.e original presentation assumed just static coe›cients, but a feature vector with statics and dynamics can be used. Section 5.3.4 will introduce a version that uses “extended” feature vectors. e Algonquin algorithm uses a approximation of the environment model com- pared to the one discussed in chapter 4.2, in (4.24c).e inžuence of the phase factor on the observation is captured by a Gaussian around the mode of the distribution for given x and n.us, that distribution is approximated as p(y|x,n) = ∫ δf(x,n,α)(y)p(α)dα ' N (y; f(x,n), Ψ) , (4.55) where f(x,n) , f(x,n, 0), andΨ is the xed covariance that models the uncertainty around the mismatch function. Figure 4.12 on the next page shows a one-dimensional example. e prior of the speech and the noise is given in the leŸ panel.eir posterior distribution aŸer having seen an observation yt is in the right panel.is posterior will be approximated with a Gaussian centred on its mode. To deal with the non-linearity in the mismatch function, it is linearised (as in vts compensation).eAlgonquin algorithm iteratively updates the linearisation point to themode of the posterior distribution.e linearisation point in iterationk is denoted 96 4.5. observation-dependent methods 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (a)e prior distribution p(x, n). 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (b) e posterior p(x, n|y = 9). Figure 4.12e Algonquin-derived distribution of the clean speech and noise for x ∼ N (10, 1); n ∼ N (9, 2); ψ = 0.04; y = 9. by ( x (k) 0 ,n (k) 0 ) .e linearised mismatch function in iteration k is f (k) vts(x,n) = f ( x (k) 0 ,n (k) 0 ) + J (k) x ( x− x (k) 0 ) + J (k) n ( n− n (k) 0 ) , (4.56) where the Jacobians are J (k) x = dy dx ∣∣∣∣ x (k) 0 ,n (k) 0 ; J (k) n = dy dn ∣∣∣∣ x (k) 0 ,n (k) 0 . (4.57) For the rst iteration k = 0, the linearisation point is set to (µx,µn), so that the mis- match function is equivalent to the one used in vts compensation in (4.36a), leaving out the phase factor. Using the linearised mismatch function in (4.56), the speech, the noise, and the observation become jointly Gaussian: q(k)   x n y   = N   x n y  ;  µx µn µ (k) y  ,  Σx 0 Σ (k) xy 0 Σn Σ (k) ny Σ (k) yx Σ (k) yn Σ (k) y   . (4.58) 97 chapter 4. noise-robustness Note that the parameters of the marginal of x and n (µx, µn, Σx, and Σn) are given by the prior and do not depend on the iteration k. On the other hand, the covari- ance of the observation and the cross-covariances are found through the linearised mismatch function, which changes with every iteration.ose parameters are found as follows. Using the linearised mismatch function and assuming the error N (0,Ψ) of the mismatch function, the distribution of the corrupted speech is Gaussian y ∼ N (µ(k)y ,Σ(k)y ) with parameters similar to those for vts compensation in (4.42): µ (k) y := E { f (k) vts(x,n) } = f ( x (k) 0 ,n (k) 0 ) + J (k) x ( µx − x (k) 0 ) + J (k) n ( µn − n (k) 0 ) ; (4.59a) Σ (k) y := E {( f (k) vts(x,n) − µy )( f (k) vts(x,n) − µy )T} +Ψ = E { J (k) x ( x− µx )( J (k) x ( x− µx ))T + J (k) n ( n− µn )( J (k) n ( n− µn ))T} +Ψ = J (k) x E {( x− µx )( x− µx )T} J (k) x T + J (k) n E {( n− µn )( n− µn )T} J (k) n T +Ψ = J (k) x ΣxJ (k) x T + J (k) n ΣnJ (k) n T +Ψ. (4.59b) e cross-covariance between the speech and the observation and between the noise and the observation can be derived similarly: Σ (k) yx := E {( f (k) vts(x,n) − µy )( x− µx )T} = E {( J (k) x ( x− µx ))( x− µx )T} = J (k) x E {( x− µx )( x− µx )T} = J (k) x Σx; (4.60a) Σ (k) yn := J (k) n Σn. (4.60b) Note that the original implementation (Frey et al. 2001a; Kristjansson 2002) diagon- alised the covariances and the cross-covariances.is requires that the Jacobians are diagonalised too, otherwise the relationship between speech, noise, and observation is invalid. Assuming the joint distribution of the speech, noise, and observations in (4.58), the posterior distribution of the speech and the noise conditioned on an observa- tion yt follows from a standard result (derived in appendix a.1.3, (a.10c)). is ap- 98 4.5. observation-dependent methods proximation to the posterior distribution will be written q(k): q(k)  x n  ∣∣∣∣∣∣yt  = N(  x n  ; µx µn + Σ(k)xy Σ (k) ny Σ(k)y −1(yt − µ(k)y ), Σx 0 0 Σn − Σ(k)xy Σ (k) ny Σ(k)y −1[Σ(k)yx Σ(k)yn ] ) . (4.61) Note that the speech and noise priors are not correlated, but the posteriors are. Algonquin sets the expansion point for the next iteration to the mean of this ap- proximation to the posterior: x (k+1) 0 = µx + Σ (k) xy Σ (k) y −1 (y− µ (k) y ); n (k+1) 0 = µn + Σ (k) ny Σ (k) y −1 (y− µ (k) y ), (4.62) so that the expansion point is updated at every iteration, and the Gaussian approxim- ation to the posterior is moved. ere is no guarantee that the mode of the approx- imation converges to the mode of the real posterior: the algorithm may overshoot. A damping factor could be introduced for this, but this appears to slow down conver- gence without benet (Kristjansson 2002). AŸer K iterations, the Gaussian approximation q to the distribution of y is found from (4.58) and (4.59): q (K) yt (y) = N ( y; µ (K) y , Σ (K) y ) = N ( y; f ( x (K) 0 ,n (K) 0 ) + J (K) x ( µx − x (K) 0 ) + J (K) n ( µn − n (K) 0 ) , J (K) x ΣxJ (K) x T + J (K) n ΣnJ (K) n T +Ψ ) . (4.63) Figure 4.13 on the following page shows a one-dimensional simulation of the Al- gonquin algorithm in (x, n)-space.e leŸ panel shows the prior of the clean speech and the additive noise, and the right panel the Algonquin approximation to the pos- terior. Note again that the priors of the clean speech and noise are not correlated, but the posterior is. Algonquin applied to the model compensates each Gaussian separately for each observation. It is not clear from the original presentation that this happens, so it is 99 chapter 4. noise-robustness 3 5 7 9 11 n 3 5 7 9 11 x (a) Initialisation: clean speech and noise prior. 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (b) Two iterations of Algonquin: Gaussian approximation of the posterior. Figure 4.13 Iterations of Algonquin. proven in appendix d. e problem with using this Gaussian approximation is that the ešective distribution is not normalised. Even though q in (4.63) is a normalised Gaussian if estimated iteratively for oneyt, in practice it would be estimated and then applied to the same observation yt.us, in general, ∫ q(yt)dyt 6= 1. (4.64) However, Algonquin applied to the model compensates each Gaussian separately for each observation. e output distribution is therefore optimised dišerently for each component, and there is no reason to assume their densities at one position can be compared in the way decoding normally does. e original Algonquin algorithm works around the problem that q(yt) is not a normalised distribution by nding the minimum mean square error estimate of the clean speech. Section 4.6.2 will discuss this. 100 4.5. observation-dependent methods 4.5.2 Piecewise linear approximation e compensation methods in the previous sections have used a single Gaussian to represent the observationdistribution. However, the actual observations are notGauss- ian distributed even when the speech and noise are. It is possible to approximate the integral that gives the likelihood of the observation. Myrvoll and Nakamura (2004) use a piecewise linear approximation for this as a step in estimating the noise model for feature enhancement. Like for the Algonquin algorithm, themodel for the interac- tion of the speech, noise, and observation the original work uses is somewhat simpler than the one in section 4.2. e following will adhere to the original presentation, because it provides good insight in both the main idea and the main limitation. e main idea is to transform the integral over the speech and the noise into an- other space. It then becomes easier to apply a piecewise linear approximation. e main limitation is that the method works on a single dimension, and uses log-spec- tral domain coe›cients. In the log-spectral domain, coe›cients are highly correlated. Appendix e.2 shows that it is theoretically possible to perform the transformation of the integral in more dimensions. However, if the piecewise linear approximation in the one-dimensional case requires 8 line segments, the d-dimensional case requires 8d plane segments.is makes the scheme infeasible for correlated feature vectors. In one dimension, the scheme works as follows. Speech x, noise n, and observa- tion y are assumed deterministically related, with exp(y) = exp(x) + exp(n) , (4.65a) which is equivalent to (4.9) where the convolutional noise and the phase factor are assumed 0. If y is observed to be yt, and x is changed, n automatically changes too. A substitute variableu is therefore introduced to replace both x andn in the integration in the likelihood expression. It is dened u = 1− exp(x− yt) , (4.65b) 101 chapter 4. noise-robustness so that (see section e.1 for details) n = yt + log(u) ; (4.65c) x = yt + log(1− u) . (4.65d) Given a speech coe›cient, the distribution of the observation can be written in terms of the distribution of the noise. As this is a transformation of the variable of a probab- ility distribution, it is a standard result (see section a.1.1) that a Jacobian is introduced: p(yt|x) = ∣∣∣∣∂n(x, yt)∂yt ∣∣∣∣p(n(x, yt)) , (4.66) where p(n(x, yt)) is the prior of n evaluated at the point implied by the setting of x and yt. e likelihood of yt can be expressed as an integral over x. It follows from (4.65a) that x < yt. It can then be transformed into an integral over u ∈ [0, 1] as follows: p(yt) = ∫yt −∞ p(yt|x)p(x)dx = ∫yt −∞ ∣∣∣∣∣ ∂n(x, y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(x, yt))p(x)dx = ∫ 1 0 ∣∣∣∣∣ ∂n(x, y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(u, yt)) ∣∣∣∣∂x(u, yt)∂u ∣∣∣∣p(x(u, yt))du, (4.67) where p(n(u, yt)) is the prior of n evaluated at the point implied by the setting of u and yt, and similar for p(x(u, yt)). Appendix e gives the complete derivation.e likelihood can be rewritten p(yt) = exp ( 1 2 σ2n + 1 2 σ2x − µn − µx + 2yt ) ∫ 1 0 N ( log(u); µn − σ2n − yt, σ2n ) N ( log(1− u); µx − σ2x − yt, σ2x ) du. (4.68) e variables of the two Gaussian, log(u) and log(1− u), can be approximated with piecewise linear functions. Myrvoll and Nakamura (2004) use 8 line segments. For any observation yt, the expression then becomes a sum of integrals of a xed factor times an integral over a Gaussian, for which well-known approximations exist. 102 4.6. model-based feature enhancement is derivation crucially depends on the assumption that coe›cients can be con- sidered separately. To model the likelihood well, the priors p(x) and p(n) need to model correlations between coe›cients. In the cepstral domain where correlations are not usually modelled, they become more important as the signal-to-noise ratio drops (Gales and van Dalen 2007). But any generalisation of the derivation to vec- tors of cepstral coe›cients will need to convert to log-spectral vectors anyway, and turn diagonal-covariance priors into full-covariance ones. Note that though Myrvoll and Nakamura (2004) give the derivation for log-spectral coe›cients, they apply the method to cepstral coe›cients. Appendix e.2 gives the generalisation of the algorithm in Myrvoll and Nakamura (2004) to d-dimensional log-spectral vectors. It turns out that the integral in (4.68) becomes an integral over [0, 1]d. is means that the 8 line segments for the single- dimensional case become 8d hyperplanes. It is infeasible to apply this to a standard 24-dimensional log-spectral feature space. Section 7.3 will therefore use a similar idea but a dišerent transformation and a dišerent approximation to the integral. It will use a Monte Carlo method, sequential importance resampling, to approximate the integral. 4.6 Model-based feature enhancement As section 4.1 has discussed, a faster but less principled technique for noise-robustness than model compensation is feature enhancement.e objective of feature enhance- ment is to reconstruct the clean speech. An advantage of this is that it sidesteps issue of nding compensation for dynamics. Early schemes used spectral subtraction (Boll 1979). However, knowledge of the clean speech and noise distributions make a statist- ical approach possible (Ephraim 1992). e distributions of the noise and especially the speech are best given in the log-spectral or cepstral domain. Usually, a minimum mean square error (mmse) estimate of the clean speech given an observation yt is 103 chapter 4. noise-robustness found (Ephraim 1990): xˆ = E{x|yt} . (4.69) is requires p(x|y), and therefore p(x), a model for the speech, which is simplied from a speech recogniser. Normally, a mixture of Gaussians is used for the joint dis- tribution of the clean and corrupted speed: x y  ∼∑ r pi(r)N µ(r)x µ (r) y  , Σ(r)x Σ(r)xy Σ (r) yx Σ (r) y  . (4.70) e parameters of this joint distribution can be found with the methods described in section 4.4.3.1. Given the model in (4.70), the estimate in (4.69) is found by margin- alising out the front-end component identity. It is a known result, which is derived in appendix a.1.3, that from a joint Gaussian the distribution of one variable conditioned on another is Gaussian with parameters x|y, r ∼ N ( µ (r) x + Σ (r) xy Σ (r) y −1 (y− µ (r) y ),Σ (r) x − Σ (r) xy Σ (r) y −1 Σ (r) yx ) (4.71) e expected value of the clean speech for one component is the mean of the condi- tional distribution in (4.71): xˆ = ∑ r P(r|yt) E{x|yt, r} = ∑ r P(r|yt) ( µ (r) x + Σ (r) xy Σ (r) y −1 (yt − µ (r) y ) ) . (4.72) e posterior responsibilitiesP(r|yt) are foundwith the component-conditionalmar- ginal distribution of y: P(r|yt) ∝ P(r)p(yt|r) = pi(r)N ( yt; µ (r) y , Σ (r) y ) . (4.73) is can be written as a a›ne transformation {At,mmse,bt,mmse} that depends on the observation vector. It is a linear interpolation between a›ne transformations{ A (r) mmse,b (r) mmse } that can be precomputed with A (r) mmse = Σ (r) xy Σ (r) y −1 ; b (r) mmse = µ (r) x −A (r) mmseµ (r) y . (4.74) 104 4.6. model-based feature enhancement e estimate of the clean speech then becomes xˆ = At,mmseyt + bt,mmse, (4.75) where the interpolationweights are given by components’ posterior probabilitiesP(r|yt): At,mmse = ∑ r P(r|yt)A (r) mmse; bt,mmse = ∑ r P(r|yt)b (r) mmse. (4.76) When decoding with the clean speech estimate xˆ as the input vector for speech recogniser, the likelihood for componentm is computed with q(m)(yt) = p (m)(xˆ) = p(m) ( At,mmseyt + bt,mmse ) . (4.77) It is possible to write this as a transformation of the whole speech recogniser which is dišerent for every observation vector. However, this transformation does not have a probabilistic interpretation. 4.6.1 Propagating uncertainty A problem that has been recognised (Arrowood and Clements 2002; Stouten et al. 2004a) is that the clean speech estimate xˆ is a point estimate which does not carry any information about its uncertainty. A number of approaches have been suggested. It is possible to propagate the uncertainty of the posterior p(x|y, r) of the clean speech reconstruction (Arrowood and Clements 2002; Stouten et al. 2004a). is uses the covariance of the Gaussian conditionals in (4.71), and ešectively computes likelihoods as q(m)(yt) = ∫ p(x|yt)p (m)(x)dx, (4.78) which is not mathematically consistent (Gales 2011). An alternative is to propagate the conditional distribution p(yt|x) (Droppo et al. 2002): q(m)(yt) = ∫ p(yt|x)p (m)(x)dx, (4.79) 105 chapter 4. noise-robustness where p(yt|x) = ∑ r P(r|x)p(yt|x, r) . (4.80) e problem with this is that the component posterior P(r|x) depends on the clean speech, because it is conditioned on latent variable and must therefore be approxim- ated. 4.6.2 Algonquin So far, the joint mixture of Gaussians in (4.46) has been assumed xed. If the joint was trained on stereo data, then this is sensible. However, if it was estimated with vts, then the linearisation points may not be optimal. Section 4.5.1 has introduced Algonquin formodel compensation, which iteratively updates the linearisation points towards the mode of the posterior given the observation. e original Algonquin algorithm (Frey et al. 2001a) applies feature enhancement, which nds the minimum mean square error estimate of the clean speech. Algonquin extends this idea by at the same time as updating the observation distributions, nding an approximation to the component posteriors of the mixture of Gaussians. e algorithm replaces the static joint distribution in (4.46) by an approximation (Kristjansson 2002) x,n,y ∼ ∑ r pi(m) · p(x,n|r) · q(r)(k)(y|x,n) . (4.81) At each iteration k, the component-dependent observation distribution q(r)(k) is up- dated by re-estimating the expansion point as in section 4.5.1.e component distri- bution of the speech and the noise is Gaussian, and q(r)(k) is assumed Gaussian be- cause of themismatch function.erefore, the posterior distributionq(k)(m)(x,n|yt) of the speech and the noise given the observation also becomes Gaussian. Its mean gives the expansion point for the next iteration. 106 4.7. noise model estimation AŸer a number of iterations, themmse estimate for the clean speech is found ana- logously to (4.72) xˆ = ∑ r q(r) · E{q(r)(k)(x|yt)}, (4.82a) where the expectation is the mean of the Gaussian posterior. e posterior respons- ibilities q(r) are found with, analogously to (4.73), q(r) ∝ pi(m)q(r)(k)(yt). (4.82b) 4.7 Noise model estimation e discussion so far has assumed that a distribution of the noise is known. In practice, however, this is seldom the case. e noise model must therefore be estimated. e usual approach is to apply expectation–maximisation in an unsupervised-adaptation framework discussed in section 3.1.e aim then is to optimise the noise model para- meters to improve the likelihood using the form of model compensation that is used for decoding. Conceptually, thismoves the interface of themodel with the real world. It does not matter any more whether the noise model matches the actual noise. What matters is that the parameters can be estimated robustly and that they allow the resulting model to match the real world reasonably well. Discrepancies between the model and the real process therefore become allowable: the noise model estimate can absorb some of the mismatch. e noise modelMn = {µn,Σn,µh} comprises the parameters of the additive noise, assumed Gaussian with N (µn,Σn), and the convolutional noise µh, which is assumed constant.e parameters are of the form µn = µsn 0  ; Σn =  diag(Σsn) 0 0 diag ( Σ∆n )  ; µh = µsh 0  . (4.83) e expected value of the dynamic coe›cients of the additive noise are zero because this work assumes that the noise model has no state changes. Since the convolutional 107 chapter 4. noise-robustness noise is assumed constant, its dynamic parameters are also zero.e noise covariance is normally assumed diagonal, just like for clean speech Gaussians. With standard feature vectors, with 13 statics and 13 delta and 13 delta-delta coe›cients, the noise model has only 65 parameters, as opposed to cmllr’s 1560 per class.is means that methods for noise-robustness can adapt to a few seconds of data, in situations where applying cmllr decreases performance (for a comparison, see Flego and Gales 2009). ough noise model estimation can conceptually be seen as adaptation, the gen- eric derivation is dišerent from the derivation of adaptation in section 3.1.e noise is explicitly a hidden variable distributed according to some distribution, whose para- meters must be trained. Training uses expectation–maximisation (see section 2.3.2.1) (Rose et al. 1994). Here, the hidden variables U consist not only of the component sequencemt, but also the sequence of the noise vectors nt and ht. e expressions for the expectation and maximisation steps are (aŸer (2.22) and (2.27b)) ρ(k) := q (k) U |Y ∝ q (k) UY ; (4.84a) q (k) U := argmax qU ∫ p˜(Y) ∫ ρ(k)(U |Y) logqU (U)dUdY. (4.84b) For noise estimation, the posterior distribution over the hidden variables is a joint distribution over the components and the noise vectors, which can be expressed as ρ(U |Y) = ρ({mt,nt,ht}∣∣Y) = ρ ( {mt} ∣∣Y)ρ({nt,ht}∣∣{mt},Y). (4.85) e distribution over the components, ρ ( {mt} ∣∣Y), is the same as for speech recog- niser training, and again it is su›cient to keep component–time occupancies γ(m)t . e distribution over the noise consists of a distribution for each component. e relationship between the noise and the noise-corrupted speech is non-linear. is relationship is captured by qY |U , which the computation of the hidden variable pos- terior in (4.84a) uses (see section 2.3.2.1). e posterior distribution of the noise for one componentmt is ρ ( {nt,ht} ∣∣{mt},Y) therefore does not have a closed form. Noise estimation is of practical value. It is therefore not surprising thatmostmeth- ods that have been proposed work on a practical compensation method, which lin- 108 4.7. noise model estimation earises of the mismatch function. is is mostly either vts compensation or joint uncertainty decoding. Methods use either of two ways of approximating the hidden variable posterior. Some x the linearisation of the inžuence of the noise on the ob- servation to make computing the parameters possible. Others approximate the real posterior given the mismatch function with numerical or Monte Carlo methods, but apply the linearisation when compensating anyway. Neither is guaranteed to yield an increase of the likelihood. 4.7.1 Linearisation Both vts compensation and joint uncertainty decoding linearise the inžuence of the noise (and the other sources) on the corrupted speech. It is possible to nd a newnoise model estimate that is guaranteed not to decrease the likelihood as long as the vts ex- pansion point of the noise is xed. However, the linearisation depends on the noise model: the expansion point of the additive noise is normally set to the noisemean (see section 4.4.2). As soon as the expansion point changes, therefore, the guarantee drops away.is can lead to oscillations. However, it is always possible to evaluate the result of the likelihood function before accepting a new noise model estimate.en, a back- oš strategy can decrease the step size until the likelihood does increase. Alternatively, the noise model mean and the noise expansion point can be disconnected, so that the guarantee about the likelihood remains valid, even though the vts approximation be- comes less close to the real distribution. Schemes that combine these strategies adapt- ively are possible.e actual estimation of the new parameters can work in two ways: either by setting up a joint Gaussian distribution of noise and corrupted speech, yield- ing a factor analysis-style solution, or by iteratively optimising the likelihood function directly. Since the additive noise is assumed Gaussian, the linearisation makes the additive noise and the noise-corrupted speech jointly Gaussian for each speech component (independently introduced by Kim et al. 1998; Frey et al. 2001b). is results in a 109 chapter 4. noise-robustness distribution of the formn y  ∣∣∣∣∣∣m ∼ N µn µy  ,  Σn Σny Σyn Σy  . (4.86) is has only been used for statics, but dynamic compensation with the continuous- time approximation yields a linearised relationship of noise and observation too, so the principle can be extended to vectors with statics and dynamics. From this joint distribution, the distribution of the noise conditional on an ob- servation y is also Gaussian (see appendix a.1.3). is yields a factor analysis-style solution for the optimal noise distribution for each time instance and component. However, nding the convolutional noise parameters is not possible in this frame- work if the convolutional noise is assumed constant, because the posterior distribution of it cannotmove from its prior estimate. It is possible to assume a convolutional noise covariance while estimating the parameters and then ignore it when compensating, but that again yields no guarantee of nding the optimal likelihood. A greater problem, however, is that some blocks of the joint distribution in (4.86) are usually diagonalised.e noise estimate is usually constrained to have a diagonal covariance, and so is the resulting corrupted speech covariance. Diagonalising the corrupted speech covariance, however, happens aŸer compensation. Since it is not an intrinsic property of the process, it cannot be reversed. To infer a distribution over n from a given value ofy, the joint distribution needs to be valid. In the case of the joint Gaussian, the cross-covariance Σyn must be diagonal as well. Diagonalising Σyn is equivalent to diagonalising Jn.us, this assumes that the relationship between noise and corrupted speech is per coe›cient. However, as section 4.4.2 has pointed out, the Jacobian that gives that relationship is not diagonal. If it is constrained to be diagonal in estimation, this, again, must be applied in compensation as well for the estimation process to yield a guaranteed improvement in likelihood. However, diagonalising the Jacobian is an additional approximation that reduces the quality of the compensation. An alternative is to directly optimise the noise model parameters.e static noise means can be updated at the same time using a xed-point iteration (Moreno 1996). 110 4.7. noise model estimation e additive noise covariance, however, is more complex to estimate. It is possible to estimate it on the parts of the waveform known to contain noise without speech. Another options is to use gradient ascent to nd an estimate for the additive noise variance for vtswith the continuous-time approximation (Liao andGales 2006).is needs to be alternated with the estimation of the noise mean.is scheme does allow the convolutional noise to be estimated, and it does not require the Jacobians to be diagonalised.is is the approach that this work will take.e resulting noise model estimate maximises the likelihood of model compensation with vts.us, the para- meters do not necessarily correspond to the actual noise or to a consistent sequence of static observations. 4.7.2 Numerical approximation It is possible to nd a numerical approximation to the hidden variable posterior. Myr- voll and Nakamura (2003) propose a method that uses a consistent approximation for estimation and compensation (see section 4.5.2), which does not apply the vector Taylor series approximation. However, it assumes all dimensions independent, and it is not feasible to generalise it to multiple dimensions (see appendix e). Alternatively, it is possible to approximate the noise posterior with an empirical distribution (Faubel and Klakow 2010).e empirical noise posterior is acquired with importance sampling (see appendix a.4.2). Samples n(l)t are drawn from the noise prior for the previous iteration. ey are then re-weighted to give an approximation to the noise posterior ρ(nt|Y).4 Since the noise is assumed identically distributed, the number of samples required per time frame is low. Faubel and Klakow (2010) apply this per dimension, but it may be possible to extended the method to more dimen- sions. However, note that this method improves the likelihood under the assumption that the exact distribution (given themismatch function) is used.is has the advant- age over using the vts assumption that no overshoot happens because of the linear- 4Faubel and Klakow (2010) uses Parzen windows to approximate the importance weights, but this can be replaced by a straightforward analytical solution.e required expression is (4.66) with x and n exchanged. 111 chapter 4. noise-robustness isation. However, in practice, a dišerent form of compensation will be applied, so that the method does not optimise the likelihood for the actual form of compensation. 4.8 Summary is chapter has discussed existing approaches for noise-robust speech recognition. Section 4.2.1 has derived the mismatch function, which relates the speech, noise, and the corrupted speech. is resulted (in section 4.3) in an expression for the corrup- ted speech which has no closed form. Model compensation, the topic of section 4.4, approximates the corrupted speech distribution with a parameterised density. e state-of-the-art vts compensation nds one corrupted speech Gaussian for one clean speech Gaussian.e Gaussian is diagonalised, because of the imprecise estimation of the oš-diagonals and decoding speed.ese two issues will be dealt with in chapters 5 and 6, respectively. e methods in section 4.5 use a dišerent approach: they nd a dišerent approximation to the corrupted speech distribution for every observation. is philosophy will also apply to the scheme that chapter 7 will introduce. Section 4.6 has described model-based feature enhancement, which only uses a linear transform- ation to the feature vector. Section 4.7 then cast methods for noise-robustness in an adaptation framework, by estimating the noise model.e interesting aspect is that a standard noise model has only 65 parameters, as opposed to at least 1560 for cmllr. is means that methods for noise-robustness need far less adaptation data than gen- eral adaptation methods. 112 Part II Contributions 113 Chapter 5 Compensating correlations is chapter will describe the rst contribution of this thesis.1 e previous chapter has described model compensation methods. ey diag- onalise the Gaussians they produce, because the oš-diagonals are more sensitive to approximations. For the state-of-the-art vts compensation, the continuous-time ap- proximation it applies for the dynamic coe›cients will turn out problematic. is chapter will propose a new approach for compensating the dynamic paramet- ers so that the correlations can be estimated accurately, which, section 5.1 will show, is important. Section 5.2 will then explain how the dynamic coe›cients can be ex- pressed as a linear transformation over a vector with all static feature coe›cients in a window. When a distribution over this “extended” feature vector is known, then the distribution of the static and dynamic parameters can oŸen be found by linearly transforming the parameters of the distribution over the extended feature vector. Sec- tion 5.2.2 will explore under which conditions this is valid. In the same fashion as standard model compensation, there is a range of schemes, that section 5.3 will intro- duce, that can be used to combine the extended clean speech and noise distributions 1Extended dpmc, and its application to joint uncertainty decoding, was introduced in van Dalen and Gales (2008). Extended vtswas introduced in van Dalen and Gales (2009b;a; 2011). Van Dalen and Gales (2010a;b) mentioned briežy the form in which this chapter will present them, with a distribution for the phase factor. Extended idpmc also got only a brief mention. is thesis newly introduces the extended Algonquin algorithm. 115 chapter 5. compensating correlations to yield the extended corrupted speech distribution. In particular, section 5.3.3 will in- troduce “extended vts”, a faster method that approximates the mismatch function for each time instance with a vector Taylor series. Section 5.4 will discuss how to estimate extended parameters for joint uncertainty decoding, which compensates a base class at a time. Section 5.5 will discuss how to nd robust estimates for speech and noise distributions over extended feature vectors, which havemore parameters than normal ones. Estimates for the extended noise distribution can be found from estimates for statics and dynamics. Alternatively, because the oš-diagonals can now be estimated, an expectation–maximisation approach is possible. 5.1 Correlations under noise Feature correlations change under noise. Figure 5.1 on the next page shows the overall correlations of the zeroth and rst cepstral coe›cients in Toshiba in-car data from dišerent noise conditions (for details, see section 8.1.3).e dišerences in the orient- ation of the ellipses indicate dišerences in correlations. How this comes about can be seen by considering feature correlations of only speech, and of only noise. If the feature space is optimised to reduce correlations for clean speech, whichmfccs make an attempt to do, correlations will appear under noise. However, in the limit as the noise masks the speech, the correlation pattern will be that of the noise. It is therefore important for noise-robustness that these correlations are modelled. ough correlations could be modelled with full covariance matrices, speech re- cogniser distributions are normally assumed Gaussian with diagonal covariance ma- trices. is is true for the clean speech distributions and so ingrained that methods for noise-robustness are proposed without even mentioning that diagonalisation is performed (Kim et al. 1998; Acero et al. 2000). When estimating corrupted-speech distributions on stereo data, full covariance matrices have been shown to increased performance (Liao and Gales 2005). However, stereo data is seldom available. To compensate for correlation changes under noise in realistic scenarios, a noise 116 5.1. correlations under noise −10 0 y1 −10 0 y0 35 dB o›ce 34 dB idle 25 dB city 18 dB highway Figure 5.1 Overall correlations of the zeroth and rst cepstral coe›cients in Toshiba in-car data (see section 8.1.3) for dišerent noise conditions. model must be estimated, and full-covariance compensation computed with a model compensation method like vts, discussed in section 4.4.2. As mentioned in that sec- tion, vts compensation is normally diagonalised. is is for two reasons: decoding speed, and compensation quality. Chapter 6 will introduce forms of compensation that model correlations but are fast to decode with. is chapter will look into the quality of compensation for correlations. e estimates for full-covariance Gaussi- ans are expected to be more sensitive to approximations in the compensation process than for diagonal covariance matrices. In particular, and section 8.1.1.1 will show this by comparing against a single-pass retrained recogniser, the continuous-time approx- imation, which standard vts compensation uses for dynamic coe›cients and was in- troduced in section 4.2.3, does not provide good compensation. In hmm-based speech recognition systems, dynamic features (usually delta and delta-delta coe›cients) are appended to the static features to form the feature vectors (see section 2.1.2). e continuous-time approximation makes the assumption that the dynamic coe›cients are the time derivatives of the statics. For vts, the form of compensation for the dynamic parameters is then closely related to the static para- meters.ough compensation with the continuous-time approximation can generate 117 chapter 5. compensating correlations block-diagonal covariance matrices (as section 4.4.2 has shown), the estimates are not accurate enough to yield an increase in performance. An advantage of the continuous-time approximation is that it is possible to nd compensation for any form of dynamic parameters, both those computed with linear regression and with simple dišerences. A scheme for dynamic parameter compensa- tionwithdpmc stores extra clean speech statistics (see section 4.4.1), but is only applic- able to simple dišerences. As section 2.1.2 has mentioned, state-of-the-art speech re- cognisers compute dynamic parameters with linear regression. Another scheme that attempts to improve compensation by using additional statistics, but in the log-spec- tral domain, is described in de la Torre et al. (2002). However, as section 5.3.3.1 will show, this approach involves approximations that negate any potential improvements and basically yields the same form as the continuous-time approximation. ough there are known limitations to the use of the continuous time approximation, it is still the form used in the vast majority of model compensation schemes (Acero et al. 2000; Liao and Gales 2007; Li et al. 2007). 5.2 Compensating dynamics with extended feature vectors is section will describe an alternativemethod to using the continuous-time approx- imation for compensating the dynamic model parameters. e key intuition is that the dynamic coe›cients are a linear combination of consecutive static feature vectors. us, it is possible to model the ešect of the noise separately per time instance, and only then combine the time instances. is applies the same linear transformation that speech recognisers apply to nd dynamic coe›cients from a range of statics. At which point in the process it is valid to apply the linear transformation depends on the details of the compensation methods.is will become clear in this section. For simplicity, a window of ±1 and only rst-order dynamic coe›cients will be considered. An extended feature vector yet , containing the static feature vectors in the 118 5.2. compensating dynamics with extended feature vectors surrounding window, is given by yet = [ yst−1 T yst T yst+1 T ]T .2 e transformation of the extended feature vector yet to the standard feature vector with static and dynamic parameters yt can be expressed as the linear projection D that was introduced in section 2.1.2 (analogous to (2.7b)): yt =  yst y∆t  =  0 I 0 − I2 0 I 2   yst−1 yst yst+1  = Dyet , (5.1a) e second row ofD applies the transformation from a window of statics to yield the standard delta features. Similarly, xt = Dx e t ; nt = Dn e t ; ht = Dh e t ; αt = Dα e t , (5.1b) where extended feature vectors ·et all contain consecutive static feature vectors ·st−1, ·st, ·st+1. e form of their distributions will be discussed in section 5.5. Model compensation, described in section 4.4, approximates the predicted distri- bution of the noise-corrupted speech for one component (from (4.24d); the depend- ency on the component will not be written in this chapter) p(y) = ∫ ∫ ∫ ∫ δf(x,n,h,α)(y)p(x,n,h,α)dαdhdndx. (5.2) To model the ešect of the noise on each time instance separately, an extended mis- match function fe can be dened as fe(xet ,n e t ,h e t ,α e t) =  f(xt−1,nt−1,ht−1,αt−1) f(xt,nt,ht,αt) f(xt+1,nt+1,ht+1,αt+1)  , (5.3a) where f(·) is the per-time instance mismatch function dened in (4.20). Section 4.2.3 has given the mismatch function for dynamics, which used a projection matrixD∆. 2It is straightforward to extend this to handle both second-order dynamics and linear-regression coe›cients over a larger window of±w, so that yet = [ yst−w T . . . yst+w T ]T. 119 chapter 5. compensating correlations e full noise-corrupted speech vector with statics and dynamics can be found with projection matrixD: yt = Df e(xet ,n e t ,h e t ,α e t). (5.3b) To use this to express the distribution over the corrupted speech, the integral in (5.2) is rewritten in terms of extended feature vectors: p(y) = ∫ ∫ ∫ ∫ δDfe(xe,ne,he,αe)(y)p(x e,ne,he,αe)dαedhednedxe. (5.3c) It is possible to approximate the quadruple integral in (5.3c) by sampling. e Dirac delta yields corresponding samples with statics and dynamics, in a similar way to standard dpmc in section 4.4.1. Extended dpmc and extended iterative dpmc, which sections 5.3.1 and 5.3.2 will introduce, train on these samples. However, dpmc (not iterative dpmc) can also be viewed from a dišerent perspect- ive, which corresponds to the original presentation (in van Dalen and Gales 2008). It will turn out possible to estimate a Gaussian over extended feature vectors, and only then convert to a Gaussian over standard feature vectors with statics and dynamics. is will allow section 5.3.3 to introduce extended vts, a method that applies a vector Taylor series approximation for every time instance of extended feature vectors. Ex- tended vts therefore has a reasonable time complexity.e following will detail how a Gaussian over extended feature vectors can be converted into one over statics and dy- namics. Section 5.2.2 will prove under what condition optimising a distribution over extended feature vectors is equivalent to optimising one over statics and dynamics. 5.2.1 The extended Gaussian It is interesting to look at the structure of a Gaussian distribution over extended fea- ture vectors,ye ∼ N (µey,Σey).emeanµey of the concatenation of consecutive static feature vectors is simply a concatenation of static means at time ošsets−1, 0,+1. For the corrupted speech, these will be written µsy−1 ,µ s y0 ,µsy+1 . e covariance Σ e y con- tains the covariance between statics at dišerent time ošsets.e covariance between 120 5.2. compensating dynamics with extended feature vectors ošsets −1 and +1, for example, is written Σy−1y+1 . us, the full parameters of the extended distribution are µey =  µsy−1 µsy0 µsy+1  ; Σey =  Σsy−1y−1 Σ s y−1y0 Σsy−1y+1 Σsy0y−1 Σ s y0y0 Σsy0y+1 Σsy+1y−1 Σ s y+1y0 Σsy+1y+1  . (5.4) An extended vector and its equivalent with statics and dynamics are related with y = Dye. AsD is a linear transformation, if the distribution of the extended corrup- ted speech yet is assumed Gaussian, then the extended corrupted speech distribution can be transformed to a distribution over statics and dynamics with y = Dye ∼ N (Dµey,DΣeyDT) (5.5) e internal structure of thematrices will have ramications for how to store statistics and how to compensate distributions. For example, the covariance in (5.5), substitut- ing Σey from (5.4) andD from (5.1a), can be expressed as Σy = DΣ e yD T =  0 I 0 − I2 0 I 2   Σsy−1y−1 Σ s y−1y0 Σsy−1y+1 Σsy0y−1 Σ s y0y0 Σsy0y+1 Σsy+1y−1 Σ s y+1y0 Σsy+1y+1   0 − I2 I 0 0 I2  . (5.6) 5.2.2 Validity of optimising an extended distribution It would be interesting to nd under what conditions optimising the distribution over extended feature vectors of the noise-corrupted speech, for example, the Gaussian in section 5.2.1, optimises the distribution over statics and dynamics. e standard feature vector ywith statics and dynamics is related to the extended feature vector ye by (repeated from (5.1a)) y = Dye. (5.7) e distribution qe over extended feature vectors would therefore approximate a dis- tribution similar to the one in (5.3c). ere, the integrals were over extended feature 121 chapter 5. compensating correlations vectors, and the resulting distribution over standard feature vectors. Alternatively, as in section 5.2.1, the resulting distribution is over extended feature vectors as well: pe(ye) = ∫ ∫ ∫ ∫ δfe(xe,ne,he,αe)(y e)p(xe,ne,he,αe)dαedhednedxe. (5.8) e question that the following will answer is under what conditions approximating pe (in (5.8)) with qe is equivalent to optimising the approximation q to p (in (5.3c)) directly: qˆ := argmin q KL(p‖q) ; (5.9a) qˆe := argmin qe KL(pe‖qe) . (5.9b) To relate distributions over y and ye, the determinant of the Jacobian of the con- version is necessary. Since the relation is linear, this would be the determinant ofD if it were square. As a trick, extra dimensions can be appended onto y to increase its dimensionality.ese irrelevant dimensions are similar to the “nuisance” dimensions for hlda (discussed in section 3.3.2). ey will be written y(), and the vector with these appended y+, so that y+ =  y y()  . (5.10) Similarly, the projection D from extended feature vectors to ones with statics and dynamics is extended: D+ =  D D()  . (5.11) Provided D+ is full-rank, it is irrelevant what entries D() has exactly, because it is merely a mathematical construct.en, y+ = D+y e. (5.12) e distributions p and its approximation q are similarly extended: p+(y+) = p(y) · p() ( y() ∣∣y) ; q+(y+) = q(y) · q()(y()∣∣y) (5.13) 122 5.2. compensating dynamics with extended feature vectors Again, it is irrelevant how the distribution over the nuisance dimensions p() is dened or what its approximation q() is optimised to. In any case, the distributions over ex- tended feature vectors and over standard feature vectors plus nuisance dimensions are related by the determinant of the Jacobian (a well-known equality, in (a.1)): pe(ye) = |D+| · p+(y+); qe(ye) = |D+| · q+(y+). (5.14) e question now is whether when qe is optimised, q is optimised in the process. is can be taken in twoparts.e rst question iswhether optimisingqe is equivalent to optimising q+. is is the case since |D| is constant, so that it drops out of the optimisation, when it is rewritten with (5.14): qˆe = argmax qe ∫ pe(ye) log(qe(ye))dye = argmax qe ∫ ∣∣∣D−1+ ∣∣∣|D+|p+(y+) log(|D+|q+(y+))dy+ = argmax qe ∫ p+(y+) log(q+(y+))dy+. (5.15a) e second part of the question is when optimising q+ also optimises q. To express this, substitute (5.14) into (5.15a): qˆe = argmax qe ∫ p+(y+) log(q+(y+))dy+ = argmax qe ∫ p(y) · ∫ p() ( y() ∣∣y) log(q(y) · q()(y()∣∣y))dy()dy = argmax qe [ ∫ p(y) log(q(y))dy + ∫ p(y) ∫ p() ( y() ∣∣y) log(q()(y()∣∣y))dy()dy]. (5.15b) erefore, if the parameters of q and q() can be set independently, then nding the optimal qe means nding the optimal q. For a Gaussian qe, the parameters of q and q() can indeed be set independently. Appendix a.1.3 details the well-known composition of a multi-variate Gaussian into the marginal of some dimensions (here, y) and a conditional of the other variables (here, y()) given the rst set.e projectionD+ changes the parameter space, so that 123 chapter 5. compensating correlations this factorisation is not explicit when optimising qe. However, since the projection is full-rank, the optimisation in one feature space gives the optimal parameters in another feature space, and the argument still applies. is therefore proves that it is possible to optimise a Gaussian over the extended corrupted speech distribution and convert it to a distribution over statics and dynamics. 5.3 Compensating extended distributions e following will introduce four ways of estimating an extended distribution for ye. e rst, extended dpmc, uses sampling. Its variant, extended iterative dpmc, nds a mixture of Gaussians. A faster scheme, extended vts, applies a vector Taylor series approximation to every time instance. Extended Algonquin uses the same approxim- ation, but iteratively updates the expansion point. ey all assume that the extended speech xe and noise ne are Gaussian, and that the convolutional noise he is constant.e elements of α are assumed Gaussian dis- tributed but constrained to [−1,+1]. Section 5.5 will discuss the form of parameters for their distributions. It would be possible to apply the extended feature vector approach to other ap- proaches that nd Gaussian compensation. Indeed, an appendix of van Dalen and Gales (2009a) uses the unscented transformation to nd a Gaussian extended cor- rupted speech distribution. However, as pointed out at the end of section 4.4.2, the dišerence between model compensation schemes that come up with the same form of distribution largely disappears when the noise model is estimated.emodels then dišer only in the exact parameterisation that the optimisation uses.e only schemes that this thesis will introduce that nd one xedGaussian for the corrupted speech are therefore extended dpmc, a sampling scheme that in the limit produces the optimal Gaussian, and extended vts, based on the state-of-the-art vts. 124 5.3. compensating extended distributions 5.3.1 Extended DPMC e rst method of nding a Gaussian for the extended corrupted speech distribu- tion is extended dpmc (edpmc). It derives from the integral of the exact expression for the extended corrupted speech distribution analogously to standard dpmc (sec- tion 4.4.1). However, there is also a second perspective on dpmc, which converts samples to standard feature vectors immediately and trains a distribution on those. e rst perspective ties in with extended vts, which section 5.3.3 will introduce.e second perspectivemakes it possible to extend the algorithm tomixtures of Gaussians, for extended iterative dpmc (section 5.3.2). e rst perspective on dpmc nds an extended corrupted speech Gaussian. It derives from the integral over extended feature vectors in (5.8): pe(ye) = ∫ ∫ ∫ ∫ δfe(xe,ne,he,αe)(y)p(x e,ne,he,αe)dαedhednedxe. (5.16) Extendeddpmc approximates this by representing this distributionpe by an empirical distribution p˜e. To nd the empirical distribution, sample tuples (xe(l),ne(l),he(l),αe(l)) can be drawn from the extended distributions of the clean speech, noise, and phase factor.ese are joint samples over consecutive feature vectors, so that (again, assum- ing a window±1) x e(l) t =  x s(l) t−1 x s(l) t x s(l) t+1  ; ne(l)t =  n s(l) t−1 n s(l) t n s(l) t+1  ; he(l)t =  h s(l) t−1 h s(l) t h s(l) t+1  ; αe(l)t =  α s(l) t−1 α s(l) t α s(l) t+1  . (5.17) e distribution of the extended feature vectors over the corrupted speech ye is then dened by applying the mismatch function on each time instance, as in (5.3a). e mismatch function f for the static parameters in (4.10b) (for log-spectral feature vec- tors) or (4.20) (for cepstral feature vectors) is applied to each time ošset. is yields 125 chapter 5. compensating correlations an extended corrupted speech sample ye(l): y e(l) t =  y s(l) t−1 y s(l) t y s(l) t+1  =  f(x s(l) t−1,n s(l) t−1,h s(l) t−1,α s(l) t−1) f(x s(l) t ,n s(l) t ,h s(l) t ,α s(l) t ) f(x s(l) t+1,n s(l) t+1,h s(l) t+1,α s(l) t+1)  . (5.18) e empirical distribution then has delta spikes at the positions of these samples: p˜e = 1 L ∑ l δ y e(l) t . (5.19) is approximation can be substituted in for pe in (5.9b): qˆe := argmin qe KL(p˜e‖qe) = argmax qe ∫ p˜e(ye) logqe(ye)dye = argmax qe 1 L ∑ l logqe(ye). (5.20) is is equivalent to nding a maximum-likelihood solution on the samples, which has a well-known procedure for many distributions. For a Gaussian q ∼ N (µey,Σey), maximum-likelihood estimates of the extended corrupted speech parameters µey and Σey can then be found with µey = 1 L L∑ l=1 ye(l); (5.21a) Σey = ( 1 L L∑ l=1 ye(l) [ ye(l) ]T) − µeyµ e y T. (5.21b) e samples have been generated from distributions in which the time instances are correlated. erefore, the time instances in the corrupted speech sample will also be dependent.e cross-correlations of Σey (with the structure in (5.4)) will therefore be estimated correctly. Having thus found Gaussian parameters for the extended corrupted speech dis- tribution, distributions over statics and dynamics can be found with (5.5): y = Dye ∼ N ( Dµey,DΣ e yD T ) . (5.22) 126 5.3. compensating extended distributions In the limit as the number of samples goes to innity, this nds the optimal Gaussian for the corrupted speech distribution. An alternative perspective on dpmc gives less insight in the cross-correlations of the corrupted speech distribution but allows other distribution than Gaussians to be trained from the samples. It is possible to directly express the corrupted speech distribution with statics and dynamics while integrating over extended distributions (from (5.3c)): p(y) = ∫ ∫ ∫ ∫ δDfe(xe,ne,he,αe)(y)p(x e,ne,he,αe)dαedhednedxe. (5.23) Note that the Dirac delta has a projectionD added, so that the delta spike is directly in the standard domain.e Monte Carlo approximation of this integral is very similar to that of the rst perspective on dpmc. Extended corrupted speech samples ye(l) are found exactly as in (5.18), but then immediately converted to samples with statics and dynamics: y(l) = Dye(l). (5.24) e empirical distribution then is similar to p˜e in (5.19): p˜ = 1 L ∑ l δ y (l) t = 1 L ∑ l δ Dy e(l) t . (5.25) is distribution is in the domain with statics and dynamics, so that the proced- ure from here follows that of standard dpmc, in section 4.4.1. e parameters of Gaussian q are trained on samples y(l)t in exactly the same way. Substituting (5.24) into (4.31), µy := 1 L L∑ l=1 y(l) = 1 L L∑ l=1 Dye(l); (5.26a) Σy := ( 1 L L∑ l=1 y(l)y(l) T ) − µyµ T y = ( 1 L L∑ l=1 Dye(l)ye(l) T DT ) − µyµ T y . (5.26b) is is equal to the parameters of dpmc viewed from the rst perspective (combining (5.21) and (5.22)). As section 5.2.2 has proven, this is because qe, and therefore q, 127 chapter 5. compensating correlations is Gaussian, and D is a linear projection. e next section will introduce extended iterative dpmc, which trains a mixture of Gaussians rather than one Gaussian, using the second perspective. 5.3.2 Extended IDPMC Extended iterative dpmc is an extension of dpmc to train a mixture of Gaussians, as iterative dpmc is to dpmc (see section 4.4.1). It should be possible to train a mixture of Gaussians over extended feature vectors, and convert each of the Gaussians to be over standard vectors with statics and dynamics aŸerwards. However, as section 5.2.2 has shown, there is a requirement for this to be equivalent to optimising the distri- bution in the standard domain. When the extended distribution is transformed to a distribution with as many dimensions, some of which are the statics and dynamics, and the rest the nuisance dimensions, the distribution over the nuisance dimensions must be separate from the other distribution. In the case of a mixture of Gaussians, the nuisance dimensions are not allowed to depend on the hidden variable which in- dicates which component has produced the observation.is would mean that while training the mixture of Gaussians, some dimensions, in a dišerent feature space, must be tied across components. A more straightforward way of deriving iterative dpmc is from the second perspective on dpmc, which converts each sample to the standard domain (in (5.24)) rst, and then trains the distribution on those samples. Training amixture of Gaussians from extended samples without tying dimensions is not guaranteed to be optimal, whereas training it on samples with statics and dy- namics is.is is straightforward to see from the procedure of training iterativedpmc. Trainingmixtures of Gaussians uses expectation–maximisation, with in each iteration assigns responsibilities (component-sample posteriors) to train the parameters of each Gaussian.e iteration is guaranteed not to decrease the likelihood, i.e. not to increase the kl divergence with the empirical distribution. Since recognition uses statics and dynamics, it is the likelihoods in that domain that the responsibilities should be com- puted for to optimise the kl divergence between p and q. 128 5.3. compensating extended distributions Extended iterative dpmc, then, uses the samples y(l), found with (5.24). Training the mixture of Gaussians then applies exactly the iteration in (4.35): ρ(k)(m, l) := pi (k−1) m q (m)(k−1) ( y(l) )∑ m ′∈Ω(θ) pi (k−1) m ′ q (m ′)(k−1) ( y(l) ) ; (5.27a) pi (k) m := 1 L ∑ l ρ(k)(m, l); (5.27b) µ (m)(k) y := 1∑ l ρ (k)(m, l) ∑ l ρ(k)(m, l)y(l); (5.27c) Σ (m)(k) y := ( 1∑ l ρ (k)(m, l) ∑ l ρ(k)(m, l)y(l)y(l) T ) − µ (m)(k) y µ (m)(k) y T . (5.27d) Analogously to idpmc, a good initial setting for extended idpmc, which this thesis will use, is extended dpmc. To increase the number of components, it will again ap- ply “mixing up”: the component with the greatest mixture weight is split into two components. en, a few iterations of expectation–maximisation are run, and the procedure is repeated until the desired number of components has been reached. In the limit as the number of componentsM and the number of samples L go to innity, the modelled distribution of the corrupted speech can become equal to the real one. However, as explained in section 4.4.1, to train the mixture model well, the parameters of each component need to be trained on a large enough portion of the samples.erefore, as the number of componentsM is increased, Lmust grow faster thanM. One iteration of expectation–maximisation takesO(LM) time.e number of iterations of mixing up and expectation–maximisation increases linearly withM. In practice, therefore, the time complexity of idpmc is at leastO(M3). It is important to realise the dišerence with normal training of speech recognisers, where the amount of training data is limited, and the number of components is there- fore limited. Here, an unlimited number of samples can be generated within machine limitations, and a high number of components can be chosen. To represent the noise- corrupted speech as well as possible, a much larger number of samples is useful than is usually required for training a speech recogniser, especially to estimate full covariance 129 chapter 5. compensating correlations matrices. However, it is not a priori clear how many Gaussians is enough to correctly represent the corrupted speech distribution in 39 dimensions. e experiments in section 8.2, which will model the distribution as exactly as possible, will increase the number of components, and therefore the number of samples, to the machine limits. 5.3.3 Extended VTS Another method of nding Gaussian compensation adapts vts compensation (sec- tion 4.4.2) to extended feature vectors. Just like the rst presentation of extended dpmc in section 5.3.1, extended vts (evts) will approximate the extended corrup- ted speech with a Gaussian, and then convert the Gaussian to the standard domain. Section 5.2.1 has shown that, if the approximation is Gaussian, minimising the kl di- vergence between the extended distribution and its approximation is equivalent to optimising the kl divergence in the standard domain with static and dynamic fea- tures. Since the extended feature vector is a concatenation of static feature vectors, it is possible to use a linearised version of the static mismatch function for each time ošset to yield an overall mismatch function for ye. is alleviates the need for the continuous-time approximation for dynamics, so that no distribution over dynamics is required for the phase factor. It is possible to dene a distribution for the exten- ded phase factor αe. e elements of αe are approximately Gaussian distributed but constrained to [−1,+1] (see section 5.5.3). To simplify the distribution ofys, the con- straint can be ignored, so that αe ∼ N (0,Σeα). An extension to static compensation using vts can be used to nd the exten- ded corrupted speech distribution.e rst-order vector Taylor series approximation in (4.36a) is applied to each time instance separately. us the expansion point f0r each time instance is given by the static means at the appropriate time ošsets. ese are obtained from the extended distributions distributions over the clean speech xe, noise ne, and the phase factor αe.us, using the form of the vts approximation in 130 5.3. compensating extended distributions (4.36a) per time instance: fevts(x e t ,n e t ,h e t ,α e t) ,  ft−1,vts(xt−1,nt−1,ht−1,αt−1) ft,vts(xt,nt,ht,αt) ft+1,vts(xt+1,nt+1,ht+1,αt+1)  =  f ( µsx−1 ,µ s n−1 ,µsh−1 , 0 ) + Jx−1 ( xst−1 − µ s x−1 ) + Jn−1 ( nst−1 − µ s n−1 ) + Jα−1α s t−1 f ( µsx0 ,µ s n0 ,µsh0 , 0 ) + Jx0 ( xst − µ s x0 ) + Jn0 ( nst − µ s n0 ) + Jα0α s t f ( µsx+1 ,µ s n+1 ,µsh+1 , 0 ) + Jx+1 ( xst+1 − µ s x+1 ) + Jn+1 ( nst+1 − µ s n+1 ) + Jα+1α s t+1  , (5.28) where the ošset-dependent Jacobians are given by Jx−1 = dys dxs ∣∣∣∣ µsn−1 ,µsx−1 ,µsh−1 ,0 ; Jx0 = dys dxs ∣∣∣∣ µsn0 ,µsx0 ,µsh0 ,0 ; Jx+1 = dys dxs ∣∣∣∣ µsn+1 ,µsx+1 ,µsh+1 ,0 , (5.29) and similar for Jn and Jα. Note that Jx0 is equal to the Jacobian for standard vts in (4.36b). As in standard vts (in (4.41)), approximating the corrupted speech distribution works by substituting the vector Taylor series approximation of the mismatch func- tion. Again, using this approximation, the extended corrupted speech distribution drops out as Gaussian. e approximation qe to the extended corrupted speech dis- tribution then is qe(ye) := ∫ ∫ ∫ δfevts(xe,ne,he,αe)(y e)p(xe,ne,he,αe)dαednedxe = ∫ ∫ ∫ δfevts(xe,ne,he,αe)(y e)N (xe; µex, Σex) N (ne; µen, Σen)N (αe; µeα, Σeα) dαe dne dxe , N (ye; µey, Σey) . (5.30) is Gaussian is found by compensating each block of themean and covariance separ- ately with the appropriately linearised mismatch function.e structure of the para- 131 chapter 5. compensating correlations meters of qe is (repeated from (5.4)) µey =  µsy−1 µsy0 µsy+1  ; Σey =  Σsy−1y−1 Σ s y−1y0 Σsy−1y+1 Σsy0y−1 Σ s y0y0 Σsy0y+1 Σsy+1y−1 Σ s y+1y0 Σsy+1y+1  . (5.31) e parameters of this extended corrupted speech distribution can be found by com- puting expectations over the Gaussians. e mean for time ošset +1, for example, uses the linearisation for that time instance: µsy+1 = E { ft+1,vts(x s t+1,n s t+1,h s t+1,α s t+1) } = f ( µsx+1 ,µ s n+1 ,µsh+1 ) . (5.32) e covariancematrixΣey requires the correlations between all time ošsets in the win- dow.e covariance between ošsets 0 and+1, for example, uses the linearisations for time instances 0 and +1: Σsy0y+1 = E { (ft,vts(x s 0,n s 0,h s 0,α s 0) − µy0)(ft+1,vts(x s +1,n s +1,h s +1,α s +1) − µy+1) T} = E {( Jx0(x s t − µ s x0 ) + Jn0(n s t − µ s n0 ) + Jα0α s t ) ( Jx+1(x s t+1 − µ s x+1 ) + Jn+1(n s t+1 − µ s n+1 ) + Jα+1α s t+1 )T} = Jx0Σ s x0x+1 JTx+1 + Jn0Σ s n0n+1 JTn+1 + Jα0Σ s α0α+1 JTα+1 . (5.33) is is applied for each of the possible time ošset blocks in Σey. us, each block (t, t ′) in the covariance matrix combines Jacobians at time ošsets t and t ′ and cross- covariances of the speech, noise, and phase factor for t and t ′.is is the main dišer- ence between standard vts (which uses the continuous-time approximation) and ex- tendedvts. Section 5.3.3.1 will examine standardvts in the light of this. Section 5.3.3.2 will discuss the time complexity of extended vts. 5.3.3.1 Relationship with VTS It is interesting to examine the relationship between standard vts and extended vts described in the previous section. It is possible to describe the mismatch function for 132 5.3. compensating extended distributions the dynamic coe›cients of standard vts, which uses the continuous-time approxim- ation for the dynamic coe›cients, in terms of extended vts. To be able to compare with standard vts at all, the phase factor has to be assumed 0. e approximation that standard vts uses for the static coe›cients is exactly the same as the one extended vts uses for the statics at the centre time ošset.erefore, the compensated static mean and covariance that standard vts nds are the same as the ones extended vts nds for time ošset 0. However, dynamic parameter compens- ation with standard vts uses the continuous-time approximation.is uses the vector Taylor series expansion point of the static coe›cients for all the dynamic coe›cients. When the vector Taylor series expansion uses the same clean speech and noise means µsx0 ,µ s n0 and thus the same Jacobian J0 for all time ošsets in (5.28), it becomes yst−1 yst yst+1  '  f ( µsx0 ,µ s n0 ,µsh0 , 0 ) + Jx0 ( xst−1 − µ s x0 ) + Jn0 ( nst−1 − µ s n0 ) f ( µsx0 ,µ s n0 ,µsh0 , 0 ) + Jx0 ( xst − µ s x0 ) + Jn0 ( nst − µ s n0 ) f ( µsx0 ,µ s n0 ,µsh0 , 0 ) + Jx0 ( xst+1 − µ s x0 ) + Jn0 ( nst+1 − µ s n0 )  . (5.34) is approximation results in the following when substituted in the expression for computing dynamic coe›cients in (2.8): y∆t = ∑w τ=1 τ(y s t+τ − y s t−τ) 2 ∑w τ=1 τ 2 = ∑w τ=1 τ ( Jx0x s t+τ + Jn0n s t+τ − Jx0x s t−τ − Jn0n s t−τ ) 2 ∑w τ=1 τ 2 = Jx0 ∑w τ=1 τ ( xst+τ − x s t−τ ) + Jn0 ∑w τ=1 τ ( nst+τ − n s t−τ ) 2 ∑w τ=1 τ 2 = Jx0x ∆ t + Jn0n ∆ t . (5.35) is is exactly the same expression as the continuous-time approximation when ap- plied to vts compensation (in (4.37)). Extended vts compensation therefore becomes equivalent to standard vts compensation when the expansion point is chosen equal for all time ošsets. Extended vts performs the transformation from extended to standard parameters aŸer compensation. is allows extended vts to use a dišer- ent vector Taylor series expansion point for every time ošset to nd more accurate compensation. 133 chapter 5. compensating correlations vts evts Statistics diag. block-d. striped full Decoding diag. block-d. diag. full Jacobians s3 es3 Compensation d∆s2 d∆s3 e2s2 e2s3 Projection — d∆se2 d∆2s2e2 Table 5.1Computational complexityO(·) per component for compensation with vts and with extended vts, for diagonal blocks and for full blocks. 5.3.3.2 Computational cost Compensation with extended vts is more computationally expensive than vts with the continuous-time approximation. is section examines the dišerences in detail. e computational complexity per component will be expressed in terms of the size of the static feature vector s (typically 13), the total width of the window e = 4w + 1 (typically 9), and the number of orders of statics and dynamics d∆ (typically 3). Since the calculation of the covariance matrices dominates the computation time, the analysis will not explicitly consider the means. Table 5.1 gives a comparison of the computational complexity for the three oper- ations that can be distinguished in extended vts compensation.e rst one is com- puting the Jacobian of themismatch function, which takesO(s3). Standard vts com- pensation uses one linearisation point per component, and therefore needs to compute the Jacobian only once. Extended vts, however, uses a dišerent linearisation point for all e time ošset in the window, and computes a Jacobian for each of these. Compensation of the covariance matrix is done one s × s block at a time. e expression for standard vts (see (4.42b)) is of the form: Σsy := JxΣxJ T x + JnΣnJ T n . (5.36) e expression for extended vts compensation has the same form (but dišerent vari- ables) for each block of the covariance matrix. It has time complexity O(s2) if the blocks for the noise Σsn, the clean speech Σsx and the corrupted speech Σsy are diag- onal. For standard vts, this happens when the covariances for statistics and decoding 134 5.3. compensating extended distributions are all diagonal; for extended vts, when the blocks in the covariance matrices for statistics are diagonal (“striped”), and covariances for decoding are diagonal. When either the statistics or compensation uses full covariancematrices, then compensation takes O(s3). For vts, the d∆ blocks along the diagonal are compensated; for exten- ded vts, for the 12e(e+1) blocks in the extended covariance matrix.e row labelled “Compensation” in table 5.1 summarises this. Extended vts projects the compensated distribution onto the standard feature space with statics and dynamics. Since the blocks of the projection matrixD are di- agonal, computing one entry of the resulting covariance matrix Σey = DΣeyDT takes O(e2). For diagonal-covariance decoding, d∆s entries need to be computed; for full- covariance decoding, d∆2s2. us for full-covariance compensation, the computational complexity of evts is signicantly higher than standard vts. However, in practice per-Gaussian compensa- tion is oŸen too costly evenwhen the standard version of vts is used. Joint uncertainty decoding (jud) (Liao 2007, section 4.4.3) addresses this by computing compensation per base class rather than per Gaussian component. Section 5.4 will detail how to nd a joint distribution using extended vts. Chapter 6 will deal with another important issue: the computational cost of decoding. If full-covariance compensation is found, joint uncertainty decoding still compensates for changes in the correlations by decod- ing with full covariance matrices. is is slow. Predictive linear transformations can solve this issue by applying transformation to the feature vectors that eliminate the need to decode with full covariance matrices. 5.3.4 Algonquin with extended feature vectors Section 4.5.1 has discussed the Algonquin algorithm, which applies a vector Taylor series approximation but iteratively updates the expansion point to t the observation better. is used the joint distribution of the sources, the speech and the noise, and the observation. e way it was originally presented, it only acted on the static part of feature vectors. is is all that is necessary for feature enhancement. e covari- 135 chapter 5. compensating correlations ance matrix of the observation was supposed to be diagonal. To make the process consistent, the relation between the sources and the observation must then also be assumed independent, which is an additional approximation. If this approximation is to be removed, then the observation covariance must be full, which runs into the same problem as for standard vts compensation: compensation for dynamics, which applies the continuous-time approximation, is not accurate enough. However, it is possible to apply the Algonquin algorithm to extended feature vec- tors so as not to rely on the continuous-time approximation. is makes it possible to express the inžuence that one coe›cient in the speech vector, for example, has on another coe›cient in the observation vector. e main intuition is that once the mismatch function is linearised for a component, the extended speech and noise vec- tors and the observation vector with statics and dynamics are jointly Gaussian. e following will detail how this intuition can be used to nd an extended Algonquin algorithm. Extended vts applies the mismatch function per time instance. One way of view- ing the resulting transformation from extended speech and noise to extended obser- vation vectors is as a function fevts dened in (5.28): ye ' fevts(xe,ne,he,αe). (5.37) Just like in the original Algonquin algorithm, the phase factor will not be assumed to have a distribution, but assumed 0, and instead the uncertainty will be modelled with an error termN (0,Ψ) on the observation. Also, for notational convenience, the convolutional noise he will again be omitted. It is handled the same as the additive noise ne. e linearised mismatch function fe(k)vts at iteration k relates the extended vectors of the sources to the extended observation vectorye, which in turn is related to the observation vectorwith statics and dynamics by the linear projectionD as in (5.1a), plus error termN (0,Ψ).is implies that if xe andne are Gaussian, xe,ne andy are 136 5.3. compensating extended distributions jointly Gaussian, similarly to (4.58): q (k) yt   xe ne y   = N   xe ne y  ;  µex µen µ (k) y  ,  Σex 0 Σ (k) xy 0 Σen Σ (k) ny Σ (k) yx Σ (k) yn Σ (k) y   . (5.38) µex, µen, Σex, andΣen are taken directly from the priors of the speech and the noise.e Jacobians that related the extended speech and noise with the observation with statics and dynamics, y, are J (k) x = dy dx ∣∣∣∣ x (k) 0 = dy dye dye dx ∣∣∣∣ x (k) 0 = D  J s(k) x−1 0 0 0 J s(k) x0 0 0 0 J s(k) x+1  ; (5.39a) J (k) n = dy dn ∣∣∣∣ n (k) 0 = dy dye dye dn ∣∣∣∣ n (k) 0 = D  J s(k) n−1 0 0 0 J s(k) n0 0 0 0 J s(k) n+1  . (5.39b) e parameters of the joint distribution that depend on the linearisation at iteration k are then Σ (k) yx = J e(k) x Σ e x; (5.40a) Σ (k) yn = J e(k) n Σ e n; (5.40b) Σ (k) y = J e(k) x Σ e xJ e(k) x T + J e(k) n Σ e nJ e(k) n T +Ψ. (5.40c) Having set up the joint distribution in (5.38), the Algonquin algorithm proceeds as in section 4.5.1. From the joint distribution, each observation gives a Gaussian ap- proximation of the posterior distribution of the extended speech and the noise. e expansion point is set to the mean of this distribution, which yields a newly linearised mismatch function fevts, and therefore a new joint distribution. AŸer a few iterations of this process, the expansion point should be centred on the actual posterior of the speech and the noise. e advantage of applying this to extended feature vectors is that the distribution of the corrupted speech with dynamic coe›cients can be mod- elled with a full covariance matrix.e original algorithm diagonalised the corrupted 137 chapter 5. compensating correlations speech distribution. Tomake the joint distribution valid, the Jacobians must then also be assumed diagonal, which is an additional approximation compared to vts com- pensation. Using extended feature vectors, compensation is of good enough quality to nd full covariance matrices. 5.4 Extended joint uncertainty decoding Section 4.4.3 has discussed joint uncertainty decoding, which applies a compensation method to a base class at once. e Gaussian joint distribution can be estimated us- ing any model compensation method, with an appropriate extension. Given the joint distribution per base class, it is possible to compensate the components in that base class more quickly than by applying the compensation method on each component separately. Applying extended vts to each component is slower than normal vts. ere- fore, applying extended vts per base class leads to an even greater increase in speed than applying standard vts per base class. However, one of the important aspects of extended vts is that it can generate full-covariance compensation. is leads joint uncertainty decoding to produce full-covariance compensation as well, which slows down decoding.erefore, section 6.3 will present predictive linear transformations, which enable fast decoding from full-covariance joint uncertainty decoding.is sec- tion will produce joint Gaussian distributions with extended dpmc and extended vts (repeated from (4.46)):  x y  ∼ N µx µy  ,  Σx Σxy Σyx Σy  . (5.41) Since this joint distribution over statics and dynamics is Gaussian, like in section 5.2, nding the optimal extended Gaussian distribution and then converting it yields the optimal distribution over statics and dynamics.e extended joint distribution is xe ye  ∼ N µex µey  ,  Σex Σexy Σeyx Σ e y  . (5.42) 138 5.4. extended joint uncertainty decoding To transform the joint distribution over extended feature vectors in (5.41) into the one in (5.42), the relation between the joint vectors applies to the Gaussian: x y  = D 0 0 D  xe ye  . (5.43) erefore, the same transformation can be applied to the joint extended distribution in (5.42):  x y  ∼ N Dµex Dµey  ,  DΣexDT DΣexyDT DΣeyxD T DΣeyD T  . (5.44) Given these parameters per base class, decoding uses the same form as standard joint uncertainty decoding, in section 4.4.3. e rest of this section will therefore estimate the parameters of the joint extended distribution in (5.42). Just like for standard joint uncertainty decoding, the original compensation methods (here, extended dpmc and extended vts) already provide the clean and corrupted speech marginals xe ∼ N (µex,Σex) ,ye ∼ N ( µey,Σ e y ) . e clean speech is given, and the corrupted speech Gaussian is what a model compensation method nds.e cross-covariance Σexy is what the extension needs to nd. With extendeddpmc Section 4.4.3.1 has detailed how to nd a standard joint distri- bution with dpmc. e procedure for producing an extended joint distribution with extended dpmc is analogous. Section 5.3.1 has discussed how to draw extended samples ye(l) from the noise- corrupted speech distribution. To train the joint distribution, sample pairs of both the clean speech and the corrupted speech are retained. e empirical distribution has L delta spikes at positions (xe(l),ye(l)), analogous to (5.19): p˜(xe,ye) = 1 L ∑ l δ(xe(l),ye(l)). (5.45) Just like in (5.21), the parameters of the Gaussian are set to minimise the kl diver- gence to the empirical distribution. is is equivalent to maximising the likelihood 139 chapter 5. compensating correlations of the resulting distributions on the samples. However, for the joint distribution, the mean and covariance parameters are set at once to the mean and covariance of the tuple (x,y) under the empirical distribution: µex µey  := 1 L L∑ l=1  xe(l) ye(l)  ; (5.46a)  Σex Σexy Σeyx Σ e y  := 1 L L∑ l=1  xe(l) ye(l)  xe(l) ye(l) T − µex µey µex µey T. (5.46b) With extended vts When nding the joint distribution with extended dpmc, the estimating of the cross-covariance is implicit. When using extended vts for the same purpose, however, the structure of the cross-covariance has to be considered. It is analogous to (5.4): Σeyx =  Σsy−1x−1 Σ s y−1x0 Σsy−1x+1 Σsy0x−1 Σ s y0x0 Σsy0x+1 Σsy+1x−1 Σ s y+1x0 Σsy+1x+1  . (5.47) e blocks of this can each be found analogously to (5.33). For example, for the cross- covariance between the corrupted speech at time instance 0 and the clean speech at time instance +1, noting that the clean speech is assumed independent of the noise and the phase factor, Σsy0x+1 = E { (fvts(x s 0,n s 0,h s 0,α s 0) − µy0) ( xt+1 − µx+1 )T} = E {( Jx0(x s t − µ s x0 ) + Jn0(n s t − µ s n0 ) + Jα0α s )( xst − µ s x+1 )T} = Jx0Σ s x0x+1 . (5.48) e equivalent computation can be performed for all blocks of Σeyx. is gives the complete joint distribution in (5.42). 140 5.5. extended statistics 5.5 Extended statistics A practical issue when compensating extended feature vectors is the form of the stat- istics for the clean speech and the noise. For standard vts, the clean speech statistics are usually taken from the recogniser trained on clean speech and the noise model is usually estimated with maximum-likelihood estimation, as discussed in section 4.7. In contrast, extended vts and extended dpmc require distributions over the extended clean speech and noise vectors. As these have more parameters than standard statist- ics, robustness and storage requirements need to be carefully considered. Amodel for the phase factor is also needed. 5.5.1 Clean speech statistics Model compensation schemes, such as vts, use the Gaussian components from the uncompensated system as the clean speech distributions. For extendedvts and exten- ded dpmc, however, distributions over the extended clean speech vector are required. For one extended clean speech Gaussian N (µex,Σex), the parameters are of the same form as in 5.4: µex =  µsx−1 µsx0 µsx+1  ; Σex =  Σsx−1x−1 Σ s x−1x0 Σsx−1x+1 Σsx0x−1 Σ s x0x0 Σsx0x+1 Σsx+1x−1 Σ s x+1x0 Σsx+1x+1  . (5.49) In common with standard model compensation schemes, when there is no noise the compensated system should be the same as the original clean system trained with ex- pectation–maximisation. To ensure that this is the case single-pass retraining (Gales 1995) should be used to obtain the extended clean speech distributions. Here the same posteriors (associatedwith the complete data set for expectation–maximisation) of the last standard clean speech training iteration (with static and dynamic parameters) are used to accumulate extended feature vectors around every time instance. Another problem with using the extended statistics is ensuring robust estimation. e extended feature vectors contain more coe›cients than the standard static and 141 chapter 5. compensating correlations dynamic ones. Hence, the estimates of their distributions will be less robust and take up more memory. If full covariance matrices for Σex are stored and used, both rst- and second-order dynamic parameters usewindowwidths of±2, and there are s static parameters, this requires estimating a 9s×9s covariancematrix for every component. is ismemory-intensive and singularmatrices and numerical accuracy problems can occur. One solution is to reduce the number of Gaussian components or states in the system. However, the precision of the speech model then decreases. Also, this makes it hard to compare the performance of compensation with extended vts and standard vts. An alternative approach is to modify the structure of the covariance matrices, in the same fashion as diagonalising the standard clean speech covariance model. To maintain some level of inter-frame correlations, which may be useful for computing the dynamic parameters, each block is diagonalised. is yields the following struc- ture: Σex =  diag ( Σsx−1x−1 ) diag ( Σsx−1x0 ) diag ( Σsx−1x+1 ) diag ( Σsx0x−1 ) diag ( Σsx0x0 ) diag ( Σsx0x+1 ) diag ( Σsx+1x−1 ) diag ( Σsx+1x0 ) diag ( Σsx+1x+1 )  . (5.50) For each Gaussian component, the ith element of the static coe›cients for a time in- stance is then assumed correlated with only itself and the ith element of other time instances.is causes Σex to have a striped structure with only 45s parameters rather than 9s(9s + 1)/2 for the full case. is type of covariance matrix will be called “striped”. It is a simple instantiation of structured precision matrix modelling,3 dis- cussed in section 3.3.1, with the special attribute that when there is no noise it will still yield the standard static and dynamic clean speech parameters. 3A stripedmatrix can be expressed as a block-diagonalmatrix transformed by a permutationmatrix. is can straightforwardly be expressed in terms of basis matrices. 142 5.5. extended statistics 5.5.2 Noise model estimation A noise model with extended feature vectors is necessary to perform compensation with extended vts. is noise model is of the formMen = {µen,Σen,µeh}, with para- meters ne =  nst−1 nst nst+1  ∼ N   µsn−1 µsn0 µsn+1  ,  Σsn−1n−1 Σ s n−1n0 Σsn−1n+1 Σsn0n−1 Σ s n0n0 Σsn0n+1 Σsn+1n−1 Σ s n+1n0 Σsn+1n+1   ; (5.51) he =  hst−1 hst hst+1  =  µsh−1 µsh0 µsh+1  . (5.52) In this thesis, and the majority of other work, the noise model consists of a single Gaussian component. e distribution for each time ošset therefore is by denition the same. is means that the extended means for the additive and convolutional noise simply repeat the static means. e structure of the extended covariance Σen is also known. Since the noise is assumed identically distributed for all time instances at the same distance, the correlation between time instances is always the same.us, all diagonals of the covariance matrix repeat the same entries. LetΣsn0 ,Σ s n1 ,Σsn2 indicate the cross-correlation between noise that is 0, 1, or 2 time instances apart.e extended noise model then has the following form: µen =  µsn µsn µsn  ; Σen =  Σsn0 Σ s n1 T Σsn2 T Σsn1 Σ s n0 Σsn1 T Σsn2 Σ s n1 Σsn0  ; µeh =  µsh µsh µsh  . (5.53) In theory these noise parameters could be found using maximum likelihood estim- ation. However, this would make the noise estimation process inconsistent between standard vts and extended vts. It would be useful to use the standard noise estima- tion schemes and map the parameters to the ones in the extended forms above.is has the additional advantage of limiting the number of parameters to be trained, thus ensuring robust estimation on small amounts of data.ese standard noise paramet- 143 chapter 5. compensating correlations ers are (repeated from (4.83)) µn = µsn 0  ; Σn =  diag(Σsn) 0 0 diag ( Σ∆n )  ; µh = µsh 0  . (5.54) e extended noise means are straightforward functions of the static means of the standard noise model (5.54). Similarly, Σsn0 , the covariance between noise 0 time instances apart, is the static noise covariance Σsn. Computing the oš-diagonals of the extended covariance, however, is not as straightforward. e next subsections will discuss two forms of extended noise covariance from the standard noise covariance: the diagonal form, and a smooth reconstruction. A simple way of reconstructing the extended noise covariance from a standard noise model assumes that the noise is uncorrelated between time instances. is is done by setting the oš-diagonal elements are set to zero, which yields Σen =  Σsn 0 0 0 Σsn 0 0 0 Σsn  . (5.55) is only uses the static elements of the estimated noise covariance. For very low signal-to-noise ratio (snr) conditions this form of extended noise distribution will not yield the standard noise distributions for the dynamic parameters. Another option is to use the dynamic parameters of the noise model to nd a reconstruction of the extended noise covariance from (5.54). A problem is that the mapping from the extended feature domain to statics and dynamics is straightforward, but the reverse mapping is under-specied. e standard feature vector nt is related to one in the extended domain net (analogously to (5.1a)): nt =  nst n∆t  =  0 I 0 − I2 0 I 2   nst−1 nst nst+1  = Dnet . (5.56) To reconstruct net from nt, extra constraints are necessary as D is not square, and therefore not invertible.ese constraints should yield an extended feature vector that 144 5.5. extended statistics represents a plausible sequence of static feature vectors.e Moore-Penrose pseudo- inverse of D could be used. However, this would result in the net with the smallest norm. For the three-dimensional example used here, the reconstruction would set net−1 = −n e t+1 without any reference to the value of the static coe›cients nst. us, the Moore-Penrose pseudoinverse might lead to reconstructions with large changes in coe›cients from one time to the next. e need for smooth changes from time instance to time instance can be used as additional constraints. us the aim is to nd a smooth reconstruction whilst satis- fying the constraints to yield the standard static and dynamic distributions. To im- plement this constraint, rows representing higher-frequency changes are added toD and zeros added tont to indicate their desired values.e extension of the projection matrixD, E, can then be made invertible.us nst n∆t 0  = Enet ; E−1  nst n∆t 0  = net . (5.57) For the extra rows of E, the corresponding rows from the discrete cosine transform (dct) matrix are appropriate, since they indicate higher-order frequencies and are independent.e entries of aN×N dctmatrixC are given by (see (2.3)) cij = √ 2 N cos (2j− 1)(i− 1)pi 2N . (5.58) e form of E,D with dct-derived blocks appended, is: E =  0 I 0 − I2 0 I 2 c31I c32I c33I  . (5.59) Because the dynamicmean of the additive noise is zero,E−1µn is equal to the extended mean in (5.53) (and similar for the convolutional noise). To reconstruct the extended covarianceΣen from a standard noise model, the cross-covariance between statics and dynamics can be ignored, and the higher-order covariance terms set to zero to make 145 chapter 5. compensating correlations the reconstruction as smooth as possible.is results in the following expression: Σsn Σ∆n 0  = EΣenET = E  Σsn0 Σ s n1 T Σsn2 T Σsn1 Σ s n0 Σsn1 T Σsn2 Σ s n1 Σsn0 ET, (5.60) where the empty entries on the leŸ-hand side are ignored. is is a system with three sets of matrix equalities, which can be simply solved.4 In this work, the estim- ated noise covariance matrix Σn is diagonal, so that Σsn0 ,Σ s n1 ,Σsn2 are also diagonal. is results in a striped matrix for Σen. 5.5.2.1 Zeros in the noise variance estimate An additional issue that can occur when estimating the noise model using maximum likelihood, is that noise variances estimates for some dimensions can become very small, or zero.ough this value may optimise the likelihood, it does not necessarily režect the “true” noise variance.is can lead to the following problem in compens- ation. One problem for the small noise variance estimates is that the clean speech silence models are never really estimated on silence. In practice, even for clean speech there are always low levels of background noise.us, the estimated noise is really only re- lative to this clean background level. At very high snrs the noise may be at a similar level to the clean silence model.is will cause very small noise variance values. An- other problem results from the form of the covariance matrix compensation. For the static parameters this may be written as (repeated from (4.42b)) Σ s(m) y := J (m) x Σ s(m) x J (m) x T + J (m) n Σ s nJ (m) n T . (5.61) At low snrs J(m)x → 0 and J(m)n → I, so the corrupted distribution tends to the noise distribution. Conversely, at high snrs, J(m)x → I and J(m)n → 0, as the corrupted speech distribution tends to the clean speech distribution. e impact of this when estimating the noise covariance matrix Σsn in high snr conditions is that changes in 4is implicitly sets Σsn0 to Σ s n. 146 5.5. extended statistics the form of the noise covariance matrix have little impact on the nal compensated distribution. When vts with the continuous-time approximation, along with diagonal corrup- ted speech covariance matrices, is used during both noise estimation and recognition then the process is self-consistent. However if the noise estimates are used with evts to nd full compensated covariance matrices this is not the case.is slight mismatch can cause problems. To address this issue a back-oš strategy can be used. When the estimated noise variance has very low values rather than using full compensated cov- ariancematrices, diagonal compensated variances can be used.is will occur at high snrs, where the correlation changes compared to the clean speech conditions should be small. In this condition little gain is expected from full compensated covariance matrices. 5.5.2.2 Estimating an extended noise model An alternative approach to address this problem is to make the noise estimation and decoding consistent for evts. is would mean: estimating the parameters of the extended noise distribution directly. Some of the methods in section 4.7.1 can be ex- tended. e most important dišerence with the circumstances in that section is that now the compensated components have full covariances. Section 4.7.1 discussed two types ofmethods: one typemodelled the noise as a hidden variable in the expectation– maximisation framework, and the other directly optimised the noise model. Directly optimising the noise model has become harder, because of the full covariance matri- ces. However, the biggest problem with seeing the noise as a hidden variable was the inconsistency arising from the diagonalisation. Since the resulting component distri- butions are not diagonalised any more, this ceases to be a problem. is section will sketch how estimation of the extended noise distribution could proceed. Just like in section 4.7, training uses expectation–maximisation (see section 2.3.2.1). Here, the hidden variables U consist of the component sequencemt and the sequence of extended noise vectors net and het . At an abstract level, the expressions for the ex- 147 chapter 5. compensating correlations pectation and maximisation steps are the same as in (4.84): ρ(k) := q (k) U |Y ∝ q (k) UY ; (5.62a) q (k) U := argmax qU ∫ p˜(Y) ∫ ρ(k)(U |Y) logqU (U)dUdY. (5.62b) For simplicity, assume only additive noise.e expectation step again is approximate: the linearisation of the mismatch function from the last iteration is used. With ex- tended feature vectors, each time instance is linearised separately. For time instance t + 1, for example, the inžuence of nst+1 on yst+1 is dened by Jnt+1 . e extended feature vectoryet , which consists of time instancesyst−w . . .yst+w, is related to the ob- servation with statics and dynamics with another linear projectionD. e inžuence of the extended noise vector on the observation vector with statics and dynamics is therefore linear. e details of the relationship are the same as for extended Algon- quin (see section 5.3.4). Both the noise and the observation are modelled Gaussian per componentm, so that their relation can be expressed as a joint distribution: ne y  ∣∣∣∣∣∣m ∼ N  µen µ (m)(k) y  ,  Σen Σ(m)(k)ny Σ (m)(k) yn Σ (m)(k) y  , (5.63) where Σny is the cross-covariance of the extended noise vector and the conventional observation vector. None of the blocks of the covariance matrix is diagonal. e posterior distribution of the noise vector for component m at time t, ρ(m)t (net) = ρ ( net ∣∣yt,mt), is then Gaussian. Its parameters can be found as in appendix a.1.3. In the maximisation step, the noise parameters are set to the empirical mean and covariance of ρ. is requires summing over all time slices and components and weighting the distributions by the posterior component–time occupancyγ(m)t dened in (2.31a): µ e(k) n := Eρ{ne} = 1∫ p˜(Y)TYdY ∫ p˜(Y) TY∑ t=1 ∑ m γ (m) t Eρ(m)t {n e}dY ; (5.64a) 148 5.6. summary Σ e(k) n := Eρ { neneT } − µ e(k) n µ e(k) n T = ( 1∫ p˜(Y)TYdY ∫ p˜(Y) TY∑ t=1 ∑ m γ (m) t Eρ(m)t { neneT } dY ) − µ e(k) n µ e(k) n T . (5.64b) Additionally, the convolutional noise should be estimated, and the structure of the noise model must be constrained. is work does not investigate this. Instead, the extended noise model will derive from a noise model estimated with standard vts. By using the same noise estimates for both vts and evts, only dišerences in the com- pensation process are examined, rather than any dišerences in the noise estimation process. It should be emphasised that the results presented for evtsmay a slight un- derestimate of the possible performance if a fully integrated noise estimate was used. 5.5.3 Phase factor distribution e phase factorαe is assumed to have independent dimensions (within and between time instances). It is αe =  αst−1 αst αst+1  , (5.65) where every dimension is independent and distributed as in 4.18: p(αi) ∝  N ( αi; 0, σ 2 α,i ) αi ∈ [−1,+1]; 0 otherwise. (5.66) 5.6 Summary is chapter has described the rst contribution of this thesis. It has extended model compensation methods from chapter 4 to produce full-covariance compensation. e most important insight is that full covariance matrices require higher-quality compensation for dynamics.is claim will be validated in section 8.1.1.1. Section 5.2 149 chapter 5. compensating correlations has therefore shown how from a distribution over the statics in a window (an extended feature vector) a distribution over dynamics can be found. is uses the same linear projection that feature extraction uses. Section 5.3 has introduced instances of com- pensation methods that model the ešect of the noise separately for each time instance in the extended feature vector. ey are therefore capable of generating accurate full covariance matrices. Section 5.5.2 has shown how to nd an extended noise model from a standard noise model, so that model compensation with extended feature vec- tors needs as little adaptation data as standard model compensation. e same principle of compensation can apply to a base class at once: section 5.4 has detailed how joint uncertainty decoding can be extended.e choice of the num- ber of base classes gives a trade-oš between speed and accuracy. However, decoding with full covariance Gaussians is still slow.e next chapter will present approxima- tions to deal with this. 150 Chapter 6 Predictive transformations is chapter will describe the second contribution of this thesis.1 e previous chapter has introduced methods of model compensation that need little data to train, but nd full-covariance model compensation, which it is slow to decode with. is chapter will introduce predictive methods, which approximate a distribution predicted with one model with another, dišerently parameterised model. For example, this enables fast transformations to be trained from full-covariance com- pensation. Section 6.1 will formalise predictive methods. ey can combine the advantages of one method and those of another. For example, section 6.2 will introduce predic- tive linear transformations, which like their adaptive versions (which were discussed in sections 3.2 and 3.3) allow fast decoding. e interesting aspect for this work is that they canmodel correlations without the computational burden of full-covariance 1ough van Dalen (2007); Gales and van Dalen (2007) introduced predictive linear transforma- tions, they derived the predicted statistics by intuition.is chapter introduces a rigorous framework for predictive transformation. Additionally, this chapter will introduce a number of new schemes. Front- end cmllr schemes, joint work with Federico Flego, in section 6.4 were published as van Dalen et al. (2009). Section 6.3.5 will newly introduce predictive hlda. Since the introduction of predictive linear transformations, a number of variants of specic trans- formations have been proposed. Xu et al. (2009) used joint uncertainty decoding in a dišerent feature space for estimating predictive cmllr transforms.is chapter will give the theoretical underpinnings for this. Another interesting line of work combines predicted statistics with statistics from data (Flego and Gales 2009; Breslin et al. 2010) 151 chapter 6. predictive transformations θt−1 θt xt−1 xt yt−1 yt nt−1 nt (a) e predicted noise-corrup- ted speech, from gure 4.5 on page 72. θt−1 θt yt−1 yt (b) Compensated hidden Markov model, from gure 4.7 on page 75. Figure 6.1 Model compensation as a predictive method: the predicted corrupted speech is approximated with a standard hmm. Gaussians. Section 6.3 will use full-covariance joint uncertainty decoding (discussed in section 5.4) as the predicted distribution. Another use of the predictive framework is for fast feature transformations. Un- like conventional feature transformations for noise-robustness, which aim to recon- struct the clean speech, the methods that section 6.4 will introduce aim to model the corrupted speech distribution. 6.1 Approximating the predicted distribution Predictive methods are methods that train parameterised distributions from distribu- tions predicted with another model.is is useful when it is impossible or impractical to use the formermodel.is sectionwill introduce the general framework for predic- tive methods. It will give a formalisation of predictive transformations as minimising the kl divergence between the predicted distributions and the model set. For example, section 4.4 has introduced model compensation, which approxim- ates the predicted model for the corrupted speech. Figure 6.1a shows that predicted 152 6.1. approximating the predicted distribution θt−1 θt xt−1 xt yt−1 yt (a) Predicted corrupted speech with joint uncertainty decoding. θt−1 θt yt−1 yt A (b)AhiddenMarkovmodel with a linear transformation. Figure 6.2 Predictive linear transformations: a jud-compensated hmm is ap- proximated with a linear transformation. model, which results in an integral (in (4.24)) that has no closed form. Figure 6.1b shows the approximated model. e abstract idea of interpreting model compensa- tion as a predictive method was proposed in Gales (1998b). In many cases, it is possible to sample from the predicted distribution, and train the approximated distribution from those samples. Indeed, extended dpmc (see sec- tion 5.3.1) drops out when the parameters of all Gaussians of the approximate distri- bution are trained separately, and extended iterative dpmcwhen the state-conditional mixtures of Gaussians are trained. ese schemes are slow but in the limit yield the optimal parameters for their parameterisations. Faster predictive methods, such as speech recogniser transformations, are pos- sible. Section 6.2 will introduce an application of the predictive framework that nds a speech recogniser transformation that approximates another distribution. Predictive transformations could be estimated with Monte Carlo. However, this would negate the reason for estimating transformations, which is speed. Figure 6.2a shows a graph- ical model for joint uncertainty decoding, a fast compensation method discussed in section 4.4.3. Section 6.3 will discuss how to nd an approximation to this with linear transformations (discussed in 3.2), depicted in gure 6.2b.is combines the advant- ages of joint uncertainty decoding, which needs little data to adapt but can generate 153 chapter 6. predictive transformations full-covariance compensation (see section 5.4), and linear transformations, which are fast in decoding. e rest of this sectionwill discuss the general framework of predictive transform- ations. ey are estimated by minimising the kl divergence between the predicted distributions and the target transformation of the model set. Section 6.1.1 will intro- duce a form of predicted distribution that predicts and approximates distributions per component (normally, a Gaussian). is is the form that most model compensation methods presented in the chapter use, and will be the form used in the rest of this work. However, it is also possible to approximate the distributions per sub-phone state.e formal derivation of this will be introduced in section 6.1.2. Iterative dpmc, introduced in section 4.4.1, is the only method in this thesis that applies this. 6.1.1 Per component e idea of predictive transformations is that they are analogous to adaptive trans- formations, but are estimated on predicted statistics rather than statistics from data. Section 2.3.2.1 wrote the maximisation step of expectation–maximisation as minim- ising the kl divergence between the inferred distribution of the complete data and the model to be trained. Here, the idea is to minimise a kl divergence in the same way, but between the predicted distribution and the model. e predicted distribution over the hidden variables U and test data observa- tions Y will be written p(U ,Y) = p(U)p(Y |U) . (6.1) For speech recognition, U is the component sequence {mt}, which no transforma- tion method in this thesis changes. p(Y |U) consists of the predicted component- conditional distributions p(m).is replaces the empirical distribution from data and the inferred distribution over the hidden parameters in (2.24) on page 32, which was p(U ,X ) = p˜(X )ρ(U |X ). 154 6.1. approximating the predicted distribution e distribution to be trained is qUY , which for speech recognition factorises into a distribution over the hidden variables qU , and one over the observed variables given the hidden ones qY |U (as in (2.26)): qUY(U ,Y) = qU (U)qY |U (Y |U) . (6.2) Of these, only qY |U , the component-conditional distributions q(m), will be trained, so that argmin qUY KL(p‖qUY) = argmin qUY ∫ ∫ p(U ,Y) log p(U ,Y) qUY(U ,Y)dYdU = argmin qY|U ∫ ∫ p(U)p(Y |U) log p(U)p(Y |U) qU (U)qY |U (Y |U) dYdU = argmin qY|U ∫ ∫ p(U)p(Y |U) log p(Y |U) qY |U (Y |U) dYdU . (6.3) In this case, theminimisation is performed per component distribution, so thatU rep- resents just the component identitym, and Y just the observation y it generates.e output distribution qY |U factorises per time step, so that the expression becomes argmin qY|U KL(p‖qUY) = argmin qY|U ∫∑ m p(m)p(m)(y) log p(m)(y) q(m)(y) dy = argmin qY|U ∑ m p(m)KL(p(m)∥∥q(m)). (6.4) A maximum-likelihood estimate of the prior distribution on the component can be found from the training data.e expectation step of expectation–maximisation gives the total component occupancy γ(m), which in (2.31b) was dened as γ(m) , ∫ p˜(X ) TX∑ t=1 ∫ ρ(U |X ) 1(mt = m)dUdX . (6.5) e maximum-likelihood estimated prior of the component is p(m) := γ(m)∑ m ′ γ (m ′) . (6.6) is can straightforwardly be substituted into (6.4). However, the normalisation term 1/ ∑ m ′ γ (m ′) does not make a dišerence for the minimisation. erefore, and be- 155 chapter 6. predictive transformations cause without the normalisation theminimand turns out to be easier to relate to trans- formations trained on data, the minimisation in (6.4) will be written argmin qY|U KL(p‖qUY) = argmin qY|U ∑ m γ(m)KL(p(m)∥∥q(m)). (6.7) e objective of predictive transforms, then, turns out to be to minimise the occu- pancy-weighted kl divergence between the predicted component distributions and the distributions used for decoding. For the model compensation methods that were discussed in the previous chapters, the parameters of each component distributionq(m) were independent, so that the optimisation (in (4.28)) was separate for each compon- ent. For the predictive linear transformations that section 6.2 will introduce, however, the component distributions q(m) cannot be optimised separately, because they share parameters.e weighting by training data occupancy γ(m) therefore is necessary. e per-component kl divergence consists of the entropy of p(m) and the cross- entropy of p(m) and q(m) (see appendix a.2), of which only the cross-entropy can be optimised. To solve (6.7), it therefore su›ces to nd argmin qY|U KL(p‖qUY) = argmin qY|U ∑ m γ(m)H(p(m)∥∥q(m)) = argmax qY|U ∑ m γ(m) ∫ p(m)(y) logq(m)(y)dy. (6.8) e minimisation of the cross-entropy turns into a maximisation of something that can be interpreted as the expected log-likelihood. 6.1.2 Per sub-phone state e discussion so far has assumed that the hidden variables are componentsm. How- ever, it is also possible to match the decoding process more closely. Section 2.4 has discussed how Viterbi decoding nds a state sequence Θ, and marginalises out over the component sequencesM. Finding a transformed distribution that minimises the per-component kl divergence is oŸen sub-optimal compared to minimising a per- state kl divergence. e following rst presents the expression for the per-state op- 156 6.1. approximating the predicted distribution timisation, and then amethod to approximate this for the specic case of twomixture models. By following the derivation in section 6.1.1, but using the sub-phone state se- quenceΘ = {θt} as the hidden variablesU and per-sub-phone output distributions for the predictions and approximations, pY |U = { p(θ) } , qY |U = { q(θ) } , the expression to optimise becomes (analogous to (6.7)): argmin qY|U KL(p‖qUY) = argmin qY|U ∑ θ γ(θ)KL(p(θ)∥∥q(θ)), (6.9) where γ(θ) is the sub-phone occupancy, p(θ) is the predicted distribution for state θ, and q(θ) the transformed speech recogniser model for that state. For the case where p(θ) and q(θ) are mixture models, this expression normally has no analytic solution. However, assuming that it is possible to compute and improve the divergence between pairs of components of the mixtures, the divergence between pairs of components of themixtures, the divergence between sub-phone state pairs can be improved starting from the per-component one (Yu 2006;Dognin et al. 2009).is uses the upper bound on the cross-entropy presented in section a.2.2. AŸer nding the tightest upper bound,qY |U is set to improve the upper bound by setting variational parameters.is process can be repeated a number of times to iteratively improve the bound. To apply this, (6.9) must be written as a minimisation of only one half of the kl divergence, the cross-entropyH(p(θ)∥∥q(θ)), which is analogous to (6.8): argmin qY|U KL(p‖qUY) = argmin qY|U ∑ θ γ(θ)H(p(θ)∥∥q(θ)). (6.10) is is the minimisation that iterative dpmc approximates with a sampling method (in (4.33)).e following will optimise an upper bound on this. For simplicity of nota- tion, it will assume that the components of themixture distributionp(θ) are not shared with other distributions p(θ ′), and similar for components of q(θ). e algorithm is straightforward to extend to the general case. 157 chapter 6. predictive transformations e distributionsp(θ) andq(θ) are assumedmixture distributionswith component setsΩ(θ) and Ω˜(θ): p(θ)(y) = ∑ m∈Ω(θ) pi (θ) m p (m)(y); q(θ)(y) = ∑ n∈Ω˜(θ) ω (θ) n q (n)(y). (6.11) e sub-phone occupancy can be written as the sum of the occupancies of the com- ponents: γ(θ) = ∑ m∈Ω(θ) γ(m). (6.12) e algorithm for nding the upper bound to the cross-entropy between twomix- tures (Yu 2006;Hershey andOlsen 2007) is discussed in appendixa.2.2.e algorithm optimises a probabilistic mapping between the components from one mixture and components from the other mixture.is mapping is represented by variational para- meters φ(m)n , with (repeated from (a.16))∑ n φ (m) n = 1, φ (m) n ≥ 0. (6.13) Appendixa.2.2 shows that the cross-entropy betweenmixtures can be upper-bounded by a function of the sub-weights and the kl divergence between all component pairs. e expression is given in (a.17b). Applied to one sub-phone state pair it is H(p(θ)∥∥q(θ)) ≤ ∑ m∈Ω(θ) ∑ n∈Ω˜(θ) pi (θ) m φ (m) n ( H(p(m)∥∥q(n))+ log φ(m)n ω (θ) n ) , F(p(θ), q(θ),φ). (6.14) To nd an upper bound on the kl divergence of the whole model set, this can be substituted in the expression optimised in (6.10):∑ θ γ(θ)H(p(θ)∥∥q(θ)) ≤ ∑ θ γ(θ)  ∑ m∈Ω(θ) ∑ n∈Ω˜(θ) pi (θ) m φ (m) n ( H(p(m)∥∥q(n))+ log φ(m)n ω (θ) n ) , F(pY |U , qY |U ,φ) . (6.15) 158 6.1. approximating the predicted distribution function Optimise-Predictive-Distribution(pY |U ) Initialise qY |U repeat for allm,n do φ (m) n ← ω(θ)n exp(−H(p(m)‖q(n)))∑ n ′ ω(n ′) exp(−H(p(m)‖q(n ′))) qY |U ← argmaxqY|U F(pY |U , qY |U ,φ) until convergence return qY |U Algorithm 3 Optimising the upper bound to kl divergence to a predicted distri- bution for mixture models. Optimising the state-for-state kl divergence works as in algorithm 3.e initial- isation for approximate distributionqY |U could be the component-for-component op- timum, or another setting. e variational parameters are then optimised to tighten the upper bound as in (a.20), aŸerwhich the upper bound is improved by settingqY |U . Both these steps are guaranteed not to increase the cross-entropy, so iterating over them nds a local optimum. Since for most of the methods in the rest of this thesis, the components of the predicted distribution and its approximation both derive from the same clean speech component, the component-for-component optimisation from the previous section will be used most. It is interesting to see how the component-for-component optim- isation relates to the state-for-state optimisation. It is straightforward to set parameters φ (m) n so that (6.15) is equal to the component-for-component kl divergence in (6.7) (repeated from (a.21)): φ (m) n =  1, m = n; 0, m 6= n. . (6.16) Since the inequality in (6.15) still holds under this setting, optimising the component- for-componentkldivergence gives is an upper boundon the optimal state-for-statekl divergence with variational parameters φ(m)n . It is called thematched-pair bound (see section a.2.2 and Hershey and Olsen 2007). is means that even when optimising 159 chapter 6. predictive transformations the component-for-component kl divergence, this optimises an upper bound on the state-for-state kl divergence, which is consistent with the decoding process. 6.2 Predictive linear transformations Adaptive linear transformations have been discussed in section 3.2.ey do not have a model of the environment, but restrict the number of parameters that need to be trained, compared to training a full speech recogniser, by estimating linear trans- formations. However, they still require more data than methods for noise-robust- ness. However, decoding is oŸen hardly slowed down at all. For examples, cmllr (section 3.2.1) transforms every observation feature by a small number of transforma- tions, which is fast. It then feeds the dišerently-transformed feature vectors to dišer- ent groups of components. No such tricks are possible for model compensation methods. Methods that nd Gaussian compensation, like standard vts and extended vts, apply a dišerent trans- formation to each component. Converting them into fewer linear transformations, however, could leverage the fast adaptation of extended vts, and the fast decoding of linear transformations. As section 3.1 has discussed, adaptationuses expectation–maximisation.emax- imisation step is equivalent to minimising the kl divergence between the empirical distribution and the modelled distribution.e framework of predictive transforma- tions, introduced in section 6.1, approximates one distribution by another, also min- imising thekl divergence between them.at the optimisation in both cases is similar makes converting adaptive transformations into predictive transformations relatively straightforward.e derivation of, for example, predictive cmllr, runs parallel to the derivation of standard cmllr.e main dišerence will be that the statistics from data will be replaced by predicted statistics. Predictive transformations approximate a predicted distribution p. e approx- imate distribution qY |U over observed variables Y given hidden variables U is set to 160 6.2. predictive linear transformations (repeated from (6.7) and (6.8)): argmin qY|U KL(p‖qUY) = argmin qY|U ∑ m γ(m)KL(p(m)∥∥q(m)) = argmin qY|U ∑ m γ(m)H(p(m)∥∥q(m)) = argmax qY|U ∑ m γ(m) ∫ p(m)(y) logq(m)(y)dy, (6.17) whereγ(m) is the total occupancy on the training data for componentm. Linear trans- formations are dened by a set of transformationsA, one of which acts on each com- ponent.e output distribution of transformed componentm is written q(m)(y|A). e optimisation in (6.17) then becomes A := argmin A ∑ m γ(m)KL(p(m)∥∥q(m)) = argmax A ∑ m γ(m) ∫ p(m)(y) logq(m)(y|A)dy. (6.18) is expression is similar to the optimisation for adaptive linear transformations in (3.2): A(k) := argmax A ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t logq (m)(yt|A)dY. (6.19) Alternatively, the per-state kl divergence could be optimised using the technique in section 6.1.2. is would require interleaving the optimisation of the variational parameters of the upper bound on the kl divergence per state, and the optimisation of the transformation. Since the upper bound (in (6.15)) is a linear combination of the per-component cross-entropies, optimising the transformation still has the form in (6.18). For clarity, therefore, the following will use that form. Both predictive cmllr and predictive covariance mllr derive from (6.18) in the same way as their adaptive versions derive from (6.19). It will turn out the only change is in the statistics, which are predicted rather than gathered fromdata.is is conveni- ent, because the procedure for estimating the transformations presented in chapter 3.2 can be re-used. It is also satisfying, because the predicted statistics that drop out cor- respond to the intuitive expressions.ey are the same as Gales and van Dalen (2007) derived by intuition. 161 chapter 6. predictive transformations Onedišerence between the adaptive linear transformations and the predictive ver- sions is in the regression classes. Adaptation needs to carefully control for the avail- able amount of data. One tool for this is a regression class tree that expands nodes as long as there is enough data to train the corresponding transformation robustly. For predictive transformations, on the other hand, there is a predicted distribution, para- meterised for the instantiations in this thesis.is corresponds to an innite amount of data, so that data sparsity is not an issue. An interesting case is when there is only one component in a base class. If addi- tionally the predicted distributions are Gaussian, then the algorithms for predictive cmllr and predictive semi-tied covariance matrices will nd a transformation that sets q(m) exactly equal to p(m). e choice of the number of base classes therefore gives a trade-oš between speed and accuracy. Since each component is assigned to one base class only, the optimisation expres- sions for each base class are independent. All of the derivations in the next sections will therefore simplify notation by assuming only one base class, and summing over all components. To convert these into expressions that do use base classes, the sums over components should only by over components in the base class that the transformation is estimated for. 6.2.1 Predictive CMLLR Predictive cmllr (pcmllr) uses the exact same form of transformation that cmllr uses. It is called “constrained” because the linear transformation of the means and covariances are equal. ere is also a bias on the mean. As section 3.2.1 has shown, this can be written with the inverse transformation, which then works on the feature vector.e likelihood computation becomes (as in (3.4b)) q(m)(y|A) = |A| · N (Ay+ b; µ(m)x , Σ(m)x ), (6.20) whereµ(m)x andΣ(m)x are Gaussian parameters for the clean speech.ough for clarity of notation it will not be written explicitly here, there is normally a set of R transform- 162 6.2. predictive linear transformations ations { A(r),b(r) } , one for each base class (in the adaptive case, regression class). Each component is assigned to one base class. A fast implementation can therefore transform each observation vector yt into R transformed versionsA(r)yt +b(r) and pass each component the appropriately transformed versions.e determinant ∣∣A(r)∣∣ in (6.20) can be precomputed. is makes decoding fast, both for the adaptive and predictive versions of cmllr. e algorithm for predictive cmllr turns out to be the same as that for adaptive cmllr. Appendix b.1.2 derives this. Both are expressed in terms of the same statistics, but the dišerence is in how these statistics are acquired. For the predictive version they are γ , ∑ m γ(m); (6.21a) k(i) , ∑ m γ(m)µ (m) x,i σ (m) x,ii [ Ep(m) { yT } 1 ] ; (6.21b) G(i) , ∑ m γ(m) σ (m) x,ii  Ep(m){yyT} Ep(m){y} Ep(m) { yT } 1  . (6.21c) e form of the statistics for predictive cmllr is intuitively related to the form of the statistics for the adaptive version, in (3.5). γ is the total occupancy, which for adaptive cmllr is found from the distribution over the hidden variables, and here is derived from the clean training data. k(i) is a function of the expected value of the predicted distribution; its equivalent in (3.5b) can be viewed as the mean observation vector for componentm under the empirical distribution, and similarly forG(i). 6.2.2 Predictive covariance MLLR Covariance mllr (see section 3.2.2) is a technique that tries to nd the best covari- ance transformation to model the data. Since the overall aim of this chapter is to nd transformations that help model correlations without the decoding cost of full cov- ariances, a predictive variant of covariance mllr should be useful. e derivation of predictive covariance mllr follows the same structure as that of predictive cmllr, 163 chapter 6. predictive transformations in the previous section. e kl divergence between the predicted distributions and the transformed speech recogniser is minimised.is will result in the same expres- sion as for adaptive covariance mllr, but with the statistics replaced by the predicted equivalents. e transformed likelihood is exactly the one in (3.6b): q(m)(y|A) = |A| · N (Ay; Aµ(m)x , Σ(m)x ). (6.22) As explained in section 3.2.2, this expression transforms feature and means, because that makes decoding faster than transforming the covariance with the inverse. e derivation of predictive covariancemllr is in appendix b.2.2. As for predictive cmllr, the only change is in the statistics, which again are intuitively related to those for the adaptive version in (3.7): γ , ∑ m γ(m); (6.23a) G(i) , ∑ m γ(m) σ (m) x,ii Ep(m) {( y− µ (m) x )( y− µ (m) x )T} . (6.23b) 6.2.3 Predictive semi-tied covariance matrices Predictive covariance mllr does not change the mean, nor the covariance matrices beyond the linear transformation it applies. However, it is possible to adjust the cov- ariancematrices as well as the linear transformation that is applied to them. Semi-tied covariance matrices, discussed in section 3.3.1, do exactly that, for the training data. It is possible to train the same transformation on predicted statistics. e derivation is analogous to the one for standard semi-tied covariance matrices. e likelihood expression for predictive semi-tied covariance matrices is analog- ous to (3.12b): q(m)(y) = |A| · N (Ay; Aµ(m)y , Σ˜(m)y,diag). (6.24) 164 6.2. predictive linear transformations is expression is similar to the one for predictive covariance mllr, in (3.6b). It also uses a transformationA to nd a feature space in which diagonal covariance matrices are used. However, there are two important dišerences. First, the mean is µ(m)y (in- stead of µ(m)x ): it is set to the mean of the predicted corrupted speech distributions. is is not explicitly necessary for standard semi-tied covariance matrices, since they are normally trained on the clean training data. e second dišerence is that, just like with normal semi-tied covariance matrices, the component covariance is also re- estimated, and here indicated by Σ˜(m)y,diag. For standard semi-tied covariance matri- ces, the reason covariances could be re-estimated was that the training data was used (rather than test data, as for adaptation transformations). Without data sparsity, over- training was not a problem. For the predictive version, the training data statistics will be replaced by predicted statistics, so that over-training is again not a problem. e three types of parameters to be estimated are µ(m)y , Σ(m)y , andA.e means are straightforwardly estimated. Even when it is transformed byA, y is transformed by the same matrix, so thatAµy is still the mean, in the transformed space.e cov- ariances, on the other hand, are diagonal, and ifA changes, then they are sub-optimal, and vice versa. ey will therefore, like for standard semi-tied covariance matrices, be estimated in an iterative fashion. Every step is guaranteed to not decrease the kl divergence, so that the algorithm nds a local optimum. e full derivation is in b.3.2.e statistics required are (repeated from (b.42)) W(m) , Ep(m) {( y− µ (m) y )( y− µ (m) y )T} ; (6.25a) γ , ∑ m γ(m); (6.25b) G(i) , ∑ m γ(m) σ˜ (m) y,ii W(m). (6.25c) Given the statistics, the process of estimating the component parameters and the transformations is basically the same as for standard semi-tied covariance matrices, in section 3.3.1. In addition, the means are rst estimated. e complete process is given in algorithm 4. 165 chapter 6. predictive transformations function Predictive-Semi-Tied-Covariance-Matrices({p(m), γ(m)}, γ) for all componentsm do µ (m) y ← Ep(m){y} W(m) ← Ep(m){(y− µ(m)y )(y− µ(m)y )T} Σ˜ (m) y,diag ← diag(W(m)) A← I repeat G(i) ←∑m γ(m)σ˜(m)y,iiW(m) A← Estimate-Covariance-MLLR(γ,G(i)) for all componentsm do Σ˜ (m) y,diag ← diag(AW(m)AT) until convergence return { µ (m) y , Σ˜ (m) y,diag } ,A Algorithm 4 Estimating predictive semi-tied covariance matrices. is scheme is computationally expensive, because it alternates over updating A and Σ˜(m)y,diag, and updatingA requires iterating over its rows. An alternative is to update only A, by stopping aŸer the call to Estimate-Covariance-MLLR in algorithm 4. e form of the likelihood is then the same as in (6.24), but the covariancematrices on themodels are diagonalised predicted covariances in the original feature space, unlike covariancemllr, where the original covariance matrices are retained. A is optimised for this feature space.is form will be referred to as “half-iteration predictive semi- tied”. 6.3 Correlation modelling for noise Section 6.2 has presented predictive linear transform agnostic to the form of the pre- dicted distribution. Indeed, they can be trained from any distribution that yields the required statistics.is work applies predictive linear transformations to methods for noise-robustness. However, since the introduction of the general framework (Gales and van Dalen 2007), other forms of predictive transformations have been proposed, 166 6.3. correlation modelling for noise like vtln (Breslin et al. 2010). is sectionwill conne itself to estimating predictive linear transformations from joint uncertainty decoding as discussed in sections 4.4.3 and 5.4.e advantage of the form of distribution that joint uncertainty decoding predicts is that the components are Gaussian-distributed, and that it uses base classes. e statistics that the predictive methods from the previous chapter require are straightforwardly expressed in terms of the means and covariances of the compon- ent distributions. As discussed in section 4.3, the real noise-corrupted speech distri- butions do not have a closed form. Model compensation methods normally already approximate the component distributions as Gaussians, so that no additional approx- imations are required to nd the statistics from joint uncertainty decoding. at joint uncertainty decoding shares compensation parameters across a whole base class is particularly useful if the predictive transform uses the same base class. It will turn out to be possible to express the statistics that joint uncertainty decoding predicts in a component-dependent part that can be computed oš-line, and a base- class-dependent part that changes with the noise parameters. Accumulating statistics from all components is therefore not necessary at run-time. is saves storage space and computation time. is section will use the convention that base classes for joint uncertainty decod- ing and for predictive transformations are the same. As before, since the estimation is per base class, the notation will assume only one base class. e distribution for componentm that joint uncertainty decoding predicts was given in (4.48a): p(m)(y) = ∣∣Ajud∣∣ · N (Ajudy+ bjud; µ(m)x , Σ(m)x + Σbias), (6.26) where Ajud, bjud, and Σbias are computed from the joint distribution of the clean speech and the observation as in section 4.4.3. Of particular interest is the case where Σbias is full, to compensate for changes in feature correlations, which arises when the joint distribution has a full covariance. e predictive transformations in this section will use the predicted distribution 167 chapter 6. predictive transformations of the jud-transformed observation yˆ, yˆ = Ajudy+ bjud. (6.27) Other options are possible (e.g. Xu et al. 2009; 2011), because jud compensation can be written without feature transformation, but with the inverse transformation on the mean and covariance. However, from joint uncertainty decoding with a full trans- formation and full covariance bias, predictive cmllr could nd the exact same max- imum likelihood solution (it would if it found the global optimum). Predictive covari- ance mllr keeps the original covariance Σx, so transforming it to a dišerent feature space rst would defeat the purpose. Predictive semi-tied covariance matrices could, just like predictive cmllr, nd the same solution in whatever feature space, as long as the the means are re-estimated as in section 6.2.3. e statistics that the predictive transforms require from the predicted distribu- tions consist solely of the following elements, which are straightforward to derive from the Gaussian in (6.26): Ep(m){yˆ} = µ(m)x ; (6.28a) Varp(m){yˆ} = Σ (m) x + Σbias; (6.28b) Ep(m) { yˆyˆT } = Σ (m) x + Σbias + µ (m) x µ (m) x T . (6.28c) e next sections will discuss instantiations of the predictive versions of cmllr, co- variance mllr, and semi-tied covariance matrices. In each case, it will turn out to be possible to express the statistics so that most of the accumulation can be performed oš-line.is is because in (6.28) the statistics derived from the clean speech compon- ent, µ(m)x and Σ(m)x , do not depend on the noise model, whereas Σbias does depend on the noise model, but not on the component distributions. 168 6.3. correlation modelling for noise 6.3.1 Predictive CMLLR Predictive cmllr trained on joint uncertainty decoding results in the following like- lihood calculation: q(m)(y) = ∣∣Acmllr∣∣ · ∣∣Ajud∣∣ · N(Acmllr(Ajudy+ bjud)+ bcmllr; µ(m)x , Σ(m)x ). (6.29) Compared to (6.26), this lacks covariance bias Σbias. To make up for that, Acmllr and bcmllr are trained to minimise the kl divergence with p in (6.26).is uses stat- istics k(i) andG(i) dened in (6.21) in section 6.2.1. Using (6.28), they become k(i) , ∑ m γ(m)µ (m) x,i σ (m) x,ii [ Ep(m) { yˆT } 1 ] = ∑ m γ(m)µ (m) x,i σ (m) x,ii [ µ (m) x T 1 ] ; (6.30a) G(i) , ∑ m γ(m) σ (m) x,ii  Ep(m){yˆyˆT} Ep(m){yˆ} Ep(m) { yˆT } 1  = ∑ m γ(m) σ (m) x,ii (Σ(m)x + Σbias + µ(m)x µ(m)x T) µ(m)x µ (m) x T 1  . (6.30b) It is interesting that k(i) does not depend on the parameters of joint uncertainty de- coding. It can therefore be computed oš-line and cached in its entirety. G(i), on the other hand, does depend on a jud parameter: Σbias. However, it can be rewritten to be largely cacheable: G(i) = ∑ m γ(m) σ (m) x,ii (Σ(m)x + µ(m)x µ(m)x T) µ(m)x µ (m) x T 1  ︸ ︷︷ ︸ cached + Σbias 0 0 0 ∑ m γ(m) σ (m) x,ii︸ ︷︷ ︸ cached . (6.30c) is completely removes the need to iterate over components at run-time. Since there is a G(i) for every dimension i, the total computational cost per base class for com- puting them isO(Rd3). 169 chapter 6. predictive transformations 6.3.2 Predictive covariance MLLR Predictive covariancemllr results in the following likelihood calculationwhen trained on joint uncertainty decoding: q(m)(y) = ∣∣Amllrcov∣∣ · ∣∣Ajud∣∣ · N (Amllrcov(Ajudy+ bjud) ; Amllrcovµ(m)x , Σ(m)x ). (6.31) e transformation Amllrcov on the features and on the means is equivalent to the inverse transformation on the covariance, as discussed in section 3.2.2. Compared to the distribution p(m) in (6.26) that q(m) aims to approximate, this lacks the covariance biasΣbias.erefore,Amllrcov is trained to make up for that.is allows decoding to use unchanged diagonal covariances while still modelling some of the predicted correlations. Training predictive covariance mllr uses statisticsG(i) dened in (b.22) in section 6.2.2. Using (6.28), and noting that the mean of yˆ is µ(m)x , they become G(i) , ∑ m γ(m) σ (m) x,ii Ep(m) {( yˆ− µ (m) x )( yˆ− µ (m) x )T} = ∑ m γ(m) σ (m) x,ii Varp(m){yˆ} = ∑ m γ(m) σ (m) x,ii ( Σ (m) x + Σbias ) . (6.32a) Just like for predictive cmllr trained on joint uncertainty decoding, this expression can be written so that iterations over the components can be cached.is works sim- ilarly to (6.30c): G(i) , ∑ m γ(m) σ (m) x,ii Σ (m) x︸ ︷︷ ︸ cached +Σbias ∑ m γ(m) σ (m) x,ii︸ ︷︷ ︸ cached . (6.32b) e total on-line cost of computing G(i) for all dimensions i is therefore O(Rd3). is does not depend on the number of components.e model means also need to be transformed, byAmllrcov, the complexity of which does depend on the number of components: O(Md2). 170 6.3. correlation modelling for noise 6.3.3 Predictive semi-tied covariance matrices Predictive semi-tied covariancematrices, when trained on statistics predicted by joint uncertainty decoding, result in the following likelihood expression: q(m)(y) = ∣∣Ast∣∣ · ∣∣Ajud∣∣ · N(Ast(Ajudy+ bjud) ; Astµ(m)x , Σ˜(m)y,diag) . (6.33) As in the case of predictive covariance mllr, the transformationAst is equivalent to the inverse transformation on the covariance matrix. Semi-tied covariance matrices mean to nd a feature space, specied byAst, in which a diagonal covariancematrix is a reasonable assumption. Unlike for covariancemllr, the component covariances are also adapted to minimise the kl divergence with the predicted distribution. Ast and the Σ˜(m)y,diag are estimated in an iterative process. e dišerence with the predicted distribution, in (6.26), is that the covariance matrix Σ˜(m)y,diag is diagonal. e transformation Ast needs to make up only for the oš-diagonal entries of the covariance matrix, unlike for predictive covariance mllr, which has a similar form, but uses the original covariance matrices. Training pre- dictive semi-tied covariance matrices uses statistics µ(m)y ,W(m) and G(i) dened in (6.25). Using (6.28), µ(m)y andW(m) become µ (m) y , Ep(m){yˆ} = µ(m)x ; (6.34a) W(m) , Ep(m) {( yˆ− µ (m) y )( yˆ− µ (m) y )T} = Varp(m){yˆ} = Σ (m) x + Σbias. (6.34b) G(i) can be dened in terms of this, exactly as in (6.25c): G(i) , ∑ m γ(m) σ˜ (m) y,ii W(m). (6.34c) ese statistics can again be formulated in such a way that the on-line computational cost is less than for a direct implementation. is assumes that the original speech 171 chapter 6. predictive transformations covariance matrices Σ(m)x are diagonal. G(i) can be written G(i) = ∑ m γ(m) σ˜ (m) y,ii Σ (m) x︸ ︷︷ ︸ O(Md) +Σbias ∑ m γ(m) σ˜ (m) y,ii︸ ︷︷ ︸ O(M+ d2) . (6.35a) e total cost of nding these for d dimensions and R base classes isO(Md2 + Rd3). A similar optimisation can be applied to the covariance update in (b.40): Σ˜ (m) y,diag := diag ( AW(m)AT ) = diag ( A ( Σ (m) x + Σbias ) AT ) = diag ( AΣ (m) x A T )︸ ︷︷ ︸ O(d2) + diag ( AΣbiasA T )︸ ︷︷ ︸ O(d3) . (6.35b) Since the leŸ-hand term is component-dependent, it needs to be computed separately for each component. However,Σ(m)x is diagonal and the result is constrained to be di- agonal, so that the cost per component is onlyO(d2).e right-hand term is the same for a whole base class, but since Σbias is full, the calculation takes O ( d3 ) . e over- all complexity of updating Σ˜(m)y,diag for allM components in R base classes is therefore O(Md2 + Rd3). For the full scheme, the complexities are multiplied by the number of outer iter- ations K.e half-iteration scheme for predictive semi-tied covariance matrices only nds a transformation matrix Ast and only initialises the covariances Σ˜ (m) y,diag, apart from adding the covariance bias, which isO(Md). 6.3.4 Computational complexity Table 6.1 on the next page details the time requirements for the approximations to joint uncertainty decoding discussed in the previous sections.e naive implementa- tion for calculating the cofactors takesO(Rd4) per iteration, but using the Sherman- Morrisonmatrix-inversion lemma this can be reduced toO(Rd3) per iteration (Gales and van Dalen 2007). InvertingG(i) takes O(Rd4) per iteration.2 In all cases, by al- 2 By using the average of the diagonal of Σ˜(m)y,diag rather than dišerent σ˜ (m) y,ii for 1 ≤ i ≤ d, it is possible to reduce this toO(Rd3) (Gales and van Dalen 2007) at the cost of some loss in accuracy.is has not been shown in the table. 172 6.3. correlation modelling for noise cmllr mllr co-variance Half semi- tied Full semi- tied Estimation Statistics Rd3 Rd3 Md2 + Rd3 K(Md2 + Rd3) InvertingG(i) Rd4 Rd4 Rd4 KRd4 Calculating ci LRd3 LRd3 LRd3 KLRd3 Compensation Features TRd2 TRd2 TRd2 TRd2 Means 0 Md2 Md2 Md2 Variances 0 0 Md KMd2 Table 6.1e complexity of estimating predictive transforms from joint uncer- tainty decoding. d is the size of the feature vector;M is the number of compon- ents; R is the number of base classes; L is the number of inner iterations; K is the number of outer iterations (see section 3.3.1 on page 50). lowing for diagonal covariances on the models compensated for noise, the complexity associated with decoding T observations with joint uncertainty decoding is reduced by a factor of d. 6.3.5 Predictive HLDA is chapter has so far presented predictive methods that start with a predicted distri- bution over feature vectors with statics and dynamics. ose distributions will have been derived from distributions over extended feature vectors as in chapter 5. e conversion from extended feature vectors, or distributions over them, to ones with statics and dynamics used a linear transformation D. e predictive linear trans- formations in this chapter have estimated another linear feature transformations A. An interesting avenue would be to not assume projection D and replace the pair of transformationsA·D, of which onlyA is estimated, by one transformation that at the same time reduces the feature dimensionality and transforms to a good feature space for noise at the same time. To formulate distributions over the feature space with reduced dimensionality, a square Jacobian is necessary. Dimensionality reduction therefore needs, at least math- 173 chapter 6. predictive transformations ematically, to nd a new feature space of the same dimensionality. Some of those dimensions are useful dimensions, which are used for discrimination, and some are nuisance dimensions, the distributions over which should be tied over all components so they do not discriminate.is requirement was also an issue for estimating an ap- proximate distribution over extended feature vectors. As section 5.2.2 has discussed, this is only equivalent to optimising in the projected space under certain conditions. Extended idpmc (section 5.3.2), for example, which estimates mixtures of Gaussians, explicitly bypassed the issue and estimates parameters directly on samples with statics and dynamics. is trick does not apply here, so the nuisance dimensions must be handled explicitly. is section will therefore sketch how to apply heteroscedastic linear discriminant analysis (Kumar 1997), discussed in section 3.3.2, to extended feature vectors.is will be called “predictive hlda”.is explicitly splits the feature transformation up in use- ful and nuisance dimensions, and ties the distribution over the nuisance dimensions. Both heteroscedastic linear discriminant analysis (hlda) and semi-tied covariance matrices can be seen as instances ofmultiplehlda (mhlda). mhlda ishldawith dif- ferent transformations for dišerent classes; semi-tied covariance matrices is mhlda without dimensionality reduction.e scheme sketched here could straightforwardly be extended to multiple base classes. e likelihood calculation for predictive hlda is similar to the one for predictive semi-tied covariance matrices (in (6.24)):3 q(m)(ye) = N (Aye; µ˜(m)y , Σ˜(m)y,diag). (6.36) e two dišerences are that this expression uses extended feature vectorsye, and that A is a non-square matrix that reduces the dimensionality as well as nding a feature space in which it is a decent approximation to make Σ˜(m)y,diag diagonal. hlda needs statistics similar to semi-tied covariance matrices; analogously, pre- dictive hlda needs statistics similar to predictive semi-tied covariance matrices, but 3e factor|A| in (6.24) does not need to be computed if there is only one base class, like here, since it ašects all likelihood equally. 174 6.4. front-end pcmllr over extended feature vectors. Denoting the predicted distribution for componentm with pe(m), the predicted covariance for componentm is (analogous to (6.34b)) W(m) , Epe(m) { yeyeT } − Epe(m){ye} Epe(m){ye}T. (6.37) Given the statistics, following the procedure for computing semi-tied covariance ma- trices would yield a transformation that aims to nd an optimal square linear trans- formation. hlda on the other hand applies dimensionality reduction as well. Unlike with predictive semi-tied covariance matrices, the new component parameters must be derived directly from the extended statistics.e means, for example, are set to the transformed mean of the predicted distribution: µ˜ (m) y := AEpe(m){ye} , (6.38a) and the component covariance matrices use (6.37): Σ˜ (m) y,diag := diag ( AW(m)AT ) . (6.38b) Optimising the transformation A as in Kumar (1997) yields an interesting vari- ant of predictive linear transformations that transforms the extended feature space as well as selecting useful features from it. For noise-corrupted speech, this makes intuitive sense. Some features may be completely masked under low signal-to-noise ratios. Predictive hlda will then transform the feature vector used for recognition so as to replace those features by more useful features for that specic noise condition. However, predictive hlda could also be applied to distributions predicted from other models. 6.4 Front-end PCMLLR Predictive linear transformations are žexible.is thesis has motivated their use from the perspective of reducing the computational load of decoding with full covariances. However, predictivecmllrnds a component-dependent feature transformation, and 175 chapter 6. predictive transformations it is possible to use this for approximating diagonal-covariance predicted distribu- tions. is section will apply pcmllr-like transformations to features without refer- ence to the component identity.4 An interesting practical consequence will be that the resulting transformation can be similar to that of model-based feature enhancement (see section 4.6), but is motivated dišerently. Whereas feature enhancement aims to reconstruct the clean speech, here, the transformations aim to minimise the kl diver- gence between the predicted distribution and the ešective speech recogniser distri- bution. is uses more precise information about the clean speech distribution, the speech recogniser components rather than the front-end components. Many statistics can again be cached, so that this is computationally very e›cient. e framework of predictive transformations can also be used to speed up com- pensation for noise with diagonal covariance matrices. e methods in this section will derive from predictive cmllr, which, like standard cmllr, applies a compon- ent-dependent feature transformation to the observation vector. e transformation that both forms of cmllr nd is a set of a›ne transformations for each base class r: A = {A(r)} ={A(r),b(r)}. To keep the notation uncluttered, this thesis has not ex- plicitly written the dependency on the base class, but this section will, because it will be vital. Both adaptive and predictive cmllr decode with (from (6.20), with the base class explicit) q(m)(y|A) = ∣∣A(r)∣∣ · N (A(r)y+ b(r); µ(m)x , Σ(m)x ). (6.39a) Each of the components is assigned one base class r, and only one feature transforma- tionA(r) is estimated for a base class.erefore, which transformationA(r) is chosen from the set of transformations depends on the component. However, it is also pos- sible to nd a transformation that takes the observation into accountwhenminimising the kl divergence to the ešective decoding distribution from the predicted noise-cor- rupted speech distribution.is could make the transformation more appropriate for the acoustic region that the observation is in. 4e work in this section, section 6.4, is joint work with Federico Flego, published as van Dalen et al. (2009). 176 6.4. front-end pcmllr e next sections will introduce methods of nding a component-independent transformationAt = { At,bt } at each time instance t.e distribution of compon- entm becomes q(m)(yt) = ∣∣At∣∣ · N(Aty+ bt; µ(m)x , Σ(m)x ). (6.39b) e simplest scheme estimates a global transformation. It is equivalent to pre- dictive cmllr with the number of base classes set to R = 1. is makes the scheme trivially independent of the component. It is interesting as a baselinemethod, because themethods that this section will introduce all reduce to it when there is only one base class.e optimisation in (6.18) becomes A := argmin A ∑ m γ(m)KL(p(m)∥∥q(m)). (6.40) Section 6.4.1 will introduce a method that re-trains the transformation with pre- dictive cmllr for every feature vector. Section 6.4.2 will introduce two methods that estimate the base class posterior to combine precomputed transformations. 6.4.1 Observation-trained transformation e global transformation estimated with (6.40) gives an optimal overall transforma- tion. However, when the observation is known, the distribution over the component identity can be approximated better. A scheme that estimates one pcmllr transform from the posterior predicted distribution will be called “observation-trained pcmllr”. To give a more robust estimate of the component distribution, the update of the component distribution is not performed per component, but per base class. To do this, the component occupancy is factorised into the occupancy of the base class γ(r) and the occupancy of the component given the base class: γ(m) = γ(r) ( γ(m) γ(r) ) , γ(r) = ∑ m∈Ω(r) γ(m), (6.41a) whereΩ(r) is the set of all components in a base class. 177 chapter 6. predictive transformations e base class occupancy can be replaced by a weight estimated using the obser- vation yt. An obvious choice is the posterior of the front-end mixture of Gaussians, P(r|yt). Ešectively, the posterior distribution of the corrupted speech vector y given the observation becomes p(y|yt) = ∑ r P(r|yt)P(y|r) . (6.41b) e base class posterior then replaces the prior base class occupancy in (6.41a): P(m|yt) := P(r|yt) ( γ(m) γ(r) ) = P(r|yt) γ(r) γ(m). (6.41c) Each of the componentsm of the speech recogniser gmm is weighted or deweighted by the same amount as its associated front-end component r, so that (6.40) becomes At := argmin A ∑ r P(r|yt) γ(r) ∑ m∈Ω(r) γ(m)KL(p(m)∥∥q(m)). (6.41d) At in (6.41d) is retrained for every feature vector. Apart from the weighting, this is the same expression as the optimisation for generic predictive transformations in (6.18). e algorithm therefore is the same as for predictive cmllr discussed in section 6.2.1, but with modied statistics.ey are similar to (6.21), with the component occupan- cies γ(m) replaced by their posteriors P(m|yt) in (6.41c): k(i) , ∑ r P(r|yt) γ(r) ∑ m∈Ω(r) γ(m)µ (m) x,i σ (m) x,ii [ Ep(m) { yT } 1 ] ; (6.42a) G(i) , ∑ r P(r|yt) γ(r) ∑ m∈Ω(r) γ(m) σ (m) x,ii  Ep(m){yyT} Ep(m){y} Ep(m) { yT } 1  . (6.42b) If implemented directly, it is computationally expensive to accumulate these statistics. However, given that speech recogniser components are weighted a whole base class at a time, the necessary statistics are weighted sums of per-base class statistics, inside the sums ∑ m∈Ω(r) .ese per-base class statistics are exactly the same as the statistics that standard cmllr (predictive and adaptive) uses, in (6.21), to estimate per-base class transformations. e number of base classes is normally only a fraction of the number of back-end components, so computing the statistics in (6.42b) is fast. 178 6.4. front-end pcmllr 6.4.2 Posterior-weighted transformations Rather than estimating a global transform, whether using the speech recogniser com- ponent priors or their updated versions in (6.4.1), it is possible to estimate a set of transforms appropriate to regions of the acoustic space with pcmllr, and from those construct a transformation that is observation-specic. A simple way of doing this is to pick the transformation associated with the most likely front-end component: At := A(r∗), r∗ = argmax r P(r|yt) . (6.43) is scheme, “hard-decision pcmllr”, yields a piecewise linear transformation of the feature space. A more sophisticated approach is similar to front-end cmllr (Liao and Gales 2005). is interpolates the transforms, each optimised to minimise the kl diver- gence to the predicted back-end distributions in an acoustic region. e front-end component posterior is an approximate measure of how likely an observation is to have been generated by a back-end component associated with front-end compon- ent r.e transformation becomes At = ∑ r P(r|yt)A (r); (6.44a) bt := ∑ r P(r|yt)b (r). (6.44b) is will be called “interpolated pcmllr”. e form of the compensation is then the same as for model-based feature enhancement (see section 4.6), so that decoding is equally fast. However, interpolated pcmllr is properly viewed as transforming the models, aiming to minimise the kl divergence between the predicted distributions and the speech recogniser models, rather than reconstructing the clean speech. Post- processing of the transformed observation as if it represents the clean speech is there- fore not possible. 179 chapter 6. predictive transformations 6.5 Summary is chapter has described the second contribution of this thesis. It has introduced a general framework for approximating one speech recogniser parameterisation with another. Its objective is to minimise the kl divergence to a predicted distribution. is framework of predictive methods subsumes model compensation methods. Sec- tion 6.2 has derived a number of new predictive versions of linear transformations that chapter 3 had discussed. Section 6.3 has detailed how to approximate joint uncer- tainty decoding with full covariances (from section 5.4). is combines joint uncer- tainty decoding’s ability to train on limited data and linear transformations’ ability to model correlations without reducing decoding speed.e resulting chain of methods (extended vts, joint uncertainty decoding, and predictive semi-tied covariance ma- trices) is what this thesis proposes as a feasible compensation scheme for real-world speech recognition. To show oš the versatility of predictive transformations, section 6.4 has intro- duced a number of schemes based oš predictive cmllr, which apply only feature transformations, but make the resulting speech recogniser match the predicted dis- tributions as well as possible. 180 Chapter 7 Asymptotically exact likelihoods is chapter will present the third contribution of this thesis, which is more theoret- ical.1 Section 4.1 has argued that using the exact corrupted speech distribution would lead to optimal classication. Given standard models for the speech, noise, and mis- match function, the distributions of the corrupted speech does not have a closed form. Model compensation methods therefore approximate it with a parameterised distri- bution, usually a Gaussian. Even for amixture of full-covariance Gaussians, estimated as in section 5.3.2, the number of Gaussians required to approximate the real distri- bution well may be high, and the complexity is essentially cubic in that number. No previous work has investigated how well speech recognition would perform without parameterised distributions. is chapter will introduce a sampling method that, given distributions over fea- ture vectors for the speech and the noise and a mismatch function, in the limit calcu- lates the corrupted speech likelihood exactly. What the method yields, therefore, is an upper bound on model compensation. Section 7.1 will introduce the model that this chapter assumes. Section 7.2 will discuss how to approximate the distribution with importance sampling from a Gaussian over the speech and noise. However, when 1 is work has been published as van Dalen and Gales (2010b) and a summary in van Dalen and Gales (2010a). 181 chapter 7. asymptotically exact likelihoods applied to more than a few dimensions this becomes infeasible. Section 7.3 will there- fore rst transform the integral in one dimension, and apply importance sampling. It will then introduce factorisations for the multi-dimensional integrand, and approx- imate the integral with sequential importance sampling. is scheme is too slow to implement in a speech recogniser. An alternative metric for assessing model com- pensation methods, more ne-grained than word error rates, would be how closely model compensation methods match the distribution they approximate. Section 7.4 will therefore introduce a method to compute the kl divergence up to a constant of the approximation of the corrupted speech distribution to the real distribution. 7.1 Likelihood evaluation is section uses a dišerent approach to approximating the corrupted speech distribu- tion from standardmodel compensation. Model compensationmethods, like the ones discussed in section 4.4, nd a parameterised approximation for the corrupted speech distribution for recognition. However, no expression for the full density is needed: while recognising speech only likelihoods for vectors that are observed are required. erefore, similarly to the methods in section 4.5, in this section the likelihood is ap- proximated for a specic observationyt. For simplicity of notation, the convolutional noise will be assumed zero in this section.2 Substituting yt for y in (4.24b), p(yt) = ∫ ∫ p(yt|n, x)p(n)dnp(x)dx (7.1) Since, given yt, this is essentially of the form p(yt) = ∫ φ(n, x)p(n, x)d(x,n), (7.2) where the desired quantity is the integral of test function φ under distribution p, the obvious approachwould be to apply one of a number of standard approaches to solving this form of problem. 2 In the mismatch function in (4.10a) the convolutional noise just causes an ošset on the speech signal. 182 7.1. likelihood evaluation Laplace’s method may be used.is approximates the complete integrand by pla- cing a Gaussian q on its mode. is is not possible analytically, but the Algonquin algorithm (see section 4.5.1) approximates this with an iterative algorithm. However, this does not yield any guarantee about the resulting likelihood. Another approach would seem to nd a variational lower-bound q to the in- tegrand. An obstacle to this, however, is the form of the test function φ(n, x) = p(yt|n, x). It represents the probability that corrupted speech vector yt is generated from clean speech and noise vectors x,n. ough the phase factor introduces some uncertainty about thismapping,most values forx andn are incompatible for givenyt, and φ(n, x) is then 0. It is therefore not possible to lower-bound the integrand with a Gaussian, and not obvious which other distribution (the shape of the integrand will be detailed in section 7.2, and depicted in gure 7.1a on page 187) to choose. A third approach may be to approximate the integral with Monte Carlo. (7.2) is written in a form that makes it obvious how to do so. It is straightforward to sam- ple (n, x) from the prior p(n, x). However, the shape of p(n, x) is not always a good match for the shape of φ, so that most samples are drawn in vain. To alle- viate this problem, it is possible to use importance sampling, which is discussed in appendix a.4.2, to evaluate the integral over φ(n, x)p(n, x) at once. Why sampling from p(n, x) is so ine›cient will also be analysed from this perspective. Importance sampling requires a proposal distribution, the distribution that the al- gorithmdraws from instead of the actual distribution. Itmakes up for the dišerence by assigning each sample a weight.e proposal distributions needs to match the target density well, otherwise the weights will have a high variance, and too many samples will be required to arrive at a good estimate. e number of samples required grows exponentially with the number of dimensions. To apply importance sampling, the proposal distribution will be either the prior, which reduces to straightforward Monte Carlo, or an Algonquin-derived approxima- tion of the posterior. Both essentially use a joint Gaussian distribution over the speech and the noise.e prior distribution is oŸen far away from the integrand, but even the 183 chapter 7. asymptotically exact likelihoods Algonquin-generatedGaussian cannot approximate the curved integrandwell.ere- fore, the number of samples required makes it infeasible to use this approximation. Section 7.3 will then introduce a transformation of the integrand.e integral over x, n, and α can then be expressed as an integral over substitute variable u and α. is new expression is still exact, but more amenable to being approximated with import- ance sampling. AŸer considering the one-dimensional case, the higher-dimensional space will use the same techniques for each dimension, and apply sequential import- ance resampling. is chapter will focus on one Gaussian for the speech and one for the noise: x ∼ N (µx,Σx) ; (7.3a) n ∼ N (µn,Σn) . (7.3b) e mismatch function is the one given in (4.10a). is mismatch function uses a state-of-the-art model of the phase factor (see section 4.2.1.1). However, all that the scheme relies on is that it is possible to draw samples from the phase factor distribu- tion. Any other distribution for p(α) can be plugged in as long as it can be sampled from. In this work, the speech and noise are modelled with Gaussians. However, dif- ferent distributions would be possible, provided the form of the distributions satises two requirements. First, a proposal distributionmust be found that is su›ciently close to the one-dimensional distribution that importance sampling is possible. Second, a reasonable approximation must be available for the marginal of one dimension given settings for a subset of all other dimensions. ForGaussians, both of these requirements will appear to be possible, though far from straightforward, to full. One aspect that makes the second requirement hard is that the speech and noise will bemodelled with correlated dimensions, through full covariances. Since the speech and noise priors are full-covariance Gaussians, it make little dif- ference whether they represent cepstral or log-spectral features: the domains are re- lated by a linear transformation, the dct (see section 2.1). In the log-spectral domain, the mismatch function works dimension per dimension (see section 4.2.1), so this 184 7.2. importance sampling over the speech and noise chapter will assume log-spectral domain speech and noise priors.e vectors in this chapter will consist of just statics, and will be denoted with x,n,α,y. e main dišerence with earlier work with a similar aim is that here, the speech and noise priors have full covariance matrices, so that dimensions can not be treated separately. Myrvoll and Nakamura (2004) used a piecewise linear approximation in one dimension, which, as section 4.5.2 has shown, does not generalise to multiple dimensions. At the same conference as work in this thesis was presented (van Dalen and Gales 2010a), Hershey et al. (2010) independently introduced a similar strategy. As will be discussed in section 7.3.1.3, it also treats dimensions separately. e aim of the method that this chapter will introduce is the opposite of that of single-pass retraining (see section 4.4.4), though both can be seen as idealmodel com- pensation. Single-pass retraining does not use speech and noise distributions or a mismatch function, but it does assume that the corrupted speech is Gaussian. e method in this chapter assumes the speech and noise priors to be Gaussian, and the mismatch function to be given, but does not assume a form of distribution of the cor- rupted speech. 7.2 Importance sampling over the speech and noise is section presents the rst approach, which is to approximate the integration over x andn. Importance sampling requires that the integrand can be evaluated at any point (x,n). For this, part of the observation likelihood needs to be rewritten more expli- citly. e corrupted speech likelihood is ((4.24b) and (4.24c) with yt substituted for y, as in (7.1)): p(yt) = ∫ ∫ p(yt|x,n)p(n)dnp(x)dx (7.4a) = ∫ ∫ ∫ δf(x,n,α)(yt)p(α)dαp(n)dnp(x)dx. (7.4b) e value of the observation vector yt is not deterministic given the speech and the noise, because the phase factor α is a random variable. However, α is deterministic 185 chapter 7. asymptotically exact likelihoods given x, n, and yt, so that p(yt|x,n) can be written in terms of the distribution of the phase factor.e phase factor that a specic setting of x, n, and yt implies will be written α(x,n,yt). It is a standard result (in appendix a.1.1 and, e.g., Bishop 2006, 11.1.1) that transforming the space of a probability distribution requires multiplying by the determinant of the Jacobian: p(yt|x,n) = ∣∣∣∣∣ ∂α(x,n,y)∂y ∣∣∣∣ yt ∣∣∣∣∣ · p(α(x,n,yt)) , (7.5) where p(α(x,n,y)) denotes the density of p(α) at the value of α implied by x, n, and y. e value of the phase factor as a function of the other variables follows from (4.9). e relation is dened per coe›cient (i.e. frequency bin) i of the variables: αi = exp ( y log i ) − exp ( x log i ) − exp ( n log i ) 2 exp ( 1 2x log i + 1 2n log i ) , (7.6a) and its partial derivative with respect to yi is ∂α(xi, ni, yi) ∂yi = exp ( y log i ) 2 exp ( 1 2x log i + 1 2n log i ) . (7.6b) e diagonal elements of the Jacobian ∂α(x,n,y)∂y are given by these partial derivatives with respect to yi.e oš-diagonal entries of the Jacobian are 0. e the distribution of the corrupted speech in (7.4a) can then be written as p(yt) = ∫ ∫ p(yt|x,n)p(n)dnp(x)dx = ∫ ∫ ∣∣∣∣∣ ∂α(x,n,y)∂y ∣∣∣∣ yt ∣∣∣∣∣ · p(α(x,n,yt))p(n)dnp(x)dx = ∫ ∫{(∏ i exp ( y log t,i ) 2 exp ( 1 2x log i + 1 2n log i )) · p(α(x,n,yt))p(n)p(x)}dndx , ∫ ∫ γ(x,n)dndx. (7.7) is expression is exact.ough the integrand, the expression in curly braces and de- noted with γ, is now straightforward to evaluate at any given point (x,n) if p(α) 186 7.2. importance sampling over the speech and noise 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (a)e distribution p(yt|x, n). 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (b)e distribution p(x, n, yt). Figure 7.1e distribution of the clean speech and noise for x ∼ N (10, 1); n ∼ N (9, 2); α ∼ N (0, 0.04); yt = 9. can be evaluated, the integral has no closed form. It can, however, be approxim- ated using importance sampling with γ as the target density. Importance sampling requires a proposal distribution that is close to the target. Figure 7.1a illustrates a one- dimensional version of the density p(yt|x, n).e density lies around the curve that relates x and n for given α and yt, shown as a dashed line. is curve is given by exp(x) + exp(n) = exp(yt).at the sum of the exponents of x and n is xed causes a bend in the curve. Figure 7.1b shows the density in gure 7.1amultiplied by the priors of x and n, which gives p(x, n, yt) = γ(x, n). Importance sampling uses a proposal distribution ρ, which it draws L samples (x(l),n(l)) from and weights them to make up for the dišerence between proposal and target densities. is integral to be approximated is the normalisation constant of γ. e derivation is in appendix a.4.2, and in particular (a.43b), but intuitively it 187 chapter 7. asymptotically exact likelihoods 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (a)e prior p(x,n). 3 4 5 6 7 8 9 10 11 n 3 5 7 9 11 x (b) Gaussian approximation to the posterior p(x,n|yt). Figure 7.2e distribution of the clean speech and noise for x ∼ N (10, 1); n ∼ N (9, 2); α ∼ N (0, 0.04); y = 9. can be written as∫ ∫ γ(x,n)dxdn = ∫ ∫ γ(x,n) ρ(x,n) ρ(x,n)dxdn ' L∑ l=1 γ(x(l),n(l)) ρ(x(l),n(l)) , (7.8) (x(l),n(l)) ∼ ρ. e fraction of the target and proposal densities, γ/ρ, gives the weight of the samples. is weight makes up for the dišerence between the two densities. To approximate the integral under γ, this section considers two options for the proposal distribution: the prior, and the Algonquin approximation to the posterior. Both are Gaussian. A priori, the speech and the noise are independent. Given Gaussian prior distri- butions for p(x) and p(n) (in (7.3a) and (7.3b), respectively), their joint prior becomes x n  ∼ N µx µn  , Σx 0 0 Σn  . (7.9) 188 7.3. importance sampling in a transformed space is Gaussian is shown as white lines on top of the actual posterior of x and n in gure 7.2a on the facing page. An alternative approach would be to use the Gaussian approximation to the pos- terior that the Algonquin algorithm (presented section 4.5.1) nds. Unlike the one in (7.9), this distribution does not model the speech and the noise as independent. Figure 7.2b shows it superimposed on the actual posterior. e main problem areas for either Gaussian as proposal distribution are the re- gions of space where proposal and target do not match well. Where the proposal dis- tribution has a higher value than the target distribution, more samples will be drawn that are assigned lower weights: a waste of computational time. Conversely, where the proposal distribution is much lower than the target distribution, samples will seldom be drawn, and when they do, they are assigned high weights. In this case, the number of samples that needs to be drawn to get su›cient coverage becomes very high. ese two problems are exacerbated by high dimensionality: for every dimension, either of these cases can occur.e probability that neither problem arises for a sample decreases exponentially, so that the number of samples required increases exponen- tially. Both approximations shown in gure 7.2 sušer from this problem, so that it is not feasible to apply them to a 24-dimensional log-spectral problem.e next section will therefore transform the space so that the target distribution per dimension can be approximated better. 7.3 Importance sampling in a transformed space e problemwith the scheme in the previous section is the hard-to-approximate bend in the distribution of x and n given yt.is section will overcome this by transform- ing the space, and then approximating the integral. Conceptually, this is similar to Myrvoll and Nakamura (2004), which was discussed in section 4.5.2. However, there the approximation was constrained to one dimension. Here, the mismatch function contains an extra variable (the phase factor), a dišerent transformation is used, and 189 chapter 7. asymptotically exact likelihoods the approximation uses sequential importance sampling. Initially, a one-dimensional space will be considered. Section 7.3.2 will discuss how to generalise this to the multi- dimensional case and how to deal with the additional complications that arise. 7.3.1 Single-dimensional In one dimension, the mismatch function is ((4.9) without indices and with yt sub- stituted for y) exp ( y log t ) = exp ( xlog ) + exp ( nlog ) + 2α exp ( 1 2x log + 12n log ) . (7.10) As discussed in section 4.2.1.1 on page 65, α is a weighted average of cosines of the angle between the signals of the frequencies in one bin, and as such constrained to be between −1 and +1. Since this equality relates four variables deterministically, if three are known, then in many cases the fourth is known as well. e objective of this section is to nd an approximation to p(yt) for a given observation yt. Since four variables are linked de- terministically and one is known (yt), the integration will be over two variables.is was also the case in section 7.2.ere, the obvious choice of integrating over x and n did not work out well because of the shape of the density in (x, n)-space. is sec- tion introduces a variable u that represents a pair (x, n) given yt and α. Changing u traverses the curve in (x, n)-space, so that the bend is not a problem any more. e integral will then be over α and the new variable u. A property of (7.10) that complicates the derivation of the transformed integral is that when three variables are known, the fourth can in some cases have two values. (7.10) is quadratic in exp ( 1 2x ) and exp ( 1 2n ) , so it can have two solutions for x and n. is can be seen in gure 7.3 on the facing page, which shows how x and n are related for various values of α. For n = 10, α = 0.99, for example, x has two solutions. However, for given yt and α, it is possible to dene one variable that unambigu- ously identies a point on the curve.e substitute variable will be called u with u = n− x. (7.11) 190 7.3. importance sampling in a transformed space α = −0.99 α = 0.99 0 3 6 9 12 x 0 3 6 9 12 n Figure 7.3e relation between clean speech x and additive noise n for yt = 9 and evenly-spaced values of α. 0 3 6 9 12 x 0 3 6 9 12 n Figure 7.4e region of the (x, n) that the integral is explicitly derived for: x ≤ n. e value of u is related to the signal-to-noise ratio. If u is a large negative number, x is close to yt and n is large and negative.is represents a very low signal-to-noise ratio.e converse is true if u is a large positive number: then n is close to yt and x is large and negative. is substitution will be used to dene an integral that yields p(yt). First, p(yt|·) will be re-expressed using p(n|·) or p(x|·). Since neither of these variables is known 191 chapter 7. asymptotically exact likelihoods deterministically for all values of the other variables, the integral will be partitioned in two parts. n has one possible value given a setting for (x, yt, α)when x is constrained to be smaller than n, which is the shaded region in gure 7.4 on the previous page. In the complementary region, x has one possible value given xed (n, yt, α). e likelihood can be written as a sum of both these regions: p(yt) = p(yt, x ≤ n) + p(yt, n < x) . (7.12) Because of the symmetry between these two regions, only the derivation for the region x ≤ n will be given explicitly.e derivation for n < x is analogous. e additive noise that follows from setting the variables x, yt, α will be denoted with n(x, yt, α). is is deterministic in the region where x ≤ n. e variable of the probability distribution will be changed from yt ton.is requires multiplication by a Jacobian (see section a.1.1 on page 255 or, for example, Bishop 2006, 11.1.1). is Jacobian, the partial derivative of n with respect to y and keeping x and α xed, will be written ∂n(x,y,α)∂y , and be evaluated at yt. p(yt, x ≤ n|x, α) = ∣∣∣∣∣ ∂n(x, y, α)∂y ∣∣∣∣ yt ∣∣∣∣∣ · 1(x ≤ n) · p(n(x, yt, α)) . (7.13) Here, 1(·) denotes the indicator function, which evaluates to 1 if its argument is true, and 0otherwise. p(n(x, yt, α)) is the probability distribution ofn evaluated atn(x, yt, α), the value of n corresponding to the setting of (x, yt, α). e evaluation of the half of the likelihood for x ≤ n can then be rewritten with (7.13) and by then replacing the variable of the integration by u.e predicate x ≤ n is equivalent to 0 ≤ u, which can be expressed using bounds on the integral. p(yt, x ≤ n) = ∫ p(α) ∫ p(x)p(yt, x ≤ n|x, α)dxdα = ∫ p(α) ∫ p(x) · ∣∣∣∣∣ ∂n(x, y, α)∂y ∣∣∣∣ yt ∣∣∣∣∣ · 1(x ≤ n) · p(n(x, yt, α))dxdα = ∫ p(α) ∫∞ 0 ∣∣∣∣∂x(u, yt, α)∂u ∣∣∣∣ · ∣∣∣∣∣ ∂n(x, y, α)∂y ∣∣∣∣ yt,x(u,yt,α) ∣∣∣∣∣ · p(x(u, yt, α)) · p(n(u, yt, α)) dudα. (7.14) 192 7.3. importance sampling in a transformed space Here, p(x(n, yt, α)) is the probability distribution of x evaluated at x(n, yt, α), the value of x corresponding to the setting of (n, yt, α). Appendix f.1 on page 304 gives the derivations for the Jacobians in (7.14), and x(u, yt, α) and n(u, yt, α) for the re- gion x ≤ n. From (f.7c) the product of the Jacobians is −1. By substituting 1 for the absolute value of the product of the Jacobians in (f.7c) into (7.14), one half of the likelihood in (7.12) becomes p(yt, x ≤ n) = ∫ p(α) ∫∞ 0 ∣∣∣∣∂x(u, yt, α)∂u ∣∣∣∣ · ∣∣∣∣∣ ∂n(x, y, α)∂y ∣∣∣∣ yt,x(u,yt,α) ∣∣∣∣∣ p(x(u, yt, α))p(n(u, yt, α)) dudα = ∫ p(α) ∫∞ 0 p(x(u, yt, α))p(n(u, yt, α))dudα. (7.15a) e integral of u is over the area where 0 ≤ u, i.e. where x ≤ n. e equivalent integration over the region where u < 0 could be derived by exchanging x andn, and replacing u with −u. Applying this to all derivations in section f.1 on page 304 and to (7.15a) yields the other half of the likelihood in (7.12). Because of the symmetry of n and x and because the Jacobians cancel out, the result is identical to (7.15a) save for the range of u: p(yt, n < x) = ∫ p(α) ∫ 0 −∞p(x(u, yt, α))p(n(u, yt, α))dudα. (7.15b) e sum of (7.15a) and (7.15b) yields the total likelihood of yt. e integrand will be called γ. p(yt) = p(yt, x ≤ n) + p(yt, n < x) = ∫ p(α) ∫∞ −∞p(x(u, yt, α))p(n(u, yt, α))dudα , ∫ p(α) ∫ γ(u|α)dudα (7.16) , ∫ ∫ γ(u,α)dudα. (7.17) us the integral has been expressed in terms of u and α, rather than x and n as in (7.7). is derivation is exact and holds for any form of priors for the speech and 193 chapter 7. asymptotically exact likelihoods noise p(x) and p(n). Just like aŸer rewriting p(yt) in (7.7), the integrand can be eval- uated at any given point (u,α), assuming that p(α) can be evaluated, but the integral has no closed form. e outer integral is straightforward to approximate with plain Monte Carlo (see section a.4 on page 268).is works by drawing samples α(l) from p(α) and averaging over sampling approximations for the inner integral given α(l). e problem with approximating the inner integral is that it is impossible to draw samples from γ(u|α). erefore, importance sampling is necessary. is requires a proposal distribution ρ(u|α) that it is possible to draw samples from, and is close to γ. Section a.4.2 on page 269 gives a detailed description of importance sampling. How- ever, intuitively the double integral can be replaced by a summation over L samples α(l) from p(α) and corresponding samples for u(l) drawn from ρ(u|α(l)): ∫ p(α) ∫ γ(u|α)dudα = ∫ p(α) ∫ γ(u|α) ρ(u|α) ρ(u|α)dudα ' 1 L L∑ l=1 ∫ γ ( u ∣∣α(l)) ρ ( u ∣∣α(l))ρ(u∣∣α(l))du (7.18a) ' 1 L L∑ l=1 γ ( u(l) ∣∣α(l)) ρ ( u(l) ∣∣α(l)) , (7.18b) α(l) ∼ p(α) , u(l) ∼ ρ ( u ∣∣α(l)). e next section will detail the shape of γ(u|α(l)) and nd appropriate forms for ρ(u|α(l)) for it. 7.3.1.1 The shape of the integrand To apply importance sampling, a proposal distribution is required, whichwill be tailored to the parameters of the target distribution. As discussed in section 7.2, it is important that the proposal distributionmatches the integrand closely, or too many samples will be required for a good approximation. is section will nd proposal distributions with well-matching shapes. e scaling of the density graphs in this section will be arbitrary. 194 7.3. importance sampling in a transformed space −0.5 0 0.5 1 α −10 −5 0 5 u Figure 7.5e density of γ(u,α) = p(α)γ(u|α) for yt = 9, x ∼ N (7, 1) , n ∼ N (4, 4) , σ2α = 0.13. So far, the derivation has not assumed any specic distributions for x or n, and has been valid for any distribution. However, for dišerent distributions, dišerent pro- posal functions are required. With Gaussian for the speech and noise, the integrand becomes γ(u,α) = p(α)N (x(u,α, yt); µx, σ2x)N (n(u,α, yt); µn, σ2n) , p(α)γ(u|α) . (7.19) Figure 7.5 gives an example of the target distribution γ(u,α). As shown in (7.18b), samples for one dimension, α, can be directly drawn from the distribution for α. It is the other dimension, u, that requires importance sampling, and therefore a proposal distribution ρ. e following examples will assume the mode of p(α), α = 0, and consider representative shapes for γ(u|α = 0). γ(u|α) consists of a factorN (x(u,α, yt); µx, σ2x) related to the clean speech and a factor N (n(u,α, yt); µn, σ2n) related to the noise. Both terms are Gaussians, but the variables of the Gaussians (x and n) are non-linear functions of u. Figure 7.6 on the following page depicts the relationship between x and n. Dišerent values of u represent dišerent positions on the curve. When u is negative, n tends towards u and 195 chapter 7. asymptotically exact likelihoods 0 3 6 9 12 n 0 3 6 9 12 x u = −5 u = 0 u = 5 Figure 7.6 Values of x, n for yt = 9, α = 0, and varying u. At the top of the frameN (x; µx, σ2x) in γ; rightN (n; µn, σ2n). x tends towards yt. When u is positive, x tends towards −u and n tends towards yt. Around u = 0 there is a soŸ cut-oš. is graph provides an intuitive connection to noise masking schemes, which assume that either the speech or the noise dominates (Klatt 1976; Holmes and Sedgwick 1986).is would yield a curve with a sharp angle, so that in all cases either the speech of the noise is equal to the observation. e two factors of γ(u|α) are the Gaussians depicted on top and on the side of the graph. ey are evaluated at the appropriate values of x and n. When the Gaussians are plotted with respect to u, the soŸ cut-oš leads to a Gaussian distribution that is innitely extended on one side. Figure 7.7 illustrates the shape of the two factors. As u tends to −∞, x tends to yt. In this example, as u → −∞,N (x(u,α, yt); µx, σ2x) therefore tends to N (x(u,α, yt); µx, σ2x)→ N (yt; µx, σ2x) = N (9; 7, 1) = e−2/√2pi ' 0.054. (7.20) N (x(u,α, yt); µx, σ2x) in gure 7.7 on the next page therefore is a Gaussian-like dis- tribution with a soŸ cut-oš on its leŸ tail, so that it converges to a non-zero constant. Analogously,N (n(u,α, yt); µn, σ2n) is Gaussian-like but cut oš at its right tail, where 196 7.3. importance sampling in a transformed space 0 0.1 0.2 0.3 −15 −10 −5 0 5 10 u N (x(u,α, yt); µx, σ2x) N (n(u,α, yt); µn, σ2n) Figure 7.7 e factors of γ(u|α = 0) separately. For yt = 9, x ∼ N (7, 1) , n ∼ N (4, 4). it converges to a non-zero constant. Figure 7.8 on the following page shows examples for the three types of shape of γ(u|α = 0). Each time, the leŸ graph contains the shape of the two factors, and the right graph their product. Figure 7.8a uses the example from gure 7.7.e integrand, the product of two cut-oš Gaussians, is bimodal. When u tends to ±∞, one term of γ tends to a non-zero constant, but the other one tends to 0. γ therefore tends to 0 as well.3 In gure 7.8b, µx > yt, so that the graph ofN (x(u,α, yt); µx, σ2x) is cut oš right of its maximum. e product is similar to N (n(u,α, yt); µn, σ2n), except that the right tail goes to zero.e result is almost Gaussian. In the last example, gure 7.8c, both µx and µn are greater than yt, so that both Gaussians are cut oš before their maximum.eir product has a lop-sided Gaussian- like shape around the point of the soŸ cut-oš, by denition u = 0. 3ismust be true also because ∫ γ(u|α = 0)du is equal top(yt) evaluated at a point, which cannot be innite if either σ2x or σ2n is non-zero. 197 chapter 7. asymptotically exact likelihoods 0 0.1 0.2 0.3 −15 −10 −5 0 5 10 u −15 −10 −5 0 5 10 u 0 0.002 0.004 0.006 0.008 0.01 (a) N (x(u, 0, yt); 7, 1),N (n(u, 0, yt); 4, 4); their product γ(u|α). 0 0.1 0.2 0.3 −15 −10 −5 0 5 10 u −15 −10 −5 0 5 10 u 0 0.01 0.02 0.03 0.04 0.05 0.06 (b) N (x(u, 0, yt); 9.5, 1),N (n(u, 0, yt); 4, 4); their product γ(u|α). 0 0.1 0.2 0.3 −15 −10 −5 0 5 10 u −15 −10 −5 0 5 10 u 0 0.01 0.02 (c) N (x(u, 0, yt); 9.5, 1),N (n(u, 0, yt); 10, 10); their product γ(u|α). Figure 7.8 γ(u|α = 0) for dišerent cases: leŸ the two factors, right their product. 198 7.3. importance sampling in a transformed space 7.3.1.2 Importance sampling from the integrand A proposal distribution that it is simple to draw a sample from is a Gaussian mixture model. To nd a mixture of Gaussian densities that approximates γ, the three types of γ from gure 7.8 are considered separately. Because γ has these dišerent shapes, the approximation will be dišerent for each of these cases. the proposal distributions must be dened over u, so that it is useful to nd the value of u corresponding to a specic setting of x, α, and yt (and n, α, and yt). is will be denoted u(x, α, yt) (and u(n,α, yt)).e expressions are derived in appendix f.4, (f.25) and (f.26f). Figure 7.9 on the next page shows the proposal distributions. eir magnitudes are scaled to equalise the areas under the target and proposal densities.e proposal distribution is chosen dišerently depending on the mean of the terms of γ as follows: 1. µx < yt and µn < yt.is produces a shape of γ as in gure 7.8a.e shape of γ being close to the product of two Gaussians, a Gaussian mixture model with the parameters of these twoGaussianswould forma good proposal distribution. As a proposal distribution, a mixture of two Gaussians is chosen with means at the approximate modes, and covariances set to σ2x and σ2n, respectively. ese Gaussians are illustrated in gure 7.9a.ey approximate the termsN (n(u,α, yt); µn, σ2n) andN (x(u,α, yt); µx, σ2x) with N (u; u(µn, α, yt), σ2n); (7.21a) N (u; u(µx, α, yt), σ2x). (7.21b) As was seen in gure 7.7 on page 197, each Gaussian is essentially scaled by the extended tail of the other one. e weights of the Gaussians of the proposal distribution can be set to the value that the tail of the other one converges to, which can be computed as in (7.20): pin ∝ N (yt; µx, σ2x); (7.22a) pix ∝ N (yt; µn, σ2n). (7.22b) 199 chapter 7. asymptotically exact likelihoods −15 −10 −5 0 5 10 u 0 0.05 0.1 0.15 −15 −10 −5 0 5 10 u 0 0.003 0.006 0.009 0.012 (a) Proposal (gmm) for yt = 9, x ∼ N (7, 1) , n ∼ N (4, 4). −15 −10 −5 0 5 10 u 0 0.05 0.1 0.15 −15 −10 −5 0 5 10 u 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 (b) Proposal (one Gaussian) for yt = 9, x ∼ N (9.5, 1) , n ∼ N (4, 4). −15 −10 −5 0 5 10 u 0 0.03 0.06 0.09 0.12 −15 −10 −5 0 5 10 u 0 0.01 0.02 (c) Proposal (one Gaussian) for yt = 9, x ∼ N (9.5, 1) , n ∼ N (10, 10). Figure 7.9 e proposal distribution for γ(u|α) for dišerent cases: leŸ the com- ponents of the proposal distribution, right γ (solid line) and proposal distribution ρ (dashed line, scaled so the area under the curve matches). 200 7.3. importance sampling in a transformed space ese weights are normalised so that they sum to 1.e distribution becomes ρ(u) = pinN ( u; u(µn, α, yt), σ 2 n ) + pixN ( u; u(µx, α, yt), σ 2 x ) . (7.23) 2. µx > yt and µn < yt (or its mirror image, µx < yt and µn > yt). Figure 7.8b on page 198 has shown thatN (x(u,α, yt); µx, σ2x) is cut oš before its peak, and converges to its maximum in the limit as u → −∞.is results in a Gaussian distribution except for one tail.e proposal distribution is therefore set to this Gaussian: ρ(u) = N (u; u(µn, α, yt), σ2n). (7.24) Figure 7.9b on the facing page shows the near-perfect match of this proposal distribution. 3. µn > yt and µx > yt. Both terms of γ are cut oš before their peaks, resulting in a shape as in gure 7.8c on page 198. e product is a distribution around u = 0 with Gaussian-like tails, one derived from N (n(u,α, yt); µn, σ2n) and another one derived fromN (x(u,α, yt); µx, σ2x).e proposal distribution is therefore set to a Gaussian with mean 0. Its variance is set to the largest of the variances of the speech and the noise: ρ(u) = N (u; 0, max(σ2n, σ2x)). (7.25) As gure 7.9c on the facing page shows, this provides good coverage but over- estimation on part of the space. is means that some samples will receive a very low weight. us, by transforming the space of the integration from (x, n) to (u,α), much better proposal distributions for importance sampling can be found than in (x, n)- space, like in section 7.2. e sample weights will therefore vary less, so that good approximations to the integral will be found with a much smaller number of samples. e next section will extend the techniques applied in this chapter to the multi-di- mensional case. 201 chapter 7. asymptotically exact likelihoods 7.3.1.3 Related method At the same time as van Dalen and Gales (2010a), a similar method was proposed (Hershey et al. 2010).ere are three dišerences. First, the model is dišerent: nomel- bins are used, so that the phase factor model in this work ešectively is replaced by a per-frequency cosine of the angle between speech and noise in the complex plane (see section 4.2.1.1). Also, the variable transformation is dišerent. e biggest dišerence, however, is that the method treats dimensions as independent. It therefore fails to take into account the correlations in the distributions of the speech and noise. e next section will introduce multi-dimensional sampling.e strategy it uses may also apply to the method in Hershey et al. (2010). 7.3.2 Multi-dimensional In this chapter, the log-spectral domain is used.is has the advantage that the inter- action between the speech and the noise can be modelled separately per dimension. However, there are strong correlations between log-spectral coe›cients. erefore, the Gaussian priors for the speech and the noise have full covariance matrices. is section will build on the techniques used in the previous section. It will generalise the transformation of the integral to multi-dimensional (u,α). Rather than standard importance sampling, it will then apply sequential importance resampling to approx- imate the integral. e relations between the single-dimensional variables in the previous section hold per dimension for the multi-variate case. e substitute variable u that is in- troduced to represent a point (x,n) given observation yt and phase factor vector α is therefore dened as u = n− x. (7.26) e expressions for x(u,α,y) and n(u,α,y), the values for the speech and noise that result from setting (u,α,y), are given in appendix f.2. 202 7.3. importance sampling in a transformed space ere was a complication in transforming the one-dimensional integral in sec- tion 7.3.1 from (x, n) to (u,α): for some x, multiple values for n were possible, and vice versa. Because transforming the integral needed a deterministic link, the integ- ral was split into two parts, for two regions of (x, n)-space. In the multi-dimensional case it is necessary to do this for each of the dimensions. Appendix f.2 gives the full derivation.e integration is therefore rst split up into conditional distributions per dimension i, and then into regions.e integrals for the two regions over (xi, ni) are rewritten as an integral over (ui, αi). Collapsing the dimensions (see equation (f.16)) then yields an unsurprising generalisation of (7.17): p(yt) = ∫ p(α) ∫ p(x(u,α,yt))p(n(u,α,yt))dudα (7.27a) , ∫ ∫ γ(u,α)dudα, (7.27b) and it is convenient to factorise the integrand γ(u,α) as γ(u,α) = γ(α)γ(u|α) , (7.27c) γ(α) = p(α) ; (7.27d) γ(u|α) = p(x(u,α,yt))p(n(u,α,yt)) . (7.27e) is derivation is valid whatever the form of the speech and noise priors, p(x) and p(n). In this work, they are Gaussians with (repeated from (7.3a) and (7.3b)) x ∼ N (µx,Σx) ; n ∼ N (µn,Σn) , (7.28) with full covariance matrices Σx and Σn. γ(u|α) then becomes γ(u|α) = N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) , (7.29a) so that γ(u,α) = p(α)N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) . (7.29b) To approximate this integral, conventional importance sampling could again be used. Just like in section 7.2, intuitively, the integration over two variables can be ap- 203 chapter 7. asymptotically exact likelihoods −10 −5 0 5 10 −10 −5 0 5 10 (a)N (x(u,α,yt); µx, Σx). −10 −5 0 5 10 −10 −5 0 5 10 (b)N (n(u,α,yt); µn, Σn). −10 −5 0 5 10 −10 −5 0 5 10 (c)e product: γ(u|α). Figure 7.10 e integrand γ(u|α) for α = 0: the two factors, and their product. proximated by drawing samples (u(l),α(l)) from a proposal distribution ρ:∫ ∫ γ(u,α)dudα = ∫ ∫ γ(u,α) ρ(u,α) ρ(u,α)dudα ' 1 L L∑ l=1 γ(u(l),α(l)) ρ(u(l),α(l)) , (u(l),α(l)) ∼ ρ. (7.30) Figure 7.10 illustrates how the shape of the integrand γ(u|α = 0) generalises to 204 7.3. importance sampling in a transformed space two dimensions of u. e principles are the same as the one-dimensional case in gure 7.8a on page 198. Figure 7.10a contains the factor of γ deriving from the speech prior, N (x(u,α,yt); µx, Σx), and gure 7.10b the same for the noise prior. ey are again Gaussians with a soŸ cut-oš, this time in two directions. By choosing the slightly contrived parameter setting x ∼ N  7 6.3  ,  1 −0.1 −0.1 0.5  ; n ∼ N  5 3  ,  2 0.3 0.3 2  ; yt =  9 9  , (7.31) the product of the two factors, in gure 7.10c, turns out to have four maxima. In gen- eral, for d dimensions, the integrand can have 2d modes. Even though this may be unlikely to occur oŸen in practice, it is hard to formulate a proposal distribution for importance sampling.e proposal distribution would need to be close to the integ- rand, and it must be possible to draw samples from it. A mixture of Gaussians, for example, could need as many components as the integrand has maxima. However, rather than applying normal importance sampling, the integrand will be factorised in dimensions for sequential importance sampling. 7.3.2.1 Per-dimension sampling Rather than sampling from all dimensions at once, sequential importance sampling (see appendix a.4.3 on page 272) can be used, which samples dimension per dimen- sion. Fundamentally, it is an instantiation of importance sampling formultiple dimen- sions. First, it draws a set of samples from a distribution over the rst dimension, and assigns the samples a weight.en, for each dimension it extends every partial sam- ple with a value drawn given the value for previous dimensions of that sample. e advantage of this formulation is that between dimensions it allows for resampling: duplicating higher-probability samples from the set and removing lower-probability ones.is concentrates the samples on the most relevant areas of the space. To be able to apply sequential importance sampling, the target density needs to be factorised into dimensions. If the feature space is d-dimensional, the integration 205 chapter 7. asymptotically exact likelihoods is over 2d dimensions: α1, . . . , αd, u1, . . . , ud. It is important to realise that there is no need for the factors to represent conditional probability distributions, normalised or not. It is true that the most informative weights aŸer each dimension i would arise if factors 1 . . . i combined to form the (potentially unnormalised) marginal of partial sampleu1:i. Resamplingwould then bemost ešective. By denition, the factors would be (unnormalised) conditionals. However, this is not a requirement, and this work will compare two dišerent factorisations, both of which should be close to the actual conditionals. Since the phase factor coe›cients are independent (see section 4.2.1.1 on page 65), an obvious factorisation for γ(α) is γ(α) = γ1(α1) · · ·γd(αd) = d∏ i=1 γi(αi), γi(αi) = p(αi) . (7.32a) Since the ui are not independent, the factors of γ(u|α)must take the full partial sample into account: γ(u|α) = γ1(u1|α1)γ2(u2|u1,α1:2)γ3(u3|u1:2,α1:3) · · ·γd(ud|u1:d−1,α1:d) = d∏ i=1 γi(ui|u1:i−1,α1:i) . (7.32b) Again, the notation of these factors γi(ui|u1:i−1,α1:i) does not mean that they are necessarily related to conditional distributions.ey can be any function of the vari- ables before and aŸer the bar, as long as their product γ(u|α) yields the correct res- ult. Indeed, the next subsections will present two choices for the factorisation. Both apply the same factorisation to both terms of γ in parallel. e rst, which will be called “postponed factorisation”, each factor only incorporates the dimensions that are sampled from, and leave other terms for later in the process. e second, which will be called “quasi-conditional factorisation”, factorises the two Gaussians separately into conditional distributions per dimension. e form of the speech and noise prior in this work are standard Gaussians, so 206 7.3. importance sampling in a transformed space that the factorisation of γ(u|α)must satisfy (combining (7.29a) and (7.32b)): d∏ i=1 γi(ui|u1:i−1,α1:i) = N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) . (7.33) 7.3.2.2 Postponed factorisation e Gaussian terms in (7.33) can be written as products with every element of the precision matrices. e precision matrix is the inverse covariance: Λx = Σ−1x . Its elements are denoted λx,ij. It is possible to postpone the terms until both dimensions to be related are known.is requires some manipulation, the details of which are in appendix f.3. e integrand is rewritten in (f.20), and the factors γi are dened as (from (f.21)) γ1(u1|α1) = |2piΣy|− 1 2 |2piΣx|− 1 2 exp ( − 12λn,11(n(u1, α1, yt,1) − µn,1) 2 − 12λx,11(x(u1, α1, yt,1) − µx,1) 2 ) ; (7.34a) γi(ui|u1:i−1,α1:i) = exp ( − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i)νx,i − 12λn,ii(n(ui, αi, yt,i) − µn,i) 2 −(n(ui, αi, yt,i) − µn,i)νn,i ) , (7.34b) where the terms containing coordinates of lower dimensions u1:i−1 are dened as (in (f.19b)) νx,i = i−1∑ j=1 λx,ij(x(uj, αj, yt,j) − µx,j) ; νn,i = i−1∑ j=1 λn,ij(n(uj, αj, yt,j) − µn,j) . (7.35) Note again that γi(ui|u1:i−1,α1:i) is not a conditional distribution. At every dimension, a proposal distributionρi is required for importance sampling. is distribution needs to have a shape similar to γi. ere is, however, no need to match the amplitude of γi.is is convenient when rewriting γi to nd the shape of the distribution, as appendix f.3 does. 207 chapter 7. asymptotically exact likelihoods e factors in (7.34b) turn out to be proportional to two Gaussian distributions that are functions of x(ui, αi, yt,i) and n(ui, αi, yt,i) (from (f.23)): γi(ui|u1:i−1,α1:i−1) ∝ N ( x(ui, αi, yt,i); µx,i − νx,i λx,ii , λ−1x,ii ) · N ( n(ui, αi, yt,i); µn,i − νn,i λn,ii , λ−1n,ii ) . (7.36) is expression has the same shape as the one-dimensional integrand in (7.19) in sec- tion 7.3.1, and the same proposal distribution as discussed in section 7.3.1.2 can be used. 7.3.2.3 Quasi-conditional factorisation Alternatively, both Gaussians in (7.29b) could be decomposed into themarginal of the rst dimension, the marginal of the second dimension given the rst, and so on. Ap- pendix a.1.3 decomposes a Gaussian distribution over two vectors into the marginal of the rst vector times the distribution of the second conditional on the rst. How- ever, since the density γ is the product of two Gaussians with two dišerent non-linear variables, its factors are not normalised or proportional to conditional probabilities. Crucially, the derivation in (a.7) does not rely on the input variable or normalisa- tion. It is therefore possible to nd a factorisation of both speech and noise Gaussians in parallel, even if the factors are not exactly conditionals. For this, both terms of γ1:i is factorised recursively as (only the leŸ-hand term is shown) N (x1:i; µx,1:i, Σx,1:i,1:i) = N (x1:i−1; µx,1:i−1, Σx,1:i−1,1:i−1) N (xi; µx,i|1:i−1(x1:i−1), σ2x,i|1:i−1), (7.37a) where the parameters for xi dependent on x1:i−1 are µx,i|1:i−1(x1:i−1) = µx,i − Σx,i,1:i−1 [ Σx,1:i−1,1:i−1 ]−1 (x1:i−1 − µx,1:i−1) ; (7.37b) σ2x,i|1:i−1 = σ 2 x,i,i − Σx,i,1:i−1 [ Σx,1:i−1,1:i−1 ]−1 Σx,1:i−1,i. (7.37c) Note that the variance σ2x,i|1:i−1 is not a function of x, so that it needs to be computed only once for all samples. Also, in computing the inverses of Σx,1:i,1:i for i = 1 . . . d, 208 7.3. importance sampling in a transformed space their structure can be exploited. A block-wise inversion can be used. Block-wise in- version is a technique oŸen applied to take advantage of known structure in blocks of the matrix, for example, when a block is diagonal.e trick here, however, is that the intermediate results are useful. If the the inverse of amatrix with one column removed from the right and one row removed from the bottom, Σx,1:i−1,1:i−1 is known, the in- verse of thematrix with the extra row and column,Σx,1:i,1:i, can be computed inO ( i2 ) time. An incremental implementation that yields[Σx,1:i,1:i]−1 for i = 1 . . . d thus has a time complexity of onlyO(d3), the same as inverting only the full covariance matrix with a conventional approach. e fully factorised formulation of the leŸ-hand term of γ is N (x1:d; µx,1:d, Σx,1:d,1:d) = N ( x1; µx,1, σ 2 x,1,1 ) d∏ i=2 N (xi; µx,i|1:i−1(x1:i−1), σ2x,i|1:i−1). (7.38) e analogous factorisation can be applied to the right-hand term of γ(u). (7.29a) can then be factorised γ(u|α) = N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) = N (x(u1, α1, y1); µ1, σ21,1)N (n(u1, α1, y1); µ1, σ21,1) d∏ i=2 N ( x(ui, αi, yi); µx,i|1:i−1 ( x(u1:i−1,α1:i−1,y1:i−1) ) , σ2x,i|1:i−1 ) N ( n(ui, αi, yi); µn,i|1:i−1 ( n(u1:i−1,α1:i−1,y1:i−1) ) , σ2n,i|1:i−1 ) . (7.39) e factors of γ then become4 γi(ui|u1:i−1) = N ( x(ui, αi, yi); µx,i|1:i−1 ( x(u1:i−1,α1:i−1,y1:i−1) ) , σ2x,i|1:i−1 ) N ( n(ui, αi, yi); µn,i|1:i−1 ( n(u1:i−1,α1:i−1,y1:i−1) ) , σ2n,i|1:i−1 ) , (7.40) 4 is formulation assumes (in γ1) that multiplying a 1 × 0 matrix by a 0 × 0 matrix by a 0 × 1 matrix yields a 1× 1matrix[ 0 ]. 209 chapter 7. asymptotically exact likelihoods where µx,i|1:i−1(x1:i−1) and σ2x,i|1:i−1 are dened in (7.37a). Again, the factors have the form of density as the one-dimensional γ in (7.19), so that the proposal distribution given in section 7.3.1.2 can be used. 7.3.2.4 Applying sequential importance resampling Whichever factorisation of γ(u|α) is chosen, the application of sequential import- ance resampling is the same. e integral ∫ ∫ γ(u,α)dαdu, the value of interest, is the normalisation constant of γ(u,α), which will be called Z. To nd Z by stepping through dimensions, it can be expressed as a sequence of incremental normalisation constantsZi/Zi−1 (see appendixa.4.3). Given a sample set {( u (l) 1:i−1,α (l) 1:i−1 )} , the ap- proximation of the incremental normalisation constant is5 (when resampling is used) Z˜i Zi−1 = 1 L L∑ l=1 γi ( α (l) i ) ρi ( α (l) i ) γi(u(l)i ∣∣u(l)1:i−1,α(l)1:i) ρi ( u (l) i ∣∣u(l)1:i−1,α(l)1:i) , (7.41) where samples α(l)i are drawn from proposal distribution ρi ( α (l) i ) and samples u(l)i from the appropriate ρi ( ui ∣∣u(l)1:i−1,α(l)1:i). e shape of the density γ ( ui ∣∣u(l)1:i−1,α(l)1:i) depends on the current partial sam- ple ( u (l) 1:i−1,α (l) 1:i ) and the type of factorisation. e factorisations in sections 7.3.2.2 and 7.3.2.3 both result in factors of the form γi(ui|u1:i−1,α1:i) = N ( x(ui, αi, yt,i); µˆx,i, σˆ 2 x,i ) · N (n(ui, αi, yt,i); µˆn,i, σˆ2n,i), (7.42) where the parameters (µˆx,i, σˆ2x,i, µˆn,i, σˆ2n,i) depend on the type of factorisation and the current partial sample ( u (l) 1:i−1,α (l) 1:i ) . Appropriate proposal distributions for this type of density have been discussed in section 7.3.1.2. ese distributions over ui take the form of one Gaussian or a mixture of two.ey are therefore straightforward to draw a sample from and slot into (7.41). e density γi ( αi ) is set to p(αi) dened in (4.18), which has a Gaussian shape, but constrained to [−1, 1]. It is straightforward to draw a sample directly from this 5Since samples for one dimension of α and one of u are drawn, this could be written Z2i/Z2i−2 to be exactly compatible with appendix a.4.3. 210 7.3. importance sampling in a transformed space distribution, by sampling from theGaussian and rejecting samples outside [−1, 1] (see section 4.2.1.1).erefore, ρi ( αi ) can be set toγi ( αi ) . thismeans thatγi ( αi ) /ρi ( αi ) in (7.41) becomes 1, and hybrid sequential importance resampling of algorithm 7 on page 278 can be applied. Resampling is discussed in appendix a.4.4. In short, it duplicates higher-weight samples from the sample set and removes lower-weight ones between every dimen- sion. is reduces the variance of the sample weights; conceptually, it focuses ešort on higher-probability regions. When using resampling, the order inwhich the dimensions are traversed becomes important. e longer ago samples for one dimensions were drawn, the more likely they are to have duplicate entries for that dimension.e sample set will therefore be less varied for earlier dimensions.is is not a big problem when, as here, the interest is not in the samples, but in the normalisation constant. However, when drawing samples for one dimension, it still makes sense to have last drawn the dimensions which the new dimension depends on most. For example, it might seem obvious to drawα(l)1 . . . α (l) d rst, and then go through u (l) 1 . . . u (l) d . However, in i − 1 rounds of resampling, the set of samples α (l) 1 . . . α (l) d may become considerably less varied. For higher i, u(l)i may be drawn with only a few or one unique α(l)i , which limits the accuracy of the approximation of the normalisa- tion constant. In thiswork, the order of sampling is thereforeα1, u1, α2, u2, . . . αd, ud. is works best ifui anduj are less dependent when j−i is greater.e order inwhich samples for the dimensions are drawn could also be determined on the žy by consid- ering the values of the oš-diagonals in Σx and Σn, but this work does not investigate this. is section has discussed a transformation of the integral that gives the corrup- ted speech likelihood p(yt), two dišerent factorisations of the integrand, and how to apply sequential importance sampling to approximate the integral. e estimate from the sampling scheme is consistent, but not unbiased.is means that for a small sample cloud, the approximated value for p(yt) may be generally overestimated of 211 chapter 7. asymptotically exact likelihoods underestimated. However, as the sample cloud size increases, the resulting value con- verges to the real likelihood. 7.4 Approximate cross-entropy It is standard practice in speech recognition research to judge speech recognition methods by word error rates. However, this chapter has the explicit aim of model- ling the predicted corrupted speech distribution as accurately as possible. ere is a more direct way of testing performance on this criterion than the word error rate: the Kullback-Leibler divergence between the predicted distribution and the approxima- tion.is is similar to the kl divergence to the single-pass retrained system discussed in section 4.4.4.1, but here the reference distribution is not parametric, and the speech and noise distributions are. For a detailed assessment and for e›ciency, it is useful to focus on one speech Gaussian. e kl divergence is always non-negative; it is 0 if and only if the distributions are the same. e kl divergence to real distribution p from approximation q over y is dened as (from (a.11)) KL(p‖q) = ∫ p(y) log p(y) q(y) dy. (7.43) Here, q is an approximation to the noise-corrupted speech distribution, found for example with vts, dpmc, or transformed-space sampling.e problem in computing this divergence is the one that motivates this whole chapter: the real distribution p has no closed form, and neither doesKL(p‖q).is problem can be worked around in two steps. First, the kl divergence can be decomposed as (from (a.12)) KL(p‖q) = H(p‖q) −H(p) , (7.44a) where the cross-entropy of p and q is dened as H(p‖q) = − ∫ p(y) logq(y)dy (7.44b) 212 7.4. approximate cross-entropy and the entropy of p as H(p) = − ∫ p(y) logp(y)dy. (7.44c) e entropy of p is constant when only q changes.e cross-entropy is then equal to the kl divergence up to a constant. For comparing dišerent approximations q against a xed p, therefore, the cross-entropyH(p‖q) su›ces. It does not, however, give an absolute divergence. When q becomes equal to p, the cross-entropy becomes equal to the entropy, but the latter cannot be computed for the noise-corrupted speech dis- tribution.6 e second problem is that H(p‖q) cannot be computed either. However, it is straightforward to draw samples fromp: section 4.3.1 has shown the algorithm. L sam- ples drawn from p give the delta spikes in the empirical distribution p˜: p˜ = ∑ l δy(l) , y (l) ∼ p(y). (7.45) en, plainMonte Carlo (see section a.4) can approximate the cross-entropy between p and q with the cross-entropy between p˜ and q. H(p‖q) ' H(p˜‖q) = − ∫ p˜(y) logq(y)dy = − 1 L ∑ l logq ( y(l) ) . (7.46) e cross-entropy in (7.44b) can be viewed as the expectation of the log-likelihood logq(y) under p, which is approximated with Monte Carlo (as in (a.37)).e result is (7.46), which can be seen as the negative log-likelihood of q for the set of sam- ples { y(l) } . is gives another motivation for using the cross-entropy as a metric for comparing compensation methods. Note that when q is the transformed-space sampling method from section 7.3, for every sample y(l) another level of sampling takes place inside the evaluation of q ( y(l) ) . 6e entropy could be rewritten to H(p) = ∫ p(y) logp(y)dy = ∫(∫ ∫ p(x,n,y)dndx ) log (∫ ∫ p(x,n,y)dndx ) dy, which has no obvious closed form. 213 chapter 7. asymptotically exact likelihoods e cross-entropy results in section 8.2 will use this Monte Carlo approximation. It has one caveat: the distribution q is assumed to be normalised, and if not, then the result is not valid.ismeans that theAlgonquin approximation, which does not yield a normalised distribution over y, cannot be assessed in this way. As pointed out in section 7.3.2, the likelihood approximation of transformed-space sampling is biased, but consistent. is means that as the size of its sample cloud increases, q converges to being normalised. 7.5 Summary is chapter has described the third contribution of this thesis. is chapter has introduced a new technique for computing the likelihood of a corrupted speech observation vector. It does not use a parametric density, like the schemes in chapters 4 and 5, but a sampling method.e integral over speech, noise, and phase factor that the likelihood consists of is transformed to allow importance sampling to be applied. As the number of samples goes to innity, this approximation converges to the real likelihood.is work uses it with specic distributions (Gaussian speech and noise, a constrained Gaussian for the phase factor), but the method could also be applied to other distributions. ough the method is too slow to embed in a speech recogniser, it is possible to nd the kl divergence from an approximated corrupted speech distribution to the real one up to a constant. Section 8.2 will use it to assess how close to ideal compensation methods are, and the ešect of approximations such as assuming the corrupted speech Gaussian. 214 Chapter 8 Experiments is thesis has looked into two ways of improving statistical models for noise-robust- ness.e experimental results will therefore be in two parts. e rst part is about modelling correlations within speech recogniser compon- ents.e theory has considered two aspects: estimating full-covariance compensation (chapter 5), and decoding with that compensation but without the computational cost (chapter 6).ese aspects will be demonstrated in section 8.1. e secondpart ismore theoretical. Chapter 7 has introduced amethod that, given speech andnoise priors and amismatch function, computes the corrupted speech like- lihood exactly in the limit.ough it is too slow for decoding, it makes it possible to nd how well current methods for model compensation do.is will use an approx- imation to the kl divergence. It is interesting to see how well that predicts speech recogniser performance. It also becomes possible to investigate specic approxima- tions that model compensationmethods make. Section 8.2 will examine, for example, the inžuence of the assumption that the corrupted speech distribution is Gaussian and diagonalising that Gaussian’s covariance. It will also assess the impact of a common approximation to the mismatch function for vts compensation, namely setting the phase factor to a xed value. 215 chapter 8. experiments 8.1 Correlation modelling Realisticmethods specically for noise-robustness aremeant to deal with short, noise- corrupted utterances, for example, voice commands to a car navigation system. With little adaptation data, it is vital to have as few parameters as possible to estimate. A Gaussian noisemodel can be estimated on a few seconds of data. A generic adaptation method, on the other hand, would do better only if the noise is constant for minutes (for a comparison between vts and cmllr on short noisy utterances, see Flego and Gales 2009). e scenario that this section will consider is, therefore, that of short utterances with varying noise. Model compensation methods using extended feature vectors (like extended vts or extended dpmc) model dynamics better.erefore, they are able to nd better full- covariance compensation. is section will examine the ešects of the improvements that extended vts makes over standard vts. For this, it will use the Resource Man- agement task, which allows per-speaker noise estimation, so that it is feasible to run edpmc.en, results on aurora will indicate whether the improvements carry over to this well-known task. Finally, results on the Toshiba in-car data will show perform- ance on data recorded in a real noisy environment. For all recognition systems, clean training data is used to train the speech models. 39-dimensional feature vectors are used: 12 mfccs and the zeroth coe›cient, aug- mented with deltas and delta-deltas. Unless indicated otherwise, themfccs are found from the magnitude spectrum with htk (Young et al. 2006) and the deltas and delta- deltas are computed over a window of 2 observations leŸ and 2 right, making the total window width 9. e state of the art in model compensation is vts compensation with the con- tinuous-time approximation. Section 4.4.2 has discussed how it uses a vector Taylor series approximation of the mismatch function. is makes it possible to estimate the noise model, because the inžuence of the noise is locally linearised. To keep in line with standard vts compensation, for this section (not for section 8.2) the same noise estimates and the same mismatch function will be used for all per-component 216 8.1. correlation modelling compensationmethods in this section.e phase factor will be assumed 0 in themag- nitude spectrum. (Appendix c.2 shows how this is roughly equivalent to setting the phase factor to 1 when working with the power spectrum.) When the noise model is estimated, maximum-likelihood estimation (Liao and Gales 2006), as described in section 4.7, is used for a clean system with vts and the continuous-time approximation. e initial noise model’s Gaussian for the additive noise is the maximum-likelihood estimate from the rst 20 and last 20 frames of the utterance, which are assumed to contain no speech.e initial convolutional noise es- timate is 0. Given this initial noise estimate for an utterance, a recognition hypothesis is found.is is used to nd component–time posteriors.en, the noise means and the additive noise covariance are re-estimated. Decoding with this noise model yields the nal hypothesis. 8.1.1 Resource Management To assess compensation quality and the ešect of noise estimation, initial experiments are on a task of reasonable complexity, but with articial noise.e rst corpus used is the 1000-word Resource Management corpus (Price et al. 1988). Operations Room noise from the noisex-92 database (Varga and Steeneken 1993) is added at 20 and 14 dB. e rm database contains read sentences associated with a naval resource man- agement task.is task contains 109 training speakers reading 3990 sentences, a total of 3.8 hours of data.e original database contains clean speech recorded in a sound- isolated booth, which was used for training the recognisers. All results are averaged over three of the four available test sets, February 89, October 89, and February 91 (September 1992 is not used), a total of 30 test speakers and 900 utterances. e noisex-92 database provides recording samples of various articial, pedes- trian and military noise environments recorded at 20 kHz with 16-bit resolution.e Destroyer Operations Room noise is sampled at random intervals and added to the clean speech data scaled to yield signal-to-noise ratios of 20 dB and 14 dB. 217 chapter 8. experiments State-clustered triphone models with six components per mixture are built us- ing the htk rm recipe (Young et al. 2006). AŸer four iterations of embedded re- estimation, the monophone models are cloned to produce a single-component tri- phone system. AŸer two iterations of embedded training, the triphones are clustered at the state level.e number of distinct states is about 1600.ese are then mixed up to six components, yielding about 9500 components in total.e language model for recognition is a word-pair grammar. For initial experiments, an equivalent system is trained with one component per state from the one-but-last six-component system.e single-component one will be used for initial experiments, because data sparsity becomes an issue when estimating full covariance matrices over extended feature vectors. e six-component system is the standard one. On clean data, it produces a word error rate of 3.1 %. At the 20 dB word error rate, however, it yields 38.1 %. Since the additive background noise is known, it is possible to generate stereo data (clean and articially corrupted) and use single-pass retraining (see section 4.4.4) to obtain an idealised compensated system. e word error rate then becomes 7.4%. It is also possible to extract the true noise model. When the noise model is estimated, this is done per speaker. With maximum- likelihood estimation for the noise model, there is no guarantee that this compensates only for noise: it may implicitly also adapt for speaker characteristics. For example, the voice quality will inžuence the parameters of the convolutional noise, like cepstral mean normalisation does. 10 000 samples per distribution are used for dpmc. (Performance does not im- prove with additional samples.) e noise covariance estimate does not contain any zero entries, so back-oš as discussed in section 5.5.2.1 is not necessary. Section 8.1.1.1 will investigate the closeness of compensation methods to the ideal distribution. Section 8.1.1.2 will look into reconstructing an extended noise model from a noise model with statics and dynamics. Section 8.1.1.3 will then apply exten- ded feature vector compensation to components. e computational complexity of 218 8.1. correlation modelling compensation and decoding will be the topic of section 8.1.1.4. 8.1.1.1 Compensation quality e compensation methods with extended feature vectors in this thesis aim to model the corrupted speech distribution more accurately, with the objective to improve per- formance. Whether they succeed can be veried in two stages.is section will assess compensation quality with the kl divergence; later sections will assess recognition performance. Compensation quality will be measured with the component-per-component kl divergence to the single-pass retrained system as explained in section 4.4.4.1. at section has also mentioned that, depending on the structure of the covariance matrix, the kl divergence can be computed separately for coe›cients or blocks of compon- ents.e following will therefore examine single coe›cients, for diagonal-covariance compensation, rst, and then blocks of coe›cients for block-diagonal matrices. e results derive from a Resource Management system with one component per state so full-covariance speech statistics can be robustly estimated.e total number of Gaussians is 1600. noisex-92 Operations Room noise is articially added at a 14 dB snr. Hence it is possible to obtain the correct noise distribution, for both the standard and extended feature vector cases.e noise models also have full covariance matri- ces.1 Diagonal compensation Normally, vts-compensated covariance matrices are di- agonalised. us it is interesting to initially examine this conguration. Using diag- onal covariance matrices also allows each dimension to be assessed. Figure 8.1 on the following page contrasts the accuracy of an uncompensated system, and three forms of compensation: standard vts, extended vts, and, as an indication of maximum possible performance, extended dpmc. is graph is for a 14 dB snr, but graphs for other snrs are very similar. e horizontal axis has the feature dimensions: 13 static 1Similar trends are observed when striped noise statistics, consistent with diagonal standard noise models for vts, are used for evts. 219 chapter 8. experiments 0.1 1 10 Av er ag ek l di ve rg en ce ys0 y s 12 y ∆ 0 y ∆ 12 y ∆2 0 y ∆2 12 Dimension Uncompensated Standard vts Extended vts Extended dpmc Figure 8.1 Average Kullback-Leibler divergence between compensated systems and a single-pass retrained system. mfccs ys, 13 rst-order dynamics y∆, and 13 second-order dynamics y∆2 . As expec- ted, the uncompensated system is furthest away from the single-pass retrained sys- tem, and extended dpmc provides the most accurate compensation given the speech and noise models.e dišerence between standard vts and extended vts is interest- ing. By denition, both yield the same compensation for the statics.2 For the deltas and especially the delta-deltas, however, the continuous-time approximation does not consistently decrease the distance to the single-pass retrained system. Extended vts, though not as accurate as extended dpmc, provides a substantial improvement over standard vts. Block-diagonal compensation e previous section used diagonal covariance ma- trices. To compensate for changing correlations under noise, more complex covari- ance matrix structures, such as full or block-diagonal, may be used. vtswith the con- tinuous-time approximation can also be used to generate block-diagonal covariance matrices for the output distributions.e form of this was shown in (4.44). Normally, evts produces full covariance matrices, and so can single-pass retraining, but they 2Small dišerences are because htk gathers statistics for statics and dynamics dišerently from for extended feature vectors. 220 8.1. correlation modelling Compensation — vts evts edpmc ys 58.8 1.0 1.0 1.0 y∆ 3.3 1.4 0.7 0.5 y∆ 2 3.2 1.7 0.7 0.5 Table 8.1 Resource Management task: average kl divergence to a block-diagonal single-pass retrained system for vts (continuous time), evts and dpmc at 14 dB snr. can be constrained to be block-diagonal by setting other entries to zero. It then is pos- sible to compute the kl divergence to a single-pass retrained system per block. is allows the compensation of each of the blocks of features to be individually assessed: the statics, and rst- and second-order dynamics. vts compensation uses block-di- agonal statistics for both the clean speech and noise models. For evts the extended statistics have full covariance matrices, to be equivalent to the statistics for standard vts. Table 8.1 shows the average kl divergence between a system compensated with block-diagonal vts with the continuous time approximation and the block-diagonal single-pass retrained system. For other signal-to-noise ratios, numbers are very sim- ilar. vts nds compensated parameters close to the single-pass retrained system for the static features: the kl divergence goes from 58.8 to 1.0. However, the dynamic parameters are not compensated as accurately, though both the delta and delta-delta parameters are still somewhat closer to the spr system than the uncompensatedmodel set (3.2 to 1.7 for the delta-deltas). Similarly to diagonal compensation (see gure 8.1), with block-diagonal covariances standard vts nds good compensation for the static parameters, but less good for the deltas and delta-deltas. evts has the same compensation as standard vts for the statics. As in the di- agonal-covariance case, however, for dynamic parameters compensation it is more accurate. It does yield a clear improvement over the uncompensated system (3.2 to 0.7 for the delta-deltas) and is close to edpmc, which in the limit yields the best obtainable Gaussian compensation given the speech and noise models. 221 chapter 8. experiments Σy Σen diag full known diag 16.4 16.0 smoothed 15.9 15.2 full 15.9 14.9 estimated diag 12.0 11.2smoothed 12.5 12.0 Table 8.2 Resource Management task: word error rates for reconstructing an extended noise model at 14 dB snr. 8.1.1.2 Extended noise reconstruction In practice, labelled samples of the noise to estimate the noise model are not always available. As discussed in section 5.5.2, when using the standard ml estimated noise models there are two approaches tomapping the noisemodel parameters to the exten- ded noise model parameters: one with a diagonal covariance matrix for the additive noise, and one with a smoothly reconstructed matrix (which is striped).is section contrasts the performance of the two. It applies extended vts to the system with six striped-covariance Gaussian components per state. e standard noise model para- meters can either be derived from the actual noise data, the “known” case, or using maximum-likelihood estimation, “estimated”. In this section the noise added to the rm task is scaled to yield a 14 dB snr. e top three rows of table 8.2 compare diagonal and smoothed reconstructions of the extended noise when the standard noisemodel is estimated on the actual data. For diagonal-covariance and full-covariance compensation, smoothing results in 0.6% and 0.8% absolute improvement in the word error rate. is works almost as well as when the known noise distribution is used (as well, with diagonal-covariance com- pensation). However, when the standard noisemodel parameters are foundwithmax- imum-likelihood estimation, the bottom two rows in table 8.2, the smoothing process degrades performance. is degradation from the use of smoothing when using maximum-likelihood estimated noise parameters is caused by the nature of the maximum-likelihood es- 222 8.1. correlation modelling Jacobians Σex 20 dB 14 dB Standard vts 6.8 13.7 Fixed diag 7.8 13.5 Variable diag 7.5 13.0 Variable striped 6.2 12.0 Table 8.3 Resource Management task: word error rates for approximations of evts. Diagonal-covariance decoding. timates. For the smoothing process it is assumed that there is some true underlying sequence of noise samples that yields the standard noise model parameters. is is guaranteed to be true for the known noise distribution. However, this is not neces- sarily the case for the estimated noise. e dynamic noise parameters are estimated using the continuous-time approximation. ere are no constraints that the estim- ates režect the true or a possible sequence of noise samples. Using smoothing, which assumes relationships in the noise sample sequence, to estimate the extended noise covariance matrix may therefore not help.e experiments in the following sections will therefore use the diagonal reconstruction for the extended noise distribution. 8.1.1.3 Per-component compensation is section investigates recognition performance for per-component compensation with extendedvts. Here, a standard noisemodel is estimated and a diagonal extended noise model derived from it. Table 8.3 presents the properties of extended vts and striped statistics. e rst row gives word error rates for standard vts at 20 and 14 dB. As discussed in sec- tion 5.3.3.1, extended vts becomes standard vts if the expansion point is chosen equal for all time instances. Varying the expansion point is expected to provide better com- pensation for dynamics. On the other hand, diagonalising extended clean speech stat- istics discards information compared to diagonalising standard features with statics and dynamics. For the second line in the table, the expansion points vary, but the Jacobian is xed. At the lower snr, the improved modelling helps, but at the higher snr, where the corrupted speech distributions are closer to the clean speech, diagonal- 223 chapter 8. experiments Operations Room Car Scheme Speech Σy 20 14 8 20 14 8 vts diag Σx diag 6.8 13.7 30.0 5.2 9.1 18.7block Σx block 7.0 14.2 31.6 5.3 9.7 20.1 evts striped Σex diag 6.2 12.0 27.9 4.8 8.5 18.2 full 6.3 11.2 26.7 5.0 8.3 17.9 edpmc striped Σex diag 6.3 11.9 27.9 4.8 8.2 16.4 full 6.0 11.3 26.3 4.7 7.9 15.9 Table 8.4 Resource Management task: word error rates for standard vts, exten- ded vts and extended dpmc. ising the extended clean speech statistics discards vital information. For the third row, the Jacobians are allowed to vary, which gives complete evts, if with diagonal speech statistics. e bottom row uses striped statistics, as discussed in section 5.5.1, which discards no information compared to diagonal standard statistics.is leads to a con- sistent improvement over standard vts.e following experiments will therefore use striped statistics. Table 8.4 shows contrasts between compensation with standard vts and with ex- tended feature vectors using either evts or edpmc. A rst thing to note is that all approaches perform better than single-pass retraining (not in the table, at 7.4 % for 20 dB Operations Room noise). is is because the noise estimation can implicitly compensate for some speaker characteristics.e results in the rst row are from the standard scheme, diagonal-covariance compensation with vts. Block-diagonal com- pensation with standard vts is also implemented and block-diagonal clean speech statistics are used.e results for this approach are in the second row.e use of the block-diagonal compensation with vts degrades performance, for example 13.7 % to 14.2 % for Operations Room noise at 14 dB. Compensation with evts (shown in the middle two rows of the table) yields bet- ter performance than standard vts for both diagonal and full compensated covari- ance matrices. For diagonal-covariance compensation, the relative improvement is 5–10% (6.8 to 6.2 %; 13.7 to 12.0%, etc.) over standard vts.ough at the higher snr condition, 20 dB, full-covariance compensation does not improve performance over 224 8.1. correlation modelling diagonal-covariance performance, gains are observed at the lower snrs. For Oper- ations Room noise at 14 dB, full-covariance compensation produces an 11.2 % word error rate, which is a 20% relative improvement from standard vts, and 7% relative gain compared to the diagonal case. In addition, table 8.4 shows the performance of edpmc, which in the limit can be viewed as the optimal Gaussian compensation scheme. e results for this approach are shown in the bottom two rows of table 8.4. When compared with edpmc, the rst- order approximation in evts degrades performance by up to 0.4% absolute, except for 8 dB Car noise. However, evts is signicantly faster than edpmc. 8.1.1.4 Per-base class compensation e results in section 8.1.1.3 used per-component compensation, and full covariance matrices. In real recognition systems, this is oŸen impractical: computing per-com- ponent compensation is too slow, as is decoding with full covariances. ese prob- lems can be solved with two techniques in tandem. Joint uncertainty decoding can perform compensation with extended feature vectors per base class (see section 5.4). is speeds up compensation. Predictive linear transformations, such as predictive semi-tied covariance matrices (see section 6.2.3), can convert full-covariance com- pensation into diagonal-covariance compensation plus a linear transformation. is speeds up decoding. It is the combination of extended feature vector compensation with joint uncertainty decoding and predictive transformations that make it feasible for real systems. Since both joint uncertainty decoding (jud) and predictive semi-tied covariance matrices are approximations, some loss of accuracy is expected in return for the gain in speed.e noise model is estimated to maximise the likelihood for standard vts and for jud with standard vts. Again, a diagonal-covariance noise model with extended feature vectors is found. It should be noted that the performance gures on extended feature vectors give an underestimate of performance if the noisemodel was estimated consistently. In the results in this work there is a contest between the ešect of the 225 chapter 8. experiments Gaussians Type Σy 20 dB 14 dB 9.5 K vts diag 6.8 13.7 evts diag 6.2 12.0full 6.3 11.2 edpmc diag 6.3 11.9full 6.0 11.3 16 (jud) vts diag 7.4 16.4 evts diag 7.6 15.6 full 7.5 13.8 semi-tied 7.5 13.8 edpmc diag 8.0 16.2 full 7.4 14.6 semi-tied 7.4 14.7 Table 8.5 Word error rates for per-component and per-base class compensation. quality of compensation (extended dpmc should be better than extended vts, which should be better than standard vts), and the ešect of themismatch in the noisemodel estimation (standard vts should be better than extended vts, which should be better than extended dpmc). Table 8.5 shows the word error rates.e top half of table 8.5 contains results for per-component compensation. ey are repeated from table 8.4 on page 224 for ref- erence. e main ešects are that methods with extended feature vectors (evts and edpmc) were able to produce more accurate compensation, especially with full cov- ariance matrices and at the lower signal-to-noise ratio. Because the noise model is estimated for standard vts, to which evts compensation is more similar than edpmc compensation, edpmc does not consistently outperform evts even though it provides more accurate compensation. e second half of the table contains results with joint uncertainty decoding. To reduce the computational cost substantially, the number of base classes is low: only 16. e noise model is optimised for jud compensation with vts, the top row of the second half. Going from per-component to per-base class vts, the accuracy decreases by 0.6% and 2.7%. At 20 dB, extended vts and extended dpmc do not improve over standard vts at all.is is caused by the mismatch of the noise model, compounded 226 8.1. correlation modelling by the approximation that jud uses. At the 14 dB signal-to-noise ratio, however, the performance dišerences are more similar to the ones for per-component compensation. However, here extended dpmc performs less well than extended vts, though it is slower and more accurate given the real noise distribution. is is caused by the larger mismatch between the com- pensation that the noise model is estimated for and the actual compensation. e mismatch between standard vts and extended vts is less great. Extended vts im- proves jud performance by 0.8% when generating diagonal covariances, and by 2.6% with full covariances. e latter models the change in feature correlations. However, since it makes the covariance bias of joint uncertainty decoding full, all covariance matrices become full, which is slow in decoding. Semi-tied covariance matrices, in the next row, convert the full-covariance Gaussians to diagonal-covariance ones with a linear transformation per base class. As in earlier work (Gales and van Dalen 2007), this does not impact performance negatively: it stays at 13.8 %. ese results show that it is feasible to reduce the computational load of exten- ded feature vector approaches to a practical level, with only limited negative ešect to accuracy. e combination of predictive semi-tied covariance matrices, joint uncer- tainty decoding, and extended vts is the set-up that this thesis proposes for practical compensation. Joint uncertainty decoding reduces the number of Gaussians to be compensated. en, predictive semi-tied covariance matrices provides a form of de- coding that is basically as fast as diagonal covariances, but does model correlations. e choice of the number of base classes for joint uncertainty decoding provides a trade-oš between speed and accuracy. It is important to note that joint uncertainty decoding with only one component in each base class is equal to the form of com- pensation it is derived from, for example, extended vts.e same goes for predictive semi-tied covariance matrices. e results here show the extremes of both, but it is possible to select any point in between for the desired trade-oš between speed and accuracy. 227 chapter 8. experiments 8.1.2 AURORA 2 aurora 2 is a small vocabulary digit string recognition task (Hirsch and Pearce 2000). ough it is less complex than the Resource Management corpus, it is a standard cor- pus for testing noise-robustness. Utterances are one to seven digits long and based on the tidigits database with noise articially added.e clean speech training data comprises 8440 utterances from 55 male and 55 female speakers.e test data is split into three sections. Test set a comprises 4 noise conditions: subway, babble, car and exhibition hall. Matched training data is available for these test conditions, but not used in this work. Test set b comprises 4 dišerent noise conditions. For both test set a and b the noise is scaled and added to the waveforms. For the two noise conditions in test set c convolutional noise is also added. Each of the conditions has a test set of 1001 sentences with 52 male and 52 female speakers. e feature vectors are extractedwith the etsi front-end (Hirsch andPearce 2000). e delta and delta-delta coe›cients use 2 and 3 frames leŸ and right, respectively, for total window of 11 frames. e acoustic models are whole word digit models with 16 emitting states, and 3 mixtures per state and silence.e results presented here use the simple aurora back-end. Using the simple back-end recogniser rather than one with more Gaussian components per state ensures that block-diagonal and full cov- ariancematrices for the clean speech are robust. Since for this task, as for the Resource Management task, the noise estimates do not contain zero elements in the variance, the back-oš strategy for the noise estimate discussed in section 5.5.2.1 is not necessary. 8.1.2.1 Extended VTS Table 8.6 on the next page shows results for compensation with vtswith the continu- ous-time approximation and evts. Both diagonal and block-diagonal forms of vts are used. vts with diagonal compensation (trained on diagonal speech statistics) is the standard method. Results for this are shown in the columns labelled “vts diag” of table 8.6 and are treated as the baseline performance gures (similar results for vts are given in Li et al. (2007)). vts can also be used to produce block-diagonal cov- 228 8.1. correlation modelling snr A B C Scheme vts evts vts evts vts evts Comp. diag block full diag block full diag block full 00 28.2 24.3 23.5 26.2 22.9 24.3 25.9 23.6 22.6 05 10.5 8.2 7.1 9.3 7.9 7.2 9.9 8.2 6.9 10 4.3 3.3 2.5 3.9 3.2 2.4 4.4 3.4 2.8 15 2.2 1.9 1.2 2.2 1.8 1.2 2.3 1.8 1.4 20 1.6 1.3 0.8 1.4 1.2 0.7 1.6 1.2 1.0 Avg. 9.4 7.8 7.0 8.6 7.4 7.2 8.8 7.7 6.9 Table 8.6 aurora: diagonal compensation with standard vts and full compens- ation with extended vts. ariance matrices. e results for this are shown in the columns labelled “vts block” in table 8.6. Compared to the standard diagonal vts scheme, this gives, for example, relative reductions inword error rate of 15 % to 22% at 5 dB snr. With diagonal-covari- ance statistics for the clean speech (not in the table), this yields no performance gain. However, performance gains were obtained when using block-diagonal clean-speech models, unlike for Resource Management. is dišerence in performance between the tasks can be explained by the additional complexity of the rm task compared to aurora. e results for evts are shown in the last three columns of table 8.6. Here, full- covariancemodels are used for the extended clean speech to produce compensated full covariance matrices for decoding. e improved compensation for dynamics causes extended vts to perform better than to block-diagonal standard vts in all but one noise conditions. At 5 dB again, relative improvements are an extra 3% to 10%. 8.1.2.2 Front-end PCMLLR is section makes a dišerent, but interesting, trade-oš between speed and accuracy. Section 6.4 has introduced a number of schemes that are based on predictive trans- formations but are even faster, because they solely use feature transformations. Since the objective of the experiments in this section is to nd fast feature transformations, appropriate for embedded real-time systems, the small aurora task is again used. 229 chapter 8. experiments snr Scheme 20 15 10 5 0 Avg. vts 1.6 2.2 4.3 10.5 28.2 9.4 jud 1.7 2.7 4.7 11.4 30.4 10.2 pcmllr 1.5 2.6 5.2 13.5 34.7 11.5 Model-mmse 33.8 47.5 61.6 77.4 91.7 62.4 Table 8.7 aurora: component-dependent compensation. Also, all transformations have diagonal covariance matrices, which makes compens- ation and decoding very fast. 64 base classes are used, which as in Stouten et al. (2004b) is about a tenth of the number of back-end components (546).e noise model for each utterance is estim- ated for every utterance, for compensation on the base class level.is noise model is then used to vts-compensate the 64-component clean front-end gmm, producing the joint distribution (4.70).is scheme for estimating the joint distribution is the same as used for mmse feature enhancement in Stouten et al. (2004b), with the addition of additive noise estimation and explicit compensation of dynamic parameters. Table 8.7 contains word error rates for component-specic model compensation. vts is the odd one out in that it compensates every back-end component separately, providing accurate but slow compensation. All other schemes use the joint distri- bution. jud actually decodes with the back-end models set to their predicted distri- butions in (4.48a), trading in some accuracy for speed compared to vts. e other two schemes, pcmllr and model-mmse, both use component-dependent transform- ations. pcmllr minimises the kl divergence to jud compensation. Model-mmse is applied in the same way, but estimated for mmse (see section 4.6): every transform- ation reconstructs the clean speech in an area of acoustic space. pcmllr brings the adaptedmodels close to the jud-predicted statistics.atmodel-mmse fails to provide meaningful compensation highlights the dišerence in the nature of pcmllr andmmse transforms, even though their form is the same. mmse transforms, which aim to re- construct the clean speech, have no meaning when applied separately to each base class of components. 230 8.1. correlation modelling snr pcmllr Scheme 20 15 10 5 0 Avg. Global 2.9 6.5 16.8 40.5 71.8 27.7 Observation-trained 1.4 2.6 5.5 14.2 37.1 12.2 Hard-decision 1.5 3.0 6.5 16.6 40.5 13.6 Interpolated 1.4 2.5 5.0 12.6 32.5 10.8 mmse 1.5 2.6 5.4 14.4 39.7 12.7 Table 8.8 aurora: component-independent transformation: pcmllr-based schemes and mmse. Table 8.8 shows results of using component-independent transformations. mmse is the standard feature enhancement scheme that reconstructs a clean speech estim- ate from the noise-corrupted observation. For higher snrs, its accuracy is similar to model compensation methods jud and pcmllr trained from the same joint distribu- tion. For lower snrs, however,mmse’s point estimate of the clean speech performs less well than jud and pcmllr’s compensation of distributions. Global pcmllr estimates one transform per noise condition, and is the baseline for component-independent pcmllr. Observation-trained pcmllr adapts the pre- dicted corrupted speech distribution for every observation and estimates an appro- priate transform.is yields accurate compensation, which for lower snrs shows the advantages of model compensation. Hard-decision pcmllr uses the same transforms as pcmllr, but picks one based on the feature vector rather than on the back-end component. is simple scheme yields a piecewise linear transformation of the feature space. Interpolated pcmllr performs better, by interpolating the transforms weighted by the front-end posterior. Interpolated pcmllr outperforms all other derivatives of pcmllr, pcmllr itself, and mmse, which uses the same form of decoding.e ingredients for its performance are twofold. First, the pcmllr transforms are trained to minimise the kl divergence of the adaptedmodels to the jud-predicted corrupted speech distributions in an acoustic region. Secondly, the interpolation smoothly moves between transformations that are appropriate for the acoustic region. pcmllr itself applies component-dependent 231 chapter 8. experiments enon city hwy Scheme Decoding 35 dB 25 dB 18 dB vts diag 1.2 2.5 3.2 evts diag 1.1 2.4 2.8full 1.7 2.5 2.4 evts back-oš 1.1 2.2 2.4 % utterances 87% 38% 11% Table 8.9 Extended vts on the Toshiba in-car task. transformations independently of the feature vector. 8.1.3 Toshiba in-car database Experiments are also run on a task with real recorded noise: the Toshiba in-car data- base.is is a corpus collected by Toshiba Research Europe Limited’s Cambridge Re- search Laboratory. It comprises a set of small/medium sized tasks with noisy speech collected in an o›ce and in vehicles driving at various conditions. is work uses three of the test sets containing digit sequences (phone numbers) recorded in a car with a microphone mounted on the rear-view mirror. e enon set, which consists of 835 utterances, was recorded with the engine idle, and has a 35 dB average signal- to-noise ratio.e city set, which consists of 862 utterances, was recorded driving in the city, and has a 25 dB average signal-to-noise ratio.e hwy set, which consists of 887 utterances, was recorded on the highway, and has a 18 dB average signal-to-noise ratio. e clean speech models are trained on the Wall Street Journal corpus, based on the system described in Liao (2007), but the number of states is reduced to about 650, more appropriate for an embedded system.e acoustic models used are cross-word triphones decision-tree clustered per state, with three emitting states per hmm, twelve components per gmm and diagonal covariance matrices.e number of components is about 7800. Like for Resource Management, extended clean speech statistics for extended vts are striped for robustness.e language model is an open digit loop. To nd the noise model, it is re-estimated twice on a new hypothesis. 232 8.2. the effect of approximations Table 8.9 on the facing page shows results on the Toshiba task.e top row con- tains word error rates for the standard compensation method: vts trained on diag- onal speech statistics.e performance of evts using diagonal covariance matrices is shown in the second row. Again evts shows gains over vts, especially at the lowest snr condition, hwy. In the hwy condition about a 12% relative reduction in error rate was obtained. Initially full-covariance matrix compensation with evts is evaluated without the use of the back-oš scheme described in section 5.5.2.1. Using evts with full-covari- ance decoding yields additional gains compared to diagonal compensation at low snrs (2.8 % to 2.4%). However the performance is degraded at higher snr conditions, for example enon where performance is degraded from 1.1 % to 1.7 %. In contrast to the previous tasks, at high snrs there are found to be zeros in the noise variance estimate. e back-oš scheme, labelled “evts back-oš” in table 8.9, is therefore used. Here diagonal covariance matrix compensation is used if any noise variance estimate falls below 0.05 times the variance žoor used for clean speechmodel training (results are consistent over a range of values from 0.0 to 0.1). e bottom line in table 8.9 shows the percentage of utterances for each of the task where the system is backed oš to diagonal covariance matrix compensation. As expected the percentage at high snrs, 87%, was far higher than at lower snrs, 11 %. Using this back- oš approach gives consistent gains over using either diagonal or full compensation evts alone. Note that as the back-oš is based on the ml-estimated noise variances, it is fully automated. Compared to standard vts, evts with back-oš gave relative reductions of 8% in the enon condition, 12 % in city, and 25% in hwy. 8.2 The effect of approximations e second part of this chapter is more theoretical. Chapter 7 has found an accur- ate approximation to the corrupted speech likelihood. To assess its performance, this section will initially consider the cross-entropy to the real distribution for individual 233 chapter 8. experiments components, as discussed in section 7.4. In the limit, the transformed-space sampling method yields the ideal compensation, which gives the point where the kl divergence is 0. is allows a ne-grained assessment of the approximations that methods for noise-robustness make. is section will consider the ešects of parameterisations of the noise-corrupted speech distributions, such as assuming it Gaussian (section 8.2.2), and diagonalising the covariancematrix of this Gaussian (section 8.2.3). It will also in- vestigate the ešects of a common approximation to the mismatch function: assuming the phase factor α xed (section 8.2.4). 8.2.1 Set-up If the speech and noise models represented the real distributions perfectly, then com- puting the corrupted speech distribution exactly would yield the best recognition per- formance. In practice, however, themodels are imperfect and improving the kl diver- gence to the real distribution does not necessarily mean that the speech recognition accuracy will also improve. In this respect, assessing the quality of speech recognition compensation with the kl divergence is conceptually similar to assessing language models by their perplexities. e following sections will therefore also examine how the cross-entropy results relate to word error rates. However, not all methods discussed in this thesis can be assessed with both of these metrics.e Algonquin algorithm, discussed in section 4.5.1, yields a Gaussian approximation of the corrupted speech distribution specic to an observation. Used as a method to approximate the likelihood of observations, it therefore is not norm- alised. is makes it impossible to compute the cross-entropy for it. As discussed in section 7.3, the likelihood for transformed-space sampling is not normalised for small sample clouds, but converges to normalisation as the number of samples increases. is sectionwill also not presentword error rates for the transformed-space sampling method introduced in chapter 7.3, because decoding with it is prohibitively slow.e cause of this is a conceptual dišerence between model compensation methods (e.g. vts, dpmc, and idpmc) on the one hand and transformed-space sampling on the 234 8.2. the effect of approximations other. Model compensation computes a parametric distribution, and once that is done, running a recogniser or computing a cross-entropy is not necessarily slower than without compensation. Transformed-space sampling, on the other hand, ap- proximates the likelihood given an observation, and cannot precompute anything. To employ it for the recognition task in this section, just for the statics and with a decent- sized sample cloud of 512, it would run at roughly 30 million times real-time.3 is gure is based on an implementation that is not optimised for speed at all, but even with an optimised implementation running a speech recogniser with it would not be feasible. However, the approximated likelihood of transformed-space sampling tends to the exact likelihood. In the following section it will become clear that the cross-en- tropy that transformed-space sampling converges to indicates the minimum value of the cross-entropy to the real distribution. is minimum is by denition where the kl divergence is 0.e distance to this point in a cross-entropy graph therefore shows how far compensation methods are from the ideal compensation. e cross-entropy experiments will assess compensation quality for the corrupted speech distribution resulting from combining one speech Gaussian with one noise Gaussian.e distributions will just be over statics, to remove the dependence on any additional approximations for the dynamics. For the speech recognition experiments, the distributions over dynamic features for speech recognition are foundwith extended feature vectors, as introduced in chapter 5. Not only does this yield better accuracy than approximations, but it also keeps com- pensation for dynamics most closely related to that for the statics. For robustness, the speech statistics have striped covariance matrices as discussed in section 5.5.1.e es- timated Gaussian distributions of the corrupted speech have full covariance matrices. is section uses a noise-corrupted version of the ResourceManagement task, the set-up of which was discussed in section 8.1.1, both for evaluating the cross-entropy 3Computing the likelihood of one sample for one speech Gaussian takes slightly over 30 seconds on a machine on the Cambridge Engineering Department’s compute cluster. Processing one second of speech, with observations every 10ms, and 9500 speech Gaussians, requires performing this operation 950 000 times.is therefore takes roughly 30 million seconds. 235 chapter 8. experiments and the word error rate. All experiments use a noise model trained directly on the noise as added to the speech audio, and there is no convolutional noise.is elimin- ates the inžuence of the noise estimation algorithm. As discussed in section 4.4, in practice methods for noise robustness can estimate a noise model on little training data compared to generic adaptation techniques. is is because their model of the environment matches the actual environment to some degree.is section examines how close that model can be without considering how to estimate the noise model. is section uses the mismatch function presented in section 4.2.1 as the real one. It also assumes that the phase factor is Gaussian but constrained to [−1, 1], and in- dependent per time frame and per spectral coe›cient.e variances σ2α of the phase factor are found from the actual lter bank weights for the htk Resource Manage- ment recogniser, as discussed in section 4.2.1.1. For the cross-entropy experiments, the schemes that sample from the phase factor distribution, dpmc and transformed-space sampling, use the exact distribution. vts requires the distribution to be Gaussian (see section 5.3.3) and ignores the domain of the coe›cients. Previous work on model compensation has not modelled the phase factor with a distribution. Work on feature enhancement with vts, on the other hand, has: to nd a minimummean square error estimate (Deng et al. 2004), or with a Kalman l- ter (Leutnant andHaeb-Umbach 2009a). Formodel compensationwithvts, previous work has xedα to 0 (Moreno 1996; Acero et al. 2000), to 1 (Liao 2007, and section 8.1 of this thesis), mathematically highly improbable, or 2.5 (Li et al. 2007), mathemat- ically inconsistent. e original presentation of dpmc (Gales 1995) similarly ignored the phase factor. However, since it trains a distribution on samples drawn from the corrupted speech distribution, it is straightforward to extend it so it uses a distribution for α.e phase factor distribution to draw samples from was given in section 4.2.1.1. e recognition experiments in this section apply extended vts and dpmc compens- ation with a distribution over α.is is possible because extended feature vectors are used rather than the continuous-time approximation (see section 5.3.3). Section 8.2.4 will examine the inžuence of xing the phase factor. 236 8.2. the effect of approximations For the cross-entropy experiments, the full-covariance noise and speech Gaussi- ans are both over 24 log-spectral coe›cients.e one for the noise is trained directly on the noise audio. e speech distribution is taken from a trained Resource Man- agement system, single-pass retrained to nd Gaussian in the log-spectral domain. A low-energy speech component4 is chosen, to represent the part of the utterance where the low snr causes recognition errors.e distance between the speech and the noise means, averaged over the log-spectral coe›cients, corresponds to a 10 dB snr. Except where mentioned, the relative ordering of the approximation methods is the same for all combinations of speech and noise examined. 5000 samples y(l) are drawn from the corrupted speech distribution. For the cross-entropy experiments, dpmc trains Gaussians on 50 000 samples. For idpmc, the average number of samples per Gaussian component is also 50 000, so that the 8-component mixture, for example, is trained on 400 000 samples. For the recognition experiments, the number of samples per component for extended dpmc is set to 100 000. For iterative dpmc, the average number of samples is 100 000: for example, a 6-component mixture is trained on 600 000 samples. is is many more samples than in section 8.1. is is because as the compens- ation methods get to the limits of their performance, both for the cross-entropy and speech recognition, small inaccuracies become important. For the cross-entropy ex- periments, there is no mismatch between training data and predicted distribution, because both are generated with exactly the samemodels. For both cross-entropy and recognition experiments, over-training can be prevented not by reducing the number of components, but by increasing the number of samples. Training idpmc, in partic- ular, appears to be sensitive to the number of samples each component has to train on. e number of samples is therefore pushed to machine limits,5 at which point performance appeared to have converged. 4Tied state “st uh 4 3”, component 2. 5Training the recogniser takes several processor-weeks. 237 chapter 8. experiments 35 40 Cr os s- en tro py 1 8 64 512 8192 Size of sample cloud No compensation vts dpmc idpmc Sampling Figure 8.2 Cross-entropy to the corrupted speech distribution for transformed- space sampling and model compensation methods. 8.2.2 Compensation methods e graph in gure 8.2 shows the cross-entropy for dišerent compensation meth- ods.e curved line indicates the cross-entropy between the real distribution and the transformed-space sampling method described in section 7.3, for increasing sample cloud size.e factorisation is the quasi-conditional factorisation from section 7.3.2.3. For distributions other than a three-dimensional toy example, the postponed factor- isation discussed in section 7.3.2.2 showed a much slower convergence: in an earlier version of the experiment in gure 8.2 (without the phase factor), even with 16 384 samples it did not perform as well as the vts-estimated Gaussian. As the size of the sample cloud increases, the approximation of p(y(l)) found with transformed-space sampling converges to the correct value.ismeans that the cross- entropyH(p‖q) converges to the entropyH(p).e bottom of the graph is set to the point the curve in gure 8.2 converges to, which indicates the entropy of p. Since the kl divergence can be written (in (7.44a)) as KL(p‖q) = H(p‖q) −H(p), this is the point where the kl divergence is 0. Since the kl divergence cannot be negative, this point gives the optimum cross-entropy. It gives a lower bound on how well the 238 8.2. the effect of approximations 31 32 Cr os s- en tro py 1 2 4 8 Number of components dpmc idpmc Figure 8.3 Cross-entropy to the corrupted speech distribution for iterative dpmc. real corrupted speech distribution can be matched.e value of the cross-entropy for transformed-space sampling with 16 384 samples, 30.14, will also be the bottom for the other graphs in this section. e line labelled “dpmc” in gure 8.2 indicates the best match to the real distri- bution possible with one Gaussian.e Monte Carlo approximation to the cross-en- tropy, section 7.4 has shown, is equivalent to the negative average log-likelihood on the samples. dpmc nds the Gaussian that maximises its log-likelihood on the samples it is trained on. If the sample sets for training and testing were the same, then dpmc would yield the mathematically optimal Gaussian. ough dišerent sample sets are used (with 50 000 samples for training dpmc and 5000 samples for testing) the cross- entropy has converged. Any other Gaussian approximation will perform worse. e state-of-the-art vts compensation nds such a Gaussian analytically, and it is much faster. However, its cross-entropy to the real distribution is far from dpmc’s ideal one. Just like dpmc, idpmc nds a distribution from samples, but it uses a mixture of Gaussians rather than one Gaussian. e mixture in the graph has 8 components trained on 400 000 samples, and comes close to the correct distribution. As the num- 239 chapter 8. experiments Compensation Shape 20 dB 14 dB — diag 38.1 83.8vts, α = 1 8.6 17.3 evts full 11.1 16.5 edpmc 7.4 13.3 eidpmc 6.9 12.0 eidpmc+ 6 6.2 11.1 eidpmc+ 12 6.5 11.3 Table 8.10Word error rates for various compensation schemes. ber of components increases from 1 to 8, keeping the average number of samples for components at 50 000, the cross-entropy decreases, as gure 8.3 on the preceding page illustrates. With an innite number of components, it would yield the exact distri- bution. To correctly model the non-Gaussianity in 24 dimensions, however, a large number of components are necessary, which quickly becomes impractical. To examine the link between the cross-entropy and the word error rate, recog- nition experiments are run. Improved modelling of the corrupted speech does not guarantee better discrimination, since speech and noise models are not necessarily the real ones. Since transformed-space sampling needs to be run separately for every observation vector for every speech component, it is too slow to use in a speech re- cogniser. Table 8.10 contains word error rates at two signal-to-noise ratios for comparison with the cross-entropy results in gure 8.2. Results with the uncompensated system, trained on clean data, are in the top row. Below it, as a reference, is standard vts. It sets the phase factor α to 1, and nds diagonal-covariance compensation. Stand- ard vts uses the continuous-time approximation to compensate delta- and delta-delta parameters. is yields inaccurate compensation for oš-diagonals. Section 8.1.1.1 has demonstrated that. Using block-diagonal statistics and compensation (not in the table), word error rates for standard vts are worse than using diagonal-covariance ones: 19.5 % and 38.5 %. e bottompart of the table contains results on extendedvts (evts) and extended 240 8.2. the effect of approximations dpmc (edpmc).ey use distributions over extended feature vectors.ey also use a distribution over the phase factorα.e covariances of the resulting distributions are full. evts performs less well than standard vts at 20 dB.is is caused by the inter- action of the phase factor with the vector Taylor series approximation, which sec- tion 8.2.4 will explore in more detail. At 14 dB, the more precise modelling does pay oš. Compared to the uncompensated system extended vts’s performance improves more (38.1 % to 11.1 %) than expected from its improvement in terms of the cross- entropy in gure 8.2. vts compensation uses a vector Taylor series approximation around the speech and noise means. It therefore models the mode of the corrupted speech distribution better than the tails.is causes the majority of the improvement in discrimination. However, extended dpmc, which nds the optimal Gaussian given the speech and noise models, does yield better accuracy (7.4%). Extended dpmc nds one corrupted speech Gaussian for one clean speech Gaussian. e cross-entropy experiment only uses one clean speech Gaussian. Extended idpmc (eidpmc), however, trains amixture of Gaussians from samples, which can be drawn from any distribution. For the recog- nition experiments, therefore, eidpmc compensates one state-conditional mixture at a time. Replacing the 6-component speech distribution by a 6-component corrupted speech distribution, eidpmc increases performance from edpmc’s 7.4 % to 6.9%. By modelling the the distribution better, with 12 components (“eidpmc+ 6”), perform- ance increases further to 6.2 %. e corrupted speech distribution should be more precise as the number of components increases to 18 (“eidpmc+ 12”). However, even by increasing the number of samples by a factor of 2, to 3600 000, performance does not increase.is can be explained by lack of robustness of the speech statistics, even though they have striped covariance matrices. Since in gure 8.3 the line for idpmc tends towards the best possible cross-entropy, this should be the best possible word error rate for these clean speech and noise distributions and this noise model. Going from a Gaussian trained with extended vts to the optimal Gaussian to a 241 chapter 8. experiments 35 40 Cr os s- en tro py No compensation vts vts diagonalised dpmc dpmc diagonalised Figure 8.4e ešect of diagonalisation on the cross-entropy. mixture ofGaussians in general improves the precision of the corrupted speechmodel. is shows in the cross-entropy to the real distribution, and the same ešects are ob- served in the word error rate. Better modelling of the corrupted speech distribution leads to better performance.e next sections will evaluate specic common approx- imations: diagonalising Gaussians’ covariance matrices, and settingα to a xed value. 8.2.3 Diagonal-covariance compensation e cepstral-domain Gaussians of speech recognisers are oŸen diagonalised. is yields more robust estimates than full covariance matrices, and makes decoding fast. For noisy conditions specically, it has previously been observed that feature correla- tions change and it is advantageous to compensate for this. However, it turns out that modelling correlations for the wrong noise conditions is counter-productive (Gales and van Dalen 2007). Also, estimates for oš-diagonal elements are much less robust to approximations (see section 5.3.3). is section will relate these ešects using the cross-entropy and speech recogniser accuracy. Diagonalisation usually takes place in the cepstral domain. e theory and the cross-entropy experiments have used log-spectral-domain feature vectors. To emulate 242 8.2. the effect of approximations Compensation Shape 20 dB 14 dB — diagonal 38.1 83.8 evts diag 8.3 15.7full 11.1 16.5 edpmc diag 7.5 14.9full 7.4 13.3 Table 8.11e ešect of diagonalisation on the word error rate. diagonalisation in the cepstral domain for log-spectral-domain features, therefore, the Gaussian is rst converted to the cepstral domain with a dctmatrix. Normally, cep- stral feature vectors are truncated to 13 elements. However, to be able to convert back to the log-spectral domain, here all 24 dimensions are retained.e Gaussian is then diagonalised. To be able to compare the log-likelihoods, the diagonalised Gaussians are converted back to the log-spectral domain with the inverse dct. Figure 8.4 compares the cross-entropy to the real distribution of diagonalised and non-diagonalised Gaussians found with dpmc and vts. As explained in the previous section, dpmc by denition yields the optimal Gaussian, so it must result in the lowest cross-entropy, and diagonalising it makes it perform less well. It is interesting that full-covariance vts performs less well than its diagonalised form. On this test case, apparently, the oš-diagonals in the cepstral domain are not estimated well enough, so that diagonalising lends the distribution robustness. Table 8.11 investigates speech recognition performancewhen diagonalisingGauss- ian compensation. As in the previous section, the compensation methods use a phase factor distribution and extended feature vectors, to model the distributions as pre- cisely as possible. Here, as for the cross-entropy, diagonalising extendedvts compens- ation improves performance (e.g., 11.1 % to 8.3 %).e oš-diagonal covariance entries are not estimated well, so that diagonalisation increases robustness. (e next section will relate this to the model for the phase factor α.) However, edpmc, which nds the optimal Gaussian compensation, does perform better when it is allowed to model correlations (7.5 % to 7.4% at 20 dB). As expected from the results in section 8.1.1.3, at a lower signal-to-noise ratio the correlations change more, so that modelling them 243 chapter 8. experiments becomes more important (14.9% to 13.3 % at 20 dB). 8.2.4 Influence of the phase factor Model compensation oŸen assumes a mismatch function that is an approximation to the real one as presented in section 4.2.1. However, traditionally the phase factor α, which arises from the interaction between the speech and the noise in the complex plane, has been assumed xed. As section 8.2.1 has explained, there has recently been interest inmodelling the phase factor with a distribution. However, this work has been the rst to use a phase factor distribution for model compensation. is section will look into the ešect of the approximation of assuming the phase factor xed. Two settings for α are of interest. e traditional presentation (Moreno 1996; Acero et al. 2000) sets α = 0, which is the mode of the actual distribution. e second setting is α = 1. As appendix c.2 shows, if the term with α in the mismatch function is ignored and magnitude-spectrum feature vectors are used, this is roughly equivalent to setting α = 1 for the power spectrum.is setting has been applied in previous work (e.g. Liao 2007), and in section 8.1 of this thesis. 40 50 60 Cr os s- en tro py dpmc α = 0 dpmc α = 1 dpmc α ∼ N vts α = 0 vts α = 1 vts α ∼ N Figure 8.5e ešect of the phase factor on Gaussian compensation. Figure 8.5 shows the cross-entropy for dpmc and vtswith dišerentmodels for the 244 8.2. the effect of approximations Scheme α 20 dB 14 dB edpmc 0 7.6 13.2 1 8.0 14.7 N 7.4 13.3 evts 0 11.4 16.5 1 8.7 14.9 N 11.1 16.5 Table 8.12e ešect of the phase factor on Gaussian compensation. phase factor. Note that the vertical axis uses a larger scale than gure 8.2.e bottom of the graph is still set to the optimal cross-entropy acquired with transformed-space sampling. Both methods generate full covariance matrices.e diagonalised versions (not shown in the table) show the same trends, with smaller distances between cross- entropies. dpmc with the model for α matching the actual distribution (“dpmc α ∼ N ”) yields the lowest cross-entropy by denition. vts with a Gaussian model (“vts α ∼ N ”) is at some distance. e obvious choice for xing α would be the mode of its actual distribution, 0. With that assumption, both dpmc and vts end up further away from the ideal distri- bution. Note that though the cross-entropy lines for “dpmc α = 0” and “vts α = 0” are close, the distributions are not necessarily similar. As expected, when α is xed to 1, the modelled distributions become even further away from the actual ones. Table 8.12 contains word error rates for the same contrasts. Again, it shows only full-covariance compensation. With diagonal covariances the trends are again the same but less pronounced. For edpmc, the ešect of dišerent phase factor models is as expected. Whether α is distributed around 0 or xed to 0 mostly ašects the covari- ances. ough this does have an ešect on the cross-entropy, since the change to the covariance matrices is fairly uniform across components, this makes little dišerence for discrimination. However, setting α to the unlikely value 1 ašects performance negatively. e results for vts are more surprising. Again, there is little dišerence between setting α to 0 and letting it be distributed around 0. For vts, this by denition does 245 chapter 8. experiments not ašect components’ means, but only their covariances. However, setting it to 1 does improve performance.is may be because overestimation of themode (see sec- tion 4.4.2) improvesmodelling for some components. Preliminary results suggest that which value of α yields the best cross-entropy varies with dišerent distances between the speech and noise means. A hypothesis is that for dišerent tasks, dišerent settings for α optimise compensation for components at a speech-to-noise ratio where mis- compensation is most likely to cause recognition errors. is would explain why the optimal α is dišerent for dišerent corpora (Gales and Flego 2010). However, this is material for future research. What the results here do show is that while modelling α with a distribution re- duces the distance to the actual distribution, as evidenced by the improving cross-en- tropy, discrimination is not helped. Section 4.4.2 has pointed out that the only ešect of using a distribution for the phase factor over a xed value at the distribution’s mode is a fairly equal bias on the covariance, which is unlikely to inžuence discrimination much. It has also discussed how in practice the noise estimation can subsume this bias. Using a distribution over the phase factor rather than a xed value as is currently done, is therefore unlikely to result in gains in a practical speech recogniser. 246 Chapter 9 Conclusion e theme of this thesis has been to improve modelling of noise-corrupted speech distributions for speech recognition. It has argued (section 4.1) that if the model for speech, noise, and their interactionwere exact, then decoding the audio with the exact corrupted speech distribution would yield the best speech recogniser performance. is thesis has sought not only performance baselines, but also performance ceil- ings. It has derived an explicit expression for the corrupted speech distribution, which has no closed form. It has then analysed model compensation as aiming to minim- ise the divergence between the recogniser distribution and this predicted corrupted speech distribution. Two ways of improving modelling of corrupted speech distribu- tions have been proposed. First, this thesis has introduced methods that can model within-component fea- ture correlations under noise.is is not normally done, because of problems nding accurate compensation, and the computational cost. is thesis has found new ap- proaches for both problems (chapters 5 and 6), resulting in a chain of techniques that provide correlation modelling at a reasonable and tunable computational cost. Second, this thesis has introduced a method of approximating the real corrupted speech likelihood (chapter 7). Given speech and noise distributions and a mismatch function, it nds a Monte Carlo approximation to the likelihood of one observation vector. ough it is very slow, in the limit it computes the exact likelihood. It there- 247 chapter 9. conclusion fore gives a theoretical bound for noise-robust speech recognition. is has made it possible to assess the ešect of approximations that model compensation makes. e following gives more detail about the contributions of this thesis. 9.1 Modelling correlations Methods for noise-robust speech recognition, like the state-of-the-art vts compensa- tion, normally use diagonal-covariance Gaussians, and fail to model within-compon- ent feature correlations. e reason for this is twofold: existing methods do not give good estimates for correlations, and the computational cost of decoding with full co- variance matrices is prohibitive. is thesis has presented insights and solutions for both problems. vts compensation’s estimates for the feature correlations of corrupted speech are unreliable because of the continuous-time approximation. It assumes that dynamic coe›cients are time derivatives of static coe›cients, whereas in reality they aremerely approximations. Dynamic coe›cients are found with linear regression from a win- dow of static feature vectors. Chapter 5 has introduced model compensation methods that apply the mismatch function (or an approximation) to each time instance in the window separately. By only then applying the linear transformation that the dynamic coe›cients are extracted with, compensation becomes much more precise. It then becomes possible to generate good full-covariance compensation with as little adapt- ation data as standard vts needs. However, decoding with full covariance matrices is slow. Chapter 6 has therefore derived methods that approximate full-covariance compensation using linear trans- formations. is makes decoding faster. ese predictive linear transformations are versions ofwell-knownadaptive linear transformations, whichnormally requiremuch more adaptation data than methods for noise-robustness. Also, joint uncertainty de- coding can compensate a whole base class of components at once. Combining joint uncertainty decoding and predictive linear transformations makes many of the re- 248 9.2. asymptotically exact likelihoods quired statistics cacheable, so that the whole process is fast enough to be implemented in a real-world speech recogniser. Predictive linear transformations had been introduced before (van Dalen 2007), but chapter 6 has introduced the formal framework for them. Predictive methods approximate a predicted distribution with a dišerent parameterisation.is has been formalised as minimising the kl divergence.e framework also subsumes standard model compensation methods, which minimise the kl divergence to the predicted distribution. Predictive transformations give a powerful framework for combining the advant- ages of two forms of distribution. Since the introduction of predictive linear trans- formations for noise-robustness, other variants have been proposed, some of which derive from vtln or combine predicted statistics from with statistics directly from data (Flego and Gales 2009; Xu et al. 2009; 2011; Breslin et al. 2010). Section 6.4, based on joint work with Federico Flego, has introduced another, fast, variant, which estimates a transformation tominimise the divergence to the predicted corrupted speech, but applies it to speech recogniser features. 9.2 Asymptotically exact likelihoods Model compensation methods aim to model the corrupted speech distribution, but usually fall short by denition. With Gaussian speech and noise, the corrupted speech is not Gaussian. Standard model compensation assumes it is, and can never provide ideal compensation. Chapter 7 has introduced a more accurate approximation to the corrupted speech likelihood. Rather than a parameterised density, it uses a sampling method, which approximates the integral over speech, noise and phase factor that the likelihood consists of. Because the probability density has an awkward shape, the integral is rst transformed. en, sequential importance resampling deals with the high dimensionality. As the number of samples tends to innity, this approximation converges to the real likelihood. 249 chapter 9. conclusion Because the method cannot precompute distributions, it is too slow to embed in a speech recogniser. However, it is possible to nd the kl divergence from the real corrupted speech distribution to an approximation up to a constant.e newmethod essentially gives the point where the kl divergence is 0, so it can be assessed how close compensationmethods are to the ideal.e kl divergence for dišerent compensation methods appears to predict their word error rates well. One of these compensation methods is iterative data-driven parallelmodel combination (idpmc), which takes im- practically long to train but it is feasible to run speech recognition with. A version of idpmc that uses extended feature vectors comes close to transformed-space sampling in terms of cross-entropy, and improves the word error rate substantially. Given the link between the cross-entropy and the word error rate, this should indicate the best possible performance with these speech and noise models, and this mismatch func- tion. Using the kl divergence technique, it also becomes possible to examine approx- imations to the mismatch function. ese include assuming the corrupted speech distribution Gaussian, and diagonalising that Gaussian’s covariance. One common approximation, assuming the phase factor xed, has seen particular interest in recent years. is work has introduced model compensation using a phase factor distribu- tion for extended vts, extended dpmc, and extended idpmc. is has turned out to improve the cross-entropy more than discrimination. In particular, for vts compens- ation setting the phase factor to a xed value other than its mode appears to counter some ešects of the vector Taylor series approximation at dišerent signal-to-noise ra- tios. 9.3 Future work is thesis has found improvedmodels for the corrupted speech, assuming the speech and noise Gaussian-distributed. is should give insight into what are viable direc- tions of research in model compensation. 250 9.3. future work e search for better compensation with diagonal-covariance Gaussians contin- ues. Better approximations to the mismatch function (Xu and Chin 2009b; Seltzer et al. 2010, and the phase factor distribution in section 8.2 of this work) and using an alternative sampling scheme, the unscented transformation (Hu and Huo 2006; van Dalen and Gales 2009a; Li et al. 2010) have been investigated. is line of research has two issues. First, if the noise distribution is known, the theoretical bound for Gaussian compensation with full covariance matrices, and therefore also for diagonal covariance matrices, is now known. It is given by extended dpmc, presented in this thesis. Second, in practice, it is necessary to estimate the noisemodel.is is currently possible with maximum-likelihood estimation for standard vts. Using such a noise model, extended dpmc hardly beats extended vts (also presented in this thesis, and reasonably fast), whether with full or diagonal covariances. is indicates that any new practical method for Gaussian compensation will need to improve over extended vts in terms of accuracy, and in terms of noise model estimation.is may be a tough search for little gain. Noise estimation can absorb the dišerences between compens- ation schemes if they model the environment reasonably well. It is telling that (with diagonal covariances) awell-tuned implementation of unscented transformations per- forms an insignicant 0.02%worse than a well-tuned implementation of vts (Li et al. 2010). It may therefore be advisable to call oš the search for newmethods of Gaussian compensation. However, this thesis has only sketched (in section 5.5.2.2) how to directly estimate a noise model for extended vts. Implementing this would not only make estimation and compensation with evts consistent, but also allow full-covariance noise models. For a known noise distribution, a full-covariance Gaussian has yielded some improve- ment in accuracy (in section 8.1.1.2). It would also be interesting to estimate a noise model that optimises the likelihood for compensation with full-covariance joint un- certainty decoding or predictive transformations directly. Non-Gaussian compensation may be harder to nd, though it may yield bigger gains. is thesis has upper-bounded the gain (in section 8.2), but with slow meth- 251 chapter 9. conclusion ods. Suitably fast non-Gaussian distributions that are better tailored to the corrup- ted speech distribution in multiple dimensions might help. Alternatively, but poten- tially even harder, forms of clean speech and noise distributions that when combined through themismatch function produce a distribution for the corrupted speech that is easier to approximate would be helpful.is work has only given a theoretical bound for Gaussian speech and noise models. However, the techniques presented in this thesis should be general enough to estimate how far from optimal any new proposal is. ere are alternative directions for research on improving noise-robust speech recognition, though. One is to investigate other noise models, for example, with more temporal structure. Again, this thesis has not given theoretical bounds for that. A more practically-minded strand of research that may follow from this thesis, not necessarily restricted to noise-robustness, is in predictive transformations. It has already become clear that predictive linear transformations bring advantages to other areas than noise-robustness (Breslin et al. 2010). at the framework is so general, and formalised in this thesis, provides opportunity for a wide range of interesting in- stantiations. 252 Appendices 253 Appendix A Standard techniques is appendix gives details of a number of well-known equalities and algorithms for reference. A.1 Known equalities e following useful equalities are well-known. A.1.1 Transforming variables of probability distributions If variables x and y are deterministically linked, a probability distribution over x, p(x), can be converted into one over y with (see, for example, Bishop 2006, 11.1.1) p(y) = p(x) ∣∣∣∣ ∂x∂y ∣∣∣∣ . (a.1) A.1.2 Matrix identities eWoodbury identity relates three matricesA,B,C (see e.g. Petersen and Pedersen 2008, 3.2.2):( A+CBCT )−1 = A−1 −A−1C ( B−1 +CTA−1C )−1 CTA−1. (a.2) 255 appendix a. standard techniques e inverse of a block symmetricmatrix is given by (see e.g. Petersen and Pedersen 2008, 9.1.3) A CT C B −1 =  D−1 −D−1CTB−1 −B−1CD−1 E−1  (a.3a) =  D−1 −A−1CTE−1 −E−1CA−1 E−1  , (a.3b) whereD = A−CTB−1C is the Schur complement of the matrix with respect to B, and E = B−CA−1CT is the Schur complement of the matrix with respect toA. e determinant of the matrix is∣∣∣∣∣∣ A CT C B ∣∣∣∣∣∣ = |A| ·|E| = |B| ·|D| . (a.4) A.1.3 Multi-variate Gaussian factorisation It can be useful to decompose the evaluation of a multi-variate Gaussian into factors. An obvious choice of factors would be the actual distribution of one coe›cient con- ditional on all previous ones. Straightforward derivations of this usually (e.g. Bishop 2006) assume that theGaussian is normalised (so that constant factors can be dropped) and assume the input for the Gaussian is linear in the variable of interest (so that the integral over coe›cients is constant). ese assumptions, however, can not be made in this work. Below derivation therefore explicitly considers all constants. Let q an unnormalised Gaussian density with parameters a and b, q a b  = exp −1 2 a b − µa µb T Λa Λab Λba Λb a b − µa µb   , (a.5) whereΛ is the precision matrix, the inverse of the covariance matrix Σ: Λ =  Λa Λab Λba Λb  =  Σa Σab Σba Σb −1 = Σ−1. (a.6) 256 a.1 . known equalities From the expression for the inverse of a symmetric block matrix, given in (a.3), it follows that Σ−1a = Λa −ΛabΛ−1b Λba, which will be useful in the derivation below. e density can be decomposed into a factor dependent on a and one dependent on both a and b. e steps the derivation follows are (a.7a) expanding the terms; (a.7b) gathering terms containing b; (a.7c) completing the square and compensating for that; and nally (a.7d) simplifying. q a b  = exp −1 2 a b − µa µb T Λa Λab Λba Λb a b − µa µb   = exp ( − 12(a− µa) TΛa(a− µa) −(b− µb) TΛba(a− µa) − 12(b− µb) TΛb(b− µb) ) = exp ( − 12(a− µa) TΛa(a− µa) − b TΛba(a− µa) + µ T bΛba(a− µa) − 12b TΛbb+ b TΛbµb − 1 2µ T bΛbµb ) (a.7a) = exp ( − 12(a− µa) TΛa(a− µa) + µ T bΛba(a− µa) − 1 2µ T bΛbµb − 12b TΛbb+ b TΛb ( µb −Λ −1 b Λba(a− µa) )) (a.7b) = exp ( − 12(a− µa) TΛa(a− µa) + µ T bΛba(a− µa) − 1 2µ T bΛbµb − 12 ( b− ( µb −Λ −1 b Λba(a− µa) ))T Λb ( b− ( µb −Λ −1 b Λba(a− µa) )) + 12 ( µb −Λ −1 b Λba(a− µa) )T Λb ( µb −Λ −1 b Λba(a− µa) )) (a.7c) = exp ( − 12(a− µa) T ( Λa −ΛabΛ −1 b Λba ) (a− µa) − 12 ( bT − ( µb −Λ −1 b Λba(a− µa) ))T Λb ( bT − ( µb −Λ −1 b Λba(a− µa) ))) = exp ( − 12(a− µa) TΣ−1a (a− µa) ) exp ( − 12 ( bT− ( µb −Λ −1 b Λba(a− µa) ))T Λb ( bT− ( µb −Λ −1 b Λba(a− µa) ))) . (a.7d) A normalised Gaussian can be factorised analogously. Using (a.4), the determinant of the block matrix Σ can be decomposed into the determinants of the covariance matrices of the two terms in (a.7): |Σ| = |Σa| · ∣∣Λ−1b ∣∣. A normalised Gaussian then 257 appendix a. standard techniques can be decomposed into two normalised Gaussians: N a b  ; µa µb  ,  Σa Σab Σba Σb  = |2piΣ|−1/2 q a b  = |2piΣa|−1/2 · ∣∣∣2piΛ−1b ∣∣∣−1/2 exp(− 12(a− µa)TΣ−1a (a− µa)) exp ( − 12 ( bT− ( µb −Λ −1 b Λba(a− µa) ))T Λb ( bT− ( µb −Λ −1 b Λba(a− µa) ))) = N (a; µa, Σa)N ( b; µb −Λ −1 b Λba(a− µa), Λ −1 b ) = N (a; µa, Σa)N ( b; µb + ΣbaΣ −1 a (a− µa), Σb − ΣbaΣ −1 a Σab ) . (a.8) If the density q is a probability distribution anda andb are distributed according to it: a b  ∼ N µa µb  ,  Σa Σab Σba Σb  , (a.9) then the two factors in (a.8) are the marginal probability distributions of a and the distribution of b conditional on a, so that a ∼ N (µa,Σa) ; (a.10a) b|a ∼ N ( µb −Λ −1 b Λba(a− µa),Λ −1 b ) (a.10b) ∼ N ( µb + ΣbaΣ −1 a (a− µa),Σb − ΣbaΣ −1 a Σab ) . (a.10c) is is a standard result. Note that the distribution of a is more concisely expressed in terms of the joint’s covariance matrix, and the distribution of b|a in terms of the precision matrix. A.2 Kullback-Leibler divergence An important tool in this work is the Kullback-Leibler (kl) divergence (Kullback and Leibler 1951). It measures the dišerence between two distributions. If p and q are 258 a.2. kullback-leibler divergence distributions over a continuous domain, the kl divergence between them, KL(p‖q) is dened as KL(p‖q) = ∫ p(x) log p(x) q(x) dx. (a.11) In this work it is used both as a criterion for optimisation and to assess the accuracy of models.e kl divergence has the following properties. e expression in (a.11) can be decomposed as KL(p‖q) = H(p‖q) −H(p) , (a.12a) whereH(p‖q) is the cross-entropy of p and q, H(p‖q) = − ∫ p(x) logq(x)dx, (a.12b) andH(p) is the entropy of p, H(p) = − ∫ p(x) logp(x)dx. (a.12c) e kl divergence is always non-negative, since the cross-entropy is always greater than or equal to the entropy. If and only if the distributions have the same density for each x, the cross-entropy and the entropy are equal, so that the kl divergence becomes 0. To nd a distribution of a particular form that best matches another distribution, minimising the kl divergence is oŸen a useful criterion. Well-known algorithms for training distribution parameters such as expectation–maximisation, described in ap- pendix 2.3.2.1, can be seen asminimising a kl divergence. Inference in graphicalmod- els is oŸen expressed as minimising a kl divergence as well.e expectation propaga- tion and belief propagation algorithms are examples. When optimising q or comparing dišerent distributions q, the reference distri- bution p is oŸen xed. In that case, the cross-entropy is the kl divergence up to a constant. Optimising the cross-entropy, or comparing merely the cross-entropy, is therefore oŸen a valid alternative for working with the kl divergence. 259 appendix a. standard techniques A.2.1 KL divergence between Gaussians A specic case of interest is the kl divergence between two Gaussians. If two distri- butions p and q over a d-dimensional space are dened p(x) = N (x; µa, Σa) ; q(x) = N (x; µb, Σb) , (a.13a) then the kl divergence between the distributions is KL(p‖q) = 1 2 ( log (|Σb| |Σa| ) + Tr ( Σ−1b Σa ) + (µb − µa) TΣ−1b (µb − µa) − d ) . (a.13b) A sub-case of this is when the dimensions for both distributions can be partitioned into blocks of dimensions that are mutually independent. e kl divergence then becomes a sum of kl divergences for these blocks. Without loss of generality, assume that the covariance matrices are block-diagonal with Σa = Σa,1 0 0 Σa,2  ; Σb = Σb,1 0 0 Σb,2  . (a.14a) Both distributions can then be factorised as distributions over x1 and x2, with x =[ xT1 x T 2 ]T : p(x) = p1(x1)p2(x2) = N (x1; µa,1, Σa,1)N (x2; µa,2, Σa,2) ; (a.14b) q(x) = q1(x1)q2(x2) = N (x1; µb,1, Σb,1)N (x2; µb,2, Σb,2) (a.14c) ekldivergence in (a.13b) then becomes the sumof the parallel divergences between the two factors: KL(p‖q) = KL(p1‖q1) +KL(p2‖q2) . (a.14d) is is true for the kl divergence between any two distributions that display inde- pendence between blocks of dimensions. By applying this equation recursively, the kl divergence between any two Gaussians with the same block-diagonal covariance matrices is the sum of the kl divergences of the factors in parallel. e most factor- ised Gaussians have diagonal covariances, in which case the kl divergence works per dimension. is will be useful both when training parameters of a distribution, and when assessing the distance to the real distribution. 260 a.2. kullback-leibler divergence A.2.2 Between mixtures e kl divergence between mixtures does not have a closed form. A number of ap- proximations with dišerent properties (Yu 2006; Hershey and Olsen 2007) are pos- sible. For section 6, it will be useful to minimise the kl divergence between two mix- tures of Gaussians. e following will therefore describe a variational approxima- tion to the kl divergence introduced independently by Yu (2006); Hershey and Olsen (2007).1 is approximation can then be minimised as a proxy for minimising the exact one. Let p and q be mixture models, with components indexed withm and n respect- ively: p(x) = ∑ m pi(m)p(m)(x); q(x) = ∑ n ω(n)q(n)(x), (a.15) with p(m)(x) and q(n)(x) the component distributions, and pi andω the weight vec- tors, with ∑ m pi (m) = ∑ nω (n) = 1. e kl divergence can be written as the dišerence between the cross-entropy and the entropy, as in (a.12). e variational approximation that will be presented here nds an upper bound on both the cross-entropy and on the entropy separately. is implies that there is no guarantee for the approximation to the kl divergence to be on either side of the real one. However, in this thesis the approximation is used to minimiseKL(p‖q) with respect to q. SinceH(p) is not a function of q, it su›ces to minimise the upper bound onH(p‖q).e following will therefore present the upper bound onH(p‖q); the upper bound onH(p) can be found analogously. A set of variational parametersφ is introduced that partition the weight of each component of p up into parts representing the components of q:∑ n φ (m) n = 1, (a.16) with φ(m)n ≥ 0. 1e description inHershey andOlsen (2007)was originallymeant formixtures ofGaussians, which is also what this work will use it for, but the derivation is valid for mixtures of any type of distribution, as Dognin et al. (2009) acknowledge. 261 appendix a. standard techniques e following derivation of an upper boundF(p, q,φ) to the cross-entropy uses Jensen’s inequality. It moves a summation outside a logarithm in (a.17a), and because of Jensen’s inequality, the result is less than or equal to the logarithm of the sum. In this case, the expressions are negated, so that the result is greater than or equal. H(p‖q) = − ∫ p(x) logq(x)dx = − ∫∑ m pi(m)p(m)(x) log (∑ n ω(n)q(n)(x) ) dx = − ∫∑ m pi(m)p(m)(x) log (∑ n φ (m) n ω(n)q(n)(x) φ (m) n ) dx ≤ − ∫∑ m pi(m)p(m)(x) ∑ n φ (m) n log ( ω(n)q(n)(x) φ (m) n ) dx (a.17a) = − ∑ m ∑ n pi(m)φ (m) n (∫ p(m)(x) logq(n)(x)dx+ log ω(n) φ (m) n ) = ∑ m ∑ n pi(m)φ (m) n ( H(p(m)∥∥q(n))+ log φ(m)n ω(n) ) , F(p, q,φ). (a.17b) Derivatives of this with respect to the variational parameters are dF(p, q,φ) dφ (m) n = dpi(m)φ (m) n ( H(p(m)∥∥q(n))− logω(n) + logφ(m)n ) dφ (m) n = pi(m) ( H(p(m)∥∥q(n))− logω(n) + logφ(m)n + 1) ; (a.18a) d2F(p, q,φ) d2φ (m) n = pi(m) φ (m) n . (a.18b) On the domain of φ(m)n , which is [0, 1], its second derivative is non-negative, so that the upper bound is convex. e upper bound is minimised with respect to the variational parameters φ(m)n . 262 a.2. kullback-leibler divergence e optimisation under the constraints in (a.16) uses Lagrange multipliers: 0 = dF(p, q,φ) + λ (∑ n ′ φ (m) n ′ − 1 ) dφ (m) n ; logφ(m)n = logω(n) −H ( p(m) ∥∥q(n))− 1− λ pi(m) ; φ (m) n = ω (n) exp ( −H(p(m)∥∥q(n))− 1− λ pi(m) ) , (a.19) which, setting λ to satisfy the constraint in (a.16), gives the optimal parameter setting φ (m) n := ω(n) exp ( −H(p(m)∥∥q(n)))∑ n ′ ω (n ′) exp ( −H(p(m)∥∥q(n ′))) . (a.20) Optimisingφ does not change the real cross-entropy; it merely nds a tighter bound. In section 6, this bound is used to optimise the cross-entropy itself with respect to q. A parameter setting that is of interest results in the matched-pair bound.is as- sumes a one-to-one mapping from components of p to components of q. Assuming that componentm in p corresponds to componentm in q, the parameters can be set to2 φ (m) n =  1, m = n; 0, m 6= n. . (a.21) is reduces the upper bound to F(p, q,φ) = ∑ m pi(m) ( H(p(m)∥∥q(m))− logω(m)) = H(pi∥∥ω)+∑ m pi(m)H(p(m)∥∥q(m)). (a.22) If additionally the component priors for both mixtures are kept equal, pi = ω, then optimising the term behind the + sign will tighten this upper bound. In section 6.1.1 this bound is used to nd distributionsq(m) that approximatep(m), where both derive from the same speech recogniser model set but are parameterised dišerently. 2is is easy to generalise to the case where there is a dišerent deterministic mapping from each component of p to a component of q. 263 appendix a. standard techniques A.3 Expectation–maximisation Expectation–maximisation aims to maximise L(p˜, qUX ) by updating a lower-bound function F(p˜, ρ, qUX ) of the likelihood.e lower-bound explicitly uses a distribu- tion over the hidden variables ρ(U |X ). ρ and qUX are optimised iteratively. First, ρ must be set to make the lower-bound function equal to the log-likelihood (in the “ex- pectation” step).en,qUX is set tomaximise the lower bound (in the “maximisation” step). e generalised em algorithm replaces this maximisation by an improvement. It is possible to prove, with Jensen’s inequality, that the lower bound is indeed lower or equal to the log-likelihood, so that the log-likelihood is guaranteed not to decrease if the lower bound does not decrease.e following details the expectation–maximi- sation algorithm and its proof. e statisticalmodelwhose parameters are trainedwill be denotedwithqUX (U ,X ), which is a distribution over the hidden and observed variables. Marginalising out over the hidden variables gives the distribution over the observed variables: qX (X ) = ∫ qUX (U ,X )dU (a.23a) e log-likelihood for one data point is then L(X , qUX ) , log ∫ qUX (U ,X )dU , (a.23b) and for the whole training data L(p˜, qUX ) , ∫ p˜(X )L(X , qUX )dX . (a.23c) e lower bound that expectation–maximisation optimises is dened for a single data point X as F(X , ρ, qUX ) , ∫ ρ(U |X ) log qUX (U ,X ) ρ(U |X ) dU (a.24a) and for the empirical distribution p˜ representing all training data as F(p˜, ρ, qUX ) , ∫ p˜(X )F(X , ρ, qUX )dX . (a.24b) 264 a.3. expectation–maximisation e following will optimise the lower bound as a surrogate for optimising the actual log-likelihood.e expectation and the maximisation steps rewrite the lower bound dišerently to make it possible to optimise its parameters. A.3.1 The expectation step: the hidden variables Compared to the log-likelihood, the lower bound function takes an extra parameter, ρ. e expectation step sets this distribution so that the lower bound is maximised, and equals the log-likelihood.is uses the posterior distribution of the hidden variables given the data according to the current mode parameters, which will be written qU |X with qU |X (U |X ) = qUX (U ,X ) qX (X ) . (a.25) For a given observation X , the lower bound can be written as a sum of the log-likeli- hood (independent of the hidden variables) and a kl divergence: F(X , ρ, qUX ) = ∫ ρ(U |X ) log qUX (U ,X ) ρ(U |X ) dU = ∫ ρ(U |X ) ( logqX (X ) + log qU |X (U |X ) ρ(U |X ) ) dU = logqX (X ) + ∫ ρ(U |X ) log qU |X (U |X ) ρ(U |X ) dU = L(X , qUX ) −KL ( ρ ∥∥qU |X ), (a.26) e log-likelihoodL(X , qUX ) does not depend on ρ. Tomake the lower bound equal to the log-likelihood, the right-hand term needs to be zero.e Kullback-Leibler di- vergenceKL(·‖·) of two distributions is always non-negative, and zero when the dis- tributions are identical. F is therefore maximised when ρ is set to the hidden variable posterior qU |X for all observations in p˜: ρ(k) := argmin ρ F(p˜, ρ, q(k)UX ) = argmin ρ ∫ p˜(X )KL(ρ∥∥q(k−1)U |X )dX = q (k−1) U |X . (a.27) 265 appendix a. standard techniques By setting ρ(k) to the hidden variable posterior, the Kullback-Leibler divergence in (a.26) becomes 0, so that L(p˜, q(k−1)X ) = F(p˜, ρ(k), q(k−1)UX ). (a.28) is is the rst part of the proof of convergence of expectation–maximisation. A.3.2 The maximisation step: the model parameters e second step of expectation–maximisation, the maximisation step, optimises the parameters of the model, qUX . Again, like in (a.26), the expression for the lower bound for a single data point is rewritten, this time straightforwardly as a term de- pendent on qUX and one independent: F(X , ρ, qUX ) = ∫ ρ(U |X ) log qUX (U ,X ) ρ(U |X ) dU = ∫ ρ(U |X ) logqUX (U ,X )dU − ∫ ρ(U |X ) log ρ(U |X )dU . (a.29) e right-hand term in this expression is constant when optimising F with respect to qUX in the maximisation step.e new estimate for qUX is therefore chosen q (k) UX := argmax qUX F(p˜, ρ(k), qUX ) = argmax qUX ∫ p˜(X )F(X , ρ(k), qUX )dX = argmax qUX ∫ p˜(X ) ∫ ρ(k)(U |X ) logqUX (U ,X )dUdX . (a.30) How to perform this maximisation depends on the specic problem. If it is not pos- sible to maximise F , then generalised em can be used. It merely requires that qUX does not decrease. In either case, the lower bound is guaranteed to remain equal or increase: F(p˜, ρ(k), q(k−1)UX ) ≤ F(p˜, ρ(k), q(k)UX ). (a.31) is is the second part of the proof of convergence of expectation–maximisation. 266 a.3. expectation–maximisation A.3.3 Convergence e objective of expectation–maximisation is to increase the log-likelihood at every iteration. It can be proven that the log-likelihood never decreases.e last step of this proof requires Jensen’s inequality, which states that for a convex function φ(x) (for example, the log function), inputs xi, and non-negative weights pii, the weighted sum of the function applied to the inputs is never greater than the function applied to the weighted sum of the inputs:∑ i piiφ(xi) ≤ φ (∑ i piixi ) . (a.32) e relation betweenF and L is the analogue in the continuous domain. For one observation X , F was dened in (a.24a). By applying Jensen’s inequality, F turns out to be related to lower bound L expressed as marginalising out over the hidden variables (as it was in (a.23b)): F(X , ρ, qUX ) = ∫ ρ(U |X ) log qUX (U ,X ) ρ(U |X ) dU ≤ log ∫ ρ(U |X ) qUX (U ,X ) ρ(U |X ) dU = log ∫ qUX (U ,X )dU = L(X , qUX ). (a.33) is same relation then goes for the log-likelihood and the lower bound over the full training data: F(p˜, ρ, qUX ) = ∫ p˜(X )F(X , ρ, qUX )dX ≤ ∫ p˜(X )L(X , qUX )dX = L(p˜, qUX ). (a.34) is is the nal part of the proof of convergence of expectation–maximisation. Combining (a.28), (a.31), and (a.34): L(p˜, q(k−1)X ) = F(p˜, ρ(k), q(k−1)UX ) ≤ F(p˜, ρ(k), q(k)UX ) ≤ L(p˜, q(k)X ). (a.35) is proves that the expectation–maximisation does not decrease the likelihood of the data. Note that the algorithm is not guaranteed to nd a global maximum. In practice, it oŸen does nd a useful parameter setting. 267 appendix a. standard techniques A.4 Monte Carlo e previous section considered training data, which naturally consists of a nite number of training samples, as an empirical distribution. Even when parameterised forms of distributions are available, using them directly is not always tractable. In these cases, it is oŸen necessary to produce empirical distributions from paramet- erised distributions by sampling. Methods that approximate a target density with a nite number of samples are calledMonte Carlomethods. Many Monte Carlo methods can work with unnormalised densities, which for many applications is a useful feature. Markov chainMonte Carlo, for example, divides the value of the density at two points by each other, so that any normalisation constant cancels out, and can be disregarded. However, the value required in this thesis, the integral of a target density over the whole of a space, is the normalisation constant of the density.e samples themselves are merely a by-product. is requirement rules outmanyMonteCarlomethods. One technique, called im- portance sampling, does return an approximation to the normalisation constant. It re- quires a (normalised) proposal distribution that samples can be drawn fromand is close in shape to the target density. Sequential importance sampling is importance sampling over a multi-dimensional space. In itself, it is just importance sampling that deals explicitly with one dimension at a time. It becomes advantageous once resampling is introduced between dimensions. is removes low-weight samples and duplicates high-weight ones, so that the samples focus on the most interesting, high-probability regions of space. Sequential sampling techniques are oŸen presented as traversing through time. ere is no reason, however, why the dimensions should represent time. In this work, dimensions will relate to elements of feature vectors and will be called just “dimen- sions”. is section follows the presentation of Doucet and Johansen (2008). It will discuss implicitly multi-dimensional Monte Carlo and importance sampling. en, resampling is introduced. Finally, section a.4.5 on page 277 discusses the case where for some dimensions it is possible to draw from the target distribution, and for some 268 a.4. monte carlo it is not. A.4.1 Plain Monte Carlo Monte Carlo methods approximate a target distribution with a nite number of sam- ples. Denote the target probability distribution with pi. If it is possible to draw L sam- ples u(l) ∼ pi, the Monte Carlo approximation of pi is the empirical distribution p˜i = 1 L ∑ l δu(l) , (a.36) where δ· denotes the Dirac delta. Using the empirical measure in the place of the target distribution, the expectation of any test functionφ under distribution pi can be approximated as Epi{φ(u)} = ∫ pi(u)φ(u)du ' ∫ p˜i(u)φ(u)du = 1 L ∑ l φ ( u(l) ) . (a.37) is equation is used in section 4.4.1 to estimate empirical means (with φ(u) = u) and second moments (with φ(u) = uuT). is straightforward Monte Carlo method requires, however, that it is possible to sample directly from the target distribution. When this is not the case, it is oŸen possible to use importance sampling, which draws samples from a proposal distribution that is close to the target distribution, and then assigns the samples weights to make up for the dišerence. A.4.2 Importance sampling If it is impossible to sample from the target distribution, it can still be possible to approximate it by sampling from a proposal distribution ρ similar to the target dis- tribution, and make up for the dišerence by weighting the samples. is is called importance sampling. e weights also give an approximation to the normalisation constant. e process is analogous to the evaluation of a function under a distribu- tion in (a.37). However, L samples u(l) ∼ ρ are drawn from the proposal distribution, 269 appendix a. standard techniques so that the empirical proposal distribution of it is, analogously to (a.36): ρ˜ = 1 L ∑ l δu(l) . (a.38) at samples are drawn from a proposal distribution and not directly from the target distribution makes it possible to use an unnormalised target density, γ.is is oŸen useful if the normalisation constant of the target density is unknown. It does need to be possible to evaluate γ at any point. Importance sampling nds samples from the distribution at the same time as an approximation to the normalisation con- stant. γ is a scaled version of pi: pi(u) = γ(u) Z , (a.39a) where the normalising constant is Z = ∫ γ(u)du. (a.39b) e proposal density needs to cover at least the area that the target distribution covers: pi(u) > 0⇒ ρ(u) > 0, (a.40) otherwise no samples will be drawn in some regions where pi is non-zero. e key to making up for the dišerence between proposal and target is the weight function w(u). It gives the ratio between the target density and the proposal distri- bution: w(u) = γ(u) ρ(u) . (a.41) Substitutingw(u)ρ(u) for γ(u) in (a.39a) and (a.39b), pi(u) = w(u)ρ(u) Z ; (a.42a) Z = ∫ w(u)ρ(u)du. (a.42b) 270 a.4. monte carlo Now that the target distribution has been expressed in terms of the proposal distri- bution ρ, the proposal distribution can be replaced by its empirical version ρ˜ in (a.38). is yields the empirical distribution to pi, p˜i, with the samples from ρ weighted by their importance weight: p˜i = 1 L ∑ l w ( u(l) ) Z˜ δu(l) = ∑ l w(l)δu(l) , (a.43a) where approximation to the normalisation constant is Z˜ = ∫ w(u)ρ˜(u)du = 1 L ∑ l w ( u(l) ) , (a.43b) and the normalised weightsw(l) are w(l) = w ( u(l) )∑ l ′ w ( u(l ′) ) . (a.43c) p˜i is the normalised importance sampling approximation to target distribution γ, and Z˜ is the corresponding approximation to the normalisation constant. e expectation of a test functionφ(u) underpi can be approximated analogously to (a.37), with p˜i given by (a.43a): Epi{φ(u)} = ∫ pi(u)φ(u)du ' ∫ p˜i(u)φ(u)du = 1 L ∑ l w(l)φ ( u(l) ) . (a.44) A degenerate case of importance sampling is when the proposal distribution is equal to the normalised target distribution ρ = pi. If it is possible to draw samples from pi, then using importance sampling is overkill. However, the next section will introduce sequential importance sampling, which samples from one dimension at a time. In that setting, it might be possible to draw from the target distribution for some dimensions, but not for others. In the simple importance sampling case with ρ = pi, the weight function in (a.41) always yields the normalisation constant: w(u) = γ(u) ρ(u) = Z. (a.45) 271 appendix a. standard techniques Substituting this in (a.43a), the approximation p˜i of the normalised target distribu- tion pi becomes p˜i = 1 L ∑ l w ( u(l) ) Z˜ δu(l) = 1 L ∑ l δu(l) , (a.46) which is exactly the standard Monte Carlo empirical distribution in (a.36). A.4.3 Sequential importance sampling Sequential importance sampling is an instance of importance sampling that explicitly handles a multi-dimensional sample space. It steps through the dimensions one by one, keeping track of L samples, and extending them with a new dimension at every step. Considering this explicitly is useful because then the set of samples can be ad- justed between dimensions. Section a.4.4 will discuss resampling, which drops low- probability samples and multiplies high-probability samples, so that computational ešort is focussed on high-probability regions. Sequential importance sampling is a generalisation of the well-known particle l- tering algorithm. It samples from a multi-dimensional distribution dimension by di- mension, applying principles similar to those of importance sampling at every step. e distribution must be factored into dimensions. In particle ltering, the dimen- sions are oŸen time steps, and the factor for each dimension a distribution condi- tional on previous dimensions. However, for sequential importance sampling the per- dimension target distributions in sequential importance sampling need not be norm- alised or relate to valid probability distributions, as long as their product is equal to the target distribution. In section a.4.1 on page 269, the sample space was implicitly multi-dimensional. In this section, the the dimensions of the samples will be explicitly written.e space has d dimensions.us, u , u1:d.e distributions γ, pi, and ρ will be factorised, as will Z andw. To apply sequential importance sampling, itmust be possible to factorise the target 272 a.4. monte carlo density γ into factors γi(·|·) for each dimension. γi(·|·) is therefore dened as γi(ui|u1:i−1) = γi(u1:i) γi−1(u1:i−1) . (a.47a) If for every i, γi(u1:i) is amarginal distribution ofu1:i, then (a.47a) is an instantiation of Bayes’ rule and γi(ui|u1:i−1) is a conditional distribution. However, even though the notation used is(·|·), there is no requirement for the factors to be conditionals or to be normalised. is is a generalisation of particle ltering, and indeed, a strength of sequential importance sampling. e target densityγ can be written as the product of factorsγi. (a.47b) formulates the factorisation of γ recursively; (a.47c) writes out the recursion: γ(u) = γd(u1:d) = γd−1(u1:d−1)γd(ud|u1:d−1) (a.47b) = γ1(u1) d∏ i=2 γi(ui|u1:i−1) . (a.47c) e normalised variant of γi will be called pii and dened analogous to pi in the previous section: pii(u1:i) = γi(u1:i) Zi ; (a.48a) Zi = ∫ γi(u1:i)du1:i. (a.48b) e proposal distribution ρ is factorised similarly to the target distribution: ρi(ui|u1:i−1) = ρi(u1:i) ρi−1(u1:i−1) ; (a.49a) ρ(u) = ρd(u1:d) = ρd−1(u1:d−1)ρd(ud|u1:d−1) = ρ1(u1) d∏ i=2 ρi(ui|u1:i−1) . (a.49b) is makes it possible to draw samples u(l)1:d dimension per dimension. For the rst dimension, u(l)1 ∼ ρ1. en, u (l) i ∣∣u(l)1:i−1 ∼ ρi for dimensions i = 2, . . . , d. Each proposal factor ρi approximates target factor γi. Computing the importance weight of a sample can also be done dimension per dimension. Decomposing the weight function in (a.41) recursively, in factors wi(·|·) 273 appendix a. standard techniques related to γi(·|·) and ρi(·|·): wi(ui|u1:i−1) = wi(u1:i) wi−1(u1:i−1) = γi(u1:i)ρi−1(u1:i−1) ρi(u1:i)γi−1(u1:i−1) = γi(ui|u1:i−1) ρi(ui|u1:i−1) ; (a.50a) w(u) = wd(u1:d) = wd−1(u1:d−1)wd(ud|u1:d−1) = w1(u1) d∏ i=2 wi(ui|u1:i−1) . (a.50b) e empirical distribution of ρi then is found from L samples drawn from ρi, analogously to the approximation of ρ˜ in (a.38): ρ˜i = 1 L ∑ l δ u (l) 1:i . (a.51) Using this empirical distribution, the empirical normalisation constant and the nor- malised weights are Z˜i = ∫ wi(u1:i)ρ˜i(u1:i)du1:i = 1 L ∑ l wi ( u (l) 1:i ) ; (a.52a) w (l) i = wi ( u (l) 1:i )∑ l ′ wi ( u (l ′) 1:i ) . (a.52b) e empirical distribution derived from pii then is, analogously to (a.43a), p˜ii = ∑ l w (l) i δu(l)1:i . (a.53) e complete algorithm for sequential importance sampling is described in Al- gorithm 5. e nal normalisation constant could be approximated as the average of the weights in the last step, using (a.43b). However, in section a.4.4 resampling will be in- troduced.is at every step removes some samples and duplicates others, and overall weights will not be available. To overcome this problem, the normalising constant Z of γ, dened in (a.39b), can be factorised into terms Zi/Zi−1, which can be approx- imated at every step: Z , Zd = Zd−1 Zd Zd−1 = Z1 d∏ i=2 Zi Zi−1 . (a.54) 274 a.4. monte carlo procedure Sequential-importance-sampling(γ, ρ) for dimension i = 1 . . . d do for sample index l = 1 . . . L do Sample u(l)i ∼ ρi ( ui ∣∣u(l)1:i−1); Compute weightwi ( u (l) 1:i ) = wi−1 ( u (l) 1:i−1 )γi(u(l)i |u(l)1:i−1) ρi(u (l) i |u (l) 1:i−1) . return weighted samples { wd ( u (l) 1:i ) ,u (l) 1:d } . Algorithm 5 Sequential importance sampling e fraction Zi/Zi−1 can be written in terms of the normalised density for dimen- sion i− 1 and the proposal distribution and the weight function for dimension i (ap- plying (a.48a), (a.50a), and (a.49a)) as Zi Zi−1 = ∫ γi(u1:i)du1:i∫ γi−1(u1:i−1)du1:i−1 = ∫ γi−1(u1:i−1)γi(ui|u1:i−1)du1:i∫ γi−1(u1:i−1)du1:i−1 = ∫ pii−1(u1:i−1)wi(ui|u1:i−1) ρi(ui|u1:i−1)du1:i = ∫ pii−1(u1:i−1) ρi−1(u1:i−1) wi(ui|u1:i−1) ρi(u1:i)du1:i. (a.55) is can then be approximated at every step i using the empirical distribution ρ˜i from (a.51): Z˜i Zi−1 = 1 L L∑ l=1 w (l) 1:i−1wi ( u (l) i ∣∣u(l)1:i−1). (a.56) It is straightforward to see that this yields a consistent estimate of Zi/Zi−1. Note that as long as the samples are not resampled, this computation yields the exact same Z˜d as (a.43b). A.4.4 Resampling A problem with importance sampling is that some samples will be in low-probability regions. As the number of dimensions grows, the number of high-probability samples tends to shrink exponentially. As a measure for this problem, the variance of the sam- ple weights is oŸen used. Sequential importance sampling as presented so far does not do anything to produce lower variances. 275 appendix a. standard techniques A technique that does help to focus on higher-probability regions, and therefore does produce lower-variance weights is resampling. Resampling can be applied at every step to nd a new empirical measure with unweighted samples from a set of weighted samples. e unweighted samples can bewritten as a set ofweights and samples: { w (l) i ,u (l) 1:i } . e empirical distribution for dimensions 1 . . . i was given in (a.53): p˜ii = L∑ l=1 w (l) i δu(l)1:i . (a.57) Resampling aims to nd an approximation to this distribution with unweighted sam- ples. e conceptually simplest way of doing this is to draw L samples from p˜ii and construct a new empirical distribution pˆii = L∑ l=1 N (l) i L δ u (l) 1:i , (a.58) whereN(l)i is the number of times sampleu (l) 1:i was drawn from p˜ii, the (integer) num- ber of ošspring of sample u(l)1:i . is is called multinomial resampling. However, the only requirement to a resampling method is that the expected value of the number of ošspring of a sample is proportional to its weight: E{N(l)i } = Lw(l)i . Another way of generatingN(l)1:i uses systematic resampling (Kitagawa 1996).is uses uniformly distributed z ∼ Unif [ 0, 1L ] . New sample l ′ then is set equal to original sample l where ∑l−1 j=1w (j) i ≤ z+ l ′ < ∑l j=1w (j) i . e new empirical distribution can also be described as a list of unweighted sam- ples { 1 L , uˆ (l) 1:i } that containsN(l)i copies of original sample u (l) 1:i .e distribution then is pˆii = 1 L L∑ l=1 δ uˆ (l) 1:i . (a.59) is makes it straightforward to introduce resampling at every step of the sequential importance sampling algorithm as described in section a.4.3. AŸer drawing samples and computing their weights, the set of samples is resampled to yield equally weighted 276 a.4. monte carlo procedure Sequential-importance-resampling(γ, ρ) for dimension i = 1 . . . d do for sample index l = 1 . . . L do Sample u(l)i ∼ ρi ( ui ∣∣uˆ(l)1:i−1); Compute incremental weightwi ( u (l) i ∣∣u(l)1:i−1) = γi(u(l)i |u(l)1:i−1)ρi(u(l)i |u(l)1:i−1) . Compute Z˜iZi−1 = 1 L ∑ lwi ( u (l) i ∣∣u(l)1:i−1).{ uˆ (l) 1:i } ← Resample({wi(u(l)i ∣∣u(l)1:i−1),u(l)1:i}). Compute Z˜ = Z˜1 ∑d i=2 Z˜i Zi−1 . return ({ uˆ (l) 1:d } , Z˜ ) . Algorithm 6 Sequential importance resampling samples { 1 L , uˆ (l) 1:i } . is set is then used when drawing samples for the next itera- tion. e complete algorithm for sequential importance resampling is described in Algorithm 6. A.4.5 Sampling from the target distribution An extension of sequential importance resampling was foreshadowed in (a.46). It concerns the case where for some dimensions it is possible to sample from the norm- alised target distribution pii(ui|u1:i−1). For such a dimension i, the proposal distri- bution can be set to ρi(ui|u1:i−1) = pii(ui|u1:i−1). e incremental weight function from (a.50a) then returns the same value, the incremental normalisation constant, for each point: wi(ui|u1:i−1) = γi(ui|u1:i−1) ρi(ui|u1:i−1) = Zi Zi−1 . (a.60) is is useful because it removes the need to compute the density of the distribution at any point, or to resample the set of samples. In the case where γi = pii = ρi, wi(ui|u1:i−1) = 1 by denition. Algorithm 7 on the next page species how sequential importance resampling, in algorithm 6, can be extended to deal with dimensions for which it is possible to sample 277 appendix a. standard techniques procedureHybrid-sequential-importance-resampling(γ, ρ) for dimension i = 1 . . . d do if ρi ( ui ∣∣uˆ(l)1:i−1) ∝ γi(ui∣∣uˆ(l)1:i−1) then for sample index l = 1 . . . L do Sample uˆ(l)i ∼ ρi ( ui ∣∣uˆ(l)1:i−1); Set Z˜iZi−1 = wi ( ui ∣∣u1:i−1) (for any u1:i−1). else for sample index l = 1 . . . L do Sample u(l)i ∼ ρi ( ui ∣∣uˆ(l)1:i−1); Compute incremental weightwi ( u (l) i ∣∣u(l)1:i−1) = γi(u(l)i |u(l)1:i−1)ρi(u(l)i |u(l)1:i−1) . Compute Z˜iZi−1 = 1 L ∑ lwi ( u (l) i ∣∣u(l)1:i−1).{ uˆ (l) 1:i } ← Resample({wi(u(l)1:i)}). Compute Z˜ = Z˜1 ∑d i=2 Z˜i Zi−1 . return ({ uˆ (l) 1:d } , Z˜ ) . Algorithm 7Hybrid sequential importance resampling from the target distribution, but its density at a point cannot be computed. 278 Appendix B Derivation of linear transformations Section 3.2 has discussed linear transformations for adaptation. Section 6.2 has in- troduced versions of the same transformations trained on predicted statistics. eir derivations run parallel.e following sections will highlight this by deriving the stat- istics for both adaptive and predictive transformations of each form side by side.ey will discuss, in order, cmllr, covariancemllr, and semi-tied covariance matrices, all with their predictive variants. B.1 CMLLR e likelihood for cmllr is (repeated from (3.4b)) q(m)(y|A) = |A| · N (Ay+ b; µ(m)x , Σ(m)x ). (b.1) To express the optimisation, it is convenient to write the a›ne transformation of the observation vectors with one matrixW by appending a 1 to the observation to form 279 appendix b. derivation of linear transformations vector ζ: Ay+ b = [ A b ]y 1  ,Wζ. (b.2) B.1.1 Adaptive e optimisation procedure implements an iterative approximation themaximisation in (3.2).is requires the log-likelihood, which is the logarithm of 3.4b, and its deriv- ative with respect to the transform: logq(m)(y|A) = log|A|− 12 log ∣∣2piΣ(m)x ∣∣ − 12 ( Wζ− µ (m) x )T Σ (m) x −1( Wζ− µ (m) x ) ; (b.3a) ∂ logq(m)(y|A) ∂W = [ A−T 0 ] + Σ (m) x −1( µ (m) x −Wζ ) ζT. (b.3b) e transformation will be optimised row by row. e derivative of row i ofW is (assuming the covariance matrix is diagonal) ∂ logq(m)(y|A) ∂wi = [[ A−T ] i 0 ] + 1 σ (m) x,ii ( µ (m) x,i −wiζ ) ζT, (b.3c) where [ A−T ] i is the ith row of the transposed inverse ofA. is expression can be substituted in the maximisation step in (3.2): ∂ ∫ p˜(Y)∑m∑TYt=1 γ(m)t logq(m)(yt|A)dY ∂wi = ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t ∂ logq(m)(yt|A) ∂wi dY = ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t ([[ A−T ] i 0 ] + 1 σ (m) x,ii ( µ (m) x,i −wiζt ) ζTt ) dY = γ [[ A−T ] i 0 ] + k(i) −wiG (i), (b.4) where γ, k(i), andG(i) are statistics from the adaptation data: γ , ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t dY ; (b.5a) 280 b.1 . cmllr k(i) , ∫ p˜(Y) ∑ m µ (m) x,i σ (m) x,ii TY∑ t=1 γ (m) t ζ T t dY ; (b.5b) G(i) , ∫ p˜(Y) ∑ m 1 σ (m) x,ii TY∑ t=1 γ (m) t ζtζ T t dY. (b.5c) To maximise the log-likelihood, (b.4) should be set to zero for all rows ofW at once. is is not in general possible, so the algorithm has to resort to updating the rows ofW iteratively. Dene P to be the cofactor matrix of A with an extra column 0 appended. e rst term in (b.4) can be written in terms ofwi and row i of P, pi:[[ A−T ] i 0 ] = wi ·|A|−1 = wi · ( piw T i )−1 (b.6) e row update is given in Gales (1998a): wi := ( ηpi + k (i) ) G(i) −1 , (b.7a) where η is a solution of the quadratic expression η2piG (i)−1pTi + ηpiG (i)−1k(i) T − γ = 0. (b.7b) is row update is applied iteratively. It optimises the likelihood given the current settings of the other rows, i.e. given the current setting ofpi.erefore, the likelihood is guaranteed not to decrease, and the overall process is an instantiation of generalised expectation–maximisation. B.1.2 Predictive As for adaptive cmllr, the optimisation is performed row by row. e derivative of the log-likelihood for the output distribution is the same as in (b.3c). is can be substituted into the derivative of the maximand for general predictive linear trans- formations from (6.18): ∂ ∑ m γ (m) ∫ p(m)(y) logq(m)(y|A)dy ∂wi (b.8) 281 appendix b. derivation of linear transformations = ∑ m γ(m) ∫ p(m)(y) ∂ logq(m)(y|A) ∂wi dy (b.9) = ∑ m γ(m) ∫ p(m)(y) ([[ A−T ] i 0 ] + 1 σ (m) x,ii ( µ (m) x,i −wiζ ) ζT ) dy (b.10) = ∑ m γ(m) [[ A−T ] i 0 ] + ∑ m γ(m)µ (m) x,i σ (m) x,ii ∫ p(m)(y)ζTdy − ∑ m γ(m) σ (m) x,ii ∫ p(m)(y)wiζζ Tdy (b.11) = γ [[ A−T ] i 0 ] + k(i) −wiG (i), (b.12) where γ, k(i), andG(i) are predicted statistics: γ , ∑ m γ(m); (b.13a) k(i) , ∑ m γ(m)µ (m) x,i σ (m) x,ii ∫ p(m)(y)ζTdy = ∑ m γ(m)µ (m) x,i σ (m) x,ii Ep(m) { ζT } ; (b.13b) G(i) , ∑ m γ(m) σ (m) x,ii ∫ p(m)(y)ζζTdy = ∑ m γ(m) σ (m) x,ii Ep(m) { ζζT } , (b.13c) where Ep(m){·} is the expectation under the predicted distribution p(m) for compon- entm. B.2 Covariance MLLR e likelihood expression for covariance mllr is (repeated from (3.6b)) q(m)(y|A) = |A| · N (Ay; Aµ(m)x , Σ(m)x ). (b.14) 282 b.2. covariance mllr B.2.1 Adaptive e optimisation of the transformation requires the log-likelihood and its derivative with respect toA: logq(m)(y|A) = log|A|− 12 log ∣∣2piΣ(m)x ∣∣ − 12 ( A ( y− µ (m) x ))T Σ (m) x −1( A ( y− µ (m) x )) ; (b.15a) ∂ logq(m)(y|A) ∂A = A−T − Σ (m) x −1 A ( y− µ (m) x )( y− µ (m) x )T . (b.15b) e transformation will be optimised row by row. e derivative of row i of A is (assuming the covariance matrix Σ(m)x is diagonal) ∂ logq(m)(y|A) ∂ai = [ A−T ] i − 1 σ (m) x,ii ai ( y− µ (m) x )( y− µ (m) x )T . (b.15c) is expression can be substituted into the maximand in (3.2): ∂ ∫ p˜(Y)∑m∑TYt=1 γ(m)t logq(m)(yt|A)dY ∂ai = ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t ∂ logq(m)(yt|A) ∂ai dY = ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t ([ A−T ] i − 1 σ (m) x,ii ai ( yt − µ (m) x )( yt − µ (m) x )T) dY = γ [ A−T ] i − aiG (i), (b.16) where γ andG(i) are statistics from the adaptation data: γ , ∫ p˜(Y) ∑ m TY∑ t=1 γ (m) t dY ; (b.17a) G(i) , ∫ p˜(Y) ∑ m 1 σ (m) x,ii TY∑ t=1 γ (m) t ( yt − µ (m) x )( yt − µ (m) x )T dY. (b.17b) For covariance mllr, γ is the same as for cmllr (in (3.5a)); G(i) is similar but uses yt − µ (m) x instead of yt. To maximise the log-likelihood, (b.4) should be set to zero for all rows of A at once. Just like for cmllr, this is not in general possible, so the algorithm has to resort 283 appendix b. derivation of linear transformations to updating the rows ofA iteratively. Each row is set to optimise the likelihood giving the current settings of the other rows.e details of the algorithm are inGales (1998a). e likelihood is guaranteed not to decrease, and the overall process is an instantiation of generalised expectation–maximisation, where themaximisation step is an iteration over the rows. B.2.2 Predictive e optimisation works row by row. e partial derivative of the log-likelihood was given in (b.15c).is can be substituted into the derivative of themaximand for general predictive linear transformations from (6.18): ∂ ∑ m γ (m) ∫ p(m)(y) logq(m)(y|A)dy ∂ai (b.18) = ∑ m γ(m) ∫ p(m)(y) ∂ logq(m)(y|A) ∂ai dy (b.19) = ∑ m γ(m) ∫ p(m)(y) ([ A−T ] i − 1 σ (m) x,ii ai ( y− µ (m) x )( y− µ (m) x )T) dy (b.20) = γ [ A−T ] i − aiG (i), (b.21) where γ andG(i) are predicted statistics: γ , ∑ m γ(m); (b.22a) G(i) , ∑ m γ(m) σ (m) x,ii ∫ p(m)(y) ( y− µ (m) x )( y− µ (m) x )T dy = ∑ m γ(m) σ (m) x,ii Ep(m) {( y− µ (m) x )( y− µ (m) x )T} . (b.22b) B.3 Semi-tied covariance matrices e expression for the likelihood for semi-tied covariancematrices is (repeated from (3.12b)) q(m)(x) = |A| · N (Ax; Aµ(m)x , Σ˜(m)x,diag). (b.23) 284 b.3. semi-tied covariance matrices B.3.1 From data e log-likelihood expression and its derivative both with respect to the component covariances and to the rows of the transformation matrix (the optimisation again is row-wise, similar to (b.15c)) are necessary: logq(m)(x) = log|A|− 12 log ∣∣2piΣ˜(m)x,diag∣∣ − 12 ( A ( x− µ (m) x ))T[ Σ˜ (m) x,diag ]−1( A ( x− µ (m) x )) ; (b.24a) ∂ logq(m)(x) ∂Σ˜ (m) x,diag = −12 [ Σ˜ (m) x,diag ]−1( I−A ( x− µ (m) x )( x− µ (m) x )T AT [ Σ˜ (m) x,diag ]−1) ; (b.24b) ∂ logq(m)(x) ∂A = A−T − Σ (m) x −1 A ( x− µ (m) x )( x− µ (m) x )T ; (b.24c) ∂ logq(m)(x) ∂ai = [ A−T ] i − 1 σ˜ (m) x,ii ai ( x− µ (m) x )( x− µ (m) x )T . (b.24d) ese expressions can be substituted into the maximand in (2.32). For the optim- isation of Σ˜(m)x,diag, 2 ∂ ∫ p˜(X )∑m∑TXt=1 γ(m)t logq(m)(xt)dX ∂Σ˜ (m) x,diag (b.25) = ∫ p˜(X ) TX∑ t=1 γ (m) t [ Σ˜ (m) x,diag ]−1( A ( xt − µ (m) x )( xt − µ (m) x )T AT [ Σ˜ (m) x,diag ]−1 − I ) dX (b.26) = γ(m) [ Σ˜ (m) x,diag ]−1( AW(m)AT [ Σ˜ (m) x,diag ]−1 − I ) . (b.27) Here, γ(m)t and γ(m) are found from the training data as in (2.31), andW(m) contains statistics from training observations. Only the diagonal elements of Σ˜(m)x,diag are estim- ated. However sinceW(m), the empirical full covariance for one component m, is transformed byA (which changes at every iteration), it must be full: W(m) , 1 γ(m) ∫ p˜(X ) TX∑ t=1 γ (m) t ( xt − µ (m) x )( xt − µ (m) x )T dX . (b.28) 285 appendix b. derivation of linear transformations e optimisation with respect toA is very similar to the one for covariance mllr in (b.16): ∂ ∫ p˜(X )∑m∑TXt=1 γ(m)t logq(m)(xt)dX ∂ai (b.29) = ∫ p˜(X ) ∑ m TX∑ t=1 γ (m) t ∂ logq(m)(xt) ∂ai dX (b.30) = ∫ p˜(X ) ∑ m TX∑ t=1 γ (m) t ([ A−T ] i − 1 σ˜ (m) x,ii ai ( xt − µ (m) x )( xt − µ (m) x )T) dX (b.31) = γ [ A−T ] i − aiG (i). (b.32) Estimating A requires two types of statistics, γ and G(i). γ, the total occupancy, is found in a similar way as in (b.17a), but from the training data. Since the procedure for estimating the covariance transformation is the same as for covariance mllr, the statistics are also basically the same.e dišerence is that for covariancemllr, the co- variance Σ(m)x of the components is xed, whereas for semi-tied covariance matrices, the covariances Σ˜(m)x,diag get re-estimated every iteration. Fixed σ (m) x,ii in (b.17b) there- fore is replaced by the diagonal elements of Σ˜(m)x,diag, σ˜ (m) x,ii . G (i) is rewritten in terms of the part that does not change with every iteration,W(m) in (b.28): γ , ∫ p˜(X ) ∑ m TX∑ t=1 γ (m) t dX ; (b.33a) G(i) , ∫ p˜(X ) ∑ m 1 σ˜ (m) x,ii TX∑ t=1 γ (m) t ( xt − µ (m) x )( xt − µ (m) x )T dX = ∑ m γ(m) σ˜ (m) x,ii W(m). (b.33b) B.3.2 Predictive Unlike non-predictive semi-tied covariance matrices, the predictive variant denes a distribution the corrupted speech, similar to (b.23) (repeated from (6.24)): q(m)(y) = |A| · N (Ay; Aµ(m)y , Σ˜(m)y,diag) (b.34) 286 b.3. semi-tied covariance matrices e optimisation of the means, the covariances, and the transformation need the derivatives of the log-likelihood with respect to them.ey are similar to (b.24) (the derivative with respect to the means was not given there): logq(m)(y) = log|A|− 12 log ∣∣2piΣ˜(m)y,diag∣∣ − 12 ( A ( y− µ (m) y ))T[ Σ˜ (m) y,diag ]−1( A ( y− µ (m) y )) ; (b.35a) ∂ logq(m)(y) ∂µ (m) y = 2 [ Σ˜ (m) y,diag ]−1 A ( y− µ (m) y ) ; (b.35b) ∂ logq(m)(y) ∂Σ˜ (m) y,diag = −12 [ Σ˜ (m) y,diag ]−1( I−A ( y− µ (m) y )( y− µ (m) y )T AT [ Σ˜ (m) y,diag ]−1) ; (b.35c) ∂ logq(m)(y) ∂ai = [ A−T ] i − 1 σ˜ (m) y,ii ai ( y− µ (m) y )( y− µ (m) y )T . (b.35d) ese can be substituted into the derivative of themaximand for general predictive linear transformations from (6.18). ∂ ∑ m γ (m) ∫ p(m)(y) logq(m)(y|A)dy ∂µ (m) y = 2 ∑ m γ(m) [ Σ˜ (m) y,diag ]−1 A ∫ p(m)(y) ( y− µ (m) y ) dy. (b.36) To minimise the kl divergence with respect to the means, they are unsurprisingly set to the expected value under the predicted distribution for componentm: µ (m) y := ∫ p(m)(y)ydy = Ep(m){y} . (b.37) Since this expression does not depend on the other variables that are estimated, setting the means is a one-shot process. e derivative with respect to the covariance of componentm is 2 ∂ ∑ m γ (m) ∫ p(m)(y) logq(m)(y)dy ∂Σ˜ (m) y,diag = γ(m) [ Σ˜ (m) y,diag ]−1 ∫ p(m)(y) ( A ( y− µ (m) y )( y− µ (m) y )T AT [ Σ˜ (m) y,diag ]−1 − I ) dy = γ(m) [ Σ˜ (m) y,diag ]−1( AW(m)AT [ Σ˜ (m) y,diag ]−1 − I ) , (b.38) 287 appendix b. derivation of linear transformations where the predicted covariance in the original feature space for componentm is W(m) , ∫ p(m)(y) ( y− µ (m) y )( y− µ (m) y )T dy = Ep(m) {( y− µ (m) y )( y− µ (m) y )T} . (b.39) is is the equivalent ofW(m) for standard semi-tied covariance matrices, but estim- ated on the predicted distribution rather than an empirical one.ough the compon- ent covariance Σ˜(m)y,diag is constrained to be diagonal, the statisticsW (m) have to be full, because they are transformed by A. By setting the bracketed expression in (b.27) to zero, the optimum setting for the diagonal elements of Σ˜(m)y,diag is found with Σ˜ (m) y,diag := diag ( AW(m)AT ) . (b.40) Unlike the expression for the means in (b.37), this expression depends onA, which in turn depends on Σ˜(m)y,diag.erefore, the procedure will be iterative. e optimisation of A is performed row-by-row. Again, taking the derivative of the kl divergence, ∂ ∑ m γ (m) ∫ p(m)(y) logq(m)(y)dy ∂ai = ∑ m γ(m) ∫ p(m)(y) ([ A−T ] i − 1 σ˜ (m) y,ii ai ( y− µ (m) y )( y− µ (m) y )T) dy = γ [ A−T ] i − aiG (i), (b.41) where γ andG(i) are the predicted statistics: γ , ∑ m γ(m); (b.42a) G(i) , ∑ m γ(m) σ˜ (m) y,ii ∫ p(m)(y) ( y− µ (m) y )( y− µ (m) y )T dy = ∑ m γ(m) σ˜ (m) y,ii W(m). (b.42b) As in standard semi-tied covariance matrices (in (b.33b)), G(i) can be expressed in terms ofW(m), which do not change with every iteration. 288 Appendix C Mismatch function is section gives derivations relating to the mismatch function that would have con- fused the main text. Section c.1 gives the Jacobians with respect to the sources, which are required for vts compensation. Section c.2 derives the mismatch function and its derivatives for dišerent powers of the spectrum. C.1 Jacobians emismatch function relates log-spectral coe›cients of the observation, speech, and noise with (repeated from (4.9)) exp ( yi ) = exp ( xi + hi ) + exp ( ni ) + 2αi exp 12 ( xi + hi + ni ) . (c.1) 289 appendix c. mismatch function e per-coe›cient derivatives of this are dyi dxi = exp(xi + hi) + αi exp ( 1 2 ( xi + hi + ni )) exp(xi + hi) + exp ( ni ) + 2αi exp ( 1 2 ( xi + hi + ni )) ; (c.2a) dyi dhi = dyi dxi ; (c.2b) dyi dni = exp ( ni ) + αi exp ( 1 2 ( xi + hi + ni )) exp(xi + hi) + exp ( ni ) + 2αi exp ( 1 2 ( xi + hi + ni )) = 1− dyi dxi ; (c.2c) dyi dαi = β exp ( 1 2 ( xi + hi + ni )) exp(xi + hi) + exp ( ni ) + 2αi exp ( 1 2 ( xi + hi + ni )) . (c.2d) e Jacobians of the vector mismatch function in the log-spectral domain are Jlogx = ∂ylog ∂xlog ; Jlogn = ∂ylog ∂nlog ; Jlogh = ∂ylog ∂hlog ; Jlogα = ∂ylog ∂α . (c.3) Since the mismatch function applies per dimension in the log-spectral domain, the Jacobians of the vector mismatch function in this domain are diagonal.ey have as diagonal entries the derivatives given in (c.2): j log x,i = dyi dxi ; jlogn,i = dyi dni ; jlogh,i = dyi dhi ; jlogα,i = dyi dαi . (c.4) Cepstral features are related to log-spectral through the dctmatrix C.e cepstral- domain can be found through the chain rule. Jx = ∂ys ∂xs = ∂ys ∂ylog ∂ylog ∂xlog ∂xlog ∂xs = CJlogx C −1, (c.5a) and analogously Jn = ∂ys ∂ns = CJlogn C −1; Jh = ∂ys ∂hs = CJlogh C −1; Jα = ∂ys ∂α = CJlogα . (c.5b) C.2 Mismatch function for other spectral powers e mismatch in the log-spectral domain was given in (4.9). It assumed that the fea- tures yi, xi, ni used the power spectrum.is section will write the power β applied 290 c.2. mismatch function for other spectral powers to the spectral coe›cients explicitly as y(β)i , x (β) i , n (β) i , so that (4.9) becomes: exp ( y (2) i ) = exp ( x (2) i + h (2) i ) + exp ( n (2) i ) + 2αi exp ( 1 2 ( x (2) i + h (2) i + n (2) i )) . (c.6) e expression for the mismatch relating vectors in domains with dišerent powers than 2 derives from this using an assumption about the mel-ltered spectrum. e assumption is the same that was used to approximate the convolutional noise in (4.7), namely that all spectral coe›cients in one mel-bin are equal. In that case, a mel- ltered spectral coe›cient is a weighted sum of spectral coe›cients to the power of β (see (4.8a)), is equal to the power of the sum: Y¯ (β) i = ∑ k wik|Y[k]|β ' (∑ k wik|Y[k]| )β . (c.7) e log-spectral coe›cients are found by taking the logarithm of this, so that y (β) i = log ( Y¯ (β) i ) ' β log(∑ k wik|Y[k]| ) . (c.8) is assumption can be applied to all feature vectors. It causes coe›cients acquired from the βth-power domain to be assumed related to those using a power of 2 by y (β) i = β 2y (2) i ; x (β) i = β 2 x (2) i ; n (β) i = β 2n (2) i ; h (β) i = β 2h (2) i . (c.9) For the log-magnitude-spectrum (β = 1), for example, coe›cients yi, xi, ni, are smaller by a factor of 2.erefore, (c.6) can be generalised to any power β by making up for the power: exp ( 2 βy (β) i ) = exp ( 2 β ( x (β) i + h (β) i )) + exp ( 2 βn (β) i ) + 2αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) , (c.10a) or y (β) i = β 2 log ( exp ( 2 β ( x (β) i + h (β) i )) + exp ( 2 βn (β) i ) + 2αi exp ( 1 β ( x (β) i + h (β) i + n (β) i ))) . (c.10b) 291 appendix c. mismatch function Derivatives of this are dy (β) i dx (β) i = exp ( 2 β ( x (β) i + h (β) i )) + αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) exp ( 2 β ( x (β) i + h (β) i )) + exp ( 2 βn (β) i ) + 2αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) ; (c.11a) dy (β) i dh (β) i = dy (β) i dx (β) i ; (c.11b) dy (β) i dn (β) i = exp ( 2 βn (β) i ) + αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) exp ( 2 β ( x (β) i + h (β) i )) + exp ( 2 βn (β) i ) + 2αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) = 1− dy (β) i dx (β) i ; (c.11c) dy (β) i dα (β) i = β exp ( 1 β ( x (β) i + h (β) i + n (β) i )) exp ( 2 β ( x (β) i + h (β) i )) + exp ( 2 βn (β) i ) + 2αi exp ( 1 β ( x (β) i + h (β) i + n (β) i )) . (c.11d) Some implementations of vts compensation (e.g. Liao 2007) have usedmagnitude- spectrum features (β = 1), but assumed the mismatch function was simply exp ( y (1) i ) = exp ( x (1) i + h (1) i ) + exp ( n (1) i ) . (c.12) It is interesting to see the ešect of these assumptions. By converting this back to power- spectral features, exp ( 1 2y (2) i ) = exp ( 1 2x (2) i + 1 2h (2) i ) + exp ( 1 2n (2) i ) ; (c.13a) exp ( y (2) i ) = ( exp ( 1 2x (2) i + 1 2h (2) i ) + exp ( 1 2n (2) i ))2 = exp ( x (2) i ) + exp ( n (2) i ) + 2 exp ( 1 2x (2) i + 1 2h (2) i + 1 2n (2) i ) . (c.13b) is is exactly equivalent to the real mismatch function, in (c.6), with α = 1. is means that performing vts compensation with vectors in the magnitude domain and ignoring the phase term, as in (c.12) (e.g. Liao 2007), is essentially equivalent to as- suming α = 1 on log-power-spectral features. Also, when the noise model is ml- estimated, with the same mismatch function used for decoding, then the noise model parameters will subsume much of the dišerence between model and reality. 292 Appendix D Derivation of model-space Algonquin e Algonquin algorithm, when applied to the model, compensates each Gaussian separately for each observation. It is not clear from the original presentation that this happens. e original presentation replaces the likelihood calculation for each component by a computation of the component’s “soŸ information score” (Kristjansson and Frey 2002; Kristjansson 2002). It derives this form from the feature enhancement version. e end result can be derived more directly, which the following will do before show- ing the equivalence to the original presentation. As in the original, this will assume that there is only one component per state (though the generalisation is straightfor- ward). Consider a speech recogniser, which aims to nd the most likely state sequenceΘ from an observation sequence Y with P(Θ|Y) = p(Y |Θ)P(Θ) p(Y) ∝ p(Y |Θ)P(Θ) = P(θ0) ∏ t P(yt|θt)P(θt|θt−1) . (d.1) 293 appendix d. derivation of model-space algonquin e normalisation constant p(Y) is normally ignored, because it does not change de- pending on the state sequence Θ. An obvious way of approximating the component likelihood p(yt|θt) would be to replace it by the Algonquin approximation for that componentm and observation yt, p(yt|θt) ' ∑ m∈Ω(θ) pi (m) θ q (m) yt (yt). (d.2) at this approximation to p(yt|θt) is not necessarily normalised is not a problem for the Viterbi algorithm. A relevant question is, however, whether the likelihood ap- proximations for dišerent components can be compared, when they are compensated dišerently. e original presentation of Algonquin for model compensation (Kristjansson 2002; Kristjansson and Frey 2002) derives this form (up to a constant factor) through a number of manipulations.is relates the form of models compensation to the one for feature enhancement. e normalisation constant in (d.1) is not ignored, but ap- proximated with a mixture of Gaussians: p(Y) ' ∏ t p(yt) . (d.3) is decouples the distribution of y from the state sequence.is is substituted in in the expression that speech recognisers aim to maximise, P(Θ|Y). It is then rewritten by applying Bayes’ rule, the approximation in (d.3), and again Bayes’ rule: P(Θ|Y) = p(Y |Θ) p(Y) P(Θ) ' P(θ0) ∏ t p(yt|θt) p(yt) P(θt|θt−1) = P(θ0) ∏ t P(θt|yt)p(yt) p(yt)P(θt) P(θt|θt−1) = P(θ0) ∏ t P(θt|yt) P(θt) P(θt|θt−1) . (d.4) e expression that now replaces the likelihood computation for component θt = m (again, assuming the state-conditional distribution Gaussian) in the speech re- cogniser, P(m|yt) /P(m), is then called the “soŸ information score”. Its numerator, 294 P(m|yt), is approximated by its variational approximation1 P(m|yt) ' qyt(m) = q (m) yt (yt)P(m)∑ m ′ q (m ′) yt (yt)P(m ′) . (d.5) e soŸ information score therefore is approximated P(m|yt) P(m) ' qyt(m) P(m) = q (m) yt (yt)∑ m ′ q (m ′) yt (yt)P(m ′) . (d.6) e normalisation term for one frame is the same across all components. erefore, it does not have an ešect on decoding, so that the likelihood is in essence replaced by unnormalised distributionqyt(yt), as in (d.2).e question thus remainswhether the unnormalised Algonquin approximations to the likelihood are comparable between dišerence components. 1See Kristjansson (2002), equations (10.10), (10.15), and (8.18) for more details. 295 Appendix E The likelihood for piecewise linear approximation is section follows the transformation of the expression for the likelihood presented in Myrvoll and Nakamura (2004). e explanation of the idea behind it is in sec- tion 4.5.2. Section e.1 discusses the single-dimensional case as in the original paper. Section e.2 generalises it to more dimensions. E.1 Single-dimensional e interaction between the log-spectral coe›cients of the speech x, the noise n, and the observation y is assumed to be exp(y) = exp(x) + exp(n) . (e.1) y is set to its observed value, yt. e substitute variable introduced to replace the integration overx andn is dened u = 1− exp(x− yt) , (e.2a) 297 appendix e. the likelihood for piecewise linear approximation so that n = log(exp(yt) − exp(x)) = yt + log(1− exp(x− yt)) = yt + log(u) ; (e.2b) x = yt + log(1− u) . (e.2c) Two useful derivatives for transforming the integral are the following.e derivative of n with respect to y while keeping x xed is ∂n(x, y) ∂y = exp(y) exp(y) − exp(x) = 1 1− exp(x− y) = 1 u . (e.3a) e notation n(x, y) is used to indicate the value of n that the setting of the other two variables (x, y) implies. Similarly, the derivative of x with respect to u while keeping yt xed is ∂x(u, yt) ∂u = −1 1− u . (e.3b) As explained in section 4.5.2 (and see also section a.1.1), the transformation of the integral in the likelihood expression uses the absolute values of the two derivatives. p(yt) = ∫yt −∞ p(yt|x)p(x)dx = ∫yt −∞ ∣∣∣∣∣ ∂n(x, y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(x, yt))p(x)dx = ∫ 1 0 ∣∣∣∣∣ ∂n(x, y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(u, yt)) ∣∣∣∣∂x(u, yt)∂u ∣∣∣∣p(x(u, yt))du = ∫ 1 0 1 u N ( yt + log(u) ; µn, σ2n ) 1 1− u N ( yt + log(1− u) ; µx, σ2x ) du. (e.4) 298 e.2. multi-dimensional Rewriting only the leŸ-hand term of the integrand, noting that 1u = exp(− log(u)), 1 u N ( log(u) + yt; µn, σ2n ) = 1√ 2piσ2n exp ( − (log(u) + yt − µn)2 2σ2n − log(u) ) = 1√ 2piσ2n exp ( − (log(u) + yt − µn + σ2n)2 2σ2n + 12σ 2 n + yt − µn ) = exp ( 1 2σ 2 n + yt − µn ) N ( log(u); µn − σ2n − yt, σ2n ) . (e.5) e right-hand side of the integrand can be rewritten in a similar way, so that the likelihood expression becomes p(yt) = exp ( 1 2σ 2 n + 1 2σ 2 x − µn − µx + 2yt ) ∫ 1 0 N ( log(u); µn − σ2n − yt, σ2n ) N ( log(1− u); µx − σ2x − yt, σ2x ) du. (e.6) By approximating log(u) and log(1 − u) with a piecewise linear function (Myrvoll and Nakamura 2004), the integral can be written as a sum of integrals over part of a Gaussian and a constant factor. E.2 Multi-dimensional at the derivation above can use scalars crucially relies on two assumptions.e as- sumption that the ith coordinate of the clean speech only inžuences the ith coordinate of the corrupted speech is only valid in the log-spectral domain.e assumption that the coordinates of both the clean speech and the corrupted speech are uncorrelated is marginally valid in the cepstral domain, and invalid in the log-spectral domain. e following generalises the derivation above to a vector of mfccs. mfccs are related to log-spectral coe›cients by a linear transformation. As long as the distributions in the log-spectral domain are not assumed uncorrelated, therefore, a derivation in the log-spectral domain can be used. 299 appendix e. the likelihood for piecewise linear approximation e relation of the clean speech, noise, and corrupted speech for every dimension is the same as the single-dimensional case in (e.1), so that for vectors: exp(y) = exp(x) + exp(n) . (e.7) Again, y is set to its observed value, yt. e coe›cients of the substitute variableu are dened as in (e.2), so that in vector notation, u = 1− exp(x− yt) , (e.8) so that n = yt + log(u) ; (e.9) x = yt + log(1− u) , (e.10) where 1 is a vector with all entries set to 1. e absolutes of the derivatives that the transformation of the feature space res- ults in in one-dimensional space generalise to determinants of partial derivatives. Since the relationships between speech, noise, substitute variable, and observation are element-by-element in log-spectral space, the partial derivatives are diagonal.e generalisations of the derivatives in (e.3) therefore is (note that u ∈ [0, 1]) ∣∣∣∣∂n(x,y)∂y ∣∣∣∣ = ∣∣∣∣∣∏ i ∂n(xi, yi) ∂yi ∣∣∣∣∣ = ∣∣∣∣∣∏ i 1 1− exp(xi − yi) ∣∣∣∣∣ =∏ i 1 ui ; (e.11) ∣∣∣∣∂x(u,yt)∂u ∣∣∣∣ = ∣∣∣∣∣∏ i ∂x(ui, yt,i) ∂ui ∣∣∣∣∣ =∏ i 1 1− ui . (e.12) e additive noise and the clean speech are distributed as n ∼ N (µn,Σn) ; x ∼ N (µx,Σx) . (e.13) 300 e.2. multi-dimensional e likelihood of yt generalises (e.4): p(yt) = ∫ p(yt|x)p(x)dx = ∫ ∣∣∣∣∣ ∂n(x,y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(x,y))p(x)dx = ∫ [0,1]d ∣∣∣∣∣ ∂n(x,y)∂y ∣∣∣∣ yt ∣∣∣∣∣p(n(u,y)) ∣∣∣∣∂x(u,yt)∂u ∣∣∣∣p(x(u,y))du = ∫ [0,1]d (∏ i 1 ui ) N (yt + log(u) ; µn, Σn)(∏ i 1 1− ui ) N (yt + log(1− u) ; µx, Σx) du. (e.14) Noting that ∏ i 1 ui = exp ( − ∑ i log(ui) ) = exp ( − log(u)T1 ) ; (e.15) N (log(u) + yt; µn, Σn) = |2piΣn|− 1 2 exp ( − 12(log(u) + yt − µn) TΣ−1n (log(u) + yt − µn) ) , (e.16) the leŸ term in (e.14) becomes (generalising (e.5))(∏ i 1 ui ) N (log(u) + yt; µn, Σn) = |2piΣn|− 1 2 exp ( − 12(log(u) + yt − µn) TΣ−1n (log(u) + yt − µn) − log(u) T1 ) = |2piΣn|− 1 2 exp ( − 12(log(u) + yt − µn + Σn1) TΣ−1n (log(u) + yt − µn + Σn1) + 121 TΣn1+ 1 Tyt − 1 Tµn ) = N (log(u) ; µn − yt − Σn1, Σn) exp ( 1 21 TΣn1+ 1 Tyt − 1 Tµn ) . (e.17) Applying the same process to the right term, the likelihood of yt becomes p(yt) = exp ( 1 21 TΣn1+ 1 21 TΣx1− 1 Tµn − 1 Tµx + 2 · 1Tyt ) ∫ [0,1]d N (log(u) ; µn − Σn1− yt, Σn)N (log(1− u) ; µx − Σx1− yt, Σx)du. (e.18) 301 appendix e. the likelihood for piecewise linear approximation In the single-dimensional case, the integral is approximated with 8 line segments. In themulti-dimensional case, the approximationwould use 8d hyperplanes. Sinceu has as many dimensions as there are lter bank coe›cients, a piecewise linear approxim- ation is infeasible. 302 Appendix F The likelihood for transformed-space sampling To approximate the integral in the expression for the likelihood of the observation, this work uses sequential importance resampling. A number of transformations of the integral are required, some of the details of which are in this appendix. e detailed derivation of the transformation of the single-dimensional version of the integral is in section f.1. e generalisation of this transformation to the multi- dimensional case is in section f.2. One of the two factorisations of the multi-dimen- sional integrand that this work presents is detailed in section f.3.e form of the pro- posal distribution that approximates the single-dimensional integrand and the factors of the multi-dimensional integrand is in section f.4. 303 appendix f. the likelihood for transformed-space sampling F.1 Transforming the single-dimensional integral In section 7.3 on page 189, one half of a one-dimensional version of the corrupted speech likelihood is rewritten to (repeated from (7.14)): p(yt, x ≤ n) = ∫ p(α) ∫∞ 0 ∣∣∣∣∂x(u, yt, α)∂u ∣∣∣∣ · ∣∣∣∣∣ ∂n(x, y, α)∂y ∣∣∣∣ yt,x(u,yt,α) ∣∣∣∣∣ · p(x(u, yt, α)) · p(n(u, yt, α)) dudα, (f.1a) where (repeated from (7.11)) u = n− x. (f.1b) Because the derivations of the Jacobians and of x(u, yt, α) and n(u, yt, α) are long, they are given here. e mismatch function is (repeated from (7.10)) exp ( y log t ) = exp ( xlog ) + exp ( nlog ) + 2α exp ( 1 2x log + 12n log ) . (f.2) To express n as a function of x, yt, α, (f.2) can be rewritten to exp(n) + 2α exp ( 1 2x ) exp ( 1 2n ) = exp(yt) − exp(x) ; (f.3a)( exp ( 1 2n ) + α exp ( 1 2x ))2 = exp(yt) − exp(x) + ( α exp ( 1 2x ))2 . (f.3b) is is where it becomes useful that the computation is restricted to the region where x ≤ n so that n has only one solution. Since −1 ≤ α (as shown in section 4.2.1.1), the squared expression on the leŸ-hand side, exp ( 1 2n ) + α exp ( 1 2x ) , is always non- negative.erefore, exp ( 1 2n ) = −α exp ( 1 2x ) + √ exp(yt) − exp(x) + α2 exp(x); (f.3c) n = 2 log ( −α exp ( 1 2x ) + √ exp(yt) + exp(x) (α2 − 1) ) . (f.3d) 304 f.1 . transforming the single-dimensional integral To express n as a function of u, yt, α, (f.2) can be rewritten with x = n − u from (f.1b): exp(yt) = exp(n− u) + exp(n) + 2α exp ( 1 2n− 1 2u+ 1 2n ) = exp(n) ( 1+ exp(−u) + 2α exp ( − 12u )) ; (f.4a) exp(n) = exp(yt) 1+ exp(−u) + 2α exp ( − 12u ) ; (f.4b) n = yt − log ( 1+ exp(−u) + 2α exp ( − 12u )) . (f.4c) Similarly, x can be expressed as a function of u, yt, α by rewriting (f.2) with n = u+ x from (f.1b): exp(yt) = exp(x) + exp(u+ x) + 2α exp ( 1 2x+ 1 2u+ 1 2x ) = exp(x) ( 1+ exp(u) + 2α exp ( 1 2u )) ; (f.5a) exp(yt − x) = 1+ exp(u) + 2α exp ( 1 2u ) ; (f.5b) x = yt − log ( 1+ exp(u) + 2α exp ( 1 2u )) . (f.5c) Because u was chosen to relate x and n symmetrically, (f.4c) and (f.5c) are the same except that u is replaced by −u. An equality that will come in useful derives from (f.5b): √ exp(yt) + exp(x) (α2 − 1) = exp ( 1 2x )√ exp(yt − x) + (α2 − 1) = exp ( 1 2x )√ exp(u) + 2α exp ( 1 2u ) + α2 = exp ( 1 2x )( exp ( 1 2u ) + α ) . (f.6) e Jacobians in (7.14) are derivatives of (f.5c) and (f.3d): ∂x(u, yt, α) ∂u = − exp(u) + α exp ( 1 2u ) 1+ exp(u) + 2α exp ( 1 2u ) = − exp( 12u)(exp( 12u)+ α) 1+ exp(u) + 2α exp ( 1 2u ) ; (f.7a) 305 appendix f. the likelihood for transformed-space sampling ∂n(x, y, α) ∂y = 2√ exp(y) + exp(x) (α2 − 1) − α exp ( 1 2x ) · exp(y) 2 √ exp(y) + exp(x) (α2 − 1) = exp(y)( exp ( 1 2x )( exp ( 1 2u ) + α ) − α exp ( 1 2x )) exp ( 1 2x )( exp ( 1 2u ) + α ) = exp(y− x) exp ( 1 2u )( exp ( 1 2u ) + α ) = 1+ exp(u) + 2α exp( 12u) exp ( 1 2u )( exp ( 1 2u ) + α ) . (f.7b) When these aremultiplied, as in the integral in (f.1a), they drop out against each other, except for the negation: ∂x(u, yt, α) ∂u ∂n(x, y, α) ∂y ∣∣∣∣ yt = −1. (f.7c) is does not seem to be an intrinsic property of the process. F.2 Transforming the multi-dimensional integral e transformation of the integral that returns the likelihood of the corrupted-speech observation is more laborious for multiple dimensions than for a single dimension. e derivation uses three steps. First, the integral is split into separate dimensions. en, each of the integrals for one dimension is rewritten similarly to appendix f.1. Finally, the dimensions are collated. e full expression for the likelihood of observation yt is (repeated from (7.1)) p(yt) = ∫ ∫ ∫ p(yt|x,n,α)p(x)p(n)p(α)dxdndα. (f.8) Just like in the single-dimensional case, the integration over the clean speech and the additive noise will be rewritten as an integral over a substitute variable. For each dimension, this substitution is the same as the one in (7.11). In multiple dimensions, the substitute variable u also relates the speech x and the noise n symmetrically: u = x− n. (f.9) However, the transformation of the integral will work one dimension at a time. Per dimension, the derivation will be split in two regions which use symmetric de- rivations, like in section 7.3.1. Again, the derivation will be explicitly given only for 306 f.2. transforming the multi-dimensional integral xi ≤ ni, with ni < xi completely analogous. By formulating (f.8) recursively, it can be transformed one scalar at a time.e following marginalises out one variable at a time, starting with αi: p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i−1) = ∫ p(αi|α1:i−1)p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i)dαi, (f.10a) where, marginalising out xi, p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i) = ∫ p(xi|x1:i−1)p(yt,i:d, xi ≤ ni|x1:i,n1:i−1,α1:i)dxi, (f.10b) where, with the restriction xi ≤ ni subsumed in the range of the integration over ni, p(yt,i:d, xi ≤ ni|x1:i,n1:i−1,α1:i) = ∫∞ xi p(ni|n1:i−1)p(yt,i:d|x1:i,n1:i,α1:i)dni. (f.10c) e integrals in (f.10c) and in (f.10b) can then be re-expressed as one integral over the substitute variable. First, the integral over ni in (f.10c) can be written without the integral. is is because given the clean speech, additive noise, and phase factor for one dimension, the corrupted speech for that dimension is deterministic: p(yt,i|xi, ni, αi) = δf(xi,ni,αi)(yt,i) . (f.11) e variable of the Dirac delta in (f.11) can be transformed using the Jacobian (see (a.1) in appendix a.1.1): p(yt,i:d, xi ≤ ni|x1:i,n1:i−1,α1:i) = ∫∞ xi p(ni|n1:i−1)p(yt,i|xi, ni, αi)p(yt,i+1:d|x1:i,n1:i,α1:i)dni = ∫∞ xi p(ni|n1:i−1) δf(xi,ni,αi)(yt,i)p(yt,i+1:d|x1:i,n1:i,α1:i)dni 307 appendix f. the likelihood for transformed-space sampling = ∫∞ xi p(ni|n1:i−1) · ∣∣∣∣∣ dn(xi, αi, yi)dyi ∣∣∣∣ yt,i ∣∣∣∣∣ · δn(xi,αi,yt,i)(ni) p(yt,i+1:d|x1:i,n1:i,α1:i)dni = ∣∣∣∣∣ dn(xi, αi, yi)dyi ∣∣∣∣ yt,i ∣∣∣∣∣ · 1(xi ≤ ni)p(n(xi, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i,n1:i−1,α1:i, ni = n(xi, αi, yt,i)). (f.12) e next step is to substitute this result into (f.10b), and then replace the variable of the integral from xi to ui.e Jacobians that result from this are exactly the same as the ones in section 7.3.1 on page 190, in (7.14). Since the product of their absolutes is therefore again 1, they drop out. p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i) = ∫ p(xi|x1:i−1) ∣∣∣∣∣ dn(xi, αi, yi)dyi ∣∣∣∣ yt,i ∣∣∣∣∣ 1(xi ≤ ni)p(n(xi, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i,n1:i−1,α1:i, ni = n(xi, αi, yt,i))dxi = ∫∞ 0 ∣∣∣∣dx(ui, αi, yi)dui ∣∣∣∣ · ∣∣∣∣∣ dn(xi, αi, yi)dyi ∣∣∣∣ yt,i ∣∣∣∣∣ p(x(ui, αi, yt,i)|x1:i−1)p(n(ui, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i−1,n1:i−1,α1:i, xi = x(ui, αi, yt,i), ni = n(ui, αi, yt,i)) dui = ∫∞ 0 p(x(ui, αi, yt,i)|x1:i−1)p(n(ui, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i−1,n1:i−1,α1:i, xi = x(ui, αi, yt,i), ni = n(ui, αi, yt,i)) dui, (f.13) 308 f.2. transforming the multi-dimensional integral Substituting this into (f.10a) gives one half of the likelihood: p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i−1) = ∫ p(αi|α1:i−1) ∫∞ 0 p(x(ui, αi, yt,i)|x1:i−1)p(n(ui, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i−1,n1:i−1,α1:i, xi = x(ui, αi, yt,i), ni = n(ui, αi, yt,i)) dui dαi. (f.14) is gives half the likelihood, because it is constrained to xi ≤ ni. e other part, for ni < xi, has the exact same derivation with xi and ni swapped, and ui replaced by −ui. is is exactly the same as the single-dimensional case in appendix f.1. e full likelihood, expressed recursively, then combines integrals over ui ∈ [0,∞) and ui ∈ (−∞, 0): p(yt,i:d|x1:i−1,n1:i−1,α1:i−1) = p(yt,i:d, xi ≤ ni|x1:i−1,n1:i−1,α1:i−1) + p(yt,i:d, ni < xi|x1:i−1,n1:i−1,α1:i−1) = ∫ p(αi|α1:i−1) ∫ p(xi(ui, αi, yt,i)|x1:i−1)p(n(ui, αi, yt,i)|n1:i−1) p ( yt,i+1:d ∣∣x1:i−1,n1:i−1,α1:i, xi = x(ui, αi, yt,i), ni = n(ui, αi, yt,i)) dui dαi. (f.15) is recursive formulation is straightforward to unroll to p(yt) = p(yt,1:d) = ∫ [ d∏ i=1 p(αi|α1:i−1) ] ∫ [ d∏ i=1 p(x(ui, αi, yt,i)|x1:i−1) ] [ d∏ i=1 p(n(ui, αi, yt,i)|n1:i−1) ] dudα = ∫ p(α) ∫ p(x(u,α,yt))p(n(u,α,yt))dudα, (f.16) where, analogously to the one-dimensional case, p(x(u,α,yt)) denotes the value of the prior ofx evaluated at the value ofx implied by the values of (u,α,yt), and similar 309 appendix f. the likelihood for transformed-space sampling for p(n(u,α,yt)). To nd these values for x and n, the relations in (f.5c) and (f.4c) apply per dimension: x = yt − log ( 1+ exp(u) + 2α ◦ exp( 12u)) ; (f.17a) n = yt − log ( 1+ exp(−u) + 2α ◦ exp(− 12u)) . (f.17b) In section 7.3.2, the full integrand is called γ(u,α), and the integral is approximated with sequential importance sampling. Note that this derivation holds for any form of priors for the speech and noise p(x) and p(n). F.3 Postponed factorisation of the integrand is section presents a factorisation of the integrandγ(u|α). It should result in factors γi so that (repeated from (7.33)) γ(u|α) = N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) . (f.18) e two Gaussians on the right-hand side have the same structure.e factorisation here will only explicitly consider the term deriving from the speech prior; the one deriving from the noise prior factorises analogously. A multi-variate Gaussian relates all elements in its input vector through the inverse covariance matrix, the precision matrix Λx. e derivation writes these explicitly. e elements of Λx = Σ−1x are denoted with λx,ij. N (x(u,α,yt); µx, Σx) = |2piΣx|− 1 2 exp ( − 12(x(u,α,yt) − µx) TΛx(x(u,α,yt) − µx) ) = |2piΣx|− 1 2 exp ( − 12 d∑ i=1 d∑ j=1 (x(ui, αi, yt,i) − µx,i) λx,ij(x(uj, αj, yt,j) − µx,j) ) = |2piΣx|− 1 2 exp ( d∑ i=1 [ − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i) i−1∑ j=1 λx,ij(x(uj, αj, yt,j) − µx,j) ]) 310 f.3. postponed factorisation of the integrand = |2piΣx|− 1 2 d∏ i=1 exp ( − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i)νx,i ) , (f.19a) where the term containing coordinates of lower dimensions u1:i−1 is νx,i = i−1∑ j=1 λx,ij(x(uj, αj, yt,j) − µx,j) . (f.19b) When drawing ui for dimension i, the coordinates of lower dimensions u1:i−1 are known. AŸer applying the same factorisation to the noise term, the complete integrand in (f.18) can be written γ(u|α) = N (x(u,α,yt); µx, Σx)N (n(u,α,yt); µn, Σn) = |2piΣx|− 1 2 |2piΣn|− 1 2 d∏ i=1 exp ( − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i)νx,i − 12λn,ii(n(ui, αi, yt,i) − µn,i) 2 −(n(ui, αi, yt,i) − µn,i)νn,i ) , (f.20) where νn,i is dened analogously to (f.19b). e factors are then dened as (it is ar- bitrary which factor takes the constant determiners) γ1(u1|α1) = |2piΣy|− 1 2 |2piΣx|− 1 2 exp ( − 12λn,11(n(u1, α1, yt,1) − µn,1) 2 − 12λx,11(x(u1, α1, yt,1) − µx,1) 2 ) ; (f.21a) γi(ui|u1:i−1,α1:i) = exp ( − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i)νx,i − 12λn,ii(n(ui, αi, yt,i) − µn,i) 2 −(n(ui, αi, yt,i) − µn,i)νn,i ) , (f.21b) 311 appendix f. the likelihood for transformed-space sampling To nd a proposal distribution for the resulting density, it can be rewritten so that it is easily related to the one-dimensional γ in section 7.3.1, for which good proposal distributions were discussed in section 7.3.1.2. Again, the following rewrites only part of the term related to the speech prior; the noise term works completely analogously. Since for importance sampling it is the shape rather than the height of the density that the proposal distribution needs to match, the following disregards constant factors (note the use of ∝), which are additive within the exp(·). A technique sometimes called “completing the square” helps nd the shape of the term related to the clean speech in (f.21).is derivation is similar to the derivation of the parameters of a con- ditional Gaussian distribution (see e.g. Bishop 2006). Taking one factor from (f.19a), exp ( − 12λx,ii(x(ui, αi, yt,i) − µx,i) 2 −(x(ui, αi, yt,i) − µx,i)νx,i ) ∝ exp ( − 12λx,ii(x(ui, αi, yt,i)) 2 + λx,iiµx,ix(ui, αi, yt,i) − νx,ix(ui, αi, yt,i) ) = exp ( − 12λx,ii(x(ui, αi, yt,i)) 2 + λx,ii ( µx,i − νx,i λx,ii ) x(ui, αi, yt,i) ) ∝ exp ( − 12λx,ii [ x(ui, αi, yt,i) − ( µx,i − νx,i λx,ii )]2) ∝ N ( x(ui, αi, yt,i); µx,i − νx,i λx,ii , λ−1x,ii ) . (f.22) By rewriting the additive noise term in the samemanner, the factors in (f.21) turn out to be proportional to twoGaussian distributions that are functions ofx(ui, αi, yt,i) and n(ui, αi, yt,i): γi(ui|u1:i−1,α1:i−1) ∝ N ( x(ui, αi, yt,i); µx,i − νx,i λx,ii , λ−1x,ii ) · N ( n(ui, αi, yt,i); µn,i − νn,i λn,ii , λ−1n,ii ) . (f.23) is expression has the same shape as the one-dimensional integrand in (7.19) in sec- tion 7.3.1. 312 f.4. terms of the proposal distribution F.4 Terms of the proposal distribution Finding the proposal distribution uses u(x, y, α), the value for u that follows from xing the other variables. Where this is necessary, u > 0.is is equivalent to x ≤ n, which is the area that this expression for n is valid for (repeated in (f.3d)): n = 2 log ( −α exp ( 1 2x ) + √ exp(yt) + exp(x) (α2 − 1) ) , (f.24) so that u can be found with u = n− x = 2 log ( −α+ √ exp(yt − x) + α2 − 1 ) . (f.25) emirror image of this expression isu(n, y, α), which xesn rather than x, and is required only for u < 0.is expression can be found by rewriting (f.4b): exp(n) = exp(yt) 1+ exp(−u) + 2α exp ( − 12u ) ; (f.26a) exp(−u) + 2α exp ( − 12u ) + 1 = exp(yt − n) ; (f.26b)( exp ( − 12u ) + α )2 − α2 + 1 = exp(yt − n) ; (f.26c)( exp ( − 12u ) + α )2 = exp(yt − n) + α2 − 1. (f.26d) From u < 0, it follows that exp ( − 12u ) ≥ 1 and exp(− 12u)+ α ≥ 0, so that exp ( − 12u ) + α = √ exp(yt − n) + α2 − 1; (f.26e) u = −2 log ( −α+ √ exp(yt − n) + α2 − 1 ) . (f.26f) 313 Bibliography Alex Acero (1990). Acoustical and Environmental Robustness in Automatic Speech Re- cognition. Ph.D. thesis, Carnegie Mellon University. Alex Acero, Li Deng, Trausti Kristjansson, and Jerry Zhang (2000). “hmm adaptation using vector Taylor series for noisy speech recognition.” In Proceedings of icslp. vol. 3, pp. 229–232. Cyril Allauzen, Michael Riley, and Johan Schalkwyk (2009). “A Generalized Com- position Algorithm forWeighted Finite-State Transducers.” In Proceedings of Inter- speech. pp. 1203–1206. Tasos Anastasakos, John McDonough, Richard Schwartz, and John Makhoul (1996). “A Compact Model for Speaker-Adaptive Training.” In Proceedings of icslp. pp. 1137–1140. Jon A. Arrowood and Mark A. Clements (2002). “Using Observation Uncertainty in hmm Decoding.” In Proceedings of icslp. pp. 1561–1564. Scott Axelrod, Ramesh Gopinath, and Peder Olsen (2002). “Modeling With A Sub- space Constraint On Inverse Covariance Matrices.” In Proceedings of icslp. pp. 2177–2180. L. E. Baum, T. Petrie, G. Soules, and N. Weiss (1970). “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.” Annals of Mathematical Statistics 41 (1), pp. 164–171. Ješ A. Bilmes (1998). “A Gentle Tutorial of the em Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.” Tech. rep., U. C. Berkeley. Chistopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. S. F. Boll (1979). “Suppression of acoustic noise in speech using spectral subtraction.” ieee Transactions on Acoustics, Speech and Signal Processing 27, pp. 113–120. C. Breslin, K. K. Chin, M. J. F. Gales, K. Knill, and H. Xu (2010). “Prior Information for Rapid Speaker Adaptation.” In Proceedings of Interspeech. pp. 1644–1647. 315 bibliography Stanley Chen and Joshua Goodman (1998). “An empirical study of smoothing tech- niques for language modeling.” Tech. rep., Harvard University. Beverly Collins and Inger Mees (1999). e Phonetics of English and Dutch. Brill, Leiden. Justin Dauwels, Sascha Korl, and Hans-Andrea Loeliger (2006). “Particle Methods as Message Passing.” In Proceedings of 2006 ieee International Symposium on Inform- ationeory. pp. 2052–2056. Steven B. Davis and Paul Mermelstein (1980). “Comparison of Parametric Repres- entations for Monosyllabic Word Recognition in Continuously Spoken Sentences.” ieee Transactions on Acoustics, Speech and Signal Processing 28 (4), pp. 357–366. A´ngel de la Torre, Dominique Fohr, and Jean-Paul Haton (2002). “Statistical Adapt- ation of Acoustic Models to Noise Conditions for Robust Speech Recognition.” In Proceedings of icslp. pp. 1437–1440. A. P. Dempster, N. M. Laird, and D. B. Rubin (1977). “Maximum Likelihood from IncompleteData via the emAlgorithm.” Journal of the Royal Statistical Society 39 (1), pp. 1–38. Li Deng, Jasha Droppo, and Alex Acero (2004). “Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise.” ieee Transactions on Speech and Audio Processing 12 (2), pp. 133–143. V. V. Digalakis, D. Rtischev, and L. G. Neumeyer (1995). “Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures.” ieee Transactions on Speech and Audio Processing 3 (5), pp. 357–366. SimonDobrisˇek, Janez Zˇibert, and FranceMihelicˇ (2010). “Towards theOptimalMin- imization of a Pronunciation Dictionary Model.” In Text, Speech and Dialogue, Springer, Berlin / Heidelberg, Lecture Notes in Computer Science, vol. 6231, pp. 267– 274. Pierre L. Dognin, John R. Hershey, Vaibhava Goel, and Peder A. Olsen (2009). “Re- structuring Exponential Family Mixture Models.” In Proceedings of Interspeech. A. Doucet and A. M. Johansen (2008). “A tutorial on particle ltering and smooth- ing: Ÿeen years later.” Tech. rep., Department of Statistics, University of British Columbia. url: 〈http://www.cs.ubc.ca/∼arnaud/doucet johansen tutorialPF. pdf〉. Jasha Droppo, Alex Acero, and Li Deng (2002). “Uncertainty Decoding with splice for Noise Robust Speech Recognition.” In Proceedings of icassp. pp. 829–832. 316 Yariv Ephraim (1990). “Amimimummean square error approach for speech enhance- ment.” In Proceedings of icassp. pp. 829–832. Yariv Ephraim (1992). “Statistical-model-based speech enhancement systems.” Pro- ceedings of the ieee 80 (10), pp. 1526–1555. Friedrich Faubel and Dietrich Klakow (2010). “Estimating Noise from Noisy Speech Features with a Monte Carlo Variant of the Expectation Maximization Algorithm.” In Proceedings of Interspeech. pp. 2046–2049. F. Flego and M. J. F. Gales (2009). “Incremental Predictive and Adaptive Noise Com- pensation.” In Proceedings of icassp. pp. 3837–3840. Brendan J. Frey, Li Deng, Alex Acero, and Trausti Kristjansson (2001a). “algonquin: Iterating Laplace’s Method to Remove Multiple Types of Acoustic Distortion for Robust Speech Recognition.” In Proceedings of Eurospeech. pp. 901–904. Brendan J. Frey, Trausti T. Kristjansson, Li Deng, and Alex Acero (2001b). “algonquin: Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition.” In Proceedings of nips. Brendan J. Frey and David J. C. MacKay (1997). “A Revolution: Belief Propagation in GraphsWith Cycles.” In Proceedings of Neural Information Processing Systems. MIT Press, pp. 479–485. K. Fukunaga (1972). Introduction to Statistical Pattern Recognition. Academic Press. Sadaoki Furui (1986). “Speaker-Independent Isolated Word Recognition Using Dy- namic Features of Speech Spectrum.” ieee Transactions on Acoustics, Speech, and Signal Processing 34 (1), pp. 52–59. M. J. F. Gales (1996). “e Generation and Use of Regression Class Trees for mllr Adaptation.” Tech. Rep. cued/f-infeng/tr.263, Cambridge University Engineer- ing Department. M. J. F. Gales (1998a). “Maximum Likelihood Linear Transformations for hmm-based Speech Recognition.” Computer Speech and Language 12 (2), pp. 75–98. M. J. F. Gales (1998b). “Predictive Model-Based Compensation Schemes for Robust Speech Recognition.” Speech Communication 25 (1-3), pp. 49–74. M. J. F. Gales (1999). “Semi-Tied Covariance Matrices for Hidden Markov Models.” ieee Transactions on Speech and Audio Processing 7 (3), pp. 272–281. M. J. F. Gales (2000). “Cluster adaptive training of hidden Markov models.” ieee Transactions on Speech and Audio Processing 8 (4), pp. 417–428. M. J. F. Gales (2002). “Maximum likelihood multiple subspace projections for hidden Markovmodels.” ieeeTransactions on Speech andAudio Processing 10 (2), pp. 37–47. 317 bibliography M. J. F. Gales (2011). “Model-Based Approaches to Handling Uncertainty.” In D. Ko- lossa and R. Haeb-Umbach (eds.), Robust Speech Recognition of Uncertain Data, Springer Verlag. M. J. F. Gales and F. Flego (2010). “Discriminative classiers with adaptive kernels for noise robust speech recognition.” Computer Speech and Language 24 (4), pp. 648–662. M. J. F. Gales and R. C. van Dalen (2007). “Predictive Linear Transforms for Noise Robust Speech Recognition.” In Proceedings of asru. pp. 59–64. M. J. F. Gales and P. C. Woodland (1996). “Mean and Variance Adaptation within the mllr Framework.” Computer Speech and Language 10, pp. 249–264. Mark J. F. Gales (1995). Model-Based Techniques for Noise Robust Speech Recognition. Ph.D. thesis, Cambridge University. Jean-Luc Gauvain and Chin-Hui Lee (1994). “Maximum a Posteriori Estimation for Multivariate GaussianMixture Observations of Markov Chains.” ieee Transactions on Speech and Audio Processing 2 (2), pp. 291 – 298. S. Geman and D. Geman (1984). “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” ieeeTransactions on Pattern Analysis and Machine Intelligence 6, pp. 721–741. Arnab Ghoshal, Daniel Povey, Mohit Agarwal, Pinar Akyazi, Luka´sˇ Burget, Kai Feng, Ondrˇej Glembek, Nagendra Goel, Martin Karaa´t, Ariya Rastrow, Richard C. Rose, Petr Schwarz, and Samuelomas (2010). “A Novel Estimation of Feature-space mllr for Full Covariance Models.” In Proceedings of icassp. R. A. Gopinath, M. J. F. Gales, P. S. Gopalakrishnan, S. Balakrishnan-Aiyer, and M. A. Picheny (1995). “Robust speech recognition in noise - performance of the ibm con- tinuous speech recognizer on the arpa noise spoke task.” In Proceedings of the arpa Workshop on Spoken Language System Technology. pp. 127–130. Ramesh A. Gopinath, Bhuvana Ramabhadran, and Satya Dharanipragada (1998). “Factor Analysis Invariant to Linear Transformations of Data.” In Proceedings of icslp. Reinhold Haeb-Umbach (2001). “Automatic Generation of Phonetic Regression Class Trees formllrAdaptation.” ieee Transactions on Speech and Audio Processing 9 (3), pp. 299–302. John Harris (1994). English sound structure. Blackwell, Oxford. John R. Hershey, Peder Olsen, and Steven J. Rennie (2010). “Signal Interaction and the Devil Function.” In Proceedings of Interspeech. pp. 334–337. 318 John R. Hershey and Peder A. Olsen (2007). “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models.” In Proceedings of icassp. Hans-Gu¨nter Hirsch and David Pearce (2000). “e aurora experimental frame- work for the performance evaluation of speech recognition systems under noise conditions.” In Proceedings of asr. pp. 181–188. J. Holmes and N. Sedgwick (1986). “Noise compensation for speech recognition using probabilistic models.” In Proceedings of icassp. pp. 741–744. Yu Hu and Qiang Huo (2006). “An hmm Compensation Approach Using Unscented Transformation [sic] for Noisy Speech Recognition.” In Proceedings of iscslp. pp. 346–357. Simon J. Julier and Ješrey K. Uhlmann (2004). “Unscented Filtering and Nonlinear Estimation.” Proceedings of the ieee 92 (3), pp. 401–422. J. Junqua and Y. Anglade (1990). “Acoustic and perceptual studies of Lombard speech: application to isolated-word automatic speech recognition.” In Proceedings of icassp. pp. 841–844. O. Kalinli, M. L. Seltzer, and A. Acero (2009). “Noise Adaptive Training Using a Vec- tor Taylor Series Approach for Noise Robust Automatic Speech Recognition.” In Proceedings of icassp. pp. 3825–3828. O. Kalinli, M. L. Seltzer, J. Droppo, and A. Acero (2010). “Noise Adaptive Training for Robust Automatic Speech Recognition.” ieee Transactions on Audio, Speech, and Language Processing 18 (8), pp. 1889–1901. D. K. Kim and M. J. F. Gales (2010). “Noisy Constrained Maximum Likelihood Lin- ear Regression for Noise-Robust Speech Recognition.” ieee Transactions on Audio, Speech, and Language Processing 19 (2), pp. 315–325. Do Yeong Kim, Chong Kwan Un, and Nam Soo Kim (1998). “Speech recognition in noisy environments using rst-order vector Taylor series.” Speech Communication 24, pp. 39–49. G. Kitagawa (1996). “Monte Carlo lter and smoother for non-Gaussian non-linear state space models.” Journal of Computational and Graphical Statistics 5, pp. 1–25. D. Klatt (1976). “A digital lter bank for spectral matching.” In Proceedings of icassp. pp. 573–576. Trausti Kristjansson, Brendan Frey, Li Deng, andAlex Acero (2001). “Joint Estimation of Noise and Channel Distortion in a Generalized em Framework.” In Proceedings of asru. 319 bibliography Trausti T. Kristjansson and Brendan J. Frey (2002). “Accounting for uncertainity [sic] in observations: a new paradigm for robust automatic speech recognition.” In Pro- ceedings of icassp. pp. 61–64. TraustiorKristjansson (2002). Speech Recognition inAdverse Environments: a Prob- abilistic Approach. Ph.D. thesis, University of Waterloo. S. Kullback and R. A. Leibler (1951). “On Information and Su›ciency.” Annals of Mathematical Statistics 22 (1), pp. 79–86. Nagendra Kumar (1997). Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. Ph.D. thesis, John Hopkins University. C. J. Leggetter (1995). Improved Acoustic Modelling for hmms using Linear Transform- ations. Ph.D. thesis, Cambridge University. C. J. Leggetter and P. C. Woodland (1995). “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.” Computer Speech and Language 9 (2), pp. 171–185. Volker Leutnant and Reinhold Haeb-Umbach (2009a). “An analytic derivation of a phase-sensitive observation model for noise robust speech recognition.” In Pro- ceedings of Interspeech. pp. 2395–2398. Volker Leutnant and Reinhold Haeb-Umbach (2009b). “An analytic derivation of a phase-sensitive observation model for noise robust speech recognition.” Tech. rep., Universita¨t Paderborn, Faculty of Electrical Engineering and Information Techno- logy. Jinyu Li, Li Deng, Dong Yu, Yifan Gong, and Alex Acero (2007). “High-Performance hmmAdaptationwith Joint Compensation of Additive andConvolutiveDistortions via Vector Taylor Series.” In Proceedings of asru. pp. 65–70. Jinyu Li, Dong Yu, Li Deng, Yifan Gong, and Alex Acero (2009). “A unied frame- work of hmm adaptation with joint compensation of additive and convolutive dis- tortions.” Computer Speech and Language 23, pp. 389–405. Jinyu Li, Dong Yu, Yifan Gong, and L. Deng (2010). “Unscented Transform with Online Distortion Estimation for hmm Adaptation.” In Proceedings of Interspeech. pp. 1660–1663. H. Liao and M. J. F. Gales (2005). “Uncertainty Decoding for Noise Robust Speech Recognition.” In Proceedings of Interspeech. pp. 3129–3132. H. Liao and M. J. F. Gales (2006). “Joint Uncertainty Decoding for Robust Large Vocabulary Speech Recognition.” Tech. Rep. cued/f-infeng/tr.552, Cambridge University Engineering Department. 320 H. Liao andM. J. F. Gales (2007). “Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noisy Data.” In Proceedings of icassp. vol. iv, pp. 389– 392. Hank Liao (2007). Uncertainty Decoding for Noise Robust Speech Recognition. Ph.D. thesis, Cambridge University. Richard P. Lippmann (1997). “Speech recognition by machines and humans.” Speech Communication 22 (1), pp. 1–15. Tomoko Matsui and Sadaoki Furui (1998). “N-Best-based unsupervised speaker ad- aptation for speech recognition.” Computer Speech and Language 12 (1), pp. 41–50. omas Minka (2005). “Divergence measures and message passing.” Tech. Rep.msr- tr-2005-173, MicrosoŸ Research. url: 〈ftp://ftp.research.microsoft.com/pub/ tr/TR-2005-173.pdf〉. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley (2008). “Speech recog- nition with weighted nite-state transducers.” In Larry Rabiner and Fred Juang (eds.), Handbook on Speech Processing and Speech Communication, Part e: Speech recognition, Springer-Verlag, Heidelberg, Germany. Pedro J. Moreno (1996). Speech Recognition in Noisy Environments. Ph.D. thesis, Carnegie Mellon University. Kevin Murphy (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. thesis, UC Berkeley, Computer Science Division. Tor Andre´ Myrvoll and Satoshi Nakamura (2003). “Optimal ltering of noisy ceptral [sic] coe›cients for robust asr.” In Proceedings of asru. pp. 381–386. Tor Andre´ Myrvoll and Satoshi Nakamura (2004). “Minimum mean square error ltering of noisy cepstral coe›cients with applications to ASR.” In ICASSP. pp. 977–980. L. Neumeyer, A. Sankar, and V. Digalakis (1995). “A Comparative Study of Speaker Adaptation Techniques.” In Proceedings of Eurospeech. pp. 1127–1130. LeonardoNeumeyer andMitchelWeintraub (1994). “Probabilistic OptimumFiltering for Robust Speech Recognition.” In Proceedings of icassp. vol. 1, pp. 417–420. Peder A. Olsen and Ramesh A. Gopinath (2004). “Modeling inverse covariance ma- trices by basis expansion.” ieee Transactions on Speech and Audio Processing 12 (1), pp. 37–46. Tasuku Oonishi, Paul R. Dixon, Koji Iwano, and Sadaoki Furui (2009). “Generaliza- tion Of Specialized On-e-Fly Composition.” In Proceedings of icassp. pp. 4317– 4320. 321 bibliography Judea Pearl (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann. K. B. Petersen andM. S. Pedersen (2008). “eMatrix Cookbook.” url: 〈http://http: //matrixcookbook.com/〉. P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett (1988). “e darpa 1000-word ResourceManagement database for continuous speech recognition.” In Proceedings of icassp. vol. 1, pp. 651–654. R. C. Rose, E. M. Hofstetter, and D. A. Reynolds (1994). “Integrated models of signal and background with application to speaker identication in noise.” ieee Transac- tions on Speech and Audio Processing 2 (2), pp. 245–257. A-V. I. Rosti and M. J. F. Gales (2004). “Factor analysed hidden Markov models for speech recognition.” Computer Speech and Language 18 (2), pp. 181–200. S. Sagayama, Y. Yamaguchi, S. Takahashi, and J. Takahashi (1997). “Jacobian approach to fast acoustic model adaptation.” In Proceedings of icassp. vol. 2, pp. 835 – 838. George Saon, Mukund Padmanabhan, Ramesh Gopinath, and Scott Chen (2000). “Maximum Likelihood Discriminant Feature Spaces.” In Proceedings of icassp. pp. 1129–1132. Lawrence Saul andMazin Rahim (2000). “MaximumLikelihood andMinimumClas- sication Error Factor Analysis for Automatic Speech Recognition.” ieee Transac- tions on Speech and Audio Processing 8, pp. 115–125. Mike Seltzer, Alex Acero, and Kaustubh Kalgaonkar (2010). “Acoustic Model Adapta- tion via Linear Spline Interpolation for Robust Speech Recognition.” In Proceedings of icassp. pp. 4550–4553. Yusuke Shinohara and Masami Akamine (2009). “Bayesian feature enhancement using a mixture of unscented transformations for uncertainty decoding of noisy speech.” In Proceedings of icassp. pp. 4569–4572. K. C. Sim and M. J. F. Gales (2004). “Basis superposition precision matrix modelling for large vocabulary continuous speech recognition.” In Proceedings of icassp. vol. i, pp. 801–804. K. C. Sim and M. J. F. Gales (2005). “Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition.” In Proceedings of icassp. vol. i, pp. 97–100. Veronique Stouten, Hugo Van hamme, and Patrick Wambacq (2004a). “Accounting for the Uncertainty of Speech Estimates in the Context of Model-Based Feature Enhancement.” In Proceedings of icslp. pp. 105–108. 322 Veronique Stouten, HugoVan hamme, and PatrickWambacq (2004b). “Joint Removal of Additive and Convolutional Noise with Model-Based Feature Enhancement.” In Proceedings of icassp. pp. 949–952. Veronique Stouten, HugoVan hamme, and PatrickWambacq (2005). “Ešect of phase- sensitive environment model and higher order vts on noisy speech feature en- hancement.” In Proceedings of icassp. pp. 433–436. Yee Whye Teh (2006). “A Hierarchical Bayesian Language Model Based On Pitman- Yor Processes.” In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguist- ics. Association for Computational Linguistics, Sydney, Australia, pp. 985–992. R. C. van Dalen, F. Flego, and M. J. F. Gales (2009). “Transforming Features to Com- pensate Speech Recogniser Models for Noise.” In Proceedings of Interspeech. pp. 2499–2502. R. C. van Dalen and M. J. F. Gales (2008). “Covariance Modelling for Noise Robust Speech Recognition.” In Proceedings of Interspeech. pp. 2000–2003. R. C. van Dalen andM. J. F. Gales (2009a). “Extended vts for noise-robust speech re- cognition.” Tech. Rep. cued/f-infeng/tr.636, Cambridge University Engineering Department. R. C. van Dalen and M. J. F. Gales (2009b). “Extended vts for Noise-Robust Speech Recognition.” In Proceedings of icassp. pp. 3829–3832. R. C. van Dalen and M. J. F. Gales (2010a). “Asymptotically Exact Noise-Corrupted Speech Likelihoods.” In Proceedings of Interspeech. pp. 709–712. R. C. van Dalen and M. J. F. Gales (2010b). “Aeoretical Bound for Noise-Robust Speech Recognition.” Tech. Rep. cued/f-infeng/tr.648, Cambridge University Engineering Department. Rogier C. van Dalen (2007). Optimal Feature Spaces for Noise-Robust Speech Recogni- tion. Master’s thesis, University of Cambridge. Rogier C. van Dalen and Mark J. F. Gales (2011). “Extended vts for Noise-Robust Speech Recognition.” ieee Transactions on Audio, Speech, and Language Processing 19 (4), pp. 733–743. Vincent Vanhoucke and Ananth Sankar (2004). “Mixtures of Inverse Covariances.” ieee Transactions on Speech and Audio Processing 13, pp. 250–264. A. Varga andH. J. M. Steeneken (1993). “Assessment for automatic speech recognition ii: noisex-92: A database and an experiment to study the ešect of additive noise on speech recognition systems.” Speech Communication 12 (3), pp. 247–251. 323 bibliography A. P. Varga and R. K. Moore (1990). “HiddenMarkovmodel decomposition of speech and noise.” In Proceedings of icassp. pp. 845–848. A. J. Viterbi (1982). “Error bounds for convolutional codes and asymptotically op- timum decoding algorithm.” ieee Transactions on Informationeory 13, pp. 260– 269. Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, and Naonori Ueda (2004). “Variational Bayesian estimation and clustering for speech recognition.” ieeeTrans- actions on Speech and Audio Processing 12, pp. 365–381. Pascal Wiggers, Leon Rothkrantz, and Rob van de Lisdonk (2010). “Design and Implementation of a Bayesian Network Speech Recognizer.” In Petr Sojka, Alesˇ Hora´k, Ivan Kopecek, and Karel Pala (eds.), Text, Speech and Dialogue, Springer, Berlin/Heidelberg, Lecture Notes in Computer Science, vol. 6231, pp. 447–454. Haitian Xu and K. K. Chin (2009a). “Comparison of estimation techniques in joint uncertainty decoding for noise robust speech recognition.” In Proceedings of Inter- speech. pp. 2403–2406. HaitianXu andK. K. Chin (2009b). “Joint uncertainty decodingwith the second order approximation for noise robust speech recognition.” In Proceedings of icassp. pp. 3841–3844. Haitian Xu, M. J. F. Gales, and K. K. Chin (2011). “Joint Uncertainty Decoding With Predictive Methods for Noise Robust Speech Recognition.” ieee Transactions on Audio, Speech, and Language Processing 19 (6), pp. 1665–1676. Haitian Xu,Mark J. F. Gales, andK. K. Chin (2009). “Improving Joint Uncertainty De- coding Performance by Predictive Methods for Noise Robust Speech Recognition.” In Proceedings of asru. pp. 222–227. Haitian Xu, Luca Rigazio, and David Kryze (2006). “Vector Taylor series based joint uncertainty decoding.” In Proceedings of Interspeech. pp. 1125–1128. Steve Young, Gunnar Evermann, Mark Gales, omas Hain, Dan Kershaw, Xuny- ing (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland (2006). “e htk book (for htk Version 3.4).” url: 〈http://htk.eng.cam.ac.uk/docs/docs.shtml〉. Kai Yu (2006). Adaptive Training for Large Vocabulary Continuous Speech Recognition. Ph.D. thesis, Cambridge University. Kai Yu andMark J. F. Gales (2007). “Bayesian Adaptive Inference and Adaptive Train- ing.” ieee Transactions on Audio, Speech, and Language Processing 15 (6), pp. 1932– 1943. 324