Maximum Likelihood Parameter Estimation in Time Series Models Using Sequential Monte Carlo Sinan Yıldırım Darwin College Department of Pure Mathematics and Mathematical Statistics University of Cambridge A thesis submitted for the degree of Doctor of Philosophy To Selcan and I˙lhan... Declaration This dissertation is the result of work carried out by myself between October 2009 and October 2012. It includes nothing which is the outcome of work done in collaboration with others, except as specified in the text. Signed: Sinan Yıldırım Acknowledgements I would like to thank my supervisor, Dr. Sumeetpal S. Singh for his supervi- sion, assistance, and friendship. Having greatly benefited from working with him for over three years, I am very glad to have been his PhD student. I would like to express my gratitude to my advisor Prof. A. Philip Dawid, who has been extremely kind and supportive throughout my PhD. Many thanks to my collaborators Dr. A. Taylan Cemgil, Dr. Tom Dean, Lan Jiang, and Prof. Arnaud Doucet for their invaluable contributions to the de- velopment of this thesis. I must thank my proofreaders Ozan Aksoy and Peter Bunch, whose efforts have improved the presentation of this work significantly. Also, many thanks to all my friends for providing the happy moments of this stressful period of time. Besides my special thanks to my ‘mentor’ Prof. Mi- halis Dafermos, my housemate Yasemin Aslan, and my ‘music-mate’ Dr. Kyri- acos Leptos, I owe a special word of appreciation to Selcan Deniz Kolag˘asıog˘lu, who shared all the happiness, sorrow, excitement with me. Finally, I would like to thank my parents and relatives, especially my dear brother I˙lhan and my best cousin and best friend Hasan I˙lkay C¸elik, for their endless support and encouragement all these years. Abstract Time series models are used to characterise uncertainty in many real-world dy- namical phenomena. A time series model typically contains a static variable, called parameter, which parametrizes the joint law of the random variables involved in the definition of the model. When a time series model is to be fitted to some sequentially observed data, it is essential to decide on the value of the parameter that describes the data best, a procedure generally called parameter estimation. This thesis comprises novel contributions to the methodology on parameter estimation in time series models. Our primary interest is online estimation, although batch estimation is also considered. The developed methods are based on batch and online versions of expectation-maximisation (EM) and gradient ascent, two widely popular algorithms for maximum likelihood esti- mation (MLE). In the last two decades, the range of statistical models where parameter estimation can be performed has been significantly extended with the development of Monte Carlo methods. We provide contribution to the field in a similar manner, namely by combining EM and gradient ascent algorithms with sequential Monte Carlo (SMC) techniques. The time series models we investigate are widely used in statistical and engineering applications. The original work of this thesis is organised in Chapters 4 to 7. Chapter 4 contains an online EM algorithm using SMC for MLE in changepoint models, which are widely used to model heterogeneity in sequential data. In Chap- ter 5, we present batch and online EM algorithms using SMC for MLE in linear Gaussian multiple target tracking models. Chapter 6 contains a novel methodology for implementing MLE in a hidden Markov model having in- tractable probability densities for its observations. Finally, in Chapter 7 we formulate the nonnegative matrix factorisation problem as MLE in a spe- cific hidden Markov model and propose online EM algorithms using SMC to perform MLE. Contents Contents vii List of Figures xiii List of Tables xvii List of Abbreviations xix 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Sequential inference and Monte Carlo . . . . . . . . . . . . . . . . 1 1.1.3 Online parameter estimation . . . . . . . . . . . . . . . . . . . . . 2 1.1.4 Bayesian estimation vs maximum likelihood estimation . . . . . . 3 1.2 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Monte Carlo Methods for Statistical Inference 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Perfect Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Inversion sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Self-normalised importance sampling . . . . . . . . . . . . . . . . 16 2.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Sequential importance sampling . . . . . . . . . . . . . . . . . . . 26 2.5.2 Sequential importance sampling resampling . . . . . . . . . . . . 28 2.5.3 Auxiliary particle filter . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.4 Sequential Monte Carlo samplers . . . . . . . . . . . . . . . . . . 32 2.6 Approximate Bayesian computation . . . . . . . . . . . . . . . . . . . . . 35 vii 3 Hidden Markov Models and Parameter Estimation 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 Extensions to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Sequential inference in HMMs . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Bayesian optimal filtering . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Particle filters for optimal filtering . . . . . . . . . . . . . . . . . . 45 3.3.3 The marginal particle filter . . . . . . . . . . . . . . . . . . . . . . 50 3.3.4 The Rao-Blackwellised particle filter . . . . . . . . . . . . . . . . 51 3.3.5 Application of SMC to smoothing additive functionals . . . . . . 53 3.3.5.1 Forward filtering backward smoothing . . . . . . . . . . 55 3.3.5.2 Forward-only smoothing . . . . . . . . . . . . . . . . . . 56 3.4 Static parameter estimation in HMMs . . . . . . . . . . . . . . . . . . . 58 3.4.1 Direct maximisation of the likelihood . . . . . . . . . . . . . . . . 60 3.4.2 Gradient ascent maximum likelihood . . . . . . . . . . . . . . . . 62 3.4.2.1 Online gradient ascent . . . . . . . . . . . . . . . . . . . 63 3.4.3 Expectation-Maximisation . . . . . . . . . . . . . . . . . . . . . . 65 3.4.3.1 Stochastic versions of EM . . . . . . . . . . . . . . . . . 66 3.4.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.4 Iterated filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.5 Discussion of the MLE methods . . . . . . . . . . . . . . . . . . . 69 4 An Online Expectation-Maximisation Algorithm for Changepoint Mod- els 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 The changepoint model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 EM algorithms for changepoint models . . . . . . . . . . . . . . . . . . . 76 4.3.1 Batch EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.3 SMC implementations of the online EM algorithm . . . . . . . . . 79 4.3.4 Comparison with the path space online EM . . . . . . . . . . . . 81 4.4 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1 Simulated experiments . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1.1 Online EM applied to long data sequence . . . . . . . . 86 4.5.1.2 Comparison between online and batch EM for a short data sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1.3 Comparison with the path space method . . . . . . . . . 88 4.5.2 GC content in the DNA of Human Chromosome no. 2 . . . . . . 89 viii 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.A.1 Derivation of Hk in (4.4) . . . . . . . . . . . . . . . . . . . . . . . 92 4.A.2 Derivation of the EM algorithm for the model in Section 4.5 . . . 94 4.A.3 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 95 4.A.3.1 Verification of Example 4.2 satisfying the conditions of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 99 5 Estimating the Static Parameters in Linear Gaussian Multiple Target Tracking Models 101 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Multiple target tracking model . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 EM algorithms for MTT . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1 Batch EM for MTT . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1.1 Estimation of sufficient statistics . . . . . . . . . . . . . 110 5.3.1.2 Stochastic versions of EM . . . . . . . . . . . . . . . . . 111 5.3.2 Online EM for MTT . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.2.1 Online smoothing in a single GLSSM . . . . . . . . . . . 114 5.3.2.2 Application to MTT . . . . . . . . . . . . . . . . . . . . 116 5.3.2.3 Online EM implementation . . . . . . . . . . . . . . . . 120 5.4 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.1 Batch setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.2 Online EM setting . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.2.1 Unknown fixed number of targets . . . . . . . . . . . . . 126 5.4.2.2 Unknown time varying number of targets . . . . . . . . 128 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.A.1 Recursive updates for sufficient statistics in a single GLSSM . . . 131 5.A.2 SMC algorithm for MTT . . . . . . . . . . . . . . . . . . . . . . . 132 5.A.3 Computational complexity of EM algorithms . . . . . . . . . . . . 134 5.A.3.1 Computational complexity of SMC filtering . . . . . . . 134 5.A.3.2 SMC-EM for the batch setting . . . . . . . . . . . . . . 134 5.A.3.3 SMC online EM . . . . . . . . . . . . . . . . . . . . . . 135 6 Approximate Bayesian Computation for Maximum Likelihood Estima- tion in Hidden Markov Models 137 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.1.1 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . 137 ix 6.1.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.1.3 Approximate Bayesian computation for parameter estimation . . 138 6.1.4 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2 ABC MLE approaches for HMM . . . . . . . . . . . . . . . . . . . . . . . 140 6.2.1 Standard ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2.2 Noisy ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.3 Smoothed ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.3 Implementing ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3.0.1 SMC algorithm for the expanded HMM . . . . . . . . . 146 6.3.1 Gradient ascent ABC MLE . . . . . . . . . . . . . . . . . . . . . 148 6.3.1.1 Batch gradient ascent . . . . . . . . . . . . . . . . . . . 148 6.3.1.2 Online gradient ascent . . . . . . . . . . . . . . . . . . . 150 6.3.1.3 Controlling the stability . . . . . . . . . . . . . . . . . . 150 6.3.1.4 Special case: i.i.d. random variables with an intractable density . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.3.2 Expectation-maximisation . . . . . . . . . . . . . . . . . . . . . . 152 6.4 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.4.1 MLE for α-stable distribution . . . . . . . . . . . . . . . . . . . . 155 6.4.2 MLE for g-and-k distribution . . . . . . . . . . . . . . . . . . . . 157 6.4.3 The stochastic volatility model with symmetric α-stable returns . 160 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7 An Online Expectation-Maximisation Algorithm for Nonnegative Ma- trix Factorisation Models 163 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2 The Statistical Model for NMF . . . . . . . . . . . . . . . . . . . . . . . 165 7.2.1 Relation to the classical NMF . . . . . . . . . . . . . . . . . . . . 167 7.3 EM algorithms for NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3.1 Batch EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.3.3 SMC implementation of the online EM algorithm . . . . . . . . . 171 7.4 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.4.1 Multiple basis selection model . . . . . . . . . . . . . . . . . . . . 173 7.4.2 A relaxation of the multiple basis selection model . . . . . . . . . 174 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 x 8 Conclusions 179 8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 References 183 xi xii List of Figures 4.1 SMC-FS online EM estimates vs time for a long simulated data sequence. The true parameter values are indicated with a horizontal line. . . . . . . 87 4.2 SMC-FS online EM estimates vs number of passes for the concatenated data set {y1:2000, y1:2000, . . .} where each pass is one complete browse of y1:2000. The true parameter values: α = 10, β = 0.1, ξ1 = 1.78, ξ2 = 3.56, κ1 = 0.30, κ2 = 0.03, λ1 = λ2 = 0.1, Pi,j = 0.5. . . . . . . . . . . . . . 87 4.3 SMC-FS batch EM estimates vs number of iterations for for the same y1:2000 used to produce the results in Figure 4.2. . . . . . . . . . . . . . . 88 4.4 Comparison of the forward smoothing and the path space methods in terms of the variability in the estimates of S16,n. The box plots and the relative variance plot are generated from 100 Monte Carlo simulations using the same observation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Comparison of SMC-FS online EM and SMC-PS online EM in terms of the variability in their estimates of λ1 = 0.1. The two plots at the top generated by superimposing different estimates, the box plots, and the relative variance plot are generated from estimates out of 100 different Monte Carlo runs using the same observation data. . . . . . . . . . . . . 90 4.6 Noisy GC content over 3 kb windows in human DNA chromosome 2. . . 91 4.7 Online EM estimates vs number of passes over the data sequence in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 Top: The list of the random variables in the MTT model. Bottom: A realisation for an MTT model: States of a targets are connected with arrows. Also, observations generated from targets are connected to those targets with arrows. Mis-detected targets are highlighted with shadows, and observations from false measurements are coloured with grey. . . . . 107 5.2 Batch estimates obtained using the SMC-EM algorithm for MLE. θ∗,z is shown as a cross. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3 Comparison of online SMC-EM estimates applied to the concatenated data (thicker line) with batch SMC-EM. . . . . . . . . . . . . . . . . . . . . . 126 xiii 5.4 Online estimates of SMC-EM algorithm (Algorithm 5.3) for fixed number of targets. True values are indicated with a horizontal line. Initial esti- mates for pd, λf , σ 2 xv, σ 2 y are 0.6, 15, 0.25, 25; they are not shown in order to zoom in around the converged values. . . . . . . . . . . . . . . . . . . . . 127 5.5 Left: estimates of pθ1:t(y1:t|K) (normalised by t) for values t = 100 . . . , t = 500 and for K = 6, . . . , K = 15. Right: Estimates of pθ1:t(y1:t|K) nor- malised by t for values K = 6, . . . , K = 15, K = 10 is stressed with a bold plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6 Estimates of online SMC-EM algorithm (Algorithm 5.3) for an MTT model with time varying number of targets, compared with online EM estimates when the true data association {Zt}t≥1 is known. For the estimates in case of known true association, θ1000,2000,...,100000 are shown only. True values are indicated with a horizontal line. . . . . . . . . . . . . . . . . . . . . . . . 129 5.7 SMC online EM estimates when birth-death known (solid line) compared to the original results in Figure 5.6 (dashed lines). For illustrative purposes, every 1000th estimate is shown . . . . . . . . . . . . . . . . . . . . . . . 130 6.1 Histograms of Monte Carlo estimates of gradients of log pǫ,κ,ψθ (Y ǫ,κ,ψ) w.r.t. the parameters of the α-stable distribution with tan−1(·) being used. 105 samples were used for generating the histograms. . . . . . . . . . . . . . 156 6.2 On the top: Online estimation of α-stable parameters from a sequence of i.i.d. random variables using online gradient ascent MLE. True parameters (α, β, µ, σ) = (1.5, 0.2, 0, 0.5) are indicated with a horizontal line. At the bottom: Gradient of incremental likelihood for the α-stable parameters . 157 6.3 S-ABC MLE and SN-ABC MLE estimates of the parameters of the α- stable distribution (averaged over 50 runs) using the online gradient ascent algorithm for the same data set. For SN-ABC MLE, a different noisy data sequence obtained from the original data set is used in each run. True parameters (α, β, µ, σ) = (1.5, 0.2, 0, 0.5) are indicated with a horizontal line.158 6.4 Mean and the variance (over 50 runs) of SN-ABC MLE estimates using the online gradient ascent algorithm. Same noisy data sequence is used in each run. True parameters (g, k, A,B) = (2, 0.5, 10, 2) are indicated with a horizontal line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.5 Top: SN-ABC MLE estimates of g-and-k parameters from a sequence of i.i.d. random variables using the batch gradient ascent algorithm. True pa- rameters (g, k, A,B) = (2, 0.5, 10, 2) are indicated with a horizontal line. Bottom: Approximate distributions (histograms over 20 bins) of the esti- mates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 xiv 6.6 Online estimation of SVαR parameters using online gradient ascent algo- rithm to implement SN-ABC MLE. True parameter values (α, φ, σ2x) = (1.9, 0.9, 0.1) are indicated with a horizontal line. . . . . . . . . . . . . . 161 6.7 Online estimation of SVαR parameters (α = 1.9 is known) using the on- line EM algorithm to implement SN-ABC MLE. True parameter values (φ, σ2x) = (0.9, 0.1) are indicated with a horizontal line. . . . . . . . . . . 162 7.1 Online estimation of B in the NMF model in Section 7.4.1 using exact implementation of online EM for NMF. The (i, j)’th subfigure shows the estimation result for the B(i, j) (horizontal lines). . . . . . . . . . . . . . 175 7.2 A realisation of {Xt(1)}t≥1 for α = 0.95. . . . . . . . . . . . . . . . . . . 176 7.3 Online estimation of B in the NMF model in Section 7.4.2 using Algo- rithm 7.1. The (i, j)’th subfigure shows the estimation result for B(i, j) (horizontal lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 xv xvi List of Tables 5.1 The list of the EM variables used in Section 5.3 . . . . . . . . . . . . . . 123 6.1 A comparison of ABC MLE approaches. . . . . . . . . . . . . . . . . . . 144 xvii xviii List of Abbreviations a.e. almost everywhere a.s. almost surely ABC approximate Bayesian computation AMPF auxiliary marginal particle filter CPHD Cardinalised PHD EM expectation-maximisation FFBS forward filtering backward smoothing GLSSM Gaussian linear state-space model HMM hidden Markov model i.i.d. independently and identically distributed JPDAF joint probabilistic data association filter MCEM Monte Carlo EM MCMC Markov chain Monte Carlo MCMC-DA MCMC-data association MHT multiple hypothesis tracking MLE maximum likelihood estimation MPF marginal particle filter MTT multiple target tracking NMF nonnegative matrix factorisation PHD probability hypothesis density PMCMC particle Markov chain Monte Carlo PMHT probabilistic MHT RBPF Rao-Blackwellised particle filter S-ABC MLE smoothed ABC MLE SAEM stochastic approximation EM SEM stochastic EM SIS sequential importance sampling SISR sequential importance sampling resampling SMC sequential Monte Carlo SN-ABC MLE smoothed noisy ABC MLE SVαR stochastic volatility model with α-stable returns xix xx Chapter 1 Introduction 1.1 Context 1.1.1 Time series models In probability theory and statistics, stochastic processes are used to capture uncertainty in many real-world dynamical phenomena. A stochastic process can be thought to evolve in time either continuously or discretely; in this thesis we will only consider discrete time stochastic processes. In the literature, a large number of different discrete time stochastic processes can be represented under the family of generative dynamical models called time series models. A parametric time series model consists of random variables that describe the modelled process with adequate generality, and these random variables admit probability laws that are parametrised by a vector-valued static variable. This variable is generally denoted by θ and called the static parameter, or simply the parameter, of the model. A time series model associated with a stochastic process is generative. That is, when simulated, the model produces a realisation of a sequence of observable random variables {Yt}t≥1 of the stochastic process over time. Typically {Yt}t≥1 are only a subset of the random variables that comprise the time series model; the rest of the random variables are called latent, or hidden, variables. In many cases, observable variables are considered to be somewhat noisy measurements of an underlying structure which is of primary interest. The power of a time series model is its ability to provide a rigorous mathematical formulation of this underlying structure as well as its relation to {Yt}t≥1 via its latent variables. This helps the scientist infer the latent variables from an observed time series in a principled way by employing well-established methods from statistics. 1.1.2 Sequential inference and Monte Carlo In many time series models, the latent variables themselves are lumped together to form another random process {Xt}t≥1. This process represents the hidden state of interest evolving dynamically, typically in a Markovian fashion. An example of this is a hidden Markov model (HMM), sometimes called a state-space model. In a HMM, {Xt}t≥1 is a 1 2 CHAPTER 1. INTRODUCTION Markov process and each Yt is a conditionally independent observation generated by Xt, the evolving state at time t. (For a review of HMMs in a closely related context, see Cappe´ et al. [2005]). In the literature, the problem of sequential Bayesian estimation of Xt based on the sequentially observed variables Y1, . . . , Yt is known as the optimum Bayesian filtering problem. When the time series model has linear and Gaussian dynamics, the exact solu- tion of this problem is the Kalman filtering. However, in non-linear non-Gaussian models, numerical approximations must be used. Sequential Monte Carlo (SMC) methods, also known as particle filters, are the most popular numerical methods for approximate so- lutions of the optimum Bayesian filtering problem [Doucet et al., 2000b; Durbin and Koopman, 2000; Gordon et al., 1993; Kitagawa, 1996; Liu and Chen, 1998]. These meth- ods are a special class of Monte Carlo methods, which rely on the basic idea of simulating from probability distributions when analytical evaluation of quantities that involve in these probability distributions cannot be performed [Metropolis and Ulam, 1949]. Al- though originally developed for HMMs, SMC methods can often easily be extended to more general time series models. A review of SMC methods is presented in Section 2.5, and their application to HMMs is reviewed in Section 3.3. 1.1.3 Online parameter estimation For the case when the true value of the static parameter of the time series model, which we will denote by θ∗ throughout the thesis, is known, numerous SMC methods have been proposed and successfully applied to the Bayesian optimal filtering problem over the last two decades. (See Cappe´ et al. [2007]; Doucet and Johansen [2009]; Fearnhead [2008] for recent reviews of the methodology.) However, in realistic applications θ∗ is hardly ever known although its estimation is essential for accurate inference of the latent variables of the model. Therefore, developing efficient and accurate parameter estimation methods for time series model is of significant importance. Classical methods used for parameter estimation process the observed data in a batch fashion, i.e. they require several iterative complete browses through the entire data set. In this thesis, we are primarily concerned with developing online parameter estimation algorithms. With the advancement of sensor and storage technologies, and with the significantly reduced costs of data acquisition, we are able to collect and record vast amounts of raw data. Arguably, the grand challenge facing computation in the 21st century is the effective handling of such large data sets. Unfortunately, classical batch processing methods fail with very large data sets due to memory restrictions and long computational time. For this reason, so called online methods have recently gained a popularity in the area. The main principle of these methods is that, a current estimate obtained using the data available so far could be updated when a new portion of data is 1.1. CONTEXT 3 received. Based on this principle, online methods are promising in terms of reducing both memory and computation requirements; hence they are potentially a powerful alternative to batch methods. 1.1.4 Bayesian estimation vs maximum likelihood estimation There are two different approaches for static parameter estimation, which is either Bayesian or maximum likelihood. Bayesian parameter estimation requires the assignment of a prior distribution for the unknown parameter θ. The objective is then to calculate the posterior distribution of θ given the observed data. When a point estimate of θ∗ is required, some feature of this posterior distribution can be provided. The common Bayesian estimators are the posterior mean, posterior median, and the posterior mode, or the maximum a posteriori probability (MAP) estimate. There are several Monte Carlo based methods for Bayesian parameter estimation when exact calculation of the posterior distribution is not available. Alternatively, the maximum likelihood approach regards the likelihood of the observed data, which is a function of θ, to contain all relevant information for esti- mating θ∗. The point estimate of θ∗ is the maximising argument of the likelihood. When maximum likelihood estimation (MLE) cannot be done analytically, iterative search- based algorithms such as expectation-maximisation (EM) and gradient ascent guarantee maximising the likelihood locally given certain regularity conditions on densities of the random variables involved. Also, Monte Carlo versions of these algorithms have been de- veloped and applied to many time series models successfully. See Kantas et al. [2009] for a comprehensive review of SMC methods for Bayesian and maximum likelihood parameter estimation, or Section 3.4 for a more brief discussion. Whether one should in principle use the Bayesian or maximum likelihood approach for estimating θ∗ is a fundamental debate which we will not go into. There are indeed cases when these two approaches do produce dramatically different suggestions on what θ∗ might be, especially when the observed data is of small size and a highly informative prior for Bayesian estimation is used. However, as data size tends to infinity, the likelihood of the data sweeps away the effect of the prior in the posterior distribution and the difference between the estimates of the two approaches vanishes (say when the MAP estimate is used for Bayesian estimation), provided that the prior is well-behaved (i.e. it does not assign zero density to any ‘feasible’ parameter value). Therefore, in an online estimation setting, where the data size is presumably very large, the two approaches are expected to give almost identical results if they could be implemented exactly. Thus, for the practitioner, the choice of online parameter estimation method depends on which has the most favourable properties in terms of computational costs and memory requirements rather than philosophical concerns that would matter when the outcomes of the Bayesian and maximum likelihood approaches differed significantly. Moreover, when a parameter 4 CHAPTER 1. INTRODUCTION estimation method involves any sort of Monte Carlo approximation, this brings with it the additional requirement that the statistical properties of the method, such as bias and variance of its estimator, be added to consideration. Given these concerns, one can argue that so far online MLE methods proposed in the literature are preferable over their Bayesian counterparts. Online Bayesian estimation methods, in one way or the other, are based on including the static parameter into the hidden state of the time series model and cast the online parameter estimation problem as a filtering one. Unfortunately, when the data size is large, these methods suffer from particle degeneracy which is inherent in SMC filtering, see e.g. Andrieu et al. [2005]; Olsson et al. [2008] for a discussion. There are certain techniques proposed to overcome the degeneracy problem, such as those based on Markov chain Monte Carlo (MCMC) moves for the parameter (e.g. Gilks and Berzuini [2001]; Polson et al. [2008]) or introduc- ing artificial dynamics on the parameter (e.g. Campillo and Rossi [2009]; Higuchi [2001]; Kitagawa [1998]). But all these techniques either still suffer from particle degeneracy problem or come with the price of bias and tuning difficulties, or both; see Section 3.4 for more discussion or Kantas et al. [2009] for even more details. On the other hand, online MLE methods based on Monte Carlo are more promising due to their favourable stability properties and reasonable computational and memory requirements. Recently, SMC based online EM and online gradient ascent algorithms for hidden Markov models have been proposed and analysed in several works such as Cappe´ [2009]; Del Moral et al. [2009, 2011]; Poyiadjis et al. [2011]. It has been shown in these works that the variance of the estimators in these algorithms either remain constant over time or decay, depending on their SMC schemes. For these reasons, in this thesis we focus on online MLE methods for parameter estimation in this thesis. 1.2 Scope of the thesis SMC based MLE methods are present in the literature, and they are successfully applied to many important time series models, especially to a large proportion of HMMs. How- ever, there are still many important types of time series models for which the developed methods so far are not directly applicable. This thesis aims to develop MLE methods, especially online MLE methods, in some non-standard time series models using Monte Carlo. Below we list and summarise the topics we investigate in this thesis. • Changepoint models: One example for a time series model is a changepoint model, which is commonly used to model heterogeneity of sequential data in a range of areas such as engineering, physical and biological sciences, and finance. Having a segmented structure introduced by changepoints, the model differs from 1.2. SCOPE OF THE THESIS 5 a HMM, which makes it both interesting and challenging to study its statistical aspects and to estimate its parameters. • Multiple target tracking models: Another challenging problem in the areas of applied statistics and engineering is multiple target tracking (MTT). In MTT, the main objective is to simultaneously track several moving objects in a surveil- lance region under far from ideal conditions that introduce random mis-detection of targets and false measurements. The additional issues of time varying unknown number of targets and unknown association of targets to observation points make the problem even more challenging. In this thesis we restrict ourselves to the linear Gaussian MTT model. Statistical treatment for the dynamics of the MTT model is widely popular and Monte Carlo methods are available for estimation of latent states in the tracking problem. However, the problem of calibrating the MTT model by estimating its static parameters has largely been ignored. • HMMs with intractable observation densities: An important computational problem studied in this thesis is that of implementing batch and online MLE in a HMM where the conditional law of observations is intractable, that is, its proba- bility density is either analytically unavailable or prohibitive to calculate. Due to this intractability, the online MLE methods developed for HMM are not directly applicable since computation of quantities required by those methods becomes im- possible. Approximate Bayesian computation (ABC) has become an increasingly popular strategy for confronting intractability in many statistical models. The adaptation of ABC to HMMs resulting in an SMC-ABC scheme has recently been demonstrated. Moreover, theoretical analysis on the properties of MLE based on this SMC-ABC scheme has been performed. However, methods for implementing MLE using SMC-ABC have not been discovered yet, and we believe that solving this implementation problem would be a valuable contribution to the literature. • Nonnegative matrix factorisation: Another interesting problem where online statistical estimation methods are of use is the nonnegative matrix factorisation (NMF) problem, where a given non-negative matrix Y is to be approximated as a multiplication of two nonnegative matrices as BX. In many applications, B is considered as a matrix of ‘basis’ vectors and X is a ‘gain’ matrix determining which of the columns in B dominate the columns of Y . Our approach to the NMF problem is to consider it as a parameter estimation problem for a HMM whose latent and observed processes are the columns of X and Y , respectively. This approach is useful to handle the case where the columns of Y are generated sequentially in time, such as in audio signal processing. Usually very large number of columns in Y leads to the necessity of online algorithms to learn the model and make inference. 6 CHAPTER 1. INTRODUCTION This thesis aims to contribute to the methodology on MLE, especially online MLE, in time series models within the contexts of the topics summarised above. We present novel EM and gradient ascent methods implemented with SMC. Statistical and computational aspects of the developed methods will be studied, mostly using numerical experiments. 1.3 Outline The material presented in the rest of the thesis is organised in six main chapters, followed by a final chapter including a conclusion, as follows. Chapter 2: Monte Carlo Methods for Statistical Inference This chapter provides a survey of the Monte Carlo literature. We will review some basic Monte Carlo methods, such as rejection sampling, importance sampling, MCMC, SMC, ABC, etc. Chapter 3: Hidden Markov Models and Parameter Estimation We introduce hidden Markov models and review Monte Carlo methods for filtering and parameter estimation in hidden Markov models. Chapter 4: An Online Expectation-Maximisation Algorithm for Changepoint Models: We present a novel online EM algorithm using SMC for changepoint models. We also provide theoretical and numerical stability analysis for the developed algorithm. Chapter 5: Estimating the static parameters of the linear Gaussian Multiple Target Tracking Model: We present novel batch and online EM algorithms using SMC for linear Gaussian MTT models. The algorithms are based on the availability of exact EM algorithms in a single linear Gaussian state-space model, but involve SMC for tracking the unknown data association inherent in the model. Chapter 6: Approximate Bayesian Computation for Maximum Likelihood Estimation in Hidden Markov Models We present methods for implementing MLE in HMMs with intractable observation densities. An SMC based ABC is our main tool for dealing with the intractability inherent in these HMMs and batch and online gradient ascent algorithms using SMC are shown to be suitable for this scheme. 1.4. NOTATION 7 Chapter 7: An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models We formulate the nonnegative matrix factorisation (NMF) problem as a MLE prob- lem for HMMs and propose online EM algorithms using SMC to estimate the NMF and the other unknown static parameters. 1.4 Notation It will be useful to summarise the notation used throughout this thesis. The notation presented here will be used with consistency in the literature review part of the thesis (Chapters 2 and 3); however the particular requirements of Chapters 4, 5, 6, and 7 containing original work are such that there is inevitably some conflict with the desire to be consistent with standard usage within the literature. The reader will be notified whenever we deviate from the notation or any additional notation is introduced. We use N and R to denote the set of natural numbers and real numbers. For a sequence {ak}k≥1 and integers i, j, we let ai:j denote the set {ai, . . . , aj}, which is empty if j < i, and ai:∞ = {ai, ai+1, . . .}. Probability measures, integrals, and random variables: Given a general mea- surable space (X , E), we refer to the set of all σ-finite measures on that space as M(X ). The set of all probability measures on (X , E) is denoted P(X ) ⊂M(X ). We use Bb(X ) to denote the Banach space of bounded real valued measurable functions on X . The integration of a real-valued measurable function ϕ on X with respect to the measure µ ∈M(X ) is denoted as µ(ϕ) = ∫ X ϕ(x)µ(dx), ∀µ ∈M(X ). Also, for A ∈ X , µ(A) = µ(IA), where IA : X → {0, 1} is the indicator function so that IA(x) is 1 if x ∈ A and 0 otherwise. Finally, δx is the Dirac measure satisfying δx(A) = IA(x) for all A ∈ X . Let (Ω,F ,P) be a probability space and (X , E) be a measurable space. A E/F mea- surable function X : Ω → X is called a (X , E)-valued random variable. The probability measure π on (X , E) corresponding to the law of X is given by P ◦X−1 so that π(A) = P [ X−1(A) ] , ∀A ∈ E . We will denote the expectation of ϕ with respect to π as Eπ[ϕ(X)] = π(ϕ), ∀π ∈ P(X ). 8 CHAPTER 1. INTRODUCTION Both expressions on the left and the right sides of the equality will be used. If π is parametrised by a vector θ, we will denote it by πθ and we will write Eθ[ϕ(X)] to mean Eπθ [ϕ(X)]. Finally, capital letters X, Y, Z, etc. will be used to denote random variables; whereas for their realisations corresponding small letters x, y, z, etc. will be used. Let π be the law of X. We write π ≪ λ to mean that π is absolutely continuous with respect to the dominating measure λ, and we call the Radon-Nikody´m derivative ν = dπ dλ the density of π (or the probability density of X) with respect to λ. Throughout the chapters of this thesis containing original work, λ will be either the Lebesgue measure or the counting measure and λ(dx) will be replaced by dx for simplicity. To make explicit the law of X, we interchangeably use X ∼ π and X ∼ ν. Markov kernels: Given two measurable spaces (X1, E1) and (X2, E2), we define a Markov kernel or transition kernel K : X1 → P(X2) satisfying the following two con- ditions • ∀x ∈ X1, K(x, ·) is a probability measure in P(X2), • ∀A ∈ E2, K(·, A) is a nonnegative measurable function with respect to E1 on X1. A Markov kernel induces two operators, the first one on M(X2) and the second one on the bounded E2-measurable functions on X2: µK(dy) = ∫ X1 µ(dx)K(x, dy), ∀µ ∈M(X1), K(ϕ)(x) = ∫ X2 ϕ(y)K(x, dy), ∀ϕ ∈ Bb(X2). Using the first operation a probability measure π ∈ P(X1) is mapped by K to another probability measure µK ∈ P(X2). Also, when we wish to consider the joint distribution induced over (X1 × X2, E1 × E2) by a measure π and a Markov kernel K : X1 →M(X2), we use the notation π ⊗K, i.e. π ⊗K(dx, dy) = π(dx)K(x, dy). Moreover, given a sequence of measurable spaces {Xn, En}n≥1 and a sequence of Markov kernels {Kn : Xn−1 → P(Xn)}n≥2, Kp:q(xp−1, dxp:q) = Kp+1 ⊗ . . .⊗Kq(xp, dxp+1:q) = q∏ i=p+1 Ki(xi−1, dxi). q ≥ p ≥ 1; 1.4. NOTATION 9 and we define the operator on the bounded Ep+1 ⊗ . . .⊗ Eq-measurable functions on X2 Kp:q(ϕ)(xp) = ∫ Xp+1 · · · ∫ Xq ϕ(xp+1:q) q∏ i=p+1 Ki(xi−1, dxi), ∀ϕ ∈ Bb(Xp+1 × . . .× Xq). Some common probability distributions: We will use N (µ, σ2) to describe the nor- mal distribution with mean µ and variance σ2; UA for the uniform distribution over the set A; PO(λ) for the Poisson distribution with rate λ; G(α, β) for the gamma distribution with shape α and scale β; IG(α, β) for the inverse-gamma distribution with shape α and (inverse) scale β; BE(p) for the Bernoulli distribution with success rate p; NΓ−1(ζ, κ, α, β) for the normal-inverse gamma distribution such that (X, Y ) ∼ NΓ−1(ξ, κ, α, β) means Y ∼ IG(α, β) and X ∼ N (ξ, Y κ ); A(α, β, µ, σ) for the α-stable distribution with shape α, skewness β, location µ and scale σ; M(α, ρ) for the multinomial distribution with α number of independent trials and the probability vector ρ. We also use these nota- tions to express the corresponding probability densities. For example, N (x;µ, σ2) is the probability density of the normal distribution N (µ, σ2) evaluated at x. 10 CHAPTER 1. INTRODUCTION Chapter 2 Monte Carlo Methods for Statistical Inference Summary: This chapter provides a survey of the Monte Carlo literature. We will review some basic Monte Carlo methods for statistical inference that are related to the main con- tent of this thesis. These methods are rejection sampling, importance sampling, Markov chain Monte Carlo, sequential Monte Carlo, and approximate Bayesian computation. 2.1 Introduction Assume that we are given a probability space (Ω,F ,P) and some random variable X : Ω → X which is E/F measurable. We allow the probability measure π on (X , E) to describe the law of X so that π = P◦X−1. We are interested in integrating a measurable function ϕ : X → Rdϕ with respect to the probability measure π, i.e. π(ϕ) = Eπ [ϕ(X)] = ∫ X ϕ(x)π(dx). (2.1) When analytical evaluation of (2.1) is not possible, we have to use approximations. There are deterministic numerical integration techniques available; however these meth- ods encounter the problem called the curse of dimensionality since the amount of com- putation grows exponentially with the dimension of X [Press, 2007]. Therefore, they are far from being practical and reliable unless they work in low dimensional problems. A powerful alternative to deterministic methods for integration problems is Monte Carlo integration, where random samples from some distribution are used to approximate the integral in (2.1). The term Monte Carlo was coined in the 1940s, see Metropolis and Ulam [1949] for a first use of the term, and Eckhardt [1987]; Metropolis [1987] for a historical review. In this chapter we will review the Monte Carlo methodology. We first present the main methods in the literature that aim to evaluate the integral in (2.1). We then proceed to sequential Monte Carlo methods to approximate a sequence of integrals like in (2.1). We conclude the chapter with a review of approximate Bayesian computation, which is a name 11 12 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE attached to a wide class of popular Monte Carlo methods aiming to tackle integrations π(ϕ) where π is a posterior distribution resulting from an intractable likelihood. Note that we restrict ourselves to the review of only those methods which are closely related to the work in this thesis. A book length review of general Monte Carlo methods can be found in Robert and Casella [2004], and for a detailed review of sequential Monte Carlo methods one can consult the books Doucet et al. [2001] and Del Moral [2004]. 2.2 Perfect Monte Carlo The term perfect Monte Carlo refers to those methods in which the distribution of interest π is approximated by N > 0 of independent, identically distributed (i.i.d.) samples from the distribution π and integration of ϕ with respect to π is approximated by using this approximation. The approximation to π using N i.i.d. samples X(1), . . . , X(N) is given by πNMC(dx) := 1 N N∑ i=1 δX(i)(dx). Then, the perfect Monte Carlo approximation to π(ϕ) is obtained by substituting π with πNMC in (2.1) as πNMC(ϕ) = 1 N N∑ i=1 ϕ(X(i)). It is this approach which was originally referred to as the Monte Carlo method in Metropo- lis and Ulam [1949], although the term has come to encompass a broader class of methods through the following years. It is easy to show that πNMC(ϕ) is an unbiased estimator of π(ϕ) for any N > 0. Also, if π(ϕ) is finite, the strong law of large numbers (e.g. Shiryaev [1995], p. 391) ensures almost sure (a.s.) convergence of πNMC(ϕ) to π(ϕ) as the number of i.i.d. samples tends to infinity, πNMC(ϕ) a.s.→ π(ϕ). The variance of πN (ϕ) is given by var [ πNMC(ϕ) ] = 1 N2 N∑ i=1 varπ [ ϕ(X(i)) ] = 1 N varπ [ϕ(X)] . which indicates the improvement in the accuracy with increasing N , provided that varπ [ϕ(X)] is finite. Note that this is true regardless of the dimension of X ; which makes Monte Carlo preferable over the deterministic numerical methods particularly for high dimensional integrations [Newman and Barkema, 1999]. Also, if varπ [ϕ(X)] is fi- nite, the distribution of the estimator is well behaved in the limit, which is ensured by 2.2. PERFECT MONTE CARLO 13 the central limit theorem (e.g. Shiryaev [1995], p. 335) √ N [ πNMC(ϕ)− π(ϕ) ] d→ N (0, varπ [ϕ(X)]) . The requirement of perfect Monte Carlo is the ability to obtain i.i.d. samples from π. There are several methods for obtaining i.i.d. samples from distributions. We shall cover the two most common ones in the following. 2.2.1 Inversion sampling If π is a distribution on R, then its cumulative distribution function can be defined as Fπ : R → [0, 1], Fπ(x) = π((−∞, x]). If it is possible to invert Fπ, then it is possible to sample from π by transforming a uniform sample U distributed over (0, 1) as X = F−1π (U) := inf{x ∈ X : Fπ(x) ≥ U}. This approach was considered by Ulam prior to 1947 [Eckhardt, 1987] and some extensions to the method are provided by Robert and Casella [2004]. 2.2.2 Rejection sampling Another common method of obtaining i.i.d. samples from π is rejection sampling, which is available when there exists an instrumental distribution µ such that π ≪ µ with bounded Radon-Nikody´m derivative dπ dµ . Rejection sampling was first mentioned in a 1947 letter by Von Neumann [Eckhardt, 1987], it was also presented a few years later in von Neumann [1951]. The method for obtaining one sample from π can be implemented with any M ≥ supx dπdµ(x) by (i) generating X from µ, (ii) accepting it with probability 1M dπdµ(X), and otherwise repeating steps (i) and (ii) until acceptance. Letting A = {U ≤ 1 M dπ dµ (X)} be the event of acceptance in a single trial, its probability is given by P (A) = Eµ [ 1 M dπ dµ (X) ] = 1 M µ ( dπ dµ ) = 1 M , (2.2) which is also the long term proportion of the number accepted samples over the number of trials. Therefore, taking µ as close to π as possible to avoid large Radon-Nikody´m deriva- tives and taking M = supx dπ dµ (x) are sensible choices to make the acceptance probability P (A) as high as possible. 14 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE Algorithm 2.1. Rejection sampling: Choose M ≥ supx dπdµ(x). To generate a single sample, 1. Generate X ∼ µ and U ∼ Unif(0, 1). 2. If U ≤ 1 M dπ dµ (X), accept X; else go to 1. The rejection sampling algorithm is given in Algorithm 2.1. The validity of this algorithm can be verified by considering the distribution of the accepted samples. Using Bayes’ theorem, P (X ∈ dx |A) = µ(dx)P (A|x) P (A) = µ(dx) 1 M dπ dµ (x)/ 1 M = π(dx). (2.3) One advantage of rejection sampling is that we can implement it even when we know π and µ only up to some proportionality constants Zπ and Zµ, that is, when π = bπ Zπ , µ = bµ Zµ and we only know π̂ and µ̂. It is easy to check that one can perform the steps (i) and (ii) of rejection sampling method for any M ≥ supx dbπdbµ(x) using dbπdbµ instead of dπdµ , and justification of this modification would follow from similar steps to those in (2.3). Also, in that case, the acceptance probability would be 1 M Zπ Zµ . Finally, when π and µ have densities (denoted as π and µ also) with respect to a common dominating measure, then the Radon-Nikody´m derivative dπ dµ (x) becomes equal to π(x) µ(x) . The drawback of rejection sampling is that in practice a rejection based procedure is usually not viable when X is high-dimensional, since P (A) gets smaller and more com- putation is required to evaluate acceptance probabilities as the dimension increases. In the literature there exist approaches to improve the computational efficiency of rejection sampling. For example, assuming the densities exist, when it is difficult to compute π(x), tests like u ≤ 1 M π(x) µ(x) can be slow to evaluate. In this case, one may use a squeezing function s : X → [0,∞) such that s(x) µ(x) is cheap to evaluate and s(x) π(x) is tightly bounded from above by 1. For such an s, not only u ≤ 1 M s(x) µ(x) would guarantee u ≤ 1 M π(x) µ(x) , hence acceptance, but also if u ≤ 1 M π(x) µ(x) then u ≤ 1 M s(x) µ(x) would hold with a high probability. Therefore, in case of acceptance evaluation of π(x) µ(x) would largely be avoided by checking u ≤ 1 M s(x) µ(x) first. In Marsaglia [1977], the author proposed to squeeze π from above and below by µ and s respectively, where µ is easy to sample from and s is easy to evaluate. There are also adaptive methods to squeeze π from both below and above; they involve an adaptive scheme to gradually modify µ and s from the samples that have already been obtained [Gilks, 1992; Gilks et al., 1995; Gilks and Wild, 1992]. 2.3. IMPORTANCE SAMPLING 15 2.3 Importance sampling We saw that rejection sampling can be wasteful as it uses only about 1/M of generated random samples to construct an approximation to π. In contrast, importance sampling uses every sample but weights each one according to the degree of similarity between the target and instrumental distributions. The idea of importance sampling follows from the importance sampling fundamental identity [Robert and Casella, 2004]: if there is a probability measure µ such that π ≪ µ with the Radon-Nikody´m derivative w = dπ dµ , then we have π(ϕ) = µ (ϕw) . This identity can be used with a µ which is easy to sample from. Sampling X(1), . . . , X(N) from µ, the integral π(ϕ) = µ (ϕw) can be approximated by using perfect Monte Carlo as πNIS(ϕ) := 1 N N∑ i=1 ϕ(X(i))w(X(i)). (2.4) Algorithm 2.2. Importance sampling: • For i = 1, . . . , N ; generate X(i) ∼ µ, calculate w(X(i)) = dπ dµ (X(i)). • Set πNIS(ϕ) = 1N ∑N i=1w(X (i))ϕ(X(i)). The importance sampling is summarised in Algorithm 2.2. The Radon-Nikody´m derivatives w(X(i)) are known as the importance sampling weights. Noting its equivalence to perfect Monte Carlo for µ (ϕw), the estimator in (2.4) is unbiased and justified by the strong law of large numbers and the central limit theorem, provided that π(ϕ) and varµ [w(X)ϕ(X)] are finite. Moreover, as we have freedom to choose µ we can control the variance of importance sampling [Robert and Casella, 2004] var [ πNIS(ϕ) ] = 1 N varµ [w(X)ϕ(X)] = 1 N ( µ(w2ϕ2)− [µ(wϕ)]2) = 1 N ( µ(w2ϕ2)− [π(ϕ)]2) . Therefore, minimising var [ πNIS(ϕ) ] is equivalent to minimising µ(w2ϕ2), which can be lower bounded as µ(w2ϕ2) ≥ [µ(w|ϕ|)]2 = [π(|ϕ|)]2 using the Jensen’s inequality. Considering µ(w2ϕ2) = π(wϕ2), this bound is attainable if 16 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE we choose µ such that it satisfies w(x) = dπ dµ (x) = π(|ϕ|) |ϕ(x)| , x ∈ X , ϕ(x) 6= 0. This results in the optimum choice of µ to be µ(dx) = π(dx) |ϕ(x)| π(|ϕ|) for points x ∈ X such that ϕ(x) 6= 0, and the resulting minimum variance is given by min µ var [ πNIS(ϕ) ] = 1 N ( [π(|ϕ|)]2 − [π(ϕ)]2) . Note that this minimum value is 0 if ϕ is nonnegative π-almost everywhere. Therefore, importance sampling in principle can achieve a lower variance than perfect Monte Carlo. Of course, if we can not already compute π(ϕ), it is unlikely that we can compute π(|ϕ|). Also, it will be rare that we can easily simulate from the optimal µ even if we can construct it. Instead, we are guided to seek a µ close to the optimal one, but from which it is easy to sample. 2.3.1 Self-normalised importance sampling Like rejection sampling, the importance sampling method is available also when π = bπ Zπ , µ = bµ Zµ and we only have π̂ and µ̂. This time, letting w = dbπ dbµ we write the importance sampling fundamental identity in terms of π̂ and µ̂ as π(ϕ) = µ (ϕw) Zπ/Zµ = µ (ϕw) µ (w) . The importance sampling method can be modified to approximate both the nominator (the unnormalised estimate) and the denominator (the normalisation constant) by using perfect Monte Carlo. Sampling X(1), . . . , X(N) from µ, we have the approximation πNIS(ϕ) = 1 N ∑N i=1 ϕ(X (i))w(X(i)) 1 N ∑N i=1w(X (i)) = N∑ i=1 W (i)ϕ(X(i)). where W (i) = w(X (i))PN j=1 w(X (j)) are called the normalised importance weights as they sum up to 1. Being the ratio of two unbiased estimators, estimator of the self-normalised importance sampling is biased for finite N . However, its consistency and stability are provided by a strong law of large numbers and a central limit theorem in Geweke [1989]. In the same work, the variance of the self normalised importance sampling estimator is analysed and 2.3. IMPORTANCE SAMPLING 17 an approximation is provided, from which it reveals that it can provide lower variance estimates than the unnormalised importance sampling method. Therefore, this method can be preferable to its unnormalised version even if it is not the case that π and µ are known only up to proportionality constants. Algorithm 2.3. Self-normalised importance sampling: • For i = 1, . . . , N ; generate X(i) ∼ µ, calculate w(X(i)) = dbπ dbµ(X (i)). • For i = 1, . . . , N ; set W (i) = w(X(i))PN j=1 w(X (j)) . • Set πNIS(ϕ) = ∑N i=1W (i)ϕ(X(i)). Self-normalised importance sampling is also called Bayesian importance sampling in Geweke [1989], since in most Bayesian inference problems the normalising constant of posterior distribution is unknown. One approximation to the variance of the self-normalised importance sampling esti- mator is proposed in Kong et al. [1994] to be var [ πNIS(ϕ) ] ≈ 1 N varπ [ϕ(X)] {1 + varµ [w(X)]} = var [ πNMC(ϕ) ] {1 + varµ [w(X)]}. This approximation might be confusing at the first instance since it suggests that the variance of self-normalised importance sampling is always greater than that of perfect Monte Carlo, which we have just seen is not the case. However, it is useful as it provides an easy way of monitoring the efficiency of the method. Consider the ratio of variances of the self-normalised importance sampling method with N particles and perfect Monte Carlo with N ′ particles, which is given according to this approximation by var [ πNIS(ϕ) ] var [ πN ′ MC(ϕ) ] ≈ N ′ N {1 + varµ [w(X)]}. The number N ′ for which this ratio is 1 would suggest how many samples for perfect Monte Carlo would be equivalent to N samples for self-normalised importance sampling. For this reason this number is defined as the effective sample size [Kong et al., 1994; Liu, 1996] and it is given by Neff = N 1 + varµ [w(X)] . Obviously, the term varµ [w(X)] itself is usually estimated using the samplesX (1), . . . , X(N) with weights w(X(i)), . . . , w(X(N)) obtained from the method. 18 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE 2.4 Markov chain Monte Carlo We have already discussed the difficulties of generating a large number of i.i.d. samples from π. One alternative was importance sampling which involved weighting every gener- ated sample in order not to waste it, but it has its own drawbacks mostly due to issues related to controlling variance. Another alternative is to use Markov chain Monte Carlo (MCMC) methods [Gilks et al., 1996; Hastings, 1970; Metropolis et al., 1953; Robert and Casella, 2004]. These methods are based on design of a suitable ergodic Markov chain whose stationary distribution is π. The idea is that if one simulates such a Markov chain, after a long enough time the samples of the Markov chain will admit π. Although the samples generated from the Markov chain are not i.i.d., their use is justified by conver- gence results for dependent random variables in the literature. First examples of MCMC can be found in Metropolis et al. [1953]; Hastings [1970], and book length reviews are available in Gilks et al. [1996]; Robert and Casella [2004]. 2.4.1 Discrete time Markov chains In order to adequately summarise the MCMC methodology, we first need reference to the theory of discrete time Markov chains defined on general state spaces. Discrete time Markov chains also constitute an important part of this thesis. The review made here is limited by the relation of Markov chains to the topics of this thesis; for more details one can see Meyn and Tweedie [2009] or Shiryaev [1995]; a more related introduction to our area of interest in this this thesis is present in Robert and Casella [2004, Chapter 6] and Cappe´ et al. [2005, Chapter 14], Tierney [1994] and Gilks et al. [1996, Chapter 4]. Definition 2.1 (Markov chain). Consider a sequence of measurable spaces {Xn, En}n≥1, an initial distribution η and a sequence of Markov kernels {Mn}n≥2 with each Mn : Xn−1 → P(Xn), where P(Xn) denotes the set of probability measures on Xn. Then, there exists a unique stochastic process {Xn}n≥1 on the canonical space ( ∏∞ n=1Xn,⊗∞n=1En) and admits the following probability law Pη on F which is defined from the initial distribution η and the Markov kernels {Mn}n≥2 by finite dimensional distributions as Pη(X1 ∈ A1, X2 ∈ A2, . . . , Xn ∈ An) = ∫ A1 ∫ A2 . . . ∫ An η(dx1)M2(x1, dx2) . . .Mn(xn−1, dxn) for all n and Ei-measurable Ai, i = 1, . . . , n. This is the canonical definition of a Markov chain, which leads to the defining property of a Markov chain which is that the current state of the chain at time n depends only on the previous state at time n− 1. More explicitly, for any n and En-measurable set A, we 2.4. MARKOV CHAIN MONTE CARLO 19 have Pη(Xn ∈ A|X1:n−1 = x1:n−1) = Pη(Xn ∈ A|Xn−1 = xn−1) = Mn(xn−1, A). This property is also referred to as the weak Markov property, which can be stated in a more general sense: Proposition 2.1 (weak Markov property). Given X1 = x1, . . . , Xm = xm, the process {Xm+n}n≥0 is a Markov chain independent from X1, . . . , Xm whose probability law is con- structed from the initial distribution δxm and the sequence of Markov kernels {Mm+n}n≥1 in the same way as in Definition 2.1. From now on, we will consider time-homogenous Markov chains where (Xn, En) = (X , E) for all n ≥ 1 andMn = M for all n ≥ 2, and we will denote them asMarkov(η,M). Such Markov chains are sufficient for the purposes of considering MCMC methods and also the other methods investigated throughout this thesis. For MCMC, we require the Markov chain to have a unique stationary distribution π and to converge to π. Before that we need to review some fundamental properties a of discrete time Markov chain to understand when stationarity and convergence are ensured. Irreducibility: Informally, a Markov chain is irreducible if (almost) all its states com- municate, that is, it is with a positive probability that the chain travels from any point in X to any set in E . For discrete X it is possible to state this as ∀x, x′ ∈ X , ∃n ≥ 1 s.t. Pδx(Xn = x′) > 0. For general state-spaces, we need to generalise the concept of irreducibility. Definition 2.2 (φ-irreducibility). The transition kernelM , or the Markov chain {Xn}n≥1 with transition kernel M , is said to be φ-irreducible if there exists a measure φ on (X , E) such that for any A ∈ E with φ(A) > 0, we have ∀x ∈ X , ∃n ≥ 1 s.t. Pδx(Xn ∈ A) > 0. Such a measure φ is called an irreducibility measure for M . Recurrence and Transience: In the discrete state-space case, we say that a Markov chain is recurrent if every of its states is expected to be visited by the chain infinitely often, otherwise it is transient. In the general state-space case, instead of states we consider accessible sets. A set A ∈ E is accessible if Pδx(Xn ∈ A for some n) > 0 for all 20 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE x ∈ X . It is also useful to consider stronger recurrence properties, expressed in terms of return probabilities rather than expected number of visits. Definition 2.3 (recurrence). Let A be a set in E . We say A is recurrent if for all x ∈ A Ex [ ∞∑ n=1 IA(Xn) ] =∞. Moreover, we say A is Harris recurrent if for all x ∈ A Pδx ( ∞∑ n=1 IA(Xn) =∞ ) = 1. Finally, we say a φ-irreducible Markov chain is recurrent (Harris recurrent) if every accessible A ∈ E is recurrent (Harris recurrent). Invariant measures: We call a σ-finite measure µ M-invariant if µ = µM . If a M- invariant µ is a probability measure then µ is referred to as stationary. A Markov chain associated with a φ-irreducible M is called positive if there is a probability measure µ which is M-invariant. In order to state the conditions for existence of a unique invariant probability measure for a Markov chain, we need the definition of a small set. Definition 2.4 (small set). Let M and ν be a transition kernel and a probability mea- sure, respectively, on (X , E), integer m ≥ 2 and constant ǫ ∈ (0, 1]. A set C ∈ E is called a (m, ǫ, ν)-small set for M , or simply a small set, if for all x ∈ C and A ∈ E , Pδx(Xm ∈ A) ≥ ǫν(A). Trivially, every point in X is a small set, so in discrete X every state is a small set. Now, we have the following theorem for the existence and uniqueness of an invariant probability measure. Theorem 2.1. Given a Markov kernel M and a Markov chain associated to M , the following hold • M is φ-irreducible and recurrent if and only if it admits a unique (up to a multi- plicative constant) invariant measure. • If M admits an accessible small set C such that sup x∈C EPδx [inf{n ≥ 2 : Xn ∈ C}] <∞, (2.5) then the Markov chain is positive. 2.4. MARKOV CHAIN MONTE CARLO 21 Note that while φ-irreducibility and recurrence ensure a unique (up to a multiplicative constant) invariant measure, existence of an accessible small set is required as well to have an invariant probability measure. In fact, the condition (2.5) is equivalent to the property of positive recurrence for Markov chains with discrete state-space which is necessary for the existence of a unique stationary distribution. Reversibility and detailed balance: One useful way to verify the existence of an invariant probability measure for a Markov chain is to check for its reversibility, which is a sufficient (but not necessary) condition for existence of a stationary distribution. Definition 2.5 (reversibility). Let M be a transitional kernel having a stationary dis- tribution and assume the associated Markov chain is started from π. We say that M is reversible if the reversed process {Xm = Xn−m+1}1≤m≤n is also Markov(π,M) for all n ≥ 1. A necessary and sufficient condition for reversibility of M is the detailed balance condition. Proposition 2.2 (detailed balance). We say a Markov kernel M is reversible with respect to a probability measure π if and only if the following condition, known as the detailed balance condition, holds: for all bounded measurable functions f on X ×X∫ X×X f(x, y)µ(dx)M(x, dy) = ∫ X×X f(x, y)µ(dy)M(y, dx). Also, then π is a stationary distribution for M . Being a sufficient condition for stationarity, the detailed balance condition is quite useful for designing transition kernels for MCMC algorithms. Ergodicity: We have shown the conditions for a unique stationary distribution of a Markov chain. The first ergodic theorem shows that these conditions are sufficient for establishing a strong law of large numbers. Theorem 2.2. If {Xn}n≥1 is a positive, Harris recurrent Markov chain with invariant distribution π, then for all π-integrable functions ϕ, 1 n n∑ i=1 ϕ(Xi) a.s→ π(ϕ). Note that this ergodic theorem is about the convergence of the sample mean and it does not tell whether the chain will converge to its stationary distribution. For that to happen the Markov chain is required to be aperiodic, a property which restricts the chain 22 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE from getting trapped in cycles. In discrete state-space a cycle is defined as the greatest common divisor of the lengths of all routes of positive probability between two states, and if there exists no cycles of length greater than one, the chain is said to be aperiodic. In general state-spaces, a more detailed care is required to define a cycle. It is a theorem that there exists a (m, ǫ, ν)-small C set for a φ-irreducible Markov chain, which enables the following definition. Definition 2.6 (cycle and period). A φ-irreducible Markov chain associated to the Markov kernel M has a cycle of length d if for some accessible (m, ǫ, ν)-small set C d is the greatest common divisor of {n− 1 : n ≥ 2 : C is (n, ǫn, νn)-small for some ǫn > 0, νn ∈ P(X )} The period of the Markov chain is the largest possible cycle d for M . When the period is 1, the chain is called aperiodic. We are now ready to state our second ergodic theorem which requires the ergodicity of the Markov chain. Theorem 2.3. If {Xn}n≥1 is a positive, and aperiodic Markov chain with stationary distribution π, then for π-almost every x ∈ X , and all sets A ∈ E , sup A∈E |Pδx(Xn ∈ A)− π(A)| a.s.→ 0. (2.6) Moreover, if the chain is Harris recurrent with stationary distribution π, the above holds for every x ∈ X We call a chain ergodic if it satisfies (2.6) for π-almost all x ∈ X ; if (2.6) is satisfied for all x ∈ X then the chain is called uniformly ergodic. Hence we can define ergodicity in terms of the properties of the Markov chain. Definition 2.7 (ergodic Markov chain). A φ-irreducible Markov chain is called ergodic if it is positive and aperiodic; it is called uniformly ergodic if it is also Harris recurrent. 2.4.2 Metropolis-Hastings As previously stated, an MCMC method is based on a discrete-time Markov chain which has its stationary distribution as π. The most widely used MCMC algorithm up to date is the Metropolis-Hastings algorithm [Hastings, 1970; Metropolis et al., 1953]. In this algorithm, given the previous sample Xn−1 a new value Y for Xn is proposed using an instrumental transitional kernel K : X → P(E). We assume for simplicity that the product measure π(dx)K(x, dy) has a probability density q(x, y) with respect to 2.4. MARKOV CHAIN MONTE CARLO 23 a dominating symmetric measure ζ(dx, dy) (a situation where this is not the case will be visited in Section 2.4.3). The proposed sample Y is accepted with the acceptance probability α(Xn−1, Y ), where the function α : X × X → [0, 1] is defined as α(x, y) = min { 1, q(y, x) q(x, y) } , x, y ∈ X . Algorithm 2.4. Metropolis-Hastings: Begin with some X1 ∈ X . For n = 2, 3, . . . • Sample Y ∼ K(Xn−1, ·). • Set Xn = Y with probability α(Xn−1, Y ); otherwise set Xn = Xn−1. According to Algorithm 2.4, the transition kernel M of the Markov chain from which the samples are obtained is such that for any bounded measurable function f defined on X M(x, f) = ∫ X K(x, dy)α(x, y)f(y) + [ 1− ∫ X K(x, dy)α(x, y) ] f(x). where we can simplify the expression by substituting pr(x) = 1− ∫ X K(x, dy)α(x, y), the rejection probability of a proposed sample from K(x, ·). We can check for the detailed balance condition to see why this Markov chain has π as its stationary distribution. For a bounded measurable f on X ×X , we have∫ X×X π(dx)M(x, dy)f(x, y) = ∫ X×X q(x, y)α(x, y)f(x, y)ζ(dx, dy)+ ∫ X π(dx)pr(x)f(x, x) = ∫ X×X min{q(x, y), q(y, x)}f(x, y)ζ(dx, dy)+ π(prg). where the function g : X → R satisfies g(x) = f(x, x). Since the measure ζ(dx, dy) and the expression min{q(x, y), q(y, x)} are symmetric in (x, y), we can swap x and y in f(x, y) in the last line, hence in the first line. This results the detailed balance condition being satisfied for M with π. Note that existence of π for M ensures the recurrence of M , and fortunately it is rare that a recurrent M is not Harris recurrent. There are also various sufficient conditions for the M in the Metropolis-Hastings algorithm to be φ-irreducible and aperiodic. For example, if K is π-irreducible and α(x, y) > 0 for all x, y ∈ X then M is π-irreducible; if P (Xn = Xn−1) > 0 or K is aperiodic then M is aperiodic [Roberts and Smith, 1994]. More detailed results on the convergence of Metropolis-Hastings are also available, see e.g. Tierney [1994], Roberts and Tweedie [1996], and Mengersen and Tweedie [1996]. Historically, the original MCMC algorithm was introduced by Metropolis et al. [1953] for the purpose of optimisation on a discrete state-space. This algorithm, called the Metropolis algorithm, used symmetrical proposal kernels K. The Metropolis algorithm was later generalised by Hastings [1970] such that it permitted continuous state-spaces 24 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE and asymmetrical proposal kernels, preserving the Metropolis algorithm as a special case, and its use for statistical simulation was shown. A more historical survey is provided by Hitchcock [2003]. 2.4.3 Gibbs sampling The Gibbs sampler [Gelfand and Smith, 1990; Geman and Geman, 1984] is one of the most popular MCMC methods, which can be used when X has more than one dimension. IfX has d > 1 components (of possibly different dimensions) such thatX = (X1, . . . , Xd), and one can sample from each of the full conditional distributions πk (·|X1:k−1, Xk+1:d), then the Gibbs sampler produces a Markov chain by updating one component at a time using πk’s. One cycle of the Gibbs sampler successively samples from the conditional distributions π1, . . . , πd by conditioning on the most recent samples. Algorithm 2.5. The Gibbs sampler: Begin with some X1 ∈ X . For n = 2, 3, . . ., generate for k = 1, . . . , d Xn,k ∼ πi(·|Xn−1,1:k−1, Xn−1,k+1:d). For an x ∈ X , let x−k = (x1:k−1, xk+1:d) for k = 1, . . . , d denotes the components of x excluding xk, and let us permit ourselves to write x = (xk, x−k). The corresponding MCMC kernel of the Gibbs sampler can be written as M = M1M2 . . .Md, where each transition Kernel Mk : X → P(X ) for k = 1, . . . , d can be written as Mk(x, dy) = πk(dyk|x−k)δx−k(dy−k) The justification of the transitional kernel comes from the reversibility of each Mk with respect to π, which can be verified from the detailed balance condition as follows. For any bounded measurable function f on X × X ,∫ π(dx)Mk(x, dy)f(x, y) = ∫ π(dx)πk(dyk|x−k)δx−k(dy−k)f(xk, x−k, yk, y−k) = ∫ π(dx−k)πk(dxk|x−k)πk(dyk|x−k)f(xk, x−k, yk, x−k) = ∫ π(dy)πk(dxk|y−k)f(xk, y−k, yk, y−k) = ∫ π(dy)πk(dxk|y−k)δx−k(dy−k)f(xk, x−k, yk, y−k) = ∫ π(dy)Mk(y, dx)f(x, y), (2.7) hence the detailed balance condition for Mk is satisfied with π. This leads to πMk = π, hence πM = π, so π is indeed stationary for the Gibbs sampler. An insightful interpre- 2.5. SEQUENTIAL MONTE CARLO 25 tation of (2.7) is that each step of a cycle of the Gibbs sampler is a Metropolis-Hastings move whose MCMC kernel M is equal to its proposal kernel K i.e. the α(x, y) = 1 uniformly. This also shows that the assumption that π(dx)K(x, dy) has a density with respect to a symmetric measure ζ(x, y) is not a necessary condition for the Metropolis- Hastings algorithm. However, reversibility of each Mk with respect to π does not suffice to establish proper convergence of the Gibbs sampler, as none of the individual steps pro- duces a φ-irreducible chain. Only the combination of the d moves in the complete cycle has a chance of producing a φ-irreducible chain. We refer to Roberts and Smith [1994] for some simple conditions for convergence of the classical Gibbs sampler. Note, also, that M is not reversible either, although this is not a necessary condition for convergence. A way of guaranteeing both φ-irreducibility and reversibility is to use a mixture of kernels Mβ = d∑ k=1 βkMk, βk > 0, k = 1, . . . , d, d∑ k=1 βk = 1. provided that at least one Mk is irreducible and aperiodic. This choice of kernel leads to the random scan Gibbs sampler algorithm. We refer to Tierney [1994], Roberts and Tweedie [1996], and Robert and Casella [2004] for more detailed convergence results pertaining to these variants of the Gibbs sampler. Having attractive computational properties, the Gibbs sampler is widely used. The requirement for easy-to-sample conditional distributions is the main restriction for the Gibbs sampler. Fortunately, though, replacing the exact simulation by a Metropolis- Hastings step in a general MCMC algorithm does not violate its validity as long as the Metropolis-Hastings step is associated with the correct stationary distribution. The most natural alternative to the Gibbs move in step k where sampling from the full conditional distribution πk(·|x−k) is not directly feasible is to use one-step Metropolis-Hastings move that updates xk by using a Metropolis-Hastings kernelM : X → P(X ) such that πk(·|x−k) is M-invariant [Tierney, 1994]. 2.5 Sequential Monte Carlo Despite their versatility and success, it might be impractical to apply MCMC algorithms to sequential inference problems. This section discusses sequential Monte Carlo (SMC) methods, that can provide with approximation tools for a sequence of varying distribu- tions. Good tutorials on the subject are available, see for example Doucet et al. [2000b] and Doucet et al. [2001] for a book length review. Also, Robert and Casella [2004] and Cappe´ et al. [2005] contain detailed summaries. Finally, the book Del Moral [2004] contains a more theoretical work on the subject in a more general framework, namely Feynman-Kac formulae. 26 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE 2.5.1 Sequential importance sampling Let {Xn}n≥1 be a sequence of random variables where each Xn takes values at some measurable space (Xn, En). Define the sequence of distributions {πn}n≥1 defined on the measurable space (Xn = ∏n i=1Xi, En = ⊗ni=1Ei). Also, let {ϕn}n≥1 be a sequence of functions where ϕn : Xn → R is a πn-measurable real-valued function on Xn. We are interested in sequential inference, i.e. approximating the following integrals sequentially in n πn(ϕn) = Eπn [ϕn(X1:n)] , n = 1, 2, . . . The first method which is usually considered a SMC method is sequential importance sampling (SIS), which is a sequential version of the importance sampling. First use of SIS can be recognised in works back in 1960s and 1970s such as Mayne [1966], Handschin and Mayne [1969], and Handschin [1970]; see Doucet et al. [2000b] for a general formulation of the method for Bayesian filtering. Consider the naive importance sampling approach to the sequential problem where we have a sequence of importance measures {qn}n≥1 with each qn is a measure defined on (Xn, En) such that πn ≪ qn with Radon-Nikody´m derivative wn = dπn dqn . It is obvious that we can approximate πn(ϕn) by generating samples from qn independently of samples generated from q1, . . . , qn−1 and exploiting the relation πn(ϕn) = qn (wnϕn) . This approach would require the design of a separate qn and sampling the whole path X1:n at each n, which is obviously inefficient. An efficient alternative to this approach is SIS which can be used when it is possible to choose qn to have the form qn(dx1:n) = q1(dx1) n∏ i=1 Qi(dx1:i−1, xi), (2.8) where Qn : X1:n−1 → P(En) are some transitional kernels which are possible to sample from. This selection of qn leads to the following useful relation recursion on the importance weights wn(x1:n) = wn−1(x1:n−1) dπn d(πn−1 ⊗Qn)(x1:n). (2.9) In many applications of (2.9), the Radon-Nikody´m derivative dπn d(πn−1⊗Qn) (x1:n) is function of xn−1 and xn only. Hence, one can exploit this recursion by sampling only Xn using Qn at time n and updating the weights with a small effort. More explicitly, assume a set of N > 0 samples, termed as particles, X (i) 1:n−1 with weights w (i) n−1 for i = 1, . . . , N are available at time n− 1. As long as self-normalised importance sampling is concerned, it 2.5. SEQUENTIAL MONTE CARLO 27 is practical to define the weighted empirical distribution πNn−1(dx1:n−1) = N∑ i=1 W (i) n−1δX(i)1:n−1 (dx1:n−1), (2.10) as an approximation to πn−1, whereW (i) n , i = 1, . . . , N are the self-normalised importance weights W (i) n−1 = wn−1(X (i) 1:n−1)∑N i=1wn−1(X (i) 1:n−1) . (2.11) The update from πNn−1 to π N n can be performed by first sampling X (i) n ∼ Q(X(i)1:n−1, ·) and computing the weights wn at points X (i) 1:n = (X (i) 1:n−1, X (i) n ) using the update rule in (2.9), and finally obtain the normalised weights W (i) n using (2.11). A SIS estimate of πn(ϕn) is, then, given by πNn (ϕn) = N∑ i=1 W (i)n ϕn(X (i) 1:n). Being a special case of importance sampling approximation, this approximation has al- most sure convergence to πNn (ϕn) for any n (under regular conditions) as the number of particles tends to infinity; it is also possible to have a central limit theorem for πNn (ϕn) [Geweke, 1989]. The SIS method is summarised in Algorithm 2.6. Algorithm 2.6. Sequential importance sampling (SIS) For n = 1, 2, . . .; • for i = 1, . . . , N , – if n = 1; sample X (i) 1 ∼ q1, calculate w1(X(i)1 ) = dπ1dq1 (X (i) 1 ). – if n ≥ 2; sample X(i)n ∼ Qn(X(i)1:n−1, ·), set X(i)1:n = (X(i)1:n−1, X(i)n ), and calculate wn(X (i) 1:n) = wn−1(X (i) 1:n−1) dπn d(πn−1 ⊗Qn)(X (i) 1:n). • for i = 1, . . . , N , calculate W (i)n = wn(X (i) 1:n)∑N i=1wn(X (i) 1:n) . As in the non-sequential case, it is important to choose {qn}n≥1 such that the variances of {πNn (ϕn)}n≥1 are minimised. Recall that in the SIS algorithm we restrict ourselves to {qn}n≥1 satisfying (2.8), therefore selection of the optimal proposal distributions sug- gested in Section 2.3 may not be possible. Instead, a more general motivation for those 28 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE {qn}n≥0 satisfying (2.8) might be to minimise the variance of incremental importance weights wn|n−1(x1:n) = dπn d(πn−1 ⊗Qn)(x1:n). conditional upon x1:n−1. Note that the objective of minimising the conditional variance of wn|n−1 is more general in the sense that it is not specific to ϕn. It was shown in Doucet [1997] that the kernel Qoptn by which the variance is minimised is given by Qoptn (x1:n−1, dxn) = πn(dxn|x1:n−1). (2.12) Before Doucet [1997], the optimum kernel was used in several works for particular appli- cations, see e.g. Kong et al. [1994], Liu and Chen [1995], and Chen and Liu [1996]. The optimum kernel leads to the optimum incremental weight woptn|n−1(x1:n−1) = dπn dπn−1 (x1:n−1). (2.13) which does not depend on the value of xn. This is an interesting observation and it will be revisited in Section 2.5.3. 2.5.2 Sequential importance sampling resampling The SIS method is an efficient way of implementing importance sampling sequentially. However; unless the proposal distribution is very close to the true distribution, the im- portance weight step will lead over a number of iterations to a small number of particles with very large weights compared to the rest of the particles. This will eventually result in one of the normalised weights to being 1 and the others being 0, effectively leading to a particle approximation with a single particle, see Kong et al. [1994] and Doucet et al. [2000b]. This problem is called the weight degeneracy problem. In order to address the weight degeneracy problem, a resampling step is introduced at iterations of the SIS method, leading to the sequential importance sampling resampling (SISR) algorithm. Generally, we can describe resampling as a method by which a weighted empirical distribution is replaced with an equally weighted distribution, where the samples of the equally weighted distribution are drawn from the weighted empirical distribution. Here, resampling is applied to πNn−1 before proceeding to approximate πn. Assume, again, that πn−1 is approximated with N particles X (1) 1:n−1, . . . , X (N) 1:n−1 with normalised weights W (i) n−1 as in equation (2.10). We draw N independent samples from π N n−1, namely X˜ (i) 1:n−1, i = 1, . . . , N such that P (X˜ (i) 1:n−1 = X (j) 1:n−1) = W (j) n−1, i, j = 1, . . . , N. 2.5. SEQUENTIAL MONTE CARLO 29 Obviously, this corresponds to drawing N independent samples from a multinomial dis- tribution, therefore this particular resampling scheme is called multinomial resampling. After resampling, for each i = 1, . . . , N we sample X (i) n from Qn(X˜ (i) 1:n−1, ·), weight the particles X (i) 1:n = (X˜ (i) 1:n−1, X (i) n ) using W (i)n ∝ dπn d(πn−1 ⊗Qn)(X (i) 1:n), N∑ i=1 W (i)n = 1. The SISR method, also known as the particle filter, is summarised in Algorithm 2.7. Algorithm 2.7. Sequential importance sampling resampling (SISR) For n = 1; for i = 1, . . . , N sample X (i) 1 ∼ q1, set W (i)1 ∝ dπ1dq1 (X (i) 1 ). For n = 2, 3, . . . • Resample {X(i)1:n−1}1≤i≤N according to the weights {W (i)n−1}1≤i≤N to get resampled particles {X˜(i)1:n−1}1≤i≤N with weight 1/N . • For i = 1, . . . , N ; sample X(i)n ∼ Qn(X˜(i)1:n−1, ·), set X(i)1:n = (X˜(i)1:n−1, X(i)n ), and set W (i)n ∝ dπn d(πn−1 ⊗Qn)(X (i) 1:n). The importance of resampling in the context of SMC was first demonstrated by Gor- don et al. [1993] based on the ideas of Rubin [1987]. Although the resampling step alleviates the weight degeneracy problem, it has two drawbacks. Firstly, since after suc- cessive resampling steps some of the distinct particles for X1:n are dropped in favour of more copies of highly-weighted particles. This leads to the impoverishment of particles such that for k << n, very few particles represent the marginal distribution of Xk un- der πn [Andrieu et al., 2005; Del Moral and Doucet, 2003; Olsson et al., 2008]. Hence, whatever being the number of particles, πn(dx1:k) will eventually be approximated by a single unique particle for all (sufficiently large) n. As a result, any attempt to perform integrations over the path space will suffer from this form of degeneracy, which is called path degeneracy. The second drawback is the extra variance introduced by the resampling step. There are a few ways of reducing the effects of resampling. • One way is adaptive resampling i.e. resampling only at iterations where the effective sample size drops below a certain proportion of N . For a practical implementation, the effective sample size at time n itself should be estimated from particles as well. One particle estimate of Neff,n is given in Liu [2001, pp. 35-36] N˜eff,n = 1∑N i=1W (i)2 n . 30 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE • Another way to reduce the effects of resampling is to use alternative resampling methods to multinomial resampling. Let In(i) is the number of times the i’th particle is drawn from πNn in a resampling scheme. A number of resampling methods have been proposed in the literature that satisfy E [In(i)] = NW (i) n but have different var [In(i)]. The idea behind E [In(i)] = NW (i) n is that the mean of the particle approximation to πn(ϕn) remains the same after resampling. Standard resampling schemes include multinomial resampling [Gordon et al., 1993], residual resampling [Liu and Chen, 1998; Whitley, 1994], stratified resampling [Kitagawa, 1996], and systematic resampling [Carpenter et al., 1999; Whitley, 1994]. There are also some non-standard resampling algorithms such that the particle size varies (randomly) after resampling (e.g. Crisan et al. [1999]; Fearnhead and Liu [2007]), or the weights are not constrained to be equal after resampling (e.g. Fearnhead and Clifford [2003]; Fearnhead and Liu [2007]). • A third way of avoiding path degeneracy is provided by the resample-move al- gorithm [Gilks and Berzuini, 2001], where each resampled particle X˜ (i) 1:n is moved according to a MCMC kernel Kn : Xn → P(En) whose invariant distribution is πn. In fact we could have included this MCMC move step in Algorithm 2.7 to make the algorithm more generic. However, the resample-move algorithm is a useful de- generacy reduction technique usually in a much more general setting. Although possible in principle, it is computationally infeasible to apply a kernel to the path space on which current particles exist as the state space grows at evert iteration of SISR. The resample-move algorithm will be revisited in Section 2.5.4, where it is considered as a special case of a wide class of sequential sampling methods that operate on sequences of arbitrary spaces. • The final method we will mention here that is used to reduce path degeneracy is block sampling [Doucet et al., 2006], where at time n one samples components Xn−L+1:n for L > 1, and previously sampled values for Xn−L+1:n−1 are simply discarded. In return of the computational cost introduced by L, this procedure reduces the variance of weights and hence reduces the number of resampling steps (if an adaptive resampling strategy is used) dramatically. Therefore, path degeneracy is reduced. 2.5.3 Auxiliary particle filter Recall that when the optimum proposal Qoptn is used to sample xn the corresponding optimum incremental weight woptn|n−1 does not depend on the value of xn. Therefore, the optimum incremental weight indicates which particles are likely to represent πn bet- ter even before proposing the new state xn. This encourages for a sequential sampling 2.5. SEQUENTIAL MONTE CARLO 31 strategy where the optimum incremental weights are involved in deciding on with which particles the algorithm proceeds to the next time step, and this is the strategy on which the auxiliary particle filter [Pitt and Shephard, 1999] is based. To understand how we can implement this strategy, it is useful to see how target distributions at iterations are modified with the resampling step in the SISR algorithm. One can show that given πNn−1 in (2.10) to be the SISR approximation to πn−1, SISR targets the following distribution at time n (provided that resampling step is performed) π¯n(dx1:n) ∝ [ N∑ i=1 W (i) n−1w opt n|n−1(X (i) 1:n−1)δX(i)1:n−1 (dx1:n−1) ] Qoptn (dxn|x1:n−1). (2.14) In the standard SISR algorithm, the following proposal distribution is used to implement importance sampling at time n q¯n(dx1:n) = [ N∑ i=1 W (i) n−1δX(i)1:n−1 (dx1:n−1) ] ︸ ︷︷ ︸ resampling X1:n−1 Qn(x1:n−1, dxn)︸ ︷︷ ︸ proposing Xn which does not fully exploit the structure in (2.14). As a result we have a well known drawback of SISR: if πn varies significantly compared to πn−1, the variance of the weights can be quite high. This results in an inefficient algorithm, and a large number of particles may be required for recovery. Provided that one can calculate woptn (x1:n−1), a more sensible choice for q¯n(dx1:n) could be q¯optn (dx1:n) = [ N∑ i=1 W (i) n−1δX(i)1:n−1 (dx1:n−1) ] Qn(dxn|x1:n−1). (2.15) where W (i) n−1 ∝ W (i)n−1woptn (X(i)1:n−1) such that ∑N i=1W (i) n−1 = 1. Then, the importance weight for particle X (i) 1:n = ( X (j) 1:n−1, X (i) n ) would be W (i)n ∝ dπ¯n dq¯optn (X (i) 1:n) = dQoptn (X (j) 1:n−1, ·) dQn(X (j) 1:n−1, ·) (X(i)n ). This type of particle filter is called an auxiliary particle filter in the literature. The term ‘auxiliary’ is due to treating X1:n−1 at time n as auxiliary; because in many cases where a particle filter is used, integration of functions on Xn with respect to the marginal distribution πn(dxn) is the main interest and resampling of X1:n−1 in this particular way helps the Monte Carlo approximation of such integrations improve. One remarkable point here is that if one can use Qn = Q opt n , then all the particles have equal weights. This shows how this sampling scheme can reduce weight degeneracy 32 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE effectively. (Notice also that Qn = Q opt n results in the regular SISR with optimum pro- posal, where the sampling and resampling steps are interchanged.) However, it may not be possible (or straightforward) to sample from Qoptn or calculate w opt n (x1:n−1). This does not restrict the use of the idea behind the auxiliary particle filter, though. In fact, the auxiliary particle filter is more general: We can perform importance sampling for π¯n by constructing a q¯n which can be generically written as q¯auxn (dx1:n) = N∑ i=1 α (i) n−1δX(i)1:n−1 (dx1:n−1)Qn(dxn|x1:n−1). We have complete control over αn−1 andQn; however the idea is to be able to sample those particles X (i) 1:n−1 which represents πn(x1:n−1) better, and sample Xn approximately from the optimal proposal distribution in order to have weights with low variance. Therefore, the rule of thumb is to make α (i) n−1 and Qn as close as possible to W (i) n−1 and Q opt n . Indeed, the authors in Andrieu et al. [2001] propose an improved auxiliary particle filter scheme, where (2.15) or a suitable approximation to (2.15) is suggested to be used. 2.5.4 Sequential Monte Carlo samplers Sequential Monte Carlo samplers [Del Moral et al., 2006] cover a very large class of SMC methods. Assume that we have a sequence of somehow related distributions π1, . . . , πp where each πn is defined on an arbitrary measurable space (Xn, En). There are many po- tential choices for π1, . . . , πp leading to various integration and optimisation algorithms; examples can be found in Chopin [2002] for static parameter estimation, Gelman and Meng [1998] and Neal [2001] for targeting a distribution through a sequence of intermedi- ate distributions, Del Moral et al. [2006] for global optimisation, Johansen et al. [2005] and Del Moral et al. [2006] for rare event simulation and density estimation, and Del Moral et al. [2012] for approximate Bayesian computation. The problem of approximating these distributions sequentially using Monte Carlo is beyond the extend of the classical SIS or SISR methods, since these require the distributions to be defined on increasing spaces. The first approach that comes to mind is to treat each πn individually and perform importance sampling for each of them independently. Obviously, this approach has the difficulties of importance sampling: unless the distribution of interest is a standard low- dimensional one, importance sampling is almost never used when there are alternatives. The main reason for that is the difficulty of designing an good proposal. One reasonable way is to do importance sampling for πn individually, but this time by designing the importance distributions sequentially using an initial distribution η1 and a sequence of transition kernels {Kn : Xn−1 → P(En)}n≥1. The idea here is that if the distributions πn varies slowly in n, then it is possible to obtain samples to approximate πn effectively by 2.5. SEQUENTIAL MONTE CARLO 33 using Kn to slowly move the samples obtained to approximate πn−1. Let us assume that we begin with sampling X (1) 1 , . . . , X (N) 1 from η1 to approximate π1. At times n ≥ 2, we sample X (i) n from Kn(X (i) n−1, ·). The importance weight of X(i)n is given by w(i)n = dπn dηn (X(i)n ), ηn(dxn) = ηn−1Kn(dxn). The choice of Kn’s are optional except the requirement that πn ≪ ηn−1Kn; however it is crucial for the the performance of this method. In the literature, several different types of moves are used, such as independent proposals [West, 1993], local random moves [Givens and Raftery, 1996], MCMC and Gibbs moves [Del Moral et al., 2006], etc. This sequential implementation of importance sampling approach is attractive and optimal in some sense (we will see soon in what sense), however it has a quite restrictive limitation: in most cases it is impossible to calculate the importance distribution ηn. SMC samplers come into role at this point, circumventing the need for calculation of ηn. The main idea of the method is to construct the synthetic distributions π˜n on the extended spaces (X1 × . . .× Xn, E1 ⊗ . . .⊗ En) as π˜n(dx1:n) = πn(dxn) n−1∏ i=1 Li(xi+1, dxi) (2.16) where each Ln : Xn+1 → P(Xn) is a backward Markov kernel. Since π˜n admits πn marginally by construction, importance sampling on π˜n using the following proposal distribution η˜n(dx1:n) = η1(dx1) n∏ i=2 Ki(xi−1, dxi). can provide an approximation for πn as well. Although, freedom to choose Kn’s and Ln’s contribute to the method’s generality, the performance of the method crucially depends on the their choice. In fact, the central limit theorem presented in Del Moral et al. [2006] demonstrates that the variance of the estimator is strongly dependent upon the choice of these kernels. The importance weight for this method is given by wn(x1:n) = dπ˜n dη˜n (x1:n). It was shown in Del Moral et al. [2006] that given Kn, the optimum backward kernel Loptn−1 which minimises the variance of the importance weights satisfies the relation ηn ⊗ Loptn−1 = ηn−1 ⊗Kn. 34 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE It can be shown that the importance weights for the optimum backward kernel is woptn (x1:n) = dπn dηn (xn). This result reveals that the optimum backward kernel takes us back to the case where one performs importance sampling on the marginal space instead of the extended one. However, most of the time ηn cannot be calculated, hence other sub-optimal backward kernels must be used. It was shown in Del Moral et al. [2006] that when Loptn−1 is not used, the variance of wn(x1:n) can not be stabilised. For that reason, resampling of the samples that are used for approximating πn−1 is necessary before moving to the approximation of πn. Actually, this can be done thanks to the possibility of constructing π˜n such that the importance weights can be expressed as a product of incremental weights. Assume that πn ≪ Kn and Ln ≪ πn for all n. Then it can be shown that for a bounded measurable function ϕn on X1 × . . .×Xn we have π˜n(ϕn) = η˜n(ϕnwn) where the importance weights wn are given by wn(x1:n) = dπ1 dη1 (x1) n∏ i=2 dLi−1(xi, ·) dπi−1 (xi−1) dπi dKi(xi−1, ·)(xi). (2.17) Equation (2.17) admits a recursion in n as wn(x1:n) = wn|n−1(xn−1, xn)wn(x1:n−1) where the incremental weight wn|n−1(xn−1, xn) is given by wn|n−1(xn−1, xn) = dπn dKn(xn−1, ·)(xn) dLn−1(xn, ·) dπn−1 (xn−1). (2.18) Note that the recursive form of the weights enables us to implement an SMC method for the synthetic distributions π˜n. Actually, when (2.17) exists, the SMC sampler for π1, . . . , πp is the SISR algorithm targeting π˜1, . . . , π˜p using the initial and transitional proposal distributions η1 and Kn, n = 2, . . . , p respectively, and its incremental weights are given in (2.18). Note that, in practice even if Ln is not absolutely continuous with respect to πn, we can still obtain importance weights factorized into incremental weights by taking the restrictions of Ln’s to the supports of πn’s. Note, also, that as in importance sampling, SIS, and SISR, even if we know π˜n’s and η˜n’s only up to some normalising constants we can still perform the SMC samplers algorithm to approximate the integrals πn and to estimate the unknown normalising constants as well. SMC samplers generalise many related works previously in the literature. For exam- 2.6. APPROXIMATE BAYESIAN COMPUTATION 35 ple, the annealed importance sampling method, which corresponds to the SMC sampler without resampling where Ln−1 satisfies πn−1Kn ⊗ Ln−1 = πn−1 ⊗Kn (2.19) and Kn is such that πn−1 is Kn-invariant, is proposed by Neal [2001] for sequences of slightly varying distributions. To deal with the variance problem for general cases, the equivalent choice of kernels are used in (among others) Chopin [2002] and Gilks and Berzuini [2001] with resample-move strategies, which actually corresponds to the SMC sampler algorithm with resampling. Population Monte Carlo, presented by Cappe´ et al. [2004] and Celeux et al. [2006] with an extension, is another special case of SMC samplers where the authors consider the homogeneous case where πn = π and Ln(x, dx ′) = π(dx′) andKn(x, dx ′) = Kn(dx ′). Finally Liang [2002] presents a related algorithm where πn = π and Kn(x, x ′) = Ln(x, dx ′) = K(x, dx′). 2.6 Approximate Bayesian computation Assume that we have a random variable of interest X, taking values in X . Its probability distribution π(dx) has a density on X with respect to a dominating measure dx, which is abusively denoted as π(x)1. The value of X, denoted by x, is observed indirectly through an observation process generating values Y ∈ Y according to conditional observation probability distribution who also has a density on Y with respect to dy, which is denoted as g(y|x). The density g(y|x) is also called the likelihood. The posterior distribution of X given Y = y has the following density which is given by Bayes’ theorem π(x|y) = π(x)g(y|x)∫ X π(x′)g(y|x′)dx′ . Approximate Bayesian computation (ABC) deals with the problem of Monte Carlo ap- proximation to π(x|y) when the likelihood g(y|x) is intractable. By intractability it is meant either that the density does not have a close form expression or that it is pro- hibitive to calculate it. ABC methods try to approximate π(x|y) without circumventing the calculation of g(y|x) and for this reason they are also known as likelihood-free meth- ods. The main idea behind ABC is simulating from the observation process and accepting simulated samples provided that they are close to the observed value y in some sense. ABC methods have appeared in the past ten years as one of the most satisfactory ap- proach to intractable likelihood problems. This section is a brief and limited review of the main contributions to the ABC methodology, for a more detailed recent review, one 1It is simpler to describe the methodology in this section when we use densities instead of measures. 36 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE can see Marin et al. [2011]. The idea core to ABC is first mentioned in Rubin [1984]; but the first ABC method was proposed by Tavare´ et al. [1997] as a special case of rejection sampling for discrete Y . It proposes to sample (x, y) from π(x)g(u|x) and consider only those samples for which u = y. It is not difficult to show that if the accepted samples are (X(1), y), . . . , (X(N), y), then X(1), . . . , X(N) are samples from the posterior π(x|y). Note that this a rejection sampling method for the distribution π(x, u|y) on X × Y , which is given by π(x, u|y) ∝ π(x)g(u|x)Iy(u) (2.20) and when this density is integrated over u, we end up with π(x|y). This method is exact in the sense that the obtained samples for X are drawn from π(x|y). However, obviously Iy(u) would not work for continuous Y , since the probability of hitting {y} will be zero. The first genuine ABC method, proposed by Pritchard et al. [1999] as a rejection sampling method also, relaxes Iy(u) and replaces (2.20) with πǫ(x, u|y) ∝ π(x)g(u|x)IAǫy(u) (2.21) where Aǫy is called the ABC set and defined based on some summary statistic s : Y → Rds and a distance metric ρ : Rds ×Rds → R as Aǫy = {u ∈ Y : ρ[s(u), s(y)] < ǫ}. (2.22) If s is sufficient with respect to x, one can show that as ǫ tends to zero, the marginal of density πǫ(x, u|y) with respect to x converges to the posterior π(x|y). In most cases sufficient statistics are not available, hence the choice of summary statistics is of great importance. The ABC literature is rich in papers discussing on the selection of these sufficient statistics, see Fearnhead and Prangle [2012] for an example. The ABC method in Pritchard et al. [1999] is also a rejection sampling method, targeting πǫ(x, u|y): generate (X,U) from π(x)g(u|x) and consider only those samples for which U ∈ Aǫy. This method is summarised in Algorithm 2.8. Algorithm 2.8. Rejection sampling for ABC: To generate a single sample from πǫ(x, u|y), 1. Generate (X,U) ∼ π(x)g(u|x). 2. If U ∈ Aǫy, accept (X,U); else go to 1. Using simulations from the prior distribution π(x) can be inefficient since this does not take neither the data nor the previously accepted samples into account when proposing a new x and thus fails to propose values located in high posterior probability regions. 2.6. APPROXIMATE BAYESIAN COMPUTATION 37 To overcome this impracticality of rejection sampling, an MCMC based ABC method was developed by Marjoram et al. [2003]. The method is simply an MCMC algorithm targeting πǫ(x, u|y) which uses an instrumental kernel with density q(x′|x)g(u′|x′) to move samples (x, u) and takes either (x′, u′) or (x, u) as the next sample according to the corresponding acceptance probability α(x, u; x′, u′) = min { 1, q(x|x′)π(x′)IAǫy(u′) q(x′|x)π(x) } , x, x′ ∈ X , u, u′ ∈ U . The MCMC-ABC method is given in Algorithm 2.9. Algorithm 2.9. MCMC for ABC: Begin with some (X1, U1) ∈ X × U . For n = 2, 3, . . . • Generate (X ′, U ′) ∼ q(x′|x)g(u′|x′). • Set (Xn, Un) = (X ′, U ′) with probability α(Xn−1, Un−1;Xn, Un); otherwise set (Xn, Un) = (Xn−1, Un−1). It is useful to interpret the ABC posterior all in terms densities: We can consider IAǫy(u) as to which the density of the conditional distribution of Y given U = u, say κǫ(y|u), is proportional to. Then we can rewrite (2.21) as πǫ(x, u|y) = π(x)g(u|x)κǫ(y|u)∫ X×Y π(x′)g(u′|x′)κǫ(y|u′)dx′du′ . A useful generalisation of πǫ(x, u|y) can be made by taking κǫ(y|u) some normalised kernel with bandwidth ǫ centred at u. In many applications, it is practical sometimes to take κǫ proportional not to an indicator function but to a smooth kernel, such as a Gaussian kernel, to make calculations tractable or to avoid computational waste due to rejections. We will see a use of choosing a smooth kernel in Chapter 6. Another use of kernels in defining the ABC posterior is to be able to express the difference between the ABC posterior and the real posterior in terms of model error. Note that the ABC suffers from model discrepancy since it corresponds to performing Bayesian inference for the case where the observation Y has the conditional probability density not being g(y|x) but the following: gǫ(y|x) = ∫ Y g(u|x)κǫ(y|u)du. Therefore, we say that the ABC posterior is not ‘calibrated’. A way of rephrasing this is that if the model included an error term, characterised by κǫ, then the ABC would target the true posterior hence the ABC posterior would be ‘calibrated’ [Wilkinson, 2008]. This 38 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE leads to the method of noisy ABC [Dean et al., 2011; Fearnhead and Prangle, 2012], which adds noise to the (summary statistic of) data itself to have yǫ ∼ κǫ(·|y), and then perform ABC for the modified data by targeting πǫ(x, u|yǫ), which is calibrated. Other than approximating the posterior distribution of a single random variable, the ABC approach also extends to sequential inference. Jasra et al. [2012] propose an ABC implementation scheme for approximating the densities like πn(x1:n|y1:n) in hidden Markov models (HMM), where {Xt}t≥1 is a Markov process and the distribution of the observables {Yt}t≥1 conditioned on the hidden process is intractable. Their approach is related to the convolution particle filter of Campillo and Rossi [2009]. Dean et al. [2011] discuss the ABC implementation for HMMs further and show that the model for which noisy ABC is exact is also a HMM; therefore they conclude that noisy ABC can be im- plemented for HMMs. We will see the sequential implementation of ABC as well as its use for static parameter estimation in more detail in Chapter 6. While decreasing the value of ǫ obviously makes πǫ(x, u|y) close to the true posterior, the variance of its Monte Carlo becomes a more crucial issue. Therefore, it is important to keep the variance of the approximation at a reasonable level while making ǫ suffi- ciently small. For this reason, SMC samplers are used for approximating a sequence of distributions {πk(x, u|y) = πǫk(x, u|y)}1≤k≤p, where ǫ1 > . . . > ǫp = ǫ and the difference between successive πk’s are small enough to make successive πk(x, u|y) varying sufficiently slowly. SMC samplers are used in the ABC context in Sisson et al. [2007] and the method there was improved in Beaumont et al. [2009]; Toni et al. [2009] and Sisson et al. [2009]. Del Moral et al. [2012] showed the relation of these works to SMC samplers explicitly. Other novelties of Del Moral et al. [2012] is that the authors rely on M repeated simu- lations of the pseudo-data u and benefit the variance reduction property of Monte Carlo averaging and they propose a scheme for adaptive selection of the sequence of tolerance levels {ǫk}1≤k≤n. The forward kernel at step k is chosen to leave πk−1 invariant and backward kernel is chosen to satisfy (2.19). Other than inference of hidden variables, there is a lot of work for model selection using ABC methods. We will not review these methods as they are not of particular interest for this thesis; the interested reader may see Marin et al. [2011] for a review and the references therein for details. Chapter 3 Hidden Markov Models and Parameter Estimation Summary: This chapter contains the second half of the literature survey. The main purpose of this chapter is to introduce hidden Markov models (HMM), which are also known as general state-space models, and review their use in the literature as a powerful framework for filtering and parameter estimation. 3.1 Introduction HMMs arguably constitute the widest class of time series models that are used for mod- elling stochastic dynamic systems. In Section 3.2, we will introduce HMMs using a formulation that is appropriate for filtering and parameter estimation problems. We will restrict ourselves to discrete time homogenous HMMs whose dynamics for their hidden states and observables admit conditional probability densities which are parametrised by vector valued static parameters. However, this is our only restriction; we keep our framework general enough to cover those models with non-linear non-Gaussian dynamics. One of the main problems dealt within the framework of HMMs is optimal Bayesian filtering, which has many applications in signal processing and related areas such as speech processing [Rabiner, 1989], finance [Pitt and Shephard, 1999], robotics [Gordon et al., 1993], communications [Andrieu et al., 2001], etc. Due to the non-linearity and non- Gaussianity of most of models of interest in real life applications, approximate solutions are inevitable and SMC is the main computational tool used for this; see e.g. Doucet et al. [2001] for a wide selection of examples demonstrating use of SMC. SMC methods have already been presented in its general form in Section 2.5, we will present their application to HMMs for optimal Bayesian filtering in Section 3.3. In practice, it is rare that the practitioner has complete knowledge on the static pa- rameters of the time series model which she uses to perform optimal Bayesian filtering. This raises the necessity of ‘calibrating the model’, hence estimating its static parame- ters. Note also that estimating the static parameters of a HMM itself may be the main objective. Section 3.4 of this chapter contains a review of the methodology for static 39 40 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION parameter estimation in HMMs, in particular we will present some of the popular max- imum likelihood estimation (MLE) algorithms. We will also show how to obtain SMC approximations of those MLE algorithms for HMMs. Although we present optimal Bayesian filtering and statical parameter estimation methods and their SMC approximations within the framework of HMMs, we would like to stress that this thesis does contain time-series models which are not a HMM (at least in the way we deal with it). We will mention such models in Section 3.2.1. The reason why we restrict ourselves to HMMs is that the computational tools developed for them are generally applicable to more general time series models with some suitable modifications. 3.2 Hidden Markov models We begin with the definition of a HMM. Let {Xn}n≥1 be a homogenous Markov chain defined on (X , EX ). Suppose that this process is observed as another process {Yn}n≥1 defined on (Y , EY) such that the conditional distribution on Yn given all the other random variables depends only on Xn. Then the bivariate process {Xn, Yn}n≥1 is called a HMM. We give below a more formal definition which is taken from Cappe´ et al. [2005]; we additionally assume that the HMM is parametrised by a vector valued static parameter. Definition 3.1 (HMM). Let (X , EX ) and (Y , EY) be two measurable spaces, dθ > 0, and Θ is a compact subset of Rdθ . For any θ ∈ Θ, let µθ, Fθ, and Gθ denote, respectively, a probability measure on (X , EX ), a Markov transition kernel on (X , EX ), and a transitional kernel from (X , EX ) to (Y , EY). Consider the Markov transition kernel Hθ defined on the product space (X × Y , EX ⊗ EY) such that for all (x, y) ∈ X × Y, C ∈ EX ⊗ EY Hθ[(x, y), C] = ∫ C Fθ(x, dx ′)Gθ(x ′, dy′). Then, the Markov chain {Xn, Yn}n≥1 with initial distribution µθ ⊗ Gθ, and with transi- tional kernel Hθ is called a hidden Markov model (HMM) parametrised by θ. Although this definition concerns the joint process {Xn, Yn}n≥1, the term ‘hidden’ is justified when only {Yn}n≥1 is observable. We call {Xn}n≥1 the hidden process and its states the hidden states, and {Yn}n≥1 is called the observed process, containing the observed states. We will deal with real valued vector processes, that is why we always take X ∈ Rdx and Y ∈ Rdy . Note, also, that it is Definition 3.1 from which it follows that {Xn}n≥1 is Markov(µθ, Fθ) and observations {Yn}n≥1 conditioned upon {Xn}n≥1 are independent and have the conditional distributions Gθ(xn, ·), i.e. for every A ∈ EX 3.2. HIDDEN MARKOV MODELS 41 and B ∈ EY we have Pθ(X1 ∈ A) = µθ(A), Pθ (Xn ∈ A |X1:n−1 = x1:n−1 ) = Fθ(xn−1, A), (3.1) Pθ ( Yn ∈ B ∣∣∣{Xt}t≥1 = {xt}t≥1 , {Yt}t6=n = {yt}t6=n) = Gθ(xn, B). (3.2) In the time series literature, the term HMM has been widely associated with the case of X being finite [Rabiner, 1989] and those models with continuous X are often referred to as state-space models. Again, in some works the term ‘state space models’ refers to the case of linear Gaussian systems [Anderson and Moore, 1979]. We emphasise at this point that in this thesis we shall keep the framework as general as possible. We consider the general case of measurable spaces and we avoid making any restrictive assumptions on µθ, Fθ, and Gθ that impose a certain structure on the dynamics of the HMM. Also, we clarify that in contrast to previous restrictive use of terminology, we will use both terms ‘HMM’ and ‘general state space model’ to describe exactly the same thing as defined by Definition 3.1. For the rest of the thesis, we will be dealing with fully dominated HMMs, where µθ, Fθ(x, ·) and Gθ(x, ·) have densities with respect to some dominating measures. We give a formal definition of a fully dominated HMM here. Definition 3.2 (fully dominated HMM). Consider the HMM in Definition 3.1. Sup- pose that there exists probability measures λ on (X , EX ) and ν on (Y , EY) such that (i) µθ is absolutely continuous with respect to λ (ii) for all x ∈ X , Fθ(x, ·) is absolutely con- tinuous with respect to λ with transition density function fθ(·|x) and (iii) for all x ∈ X , Gθ(x, ·) is absolutely continuous with respect to ν with transition density function gθ(·|x). Then the HMM is said to be fully dominated and the joint Markov transition kernel Hθ is dominated by the product measure λ⊗ ν and admits the transition density function hθ(x ′, y′|x, y) = fθ(x′|x)gθ(y′|x′). Therefore, for a fully dominated HMM as in Definition 3.2, the joint probability density of (X1:n, Y1:n) exists and it is given by pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1) n∏ t=2 fθ(xt|xt−1)gθ(yt|xt) (3.3) where, with abuse of notation, we have used µ also to denote the density of the probability measure µ. Note, the joint law of all the variables of the HMM up to time n is summarised in (3.3) from which we derive several probability densities of interest. One example is the 42 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION likelihood of the observations up to time n which can be derived as pθ(y1:n) = ∫ pθ(x1:n, y1:n)λ(dx1:n). (3.4) Maximisation of this quantity with respect to θ is the main interest of this thesis. Another important probability density, which will be pursued in detail, is the density of the posterior distribution of X1:n given Y1:n = y1:n, which is obtained by using the Bayes’ theorem pθ(x1:n|y1:n) = pθ(x1:n, y1:n) pθ(y1:n) (3.5) 3.2.1 Extensions to HMMs Although HMMs are the most common class of time series models in the literature, there are also many time series models which are not a HMM and are still of great importance. These models differ from HMMs mostly because they do not possess the conditional independency of observations. Here, we give two examples that we will also use in this thesis. • In the first example of such models, the process {Xn}n≥1 is still a Markov chain; however the conditional distribution of Yn, given all past variables X1:n and Y1:n−1, depends not only on the value of Xn but also on the values of past observations i.e. Y1:n−1. If we denote the probability density of this conditional distribution gθ,n(yn|xn, y1:n−1), the joint probability density of (X1:n, Y1:n) is pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1) n∏ t=2 fθ(xt|xt−1)gθ,t(yt|xt, y1:t−1). If Yn given Xn is independent of the past values of the observations prior to time n− k, then we can define a gθ such that gθ,n(yn|xn, y1:n−1) = gθ(yn|xn, yn−k:n−1) for all n. One example of such models is a changepoint model e.g. see Fearnhead and Liu [2007]. We will encounter changepoint models in Chapter 4 of this thesis. The terminology regarding the type of models where we have gθ(yn|xn, yn−k:n−1) is not fully standardised. One term that is used is Markov switching models; Markov jump systems is also used at least in cases where the hidden state space is finite [Cappe´ et al., 2005]. These models have much in common with basic HMMs in the sense that virtually identical computational tools may be used for both models. In the particular context of SMC, the similarity between these to types of models is more clearly exposed in Del Moral [2004] via the Feynman-Kac representation of SMC methods, where the conditional density of observation at time n is treated generally as a potential function of xn. 3.2. HIDDEN MARKOV MODELS 43 • In another type of time series models that are not HMM the latent process {Xn}n≥1 is, again, still a Markov chain; however observation at current time depends on all the past values, i.e. Yn conditional on (X1:n, Y1:n−1) depends on all of these conditioned random variables. Actually, these models are usually the result of marginalising an extended HMM. Consider the HMM {(Xn, Zn), Yn}n≥1, where the joint process {Xn, Zn}n≥1 is a Markov chain such that its transitional law admits the density fθ with respect to the product measure λ1⊗λ2 which can be factorized as fθ(xn, zn|xn−1, zn−1) = fθ,1(xn|xn−1)fθ,2(zn|xn, zn−1). and the observation Yn depends only on Xn and Zn given all the past random vari- ables and admits the probability density gθ(yn|xn, zn). Now, the marginal bivariate process {Xn, Yn}n≥1 is not a HMM and we express the joint density of (X1:n, Y1:n) as pθ(x1:n, y1:n) = µθ(x1)pθ,1(y1|x1) n∏ t=2 fθ,1(xt|xt−1)pθ,t(yt|x1:t, y1:t−1) where the density pθ,n(yn|x1:n, y1:n−1) is given by pθ,n(yn|x1:n, y1:n−1) = ∫ pθ(z1:n−1|x1:n−1, y1:n−1)fθ,2(zn|xn, zn−1)gθ(yn|xn, zn)λ2(dz1:n). (3.6) The reason {Xn, Yn}n≥1 might be of interest is that the conditional laws of Z1:n may be available in close form and exact evaluation of the integral in (3.6) is available. In that case, it can be more effective to perform Monte Carlo approximation for the law of X1:n given observations Y1:n, which leads to the so called Rao-Blackwellised particle filters in the literature [Doucet et al., 2000a]. The integration is indeed available in close form for some time series models. One example is the linear Gaussian switching state space models [Chen and Liu, 2000; Doucet et al., 2000a; Fearnhead and Clifford, 2003], where Xn takes values on a finite set whose elements are often called ‘labels’, and conditioned on {Xn}n≥1, {Zn, Yn}n≥1 is a linear Gaussian state-space model. A more sophisticated time series model of the same nature is linear Gaussian multiple target tracking models, which we will investigate in detail in Chapter 5. Having stated that the interest of this thesis is on more general time series models than HMMs, we note that the computational tools developed for HMMs are generally applicable to a more general class of time series models with some suitable modifications. For this reason we carry on this chapter with review of SMC and parameter estimation methods for HMMs. 44 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION 3.3 Sequential inference in HMMs 3.3.1 Bayesian optimal filtering In a HMM, one is usually interested in sequential inference on the variables of the hidden process {Xt}t≥1 given observations Y1:n = y1:n up to time n. For example, one pursues for the sequence of posterior distributions {pθ(x1:n|y1:n)}n≥1, where pθ(x1:n|y1:n) is given in equation (3.5). It is also straightforward to generalise pθ(x1:n|y1:n) to the posterior distributions of X1:n′ for any n ′ ≥ 1. For n′ > n we have pθ(x1:n′|y1:n) = pθ(x1:n|y1:n) n′∏ t=n+1 fθ(xt|xt−1); whereas for n′ < n the density pθ(x1:n′ |y1:n) can be obtained simply by integrating out the variables xn′+1:n, i.e. pθ(x1:n′ |y1:n) = ∫ pθ(x1:n|y1:n)λ(dxn′+1:n). It is possible to obtain a recursion for these posterior distributions as one receives obser- vations sequentially. Equations (3.3) and (3.5) reveal that we can write pθ(x1:n|y1:n) in terms of pθ(x1:n−1|y1:n−1) as pθ(x1:n|y1:n) = fθ(xn|xn−1)gθ(yn|xn) pθ(yn|y1:n−1) pθ(x1:n−1|y1:n−1). (3.7) The normalising constant pθ(yn|y1:n−1) can be written in terms of the known densities as pθ(yn|y1:n−1) = ∫ pθ(x1:n−1|y1:n−1)fθ(xn|xn−1)gθ(yn|xn)λ(dx1:n). (3.8) Also, by convention pθ(y1|y0) = pθ(y1) = ∫ gθ(y1|x1)µθ(x1)λ(dx1). The recursion in (3.7) is essential since it enables efficient sequential approximation of the distributions pθ(x1:n|y1:n) as we will see in Section 3.3.2. From a Bayesian point of view, the probability densities pθ(x1:n′ |y1:n) are complete solutions to the inference problems as they contain all the information about the hidden states X1:n′ given the observations y1:n. For example, the expectation of a measurable function ϕn′ : X n′ → Rdϕ(n′) conditional upon the observations y1:n can be evaluated as Eθ [ϕn(X1:n′)|y1:n] = ∫ ϕ(x1:n′)pθ(x1:n′ |y1:n)λ(dx1:n′). However, one can restrict her focus to a problem of smaller size, such as the marginal 3.3. SEQUENTIAL INFERENCE IN HMMS 45 distribution of the random variableXk, k ≤ n′, given y1:n. The probability density of such a marginal posterior distribution pθ(xk|y1:n) is called a smoothing, filtering or prediction density if k < n, k = n and k > n, respectively. Indeed, there are many cases where one is interested in calculating the expectations of functions ϕ : X → Rdϕ of Xk given y1:n Eθ [ϕ(Xk)|y1:n] = ∫ ϕ(xk)pθ(xk|y1:n)λ(dxk). Although one we have pθ(x1:n′ |y1:n) for n′ ≥ k the marginal density can directly be obtained by marginalization, the recursion in (3.7) may be intractable or too expensive to calculate. Therefore it is useful to use alternative recursion techniques to effectively evaluate the marginal densities sequentially. Here, we will cover the recursions for the filtering and one-step prediction densities. Given the filtering density pθ(xn−1|y1:n−1) at time n − 1, the filtering density at time n is usually obtained recursively in two stages, which are called prediction and update. These are given as pθ(xn|y1:n−1) = ∫ fθ(xn|xn−1)pθ(xn−1|y1:n−1)λ(dxn−1), (3.9) pθ(xn|y1:n) = gθ(yn|xn)pθ(xn|y1:n−1) pθ(yn|y1:n−1) . (3.10) where this time we write the normalising constant as pθ(yn|y1:n−1) = ∫ pθ(xn|y1:n−1)gθ(yn|xn)λ(dxn). (3.11) The problem of evaluating the recursion given by equations (3.9) and (3.10) is called the Bayesian optimal filtering (or shortly optimum filtering) problem in the literature. In the following, we will look at the SMC methodology in the context of HMMs and review how SMC methods have been used to provide approximate solutions to the optimal filtering problem. 3.3.2 Particle filters for optimal filtering There are cases when the optimum filtering problem can be solved exactly. One such case is when X is a finite countable set [Rabiner, 1989]. Also, in linear Gaussian state-space models the densities in (3.9) and (3.10) are obtained by the Kalman filter [Kalman, 1960]. In general, however, these densities do not admit a close form expression and one has to use methods based on numerical approximations. One such approach is to use grid-based methods, where the continuous X is approximated by its finite discretised version and the update rules are used as in the case of finite state HMMs. Another approach is extended Kalman filter [Sorenson, 1985], which approximates a non-linear transition by a linear 46 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION one and performs the Kalman filter afterwards. The method fails if the nonlinearity in the HMM is substantial. An improved approach based on the Kalman filter is the unscented Kalman filter [Julier and Uhlmann, 1997], which is based on a deterministic selection of sigma-points from the support of the state distribution of interest such that the mean and the variance of the true distribution are preserved by the sample mean and covariance calculated at these selected sigma-points. All of these methods are deterministic and not capable of dealing with the most general state-space models; in particular they will fail when the dimensions or the nonlinearities increase. Alternative to the deterministic approximation methods, Monte Carlo can provide a robust and efficient solution to the optimal filtering problem. SMC methods for opti- mal filtering, also known as particle filters, have been shown to produce more accurate estimates than the deterministic methods mentioned [Doucet et al., 2000b; Durbin and Koopman, 2000; Kitagawa, 1996; Liu and Chen, 1998]. Some of the good tutorials on SMC methods for filtering as well as smoothing in HMMs are Doucet et al. [2000b], Aru- lampalam et al. [2002], Cappe´ et al. [2007], Fearnhead [2008], and Doucet and Johansen [2009], from the earliest to the most recent. One can also see Doucet et al. [2001] as a reference book, although a bit outdated. Also, the book Del Moral [2004] contains a rigorous review of numerous theoretical aspects of the SMC methodology in a different framework where a SMC method is treated as an interacting particle system associated with the mean field interpretation of a Feynman-Kac flow. With reference to the Monte Carlo methodology covered in Chapter 2, the filtering problem in state space models can be considered as a sequential inference problem for the sequence of probability distributions πθ,n on the product measurable spaces (Xn = X n, En = E⊗(n)) πθ,n(dx1:n) := pθ(x1:n|y1:n)λ(dx1:n). As we saw Section 2.5, we can perform SIS and SISR methods targeting {πθ,n}n≥1. The SMC proposal distribution at time n, denoted as qθ,n, is designed conditional to the observations up to time n and state values up to time n−1; and in the most general case it can be written as qθ,n(dx1:n) := Qθ,1(y1, dx1) n∏ t=2 Qθ,t [(x1:t−1, y1:t), dxt] = qθ,n−1(dx1:n−1)Qθ,n [(x1:n−1, y1:n), dxn] (3.12) In fact, most of the time the transition kernel Qθ,n only depends only on the current observation and the previous state, hence we simplify (3.12) by defining Qθ : X × Y → P(E) and taking Qθ,n [(x1:n−1, y1:n), dxn] = Qθ [(xn−1, yn), xn] 3.3. SEQUENTIAL INFERENCE IN HMMS 47 for all n ≥ 1 with the convention Qθ [(x0, y1), x1] = Qθ(y1, x1). Suppose we design Qθ [(x, y), ·] such that it is absolutely continuous with respect to λ with density qθ(·|x, y). Therefore, we can write qθ,n(dx1:n) = [ qθ(x1|y1) n∏ t=2 qθ(xt|xt−1, yt) ] λ(dx1:n) (3.13) If we wanted to perform SMC using the target distribution πθ,n directly, then we would have to calculate the following incremental weight at time n dπθ,n dπθ,n−1 ⊗Qθ (x1:n) = fθ(xn|xn−1)gθ(yn|xn) pθ(yn|y1:n−1)qθ(xn|xn−1, yn) ∝ fθ(xn|xn−1)gθ(yn|xn) qθ(xn|xn−1, yn) . In most of the applications pθ(yn|y1:n−1) can not be calculated, hence dπθ,ndπθ,n−1⊗Qθ (x1:n) is not available. For this reason, instead of πθ,n SMC methods use the following unnor- malised measure for importance sampling π̂θ,n(dx1:n) = pθ(x1:n, y1:n)λ(dx1:n), where the normalising constant is pθ(y1:n), the likelihood of observations up to time n. In that case, the importance weight for the whole path X1:n is given by wn(x1:n) = wn−1(x1:n−1)wn|n−1(xn−1, xn), where the incremental importance weight wn|n−1(x1:n) is wn|n−1(xn−1, xn) = fθ(xn|xn−1)gθ(yn|xn) qθ(xn|xn−1, yn) . Algorithm 3.1. SISR (Particle filter) for HMM For n = 1; for i = 1, . . . , N sample X (i) 1 ∼ qθ(·|y1), set W (i)1 ∝ µθ(X (i) 1 )gθ(y1|X (i) 1 ) qθ(X (i) 1 |y1) . For n = 2, 3, . . . • Resample {X(i)1:n−1}1≤i≤N according to the weights {W (i)n−1}1≤i≤N to get resampled particles {X˜(i)1:n−1}1≤i≤N with weight 1/N . • For i = 1, . . . , N ; sample X(i)n ∼ qθ(·|X˜(i)n−1, yn), set X(i)1:n = (X˜(i)1:n−1, X(i)n ), and set W (i)n ∝ fθ(X (i) n |X˜(i)n−1)gθ(yn|X(i)n ) qθ(X (i) n |X˜(i)n−1, yn) . We present the SISR algorithm, aka the particle filter, for general state-space models in Algorithm 3.1, reminding that SIS is a special type of SISR where there is no resampling. 48 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION In the following we list some of the aspects of the particle filter. • As in the general SISR algorithm, we can use an optional resampling scheme, where we do resampling only when the estimated effective sampling size decreases below a threshold value. • A by-product of the particle filter is that it can provide unbiased estimates for unknown normalising constants of the target distribution [Del Moral, 2004, Chapter 7]. For example, when SISR is used with an optional sampling scheme, if the last time prior to n when resampling was performed is k, an unbiased estimator of pθ(yk+1:n|y1:k) can be obtained as pθ(yk+1:n|y1:k) ≈ 1 N N∑ i=1 n∏ t=k+1 wt|t−1(X (i) t−1, X (i) t ). We will come back to this aspect of the particle filter in Section 3.4.1. • The choice of the kernel Qθ for the proposal distribution in the particle filter is important to ensure effective SMC approximation. The first genuine particle filter in the literature, proposed by Gordon et al. [1993], involved proposing from the prior distribution of X1:n, hence taking qθ(xn|xn−1, yn) = fθ(xn|xn−1) and the resulting particle filter with this particular choice of Qθ is called the bootstrap filter. Another interesting choice is to take qθ(xn|xn−1, yn) = qθ(xn|yn), which can be useful when observations provide significant information about the hidden state but the state dynamics are weak. This proposal was introduced in Lin et al. [2005] and the resulting particle filter was called independent particle filter. The optimal choice that minimises the variance of the incremental importance weights is, from equation (2.12), qoptθ (xn|xn−1, yn) = pθ(xn|xn−1, yn). This results in the optimal incremental weights to be woptn|n−1(x1:n) = pθ(yn|xn−1), which is independent from the value of xn. First works where q opt θ was used include Kong et al. [1994]; Liu [1996]; Liu and Chen [1995]. • The auxiliary particle filter for optimal filtering [Pitt and Shephard, 1999] is imple- mented by sampling X1:n−1 among the set of the particle paths up to time n − 1 and a new Xn from X in order to target π¯θ,n(dx1:n) = [ N∑ i=1 W (i) n−1w opt n|n−1(X (i) 1:n−1)δX(i)1:n−1 (dx1:n−1) ] pθ(xn|xn−1, yn)λ(dxn). Note that when pθ(yn|xn−1) can be calculated and pθ(xn|xn−1, yn) is available to 3.3. SEQUENTIAL INFERENCE IN HMMS 49 sample from, then all the particles at time n will have equal weights. If this is not the case, the proposal distribution to sample from this target distribution can be written generally as q¯θ,n(dx1:n) = [ N∑ i=1 α (i) n−1δX(i)1:n−1 (dx1:n−1) ] qθ(xn|xn−1, yn)λ(dxn) where αn−1(xn−1) and qθ(xn|xn−1, yn) is up to choice and should be close as possible to the ideal choice. One attempt to make α (i) n−1 close to W (i) n−1pθ(yn|X(i)n−1) (up to normalising), which was suggested in the original work Pitt and Shephard [1999] on the auxiliary particle filter, is to take α (i) n−1 = gθ(yn|x∗(i)n ), where x∗(i)n is a prediction of Xn given X (i) n−1 based on the dynamics of the process, e.g. x ∗ n = Eθ[Xn|X(i)n−1]. • Although the particle filter we presented in Algorithm 3.1 targets the path filtering distributions πθ,n(dx1:n) = pθ(x1:n|y1:n)λ(dx1:n); it can easily be modified, or used directly, to make inference on other distributions that might be of interest. For example, consider the one step path prediction distribution πpθ,n(dx1:n) = pθ(x1:n|y1:n−1)λ(dx1:n). There is the following relation between πθ,n and π p θ,n. πpθ,n(dx1:n) = πθ,n−1(dx1:n−1)fθ(xn|xn−1)λ(dxn), dπθ,n dπpθ,n (x1:n) = gθ(yn|xn) πpθ,n(gθ(yn|·)) . Therefore, it is easy to derive approximations to these distributions from each other: obtaining πp,Nθ,n from π N n−1 requires a simple extension of the path X1:n−1 to X1:n through fθ; this is done by sampling X (i) n conditioned on the existing particles paths X (i) 1:n−1, respectively for i = 1, . . . , N . Whereas; obtaining π N θ,n from π p,N θ,n requires a simple reweighting of the measure (or the approximate measure) according to gθ(yn|·). As a second example, the approximations to the marginal distributions πNn (dxk), k ≤ n (or πp,Nn (dxk)) are simply obtained from the k’th components of the particles, e.g. πNn (dx1:n) = N∑ i=1 W (i)n δX(i)1:n (dx1:n)⇒ πNn (dxk) = N∑ i=1 W (i)n δX(i)k (dxk). Note that the optimal filtering problem corresponds to the case k = n. Therefore, it may be sufficient to have a good approximation for the marginal posterior distri- bution of the current state Xn rather than the whole path X1:n. This justifies the resampling step of the particle filter in practice, since resampling trades off accu- 50 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION racy for states Xk with k ≪ n for a good approximation for the marginal posterior distribution of Xn. 3.3.3 The marginal particle filter Recall that the standard particle filter follows the recursion in (3.7). It estimates πθ,n(dx1:n) by taking an estimate of πθ,n−1(dx1:n−1) and augmenting it with xn at time n. It involves a resampling step not to suffer from high variance which is a result of the sequential nature of the algorithm and that the dimension of the sampled paths is increased by the dimension of the state space at each time. When it is the filtering distribution πθ,n(dxn) that is desired, one can use a somewhat more principled approach. The marginal particle filter (MPF) [Klaas et al., 2005] follows the recursion in (3.9) and (3.10) and performs particle filtering for the marginal distribution πθ,n(dxn) instead of the joint distribution πθ,n(dx1:n). Assume {X(i)n−1,W (i)n−1}1≤i≤N is the set of particles and their weights obtained by the MPF for the approximation of πθ,n−1(dxn−1). The MPF approximates the recursion in (3.9) and (3.10) by substituting the predictive density pθ(xn|y1:n−1) with its approximation∑N i=1W (i) n−1f(xn|X(i)n−1) in (3.9). Then it performs importance sampling for the following resulting approximation of the marginal density pθ(xn|y1:n) pNθ (xn|y1:n) ∝ gθ(yn|xn) N∑ i=1 W (i) n−1f(xn|X(i)n−1). Although we have freedom to choose any proposal distribution qθ(xn|y1:n) that has ap- propriate support, the authors in Klaas et al. [2005] suggest a proposal which takes a similar form, namely qθ(xn|y1:n) = N∑ i=1 W (i) n−1qθ(xn|X(i)n−1, yn). (3.14) Note that the proposal in (3.14) suggests sampling X (i) t−1 from the particle estimate of πθ,n−1(dxn−1) and then proposing the new component Xt. Instead, we may want to design a proposal that samples particles which will be in high-probability regions of the observation model. We can do this by re-weighting the particles at time n − 1 to boost them in these regions, and this modification results in the auxiliary marginal particle filter (AMPF) [Klaas et al., 2005]. The AMPF is the general version of the MPF where the proposal distribution can be written more generally than (3.14) as qθ(xn|y1:n) = N∑ i=1 α (i) n−1qθ(xn|X(i)n−1, yn). (3.15) 3.3. SEQUENTIAL INFERENCE IN HMMS 51 Just as in the auxiliary particle filter in Section 3.3.2, one should ideally take α (i) n−1 ∝W (i)n−1pθ(yn|X(i)n−1) if calculation of pθ(yn|X(i)n−1) is possible; otherwise a suitable approximation of pθ(yn|X(i)n−1) should be used instead of pθ(yn|X(i)n−1). The pseudocode for the AMPF is given in Algorithm 3.2. The variance of the impor- tance weights of the AMPF is less than or equal to the variance of the importance weights of the standard auxiliary particle filter. Although this improvement of the marginal par- ticle filter comes with the cost of O(N2) calculations per time compared to the O(N) calculations in standard particle filters; it is possible to reduce this cost to O(N logN) with a small and controllable error [Klaas et al., 2005]. Algorithm 3.2. The auxiliary marginal particle filter: For n = 1; for i = 1, . . . , N sample X (i) 1 ∼ qθ(·|y1), set W (i)1 ∝ µ(X (i) 1 )gθ(y1|X (i) 1 ) qθ(X (i) 1 |y1) . For n = 2, 3, . . . • For i = 1, . . . , N ; sample X(i)n ∼ ∑Ni=1 α(i)n−1qθ(xn|X(i)n−1, yn) where α(i) is propor- tional to W (i) n−1pθ(yn|X(i)n−1) or to an approximation of it. • For i = 1, . . . , N ; set W (i)n ∝ gθ(yn|xn) ∑N i=1W (i) n−1fθ(xn|X(i)n−1)∑N i=1 α (i) n−1qθ(X (i) n |X(i)n−1, yn) . Finally, we note that another O(N2) particle filter can be found in Lin et al. [2005] as a special case of what the authors call the independent particle filter. The name ‘independent’ is due to their proposal distribution at time n being independent of xn−1, and this allows multiple matching with the previous particles which makes their algorithm O(N2) in case of complete matching. Moreover; a slight extension of their algorithm where the proposal distribution uses the past particles is also mentioned in their work, and the MPF or AMPF can be considered to be equivalent to special cases of this extension. 3.3.4 The Rao-Blackwellised particle filter Assume we are given a HMM {(Xn, Zn), Yn}n≥1 where this time the hidden state at time n is composed of two components Xn and Zn. Suppose that the initial and transition distributions of the Markov chain {Xn, Zn}n≥1 have densities µθ and fθ with respect to the product measure λ1 ⊗ λ2 and they can be factorized as follows µθ(x1, z1) = µθ,1(x1)µθ,2(z1|x1), fθ(xn, zn|xn−1, zn−1) = fθ,1(xn|xn−1)fθ,2(zn|xn, zn−1). 52 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION Also, conditioned on (xn, zn) the distribution of observation Yn admit a density gθ(·|xn, zn) with respect to ν. We are interested in the case where the posterior distribution πθ,n(dx1:ndz1:n) = pθ(x1:n, z1:n|y1:n)λ1(dx1:n)λ2(dz1:n) is analytically intractable and we are interested in approximating the expectations πθ,n(ϕn) = Eθ [ϕn(X1:n, Z1:n)|y1:n] for bounded measurable functions ϕn : X n × Zn → Rdϕ(n). Ob- viously, one way to do this is to run an SMC filter for {πθ,n}n≥1 which obtains the approximation πNθ,n at time n as πNθ,n(dx1:ndz1:n) = N∑ i=1 W (i)n δ(X(i)1:n,Z (i) 1:n) (dx1:ndz1:n), N∑ i=1 W (i)n = 1. However, if the conditional posterior probability distribution πθ,2,n(dz1:n|x1:n) = pθ(z1:n|x1:n, y1:n)λ2(dz1:n) is analytically tractable, there is a better SMC scheme for approximating πθ,n and esti- mating πθ,n(ϕn). This SMC scheme is called the Rao Blackwellised particle filter (RBPF) [Doucet et al., 2000a]. Consider the following decomposition which follows from the chain rule pθ(x1:n, z1:n|y1:n) = pθ(x1:n|y1:n)pθ(z1:n|x1:n, y1:n) and define the marginal posterior distribution of X1:n conditioned on y1:n as πθ,1,n(dx1:n) = pθ,1(x1:n|y1:n)λ1(dx1:n). The RBPF is a particle filter for the sequence of marginal distributions {πθ,1,n}n≥1 which produces at time n the approximation πNθ,1,n(dx1:n) = N∑ i=1 W (i) 1,nδX(i)1:n (dx1:n), N∑ i=1 W (i) 1,n = 1. and the Rao-Blackwellised approximation the full posterior distribution involves the par- ticle filter estimate πNθ,1,n and the exact distribution πθ,2,n πRB,Nθ,n (dx1:ndz1:n) = π N θ,1,n(dx1:n)πθ,2,n(dz1:n|x1:n). Then, the estimator of the the RBPF for πθ,n(ϕn) becomes πRB,Nθ,n (ϕn) = π N θ,1,n (πθ,2,n [ϕn(X1:n, ·)]) = N∑ i=1 W (i) 1,nπθ,2,n[ϕn(X (i) 1:n, ·)]. 3.3. SEQUENTIAL INFERENCE IN HMMS 53 Assuming qθ(x1:n|y1:n) = qθ(x1:n−1|y1:n−1)qθ(xn|x1:n−1, y1:n) is used as the proposal distri- bution, the incremental importance weight for the RBPF is given by w1,n|n−1(x1:n) = fθ,1(xn|xn−1)pθ(yn|x1:n, y1:n−1) qθ(xn|x1:n−1, y1:n) where the density pθ(yn|x1:n, y1:n−1) is given by pθ,n(yn|x1:n, y1:n−1) = ∫ pθ(z1:n−1|x1:n−1, y1:n−1)fθ,2(zn|xn, zn−1)gθ(yn|xn, zn)λ2(dz1:n). Also, the optimum importance density which reduces the variance of w1,n|n−1 is when the incremental importance density qθ(xn|x1:n−1, y1:n) is taken to be pθ(xn|x1:n−1, y1:n) which results in w1,n|n−1(x1:n) being equal to pθ(yn|x1:n−1, y1:n−1). The use of the RBPF whenever it is possible is intuitively justified by the fact that we substitute particle approximation of some expectations with their exact values. Indeed, the theoretical analysis in Doucet et al. [2000a] and Chopin [2004, Proposition 3] revealed that the RBPF has better precision than the regular particle filter: the estimates of the RBPF never have larger variances. The favouring results for the RBPF are basically due to the Rao-Blackwell theorem (see e.g. Blackwell [1947]), after which the proposed particle filter gets its name. The RBPF was formulated by Doucet et al. [2000a] and have been implemented in various settings by Andrieu and Doucet [2002]; Chen and Liu [2000]; Sa¨rkka¨ et al. [2004] among many. We will also use RBPFs in our works presented in Chapters 4, 5, and 7. The use of Rao-Blackwellisation is not limited to marginalising out one of the com- ponents of the hidden state; it may be possible to use Rao-Blackwellisation in the in- termediate steps of a particle filter. In some time series models, an exact sequential inference is not tractable but the exact one-step update of distributions conditioned on the approximations made prior to the current time is possible. For such models, one can calculate an expectation of interest using this exact one-step update that is available, and then continue by approximating this exact update with particles in order to be able to proceed to the next time step of the particle filter. For examples of such implementation of Rao-Blackwellisation, see Fearnhead and Clifford [2003, p. 890], Fearnhead and Liu [2007], and the Algorithm in Chapter 4 of this thesis. 3.3.5 Application of SMC to smoothing additive functionals In this section, we provide an example for use of particle filters which is central to this thesis due to its relation to parameter estimation. We are interested in approximating smoothed estimates of additive functionals of state variables in a fully dominated HMM {Xn, Yn}n≥1 defined in Definition 3.2. Let us have a sequence of functions st : X×X → R, 54 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION t ≥ 1 and let Sn : X n → R, n ≥ 1 be the corresponding sequence of additive functionals constructed from st as follows Sn(x1:n) = n∑ t=1 st(xt−1, xt) where, by convention, we take s1(x0, x1) = s1(x1). In many instances it is necessary to be able to compute the following expectations sequentially Sθn = πθ,n(Sn) = Eθ [Sn(X1:n)|y1:n] = ∫ Sn(x1:n)pθ(x1:n|y1:n)λ(dx1:n). The expectation is to be computed with respect to the density pθ(x1:n|y1:n) and for this reason Sθn is referred to as a smoothed additive functional. Calculation of S θ n might be of interest for its own sake, it is also necessary for computing the filter derivative and the gradient of the log-likelihood of observations [Del Moral et al., 2011; Poyiadjis et al., 2011], the intermediate function of the expectation-maximisation algorithm (see e.g. Del Moral et al. [2009]), etc. In most cases exact computation of Sθn is not available due to the unavailability of pθ(x1:n|y1:n), therefore one has to use Monte Carlo methods, specifically SMC. The first SMC method in the literature proposed to approximate Sθn uses the path space approxi- mation of πθ,n directly [Cappe´, 2009]. Let the SMC approximation of πθ,n be πNθ,n(dx1:n) = N∑ i=1 W (i)n δX(i)1:n (dx1:n), N∑ i=1 W (i)n = 1. (3.16) Then, one obtains the path space approximation of the smoothed additive functional as Ŝθn = π N θ,n(Sn) = N∑ i=1 W (i)n Sn(X (i) 1:n) (3.17) Observing Sn(x1:n) = Sn−1(x1:n−1) + sn(xn−1, xn), this approximation can be calculated online for n along with the particle filter, see Cappe´ [2009] for an application exploiting this fact. In this approximation, there is no need to store the entire ancestry of each particle and computational cost of calculation of Ŝθn is linear in the number of parti- cles, i.e. O(N). However; this approximation relies on the approximation of the joint distribution πθ,n(dx1:n) which, as already mentioned in Section 2.5.2, is well-known in the SMC literature to become progressively impoverished as n increases because of the successive resampling steps. Indeed, it was shown in Del Moral and Doucet [2003] that under favourable mixing assumptions, the authors established an upper bound on the Lp error in the path space estimate in (3.17) which is proportional to n2/ √ N ; and under 3.3. SEQUENTIAL INFERENCE IN HMMS 55 similar assumptions it was shown in Poyiadjis et al. [2011] that the asymptotic variance of the path space estimate increases at least quadratically with n. An O(N) SMC approach that reduces the variance is fixed-lag smoothing [Kitagawa and Sato, 2001] which uses the following approximation pθ(x1:k|y1:n) ≈ pθ(x1:k|y1:min(n,k+∆)), ∆ > 0. (3.18) with the idea that for large enough ∆ the error introduced by ∆ will be negligible. The SMC implementation of this approximation prevents the particle filter from updating path X1:k beyond time k +∆ and hence reduces the variance resulting from path degeneracy. However; choosing the lag amount ∆ is a difficult task, and this approach introduces a bias to the estimate of Sθn which does not vanish asymptotically in N , see Olsson et al. [2008]. 3.3.5.1 Forward filtering backward smoothing A standard alternative to computing Sθn is to use SMC approximations of fixed-interval smoothing techniques such as the forward filtering backward smoothing (FFBS) algo- rithm [Doucet et al., 2000b; Godsill et al., 2004]. Let us define the marginal smoothing distributions ηθ,n,k(dxk) := πθ,n(dxk) = pθ(xk|y1:n)λ(dxk) and define the backward transition kernel Mθ,n−1 : X → P(E) such that Mθ,n−1(xn, dxn−1) = pθ(xn−1|xn, y1:n−1)λ(dxn−1). FFBS relies on the additivity of the functional Sn and that pθ(xt−1, xt|y1:n)λ(dxt−1dxt) = ηθ,n,t(dxt)Mθ,t−1(xt, dxt−1) for t ≤ n, which lead to Sθn = n∑ t=1 [ηθ,n,t ⊗Mθ,t−1] (st) = n∑ t=1 ∫ ηθ,n,t(dxt)Mθ,t−1(xt, dxt−1)st(xt−1, xt). Moreover, once πθ,1, . . . , πθ,n are obtained up to time n (forward filtering), ηθ,n,1, . . . , ηθ,n,n can be obtained with a backward recursion (backward smoothing) starting from ηθ,n,n(dxn) = πθ,n(dxn) and recursing back with ηθ,n,t = ηθ,n,t+1Mθ,t, t = n− 1, . . . , 1. The SMC implementation of FFBS [Doucet et al., 2000b], which we will call SMC- 56 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION FFBS, is based on the following alternative approximation to πθ,n π∗,Nθ,n = η N θ,n,n ⊗MNθ,n−1 ⊗ . . .⊗MNθ,1 (3.19) where the particle approximation to the backward kernels are MNθ,n−1(xn, dxn−1) = η N θ,n−1,n−1(dxn−1) fθ(xn|xn−1)∫ ηNθ,n−1,n−1(dxn−1)fθ(xn|xn−1)λ(dxn) . (3.20) Therefore, once πNθ,1, . . . , π N θ,n are obtained up to time n (forward filtering), η N θ,n,1, . . . , η N θ,n,n can be obtained with a backward recursion (backward smoothing) starting from ηNθ,n,n(dxn) = πNθ,n(dxn) and recursing back with ηNθ,n,t = η N θ,n,t+1M N θ,t, t = n− 1, . . . , 1. Then, the SMC approximation to FFBS leads to the following estimate of the smoothed functional Ŝ∗,θn = n∑ t=1 [ ηNθ,n,t ⊗MNθ,t−1 ] (st). The SMC implementation of FFBS requires O(N2) computations per time, compared to the O(N) path space approximation. As a return, the estimator has better properties over the estimator of the path space approximation. Douc et al. [2011] includes a cen- tral limit theorem for Ŝ∗,θn and time uniform deviation inequalities for the SMC-FFBS approximations of the marginals {ηθ,n,t}1≤t≤n. For alternative proofs to those in Douc et al. [2011], see Del Moral et al. [2010]. Additionally, it was shown in Del Moral et al. [2009] that under strong mixing conditions the asymptotic variance of Ŝ∗,θn as N → ∞ is linear in n. More general but more complicated results on the variance of Ŝ∗,θn with weaker conditions can be found in Del Moral et al. [2010]. 3.3.5.2 Forward-only smoothing Filtering forwards and smoothing backwards, the FFBS algorithm is surely offline, unlike the path space approximation. Also, it may be demanding since it requires the SMC filters ηNθ,t,t to be stored up to time n. To circumvent the need for the backward pass in the computation of Sθn, the following auxiliary function on X is introduced, T θn(xn) =Mn−1 ⊗ . . .⊗M1 [Sn(·, xn)] (xn) = Eθ [Sn(X1:n)|Xn = xn, y1:n−1] . 3.3. SEQUENTIAL INFERENCE IN HMMS 57 It is apparent that Sθn = ηθ,n,n(T θ n). A forward recursion to compute {T θn}n≥1, hence {Sθn}n≥1, is established by T θn(xn) = Mθ,n−1 [ T θn−1 + sn(·, xn) ] = Eθ [ T θn−1(Xn−1) + sn(Xn−1, xn) ∣∣ xn, y1:n−1] . (3.21) for n ≥ 2, with the initial condition T θn(x1) = s1(x1). Note that online calculation of T θn(xn) requires only an integration with respect to the measure Mn−1(xn, ·), i.e. pθ(xn−1|xn, y1:n−1). The recursion in (3.21) has been rediscovered independently sev- eral times (see e.g. Elliott and Krishnamurthy [1999]; Hernando et al. [2005]; Mongillo and Deneve [2008]) and it was called forward smoothing recursion in Del Moral et al. [2009]. A straightforward implementation of forward smoothing recursion would be by using π∗,Nθ,n in (3.19) so that Mθ,n−1 in (3.21) is approximated by M N θ,n−1 in (3.20). It can be shown that when this approximation is used, we calculate exactly the same quantity as SMC-FFBS. Therefore, the preferable statistical properties of SMC-FFBS is preserved. Moreover, although the online calculation still requires O(N2) calculations per time, it does not need to store the SMC filters {ηθ,n,t}1≤t≤n. Being a forward implementation of SMC-FFBS, we call this implementation SMC-forward smoothing, or SMC-FS. SMC-FS proves to be a very useful tool for online parameter estimation, as we shall see in Section 3.4 and throughout this thesis. Algorithm 3.3. SMC-FS: Forward only SMC computation of FFBS for smooth- ing additive functionals For n = 1; • compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ,1,1. • For i = 1, . . . , N , set T (i)1 = s1(X(i)1 ). For n = 2, 3, . . . • Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ,n,n. • For i = 1, . . . , N ; set T (i)n = ∑N j=1 [ T (j) n−1 + sn(X (j) n−1, X (i) n ) ] W (j) n−1fθ(X (i) n |X(j)n−1)∑N j′=1W (j′) n−1fθ(X (i) n |X(j′)n−1) . • Calculate Ŝ∗,θn = ∑N i=1W (i) n T θ(i) n . We present the SMC-FS algorithm in Algorithm 3.3. We note that; since SMC-FS relies on particle estimates of the filtering distributions {ηθ,n,n}n≥1 only, the marginal 58 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION particle filter in Section 3.3.3 can be used in Algorithm 3.3 the instead of the standard particle filter. Finally, note that the SMC implementation of the forward smoothing recursion by using the path space approximation is trivial in the sense that it reduces to the approximation given in (3.16). 3.4 Static parameter estimation in HMMs One problem that is largely dealt in the literature is that of estimating the true static parameter θ∗ of the HMM given observations y1:n up to time n. There are two main approaches to solving the parameter estimation problem, the Bayesian approach and the maximum likelihood approach. We briefly summarise Bayesian methods and then give a more detailed review on the maximum likelihood parameter estimation methods. We refer the interested reader to Kantas et al. [2009] for a comprehensive review of SMC methods that have been proposed for static parameter estimation in HMMs. Bayesian parameter estimation: In the Bayesian approach, the static parameter is treated as a random variable taking values θ in Θ with a probability density η(θ) with respect to a dominating measure dθ, and the aim is to evaluate the density of the posterior distribution of θ given y1:n, which follows from Bayes’ theorem as η(θ|y1:n) = η(θ)pθ(y1:n)∫ pθ(y1:n)η(θ)dθ . (3.22) When the likelihood pθ(y1:n) is analytically available, one can simply apply a MCMC scheme for the posterior η(θ|y1:n). An MCMC algorithm can be inefficient when n is large; however online Bayesian methods are also available. For example, the method in Chopin [2002] is based on the SMC approximation of the sequence of distributions {η(dθ|y1:t)}1≤t≤n. This approach is equivalent to the resample-move algorithm described in Gilks and Berzuini [2001], which is a special SMC sampler. More sophisticated techniques are required when pθ(y1:n) cannot be computed, which is usually the case for a general HMM. The methods developed for this case consider • the joint density p(θ, x1:n|y1:n) = η(θ)pθ(x1:n|y1:n) in the batch parameter estimation setting • the sequence of posterior densities {pθ(θ, x1:t|y1:t)}1≤t≤n in the online parameter estimation setting. One elegant method used in the batch estimation setting is particle MCMC (PMCMC) [Andrieu et al., 2010]. Notice that an ideal Metropolis-Hastings algorithm targeting pθ(θ, x1:n|y1:n) is not feasible in general since it requires exact sampling from pθ(x1:n|y1:n) 3.4. STATIC PARAMETER ESTIMATION IN HMMS 59 and exact calculation of pθ(y1:n). A particle version of the Metropolis-Hastings algorithm, which was called PMMH in Andrieu et al. [2010], runs an SMC for pθ(x1:n|y1:n) with N particles and uses the SMC approximation of the unknown quantity pθ(y1:n). The validity of this approach is not trivial to show; see, again, Andrieu et al. [2010] for a derivation. In the same work, a particle version of the Gibbs sampler was also developed. In Andrieu et al. [2010] the variance of the acceptance rate of the PMMH algorithm was numerically shown to be proportional to n/N under favourable mixing conditions. This suggests that one needs to increase the number of particles linearly with n in order to keep the performance of the PMCMC algorithm at a certain level. Therefore, for large n PMCMC may not be practical and online parameter estimation methods may be required. Although with possible modifications, all of the Bayesian methods for online pa- rameter estimation rely on the SMC approximation of the sequence of distributions pθ(θ, x1:t|y1:t), 1 ≤ t ≤ n. At first sight, it seems easy to achieve this using standard SMC methods by introducing the extended state {θn, Xn}n≥1 with the initial distri- bution µθ1(x1)λ(dx1)η(θ1)dθ and transitional distribution fθn(xn|xn−1)λ(dxn)δθn−1(dθn). This implies θn = θn−1; therefore an SMC algorithm explores the parameter space only at its initialisation. As a result of successive resampling steps, we will end up with only a single value for θ, which makes the approximation to the marginal distribution η(dθ|y1:n) clearly a bad approximation. Several methods have been proposed to avoid degeneracy of particles for the static parameter of the HMM. We briefly mention them below. One approach to avoid degeneracy, proposed originally in Gilks and Berzuini [2001], is based on adding MCMC steps to re-introduce particle diversity. Assume that the SMC approximation to p(θ, x1:n|y1:n) at time n contains particles (θ(i)n , X(i)1:n), i = 1, . . . , N , with equal weights. To add diversity in this population, a MCMC kernel Kn which leaves p(d(θ, x1:n)|y1:n) invariant is applied to each of the particles. One remarkable point is that the MCMC kernel need not be ergodic; indeed in practice one designs Kn so that it moves only θ(i) and last L components of X (i) 1:n. A first use of this method in an online Bayesian parameter estimation context is seen in Andrieu et al. [1999], Kn is taken to be a Gibbs move to update the parameter value only, i.e. Kn [(x1:n, θ), d(x ′ 1:n, θ ′)] = δx1:n(dx ′ 1:n)p(θ|x1:n, y1:n)dθ Similar strategies were used in Fearnhead [2002] and Storvik [2002]. The use of MCMC within SMC steps is particularly elegant when (x1:n, y1:n) can be summarised by a set of fixed dimensional sufficient statistics; since then the memory and computational re- quirements for calculating densities such as pθ(y1:n|x1:n) or p(θ|x1:n, y1:n) does not increase with time. Unfortunately; these MCMC-based methods suffer from the path degeneracy problem of the SMC approximation, since the error in the estimate of pθ(x1:n|y1:n) will 60 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION lead to an error in sufficient statistics to be used and these errors build up over time. This disadvantage was first noticed in Andrieu et al. [1999] and a convincing example was provided in Andrieu et al. [2005]. Another MCMC-based online Bayesian estimation method is called practical filtering [Polson et al., 2008], which relies on a fixed-lag approximation as in (3.18). As for all fixed-lag approaches, it is hard to tune the amount of lag and control the non-vanishing bias introduced by the approximation. Alternative to MCMC-based methods to avoid degeneracy, another class of methods are based on introducing artificial dynamics for the parameter [Higuchi, 2001; Kitagawa, 1998]. More explicitly, it is assumed that θ1 ∼ η(θ1), θn = θn−1 + ǫn, n ≥ 2, where ǫn is a small artificial dynamic centred noise whose variance is decreasing with n. Obviously, SMC applied to approximate {p(θn, x1:n|y1:n)}n≥1 under this assumption will have better properties than before in terms of degeneracy. This approach is closely related to the kernel density estimation method in Liu and West [2001], which proposes regularising smoothing the empirical measure of the posterior distribution of the param- eter with a smooth kernel density, such as Gaussian or Epanechnikov. A more general approach where the kernel smoothing approach is also applied to the components of the HMM is given in Campillo and Rossi [2009]. All these methods who introduce artificial dynamics to the parameter require a significant amount of tuning and it suffers from bias which is hard to quantify. Maximum likelihood parameter estimation: In the maximum likelihood approach to parameter estimation, one has a point estimate obtained by calculating the value of θ that maximises the likelihood pθ(y1:n) over all the possible values of θ, i.e. θML = argmax θ∈Θ pθ(y1:n). This procedure is called maximum likelihood estimation (MLE). In this thesis we will investigate methods for MLE applied to several time series models. In the following we present some of the MLE methods directly applicable to HMMs. 3.4.1 Direct maximisation of the likelihood The traditional approach of ML is to try to calculate the maximiser of pθ(y1:n) with respect to θ by direct calculation of pθ(y1:n). Note that pθ(y1:n) also satisfies the following 3.4. STATIC PARAMETER ESTIMATION IN HMMS 61 recursive form pθ(y1:n) = pθ(y1) n∏ t=2 pθ(yt|y1:t−1) = pθ(y1:n−1)pθ(yn|y1:n−1). (3.23) The incremental likelihood pθ(yn|y1:n−1) may be obtained by exploiting one the expres- sions for it, such as the one in (3.8) or (3.11), whichever is available. In practice, one uses the log-likelihood lθ(y1:n) = log pθ(y1:n) which is numerically better-behaved since this time the product in (3.23) is replaced by a sum. It is rarely the case that the likelihood (or log-likelihood) is in closed form and can be maximised analytically. When it is not in closed form but it can be calculated, grid based methods, where the likelihood is calculated on a grid based representation of Θ with enough resolution, can be used. When even the likelihood can not even be calculated, SMC approximation can be applied. Let τ1, . . . , τk be the times when the resampling step is applied in the particle filter in Algorithm 3.1 and let τ0 = 0 and τk+1 = n. It is shown in Del Moral [2004, Chapter 7] the following estimator of pθ(y1:n) is unbiased pNθ (y1:n) = k+1∏ j=1 pNθ (yτj−1+1:τj |y1:τj−1), pNθ (yτj−1+1:τj |y1:τj−1) = N∑ i=1 τj∏ t=τj−1+1 w (i) t|t−1 Based on this unbiased estimator, an estimate of lθ(y1:n) is lNθ (y1:n) = k+1∑ j=1 log pNθ (yτj−1+1:τj |y1:τj−1) which is obviously biased due to the non-linear transformation of the unbiased estimators. The bias can be reduced by using the following standard technique based on a Taylor series expansion, see Andrieu et al. [2004]. Direct maximisation of the likelihood by means of calculating it point-wise is not a practical approach unless Θ is a discrete space with small number of elements or a con- tinuous space which can be well approximated by a grid. Unfortunately, these conditions do not hold in almost all cases mainly because θ is of large dimension. In the following we will review two alternative approaches that maximises pθ(y1:n) (at least locally) indirectly without calculating it. 62 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION 3.4.2 Gradient ascent maximum likelihood Gradient based maximum likelihood methods work with the gradient of the log-likelihood rather than itself. The gradient ascent algorithm is an iterative procedure implemented as follows: We begin with θ(0) and assume that we have the estimate θ(j−1) at the end of the the (j − 1)’th iteration. At the j’th iteration we update the parameter θ(j) = θ(j−1) + γj∇θlθ(y1:n) ∣∣ θ=θ(j−1) . The gradient term ∇θlθ(y1:n) is also called the score vector. Here {γj}j≥1 is the sequence of step sizes satisfying ∑ j≥0 γj =∞, ∑ j≥0 γ2j <∞, (3.24) ensuring convergence of the algorithm when it is used with the Monte Carlo approxima- tions ∇Nθ lθ(y1:n) of the score vectors. A common choice is γn = n−a for 0.5 < a ≤ 1. One way to calculate the gradient term is to use Fisher’s identity for the score vector as ∇θlθ(y1:n) = ∫ pθ(x1:n, y1:n) log pθ(x1:n, y1:n)λ(dx1:n). (3.25) i.e. the expectation of the complete data log-likelihood with respect to the posterior distribution of the latent variables. Equation (3.25) can be rewritten as ∇θlθ(y1:n) = πθ,n(Sθ,n) (3.26) where Sθ,n : X n → Rdθ is the additive function of of x1:n Sθ,n(x1:n) = n∑ t=1 sθ,t(xt−1, xt), (3.27) sθ,1(x0, x1) = sθ,1(x1) = ∇θ log µθ(x1) +∇θ log gθ(y1|x1) sθ,t(xt−1, xt) = ∇θ log gθ(yt|xt) +∇θ log fθ(xt|xt−1), t ≥ 2. Notice that since Sθ,n is in the additive form, the approximation to its expectation πθ,n(Sθ,n) can be carried out with one of the Monte Carlo methods mentioned in Sec- tion 3.3.5 when exact calculation of πθ,n(Sθ,n) is not available. An SMC estimate of the score vector using the O(N) path space approximation was provided in Andrieu et al. [2004]. However; it was shown in Poyiadjis et al. [2011] that the variance of this estimate increases typically quadratically with n. For this reason, Poyiadjis et al. [2011] proposed to use the O(N2) method that is based on FFBS to estimate ∇θlθ(y1:n), and it was shown in Del Moral et al. [2011] that this SMC estimate is stable. An alternative to Fisher’s identity to compute the score vector ∇θlθ(y1:n) is a method 3.4. STATIC PARAMETER ESTIMATION IN HMMS 63 based on infinitesimal perturbation analysis which was proposed in Coquelin et al. [2009]. This method is also estimating the expectation with respect to pθ(x1:n|y1:n) of an additive functional of the form ∑n t=1 sθ(xt−1, xt); so all the SMC smoothing techniques described in Section 3.3.5 can also be applied to estimate this expectation. 3.4.2.1 Online gradient ascent The batch gradient ascent MLE algorithm may be inefficient when n is large since each iteration requires a complete browse over the whole data sequence. An alternative to the batch algorithm is possible via online calculation of the score vector, leading to a recursive maximum likelihood algorithm which we will call online gradient ascent. An online gradient ascent algorithm can be implemented as follows [Del Moral et al., 2011; Poyiadjis et al., 2011]: Let θ1 be the initial guess of θ ∗ before having made any observations and at time n and let θ1:n be the sequence of parameter estimates of the online gradient ascent algorithm computed sequentially based on y1:n−1. When yn is received, we update the parameter θn+1 = θn + γn∇θ log pθ(yn|y1:n−1) ∣∣ θ=θn . (3.28) The incremental gradients ∇θ log pθ(yn|y1:n−1) can be calculated sequentially from the gradients ∇θlθ(y1:n) using the relation ∇θ log pθ(yn|y1:n−1) = ∇θlθ(y1:n)−∇θlθ(y1:n−1) = πθ,n(Sθ,n)− πθ,n−1(Sθ,n−1). (3.29) However, since θn is changing over time, (3.29) hence (3.28) is impractical to calculate sequentially. In practice, the integrals πθn,n(Sθn,n) are approximated by πθ1:n,n ( n∑ t=1 sθt(xt−1, xt) ) , where θ1:n in πθ1:n,n indicates that the distributions are calculated sequentially with vary- ing θ’s. This approach has previously appeared in the literature for finite state-space HMMs, see e.g. Le Gland and Mevel [1997] and Collings and Ryden [1998]. The asymptotic properties of this algorithm, i.e. the behaviour of θn in the limit as n goes to infinity, has been studied by Titterington [1984] for i.i.d. hidden processes and by Le Gland and Mevel [1997] for finite state-space HMMs. It is shown in Le Gland and Mevel [1997] that under regularity conditions this algorithm converges towards a local maximum of the average log-likelihood and that this average log-likelihood is maximised at θ∗. Algorithm 3.4. SMC-online gradient ascent algorithm 64 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION Choose θ1. Set S0 = 0. For n = 1, 2, . . .; • If n = 1, – Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ1,1,1. – For i = 1, . . . , N ; for k = 1, . . . , r set T (i) γ,1,k = ∇θ log µθ(X(i)1 )+∇θ log gθ(y1|X(i)1 ). if n ≥ 2, – Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ1:n,n,n. – For i = 1, . . . , N set T (i)γ,n = ∑N j=1 [ (1− γn)T (j)γ,n−1 + γnsn(X(j)n−1, X(i)n ) ] W (j) n−1fθn(X (i) n |X(j)n−1)∑N j′=1W (j′) n−1fθn(X (i) n |X(j′)n−1) . where sn(X (j) n−1, X (i) n ) = ∇θn log fθn(X(i)n |X(j)n−1) +∇θn log gθn(yn|X(i)n ) • Calculate Sn = ∑N i=1W (i) n T (i) γ,n and set θn+1 = θn + γn (Sn − Sn−1). A SMC online gradient ascent method, which can be seen as a particle version of the recursive maximum likelihood algorithm of Le Gland and Mevel [1997] is given in Algorithm 3.4. This algorithm is based on the O(N2) SMC approximation of (3.29) and calculates θn+1 = θn + γn [ π∗,Nθ1:n,n ( n∑ t=1 sθt(xt−1, xt) ) − π∗,Nθ1:n−1,n−1 ( n−1∑ t=1 sθt(xt−1, xt) )] . (3.30) In Poyiadjis et al. [2011], this algorithm is used with the MPF described in Section 3.3.3 in order to approximate the filtering distributions ηθ,n,n. A very similar algorithm, which is equivalent to Algorithm 3.4 in principle, can be found in Del Moral et al. [2011]; the difference is that the authors include θn in the calculation of the second term in (3.29) by using the relation πθ,n−1(Sθ,n−1) = π̂θ,n(Sθ,n−1 +∇θ log fθ(xn|xn−1)). We remind that π̂θ,n is distribution ofX1:n conditioned on y1:n−1. Hence, (3.30) is replaced by θn+1 = θn+γn [ π∗,Nθ1:n,n ( n∑ t=1 sθt(xt−1, xt) ) − π̂∗,Nθ1:n,n ( ∇θn log fθn(xn|xn−1) + n−1∑ t=1 sθt(xt−1, xt) )] . and the online implementation of this update is derived using the filter derivative at time n. Similar to the O(N2) particle approximation to πθ,n(Sθ,n), the O(N2) particle 3.4. STATIC PARAMETER ESTIMATION IN HMMS 65 approximation of π̂θ,n(Sθ,n) can be performed by taking π̂∗,Nθ,n = η̂ N θ,n,n ⊗MNθ,n−1 ⊗ . . .⊗MNθ,1 where η̂Nθ,n,n(dxn) = π̂ N θ,n(dxn) is the particle approximation to the one step prediction distribution obtained by marginalising the path particle approximation π̂Nθ,n. 3.4.3 Expectation-Maximisation The expectation-maximisation (EM) algorithm [Dempster et al., 1977] is one of the most popular methods for MLE. Given Y1:n = y1:n, the EM algorithm for maximising pθ(y1:n) is given by the following iterative procedure: if θ(j) is the estimate of the EM algorithm at the jth iteration, then at iteration j + 1 the estimate is updated by first calculating the following intermediate optimisation criterion, which is known as the expectation (E) step, Q(θ(j), θ) = ∫ log pθ(x1:n, y1:n)pθ(j)(x1:n|y1:n)λ(dx1:n) = Eθ(j) [log pθ(X1:n, y1:n)|y1:n] . (3.31) The updated estimate is then computed in the maximisation (M) step θ(j+1) = argmax θ∈Θ Q(θ(j), θ) (3.32) The EM algorithm produces a sequence {θ(j)}j≥1 such that {pθ(j)(y1:n)}j≥1 is non-decreasing, and under mild conditions this sequence is guaranteed to converge to a maximum point of pθ(y1:n). In practice, the procedure in (3.31) and (3.32) is repeated until θ (j) ceases to change significantly. One important observation here is that the integrand in (3.31), which is the joint-log density of the complete data (x1:n, y1:n), has the following additive structure. log pθ(x1:n, y1:n) = µθ(x1) + log gθ(y1|x1) + n∑ t=2 log fθ(xt|xt−1) + log gθ(yt|xt) (3.33) Moreover, equation (3.33) suggests that when pθ(x1:n, y1:n) belongs to the exponential family with respect to θ, then there exist an integer r > 0, functions si,t : X × X → R, i = 1, . . . , r, t ≥ 1, such that the E-step and M-step of the EM algorithm reduce to calculating Sθ (j) i,n = πθ(j),n(Si,n) = Eθ(j) [Si,n(X1:n)| y1:n] , Si,n(x1:n) = n∑ t=1 si,t(xt−1, xt), i = 1, . . . , r, 66 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION and applying a maximisation rule Λ : Rr → Θ to compute (3.32) such that θ(j+1) = Λ ( Sθ (j) 1,n , . . . , S θ(j) r,n ) . (3.34) Functionals S1,n, . . . , Sr,n are also called the sufficient statistics of the complete data (x1:n, y1:n). 3.4.3.1 Stochastic versions of EM The intermediate function Q(θ(j), θ) of the EM algorithm can be computed exactly only in few HMMs such as linear Gaussian HMMs or finite state-space HMMs. When Q(θ(j), θ) cannot be computed exactly, Monte Carlo approximation must be used to numerically estimate it. The additive structure of log pθ(x1:n, y1:n) allows us to use several SMC smoothing techniques for estimating Q(θ(j), θ); see Andrieu et al. [2004] for the path space approximation, Olsson et al. [2008] for the fixed-lag approximation, Wills et al. [2008] for the FFBS approximation and Briers et al. [2010] for generalised two-filter smoothing. Using Monte Carlo estimate of the intermediate function leads to the stochastic ver- sions of the EM algorithm. There are three different main stochastic versions of the EM algorithm proposed in the literature, we will review them below. • If we use a constant number N of particles for all iterations, the resulting algorithm is called the stochastic EM algorithm (SEM) [Celeux and Diebolt, 1985]. Since the Monte Carlo variance is never reduced over iterations, this algorithm will not converge to a point in Θ; however one expects to have an ergodic homogeneous Markov chain of estimates {θ(j)}j≥0 whose stationary distribution is concentrated around θML [Nielsen, 2000]. • The settlement of the Markov chain in the SEM algorithm to its equilibrium may take too much time. An alternative to SEM is introduced in Wei and Tanner [1990] and is called Monte Carlo EM (MCEM). In MCEM, the number of particles for Monte Carlo approximation increases with j in order to ensure convergence to the maximum likely parameter value θML rather than convergence to a stationary distribution around it. The disadvantage of this approach is having to use an increasing amount of computational resource because of the increasing number of particles over iterations. • Another stochastic version of the EM algorithm involves a stochastic approximation procedure for which it is called stochastic approximation EM (SAEM) [Delyon et al., 1999]. In SAEM, the E-step involves a weighted average of the approximations of the intermediate quantity of EM obtained in the current as well as in the previous iterations. Specifically, consider step size sequence {γj}j≥0 satisfying the conditions 3.4. STATIC PARAMETER ESTIMATION IN HMMS 67 in (3.24). Then we calculate the weighted average of the estimates QN(θ(j), θ) of the intermediate functions recursively as Qγ,j(θ) = (1− γj)Qγ,j−1(θ) + γjQN(θ(j), θ), with the initialisation Qγ,−1(θ) = 0 and at the M-step at iteration j θj+1 is set to be the maximiser of Qγ,j(θ) with respect to θ. When pθ(x1:n, y1:n) is in the exponential family, the above recursion is in terms of the smoothed estimates of sufficient statistics; we will see a use of SAEM in this case in Chapter 5. 3.4.3.2 Online EM The online EM algorithm [Cappe´, 2009, 2011; Elliott et al., 2002; Kantas et al., 2009; Mongillo and Deneve, 2008] is a variation over the batch EM where, as in online gra- dient ascent algorithm, the parameter is re-estimated each time a new observation is collected. We assume that pθ(x1:n, y1:n) is in the exponential family and there exists suf- ficient statistics so that the M-step can be characterised by (3.34). In the online EM algorithm, running averages of Sθi,n are computed. Specifically, let γ = {γn}n≥1, called the step-size sequence, be a positive decreasing sequence satisfying ∑ n≥1 γn = ∞ and∑ n≥1 γ 2 n <∞. Let θ1 be the initial guess of θ∗ before having made any observations and at time n and let θ1:n be the sequence of parameter estimates of the online EM algorithm computed sequentially based on y1:n−1. When yn is received, online EM computes for i = 1, . . . , r Tγ,i,n(xn) =Mθ1:n,n−1 [(1− γn)Tγ,i,n−1 + γnsi,n(·, xn)] (xn), (3.35) Si,n = ηθ1:n,n,n(Tγ,i,n) (3.36) and then sets θn+1 = Λ (S1,n, . . . ,Sr,n) . The subscript θ1:n on Mθ1:n,n−1 and ηθ1:n,n,n indicates that these laws are being computed sequentially using the parameter θt at time t, t ≤ n. In practice, the maximisation step is not executed until a burn-in time nb for added stability of the estimators as discussed in Cappe´ [2009]. The online EM algorithm can be implemented exactly for a linear Gaussian state-space model [Elliott et al., 2002] and for finite state-space HMM’s. [Cappe´, 2011; Mongillo and Deneve, 2008]. An exact implementation is not possible for state-space models in general, therefore SMC implementations of the online EM algorithm are used. Both the O(N) and O(N2) approximations are used for the SMC implementation on online EM in the literature, we present both of them in Algorithms 3.5 and 3.6. The first SMC online 68 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION EM algorithm, proposed in Cappe´ [2009] uses the path space approximation to equations (3.35) and (3.36) resulting in Algorithm 3.5. The O(N2) approximation was proposed in Del Moral et al. [2009], resulting in Algorithm 3.6. Algorithm 3.5. SMC-online EM: O(N) implementation Choose θ1. For n = 1, 2, . . .; • If n = 1, – Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for πθ1,1. – For i = 1, . . . , N ; for k = 1, . . . , r set T (i) γ,k,1 = sk,1(X (i) 1 ). if n ≥ 2, – Compute the SMC approximation {X(i)1:n,W (i)n }1≤i≤N for πθ1:n,n. Construct the N × 1 vector A of resampling indexes such that X(i)1:n = (X(A(i))1:n−1 , X(i)n ). – For i = 1, . . . , N ; set j = A(i), and compute for k = 1, . . . , r set T (i) γ,k,n = (1− γn)T (j)γ,k,n−1 + γnsk,n(X(j)n−1, X(i)n ). • If n ≥ nb, calculate Sk,n = ∑N i=1W (i) n T (i) γ,k,n for k = 1, . . . , r and set θn+1 = Λ (S1,n, . . . ,Sr,n). Else, set θn+1 = θn. Algorithm 3.6. SMC-online EM: O(N2) implementation Choose θ1. For n = 1, 2, . . .; • If n = 1, – Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ1,1,1. – For i = 1, . . . , N ; for k = 1, . . . , r set T (i) γ,k,1 = sk,1(X (i) 1 ). if n ≥ 2, – Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ1:n,n,n. – For i = 1, . . . , N ; for k = 1, . . . , r set T (i) γ,k,n = ∑N j=1 [ (1− γn)T (j)γ,k,n−1 + γnsk,n(X(j)n−1, X(i)n ) ] W (j) n−1fθn(X (i) n |X(j)n−1)∑N j′=1W (j′) n−1fθn(X (i) n |X(j′)n−1) . • If n ≥ nb, calculate Sk,n = ∑N i=1W (i) n T (i) γ,k,n for k = 1, . . . , r and set θn+1 = Λ (S1,n, . . . ,Sr,n). Else, set θn+1 = θn. 3.4. STATIC PARAMETER ESTIMATION IN HMMS 69 3.4.4 Iterated filtering As a batch MLE method for HMMs, iterated filtering Ionides et al. [2011] can be useful to non-linear state space dynamics. The iterated filtering algorithm works as follows. We begin with θ(0) and assume that at the end of the j−1’th iteration we obtain the estimate θ(j). Iterated filtering extends the HMM further as {Xt, θt, Yt}t≥1 by introducing a slowly moving Markov chain for the static parameter as {θt}t≥1. At iteration j, the Markov chain for {θt}t≥1 is a random walk typically with Gaussians steps. θ1 ∼ N (θ(j), τ 2j Σ), θk|θk−1 ∼ N (θk−1, σ2jΣ), k ≥ 2. (3.37) At iteration j, one runs an SMC filter for the {Xt, θt, Yt}t≥1 with Nj particles, and calculates at every time step the mean and variance estimates for θt with respect to the filtering and prediction densities, respectively mt = Eθ(j−1) [θt|y1:t] , Vk = varθ(j−1) [θt|y1:t−1] , t = 1, . . . , n. (3.38) Denoting the SMC estimates of these quantities as m˜t and V˜t and letting m˜0 = 0, the algorithm updates the parameter estimates by θ(j) = θ(j−1) + γj n∑ t=1 V˜ −1t (m˜t − m˜t−1) (3.39) Actually, the quantity ∑n t=1 V˜ −1 t (m˜t − m˜t−1) is an approximation to the gradient of the log likelihood, ∇θlθ(y1:n) at θ = θ(j−1). Here, the positive sequences {τj}j≥0 and {σj}j≥0 satisfy the conditions limj→∞ τj = 0 and limj→∞ σj/τj = 0, which are the conditions leading to an annealing schedule. More- over, the sequences of number of particles and step sizes must satisfy Njτj → ∞ and∑ j γ 2 jN −1 j τ −2 j < ∞, which are the conditions for convergence of the stochastic approxi- mation for θ to a local maximum [Ionides et al., 2011]. 3.4.5 Discussion of the MLE methods One one hand, one might prefer a gradient ascent procedure over the EM algorithm for a number of reasons. Firstly, when lθ(y1:n) is a concave function of θ, if γj is replaced by −γjΓ−1j where Γj is the Hessian of lθ(y1:n) evaluated at θj , then the rate of convergence is quadratic and thus faster than the EM which converges linearly. The Hessian matrix can be estimated using SMC techniques, see Poyiadjis et al. [2011]. Secondly, the gradient ascent algorithm is more general since it can be implemented in those cases where M-step of the EM cannot be solved in closed-form. On the other hand, scaling the gradients 70 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION might be quite hard. In addition, the EM needs less tuning and its M-step is typically numerically stable. Therefore, one might prefer an EM approach if the M-step can be computed analytically. Finally, both approaches have online versions, which makes them very powerful tools in dealing with large sequential data sets. An advantage of iterative filtering over standard gradient and EM techniques is that it only requires being able to sample from fθ(x ′|x) and there is no explicit calculations of the derivative. However, it might require a bit of tuning when the parameter is high- dimensional. Another disadvantage of iterated filtering is the necessity to use increasing number of particles versus iterations in order to ensure convergence. Finally, iterated filtering does not have an online version hence can only be used in a batch setting. Chapter 4 An Online Expectation-Maximisation Algorithm for Changepoint Models Summary: Changepoint models are widely used to model the heterogeneity of sequential data. We present a novel sequential Monte Carlo (SMC) online Expectation-Maximisation (EM) algorithm for estimating the static parameters of such models. The SMC online EM algorithm has a cost per time which is linear in the number of particles and could be particularly important when the data is representable as a long sequence of observa- tions, since it drastically reduces the computational requirements for implementation. We present an asymptotic analysis for the stability of the SMC estimates used in the online EM algorithm and demonstrate the performance of this scheme using both simulated and real data originating from DNA analysis. The work done in this chapter is published in Yıldırım et al. [2012d]. The idea was initiated in a discussion Dr. Sumeetpal S. Singh had with Prof. Arnaud Doucet. I did all the work except that Section 4.4 was done in collaboration with Dr. Sumeetpal S. Singh. 4.1 Introduction Consider a sequence of observations {y1, y2, . . .} collected sequentially in time. A change- point model is a particular model for heterogeneity of sequential data that postulates the existence of a strictly increasing time sequence t1, t2, . . . with t1 = 1, that partitions the data into disjoint segments {yt1 , . . . , yt2−1}, {yt2, . . . , yt3−1}, . . . and that the data is correlated within a segment but are otherwise independent across segments. The time instances t1, t2, . . . are known as the changepoints and constitute a random unobserved sequence. This segmental structure is both an intuitive and versatile model for heterogeneity and it is the reason why changepoint models have enjoyed a wide 71 72 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS appeal in a variety of disciplines such as Biological Science [Braun and Muller, 1998; Caron et al., 2011; Fearnhead and Vasileiou, 2009; Johnson et al., 2003], Physical Science [Lund and Reeves, 2002; O´ Ruanaidh and Fitzgerald, 1996] Signal Processing [Cemgil et al., 2006; Punskaya et al., 2002], and Finance [Dias and Embrechts, 2004]. In a Bayesian approach to inferring changepoints, one adopts a prior distribution on their locations and a likelihood function for the observed process given these change- points. However, both of these laws typically depend on a finite dimensional real pa- rameter vector θ ∈ Θ where Θ denotes the set of permissible parameter vectors. In all realistic applications, the static parameter θ is unknown and needs to be estimated from the data as well. A fully Bayesian approach would assign a prior distribution to θ. However the resulting posterior distribution is intractable. Several Markov chain Monte Carlo (MCMC) schemes have been proposed in this context [Chib, 1998; Fearnhead, 2006; Lavielle and Lebarbier, 2001; Stephens, 1994]. Unfortunately these algorithms are far too computationally intensive when dealing with very large datasets. Alternative to an MCMC based full Bayesian analysis is sequential Monte Carlo (SMC); however, SMC methods to perform online Bayesian static parameter estimation suffer from the well-known particle path degeneracy problem and can provide unreliable estimates; see Andrieu et al. [2005], Olsson et al. [2008] for a discussion of this issue. This is why we focus here on estimating the parameter θ using a maximum likelihood approach; i.e. the Maximum Likelihood Estimate (MLE) of interest is the parameter vector from Θ that maximises the probability density of the observed data sequence pθ(y1, . . . , yn). This is a challenging problem as computing the likelihood pθ(y1, . . . , yn) requires a computational cost increasing super-linearly with n [Chopin, 2007; Fearnhead and Liu, 2007]. Our main contribution is a novel online EM algorithm to compute the MLE of the static parameter θ for changepoint models. We remark that standard batch EM algo- rithms for a restricted class of changepoint models have been proposed before, e.g. see Gales and Young [1993], Barbu and Limnios [2008], Fearnhead and Vasileiou [2009]. The main reason why an online algorithm is desirable is that huge computational and mem- ory savings are possible. For a long data sequence, a standard EM algorithm requires a complete browse through the entire data set at each iteration to update the MLE of θ; and many such iterations are needed until the estimate of θ converges. This not only requires storing the entire data sequence but also the probability laws that are needed in the intermediate computations done in each EM iteration, which can be impractical. For this reason, there has been a strong interest in online methods which make parameter estimation possible by browsing through the data only once and hence circumventing the need to store it in its entirety (see Kantas et al. [2009] for a review). The only other work on computing the MLE of θ for a more restrictive class of changepoint models in an online manner that we are aware of is Caron et al. [2011], where the authors used 4.1. INTRODUCTION 73 a recursive gradient algorithm. If the model permits an EM implementation then it is fair to say that the EM is generally preferred by practitioners as no algorithm tuning is required whereas it can be difficult to properly scale the components of the computed gradient vector. For finite state-space Hidden Markov Models (HMM) [Cappe´, 2011; Mongillo and Deneve, 2008] and linear Gaussian state-space models [Elliott et al., 2002], it is possible to implement exactly the online EM algorithm. A detailed study of this algorithm in the finite state-space case can be found in Cappe´ [2011]. For changepoint models, it is necessary to approximate numerically certain expectations sequentially over time with respect to (w.r.t.) the conditional law of the changepoints and other latent random vari- ables of the model given the available observations up to that point in time. We present SMC estimates of these expectations and establish the stability (via the variance) of these estimates w.r.t. time n and the number of particles N both theoretically and with numerical examples. Stability of the SMC estimates of the expectations is important for assessing the performance and reliability of the EM algorithm and is not to be taken for granted because these expectations are computed w.r.t. a probability law whose dimen- sion increases linearly with time n. We note that the computational cost of the proposed SMC online EM algorithm is O(N) per-time whereas a O(N2) per-time algorithm is re- quired to obtain similar stability results for general state-space HMMs [Del Moral et al., 2009]. Cappe´ [2011], remarked that “although the online EM algorithm resembles a clas- sical stochastic approximation algorithm, it is sufficiently different to resist conventional ‘analysis of convergence’. We believe that limited results similar to those discussed in Cappe´ [2011, Section 4] identifying the potential accumulation points of the online EM procedure could be established but this is beyond the scope of this work. In the nu- merical studies reported in this work, and indeed in all the ones we have conducted, the SMC online EM algorithm converges, and to a very close vicinity of the correct values when these are known, e.g. in synthetic examples. Moreover, we observed that online EM converged significantly quicker than the batch EM implementation. The organisation of the chapter is as follows. In Section 4.2, we describe a general changepoint model. In Section 4.3, we present the associated online EM algorithm and its SMC implementation. Theoretical results on the stability of the SMC estimates used in the online EM algorithm are given in Section 4.4. In Section 4.5, we demonstrate the performance of the SMC online EM algorithm on both simulated and real data. We finish with a discussion in Section 4.6 and finally, some detailed model specific derivations as well as mathematical proofs are given in Appendix. 74 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS 4.2 The changepoint model In this work a changepoint model is defined to be comprised of two discrete-time stochastic processes which are {(Xk, Zk)}k≥1 and {Yk}k≥1. {(Xk, Zk)}k≥1 is an unobserved time- homogeneous Markov chain taking values in X × Z where X = {1, 2, . . .} × {1, . . . , R} and Z ⊆ Rp. (While the definition of X in this manner is necessary for the resulting model to be a changepoint model, the definition of Z can change depending on the application domain.) We denote realisations of the first component of this chain by xk = (dk, mk). The variable mk takes values in the index set {1, . . . , R} and indicates the (generative) model the chain is in at that time while dk indicates the duration the chain has spent in model mk. The transition law of {(Xk, Zk)}k≥1 is X1 ∼ µ, Xk |(xk−1 = (d,m), zk−1) = { (d+ 1, m) w.p. 1− λθ,m(d) (1, m′) w.p. λθ,m(d)× Pθ(m,m′) , Zk |(xk = (d′, m′), xk−1, zk−1) ∼ { fθ,m′(z|zk−1)dz if d′ 6= 1 πθ,m′(z)dz if d ′ = 1 , (4.1) where λθ,m(d) ∈ [0, 1] for all θ ∈ Θ and (d,m) ∈ X ; Pθ is an R×R row stochastic matrix; for each θ and m, fθ,m(z|zk−1) is the density of a Markov transition kernel on Z w.r.t. a suitable dominating measure which is denoted by dz; and for each θ and m, πθ,m is a probability density on Z. The transition kernel of the Markov chain {(Xk, Zk)}k≥1 is assumed to be parametrised by the finite dimensional parameter θ ∈ Θ. Without loss of generality, it is assumed that the probability distribution of the initial state of the chain {Xk}k≥1, denoted µ, has all its mass on {(1, 1), . . . , (1, R)}, e.g. the uniform distribution on {(1, 1), . . . , (1, R)}. For a sequence {ak}k≥1 and integers i, j, let ai:j denote the set {ai, ai+1, ..., aj}, which is empty if j < i, and ai:∞ = {ai, ai+1, ...}. The process {Yk}k≥1 is a Y-valued observed process which satisfies the following conditional independence property: Yk ∣∣({xk, zk}k≥1 , y1:k−1, yk+1:∞) ∼ gθ,mk(y|zk)dy (4.2) where for each θ and m, gθ,m is a probability density on Y with respect to the dominating measure dy. In this work Y ⊆ Rq although the definition of Y may be altered depending on the application. Equations (4.1) and (4.2), now define the law of (X1:n, Z1:n, Y1:n). Note that {Xk}k≥1 itself is a Markov chain and we denote its transition matrix by pθ (xk| xk−1). Secondly, it is useful to visualise a realisation of {Xk}k≥1 as a labelled con- tiguous partition of {1, 2, . . .}, {[t1, t2), [t2, t3), . . .} and ti+1 > ti, where each set [ti, ti+1) of the partition, which we call a segment, is accompanied by mti , the model number during that segment. The variables ti are the instances {Xk}k≥1 visits the set {1} × {1, . . . , R} 4.2. THE CHANGEPOINT MODEL 75 and are called as the changepoints. As {Zk}k≥1 forgets its past at times of changepoints, within the segment [ti, ti+1), {(Zk, Yk)}ti≤k