Maximum Likelihood Parameter
Estimation in Time Series Models
Using Sequential Monte Carlo
Sinan Yıldırım
Darwin College
Department of Pure Mathematics and Mathematical Statistics
University of Cambridge
A thesis submitted for the degree of
Doctor of Philosophy
To Selcan and I˙lhan...
Declaration
This dissertation is the result of work carried out by myself between October
2009 and October 2012. It includes nothing which is the outcome of work
done in collaboration with others, except as specified in the text.
Signed:
Sinan Yıldırım
Acknowledgements
I would like to thank my supervisor, Dr. Sumeetpal S. Singh for his supervi-
sion, assistance, and friendship. Having greatly benefited from working with
him for over three years, I am very glad to have been his PhD student.
I would like to express my gratitude to my advisor Prof. A. Philip Dawid,
who has been extremely kind and supportive throughout my PhD.
Many thanks to my collaborators Dr. A. Taylan Cemgil, Dr. Tom Dean, Lan
Jiang, and Prof. Arnaud Doucet for their invaluable contributions to the de-
velopment of this thesis.
I must thank my proofreaders Ozan Aksoy and Peter Bunch, whose efforts
have improved the presentation of this work significantly.
Also, many thanks to all my friends for providing the happy moments of this
stressful period of time. Besides my special thanks to my ‘mentor’ Prof. Mi-
halis Dafermos, my housemate Yasemin Aslan, and my ‘music-mate’ Dr. Kyri-
acos Leptos, I owe a special word of appreciation to Selcan Deniz Kolag˘asıog˘lu,
who shared all the happiness, sorrow, excitement with me.
Finally, I would like to thank my parents and relatives, especially my dear
brother I˙lhan and my best cousin and best friend Hasan I˙lkay C¸elik, for their
endless support and encouragement all these years.
Abstract
Time series models are used to characterise uncertainty in many real-world dy-
namical phenomena. A time series model typically contains a static variable,
called parameter, which parametrizes the joint law of the random variables
involved in the definition of the model. When a time series model is to be
fitted to some sequentially observed data, it is essential to decide on the value
of the parameter that describes the data best, a procedure generally called
parameter estimation.
This thesis comprises novel contributions to the methodology on parameter
estimation in time series models. Our primary interest is online estimation,
although batch estimation is also considered. The developed methods are
based on batch and online versions of expectation-maximisation (EM) and
gradient ascent, two widely popular algorithms for maximum likelihood esti-
mation (MLE). In the last two decades, the range of statistical models where
parameter estimation can be performed has been significantly extended with
the development of Monte Carlo methods. We provide contribution to the field
in a similar manner, namely by combining EM and gradient ascent algorithms
with sequential Monte Carlo (SMC) techniques. The time series models we
investigate are widely used in statistical and engineering applications.
The original work of this thesis is organised in Chapters 4 to 7. Chapter 4
contains an online EM algorithm using SMC for MLE in changepoint models,
which are widely used to model heterogeneity in sequential data. In Chap-
ter 5, we present batch and online EM algorithms using SMC for MLE in
linear Gaussian multiple target tracking models. Chapter 6 contains a novel
methodology for implementing MLE in a hidden Markov model having in-
tractable probability densities for its observations. Finally, in Chapter 7 we
formulate the nonnegative matrix factorisation problem as MLE in a spe-
cific hidden Markov model and propose online EM algorithms using SMC to
perform MLE.
Contents
Contents vii
List of Figures xiii
List of Tables xvii
List of Abbreviations xix
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Sequential inference and Monte Carlo . . . . . . . . . . . . . . . . 1
1.1.3 Online parameter estimation . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Bayesian estimation vs maximum likelihood estimation . . . . . . 3
1.2 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Monte Carlo Methods for Statistical Inference 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Perfect Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Inversion sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Self-normalised importance sampling . . . . . . . . . . . . . . . . 16
2.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Sequential importance sampling . . . . . . . . . . . . . . . . . . . 26
2.5.2 Sequential importance sampling resampling . . . . . . . . . . . . 28
2.5.3 Auxiliary particle filter . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.4 Sequential Monte Carlo samplers . . . . . . . . . . . . . . . . . . 32
2.6 Approximate Bayesian computation . . . . . . . . . . . . . . . . . . . . . 35
vii
3 Hidden Markov Models and Parameter Estimation 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Extensions to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Sequential inference in HMMs . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Bayesian optimal filtering . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Particle filters for optimal filtering . . . . . . . . . . . . . . . . . . 45
3.3.3 The marginal particle filter . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 The Rao-Blackwellised particle filter . . . . . . . . . . . . . . . . 51
3.3.5 Application of SMC to smoothing additive functionals . . . . . . 53
3.3.5.1 Forward filtering backward smoothing . . . . . . . . . . 55
3.3.5.2 Forward-only smoothing . . . . . . . . . . . . . . . . . . 56
3.4 Static parameter estimation in HMMs . . . . . . . . . . . . . . . . . . . 58
3.4.1 Direct maximisation of the likelihood . . . . . . . . . . . . . . . . 60
3.4.2 Gradient ascent maximum likelihood . . . . . . . . . . . . . . . . 62
3.4.2.1 Online gradient ascent . . . . . . . . . . . . . . . . . . . 63
3.4.3 Expectation-Maximisation . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3.1 Stochastic versions of EM . . . . . . . . . . . . . . . . . 66
3.4.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.4 Iterated filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.5 Discussion of the MLE methods . . . . . . . . . . . . . . . . . . . 69
4 An Online Expectation-Maximisation Algorithm for Changepoint Mod-
els 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 The changepoint model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 EM algorithms for changepoint models . . . . . . . . . . . . . . . . . . . 76
4.3.1 Batch EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 SMC implementations of the online EM algorithm . . . . . . . . . 79
4.3.4 Comparison with the path space online EM . . . . . . . . . . . . 81
4.4 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1 Simulated experiments . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1.1 Online EM applied to long data sequence . . . . . . . . 86
4.5.1.2 Comparison between online and batch EM for a short data
sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1.3 Comparison with the path space method . . . . . . . . . 88
4.5.2 GC content in the DNA of Human Chromosome no. 2 . . . . . . 89
viii
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.A.1 Derivation of Hk in (4.4) . . . . . . . . . . . . . . . . . . . . . . . 92
4.A.2 Derivation of the EM algorithm for the model in Section 4.5 . . . 94
4.A.3 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 95
4.A.3.1 Verification of Example 4.2 satisfying the conditions of
Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 99
5 Estimating the Static Parameters in Linear Gaussian Multiple Target
Tracking Models 101
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Multiple target tracking model . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 EM algorithms for MTT . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.1 Batch EM for MTT . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.1.1 Estimation of sufficient statistics . . . . . . . . . . . . . 110
5.3.1.2 Stochastic versions of EM . . . . . . . . . . . . . . . . . 111
5.3.2 Online EM for MTT . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.2.1 Online smoothing in a single GLSSM . . . . . . . . . . . 114
5.3.2.2 Application to MTT . . . . . . . . . . . . . . . . . . . . 116
5.3.2.3 Online EM implementation . . . . . . . . . . . . . . . . 120
5.4 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 Batch setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.2 Online EM setting . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.2.1 Unknown fixed number of targets . . . . . . . . . . . . . 126
5.4.2.2 Unknown time varying number of targets . . . . . . . . 128
5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.A.1 Recursive updates for sufficient statistics in a single GLSSM . . . 131
5.A.2 SMC algorithm for MTT . . . . . . . . . . . . . . . . . . . . . . . 132
5.A.3 Computational complexity of EM algorithms . . . . . . . . . . . . 134
5.A.3.1 Computational complexity of SMC filtering . . . . . . . 134
5.A.3.2 SMC-EM for the batch setting . . . . . . . . . . . . . . 134
5.A.3.3 SMC online EM . . . . . . . . . . . . . . . . . . . . . . 135
6 Approximate Bayesian Computation for Maximum Likelihood Estima-
tion in Hidden Markov Models 137
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.1 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . 137
ix
6.1.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1.3 Approximate Bayesian computation for parameter estimation . . 138
6.1.4 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 ABC MLE approaches for HMM . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.1 Standard ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.2 Noisy ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.3 Smoothed ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 Implementing ABC MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.0.1 SMC algorithm for the expanded HMM . . . . . . . . . 146
6.3.1 Gradient ascent ABC MLE . . . . . . . . . . . . . . . . . . . . . 148
6.3.1.1 Batch gradient ascent . . . . . . . . . . . . . . . . . . . 148
6.3.1.2 Online gradient ascent . . . . . . . . . . . . . . . . . . . 150
6.3.1.3 Controlling the stability . . . . . . . . . . . . . . . . . . 150
6.3.1.4 Special case: i.i.d. random variables with an intractable
density . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.2 Expectation-maximisation . . . . . . . . . . . . . . . . . . . . . . 152
6.4 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.4.1 MLE for α-stable distribution . . . . . . . . . . . . . . . . . . . . 155
6.4.2 MLE for g-and-k distribution . . . . . . . . . . . . . . . . . . . . 157
6.4.3 The stochastic volatility model with symmetric α-stable returns . 160
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7 An Online Expectation-Maximisation Algorithm for Nonnegative Ma-
trix Factorisation Models 163
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.2 The Statistical Model for NMF . . . . . . . . . . . . . . . . . . . . . . . 165
7.2.1 Relation to the classical NMF . . . . . . . . . . . . . . . . . . . . 167
7.3 EM algorithms for NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3.1 Batch EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3.2 Online EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.3.3 SMC implementation of the online EM algorithm . . . . . . . . . 171
7.4 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.4.1 Multiple basis selection model . . . . . . . . . . . . . . . . . . . . 173
7.4.2 A relaxation of the multiple basis selection model . . . . . . . . . 174
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
x
8 Conclusions 179
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
References 183
xi
xii
List of Figures
4.1 SMC-FS online EM estimates vs time for a long simulated data sequence.
The true parameter values are indicated with a horizontal line. . . . . . . 87
4.2 SMC-FS online EM estimates vs number of passes for the concatenated
data set {y1:2000, y1:2000, . . .} where each pass is one complete browse of
y1:2000. The true parameter values: α = 10, β = 0.1, ξ1 = 1.78, ξ2 =
3.56, κ1 = 0.30, κ2 = 0.03, λ1 = λ2 = 0.1, Pi,j = 0.5. . . . . . . . . . . . . . 87
4.3 SMC-FS batch EM estimates vs number of iterations for for the same
y1:2000 used to produce the results in Figure 4.2. . . . . . . . . . . . . . . 88
4.4 Comparison of the forward smoothing and the path space methods in terms
of the variability in the estimates of S16,n. The box plots and the relative
variance plot are generated from 100 Monte Carlo simulations using the
same observation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Comparison of SMC-FS online EM and SMC-PS online EM in terms of
the variability in their estimates of λ1 = 0.1. The two plots at the top
generated by superimposing different estimates, the box plots, and the
relative variance plot are generated from estimates out of 100 different
Monte Carlo runs using the same observation data. . . . . . . . . . . . . 90
4.6 Noisy GC content over 3 kb windows in human DNA chromosome 2. . . 91
4.7 Online EM estimates vs number of passes over the data sequence in Figure
4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Top: The list of the random variables in the MTT model. Bottom: A
realisation for an MTT model: States of a targets are connected with
arrows. Also, observations generated from targets are connected to those
targets with arrows. Mis-detected targets are highlighted with shadows,
and observations from false measurements are coloured with grey. . . . . 107
5.2 Batch estimates obtained using the SMC-EM algorithm for MLE. θ∗,z is
shown as a cross. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Comparison of online SMC-EM estimates applied to the concatenated data
(thicker line) with batch SMC-EM. . . . . . . . . . . . . . . . . . . . . . 126
xiii
5.4 Online estimates of SMC-EM algorithm (Algorithm 5.3) for fixed number
of targets. True values are indicated with a horizontal line. Initial esti-
mates for pd, λf , σ
2
xv, σ
2
y are 0.6, 15, 0.25, 25; they are not shown in order to
zoom in around the converged values. . . . . . . . . . . . . . . . . . . . . 127
5.5 Left: estimates of pθ1:t(y1:t|K) (normalised by t) for values t = 100 . . . , t =
500 and for K = 6, . . . , K = 15. Right: Estimates of pθ1:t(y1:t|K) nor-
malised by t for values K = 6, . . . , K = 15, K = 10 is stressed with a bold
plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.6 Estimates of online SMC-EM algorithm (Algorithm 5.3) for an MTT model
with time varying number of targets, compared with online EM estimates
when the true data association {Zt}t≥1 is known. For the estimates in case
of known true association, θ1000,2000,...,100000 are shown only. True values are
indicated with a horizontal line. . . . . . . . . . . . . . . . . . . . . . . . 129
5.7 SMC online EM estimates when birth-death known (solid line) compared
to the original results in Figure 5.6 (dashed lines). For illustrative purposes,
every 1000th estimate is shown . . . . . . . . . . . . . . . . . . . . . . . 130
6.1 Histograms of Monte Carlo estimates of gradients of log pǫ,κ,ψθ (Y
ǫ,κ,ψ) w.r.t.
the parameters of the α-stable distribution with tan−1(·) being used. 105
samples were used for generating the histograms. . . . . . . . . . . . . . 156
6.2 On the top: Online estimation of α-stable parameters from a sequence of
i.i.d. random variables using online gradient ascent MLE. True parameters
(α, β, µ, σ) = (1.5, 0.2, 0, 0.5) are indicated with a horizontal line. At the
bottom: Gradient of incremental likelihood for the α-stable parameters . 157
6.3 S-ABC MLE and SN-ABC MLE estimates of the parameters of the α-
stable distribution (averaged over 50 runs) using the online gradient ascent
algorithm for the same data set. For SN-ABC MLE, a different noisy data
sequence obtained from the original data set is used in each run. True
parameters (α, β, µ, σ) = (1.5, 0.2, 0, 0.5) are indicated with a horizontal line.158
6.4 Mean and the variance (over 50 runs) of SN-ABC MLE estimates using
the online gradient ascent algorithm. Same noisy data sequence is used in
each run. True parameters (g, k, A,B) = (2, 0.5, 10, 2) are indicated with
a horizontal line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Top: SN-ABC MLE estimates of g-and-k parameters from a sequence of
i.i.d. random variables using the batch gradient ascent algorithm. True pa-
rameters (g, k, A,B) = (2, 0.5, 10, 2) are indicated with a horizontal line.
Bottom: Approximate distributions (histograms over 20 bins) of the esti-
mates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
xiv
6.6 Online estimation of SVαR parameters using online gradient ascent algo-
rithm to implement SN-ABC MLE. True parameter values (α, φ, σ2x) =
(1.9, 0.9, 0.1) are indicated with a horizontal line. . . . . . . . . . . . . . 161
6.7 Online estimation of SVαR parameters (α = 1.9 is known) using the on-
line EM algorithm to implement SN-ABC MLE. True parameter values
(φ, σ2x) = (0.9, 0.1) are indicated with a horizontal line. . . . . . . . . . . 162
7.1 Online estimation of B in the NMF model in Section 7.4.1 using exact
implementation of online EM for NMF. The (i, j)’th subfigure shows the
estimation result for the B(i, j) (horizontal lines). . . . . . . . . . . . . . 175
7.2 A realisation of {Xt(1)}t≥1 for α = 0.95. . . . . . . . . . . . . . . . . . . 176
7.3 Online estimation of B in the NMF model in Section 7.4.2 using Algo-
rithm 7.1. The (i, j)’th subfigure shows the estimation result for B(i, j)
(horizontal lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
xv
xvi
List of Tables
5.1 The list of the EM variables used in Section 5.3 . . . . . . . . . . . . . . 123
6.1 A comparison of ABC MLE approaches. . . . . . . . . . . . . . . . . . . 144
xvii
xviii
List of Abbreviations
a.e. almost everywhere
a.s. almost surely
ABC approximate Bayesian computation
AMPF auxiliary marginal particle filter
CPHD Cardinalised PHD
EM expectation-maximisation
FFBS forward filtering backward smoothing
GLSSM Gaussian linear state-space model
HMM hidden Markov model
i.i.d. independently and identically distributed
JPDAF joint probabilistic data association filter
MCEM Monte Carlo EM
MCMC Markov chain Monte Carlo
MCMC-DA MCMC-data association
MHT multiple hypothesis tracking
MLE maximum likelihood estimation
MPF marginal particle filter
MTT multiple target tracking
NMF nonnegative matrix factorisation
PHD probability hypothesis density
PMCMC particle Markov chain Monte Carlo
PMHT probabilistic MHT
RBPF Rao-Blackwellised particle filter
S-ABC MLE smoothed ABC MLE
SAEM stochastic approximation EM
SEM stochastic EM
SIS sequential importance sampling
SISR sequential importance sampling resampling
SMC sequential Monte Carlo
SN-ABC MLE smoothed noisy ABC MLE
SVαR stochastic volatility model with α-stable returns
xix
xx
Chapter 1
Introduction
1.1 Context
1.1.1 Time series models
In probability theory and statistics, stochastic processes are used to capture uncertainty
in many real-world dynamical phenomena. A stochastic process can be thought to evolve
in time either continuously or discretely; in this thesis we will only consider discrete
time stochastic processes. In the literature, a large number of different discrete time
stochastic processes can be represented under the family of generative dynamical models
called time series models. A parametric time series model consists of random variables
that describe the modelled process with adequate generality, and these random variables
admit probability laws that are parametrised by a vector-valued static variable. This
variable is generally denoted by θ and called the static parameter, or simply the parameter,
of the model.
A time series model associated with a stochastic process is generative. That is, when
simulated, the model produces a realisation of a sequence of observable random variables
{Yt}t≥1 of the stochastic process over time. Typically {Yt}t≥1 are only a subset of the
random variables that comprise the time series model; the rest of the random variables
are called latent, or hidden, variables. In many cases, observable variables are considered
to be somewhat noisy measurements of an underlying structure which is of primary
interest. The power of a time series model is its ability to provide a rigorous mathematical
formulation of this underlying structure as well as its relation to {Yt}t≥1 via its latent
variables. This helps the scientist infer the latent variables from an observed time series
in a principled way by employing well-established methods from statistics.
1.1.2 Sequential inference and Monte Carlo
In many time series models, the latent variables themselves are lumped together to form
another random process {Xt}t≥1. This process represents the hidden state of interest
evolving dynamically, typically in a Markovian fashion. An example of this is a hidden
Markov model (HMM), sometimes called a state-space model. In a HMM, {Xt}t≥1 is a
1
2 CHAPTER 1. INTRODUCTION
Markov process and each Yt is a conditionally independent observation generated by Xt,
the evolving state at time t. (For a review of HMMs in a closely related context, see
Cappe´ et al. [2005]).
In the literature, the problem of sequential Bayesian estimation of Xt based on the
sequentially observed variables Y1, . . . , Yt is known as the optimum Bayesian filtering
problem. When the time series model has linear and Gaussian dynamics, the exact solu-
tion of this problem is the Kalman filtering. However, in non-linear non-Gaussian models,
numerical approximations must be used. Sequential Monte Carlo (SMC) methods, also
known as particle filters, are the most popular numerical methods for approximate so-
lutions of the optimum Bayesian filtering problem [Doucet et al., 2000b; Durbin and
Koopman, 2000; Gordon et al., 1993; Kitagawa, 1996; Liu and Chen, 1998]. These meth-
ods are a special class of Monte Carlo methods, which rely on the basic idea of simulating
from probability distributions when analytical evaluation of quantities that involve in
these probability distributions cannot be performed [Metropolis and Ulam, 1949]. Al-
though originally developed for HMMs, SMC methods can often easily be extended to
more general time series models. A review of SMC methods is presented in Section 2.5,
and their application to HMMs is reviewed in Section 3.3.
1.1.3 Online parameter estimation
For the case when the true value of the static parameter of the time series model, which
we will denote by θ∗ throughout the thesis, is known, numerous SMC methods have been
proposed and successfully applied to the Bayesian optimal filtering problem over the last
two decades. (See Cappe´ et al. [2007]; Doucet and Johansen [2009]; Fearnhead [2008] for
recent reviews of the methodology.) However, in realistic applications θ∗ is hardly ever
known although its estimation is essential for accurate inference of the latent variables of
the model. Therefore, developing efficient and accurate parameter estimation methods
for time series model is of significant importance.
Classical methods used for parameter estimation process the observed data in a batch
fashion, i.e. they require several iterative complete browses through the entire data set.
In this thesis, we are primarily concerned with developing online parameter estimation
algorithms. With the advancement of sensor and storage technologies, and with the
significantly reduced costs of data acquisition, we are able to collect and record vast
amounts of raw data. Arguably, the grand challenge facing computation in the 21st
century is the effective handling of such large data sets. Unfortunately, classical batch
processing methods fail with very large data sets due to memory restrictions and long
computational time. For this reason, so called online methods have recently gained a
popularity in the area. The main principle of these methods is that, a current estimate
obtained using the data available so far could be updated when a new portion of data is
1.1. CONTEXT 3
received. Based on this principle, online methods are promising in terms of reducing both
memory and computation requirements; hence they are potentially a powerful alternative
to batch methods.
1.1.4 Bayesian estimation vs maximum likelihood estimation
There are two different approaches for static parameter estimation, which is either Bayesian
or maximum likelihood. Bayesian parameter estimation requires the assignment of a prior
distribution for the unknown parameter θ. The objective is then to calculate the posterior
distribution of θ given the observed data. When a point estimate of θ∗ is required, some
feature of this posterior distribution can be provided. The common Bayesian estimators
are the posterior mean, posterior median, and the posterior mode, or the maximum a
posteriori probability (MAP) estimate. There are several Monte Carlo based methods
for Bayesian parameter estimation when exact calculation of the posterior distribution is
not available. Alternatively, the maximum likelihood approach regards the likelihood of
the observed data, which is a function of θ, to contain all relevant information for esti-
mating θ∗. The point estimate of θ∗ is the maximising argument of the likelihood. When
maximum likelihood estimation (MLE) cannot be done analytically, iterative search-
based algorithms such as expectation-maximisation (EM) and gradient ascent guarantee
maximising the likelihood locally given certain regularity conditions on densities of the
random variables involved. Also, Monte Carlo versions of these algorithms have been de-
veloped and applied to many time series models successfully. See Kantas et al. [2009] for a
comprehensive review of SMC methods for Bayesian and maximum likelihood parameter
estimation, or Section 3.4 for a more brief discussion.
Whether one should in principle use the Bayesian or maximum likelihood approach
for estimating θ∗ is a fundamental debate which we will not go into. There are indeed
cases when these two approaches do produce dramatically different suggestions on what θ∗
might be, especially when the observed data is of small size and a highly informative prior
for Bayesian estimation is used. However, as data size tends to infinity, the likelihood
of the data sweeps away the effect of the prior in the posterior distribution and the
difference between the estimates of the two approaches vanishes (say when the MAP
estimate is used for Bayesian estimation), provided that the prior is well-behaved (i.e. it
does not assign zero density to any ‘feasible’ parameter value). Therefore, in an online
estimation setting, where the data size is presumably very large, the two approaches are
expected to give almost identical results if they could be implemented exactly. Thus, for
the practitioner, the choice of online parameter estimation method depends on which has
the most favourable properties in terms of computational costs and memory requirements
rather than philosophical concerns that would matter when the outcomes of the Bayesian
and maximum likelihood approaches differed significantly. Moreover, when a parameter
4 CHAPTER 1. INTRODUCTION
estimation method involves any sort of Monte Carlo approximation, this brings with it
the additional requirement that the statistical properties of the method, such as bias and
variance of its estimator, be added to consideration.
Given these concerns, one can argue that so far online MLE methods proposed in the
literature are preferable over their Bayesian counterparts. Online Bayesian estimation
methods, in one way or the other, are based on including the static parameter into the
hidden state of the time series model and cast the online parameter estimation problem
as a filtering one. Unfortunately, when the data size is large, these methods suffer from
particle degeneracy which is inherent in SMC filtering, see e.g. Andrieu et al. [2005];
Olsson et al. [2008] for a discussion. There are certain techniques proposed to overcome
the degeneracy problem, such as those based on Markov chain Monte Carlo (MCMC)
moves for the parameter (e.g. Gilks and Berzuini [2001]; Polson et al. [2008]) or introduc-
ing artificial dynamics on the parameter (e.g. Campillo and Rossi [2009]; Higuchi [2001];
Kitagawa [1998]). But all these techniques either still suffer from particle degeneracy
problem or come with the price of bias and tuning difficulties, or both; see Section 3.4
for more discussion or Kantas et al. [2009] for even more details. On the other hand,
online MLE methods based on Monte Carlo are more promising due to their favourable
stability properties and reasonable computational and memory requirements. Recently,
SMC based online EM and online gradient ascent algorithms for hidden Markov models
have been proposed and analysed in several works such as Cappe´ [2009]; Del Moral et al.
[2009, 2011]; Poyiadjis et al. [2011]. It has been shown in these works that the variance of
the estimators in these algorithms either remain constant over time or decay, depending
on their SMC schemes. For these reasons, in this thesis we focus on online MLE methods
for parameter estimation in this thesis.
1.2 Scope of the thesis
SMC based MLE methods are present in the literature, and they are successfully applied
to many important time series models, especially to a large proportion of HMMs. How-
ever, there are still many important types of time series models for which the developed
methods so far are not directly applicable. This thesis aims to develop MLE methods,
especially online MLE methods, in some non-standard time series models using Monte
Carlo. Below we list and summarise the topics we investigate in this thesis.
• Changepoint models: One example for a time series model is a changepoint
model, which is commonly used to model heterogeneity of sequential data in a
range of areas such as engineering, physical and biological sciences, and finance.
Having a segmented structure introduced by changepoints, the model differs from
1.2. SCOPE OF THE THESIS 5
a HMM, which makes it both interesting and challenging to study its statistical
aspects and to estimate its parameters.
• Multiple target tracking models: Another challenging problem in the areas
of applied statistics and engineering is multiple target tracking (MTT). In MTT,
the main objective is to simultaneously track several moving objects in a surveil-
lance region under far from ideal conditions that introduce random mis-detection
of targets and false measurements. The additional issues of time varying unknown
number of targets and unknown association of targets to observation points make
the problem even more challenging. In this thesis we restrict ourselves to the linear
Gaussian MTT model. Statistical treatment for the dynamics of the MTT model
is widely popular and Monte Carlo methods are available for estimation of latent
states in the tracking problem. However, the problem of calibrating the MTT model
by estimating its static parameters has largely been ignored.
• HMMs with intractable observation densities: An important computational
problem studied in this thesis is that of implementing batch and online MLE in a
HMM where the conditional law of observations is intractable, that is, its proba-
bility density is either analytically unavailable or prohibitive to calculate. Due to
this intractability, the online MLE methods developed for HMM are not directly
applicable since computation of quantities required by those methods becomes im-
possible. Approximate Bayesian computation (ABC) has become an increasingly
popular strategy for confronting intractability in many statistical models. The
adaptation of ABC to HMMs resulting in an SMC-ABC scheme has recently been
demonstrated. Moreover, theoretical analysis on the properties of MLE based on
this SMC-ABC scheme has been performed. However, methods for implementing
MLE using SMC-ABC have not been discovered yet, and we believe that solving
this implementation problem would be a valuable contribution to the literature.
• Nonnegative matrix factorisation: Another interesting problem where online
statistical estimation methods are of use is the nonnegative matrix factorisation
(NMF) problem, where a given non-negative matrix Y is to be approximated as
a multiplication of two nonnegative matrices as BX. In many applications, B is
considered as a matrix of ‘basis’ vectors and X is a ‘gain’ matrix determining which
of the columns in B dominate the columns of Y . Our approach to the NMF problem
is to consider it as a parameter estimation problem for a HMM whose latent and
observed processes are the columns of X and Y , respectively. This approach is
useful to handle the case where the columns of Y are generated sequentially in
time, such as in audio signal processing. Usually very large number of columns in
Y leads to the necessity of online algorithms to learn the model and make inference.
6 CHAPTER 1. INTRODUCTION
This thesis aims to contribute to the methodology on MLE, especially online MLE, in
time series models within the contexts of the topics summarised above. We present novel
EM and gradient ascent methods implemented with SMC. Statistical and computational
aspects of the developed methods will be studied, mostly using numerical experiments.
1.3 Outline
The material presented in the rest of the thesis is organised in six main chapters, followed
by a final chapter including a conclusion, as follows.
Chapter 2: Monte Carlo Methods for Statistical Inference
This chapter provides a survey of the Monte Carlo literature. We will review
some basic Monte Carlo methods, such as rejection sampling, importance sampling,
MCMC, SMC, ABC, etc.
Chapter 3: Hidden Markov Models and Parameter Estimation
We introduce hidden Markov models and review Monte Carlo methods for filtering
and parameter estimation in hidden Markov models.
Chapter 4: An Online Expectation-Maximisation Algorithm for Changepoint
Models:
We present a novel online EM algorithm using SMC for changepoint models. We
also provide theoretical and numerical stability analysis for the developed algorithm.
Chapter 5: Estimating the static parameters of the linear Gaussian Multiple
Target Tracking Model:
We present novel batch and online EM algorithms using SMC for linear Gaussian
MTT models. The algorithms are based on the availability of exact EM algorithms
in a single linear Gaussian state-space model, but involve SMC for tracking the
unknown data association inherent in the model.
Chapter 6: Approximate Bayesian Computation for Maximum Likelihood
Estimation in Hidden Markov Models
We present methods for implementing MLE in HMMs with intractable observation
densities. An SMC based ABC is our main tool for dealing with the intractability
inherent in these HMMs and batch and online gradient ascent algorithms using
SMC are shown to be suitable for this scheme.
1.4. NOTATION 7
Chapter 7: An Online Expectation-Maximisation Algorithm for Nonnegative
Matrix Factorisation Models
We formulate the nonnegative matrix factorisation (NMF) problem as a MLE prob-
lem for HMMs and propose online EM algorithms using SMC to estimate the NMF
and the other unknown static parameters.
1.4 Notation
It will be useful to summarise the notation used throughout this thesis. The notation
presented here will be used with consistency in the literature review part of the thesis
(Chapters 2 and 3); however the particular requirements of Chapters 4, 5, 6, and 7
containing original work are such that there is inevitably some conflict with the desire
to be consistent with standard usage within the literature. The reader will be notified
whenever we deviate from the notation or any additional notation is introduced.
We use N and R to denote the set of natural numbers and real numbers. For a
sequence {ak}k≥1 and integers i, j, we let ai:j denote the set {ai, . . . , aj}, which is empty
if j < i, and ai:∞ = {ai, ai+1, . . .}.
Probability measures, integrals, and random variables: Given a general mea-
surable space (X , E), we refer to the set of all σ-finite measures on that space as M(X ).
The set of all probability measures on (X , E) is denoted P(X ) ⊂M(X ). We use Bb(X )
to denote the Banach space of bounded real valued measurable functions on X .
The integration of a real-valued measurable function ϕ on X with respect to the
measure µ ∈M(X ) is denoted as
µ(ϕ) =
∫
X
ϕ(x)µ(dx), ∀µ ∈M(X ).
Also, for A ∈ X , µ(A) = µ(IA), where IA : X → {0, 1} is the indicator function so
that IA(x) is 1 if x ∈ A and 0 otherwise. Finally, δx is the Dirac measure satisfying
δx(A) = IA(x) for all A ∈ X .
Let (Ω,F ,P) be a probability space and (X , E) be a measurable space. A E/F mea-
surable function X : Ω → X is called a (X , E)-valued random variable. The probability
measure π on (X , E) corresponding to the law of X is given by P ◦X−1 so that
π(A) = P
[
X−1(A)
]
, ∀A ∈ E .
We will denote the expectation of ϕ with respect to π as
Eπ[ϕ(X)] = π(ϕ), ∀π ∈ P(X ).
8 CHAPTER 1. INTRODUCTION
Both expressions on the left and the right sides of the equality will be used. If π is
parametrised by a vector θ, we will denote it by πθ and we will write Eθ[ϕ(X)] to mean
Eπθ [ϕ(X)]. Finally, capital letters X, Y, Z, etc. will be used to denote random variables;
whereas for their realisations corresponding small letters x, y, z, etc. will be used.
Let π be the law of X. We write π ≪ λ to mean that π is absolutely continuous with
respect to the dominating measure λ, and we call the Radon-Nikody´m derivative ν = dπ
dλ
the density of π (or the probability density of X) with respect to λ. Throughout the
chapters of this thesis containing original work, λ will be either the Lebesgue measure or
the counting measure and λ(dx) will be replaced by dx for simplicity. To make explicit
the law of X, we interchangeably use X ∼ π and X ∼ ν.
Markov kernels: Given two measurable spaces (X1, E1) and (X2, E2), we define a
Markov kernel or transition kernel K : X1 → P(X2) satisfying the following two con-
ditions
• ∀x ∈ X1, K(x, ·) is a probability measure in P(X2),
• ∀A ∈ E2, K(·, A) is a nonnegative measurable function with respect to E1 on X1.
A Markov kernel induces two operators, the first one on M(X2) and the second one on
the bounded E2-measurable functions on X2:
µK(dy) =
∫
X1
µ(dx)K(x, dy), ∀µ ∈M(X1),
K(ϕ)(x) =
∫
X2
ϕ(y)K(x, dy), ∀ϕ ∈ Bb(X2).
Using the first operation a probability measure π ∈ P(X1) is mapped by K to another
probability measure µK ∈ P(X2). Also, when we wish to consider the joint distribution
induced over (X1 × X2, E1 × E2) by a measure π and a Markov kernel K : X1 →M(X2),
we use the notation π ⊗K, i.e.
π ⊗K(dx, dy) = π(dx)K(x, dy).
Moreover, given a sequence of measurable spaces {Xn, En}n≥1 and a sequence of Markov
kernels {Kn : Xn−1 → P(Xn)}n≥2,
Kp:q(xp−1, dxp:q) = Kp+1 ⊗ . . .⊗Kq(xp, dxp+1:q) =
q∏
i=p+1
Ki(xi−1, dxi). q ≥ p ≥ 1;
1.4. NOTATION 9
and we define the operator on the bounded Ep+1 ⊗ . . .⊗ Eq-measurable functions on X2
Kp:q(ϕ)(xp) =
∫
Xp+1
· · ·
∫
Xq
ϕ(xp+1:q)
q∏
i=p+1
Ki(xi−1, dxi), ∀ϕ ∈ Bb(Xp+1 × . . .× Xq).
Some common probability distributions: We will use N (µ, σ2) to describe the nor-
mal distribution with mean µ and variance σ2; UA for the uniform distribution over the
set A; PO(λ) for the Poisson distribution with rate λ; G(α, β) for the gamma distribution
with shape α and scale β; IG(α, β) for the inverse-gamma distribution with shape α and
(inverse) scale β; BE(p) for the Bernoulli distribution with success rate p; NΓ−1(ζ, κ, α, β)
for the normal-inverse gamma distribution such that (X, Y ) ∼ NΓ−1(ξ, κ, α, β) means
Y ∼ IG(α, β) and X ∼ N (ξ, Y
κ
); A(α, β, µ, σ) for the α-stable distribution with shape
α, skewness β, location µ and scale σ; M(α, ρ) for the multinomial distribution with
α number of independent trials and the probability vector ρ. We also use these nota-
tions to express the corresponding probability densities. For example, N (x;µ, σ2) is the
probability density of the normal distribution N (µ, σ2) evaluated at x.
10 CHAPTER 1. INTRODUCTION
Chapter 2
Monte Carlo Methods for Statistical
Inference
Summary: This chapter provides a survey of the Monte Carlo literature. We will review
some basic Monte Carlo methods for statistical inference that are related to the main con-
tent of this thesis. These methods are rejection sampling, importance sampling, Markov
chain Monte Carlo, sequential Monte Carlo, and approximate Bayesian computation.
2.1 Introduction
Assume that we are given a probability space (Ω,F ,P) and some random variable X :
Ω → X which is E/F measurable. We allow the probability measure π on (X , E) to
describe the law of X so that π = P◦X−1. We are interested in integrating a measurable
function ϕ : X → Rdϕ with respect to the probability measure π, i.e.
π(ϕ) = Eπ [ϕ(X)] =
∫
X
ϕ(x)π(dx). (2.1)
When analytical evaluation of (2.1) is not possible, we have to use approximations.
There are deterministic numerical integration techniques available; however these meth-
ods encounter the problem called the curse of dimensionality since the amount of com-
putation grows exponentially with the dimension of X [Press, 2007]. Therefore, they are
far from being practical and reliable unless they work in low dimensional problems. A
powerful alternative to deterministic methods for integration problems is Monte Carlo
integration, where random samples from some distribution are used to approximate the
integral in (2.1). The term Monte Carlo was coined in the 1940s, see Metropolis and Ulam
[1949] for a first use of the term, and Eckhardt [1987]; Metropolis [1987] for a historical
review.
In this chapter we will review the Monte Carlo methodology. We first present the main
methods in the literature that aim to evaluate the integral in (2.1). We then proceed to
sequential Monte Carlo methods to approximate a sequence of integrals like in (2.1). We
conclude the chapter with a review of approximate Bayesian computation, which is a name
11
12 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
attached to a wide class of popular Monte Carlo methods aiming to tackle integrations
π(ϕ) where π is a posterior distribution resulting from an intractable likelihood. Note
that we restrict ourselves to the review of only those methods which are closely related
to the work in this thesis. A book length review of general Monte Carlo methods can be
found in Robert and Casella [2004], and for a detailed review of sequential Monte Carlo
methods one can consult the books Doucet et al. [2001] and Del Moral [2004].
2.2 Perfect Monte Carlo
The term perfect Monte Carlo refers to those methods in which the distribution of interest
π is approximated by N > 0 of independent, identically distributed (i.i.d.) samples from
the distribution π and integration of ϕ with respect to π is approximated by using this
approximation. The approximation to π using N i.i.d. samples X(1), . . . , X(N) is given by
πNMC(dx) :=
1
N
N∑
i=1
δX(i)(dx).
Then, the perfect Monte Carlo approximation to π(ϕ) is obtained by substituting π with
πNMC in (2.1) as
πNMC(ϕ) =
1
N
N∑
i=1
ϕ(X(i)).
It is this approach which was originally referred to as the Monte Carlo method in Metropo-
lis and Ulam [1949], although the term has come to encompass a broader class of methods
through the following years.
It is easy to show that πNMC(ϕ) is an unbiased estimator of π(ϕ) for any N > 0. Also,
if π(ϕ) is finite, the strong law of large numbers (e.g. Shiryaev [1995], p. 391) ensures
almost sure (a.s.) convergence of πNMC(ϕ) to π(ϕ) as the number of i.i.d. samples tends
to infinity,
πNMC(ϕ)
a.s.→ π(ϕ).
The variance of πN (ϕ) is given by
var
[
πNMC(ϕ)
]
=
1
N2
N∑
i=1
varπ
[
ϕ(X(i))
]
=
1
N
varπ [ϕ(X)] .
which indicates the improvement in the accuracy with increasing N , provided that
varπ [ϕ(X)] is finite. Note that this is true regardless of the dimension of X ; which
makes Monte Carlo preferable over the deterministic numerical methods particularly for
high dimensional integrations [Newman and Barkema, 1999]. Also, if varπ [ϕ(X)] is fi-
nite, the distribution of the estimator is well behaved in the limit, which is ensured by
2.2. PERFECT MONTE CARLO 13
the central limit theorem (e.g. Shiryaev [1995], p. 335)
√
N
[
πNMC(ϕ)− π(ϕ)
] d→ N (0, varπ [ϕ(X)]) .
The requirement of perfect Monte Carlo is the ability to obtain i.i.d. samples from π.
There are several methods for obtaining i.i.d. samples from distributions. We shall cover
the two most common ones in the following.
2.2.1 Inversion sampling
If π is a distribution on R, then its cumulative distribution function can be defined as
Fπ : R → [0, 1], Fπ(x) = π((−∞, x]).
If it is possible to invert Fπ, then it is possible to sample from π by transforming a uniform
sample U distributed over (0, 1) as
X = F−1π (U) := inf{x ∈ X : Fπ(x) ≥ U}.
This approach was considered by Ulam prior to 1947 [Eckhardt, 1987] and some extensions
to the method are provided by Robert and Casella [2004].
2.2.2 Rejection sampling
Another common method of obtaining i.i.d. samples from π is rejection sampling, which is
available when there exists an instrumental distribution µ such that π ≪ µ with bounded
Radon-Nikody´m derivative dπ
dµ
. Rejection sampling was first mentioned in a 1947 letter by
Von Neumann [Eckhardt, 1987], it was also presented a few years later in von Neumann
[1951]. The method for obtaining one sample from π can be implemented with any
M ≥ supx dπdµ(x) by (i) generating X from µ, (ii) accepting it with probability 1M dπdµ(X),
and otherwise repeating steps (i) and (ii) until acceptance. Letting A = {U ≤ 1
M
dπ
dµ
(X)}
be the event of acceptance in a single trial, its probability is given by
P (A) = Eµ
[
1
M
dπ
dµ
(X)
]
=
1
M
µ
(
dπ
dµ
)
=
1
M
, (2.2)
which is also the long term proportion of the number accepted samples over the number of
trials. Therefore, taking µ as close to π as possible to avoid large Radon-Nikody´m deriva-
tives and taking M = supx
dπ
dµ
(x) are sensible choices to make the acceptance probability
P (A) as high as possible.
14 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
Algorithm 2.1. Rejection sampling: Choose M ≥ supx dπdµ(x). To generate a single
sample,
1. Generate X ∼ µ and U ∼ Unif(0, 1).
2. If U ≤ 1
M
dπ
dµ
(X), accept X; else go to 1.
The rejection sampling algorithm is given in Algorithm 2.1. The validity of this
algorithm can be verified by considering the distribution of the accepted samples. Using
Bayes’ theorem,
P (X ∈ dx |A) = µ(dx)P (A|x)
P (A)
= µ(dx)
1
M
dπ
dµ
(x)/
1
M
= π(dx). (2.3)
One advantage of rejection sampling is that we can implement it even when we know π
and µ only up to some proportionality constants Zπ and Zµ, that is, when π =
bπ
Zπ
, µ = bµ
Zµ
and we only know π̂ and µ̂. It is easy to check that one can perform the steps (i) and
(ii) of rejection sampling method for any M ≥ supx dbπdbµ(x) using dbπdbµ instead of dπdµ , and
justification of this modification would follow from similar steps to those in (2.3). Also,
in that case, the acceptance probability would be 1
M
Zπ
Zµ
. Finally, when π and µ have
densities (denoted as π and µ also) with respect to a common dominating measure, then
the Radon-Nikody´m derivative dπ
dµ
(x) becomes equal to π(x)
µ(x)
.
The drawback of rejection sampling is that in practice a rejection based procedure is
usually not viable when X is high-dimensional, since P (A) gets smaller and more com-
putation is required to evaluate acceptance probabilities as the dimension increases. In
the literature there exist approaches to improve the computational efficiency of rejection
sampling. For example, assuming the densities exist, when it is difficult to compute π(x),
tests like u ≤ 1
M
π(x)
µ(x)
can be slow to evaluate. In this case, one may use a squeezing
function s : X → [0,∞) such that s(x)
µ(x)
is cheap to evaluate and s(x)
π(x)
is tightly bounded
from above by 1. For such an s, not only u ≤ 1
M
s(x)
µ(x)
would guarantee u ≤ 1
M
π(x)
µ(x)
, hence
acceptance, but also if u ≤ 1
M
π(x)
µ(x)
then u ≤ 1
M
s(x)
µ(x)
would hold with a high probability.
Therefore, in case of acceptance evaluation of π(x)
µ(x)
would largely be avoided by checking
u ≤ 1
M
s(x)
µ(x)
first. In Marsaglia [1977], the author proposed to squeeze π from above and
below by µ and s respectively, where µ is easy to sample from and s is easy to evaluate.
There are also adaptive methods to squeeze π from both below and above; they involve
an adaptive scheme to gradually modify µ and s from the samples that have already been
obtained [Gilks, 1992; Gilks et al., 1995; Gilks and Wild, 1992].
2.3. IMPORTANCE SAMPLING 15
2.3 Importance sampling
We saw that rejection sampling can be wasteful as it uses only about 1/M of generated
random samples to construct an approximation to π. In contrast, importance sampling
uses every sample but weights each one according to the degree of similarity between
the target and instrumental distributions. The idea of importance sampling follows from
the importance sampling fundamental identity [Robert and Casella, 2004]: if there is a
probability measure µ such that π ≪ µ with the Radon-Nikody´m derivative w = dπ
dµ
, then
we have
π(ϕ) = µ (ϕw) .
This identity can be used with a µ which is easy to sample from. Sampling X(1), . . . , X(N)
from µ, the integral π(ϕ) = µ (ϕw) can be approximated by using perfect Monte Carlo
as
πNIS(ϕ) :=
1
N
N∑
i=1
ϕ(X(i))w(X(i)). (2.4)
Algorithm 2.2. Importance sampling:
• For i = 1, . . . , N ; generate X(i) ∼ µ, calculate w(X(i)) = dπ
dµ
(X(i)).
• Set πNIS(ϕ) = 1N
∑N
i=1w(X
(i))ϕ(X(i)).
The importance sampling is summarised in Algorithm 2.2. The Radon-Nikody´m
derivatives w(X(i)) are known as the importance sampling weights. Noting its equivalence
to perfect Monte Carlo for µ (ϕw), the estimator in (2.4) is unbiased and justified by
the strong law of large numbers and the central limit theorem, provided that π(ϕ) and
varµ [w(X)ϕ(X)] are finite. Moreover, as we have freedom to choose µ we can control
the variance of importance sampling [Robert and Casella, 2004]
var
[
πNIS(ϕ)
]
=
1
N
varµ [w(X)ϕ(X)]
=
1
N
(
µ(w2ϕ2)− [µ(wϕ)]2)
=
1
N
(
µ(w2ϕ2)− [π(ϕ)]2) .
Therefore, minimising var
[
πNIS(ϕ)
]
is equivalent to minimising µ(w2ϕ2), which can be
lower bounded as
µ(w2ϕ2) ≥ [µ(w|ϕ|)]2 = [π(|ϕ|)]2
using the Jensen’s inequality. Considering µ(w2ϕ2) = π(wϕ2), this bound is attainable if
16 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
we choose µ such that it satisfies
w(x) =
dπ
dµ
(x) =
π(|ϕ|)
|ϕ(x)| , x ∈ X , ϕ(x) 6= 0.
This results in the optimum choice of µ to be
µ(dx) = π(dx)
|ϕ(x)|
π(|ϕ|)
for points x ∈ X such that ϕ(x) 6= 0, and the resulting minimum variance is given by
min
µ
var
[
πNIS(ϕ)
]
=
1
N
(
[π(|ϕ|)]2 − [π(ϕ)]2) .
Note that this minimum value is 0 if ϕ is nonnegative π-almost everywhere. Therefore,
importance sampling in principle can achieve a lower variance than perfect Monte Carlo.
Of course, if we can not already compute π(ϕ), it is unlikely that we can compute π(|ϕ|).
Also, it will be rare that we can easily simulate from the optimal µ even if we can construct
it. Instead, we are guided to seek a µ close to the optimal one, but from which it is easy
to sample.
2.3.1 Self-normalised importance sampling
Like rejection sampling, the importance sampling method is available also when π =
bπ
Zπ
, µ = bµ
Zµ
and we only have π̂ and µ̂. This time, letting w = dbπ
dbµ we write the importance
sampling fundamental identity in terms of π̂ and µ̂ as
π(ϕ) =
µ (ϕw)
Zπ/Zµ
=
µ (ϕw)
µ (w)
.
The importance sampling method can be modified to approximate both the nominator
(the unnormalised estimate) and the denominator (the normalisation constant) by using
perfect Monte Carlo. Sampling X(1), . . . , X(N) from µ, we have the approximation
πNIS(ϕ) =
1
N
∑N
i=1 ϕ(X
(i))w(X(i))
1
N
∑N
i=1w(X
(i))
=
N∑
i=1
W (i)ϕ(X(i)).
where W (i) = w(X
(i))PN
j=1 w(X
(j))
are called the normalised importance weights as they sum up to
1. Being the ratio of two unbiased estimators, estimator of the self-normalised importance
sampling is biased for finite N . However, its consistency and stability are provided by a
strong law of large numbers and a central limit theorem in Geweke [1989]. In the same
work, the variance of the self normalised importance sampling estimator is analysed and
2.3. IMPORTANCE SAMPLING 17
an approximation is provided, from which it reveals that it can provide lower variance
estimates than the unnormalised importance sampling method. Therefore, this method
can be preferable to its unnormalised version even if it is not the case that π and µ are
known only up to proportionality constants.
Algorithm 2.3. Self-normalised importance sampling:
• For i = 1, . . . , N ; generate X(i) ∼ µ, calculate w(X(i)) = dbπ
dbµ(X
(i)).
• For i = 1, . . . , N ; set W (i) = w(X(i))PN
j=1 w(X
(j))
.
• Set πNIS(ϕ) =
∑N
i=1W
(i)ϕ(X(i)).
Self-normalised importance sampling is also called Bayesian importance sampling in
Geweke [1989], since in most Bayesian inference problems the normalising constant of
posterior distribution is unknown.
One approximation to the variance of the self-normalised importance sampling esti-
mator is proposed in Kong et al. [1994] to be
var
[
πNIS(ϕ)
] ≈ 1
N
varπ [ϕ(X)] {1 + varµ [w(X)]}
= var
[
πNMC(ϕ)
] {1 + varµ [w(X)]}.
This approximation might be confusing at the first instance since it suggests that the
variance of self-normalised importance sampling is always greater than that of perfect
Monte Carlo, which we have just seen is not the case. However, it is useful as it provides
an easy way of monitoring the efficiency of the method. Consider the ratio of variances
of the self-normalised importance sampling method with N particles and perfect Monte
Carlo with N ′ particles, which is given according to this approximation by
var
[
πNIS(ϕ)
]
var
[
πN
′
MC(ϕ)
] ≈ N ′
N
{1 + varµ [w(X)]}.
The number N ′ for which this ratio is 1 would suggest how many samples for perfect
Monte Carlo would be equivalent to N samples for self-normalised importance sampling.
For this reason this number is defined as the effective sample size [Kong et al., 1994; Liu,
1996] and it is given by
Neff =
N
1 + varµ [w(X)]
.
Obviously, the term varµ [w(X)] itself is usually estimated using the samplesX
(1), . . . , X(N)
with weights w(X(i)), . . . , w(X(N)) obtained from the method.
18 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
2.4 Markov chain Monte Carlo
We have already discussed the difficulties of generating a large number of i.i.d. samples
from π. One alternative was importance sampling which involved weighting every gener-
ated sample in order not to waste it, but it has its own drawbacks mostly due to issues
related to controlling variance. Another alternative is to use Markov chain Monte Carlo
(MCMC) methods [Gilks et al., 1996; Hastings, 1970; Metropolis et al., 1953; Robert and
Casella, 2004]. These methods are based on design of a suitable ergodic Markov chain
whose stationary distribution is π. The idea is that if one simulates such a Markov chain,
after a long enough time the samples of the Markov chain will admit π. Although the
samples generated from the Markov chain are not i.i.d., their use is justified by conver-
gence results for dependent random variables in the literature. First examples of MCMC
can be found in Metropolis et al. [1953]; Hastings [1970], and book length reviews are
available in Gilks et al. [1996]; Robert and Casella [2004].
2.4.1 Discrete time Markov chains
In order to adequately summarise the MCMC methodology, we first need reference to
the theory of discrete time Markov chains defined on general state spaces. Discrete time
Markov chains also constitute an important part of this thesis. The review made here is
limited by the relation of Markov chains to the topics of this thesis; for more details one
can see Meyn and Tweedie [2009] or Shiryaev [1995]; a more related introduction to our
area of interest in this this thesis is present in Robert and Casella [2004, Chapter 6] and
Cappe´ et al. [2005, Chapter 14], Tierney [1994] and Gilks et al. [1996, Chapter 4].
Definition 2.1 (Markov chain). Consider a sequence of measurable spaces {Xn, En}n≥1,
an initial distribution η and a sequence of Markov kernels {Mn}n≥2 with each Mn :
Xn−1 → P(Xn), where P(Xn) denotes the set of probability measures on Xn. Then, there
exists a unique stochastic process {Xn}n≥1 on the canonical space (
∏∞
n=1Xn,⊗∞n=1En) and
admits the following probability law Pη on F which is defined from the initial distribution
η and the Markov kernels {Mn}n≥2 by finite dimensional distributions as
Pη(X1 ∈ A1, X2 ∈ A2, . . . , Xn ∈ An) =
∫
A1
∫
A2
. . .
∫
An
η(dx1)M2(x1, dx2) . . .Mn(xn−1, dxn)
for all n and Ei-measurable Ai, i = 1, . . . , n.
This is the canonical definition of a Markov chain, which leads to the defining property
of a Markov chain which is that the current state of the chain at time n depends only on
the previous state at time n− 1. More explicitly, for any n and En-measurable set A, we
2.4. MARKOV CHAIN MONTE CARLO 19
have
Pη(Xn ∈ A|X1:n−1 = x1:n−1) = Pη(Xn ∈ A|Xn−1 = xn−1)
= Mn(xn−1, A).
This property is also referred to as the weak Markov property, which can be stated in a
more general sense:
Proposition 2.1 (weak Markov property). Given X1 = x1, . . . , Xm = xm, the process
{Xm+n}n≥0 is a Markov chain independent from X1, . . . , Xm whose probability law is con-
structed from the initial distribution δxm and the sequence of Markov kernels {Mm+n}n≥1
in the same way as in Definition 2.1.
From now on, we will consider time-homogenous Markov chains where (Xn, En) =
(X , E) for all n ≥ 1 andMn = M for all n ≥ 2, and we will denote them asMarkov(η,M).
Such Markov chains are sufficient for the purposes of considering MCMC methods and
also the other methods investigated throughout this thesis.
For MCMC, we require the Markov chain to have a unique stationary distribution π
and to converge to π. Before that we need to review some fundamental properties a of
discrete time Markov chain to understand when stationarity and convergence are ensured.
Irreducibility: Informally, a Markov chain is irreducible if (almost) all its states com-
municate, that is, it is with a positive probability that the chain travels from any point
in X to any set in E . For discrete X it is possible to state this as
∀x, x′ ∈ X , ∃n ≥ 1 s.t. Pδx(Xn = x′) > 0.
For general state-spaces, we need to generalise the concept of irreducibility.
Definition 2.2 (φ-irreducibility). The transition kernelM , or the Markov chain {Xn}n≥1
with transition kernel M , is said to be φ-irreducible if there exists a measure φ on (X , E)
such that for any A ∈ E with φ(A) > 0, we have
∀x ∈ X , ∃n ≥ 1 s.t. Pδx(Xn ∈ A) > 0.
Such a measure φ is called an irreducibility measure for M .
Recurrence and Transience: In the discrete state-space case, we say that a Markov
chain is recurrent if every of its states is expected to be visited by the chain infinitely
often, otherwise it is transient. In the general state-space case, instead of states we
consider accessible sets. A set A ∈ E is accessible if Pδx(Xn ∈ A for some n) > 0 for all
20 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
x ∈ X . It is also useful to consider stronger recurrence properties, expressed in terms of
return probabilities rather than expected number of visits.
Definition 2.3 (recurrence). Let A be a set in E . We say A is recurrent if for all
x ∈ A
Ex
[
∞∑
n=1
IA(Xn)
]
=∞.
Moreover, we say A is Harris recurrent if for all x ∈ A
Pδx
(
∞∑
n=1
IA(Xn) =∞
)
= 1.
Finally, we say a φ-irreducible Markov chain is recurrent (Harris recurrent) if every
accessible A ∈ E is recurrent (Harris recurrent).
Invariant measures: We call a σ-finite measure µ M-invariant if µ = µM . If a M-
invariant µ is a probability measure then µ is referred to as stationary. A Markov chain
associated with a φ-irreducible M is called positive if there is a probability measure µ
which is M-invariant. In order to state the conditions for existence of a unique invariant
probability measure for a Markov chain, we need the definition of a small set.
Definition 2.4 (small set). Let M and ν be a transition kernel and a probability mea-
sure, respectively, on (X , E), integer m ≥ 2 and constant ǫ ∈ (0, 1]. A set C ∈ E is called
a (m, ǫ, ν)-small set for M , or simply a small set, if for all x ∈ C and A ∈ E ,
Pδx(Xm ∈ A) ≥ ǫν(A).
Trivially, every point in X is a small set, so in discrete X every state is a small set.
Now, we have the following theorem for the existence and uniqueness of an invariant
probability measure.
Theorem 2.1. Given a Markov kernel M and a Markov chain associated to M , the
following hold
• M is φ-irreducible and recurrent if and only if it admits a unique (up to a multi-
plicative constant) invariant measure.
• If M admits an accessible small set C such that
sup
x∈C
EPδx [inf{n ≥ 2 : Xn ∈ C}] <∞, (2.5)
then the Markov chain is positive.
2.4. MARKOV CHAIN MONTE CARLO 21
Note that while φ-irreducibility and recurrence ensure a unique (up to a multiplicative
constant) invariant measure, existence of an accessible small set is required as well to have
an invariant probability measure. In fact, the condition (2.5) is equivalent to the property
of positive recurrence for Markov chains with discrete state-space which is necessary for
the existence of a unique stationary distribution.
Reversibility and detailed balance: One useful way to verify the existence of an
invariant probability measure for a Markov chain is to check for its reversibility, which is
a sufficient (but not necessary) condition for existence of a stationary distribution.
Definition 2.5 (reversibility). Let M be a transitional kernel having a stationary dis-
tribution and assume the associated Markov chain is started from π. We say that M is
reversible if the reversed process {Xm = Xn−m+1}1≤m≤n is also Markov(π,M) for all
n ≥ 1.
A necessary and sufficient condition for reversibility of M is the detailed balance
condition.
Proposition 2.2 (detailed balance). We say a Markov kernel M is reversible with
respect to a probability measure π if and only if the following condition, known as the
detailed balance condition, holds: for all bounded measurable functions f on X ×X∫
X×X
f(x, y)µ(dx)M(x, dy) =
∫
X×X
f(x, y)µ(dy)M(y, dx).
Also, then π is a stationary distribution for M .
Being a sufficient condition for stationarity, the detailed balance condition is quite
useful for designing transition kernels for MCMC algorithms.
Ergodicity: We have shown the conditions for a unique stationary distribution of a
Markov chain. The first ergodic theorem shows that these conditions are sufficient for
establishing a strong law of large numbers.
Theorem 2.2. If {Xn}n≥1 is a positive, Harris recurrent Markov chain with invariant
distribution π, then for all π-integrable functions ϕ,
1
n
n∑
i=1
ϕ(Xi)
a.s→ π(ϕ).
Note that this ergodic theorem is about the convergence of the sample mean and it
does not tell whether the chain will converge to its stationary distribution. For that to
happen the Markov chain is required to be aperiodic, a property which restricts the chain
22 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
from getting trapped in cycles. In discrete state-space a cycle is defined as the greatest
common divisor of the lengths of all routes of positive probability between two states,
and if there exists no cycles of length greater than one, the chain is said to be aperiodic.
In general state-spaces, a more detailed care is required to define a cycle. It is a theorem
that there exists a (m, ǫ, ν)-small C set for a φ-irreducible Markov chain, which enables
the following definition.
Definition 2.6 (cycle and period). A φ-irreducible Markov chain associated to the
Markov kernel M has a cycle of length d if for some accessible (m, ǫ, ν)-small set C d is
the greatest common divisor of
{n− 1 : n ≥ 2 : C is (n, ǫn, νn)-small for some ǫn > 0, νn ∈ P(X )}
The period of the Markov chain is the largest possible cycle d for M . When the period is
1, the chain is called aperiodic.
We are now ready to state our second ergodic theorem which requires the ergodicity
of the Markov chain.
Theorem 2.3. If {Xn}n≥1 is a positive, and aperiodic Markov chain with stationary
distribution π, then for π-almost every x ∈ X , and all sets A ∈ E ,
sup
A∈E
|Pδx(Xn ∈ A)− π(A)| a.s.→ 0. (2.6)
Moreover, if the chain is Harris recurrent with stationary distribution π, the above holds
for every x ∈ X
We call a chain ergodic if it satisfies (2.6) for π-almost all x ∈ X ; if (2.6) is satisfied
for all x ∈ X then the chain is called uniformly ergodic. Hence we can define ergodicity
in terms of the properties of the Markov chain.
Definition 2.7 (ergodic Markov chain). A φ-irreducible Markov chain is called ergodic
if it is positive and aperiodic; it is called uniformly ergodic if it is also Harris recurrent.
2.4.2 Metropolis-Hastings
As previously stated, an MCMC method is based on a discrete-time Markov chain which
has its stationary distribution as π. The most widely used MCMC algorithm up to
date is the Metropolis-Hastings algorithm [Hastings, 1970; Metropolis et al., 1953]. In
this algorithm, given the previous sample Xn−1 a new value Y for Xn is proposed using
an instrumental transitional kernel K : X → P(E). We assume for simplicity that
the product measure π(dx)K(x, dy) has a probability density q(x, y) with respect to
2.4. MARKOV CHAIN MONTE CARLO 23
a dominating symmetric measure ζ(dx, dy) (a situation where this is not the case will
be visited in Section 2.4.3). The proposed sample Y is accepted with the acceptance
probability α(Xn−1, Y ), where the function α : X × X → [0, 1] is defined as
α(x, y) = min
{
1,
q(y, x)
q(x, y)
}
, x, y ∈ X .
Algorithm 2.4. Metropolis-Hastings: Begin with some X1 ∈ X . For n = 2, 3, . . .
• Sample Y ∼ K(Xn−1, ·).
• Set Xn = Y with probability α(Xn−1, Y ); otherwise set Xn = Xn−1.
According to Algorithm 2.4, the transition kernel M of the Markov chain from which
the samples are obtained is such that for any bounded measurable function f defined on
X
M(x, f) =
∫
X
K(x, dy)α(x, y)f(y) +
[
1−
∫
X
K(x, dy)α(x, y)
]
f(x).
where we can simplify the expression by substituting pr(x) = 1−
∫
X
K(x, dy)α(x, y), the
rejection probability of a proposed sample from K(x, ·). We can check for the detailed
balance condition to see why this Markov chain has π as its stationary distribution. For
a bounded measurable f on X ×X , we have∫
X×X
π(dx)M(x, dy)f(x, y) =
∫
X×X
q(x, y)α(x, y)f(x, y)ζ(dx, dy)+
∫
X
π(dx)pr(x)f(x, x)
=
∫
X×X
min{q(x, y), q(y, x)}f(x, y)ζ(dx, dy)+ π(prg).
where the function g : X → R satisfies g(x) = f(x, x). Since the measure ζ(dx, dy) and
the expression min{q(x, y), q(y, x)} are symmetric in (x, y), we can swap x and y in f(x, y)
in the last line, hence in the first line. This results the detailed balance condition being
satisfied for M with π. Note that existence of π for M ensures the recurrence of M , and
fortunately it is rare that a recurrent M is not Harris recurrent. There are also various
sufficient conditions for the M in the Metropolis-Hastings algorithm to be φ-irreducible
and aperiodic. For example, if K is π-irreducible and α(x, y) > 0 for all x, y ∈ X then
M is π-irreducible; if P (Xn = Xn−1) > 0 or K is aperiodic then M is aperiodic [Roberts
and Smith, 1994]. More detailed results on the convergence of Metropolis-Hastings are
also available, see e.g. Tierney [1994], Roberts and Tweedie [1996], and Mengersen and
Tweedie [1996].
Historically, the original MCMC algorithm was introduced by Metropolis et al. [1953]
for the purpose of optimisation on a discrete state-space. This algorithm, called the
Metropolis algorithm, used symmetrical proposal kernels K. The Metropolis algorithm
was later generalised by Hastings [1970] such that it permitted continuous state-spaces
24 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
and asymmetrical proposal kernels, preserving the Metropolis algorithm as a special case,
and its use for statistical simulation was shown. A more historical survey is provided by
Hitchcock [2003].
2.4.3 Gibbs sampling
The Gibbs sampler [Gelfand and Smith, 1990; Geman and Geman, 1984] is one of the
most popular MCMC methods, which can be used when X has more than one dimension.
IfX has d > 1 components (of possibly different dimensions) such thatX = (X1, . . . , Xd),
and one can sample from each of the full conditional distributions πk (·|X1:k−1, Xk+1:d),
then the Gibbs sampler produces a Markov chain by updating one component at a time
using πk’s. One cycle of the Gibbs sampler successively samples from the conditional
distributions π1, . . . , πd by conditioning on the most recent samples.
Algorithm 2.5. The Gibbs sampler: Begin with some X1 ∈ X . For n = 2, 3, . . .,
generate for k = 1, . . . , d
Xn,k ∼ πi(·|Xn−1,1:k−1, Xn−1,k+1:d).
For an x ∈ X , let x−k = (x1:k−1, xk+1:d) for k = 1, . . . , d denotes the components of
x excluding xk, and let us permit ourselves to write x = (xk, x−k). The corresponding
MCMC kernel of the Gibbs sampler can be written as M = M1M2 . . .Md, where each
transition Kernel Mk : X → P(X ) for k = 1, . . . , d can be written as
Mk(x, dy) = πk(dyk|x−k)δx−k(dy−k)
The justification of the transitional kernel comes from the reversibility of each Mk with
respect to π, which can be verified from the detailed balance condition as follows. For
any bounded measurable function f on X × X ,∫
π(dx)Mk(x, dy)f(x, y) =
∫
π(dx)πk(dyk|x−k)δx−k(dy−k)f(xk, x−k, yk, y−k)
=
∫
π(dx−k)πk(dxk|x−k)πk(dyk|x−k)f(xk, x−k, yk, x−k)
=
∫
π(dy)πk(dxk|y−k)f(xk, y−k, yk, y−k)
=
∫
π(dy)πk(dxk|y−k)δx−k(dy−k)f(xk, x−k, yk, y−k)
=
∫
π(dy)Mk(y, dx)f(x, y), (2.7)
hence the detailed balance condition for Mk is satisfied with π. This leads to πMk = π,
hence πM = π, so π is indeed stationary for the Gibbs sampler. An insightful interpre-
2.5. SEQUENTIAL MONTE CARLO 25
tation of (2.7) is that each step of a cycle of the Gibbs sampler is a Metropolis-Hastings
move whose MCMC kernel M is equal to its proposal kernel K i.e. the α(x, y) = 1
uniformly. This also shows that the assumption that π(dx)K(x, dy) has a density with
respect to a symmetric measure ζ(x, y) is not a necessary condition for the Metropolis-
Hastings algorithm. However, reversibility of each Mk with respect to π does not suffice
to establish proper convergence of the Gibbs sampler, as none of the individual steps pro-
duces a φ-irreducible chain. Only the combination of the d moves in the complete cycle
has a chance of producing a φ-irreducible chain. We refer to Roberts and Smith [1994] for
some simple conditions for convergence of the classical Gibbs sampler. Note, also, that
M is not reversible either, although this is not a necessary condition for convergence. A
way of guaranteeing both φ-irreducibility and reversibility is to use a mixture of kernels
Mβ =
d∑
k=1
βkMk, βk > 0, k = 1, . . . , d,
d∑
k=1
βk = 1.
provided that at least one Mk is irreducible and aperiodic. This choice of kernel leads
to the random scan Gibbs sampler algorithm. We refer to Tierney [1994], Roberts and
Tweedie [1996], and Robert and Casella [2004] for more detailed convergence results
pertaining to these variants of the Gibbs sampler.
Having attractive computational properties, the Gibbs sampler is widely used. The
requirement for easy-to-sample conditional distributions is the main restriction for the
Gibbs sampler. Fortunately, though, replacing the exact simulation by a Metropolis-
Hastings step in a general MCMC algorithm does not violate its validity as long as the
Metropolis-Hastings step is associated with the correct stationary distribution. The most
natural alternative to the Gibbs move in step k where sampling from the full conditional
distribution πk(·|x−k) is not directly feasible is to use one-step Metropolis-Hastings move
that updates xk by using a Metropolis-Hastings kernelM : X → P(X ) such that πk(·|x−k)
is M-invariant [Tierney, 1994].
2.5 Sequential Monte Carlo
Despite their versatility and success, it might be impractical to apply MCMC algorithms
to sequential inference problems. This section discusses sequential Monte Carlo (SMC)
methods, that can provide with approximation tools for a sequence of varying distribu-
tions. Good tutorials on the subject are available, see for example Doucet et al. [2000b]
and Doucet et al. [2001] for a book length review. Also, Robert and Casella [2004]
and Cappe´ et al. [2005] contain detailed summaries. Finally, the book Del Moral [2004]
contains a more theoretical work on the subject in a more general framework, namely
Feynman-Kac formulae.
26 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
2.5.1 Sequential importance sampling
Let {Xn}n≥1 be a sequence of random variables where each Xn takes values at some
measurable space (Xn, En). Define the sequence of distributions {πn}n≥1 defined on the
measurable space (Xn =
∏n
i=1Xi, En = ⊗ni=1Ei). Also, let {ϕn}n≥1 be a sequence of
functions where ϕn : Xn → R is a πn-measurable real-valued function on Xn. We are
interested in sequential inference, i.e. approximating the following integrals sequentially
in n
πn(ϕn) = Eπn [ϕn(X1:n)] , n = 1, 2, . . .
The first method which is usually considered a SMC method is sequential importance
sampling (SIS), which is a sequential version of the importance sampling. First use of SIS
can be recognised in works back in 1960s and 1970s such as Mayne [1966], Handschin and
Mayne [1969], and Handschin [1970]; see Doucet et al. [2000b] for a general formulation
of the method for Bayesian filtering. Consider the naive importance sampling approach
to the sequential problem where we have a sequence of importance measures {qn}n≥1
with each qn is a measure defined on (Xn, En) such that πn ≪ qn with Radon-Nikody´m
derivative wn =
dπn
dqn
. It is obvious that we can approximate πn(ϕn) by generating samples
from qn independently of samples generated from q1, . . . , qn−1 and exploiting the relation
πn(ϕn) = qn (wnϕn) .
This approach would require the design of a separate qn and sampling the whole path
X1:n at each n, which is obviously inefficient. An efficient alternative to this approach is
SIS which can be used when it is possible to choose qn to have the form
qn(dx1:n) = q1(dx1)
n∏
i=1
Qi(dx1:i−1, xi), (2.8)
where Qn : X1:n−1 → P(En) are some transitional kernels which are possible to sample
from. This selection of qn leads to the following useful relation recursion on the importance
weights
wn(x1:n) = wn−1(x1:n−1)
dπn
d(πn−1 ⊗Qn)(x1:n). (2.9)
In many applications of (2.9), the Radon-Nikody´m derivative dπn
d(πn−1⊗Qn)
(x1:n) is function
of xn−1 and xn only. Hence, one can exploit this recursion by sampling only Xn using
Qn at time n and updating the weights with a small effort. More explicitly, assume a
set of N > 0 samples, termed as particles, X
(i)
1:n−1 with weights w
(i)
n−1 for i = 1, . . . , N are
available at time n− 1. As long as self-normalised importance sampling is concerned, it
2.5. SEQUENTIAL MONTE CARLO 27
is practical to define the weighted empirical distribution
πNn−1(dx1:n−1) =
N∑
i=1
W
(i)
n−1δX(i)1:n−1
(dx1:n−1), (2.10)
as an approximation to πn−1, whereW
(i)
n , i = 1, . . . , N are the self-normalised importance
weights
W
(i)
n−1 =
wn−1(X
(i)
1:n−1)∑N
i=1wn−1(X
(i)
1:n−1)
. (2.11)
The update from πNn−1 to π
N
n can be performed by first sampling X
(i)
n ∼ Q(X(i)1:n−1, ·) and
computing the weights wn at points X
(i)
1:n = (X
(i)
1:n−1, X
(i)
n ) using the update rule in (2.9),
and finally obtain the normalised weights W
(i)
n using (2.11). A SIS estimate of πn(ϕn) is,
then, given by
πNn (ϕn) =
N∑
i=1
W (i)n ϕn(X
(i)
1:n).
Being a special case of importance sampling approximation, this approximation has al-
most sure convergence to πNn (ϕn) for any n (under regular conditions) as the number of
particles tends to infinity; it is also possible to have a central limit theorem for πNn (ϕn)
[Geweke, 1989]. The SIS method is summarised in Algorithm 2.6.
Algorithm 2.6. Sequential importance sampling (SIS)
For n = 1, 2, . . .;
• for i = 1, . . . , N ,
– if n = 1; sample X
(i)
1 ∼ q1, calculate w1(X(i)1 ) = dπ1dq1 (X
(i)
1 ).
– if n ≥ 2; sample X(i)n ∼ Qn(X(i)1:n−1, ·), set X(i)1:n = (X(i)1:n−1, X(i)n ), and calculate
wn(X
(i)
1:n) = wn−1(X
(i)
1:n−1)
dπn
d(πn−1 ⊗Qn)(X
(i)
1:n).
• for i = 1, . . . , N , calculate
W (i)n =
wn(X
(i)
1:n)∑N
i=1wn(X
(i)
1:n)
.
As in the non-sequential case, it is important to choose {qn}n≥1 such that the variances
of {πNn (ϕn)}n≥1 are minimised. Recall that in the SIS algorithm we restrict ourselves to
{qn}n≥1 satisfying (2.8), therefore selection of the optimal proposal distributions sug-
gested in Section 2.3 may not be possible. Instead, a more general motivation for those
28 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
{qn}n≥0 satisfying (2.8) might be to minimise the variance of incremental importance
weights
wn|n−1(x1:n) =
dπn
d(πn−1 ⊗Qn)(x1:n).
conditional upon x1:n−1. Note that the objective of minimising the conditional variance
of wn|n−1 is more general in the sense that it is not specific to ϕn. It was shown in Doucet
[1997] that the kernel Qoptn by which the variance is minimised is given by
Qoptn (x1:n−1, dxn) = πn(dxn|x1:n−1). (2.12)
Before Doucet [1997], the optimum kernel was used in several works for particular appli-
cations, see e.g. Kong et al. [1994], Liu and Chen [1995], and Chen and Liu [1996]. The
optimum kernel leads to the optimum incremental weight
woptn|n−1(x1:n−1) =
dπn
dπn−1
(x1:n−1). (2.13)
which does not depend on the value of xn. This is an interesting observation and it will
be revisited in Section 2.5.3.
2.5.2 Sequential importance sampling resampling
The SIS method is an efficient way of implementing importance sampling sequentially.
However; unless the proposal distribution is very close to the true distribution, the im-
portance weight step will lead over a number of iterations to a small number of particles
with very large weights compared to the rest of the particles. This will eventually result
in one of the normalised weights to being 1 and the others being 0, effectively leading to
a particle approximation with a single particle, see Kong et al. [1994] and Doucet et al.
[2000b]. This problem is called the weight degeneracy problem.
In order to address the weight degeneracy problem, a resampling step is introduced
at iterations of the SIS method, leading to the sequential importance sampling resampling
(SISR) algorithm. Generally, we can describe resampling as a method by which a weighted
empirical distribution is replaced with an equally weighted distribution, where the samples
of the equally weighted distribution are drawn from the weighted empirical distribution.
Here, resampling is applied to πNn−1 before proceeding to approximate πn. Assume, again,
that πn−1 is approximated with N particles X
(1)
1:n−1, . . . , X
(N)
1:n−1 with normalised weights
W
(i)
n−1 as in equation (2.10). We draw N independent samples from π
N
n−1, namely X˜
(i)
1:n−1,
i = 1, . . . , N such that
P (X˜
(i)
1:n−1 = X
(j)
1:n−1) = W
(j)
n−1, i, j = 1, . . . , N.
2.5. SEQUENTIAL MONTE CARLO 29
Obviously, this corresponds to drawing N independent samples from a multinomial dis-
tribution, therefore this particular resampling scheme is called multinomial resampling.
After resampling, for each i = 1, . . . , N we sample X
(i)
n from Qn(X˜
(i)
1:n−1, ·), weight the
particles X
(i)
1:n = (X˜
(i)
1:n−1, X
(i)
n ) using
W (i)n ∝
dπn
d(πn−1 ⊗Qn)(X
(i)
1:n),
N∑
i=1
W (i)n = 1.
The SISR method, also known as the particle filter, is summarised in Algorithm 2.7.
Algorithm 2.7. Sequential importance sampling resampling (SISR)
For n = 1; for i = 1, . . . , N sample X
(i)
1 ∼ q1, set W (i)1 ∝ dπ1dq1 (X
(i)
1 ).
For n = 2, 3, . . .
• Resample {X(i)1:n−1}1≤i≤N according to the weights {W (i)n−1}1≤i≤N to get resampled
particles {X˜(i)1:n−1}1≤i≤N with weight 1/N .
• For i = 1, . . . , N ; sample X(i)n ∼ Qn(X˜(i)1:n−1, ·), set X(i)1:n = (X˜(i)1:n−1, X(i)n ), and set
W (i)n ∝
dπn
d(πn−1 ⊗Qn)(X
(i)
1:n).
The importance of resampling in the context of SMC was first demonstrated by Gor-
don et al. [1993] based on the ideas of Rubin [1987]. Although the resampling step
alleviates the weight degeneracy problem, it has two drawbacks. Firstly, since after suc-
cessive resampling steps some of the distinct particles for X1:n are dropped in favour of
more copies of highly-weighted particles. This leads to the impoverishment of particles
such that for k << n, very few particles represent the marginal distribution of Xk un-
der πn [Andrieu et al., 2005; Del Moral and Doucet, 2003; Olsson et al., 2008]. Hence,
whatever being the number of particles, πn(dx1:k) will eventually be approximated by a
single unique particle for all (sufficiently large) n. As a result, any attempt to perform
integrations over the path space will suffer from this form of degeneracy, which is called
path degeneracy. The second drawback is the extra variance introduced by the resampling
step. There are a few ways of reducing the effects of resampling.
• One way is adaptive resampling i.e. resampling only at iterations where the effective
sample size drops below a certain proportion of N . For a practical implementation,
the effective sample size at time n itself should be estimated from particles as well.
One particle estimate of Neff,n is given in Liu [2001, pp. 35-36]
N˜eff,n =
1∑N
i=1W
(i)2
n
.
30 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
• Another way to reduce the effects of resampling is to use alternative resampling
methods to multinomial resampling. Let In(i) is the number of times the i’th
particle is drawn from πNn in a resampling scheme. A number of resampling methods
have been proposed in the literature that satisfy E [In(i)] = NW
(i)
n but have different
var [In(i)]. The idea behind E [In(i)] = NW
(i)
n is that the mean of the particle
approximation to πn(ϕn) remains the same after resampling. Standard resampling
schemes include multinomial resampling [Gordon et al., 1993], residual resampling
[Liu and Chen, 1998; Whitley, 1994], stratified resampling [Kitagawa, 1996], and
systematic resampling [Carpenter et al., 1999; Whitley, 1994]. There are also some
non-standard resampling algorithms such that the particle size varies (randomly)
after resampling (e.g. Crisan et al. [1999]; Fearnhead and Liu [2007]), or the weights
are not constrained to be equal after resampling (e.g. Fearnhead and Clifford [2003];
Fearnhead and Liu [2007]).
• A third way of avoiding path degeneracy is provided by the resample-move al-
gorithm [Gilks and Berzuini, 2001], where each resampled particle X˜
(i)
1:n is moved
according to a MCMC kernel Kn : Xn → P(En) whose invariant distribution is πn.
In fact we could have included this MCMC move step in Algorithm 2.7 to make
the algorithm more generic. However, the resample-move algorithm is a useful de-
generacy reduction technique usually in a much more general setting. Although
possible in principle, it is computationally infeasible to apply a kernel to the path
space on which current particles exist as the state space grows at evert iteration
of SISR. The resample-move algorithm will be revisited in Section 2.5.4, where it
is considered as a special case of a wide class of sequential sampling methods that
operate on sequences of arbitrary spaces.
• The final method we will mention here that is used to reduce path degeneracy is
block sampling [Doucet et al., 2006], where at time n one samples components
Xn−L+1:n for L > 1, and previously sampled values for Xn−L+1:n−1 are simply
discarded. In return of the computational cost introduced by L, this procedure
reduces the variance of weights and hence reduces the number of resampling steps (if
an adaptive resampling strategy is used) dramatically. Therefore, path degeneracy
is reduced.
2.5.3 Auxiliary particle filter
Recall that when the optimum proposal Qoptn is used to sample xn the corresponding
optimum incremental weight woptn|n−1 does not depend on the value of xn. Therefore,
the optimum incremental weight indicates which particles are likely to represent πn bet-
ter even before proposing the new state xn. This encourages for a sequential sampling
2.5. SEQUENTIAL MONTE CARLO 31
strategy where the optimum incremental weights are involved in deciding on with which
particles the algorithm proceeds to the next time step, and this is the strategy on which
the auxiliary particle filter [Pitt and Shephard, 1999] is based. To understand how we
can implement this strategy, it is useful to see how target distributions at iterations are
modified with the resampling step in the SISR algorithm. One can show that given πNn−1
in (2.10) to be the SISR approximation to πn−1, SISR targets the following distribution
at time n (provided that resampling step is performed)
π¯n(dx1:n) ∝
[
N∑
i=1
W
(i)
n−1w
opt
n|n−1(X
(i)
1:n−1)δX(i)1:n−1
(dx1:n−1)
]
Qoptn (dxn|x1:n−1). (2.14)
In the standard SISR algorithm, the following proposal distribution is used to implement
importance sampling at time n
q¯n(dx1:n) =
[
N∑
i=1
W
(i)
n−1δX(i)1:n−1
(dx1:n−1)
]
︸ ︷︷ ︸
resampling X1:n−1
Qn(x1:n−1, dxn)︸ ︷︷ ︸
proposing Xn
which does not fully exploit the structure in (2.14). As a result we have a well known
drawback of SISR: if πn varies significantly compared to πn−1, the variance of the weights
can be quite high. This results in an inefficient algorithm, and a large number of particles
may be required for recovery.
Provided that one can calculate woptn (x1:n−1), a more sensible choice for q¯n(dx1:n) could
be
q¯optn (dx1:n) =
[
N∑
i=1
W
(i)
n−1δX(i)1:n−1
(dx1:n−1)
]
Qn(dxn|x1:n−1). (2.15)
where W
(i)
n−1 ∝ W (i)n−1woptn (X(i)1:n−1) such that
∑N
i=1W
(i)
n−1 = 1. Then, the importance
weight for particle X
(i)
1:n =
(
X
(j)
1:n−1, X
(i)
n
)
would be
W (i)n ∝
dπ¯n
dq¯optn
(X
(i)
1:n) =
dQoptn (X
(j)
1:n−1, ·)
dQn(X
(j)
1:n−1, ·)
(X(i)n ).
This type of particle filter is called an auxiliary particle filter in the literature. The
term ‘auxiliary’ is due to treating X1:n−1 at time n as auxiliary; because in many cases
where a particle filter is used, integration of functions on Xn with respect to the marginal
distribution πn(dxn) is the main interest and resampling of X1:n−1 in this particular way
helps the Monte Carlo approximation of such integrations improve.
One remarkable point here is that if one can use Qn = Q
opt
n , then all the particles
have equal weights. This shows how this sampling scheme can reduce weight degeneracy
32 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
effectively. (Notice also that Qn = Q
opt
n results in the regular SISR with optimum pro-
posal, where the sampling and resampling steps are interchanged.) However, it may not
be possible (or straightforward) to sample from Qoptn or calculate w
opt
n (x1:n−1). This does
not restrict the use of the idea behind the auxiliary particle filter, though. In fact, the
auxiliary particle filter is more general: We can perform importance sampling for π¯n by
constructing a q¯n which can be generically written as
q¯auxn (dx1:n) =
N∑
i=1
α
(i)
n−1δX(i)1:n−1
(dx1:n−1)Qn(dxn|x1:n−1).
We have complete control over αn−1 andQn; however the idea is to be able to sample those
particles X
(i)
1:n−1 which represents πn(x1:n−1) better, and sample Xn approximately from
the optimal proposal distribution in order to have weights with low variance. Therefore,
the rule of thumb is to make α
(i)
n−1 and Qn as close as possible to W
(i)
n−1 and Q
opt
n . Indeed,
the authors in Andrieu et al. [2001] propose an improved auxiliary particle filter scheme,
where (2.15) or a suitable approximation to (2.15) is suggested to be used.
2.5.4 Sequential Monte Carlo samplers
Sequential Monte Carlo samplers [Del Moral et al., 2006] cover a very large class of SMC
methods. Assume that we have a sequence of somehow related distributions π1, . . . , πp
where each πn is defined on an arbitrary measurable space (Xn, En). There are many po-
tential choices for π1, . . . , πp leading to various integration and optimisation algorithms;
examples can be found in Chopin [2002] for static parameter estimation, Gelman and
Meng [1998] and Neal [2001] for targeting a distribution through a sequence of intermedi-
ate distributions, Del Moral et al. [2006] for global optimisation, Johansen et al. [2005] and
Del Moral et al. [2006] for rare event simulation and density estimation, and Del Moral
et al. [2012] for approximate Bayesian computation. The problem of approximating these
distributions sequentially using Monte Carlo is beyond the extend of the classical SIS or
SISR methods, since these require the distributions to be defined on increasing spaces.
The first approach that comes to mind is to treat each πn individually and perform
importance sampling for each of them independently. Obviously, this approach has the
difficulties of importance sampling: unless the distribution of interest is a standard low-
dimensional one, importance sampling is almost never used when there are alternatives.
The main reason for that is the difficulty of designing an good proposal. One reasonable
way is to do importance sampling for πn individually, but this time by designing the
importance distributions sequentially using an initial distribution η1 and a sequence of
transition kernels {Kn : Xn−1 → P(En)}n≥1. The idea here is that if the distributions πn
varies slowly in n, then it is possible to obtain samples to approximate πn effectively by
2.5. SEQUENTIAL MONTE CARLO 33
using Kn to slowly move the samples obtained to approximate πn−1. Let us assume that
we begin with sampling X
(1)
1 , . . . , X
(N)
1 from η1 to approximate π1. At times n ≥ 2, we
sample X
(i)
n from Kn(X
(i)
n−1, ·). The importance weight of X(i)n is given by
w(i)n =
dπn
dηn
(X(i)n ), ηn(dxn) = ηn−1Kn(dxn).
The choice of Kn’s are optional except the requirement that πn ≪ ηn−1Kn; however it is
crucial for the the performance of this method. In the literature, several different types of
moves are used, such as independent proposals [West, 1993], local random moves [Givens
and Raftery, 1996], MCMC and Gibbs moves [Del Moral et al., 2006], etc.
This sequential implementation of importance sampling approach is attractive and
optimal in some sense (we will see soon in what sense), however it has a quite restrictive
limitation: in most cases it is impossible to calculate the importance distribution ηn.
SMC samplers come into role at this point, circumventing the need for calculation of
ηn. The main idea of the method is to construct the synthetic distributions π˜n on the
extended spaces (X1 × . . .× Xn, E1 ⊗ . . .⊗ En) as
π˜n(dx1:n) = πn(dxn)
n−1∏
i=1
Li(xi+1, dxi) (2.16)
where each Ln : Xn+1 → P(Xn) is a backward Markov kernel. Since π˜n admits πn
marginally by construction, importance sampling on π˜n using the following proposal
distribution
η˜n(dx1:n) = η1(dx1)
n∏
i=2
Ki(xi−1, dxi).
can provide an approximation for πn as well. Although, freedom to choose Kn’s and Ln’s
contribute to the method’s generality, the performance of the method crucially depends
on the their choice. In fact, the central limit theorem presented in Del Moral et al. [2006]
demonstrates that the variance of the estimator is strongly dependent upon the choice of
these kernels. The importance weight for this method is given by
wn(x1:n) =
dπ˜n
dη˜n
(x1:n).
It was shown in Del Moral et al. [2006] that given Kn, the optimum backward kernel
Loptn−1 which minimises the variance of the importance weights satisfies the relation
ηn ⊗ Loptn−1 = ηn−1 ⊗Kn.
34 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
It can be shown that the importance weights for the optimum backward kernel is
woptn (x1:n) =
dπn
dηn
(xn).
This result reveals that the optimum backward kernel takes us back to the case where
one performs importance sampling on the marginal space instead of the extended one.
However, most of the time ηn cannot be calculated, hence other sub-optimal backward
kernels must be used. It was shown in Del Moral et al. [2006] that when Loptn−1 is not used,
the variance of wn(x1:n) can not be stabilised. For that reason, resampling of the samples
that are used for approximating πn−1 is necessary before moving to the approximation of
πn. Actually, this can be done thanks to the possibility of constructing π˜n such that the
importance weights can be expressed as a product of incremental weights. Assume that
πn ≪ Kn and Ln ≪ πn for all n. Then it can be shown that for a bounded measurable
function ϕn on X1 × . . .×Xn we have π˜n(ϕn) = η˜n(ϕnwn) where the importance weights
wn are given by
wn(x1:n) =
dπ1
dη1
(x1)
n∏
i=2
dLi−1(xi, ·)
dπi−1
(xi−1)
dπi
dKi(xi−1, ·)(xi). (2.17)
Equation (2.17) admits a recursion in n as
wn(x1:n) = wn|n−1(xn−1, xn)wn(x1:n−1)
where the incremental weight wn|n−1(xn−1, xn) is given by
wn|n−1(xn−1, xn) =
dπn
dKn(xn−1, ·)(xn)
dLn−1(xn, ·)
dπn−1
(xn−1). (2.18)
Note that the recursive form of the weights enables us to implement an SMC method
for the synthetic distributions π˜n. Actually, when (2.17) exists, the SMC sampler for
π1, . . . , πp is the SISR algorithm targeting π˜1, . . . , π˜p using the initial and transitional
proposal distributions η1 and Kn, n = 2, . . . , p respectively, and its incremental weights
are given in (2.18).
Note that, in practice even if Ln is not absolutely continuous with respect to πn,
we can still obtain importance weights factorized into incremental weights by taking the
restrictions of Ln’s to the supports of πn’s. Note, also, that as in importance sampling,
SIS, and SISR, even if we know π˜n’s and η˜n’s only up to some normalising constants we
can still perform the SMC samplers algorithm to approximate the integrals πn and to
estimate the unknown normalising constants as well.
SMC samplers generalise many related works previously in the literature. For exam-
2.6. APPROXIMATE BAYESIAN COMPUTATION 35
ple, the annealed importance sampling method, which corresponds to the SMC sampler
without resampling where Ln−1 satisfies
πn−1Kn ⊗ Ln−1 = πn−1 ⊗Kn (2.19)
and Kn is such that πn−1 is Kn-invariant, is proposed by Neal [2001] for sequences of
slightly varying distributions. To deal with the variance problem for general cases, the
equivalent choice of kernels are used in (among others) Chopin [2002] and Gilks and
Berzuini [2001] with resample-move strategies, which actually corresponds to the SMC
sampler algorithm with resampling. Population Monte Carlo, presented by Cappe´ et al.
[2004] and Celeux et al. [2006] with an extension, is another special case of SMC samplers
where the authors consider the homogeneous case where πn = π and Ln(x, dx
′) = π(dx′)
andKn(x, dx
′) = Kn(dx
′). Finally Liang [2002] presents a related algorithm where πn = π
and Kn(x, x
′) = Ln(x, dx
′) = K(x, dx′).
2.6 Approximate Bayesian computation
Assume that we have a random variable of interest X, taking values in X . Its probability
distribution π(dx) has a density on X with respect to a dominating measure dx, which is
abusively denoted as π(x)1. The value of X, denoted by x, is observed indirectly through
an observation process generating values Y ∈ Y according to conditional observation
probability distribution who also has a density on Y with respect to dy, which is denoted
as g(y|x). The density g(y|x) is also called the likelihood. The posterior distribution of
X given Y = y has the following density which is given by Bayes’ theorem
π(x|y) = π(x)g(y|x)∫
X
π(x′)g(y|x′)dx′ .
Approximate Bayesian computation (ABC) deals with the problem of Monte Carlo ap-
proximation to π(x|y) when the likelihood g(y|x) is intractable. By intractability it is
meant either that the density does not have a close form expression or that it is pro-
hibitive to calculate it. ABC methods try to approximate π(x|y) without circumventing
the calculation of g(y|x) and for this reason they are also known as likelihood-free meth-
ods. The main idea behind ABC is simulating from the observation process and accepting
simulated samples provided that they are close to the observed value y in some sense.
ABC methods have appeared in the past ten years as one of the most satisfactory ap-
proach to intractable likelihood problems. This section is a brief and limited review of
the main contributions to the ABC methodology, for a more detailed recent review, one
1It is simpler to describe the methodology in this section when we use densities instead of measures.
36 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
can see Marin et al. [2011].
The idea core to ABC is first mentioned in Rubin [1984]; but the first ABC method
was proposed by Tavare´ et al. [1997] as a special case of rejection sampling for discrete Y .
It proposes to sample (x, y) from π(x)g(u|x) and consider only those samples for which
u = y. It is not difficult to show that if the accepted samples are (X(1), y), . . . , (X(N), y),
then X(1), . . . , X(N) are samples from the posterior π(x|y). Note that this a rejection
sampling method for the distribution π(x, u|y) on X × Y , which is given by
π(x, u|y) ∝ π(x)g(u|x)Iy(u) (2.20)
and when this density is integrated over u, we end up with π(x|y). This method is exact
in the sense that the obtained samples for X are drawn from π(x|y). However, obviously
Iy(u) would not work for continuous Y , since the probability of hitting {y} will be zero.
The first genuine ABC method, proposed by Pritchard et al. [1999] as a rejection sampling
method also, relaxes Iy(u) and replaces (2.20) with
πǫ(x, u|y) ∝ π(x)g(u|x)IAǫy(u) (2.21)
where Aǫy is called the ABC set and defined based on some summary statistic s : Y → Rds
and a distance metric ρ : Rds ×Rds → R as
Aǫy = {u ∈ Y : ρ[s(u), s(y)] < ǫ}. (2.22)
If s is sufficient with respect to x, one can show that as ǫ tends to zero, the marginal
of density πǫ(x, u|y) with respect to x converges to the posterior π(x|y). In most cases
sufficient statistics are not available, hence the choice of summary statistics is of great
importance. The ABC literature is rich in papers discussing on the selection of these
sufficient statistics, see Fearnhead and Prangle [2012] for an example. The ABC method
in Pritchard et al. [1999] is also a rejection sampling method, targeting πǫ(x, u|y): generate
(X,U) from π(x)g(u|x) and consider only those samples for which U ∈ Aǫy. This method
is summarised in Algorithm 2.8.
Algorithm 2.8. Rejection sampling for ABC: To generate a single sample from
πǫ(x, u|y),
1. Generate (X,U) ∼ π(x)g(u|x).
2. If U ∈ Aǫy, accept (X,U); else go to 1.
Using simulations from the prior distribution π(x) can be inefficient since this does not
take neither the data nor the previously accepted samples into account when proposing
a new x and thus fails to propose values located in high posterior probability regions.
2.6. APPROXIMATE BAYESIAN COMPUTATION 37
To overcome this impracticality of rejection sampling, an MCMC based ABC method
was developed by Marjoram et al. [2003]. The method is simply an MCMC algorithm
targeting πǫ(x, u|y) which uses an instrumental kernel with density q(x′|x)g(u′|x′) to
move samples (x, u) and takes either (x′, u′) or (x, u) as the next sample according to the
corresponding acceptance probability
α(x, u; x′, u′) = min
{
1,
q(x|x′)π(x′)IAǫy(u′)
q(x′|x)π(x)
}
, x, x′ ∈ X , u, u′ ∈ U .
The MCMC-ABC method is given in Algorithm 2.9.
Algorithm 2.9. MCMC for ABC: Begin with some (X1, U1) ∈ X × U . For n =
2, 3, . . .
• Generate (X ′, U ′) ∼ q(x′|x)g(u′|x′).
• Set (Xn, Un) = (X ′, U ′) with probability α(Xn−1, Un−1;Xn, Un); otherwise set (Xn, Un) =
(Xn−1, Un−1).
It is useful to interpret the ABC posterior all in terms densities: We can consider
IAǫy(u) as to which the density of the conditional distribution of Y given U = u, say
κǫ(y|u), is proportional to. Then we can rewrite (2.21) as
πǫ(x, u|y) = π(x)g(u|x)κǫ(y|u)∫
X×Y
π(x′)g(u′|x′)κǫ(y|u′)dx′du′ .
A useful generalisation of πǫ(x, u|y) can be made by taking κǫ(y|u) some normalised kernel
with bandwidth ǫ centred at u. In many applications, it is practical sometimes to take
κǫ proportional not to an indicator function but to a smooth kernel, such as a Gaussian
kernel, to make calculations tractable or to avoid computational waste due to rejections.
We will see a use of choosing a smooth kernel in Chapter 6.
Another use of kernels in defining the ABC posterior is to be able to express the
difference between the ABC posterior and the real posterior in terms of model error.
Note that the ABC suffers from model discrepancy since it corresponds to performing
Bayesian inference for the case where the observation Y has the conditional probability
density not being g(y|x) but the following:
gǫ(y|x) =
∫
Y
g(u|x)κǫ(y|u)du.
Therefore, we say that the ABC posterior is not ‘calibrated’. A way of rephrasing this is
that if the model included an error term, characterised by κǫ, then the ABC would target
the true posterior hence the ABC posterior would be ‘calibrated’ [Wilkinson, 2008]. This
38 CHAPTER 2. MONTE CARLO METHODS FOR STATISTICAL INFERENCE
leads to the method of noisy ABC [Dean et al., 2011; Fearnhead and Prangle, 2012],
which adds noise to the (summary statistic of) data itself to have yǫ ∼ κǫ(·|y), and then
perform ABC for the modified data by targeting πǫ(x, u|yǫ), which is calibrated.
Other than approximating the posterior distribution of a single random variable,
the ABC approach also extends to sequential inference. Jasra et al. [2012] propose an
ABC implementation scheme for approximating the densities like πn(x1:n|y1:n) in hidden
Markov models (HMM), where {Xt}t≥1 is a Markov process and the distribution of the
observables {Yt}t≥1 conditioned on the hidden process is intractable. Their approach is
related to the convolution particle filter of Campillo and Rossi [2009]. Dean et al. [2011]
discuss the ABC implementation for HMMs further and show that the model for which
noisy ABC is exact is also a HMM; therefore they conclude that noisy ABC can be im-
plemented for HMMs. We will see the sequential implementation of ABC as well as its
use for static parameter estimation in more detail in Chapter 6.
While decreasing the value of ǫ obviously makes πǫ(x, u|y) close to the true posterior,
the variance of its Monte Carlo becomes a more crucial issue. Therefore, it is important
to keep the variance of the approximation at a reasonable level while making ǫ suffi-
ciently small. For this reason, SMC samplers are used for approximating a sequence of
distributions {πk(x, u|y) = πǫk(x, u|y)}1≤k≤p, where ǫ1 > . . . > ǫp = ǫ and the difference
between successive πk’s are small enough to make successive πk(x, u|y) varying sufficiently
slowly. SMC samplers are used in the ABC context in Sisson et al. [2007] and the method
there was improved in Beaumont et al. [2009]; Toni et al. [2009] and Sisson et al. [2009].
Del Moral et al. [2012] showed the relation of these works to SMC samplers explicitly.
Other novelties of Del Moral et al. [2012] is that the authors rely on M repeated simu-
lations of the pseudo-data u and benefit the variance reduction property of Monte Carlo
averaging and they propose a scheme for adaptive selection of the sequence of tolerance
levels {ǫk}1≤k≤n. The forward kernel at step k is chosen to leave πk−1 invariant and
backward kernel is chosen to satisfy (2.19).
Other than inference of hidden variables, there is a lot of work for model selection
using ABC methods. We will not review these methods as they are not of particular
interest for this thesis; the interested reader may see Marin et al. [2011] for a review and
the references therein for details.
Chapter 3
Hidden Markov Models and
Parameter Estimation
Summary: This chapter contains the second half of the literature survey. The main
purpose of this chapter is to introduce hidden Markov models (HMM), which are also
known as general state-space models, and review their use in the literature as a powerful
framework for filtering and parameter estimation.
3.1 Introduction
HMMs arguably constitute the widest class of time series models that are used for mod-
elling stochastic dynamic systems. In Section 3.2, we will introduce HMMs using a
formulation that is appropriate for filtering and parameter estimation problems. We will
restrict ourselves to discrete time homogenous HMMs whose dynamics for their hidden
states and observables admit conditional probability densities which are parametrised
by vector valued static parameters. However, this is our only restriction; we keep our
framework general enough to cover those models with non-linear non-Gaussian dynamics.
One of the main problems dealt within the framework of HMMs is optimal Bayesian
filtering, which has many applications in signal processing and related areas such as
speech processing [Rabiner, 1989], finance [Pitt and Shephard, 1999], robotics [Gordon
et al., 1993], communications [Andrieu et al., 2001], etc. Due to the non-linearity and non-
Gaussianity of most of models of interest in real life applications, approximate solutions
are inevitable and SMC is the main computational tool used for this; see e.g. Doucet et al.
[2001] for a wide selection of examples demonstrating use of SMC. SMC methods have
already been presented in its general form in Section 2.5, we will present their application
to HMMs for optimal Bayesian filtering in Section 3.3.
In practice, it is rare that the practitioner has complete knowledge on the static pa-
rameters of the time series model which she uses to perform optimal Bayesian filtering.
This raises the necessity of ‘calibrating the model’, hence estimating its static parame-
ters. Note also that estimating the static parameters of a HMM itself may be the main
objective. Section 3.4 of this chapter contains a review of the methodology for static
39
40 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
parameter estimation in HMMs, in particular we will present some of the popular max-
imum likelihood estimation (MLE) algorithms. We will also show how to obtain SMC
approximations of those MLE algorithms for HMMs.
Although we present optimal Bayesian filtering and statical parameter estimation
methods and their SMC approximations within the framework of HMMs, we would like
to stress that this thesis does contain time-series models which are not a HMM (at least in
the way we deal with it). We will mention such models in Section 3.2.1. The reason why
we restrict ourselves to HMMs is that the computational tools developed for them are
generally applicable to more general time series models with some suitable modifications.
3.2 Hidden Markov models
We begin with the definition of a HMM. Let {Xn}n≥1 be a homogenous Markov chain
defined on (X , EX ). Suppose that this process is observed as another process {Yn}n≥1
defined on (Y , EY) such that the conditional distribution on Yn given all the other random
variables depends only on Xn. Then the bivariate process {Xn, Yn}n≥1 is called a HMM.
We give below a more formal definition which is taken from Cappe´ et al. [2005]; we
additionally assume that the HMM is parametrised by a vector valued static parameter.
Definition 3.1 (HMM). Let (X , EX ) and (Y , EY) be two measurable spaces, dθ > 0, and
Θ is a compact subset of Rdθ . For any θ ∈ Θ, let µθ, Fθ, and Gθ denote, respectively, a
probability measure on (X , EX ), a Markov transition kernel on (X , EX ), and a transitional
kernel from (X , EX ) to (Y , EY). Consider the Markov transition kernel Hθ defined on the
product space (X × Y , EX ⊗ EY) such that for all (x, y) ∈ X × Y, C ∈ EX ⊗ EY
Hθ[(x, y), C] =
∫
C
Fθ(x, dx
′)Gθ(x
′, dy′).
Then, the Markov chain {Xn, Yn}n≥1 with initial distribution µθ ⊗ Gθ, and with transi-
tional kernel Hθ is called a hidden Markov model (HMM) parametrised by θ.
Although this definition concerns the joint process {Xn, Yn}n≥1, the term ‘hidden’
is justified when only {Yn}n≥1 is observable. We call {Xn}n≥1 the hidden process and
its states the hidden states, and {Yn}n≥1 is called the observed process, containing the
observed states. We will deal with real valued vector processes, that is why we always
take X ∈ Rdx and Y ∈ Rdy . Note, also, that it is Definition 3.1 from which it follows
that {Xn}n≥1 is Markov(µθ, Fθ) and observations {Yn}n≥1 conditioned upon {Xn}n≥1
are independent and have the conditional distributions Gθ(xn, ·), i.e. for every A ∈ EX
3.2. HIDDEN MARKOV MODELS 41
and B ∈ EY we have
Pθ(X1 ∈ A) = µθ(A), Pθ (Xn ∈ A |X1:n−1 = x1:n−1 ) = Fθ(xn−1, A), (3.1)
Pθ
(
Yn ∈ B
∣∣∣{Xt}t≥1 = {xt}t≥1 , {Yt}t6=n = {yt}t6=n) = Gθ(xn, B). (3.2)
In the time series literature, the term HMM has been widely associated with the case
of X being finite [Rabiner, 1989] and those models with continuous X are often referred
to as state-space models. Again, in some works the term ‘state space models’ refers to
the case of linear Gaussian systems [Anderson and Moore, 1979]. We emphasise at this
point that in this thesis we shall keep the framework as general as possible. We consider
the general case of measurable spaces and we avoid making any restrictive assumptions
on µθ, Fθ, and Gθ that impose a certain structure on the dynamics of the HMM. Also, we
clarify that in contrast to previous restrictive use of terminology, we will use both terms
‘HMM’ and ‘general state space model’ to describe exactly the same thing as defined by
Definition 3.1.
For the rest of the thesis, we will be dealing with fully dominated HMMs, where µθ,
Fθ(x, ·) and Gθ(x, ·) have densities with respect to some dominating measures. We give
a formal definition of a fully dominated HMM here.
Definition 3.2 (fully dominated HMM). Consider the HMM in Definition 3.1. Sup-
pose that there exists probability measures λ on (X , EX ) and ν on (Y , EY) such that (i)
µθ is absolutely continuous with respect to λ (ii) for all x ∈ X , Fθ(x, ·) is absolutely con-
tinuous with respect to λ with transition density function fθ(·|x) and (iii) for all x ∈ X ,
Gθ(x, ·) is absolutely continuous with respect to ν with transition density function gθ(·|x).
Then the HMM is said to be fully dominated and the joint Markov transition kernel Hθ
is dominated by the product measure λ⊗ ν and admits the transition density function
hθ(x
′, y′|x, y) = fθ(x′|x)gθ(y′|x′).
Therefore, for a fully dominated HMM as in Definition 3.2, the joint probability
density of (X1:n, Y1:n) exists and it is given by
pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1)
n∏
t=2
fθ(xt|xt−1)gθ(yt|xt) (3.3)
where, with abuse of notation, we have used µ also to denote the density of the probability
measure µ. Note, the joint law of all the variables of the HMM up to time n is summarised
in (3.3) from which we derive several probability densities of interest. One example is the
42 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
likelihood of the observations up to time n which can be derived as
pθ(y1:n) =
∫
pθ(x1:n, y1:n)λ(dx1:n). (3.4)
Maximisation of this quantity with respect to θ is the main interest of this thesis. Another
important probability density, which will be pursued in detail, is the density of the
posterior distribution of X1:n given Y1:n = y1:n, which is obtained by using the Bayes’
theorem
pθ(x1:n|y1:n) = pθ(x1:n, y1:n)
pθ(y1:n)
(3.5)
3.2.1 Extensions to HMMs
Although HMMs are the most common class of time series models in the literature, there
are also many time series models which are not a HMM and are still of great importance.
These models differ from HMMs mostly because they do not possess the conditional
independency of observations. Here, we give two examples that we will also use in this
thesis.
• In the first example of such models, the process {Xn}n≥1 is still a Markov chain;
however the conditional distribution of Yn, given all past variables X1:n and Y1:n−1,
depends not only on the value of Xn but also on the values of past observations
i.e. Y1:n−1. If we denote the probability density of this conditional distribution
gθ,n(yn|xn, y1:n−1), the joint probability density of (X1:n, Y1:n) is
pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1)
n∏
t=2
fθ(xt|xt−1)gθ,t(yt|xt, y1:t−1).
If Yn given Xn is independent of the past values of the observations prior to time
n− k, then we can define a gθ such that gθ,n(yn|xn, y1:n−1) = gθ(yn|xn, yn−k:n−1) for
all n. One example of such models is a changepoint model e.g. see Fearnhead and
Liu [2007]. We will encounter changepoint models in Chapter 4 of this thesis.
The terminology regarding the type of models where we have gθ(yn|xn, yn−k:n−1) is
not fully standardised. One term that is used is Markov switching models; Markov
jump systems is also used at least in cases where the hidden state space is finite
[Cappe´ et al., 2005]. These models have much in common with basic HMMs in the
sense that virtually identical computational tools may be used for both models. In
the particular context of SMC, the similarity between these to types of models is
more clearly exposed in Del Moral [2004] via the Feynman-Kac representation of
SMC methods, where the conditional density of observation at time n is treated
generally as a potential function of xn.
3.2. HIDDEN MARKOV MODELS 43
• In another type of time series models that are not HMM the latent process {Xn}n≥1
is, again, still a Markov chain; however observation at current time depends on
all the past values, i.e. Yn conditional on (X1:n, Y1:n−1) depends on all of these
conditioned random variables. Actually, these models are usually the result of
marginalising an extended HMM. Consider the HMM {(Xn, Zn), Yn}n≥1, where the
joint process {Xn, Zn}n≥1 is a Markov chain such that its transitional law admits
the density fθ with respect to the product measure λ1⊗λ2 which can be factorized
as
fθ(xn, zn|xn−1, zn−1) = fθ,1(xn|xn−1)fθ,2(zn|xn, zn−1).
and the observation Yn depends only on Xn and Zn given all the past random vari-
ables and admits the probability density gθ(yn|xn, zn). Now, the marginal bivariate
process {Xn, Yn}n≥1 is not a HMM and we express the joint density of (X1:n, Y1:n)
as
pθ(x1:n, y1:n) = µθ(x1)pθ,1(y1|x1)
n∏
t=2
fθ,1(xt|xt−1)pθ,t(yt|x1:t, y1:t−1)
where the density pθ,n(yn|x1:n, y1:n−1) is given by
pθ,n(yn|x1:n, y1:n−1) =
∫
pθ(z1:n−1|x1:n−1, y1:n−1)fθ,2(zn|xn, zn−1)gθ(yn|xn, zn)λ2(dz1:n).
(3.6)
The reason {Xn, Yn}n≥1 might be of interest is that the conditional laws of Z1:n may
be available in close form and exact evaluation of the integral in (3.6) is available.
In that case, it can be more effective to perform Monte Carlo approximation for the
law of X1:n given observations Y1:n, which leads to the so called Rao-Blackwellised
particle filters in the literature [Doucet et al., 2000a].
The integration is indeed available in close form for some time series models. One
example is the linear Gaussian switching state space models [Chen and Liu, 2000;
Doucet et al., 2000a; Fearnhead and Clifford, 2003], where Xn takes values on a
finite set whose elements are often called ‘labels’, and conditioned on {Xn}n≥1,
{Zn, Yn}n≥1 is a linear Gaussian state-space model. A more sophisticated time
series model of the same nature is linear Gaussian multiple target tracking models,
which we will investigate in detail in Chapter 5.
Having stated that the interest of this thesis is on more general time series models
than HMMs, we note that the computational tools developed for HMMs are generally
applicable to a more general class of time series models with some suitable modifications.
For this reason we carry on this chapter with review of SMC and parameter estimation
methods for HMMs.
44 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
3.3 Sequential inference in HMMs
3.3.1 Bayesian optimal filtering
In a HMM, one is usually interested in sequential inference on the variables of the hidden
process {Xt}t≥1 given observations Y1:n = y1:n up to time n. For example, one pursues
for the sequence of posterior distributions {pθ(x1:n|y1:n)}n≥1, where pθ(x1:n|y1:n) is given
in equation (3.5). It is also straightforward to generalise pθ(x1:n|y1:n) to the posterior
distributions of X1:n′ for any n
′ ≥ 1. For n′ > n we have
pθ(x1:n′|y1:n) = pθ(x1:n|y1:n)
n′∏
t=n+1
fθ(xt|xt−1);
whereas for n′ < n the density pθ(x1:n′ |y1:n) can be obtained simply by integrating out
the variables xn′+1:n, i.e.
pθ(x1:n′ |y1:n) =
∫
pθ(x1:n|y1:n)λ(dxn′+1:n).
It is possible to obtain a recursion for these posterior distributions as one receives obser-
vations sequentially. Equations (3.3) and (3.5) reveal that we can write pθ(x1:n|y1:n) in
terms of pθ(x1:n−1|y1:n−1) as
pθ(x1:n|y1:n) = fθ(xn|xn−1)gθ(yn|xn)
pθ(yn|y1:n−1) pθ(x1:n−1|y1:n−1). (3.7)
The normalising constant pθ(yn|y1:n−1) can be written in terms of the known densities as
pθ(yn|y1:n−1) =
∫
pθ(x1:n−1|y1:n−1)fθ(xn|xn−1)gθ(yn|xn)λ(dx1:n). (3.8)
Also, by convention pθ(y1|y0) = pθ(y1) =
∫
gθ(y1|x1)µθ(x1)λ(dx1). The recursion in
(3.7) is essential since it enables efficient sequential approximation of the distributions
pθ(x1:n|y1:n) as we will see in Section 3.3.2.
From a Bayesian point of view, the probability densities pθ(x1:n′ |y1:n) are complete
solutions to the inference problems as they contain all the information about the hidden
states X1:n′ given the observations y1:n. For example, the expectation of a measurable
function ϕn′ : X n′ → Rdϕ(n′) conditional upon the observations y1:n can be evaluated as
Eθ [ϕn(X1:n′)|y1:n] =
∫
ϕ(x1:n′)pθ(x1:n′ |y1:n)λ(dx1:n′).
However, one can restrict her focus to a problem of smaller size, such as the marginal
3.3. SEQUENTIAL INFERENCE IN HMMS 45
distribution of the random variableXk, k ≤ n′, given y1:n. The probability density of such
a marginal posterior distribution pθ(xk|y1:n) is called a smoothing, filtering or prediction
density if k < n, k = n and k > n, respectively. Indeed, there are many cases where one
is interested in calculating the expectations of functions ϕ : X → Rdϕ of Xk given y1:n
Eθ [ϕ(Xk)|y1:n] =
∫
ϕ(xk)pθ(xk|y1:n)λ(dxk).
Although one we have pθ(x1:n′ |y1:n) for n′ ≥ k the marginal density can directly be
obtained by marginalization, the recursion in (3.7) may be intractable or too expensive
to calculate. Therefore it is useful to use alternative recursion techniques to effectively
evaluate the marginal densities sequentially. Here, we will cover the recursions for the
filtering and one-step prediction densities. Given the filtering density pθ(xn−1|y1:n−1) at
time n − 1, the filtering density at time n is usually obtained recursively in two stages,
which are called prediction and update. These are given as
pθ(xn|y1:n−1) =
∫
fθ(xn|xn−1)pθ(xn−1|y1:n−1)λ(dxn−1), (3.9)
pθ(xn|y1:n) = gθ(yn|xn)pθ(xn|y1:n−1)
pθ(yn|y1:n−1) . (3.10)
where this time we write the normalising constant as
pθ(yn|y1:n−1) =
∫
pθ(xn|y1:n−1)gθ(yn|xn)λ(dxn). (3.11)
The problem of evaluating the recursion given by equations (3.9) and (3.10) is called
the Bayesian optimal filtering (or shortly optimum filtering) problem in the literature. In
the following, we will look at the SMC methodology in the context of HMMs and review
how SMC methods have been used to provide approximate solutions to the optimal
filtering problem.
3.3.2 Particle filters for optimal filtering
There are cases when the optimum filtering problem can be solved exactly. One such case
is when X is a finite countable set [Rabiner, 1989]. Also, in linear Gaussian state-space
models the densities in (3.9) and (3.10) are obtained by the Kalman filter [Kalman, 1960].
In general, however, these densities do not admit a close form expression and one has to
use methods based on numerical approximations. One such approach is to use grid-based
methods, where the continuous X is approximated by its finite discretised version and the
update rules are used as in the case of finite state HMMs. Another approach is extended
Kalman filter [Sorenson, 1985], which approximates a non-linear transition by a linear
46 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
one and performs the Kalman filter afterwards. The method fails if the nonlinearity in the
HMM is substantial. An improved approach based on the Kalman filter is the unscented
Kalman filter [Julier and Uhlmann, 1997], which is based on a deterministic selection of
sigma-points from the support of the state distribution of interest such that the mean and
the variance of the true distribution are preserved by the sample mean and covariance
calculated at these selected sigma-points. All of these methods are deterministic and not
capable of dealing with the most general state-space models; in particular they will fail
when the dimensions or the nonlinearities increase.
Alternative to the deterministic approximation methods, Monte Carlo can provide a
robust and efficient solution to the optimal filtering problem. SMC methods for opti-
mal filtering, also known as particle filters, have been shown to produce more accurate
estimates than the deterministic methods mentioned [Doucet et al., 2000b; Durbin and
Koopman, 2000; Kitagawa, 1996; Liu and Chen, 1998]. Some of the good tutorials on
SMC methods for filtering as well as smoothing in HMMs are Doucet et al. [2000b], Aru-
lampalam et al. [2002], Cappe´ et al. [2007], Fearnhead [2008], and Doucet and Johansen
[2009], from the earliest to the most recent. One can also see Doucet et al. [2001] as
a reference book, although a bit outdated. Also, the book Del Moral [2004] contains a
rigorous review of numerous theoretical aspects of the SMC methodology in a different
framework where a SMC method is treated as an interacting particle system associated
with the mean field interpretation of a Feynman-Kac flow.
With reference to the Monte Carlo methodology covered in Chapter 2, the filtering
problem in state space models can be considered as a sequential inference problem for
the sequence of probability distributions πθ,n on the product measurable spaces (Xn =
X n, En = E⊗(n))
πθ,n(dx1:n) := pθ(x1:n|y1:n)λ(dx1:n).
As we saw Section 2.5, we can perform SIS and SISR methods targeting {πθ,n}n≥1. The
SMC proposal distribution at time n, denoted as qθ,n, is designed conditional to the
observations up to time n and state values up to time n−1; and in the most general case
it can be written as
qθ,n(dx1:n) := Qθ,1(y1, dx1)
n∏
t=2
Qθ,t [(x1:t−1, y1:t), dxt]
= qθ,n−1(dx1:n−1)Qθ,n [(x1:n−1, y1:n), dxn] (3.12)
In fact, most of the time the transition kernel Qθ,n only depends only on the current
observation and the previous state, hence we simplify (3.12) by defining Qθ : X × Y →
P(E) and taking
Qθ,n [(x1:n−1, y1:n), dxn] = Qθ [(xn−1, yn), xn]
3.3. SEQUENTIAL INFERENCE IN HMMS 47
for all n ≥ 1 with the convention Qθ [(x0, y1), x1] = Qθ(y1, x1). Suppose we design
Qθ [(x, y), ·] such that it is absolutely continuous with respect to λ with density qθ(·|x, y).
Therefore, we can write
qθ,n(dx1:n) =
[
qθ(x1|y1)
n∏
t=2
qθ(xt|xt−1, yt)
]
λ(dx1:n) (3.13)
If we wanted to perform SMC using the target distribution πθ,n directly, then we would
have to calculate the following incremental weight at time n
dπθ,n
dπθ,n−1 ⊗Qθ (x1:n) =
fθ(xn|xn−1)gθ(yn|xn)
pθ(yn|y1:n−1)qθ(xn|xn−1, yn) ∝
fθ(xn|xn−1)gθ(yn|xn)
qθ(xn|xn−1, yn) .
In most of the applications pθ(yn|y1:n−1) can not be calculated, hence dπθ,ndπθ,n−1⊗Qθ (x1:n) is
not available. For this reason, instead of πθ,n SMC methods use the following unnor-
malised measure for importance sampling
π̂θ,n(dx1:n) = pθ(x1:n, y1:n)λ(dx1:n),
where the normalising constant is pθ(y1:n), the likelihood of observations up to time n.
In that case, the importance weight for the whole path X1:n is given by
wn(x1:n) = wn−1(x1:n−1)wn|n−1(xn−1, xn),
where the incremental importance weight wn|n−1(x1:n) is
wn|n−1(xn−1, xn) =
fθ(xn|xn−1)gθ(yn|xn)
qθ(xn|xn−1, yn) .
Algorithm 3.1. SISR (Particle filter) for HMM
For n = 1; for i = 1, . . . , N sample X
(i)
1 ∼ qθ(·|y1), set W (i)1 ∝ µθ(X
(i)
1 )gθ(y1|X
(i)
1 )
qθ(X
(i)
1 |y1)
.
For n = 2, 3, . . .
• Resample {X(i)1:n−1}1≤i≤N according to the weights {W (i)n−1}1≤i≤N to get resampled
particles {X˜(i)1:n−1}1≤i≤N with weight 1/N .
• For i = 1, . . . , N ; sample X(i)n ∼ qθ(·|X˜(i)n−1, yn), set X(i)1:n = (X˜(i)1:n−1, X(i)n ), and set
W (i)n ∝
fθ(X
(i)
n |X˜(i)n−1)gθ(yn|X(i)n )
qθ(X
(i)
n |X˜(i)n−1, yn)
.
We present the SISR algorithm, aka the particle filter, for general state-space models in
Algorithm 3.1, reminding that SIS is a special type of SISR where there is no resampling.
48 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
In the following we list some of the aspects of the particle filter.
• As in the general SISR algorithm, we can use an optional resampling scheme, where
we do resampling only when the estimated effective sampling size decreases below
a threshold value.
• A by-product of the particle filter is that it can provide unbiased estimates for
unknown normalising constants of the target distribution [Del Moral, 2004, Chapter
7]. For example, when SISR is used with an optional sampling scheme, if the last
time prior to n when resampling was performed is k, an unbiased estimator of
pθ(yk+1:n|y1:k) can be obtained as
pθ(yk+1:n|y1:k) ≈ 1
N
N∑
i=1
n∏
t=k+1
wt|t−1(X
(i)
t−1, X
(i)
t ).
We will come back to this aspect of the particle filter in Section 3.4.1.
• The choice of the kernel Qθ for the proposal distribution in the particle filter is
important to ensure effective SMC approximation. The first genuine particle filter
in the literature, proposed by Gordon et al. [1993], involved proposing from the prior
distribution of X1:n, hence taking qθ(xn|xn−1, yn) = fθ(xn|xn−1) and the resulting
particle filter with this particular choice of Qθ is called the bootstrap filter. Another
interesting choice is to take qθ(xn|xn−1, yn) = qθ(xn|yn), which can be useful when
observations provide significant information about the hidden state but the state
dynamics are weak. This proposal was introduced in Lin et al. [2005] and the
resulting particle filter was called independent particle filter. The optimal choice
that minimises the variance of the incremental importance weights is, from equation
(2.12),
qoptθ (xn|xn−1, yn) = pθ(xn|xn−1, yn).
This results in the optimal incremental weights to be woptn|n−1(x1:n) = pθ(yn|xn−1),
which is independent from the value of xn. First works where q
opt
θ was used include
Kong et al. [1994]; Liu [1996]; Liu and Chen [1995].
• The auxiliary particle filter for optimal filtering [Pitt and Shephard, 1999] is imple-
mented by sampling X1:n−1 among the set of the particle paths up to time n − 1
and a new Xn from X in order to target
π¯θ,n(dx1:n) =
[
N∑
i=1
W
(i)
n−1w
opt
n|n−1(X
(i)
1:n−1)δX(i)1:n−1
(dx1:n−1)
]
pθ(xn|xn−1, yn)λ(dxn).
Note that when pθ(yn|xn−1) can be calculated and pθ(xn|xn−1, yn) is available to
3.3. SEQUENTIAL INFERENCE IN HMMS 49
sample from, then all the particles at time n will have equal weights. If this is not
the case, the proposal distribution to sample from this target distribution can be
written generally as
q¯θ,n(dx1:n) =
[
N∑
i=1
α
(i)
n−1δX(i)1:n−1
(dx1:n−1)
]
qθ(xn|xn−1, yn)λ(dxn)
where αn−1(xn−1) and qθ(xn|xn−1, yn) is up to choice and should be close as possible
to the ideal choice. One attempt to make α
(i)
n−1 close to W
(i)
n−1pθ(yn|X(i)n−1) (up to
normalising), which was suggested in the original work Pitt and Shephard [1999] on
the auxiliary particle filter, is to take α
(i)
n−1 = gθ(yn|x∗(i)n ), where x∗(i)n is a prediction
of Xn given X
(i)
n−1 based on the dynamics of the process, e.g. x
∗
n = Eθ[Xn|X(i)n−1].
• Although the particle filter we presented in Algorithm 3.1 targets the path filtering
distributions πθ,n(dx1:n) = pθ(x1:n|y1:n)λ(dx1:n); it can easily be modified, or used
directly, to make inference on other distributions that might be of interest. For
example, consider the one step path prediction distribution
πpθ,n(dx1:n) = pθ(x1:n|y1:n−1)λ(dx1:n).
There is the following relation between πθ,n and π
p
θ,n.
πpθ,n(dx1:n) = πθ,n−1(dx1:n−1)fθ(xn|xn−1)λ(dxn),
dπθ,n
dπpθ,n
(x1:n) =
gθ(yn|xn)
πpθ,n(gθ(yn|·))
.
Therefore, it is easy to derive approximations to these distributions from each other:
obtaining πp,Nθ,n from π
N
n−1 requires a simple extension of the path X1:n−1 to X1:n
through fθ; this is done by sampling X
(i)
n conditioned on the existing particles paths
X
(i)
1:n−1, respectively for i = 1, . . . , N . Whereas; obtaining π
N
θ,n from π
p,N
θ,n requires
a simple reweighting of the measure (or the approximate measure) according to
gθ(yn|·). As a second example, the approximations to the marginal distributions
πNn (dxk), k ≤ n (or πp,Nn (dxk)) are simply obtained from the k’th components of
the particles, e.g.
πNn (dx1:n) =
N∑
i=1
W (i)n δX(i)1:n
(dx1:n)⇒ πNn (dxk) =
N∑
i=1
W (i)n δX(i)k
(dxk).
Note that the optimal filtering problem corresponds to the case k = n. Therefore,
it may be sufficient to have a good approximation for the marginal posterior distri-
bution of the current state Xn rather than the whole path X1:n. This justifies the
resampling step of the particle filter in practice, since resampling trades off accu-
50 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
racy for states Xk with k ≪ n for a good approximation for the marginal posterior
distribution of Xn.
3.3.3 The marginal particle filter
Recall that the standard particle filter follows the recursion in (3.7). It estimates πθ,n(dx1:n)
by taking an estimate of πθ,n−1(dx1:n−1) and augmenting it with xn at time n. It involves
a resampling step not to suffer from high variance which is a result of the sequential
nature of the algorithm and that the dimension of the sampled paths is increased by the
dimension of the state space at each time. When it is the filtering distribution πθ,n(dxn)
that is desired, one can use a somewhat more principled approach. The marginal particle
filter (MPF) [Klaas et al., 2005] follows the recursion in (3.9) and (3.10) and performs
particle filtering for the marginal distribution πθ,n(dxn) instead of the joint distribution
πθ,n(dx1:n).
Assume {X(i)n−1,W (i)n−1}1≤i≤N is the set of particles and their weights obtained by the
MPF for the approximation of πθ,n−1(dxn−1). The MPF approximates the recursion in
(3.9) and (3.10) by substituting the predictive density pθ(xn|y1:n−1) with its approximation∑N
i=1W
(i)
n−1f(xn|X(i)n−1) in (3.9). Then it performs importance sampling for the following
resulting approximation of the marginal density pθ(xn|y1:n)
pNθ (xn|y1:n) ∝ gθ(yn|xn)
N∑
i=1
W
(i)
n−1f(xn|X(i)n−1).
Although we have freedom to choose any proposal distribution qθ(xn|y1:n) that has ap-
propriate support, the authors in Klaas et al. [2005] suggest a proposal which takes a
similar form, namely
qθ(xn|y1:n) =
N∑
i=1
W
(i)
n−1qθ(xn|X(i)n−1, yn). (3.14)
Note that the proposal in (3.14) suggests sampling X
(i)
t−1 from the particle estimate of
πθ,n−1(dxn−1) and then proposing the new component Xt. Instead, we may want to
design a proposal that samples particles which will be in high-probability regions of the
observation model. We can do this by re-weighting the particles at time n − 1 to boost
them in these regions, and this modification results in the auxiliary marginal particle
filter (AMPF) [Klaas et al., 2005]. The AMPF is the general version of the MPF where
the proposal distribution can be written more generally than (3.14) as
qθ(xn|y1:n) =
N∑
i=1
α
(i)
n−1qθ(xn|X(i)n−1, yn). (3.15)
3.3. SEQUENTIAL INFERENCE IN HMMS 51
Just as in the auxiliary particle filter in Section 3.3.2, one should ideally take
α
(i)
n−1 ∝W (i)n−1pθ(yn|X(i)n−1)
if calculation of pθ(yn|X(i)n−1) is possible; otherwise a suitable approximation of pθ(yn|X(i)n−1)
should be used instead of pθ(yn|X(i)n−1).
The pseudocode for the AMPF is given in Algorithm 3.2. The variance of the impor-
tance weights of the AMPF is less than or equal to the variance of the importance weights
of the standard auxiliary particle filter. Although this improvement of the marginal par-
ticle filter comes with the cost of O(N2) calculations per time compared to the O(N)
calculations in standard particle filters; it is possible to reduce this cost to O(N logN)
with a small and controllable error [Klaas et al., 2005].
Algorithm 3.2. The auxiliary marginal particle filter:
For n = 1; for i = 1, . . . , N sample X
(i)
1 ∼ qθ(·|y1), set W (i)1 ∝ µ(X
(i)
1 )gθ(y1|X
(i)
1 )
qθ(X
(i)
1 |y1)
.
For n = 2, 3, . . .
• For i = 1, . . . , N ; sample X(i)n ∼ ∑Ni=1 α(i)n−1qθ(xn|X(i)n−1, yn) where α(i) is propor-
tional to W
(i)
n−1pθ(yn|X(i)n−1) or to an approximation of it.
• For i = 1, . . . , N ; set
W (i)n ∝
gθ(yn|xn)
∑N
i=1W
(i)
n−1fθ(xn|X(i)n−1)∑N
i=1 α
(i)
n−1qθ(X
(i)
n |X(i)n−1, yn)
.
Finally, we note that another O(N2) particle filter can be found in Lin et al. [2005]
as a special case of what the authors call the independent particle filter. The name
‘independent’ is due to their proposal distribution at time n being independent of xn−1,
and this allows multiple matching with the previous particles which makes their algorithm
O(N2) in case of complete matching. Moreover; a slight extension of their algorithm
where the proposal distribution uses the past particles is also mentioned in their work, and
the MPF or AMPF can be considered to be equivalent to special cases of this extension.
3.3.4 The Rao-Blackwellised particle filter
Assume we are given a HMM {(Xn, Zn), Yn}n≥1 where this time the hidden state at time
n is composed of two components Xn and Zn. Suppose that the initial and transition
distributions of the Markov chain {Xn, Zn}n≥1 have densities µθ and fθ with respect to
the product measure λ1 ⊗ λ2 and they can be factorized as follows
µθ(x1, z1) = µθ,1(x1)µθ,2(z1|x1), fθ(xn, zn|xn−1, zn−1) = fθ,1(xn|xn−1)fθ,2(zn|xn, zn−1).
52 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
Also, conditioned on (xn, zn) the distribution of observation Yn admit a density gθ(·|xn, zn)
with respect to ν. We are interested in the case where the posterior distribution
πθ,n(dx1:ndz1:n) = pθ(x1:n, z1:n|y1:n)λ1(dx1:n)λ2(dz1:n)
is analytically intractable and we are interested in approximating the expectations πθ,n(ϕn) =
Eθ [ϕn(X1:n, Z1:n)|y1:n] for bounded measurable functions ϕn : X n × Zn → Rdϕ(n). Ob-
viously, one way to do this is to run an SMC filter for {πθ,n}n≥1 which obtains the
approximation πNθ,n at time n as
πNθ,n(dx1:ndz1:n) =
N∑
i=1
W (i)n δ(X(i)1:n,Z
(i)
1:n)
(dx1:ndz1:n),
N∑
i=1
W (i)n = 1.
However, if the conditional posterior probability distribution
πθ,2,n(dz1:n|x1:n) = pθ(z1:n|x1:n, y1:n)λ2(dz1:n)
is analytically tractable, there is a better SMC scheme for approximating πθ,n and esti-
mating πθ,n(ϕn). This SMC scheme is called the Rao Blackwellised particle filter (RBPF)
[Doucet et al., 2000a]. Consider the following decomposition which follows from the chain
rule
pθ(x1:n, z1:n|y1:n) = pθ(x1:n|y1:n)pθ(z1:n|x1:n, y1:n)
and define the marginal posterior distribution of X1:n conditioned on y1:n as
πθ,1,n(dx1:n) = pθ,1(x1:n|y1:n)λ1(dx1:n).
The RBPF is a particle filter for the sequence of marginal distributions {πθ,1,n}n≥1 which
produces at time n the approximation
πNθ,1,n(dx1:n) =
N∑
i=1
W
(i)
1,nδX(i)1:n
(dx1:n),
N∑
i=1
W
(i)
1,n = 1.
and the Rao-Blackwellised approximation the full posterior distribution involves the par-
ticle filter estimate πNθ,1,n and the exact distribution πθ,2,n
πRB,Nθ,n (dx1:ndz1:n) = π
N
θ,1,n(dx1:n)πθ,2,n(dz1:n|x1:n).
Then, the estimator of the the RBPF for πθ,n(ϕn) becomes
πRB,Nθ,n (ϕn) = π
N
θ,1,n (πθ,2,n [ϕn(X1:n, ·)]) =
N∑
i=1
W
(i)
1,nπθ,2,n[ϕn(X
(i)
1:n, ·)].
3.3. SEQUENTIAL INFERENCE IN HMMS 53
Assuming qθ(x1:n|y1:n) = qθ(x1:n−1|y1:n−1)qθ(xn|x1:n−1, y1:n) is used as the proposal distri-
bution, the incremental importance weight for the RBPF is given by
w1,n|n−1(x1:n) =
fθ,1(xn|xn−1)pθ(yn|x1:n, y1:n−1)
qθ(xn|x1:n−1, y1:n)
where the density pθ(yn|x1:n, y1:n−1) is given by
pθ,n(yn|x1:n, y1:n−1) =
∫
pθ(z1:n−1|x1:n−1, y1:n−1)fθ,2(zn|xn, zn−1)gθ(yn|xn, zn)λ2(dz1:n).
Also, the optimum importance density which reduces the variance of w1,n|n−1 is when the
incremental importance density qθ(xn|x1:n−1, y1:n) is taken to be pθ(xn|x1:n−1, y1:n) which
results in w1,n|n−1(x1:n) being equal to pθ(yn|x1:n−1, y1:n−1).
The use of the RBPF whenever it is possible is intuitively justified by the fact that we
substitute particle approximation of some expectations with their exact values. Indeed,
the theoretical analysis in Doucet et al. [2000a] and Chopin [2004, Proposition 3] revealed
that the RBPF has better precision than the regular particle filter: the estimates of the
RBPF never have larger variances. The favouring results for the RBPF are basically
due to the Rao-Blackwell theorem (see e.g. Blackwell [1947]), after which the proposed
particle filter gets its name.
The RBPF was formulated by Doucet et al. [2000a] and have been implemented in
various settings by Andrieu and Doucet [2002]; Chen and Liu [2000]; Sa¨rkka¨ et al. [2004]
among many. We will also use RBPFs in our works presented in Chapters 4, 5, and 7.
The use of Rao-Blackwellisation is not limited to marginalising out one of the com-
ponents of the hidden state; it may be possible to use Rao-Blackwellisation in the in-
termediate steps of a particle filter. In some time series models, an exact sequential
inference is not tractable but the exact one-step update of distributions conditioned on
the approximations made prior to the current time is possible. For such models, one can
calculate an expectation of interest using this exact one-step update that is available, and
then continue by approximating this exact update with particles in order to be able to
proceed to the next time step of the particle filter. For examples of such implementation
of Rao-Blackwellisation, see Fearnhead and Clifford [2003, p. 890], Fearnhead and Liu
[2007], and the Algorithm in Chapter 4 of this thesis.
3.3.5 Application of SMC to smoothing additive functionals
In this section, we provide an example for use of particle filters which is central to this
thesis due to its relation to parameter estimation. We are interested in approximating
smoothed estimates of additive functionals of state variables in a fully dominated HMM
{Xn, Yn}n≥1 defined in Definition 3.2. Let us have a sequence of functions st : X×X → R,
54 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
t ≥ 1 and let Sn : X n → R, n ≥ 1 be the corresponding sequence of additive functionals
constructed from st as follows
Sn(x1:n) =
n∑
t=1
st(xt−1, xt)
where, by convention, we take s1(x0, x1) = s1(x1). In many instances it is necessary to
be able to compute the following expectations sequentially
Sθn = πθ,n(Sn) = Eθ [Sn(X1:n)|y1:n] =
∫
Sn(x1:n)pθ(x1:n|y1:n)λ(dx1:n).
The expectation is to be computed with respect to the density pθ(x1:n|y1:n) and for this
reason Sθn is referred to as a smoothed additive functional. Calculation of S
θ
n might be of
interest for its own sake, it is also necessary for computing the filter derivative and the
gradient of the log-likelihood of observations [Del Moral et al., 2011; Poyiadjis et al., 2011],
the intermediate function of the expectation-maximisation algorithm (see e.g. Del Moral
et al. [2009]), etc.
In most cases exact computation of Sθn is not available due to the unavailability of
pθ(x1:n|y1:n), therefore one has to use Monte Carlo methods, specifically SMC. The first
SMC method in the literature proposed to approximate Sθn uses the path space approxi-
mation of πθ,n directly [Cappe´, 2009]. Let the SMC approximation of πθ,n be
πNθ,n(dx1:n) =
N∑
i=1
W (i)n δX(i)1:n
(dx1:n),
N∑
i=1
W (i)n = 1. (3.16)
Then, one obtains the path space approximation of the smoothed additive functional as
Ŝθn = π
N
θ,n(Sn) =
N∑
i=1
W (i)n Sn(X
(i)
1:n) (3.17)
Observing Sn(x1:n) = Sn−1(x1:n−1) + sn(xn−1, xn), this approximation can be calculated
online for n along with the particle filter, see Cappe´ [2009] for an application exploiting
this fact. In this approximation, there is no need to store the entire ancestry of each
particle and computational cost of calculation of Ŝθn is linear in the number of parti-
cles, i.e. O(N). However; this approximation relies on the approximation of the joint
distribution πθ,n(dx1:n) which, as already mentioned in Section 2.5.2, is well-known in
the SMC literature to become progressively impoverished as n increases because of the
successive resampling steps. Indeed, it was shown in Del Moral and Doucet [2003] that
under favourable mixing assumptions, the authors established an upper bound on the Lp
error in the path space estimate in (3.17) which is proportional to n2/
√
N ; and under
3.3. SEQUENTIAL INFERENCE IN HMMS 55
similar assumptions it was shown in Poyiadjis et al. [2011] that the asymptotic variance
of the path space estimate increases at least quadratically with n.
An O(N) SMC approach that reduces the variance is fixed-lag smoothing [Kitagawa
and Sato, 2001] which uses the following approximation
pθ(x1:k|y1:n) ≈ pθ(x1:k|y1:min(n,k+∆)), ∆ > 0. (3.18)
with the idea that for large enough ∆ the error introduced by ∆ will be negligible. The
SMC implementation of this approximation prevents the particle filter from updating path
X1:k beyond time k +∆ and hence reduces the variance resulting from path degeneracy.
However; choosing the lag amount ∆ is a difficult task, and this approach introduces a
bias to the estimate of Sθn which does not vanish asymptotically in N , see Olsson et al.
[2008].
3.3.5.1 Forward filtering backward smoothing
A standard alternative to computing Sθn is to use SMC approximations of fixed-interval
smoothing techniques such as the forward filtering backward smoothing (FFBS) algo-
rithm [Doucet et al., 2000b; Godsill et al., 2004]. Let us define the marginal smoothing
distributions
ηθ,n,k(dxk) := πθ,n(dxk) = pθ(xk|y1:n)λ(dxk)
and define the backward transition kernel Mθ,n−1 : X → P(E) such that
Mθ,n−1(xn, dxn−1) = pθ(xn−1|xn, y1:n−1)λ(dxn−1).
FFBS relies on the additivity of the functional Sn and that pθ(xt−1, xt|y1:n)λ(dxt−1dxt) =
ηθ,n,t(dxt)Mθ,t−1(xt, dxt−1) for t ≤ n, which lead to
Sθn =
n∑
t=1
[ηθ,n,t ⊗Mθ,t−1] (st) =
n∑
t=1
∫
ηθ,n,t(dxt)Mθ,t−1(xt, dxt−1)st(xt−1, xt).
Moreover, once πθ,1, . . . , πθ,n are obtained up to time n (forward filtering), ηθ,n,1, . . . , ηθ,n,n
can be obtained with a backward recursion (backward smoothing) starting from ηθ,n,n(dxn) =
πθ,n(dxn) and recursing back with
ηθ,n,t = ηθ,n,t+1Mθ,t, t = n− 1, . . . , 1.
The SMC implementation of FFBS [Doucet et al., 2000b], which we will call SMC-
56 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
FFBS, is based on the following alternative approximation to πθ,n
π∗,Nθ,n = η
N
θ,n,n ⊗MNθ,n−1 ⊗ . . .⊗MNθ,1 (3.19)
where the particle approximation to the backward kernels are
MNθ,n−1(xn, dxn−1) = η
N
θ,n−1,n−1(dxn−1)
fθ(xn|xn−1)∫
ηNθ,n−1,n−1(dxn−1)fθ(xn|xn−1)λ(dxn)
. (3.20)
Therefore, once πNθ,1, . . . , π
N
θ,n are obtained up to time n (forward filtering), η
N
θ,n,1, . . . , η
N
θ,n,n
can be obtained with a backward recursion (backward smoothing) starting from ηNθ,n,n(dxn) =
πNθ,n(dxn) and recursing back with
ηNθ,n,t = η
N
θ,n,t+1M
N
θ,t, t = n− 1, . . . , 1.
Then, the SMC approximation to FFBS leads to the following estimate of the smoothed
functional
Ŝ∗,θn =
n∑
t=1
[
ηNθ,n,t ⊗MNθ,t−1
]
(st).
The SMC implementation of FFBS requires O(N2) computations per time, compared to
the O(N) path space approximation. As a return, the estimator has better properties
over the estimator of the path space approximation. Douc et al. [2011] includes a cen-
tral limit theorem for Ŝ∗,θn and time uniform deviation inequalities for the SMC-FFBS
approximations of the marginals {ηθ,n,t}1≤t≤n. For alternative proofs to those in Douc
et al. [2011], see Del Moral et al. [2010]. Additionally, it was shown in Del Moral et al.
[2009] that under strong mixing conditions the asymptotic variance of Ŝ∗,θn as N → ∞
is linear in n. More general but more complicated results on the variance of Ŝ∗,θn with
weaker conditions can be found in Del Moral et al. [2010].
3.3.5.2 Forward-only smoothing
Filtering forwards and smoothing backwards, the FFBS algorithm is surely offline, unlike
the path space approximation. Also, it may be demanding since it requires the SMC
filters ηNθ,t,t to be stored up to time n. To circumvent the need for the backward pass in
the computation of Sθn, the following auxiliary function on X is introduced,
T θn(xn) =Mn−1 ⊗ . . .⊗M1 [Sn(·, xn)] (xn) = Eθ [Sn(X1:n)|Xn = xn, y1:n−1] .
3.3. SEQUENTIAL INFERENCE IN HMMS 57
It is apparent that Sθn = ηθ,n,n(T
θ
n). A forward recursion to compute {T θn}n≥1, hence
{Sθn}n≥1, is established by
T θn(xn) = Mθ,n−1
[
T θn−1 + sn(·, xn)
]
= Eθ
[
T θn−1(Xn−1) + sn(Xn−1, xn)
∣∣ xn, y1:n−1] .
(3.21)
for n ≥ 2, with the initial condition T θn(x1) = s1(x1). Note that online calculation
of T θn(xn) requires only an integration with respect to the measure Mn−1(xn, ·), i.e.
pθ(xn−1|xn, y1:n−1). The recursion in (3.21) has been rediscovered independently sev-
eral times (see e.g. Elliott and Krishnamurthy [1999]; Hernando et al. [2005]; Mongillo
and Deneve [2008]) and it was called forward smoothing recursion in Del Moral et al.
[2009].
A straightforward implementation of forward smoothing recursion would be by using
π∗,Nθ,n in (3.19) so that Mθ,n−1 in (3.21) is approximated by M
N
θ,n−1 in (3.20). It can be
shown that when this approximation is used, we calculate exactly the same quantity as
SMC-FFBS. Therefore, the preferable statistical properties of SMC-FFBS is preserved.
Moreover, although the online calculation still requires O(N2) calculations per time, it
does not need to store the SMC filters {ηθ,n,t}1≤t≤n. Being a forward implementation of
SMC-FFBS, we call this implementation SMC-forward smoothing, or SMC-FS. SMC-FS
proves to be a very useful tool for online parameter estimation, as we shall see in Section
3.4 and throughout this thesis.
Algorithm 3.3. SMC-FS: Forward only SMC computation of FFBS for smooth-
ing additive functionals
For n = 1;
• compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ,1,1.
• For i = 1, . . . , N , set T (i)1 = s1(X(i)1 ).
For n = 2, 3, . . .
• Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ,n,n.
• For i = 1, . . . , N ; set
T (i)n =
∑N
j=1
[
T
(j)
n−1 + sn(X
(j)
n−1, X
(i)
n )
]
W
(j)
n−1fθ(X
(i)
n |X(j)n−1)∑N
j′=1W
(j′)
n−1fθ(X
(i)
n |X(j′)n−1)
.
• Calculate Ŝ∗,θn =
∑N
i=1W
(i)
n T
θ(i)
n .
We present the SMC-FS algorithm in Algorithm 3.3. We note that; since SMC-FS
relies on particle estimates of the filtering distributions {ηθ,n,n}n≥1 only, the marginal
58 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
particle filter in Section 3.3.3 can be used in Algorithm 3.3 the instead of the standard
particle filter. Finally, note that the SMC implementation of the forward smoothing
recursion by using the path space approximation is trivial in the sense that it reduces to
the approximation given in (3.16).
3.4 Static parameter estimation in HMMs
One problem that is largely dealt in the literature is that of estimating the true static
parameter θ∗ of the HMM given observations y1:n up to time n. There are two main
approaches to solving the parameter estimation problem, the Bayesian approach and the
maximum likelihood approach. We briefly summarise Bayesian methods and then give
a more detailed review on the maximum likelihood parameter estimation methods. We
refer the interested reader to Kantas et al. [2009] for a comprehensive review of SMC
methods that have been proposed for static parameter estimation in HMMs.
Bayesian parameter estimation: In the Bayesian approach, the static parameter is
treated as a random variable taking values θ in Θ with a probability density η(θ) with
respect to a dominating measure dθ, and the aim is to evaluate the density of the posterior
distribution of θ given y1:n, which follows from Bayes’ theorem as
η(θ|y1:n) = η(θ)pθ(y1:n)∫
pθ(y1:n)η(θ)dθ
. (3.22)
When the likelihood pθ(y1:n) is analytically available, one can simply apply a MCMC
scheme for the posterior η(θ|y1:n). An MCMC algorithm can be inefficient when n is
large; however online Bayesian methods are also available. For example, the method
in Chopin [2002] is based on the SMC approximation of the sequence of distributions
{η(dθ|y1:t)}1≤t≤n. This approach is equivalent to the resample-move algorithm described
in Gilks and Berzuini [2001], which is a special SMC sampler.
More sophisticated techniques are required when pθ(y1:n) cannot be computed, which
is usually the case for a general HMM. The methods developed for this case consider
• the joint density p(θ, x1:n|y1:n) = η(θ)pθ(x1:n|y1:n) in the batch parameter estimation
setting
• the sequence of posterior densities {pθ(θ, x1:t|y1:t)}1≤t≤n in the online parameter
estimation setting.
One elegant method used in the batch estimation setting is particle MCMC (PMCMC)
[Andrieu et al., 2010]. Notice that an ideal Metropolis-Hastings algorithm targeting
pθ(θ, x1:n|y1:n) is not feasible in general since it requires exact sampling from pθ(x1:n|y1:n)
3.4. STATIC PARAMETER ESTIMATION IN HMMS 59
and exact calculation of pθ(y1:n). A particle version of the Metropolis-Hastings algorithm,
which was called PMMH in Andrieu et al. [2010], runs an SMC for pθ(x1:n|y1:n) with N
particles and uses the SMC approximation of the unknown quantity pθ(y1:n). The validity
of this approach is not trivial to show; see, again, Andrieu et al. [2010] for a derivation.
In the same work, a particle version of the Gibbs sampler was also developed.
In Andrieu et al. [2010] the variance of the acceptance rate of the PMMH algorithm
was numerically shown to be proportional to n/N under favourable mixing conditions.
This suggests that one needs to increase the number of particles linearly with n in order
to keep the performance of the PMCMC algorithm at a certain level. Therefore, for
large n PMCMC may not be practical and online parameter estimation methods may be
required.
Although with possible modifications, all of the Bayesian methods for online pa-
rameter estimation rely on the SMC approximation of the sequence of distributions
pθ(θ, x1:t|y1:t), 1 ≤ t ≤ n. At first sight, it seems easy to achieve this using standard
SMC methods by introducing the extended state {θn, Xn}n≥1 with the initial distri-
bution µθ1(x1)λ(dx1)η(θ1)dθ and transitional distribution fθn(xn|xn−1)λ(dxn)δθn−1(dθn).
This implies θn = θn−1; therefore an SMC algorithm explores the parameter space only
at its initialisation. As a result of successive resampling steps, we will end up with only a
single value for θ, which makes the approximation to the marginal distribution η(dθ|y1:n)
clearly a bad approximation. Several methods have been proposed to avoid degeneracy
of particles for the static parameter of the HMM. We briefly mention them below.
One approach to avoid degeneracy, proposed originally in Gilks and Berzuini [2001],
is based on adding MCMC steps to re-introduce particle diversity. Assume that the SMC
approximation to p(θ, x1:n|y1:n) at time n contains particles (θ(i)n , X(i)1:n), i = 1, . . . , N ,
with equal weights. To add diversity in this population, a MCMC kernel Kn which leaves
p(d(θ, x1:n)|y1:n) invariant is applied to each of the particles. One remarkable point is
that the MCMC kernel need not be ergodic; indeed in practice one designs Kn so that it
moves only θ(i) and last L components of X
(i)
1:n. A first use of this method in an online
Bayesian parameter estimation context is seen in Andrieu et al. [1999], Kn is taken to be
a Gibbs move to update the parameter value only, i.e.
Kn [(x1:n, θ), d(x
′
1:n, θ
′)] = δx1:n(dx
′
1:n)p(θ|x1:n, y1:n)dθ
Similar strategies were used in Fearnhead [2002] and Storvik [2002]. The use of MCMC
within SMC steps is particularly elegant when (x1:n, y1:n) can be summarised by a set
of fixed dimensional sufficient statistics; since then the memory and computational re-
quirements for calculating densities such as pθ(y1:n|x1:n) or p(θ|x1:n, y1:n) does not increase
with time. Unfortunately; these MCMC-based methods suffer from the path degeneracy
problem of the SMC approximation, since the error in the estimate of pθ(x1:n|y1:n) will
60 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
lead to an error in sufficient statistics to be used and these errors build up over time.
This disadvantage was first noticed in Andrieu et al. [1999] and a convincing example
was provided in Andrieu et al. [2005].
Another MCMC-based online Bayesian estimation method is called practical filtering
[Polson et al., 2008], which relies on a fixed-lag approximation as in (3.18). As for all
fixed-lag approaches, it is hard to tune the amount of lag and control the non-vanishing
bias introduced by the approximation.
Alternative to MCMC-based methods to avoid degeneracy, another class of methods
are based on introducing artificial dynamics for the parameter [Higuchi, 2001; Kitagawa,
1998]. More explicitly, it is assumed that
θ1 ∼ η(θ1), θn = θn−1 + ǫn, n ≥ 2,
where ǫn is a small artificial dynamic centred noise whose variance is decreasing with
n. Obviously, SMC applied to approximate {p(θn, x1:n|y1:n)}n≥1 under this assumption
will have better properties than before in terms of degeneracy. This approach is closely
related to the kernel density estimation method in Liu and West [2001], which proposes
regularising smoothing the empirical measure of the posterior distribution of the param-
eter with a smooth kernel density, such as Gaussian or Epanechnikov. A more general
approach where the kernel smoothing approach is also applied to the components of the
HMM is given in Campillo and Rossi [2009]. All these methods who introduce artificial
dynamics to the parameter require a significant amount of tuning and it suffers from bias
which is hard to quantify.
Maximum likelihood parameter estimation: In the maximum likelihood approach
to parameter estimation, one has a point estimate obtained by calculating the value of θ
that maximises the likelihood pθ(y1:n) over all the possible values of θ, i.e.
θML = argmax
θ∈Θ
pθ(y1:n).
This procedure is called maximum likelihood estimation (MLE). In this thesis we will
investigate methods for MLE applied to several time series models. In the following we
present some of the MLE methods directly applicable to HMMs.
3.4.1 Direct maximisation of the likelihood
The traditional approach of ML is to try to calculate the maximiser of pθ(y1:n) with
respect to θ by direct calculation of pθ(y1:n). Note that pθ(y1:n) also satisfies the following
3.4. STATIC PARAMETER ESTIMATION IN HMMS 61
recursive form
pθ(y1:n) = pθ(y1)
n∏
t=2
pθ(yt|y1:t−1) = pθ(y1:n−1)pθ(yn|y1:n−1). (3.23)
The incremental likelihood pθ(yn|y1:n−1) may be obtained by exploiting one the expres-
sions for it, such as the one in (3.8) or (3.11), whichever is available. In practice, one uses
the log-likelihood
lθ(y1:n) = log pθ(y1:n)
which is numerically better-behaved since this time the product in (3.23) is replaced by
a sum.
It is rarely the case that the likelihood (or log-likelihood) is in closed form and can
be maximised analytically. When it is not in closed form but it can be calculated, grid
based methods, where the likelihood is calculated on a grid based representation of Θ with
enough resolution, can be used. When even the likelihood can not even be calculated,
SMC approximation can be applied. Let τ1, . . . , τk be the times when the resampling step
is applied in the particle filter in Algorithm 3.1 and let τ0 = 0 and τk+1 = n. It is shown
in Del Moral [2004, Chapter 7] the following estimator of pθ(y1:n) is unbiased
pNθ (y1:n) =
k+1∏
j=1
pNθ (yτj−1+1:τj |y1:τj−1), pNθ (yτj−1+1:τj |y1:τj−1) =
N∑
i=1
τj∏
t=τj−1+1
w
(i)
t|t−1
Based on this unbiased estimator, an estimate of lθ(y1:n) is
lNθ (y1:n) =
k+1∑
j=1
log pNθ (yτj−1+1:τj |y1:τj−1)
which is obviously biased due to the non-linear transformation of the unbiased estimators.
The bias can be reduced by using the following standard technique based on a Taylor
series expansion, see Andrieu et al. [2004].
Direct maximisation of the likelihood by means of calculating it point-wise is not a
practical approach unless Θ is a discrete space with small number of elements or a con-
tinuous space which can be well approximated by a grid. Unfortunately, these conditions
do not hold in almost all cases mainly because θ is of large dimension. In the following we
will review two alternative approaches that maximises pθ(y1:n) (at least locally) indirectly
without calculating it.
62 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
3.4.2 Gradient ascent maximum likelihood
Gradient based maximum likelihood methods work with the gradient of the log-likelihood
rather than itself. The gradient ascent algorithm is an iterative procedure implemented
as follows: We begin with θ(0) and assume that we have the estimate θ(j−1) at the end of
the the (j − 1)’th iteration. At the j’th iteration we update the parameter
θ(j) = θ(j−1) + γj∇θlθ(y1:n)
∣∣
θ=θ(j−1)
.
The gradient term ∇θlθ(y1:n) is also called the score vector. Here {γj}j≥1 is the sequence
of step sizes satisfying ∑
j≥0
γj =∞,
∑
j≥0
γ2j <∞, (3.24)
ensuring convergence of the algorithm when it is used with the Monte Carlo approxima-
tions ∇Nθ lθ(y1:n) of the score vectors. A common choice is γn = n−a for 0.5 < a ≤ 1.
One way to calculate the gradient term is to use Fisher’s identity for the score vector
as
∇θlθ(y1:n) =
∫
pθ(x1:n, y1:n) log pθ(x1:n, y1:n)λ(dx1:n). (3.25)
i.e. the expectation of the complete data log-likelihood with respect to the posterior
distribution of the latent variables. Equation (3.25) can be rewritten as
∇θlθ(y1:n) = πθ,n(Sθ,n) (3.26)
where Sθ,n : X n → Rdθ is the additive function of of x1:n
Sθ,n(x1:n) =
n∑
t=1
sθ,t(xt−1, xt), (3.27)
sθ,1(x0, x1) = sθ,1(x1) = ∇θ log µθ(x1) +∇θ log gθ(y1|x1)
sθ,t(xt−1, xt) = ∇θ log gθ(yt|xt) +∇θ log fθ(xt|xt−1), t ≥ 2.
Notice that since Sθ,n is in the additive form, the approximation to its expectation
πθ,n(Sθ,n) can be carried out with one of the Monte Carlo methods mentioned in Sec-
tion 3.3.5 when exact calculation of πθ,n(Sθ,n) is not available. An SMC estimate of the
score vector using the O(N) path space approximation was provided in Andrieu et al.
[2004]. However; it was shown in Poyiadjis et al. [2011] that the variance of this estimate
increases typically quadratically with n. For this reason, Poyiadjis et al. [2011] proposed
to use the O(N2) method that is based on FFBS to estimate ∇θlθ(y1:n), and it was shown
in Del Moral et al. [2011] that this SMC estimate is stable.
An alternative to Fisher’s identity to compute the score vector ∇θlθ(y1:n) is a method
3.4. STATIC PARAMETER ESTIMATION IN HMMS 63
based on infinitesimal perturbation analysis which was proposed in Coquelin et al. [2009].
This method is also estimating the expectation with respect to pθ(x1:n|y1:n) of an additive
functional of the form
∑n
t=1 sθ(xt−1, xt); so all the SMC smoothing techniques described
in Section 3.3.5 can also be applied to estimate this expectation.
3.4.2.1 Online gradient ascent
The batch gradient ascent MLE algorithm may be inefficient when n is large since each
iteration requires a complete browse over the whole data sequence. An alternative to
the batch algorithm is possible via online calculation of the score vector, leading to a
recursive maximum likelihood algorithm which we will call online gradient ascent. An
online gradient ascent algorithm can be implemented as follows [Del Moral et al., 2011;
Poyiadjis et al., 2011]: Let θ1 be the initial guess of θ
∗ before having made any observations
and at time n and let θ1:n be the sequence of parameter estimates of the online gradient
ascent algorithm computed sequentially based on y1:n−1. When yn is received, we update
the parameter
θn+1 = θn + γn∇θ log pθ(yn|y1:n−1)
∣∣
θ=θn
. (3.28)
The incremental gradients ∇θ log pθ(yn|y1:n−1) can be calculated sequentially from the
gradients ∇θlθ(y1:n) using the relation
∇θ log pθ(yn|y1:n−1) = ∇θlθ(y1:n)−∇θlθ(y1:n−1)
= πθ,n(Sθ,n)− πθ,n−1(Sθ,n−1). (3.29)
However, since θn is changing over time, (3.29) hence (3.28) is impractical to calculate
sequentially. In practice, the integrals πθn,n(Sθn,n) are approximated by
πθ1:n,n
(
n∑
t=1
sθt(xt−1, xt)
)
,
where θ1:n in πθ1:n,n indicates that the distributions are calculated sequentially with vary-
ing θ’s.
This approach has previously appeared in the literature for finite state-space HMMs,
see e.g. Le Gland and Mevel [1997] and Collings and Ryden [1998]. The asymptotic
properties of this algorithm, i.e. the behaviour of θn in the limit as n goes to infinity, has
been studied by Titterington [1984] for i.i.d. hidden processes and by Le Gland and Mevel
[1997] for finite state-space HMMs. It is shown in Le Gland and Mevel [1997] that under
regularity conditions this algorithm converges towards a local maximum of the average
log-likelihood and that this average log-likelihood is maximised at θ∗.
Algorithm 3.4. SMC-online gradient ascent algorithm
64 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
Choose θ1. Set S0 = 0. For n = 1, 2, . . .;
• If n = 1,
– Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ1,1,1.
– For i = 1, . . . , N ; for k = 1, . . . , r set T
(i)
γ,1,k = ∇θ log µθ(X(i)1 )+∇θ log gθ(y1|X(i)1 ).
if n ≥ 2,
– Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ1:n,n,n.
– For i = 1, . . . , N set
T (i)γ,n =
∑N
j=1
[
(1− γn)T (j)γ,n−1 + γnsn(X(j)n−1, X(i)n )
]
W
(j)
n−1fθn(X
(i)
n |X(j)n−1)∑N
j′=1W
(j′)
n−1fθn(X
(i)
n |X(j′)n−1)
.
where sn(X
(j)
n−1, X
(i)
n ) = ∇θn log fθn(X(i)n |X(j)n−1) +∇θn log gθn(yn|X(i)n )
• Calculate Sn =
∑N
i=1W
(i)
n T
(i)
γ,n and set θn+1 = θn + γn (Sn − Sn−1).
A SMC online gradient ascent method, which can be seen as a particle version of
the recursive maximum likelihood algorithm of Le Gland and Mevel [1997] is given in
Algorithm 3.4. This algorithm is based on the O(N2) SMC approximation of (3.29) and
calculates
θn+1 = θn + γn
[
π∗,Nθ1:n,n
(
n∑
t=1
sθt(xt−1, xt)
)
− π∗,Nθ1:n−1,n−1
(
n−1∑
t=1
sθt(xt−1, xt)
)]
. (3.30)
In Poyiadjis et al. [2011], this algorithm is used with the MPF described in Section 3.3.3
in order to approximate the filtering distributions ηθ,n,n. A very similar algorithm, which
is equivalent to Algorithm 3.4 in principle, can be found in Del Moral et al. [2011]; the
difference is that the authors include θn in the calculation of the second term in (3.29)
by using the relation
πθ,n−1(Sθ,n−1) = π̂θ,n(Sθ,n−1 +∇θ log fθ(xn|xn−1)).
We remind that π̂θ,n is distribution ofX1:n conditioned on y1:n−1. Hence, (3.30) is replaced
by
θn+1 = θn+γn
[
π∗,Nθ1:n,n
(
n∑
t=1
sθt(xt−1, xt)
)
− π̂∗,Nθ1:n,n
(
∇θn log fθn(xn|xn−1) +
n−1∑
t=1
sθt(xt−1, xt)
)]
.
and the online implementation of this update is derived using the filter derivative at
time n. Similar to the O(N2) particle approximation to πθ,n(Sθ,n), the O(N2) particle
3.4. STATIC PARAMETER ESTIMATION IN HMMS 65
approximation of π̂θ,n(Sθ,n) can be performed by taking
π̂∗,Nθ,n = η̂
N
θ,n,n ⊗MNθ,n−1 ⊗ . . .⊗MNθ,1
where η̂Nθ,n,n(dxn) = π̂
N
θ,n(dxn) is the particle approximation to the one step prediction
distribution obtained by marginalising the path particle approximation π̂Nθ,n.
3.4.3 Expectation-Maximisation
The expectation-maximisation (EM) algorithm [Dempster et al., 1977] is one of the most
popular methods for MLE. Given Y1:n = y1:n, the EM algorithm for maximising pθ(y1:n)
is given by the following iterative procedure: if θ(j) is the estimate of the EM algorithm
at the jth iteration, then at iteration j + 1 the estimate is updated by first calculating
the following intermediate optimisation criterion, which is known as the expectation (E)
step,
Q(θ(j), θ) =
∫
log pθ(x1:n, y1:n)pθ(j)(x1:n|y1:n)λ(dx1:n)
= Eθ(j) [log pθ(X1:n, y1:n)|y1:n] . (3.31)
The updated estimate is then computed in the maximisation (M) step
θ(j+1) = argmax
θ∈Θ
Q(θ(j), θ) (3.32)
The EM algorithm produces a sequence {θ(j)}j≥1 such that {pθ(j)(y1:n)}j≥1 is non-decreasing,
and under mild conditions this sequence is guaranteed to converge to a maximum point
of pθ(y1:n). In practice, the procedure in (3.31) and (3.32) is repeated until θ
(j) ceases to
change significantly.
One important observation here is that the integrand in (3.31), which is the joint-log
density of the complete data (x1:n, y1:n), has the following additive structure.
log pθ(x1:n, y1:n) = µθ(x1) + log gθ(y1|x1) +
n∑
t=2
log fθ(xt|xt−1) + log gθ(yt|xt) (3.33)
Moreover, equation (3.33) suggests that when pθ(x1:n, y1:n) belongs to the exponential
family with respect to θ, then there exist an integer r > 0, functions si,t : X × X → R,
i = 1, . . . , r, t ≥ 1, such that the E-step and M-step of the EM algorithm reduce to
calculating
Sθ
(j)
i,n = πθ(j),n(Si,n) = Eθ(j) [Si,n(X1:n)| y1:n] , Si,n(x1:n) =
n∑
t=1
si,t(xt−1, xt), i = 1, . . . , r,
66 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
and applying a maximisation rule Λ : Rr → Θ to compute (3.32) such that
θ(j+1) = Λ
(
Sθ
(j)
1,n , . . . , S
θ(j)
r,n
)
. (3.34)
Functionals S1,n, . . . , Sr,n are also called the sufficient statistics of the complete data
(x1:n, y1:n).
3.4.3.1 Stochastic versions of EM
The intermediate function Q(θ(j), θ) of the EM algorithm can be computed exactly only
in few HMMs such as linear Gaussian HMMs or finite state-space HMMs. When Q(θ(j), θ)
cannot be computed exactly, Monte Carlo approximation must be used to numerically
estimate it. The additive structure of log pθ(x1:n, y1:n) allows us to use several SMC
smoothing techniques for estimating Q(θ(j), θ); see Andrieu et al. [2004] for the path space
approximation, Olsson et al. [2008] for the fixed-lag approximation, Wills et al. [2008] for
the FFBS approximation and Briers et al. [2010] for generalised two-filter smoothing.
Using Monte Carlo estimate of the intermediate function leads to the stochastic ver-
sions of the EM algorithm. There are three different main stochastic versions of the EM
algorithm proposed in the literature, we will review them below.
• If we use a constant number N of particles for all iterations, the resulting algorithm
is called the stochastic EM algorithm (SEM) [Celeux and Diebolt, 1985]. Since
the Monte Carlo variance is never reduced over iterations, this algorithm will not
converge to a point in Θ; however one expects to have an ergodic homogeneous
Markov chain of estimates {θ(j)}j≥0 whose stationary distribution is concentrated
around θML [Nielsen, 2000].
• The settlement of the Markov chain in the SEM algorithm to its equilibrium may
take too much time. An alternative to SEM is introduced in Wei and Tanner
[1990] and is called Monte Carlo EM (MCEM). In MCEM, the number of particles
for Monte Carlo approximation increases with j in order to ensure convergence to
the maximum likely parameter value θML rather than convergence to a stationary
distribution around it. The disadvantage of this approach is having to use an
increasing amount of computational resource because of the increasing number of
particles over iterations.
• Another stochastic version of the EM algorithm involves a stochastic approximation
procedure for which it is called stochastic approximation EM (SAEM) [Delyon et al.,
1999]. In SAEM, the E-step involves a weighted average of the approximations of
the intermediate quantity of EM obtained in the current as well as in the previous
iterations. Specifically, consider step size sequence {γj}j≥0 satisfying the conditions
3.4. STATIC PARAMETER ESTIMATION IN HMMS 67
in (3.24). Then we calculate the weighted average of the estimates QN(θ(j), θ) of
the intermediate functions recursively as
Qγ,j(θ) = (1− γj)Qγ,j−1(θ) + γjQN(θ(j), θ),
with the initialisation Qγ,−1(θ) = 0 and at the M-step at iteration j θj+1 is set
to be the maximiser of Qγ,j(θ) with respect to θ. When pθ(x1:n, y1:n) is in the
exponential family, the above recursion is in terms of the smoothed estimates of
sufficient statistics; we will see a use of SAEM in this case in Chapter 5.
3.4.3.2 Online EM
The online EM algorithm [Cappe´, 2009, 2011; Elliott et al., 2002; Kantas et al., 2009;
Mongillo and Deneve, 2008] is a variation over the batch EM where, as in online gra-
dient ascent algorithm, the parameter is re-estimated each time a new observation is
collected. We assume that pθ(x1:n, y1:n) is in the exponential family and there exists suf-
ficient statistics so that the M-step can be characterised by (3.34). In the online EM
algorithm, running averages of Sθi,n are computed. Specifically, let γ = {γn}n≥1, called
the step-size sequence, be a positive decreasing sequence satisfying
∑
n≥1 γn = ∞ and∑
n≥1 γ
2
n <∞. Let θ1 be the initial guess of θ∗ before having made any observations and
at time n and let θ1:n be the sequence of parameter estimates of the online EM algorithm
computed sequentially based on y1:n−1. When yn is received, online EM computes for
i = 1, . . . , r
Tγ,i,n(xn) =Mθ1:n,n−1 [(1− γn)Tγ,i,n−1 + γnsi,n(·, xn)] (xn), (3.35)
Si,n = ηθ1:n,n,n(Tγ,i,n) (3.36)
and then sets
θn+1 = Λ (S1,n, . . . ,Sr,n) .
The subscript θ1:n on Mθ1:n,n−1 and ηθ1:n,n,n indicates that these laws are being computed
sequentially using the parameter θt at time t, t ≤ n. In practice, the maximisation step
is not executed until a burn-in time nb for added stability of the estimators as discussed
in Cappe´ [2009].
The online EM algorithm can be implemented exactly for a linear Gaussian state-space
model [Elliott et al., 2002] and for finite state-space HMM’s. [Cappe´, 2011; Mongillo and
Deneve, 2008]. An exact implementation is not possible for state-space models in general,
therefore SMC implementations of the online EM algorithm are used. Both the O(N)
and O(N2) approximations are used for the SMC implementation on online EM in the
literature, we present both of them in Algorithms 3.5 and 3.6. The first SMC online
68 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
EM algorithm, proposed in Cappe´ [2009] uses the path space approximation to equations
(3.35) and (3.36) resulting in Algorithm 3.5. The O(N2) approximation was proposed in
Del Moral et al. [2009], resulting in Algorithm 3.6.
Algorithm 3.5. SMC-online EM: O(N) implementation
Choose θ1. For n = 1, 2, . . .;
• If n = 1,
– Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for πθ1,1.
– For i = 1, . . . , N ; for k = 1, . . . , r set T
(i)
γ,k,1 = sk,1(X
(i)
1 ).
if n ≥ 2,
– Compute the SMC approximation {X(i)1:n,W (i)n }1≤i≤N for πθ1:n,n. Construct the
N × 1 vector A of resampling indexes such that X(i)1:n = (X(A(i))1:n−1 , X(i)n ).
– For i = 1, . . . , N ; set j = A(i), and compute for k = 1, . . . , r set
T
(i)
γ,k,n = (1− γn)T (j)γ,k,n−1 + γnsk,n(X(j)n−1, X(i)n ).
• If n ≥ nb, calculate Sk,n =
∑N
i=1W
(i)
n T
(i)
γ,k,n for k = 1, . . . , r and set θn+1 =
Λ (S1,n, . . . ,Sr,n). Else, set θn+1 = θn.
Algorithm 3.6. SMC-online EM: O(N2) implementation
Choose θ1. For n = 1, 2, . . .;
• If n = 1,
– Compute the SMC approximation {X(i)1 ,W (i)1 }1≤i≤N for ηθ1,1,1.
– For i = 1, . . . , N ; for k = 1, . . . , r set T
(i)
γ,k,1 = sk,1(X
(i)
1 ).
if n ≥ 2,
– Compute the SMC approximation {X(i)n ,W (i)n }1≤i≤N for ηθ1:n,n,n.
– For i = 1, . . . , N ; for k = 1, . . . , r set
T
(i)
γ,k,n =
∑N
j=1
[
(1− γn)T (j)γ,k,n−1 + γnsk,n(X(j)n−1, X(i)n )
]
W
(j)
n−1fθn(X
(i)
n |X(j)n−1)∑N
j′=1W
(j′)
n−1fθn(X
(i)
n |X(j′)n−1)
.
• If n ≥ nb, calculate Sk,n =
∑N
i=1W
(i)
n T
(i)
γ,k,n for k = 1, . . . , r and set θn+1 =
Λ (S1,n, . . . ,Sr,n). Else, set θn+1 = θn.
3.4. STATIC PARAMETER ESTIMATION IN HMMS 69
3.4.4 Iterated filtering
As a batch MLE method for HMMs, iterated filtering Ionides et al. [2011] can be useful
to non-linear state space dynamics. The iterated filtering algorithm works as follows. We
begin with θ(0) and assume that at the end of the j−1’th iteration we obtain the estimate
θ(j). Iterated filtering extends the HMM further as {Xt, θt, Yt}t≥1 by introducing a slowly
moving Markov chain for the static parameter as {θt}t≥1. At iteration j, the Markov
chain for {θt}t≥1 is a random walk typically with Gaussians steps.
θ1 ∼ N (θ(j), τ 2j Σ), θk|θk−1 ∼ N (θk−1, σ2jΣ), k ≥ 2. (3.37)
At iteration j, one runs an SMC filter for the {Xt, θt, Yt}t≥1 with Nj particles, and
calculates at every time step the mean and variance estimates for θt with respect to the
filtering and prediction densities, respectively
mt = Eθ(j−1) [θt|y1:t] , Vk = varθ(j−1) [θt|y1:t−1] , t = 1, . . . , n. (3.38)
Denoting the SMC estimates of these quantities as m˜t and V˜t and letting m˜0 = 0, the
algorithm updates the parameter estimates by
θ(j) = θ(j−1) + γj
n∑
t=1
V˜ −1t (m˜t − m˜t−1) (3.39)
Actually, the quantity
∑n
t=1 V˜
−1
t (m˜t − m˜t−1) is an approximation to the gradient of the
log likelihood, ∇θlθ(y1:n) at θ = θ(j−1).
Here, the positive sequences {τj}j≥0 and {σj}j≥0 satisfy the conditions limj→∞ τj = 0
and limj→∞ σj/τj = 0, which are the conditions leading to an annealing schedule. More-
over, the sequences of number of particles and step sizes must satisfy Njτj → ∞ and∑
j γ
2
jN
−1
j τ
−2
j < ∞, which are the conditions for convergence of the stochastic approxi-
mation for θ to a local maximum [Ionides et al., 2011].
3.4.5 Discussion of the MLE methods
One one hand, one might prefer a gradient ascent procedure over the EM algorithm for
a number of reasons. Firstly, when lθ(y1:n) is a concave function of θ, if γj is replaced by
−γjΓ−1j where Γj is the Hessian of lθ(y1:n) evaluated at θj , then the rate of convergence is
quadratic and thus faster than the EM which converges linearly. The Hessian matrix can
be estimated using SMC techniques, see Poyiadjis et al. [2011]. Secondly, the gradient
ascent algorithm is more general since it can be implemented in those cases where M-step
of the EM cannot be solved in closed-form. On the other hand, scaling the gradients
70 CHAPTER 3. HIDDEN MARKOV MODELS AND PARAMETER ESTIMATION
might be quite hard. In addition, the EM needs less tuning and its M-step is typically
numerically stable. Therefore, one might prefer an EM approach if the M-step can be
computed analytically. Finally, both approaches have online versions, which makes them
very powerful tools in dealing with large sequential data sets.
An advantage of iterative filtering over standard gradient and EM techniques is that
it only requires being able to sample from fθ(x
′|x) and there is no explicit calculations
of the derivative. However, it might require a bit of tuning when the parameter is high-
dimensional. Another disadvantage of iterated filtering is the necessity to use increasing
number of particles versus iterations in order to ensure convergence. Finally, iterated
filtering does not have an online version hence can only be used in a batch setting.
Chapter 4
An Online
Expectation-Maximisation
Algorithm for Changepoint Models
Summary: Changepoint models are widely used to model the heterogeneity of sequential
data. We present a novel sequential Monte Carlo (SMC) online Expectation-Maximisation
(EM) algorithm for estimating the static parameters of such models. The SMC online
EM algorithm has a cost per time which is linear in the number of particles and could
be particularly important when the data is representable as a long sequence of observa-
tions, since it drastically reduces the computational requirements for implementation. We
present an asymptotic analysis for the stability of the SMC estimates used in the online
EM algorithm and demonstrate the performance of this scheme using both simulated and
real data originating from DNA analysis.
The work done in this chapter is published in Yıldırım et al. [2012d]. The idea was
initiated in a discussion Dr. Sumeetpal S. Singh had with Prof. Arnaud Doucet. I did all
the work except that Section 4.4 was done in collaboration with Dr. Sumeetpal S. Singh.
4.1 Introduction
Consider a sequence of observations {y1, y2, . . .} collected sequentially in time. A change-
point model is a particular model for heterogeneity of sequential data that postulates the
existence of a strictly increasing time sequence t1, t2, . . . with t1 = 1, that partitions the
data into disjoint segments
{yt1 , . . . , yt2−1}, {yt2, . . . , yt3−1}, . . .
and that the data is correlated within a segment but are otherwise independent across
segments. The time instances t1, t2, . . . are known as the changepoints and constitute a
random unobserved sequence. This segmental structure is both an intuitive and versatile
model for heterogeneity and it is the reason why changepoint models have enjoyed a wide
71
72 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
appeal in a variety of disciplines such as Biological Science [Braun and Muller, 1998;
Caron et al., 2011; Fearnhead and Vasileiou, 2009; Johnson et al., 2003], Physical Science
[Lund and Reeves, 2002; O´ Ruanaidh and Fitzgerald, 1996] Signal Processing [Cemgil
et al., 2006; Punskaya et al., 2002], and Finance [Dias and Embrechts, 2004].
In a Bayesian approach to inferring changepoints, one adopts a prior distribution on
their locations and a likelihood function for the observed process given these change-
points. However, both of these laws typically depend on a finite dimensional real pa-
rameter vector θ ∈ Θ where Θ denotes the set of permissible parameter vectors. In
all realistic applications, the static parameter θ is unknown and needs to be estimated
from the data as well. A fully Bayesian approach would assign a prior distribution to θ.
However the resulting posterior distribution is intractable. Several Markov chain Monte
Carlo (MCMC) schemes have been proposed in this context [Chib, 1998; Fearnhead,
2006; Lavielle and Lebarbier, 2001; Stephens, 1994]. Unfortunately these algorithms are
far too computationally intensive when dealing with very large datasets. Alternative
to an MCMC based full Bayesian analysis is sequential Monte Carlo (SMC); however,
SMC methods to perform online Bayesian static parameter estimation suffer from the
well-known particle path degeneracy problem and can provide unreliable estimates; see
Andrieu et al. [2005], Olsson et al. [2008] for a discussion of this issue. This is why we
focus here on estimating the parameter θ using a maximum likelihood approach; i.e. the
Maximum Likelihood Estimate (MLE) of interest is the parameter vector from Θ that
maximises the probability density of the observed data sequence pθ(y1, . . . , yn). This is a
challenging problem as computing the likelihood pθ(y1, . . . , yn) requires a computational
cost increasing super-linearly with n [Chopin, 2007; Fearnhead and Liu, 2007].
Our main contribution is a novel online EM algorithm to compute the MLE of the
static parameter θ for changepoint models. We remark that standard batch EM algo-
rithms for a restricted class of changepoint models have been proposed before, e.g. see
Gales and Young [1993], Barbu and Limnios [2008], Fearnhead and Vasileiou [2009]. The
main reason why an online algorithm is desirable is that huge computational and mem-
ory savings are possible. For a long data sequence, a standard EM algorithm requires
a complete browse through the entire data set at each iteration to update the MLE of
θ; and many such iterations are needed until the estimate of θ converges. This not only
requires storing the entire data sequence but also the probability laws that are needed in
the intermediate computations done in each EM iteration, which can be impractical. For
this reason, there has been a strong interest in online methods which make parameter
estimation possible by browsing through the data only once and hence circumventing the
need to store it in its entirety (see Kantas et al. [2009] for a review). The only other
work on computing the MLE of θ for a more restrictive class of changepoint models in
an online manner that we are aware of is Caron et al. [2011], where the authors used
4.1. INTRODUCTION 73
a recursive gradient algorithm. If the model permits an EM implementation then it is
fair to say that the EM is generally preferred by practitioners as no algorithm tuning is
required whereas it can be difficult to properly scale the components of the computed
gradient vector.
For finite state-space Hidden Markov Models (HMM) [Cappe´, 2011; Mongillo and
Deneve, 2008] and linear Gaussian state-space models [Elliott et al., 2002], it is possible
to implement exactly the online EM algorithm. A detailed study of this algorithm in
the finite state-space case can be found in Cappe´ [2011]. For changepoint models, it is
necessary to approximate numerically certain expectations sequentially over time with
respect to (w.r.t.) the conditional law of the changepoints and other latent random vari-
ables of the model given the available observations up to that point in time. We present
SMC estimates of these expectations and establish the stability (via the variance) of
these estimates w.r.t. time n and the number of particles N both theoretically and with
numerical examples. Stability of the SMC estimates of the expectations is important for
assessing the performance and reliability of the EM algorithm and is not to be taken for
granted because these expectations are computed w.r.t. a probability law whose dimen-
sion increases linearly with time n. We note that the computational cost of the proposed
SMC online EM algorithm is O(N) per-time whereas a O(N2) per-time algorithm is re-
quired to obtain similar stability results for general state-space HMMs [Del Moral et al.,
2009]. Cappe´ [2011], remarked that “although the online EM algorithm resembles a clas-
sical stochastic approximation algorithm, it is sufficiently different to resist conventional
‘analysis of convergence’. We believe that limited results similar to those discussed in
Cappe´ [2011, Section 4] identifying the potential accumulation points of the online EM
procedure could be established but this is beyond the scope of this work. In the nu-
merical studies reported in this work, and indeed in all the ones we have conducted, the
SMC online EM algorithm converges, and to a very close vicinity of the correct values
when these are known, e.g. in synthetic examples. Moreover, we observed that online EM
converged significantly quicker than the batch EM implementation.
The organisation of the chapter is as follows. In Section 4.2, we describe a general
changepoint model. In Section 4.3, we present the associated online EM algorithm and
its SMC implementation. Theoretical results on the stability of the SMC estimates used
in the online EM algorithm are given in Section 4.4. In Section 4.5, we demonstrate the
performance of the SMC online EM algorithm on both simulated and real data. We finish
with a discussion in Section 4.6 and finally, some detailed model specific derivations as
well as mathematical proofs are given in Appendix.
74 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
4.2 The changepoint model
In this work a changepoint model is defined to be comprised of two discrete-time stochastic
processes which are {(Xk, Zk)}k≥1 and {Yk}k≥1. {(Xk, Zk)}k≥1 is an unobserved time-
homogeneous Markov chain taking values in X × Z where X = {1, 2, . . .} × {1, . . . , R}
and Z ⊆ Rp. (While the definition of X in this manner is necessary for the resulting model
to be a changepoint model, the definition of Z can change depending on the application
domain.) We denote realisations of the first component of this chain by xk = (dk, mk).
The variable mk takes values in the index set {1, . . . , R} and indicates the (generative)
model the chain is in at that time while dk indicates the duration the chain has spent in
model mk. The transition law of {(Xk, Zk)}k≥1 is
X1 ∼ µ, Xk |(xk−1 = (d,m), zk−1) =
{
(d+ 1, m) w.p. 1− λθ,m(d)
(1, m′) w.p. λθ,m(d)× Pθ(m,m′)
,
Zk |(xk = (d′, m′), xk−1, zk−1) ∼
{
fθ,m′(z|zk−1)dz if d′ 6= 1
πθ,m′(z)dz if d
′ = 1
, (4.1)
where λθ,m(d) ∈ [0, 1] for all θ ∈ Θ and (d,m) ∈ X ; Pθ is an R×R row stochastic matrix;
for each θ and m, fθ,m(z|zk−1) is the density of a Markov transition kernel on Z w.r.t.
a suitable dominating measure which is denoted by dz; and for each θ and m, πθ,m is
a probability density on Z. The transition kernel of the Markov chain {(Xk, Zk)}k≥1 is
assumed to be parametrised by the finite dimensional parameter θ ∈ Θ. Without loss of
generality, it is assumed that the probability distribution of the initial state of the chain
{Xk}k≥1, denoted µ, has all its mass on {(1, 1), . . . , (1, R)}, e.g. the uniform distribution
on {(1, 1), . . . , (1, R)}.
For a sequence {ak}k≥1 and integers i, j, let ai:j denote the set {ai, ai+1, ..., aj}, which
is empty if j < i, and ai:∞ = {ai, ai+1, ...}. The process {Yk}k≥1 is a Y-valued observed
process which satisfies the following conditional independence property:
Yk
∣∣({xk, zk}k≥1 , y1:k−1, yk+1:∞) ∼ gθ,mk(y|zk)dy (4.2)
where for each θ and m, gθ,m is a probability density on Y with respect to the dominating
measure dy. In this work Y ⊆ Rq although the definition of Y may be altered depending
on the application. Equations (4.1) and (4.2), now define the law of (X1:n, Z1:n, Y1:n).
Note that {Xk}k≥1 itself is a Markov chain and we denote its transition matrix by
pθ (xk| xk−1). Secondly, it is useful to visualise a realisation of {Xk}k≥1 as a labelled con-
tiguous partition of {1, 2, . . .}, {[t1, t2), [t2, t3), . . .} and ti+1 > ti, where each set [ti, ti+1) of
the partition, which we call a segment, is accompanied by mti , the model number during
that segment. The variables ti are the instances {Xk}k≥1 visits the set {1} × {1, . . . , R}
4.2. THE CHANGEPOINT MODEL 75
and are called as the changepoints. As {Zk}k≥1 forgets its past at times of changepoints,
within the segment [ti, ti+1), {(Zk, Yk)}ti≤k<ti+1 is a HMM with initial, state transition,
and observation densities πθ,mti , fθ,mti , and gθ,mti respectively. In this sense, our model
is general enough to encompass both hidden semi-Markov models ([Barbu and Limnios,
2008; Murphy, 2002] and segmented hidden semi-Markov models [Dong and He, 2007;
Gales and Young, 1993]. Below, we give an example of a changepoint model, which we
will use in our experiments throughout this chapter.
Example 4.1. Consider the following changepoint model presented in Fearnhead and
Vasileiou [2009], where Zk = (Zk,1, Zk,2) ∈ R× R+, and Y = R. The model satisfies
X1 ∼ U{1}×{1,...,R}, Xk |(xk−1 = (d,m)) =
{
(d+ 1, m) w.p. (1− λm)
(1, m′) w.p. λm × P (m,m′)
,
Zk |(xk = (d′, m′), zk−1) ∼
{
δzk−1 if d
′ 6= 1
NΓ−1(ξm, κm, α, β) if d′ = 1
,
Yk |zk ∼ N (zk,1, zk,2),
where NΓ−1(·) denotes the normal-inverse gamma distribution and UA is the uniform
distribution over the set A. In relation to (4.1) and (4.2), we have λθ,m(d) = λm,
fθ,m(z|zk−1)dz = δzk−1(dz), πθ,m = NΓ−1(ξm, κm, α, β), and gθ(y|zk) = N (y; zk,1, zk,2).
Therefore, the parameters of interest are θ = (ξ1:R, κ1:R, λ1:R, α, β, P ). In this model,
the observations in each segment are i.i.d. Gaussian random variables whose mean and
variance change from segment to segment and are drawn from the normal-inverse gamma
distribution.
The following important conditional independence property, which follows from (4.1)
and (4.2), will be frequently used in the derivations to follow: for any k′ ≥ k,
pθ(yk|x1:k′, y1:k−1) = pθ(yk|xk, y1:k−1) = pθ(yk|xk, yk−dk+1:k−1).
(Recall that dk is the first component of xk.) This equation may be interpreted to mean
that yk only depends statistically on the past observations that are received since the most
recent changepoint and not on the observations before that. For the models considered in
this work we assume that pθ(yk|xk, y1:k−1) can be evaluated for any xk and y1:k (whenever
the conditional law is well defined). This assumption is satisfied by some important
models (e.g. Caron et al. [2011]; Fearnhead and Vasileiou [2009]; Whiteley et al. [2009]),
and allows us to focus inference on X1:n and θ given Y1:n as Z1:n may be integrated out.
For a given realisation of observations {yk}k≥1, we define the potential function Gθ,k :
76 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
X → [0,∞) as
Gθ,k(xk) =
∫
πθ,mk(zj)
∏k
i=j+1 fθ,mk(zi|zi−1)
∏k
i=j gθ,mk(yi|zi)dzj:k∫
πθ,mk(zj)
∏k−1
i=j+1 fθ,mk(zi|zi−1)
∏k−1
i=j gθ,mk(yi|zi)dzj:k−1
, j = max(k−dk+1, 1).
(Gθ,k is introduced for brevity.) Note that Gθ,k(xk) is precisely pθ(yk|xk, y1:k−1) at values
of xk where the latter is well defined. We can now express the probability density of the
observed process, or likelihood, succinctly as
pθ (y1:n) = Eθ
[
n∏
k=1
Gθ,k(Xk)
]
.
4.3 EM algorithms for changepoint models
Our main aim is to estimate the static parameter θ of the changepoint model in an online
manner using the EM algorithm. We first introduce the batch EM algorithm and then
explain how it can be modified to obtain the online EM version.
4.3.1 Batch EM
Given Y1:n = y1:n, the EM algorithm for maximising pθ(y1:n) is given by the following
iterative procedure: if θi is the estimate of the maximiser at the ith iteration, then at
iteration i+ 1 we first calculate the following intermediate optimisation criterion,
Q(θi, θ) = Eθi [ log pθ(y1:n, Z1:n, X1:n)| y1:n]
= Eθi [ log pθ(X1:n) + log pθ(y1:n, Z1:n|X1:n)| y1:n]
= Eθi [ log pθ(X1:n) + Eθi { log pθ(y1:n, Z1:n|X1:n)| y1:n, X1:n}| y1:n] . (4.3)
This step is known as the expectation (E) step. The inner expectation in (4.3) is w.r.t.
the law of Z1:n conditioned on y1:n and X1:n under θi, that is pθi (z1:n| y1:n, x1:n), whereas
the outer expectation is w.r.t. the law of X1:n conditioned on y1:n under θi, that is
pθi (x1:n| y1:n) . The updated estimate is then computed in the maximisation (or M) step
θi+1 = argmax
θ
Q(θi, θ).
This procedure is repeated until θi converges (or ceases to change significantly).
Let us define the integrand of the outer expectation in (4.3) as the function Hk :
X k × Yk ×Θ2 → R, k = 1, . . . , n,
Hk(x1:k, y1:k, θi, θ) := log pθ(x1:k) + Eθi [ log pθ(y1:k, Z1:k|x1:k)| y1:k, x1:k]
4.3. EM ALGORITHMS FOR CHANGEPOINT MODELS 77
We can exploit the following three properties of Hk and Q(θi, θ). Firstly, Hk has an
additive structure (see Appendix 4.A.1 for a derivation):
Hk(x1:k, y1:k, θi, θ) = Hk−1(x1:k−1, y1:k−1, θi, θ) + hk(xk−1, xk, yk−dk+1:k, θi, θ) (4.4)
where the incremental term hk is a function of (xk−1, xk, yk−dk+1, . . . , yk, θi, θ). Secondly,
when the transition laws of the changepoint model given in (4.1)-(4.2) belong to the
exponential family then the incremental terms can be expressed as
hk(xk−1, xk, yk−dk+1:k, θi, θ) = v
T
θ sk(xk−1, xk, yk−dk+1:k, θi) (4.5)
where vθ is a r × 1 vector depending only on θ, sk is a r × 1 vector valued function of
(xk−1, xk, yk−dk+1, . . . , yk, θi). (From now on, we omit the dependency of Hk, hk, and sk on
y1:k for the sake of conciseness.) If (4.5) holds, Q(θi, θ) = v
T
θ Eθi [Sn(X1:n, θi)| y1:n] where
Sn(x1:n, θi) =
n∑
j=1
sj(xj−1, xj , θi), (4.6)
with s1(x0, x1, θ) = s1(x1, θ) by convention, and its maximiser is explicitly characterised
by a function Λ : Rr → Θ
argmax
θ∈Θ
Q(θi, θ) = Λ (Eθi [Sn(X1:n, θi)| y1:n]) . (4.7)
Hence from a practical point of view, it is necessary to compute the expectation of
additive functionals (4.6) w.r.t. pθi (x1:n| y1:n). As for a standard HMM, this can be
achieved using a forward-backward type algorithm; see Gales and Young [1993], Barbu
and Limnios [2008], Fearnhead and Vasileiou [2009]. However in a general scenario the
computational complexity is quadratic in n and approximations are necessary when n is
very large. In Fearnhead and Vasileiou [2009] a Monte Carlo EM (MCEM) algorithm
was proposed for a specific changepoint model (see Section 4.5) where the expectations
in the E-step are computed using a backward Monte Carlo sampling procedure.
4.3.2 Online EM
The development of an online version of the EM rests on the following key fact [Cappe´,
2011; Del Moral et al., 2009]. The quantity Eθ [Sn(X1:n, θ)| y1:n] when Sn has the additive
structure in (4.6) can be evaluated sequentially with the following recursion which we
78 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
will refer to as the forward smoothing recursion:
Tn(xn, θ) :=
∑
x1:n−1∈Xn−1
Sn(x1:n, θ)pθ(x1:n−1|y1:n−1, xn)
=
∑
xn−1∈X
[Tn−1(xn−1, θ) + sn(xn−1, xn, θ)] pθ(xn−1|y1:n−1, xn)
with T1(x1, θ) = s1(x1, θ). The second line follows from (4.6) and the decomposition
pθ(x1:n−1|y1:n−1, xn) = pθ(x1:n−2|y1:n−2, xn−1)pθ(xn−1|y1:n−1, xn) (4.8)
due to the fact that given xn−1, x1:n−2 do not depend on xn, xn+1, . . . , yn−1, yn, . . ., which
follows from (4.1) and (4.2). The function Tn(·, θ) : X → Rr can be computed in an
online manner and hence so can
Eθ [Sn(X1:n, θ)| y1:n] =
∑
xn∈X
Tn(xn, θ)pθ(xn|y1:n).
It is possible to use this recursion to implement the batch EM algorithm. Compared to
the standard forward-backward implementation, this approach does not require a back-
ward pass to compute the expectations of interest and hence requires far less memory to
implement.
The online EM algorithm is a variation over the batch EM where the parameter is re-
estimated each time a new observation is collected. In this approach running averages of
Eθ [Sn(X1:n, θ)| y1:n] are computed [Cappe´, 2009, 2011; Elliott et al., 2002; Mongillo and
Deneve, 2008], [Kantas et al., 2009, Section 3.2.]. Let γ = {γn}n≥1, called the step-size
sequence, be a positive decreasing sequence satisfying
∑
n≥1 γn = ∞ and
∑
n≥1 γ
2
n < ∞.
A common choice is γn = n
−a for 0.5 < a ≤ 1. Let θ1 be the initial guess of θ∗ before
having made any observations and let θ1:n be the sequence of parameter estimates of the
online EM algorithm computed sequentially based on y1:n−1. When yn is received, online
EM computes
Tγ,n(xn) =
∑
xn−1∈X
[(1− γn)Tγ,n−1(xn−1) + γnsn(xn−1, xn, θn)] pθ1:n(xn−1|y1:n−1, xn), (4.9)
Sn =
∑
xn∈X
Tγ,n(xn)pθ1:n(xn|y1:n) (4.10)
and then sets θn+1 = Λ (Sn). The subscript θ1:n on pθ1:n(xn−1|y1:n−1, xn) and pθ1:n(xn|y1:n)
indicates that these laws are being computed sequentially using the parameter θk at time
k, k ≤ n. (See Algorithm 4.1 for details.) In practice, the maximisation step is not
executed until a burn-in time nb for added stability of the estimators as discussed in
Cappe´ [2009].
4.3. EM ALGORITHMS FOR CHANGEPOINT MODELS 79
The online EM algorithm can be implemented exactly for a linear Gaussian state-space
model [Elliott et al., 2002] and for finite state-space HMM’s. [Cappe´, 2011; Mongillo and
Deneve, 2008]. An exact implementation is not possible for changepoint models in general,
therefore we now investigate SMC implementations of the online EM algorithm.
4.3.3 SMC implementations of the online EM algorithm
Let Qθ,n(x1:n) = pθ (x1:n|y1:n−1) denote the law of X1:n conditioned on the sequence of
observed variables y1:n−1, and let ηθ,n(xn) = pθ(xn|y1:n−1) denote the time n marginal of
Qθ,n. ηθ,n is also known as the predicted filter but we refer to it simply as the filter. In
order to execute (4.9) and (4.10) at time n, we need to calculate the following probability
distributions:
pθ(xn−1|xn, y1:n−1) = ηθ,n−1(xn−1)Gθ,n−1(xn−1)pθ(xn|xn−1)∑
x′n−1
ηθ,n−1(x′n−1)Gθ,n−1(x
′
n−1)pθ(xn|x′n−1)
(4.11)
pθ(xn|y1:n) = ηθ,n(xn)Gθ,n(xn)∑
x′n
ηθ,n(x′n)Gθ,n(x
′
n)
(4.12)
Note that to calculate these probability distributions we only need ηθ,n−1 and ηθ,n at time
n. Besides, ηθ,n may be computed recursively using Bayes’ formula:
ηθ,n(xn) =
∑
xn−1
ηθ,n−1 (xn−1)Gθ,n−1 (xn−1) pθ (xn|xn−1)∑
xn−1
ηθ,n−1 (xn−1)Gθ,n−1 (xn−1)
, n > 1, (4.13)
However, the computational cost of the filtering recursion in (4.13) at time n is O(nR);
this follows since pθ(x
′|x) is non-zero for at most R + 1 values of x′. For the analysis
of large amounts of data, exact filtering is computationally infeasible and SMC methods
have been introduced as a viable alternative [Chopin, 2007; Fearnhead and Liu, 2007].
One way to obtain the SMC approximation to ηθ,n is via the path space particle
approximation of Qθ,n. This is the empirical measure corresponding to a set of N ≥ 1
random samples termed particles [Del Moral, 2004]:
Q
p,N
θ,n (x1:n) =
1
N
N∑
i=1
δ
X
(i)
1:n
(x1:n) . (4.14)
where δa(·) is the probability mass function concentrated at a. These particles are then
propagated in time using importance sampling and resampling steps; see Doucet et al.
[2001] and Cappe´ et al. [2005] for a review of the literature. Specifically, Qp,Nθ,n is the
80 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
empirical measure constructed from N independent samples from
Q
p,N
θ,n−1 (x1:n−1)Gθ,n−1 (xn−1) pθ (xn|xn−1)∑
x1:n−1
Q
p,N
θ,n−1 (x1:n−1)Gθ,n−1 (xn−1)
. (4.15)
The particle approximation of ηθ,n can now be obtained from Q
p,N
θ,n by marginalization
ηNθ,n(xn) =
1
N
N∑
i=1
δ
X
(i)
n
(xn) . (4.16)
Other than the one in (4.16), there are other ways to sequentially update ηNθ,n−1 so
that ηθ,n is approximated at N distinct particles. Given η
N
θ,n−1, at time n the distribution∑
xn−1
ηNθ,n−1 (xn−1)Gθ,n−1 (xn−1) pθ (xn| xn−1)∑
xn−1
ηNθ,n−1 (xn−1)Gθ,n−1 (xn−1)
with support at N + R points is calculated exactly and then ηNθ,n is obtained by sam-
pling this distribution independently N times (see Algorithm 4.1). Caron et al. [2011]
propose truncating to the N support points with the highest weights. This deterministic
resampling scheme introduces bias, but the authors report that this bias is negligible.
Fearnhead and Liu [2007] propose an unbiased resampling scheme that retains the max-
imum number of unique particles in the reduced representation of size N . In the same
work, and in Fearnhead and Vasileiou [2009], resampling schemes that allow changing
number of particles in time are proposed.
The online EM algorithm in Section 4.3.2 can be approximated with O(N) cost per
time using the SMC approximation of the densities in (4.11) and (4.12). The resulting
algorithm, presented as Algorithm 4.1, will be referred to as the SMC-FS online EM
algorithm.
Algorithm 4.1. SMC-FS online EM algorithm for changepoint models
• E-step: If n = 1, initialise θ1; sample X˜(i)1 ∼ µ, set T˜ (i)1 = s1(X˜(i)1 , θ1), i =
1, . . . , N .
If n ≥ 2
– For i = 1, . . . , N , set X˜
(i)
n = (d
(i)
n−1 + 1, m
(i)
n−1) , where X
(i)
n−1 = (d
(i)
n−1, m
(i)
n−1)
– For m = 1, . . . , R, set X˜
(N+m)
n = (1, m).
– For i = 1, . . . , N +R, compute W˜
(i)
n =
∑N
j=1Gθn−1,n(X
(j)
n−1)pθn(X˜
(i)
n |X(j)n−1) and
T˜ (i)n =
1
W˜
(i)
n
N∑
j=1
Gθn−1,k(X
(j)
n−1)pθn(X˜
(i)
n |X(j)n−1)
[
(1− γn)T (j)n−1 + γnsn(X(j)n−1, X˜(i)n , θn)
]
4.3. EM ALGORITHMS FOR CHANGEPOINT MODELS 81
Resample {X˜(i)n , T˜ (i)n }i=1,...,N+R according to the weights {W˜ (i)n }i=1,...,N+R to get re-
sampled particles {X(i)n , T (i)n }i=1,...,N each with weight 1/N .
• M-step: If n < nb, set θn+1 = θn else, calculate using the particles before resam-
pling
Sn =
∑N+R
i=1 T˜
(i)
n W˜
(i)
n Gθn,n(X˜
(i)
n )∑N+R
i=1 W˜
(i)
n Gθn,n(X˜
(i)
n )
,
update the parameter θn+1 = Λ (Sn).
4.3.4 Comparison with the path space online EM
As shown in Section 4.3.1, the EM algorithm requires certain expectations w.r.t. the
measure Qθ,n, and the online EM algorithm in Section 4.3.2 relies on the running averages
of these expectations. Consider the following backward representation of Qθ,n
Qθ,n(x1:n) = ηθ,n(xn)
2∏
k=n
pθ(xk−1|xk, y1:k−1).
Then a corresponding particle approximation, different from the path-space one, is given
by
QNθ,n(x1:n) = η
N
θ,n(xn)
2∏
k=n
pNθ (xk−1|xk, y1:k−1). (4.17)
where pNθ (xk−1|xk, y1:k−1) is (4.11) with ηθ,k−1 replaced with ηNθ,k−1. One can then show
that the online EM algorithm using the SMC approximation to the forward smoothing
recursion relies on the particle approximation QNθ,n described above. More precisely,
in Algorithm 4.1, if γi = 1/i, n < nb (see the M-step), θ1 = · · · = θn+1 = θ, and
sn+1(xn, xn+1, θ) = 0, then
Sn+1 = QNθ,n+1((n + 1)−1Sn).
This observation will be useful for analysing the stability properties of the sufficient
statistics calculated SMC-FS online EM algorithm in Section 4.4.
As an alternative to SMC-FS online EM, we could have proposed an SMC online EM
algorithm relying on the particle approximation Qp,Nθ,n defined in (4.14)-(4.15). In that
case (using the short-hand notation in Algorithm 4.1) the approximation to (4.9) and
(4.10) becomes
T˜ (i)n = (1− γn)T (i)n−1 + γnsn(X(i)n−1, X˜(i)n , θn)
82 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
for each i = 1, . . . , N , and then calculating the estimates of sufficient statistics as
Sn =
∑N
i=1 T˜
(i)
n Gθn,n(X˜
(i)
n )∑N
i=1Gθn,n(X˜
(i)
n )
.
Recall that each X˜
(i)
n is sampled from pθn(xn|X(i)n−1). {X˜(i)n , T˜ (i)n }i=1,...,N are then resampled
to obtain {X(i)n , T (i)n }i=1,...,N according to the weights {Gθn,n(X˜(i)n )}i=1,...,N . Based on the
path space approximation, we will hereafter call this algorithm the SMC-PS online EM
algorithm. In the context of general state-space HMM, this was proposed in Cappe´ [2009]
and only requires O(N) computations per time step. However, it is a well-known fact
that Qp,Nθ,n becomes progressively impoverished as n increases because of the successive
resampling steps [Del Moral and Doucet, 2003; Olsson et al., 2008]. That is, the number
of distinct particles representing the marginal Qp,Nθ,n (x1:k) for any fixed k < n diminishes as
n increases until it eventually collapses to a single particle – this is known as the particle
path degeneracy problem. Whereas, in the backward particle approximation QNθ,n, we
do not have this problem since it relies on the SMC approximations to the filters ηθ,n
only. Therefore, we expect that the resulting SMC estimates in the SMC-PS online
EM algorithm have higher variances than those in the SMC-FS online EM algorithm
[Del Moral et al., 2009]. For a numerical illustration of this fact, see Section 4.5.
4.4 Theoretical results
Recall that the M-step of the exact online EM algorithm applies a mapping Λ which maps
expectations of sufficient statistics Qθ,n+1(n
−1Sn) = Eθ [n
−1Sn(X1:n)|y1:n] to a parameter
estimate in Θ; see (4.9) and (4.10) with γn = n
−1. It follows from the discussion in
Section 4.3.4 that the reliability of the SMC online EM algorithm described in Section
4.3.2 depends on how stable the estimates of expectations of the type QNθ,n(Sn) are.
One convenient way of assessing the stability is to check how the asymptotic (in particle
number) variance of
√
N
(
QNθ,n −Qθ,n
)
(Sn) changes with time n. The asymptotic analysis
will give us an idea about what will happen when we use a large number of particles.
We would like the order of the variance to grow less than quadratically in time n; since
then the variance of
√
N
(
QNθ,n −Qθ,n
)
(n−1Sn), which is the variance of the estimates in
the M-step, is not only time uniformly bounded but also vanishes. This should result in
the variability of the EM’s parameter update step to particle realisation also diminishing
over time. Before proceeding further we shall make clear that our analysis is for the
approximation QNθ,n defined in (4.17) for a fixed θ. That is, our results are only indicative
of the stability of the sufficient statistics calculated in the SMC-FS online EM algorithm,
which actually uses a changing sequence of θ’s. In summary, our main result in this
section establishes that (under certain assumptions) the asymptotic (in particle number)
4.4. THEORETICAL RESULTS 83
variance of
√
N
(
QNθ,n −Qθ,n
)
(Sn) is upper bounded by a term O(n) or O(n log2 n). The
tighter O(n) bound is for finite duration models while the looser O(n log2 n) bound is for
infinite duration models.
The results in this section are phrased for any fixed θ and any sequence of observations
y = {yn}n≥1. Also, to keep the notation “light” θ is omitted from the subscripts. Some
basic definitions are provided first. For a real valued functions ϕ : X → R, let ‖ϕ‖A =
supx∈A |ϕ(x)| for A ⊆ X . Let B(X ) denote the space of bounded real valued functions
on X . For a probability measure ν on X , let ν(ϕ) = ∑x∈X ν(x) ϕ(x), and for A ∈ X ,
ν(A) = ν(IA) where IA is the indicator function for the set A such that IA(x) = 1 if x ∈ A,
0 otherwise. Denote the support of ν by supp(ν) = {x ∈ X : ν(x) > 0}. If M(x, x′) is a
transition probability (from x to x′) on X , let (Mϕ)(x) =M(ϕ)(x) =∑x′ M(x, x′)ϕ(x′).
For ϕ ∈ B(X ) and A ⊆ X , let oscA(ϕ) = supx,x′∈A |ϕ(x)− ϕ(x′)| be the oscillation of the
function over A and osc(ϕ) = oscX (ϕ). The complement of a set A is A.
We will require the following result concerning the asymptotic variance of particle
smoothing [Del Moral et al., 2010].
Theorem 4.1. Given y = {yn}n≥1, assume there exists finite constants cn such that
c−1n ≤ Gn ≤ cn for all n. For any n ≥ 1, Fn ∈ B(X n),
√
N
(
QNn −Qn
)
(Fn) converges in
law, as N →∞, to a centered Gaussian random variable with variance
n∑
i=1
ηi
(
[Gi,n Di,n(Fn −Qn(Fn))]2
)
. (4.18)
where, for 1 ≤ i ≤ n, the potential function Gi,n and the bounded integral operator Di,n
are
Gi,n(xi) :=
p(yi:n−1|xi, y1:i−1)
p(yi:n−1|y1:i−1) , Di,n(Fn)(xi) := E [Fn(X1:n)| y1:n−1, xi] .
The assumption that the potentials Gn are uniformly bounded below by c
−1
n is not
overly restrictive as it is satisfied when gm(y|z) > 0 for all m, y and z. The latter is a
typical assumption in the context of the analysis of particle filters to avoid the possibility
of all the particles having weight zero [Del Moral, 2004].
In order to discuss the rate of growth of the asymptotic variance (4.18) as a function
of time n, we need to quantify the sensitivity of the forward and backward smoothers to
their initialisations. For a given sequence of observations y1:n, the forward smoother is
defined as the Markov chain on X with transition kernel p(xk+1|xk, y1:n), k = 1, . . . , n−1.
Similarly, the backward smoother is the reverse time Markov chain with transition kernel
p(xk|xk+1, y1:n), k = n − 1, n − 2, . . . , 1. Each term of the sum in (4.18) is an integral
over X n and will typically grow linearly with n unless both the forward and backward
smoother forget their initialisations quick enough (e.g. with geometric rate) and the
class of functions Fn is restricted. Indeed the E-step of the EM algorithm computes the
84 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
expectation for not an arbitrary Fn but one that has a specific additive structure; see
Section 4.3.1, also Proposition 4.1. A definition of geometric rate is as follows. Given
{yi}i≥1, if for some integer L > 0 there exists a finite constant c(L) ≥ 1 such that for all
m− k ≥ L, n ≥ m,
|E [s(Xm)|xk, y1:n]− E [s(Xm)| x′k, y1:n]| ≤ osc(s)(1− c(L)−2)⌊
m−k
L ⌋ (4.19)
irrespective of (xk, x
′
k) provided both conditional expectations are well defined, then the
forward smoother is said to forget its initialisation with geometric rate. (A similar defi-
nition applies for the backward smoother; see (4.32)). Henceforth, when we say forward
forgetting we mean that the forward smoother forgets its initial condition in the sense of
(4.19) but without any specific reference to a rate. By backward forgetting, similarly, we
will mean the insensitivity of the backward smoother to its initialisation.
A typical route to establish forward and backward forgetting is to exploit the fact
that the Markov chain {Xk}k≥1 satisfies a majorization and a minorization condition:
that is there exists a probability measure m(x), positive integer l and positive constant
c such that c −1m(xk) ≤ p(xk|xk−l) ≤ cm(xk) for all (xk−l, xk) ∈ X 2. When this condi-
tion is satisfied it may be shown that the backward and forward smoothers forget their
initialisations at geometric rate, which is quick enough such each term of the sum (4.18)
is uniformly bounded over time. For changepoint models however, the majorization-
minorization condition is not satisfied in general. Consider the following example: let
R = 1 (in which case we drop the variable mk from xk, i.e. xk = dk) and
Xk =
{
xk−1 + 1 w.p. 1− λ
1 w.p. λ
(4.20)
Furthermore, given Xk = d then it must be that Xk−i = d − i for i < d. Thus the
distance between the probability distributions Pr(Xk−i|Xk = d) and Pr(Xk−i|Xk = d′)
will not decrease at geometric rate and the same cannot be expected for the backward
smoother (which is essentially these laws but with additional conditioning on y1:k−1.) In
this work, we analyse the asymptotic variance for changepoint models using a slightly
refined approach.
We analyse two types of changepoint models separately, namely finite duration change-
point models and infinite duration changepoint models. We distinguish between the two
models as follows. In a finite duration changepoint model, for each m ∈ {1, . . . , R} there
exists some finite d¯m such that λm(d) = 1 for all d ≥ d¯m, and smallest such d¯m is the
maximum duration length for model m. If, for at least one m ∈ {1, . . . , R}, λm(d) < 1
for all d > 0, then the model is called an infinite duration model.
4.4. THEORETICAL RESULTS 85
Given {yn}n≥1, for positive integers k ≥ 1, (lag) l and set A ⊆ X , let
ck,l(A) = sup
xk+l∈A,
xk,x
′
k∈supp(ηk)
p(xk+l, yk:k+l−1|xk, y1:k−1)
p(xk+l, yk:k+l−1|x′k, y1:k−1)
(4.21)
where ck,l is taken to be infinity if the denominator can be made zero while the numerator
is not. By convention 0/0 = 1. The variables xk and x
′
k range over supp(ηk) to ensure
the conditional expectations in the numerator and denominator are well defined. Also,
we abbreviate ck,l(X ) to ck,l. The variance result is now stated for additive functions
of the form Sk(x1:k) =
∑k
i=1 si(xi) and may be extended to the case where Sk(x1:k) =∑k
i=1 si(xi−1, xi). The proof of the result is based on some supporting results and is given
in Appendix 4.A.3.
Proposition 4.1. Assume Sn(x1:n) =
∑n
k=1 sk(xk) where osc(sk) ≤ 1.
• If {Xk} is a finite duration changepoint model which is irreducible and aperiodic;
and there exists a finite constant c such that c−1 ≤ Gn ≤ c for all n, then the
asymptotic variance of
√
N
(
QNn −Qn
)
(Sn) given in (4.18) is upper bounded by a
term O(n).
• Assume {Xk} is an infinite duration changepoint model whose forward smoother
forgets its initialisation at geometric rate in the sense of (4.19). Furthermore, let
A = {1, . . . , L} × {1, . . . , R}. If there exist a finite positive constant c such that
c−1 ≤ Gn ≤ c for all n and finite positive constants C, γ ∈ (0, 1) and c′ such that
for all n and L
sup
i≥1
ηi(A) ≤ CγL, and sup
i≥1
ci,L(A) ≤ c′, (4.22)
then the asymptotic variance of
√
N
(
QNn −Qn
)
(Sn) is upper bounded by O(n log2 n).
The first condition in (4.22) is a uniform tightness condition on the probabilities ηi,
whereas the second condition means that if a changepoint occurs between times k and
k + L, the observations up to the last changepoint prior to time k + L do not favour
one xk over another too much. Proposition 4.1 is now shown to be applicable to the
infinite duration model in (4.20) with the following example whose verification is shown
in Appendix 4.A.3.1.
Example 4.2. For the infinite duration model in (4.20), recall that Zk (see Section 4.2)
is a Markov process that “resets” itself when Xk returns to state 1, i.e.
Zk |(x1:k, z1:k−1) ∼
{
π(zk)dzk if xk = 1
f(zk|zk−1)dzk otherwise
86 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
We will assume that the process {Zk}k≥1 assumes values from a compact space and that
there exists some positive constant c such that for all (zk−1, zk)
c−1/2 ≤ π(zk) ≤ c1/2, c−1/2 ≤ f(zk|zk−1) ≤ c1/2. (4.23)
Furthermore, assume g(yk|zk) > 0 for all zk, yk. For example, a changepoint model
satisfying these assumptions could be the changepoint model in Example 4.1 in Section
4.2 with R = 1 and instead of a static {Zk}k≥1 process, a slowly moving one which
is “mixing”. Note that a slowly moving {Zk}k≥1 process permits a more parsimonious
representation of the data.
4.5 Numerical examples
4.5.1 Simulated experiments
For the experiments in this section, we will use the infinite duration changepoint model
in Example 4.1 in Section 4.2, where θ = (ξ1:R, κ1:R, λ1:R, α, β, P ). The constituent distri-
butions of this model belong to the exponential family and so (4.5) holds; see Appendix
4.A.2 for details.
4.5.1.1 Online EM applied to long data sequence
We applied Algorithm 4.1 to a data sequence of length 500000 generated by the model
in Example 4.1 with R = 2 and parameter values α = 10, β = 0.1, ξ1 = 1.445, ξ2 =
−0.214, κ1 = 1.588, κ2 = 0.379, λ1 = 0.12, λ2 = 0.09, Pij = 0.5, i, j = 1, 2. The M-step
was not executed for the first 2000 points (i.e. nb = 2000). The step-size sequence was
γn = n
−0.8. Figure 4.1 shows the trace parameter estimates over time. We observe that
the algorithm converges towards the true values. We also did multiple runs to check that
the algorithm would not only converge to a local maximum.
4.5.1.2 Comparison between online and batch EM for a short data sequence
Figure 4.1 also suggests that online EM requires a long data sequence for convergence.
Therefore, for short data sequences the algorithm may not converge and its potential use
is questionable. One can of course use the batch EM algorithm in such cases but another
solution might be to apply online EM to the concatenated sequence {y1:K , y1:K , . . .}.
By doing so, the online EM solution is not ‘online’ anymore. However, it can still be
significantly faster than the offline version as we demonstrate below. Figures 4.2 and
4.3 show results for such a scenario for 2000 data points. We used Algorithm 4.1 to
obtain the results in Figure 4.2 by replicating y1:K 100 times and the SMC-FS batch EM
4.5. NUMERICAL EXAMPLES 87
0 1 2 3 4 5
x 105
0
5
10
15
α estimate
k
0 1 2 3 4 5
x 105
0
0.2
0.4
0.6
0.8
β estimate
k
0 1 2 3 4 5
x 105
−1
0
1
2
3
ξ estimate
k
0 1 2 3 4 5
x 105
0
5
10
15
κ estimate
k
0 1 2 3 4 5
x 105
0
0.1
0.2
0.3
0.4
λ estimate
k
0 2 4 6
x 105
0
0.5
1
k
P1 1
0 2 4 6
x 105
0
0.5
1
k
P1 2
0 2 4 6
x 105
0
0.5
1
k
P2 1
0 2 4 6
x 105
0
0.5
1
k
P2 2
Figure 4.1: SMC-FS online EM estimates vs time for a long simulated data sequence.
The true parameter values are indicated with a horizontal line.
0 20 40 60 80 100
0
5
10
α estimate
Number of passes
0 20 40 60 80 100
0
0.2
0.4
β estimate
Number of passes
0 20 40 60 80 100
1
2
3
4
ξ estimate
Number of passes
0 20 40 60 80 100
0
0.2
0.4
κ estimate
Number of passes
0 20 40 60 80 100
0.05
0.1
0.15
λ estimate
Number of passes
0 50 100
0.4
0.6
P1 1
0 50 100
0.4
0.6P1 2
0 50 100
0.4
0.6
Number of passes
P2 1
0 50 100
0.4
0.6
Number of passes
P2 2
Figure 4.2: SMC-FS online EM estimates vs number of passes for the concatenated
data set {y1:2000, y1:2000, . . .} where each pass is one complete browse of y1:2000. The true
parameter values: α = 10, β = 0.1, ξ1 = 1.78, ξ2 = 3.56, κ1 = 0.30, κ2 = 0.03, λ1 = λ2 =
0.1, Pi,j = 0.5.
88 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
0 500 1000 1500
0
10
20
30
Number of iterations
α estimate
0 500 1000 1500
0
0.1
0.2
Number of iterations
β estimate
0 500 1000 1500
1
2
3
4
Number of iterations
ξ estimate
0 500 1000 1500
0
0.2
0.4
Number of iterations
κ estimate
0 500 1000 1500
0.06
0.08
0.1
0.12
Number of iterations
λ estimate
0 500 1000 1500
0.4
0.6P1 1
0 500 1000 1500
0.4
0.6P1 2
0 500 1000 1500
0.4
0.6P2 1
Number of iterations
0 500 1000 1500
0.4
0.6P2 2
Number of iterations
Figure 4.3: SMC-FS batch EM estimates vs number of iterations for for the same y1:2000
used to produce the results in Figure 4.2.
algorithm (the batch version of SMC-FS online EM) to obtain the results in Figure 4.3.
The true parameter values are α = 10, β = 0.1, ξ1 = 1.78, ξ2 = 3.56, κ1 = 0.30, κ2 =
0.03, λ1 = λ2 = 0.1, Pi,j = 0.5, i, j = 1, 2.
There are two main outcomes to be stressed from the results in Figures 4.2 and
4.3. First, the online EM algorithm in this example is much faster since it converges
after around 50 passes, whereas the batch EM algorithm needs over 1000 iterations for
convergence. Notice that the computational cost of one pass over the data in the online
case and one iteration in the batch case are almost the same and therefore the comparison
makes sense. Second, the parameter estimates of both algorithms converge to almost the
same points. This empirically validates the potential benefit of the online EM algorithm
even in the offline setting.
4.5.1.3 Comparison with the path space method
As stated in Section 4.3.3, other than the SMC-FS online EM algorithm, it is possible
to devise an online EM algorithm using Qp,Nθ,n (SMC-PS online EM), but it suffers from
higher variance. In the following, we compare the performances of these two online EM
algorithms.
In the first experiment, we compare the variability in the estimates of the sufficient
statistics of the changepoint model defined above when the SMC-FS online EM algorithm
and the SMC-PS online EM algorithm (see Section 4.3.4) are used with θn frozen to θ.
We show the results for only one of the statistics, S16,n, required for the EM algorithm (see
Appendix 4.A.2) in Figure 4.4. The figures are obtained after running 100 Monte Carlo
4.5. NUMERICAL EXAMPLES 89
simulations for the same sequence of observation data. For illustration purposes, while
the box plots show the estimates up to time 10000, we show the relative variance along
100000 time steps. We can deduce from the box-plots and relative variance that there
is much less variability in the estimates obtained by using forward smoothing and the
SMC-FS method always outperforms the SMC-PS method in time and thus should be
favoured. Note that, using a finite number of particles, these SMC estimates are biased
and will result in a loss of accuracy in the EM algorithms. To assess this bias, studies in
the context of Feynman-Kac formulae are helpful. For example, the result in Del Moral
et al. [2009] suggests that the bias of SMC-FS estimate of Sn/n for finite duration models
is bounded by a term O(1/N).
2000 4000 6000 8000 10000
−6000
−4000
−2000
0
time (k)
Forward smoothing method: Eθ
N
 [S6, k
(1)
 | y1:k]
2000 4000 6000 8000 10000
−6000
−4000
−2000
0
time (k)
Path space method: Eθ
N
 [S6, k
(1)
 | y1:k]
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
time (k x 104)
Relative variance (PS/FS) in estimates of S6, k
(1)
Figure 4.4: Comparison of the forward smoothing and the path space methods in terms
of the variability in the estimates of S16,n. The box plots and the relative variance plot
are generated from 100 Monte Carlo simulations using the same observation data.
The second experiment compares the variability in the parameter estimates of the
SMC-FS online EM and the SMC-PS online EM algorithms. Figure 4.5 shows the es-
timation results for the parameter λ1 when the two algorithms are used. The results
are obtained from 100 Monte Carlo simulations using the same sequence of observation
data of length 10000. It is interesting to observe that the trends of estimates over time
are similar for both algorithms; however, it is obvious from the box plots as well as the
relative variance over time that the SMC-FS online EM estimates have less variance than
the SMC-PS online EM estimates.
4.5.2 GC content in the DNA of Human Chromosome no. 2
We applied our online EM method to estimate the parameters of a changepoint model
used for modelling the Guanine+Cytosine (GC) content along human chromosome. It
appears that many features of the genome are correlated with GC content, such as gene
density, repeat density, substitution rates, and recombination rates; see Fearnhead and
90 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
0 2000 4000 6000 8000 10000
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Estimates with FS for λ1
time (k)
2000 4000 6000 8000 10000
0.09
0.10
0.11
0.12
time (k)
0 2000 4000 6000 8000 10000
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Estimates with PS for λ1
time (k)
2000 4000 6000 8000 10000
0.09
0.10
0.11
0.12
time (k)
2.0 4.0 6.0 8.0 10.0
0
10
20
30
40
50
60
time (k x 103)
Relative variance (PS/FS) in estimates of λ1
Figure 4.5: Comparison of SMC-FS online EM and SMC-PS online EM in terms of the
variability in their estimates of λ1 = 0.1. The two plots at the top generated by super-
imposing different estimates, the box plots, and the relative variance plot are generated
from estimates out of 100 different Monte Carlo runs using the same observation data.
Vasileiou [2009] and the references therein for further explanation. It is assumed that the
chromosome is separated into successive segments by changepoints and the GC content
during each segment is constant. However, as the signal is obscured by small scale noise,
a statistical approach may be used to uncover the sequence of changepoints. There is a
commonly used binary segmentation approach implemented within the program IsoFinder
[Oliver et al., 2004]. Fearnhead and Vasileiou [2009] proposed the changepoint model de-
scribed in Example 4.1. Regarding the model variables, Zk = (Zk,1, Zk,2) were interpreted
as the mean and variance of the GC content during the segment at window k, and Yk
was taken to be the observed GC content of the k’th window. The authors estimated the
model parameters by using a MCEM approach and their results outperformed the ones
obtained using IsoFinder.
In our experiments we used human chromosome 2, which can be downloaded via the
link http://hgdownload.cse.ucsc.edu/goldenPath/hg17. The raw data was prepro-
cessed as follows. The raw data consists of a single contiguous stretch of DNA data
containing only four different letters: A, C, G, and T. We summarised the DNA data
by partitioning the 24 Megabase (Mb) region, which is nearly the whole data set, into
80000 windows, each 3.0 kb long, and for each window recording the proportion of let-
ters within that window that are G or C. Some parts of the DNA sequence could not
be measured leading to missing parts. The noisy GC content with missing parts, which
4.6. DISCUSSION 91
we use as the observation sequence, is shown in Figure 4.6. We assumed two generative
0 1 2 3 4 5 6 7 8
x 104
0
0.2
0.4
0.6
windows (k)
y(k
)
GC content in 3k base windows
Figure 4.6: Noisy GC content over 3 kb windows in human DNA chromosome 2.
models (R = 2) to represent segments of high and low GC contents. The missing data
problem is straightforward to handle, e.g. see Fearnhead and Vasileiou [2009]. Figure 4.7
shows the online EM parameter estimates versus number of passes over the data obtained
with Algorithm 4.1. One can see that most of the parameter estimates converge after 10
passes, whereas for convergence of the rest 30 passes are enough.
0 5 10 15 20 25 30
0
5
10
15
α estimate
Number of passes
0 5 10 15 20 25 30
0
0.02
0.04
0.06
β estimate
Number of passes
0 5 10 15 20 25 30
0
2
4
6
κ estimate
Number of passes
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
λ estimate
Number of passes
0 10 20 30
0
0.5
1
Number of passes
P1 1
0 10 20 30
0
0.5
1
Number of passes
P1 2
0 10 20 30
0
0.5
1
Number of passes
P2 1
0 10 20 30
0
0.5
1
Number of passes
P2 2
0 5 10 15 20 25 30
0.35
0.4
0.45
0.5
ξ estimate
Number of passes
Figure 4.7: Online EM estimates vs number of passes over the data sequence in Figure
4.6.
4.6 Discussion
We have presented a novel SMC online EM algorithm for changepoint models and we
have studied the stability of the associated SMC estimates. The proposed EM algorithm
92 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
does not require the filters to be stored and has memory requirements independent of the
size of the dataset. We have shown that it is practical for very long data sequences, and
it can outperform the batch EM even when the data length is not so long that batch EM
is impractical (in terms of memory requirement to store the filters and the entire data
set).
From a Monte Carlo point of view, our SMC implementation of the forward smoothing
recursion at the core of the online EM algorithm is essentially an online implementation
of the forward-filtering backward-smoothing algorithm of Doucet et al. [2000b] where the
filtering densities are approximated using SMC and then backward smoothing is executed
exactly. This method is more efficient than using the path space method as demonstrated
in Section 4.5.1.3. Since we need only the SMC approximation of the filters, we could
even use more effective SMC routines that are not applicable to a path space method;
see for example the SMC algorithm in Fearnhead and Vasileiou [2009]. Besides, unlike
the general state-space model case [Del Moral et al., 2009], the computational cost of our
algorithm is of the same order as the cost of using a path space method in changepoint
models.
Even though the numerical examples were presented for one specific changepoint
model, our online EM algorithm is also applicable to the changepoint models studied
in Whiteley et al. [2009] and Caron et al. [2011]. More generally, the proposed online
EM algorithm is applicable when the constituent laws of the changepoint model given
in (4.1)-(4.2) belong to the exponential family and the latent variable {Zk}k≥1 can be
integrated out analytically.
4.A Appendix
4.A.1 Derivation of Hk in (4.4)
Given {xk}k≥1, consider the partition of {1, 2, . . .} {[t1, t2) , [t2, t3) , . . .} where ti is the i’th
time when dk = 1. Each set [tn, tn+1) is called a segment. To emphasise the segmented
structure of the changepoint model, we define ak =
∑k
i=1 I{1}(dk) to be the number of
segments up to time k, ln = tn+1− tn to be the length of the n’th segment, and m¯n = mtn
to be the model number in the n’th segment. Also, we define Z¯n = Ztn:tn+1−1 and
Y¯n = Ytn:tn+1−1 to group the variables Zk and Yk that belong to the same segment with
shorthand notation. Recall that
Hk(x1:k, y1:k, θi, θ) = log pθ(x1:k) + Eθi [ log pθ(y1:k, Z1:k|x1:k)| y1:k, x1:k] . (4.24)
4.A. APPENDIX 93
Proposition 4.2. For any changepoint model defined as in Section 4.2, we have
Hk(x1:k, θ
′, θ) = Hk−1(x1:k−1, θ
′, θ) + hk(xk−1, xk, θ
′, θ)
Proof. Since {Xk}k≥1 is a Markov chain, so log pθ(x1:k) = log pθ(x1:k−1)+ log pθ(xk|xk−1),
and we are done for the first term in (4.24). For the second term in (4.24), due to the
conditional independence of (Z¯n, Y¯n) given the model number at the segment n, which is
m¯n, we have
pθ′(z1:k|y1:k, x1:k) =
[
ak−1∏
n=1
pθ′(z¯n|y¯n, m¯n)
]
pθ′(zk−dk+1:k|yk−dk+1:k, mk) (4.25)
log pθ(y1:k, z1:k|x1:k) =
[
ak−1∑
n=1
log pθ(y¯n, z¯n|m¯n)
]
+ log pθ(yk−dk+1:k, zk−dk+1:k|mk) (4.26)
Combining (4.25) and (4.26), we have
Hk(x1:k, θ
′, θ) = log pθ(x1:k) + Eθ′
[
ak−1∑
n=1
log pθ(y¯n, Z¯n|m¯n)
∣∣∣∣∣ y¯n, m¯n
]
+ Eθ′ [ log pθ(yk−dk+1:k, Zk−dk+1:k|mk)| yk−dk+1:k, mk]
Now consider Hk−1. Given dk−1, there are two possibilities for dk, either dk = 1, dk =
dk−1 + 1.
• If dk = 1, it means a new segment starts at time k. Therefore, ak = ak−1 + 1 and
the ak−1’th segment ends at time k − 1. This gives Hk−1(x1:k−1, θ′, θ) being equal
to
log pθ(x1:k−1) + Eθ′
[
ak−1∑
n=1
log pθ(y¯n, Z¯n|m¯n)
∣∣∣∣∣ y¯n, m¯n
]
• If dk = dk−1 + 1, then we are still at the segment at which we were at time k − 1.
Therefore, we have ak = ak−1, mk = mk−1, and Hk−1(x1:k−1, θ
′, θ) is equal to
log pθ(x1:k−1) + Eθ′
[
ak−1∑
n=1
log pθ(y¯n, Z¯n|m¯n)
∣∣∣∣∣ y¯n, m¯n
]
+ Eθ′ [ log pθ(yk−dk:k−1, Zk−dk:k−1|mk)| yk−dk+1:k−1, mk]
94 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
Therefore, we have Hk(x1:k, θ
′, θ) = Hk−1(x1:k−1, θ
′, θ) + hk(xk−1, xk, θ
′, θ) where
hk(xk−1, xk, θ
′, θ) = log pθ(xk|xk−1)
+

Eθ′ [ log pθ(yk, Zk|mk)| yk, mk] , if dk = 1
Eθ′ [ log pθ(yk−dk+1:k, Zk−dk+1:k|mk)| yk−dk+1:k, mk]
−Eθ′ [ log pθ(yk−dk+1:k−1, Zk−dk+1:k−1|mk)| yk−dk+1:k−1, mk] , if dk = dk−1 + 1
which does not depend on the values of x1 to xk−2.
4.A.2 Derivation of the EM algorithm for the model in Section
4.5
We write (Z1, Z2) ∼ NΓ−1(ξ, κ, α, β) to mean Z2 ∼ Γ−1(α, β) and Z1|z2 ∼ N (ξ, z2κ ). If
Yk| (z1, z2) ∼ N (z1, z2) for k = 1, . . . , n, the marginal likelihood and the posterior are:
p(y1:n) =
π−n/2 (2β)α Γ (α+ n/2)(
2β +
∑n
k=1 y
2
k + ξ
2κ−
Pn
k=1 yk+ξ
2κ
n+κ
)n/2+α
(Z1, Z2) | (y1:n) ∼ NΓ−1
(
κξ + ny¯
κ + n
, κ+ n, α +
n
2
, β +
1
2
n∑
k=1
(yk − y¯)2 + nκ
n + κ
(y¯2 − ξ)2
2
)
where y¯ = 1
n
∑n
k=1 yk. Also, the required expectations are analytically available:
E [1/Z2] = α/β, E [Z1/Z2] = ξα/β, E
[
Z21/Z2
]
= 1/κ+ ξ2α/β, E [logZ2] = log β −Ψ(α)
For the EM algorithm, we estimate the following functionals form,m1, m2 = 1, . . . , R:
Sm1,k(x1:k, θi) =
ak∑
n:m¯n=m
1, Sm2,k(x1:k, θi) =
ak−1∑
n:m¯n=m
(ln − 1) + I{m}(mk) (dk − 1) ,
Sm1,m23,k (x1:k, θi) =
ak−1∑
n:m¯n=m1,m¯n+1=m2
1
Sm4,k(x1:k, θi) =
ak−1∑
n:m¯n=m
Eθi [ logZtn,2| y¯n, m] + I{m}(mk)Eθi [ logZk,2| yk−dk+1:k, m] ,
Sm5,k(x1:k, θi) =
ak−1∑
n:m¯n=m
Eθi [1/Ztn,2| y¯n, m] + I{m}(mk)Eθi [1/Ztn,2| yk−dk+1:k, m] ,
Sm6,k(x1:k, θi) =
ak−1∑
n:m¯n=m
Eθi [Ztn,1/Ztn,2| y¯n, m] + I{m}(mk)Eθi [Ztn,1/Ztn,2| yk−dk+1:k, m] ,
Sm7,k(x1:k, θi) =
ak−1∑
n:m¯n=m
Eθi
[
Z2tn,1/Ztn,2
∣∣ y¯n, m]+ I{m}(mk)Eθi [Z2tn,1/Ztn,2∣∣ yk−dk+1:k, m] .
4.A. APPENDIX 95
The corresponding additive functions are
sm1,k(xk−1, xk, θi) = I{m}(mk)I{1}(dk) s
m
2,k(xk−1, xk, θi) = I{m}(mk)I{dk−1+1}(dk),
sm1,m23,k (xk−1, xk, θi) = I{1}(dk)I{m1}(mk−1)I{m2}(mk),
sm4,k(xk−1, xk, θi) = I{m}(mk)
{
I{1}(dk)Eθi [ logZk,2| yk, m]
+ I{dk−1+1}(dk) (Eθi [ logZk| yk−dk+1:k, m]− Eθi [ logZk,2| yk−dk+1:k−1, m])
}
,
sm5,k(xk−1, xk, θi) = I{m}(mk)
{
I{1}(dk)Eθi [1/Zk,2| yk, m]
+I{dk−1+1}(dk) (Eθi [1/Zk,2| yk−dk+1:k, m]− Eθi [1/Zk,2| yk−dk+1:k−1, m])
}
,
sm6,k(xk−1, xk, θi) = I{m}(mk)
{
I{1}(dk)Eθi [Zk,1/Zk,2| yk, m]
+I{dk−1+1}(dk) (Eθi [Zk,1/Zk,2| yk−dk+1:k, m]− Eθi [Zk,1/Zk,2| yk−dk+1:k−1, m])
}
,
sm7,k(xk−1, xk, θi) = I{m}(mk)
{
I{1}(dk)Eθi
[
Z2k,1/Zk,2
∣∣ yk, m]
+I{dk−1+1}(dk)
(
Eθi
[
Z2k,1/Zk,2
∣∣ yk−dk+1:k, m]− Eθi [Z2k,1/Zk,2∣∣ yk−dk+1:k−1, m])} .
The maximisation step is as follows: Letting Ŝmj,n(θ) = Eθ
[
Smj,n(X1:n, θ)
∣∣ y1:n],
α(i+1) = Ψ−1
(
log β(i)
∑R
m=1 Ŝ
m
1,n(θi) +
∑R
m=1 Ŝ
m
4,n(θi)∑R
m=1 Ŝ
m
1,n(θi)
)
, β(i+1) = α(i+1)
∑R
m=1 Ŝ
m
1,n(θi)∑R
m=1 Ŝ
m
5,n(θi)
ξ(i+1)m = Ŝ
m
6,n(θi)/Ŝ
m
5,n(θi), κ
(i+1)
m = Ŝ
m
1,n(θi)/
(
Ŝm7,n(θi)− 2ξ(i+1)m Ŝm6,n(θi) + ξ(i+1)2m Ŝm5,n(θi)
)
λ(i+1)m = Ŝ
m
1,n(θi)/
(
Ŝm2,n(θi) + Ŝ
m
1,n(θi)
)
, P (i+1)m1,m2 = Ŝ
m1,m2
3,n (θi)/
R∑
m=1
Ŝm1,m3,n (θi)
where Ψ(x) = d log Γ(x)/dx is the derivative of the log-gamma function.
4.A.3 Proof of Proposition 4.1
We will first establish a weaker form of backward forgetting for the infinite duration model
with the aid for the following lemma, whose proof is straightforward and is omitted.
Lemma 4.1. LetM(x, x′) be a Markov transition kernel (from x to x′) on X , c a constant
and m a probability measure on X . If c−1m(x′) ≤ M(x, x′) ≤ cm(x′) for all x ∈ A, where
A ⊆ X , then for any B ⊆ X and ϕ ∈ B(X ) such that osc(ϕ) ≤ 1,
oscA(M(ϕ)) ≤ (1− c−1)oscB(ϕ) + 2cm
(
B
)
,
Corollary 4.1. Assume {yi}i≥1 is given with p(y1:n) > 0 for all n. Let ϕn(xn) =
E [s(X1)|xn, y1:n−1]. For any L > 0, n − L > 0, osc(s) ≤ 1, A ⊆ supp(ηn), B ⊆
supp(ηn−L)
oscA(ϕn) ≤
(
1− cn−L,L(A)−1
)
oscB(ϕn−L) + 2cn−L,L(A)ηn−L(B). (4.27)
96 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
Furthermore, let A = {1, . . . , L} × {1, . . . , R}. If there exist finite positive constants C,
γ ∈ (0, 1) and c(L) such that for all L
sup
i≥1
ηi(A) ≤ CγL and sup
i≥1
ci,L(A) ≤ c(L) (4.28)
then for all L large enough, for all n,
oscA∩supp(ηn)(ϕn) ≤ (1− c(L)−1)⌊
n−1
L ⌋ + 2c(L)2CγL. (4.29)
Proof. Substituting l = L and k = n− L in (4.21), it can be shown that
cn−L,L(A)
−1p (xn−L|y1:n−L−1) ≤ p (xn−L|xn, y1:n−1) ≤ cn−L,L(A)p (xn−L|y1:n−L−1)
for all xn ∈ A. The bound (4.27) now follows from Lemma 4.1 with c = cn−L,L(A),
m(xn−L) = p (xn−L|y1:n−L−1), M(xn, xn−L) = p(xn−L|y1:n−1, xn), and ϕ = ϕn−L. The
second bound (4.29) follows from (4.27) by iterating the backward kernels with B =
A ∩ supp(ηn−L), and using the tail behaviour of the minorization measure in (4.28).
The first condition in (4.28) is a uniform tightness condition on the probabilities ηi.
This bound for the tail probabilities can be loosened but only at the expense of a weaker
bound in Proposition 4.1. It is clear that (4.29) is weaker than backward forgetting at
geometric rate.
Corollary 4.1 presents a weaker form of backward forgetting for the infinite duration
model. The following lemma establishes that the finite duration models posses the ge-
ometric forward forgetting and geometric backward forgetting properties; both of which
are necessary in order to establish linear growth of the variance.
Lemma 4.2. For a finite duration changepoint model, let d¯m = min{d′ : λm(d) = 1, d ≥
d′} be the maximum duration length in model m and let Xf =
⋃R
m=1{(1, m), . . . , (d¯m, m)}.
Assume that the transition matrix {p(xk|xk−1) : xk, xk−1 ∈ Xf} is irreducible and ape-
riodic; and that for the given {yn}n≥1 there exist finite positive constants cn such that
c−1n ≤ Gn ≤ cn for all n. (i) Then there exists a positive integer L such that ck,l defined
in (4.21) is finite for all l ≥ L, k ≥ 1. (ii) It now follows that for all l ≥ L, n ≥ k + l,
and xk+l ∈ Xf ,
p(xk+l|xk, y1:n) ≥ c−2k,l p(xk+l|x′k, y1:n) (4.30)
and the inequality holds irrespective of (xk, x
′
k) provided both conditional probabilities are
well defined. (iii) Furthermore, the Markov chain on X with transition kernel p(xk+1|xk, y1:n),
4.A. APPENDIX 97
k = 1, . . . , n− 1, forgets its initialisation in the following sense: for all n ≥ m ≥ k ≥ 1,
|E [s(Xm)|xk, y1:n]− E [s(Xm)|x′k, y1:n]| ≤ osc(s)
⌊m−kL ⌋∏
i=1
(1− c−2k+(i−1)L,L) (4.31)
irrespective of (xk, x
′
k) provided both conditional expectations are well defined. If cn ≤ c <
∞ for all n, then (iv) ck,l ≤ c(l) <∞ for l ≥ L, k ≥ 1 and the rate in (4.31) is geometric,
and (v) letting ϕn(xn) = E [s(X1)|xn, y1:n−1], for all l ≥ L, for all n, all A ⊆ supp(ηn)
oscA(ϕn) ≤ osc(s)(1− c(l)−1)⌊n/l⌋. (4.32)
Proof. (Outline only) Property (i) is a consequence of some well known facts for finite
state Markov chains. We use the fact that, under the stated assumptions, the Markov
chain restricted to Xf has a stationary distribution, say ν(x), and we have ν(Xf ) = 1
and ν > 0 on Xf . This ensures the ratio p(xk+l|xk)/p(xk+l|x′k) is close to 1 uniformly
in its arguments and k, provided l is large enough. The result now follows from the the
fact that Gn is bounded from below and above. Property (ii) follows from (i) while the
forgetting property in (4.31) is a simple consequence of (4.30), e.g. see Del Moral [2004].
Property (iv) is proved similarly to (i) using instead the uniform bound on Gn. To verify
(v) use (iv) and (4.27), i.e. iterate the backward kernels starting with B = supp(ηn−l)
Finally, we will need the following lemma to prove Proposition 4.1
Lemma 4.3. Given {yn}n≥1, assume there exists a finite constant c such that c−1 ≤
Gn ≤ c for all n and that (4.19) holds then, for all n, 1 < k ≤ n,
sup
(xk,x
′
k)∈supp(ηk)
p(yk:n|xk, y1:k−1)
p(yk:n|x′k, y1:k−1)
<∞.
Proof. Using | log(b)− log(a)| ≤ |b−a|
min(a,b)
,
log
p(yk:n|xk, y1:k−1)
p(yk:n|x′k, y1:k−1)
=
n∑
i=k
log p(yi|xk, y1:i−1)− log p(yi|x′k, y1:i−1)
≤
n∑
i=k
|p(yi|xk, y1:i−1)− p(yi|x′k, y1:i−1)|
min(p(yi|xk, y1:i−1), p(yi|x′k, y1:i−1))
.
Since p(yi|xk, y1:i−1) = E [Gi(Xi)| xk, y1:i−1], each ratio can be bounded using (4.19) and
constant c, which then results in a geometric sum and gives the desired uniform bound.
We can now present the proof of Proposition 4.1.
98 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
Proof. (Proposition 4.1): The asymptotic variance is
n∑
i=0
ηi
(
[Gi,n Di,n(Sn −Qn(Sn))]2
)
. (4.33)
Consider the infinite duration model. Consider the ith term: For any A ⊆ X ,
ηi
(
[Gi,n Di,n(Sn −Qn(Sn))]2
) ≤ ‖Gi,n‖2supp(ηi) ηi ([Di,n(Sn −Qn(Sn))]2)
≤ ‖Gi,n‖3supp(ηi)
∫
ηi(dxi)ηi(dx
′
i)
(
[Di,n(Sn)(xi)−Di,n(Sn)(x′i)]2
)
≤ ‖Gi,n‖3supp(ηi)
([
oscA∩supp(ηi)Di,n(Sn)
]2
+ 2n2ηi(A)
)
(4.34)
Now let A = {1, . . . , L} × {1, . . . , R}. It follows from (4.19) that for some integer L′,
sup
xi,x′i∈supp(ηi)
∣∣∣∣∣E
[
n∑
k=i
sk(Xk)
∣∣∣∣∣ xi, y1:n
]
− E
[
n∑
k=i
sk(Xk)
∣∣∣∣∣ x′i, y1:n
]∣∣∣∣∣ ≤ L′c(L′)2,
and from (4.29) that
sup
xi,x′i∈A∩supp(ηi)
∣∣∣∣∣E
[
i−1∑
k=1
sk(Xk)
∣∣∣∣∣ xi, y1:n−1
]
− E
[
i−1∑
k=1
sk(Xk)
∣∣∣∣∣x′i, y1:n−1
]∣∣∣∣∣ ≤ (i−1)2c(L)2CγLI[i≥L]+Lc(L).
Thus using Lemma 4.3 to uniformly bound ‖Gi,n‖supp(ηi) and the fact that the bounds
in (4.28) are satisfied for all L large enough with c(L) < c′ < ∞, (4.33) can be upper
bounded by
C ′
n∑
i=1
(
(i− 1)2γ2LI[i≥L] + L2 + (L′)2 + n2ηi(A)
)
≤ C ′n3γ2L + C ′nL2 + C ′n(L′)2 + n3CγL
where C ′ is independent of L and n. Setting L = k log n for some fixed constant k we see
that (4.33) is upper bounded by a term O(n log2 n).
The proof for the finite duration model follows the same lines where Lemma 4.2 is
used instead of Corollary 4.1, hence it is omitted.
4.A. APPENDIX 99
4.A.3.1 Verification of Example 4.2 satisfying the conditions of Proposition
4.1
The first condition of Theorem 4.1 is satisfied since g(yk|zk) > 0 for all zk, yk. It follows
from (4.23) that
c−1 ≤
∫ ∏n
i=1 f(z
′′
i |z′′i−1)g(yi|z′′i ) dz′′1:n∫ ∏n
i=1 f(z
′
i|z′i−1)g(yi|z′i) dz′1:n
≤ c
for all n ≥ 1, y1:n, z′0 ,z′′0 . This, together with (4.20) implies
c−1 ≤ p(yk:n|xk, y1:k−1)
p(yk:n|x′k, y1:k−1)
≤ c (4.35)
for all (xk, x
′
k) ∈ supp(ηk), k ≤ n. (4.35) now implies the term ‖Gi,n‖supp(ηi) in (4.34)
is also uniformly bounded by the constant c. (Note that the condition c−1 ≤ Gn ≤ c
for all n in Proposition 4.1 is used to verify the term ‖Gi,n‖supp(ηi) in (4.34) is uniformly
bounded in n and is now no longer needed for this example as we have direct verification
via (4.35).)
Since
p(xk|xk−1, y1:n) ∝ p(yk:n|xk, xk−1, y1:k−1)p(xk|xk−1, y1:k−1)
= p(yk:n|xk, y1:k−1)p(xk|xk−1),
we have that
p(xk|xk−1, y1:n) ≥ c−1p(xk|xk−1) ≥ c−1λ δ1(xk) (4.36)
for all k ≤ n, and obviously for k > n too. To establish forward forgetting, it follows
from the minorization condition in (4.36) that
E [sk(Xk)|x1, y1:n]− E [sk(Xk)|x′1, y1:n] ≤ osc(sk)
(
1− c−1λ)k−1 .
Let A = {1, . . . , L}. For xk+L ∈ A, xk ∈ supp(ηk) and x′k ∈ supp(ηk),
p(xk+L, yk:k+L−1|xk, y1:k−1)
p(xk+L, yk:k+L−1|x′k, y1:k−1)
=
p(yk:k+L−1|xk+L, xk, y1:k−1)
p(yk:k+L−1|xk+L, x′k, y1:k−1)
p(xk+L|xk)
p(xk+L|x′k)
.
By (4.35), the first ratio is bounded by c. The second ratio is 1. Thus ck,L(A) ≤ c. Using
(4.36), supi≥L+1 E [IA(Xi)| y1:i−1] ≤ γL where γ = 1− c−1λ. Hence the bounds in (4.22)
apply with constants independent of L and n.
100 CHAPTER 4. AN ONLINE EM ALGORITHM FOR CHANGEPOINT MODELS
Chapter 5
Estimating the Static Parameters in
Linear Gaussian Multiple Target
Tracking Models
Summary: In this we work we propose both offline and online maximum likelihood esti-
mation (MLE) techniques for inferring the static parameters of a multiple target tracking
(MTT) model with linear Gaussian dynamics. We present the batch and online versions
of the expectation-maximisation (EM) algorithm for short and long data sets respectively,
and we show how Monte Carlo approximations of these methods can be implemented.
Performance is assessed in numerical examples using simulated data for various scenar-
ios.
The material in this chapter resembles my contribution in the extended work Yıldırım
et al. [2012b]. Also, an early version of this work is published in Yıldırım et al. [2012c].
I was introduced to the problem studied in this chapter by Dr. Sumeetpal S. Singh and
Dr. Thomas Dean.
5.1 Introduction
The multiple target tracking (MTT) problem concerns the analysis of data from multiple
moving objects which are partially observed in noise to extract highly reliable motion
trajectories. The MTT framework has been traditionally applied to solve surveillance
problems but more recently there has been a surge of interest in Biological Signal Pro-
cessing, e.g. see Yoon and Singh [2008].
The MTT framework is comprised of the following ingredients. A set of multiple
independent targets moving in the surveillance region in a Markov fashion. The number
of targets varies over time due to departure of existing targets (known as death) and the
arrival of new targets (known as birth). The initial number of targets are unknown and
the maximum number of targets present at any given time is unrestricted. At each time
each target may generate an observation which is a noisy record of its state. Targets
that do not generate observations are said to be undetected at that time. Additionally,
101
102 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
there maybe spurious observations generated which are unrelated to targets (known as
clutter). The observation set at each time is the collection of all target generated and
false measurements recorded at that time, but without any information on the origin
or association of the measurements. False measurements, unknown origin of recorded
measurements, undetected targets and a time varying number of targets renders the task
of extracting the motion trajectory of the underlying targets from the observation record,
which is known as tracking in the literature, a highly challenging problem.
There is a large body of work on the development of algorithms for tracking multi-
ple moving targets. These algorithms can be categorised by how they handle the data
association (or unknown origin of recorded measurements) problem. Among the main
approaches are the Multiple Hypothesis Tracking (MHT) algorithm [Reid, 1979] and the
probabilistic MHT (PMHT) variant [Streit and Luginbuhi, 1995], the joint probabilistic
data association filter (JPDAF) [Bar-Shalom and Fortmann, 1988; Bar-Shalom and Li,
1995], and the probability hypothesis density (PHD) filter [Mahler, 2003; Singh et al.,
2009]. With the advancement of Monte Carlo methodology, sequential Monte Carlo
(SMC) (or particle filtering) and Markov chain Monte Carlo (MCMC) methods have
been applied to the MTT problem, e.g. SMC and MCMC implementations of JPDAF
[Hue et al., 2002; Vermaak et al., 2005], SMC implementations for MHT and PMHT [Ng
et al., 2005; Oh et al., 2009], and SMC implementations of the PHD filter [Vo et al., 2003,
2005; Whiteley et al., 2010], to mention a few.
Compared to the huge amount of work on developing tracking algorithms, the problem
of estimating the static parameters of the tracking model has been largely neglected,
although it is rarely the case that these parameters are known. Some exceptions include
the work of Storlie et al. [2009] where they extended the MHT algorithm to simultaneously
estimate the parameters of the MTT model. A full Bayesian approach for estimating the
model parameters using MCMC was presented in Yoon and Singh [2008]. Recently, Singh
et al. [2011] presented an approximated maximum likelihood method derived by using a
Poisson approximation for the posterior distribution of the hidden targets which is also
central to the derivation of PHD filter in Mahler [2003]. Additionally, versions of PHD
and Cardinalised PHD (CPHD) filters that can learn the clutter rate and detection profile
while filtering were proposed in Mahler et al. [2011].
In this chapter, we present maximum likelihood estimation (MLE) algorithms to infer
the static parameters of the MTT model when the individual targets move according to a
linear Gaussian state-space model and when the target generated observations are linear
functions of the target state corrupted with additive Gaussian noise; we will henceforth
call this a linear Gaussian MTT model. We maximise the likelihood function using
the expectation-maximisation (EM) algorithm and we present both online and batch
EM algorithms. Because we assume a linear Gaussian MTT model, we are able to
5.1. INTRODUCTION 103
present the exact recursions for updating static parameter estimate. We stress though
that these recursions are not obvious primarily because the MTT model allows for false
measurements, unknown origin of recorded measurements, undetected targets and a time
varying number of targets with unknown birth and death times. To the best of our
knowledge, this is a novel development in the target tracking field. To implement the
proposed EM algorithms, an estimate of the posterior distribution of the hidden targets
given the observations is required, and in the linear Gaussian setting, the continuous
values of the target states can be marginalised out. But, because the number of possible
association of observations to targets grows very quickly with time, we have to resort
to approximation schemes that focus the computation in the expectation(E)-step of the
EM algorithms on the most likely associations; that is, we approximate the E-step with
a Monte Carlo method. For this we employ both SMC which give rise to the following
different MLE algorithms:
• SMC-EM algorithm for offline estimation; and
• SMC online EM algorithm for online estimation.
We implement these two algorithms for simulated examples under various tracking sce-
narios and provided recommendations for practitioner on which one is to be preferred.
The EM algorithms we present in this chapter can be implemented with any Monte
Carlo scheme for inferring the target states in MTT and reducing the errors in the ap-
proximation of the E-step can only be beneficial to the EM parameter estimates. We
do not fully explore the use of the various Monte Carlo target tracking algorithms that
have been proposed in the literature and instead focus on the following. When using
SMC to approximate the E-step, we compute the L-best assignments [Murty, 1968] as
the sequential proposal scheme of the particle filter. This L-best assignments approached
has appeared previously in the literature in the context of tracking, e.g. see Cox and
Miller [1995]; Danchick and Newnam [2006]; Ng et al. [2005]. An alternative approach,
for example, could be to approximate the E-step by using the MCMC data association
(MCMC-DA) algorithm proposed for target tracking in Oh et al. [2009]. Also a full
Bayesian estimation approach has been proposed by Yoon and Singh [2008].
The remainder of the chapter is organised as follows. In Section 5.2, we describe
the MTT model and formulate the static parameter estimation problem. In Section 5.3,
we present the batch and online EM algorithms. Section 5.4 contains the numerical
examples and we conclude the chapter with a discussion of our findings in Section 5.5.
The Appendix contains further details on the derivation of the EM algorithms for MTT,
and details of the SMC algorithm we use in this chapter. We also make an attempt to
analyse the computational complexity of the EM algorithms in the Appendix.
104 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
5.1.1 Notation
We introduce random variables (also sets and mappings) with capital letters such as
X, Y, Z,X, A and denote their realisations by corresponding small case letters x, y, z,x, a.
If a random variable X has a density ν(x), with all densities being defined w.r.t. the
Lebesgue measure (denoted by dx), we write X ∼ ν(·) to make explicit the law of X.
We use Eθ[·|·] for the (conditional) expectation operator; for random variables X, Y and
Z and a function (x, y) → f(x, y), Eθ[f(X,Z)|Y = y] is the expectation of the random
variable f(X,Z) w.r.t. the joint distribution of X,Z conditioned on Y = y. Eθ[f(X, z)|y]
is the expectation of the function x→ f(x, z) for a fixed z given Y = y.
5.2 Multiple target tracking model
Consider a single target tracking model where a moving object (or target) is observed
when it traverses in a surveillance region. We define the target state and the noisy
observation at time t to be the random variables Xt ∈ X ⊂ Rdx and Yt ∈ Y ⊂ Rdy
respectively. The statistical model most commonly used for the evolution of individual
targets {Xt, Yt}t≥1 is the hidden Markov model (HMM). In a HMM, it is assumed that
{Xt}t≥1 is a hidden Markov process with initial and transition probability densities µψ and
fψ, respectively, and {Yt}t≥1 is the observation process with the conditional observation
density gψ, i.e.
X1 ∼ µψ(·), Xt|(X1:t−1 = x1:t−1) ∼ fψ(·|xt−1)
Yt|
(
{Xi = xi}i≥1 , {Yi = yi}i6=t
)
∼ gψ(·|xt).
(5.1)
Here the densities µψ, fψ and gψ are parametrised by a real valued vector ψ ∈ Ψ ⊂ Rdψ .
In this work, we consider a specific type of HMM, the Gaussian linear state-space model
(GLSSM), which can be specified as
µψ(x) = N (x;µb,Σb), fψ(x′|x) = N (x′;Fx,W ), gψ(y|x) = N (y;Gx, V ). (5.2)
where N (x;µ,Σ) denotes the probability density function for the multivariate normal dis-
tribution with mean µ and covariance Σ. In this case, ψ parametrizes (µb,Σb, F,G,W, V ).
In a MTT model, the state and the observation at each time (t ≥ 1) are random finite
sets, Xt =
{
Xt,1, Xt,2, . . . , Xt,Kxt
}
and Yt =
{
Yt,1, Yt,2, . . . , Yt,Kyt
}
. Here each element of
Xt is the state of an individual target and elements of Yt are the distinct measurements
of these targets at time t. The number of targets Kxt under surveillance changes over
time due to targets entering and leaving the surveillance region X . Xt evolves to Xt+1
as follows: with probability ps each target Xt ‘survives’ and is displaced according to the
5.2. MULTIPLE TARGET TRACKING MODEL 105
state transition density fψ in (5.2), otherwise it dies. The random deletion and Markov
motion happens independently for all the elements of Xt. In addition to the surviving
targets, new targets are created. The number of new targets created per time follows
a Poisson distribution with mean λb and each of their states is initiated independently
according to the initial density µψ in (5.2). NowXt+1 is defined to be the superposition of
the states of the surviving and evolved targets from time t and the newly born targets at
time t+ 1. The points of Xt are observed through the following model: with probability
pd, each point of Xt generates a noisy observation in the observation space Y through
the observation density gψ in (5.2). This happens independently for each point of Xt. In
addition to these target generated observations, false measurements are also generated.
The number of false measurements collected at each time follows a Poisson distribution
with mean λf and their values are uniform over Y . Yt is the superposition of observations
originating from the detected targets and these false measurements.
A series of random variables, which are essential for the statistical analysis to follow
are now defined. Let Cst be a K
x
t−1 × 1 vector of 1’s and 0’s where 1’s indicate survivals
and 0’s indicate deaths of targets at time t. More clearly, for i = 1, . . . , Kxt−1,
Cst (i) =
1 i’th target at time t− 1 survives to time t0 i’th target at time t− 1 does not survive to time t .
The number of surviving targets at time t is Kst =
∑Kxt−1
i=1 C
s
t (i). We also define the K
s
t ×1
vector Ist containing the indices of surviving targets at time t,
Ist (i) = min
{
k :
k∑
j=1
Cst (j) = i
}
, i = 1, . . . , Kst .
Note that Ist (i) denotes the ancestor of target i from time t − 1, i.e. Xt−1,Ist (i) evolves
to Xt,i for i = 1, . . . , K
s
t . Denoting the number of ‘births’ at time n as K
b
t , we have
Kxt = K
s
t + K
b
t . Note that according to these definitions, the surviving targets from
time t − 1 are re-labeled as Xt,1, . . . , Xt,Kst , and the newly born targets are denoted as
Xt,Kst+1, . . . , Xt,Kxt . Next, given K
x
t targets we define C
d
t to be a K
x
t × 1 vector of 1’s and
0’s where 1’s indicate detections and 0’s indicate non-detections. For i = 1, . . . , Kxt ,
Cdt (i) =
1 i’th target at time t is detected at time t0 i’th target at time t is not detected at time t .
Therefore, the number of detected targets at time t is Kdt =
∑Kxt
i=1C
d
t (i). Similarly, we
106 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
also define the Kdt × 1 vector Idt showing the indices of the detected targets,
Idt (i) = min
{
k :
k∑
j=1
Cdt (j) = i
}
, i = 1, . . . , Kdt .
Idt (i) denotes the label of the i-th detected target at time t. So the detected targets at
time t are Xt,Idt (1), . . . , Xt,Idt (Kdt ). Finally, defining the number of false measurements at
time t as Kft , we have K
y
t = K
d
t +K
f
t and the association from the detected targets to
the observations can be represented by a one-to-one mapping
At : {1, . . . , Kdt } → {1, . . . , Kyt }
where at time t the i’th detected target is target Idt (i) with state value Xt,Idt (i) and
generates Yt,At(i). We assume that At is uniform over the set of all K
y
t !/K
f
t ! possible one-
to-one mappings. To summarise, we give the list of the random variables in the MTT
model introduced in this section as well as a sample realisation of them in Figure 5.1.
The main difficulty in an MTT problem is that in general we do not know birth-death
times of targets, whether they are detected or not, and which observation point in Yt is
associated to which detected target in Xt. Let
Zt =
(
Cst , C
d
t , K
b
t , K
f
t , At
)
be the collection of the just mentioned unknown random variables at time t, and
θ = (ψ, ps, pd, λb, λf) ∈ Θ = Ψ× [0, 1]2 × [0,∞)2
be the vector of the MTT model parameters. We can write the joint likelihood of all the
random variables of the MTT model up to time n given θ as
pθ(z1:n,x1:n,y1:n) = pθ(z1:n)pθ(x1:n|z1:n)pθ(y1:n|x1:n, z1:n)
where
pθ(z1:n) =
n∏
t=1
(
pk
s
t
s (1− ps)k
x
t−1−k
s
tPO(kbt ;λb)pk
d
t
d (1− pd)k
x
t−k
d
tPO(kft ;λf)
kft !
kyt !
)
(5.3)
pθ(x1:n|z1:n) =
n∏
t=1
 kst∏
j=1
fψ(xt,j |xt−1,ist (j))
kxt∏
j=kst+1
µψ(xt,j)
 (5.4)
pθ(y1:n|x1:n, z1:n) =
n∏
t=1
|Y|−kft kdt∏
j=1
gψ(yt,at(j)|xt,idt (j))
 (5.5)
5.2. MULTIPLE TARGET TRACKING MODEL 107
The list of the variables in the MTT model
Xt,k, Yt,k: k’th target and k’th observation at time t.
Xt = {X1, . . . ,XKxt }, Yt = {Yt,1, . . . , Yt,Kyt }: Sets of targets and observations at time t.
Kbt ,K
f
t : Numbers of newborn targets and false measurements at time t
Kst ,K
d
t : Numbers of targets survived from time t− 1 to time t and detected at time t.
Kxt ,K
y
t : Numbers of alive targets, observations at time t. K
x
t = K
s
t + K
b
t , K
y
t = K
d
t + K
f
t .
Cst : K
x
t−1 × 1 vector of 0’s and 1’s indicating survivals from time t− 1 to time t.
Cdt : K
x
t × 1 vector of 0’s and 1’s indicating detections at time t.
Ist : K
s
t × 1 vector of indices of surviving targets from time t− 1 to time t.
Idt : K
d
t × 1 vector of indices of detected targets at time t.
At : {1, . . . ,Kdt } → {1, . . . ,Kyt }: Association from detected targets to observations at time t.
Zt = (C
s
t , C
d
t ,K
b
t ,K
f
t , At)
X1,1 X2,1 X3,1 X4,1 X5,1
Y1,4 Y3,3 Y5,3
X1,2 X2,2 X3,2 X4,2 X5,2
Y1,1 Y2,1 Y3,5 Y4,1 Y5,2
X1,3 X2,3 X3,3 X4,3 X5,3
Y1,2 Y2,3 Y3,4 Y4,2 Y5,1
Y1,3 X2,4 Y3,1 X4,4 X5,4
Y1,5 Y2,2 Y3,2 Y4,3 Y5,4
Cs
1:5
= ([ ] , [1, 1, 1] , [1, 0, 1, 1] , [0, 1, 1] , [1, 1, 1, 1]); Is
1:5
= ([ ] , [1, 2, 3] , [1, 3, 4] , [2, 3] , [1, 2, 3, 4]);
Cd
1:5
= ([1, 1, 0] , [0, 1, 1, 1] , [1, 1, 1] , [0, 1, 1, 0] , [1, 1, 1, 1]); Id
1:5
= ([1, 2] , [2, 3, 4] , [1, 2, 3] , [2, 3] , [1, 2, 3, 4]);
Kst = (0, 3, 3, 2, 4); K
b
1:5
= (3, 1, 0, 2, 0); Kdt = (2, 3, 3, 2, 4); K
f
1:5
= (3, 0, 2, 1, 0), A1:5 =
([4, 1] , [1, 3, 2] , [3, 5, 4] , [1, 2] , [3, 2, 1, 4]).
Figure 5.1: Top: The list of the random variables in the MTT model. Bottom: A
realisation for an MTT model: States of a targets are connected with arrows. Also,
observations generated from targets are connected to those targets with arrows. Mis-
detected targets are highlighted with shadows, and observations from false measurements
are coloured with grey.
Here PO(k;λ) denotes the probability mass function of the Poisson distribution with
mean λ, |Y| is the volume (w.r.t. the Lebesgue measure) of Y and the term kft !/kyt ! in
(5.3) corresponds to the law of At. The marginal likelihood of the observation sequence
y1:n is
pθ(y1:n) = Eθ [pθ(y1:n|X1:n, Z1:n)] . (5.6)
The main aim of this work is, given Y1:n = y1:n, to estimate the static parameter θ
∗ where
we assume the data is generated by some true but unknown θ∗ ∈ Θ. Our main contribu-
tion is to present the EM algorithms, both batch and online versions, for computing the
108 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
MLE of θ∗:
θML = argmax
θ∈Θ
pθ(y1:n).
5.3 EM algorithms for MTT
In this section we present the batch and online EM algorithms for linear Gaussian MTT
models. The notation is involved and we provide a list of some important variables used
in the derivation of the EM algorithms in Table 5.1 at the end of the section.
5.3.1 Batch EM for MTT
Given Y1:n = y1:n, the EM algorithm for maximising pθ(y1:n) in (5.6) is given by the
following iterative procedure: if θj is the estimate of the EM algorithm at the j’th iter-
ation, then at iteration j + 1 the estimate is updated by first calculating the following
intermediate optimisation criterion, which is known as the expectation (E) step,
Q(θj , θ) = Eθj [log pθ(X1:n, Z1:n,y1:n)|y1:n]
= Eθj [log pθ(Z1:n) + log pθ(X1:n,y1:n|Z1:n)|y1:n]
= Eθj
[
log pθ(Z1:n) + Eθj {log pθ(X1:n,y1:n|Z1:n)|y1:n, Z1:n} |y1:n
] (5.7)
The updated estimate is then computed in the maximisation (M) step
θj+1 = argmax
θ∈Θ
Q(θj , θ).
This procedure is repeated until θj converges (or in practice ceases to change significantly).
From equations (5.2)-(5.5), it can be shown that the E-step at the j’th iteration reduces
to calculating the expectations of fifteen sufficient statistics of x1:n, z1:n and y1:n denoted
by S1,n, . . . , S15,n. (From now on, any dependency on y1:n in these sufficient statistics
and further variables arising from them will be omitted from the notation for simplicity.)
Sufficient statistics S1,n(x1:n, z1:n) to S7,n(x1:n, z1:n) are:
n∑
t=1
kdt∑
k=1
xt,idt (k)x
T
t,idt (k)
,
n∑
t=1
kdt∑
k=1
xt,idt (k)y
T
t,at(k),
n∑
t=2
kst∑
k=1
xt−1,ist (k)x
T
t−1,ist (k)
,
n∑
t=2
kst∑
k=1
xt,kx
T
t,k,
n∑
t=2
kst∑
k=1
xt−1,ist (k)x
T
t,k,
n∑
t=1
kxt∑
k=kst+1
xt,k,
n∑
t=1
kxt∑
k=kst+1
xt,kx
T
t,k. (5.8)
These sufficient statistics are related to those used for estimating the static parameter
of a linear Gaussian single target tracking model, and this relation will be made more
explicit later. The rest of the sufficient statistics S8,n(z1:n) to S15,n(z1:n) do not depend
5.3. EM ALGORITHMS FOR MTT 109
on x1:n.
[S8,n, . . . , S15,n] (z1:n) =
n∑
t=1
 kdt∑
k=1
yt,at(k)y
T
t,at(k), k
d
t , k
x
t , k
s
tk
x
t−1, k
b
t , k
f
t , 1
 (5.9)
Let Sθm,n denote the expectation of the m’th sufficient statistic Sm,n w.r.t. the law of the
latent variables X1:n and Z1:n of the MTT model given the observation y1:n for a given
θ, i.e.
Sθm,n =
Eθ [Sm,n (X1:n, Z1:n)|y1:n] 1 ≤ m ≤ 7,Eθ [Sm,n (Z1:n)|y1:n] 8 ≤ m ≤ 15. (5.10)
Then the solution to the M-step is given by a known function Λ :
{(
Sθ1,n, . . . , S
θ
15,n
)}→ Θ
such that at iteration j
θj+1 = argmax
θ
Q(θj , θ) = Λ
(
S
θj
1,n, . . . , S
θj
15,n
)
.
The explicit expression of Λ depends on the parametrisation of the MTT model, in par-
ticular on the parametrisation of the matrices F,G,W, V, µb,Σb. An example is provided
below.
Example 5.1. (The constant velocity model:) Each target has a position and velocity in
the xy-plane and the position of a target is restricted to the window [−κ, κ]2, hence
Xt = [Xt(1), Xt(2), Xt(3), Xt(4)]
T ∈ X = R2 × [0,∞)2,
where Xt(1), Xt(2) are the x and y coordinates and Xt(3), Xt(4) are the velocities in x
and y directions. Only a noisy measurement of the position of the target is available
[Yt(1), Yt(2)] ∈ Y = [−κ, κ]2.
We assumed a bounded Y and regard observations that are not recorded due to being
outside this interval as also missed detection. With reference to (5.2), the state-space of
110 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
the model are:
µb = [µbx, µby, 0, 0]
T , Σb =
(
σ2bpI2×2 02×2
02×2 σ
2
bvI2×2
)
F =
(
I2×2 ∆I2×2
02×2 I2×2
)
, G =
(
I2×2 02×2
)
W =
(
σ2xpI2×2 02×2
02×2 σ
2
xvI2×2
)
, V = σ2yI2×2
Therefore, the parameter vector of this MTT model is
θ =
(
λb, λf , pd, ps, µbp, µbv, σ
2
bp, σ
2
bv, σ
2
xp, σ
2
xv, σ
2
y
)
.
The update rule Λ for θ at the M-step of the EM algorithm is
µbx = S
θ
6,n(1)/S
θ
13,n, µby = S
θ
6,n(2)/S
θ
13,n,
σ2bp =
1
2
Sθ13,ntr
((
Sθ7,n − 2Sθ6,nµTb + Sθ13,nµbµTb
)
MTp Mp
)
σ2bv =
1
2
Sθ13,ntr
((
Sθ7,n − 2Sθ6,nµTb + Sθ13,nµbµTb
)
MTv Mv
)
σ2xp = tr
(
Sθ4,nMpM
T
p − 2Sθ5,nMpFp + Sθ3,nF Tp Fp
)
/2Sθ11,n,
σ2xv = tr
(
Sθ4,nMvM
T
v − 2Sθ5,nMvFv + Sθ3,nF Tv Fv
)
/2Sθ11,n,
σ2y = tr
(
Sθ8,n − 2GSθ2,n +GSθ1,nG
)
/2Sθ9,n,
pd = S
θ
9,n/S
θ
10,n, ps = S
θ
11,n/S
θ
12,n,
λb = S
θ
13,n/S
θ
15,n, λf = S
θ
14,n/S
θ
15,n,
where Mp =
[
I2×2 02×2
]
,Mv =
[
02×2 I2×2
]
, and Fp and Fv are the upper and lower
halves of F , that is Fp(i, j) = F (i, j) and Fv(i, j) = F (2+i, j) for i = 1, 2 and j = 1, . . . , 4.
5.3.1.1 Estimation of sufficient statistics
It is easy to calculate the expectation of the sufficient statistics in (5.9) that do not
depend on x1:n. Noting that Zt is discrete, we simply calculate Sm,n(z1:n) for every z1:n
with a positive mass w.r.t. to the density pθ(z1:n|y1:n) and calculate the expectations as
Sθm,n =
∑
z1:n
Sm,n(z1:n)pθ(z1:n|y1:n).
5.3. EM ALGORITHMS FOR MTT 111
For those sufficient statistics in (5.8) that depend on x1:n, consider the last expression in
(5.7) with the following factorisation of the posterior
pθ(x1:n, z1:n|y1:n) = pθ(x1:n|z1:n,y1:n)pθ(z1:n|y1:n).
This factorisation suggests that we can write the required expectations as
Sθm,n = Eθ [Sm,n(X1:n, Z1:n)|y1:n]
= Eθ [Eθ [Sm,n(X1:n, Z1:n)|Z1:n,y1:n]|y1:n] . (5.11)
Let us define the integrand of the outer expectation in (5.11) which is the conditional
expectation
S˜θm,n(z1:n) = Eθ [Sm,n(X1:n, z1:n)| z1:n,y1:n] .
as a matrix-valued function with domain Zn. Then, we can obtain Sθm,n by calculating
S˜θm,n(z1:n) for every z1:n with a positive mass w.r.t. the density pθ(z1:n|y1:n) and then
calculate
Sθm,n =
∑
z1:n
S˜θm,n(z1:n)pθ(z1:n|y1:n).
The crucial point here is that it is possible to calculate S˜θm,n(z1:n) for any given z1:n. In
fact, the availability of this calculation is based on the following fact: conditional on
{Zt}t≥1, {Xt,Yt}t≥1 may be regarded as a collection of independent GLSSM’s (with dif-
ferent starting and ending times, possible missing observations) and observations which
are not relevant to any of these GLSSM’s. In the context of MTT, each GLSSM cor-
responds to a target and irrelevant observations correspond to false measurements. We
defer details on how S˜θm,n(z1:n) is calculated to Section 5.3.2.
5.3.1.2 Stochastic versions of EM
For exact calculation of the E-step of the EM algorithm we need pθ(z1:n|y1:n) which is
infeasible to calculate due to the huge cardinality of Zn. We thus resort to Monte Carlo
approximations of pθ(z1:n|y1:n) which we then use in the E-step; in literature this ap-
proach is known as a stochastic version of the EM algorithm [Celeux and Diebolt, 1985;
Delyon et al., 1999; Wei and Tanner, 1990]). We know from the previous sections that
given Z1:n = z1:n the posterior distribution pθ(x1:n|y1:n, z1:n) is Gaussian and conditional
expectations can be evaluated. Therefore, it is sufficient to have the Monte Carlo ap-
proximation for pθ(z1:n|y1:n) only, which is expressed as
p̂θ(z1:n|y1:n) =
N∑
i=1
w(i)n δz(i)1:n
(z1:n),
N∑
i=1
w(i)n = 1. (5.12)
112 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
Then, the particle approximations for the expectations of the sufficient statistics are
Ŝθm,n =

∑N
i=1w
(i)
n S˜θm,n(z
(i)
1:n), 1 ≤ m ≤ 7,∑N
i=1w
(i)
n Sm,n(z
(i)
1:n), 8 ≤ m ≤ 15.
When θ changes with each EM iteration, the appropriate update scheme at iteration
j involves a stochastic approximation procedure, where in the E-step one calculates a
weighted average of Ŝθ1m,n, . . . , Ŝ
θj
m,n; the resulting algorithm is known as the stochastic
approximation EM (SAEM) [Delyon et al., 1999]. Specifically, let γ = {γj}j≥1, called the
step-size sequence, be a positive decreasing sequence satisfying∑
j
γj =∞,
∑
j
γ2j <∞.
A common choice is γj = j
−α for 0.5 < α ≤ 1. The SAEM algorithm is given in Algorithm
5.1.
Algorithm 5.1. The SAEM algorithm for the MTT model
Start with θ1 and Ŝ
(0)
γ,m,n = 0 for m = 1, . . . , 15. For j = 1, 2, . . .
• E-step: Calculate Ŝθjm,n for each m, and calculate the weighted averages
Ŝ(j)γ,m,n = (1− γj) Ŝ(j−1)γ,m,n + γjŜθjm,n. (5.13)
• M-step Update the parameter estimate using Λ(·) as before
θj+1 = Λ
(
Ŝ
(j)
γ,1,n, . . . , Ŝ
(j)
γ,15,n
)
.
In general, the Monte Carlo approximation p̂θj (z1:n|y1:n) in (5.13) is performed either
sampling N samples from pθj (z1:n|y1:n) using a SMC method with N particles or using
a MCMC method (e.g. the MCMC-DA algorithm of Oh et al. [2009]), in which case
weights w
(i)
n = 1/N , i = 1, . . . , N . In this work, we use the SMC method and we will
call the resulting SAEM algorithm SMC-EM. We use SMC to obtain the approximations
{p̂θ(z1:t|y1:t)}1≤t≤n sequentially as follows. Assume that we have the approximation at
time t− 1
p̂θ(z1:t−1|y1:t−1) =
N∑
i=1
w
(i)
t δz(i)1:t−1
(z1:t−1).
To avoid weight degeneracy, at each time one can resample from p̂θ(z1:t−1|y1:t−1) to obtain
a new collection of N particles, each with weight w¯
(i)
t−1 = 1/N , and then proceed to the
time t. Alternatively, this resampling operation can be done according to a criterion
which measures the weight degeneracy (e.g. see Doucet et al. [2000b]). We define the
5.3. EM ALGORITHMS FOR MTT 113
N × 1 random mapping
Πt : {1, . . . , N} → {1, . . . , N}
containing the indices of the resampled particles, i.e. Πt(i) = j if the i’th resampled
particle is z
(j)
1:t−1. (If no resampling is performed at the end of time t− 1, then Πt(i) = i,
and w¯
(i)
t−1 = w
(i)
t−1 for all i.) Then, given yt and Πt = πt, the particle z
(i)
t at time t is
sampled from a proposal distribution
qθ
(
zt
∣∣∣z(πt(i))1:t−1 ,y1:t)
for i = 1, . . . , N . Therefore, z
(i)
t is connected to z
(πt(i))
1:t−1 and the i’th path particle at time
t is z
(i)
1:t = (z
(i)
t , z
(πt(i))
1:t−1 ) and its new weight is
w
(i)
t ∝ w¯(πt(i))t−1 ×
pθ(z
(i)
t |z(πt(i))t−1 )pθ(yt|y1:t−1, z(i)1:t)
qθ(z
(i)
t |z(πt(i))1:t−1 ,y1:t)
. (5.14)
Note that we also need to implement SMC for the online EM algorithm in order to
obtain a Monte Carlo approximation of the E-step. Our SMC algorithm calculates the
L-best linear assignments [Murty, 1968] as the sequential proposal; see Appendix 5.A.2
for details.
5.3.2 Online EM for MTT
We showed in the previous section how to implement the batch EM algorithm for MTT
using Monte Carlo approximations. However, the batch EM algorithm is computationally
demanding when the data sequence y1:n is long since one iteration of the EM requires a
complete browse of the data. In these situations, the online version of the EM algorithm
which updates the parameter estimates as a new data record is received at each time
can be a cheaper alternative. In this section, we present a SMC online EM algorithm for
linear Gaussian MTT models.
An important observation at this point is that the sufficient statistics of interest for
the EM algorithm have a certain additive form such that the difference of Sm,n(x1:n, z1:n)
and Sm,n−1(x1:n−1, z1:n−1) only depends on (xn−1,xn,yn). This enables us to compute the
required expectations in the E-step of the EM algorithm effectively in an online manner.
We shall see in this section that, with a fixed amount of computation and memory per
time, it is possible to update from S˜θm,t−1(z1:t−1) to S˜
θ
m,t(z1:t) given yt and zt at time t.
To show how to handle the sufficient statistics in (5.8) for the MTT model, we first start
with a single GLSSM and then extend the idea to the MTT case by showing the relation
between the sufficient statistics in a single GLSSM and in the MTT model.
114 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
5.3.2.1 Online smoothing in a single GLSSM
Consider the HMM {Xt, Yt}t≥1 defined in (5.1). It is possible to evaluate expectations of
additive functionals of X1:n of the form
Sn(x1:n) = s(x1) +
n∑
t=2
s(xt−1, xt)
(with possible dependency on y1:n also allowed) w.r.t. the posterior density pθ(x1:n|y1:n)
in an online manner using only the filtering densities {pθ(xt|y1:t)}1≤t≤n. The technique is
based on the following recursion on the intermediate function [Cappe´, 2011; Del Moral
et al., 2009]
T θt (xt) :=Eθ [St(X1:t)|Xt = xt, y1:t]
=Eθ
[
T θt−1(Xt−1) + s(Xt−1, xt)
∣∣ y1:t−1, xt]
with the initial condition T θ1 (x1) = s(x1). Note that the expectation required for the
recursion is w.r.t. the backward transition density pθ(xt−1|y1:t−1, xt). The required ex-
pectation Eθ [Sn(X1:n)|y1:n] can then be calculated as the expectation of the intermediate
function T θn(xn) w.r.t. the filtering density pθ(xn|y1:n), that is,
Eθ [Sn(X1:n)| y1:n] = Eθ
[
T θn(Xn)
∣∣ y1:n] .
Consider now the GLSSM that is defined in (5.2), where, additionally, Yt is possibly
non-observable and Cdt is the indicator of detection at time t. It is well known that,
given {(Yt, Cdt ) = (yt, cdt )}t≥1, the prediction and filtering densities pθ(xt|y1:t−1, cd1:t−1) and
pθ(xt|y1:t, cd1:t) are Gaussians with means
(
µt|t−1, µt|t
)
and covariances
(
Σt|t−1,Σt|t
)
and
are updated sequentially as follows:
(µt|t−1,Σt|t−1) = Fµt−1|t−1, FΣt−1|t−1F
T +W, (5.15)
(µt|t,Σt|t)=

(
µt|t−1 + Σt|t−1G
TΓ−1t ǫt,
Σt|t−1 − Σt|t−1GTΓ−1t GΣt|t−1
)
,
cdt = 1(
µt|t−1,Σt|t−1
)
, cdt = 0.
(5.16)
where Γt = GΣt|t−1G
T + V and ǫt = yt − Gµt|t−1. Also, letting Bt = Σt|tF T (FΣt|tF T +
W )−1, bt = (Idx×dx − BtF )µt|t, and Σt|t+1 = (Idx×dx − BtF )Σt|t we can show that the
backward transition density required for the forward smoothing recursion is Gaussian as
well
pθ(xt−1|y1:t−1, cd1:t−1, xt) = N
(
xt−1;Bt−1xt + bt−1,Σt−1|t
)
.
5.3. EM ALGORITHMS FOR MTT 115
We define the matrix valued functions
S¯m,l : X l × {0, 1}l ×Y l → Rdx×dm ,
such that S¯m,l(x1:l, c
d
1:l, y1:l) for m = 1, . . . , 7 are in the following form:
l∑
t=1
cdtxtx
T
t ,
l∑
t=1
cdtxty
T
t ,
l∑
t=2
xt−1x
T
t−1,
l∑
t=2
xtx
T
t ,
l∑
t=2
xt−1x
T
t , x1, x1x
T
1 . (5.17)
(so, d2 = dy and d6 = 1, else dm = dx). These functions are actually the sufficient
statistics in the MTT model corresponding to a single target. Then it is possible to
define the incremental functions
s¯m :
(X ∪ X 2)× {0, 1} × Y → Rdx×dm (5.18)
where s¯m’s are defined such that for m = 1, . . . , 7
S¯m,l(x1:l, c
d
1:l, y1:l) = s¯m(x1, c
d
1, y1) +
l∑
t=2
s¯m(xt−1, xt, c
d
t , yt).
For example, s¯1(x1, c
d
1, y1) = c
d
1x1x
T
1 , s¯3(x1, c
d
1, y1) = 0dx×dx , s¯3(xt−1, xt, c
d
t , yt) = xt−1x
T
t ,
s¯6(x1, c
d
1, y1) = c
d
1x1, s¯7(xt−1, xt, c
d
t , yt) = 0dx×dx , etc. We observe that each sufficient
statistic is a matrix valued quantity, hence its expectation can be calculated using forward
smoothing by treating each element of the matrix separately. For example, for
S¯1,n(x1:n, c
d
1:n, y1:n) =
n∑
t=1
cdtxtx
T
t ,
we perform forward smoothing for each
S¯1,n,ij(x1:n, c
d
1:n, y1:n) =
n∑
t=1
cdtxt(i)xt(j), i, j = 1, . . . , dx.
It was shown in Elliott and Krishnamurthy [1999] that, the intermediate function
T¯ θ1,t,ij(xt, c
d
1:t) := Eθ
[
S¯1,t,ij(X1:t, c
d
1:t, y1:t)
∣∣ cd1:t, xt, y1:t]
for the i, j’th element is a quadratic in xt:
T¯ θ1,t,ij(xt, c
d
1:t) = x
T
t P¯1,t,ijxt + q¯
T
1,t,ijxt + r¯1,t,ij , (5.19)
116 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
where P¯1,t,ij is a dx×dx matrix, q¯1,t,ij is a dx×1 vector, and r¯1,t,ij is a scalar. Online smooth-
ing is then performed via the following recursion over the variables P¯1,t,ij , q¯1,t,ij, r¯1,t,ij.
P¯1,t+1,ij = B
T
t P¯1,t,ijBt + c
d
t+1eie
T
j ,
q¯1,t+1,ij = B
T
t q¯1,t,ij + B
T
t
(
P¯1,t,ij + P¯
T
1,t,ij
)
bt,
r¯1,t+1,ij = r¯1,t,ij + tr
(
P¯1,t,ijΣt|t+1
)
+ q¯T1,t,ijbt + b
T
t P¯1,t,ijbt,
where ei is the i’th column of the identity matrix of the size dx, and tr(A) is the trace of the
matrix A. For the initial value of T¯ θ1,1,ij(x1, c
d
1), P¯1,1,ij = c
d
1eie
T
j , q1,1,ij = 0dx×1, r¯1,1,ij = 0.
Therefore, the i, j’th element of the required expectation at time n can be calculated as
Eθ
[
T¯ θ1,n,ij(Xn, c
d
1:n)
∣∣ y1:n, cd1:n] = tr (P¯1,n,ij (Σn|n + µn|nµTn|n))+ q¯T1,n,ijµn|n + r¯1,n,ij.
We can similarly obtain the recursions for the other sufficient statistics in terms of vari-
ables P¯m,t,ij , q¯m,t,ij, r¯m,t,ij for the m’th sufficient statistic (see Appendix 5.A.1) [Elliott and
Krishnamurthy, 1999].
Remark 5.1. Note that P¯1,t,ji = (P¯1,t,ij)
T (similarly for q¯1,t,ij) and therefore need only be
calculated for j ≥ i. Note that the variables µt|t,Σt|t,Γt, ǫt, Bt, bt,Σt|t+1, P¯m,t,ij , q¯m,t,ij, r¯m,t,ij
obviously depend on cd1:t, y1:t and θ, but we made this dependency implicit in our notation
for simplicity. We will carry on with this simplification in the rest of the chapter.
5.3.2.2 Application to MTT
We showed above how to calculate expectations of the required sufficient for a single
GLSSM. We can extend that idea to the scenario in the MTT case, where there may
be multiple GLSSM’s at a time, with different starting and ending times and possible
missing observations. Recall that at time t the targets which are alive are the kst surviving
targets from t − 1 and the kbt newly born targets at time t, so the number of targets is
kxt = k
s
t + k
b
t . For each alive target, we can calculate the moments of the prediction
density pθ(xt,k|y1:t−1, z1:t) for the state
(µt|t−1,k,Σt|t−1,k) =

(
Fµt−1|t−1,ist (k), FΣt−1|t−1,ist (k)F
T +W
)
, k ≤ kst ,
(µb,Σb) , k
s
t < k ≤ kxt
.
Recall that ist (k) appears above due to the relabelling of surviving targets from time
t − 1. Also, given the detection vector cdt and the association vector at, we calculate
the moments of the filtering density pθ(xt,k|y1:t, z1:t) for the targets using the prediction
5.3. EM ALGORITHMS FOR MTT 117
moments
(µt|t,k,Σt|t,k) =

(
µt|t−1,k + Σt|t−1,kG
TΓ−1t,k ǫt,k,Σt|t−1,k − Σt|t−1,kGTΓ−1t,kGΣt|t−1,k
)
, cdt (k) = 1(
µt|t−1,k,Σt|t−1,k
)
, cdt (k) = 0.
where Γt,k = GΣt|t−1,kG
T + V and ǫt,k = yt,at(i′t(k)) −Gµt|t−1,k, where i′t(k) =
∑k
j=1 c
d
t (j).
Note that if the k’th alive target at time t is detected, it will be the i′t(k)’th detected
target, which explains i′t(k) in ǫt,k. In a similar manner, we calculate Bt,k, bt,k, and Σt|t+1,k
using µt|t,k and Σt|t,k for k = 1, . . . , k
x
t in analogy with Bt, bt, and Σt|t+1.
In the following, we will present the rules for one-step update of the expectations
S˜θm,n(z1:n) = Eθ [Sm,n(X1:n, z1:n)|y1:n, z1:n]
of the sufficient statistics Sm,n(x1:n, z1:n) that are defined in (5.8). Observe that we can
write for 1 ≤ m ≤ 7,
Sm,n(x1:n, z1:n) = sm(x1, z1) +
n∑
t=2
sm(xt−1,xt, zt), (5.20)
where the functions sm can be written in terms of s¯m’s (5.18) as follows:
sm(x1, z1) =
kb1∑
k=1
s¯m(x1,k, c
d
1(k), y1,a1(i′1(k))),
sm(xt−1,xt, zt) =
kst∑
k=1
s¯m(xt−1,ist (k), xt,k, c
d
t (k), yt,at(i′t(k))) +
kxt∑
k=kst+1
s¯m(xt,k, c
d
t (k), yt,at(i′t(k))).
where, again, i′t(k) =
∑k
j=1 c
d
t (j). (Notice that if c
d
t (k) = 0 this i
′
t(k) can still be used as
a convention; since the choice of the observation point in yt is irrelevant as it will have
no contribution being multiplied by cdt (k).) Therefore, the forward smoothing recursion
for those sufficient statistics in (5.8) at time t
T θm,t(xt, z1:t) = Eθ
[
T θm,t−1(Xt−1, z1:t−1) + sm (Xt−1,xt, zt) |xt,y1:t−1, z1:t−1
]
(5.21)
can be handled once we have the forward smoothing recursion rules for the sufficient
statistics in (5.17). For k = 1, . . . , kxt , let T
θ
m,t,k denote the forward smoothing recursion
function for the m’th sufficient statistic for k’th alive target at time t. For the surviving
targets, k’th target at time t is a continuation of the ist (k)’the target at time t − 1.
118 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
Therefore, we have the recursion update for T θm,t,k for 1 ≤ k ≤ kst as
T θm,t,k(xt,k, z1:t) = Eθ
[
T θm,t−1,ist (k)(Xt−1,i
s
t (k)
, z1:t−1)
+ s¯m(Xt−1,ist (k), xt,k, c
d
t (k), yat(i′t(k)))|xt,k,y1:t−1, z1:t−1
]
.
For the targets born at time t (for kst +1 ≤ k ≤ kxt ), the recursion function is initiated
as T θm,t,k(xt,k, z1:t) = sm(xt,k, c
d
t (k)). Therefore, the (i, j)’th component of the recursion
function can be written as
T θm,t,k,ij(xt,k, z1:t) = x
T
t,kPm,t,k,ijxt,k + qm,t,k,ijxt,k + rm,t,k,ij
similarly to the single GLSSM case, where this time we have the additional subscript k.
For surviving targets the recursion variables Pm,t,k,ij, qm,t,k,ij, rm,t,k,ij for each m, i, j are
updated from Pm,t−1,ist (k),ij, qm,t−1,ist (k),ij , rm,t−1,ist (k),ij, by using µt−1|t−1,ist (k), Σt−1|t−1,ist (k),
Bt−1,ist (k), bt−1,ist (k), Σt−1|t,ist (k), c
d
t (k) and, yt,at(i′t(k)) with i
′
t(k) =
∑k
j=1 c
d
t (j). For the targets
born at time t (for kst + 1 ≤ k ≤ kxt ), the variables are set to their initial values in the
same way as in Section 5.3.2.1 using cdt (k) and, if c
d
t (k) = 1, yt,at(i′t(k)). The conditional
expectations of sufficient statistics
S˜θm,t(z1:t) = Eθ
[
T θm,t (Xt, z1:t)
∣∣y1:t, z1:t]
can then be calculated by using the forward recursion variables and the filtering moments.
Let
S˜θm,t,k(z1:t) = Eθ
[
T θm,t,k(Xt,k, z1:t)
∣∣y1:t, z1:t]
denote the expectation of the m’th sufficient statistic for the k’th alive target at time t,
where its (i, j)’th component is
S˜θm,t,k,ij(z1:t) = tr
(
Pm,t,k,ij
(
µt|t,kµ
T
t|t,k + Σt|t,k
))
+ qTm,t,k,ijµt|t,k + rm,t,k,ij.
Then, the required conditional expectation for the m’th sufficient statistic can be written
as the sum of two quantities
S˜θm,t(z1:t) = S˜
θ
alive,m,t(z1:t) + S˜
θ
dead,m,t(z1:t). (5.22)
where the quantities are respectively the contributions of the alive targets at time t and
5.3. EM ALGORITHMS FOR MTT 119
dead targets up to time t to the conditional expectation S˜θm,t(z1:t)
S˜θalive,m,t(z1:t) =
kxt∑
k=1
S˜θm,t,k(z1:t),
S˜θdead,m,t(z1:t) =
t∑
j=1
kxj−1∑
k:csj(k)=0
S˜θm,j−1,k(z1:j−1) (5.23)
As (5.22) shows, we also need to calculate S˜θdead,m,t(z1:t) at each time and by (5.23) this
can easily be done by storing S˜θdead,m,t−1(z1:t−1) at time t− 1 and using the recursion
S˜θdead,m,t(z1:t)=S˜
θ
dead,m,t−1(z1:t−1) +
kxt−1∑
k:cst (k)=0
S˜θm,t−1,k(z1:t−1)
where the terms in the sum correspond to targets that terminate at time t− 1.
Finally, the sufficient statistics S8,n(z1:n), . . . , S15,n(z1:n) can be calculated online since
we can write for each m = 8, . . . , 15
Sm,n(z1:n) =
n∑
t=1
sm(zt)
for some suitable functions sm which can easily be constructed from (5.9). Hence they
can be updated online as
Sm,t(z1:t) = Sm,t−1(z1:t−1) + sm(zt). (5.24)
We now present Algorithm 5.2 to show how these one-step update rules for the suffi-
cient statistics in the MTT model can be implemented. For simplicity of the presentation,
we will use a short hand notation for representing the forward recursion variables in a
batch way. Let T θm,t(z1:t) = (T θm,t,k(z1:t), k = 1, . . . , kxt ) where
T θm,t,k(z1:t) = (Pm,t,k,ij, qm,t,k,ij, rm,t,k,ij : all i, j)
denote all the variables required for the forward smoothing recursion for the m’th suffi-
cient statistic for the k’th alive target at time t. We can now present the algorithm using
this notation.
Algorithm 5.2. One step update for sufficient statistics in the MTT model
We have T θm,t−1(z1:t−1), S˜θdead,m,t−1(z1:t−1), m = 1, . . . , 7, Sθm′,t−1(z1:t−1), m′ = 8, . . . , 15 at
time t− 1. Given zt and yt,
- Set ix = 0, id = 0, S˜
θ
alive,m,t(z1:t) = 0 and Sθdead,m,t(z1:t) = Sθdead,m,t−1(z1:t−1) for m =
120 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
1, . . . , 7.
- for i = 1, . . . , kxt−1 + k
b
t
• if i ≤ kxt−1 and cst (i) = 1, (the i’th target at time t − 1 survives), or if i > kxt−1, (a
new target is born), set ix = ix + 1.
– In case of survival, use µt−1|t−1,i and Σt−1|t−1,i to obtain the prediction moments
µt|t−1,ix and Σt|t−1,ix . In case of birth, set the prediction distribution µt|t−1,ix =
µb and Σt|t−1,i = Σb.
∗ If cdt (ix) = 1, ix’th target is detected: id = id+1. Use µt|t−1,ix and Σt|t−1,ix
and yt,at(id) to update the filtering moments µt|t,ix and Σt|t,ix .
∗ If cdt (ix) = 0, ix’th target is not detected: Set
(
µt|t,ix ,Σt|t,ix
)
=
(
µt|t−1,ix ,Σt|t−1,ix
)
.
– For m = 1, . . . , 7
∗ In case of survival, update the recursion variables T θm,t,ix(z1:t) using T θm,t−1,i(z1:t−1),
µt−1|t−1,i, Σt−1|t−1,i, bt−1,i, Bt−1,i, Σt−1|t,i, c
d
t (ix) and yt,at(id) if c
d
t (ix) = 1.
In case of birth, initiate T θm,t,ix(z1:t) using cdt (ix) and yt,at(id) if cdt (ix) = 1.
∗ (optional) Calculate S˜θm,t,ix(z1:t) using T θm,t,ix(z1:t), µt|t,ix and Σt|t,ix and
update S˜θalive,m,t(z1:t)← S˜θalive,m,t(z1:t) + S˜θm,t,ix(z1:t).
• if i ≤ kxt−1 and cst (i) = 0, the i’th target at time t− 1 is dead. For m = 1, . . . , 7,
– Calculate S˜θm,t−1,i(z1:t−1) from Tm,t−1,i(z1:t−1), µt−1|t−1,i and Σt−1|t−1,i.
– Update S˜θdead,m,t(z1:t)← S˜θdead,m,t(z1:t) + S˜θm,t−1,i(z1:t−1).
- (optional) Update S˜θm,t(z1:t) = S˜
θ
alive,m,t(z1:t) + S˜
θ
dead,m,t(z1:t) for m = 1, . . . , 7.
- Update Sm,t(z1:t) = Sm,t−1(z1:t−1) + sm(zt) for m = 8, . . . , 15.
Notice that the lines of the algorithm labeled as “optional” are not necessary for the
recursion and need not to be performed at every time step. For example, we can use
Algorithm 5.2 in a batch EM to save memory, in that case we perform these steps only
at the last time step n to obtain the required expectations. Notice also that we included
the update rule for the sufficient statistics in (5.9) for completeness.
5.3.2.3 Online EM implementation
In order to develop an online EM algorithm, we exploit the availability of calculating
S˜θ1,t, . . . , S˜
θ
7,t and S8,t, . . . , S15,t in an online manner as shown in Section 5.3.2.2. In online
EM, running averages of sufficient statistics are calculated and then used to update the
estimate of θ∗ at each time [Cappe´, 2009, 2011; Elliott et al., 2002; Mongillo and Deneve,
2008]. Let θ1 be the initial guess of θ
∗ before having made any observations and at time
t, let θ1:t be the sequence of parameter estimates of the online EM algorithm computed
5.3. EM ALGORITHMS FOR MTT 121
sequentially based on y1:t−1. When yt is received, we first update the posterior density
to have p̂θ1:t(z1:t|y1:t), and compute for 1 ≤ m ≤ 7
T θ1:tγ,m,t (xt, z1:t) = Eθ1:t
[
(1− γt)T θ1:t−1γ,m,t−1 (Xt−1, z1:t−1) + γtsm (Xt−1,xt, zt)
∣∣∣xt,y1:t−1, z1:t−1]
(5.25)
for the values z1:t = z
(i)
1:t for i = 1, . . . , N , where we have the same constraints on the
step-size sequence {γt}t≥1 as in the SAEM algorithm. This modification reflects on the
updates rules for the variables in T θm,t. To illustrate the change in the recursions with
an example, the recursion rules for the variables for S1,t(x1:t, c
d
1:t) for the simple GLSSM
case become (see Appendix 5.A.1)
P¯γ,1,t+1,ij = (1− γt+1)BTt P¯γ,1,t,ijBt + γt+1cdt+1eieTj
q¯γ,1,t+1,ij = (1− γt+1)
(
BTt q¯γ,1,t,ij +B
T
t
(
P¯γ,1,t,ij + P¯
T
γ,1,t,ij
)
bt
)
r¯γ,1,t+1,ij = (1− γt+1)
(
r¯γ,1,t,ij + tr
(
P¯γ,1,t,ijΣt|t+1
)
+ q¯Tγ,1,t,ijbt + b
T
t P¯γ,1,t,ijbt
)
So this time we have T θ1:tγ,m,t(z1:t) = (T θ1:tγ,m,t,k(z1:t), k = 1, . . . , kxt ) where
T θ1:tγ,m,t,k(z1:t) = (Pγ,m,t,k,ij, qγ,m,t,k,ij, rγ,m,t,k,ij : all i, j) .
and the conditional expectations
S˜θ1:tγ,m,t(z1:t) = S˜
θ1:t
γ,alive,m,t(z1:t) + S˜
θ1:t
γ,dead,m,t(z1:t)
can be calculated by using T θ1:tγ,m,t,k(z1:t) as in Section 5.3.2.2. Finally, regarding those Sm,t
in (5.9), we calculate 8 ≤ m ≤ 15.
Sγ,m,t (z1:t) = (1− γt)Sγ,m,t−1 (z1:t−1) + γtsm (zt) . (5.26)
for the values z1:t = z
(i)
1:t for i = 1, . . . , N . In the maximisation step, we update the
parameter estimate by θt+1 = Λ(Ŝ
θ1:t
γ,1,t, . . . , Ŝ
θ1:t
γ,15,t) where the expectations are obtained
Ŝθ1:tγ,m,t =

∑N
i=1w
(i)
t S˜
θ1:t
γ,m,t(z
(i)
1:t), 1 ≤ m ≤ 7,∑N
i=1w
(i)
t Sγ,m,t(z
(i)
1:t), 8 ≤ m ≤ 15.
In practice, the maximisation step is not executed until a burn-in time tb for added
stability of the estimators (e.g. see Cappe´ [2009]).
Notice that the SMC online EM algorithm can be implemented with the help of Algo-
rithm 5.2 the only changes are (5.25) and (5.26) instead of (5.21) and (5.24). Algorithm
5.3 describes the SMC online EM algorithm for the MTT model.
122 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
Algorithm 5.3. The SMC online EM algorithm for the MTT model
• E-step: If t = 1, start with θ1, obtain p̂θ1(z1|y1) =
∑N
i=1w
(i)
1 δz(i)1
(z1), and for
i = 1, . . . , N initialise
T θ1γ,m,1(z(i)1 ), S˜θ1γ,dead,m,1(z(i)1 ) for m = 1, . . . , 7 and Sγ,m′,1(z(i)1 ) for m′ = 8, . . . , 15,
If t ≥ 2,
Obtain p̂θ1:t(z1:t|y1:t) =
∑N
i=1w
(i)
t δz(i)1:t
(z1:t) from p̂θ1:t−1(z1:t−1|y1:t−1) along with πt.
For i = 1, . . . , N , set j = πt(i). Use Algorithm 5.2 with the stochastic approximation
to obtain
T θ1:tγ,m,t(z(i)1:t), S˜θ1:tγ,dead,m,t(z(i)1:t) for m = 1, . . . , 7 and Sγ,m′,t(z(i)1:t) for m′ = 8, . . . , 15 from
T θ1:t−1γ,m,t−1(z(j)1:t−1), S˜θ1:t−1γ,dead,m,t−1(z(j)1:t−1) for m = 1, . . . , 7 and Sγ,m′,t−1(z(j)1:t−1) for m′ =
8, . . . , 15.
• M-step: If t < tb, θt+1 = θt. Else, for i = 1, . . . , N , m = 1, . . . , 7 calculate
S˜θ1:tγ,alive,m,t(z
(i)
1:t) and S˜
θ1:t
γ,m,t(z
(i)
1:t) = S˜
θ1:t
γ,alive,m,t(z
(i)
1:t) + S˜
θ1:t
γ,dead,m,t(z
(i)
1:t) (‘optional’ lines
in Algorithm 5.2). Calculate the expectations
[
Ŝθ1:tγ,1,t, . . . , Ŝ
θ1:t
γ,15,t
]
=
N∑
i=1
w(i)n
[
S˜θγ,m,t, . . . , S˜
θ1:t
γ,7,t, Sγ,8,t, . . . , Sγ,15,t
] (
z
(i)
1:t
)
.
and update θt+1 = Λ
(
Ŝθ1:tγ,1,t, . . . , Ŝ
θ1:t
γ,15,t
)
.
Finally, before ending this section, we list in Table 5.1 some important variables used
to describe the EM algorithms throughout the section.
5.4 Experiments and results
We observe the performances of the parameter estimation methods described in Section
5.3 using the constant velocity model in Example 5.1, where the parameter vector is
θ =
(
λb, λf , pd, ps, µbp, µbv, σ
2
bp, σ
2
bv, σ
2
xp, σ
2
xv, σ
2
y
)
.
Note that the constant velocity model assumes the position noise variance σ2xp = 0. All
other parameters are estimated.
5.4.1 Batch setting
We run two experiments using the model in the batch setting. In the first experiment,
we generate an observation sequence of length n = 100 by using the parameter value
θ∗ = (0.2, 10, 0.90, 0.95, 0, 0, 25, 4, 0, 0.0625, 4)
5.4. EXPERIMENTS AND RESULTS 123
Table 5.1: The list of the EM variables used in Section 5.3
Sections 5.3.1 and 5.3.1.1
Sm,n, m = 1 : 15, Sufficient statistics of the MTT model
Sθm,n, m = 1 : 15, Expectation of Sm,n conditional to y1:n
S˜θm,n, m = 1 : 7, Expectation of Sm,n conditional to y1:n and z1:n
Section 5.3.1.2
Ŝθm,n, Monte Carlo estimation of S
θ
m,n
Ŝ
(j)
γ,m,n, Weighted average of Ŝθ1m,n, . . . , Ŝ
θj
m,n for the SAEM algorithm
Section 5.3.2.1
S¯m,n, m = 1 : 7, Sufficient statistics of a single GLSSM
s¯m,t, m = 1 : 7, Incremental functions for S¯m,n
S¯m,n,ij, The (i, j)’th element of S¯m,n
s¯m,t,ij, The (i, j)’th element of s¯m,t
T¯m,t,ij, Forward smoothing recursion (FSR) function for S¯m,t,ij
P¯m,t,ij, q¯m,t,ij , r¯m,t,ij, Variables used to write T¯m,t,ij in closed-form
Section 5.3.2.2
sm,t, m = 1 : 15, Incremental functions for Sm,n
T θm,t, m = 1 : 7, FSR function for Sm,t
T θm,t,k, FSR function for m’th sufficient statistic of the k’th alive target
at time t
T θm,t,k,ij, The (i, j)th element of T
θ
m,t,k
Pm,t,k,ij, qm,t,k,ij, rm,t,k,ij, Variables to write Tm,t,k,ij
S˜θm,t,k Expectation of the m’th sufficient statistic of the k’th alive target
at time t
S˜θm,t,k,ij, The (i, j)’th element of S˜
θ
m,t,k
S˜θalive,m,t, Contributions of the alive targets at time t to S˜
θ
m,t
S˜θdead,m,t, Contributions of the dead targets up to time t to S˜
θ
m,t
Section 5.3.2.3
T θ1:tγ,m,t, Online estimation of T
θ
m,t using θ1:t
Pγ,m,t,k,ij, qγ,m,t,k,ij, rγ,m,t,k,ij: Variables to write Tγ,m,t,k,ij
S˜θ1:tγ,alive,m,t, Online estimation of S˜
θ
alive,m,t using θ1:t
S˜θ1:tγ,dead,m,t, Online estimation of S˜
θ
dead,m,t using θ1:t
S˜θ1:tγ,m,t, Online estimation of S˜
θ
m,t using θ1:t
Sγ,m,t, m = 8 : 15, Online calculation of Sm,n using θ1:t
Ŝθ1:tγ,m,t, Online estimation of Ŝ
θ
m,t using θ1:t
124 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
and window size κ = 100. This particular value of θ∗ creates 1 target every 5 time steps
on average, and the average life of a target is 20; therefore we expect to see around 4
targets per time.
Using the generated data set, we implemented SAEM in Algorithm 5.1) using SMC-
EM for batch MLE estimation. We usedN = 200 particles to implement the SMCmethod
based on the L-best linear assignment to sample associations, where we set L = 10, the
details of the SMC method are in Appendix 5.A.2. Regarding the step-size sequence in
the SAEM algorithm, γj = j
−0.8 is used as the sequence of step-sizes for all parameters
to be estimated, with the exception that γj = j
−0.55 is used for estimating σ2xv. That is
to say, in the SAEM algorithm, Ŝ
(j)
γ,3,n, Ŝ
(j)
γ,4,n, and Ŝ
(j)
γ,5,n are calculated using γj = j
−0.55,
and Ŝ
(j)
γ,11,n is calculated twice by using γj = j
−0.55 and γj = j
−0.80 separately (since it
appears both in the estimation of σ2xv and ps), and for the rest of Ŝ
(j)
γ,m,n γj = j
−0.80 is
used.
Figure 5.2 shows the results obtained using SMC-EM after 2000. For comparison,
we also execute the EM algorithm with the true data association and the resulting θ∗
estimate will serve as the benchmark. Note that given the true association, the EM
can be executed without the need for any Monte Carlo approximation, and it gave the
estimate
θ∗,z = (0.18, 9.94, 0.92, 0.97,−1.98, 0.91, 17.18, 5.92, 0, 0.027, 4.01).
The z in the superscript is to indicate that this value of θ maximises the joint probability
density of y1:n and z1:n, i.e.
θ∗,z = argmax
θ∈Θ
log pθ(y1:n, z1:n)
which is different than θML. However, for a data size of 100, θ
∗,z is expected to be closer
to θML than θ
∗ is, hence it is useful for evaluating the performances of the stochastic EM
algorithms we present.
From Figure 5.2, we can see that almost all MLEs obtained using SMC-EM converge
to certain values around θ∗,z; except that σ2xv has not converged within reasonable running
time and it is worth investigating the reason behind this slow convergence. Our hypothesis
is that the slow convergence of σ2xv in SMC-EM is due to the fact that the algorithm
spends most of its time to update the estimates of sufficient statistics by running the
SMC algorithm for MTT going through all the observations. This may not be efficient
in the sense that the algorithm uses too many samples (i.e. too much computation time)
for estimating sufficient statistics for a fixed θj while θj is varying slowly over iterations.
Actually, we can speed up the estimation process by applying online SMC-EM on a
5.4. EXPERIMENTS AND RESULTS 125
0.95
0.96
0.97
0.98
0.99
1
p
s
0.15
0.2
0.25
λb
0.9
0.91
0.92
0.93
0.94
pd
9.8
9.85
9.9
9.95
10
λf
0 500 1000 1500 2000
−2.5
−2
−1.5
−1
−0.5
0
µbx
Number of iterations
0
0.5
1
1.5
2
µby
10
12
14
16
18
20
σbp
2
0 500 1000 1500 2000
2
4
6
8
σbv
2
Number of iterations
0 500 1000 1500 2000
0
0.05
0.1
0.15
0.2
0.25
σ
xv
2
Number of iterations
0 500 1000 1500 2000
2
3
4
5
6
σy
2
Number of iterations
Figure 5.2: Batch estimates obtained using the SMC-EM algorithm for MLE. θ∗,z is
shown as a cross.
sequence of repeated data. That is, we can concatenate the data as
[y1:n,y1:n, . . .],
and run SMC online EM in Algorithm 5.3 for the concatenated data. Figure 5.3 shows
both our previous SMC-EM estimates (vs number of iterations) in Figure 5.2 and the SMC
online EM estimates (vs number of passes over the original data y1:n) on the concatenated
data; and we note that that both algorithms are started with the same initial estimate of
θ∗. Noting that the computational cost of one iteration of the SMC-EM algorithm and the
computational cost of one pass of SMC online EM algorithm over the data are roughly the
same, we observe that σ2xv does indeed converge much quicker in this way. Actually; not
only for σ2xv but also for almost all parameters in θ SMC online EM on the concatenated
data forgets its initial values and settle around the values to which it converges in a much
quicker fashion. However, we cannot fully trust the results of SMC online EM algorithm
on the repeated data, since the discontinuity introduced by making y1 follow yn in the
concatenated data will induce bias in the estimates, see Figure 5.3. As also suggested
by the figure, we expect that this discontinuity will effect especially those parameters
governing the birth-death and detection-clutter dynamics of the model, i.e. ps, λb, pd, λf ,
however it will have little effect on the parameters µbx, µby, σ
2
bp, σ
2
bv, σ
2
xv, σ
2
y which govern
the dynamics of the HMM associated to a target. In conclusion, a recommended way
126 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
0.9
0.92
0.94
0.96
0.98
1
p
s
0
0.05
0.1
0.15
0.2
0.25
λb
0.9
0.91
0.92
0.93
0.94
pd
9.8
9.85
9.9
9.95
10
λf
0 500 1000 1500 2000
−2.5
−2
−1.5
−1
−0.5
0
µbx
Number of iterations (passes over data)
0
0.5
1
1.5
2
µby
10
12
14
16
18
20
σbp
2
0 500 1000 1500 2000
2
4
6
8
σbv
2
Number of iterations (passes over data)
 
 
0 500 1000 1500 2000
0
0.05
0.1
0.15
0.2
0.25
σ
xv
2
Number of iterations (passes over data)
0 500 1000 1500 2000
2
3
4
5
6
σy
2
Number of iterations (passes over data)
batch
online
Figure 5.3: Comparison of online SMC-EM estimates applied to the concatenated data
(thicker line) with batch SMC-EM.
to estimate θ∗ in a batch setting is first running SMC online EM on [y1:n,y1:n, . . .] until
convergence to get an estimator θ′ of θ∗, and then running the batch SMC-EM initialised
by θ′.
5.4.2 Online EM setting
We demonstrate the performance of the SMC online EM in Algorithm 5.3 in two settings.
5.4.2.1 Unknown fixed number of targets
In the first experiment for online estimation, we create a scenario where there are a
constant but unknown number of targets that never die and travel in the surveillance
region for a long time. That is, Kx0 = K (which is unknown and to be estimated) and
λb = 0 and ps = 1. We also slightly modify our MTT model so that the target state is a
stationary process. The modified model assumes that the state transition matrix F is
F =
(
0.99I2×2 ∆I2×2
02×2 0.99I2×2
)
, (5.27)
and G,W and V are the same as the MTT model in Example 5.1. The change is to
the diagonals of matrix F which should be I2×2 for a constant velocity model. However,
5.4. EXPERIMENTS AND RESULTS 127
0.88
0.89
0.9
0.91
0.92
pd
9.5
10
10.5
λf
0 1 2 3 4 5
x 104
0.005
0.01
0.015
0.02
0.025
num. of samp. (t)
σ
xv
2
0 1 2 3 4 5
x 104
3.5
4
4.5
σy
2
num. of samp. (t)
Figure 5.4: Online estimates of SMC-EM algorithm (Algorithm 5.3) for fixed num-
ber of targets. True values are indicated with a horizontal line. Initial estimates for
pd, λf , σ
2
xv, σ
2
y are 0.6, 15, 0.25, 25; they are not shown in order to zoom in around the
converged values.
0.99 I2×2 will lead to non-divergent targets, i.e. having a stationary distribution.
We create data of length n = 50000 with K = 10 targets which are initiated by
using µbx = 0, µby = 0, σ
2
bx = 25, σ
2
bv = 4. The other parameters to create the data are
pd = 0.9, λf = 10, σ
2
xv = 0.01, σ
2
y = 4, and the window size κ = 100.
Figure 5.4 shows the estimates for parameters pd, λf , σ
2
xv, σ
2
y using the SMC online EM
algorithm described in Algorithm 5.3, when K0t = K = 10 is known. We used L = 10
and N = 100, and γt = t
−0.8 is taken for all of the parameters except σ2xv, where we
used γt = t
−0.55. The burn-in time, until when the M-step is not executed, is tb = 10.
We can observe the estimates for the parameters quickly settle around the true values.
Note that µx, µy, σ
2
bp, σ
2
bv are not estimated here because they are the parameters of the
initial distribution of targets which have no effect on the stationary distribution of a MTT
model with fixed number of targets, and thus they are not identifiable by an online EM
algorithm [Douc et al., 2004]. In practice, these parameters can be estimated by running
a batch EM algorithm for the sequence of the first few observations, such as y1:50, fixing
all other parameters to the values obtained by SMC online EM. This approximate MLE
procedure is based on the fact that the parameters of the initial distribution will have
will have negligible effect on the likelihood of observations yt for large t.
The particle filter in Algorithm 5.3, which we used to produce the results in Figure
5.3, has all its particles having the same number of targets, which is the true K. However,
K can be estimated by running several SMC online EM algorithms with different possible
K’s, and comparing the estimated likelihoods pθ1:t(y1:t|K) versus t. Figure 5.5 shows how
the estimates of pθ1:t(y1:t|K) for values K = 6, . . . , 15 compare with time. Both the left
and right figures suggest that pθ1:t(y1:t|K) favours K = 10 starting from t = 100 and the
128 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
6 7 8 9 10 11 12 13 14 15
−2.4
−2.2
−2
−1.8
−1.6
−1.4
−1.2
−1
−0.8
−0.6
−0.4
x 105
Number of targets K
lo
g−
lik
el
ih
oo
d 
of
 d
at
a 
up
 to
 ti
m
e 
t
t = 500
t = 100
t = 400
t = 200
t = 300
50 100 150 200 250 300 350 400 450 500
−485
−480
−475
−470
−465
−460
−455
−450
−445
−440
−435
time t
lo
g−
lik
el
ih
oo
d 
of
 d
at
a 
up
 to
 ti
m
e 
t (n
orm
ali
se
d b
y t
)
K = 10
Figure 5.5: Left: estimates of pθ1:t(y1:t|K) (normalised by t) for values t = 100 . . . , t = 500
and for K = 6, . . . , K = 15. Right: Estimates of pθ1:t(y1:t|K) normalised by t for values
K = 6, . . . , K = 15, K = 10 is stressed with a bold plot.
decision on the number of targets can be safely made after about 200 time steps. We
have also checked this comparison with different initial values for θ and found out that
the comparison is robust to the initial estimate θ0.
5.4.2.2 Unknown time varying number of targets
In the second experiment with online estimation, we consider the constant velocity model
in Example 5.1 with a time-varying number of targets, i.e. λb > 0 and ps < 1. We
generated a set of data of length n = 105 using parameters
θ∗ = (0.2, 10, 0.90, 0.95, 0, 0, 25, 4, 0, 0.0625, 4)
and we estimated all of them (except σ2xp = 0). Again, we used L = 10 and N = 200,
and γt = t
−0.8 is taken for all of the parameters except σ2xv for which we used γt = t
−0.55.
The online estimates for those parameters are given in Figure 5.6 (solid lines). The
initial values are taken to be θ0 = (0.8, 0.5, 0.6, 13,−1,−1, 1, 1, 16, 0, 0.25, 25) which is
not shown in the figure in order to zoom in around θ∗. We observe that the estimates
have quickly left their initial values and settle around θ∗. Also, the parameter estimates
for the initial distribution of newborn targets have the largest oscillations around their
true values which is in agreement with the results in the batch setting.
Another important observation inferred from Figure 5.6 is the bias in the estimates
of some of the parameters. This bias arises from the Monte Carlo approximation. To
provide a clearer illustration of this Monte Carlo bias, we compared the SMC online EM
estimates with the online EM estimates we would have if we were given the true data
5.4. EXPERIMENTS AND RESULTS 129
0.945
0.95
0.955
0.96
p
s
0.18
0.19
0.2
0.21
0.22
λb
0.88
0.89
0.9
0.91
0.92
pd
9.8
9.9
10
10.1
10.2
10.3
λf
0 5 10
x 104
−0.2
−0.1
0
0.1
0.2
0.3
num. of samp. (t)
µbx
−0.2
0
0.2
0.4
µby
3.5
4
4.5
5
σbp
2
0 5 10
x 104
23
24
25
26
27
28
29
σbv
2
num. of samp. (t)
0 5 10
x 104
0.05
0.1
0.15
σ
xv
2
num. of samp. (t)
0 5 10
x 104
3.8
3.9
4
4.1
4.2
σy
2
num. of samp. (t)
 
 
SMC online EM
Online EM with true asssociation
True value
Figure 5.6: Estimates of online SMC-EM algorithm (Algorithm 5.3) for an MTT model
with time varying number of targets, compared with online EM estimates when the true
data association {Zt}t≥1 is known. For the estimates in case of known true association,
θ1000,2000,...,100000 are shown only. True values are indicated with a horizontal line.
association, i.e. {Zt}t≥1. The dashed lines in Figure 5.6 show the results obtained when
the true association is known; for illustrative purposes we plot every 1000’th estimate
only, hence the sequence θ1000,2000,...,100000. It is interesting to observe from the plots that
the trends of estimates over time are similar for most of the parameters; however for some
of the parameters of (namely pd, λf , σ
2
bv, σ
2
xv, σ
2
y) online SMC-EM have a bias.
In search of the source of the bias in our results, we ran the SMC online EM algorithm
for the same data sequence, but this time by feeding the algorithm with the birth-death
information, i.e. {Kbt , Cst }t≥0. Figure 5.7 shows that when {Kbt , Cst }t≥0 is provided to
the algorithm, the bias will disappear. This indicates two things at the same time: (i)
the bias is due to the poor tracking of birth time and death time of our SMC tracking
algorithm for MTT; and (ii) the L-best approach for tracking the target-to-observation
assignment, that is {Cdt , Kft , At}t≥1, is doing fine. Therefore, the bottle neck of the SMC
tracking algorithm is birth-death tracking and, generally speaking, a better SMC scheme
for the birth-death tracking may reduce the bias. At this point we would like to note that
when the number of births per time is limited by a finite integer, all the variables of Zt
i.e. {Kbt , Kft , Cst , Cdt , At} can be tracked within the L-best assignment framework, and we
expect in this case the bias to be significantly smaller. However, since in our MTT model
the number of births per time is unlimited (being a Poisson random variable), we cannot
include birth-death tracking in the L-best assignment framework. For our discussion
130 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
0.945
0.95
0.955
0.96
p
s
0.18
0.19
0.2
0.21
0.22
λb
0.88
0.89
0.9
0.91
0.92
pd
9.8
9.9
10
10.1
10.2
10.3
λf
0 5 10
x 104
−0.2
−0.1
0
0.1
0.2
0.3
µbx
num. of samp. (t)
−0.2
−0.1
0
0.1
0.2
µby
3.5
4
4.5
5
σbp
2
0 5 10
x 104
23
24
25
26
27
28
29
σbv
2
num. of samp. (t)
0 5 10
x 104
0.05
0.1
0.15
σ
xv
2
num. of samp. (t)
0 5 10
x 104
3.8
3.9
4
4.1
4.2
σy
2
num. of samp. (t)
 
 
online EM
online EM birth−death known
true value
Figure 5.7: SMC online EM estimates when birth-death known (solid line) compared to
the original results in Figure 5.6 (dashed lines). For illustrative purposes, every 1000th
estimate is shown
on this issue to become clearer, we recommend the reader to see the SMC algorithm in
Appendix 5.A.2.
5.5 Conclusion and Discussion
We have presented MLE algorithms for inferring the static parameters in the linear Gaus-
sian MTT model. Based on our experiments on the offline and online EM implementa-
tions, our recommendations to the practitioner are: if batch estimation is permissible for
the application then it should always be preferred. Moreover, online SMC-EM on con-
catenated data should be used to provide a good initial estimate for the batch SMC-EM.
For very long data sets (in terms of time) and when there is a computational budget,
online SMC-EM seems the most appropriate since the it is easier to control computa-
tional demands by restricting the number of particles. We have seen that there will be
bias in some of the parameter estimates caused by the failure to track the birth-death
dynamics accurately. We have not considered other tracking algorithms that work well in
such scenarios such as those based on the PHD filter [Vo and Ma, 2006; Whiteley et al.,
2010] which could be used provided track estimates can be extracted.
The linear Gaussian MTT model can be extended while still retaining EM based MLE.
For example, split-merge scenarios for targets can be considered. Moreover, the number of
5.A. APPENDIX 131
newborn targets per time need not be a Poisson random variable; for example the model
may allow no births or at most one birth at a time determined by a Bernoulli random
variable. Furthermore, false measurements need not be uniform, e.g. their distribution
may be a Gaussian (or a Gaussian mixture) distribution. Also, we assumed that targets
are born close to the centre of the surveillance region; however, different types of initiation
for targets may be preferable in some applications.
For non-linear non-Gaussian MTT models, Monte Carlo type batch and online EM
algorithms may still be applied by sampling from the hidden states Xt’s provided that
the sufficient statistics for the EM are available in the required additive form [Del Moral
et al., 2009]. In those MTT models where sufficient statistics for EM are not available,
other methods such as gradient based MLE methods can be useful (e.g. Poyiadjis et al.
[2011]).
5.A Appendix
5.A.1 Recursive updates for sufficient statistics in a single GLSSM
Referring to the variables in Section 5.3.2.1, the intermediate functions for the sufficient
statistics in (5.17) can be written as
Tm,t,ij(xt, c
d
1:t) = x
T
t P¯m,t,ijxt + q¯
T
m,t,ijxt + r¯m,t,ij
where i, j = 1, . . . , dx for m = 1, 3, 4, 5, 7; i = 1, . . . , dx, j = 1, . . . , dy for m = 2; and
i = 1, . . . , dx, j = 1 form = 6. All P¯m,t,ij’s, q¯m,t,ij’s and r¯m,t,ij ’s are dx×dx matrices, dx×1
vectors and scalars, respectively. Forward smoothing is then performed via recursions over
these variables. Start at time 1 with the initial conditions P¯m,1,ij = 0dx×dx , q¯m,1,ij = 0dx×1,
and r¯m,1,ij = 0 for all m except P¯1,1,ij = c
d
1eie
T
j , P¯7,1,ij = eie
T
j , q¯2,1,ij = c
d
1y1(j)ei, and
q¯6,1,i1 = ei. At time t+ 1, update
P¯1,t+1,ij = B
T
t P¯1,t,ijBt + c
d
t+1eie
T
j
q¯1,t+1,ij = B
T
t q¯1,t,ij +B
T
t
(
P¯1,t,ij + P¯
θ,T
1,t,ij
)
bt
r¯1,t+1,ij = r¯1,t,ij + tr
(
P¯1,t,ijΣt|t+1
)
+ q¯T1,t,ijbt + b
T
t P¯1,t,ijbt
P¯2,t+1,ij = 0dx×dx
q¯2,t+1,ij = B
T
t q¯2,t,ij + c
d
t+1yt+1(j)ei
r¯2,t+1,ij = r¯2,t,ij + q¯
T
2,t+1,ijbt
132 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
P¯3,t+1,ij = B
T
t
(
P¯3,t,ij + eie
T
j
)
Bt
q¯3,t+1,ij = B
T
t q¯3,t,ij +B
T
t
(
P¯3,t,ij + P¯
T
3,t,ij + eie
T
j + eje
T
i
)
bt
r¯3,t+1,ij = r¯3,t,ij + tr
((
P¯3,t,ij + eie
T
j
)
Σt|t+1
)
+ q¯T3,t,ijbt
+ bTt
(
P¯3,t,ij + eie
T
j
)
bt
P¯4,t+1,ij = B
T
t P¯4,t,ijBt + eie
T
j
q¯4,t+1,ij = B
T
t q¯4,t,ij +B
T
t
(
P¯4,t,ij + P¯
T
4,t,ij
)
bt
r¯4,t+1,ij = r¯4,t,ij + tr
(
P¯4,t,ijΣt|t+1
)
+ q¯T4,t,ijbt + b
T
t P¯4,t,ijbt
P¯5,t+1,ij = B
T
t P¯5,t,ijBt + eie
T
j Bt
q¯5,t+1,ij = B
T
t q¯5,t,ij +B
T
t
(
P¯5,t,ij + P¯
T
5,t,ij
)
bt + ejb
T
k ei
r¯5,t+1,ij = r¯5,t,ij + tr
(
P¯5,t,ijΣt|t+1
)
+ q¯T5,t,ijbt + b
T
t P¯5,t,ijbt
P¯6,t+1,i1 = 0dx×dx
q¯6,t+1,i1 = B
T
t q¯6,t,i1
r¯6,t+1,i1 = r¯6,t,i1 + q¯
T
6,t+1,i1bt
P¯7,t+1,ij = B
T
t
(
P¯7,t,ij
)
Bt
q¯7,t+1,ij = B
T
t q¯7,t,ij +B
T
t
(
P¯7,t,ij + P¯
T
7,t,ij
)
bt
r¯7,t+1,ij = r¯7,t,ij + tr
(
P¯7,t,ijΣt|t+1
)
+ q¯T7,t,ijbt + b
T
t P¯7,t,ijbt
For the online EM algorithm, we simply modify the update rules by multiplying the
terms on the right hand side containing et or Idx×dx by γt+1 and multiplying the rest of
the terms by (1− γt+1).
5.A.2 SMC algorithm for MTT
An SMC algorithm is mainly characterised by its proposal distribution. Hence, in this
section we present the proposal distribution qθ(zt|z1:t−1,y1:t), where we exclude the su-
perscripts for particle numbers from the notation for simplicity. Assume that z1:t−1 is the
ancestor of the particle of interest with weight wt−1. We sample zt =
(
kbt , c
s
t , c
d
t , k
f
t , at
)
and calculate its weight by performing the following steps:
• Birth-death move: Sample kbt ∼ PO(·;λb) and cst (j) ∼ BE(·; ps) for j = 1, . . . , kxt−1.
Set kst =
∑kxt−1
j=1 c
s
t and construct the k
s
t × 1 vector ist from cst . Set kxt = kst + kbt and
calculate the prediction moments for the state. For j = 1, . . . , kxt ,
– if j ≤ kst , set µt|t−1,j = Fµt−1|t−1,ist (j) and Σt|t−1,j = FΣt−1|t−1,ist (j)F T +W .
– if j > kst , set µt|t−1,j = µb and Σt|t−1,j = Σb.
Also, calculate the moments of the conditional observation likelihood: For j =
1, . . . , kxt , µ
y
t,j = Gµt|t−1,j and Σ
y
t,j = GΣt|t−1,jG
T + V .
5.A. APPENDIX 133
• Detection and association Define the kxt × (kyt + kxt ) matrix Dt as
Dt(i, j) =

log(pdN (yt,i;µyt,j,Σyt,j)) if j ≤ kyt ,
log
(1−pd)λf
|Y|
if i = j − kyt ,
−∞ otherwise.
and an assignment is a one-to-one mapping αt : {1, . . . , kxt } → {1, . . . , kyt + kxt }.
The cost of the assignment, up to an identical additive constant for each αt is
d(Dt, αt) =
kdt∑
j=1
Dt(j, αt(j)).
Find the set AL = {αt,1, . . . , αt,L} of L assignments producing the highest as-
signment scores. The set AL can be found using the Murty’s assignment ranking
algorithm [Murty, 1968] with a computational cost of O((kxt + kyt )3 L) in the worst
case. Finally, sample αt = αt,j with probability
κ(αt,j) =
exp(d(Dt, αt,j))∑L
j′=1 exp(d(Dt, αt,j′))
, j = 1, . . . , L
Given αt, one can infer c
d
t (hence i
d
t ), k
d
t , k
f
t and the association at as follows:
cdt (k) =
1 if αt(k) ≤ k
y
t ,
0 if αt(k) > k
y
t .
Then kdt =
∑kxt
j=1 c
d
t (k), k
f
t = k
y
t − kdt , idt is constructed from cdt , and finally
at(k) = αt(i
d
t (k)), k = 1, . . . , k
d
t .
• Reweighting: After we sample zt =
(
kbt , c
s
t , c
d
t , k
f
t , at
)
from qθ(zt|z1:t−1,yt), we calcu-
late the weight of the particle as in (5.14), which becomes for this sampling scheme
as
wt ∝ wt−1 e
−λf
N !
(
λf
|Y|
)kyt−kxt L∑
j=1
exp(d(Dt, αt,j)).
134 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
5.A.3 Computational complexity of EM algorithms
5.A.3.1 Computational complexity of SMC filtering
For simplicity, assume the true parameter value is θ. The computational cost of SMC
filtering with θ and N particles, at time t, is
CSMC(θ, t, N) = c1N︸︷︷︸
resampling
+
N∑
i=1
(c2Kx(i)t−1 + c3)︸ ︷︷ ︸
birth-death sampling
+ d3x (c4K
x
t + c5K
x
t K
y
t )︸ ︷︷ ︸
moments and assignments
+ c6L
(
K
x(i)
t +K
y
t
)3
︸ ︷︷ ︸
Murty (worst case)

where c1 to c6 are constants and c3 is for sampling from the Poisson distribution. If we
assume that SMC tracks number of births and deaths well in average (which it indeed
does), then we can simplify the term above
CSMC(θ, t, N) ≈ N
[
c1 + c3 + c2K
x
t−1 + d
3
x (c4K
x
t + c5K
x
t K
y
t ) + c6L (K
x
t +K
y
t )
3
]
The process {Kxt }t≥1 is a Markov and its stationary distribution is P(λx) where λx =
λb
1−ps
. Also Kyt = K
d
t + K
f
t and for simplicity we write K
d
t ≈ pdKxt . Therefore the
stationary distribution for {Kxt +Kyt }t≥1 ≈ {(1+pd)Kxt +Kft }t≥1 is approximately P(λy)
where λy = λx(1 + pd) + λf . Therefore, assuming stationarity at time t and substituting
EP(λ)(X
3) = λ3 + 3λ2 + λ, the expected cost will be
Eθ [CSMC(θ, t, N)] ≈ N
[
c1 + c3 +
(
c2 + d
3
x [c4 + c5 (pd + λf)]
)
λx + c5pdλ
2
x + c6L
(
λ3y + 3λ
2
y + λy
)]
It is worth emphasising that the computational cost of the SMC depends on θ unlike
many time series models.
5.A.3.2 SMC-EM for the batch setting
The SMC-EM algorithm for the batch setting which is optimised with respect to com-
putation time first runs a SMC filter by storing all its path trajectories i.e. {Z(i)1:n}1≤i≤N
and then calculates the estimates of required sufficient statistics for each Z
(i)
1:n by using a
forward filtering backward smoothing (FFBS) technique. Therefore, the overall expected
cost of an optimised SMC-EM applied to a data of size n is
CSMC-EM = CFFBS(θ, n,N) +
n∑
t=1
CSMC(θ, t, N) + c7
where c7 is the cost of the M-step, i.e. Λ. Let us denote the total number of targets up
to time n is M and let L1, . . . , LM be their life lengths. The computational cost of FFBS
to calculate the smoothed estimates of sufficient statistics for a target of life length L is
5.A. APPENDIX 135
O(d3xL). Therefore,
CFFBS(θ, n,N) =
N∑
i=1
M (i)∑
m=1
c8d
3
xL
(i)
m
Assume the particle filter tracks well and M (i) and L
(i)
m , m = 1, . . . ,M (i) for particles
i = 1, . . . , N are close enough to the true values M and Lm for m = 1, . . . ,M . Then, we
have
CFFBS(θ, n,N) ≈
N∑
i=1
M∑
m=1
c8d
3
xLm.
for some constant c8. The expected values of Lm and M are 1/(1−ps), nλb, respectively.
Also, assume stationarity at all times so that the expectations of the terms CSMC(θ, t, N)
are the same and we have
Eθ [CFFBS(θ, n,N)] ≈ c8Nnd3xλx.
As a result, given a data set of n time points, the overall expected cost of an optimised
SMC-EM for the batch setting per iteration is
Eθ [CSMC-EM] ≈ Eθ [CFFBS(θ, n,N)] + nEθ [CSMC(θ, t, N)] + c7
5.A.3.3 SMC online EM
The overall cost of an SMC online EM for a data set of n time points is
CSMC online EM ≈
n∑
t=1
[CFSR(θ, t, N) + CSMC(θ, t, N) + c7] .
The forward smoothing recursion and maximisation used in the SMC online EM requires
CFSR(θ, t, N) =
N∑
i=1
c9K
x(i)
t d
5
x
calculations at time t for a constant c9, whose expectation is c9Nλxd
5
x at stationarity.
The overall expected cost of an SMC online EM for a data of n time steps, assuming
stationarity, is
Eθ [CSMC online EM(θ, n,N)] ≈ n (Eθ [CFSR(θ, t, N)] + Eθ [CSMC(θ, t, N)] + c7)
136 CHAPTER 5. ESTIMATING THE STATIC PARAMS IN LIN. GAUSS. MTT
Chapter 6
Approximate Bayesian Computation
for Maximum Likelihood Estimation
in Hidden Markov Models
Summary: In this chapter, we present methodology for implementing maximum likeli-
hood estimation (MLE) in hidden Markov models (HMM) with intractable likelihoods in
the context of approximate Bayesian computation (ABC). We show how both batch and
online versions of gradient ascent MLE and expectation maximisation (EM) algorithms
can be used for those HMMs using the ABC approach to confront the intractability. We
demonstrate the performance of our methods first with examples on estimating the param-
eters of two intractable distributions, which are the α-stable and g-and-k distributions, and
then with an example on estimating the parameters of the stochastic volatility model with
α-stable returns.
6.1 Introduction
6.1.1 Hidden Markov models
Hidden Markov models (HMM) are important statistical models in many fields including
Bioinformatics (e.g. Durbin et al. [1998]), Econometrics (e.g. Kim et al. [1998]) and
Population genetics (e.g. Felsenstein and Churchill [1996]); see also Cappe´ et al. [2005]
for a recent overview. An HMM can be defined as a model comprised of the processes
{Rk}k≥1 and {Yk}k≥1. The latent process {Rk ∈ R ⊆ Rdr}k≥1 is a Markov chain with an
initial density νθ and the transition density fθ, i.e.,
R1 ∼ νθ(r1)dr1, Rk|(R1:k−1 = r1:k−1) ∼ fθ(rk|rk−1)drk, k ≥ 2. (6.1)
It is assumed that νθ(r) and fθ(r|r′) are densities on R with respect to (w.r.t.) a suitable
dominating measure denoted generically as dr. Next, {Yk ∈ Y ⊆ Rdy}k≥1 is the obser-
vation process where Yk is conditionally independent of all other random variables given
137
138 CHAPTER 6. ABC FOR MLE IN HMMS
Rk = rk and it has the conditional observation density gθ(·|rk) on Y w.r.t. dy, i.e.
Yk| ({Ri}i≥1 = {ri}i≥1, {Yi}i≥1,i6=k = {yi}i≥1,i6=k) ∼ gθ(yk|rk)dyk, k ≥ 1. (6.2)
Finally, the law of the HMM is parametrised by θ taking values in some compact subset
Θ of the Euclidean space Rdθ .
6.1.2 Parameter estimation
A problem that often arises when choosing which HMM to fit to a particular data set
is that of parameter estimation. Typically, the problem is formulated as choosing a
particular HMM among the range of HMMs with transitional laws in (6.1) and (6.2)
which are parametrised by θ ∈ Θ. We will denote the observed random variables of
the HMM, i.e. data, up to time n as Yˆ1:n, which are independent copies of the random
variables Y1:n. Then, given a sequence of observations Yˆ1:n the objective is to find the
parameter vector θ∗ ∈ Θ that corresponds to the particular HMM from which the data
were generated.
A common approach to estimating θ∗ is maximum likelihood estimation (MLE). In
the MLE approach, the parameter estimate given the observations Yˆ1:n, denoted θML, is
obtained by the following maximisation procedure:
θML = argmax
θ∈Θ
pθ(Yˆ1:n),
where pθ(Yˆ1:n) is the probability density, or the likelihood, of the observations Yˆ1:n, defined
by
pθ(y1:n) =
∫
Rn
νθ(r1)gθ(y1|r1)
[
n∏
k=2
fθ(rk|rk−1)gθ(yk|rk)
]
dr1:n, ∀y1:n ∈ Yn. (6.3)
Unless the model is simple, e.g. linear Gaussian or when R is a finite set, one can seldom
evaluate the likelihood in (6.3) analytically. There are a variety of techniques, for example
sequential Monte Carlo (SMC), for numerically estimating or maximising the likelihood
using Monte Carlo; see Kantas et al. [2009] for a recent comprehensive and comparative
review and discussion of SMC methods for parameter estimation in HMMs.
6.1.3 Approximate Bayesian computation for parameter esti-
mation
In a wide range of applications the standard Monte Carlo methods cannot be used for
parameter estimation, for example when the probability density gθ(·|r) of the observed
6.1. INTRODUCTION 139
state of the HMM given the hidden state Rk = r is intractable for any r ∈ R. By
intractability, we mean either that this density cannot be evaluated analytically and has
no unbiased Monte Carlo estimator or that it is computationally prohibitive to calculate.
Despite the intractability, one is often still able to generate samples from the observation
process for any value of the parameter θ; e.g. see Jasra et al. [2012] for an example in the
HMM context. Specifically, one can usually sample from gθ(·|r) by first sampling U ∈ U
from a straightforward distribution with density µθ(u|r) on U w.r.t. to the dominating
measure du, and then applying a certain transformation tθ : U ×R → Y such that
tθ(U, r) ∼ gθ(·|r).
If gθ (·|r) is an intractable density which cannot be evaluated analytically, then typically
it is the case that the function tθ is a highly non-linear mapping (see Section 6.4.1 for an
example), as observed in a different context in Guyader et al. [2011]. If gθ (·|r) is available
in analytical form but prohibitive to calculate, then U often consists of the latent variables
of a hierarchical model which generates the observation and tθ is in rather simple form.
The ability to sample from a distribution with an intractable probability density has
led to the development of approximate Bayesian computational (ABC) methods, in which
the basic idea is to circumvent the intractability of a distribution by generating samples
from it. ABC has been a highly popular method for confronting intractability, one can
see e.g. Pritchard et al. [1999], Beaumont et al. [2002], Marjoram et al. [2003] for its first
examples and Marin et al. [2011] for a recent review.
As the name approximate Bayesian computation reveals, classical ABC methods treat
the problem of estimating θ∗ in a Bayesian framework where one assigns a prior distri-
bution to θ. Numerous Monte Carlo schemes based on rejection sampling [Pritchard
et al., 1999], Markov chain Monte Carlo (MCMC) [Marjoram et al., 2003], SMC samplers
[Del Moral et al., 2012], etc. have been proposed in this context with success. However,
when they deal with very large data sets, numerical Bayesian methods for static param-
eter estimation are known to suffer either from computational complexity (when MCMC
is used) or from particle path degeneracy (when SMC is used); see Andrieu et al. [2005];
Olsson et al. [2008] for a discussion of this issue on a general basis.
As an alternative to Bayesian estimation, in Dean et al. [2011] the use of ABC was
investigated in the MLE context, where θ∗ is estimated by taking the value of θ which
maximises some principled ABC approximation of the likelihood, which is itself estimated
using Monte Carlo simulation. We will refer to the procedure of maximising this ABC
approximation of the likelihood for the purpose MLE as ABC MLE from now on. How-
ever, the authors in Dean et al. [2011] do not propose a methodology for implementing
the ABC MLE approaches presented in their work.
140 CHAPTER 6. ABC FOR MLE IN HMMS
6.1.4 Outline of the chapter
In this chapter, we present both batch and online methods to implement ABC MLE for
HMMs with intractable observation densities. In particular, we will demonstrate how
gradient ascent and expectation-maximisation (EM) algorithms can be implemented in
batch and online settings. We show how to apply the idea of noisy ABC [Dean et al.,
2011; Fearnhead and Prangle, 2012; Wilkinson, 2008] within these methods in order to
get rid of any asymptotic bias introduced by ABC approximations. The methods we
provide are always equipped with a Monte Carlo technique which is SMC based in its
most general form. The SMC scheme for ABC that we propose in this work is based on
a slight but crucial modification of the SMC scheme for ABC that is proposed in Jasra
et al. [2012] for SMC filtering. In fact, it is that modification which makes the MLE
methods implementable.
The organisation of the rest of the chapter is as follows: First, we will review the ABC
MLE approaches for HMMs with intractable likelihoods in Section 6.2. Then, in Section
6.3 we will present the methodology to implement the approaches covered in Section
6.2. We will demonstrate the performance of the developed methods with three different
examples in Section 6.4. The first two examples are on estimating the parameters of α-
stable and g-and-k distributions given a sequence of i.i.d. random variables, noting that
the i.i.d. case corresponds to a special kind of HMM. The final example we will show
is on online estimation of the static parameters of the stochastic volatility model with
α-stable returns. The last section will contain a discussion on the methods developed
and the results obtained.
6.2 ABC MLE approaches for HMM
6.2.1 Standard ABC MLE
Given data Yˆ1:n generated from a general statistical model parametrised by θ, one popular
method for approximating the likelihood pθ(Yˆ1:n) is ABC. In the standard ABC approach,
one approximates the likelihood pθ(Yˆ1:n) via the probability of the form
Pθ
(
ρds(n)
[
sn(Y1:n); sn(Yˆ1:n)
]
≤ ǫ
∣∣∣ Yˆ1:n) . (6.4)
In (6.4), Y1:n denotes the observed random variables of the statistical model, sn : Yn →
Rds(n) is a statistic associated to n data points, ρd(·; ·) is some suitable metric on Rd, and
ǫ > 0 is a constant which reflects the accuracy of the approximation. One expects that
the approximation gets better as ǫ goes towards zero. In practice the probability in (6.4)
is itself estimated using Monte Carlo techniques. The intuitive justification for the ABC
6.2. ABC MLE APPROACHES FOR HMM 141
approximation is that, if the statistic sn is sufficient for θ, for sufficiently small ǫ
Pθ
(
ρds(n)
[
sn(Y1:n); sn(Yˆ1:n)
]
≤ ǫ
∣∣∣ Yˆ1:n)
V ǫ,κ
Yˆ1:n
≈ pθ(Yˆ1:n)
where V ǫ,κ
Yˆ1:n
denotes the volume of the ρds(n)-ball of radius ǫ around the point sn(Yˆ1:n).
Thus the probabilities in (6.4) will provide a good approximation to the likelihood, up to
the value of some normalising factor which is independent of θ and hence can be ignored.
In Dean et al. [2011], the authors considered performing ABC parameter estima-
tion for HMMs using a specialisation, proposed in Jasra et al. [2012], of the standard
ABC likelihood approximation (6.4) for when the observations are generated by a HMM.
Specifically, given a sequence of observations Yˆ1:n from the HMM defined in (6.1) and
(6.2), they approximate the corresponding likelihood function pθ(Yˆ1:n) in (6.3) (up to a
proportionality) with the probability
Pθ
(
Y1 ∈ BǫYˆ1 , . . . , Yn ∈ B
ǫ
Yˆn
∣∣∣ Yˆ1:n)
=
∫
Rn×Yn
νθ(r1)gθ(y1|r1)IBǫ
Yˆ1
(y1)
[
n∏
k=2
fθ(rk|rk−1)gθ(yk|rk)IBǫ
Yˆk
(yk)
]
dr1:ndy1:n, (6.5)
where Bǫy denotes the ball of radius ǫ centred around the point y for all y ∈ Rdy . In the
same work, it was observed that the ABC approximate likelihood obtained by normalising
the probability in (6.5) is equal to the likelihood of the data Yˆ1:n under the law of the
perturbed HMM {Rk, Y ǫk }k≥1 defined such that observed states {Y ǫk }k≥1 admit
Y ǫk = Yk + ǫZk, Zk ∼i.i.d. UnifB10 , k ≥ 1. (6.6)
This observation can be verified as follows. While the initial and transition densities for
the hidden states of the perturbed HMM {Rk, Y ǫk }k≥1 are still νθ and fθ, the components
of the observation process {Y ǫk }k≥1 has the ‘perturbed’ observation density
gǫθ(y|r) =
1
|Bǫy|
∫
Y
gθ(y
′|r)IBǫy(y′)dy′.
We will denote the likelihood of data Yˆ1:n under the law of the perturbed HMM as p
ǫ
θ(Yˆ1:n),
where pǫθ is defined as
pǫθ(y1:n) =
∫
Rn
νθ(r1)g
ǫ
θ(y1|r1)
n∏
k=2
fθ(rk|rk−1)gǫθ(yk|rk)dr1:n, ∀y1:n ∈ Yn. (6.7)
142 CHAPTER 6. ABC FOR MLE IN HMMS
It can be seen from (6.5) and (6.7) that
Pθ
(
Y1 ∈ BǫYˆ1, . . . , Yn ∈ B
ǫ
Yˆn
∣∣∣ Yˆ1:n) = pǫθ(Yˆ1:n) n∏
k=1
|Bǫ
Yˆk
|
Therefore, the proportionality between the density pǫθ(Yˆ1:n) and the probability Pθ(Y1 ∈
Bǫ
Yˆ1
, . . . , Yn ∈ BǫYˆn|Yˆ1:n) clearly does not depend on the value of θ. This observation
leads to a very useful fact: In order to find the value of θ that maximises the principled
approximation of the likelihood in (6.5), one could in theory implement MLE for the
perturbed HMMs {Rk, Y ǫk }k≥1 using the observations Yˆ1:n for Y ǫ1:n. The benefit of this
approach is that it retains the Markovian structure of the model which facilitates both
the mathematical analysis and computational implementation of the method. Note that
equation (6.5) is also a formalisation of the basic idea behind ABC MLE; all the other
ABC MLE approaches reviewed in the subsequent sections can be regarded to be based
on modifications of (6.5).
6.2.2 Noisy ABC MLE
It was shown in Dean et al. [2011] that the standard ABC MLE, if implemented exactly,
leads to an asymptotically biased estimate of the parameter vector θ∗ in the sense that
as n → ∞ the corresponding ABC MLE estimate will converge to some point θ∗,ǫ 6= θ∗
in the parameter space Θ, although this bias can be made arbitrarily small by choosing
a sufficiently small value of ǫ. This bias is due to the fact that in standard ABC MLE
one approximates the likelihood of the data generated by the HMM {Rk, Yk}k≥0 with the
likelihood of the same data under the law of the perturbed HMM {Rk, Y ǫk }k≥1. Thus in
effect standard ABC MLE is equivalent to performing MLE with a misspecified collection
of models which in general leads to biased parameter estimates [White, 1982]. A related
observation was also made in Wilkinson [2008], it was shown that the standard ABC
would be ‘calibrated’ under the assumption of model error, which corresponds to the
mismatch between the original HMM {Rk, Yk}k≥1 and the perturbed HMM {Rk, Y ǫk }k≥1
here in this context.
The asymptotic bias in the standard ABC MLE approach can be removed by modify-
ing the data Yˆ1:n so that the law corresponding to the process that generated the modified
data is equal to the law of the perturbed HMM {Rk, Y ǫk }k≥1 defined above. In practice
this can be done by simply adding independent uniform noise to each of the observations
Yˆ1, . . . , Yˆn to get noisy observations
Yˆ ǫk = Yˆk + ǫZk, Zk ∼i.i.d. UnifB10 , 1 ≤ k ≤ n. (6.8)
One can then perform the same ABC MLE approach described in Section 6.2.1 using
6.2. ABC MLE APPROACHES FOR HMM 143
the noisy observations in place of the original ones. The resulting method is known as
the noisy ABC MLE [Dean et al., 2011] and it produces an unbiased estimator of the
parameter vector θ∗ as n→∞. The intuitive reason for the unbiasedness is easy to see;
the probability density of the noisy data Yˆ ǫ1:n is precisely p
ǫ
θ(Yˆ
ǫ
1:n), which is the likelihood
function of the perturbed HMM {Rk, Y ǫk }k≥1 given Y ǫ1:n = Yˆ ǫ1:n. The term “noisy ABC”
was also used in Fearnhead and Prangle [2012] in a Bayesian framework with the same
idea of adding noise to the (statistics of) data for the purpose of “calibrating the model”
instead of “removing the bias”.
6.2.3 Smoothed ABC MLE
When the standard and the noisy ABC MLE approaches described above are subject to a
SMC implementation of an iterative MLE method, it may be necessary to have sufficiently
accurate particle approximation of certain quantities, such as gradients of some densities,
at least locally in a neighbourhood of the current parameter estimate. However it is
well known that Monte Carlo estimates of gradients of densities can be poor, especially
when the densities themselves contain discontinuities. This can present problems due
to the presence of the kernel of the uniform density in the ABC approximation of the
likelihood in (6.7). The fundamental problem is that the ABC approximation of the
likelihoods essentially involves convolving the true observation density gθ with the density
of a uniform distribution. The sharp discontinuities of this distribution mean that Monte
Carlo estimates of expectations w.r.t. it are very poor at capturing the variations of these
expected values w.r.t. any underlying parameters.
The way to resolve this situation is to implement ABC with an approximation of
the likelihoods that involves convolving the gθ with the density of a smooth centred
distribution denoted by κ. In particular, this can be done by using the smoothed ABC
MLE (S-ABC MLE) approach described in Dean et al. [2011]. In this approach one
approximates the true likelihood pθ(Yˆ1:n) of the data Yˆ1:n with the likelihood of the data
under the law of the perturbed HMM {Rk, Y ǫ,κk }k≥0 which is this time defined such that
the observed states {Y ǫ,κk }k≥1 admit
Y ǫ,κk = Yk + ǫZk, Zk ∼i.i.d. κ, k ≥ 1,
Therefore, pθ(Yˆ1:n) is approximated by p
ǫ,κ
θ (Yˆ1:n) where p
ǫ,κ
θ is defined as
pǫ,κθ (y1:n) =
∫
Rn
νθ(r1)g
ǫ,κ
θ (y1|r1)
[
n∏
k=2
fθ(rk|rk−1)gǫ,κθ (yk|rk)
]
dr1:n, ∀y1:n ∈ Y (6.9)
where this time the perturbed observation density gǫ,κθ is obtained by convolving gθ with
144 CHAPTER 6. ABC FOR MLE IN HMMS
κ i.e.
gǫ,κθ (y|r) =
∫
Y
gθ(y
′|r)1
ǫ
κ
(
y − y′
ǫ
)
dy′.
In practice the random variables Zk are often chosen to be standard normal random
variables.
Finally, we note that the S-ABC MLE suffers from the same problems with asymptotic
bias as does the standard ABC MLE. However, the notion of noisy ABC MLE has a
natural extension to the smoothed case which again results in an unbiased estimate of
the parameter values. That is, instead of using Yˆ1:n, one can use new observations Yˆ
ǫ,κ
1:n
obtained by
Yˆ ǫ,κk = Yˆk + ǫZk, Zk ∼i.i.d. κ, 1 ≤ k ≤ n. (6.10)
The resulting approach will be called smoothed noisy ABC MLE (SN-ABC MLE) in this
chapter.
6.2.4 Summary
In Table 6.1, we summarise the ABC MLE approaches we have covered in this section. In
Section 6.3, in order to avoid unnecessary repeatings, we will present our methodology for
implementing ABC MLE for HMMs mainly based on the SN-ABC MLE approach only.
We have chosen SN-ABC MLE for its desirable properties, namely unbiasedness and its
broader applicability; although we do dot have any loss of generality by having done
so. The gradient ascent and EM algorithms explained in Sections 6.3.1 and 6.3.2 can be
modified for the other ABC MLE approaches (if applicable) with obvious modifications
such as removing the noise or using the uniform distribution rather than κ. Finally,
note that these algorithms involve SMC approximations in practice, and indeed choosing
an appropriate SMC scheme is of great essence in order to be able to implement the
algorithms.
Table 6.1: A comparison of ABC MLE approaches.
MLE method output bias applicability
ideal MLE argmaxθ∈Θ pθ(Yˆ1:n) unbiased impossible
standard ABC MLE argmaxθ∈Θ p
ǫ
θ(Yˆ1:n) biased restricted
noisy ABC MLE argmaxθ∈Θ p
ǫ
θ(Yˆ
ǫ
1:n) unbiased restricted
S-ABC MLE argmaxθ∈Θ p
ǫ,κ
θ (Yˆ1:n) biased generally applicable
SN-ABC MLE argmaxθ∈Θ p
ǫ,κ
θ (Yˆ
ǫ,κ
1:n ) unbiased generally applicable
6.3. IMPLEMENTING ABC MLE 145
6.3 Implementing ABC MLE
Although it was shown in Jasra et al. [2012] that the ABC approximate likelihoods (6.5)
can themselves be estimated using SMC methods, there was no investigation as to how
these SMC based estimates can be used in practice to find the parameter value that
maximises the ABC approximate likelihood. The purpose of this section is to show that
due to the underlying mathematical structure of ABC one can efficiently and accurately
implement the ABC MLE procedure via the standard MLE algorithms. We discuss in
detail how this can be done in Sections 6.3.1 and 6.3.2. However, here we give the basic
idea behind the algorithms described in those sections.
Consider the SN-ABC MLE approach where the true likelihood pθ(Yˆ1:n) in (6.3) is
approximated with pǫ,κθ (Yˆ
ǫ,κ
1:n ) where p
ǫ,κ
θ (·) is the S-ABC MLE likelihood defined in (6.9)
and Yˆ ǫ,κ1:n are observations with added smooth noise defined in (6.10). Recall that p
ǫ,κ
θ (Yˆ
ǫ,κ
1:n )
is the likelihood of Yˆ ǫ,κ1:n under the law of the perturbed HMM {Rk, Y ǫ,κk }k≥1; so one would
obtain the SN-ABC MLE estimate of θ∗ if they could implement MLE for {Rk, Y ǫ,κk }k≥1
given Y ǫ,κ1:n = Yˆ
ǫ,κ
1:n . However, MLE for {Rk, Y ǫ,κk }k≥1 is as hard as MLE for the original
HMM {Rk, Yk}k≥1, which we already know to be impossible due to the intractability
of gθ(·|r). The reason is similar; this time the perturbed observation density gǫ,κθ (·|r)
is intractable. Therefore, we cannot use the HMM {Rk, Y ǫ,κk }k≥1 directly to implement
SN-ABC MLE.
The crucial point here is that one can construct an equivalent HMM to {Rk, Y ǫ,κk }k≥1
which is tractable in terms of its densities. Recall that one can usually sample from gθ(·|r)
by first sampling U ∈ U from µθ(·|r) and applying the transformation tθ(U, r). From this
it follows that pǫ,κθ (·) is also the likelihood function corresponding to the expanded HMMs
{(Rk, Uk), Y ǫ,κk }k≥1
where {Rk}k≥1 is equal to the hidden state of the original HMM, Uk ∼ µθ(·|Rk) for all k
and
Y ǫ,κk = tθ(Uk, Rk) + ǫZk, Zk ∼i.i.d. κ, k ≥ 1.
For this particular HMM, we have {Xk := (Rk, Uk)}k≥1 to be the latent process taking
values from X = R×U and {Y ǫ,κk }k≥1 is the observation process. The initial and transition
densities πθ(x) and qθ(x|x′) for {Xk}k≥1 w.r.t. the dominating measure dx = drdu and
the observation density hǫ,κθ (y|x) for {Yk}k≥1 w.r.t. dy are as follows:
πθ(x) = ν(r)µθ(u|r), qθ(x′|x) = fθ(r′|r)µθ(u′|r′), hǫ,κθ (y|x) =
1
ǫ
κ
(
y − tθ(x)
ǫ
)
.
(6.11)
146 CHAPTER 6. ABC FOR MLE IN HMMS
where x = (r, u) and x′ = (r′, u′). Depending on whether we choose to use the S-ABC
MLE or SN-ABC MLE, we take Y ǫ,κ1:n = Yˆ1:n or Y
ǫ,κ
1:n = Yˆ
ǫ,κ
1:n , respectively. Again, to avoid
repeating ourselves, from now on we will carry on with the SN-ABC MLE approach and
as a result we are given a particular realisation Y ǫ,κ1:n = Yˆ
ǫ,κ
1:n . SN-ABC MLE reduces to
searching for
θǫ,κML = argmax
θ∈Θ
pǫ,κθ (Yˆ
ǫ,κ
1:n ),
where pǫ,κθ is defined in (6.9) and it can be rewritten in terms of the densities in (6.11) as
pǫ,κθ (y1:n) =
∫
Xn
πθ(x1)h
ǫ,κ
θ (y1|x1)
[
n∏
k=2
qθ(xk|xk−1)hǫ,κθ (yk|xk)
]
dx1:n, ∀y1:n ∈ Yn. (6.12)
Thus, in practice one can find the SN-ABC MLE estimate by applying standard MLE
algorithms to the expanded HMMs {Xk, Y ǫ,κk }k≥1 using the noisy observations Yˆ ǫ,κ1:n .
Finally, we note that there is a published remark [Andrieu et al., 2012] which also
mentions (independently) the idea of making use of the intermediate random variable U
in the ABC context in a Bayesian framework; however the idea is not developed further
and finalised with an implementable method.
6.3.0.1 SMC algorithm for the expanded HMM
In Sections 6.3.1 and 6.3.2, we describe two possible SMC based MLE methods for HMM
in the ABC context by exploiting the availability of this expanded HMM {Xk, Y ǫ,κk }k≥1.
Before going into the details of these methods, we present our SMC filtering scheme for
{Xk, Y ǫ,κk }k≥1 in the algorithm below.
Algorithm 6.1. SMC filtering for the expanded HMM {Xk, Y ǫ,κk }k≥1
For k = 1; for i = 1, . . . , N sample R
(i)
1 ∼ νθ, U (i)1 ∼ µθ(·|R(i)1 ), and set X(i)1 = (R(i)1 , U (i)1 )
and calculate
W
(i)
1 ∝ hǫ,κθ (Y ǫ,κ1 |X(i)1 ),
N∑
i=1
W
(i)
1 = 1.
For k = 2, 3, . . .
• Resample {X(i)0:k−1}1≤i≤N according to the weights {W (i)k−1}1≤i≤N to get resampled
particles {X˜(i)0:k−1}1≤i≤N .
• For i = 1, . . . , N , sample R(i)k ∼ fθ(·|R˜(i)k−1) and U (i)k ∼ µθ(·|R(i)k ); set X(i)k =
(R
(i)
k , U
(i)
k ) and X
(i)
1:k = (X˜
(i)
1:k−1, X
(i)
k ). Calculate
W
(i)
k ∝ hǫ,κθ (Y ǫ,κk |X(i)k ),
N∑
i=1
W
(i)
k = 1.
6.3. IMPLEMENTING ABC MLE 147
Algorithm 6.1 provides the SMC estimates for the posterior distributions such as
pǫ,κθ (x1:n|Y ǫ,κ1:n ), pǫ,κθ (xn|Y ǫ,κ1:n ) and pǫ,κθ (xn|Y ǫ,κ1:n−1). In this work, we will consider the last
one in particular, whose SMC approximation is provided by Algorithm 6.1 as
pǫ,κ,Nθ (dxn|Y ǫ,κ1:n−1) =
1
N
N∑
i=1
δ
X
(i)
n
(dxn) (6.13)
The implementation in Algorithm 6.1 is based on the bootstrap particle implementation
of Gordon et al. [1993]. Note that any SMC implementation may be used, e.g. the
auxiliary SMC method of Pitt and Shephard [1999], bootstrap with optimal proposal
Doucet et al. [2001]; see e.g. Del Moral [2004]; Doucet et al. [2001] for more examples.
In Jasra et al. [2012], the authors propose an SMC filtering algorithm using another
expanded HMM, namely
{(Rk, Yk), Y ǫk }k≥1.
In this HMM, the hidden state at time k is (Rk, Yk) i.e. the components of the original
HMM and Y ǫk is defined in (6.6). It is indeed possible to sample from the hidden states
(Rk, Yk) by sampling from their transition density
fθ(rk|rk−1)gθ(yk|rk)
and the SMC filter only needs to calculate the density of a uniform distribution cen-
tred at the value of Y ǫk on order to weight the sampled particles. For any given θ and
y1:n, this SMC filtering algorithm also provides an unbiased SMC estimate of the ABC
probability in (6.5), hence of the ABC likelihood pǫθ(y1:n) in (6.7) for any given θ and
y1:n up to a proportionality. Moreover, the algorithm for {(Rk, Yk), Y ǫk } can be modified
for {(Rk, Yk), Y ǫ,κk } with a straightforward manner, namely the uniform distribution is
replaced with a smooth distribution centred at Y ǫ,κk . The resulting SMC filtering algo-
rithm would be equivalent to Algorithm 6.1 in terms of filtering for the hidden process
{Rk}k≥1 of the original HMM; also we can have an unbiased SMC estimate of pǫ,κθ (y1:n)
for any given θ and y1:n. The problem, however, is that it is in general required to be
able to compute the transition density of hidden states of a HMM (or its gradient) in
order to implement MLE methods for it. Obviously neither fθ(rk|rk−1)gθ(yk|rk) nor its
gradient can be computed; therefore {(Rk, Yk), Y ǫk }k≥1 is not practical for implementing
MLE. As a result, although both HMMs are equivalent in the SMC filtering context,
{(Rk, Uk), Y ǫk }k≥1 should be preferred over {(Rk, Yk), Y ǫk }k≥1 in the MLE context.
148 CHAPTER 6. ABC FOR MLE IN HMMS
6.3.1 Gradient ascent ABC MLE
We show in this section that it is possible to devise batch and online gradient ascent
algorithms for HMMs with intractable observation densities in order to implement the
SN-ABC MLE approach. Specifically, we apply the gradient ascent MLE algorithm to
{Xk, Y ǫ,κk = tθ(Xk) + ǫZk}k≥1 where {Zk}k≥1 are taken as i.i.d. standard normal random
variables i.e.
hǫ,κθ (y|x) = N (y; tθ(x), ǫ2).
Also, for simplicity we fix Yˆ ǫ,κ1:n = y1:n.
6.3.1.1 Batch gradient ascent
The batch gradient ascent algorithm is an iterative procedure implemented as follows:
We begin with θ(0) and assume that we have the estimate θ(j−1) at the end of the the
(j − 1)’th iteration. At the j’th iteration we update the parameter
θ(j) = θ(j−1) + γj∇θ log pǫ,κθ (y1:n)
∣∣
θ=θ(j−1)
.
Here {γj}j≥1 is the sequence of step sizes satisfying
∑
j γj =∞ and
∑
j γ
2
j <∞, ensuring
convergence of the algorithm when it is used with the Monte Carlo approximations of the
gradients ∇θ log pǫ,κθ (y1:n). It was shown in Poyiadjis et al. [2011] and Del Moral et al.
[2011] that a stable SMC approximation of ∇θ log pǫ,κθ (y1:n), which we briefly outline in
the following, is available for HMMs. First, the gradient term can be written as
∇θ log pǫ,κθ (y1:n) =
∫
Xn
[Sθ,n(x1:n) +∇θ log hǫ,κθ (yn|xn)] pǫ,κθ (x1:n|y1:n)dx1:n (6.14)
where the additive functional Sθ,n : X n → Rdθ is defined from additional functions as
follows:
Sθ,n(x1:n) =
n∑
k=1
sθ,k(xk−1, xk), (6.15)
sθ,k(xk−1, xk) = ∇θ log hǫ,κθ (yk−1|xk−1) +∇θ log qθ(xk|xk−1), 2 ≤ k ≤ n,
sθ,1(x0, x1) := sθ,1(x1) = ∇θ log πθ(x1).
Notice that we have omitted the dependency on y1:n−1 from the definition of Sθ,n for
notational simplicity. The integral in (6.14) is simply the expectation of sum of the
additive function Sθ,n and ∇θ log hǫ,κθ (yn|·) under the posterior distribution of X1:n given
y1:n, and it can be evaluated in a forward manner as follows: Define the function T
θ
n :
6.3. IMPLEMENTING ABC MLE 149
X → Rdθ
T θn(xn) :=
∫
Xn−1
Sθ,n(x1:n)p
ǫ,κ
θ (x1:n−1|y1:n−1, xn)dx1:n−1
=
∫
X
[
T θn−1(xn−1) + sθ(xn−1, xn)
]
pǫ,κθ (xn−1|y1:n−1, xn)dxn−1. (6.16)
The recursion in (6.16) is a forward-only version of forward filtering backward smoothing
and it is called the forward smoothing recursion in Del Moral et al. [2009]. Forward
smoothing recursion is available for all additive functionals that have the form in (6.15);
and it is particularly helpful in a sequential setting since one can perform the recursion
using only the densities {pǫ,κθ (xk|y1:k−1)}k≥1, where we will call pǫ,κθ (xk|y1:k−1) the filter at
time k. For simplicity, define
ηθ,k(dxk) := p
ǫ,κ
θ (xk|y1:k−1)dxk, k ≥ 1,
and let ν(ϕ) =
∫
ϕ(x)ν(dx) for any measure ν on the σ-algebra generated by X and any
bounded Borel measurable function ϕ defined on X . Then, we can write
pǫ,κθ (xn−1|y1:n−1, xn)dxn−1 =
ηθ,n−1(dxn−1)h
ǫ,κ
θ (yn−1|xn−1)qθ(xn|xn−1)
ηθ,n−1 [h
ǫ,κ
θ (yn−1|·)qθ(xn|·)]
.
Once we have T θn from T
θ
n−1 using the forward smoothing recursion, it is possible to
evaluate ∇θ log pǫ,κθ (y1:n) using again the filtering densities only:
∇θ log pǫ,κθ (y1:n) =
ηθ,n
[
T θnh
ǫ,κ
θ (yn|·)
]
+ ηθ,n [∇θhǫ,κθ (yn|·)]
ηθ,n [h
ǫ,κ
θ (yn|·)]
Exact calculation of the filtering densities ηθ,n and hence ∇θ log pǫ,κθ (y1:n) is not possible,
therefore Monte Carlo approximations are needed. We have already shown by equation
(6.13) in Section 6.3.0.1 Using Algorithm 6.1 with N particles, it is possible to recursively
compute particle approximations ηNθ,n of ηθ,n
ηNθ (dxn) =
1
N
N∑
i=1
δ
X
(i)
n
(dxn),
where X
(i)
n , i = 1, . . . , N , are called particles and δx is the Dirac measure concentrated
at x. Also, a stable SMC approximation to ∇θ log pǫ,κθ (y1:n) is available by computing the
recursion in (6.16) using the following O(N2) particle approximation to the backward
transition distribution pǫ,κθ (dxn−1|xn, y1:n−1)
pǫ,κ,Nθ (dxn−1|xn, y1:n−1) =
ηNθ,n−1(dxn−1)h
ǫ,κ
θ (yn−1|xn−1)qθ(xn|xn−1)
ηNθ,n−1 [h
ǫ,κ
θ (yn−1|·)qθ(xn|·)]
.
150 CHAPTER 6. ABC FOR MLE IN HMMS
6.3.1.2 Online gradient ascent
The batch gradient ascent algorithm may be inefficient when n is large since each iteration
requires a complete browse over the data sequence. An alternative to the batch algorithm
is its online version, called the online gradient ascent MLE algorithm. An online gradient
ascent algorithm can be implemented as follows [Del Moral et al., 2011; Poyiadjis et al.,
2011]: Given y1:n−1, assume we have the estimate θn−1. When yn is received, we update
the parameter
θn = θn−1 + γn∇θ log pǫ,κθ (yn|y1:n−1)
∣∣
θ=θn−1
.
The gradient ∇θ log pǫ,κθ (yn|y1:n−1) can be calculated making use of the filter derivative:
∇θ log pǫ,κθ (yn|y1:n−1) =
ηθ,n [∇θhǫ,κθ (yn|·)] + ζθ,n [hǫ,κθ (yn|·)]
ηθ,n [h
ǫ,κ
θ (yn|·)]
where ζθ,n(dxn) is the derivative of the filter ηθ,n and is defined as
ζθ,n(dxn) = ηθ,n(dxn)
[
T θn(xn)− ηθ,n
(
T θn
)]
Therefore, using an SMC algorithm, it is possible to recursively compute particle
approximations ζNθ,n of ζθ,n by using the same O(N2) particle approximation to the for-
ward smoothing recursion (i.e. T θn ’s) as in the batch gradient ascent case to compute
∇θ log pǫ,κθ (y1:n). The resulting approximation of ∇θ log pǫ,κθ (yn|y1:n−1), which is
∇Nθ log pǫ,κθ (yn|y1:n−1) =
ηNθ,n [∇θhǫ,κθ (yn|·)] + ζNθ,n [hǫ,κθ (yn|·)]
ηNθ,n [h
ǫ,κ
θ (yn|·)]
,
was numerically shown to be stable in Poyiadjis et al. [2011] and this was proved in
Del Moral et al. [2011]. Without going into further details, we refer the reader to these
works for the implementation details (e.g. see Algorithms 1 and 2 in Del Moral et al.
[2011]) as well as proven stability results.
6.3.1.3 Controlling the stability
If the functions sθ,k and hence the additive functionals Sθ,n have very high or infinite vari-
ances; we expect failure of the gradient ascent MLE algorithm. In particular, assuming
κ = N (0, 1), the gradient term
∇θ log hǫ,κθ (Y ǫ,κk |Xk) =
1
ǫ2
[Y ǫ,κk − tθ(Xk)]∇θtθ(Xk)
can be problematic in this sense. We may circumvent the instability problem by trans-
forming each of the observations Yˆ1, . . . , Yˆn to a subset Ys ⊆ Y by using a one-to-one func-
6.3. IMPLEMENTING ABC MLE 151
tion ψ : Y → Ys. Then, we can implement SN-ABC MLE for the HMM {Xk, Y ǫ,κ,ψk }k≥1,
where this time
Y ǫ,κ,ψk = ψ(Yk) + ǫZk, Zk ∼i.i.d. N (0, 1), k ≥ 1.
In this case, the observation density of the HMM {Xk, Y ǫ,κ,ψk }k≥1 becomes
hǫ,κ,ψθ (y|x) = N (y;ψ[tθ(x)], ǫ2).
Finally, the likelihood function of {Xk, Y ǫ,κ,ψk }k≥1 is pǫ,κθ in (6.12) with hǫ,κθ is replaced by
hǫ,κ,ψθ i.e.
pǫ,κ,ψθ (y1:n) =
∫
Xn
πθ(x1)h
ǫ,κ,ψ
θ (y1|x1)
[
n∏
k=2
qθ(xk|xk−1)hǫ,κ,ψθ (yk|xk)
]
dx1:n, ∀y1:n ∈ Yn.
We choose ψ such that the gradient of the logarithm of the new observation density
∇θ log hǫ,κ,ψθ (Y ǫ,κ,ψk |Xk) =
1
ǫ2
(
Y ǫ,κ,ψ − ψ[tθ(Xk)]
)∇θψ[tθ(Xk)]
has smaller variance than it would have if no transformation were used. Note that in the
case of a transformation function applied, we obtain the noisy data by first transforming
the real data and then adding the noise, that is,
Yˆ ǫ,κ,ψk = ψ(Yˆk) + ǫZk, Zk ∼i.i.d. N (0, 1), 1 ≤ k ≤ n.
6.3.1.4 Special case: i.i.d. random variables with an intractable density
An i.i.d. process {Yk}k≥1 can be seen as a special type of HMM. Specifically, {Yk}k≥1
are i.i.d. w.r.t. a distribution with an intractable probability density gθ. The objective
is to perform MLE given a data sequence Yˆ1:n generated from the i.i.d. process. Again,
we have the assumption that gθ is intractable but we can sample from gθ by generating
U ∈ U from µθ, and by applying a certain transformation function tθ : U → Y so that
tθ(U) ∼ gθ.
Let us consider again the SN-ABC MLE approach where we want to maximise the
likelihood of the noisy observations Yˆ ǫ,κ1:n = y1:n under the law of the HMM {Uk, Y ǫ,κk }k≥1.
The observation density for this HMM is modified as
hǫ,κθ (y|u) =
1
ǫ
κ
(
y − tθ(u)
ǫ
)
.
Since we have pǫ,κθ (yn|y1:n−1) = pǫ,κθ (yn) and hence log pǫ,κθ (y1:n) =
∑n
i=1 log p
ǫ,κ
θ (yn); the
152 CHAPTER 6. ABC FOR MLE IN HMMS
batch and online gradient ascent MLE update rules algorithms reduce to
θ(j) = θ(j−1) + γj
n∑
k=1
∇θ log pǫ,κθ (yk)
∣∣
θ=θ(j−1)
, θn = θn−1 + γn∇θ log pǫ,κθ (yn)
∣∣
θ=θn−1
.
Therefore, both batch and online gradient ascent algorithms involve independent Monte
Carlo approximations to ∇θ log pǫ,κθ (yn). Noting that
∇θ log pǫ,κθ (y) =
∫
Y
[∇θ log µθ(u) +∇θ log hǫ,κθ (y|u)]pǫ,κθ (u|y)du, (6.17)
the Monte Carlo approximation of ∇θ log pǫ,κθ (y) involves a Monte Carlo approximation
to the posterior distribution pǫ,κθ (u|y) of U given y. We can use standard MCMC or
importance sampling methods to obtain an approximation of pǫ,κθ (u|y) withN ≥ 1 samples
as
pǫ,κ,Nθ (du|y) =
N∑
i=1
W (i)δU (i)(du),
N∑
i=1
W (i) = 1.
If a MCMC is used to generate samples from pǫ,κ,Nθ (u|y), we simply have W (i) = 1/N . If
self normalised importance sampling is used with a proposal density ξθ(u|y) then
W (i) ∝ µθ(U
(i))hǫ,κθ (y|U (i))
ξθ(U (i)|y) .
Therefore, the Monte Carlo approximation of (6.17) becomes
∇Nθ log pǫ,κθ (y) =
N∑
i=1
W (i)
[∇θ log µθ(U (i)) +∇θ log hǫ,κθ (y|U (i))] .
One important point to note about the i.i.d. case is that the original O(N2) algorithm
mentioned above reduces to an O(N) algorithm, so for a fixed computational source
one can implement the gradient ascent algorithms with much more particles. Secondly,
because of reduced computational complexity, we have more freedom to choose a sophis-
ticated method for the Monte Carlo approximation, such as SMC samplers [Del Moral
et al., 2006] (even though these methods are applicable also within the SMC algorithm
for general HMMs with additional computational costs).
6.3.2 Expectation-maximisation
Although not as general as the gradient ascent MLE algorithm, the EM algorithm may
be available in some models in the ABC context, at least for a part of the parameters in
6.3. IMPLEMENTING ABC MLE 153
θ. Consider the expanded HMM {Xk, Y ǫ,κk }k≥1 and assume that that the quantity
Q(θ′, θ) =
∫
Xn
log pǫ,κθ (x1:n, y1:n)p
ǫ,κ
θ′ (x1:n|y1:n)dx1:n
can be maximised w.r.t. θ. Then the EM algorithm at the j’th iteration calculates
Q(θ(j−1), θ) (E-step) sets θ(j) to be the maximiser of Q(θ(j−1), θ) (M-step) i.e.
θ(j) = argmax
θ∈Θ
Q(θ(j−1), θ)
Moreover, if the joint density pǫ,κθ (x1:n, y1:n) of observations as well as latent variables
belongs to the exponential family w.r.t. θ, then the E-step reduces to calculating the
expectations of some additive sufficient statistics Sn : X n → Rm (for some m > 0)
defined similarly to (6.15) as
Sn(x1:n) = s1(x1) +
n∑
k=2
s(xk−1, xk) (6.18)
w.r.t. the posterior distribution pθ(j−1)(x1:n|y1:n) of X1:n conditioned on Y ǫ,κ1:n = y1:n at
θ = θ(j−1). The M-step, then, can be characterised as a mapping Λ : Rm → Θ such that
θ(j) = Λ(Sθ
(j−1)
n ) = argmax
θ∈Θ
Q(θ(j−1); θ).
where, for θ ∈ Θ, Sθn denotes the expectation of Sn w.r.t. pǫ,κθ (x1:n|y1:n) i.e.
Sθn =
∫
Xn
Sn(x1:n)p
ǫ,κ
θ (x1:n|y1:n)dx1:n.
Calculation of Sθn follows similar steps as calculating∇θ log pǫ,κθ (y1:n) in the gradient ascent
algorithm in the sense that we can use the forward smoothing recursion described in
Section 6.3.1.1 to calculate Sθn since the sufficient statistics Sn are in the additive form
[Del Moral et al., 2009]. Note that to emphasise the analogy between (6.15) and (6.18)
we use the same letter S for those additive sufficient statistics.
Similar to the online gradient ascent algorithm, the availability of the recursive calcu-
lation of Sθn enables us to develop the online version of the EM algorithm [Cappe´, 2009,
2011; Elliott et al., 2002; Mongillo and Deneve, 2008]. This can be done by modifying
the forward smoothing recursion by borrowing ideas from stochastic approximation. Let
θ0:n−1 denote the parameter estimates obtained sequentially by the online EM algorithm
154 CHAPTER 6. ABC FOR MLE IN HMMS
given the data y1:n−1. When yn is received, we calculate
Tγ,n(xn) =
∫
X
[(1− γn)Tγ,n−1(xn) + γnsn(xn−1, xn)] pǫ,κθ0:n−1(xn−1|xn, y1:n−1)dxn−1
Sγ,n =
∫
X
Tγ,n(xn)p
ǫ,κ
θ0:n−1
(xn|y1:n)dxn
=
ηθ0:n−1,n
[
Tγ,nh
ǫ,κ
θn−1
(yn|·)
]
ηθ0:n−1,n
[
hǫ,κθn−1(yn|·)
]
and update θn = Λ(Sγ,n). The subscript θ0:n−1 indicates that the estimations up to time
n have contributions to the filtering densities (hence to Tγ,n and Sγ,n).
There are bothO(N) and O(N2) SMC methods available for approximation to Sθn and
Sγ,n for the batch and online cases, respectively. Actually, theO(N2) method is analogous
to the O(N2) method described in Section 6.3.1. Whereas, the O(N) method is directly
based on the path space approximation of pǫ,κθ (x1:n|y1:n) obtained by the SMC filter in
Algorithm 6.1. One can see Cappe´ [2009] for an O(N) implementation. Although the
O(N) method is computationally less demanding, its estimates have larger Monte Carlo
variance compared to those of the O(N2) method. Finally, both methods produce stable
estimates for θ∗ when used in an EM algorithm unlike the gradient ascent algorithm
which strictly requires the O(N2) method for stability [Poyiadjis et al., 2011].
In the case of i.i.d. processes the EM algorithms simplify in a similar way as in
the gradient ascent algorithms, and it will not be detailed here again. The important
points are worth emphasising, though: we have {Xk = Uk} the SMC implementation
to calculate the expectations of sufficient statistics Sn(U1:n) =
∑n
k=1 sk(Uk) breaks into
independent Monte Carlo approximations of {pǫ,κθ (uk|yk)}k≥1 and the O(N2) algorithm
can be implemented with O(N) calculations. Finally, if needed, the same one-to-one
transformation approach explained in Section 6.3.1.3 could be used for the EM algorithm
in order to stabilise the sufficient statistics (or their estimates) required for the algorithm.
6.4 Numerical examples
In this section we demonstrate the performance of the methods described in Section 6.3
with several numerical examples. The models we study are sequences of i.i.d. random
variables from α-stable and g-and-k distributions and the stochastic volatility model with
α-stable returns. The experiments focus on different aspects of the methods.
6.4. NUMERICAL EXAMPLES 155
6.4.1 MLE for α-stable distribution
We first consider the problem of estimating the parameter values of a sequence of i.i.d.
α-stable random variables. We denote A(α, β, µ, σ) to be the α-stable distribution. The
parameters of the distribution,
θ = (α, β, µ, σ) ∈ Θ = (0, 2]× [−1, 1]× R× [0,∞),
are the shape, skewness, location, and scale parameters, respectively. Several methods
for estimating parameter values for stable distributions have been proposed, including a
Bayesian approach based on ABC, see Peters et al. [2011]. In this example we consider
estimating these parameters using SN-ABC MLE implemented with the online gradient
ascent algorithm.
One can generate a random sample from A(α, β, µ, σ) by generating U = (U1, U2),
where U1 ∼ Unif(−π/2,π/2) and U2 ∼ Exp(1) independently, and setting
Y := tθ(U) := σtα,β(U) + µ.
The transformation function tα,β is defined as [Chambers et al., 1976]
tα,β(U) =
Sα,β
sin[α(U1+Bα,β)]
[cos(U1)]
1/α
(
cos[U1−α(U1+Bα,β)]
U2
)(1−α)/α
, α 6= 1
X = 2
π
[(
π
2
+ βU1
)
tanU1 − β log
(
U2 cosU1
π
2
+βU1
)]
, α = 1.
where
Bα,β =
tan−1
(
β tan πα
2
)
α
Sα,β =
(
1 + β2 tan2
πα
2
)1/2α
Since the only discontinuity in the transformation function is at α = 1, we can safely use
the gradient ascent method for estimating θ∗ with the restriction α ∈ (0, 1) or α ∈ (1, 2].
As the variance of the α-stable distribution does not exist unless α = 2, Monte
Carlo estimates of the gradient ∇θ log pǫ,κθ (Y ǫ,κk ) are expected to have very high or infinite
variance. Instead, we propose using the HMM {Uk, Y ǫ,κ,ψk }k≥1 with
Y ǫ,κ,ψk = tan
−1(Yk) + ǫZk, Zk ∼ N (0, 1), k ≥ 1,
to make the gradient ascent algorithm stable. For this HMM we have
hǫ,κ,ψθ (y|u) = N
(
y; tan−1 [tθ(u)] , ǫ
2
)
,
∇θ log hǫ,κ,ψθ (y|u) =
1
ǫ2
(
y − tan−1 [tθ(u)]
) ∇θtθ(u)
1 + tθ(u)2
.
156 CHAPTER 6. ABC FOR MLE IN HMMS
Since tan−1(·) squeezes the data to a finite interval the variance of Y ǫ,κ,ψ is obviously
bounded. The variance of ∇θ log hǫ,κ,ψθ (Y ǫ,κ,ψ|U) is not straightforward to evaluate ana-
lytically due to the highly non-linear factors from tθ involved in the expression. However,
in order to check whether the transformation stabilises gradients, we can look at the
empirical distribution of ∇Nθ log pǫ,κ,ψθ (Y ǫ,κ,ψ) when tan−1(·) is used. For this purpose,
we generated 105 samples Yˆi ∼ A(1.5, 0.5, 0, 0.5) and Zi ∼ N (0, 1), i = 1, . . . , 105, and
for each sample we estimated ∇Nθ log pǫ,κ,ψθ (Yˆ ǫ,κ,ψi ) using N = 1000 samples for when
Yˆ ǫ,κ,ψi = tan
−1(Yˆi) + ǫZk, with ǫ = 0.1. Figure 6.1 shows the histograms of{
∇Nθ log pǫ,κ,ψθ (Yˆ ǫ,κ,ψi )
}
1≤i≤105
as a numerical approximation to the distribution of ∇Nθ log pǫ,κ,ψθ (Y ǫ,κ,ψ). From the figure,
one can observe that transformation does stabilise the gradients, which is quite important
for securing the well behaving of the gradient ascent algorithm.
−0.2 −0.15 −0.1 −0.05 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
gradient w.r.t. α
hi
st
og
ra
m
s 
of
 g
ra
di
en
ts
 w
he
n 
 
Yε
,
 
κ
,
 
ψ  
=
 
ta
n−
1 (Y
) +
 ε 
Z
∇
α
 log pθ
ε, κ, ψ(Yε, κ, ψ)
−0.06 −0.04 −0.02 0 0.02 0.04
0
0.05
0.1
0.15
0.2
0.25
gradient w.r.t. β
∇β log pθ
ε, κ, ψ(Yε, κ, ψ)
−0.04 −0.02 0 0.02 0.04
0
0.02
0.04
0.06
0.08
0.1
gradient w.r.t. µ
∇µ log pθ
ε, κ, ψ(Yε, κ, ψ)
−0.05 0 0.05 0.1 0.15
0
0.05
0.1
0.15
0.2
0.25
0.3
gradient w.r.t. σ
∇
σ
 log pθ
ε, κ, ψ(Yε, κ, ψ)
Figure 6.1: Histograms of Monte Carlo estimates of gradients of log pǫ,κ,ψθ (Y
ǫ,κ,ψ) w.r.t.
the parameters of the α-stable distribution with tan−1(·) being used. 105 samples were
used for generating the histograms.
We implemented the SN-ABC MLE approach with ǫ = 0.1 using the online gradient
ascent algorithm to avoid any asymptotic bias in the parameter estimates. Self normalised
importance sampling is used in the Monte Carlo approximation part with the proposal
density being µ to sample N = 1000 particles at each time step. Figure 6.2 shows the
online estimation results for θ given a sequence of 105 i.i.d. α-stable random variables
and stability results for the gradients that are estimated during the algorithm.
In the next experiment we aimed demonstrate how bias is removed from the gradient
ascent algorithm by adding noise to data. For this aim we implemented the SN-ABC
MLE and S-ABC MLE approaches with ǫ = 0.1 on the same data set of 105 samples
generated from A(1.5, 0.5, 0, 0.5) (and transformed with tan−1(·)) . The results in Figure
6.3 are the online estimates averaged over 50 runs for both algorithms. For the SN-ABC
MLE algorithm, in each of the 50 runs we added i.i.d. Gaussian noise to the true data set
transformed with tan−1(·), independently from other runs. Figure 6.3 reveals that S-ABC
6.4. NUMERICAL EXAMPLES 157
1.4
1.5
1.6
1.7
1.8
1.9
α
SN
−A
BC
 M
LE
 e
st
im
at
es
 
 
w
ith
 o
nl
in
e 
gr
ad
ie
nt
 a
sc
en
t
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
β
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
µ
0.2
0.4
0.6
0.8
1
1.2
σ
0 5 10
x 104
−0.15
−0.1
−0.05
0
0.05
num. of samp. (t)
gr
ad
en
ts
0 5 10
x 104
−0.06
−0.04
−0.02
0
0.02
0.04
num. of samp. (t)
0 5 10
x 104
−0.06
−0.04
−0.02
0
0.02
0.04
num. of samp. (t)
0 5 10
x 104
−0.05
0
0.05
0.1
0.15
num. of samp. (t)
Figure 6.2: On the top: Online estimation of α-stable parameters from a sequence of
i.i.d. random variables using online gradient ascent MLE. True parameters (α, β, µ, σ) =
(1.5, 0.2, 0, 0.5) are indicated with a horizontal line. At the bottom: Gradient of incre-
mental likelihood for the α-stable parameters
MLE introduces biases mainly in the shape and skewness parameters α and β; whereas
these biases are removed SN-ABC MLE. As for the scale and location parameters, both
algorithms have almost identical mean estimates, which are unbiased.
6.4.2 MLE for g-and-k distribution
The g-and-k distribution is determined by variables (A,B, g, k, c) and is defined by its
quantile function Qθ, which is the inverse of the cumulative distribution function Fθ
Qθ(u) = F
−1
θ (u) = A+B
[
1 + c
1− e−gφ(u)
1 + e−gφ(u)
] (
1 + φ(u)2
)k
φ(u), u ∈ (0, 1). (6.19)
where φ(u) is the u’th standard normal quantile. The parameters of the distribution
θ = (g, k, A,B) ∈ Θ = R× (−0.5,∞)× R× [0,∞)
are the skewness, kurtosis, location, and scale parameters, and c is usually fixed to 0.8.
Therefore, one can generate from the g-and-k distribution by first sampling U ∼ Unif(0,1)
and then returning tθ(u) = Qθ(u) given U = u (see e.g. Rayner and MacGillivray [2002]
for details).
Bayesian parameter estimation for the g-and-k distribution using ABC is recently
performed in Fearnhead and Prangle [2012], we consider MLE for θ using SN-ABC MLE.
158 CHAPTER 6. ABC FOR MLE IN HMMS
1.4
1.5
1.6
1.7
1.8
1.9
a
vg
. o
f e
st
im
at
es
 
 
sm
th
 n
oi
sy
 A
BC
 M
LE
α
0
0.05
0.1
0.15
0.2
0.25
β
−0.5
0
0.5
1
µ
0.2
0.4
0.6
0.8
1
1.2
σ
0 5 10
x 104
1.4
1.5
1.6
1.7
1.8
1.9
num. of samp. (t)
a
vg
. o
f e
st
im
at
es
 
 
sm
th
 A
BC
 M
LE
0 5 10
x 104
0
0.1
0.2
0.3
0.4
num. of samp. (t)
0 5 10
x 104
−0.5
0
0.5
1
num. of samp. (t)
0 5 10
x 104
0.2
0.4
0.6
0.8
1
1.2
num. of samp. (t)
Figure 6.3: S-ABC MLE and SN-ABC MLE estimates of the parameters of the α-stable
distribution (averaged over 50 runs) using the online gradient ascent algorithm for the
same data set. For SN-ABC MLE, a different noisy data sequence obtained from the
original data set is used in each run. True parameters (α, β, µ, σ) = (1.5, 0.2, 0, 0.5) are
indicated with a horizontal line.
Note thatQθ in (6.19) is differentiable w.r.t. θ, so the gradient ascent algorithms are appli-
cable. To avoid gradients with very high variances resulting from the factor (1 + φ(u)2)
k
in Qθ, similar to the case of α-stable distribution, we use ψ(·) = tan−1(·) to transform
Yˆk and added noise to tan
−1(Yˆk) with ǫ = 0.1 to implement SN-ABC MLE with gradient
ascent algorithm. Also, during our experiments, we observed that MLE performs better
for those distributions whose location parameter A is closer to 0, which must be a result
of the non-linear behaviour of the transformation function tan−1(·). Therefore, whenever
possible, it is suggested to estimate the location parameter using a heuristic way (such
as looking at the histogram or finding the mean of a first few samples) as a preprocessing
step, subtract the heuristically estimated value Aˆ of A from the samples, perform MLE
on the (approximately) centred data, and add back Aˆ to the estimated location obtained
by the MLE algorithm. Figure 6.4 shows the mean and the (log-)variance of SN-ABC
MLE estimates of θ = (2, 0.5, 10, 2) in time which are obtained from 50 runs on the
same noisy data sequence. Therefore, the accuracy of the mean and the amount of vari-
ance correspond to the performance of the Monte Carlo approximation of the gradients
∇Nθ log pǫ,κ,ψθ (y). Self normalised importance sampling is used with N = 1000 samples
generated from µ. From the results shown in the figure, one can deduce that the bias
introduced by the finite number of particles is negligible for N = 1000 and the variance
of the algorithm reduces in time resulting in the convergence of the estimates to the true
parameter values.
The next experiment shows how the gradient ascent algorithm can be used in a batch
6.4. NUMERICAL EXAMPLES 159
0.8
0.9
1
1.1
g
a
vg
. o
f e
st
im
at
es
 
 
sm
th
 n
oi
sy
 A
BC
 M
LE
0.2
0.3
0.4
0.5
0.6
0.7
0.8
k
9.5
10
10.5
A
1.7
1.8
1.9
2
2.1
2.2
B
0 5 10
x 105
−11
−10
−9
−8
−7
−6
lo
g 
va
ria
nc
e 
of
 e
st
im
at
es
num. of samp. (t)
0 5 10
x 105
−12
−10
−8
−6
−4
num. of samp. (t)
0 5 10
x 105
−12
−10
−8
−6
−4
num. of samp. (t)
0 5 10
x 105
−12
−11
−10
−9
−8
−7
−6
num. of samp. (t)
Figure 6.4: Mean and the variance (over 50 runs) of SN-ABC MLE estimates using the
online gradient ascent algorithm. Same noisy data sequence is used in each run. True
parameters (g, k, A,B) = (2, 0.5, 10, 2) are indicated with a horizontal line.
setting when the data set is too small for the online algorithm to converge. We imple-
mented the batch gradient ascent SN-ABC MLE algorithm on data sets of n = 1000 i.i.d.
samples from the same g-and-k distribution, that is, for θ = (2, 0.5, 10, 2). A detailed
study of the MLE for g-and-k distribution can be found in Rayner and MacGillivray
[2002] where the MLE methods based on numerical approximation of the likelihood it-
self are investigated; here we present the results of an alternative numerical method to
compute the MLE which is not included in their work. We generated 500 data sets of
size n = 1000 and performed batch gradient ascent algorithm for SN-ABC MLE with
ǫ = 0.1 for each data set. Again, the same self normalised importance sampling proce-
dure is used with N = 1000 samples. The upper half of Figure 6.5 shows the estimation
results versus number of iterations on a single data set. It can be seen that 1000 iter-
ations are sufficient for the convergence of the gradient ascent algorithm. Note that for
short data sets such as those with size 1000, MLE may have a considerable variance as
the estimates out of the single data set reveal. The lower half of Figure 6.5 shows the
(approximate) distributions (histograms over 20 bins) of the MLE estimate for θ. The
mean and variance of the MLE estimates for (g, k, A,B) are (2.004, 0.503, 9.995, 1.996)
and (0.0151, 0.0021, 0.0052, 0.0213) respectively. These moments of the MLE for this par-
ticular θ and same data size n are also obtained in Rayner and MacGillivray [2002] (see
Table 3); the results are comparable. Also, note that this is not the limit of our algorithm;
the contribution of the Monte Carlo approximation to bias and variance can be reduced
further by increasing the number of particles N , or the Monte Carlo bias can even be
removed, such as by using MCMC instead of self normalised importance sampling.
160 CHAPTER 6. ABC FOR MLE IN HMMS
0 500 1000
1.9
1.95
2
2.05
2.1
g
iteration
M
LE
 fo
r a
 s
in
gl
e 
da
ta
 s
et
 o
f s
ize
 1
00
0
0 500 1000
0.3
0.35
0.4
0.45
0.5
0.55
k
iteration
0 500 1000
9.8
9.85
9.9
9.95
10
10.05
10.1
10.15
A
iteration
0 500 1000
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
B
iteration
1.5 2 2.5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
g
g − mean: 2.004 var: 0.0151
0.3 0.4 0.5 0.6 0.7
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
k
k − mean: 0.503 var: 0.0021
9.6 9.8 10 10.2 10.4
0
0.02
0.04
0.06
0.08
0.1
0.12
A
A − mean: 9.995 var: 0.0052
1.5 2 2.5 3
0
0.02
0.04
0.06
0.08
0.1
0.12
B − mean: 1.996 var: 0.0213
B
Figure 6.5: Top: SN-ABC MLE estimates of g-and-k parameters from a sequence of
i.i.d. random variables using the batch gradient ascent algorithm. True parameters
(g, k, A,B) = (2, 0.5, 10, 2) are indicated with a horizontal line. Bottom: Approximate
distributions (histograms over 20 bins) of the estimates
6.4.3 The stochastic volatility model with symmetric α-stable
returns
The stochastic volatility model with α-stable returns (SVαSR) is a model used in analysing
economical data. The hidden process {Rk ∈ R}k≥1 represents the log-volatility in time
whereas the observation process {Yk ∈ R}k≥1 shows the return values. The model for
{Rk, Yk}k≥1 is:
R1 ∼ N
(
0, σ2x/(1− φ2)
)
, Rk = φRk−1 + σxVk, Vk,∼ N (0, 1), k ≥ 2,
Yk ∼ eRk/2A(α, 0, 0, 1), k ≥ 1. (6.20)
The model is an alternative of the stochastic volatility model with Gaussian returns as
observed series tend to be heavy-tailed and display discontinuities. For more discussion
on the model as well as a review of methods for estimating the static parameters of such
models, see Lombardi and Calzolari [2009] and the references therein. Those existing
methods for parameter estimation in SVαSR, however, are batch and suitable for only
short data sequences. We test our online algorithms implementing SN-ABC MLE for
6.4. NUMERICAL EXAMPLES 161
1.4
1.6
1.8
2
α
t
Online gradient ascent MLE estimate of α
0.4
0.6
0.8
1
φ t
Online gradient ascent MLE estimate of φ
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
0
0.2
0.4
σ
x,
 t
2
Online gradient ascent MLE estimate of σ
x
2
num. of samp. (t)
Figure 6.6: Online estimation of SVαR parameters using online gradient ascent algo-
rithm to implement SN-ABC MLE. True parameter values (α, φ, σ2x) = (1.9, 0.9, 0.1) are
indicated with a horizontal line.
this model in a scenario where a very long data sequence is given (or is being received
sequentially).
Since the likelihood involves the α-stable distribution, for stability of the gradi-
ent ascent algorithm we add noise to the tan−1(·) of Yˆk to have Yˆ ǫ,κ,ψk = tan−1(Yˆk) +
ǫZk, Zk ∼ N (0, 1). The densities πθ, qθ, and hǫ,κ,ψθ corresponding to the HMM {Xk =
(Rk, Uk), Y
ǫ,κ,ψ
k }k≥1 with Uk = (Uk,1, Uk,2) are as follows:
πθ(x) = N
(
r; 0, σ2x/(1− φ2)
) 1
π
I[−π/2,π/2](u1)I[0,∞)(v)e
−u2,
qθ(x
′|x) = N (r′;φr, σ2x)
1
π
I[−π/2,π/2](u
′
1)I[0,∞)(u
′
2)e
−u′2,
hǫ,κ,ψθ (y|x) = N
(
y; tan−1
[
er/2tα,0(u)
]
, ǫ2
)
,
where x = (r, u) and x′ = (r′, u′) and u = (u1, u2).
Estimates of θ = (α, φ, σ2x) obtained with the online gradient ascent implementation
of the SN-ABC MLE described in Section 6.3.1 using N = 500 particles for a data
sequence of 2 × 106 samples is shown in Figure 6.6. θ∗ = (1.9, 0.9, 0.1) was used for
generating the data. The estimates seem to converge after around 5 × 105 samples.
We also implemented the online EM algorithm to perform noisy smoothed ABC MLE
on the same data. Note that the maximisation step for α is not feasible, that is why the
EM algorithm is restricted to estimate only the hidden state parameters, assuming α is
known. The sufficient statistics needed to estimate σ2x and φ are provided in Del Moral
et al. [2009]. The online EM results for the model are shown in Figure 6.7.
162 CHAPTER 6. ABC FOR MLE IN HMMS
0.5
0.6
0.7
0.8
0.9
1
φ t
Online EM estimate of φ
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
0
0.1
0.2
0.3
0.4
0.5
σ
x,
 t
2
Online EM estimate of σ
x
2
num. of samp. (t)
Figure 6.7: Online estimation of SVαR parameters (α = 1.9 is known) using the online
EM algorithm to implement SN-ABC MLE. True parameter values (φ, σ2x) = (0.9, 0.1)
are indicated with a horizontal line.
6.5 Discussion
In this chapter, we presented a novel methodology for implementing MLE in HMMs with
intractable likelihoods in the context of ABC. We showed how both batch and online
versions of gradient ascent and EM algorithms can be used for those HMMs by using the
ABC approach to confront the intractability. We also demonstrated how to implement
noisy ABC ideas to get rid of an asymptotic (in size of data) ABC bias in our estimates. As
also suggested by the examples that we cover, the gradient ascent algorithm is applicable
to more cases than the EM. This is not surprising, though; since intractability mostly
arises from the non-linear characteristics of tθ, and in general we do not expect to be able
to find sufficient statistics for the parameters involved in tθ.
Note that gradient ascent and EM are not the only possible methods to implement
ABC MLE, although we only covered them due to their similarity and popularity for
the practitioner. Once, we can construct the expanded HMM with tractable transitional
laws, we can use potentially any other MLE method that works for HMMs. One such
example is the iterated filtering algorithm [Ionides et al., 2011] which can be useful for
HMMs having non-linear state space dynamics.
Chapter 7
An Online
Expectation-Maximisation
Algorithm for Nonnegative Matrix
Factorisation Models
Summary: In this chapter we formulate the nonnegative matrix factorisation (NMF)
problem as a maximum likelihood estimation problem for hidden Markov models and pro-
pose online expectation-maximisation (EM) algorithms to estimate the NMF and the other
unknown static parameters. We also propose a sequential Monte Carlo approximation of
our online EM algorithm. We show the performance of the proposed method with two
numerical examples.
The work done in this chapter is published in Yıldırım et al. [2012a]. This idea for
this chapter was initiated during a discussion between Dr. Taylan Cemgil and myself.
7.1 Introduction
With the advancement of sensor and storage technologies, and with the cost of data
acquisition dropping significantly, we are able to collect and record vast amounts of
raw data. Arguably, the grand challenge facing computation in the 21st century is the
effective handling of such large data sets to extract meaningful information for scientific,
financial, political or technological purposes [Donoho, 2000]. Unfortunately, classical
batch processing methods are unable to deal with very large data sets due to memory
restrictions and slow computational time.
One key approach for the analysis of large datasets is based on the matrix and tensor
factorisation paradigm. Given an observed dataset Y , where Y is a matrix of a certain
dimension and each element of it corresponds to an observed data point, the matrix
factorisation problem is the computation of matrix factors B and X such that Y is
163
164 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
approximated by the matrix product BX, i.e.,
Y ≈ BX.
(Later we will make our notation and inferential goals more precise.) Indeed, many
standard statistical methods such as clustering, independent components analysis, non-
negative matrix factorisation (NMF), latent semantic indexing, collaborative filtering can
be expressed and understood as matrix factorisation problems [Koren et al., 2009; Lee
and Seung, 1999; Singh and Gordon, 2008].
Matrix factorisation models also have well understood probabilistic/statistical inter-
pretations as probabilistic generative models and many standard algorithms mentioned
above can also be derived as maximum likelihood or maximum a-posteriori parameter es-
timation procedures [Cemgil, 2009; Fe´votte and Cemgil, 2009; Salakhutdinov and Mnih,
2008]. The advantage of this interpretation is that it enables one to incorporate domain
specific prior knowledge in a principled and consistent way. This can be achieved by
building hierarchical statistical models to fit the specifics of the application at hand.
Moreover, the probabilistic/statistical approach also provides a natural framework for
sequential processing which is desirable for developing online algorithms that pass over
each data point only once. While the development of effective online algorithms for ma-
trix factorisation are of interest on their own, the algorithmic ideas can be generalised to
more structured models such as tensor factorisations (e.g. see Kolda and Bader [2009]).
In this work our primary interest is estimation of B (rather than B and X), which
often is the main objective in NMF problems. We formulate the NMF problem as a
maximum likelihood estimation (MLE) problem for hidden Markov models (HMMs).
The advantage of doing so is that the asymptotic properties of MLE for HMM’s has
been studied in the past by many authors and these results may be adapted to the NMF
framework. We propose a sequential Monte Carlo (SMC) based online EM algorithm
[Cappe´, 2009; Del Moral et al., 2009] for the NMF problem. SMC introduces a layer of
bias which decreases as the number of particles in the SMC approximation is increased.
In the literature, several online algorithms have been proposed for online computation
of matrix factorisations. Mairal et al. [2010] propose an online optimisation algorithm,
based on stochastic approximations, which scales up gracefully to large data sets with
millions of training samples. A proof of convergence is presented for the Gaussian case.
There are similar formulations applied to other matrix factorisation formulations, notably
NMF [Lefevre et al., 2011] and Latent Dirichlet Allocation [Hoffman et al., 2010], as
well as alternative views for NMF which are based on incremental subspace learning
[Bucak and Gunsel, 2009]. Although the empirical results of these methods suggest good
performance, their asymptotic properties have not been established.
7.2. THE STATISTICAL MODEL FOR NMF 165
7.1.1 Notation
Let A be a M × N matrix. The (m,n)’th element of A is A(m,n). If M (or N)
is 1, then A(i) = A(1, i) (or A(i, 1)). The m’th row of A is A(m, ·). If A and B
are both M × N matrices, C = A ⊙ B denotes element-by-element multiplication, i.e.,
C(m,n) = A(m,n)B(m,n); A
B
(or A/B) means element-by-element division, in a similar
way. 1M×N (0M×N) is a M × N matrix of 1’s (0’s), where 1M×1 is abbreviated to 1M .
N = {0, 1, 2, . . .} and R+ = [0,∞) are the sets of nonnegative integers and real numbers.
Random variables will be defined by using capital letters, such as X, Y, Z, etc., and their
realisations will be corresponding small case letters (x, y, z, etc.). The indicator function
Iα(x) = 1 if x = α, otherwise it is 0; also, for a set A, IA(x) = 1 if x ∈ A, otherwise it is
0.
7.2 The Statistical Model for NMF
Consider the following HMM comprised of the latent processes {Xt, Zt}t≥1 and the ob-
servation process {Yt}t≥1. The process
{
Xt ∈ RK+
}
t≥1
is a Markov process of K × 1 non-
negative vectors with an initial density µψ and the transition density fψ for t = 2, 3, . . .
X1 ∼ µψ(x), Xt| (Xt−1 = xt−1) ∼ fψ(xt|xt−1), (7.1)
where ψ ∈ Ψ is a finite dimensional parameter which parametrizes the law of the Markov
process. Zt ∈ NM×K is a M × K matrix of nonnegative integers, and its elements are
independent conditioned on Xt as follows:
Zt| (Xt = xt) ∼
M∏
m=1
K∏
k=1
PO(zt(m, k);B(m, k)xt(k))
where B ∈ RM×K+ is an M ×K nonnegative matrix. Here PO(v;λ) denotes the Poisson
distribution on N with intensity parameter λ ≥ 0
PO(v;λ) = exp (v log λ− λ− log v!) ,
The M × 1 observation vector Yt is conditioned on Zt in a deterministic way
Yt(m) =
K∑
k=1
Zt(m, k), m = 1, . . . ,M.
166 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
This results in the conditional density of Yt given Xt = xt, denoted by gB, being a product
of Poisson densities
Yt| (Xt = xt) ∼ gB(yt|xt) =
M∏
m=1
PO (yt(m);B(m, ·)xt) . (7.2)
Hence the likelihood of yt given xt can analytically be evaluated. Moreover, the condi-
tional posterior distribution πB(zt|yt, xt) of Zt given yt and xt has a factorized closed form
expression:
Zt| (Yt = yt, Xt = xt) ∼ πB(zt|yt, xt)
=
M∏
m=1
M (zt(m, ·); yt(m), ρt,m) (7.3)
where ρt,m(k) = B(m, k)xt(k)/B(m, ·)xt and M denotes a multinomial distribution de-
fined by
M(v;α, ρ) = Iα
(
K∑
k=1
vk
)
α!
K∏
k=1
ρvkk
vk!
,
where v = [v1 . . . vK ] is a realisation of the vector valued random variable V = [V1 . . . VK ],
ρ = (ρ1, . . . , ρK), and
∑K
k=1 ρk = 1. It is a standard result that the marginal mean of the
k’th component is Eα,ρ [Vk] = αρk.
Let θ = (ψ,B) ∈ Θ = Ψ × RM×K+ denote all the parameters of the HMM. We can
write the joint density of (X1:t, Z1:t, Y1:t) given θ as
pθ(x1:t, z1:t, y1:t) = µψ(x1)gB(y1|x1)πB(z1|y1, x1)
t∏
i=2
fψ(xi|xi−1)gB(yi|xi)πB(zi|xi, yi).
(7.4)
From (7.4), we observe that the joint density of (X1:t, Y1:t)
pθ(x1:t, y1:t) = µψ(x1)gB(y1|x1)
t∏
i=2
fψ(xi|xi−1)gB(yi|xi)
defines the law of another HMM {Xt, Yt}t≥1 comprised of the latent process {Xt}t≥1, with
initial and transitional densities µψ and fψ, and the observation process {Yt}t≥1 with the
observation density gB. Finally, the likelihood of data is given by
pθ(y1:T ) = Eψ
[
T∏
t=1
gB(yt|Xt)
]
. (7.5)
7.2. THE STATISTICAL MODEL FOR NMF 167
In this work, we treat θ as unknown and seek for the MLE solution θ∗ for it, which
satisfies
θ∗ = argmax
θ∈Θ
pθ(y1:T ). (7.6)
7.2.1 Relation to the classical NMF
In the classical NMF formulation [Lee and Seung, 1999, 2000], given aM×T nonnegative
matrix Y = [y1 . . . yT ], we want to factorize it toM×K andK×T nonnegative matrices B
and X = [X1 . . .XT ] such that the difference between Y and BX is minimised according
to a divergence
(B∗, X∗) = argmin
B,X
D(Y ||BX). (7.7)
One particular choice for D is the generalised Kullback-Leibler (KL) divergence which is
written as
D(Y ||U) =
M∑
m=1
T∑
t=1
Y (m, t) log
Y (m, t)
U(m, t)
− Y (m, t) + U(m, t)
Noticing the similarity between the generalised KL divergence and the Poisson distribu-
tion, [Lee and Seung, 1999] showed that the minimisation problem can be formulated in
a MLE sense. More explicitly, the solution to
(B∗, X∗) = argmaxB,X ℓ(y1, . . . , yT |B,X),
ℓ(y1, . . . , yT |B,X) =
∏T
t=1 gB (yt|Xt) (7.8)
is the same as the solution to (7.7). In our formulation of the NMF problem, X =
[X1 . . .XT ] is not a static parameter but it is a random matrix whose columns constitute
a Markov process. Therefore, the formulation for MLE in our case changes to maximising
the expected value of the likelihood in (7.8) over the parameter θ = (B,ψ) with respect
to (w.r.t.) the law of X
(B∗, ψ∗) = arg max
(B,ψ)∈Θ
Eψ [ℓ(y1, . . . , yT |B,X)] . (7.9)
It is obvious that (7.6) and (7.9) are equivalent. We will see in Section 7.3 that the
introduction of the additional process {Zt}t≥1 is necessary to perform MLE using the
EM algorithm (see Lee and Seung [2000] for its first use for the problem stated in (7.7)).
168 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
7.3 EM algorithms for NMF
Our objective is to estimate the unknown θ given Y1:T = y1:T . The EM algorithm can be
used to find the MLE for θ. We first introduce the batch EM algorithm and then explain
how an online EM version can be obtained.
7.3.1 Batch EM
With the EM algorithm, given the observation sequence y1:T we increase the likelihood
pθ(y1:T ) in (7.5) iteratively until we reach a maximal point on the surface of the likelihood.
The algorithm is as follows:
Choose θ(0) for initialisation. At iteration j = 0, 1, . . .
• E-step: Calculate the intermediate function which is the expectation of the log
joint distribution of (X1:T , Z1:T , Y1:T ) with respect to the law of (X1:T , Z1:T ) given
Y1:T = y1:T .
Q(θ(j); θ) = Eθ(j) [ log pθ(X1:T , Z1:T , Y1:T )|Y1:T = y1:T )]
• M-step: The new estimate is the maximiser of the intermediate function
θ(j+1) = argmax
θ
Q(θ(j); θ)
With a slight modification of the update rules found in Cemgil [2009, Section 2], one can
show that for NMF models the update rule for B reduces to calculating the expectations
Ŝ1,T = Eθ(j)
[
T∑
t=1
Xt
∣∣∣∣∣Y1:T = y1:T
]
, Ŝ2,T = Eθ(j)
[
T∑
t=1
Zt
∣∣∣∣∣Y1:T = y1:T
]
and updating the parameter estimate for B as
B(j+1) = Ŝ2,T/
(
1M
[
Ŝ1,T
]T)
.
Moreover, if the transition density fψ belongs to an exponential family, the update rule
for ψ becomes calculating the expectation of a J × 1 vector valued function
Ŝ3,T = Eθ(j)
[
T∑
t=1
s3,t(Xt−1, Xt)
∣∣∣∣∣Y1:T = y1:T
]
7.3. EM ALGORITHMS FOR NMF 169
and updating the estimate for ψ using a maximisation rule
Λ : RJ → Ψ, ψ(j+1) = Λ
(
Ŝ3,T
)
.
Note that s3,t and Λ depend on the NMF model, particularly to the probability laws
in (7.1) defining the Markov chain for {Xt}t≥1. Therefore, we have to find the mean
estimates of the following sufficient statistics at time t.
S1,t(x1:t) =
t∑
i=1
xi, S2,t(z1:t) =
t∑
i=1
zi, S3,t(x1:t) =
t∑
i=1
s3,t(xt−1, xt). (7.10)
Writing the sufficient statistics in additive forms as in (7.10) enables us to use a forward
recursion to find the expectations of the sufficient statistics in an online manner. This
leads to an online version of the EM algorithm as we shall see in the following section.
7.3.2 Online EM
To explain the methodology in a general sense, assume that we want to calculate the
expectations Ŝt = Eθ [St(X1:t, Z1:t)|Y1:t = y1:t] of sufficient statistics of the additive form
St(x1:t, z1:t) =
t∑
i=1
si(xi−1, zi−1, xi, zi) (7.11)
w.r.t. the posterior density pθ(x1:t, z1:t|y1:t) for a given parameter value B. Letting ut =
(xt, zt) for simplicity, we define the intermediate function
Tt(ut) =
∫
St(u1:t)pθ(u1:t−1|y1:t−1, ut)du1:t−1.
One can show that we have the forward recursion [Cappe´, 2011; Del Moral et al., 2009]
Tt(ut) =
∫
[Tt−1(ut−1) + st(ut−1, ut)] pθ(ut−1|y1:t−1, ut)dut−1 (7.12)
with the convention T0(u) = 0. Hence, Tt can be computed online, so are the estimates
Ŝt =
∫
Tt(ut)pθ(ut|y1:t)dut.
170 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
We can decompose the backward transition density pθ(ut−1|y1:t−1, ut) and the filtering
density pθ(ut|y1:t) as
pθ(xt−1, zt−1|y1:t−1, xt, zt) = πB(zt−1|xt−1, yt−1)pθ(xt−1|xt, y1:t−1), (7.13)
pθ(xt, zt|y1:t) = πB(zt|xt, yt)pθ(xt|y1:t) (7.14)
where πB is defined in (7.3). From (7.10) we know that the required sufficient statistics
are additive in the required form; therefore, the recursion in (7.12) is possible for the
NMF model. The recursion for S3,t depends on the choice of the transition density fψ;
however the recursions for S1,t and S2,t are the same for any model regardless of the choice
of fψ. For this reason, we shall have a detailed look at (7.12) for the first two sufficient
statistics S1,t and S2,t.
For S1,t, notice from (7.13) that, pθ(xt−1, zt−1|y1:t−1, xt, zt) does not depend on zt.
Moreover, the sufficient statistic S1,t is not a function of z1:t. Therefore, zt−1 in (7.12)
integrates out, and T1,t is a function of xt only. Hence we will write it as T1,t(xt). To sum
up, we have the recursion
T1,t(xt) = xt +
∫
T1,t−1(xt−1)pθ(xt−1|xt, y1:t−1)dxt−1.
For S2,t, we claim that T2,t(xt, zt) = zt + Ct(xt) where Ct(xt) is a nonnegative M ×
K matrix valued function depending on xt but not zt, and the recursion for Ct(xt) is
expressed as
Ct(xt) =
∫ [
Ct−1(xt−1) +
B ⊙ (yt−1xTt−1)
(Bxt−1) 1TK
]
pθ(xt−1|xt, y1:t−1)dxt−1
This claim can be verified by induction. Start with t = 1. Since T2,0 = 0M×K , we
immediately see that T2,t(x1, z1) = z1 = z1 + C1(x1) where C1(x1) = 0M×K . For general
t > 1, assume that T2,t−1(xt−1, zt−1) = zt−1 + Ct−1(xt−1). Using (7.13),
T2,t(xt, zt) = zt +
∫
[zt−1 + Ct−1(xt−1)]πB(zt−1|xt−1, yt−1)pθ(xt−1|xt, y1:t−1)dxt−1dzt−1
Now, observe that the (m, k)’th element of the integral
∫
zt−1πB(zt−1|xt−1, yt−1)dzt−1 is
B(m,k)yt−1(m)xt−1(k)
B(m,·)xt−1
. So, we can write the integral as
∫
zt−1πB(zt−1|xt−1, yt−1)dzt−1 =
B ⊙ (yt−1xTt−1)
(Bxt−1)1TK
So we are done. Using a similar derivation and substituting (7.14) into (7.13), we can
7.3. EM ALGORITHMS FOR NMF 171
show that
Ŝ2,t =
∫ (
Ct(xt) +
B ⊙ (ytxTt )
(Bxt) 1
T
K
)
pθ(xt|y1:t)dxt.
The online EM algorithm is a variation over the batch EM where the parameter is
re-estimated each time a new observation is received. In this approach running averages
of the sufficient statistics are computed [Cappe´, 2009, 2011; Elliott et al., 2002; Mongillo
and Deneve, 2008], [Kantas et al., 2009, Section 3.2.]. Specifically, let γ = {γt}t≥1, called
the step-size sequence, be a positive decreasing sequence satisfying
∑
t≥1 γt = ∞ and∑
t≥1 γ
2
t < ∞. A common choice is γt = t−a for 0.5 < a ≤ 1. Let θ1 be the initial
guess of θ∗ before having made any observations and at time t, let θ1:t be the sequence of
parameter estimates of the online EM algorithm computed sequentially based on y1:t−1.
Letting ut = (xt, zt) again to show for the general case, when yt is received, online EM
computes
Tγ,t(ut) =
∫
[(1− γt)Tγ,t−1(ut−1) + γtst(ut−1, ut)] pθ1:t(ut−1|y1:t−1, ut)dut−1, (7.15)
St =
∫
Tγ,t(ut)pθ1:t(ut|y1:t)dut (7.16)
and then applies the maximisation rule using the estimates St. The subscript θ1:t on the
densities pθ1:t(ut−1|y1:t−1, ut) and pθ1:t(ut|y1:t) indicates that these laws are being computed
sequentially using the parameter θk at time k, k ≤ t. (See Algorithm 7.1 for details.) In
practice, the maximisation step is not executed until a burn-in time tb for added stability
of the estimators as discussed in Cappe´ [2009].
The online EM algorithm can be implemented exactly for a linear Gaussian state-space
model [Elliott et al., 2002] and for finite state-space HMM’s. [Cappe´, 2011; Mongillo and
Deneve, 2008]. An exact implementation is not possible for NMF models in general,
therefore we now investigate SMC implementations of the online EM algorithm.
7.3.3 SMC implementation of the online EM algorithm
Recall that {Xt, Yt}t≥1 is also a HMM with the initial and transition densities µψ and fψ in
(7.1), and the observation density gB in (7.2). Since the conditional density πB(zt|xt, yt)
has a close form expression, it is sufficient to have a particle approximation to only
pθ(x1:t|y1:t). This approximation can be performed in an online manner using a SMC
approach. Suppose that we have the particle approximation to pθ(x1:t|y1:t) at time t with
172 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
N particles
pNθ (dx1:t|y1:t) =
N∑
i=1
w
(i)
t δx(i)1:t
(dx1:t),
N∑
i=1
w
(i)
t = 1, (7.17)
where x
(i)
1:t = (x
(i)
1 , . . . , x
(i)
t ) is the n’th path particle with weight w
(i)
t and δx is the dirac
measure concentrated at x. The particle approximation of the filter at time t can be
obtained from pNθ (dx1:t|y1:t) by marginalization
pNθ (dxt|y1:t) =
N∑
i=1
w
(i)
t δx(i)t
(dxt).
At time t+1, for each n we draw x
(i)
t+1 from a proposal density qθ(xt+1|x(i)t ) with a possible
implicit dependency on yt+1. We then update the weights according to the recursive rule:
w
(i)
t+1 ∝
w
(i)
t fψ(x
(i)
t+1|x(i)t )gB(yt+1|x(i)t+1)
qθ(x
(i)
t+1|x(i)t )
.
To avoid weight degeneracy, at each time one can resample from (7.17) to obtain a
new collection of particles x
(i)
t with weights w
(i)
t = 1/N , and then proceed to the time
t+1. Alternatively, this resampling operation can be done according to a criterion which
measures the weight degeneracy [Doucet et al., 2000b]. The SMC online EM algorithm for
NMF models executing (7.15) and (7.16) based on the SMC approximation of pθ(x1:t|y1:t)
in (7.17) is presented Algorithm 7.1.
Algorithm 7.1. SMC online EM algorithm for NMF models
• E-step: If t = 1, initialise θ1; sample x˜(i)1 ∼ qθ1(·), and set w(i)1 = µψ1 (ex
(i)
1 )gB1 (y1|ex
(i)
1 )
qθ1 (ex
(i)
1 )
,
T˜
(i)
1,1 = x˜
(i)
1 , C˜
(i)
1 = 0, T˜
(i)
3,1 = s3,1(x˜
(i)
1 ), i = 1, . . . , N . If t > 1,
– For i = 1, . . . , N , sample x˜
(i)
t ∼ qθt(·|x(i)t−1) and compute
T˜
(i)
1,t = (1− γt)T (i)1,t−1 + γtx˜(i)t ,
T˜
(i)
3,t = (1− γt)T (i)3,t−1 + γts3,t(x(i)t−1, x˜(i)t )
C˜
(i)
t = (1− γt)C(i)t−1 + (1− γt)γt−1
Bt ⊙
(
yt−1x
(i)T
t−1
)
(
Btx
(i)
t−1
)
1TK
,
w˜
(i)
t ∝
w
(i)
t−1fψt(x˜
(i)
t |x(i)t−1)gBt(yt|x˜(i)t )
qθt(x˜
(i)
t |x(i)t−1)
.
7.4. NUMERICAL EXAMPLES 173
– Resample from particles {(x˜t, T˜1,t, C˜t, T˜3,t)(i)} for i = 1, . . . , N according to the
weights {w˜(i)t }i=1,...,N to get {(xt, T1,t, Ct, T3,t)(i)} for i = 1, . . . , N each with
weight w
(i)
t = 1/N .
• M-step: If t < tb, set Bt+1 = Bt. Else, calculate using the particles before resam-
pling
S1,t =
N∑
i=1
T˜
1(i)
t w˜
(i)
t ,
S2,t =
N∑
i=1
C˜(i)t + γtBt ⊙
(
ytx˜
(i)T
t
)
(
Btx˜
(i)
t
)
1TK
 w˜(i)t
S3,t =
N∑
i=1
T˜
3(i)
t w˜
(i)
t ,
update the parameter θt+1 = (Bt+1, ψt+1), Bt+1 =
S2,t
1M [S1,t]
T , ψt+1 = Λ(S3,t).
Algorithm 7.1 is a special application of the SMC online EM algorithm proposed in
Cappe´ [2009] for a general state-space HMM, and it only requires O(N) computations
per time step. Alternatively, one can implement an O(N2) SMC approximation to the
online EM algorithm, see Del Moral et al. [2009] for its merits and demerits over the
current O(N) implementation. The O(N2) is made possible by plugging the following
SMC approximation to pθ(xt−1|xt, y1:t−1) into (7.12)
pNθ (dxt−1|xt, y1:t−1) =
pNθ (dxt−1|y1:t−1)fψ(xt|xt−1)∫
pNθ (dxt−1|y1:t−1)fψ(xt|xt−1)
.
7.4 Numerical examples
7.4.1 Multiple basis selection model
In this simple basis selection model, Xt ∈ {0, 1}K determines which columns of B are
selected to contribute to the intensity of the Poisson distribution for observations. For
k = 1, . . . , K,
X1(k) ∼ µ(·), Prob(Xt(k) = i|Xt−1(k) = j) = P (j, i),
174 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
where µ0 is a distribution over X and P is such that P (1, 1) = p and P (2, 2) = q.
Estimation of ψ = (p, q) can be done by calculating
Ŝ3,T = Eθ
[
T∑
i=1
s3,i(Xi−1, Xi)
∣∣∣∣∣Y1:T = y1:T
]
, s3,t(xt, xt−1) =
K∑
k=1

I(0,0)(xt−1(k), xt(k))
I0(xt(k))
I(1,1)(xt−1(k), xt(k))
I1(xt(k))

and applying the maximisation rule (p(j+1), q(j+1)) = Λ(Ŝ
(j)
3,t ) where Λ(·) for this model is
defined as
Λ
(
Ŝ3,t
)
=
(
Ŝ3,t(1)/Ŝ3,t(2), Ŝ3,t(3)/Ŝ3,t(4)
)
.
Figure 7.4.1 shows the estimation results of the exact implementation of online EM (with
γt = t
−0.8 and tb = 100) for the 8 × 5 matrix B (assuming (p, q) known) given the
8× 100000 matrix Y which is simulated p = 0.8571, q = 0.6926.
7.4.2 A relaxation of the multiple basis selection model
In this model, the process {Xt ∈ (0, 1)}t≥1 is not a discrete one, but it is a Markov process
on the unit interval (0, 1). The law of the Markov chain for {Xt}t≥1 is as follows: for
k = 1, . . . , K, X1(k) ∼ U(0, 1), and
Xt+1(k)|(Xt(k) = x) ∼ ρ(x)U(0, x) + (1− ρ(x))U(x, 1),
ρ(x) =
α, if x ≤ 0.51− α, if x > 0.5.
When α is close to 1, the process will spend most of its time around 0 and 1 with a
strong correlation. (Figure 7.4.2 shows a realisation of {Xt(1)}t≥1 for 500 time steps
when α = 0.95.) For estimation of α, one needs to calculate
Ŝ3,T = Eθ
[
T∑
i=1
s3,i(Xi−1, Xi)
∣∣∣∣∣Y1:T = y1:T
]
,
s3,t(xt−1, xt) =
[
IAxt−1(k)(xt−1(k), xt(k))
I(0,1)×(0,1)/Axt−1(k)(xt−1(k), xt(k))
]
where, for u ∈ (0, 1), we define the set
Au = ((0, 0.5]× (0, u]) ∪ ((0.5, 1)× (u, 1)) .
7.4. NUMERICAL EXAMPLES 175
0
10
20
0
10
20
0
5
10
0
10
20
0
10
20
0
5
10
0
5
10
0
5
10
0
5
10
0
0.5
1
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
2
4
0
10
20
0
5
10
0
10
20
0
2
4
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
2
4
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0 5 10
x 104
0
5
10
0 5 10
x 104
0
5
10
0 5 10
x 104
0
5
10
0 5 10
x 104
0
10
20
0 5 10
x 104
0
2
4
Figure 7.1: Online estimation of B in the NMF model in Section 7.4.1 using exact im-
plementation of online EM for NMF. The (i, j)’th subfigure shows the estimation result
for the B(i, j) (horizontal lines).
176 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
0 100 200 300 400 500
0
0.2
0.4
0.6
0.8
1
time (t)
Xt(1) for the first 500 time steps (α = 0.95)
Figure 7.2: A realisation of {Xt(1)}t≥1 for α = 0.95.
The maximisation step for α is characterised as
Λ
(
Ŝ3,t
)
= Ŝ3,t(1)/
(
Ŝ3,t(1) + Ŝ3,t(2)
)
.
We generated a 8 × 50000 observation matrix Y by using a 8 × 5 matrix B and
α = 0.95. We used the SMC EM algorithm described in Algorithm 7.1 to estimate B
(assuming α known), with N = 1000 particles, qθ(xt|xt−1) = fϕ(xt|xt−1), γt = t−0.8, and
tb = 100. Figure 7.4.2 shows the estimation results.
7.5 Discussion
In this chapter, we presented and online EM algorithm for NMF models with Poisson
observations. We demonstrated an exact implementation and the SMC implementation of
the online EM method on two separate NMF models. However, the method is applicable
to any NMF model where the columns of the matrix X can be represented as a stationary
Markov process, e.g. the log-Gaussian process.
The results in Section 7.4 do not reflect on the generality of the method, i.e., only B
is estimated but the parameter ϕ is assumed to be known, although we formulated the
estimation rules for all of the parameters in θ. Also, we perform experiments where the
dimension of the B matrix may be too small for realistic scenarios. Note that in Algorithm
7.1 we used the bootstrap particle filter, which is the simplest SMC implementation.
The SMC implementation may be improved devising sophisticated particle filters, (e.g.
those involving better proposal densities that learn from the current observation, SMC
samplers, etc.), and we believe that only with that improvement the method can handle
more complete problems with higher dimensions.
7.5. DISCUSSION 177
0
2
4
0
5
0
2
4
0
2
4
0
2
4
0
5
10
0
5
10
0
5
10
0
5
10
0
2
4
0
5
0
5
10
0
5
10
0
5
0
5
0
2
4
0
2
4
0
5
10
0
5
0
2
4
0
5
10
0
5
10
0
5
10
0
5
10
0
5
0
2
4
0
5
10
0
5
10
0
2
4
0
2
4
0
5
0
5
10
0
5
10
0
2
4
0
5
10
0 5
x 104
0
2
4
0 5
x 104
0
2
4
0 5
x 104
0
2
4
0 5
x 104
0
2
4
0 5
x 104
0
2
4
Figure 7.3: Online estimation of B in the NMF model in Section 7.4.2 using Algorithm
7.1. The (i, j)’th subfigure shows the estimation result for B(i, j) (horizontal lines).
178 CHAPTER 7. AN ONLINE EM ALGORITHM FOR NMF MODELS
Chapter 8
Conclusions
Summary: In this thesis, we developed batch and online SMC methods for maximum
likelihood parameter estimation in several time series models. In the following, we sum-
marise our contributions and suggest possible future directions of our work.
8.1 Contributions
In Chapter 4, we presented a novel SMC online EM algorithm for the changepoint model
and studied the stability of the associated SMC estimates. The computational cost of the
developed algorithm is linear with the number of particles, unlike its counterpart online
EM algorithms in the general state-space case.
In Chapter 5, we presented MLE algorithms for inferring the static parameters of
the linear Gaussian MTT model, a problem which has largely been left untouched by
researchers in the area. We analysed both the computational and statistical aspects
of the algorithms via several numerical examples. Our developed algorithm is applicable
(with slight modifications) to many extensions of the specific MTT model that we studied,
as long as linear Gaussian dynamics of the MTT model are preserved by those extensions.
In Chapter 6, we presented a novel methodology for implementing MLE in HMMs
with intractable observation densities. We demonstrated how both batch and online ver-
sions of gradient ascent and EM algorithms can be used for those HMMs by using the
ABC approach to address the intractability. The idea of maximum likelihood parameter
estimation in the context of ABC is out of the mainstream of the ABC literature. How-
ever, we think that its implementation is particularly useful for the case of long data sets
where Bayesian approaches, when implemented with Monte Carlo, tend to fail because
of particle degeneracy. Our algorithms are based on noisy ABC ideas and hence their
estimators for the static parameter do not contain any asymptotic (in size of data) ABC
bias.
In Chapter 7, we formulated the NMF problem as an MLE problem for HMMs and
adapted the online SMC EM algorithm for general HMMs to estimate the matrix factors
and the other unknown static parameters of the NMF model. We believe that formulating
the NMF model as an HMM is a useful approach that enables the practitioner to solve
179
180 CHAPTER 8. CONCLUSIONS
the NMF problem with more ease. Our statistical approach to NMF provides a natural
framework for sequential processing which can be of significant importance in the areas
of signal processing.
8.2 Future directions
In Chapter 4, one limitation of the proposed online EM algorithm for changepoint models
is that it is applicable only when the constituent laws of the changepoint model given be-
long to the exponential family and the latent variables of each regime of the changepoint
model can be integrated out analytically. In cases where this is not the case, gradient
based MLE algorithms can be used, see e.g. Caron et al. [2011]. Another limitation
was the assumption that the observations across segments are conditionally independent.
However, there are changepoint models where observations across segments are condition-
ally dependent (e.g. see Barbu and Limnios [2008, Chapter 6]). Therefore, it would be a
useful extension if our method could be generalised to the case where this dependency is
allowed.
One obvious extension of our work for the linear Gaussian MTT model in Chapter 5 is
to consider batch and online MLE algorithms in non-linear non-Gaussian MTT models.
Note that for non-linear non-Gaussian models, Monte Carlo type batch and online EM
algorithms may still be applied provided that the sufficient statistics for the EM are
available in the required additive form [Del Moral et al., 2009]. When this condition on
the sufficient statistics is not met, other methods such as gradient based MLE methods
can be useful (e.g. Poyiadjis et al. [2011]).
Although we have made an initial step towards MLE in HMMs with intractable densi-
ties in Chapter 6, we have not solved all the issues regarding intractability. For example,
there are cases when we cannot use gradient ascent MLE, such as when the non-linear
transformation function used to generate observations has discontinuities with respect
to the unknown parameter. Another case that is out of the reach of our algorithm is
when the state transition law for the latent process has an intractable density. These
challenging problems motivate the need to extend the algorithms developed in this thesis
to cover these cases. Also note that more sophisticated SMC algorithms, at a cost of
more computations, can be proposed to improve precision of the proposed algorithms.
For example, at each step of the SMC filtering algorithm, an SMC sampler can be applied
for targeting a sequence of ABC approximations with decreasing ABC error term ǫ [Dean
et al., 2012].
The SMC online EM for NMF in Chapter 7 uses the simplest SMC implementation,
namely the bootstrap particle filter. Real-life NMF problems exist in high dimensions
where bootstrap would be most probably inefficient. The SMC implementation may be
8.2. FUTURE DIRECTIONS 181
improved by using more sophisticated particle filters, e.g. those involving better proposal
densities that learn from the current observation, SMC samplers, etc. We believe that
only with such improvements could the method handle the more practicle problems with
higher dimensions. Moreover, the EM algorithm may not be applicable to more general
statistical NMF models due to the lack of sufficient statistics, in those cases the online
gradient MLE algorithm may be useful.
Our final comments contain a general concern: In parallel to increasing amount of
research on online methods, the need for a comprehensive comparison of those methods
in terms of their statistical and computational performances is increasing. Therefore,
both numerical and theoretical analysis of state of the art online parameter estimation
methods would be helpful in order to clarify the merits of different approaches proposed
so far. Although such an attempt has been made recently for general SMC parameter
estimation methods by Kantas et al. [2009]; an extensive work on online parameter esti-
mation methods only would be an important contribution to the literature. The analysis
and comparison of those methods would include investigation of their rate of convergence,
accuracy, statistical efficiency, and computational complexity in both time and number
of particles.
182 CHAPTER 8. CONCLUSIONS
References
Anderson, B. and Moore, J. (1979). Optimal Filtering. Prentice-Hall, New York. 41
Andrieu, C., Davy, M., and Doucet, A. (2001). Improved auxiliary particle filtering:
applications to time-varying spectral analysis. In Statistical Signal Processing, 2001.
Proceedings of the 11th IEEE Signal Processing Workshop on, pages 309–312. 32, 39
Andrieu, C., De Freitas, J., and Doucet, A. (1999). Sequential MCMC for Bayesian
model selection. In Proceedings of the IEEE Workshop on Higher Order Statistics,
pages 130–134. 59, 60
Andrieu, C. and Doucet, A. (2002). Particle filtering for partially observed Gaussian
state space models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 64:827–836. 53
Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo
methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
72:269–342. 58, 59
Andrieu, C., Doucet, A., and Lee, A. (2012). Comments on Fearnhead & Prangle. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 74:451–452. 146
Andrieu, C., Doucet, A., Singh, S., and Tadic´, V. (2004). Particle methods for change
detection, system identification, and control. Proceedings of the IEEE, 92(3):423–438.
61, 62, 66
Andrieu, C., Doucet, A., and Tadic´, V. B. (2005). On-line parameter estimation in
general state-space models. In Proceedings of the 44th IEEE Conference on Decision
and Control, pages 332–337. 4, 29, 60, 72, 139
Arulampalam, M., Maskell, S., Gordon, N., and Clapp, T. (2002). A tutorial on particle
filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE
Transactions on, 50(2):174–188. 46
Bar-Shalom, Y. and Fortmann, T. E. (1988). Tracking and Data Association. Academic
Press, Boston. 102
Bar-Shalom, Y. and Li, X. (1995). Multitarget-Multisensor Tracking: Principles and
Techniques. YBS Publishig. 102
183
184 REFERENCES
Barbu, V. and Limnios, N. (2008). Semi-Markov Chains and Hidden Semi-Markov Models
toward Applications: Their Use in Reliability and DNA Analysis. Springer. 72, 75, 77,
180
Beaumont, M., Cornuet, J., Marin, J., and Robert, C. (2009). Adaptive approximate
Bayesian computation. Biometrica, 96(4):983–990. 38
Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesian compu-
tation in population genetics. Genetics, 162(4):2025–2035. 139
Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. The
Annals of Mathematical Statistics, 18(1):105–110. 53
Braun, J. V. and Muller, H. G. (1998). Statistical methods for DNA sequence segmenta-
tion. Statistical Sciences, 13:142–162. 72
Briers, M., Doucet, A., and Maskell, S. (2010). Smoothing algorithms for state-space
models. Annals of the Institute of Statistical Mathematics, 62(1):61–89. 66
Bucak, S. S. and Gunsel, B. (2009). Incremental subspace learning via non-negative
matrix factorization. Pattern Recognition, 42:788–797. 164
Campillo, F. and Rossi, V. (2009). Convolution particle filter for parameter estimation
in general state-space models. Aerospace and Electronic Systems, IEEE Transactions
on, 45(3):1063 –1072. 4, 38, 60
Cappe´, O. (2009). Online sequential Monte Carlo EM algorithm. In Proceedings of the
IEEE Workshop on Statistical Signal Processing. 4, 54, 67, 68, 78, 82, 120, 121, 153,
154, 164, 171, 173
Cappe´, O. (2011). Online EM algorithm for hidden Markov models. Journal of Compu-
tational and Graphical Statistics, 20(3):728–749. 67, 73, 77, 78, 79, 114, 120, 153, 169,
171
Cappe´, O., Godsill, S., and Moulines, E. (2007). An overview of existing methods and
recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924. 2,
46
Cappe´, O., Guillin, A., Marin, J. M., and Robert, C. P. (2004). Population Monte Carlo.
Journal of Computational and Graphical Statistics, 13:907–930. 35
Cappe´, O., Moulines, E., and Ryde´n, T. (2005). Inference in Hidden Markov Models.
Springer. 2, 18, 25, 40, 42, 79, 137
REFERENCES 185
Caron, F., Doucet, A., and Gottardo, R. (2011). On-line changepoint detection and
parameter estimation with application to genomic data. Statistics and Computing,
pages 1–17. 10.1007/s11222-011-9248-x. 72, 75, 80, 92, 180
Carpenter, J., Clifford, P., and Fearnhead, P. (1999). An improved particle filter for
non-linear problems. Radar Sonar & Navigation, IEE Proceedings, 146:2–7. 30
Celeux, G. and Diebolt, J. (1985). The SEM algorithm: A probabilistic teacher algorithm
derived from the EM algorithm for the mixture problem. Computational Statistics
Quarterly, 2:73–82. 66, 111
Celeux, G., Marin, J.-M., and Robert, C. P. (2006). Iterated importance sampling in
missing data problems. Computational Statistics and Data Analysis, 50(12):3386–3404.
35
Cemgil, A. T. (2009). Bayesian inference for nonnegative matrix factorisation models.
Computational Intelligence and Neuroscience, 2009:4:1–4:17. 164, 168
Cemgil, A. T., Kappen, H. J., and Barber, D. (2006). A generative model for mu-
sic transcription. Audio, Speech, and Language Processing, IEEE Transactions on,
14(2):679–694. 72
Chambers, J. M., Mallows, C. L., and Stuck, B. W. (1976). Method for simulating stable
random variables. Journal of the American Statistical Association, 71(354):340–344.
155
Chen, R. and Liu, J. (1996). Predictive updating methods with application to Bayesian
classification. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 58:397–415. 28
Chen, R. and Liu, J. S. (2000). Mixture kalman filters. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 62(3):493–508. 43, 53
Chib, S. (1998). Estimation and comparison of multiple change-point models. Journal of
Econometrics, 86:221241. 72
Chopin, N. (2002). A sequential particle filter method for static models. Biometrica,
89(3):539–551. 32, 35, 58
Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its
application to Bayesian inference. The Annals of Statistics, 32(6):2385–2411. 53
Chopin, N. (2007). Dynamic detection of change points in long time series. Annals of the
Institute of Statistical Mathematics, 59(2):349–366. 72, 79
186 REFERENCES
Collings, I. and Ryden, T. (1998). A new maximum likelihood gradient algorithm for on-
line hidden Markov model identification. In in Proceedings of the IEEE International.
Conference on Acoustic, Speech, and Signal Processing, pages 2261–2264. 63
Coquelin, P., Deguest, R., and Munos, R. (2009). Sensitivity analysis in HMMs with
application to likelihood maximization. In Bengio, Y., Schuurmans, D., Lafferty, J.,
Williams, C. K. I., and Culotta, A., editors, Advances in Neural Information Processing
Systems 22, pages 387–395. 63
Cox, I. J. and Miller, M. L. (1995). On finding ranked assignments with application to
multi-target tracking and motion correspondence. Aerospace and Electronic Systems,
IEEE Transactions on, 32:48–9. 103
Crisan, D., Moral, P. D., and Lyons, T. (1999). Discrete filtering using branching and
interacting particle systems. Markov Processes and Related Fields, 5(3):293–318. 30
Danchick, R. and Newnam, G. E. (2006). Reformulating Reid’s MHT method with gener-
alised Murty K-best ranked linear assignment algorithm. Radar, Sonar and Navigation,
IEE Proceedings-, 153(1):13–22. 103
Dean, T., Singh, S., Jasra, A., and Peters, G. (2011). Parameter estimation for hidden
Markov models with intractable likelihoods. Technical Report 1103.5399, arXiv.org.
38, 139, 140, 141, 142, 143
Dean, T., Singh, S., and Yıldırım, S. (2012). Efficient and accurate approximate Bayesian
computation for hidden Markov models. preprint. 180
Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle
Systems with Applications. Springer-Verlag, New York. 12, 25, 42, 46, 48, 61, 79, 83,
97, 147
Del Moral, P. and Doucet, A. (2003). On a class of genealogical and interacting metropolis
models. In Aze´ma, J., E´mery, M., Ledoux, M., and Yor, M., editors, Se´minaire de
Probabilite´s XXXVII, volume 1832 of Lecture Notes in Mathematics, pages 415–446.
Springer Berlin Heidelberg. 29, 54, 82
Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68:411–
436. 32, 33, 34, 152
Del Moral, P., Doucet, A., and Jasra, A. (2012). An adaptive sequential Monte Carlo
method for approximate Bayesian computation. Statistics and Computing, 22:1009–
1020. 32, 38, 139
REFERENCES 187
Del Moral, P., Doucet, A., and Singh, S. (2009). Forward smoothing using sequential
Monte Carlo. Technical Report 638, Cambridge University, Engineering Department.
4, 54, 56, 57, 68, 73, 77, 82, 89, 92, 114, 131, 149, 153, 161, 164, 169, 173, 180
Del Moral, P., Doucet, A., and Singh, S. (2010). A backward particle interpretation
of Feynman-Kac formulae. ESAIM: Mathematical Modelling and Numerical Analysis,
44:947–975. 56, 83
Del Moral, P., Doucet, A., and Singh, S. (2011). Uniform stability of a particle approxi-
mation of the optimal filter derivative. Technical Report CUED/F-INFENG/TR 668,
Cambridge University, Engineering Department. 4, 54, 62, 63, 64, 148, 150
Delyon, B., Lavielle, M., and Moulines, E. (1999). Convergence of a stochastic approxi-
mation version of the EM algorithm. The Annals of Statistics, 27(1):pp. 94–128. 66,
111, 112
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 39(1):1–38. 65
Dias, A. and Embrechts, P. (2004). Change-point analysis for dependence structures
in finance and insurance. In Risk Measures for the 21st Century, chapter 16, pages
321–335. Wiley Finance Series. 72
Dong, M. and He, D. (2007). A segmental hidden semi-Markov model (HSMM)-based
diagnostics and prognostics framework and methodology. Mechanical Systems and
Signal Processing, 21(5):2248–2266. 75
Donoho, D. L. (2000). High-dimensional data analysis: the curses and blessings of di-
mensionality. In American Mathematical Society Conf. Math Challenges of the 21st
Century. 163
Douc, R., E´ric Moulines, and Ryde´n, T. (2004). Asymptotic properties of the maximum
likelihood estimator in autoregressive models with Markov regime. Annals of Statistics,
32(5):2254–2304. 127
Douc, R., Garivier, A., Moulines, E., and Olsson, J. (2011). Sequential Monte Carlo
smoothing for general state space hidden Markov models. Annals of Applied Probability,
21(6):2109–2145. 56
Doucet, A. (1997). Monte Carlo methods for Bayesian estimation of hidden Markov
models. Application to radiation signals (in French). PhD thesis, University Paris-Sud
Orsay, France. 28
188 REFERENCES
Doucet, A., Briers, M., and Se´ne´cal, S. (2006). Efficient block sampling strategies for
sequential Monte Carlo methods. Journal of Computational and Graphical Statistics,
15(3):693–711. 30
Doucet, A., De Freitas, J., and Gordon, N. (2001). Sequential Monte Carlo Methods in
Practice. Springer-Verlag, New York. 12, 25, 39, 46, 79, 147
Doucet, A., de Freitas, N., Murphy, K., and Russell, S. (2000a). Rao-Blackwellised parti-
cle filtering for dynamic Bayesian networks. In Proceedings of the Sixteenth Conference
Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 176–183,
San Francisco, CA. Morgan Kaufmann. 43, 52, 53
Doucet, A., Godsill, S., and Andrieu, C. (2000b). On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and Computing, 10:197–208. 2, 25, 26, 28,
46, 55, 92, 112, 172
Doucet, A. and Johansen, A. M. (2009). A Tutorial on Particle Filtering and Smoothing:
Fifteen Years Later. In Crisan, D. and Rozovsky, B., editors, The Oxford Handbook of
Nonlinear Filtering. Oxford University Press. 2, 46
Durbin, J. and Koopman, S. J. (2000). Time series analysis of non-Gaussian observations
based on state space models from both classical and Bayesian perspectives (with dis-
cussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology),
62:3–56. 2, 46
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids. CUP: Cambridge. 137
Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo method. Los
Alamos Science, Special Issue, pages 131–137. 11, 13
Elliott, R. and Krishnamurthy, V. (1999). New finite-dimensional filters for parameter
estimation of discrete-time linear Gaussian models. Automatic Control, IEEE Trans-
actions on, 44(5):938–951. 57, 115, 116
Elliott, R. J., Ford, J. J., and Moore, J. B. (2002). On-line almost-sure parameter estima-
tion for partially observed discrete-time linear systems with known noise characteristics.
International Journal of Adaptive Control and Signal Processing, 16:435–453. 67, 73,
78, 79, 120, 153, 171
Fearnhead, P. (2002). MCMC, sufficient statistics and particle filters. Journal of Com-
putational and Graphical Statistics, 11:848–862. 59
REFERENCES 189
Fearnhead, P. (2006). Efficient and exact Bayesian inference for multiple changepoint
problems. Statistics and Computing, 16:203–213. 72
Fearnhead, P. (2008). Computational methods for complex stochastic systems: a review
of some alternatives to MCMC. Statistics and Computing, 18(2):151–171. 2, 46
Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via
particle filters. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 65:887–889. 30, 43, 53
Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):589–
605. 30, 42, 53, 72, 79, 80
Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate
Bayesian computation: semi-automatic approximate Bayesian computation. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):419–474. 36,
38, 140, 143, 157
Fearnhead, P. and Vasileiou, D. (2009). Bayesian analysis of isochores. Journal of the
American Statistical Association, 104(485):132–141. 72, 75, 77, 80, 89, 90, 91, 92
Felsenstein, J. and Churchill, G. (1996). A hidden Markov model approach to variation
among sites in rate of evolution. Molecular Biology and Evolution, 13:93–104. 137
Fe´votte, C. and Cemgil, A. T. (2009). Nonnegative matrix factorisations as probabilistic
inference in composite models. In Proc. 17th European Signal Processing Conference
(EUSIPCO’09), Glasgow. 164
Gales, M. J. F. and Young, S. J. (1993). The theory of segmental hidden Markov models.
Technical report, Cambridge University Engineering Department. 72, 75, 77
Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating
marginal densities. Journal of the American Statistical Association, 85(410):398–409.
24
Gelman, A. and Meng, X. L. (1998). Simulating normalizing constants: From importance
sampling to bridge sampling to path sampling. Statistical Science, 13:163–185. 32
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 6(6):721–741. 24
190 REFERENCES
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integra-
tion. Econometrica, 57(6):1317–1339. 16, 17, 27
Gilks, W. R. (1992). Derivative-free adaptive rejection sampling for Gibbs sampling. In
Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian
Statistics 4, pages 641–649. Oxford University Press, Oxford, UK. 14
Gilks, W. R. and Berzuini, C. (2001). Following a moving target-Monte Carlo inference for
dynamic Bayesian models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 63(1):127–146. 4, 30, 35, 58, 59
Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection Metropolis
sampling within Gibbs sampling. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 44(4):455–472. 14
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall/CRC. 18
Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348.
14
Givens, G. H. and Raftery, A. E. (1996). Local adaptive importance sampling for multi-
variate densities with strong non-linear relationships. J. Amer. Statist. Assoc., 91:132–
141. 33
Godsill, S., Doucet, A., and West, M. (2004). Monte Carlo smoothing for nonlinear time
series. Journal of the American Statistical Association, 99:156–168. 55
Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140(6):107–
113. 2, 29, 30, 39, 48, 147
Guyader, A., Hengartner, N., and Matzner-Lober, E. (2011). Simulation and estimation
of extreme quantiles and extreme probabilities. Applied Mathematics & Optimization,
64(2):171–196. 139
Handschin, J. E. (1970). Monte Carlo techniques for prediction and filtering of non-linear
stochastic processes. Automatica, 6:555–563. 26
Handschin, J. E. and Mayne, D. (1969). Monte Carlo techniques to estimate the condi-
tional expectation in multi-stage non-linear filtering. International Journal of Control,
9:547–559. 26
REFERENCES 191
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 52(1):97–109. 18, 22, 23
Hernando, D., Valentino, C., and Cybenko, G. (2005). Efficient computation of the
hidden Markov model entropy for a given observation sequence. Information Theory,
IEEE Transactions on, 51:2681–2685. 57
Higuchi, T. (2001). Self-organizing time series model. In Doucet, A., De Freitas, J.,
and Gordon, N., editors, Sequential Monte Carlo Methods in Practice, pages 429–444.
Springer-Verlag. 4, 60
Hitchcock, D. B. (2003). A history of the Metropolis-Hastings algorithm. The American
Statistician, 57:254–257. 24
Hoffman, M., Blei, D., and Bach, F. (2010). Online learning for latent Dirichlet allocation.
In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors,
Advances in Neural Information Processing Systems 23, pages 856–864. 164
Hue, C., Le Cadre, J.-P., and Perez, P. (2002). Sequential Monte Carlo methods for
multiple target tracking and data fusion. Signal Processing, IEEE Transactions on,
50(2):300–325. 102
Ionides, E. L., Bhadra, A., and King, A. (2011). Iterated filtering. The Annals of
Statistics, 39(3):1776–1802. 69, 162
Jasra, A., Singh, S., Martin, J., and McCoy, E. (2012). Filtering via approximate Bayesian
computation. Statistics and Computing (to appear), pages 1–15. 38, 139, 140, 141,
145, 147
Johansen, A. M., Moral, P. D., and Doucet, A. (2005). Sequential Monte Carlo samplers
for rare events. Technical report, University of Cambridge, Department of Engineering.
32
Johnson, T. D., Elashoff, R. M., and Harkema, S. J. (2003). A Bayesian change-point
analysis of electromyographic data: detecting muscle activation patterns and associated
applications. Biostatistics, 4:143–164. 72
Julier, S. J. and Uhlmann, J. K. (1997). A new extension of the Kalman filter to nonlinear
systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls 3, pages 182–
193. 46
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
Transactions of the ASME; Series D: Journal of Basic Engineering, 82:35–45. 45
192 REFERENCES
Kantas, N., Doucet, A., Singh, S. S., and Maciejowski, J. M. (2009). An overview
of sequential Monte Carlo methods for parameter estimation in general state-space
models. In Proceedings IFAC System Identification (SysId) Meeting. 3, 4, 58, 67, 72,
78, 138, 171, 181
Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: Likelihood inference
and comparison with ARCH models. The Review of Economic Studies, 65:361–393.
137
Kitagawa, G. (1996). Monte-Carlo filter and smoother for non-Gaussian nonlinear state
space models. Journal of Computational and Graphical Statistics, 1:1–25. 2, 30, 46
Kitagawa, G. (1998). A self-organizing state-space model. Journal of the American
Statistical Association, 93(443):1203–1215. 4, 60
Kitagawa, G. and Sato, S. (2001). Monte Carlo smoothing and self-organising state-space
model. In Doucet, A., De Freitas, J., and Gordon, N., editors, Sequential Monte Carlo
in Practice, pages 178–195. New York: Springer. 55
Klaas, M., de Freitas, N., and Doucet, A. (2005). Toward practical N2 Monte Carlo: the
marginal particle filter. In UAI, pages 308–315. AUAI Press. 50, 51
Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM Review,
51(3):455–500. 164
Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian
missing data problems. Journal of the American Statistical Association, 89(425):278–
288. 17, 28, 48
Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recom-
mender systems. Computer, 42(8):30–37. 164
Lavielle, M. and Lebarbier, E. (2001). An application of MCMC methods for the multiple
change-points. Signal Processing, 81:39–53. 72
Le Gland, F. and Mevel, L. (1997). Recursive estimation in hidden Markov models. In
Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 4,
pages 3468–3473. 63, 64
Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects with nonnegative
matrix factorization. Nature, 401:788–791. 164, 167
Lee, D. D. and Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In
NIPS, pages 556–562. 167
REFERENCES 193
Lefevre, A., Bach, F., and Fevotte, C. (2011). Online algorithms for nonnegative matrix
factorization with the Itakura-Saito divergence. In (WASPAA) IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics, pages 313–316. 164
Liang, F. (2002). Dynamically weighted importance sampling in Monte Carlo computa-
tion. Journal of the American Statistical Association, 97:807–821. 35
Lin, M. T., Zhang, J. L., Cheng, Q., and Chen, R. (2005). Independent particle filters.
Journal of the American Statistical Association, 100(472):1412–1421. 48, 51
Liu, J. (1996). Metropolized independent sampling with comparisons to rejection sam-
pling and importance sampling. Statistics and Computing, 6(2):113–119. 17, 48
Liu, J. (2001). Monte Carlo Strategies in Scientific Computing. Springer Series in Statis-
tics. Springer Verlag, New York, NY, USA. 29
Liu, J. and Chen, R. (1995). Blind deconvolution via sequential imputation. Journal of
the American Statistical Association, 90:567–576. 28, 48
Liu, J. and Chen, R. (1998). Sequential Monte-Carlo methods for dynamic systems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 93:1032–
1044. 2, 30, 46
Liu, J. and West, M. (2001). Combined parameter and state estimation in simulation-
based filtering. In Doucet, A., De Freitas, J., and Gordon, N. J., editors, Sequential
Monte Carlo Methods in Practice. Springer-Verlag, New York. 60
Lombardi, M. J. and Calzolari, G. (2009). Indirect estimation of α-stable stochastic
volatility models. Computational Statistics & Data Analysis, 53(6):2298–2308. 160
Lund, R. and Reeves, J. (2002). Detection of undocumented changepoints: A revision of
the two-phase regression model. Journal of Climate, 15:2547–2554. 72
Mahler, R. (2003). Multitarget Bayes filtering via first-order multitarget moments.
Aerospace and Electronic Systems, IEEE Transactions on, 39(4):1152 – 1178. 102
Mahler, R., Vo, B., and Vo, B. (2011). CPHD filtering with unknown clutter rate and
detection profile. Signal Processing, IEEE Transactions on, 59(8):3497–3513. 102
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factor-
ization and sparse coding. Journal of Machine Learning Research, 11:19–60. 164
Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2011). Approximate Bayesian
computational methods. Statistics and Computing, pages 1–14. 36, 38, 139
194 REFERENCES
Marjoram, P., Molitor, J., Plagnol, V., and Tavare´, S. (2003). Markov chain Monte Carlo
without likelihoods. Proceedings of the National Academy of Sciences of the United
States of America, 100(26):15324–15328. 37, 139
Marsaglia, G. (1977). The squeeze method for generating gamma variates. Computers
and Mathematics with Applications, 3(4):321–325. 14
Mayne, D. (1966). A solution of the smoothing problem for linear dynamic systems.
Automatica, 4:73–92. 26
Mengersen, K. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and
Metropolis algorithms. Annals of Statistics, 24:101–121. 23
Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos Science,
Special Issue, pages 125–130. 11
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). Equation of state calculations by fast computing machines. Journal of Chemical
Physics, 21(6):1087–1092. 18, 22, 23
Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247):pp. 335–341. 2, 11, 12
Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge
University Press, New York, NY, USA, 2nd edition. 18
Mongillo, G. and Deneve, S. (2008). Online learning with hidden Markov models. Neural
Computation, 20(7):1706–1716. 57, 67, 73, 78, 79, 120, 153, 171
Murphy, K. P. (2002). Hidden semi-Markov models (hsmms). Technical report, UBC. 75
Murty, K. G. (1968). An algorithm for ranking all the assignments in order of increasing
cost. Operations Research, 16(3):682–687. 103, 113, 133
Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11:125–139.
32, 35
Newman, M. E. J. and Barkema, G. T. (1999). Monte Carlo Methods in Statistical
Physics. Oxford University Press, USA. 12
Ng, W., Li, J., Godsill, S., and Vermaak, J. (2005). A hybrid approach for online joint
detection and tracking for multiple targets. In Aerospace Conference, 2005 IEEE, pages
2126 –2141. 102, 103
REFERENCES 195
Nielsen, S. F. (2000). The stochastic EM algorithm: Estimation and asymptotic results.
Bernoulli, 6(3):457–489. 66
O´ Ruanaidh, J. and Fitzgerald, W. J. (1996). Numerical Bayesion Methods Applied to
Signal Processing. Springer, New York. 72
Oh, S., Russell, S., and Sastry, S. (2009). Markov chain monte carlo data association for
multi-target tracking. Automatic Control, IEEE Transactions on, 54(3):481 –497. 102,
103, 112
Oliver, J. L., Carpena, P., Hackenberg, M., and Bernaola-Galvan, P. (2004). Isofinder:
Computational prediction of isochores in genome sequences. Nucleic Acids Research,
32:W287–W29. 90
Olsson, J., Cappe´, O., Douc, R., and Moulines, E. (2008). Sequential Monte Carlo
smoothing with application to parameter estimation in nonlinear state space models.
Bernoulli, 14:155–179. 4, 29, 55, 66, 72, 82, 139
Peters, G., Sisson, S., and Fan, Y. (2011). Likelihood-free bayesian inference for alpha
stable models. Computational Statistics and Data Analysis, 56(11):3743–3756. 155
Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters.
Journal of the American Statistical Association, 94(446):590–599. 31, 39, 48, 49, 147
Polson, N. G., Stroud, J. R., and Mu¨ller, P. (2008). Practical filtering with sequen-
tial parameter learning. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70(2):413–428. 4, 60
Poyiadjis, G., Doucet, A., and Singh, S. S. (2011). Particle approximations of the score
and observed information matrix in state space models with application to parameter
estimation. Biometrika, 98(1):65–80. 4, 54, 55, 62, 63, 64, 69, 131, 148, 150, 154, 180
Press, W. H. (2007). Numerical Recipes : The Art of Scientific Computing. Cambridge
University Press, 3rd edition. 11
Pritchard, J., Seielstad, M., Perez-Lezaun, A., and Feldman, M. (1999). Population
growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular
Biology and Evolution, 16:1791–1798. 36, 139
Punskaya, E., Andrieu, C., Doucet, A., and Fitzgerald, W. J. (2002). Bayesian curve
fitting using MCMC with applications to signal segmentation. IEEE Transactions on
Signal Processing, 50:747–758. 72
196 REFERENCES
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286. 39, 41, 45
Rayner, G. D. and MacGillivray, H. L. (2002). Numerical maximum likelihood estima-
tion for the g-and-k and generalized g-and-h distributions. Statistics and Computing,
12(1):57–75. 157, 159
Reid, D. B. (1979). An algorithm for tracking multiple targets. Automatic Control, IEEE
Transactions on, 24:843–854. 102
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. New York:
Springer, 2 edition. 12, 13, 15, 18, 25
Roberts, G. and Smith, A. (1994). Simple conditions for the convergence of the Gibbs
sampler and Metropolis-Hastings algorithms. Stochastic Processes and their Applica-
tions, 49(2):207–216. 23, 25
Roberts, G. and Tweedie, R. (1996). Geometric convergence and central limit theorems
for multidimensional Hastings and Metropolis algorithms. Biometrika, 83:95–110. 23,
25
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the
applied statistician. The Annals of Statistics, 12(4):1151–1172. 36
Rubin, D. B. (1987). A noniterative sampling/importance resampling alternative to
the data augmentation algorithm for creating a few imputations when the fraction of
missing information is modest: the SIR algorithm (discussion of Tanner and Wong).
Journal of the American Statistical Association, 82:543–546. 29
Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrix factorization. In Advances
in Neural Information Processing Systems, volume 20. 164
Sa¨rkka¨, S., Vehtari, A., and Lampinen, J. (2004). Rao-Blackwellized Monte Carlo data
association for multiple target tracking. In In Proceedings of the Seventh International
Conference on Information Fusion, pages 583–590. 53
Shiryaev, A. N. (1995). Probability. Springer-Verlag New York, Inc., Secaucus, NJ, USA,
2 edition. 12, 13, 18
Singh, A. P. and Gordon, G. J. (2008). A unified view of matrix factorization models. In
Proceedings of the European conference on Machine Learning and Knowledge Discovery
in Databases - Part II, ECML PKDD ’08, pages 358–373, Berlin, Heidelberg. Springer-
Verlag. 164
REFERENCES 197
Singh, S., Whiteley, N., and Godsill, S. (2011). An approximate likelihood method for
estimating the static parameters in multi-target tracking models. In Barber, D., Cemgil,
T., and Chiappa, S., editors, Bayesian Time Series Models, chapter 11, pages 225–244.
Cambridge University Press. 102
Singh, S. S., Vo, B.-N., Baddeley, A., and Zuyev, S. (2009). Filters for spatial point
processes. SIAM Journal on Control and Optimization, 48(4):2275–2295. 102
Sisson, S. A., Fan, Y., and Tanaka, M. M. (2007). Sequential Monte Carlo without
likelihoods. Proceedings of the National Academy of Sciences of the United States of
America, 104(6):1760–1765. 38
Sisson, S. A., Fan, Y., and Tanaka, M. M. (2009). Correction for Sisson et al., Sequential
Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of
the United States of America, 106(39):16889. 38
Sorenson, H. W. (1985). Kalman Filtering: Theory and Application. IEEE Press, reprint
edition. 45
Stephens, D. A. (1994). Bayesian retrospective multiple-changepoint identification. Ap-
plied Statistics, 43:159–178. 72
Storlie, C., Lee, T., Hannig, J., and Nychka, D. (2009). Tracking of multiple merging
and splitting targets: A statistical perspective. Statistica Sinica, 19:1–52. 102
Storvik, G. (2002). Particle filters in state space models with the presence of unknown
static parameters. Signal Processing, IEEE Transactions on, 50(2):281–289. 59
Streit, R. and Luginbuhi, T. (1995). Probabilistic multi-hypothesis tracking. Technical
Report 10,428, Naval Undersea Warfare Center Division, Newport, Rhode Island. 102
Tavare´, S., Balding, D., Griffith, R., and Donnelly, P. (1997). Inferring coalescence times
from DNA sequence data. Genetics, 145(2):505–518. 36
Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statis-
tics, 22:1701–1762. 18, 23, 25
Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Jour-
nal of the Royal Statistical Society: Series B (Statistical Methodology), 46(2):257–267.
63
Toni, T., Welch, D. Strelkowa, N., Ipsen, A., and Stumpf, M. (2009). Approximate
Bayesian computation scheme for parameter inference and model selection in dynamical
systems. Journal of the Royal Society Interface, 6(31):187–202. 38
198 REFERENCES
Vermaak, J., Godsill, S., and Perez, P. (2005). Monte carlo filtering for multi target
tracking and data association. Aerospace and Electronic Systems, IEEE Transactions
on, 41(1):309 – 332. 102
Vo, B. and Ma, W. (2006). The Gaussian mixture probability hypothesis density filter.
Signal Processing, IEEE Transactions on, 54(11):4091–4104. 130
Vo, B.-N., Singh, S., and Doucet, A. (2003). Random finite sets and sequential monte
carlo methods in multi-target tracking. In Radar Conference, 2003. Proceedings of the
International, pages 486–491. 102
Vo, B.-N., Singh, S., and Doucet, A. (2005). Sequential Monte Carlo methods for mul-
titarget filtering with random finite sets. Aerospace and Electronic Systems, IEEE
Transactions on, 41(4):1224–1245. 102
von Neumann, J. (1951). Various techniques used in connection with random digits.
Journal of Research of the National Bureau of Standards, 12:36–38. 13
Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms. Journal of the American
Statistical Association, 85(411):699–704. 66, 111
West, M. (1993). Approximating posterior distributions by mixture. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 55:409–422. 33
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50:1–25. 142
Whiteley, N., Doucet, A., and Andrieu, C. (2009). Particle MCMC for multiple change-
point models. Technical report, University of Bristol, Department of Mathematics. 75,
92
Whiteley, N., Singh, S., and Godsill, S. (2010). Auxiliary particle implementation of prob-
ability hypothesis density filter. Aerospace and Electronic Systems, IEEE Transactions
on, 46(3):1437–1454. 102, 130
Whitley, D. (1994). A genetic algorithm tutorial. Statistics and Computing, 4:65–85. 30
Wilkinson, R. (2008). Approximate Bayesian computation (ABC) gives exact results
under the assumption of model error. Technical report, 0811.3355, arXiv.org. 37, 140,
142
Wills, A., Schn, T. B., and Ninness, B. (2008). Parameter estimation for discrete-time
nonlinear systems using EM. In Proc. 17th IFAC World Congress. 66
REFERENCES 199
Yıldırım, S., Cemgil, A. T., and Singh, S. S. (2012a). An online expectation-maximisation
algorithm for nonnegative matrix factorisation models. In 16th IFAC Symposium on
System Identification. SYSID 2012. 163
Yıldırım, S., Jiang, L., Singh, S. S., and Dean, T. (2012b). Estimating the static param-
eters in linear Gaussian multiple target tracking models. Technical Report CUED/F-
INFENG/TR.681, University of Cambridge, Department of Engineering. 101
Yıldırım, S., Jiang, L., Singh, S. S., and Dean, T. (2012c). A Monte Carlo expectation-
maximisation algorithm for parameter estimation in multiple target tracking. In Infor-
mation Fusion (FUSION), 2012 15th International Conference on. Fusion 2012. 101
Yıldırım, S., Singh, S. S., and Doucet, A. (2012d). An online expectation-maximization
algorithm for changepoint models. Journal of Computational and Graphical Statistics,
to appear. 71
Yoon, J. and Singh, S. (2008). A Bayesian approach to tracking in single molecule
fluorescence microscopy. Technical Report CUED/F-INFENG/TR-612, University of
Cambridge. 101, 102, 103