Approximate Inference in Graphical Models
Philipp Hennig
Robinson College
dissertation submitted in candidature for the degree of
Doctor of Philosophy, University of Cambridge
Inference Group
Cavendish Laboratory
University of Cambridge
14 November 2010
ii
Declaration
I hereby declare that my dissertation entitled “Approximate Inference in Graphical
Models” is not substantially the same as any that I have submitted for a degree or
diploma or other qualification at any other University.
I further state that no part of my dissertation has already been or is concurrently
submitted for any such degree of diploma or other qualification.
Except where specifically indicated in the text, this dissertation is the result of my
own work and includes nothing which is the outcome of work done in collaboration.
This dissertation does not exceed sixty thousands words in length.
Date: 14 November 2010 Signed:
Philipp Hennig, Cambridge
iii
Abstract
Probability theory provides a mathematically rigorous yet conceptually flexible
calculus of uncertainty, allowing the construction of complex hierarchical models
for real-world inference tasks. Unfortunately, exact inference in probabilistic mod-
els is often computationally expensive or even intractable. A close inspection in
such situations often reveals that computational bottlenecks are confined to cer-
tain aspects of the model, which can be circumvented by approximations without
having to sacrifice the model’s interesting aspects. The conceptual framework of
graphical models provides an elegant means of representing probabilistic models
and deriving both exact and approximate inference algorithms in terms of local
computations. This makes graphical models an ideal aid in the development of
generalizable approximations. This thesis contains a brief introduction to approx-
imate inference in graphical models (Chapter 2), followed by three extensive case
studies in which approximate inference algorithms are developed for challenging
applied inference problems. Chapter 3 derives the first probabilistic game tree
search algorithm. Chapter 4 provides a novel expressive model for inference in
psychometric questionnaires. Chapter 5 develops a model for the topics of large
corpora of text documents, conditional on document metadata, with a focus on
computational speed. In each case, graphical models help in two important ways:
They first provide important structural insight into the problem; and then suggest
practical approximations to the exact probabilistic solution.
iv
Acknowledgments
I would like to thank all members of the Inference Group at the Cavendish Lab-
oratory in Cambridge for their support during the preparation of this thesis. In
particular, I am grateful to Carl Scheffler, Philip Sterne, Keith Vertanen, Emli-
Mari Nel, Tamara Broderick, Oliver Stegle and Christian Steinrücken for engaging
discussions and thoughtful comments over the past years, as well as for making
this time as much fun as it has been.
I feel privileged to have had the chance to work alongside, interact with, and
learn from a number of members of the extended Cambridge machine learning
community. Among them are Tom Borchert, John Cunningham, Marc Deisenroth,
Jürgen van Gael, Zoubin Ghahramani, Katherine Heller, Ferenc Huszár, Gergji
Kasneci, David Knowles, Simon Lacoste-Julien, Máté Lengyel, Tom Minka, Shakir
Mohamed, Iain Murray, Peter Orbanz, Juaquin Quiñonero-Candela, Carl Ras-
mussen, Yunus Saatc˛i, Richard Samworth, Anton Schwaighofer, Martin Szummer,
Yee Whye Teh, Ryan Turner, Sinead Williamson and John Winn.
I am especially indebted to David Stern and Ralf Herbrich, both of Microsoft
Research, for repeatedly and actively offering their time and expertise, often after
hours, to help me with final preparations of papers and talks, and for suggesting
interesting experiments and extensions.
My work was supported through a grant from Microsoft Research Ltd. Besides the
financial support, I am also grateful to Microsoft for access to the great researchers
and technical possibilities in its Cambridge research laboratories.
My deepest gratitude goes to my two supervisors: To David MacKay, for innu-
merable enlightening comments and constructive criticism as well as for being a
shining example of a scientist’s boldness and freedom. And to Thore Graepel, for
relentlessly supporting and challenging me throughout the last three years. This
thesis would not have been possible without either of them.
Contents
1 Introduction 1
1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A Calculus of Uncertainty . . . . . . . . . . . . . . . . . . . 1
1.2 Numerical Complexity Mandates Approximations . . . . . . . . . . 4
1.2.1 The Classic Answer . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Provability Is Not Everything . . . . . . . . . . . . . . . . . 4
1.2.3 About This Text . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Graphical Models 7
2.1 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Other Graphical Models . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 The Sum-Product Algorithm . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The Sum Product Algorithm is No Panacea . . . . . . . . . 14
2.3 Approximate Inference Methods . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Expectation Propagation . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Variational Inference . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Laplace Approximations . . . . . . . . . . . . . . . . . . . . 28
2.3.5 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . 30
2.3.6 Assuming Independence . . . . . . . . . . . . . . . . . . . . 31
2.3.7 Comparison of Approximation Schemes . . . . . . . . . . . . 33
3 Inference on Optimal Play in Games 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.5 Replacing a Hard Problem with a Simple Prior . . . . . . . 48
3.2.6 Exploration Policies . . . . . . . . . . . . . . . . . . . . . . . 49
vi CONTENTS
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Structure of Go Game Trees . . . . . . . . . . . . . . . . . . 50
3.3.2 Inference on the Generators . . . . . . . . . . . . . . . . . . 52
3.3.3 Recursive Inductive Inference on Optimal Values . . . . . . . 53
3.3.4 Errors Introduced by Model Mismatch . . . . . . . . . . . . 53
3.3.5 Use as a Standalone Tree Search . . . . . . . . . . . . . . . . 54
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Addendum: Related Subsequent Work . . . . . . . . . . . . . . . . . 56
4 Approximate Bayesian Psychometrics 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Item Response Theory . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Contemporary Item Response Models . . . . . . . . . . . . . 63
4.3 A Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Expressiveness of the Model . . . . . . . . . . . . . . . . . . 66
4.3.2 Approximate Inference . . . . . . . . . . . . . . . . . . . . . 66
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Computational cost . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Approximate Bayesian Estimates . . . . . . . . . . . . . . . 72
4.4.3 Predictive Performance . . . . . . . . . . . . . . . . . . . . . 72
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Fast, Online Inference for Conditional Topic Models 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Semi-Collapsed Variational Inference . . . . . . . . . . . . . 84
5.2.2 Laplace Approximation for Dirichlets . . . . . . . . . . . . . 89
5.2.3 Gaussian Regression . . . . . . . . . . . . . . . . . . . . . . 91
5.2.4 Inference from Data Streams . . . . . . . . . . . . . . . . . . 94
5.2.5 Queries to the Model . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Quality of the Laplace Bridge . . . . . . . . . . . . . . . . . 96
5.3.2 Experiments on the Twitter Corpus . . . . . . . . . . . . . . 98
5.3.3 Learning Sparse Topics . . . . . . . . . . . . . . . . . . . . . 98
5.3.4 Comparing Conditioned and Unconditioned Models . . . . . 99
5.3.5 Single Pass versus Iterative Inference . . . . . . . . . . . . . 102
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 Conclusions 111
CONTENTS vii
A The Maximum of Correlated Gaussian Variables 113
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.2 The Maximum of Two Gaussian Variables . . . . . . . . . . . . . . 114
A.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.2.2 Some Integrals . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.2.3 Analytic Forms . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.2.4 Moment Matching . . . . . . . . . . . . . . . . . . . . . . . 119
A.3 The Maximum of a Finite Set . . . . . . . . . . . . . . . . . . . . . 124
A.3.1 Analytic Form . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3.2 A Heuristic Approximation . . . . . . . . . . . . . . . . . . . 125
A.4 Discussion of the Approximation’s Quality . . . . . . . . . . . . . . 126
A.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
B Efficient Rank 1 EP Updates 129
B.1 The Step Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
C A Laplace Map Linking Gaussians and Dirichlets 135
C.1 The Dirichlet in the Softmax Basis . . . . . . . . . . . . . . . . . . 137
C.1.1 Ensuring Identifiability . . . . . . . . . . . . . . . . . . . . . 137
C.1.2 Transforming to the Softmax Basis . . . . . . . . . . . . . . 138
C.2 The Laplace Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2.1 Mode and Hessian in the Softmax Basis . . . . . . . . . . . 140
C.2.2 A Sparse Representation . . . . . . . . . . . . . . . . . . . . 141
C.2.3 Inverse Map . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.3 The Two-Dimensional Case . . . . . . . . . . . . . . . . . . . . . . 144
C.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
viii CONTENTS
List of Figures
2.1 Factor graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Advantages of directed graph notation . . . . . . . . . . . . . . . . 9
2.4 Advantages of factor graph notation . . . . . . . . . . . . . . . . . . 10
2.5 Derivation of sum-product algorithm . . . . . . . . . . . . . . . . . 10
2.6 Messages sent by the sum-product algorithm . . . . . . . . . . . . . 11
2.7 Defect of the Laplace approximation . . . . . . . . . . . . . . . . . 29
2.8 Sketch of the explaining away phenomenon . . . . . . . . . . . . . . 31
2.9 Factorized regression . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Sketch of game tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Sketch of game tree smoothness . . . . . . . . . . . . . . . . . . . . 40
3.3 Generative model for game trees . . . . . . . . . . . . . . . . . . . . 42
3.4 Inference on optimal values . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Replacing a hard problem with a simple prior . . . . . . . . . . . . 49
3.6 Empirical distribution of generator values in Go . . . . . . . . . . . 50
3.7 Changes to mean values . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Log likelihood of ground truth game values under model . . . . . . 53
3.9 Inductive model predictions vs ground truth . . . . . . . . . . . . . 54
3.10 Performance in a tree search task . . . . . . . . . . . . . . . . . . . 55
4.1 Marginal frequencies of reply categories . . . . . . . . . . . . . . . . 63
4.2 Directed model for ordinal regression . . . . . . . . . . . . . . . . . 65
4.3 Factor graph for the user-item response model . . . . . . . . . . . . 67
4.4 Ordinal and response factors . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Sketch of test set setup . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Thresholds learned by the approximate inference algorithm . . . . . 73
4.7 Sampled threshold values . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Predictions for user responses to items . . . . . . . . . . . . . . . . 75
5.1 Conceptual sketch of topic models . . . . . . . . . . . . . . . . . . . 81
5.2 Directed graphical model . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Approximate inference in the conditional topic model . . . . . . . . 91
x LIST OF FIGURES
5.4 Factorized Gaussian regression . . . . . . . . . . . . . . . . . . . . . 93
5.5 Convergence behaviour of the Laplace bridge . . . . . . . . . . . . . 97
5.6 Sparsity of learned author topic predictions . . . . . . . . . . . . . . 99
5.7 Predictive topic distribution for individual authors . . . . . . . . . . 100
5.8 Learned topics: conditioned model . . . . . . . . . . . . . . . . . . . 106
5.9 Learned topics: un-conditioned model . . . . . . . . . . . . . . . . . 107
5.10 Topic distributions for four authors . . . . . . . . . . . . . . . . . . 108
5.11 Comparison of iterative and single-pass convergence . . . . . . . . . 109
5.12 Regrets and log probabilities . . . . . . . . . . . . . . . . . . . . . . 109
A.1 Sketch of integration ranges . . . . . . . . . . . . . . . . . . . . . . 117
A.2 Analytical distributions of max and generating variables . . . . . . 120
A.3 Minimal factor graph using the max factor. . . . . . . . . . . . . . . 120
A.4 Illustrations of Gaussian approximations for the max . . . . . . . . 123
A.5 Factor graph for finite set max factor . . . . . . . . . . . . . . . . . 124
A.6 Failure modes of the max iterative approximation . . . . . . . . . . 126
A.7 Examples of the max factors results . . . . . . . . . . . . . . . . . . 128
C.1 Factor graph for the softmax factor . . . . . . . . . . . . . . . . . . 136
C.2 Conceptual sketch of the soft constraint onto a subspace . . . . . . 138
C.3 Effect of the softmax basis change on the Dirichlet distribution. . . 140
Chapter 1
Introduction
1.1 Probability
1.1.1 A Calculus of Uncertainty
The essence of human existence is constant interaction with the world, aiming
to steer and shape our surroundings to our benefit. To this end, we constantly
form hypotheses about the structure of the world, and the rules that govern it,
act according to our theories, and then change our beliefs in the light of new ex-
perience. The process of updating one’s belief about the validity of a hypothesis,
and about the values of parameters describing the hypothesis, is known as infer-
ence, or learning. Biological learning systems have been very successful adapting
to their environments in this way, and still outperform machines in many areas. In
an attempt to replicate some of the abilities of biological systems, the discipline
of machine learning has emerged over the past two decades, bringing together
researchers from a diverse range of fields, such as computer science, engineering,
physics, mathematics, neuroscience and biology, rallying around the central idea
that the behaviour of a system should be governed to a large degree by examples
(data), rather than explicit predetermined rules.
An unavoidable consequence of learning from finite data is incompleteness of the
acquired knowledge, because some areas of the data space will not have been vis-
ited or experience might otherwise be limited. This incompleteness is associated
with some level of uncertainty, and quantifying this uncertainty is essential for
good decision making. Probability theory provides an ideal, mathematically rig-
orous framework to represent and manipulate uncertain information. In fact, any
rational inference paradigm consistent with common sense (as defined by three
rather incontrovertible axioms) can be mapped to probability theory [Cox, 1946].
In an extension of formal logic, where statements A are either true or false, the
probabilistic framework introduces the concept of a probability p(A) ∈ [0, 1] of
2 Introduction
statement A being true1 The probability of both statements A and B being true
is denoted by the joint probability p(A,B), the probability of A being true if B is
known to be true is the conditional probability p(A |B). The means to manipulate
probabilities is provided by a calculus arising directly from Cox’s axioms, which
can be subsumed in the two central rules [Jaynes and Bretthorst, 2003]
p(X) =
∑
Y
p(X, Y ) and p(X, Y ) = p(Y |X)p(X) (1.1)
known as the sum and product rule, respectively. The sum symbol is meant to
represent the sum over all possible values for Y . An important corollary of these
results is Bayes’ theorem
p(X |Y ) =
p(Y |X)p(X)
∑
X p(Y |X)p(X)
(1.2)
which relates the posterior probability of X after observation of Y to the prior
probability ofX, the likelihood ofX (the conditional probability of Y givenX) and
the evidence for Y under the probabilistic model we use to evaluate conditional
probabilities.
posterior =
likelihood · prior
evidence
(1.3)
This text uses a somewhat sloppy notation to increase readability, in which the
probability of a particular variable X having the concrete value x will simply be
written as p(x), instead of p(X = x). This applies to both discrete variables and
real-valued variables; in the latter case, the sums in Equations (1.1) and (1.2) have
to be replaced with integrals.
The above framework, sometimes called “inferring inverse probabilities”, is most
widely known as the Bayesian interpretation of probability. Although named in
honour of early works by a non-conformist priest [Bayes, 1763], the theory was
first formally developed by Laplace [1774].2 Cox [1946] and Kolmogorov [1933]
established the fundamental nature of these rules of reasoning. Nevertheless, the
Bayesian paradigm has been under fierce debate throughout the 20th century,
mostly based on philosophical [see Jaynes and Bretthorst, 2003, for an opinion-
ated review] and technical [e.g. Walker, 2004] issues involving the prior. In recent
years, though, the Bayesian framework has increasingly found acceptance within
the statistics and machine learning communities, mostly thanks to successful ex-
perimental demonstration of its powerful abilities in the description of complicated
data and prior knowledge. Some important strengths of the Bayesian formulation
1In principle, probabilities can be defined as members of any space isomorphic to the real line
[Jaynes and Bretthorst, 2003], but [0, 1] has become the established space. Appendix C contains
an example of the inconveniences caused by this choice in some cases.
2However, Laplace did not yet state Bayes’ theorem explicitly.
1.1 Probability 3
are
. Bayesian methods allow quantitative statements about uncertainty. This al-
lows reasoning about the information conveyed by data about the value of
a latent variable, and about the confidence with which future data can be
predicted.
. In contrast to classical statistical methods, which distinguish between ran-
dom data and deterministic parameters that are not random variables, every
variable in a Bayesian model can be assigned probabilities. This makes it
possible to design hierarchical probabilistic models, describing very general
‘hypotheses over hypotheses’.
. Because probabilistic models use probability measures to spread a finite
amount of probability mass over all possible outcomes, they do not suf-
fer from the problem of ‘over-fitting’ that can plague estimative methods.
Bayesian methods can thus deal consistently with arbitrary, even infinite
numbers of free parameters in a model (the latter case is known as Bayesian
nonparametric inference).
. From the standpoint of classical statistics, Bayesian methods can be shown
to be consistent and to converge optimally fast (subject to minor technical
restrictions3) [Le Cam, 1973, Ibragimov and Has’minskii, 1981, Ghosal et al.,
2000]. This has allowed researchers to concentrate their efforts on the devel-
opment of flexible models and efficient inference methods without having to
worry about theoretical guarantees.
The body of literature on the subject of probabilistic inference, although still
expanding rapidly, has become too large even to be characterized by individual
exemplary works. Good introductions to the applied aspects of the field can be
found in the textbooks by Jaynes and Bretthorst [2003], MacKay [2003] and Bishop
[2006].
3More precisely: Consider a dataset {Xi} generated by a probability measure Pθ({Xi})
parametrized by a parameter θ from a parameter set Θ. If the prior over Θ puts nonzero mass on
every sufficiently small open neighbourhood of the true value θ0, and if Θ is a subset of a finite-
dimensional Euclidean space, and the functional relationship θ → Pθ({Xi}) is sufficiently regular,
then the posterior distribution on θ converges with the optimal rate [Le Cam, 1973, Ibragimov
and Has’minskii, 1981]. The situation is less clear-cut when Θ is infinite dimensional [Cox, 1993,
Ghosal et al., 2000]. Note that such analyses of both Bayesian and Frequentist methods assume
the likelihood function to be correct, which is a causal assumption about the generative process
that is often incorrect in real-world applications, such as the ones studied in this thesis.
4 Introduction
1.2 Numerical Complexity Mandates Approxima-
tions
There is also one aspect of Bayesian modeling that is still seen as a weakness
by many applied researchers, and that provides the motivation for this thesis:
Although Bayes’ rule provides a unique and in some sense straightforward way
to do inference, obtaining the posterior for a particular choice of prior is often
expensive, or even intractable. Research in Bayesian methods addresses this issue
in two ways:
1. Priors that allow closed-form, or computationally cheap, evaluation of the
posterior without sacrificing generality. Some Bayesian nonparametric meth-
ods fall into this category, including Gaussian process algorithms [Williams
and Rasmussen, 1996] and, more recently, random processes over discrete
probability spaces [Beal et al., 2002, Blei et al., 2004, Griffiths and Ghahra-
mani, 2006, Roy and Teh, 2009].
2. Approximation methods that capture important characteristics of the pos-
terior at tractable computational cost, such as Markov Chain Monte Carlo
methods [Neal, 1996], and approximate inference methods for graphical mod-
els. The latter are the methods that this text will focus on. See Chapter 2
for an introduction and further references.
1.2.1 The Classic Answer
Classic statistical research takes a disparate approach: Noting the numerical com-
plexity mentioned above, researchers in this field construct deterministic estima-
tors, and then take care to prove several desirable characteristics of these esti-
mators, such as the absence of bias, good rates of convergence, and bounds on
the error between the “true” and estimated value of a parameter. Carefully con-
structed estimators can perform very well. But their serious drawback is a lack of
generalization: Even slight changes in the structure of a problem can invalidate
the derivation of an estimator and require the attention of a specialized researcher
starting from scratch. Since not every applied researcher can afford the luxury
of their own statistician, estimators are often marketed as “black boxes” to the
applied fields, despite being usually subject to several technical restrictions. This
creates a considerable risk of misuse [Jaffe and Spirer, 1987, Altman, 1982, 1994].
1.2.2 Provability Is Not Everything
From the Bayesian viewpoint, complex models are not a weakness, but a reflection
of the complexity of the modeled system. The ability to capture intricate non-linear
1.2 Numerical Complexity Mandates Approximations 5
relationships may come at the price of losing analytic mathematical formulation.
This structural complexity also means that it is often very difficult to provide
explicit proof of the correctness of a particular approximate algorithm, even though
the underlying exact algorithms might be known to be optimal. This situation is
similar to that in the natural sciences, where there are epistemological limits to the
discoverability of the “correct” model [Popper, 1934, Hume, 1739], and predictions
are made not just on the basis of assumptions, but also as a result of extensive
approximations. Experimental evaluation has to replace mathematical proofs in
many cases. In fact, modern machine learning might be seen as an extension of the
effort of physics to explain the inanimate world, to ever more complex systems, and
to systems dominated by human behaviour (the field is thus also sometimes termed
more generally as probabilistic data analysis). Although provable performance is
of course a virtue for any method, absence of a proof should not keep us from
using approximate algorithms known empirically to perform better. In recent years
several approximate algorithms with unknown general performance have become
established tools of machine learning, because they have been shown to perform
better than simpler algorithms with provable characteristics. In some cases (such
as loopy belief propagation [Frey and MacKay, 1998] and expectation propagation
[Minka, 2001]), the approximations are even known to fail in certain cases, and are
still used for their superior capacity in those cases where they do not fail.
1.2.3 About This Text
This thesis presents a series of applied inference schemes. Although solving largely
independent concrete problems, the chapters are linked by their use of approxi-
mate inference techniques within the framework of graphical models. These are a
particularly expressive form of notation exposing the factorization properties of
multivariate probability distributions. The framework itself was developed largely
within the statistics and machine learning community, starting with seminal work
by Pearl [1988], Lauritzen and Spiegelhalter [1988] and Frey [1998]. Approxima-
tions using graphical models [Frey and MacKay, 1998, Minka, 2001, Winn and
Bishop, 2006] allow tractable inference in highly structured models.
This thesis makes extensive use of graphical models, less as a theoretical framework,
but as a robust toolbox for the applied researcher. It utilizes graphical models to
present and describe the often intricate structure of a series of probabilistic models,
and to discover and study approximate solutions to inference problems where exact
posterior distributions are intractable.
Chapter 2 presents a short introduction to graphical models, and to popular ap-
proximate inference methods, all of which will be used in the following chapters.
Chapter 3 derives the first probabilistic game tree search algorithm, allowing the
6 Introduction
update of probabilistic beliefs on the solution of an EXPTIME-complete problem,
from individual data points, in linear time. Chapter 4 presents an approximate
Bayesian inference scheme for the evaluation of psychometric questionnaires, al-
lowing inference on individual evaluation ranges for every item in the questionnaire
and every respondent to the questionnaire. Chapter 5 derives a lightweight infer-
ence algorithm for the popular Latent Dirichlet Allocation topic model for corpora
of text documents, which allows the model to be conditioned on features of indi-
vidual documents and which can run on very large datasets in a single pass.
Chapter 2
Graphical Models
At the core of probability theory lies the idea that knowledge about the values of
quantities can be “spread out” over a measurable space, rather than having to be
confined to a single exact value. As a direct consequence, inference in probabilistic
models requires probability mass to be summed up over ranges or, in the case of
continuous spaces, to be integrated. Unfortunately, integration is a tricky art, and
analytical solutions to the integrals over general probability distributions are the
exception rather than the rule. Hence much of applied inference relies on the use
of a small set of parametric distributions amenable to integration under certain
atomic operations, such as multiplication.
In multivariate models, the complexity of integration can be immense. The compu-
tational complexity of integrating, even numerically, a general multivariate prob-
ability density p(x) on a variable x ∈ V D in a D dimensional space V D =
⊗
D V
rises exponentially with D, because the volume over which we need to integrate
rises exponentially with D. This is known as the curse of dimensionality [Bellman,
1957].
However, if p(x) has structure, in particular if certain dimensions of p(x) are
conditionally independent of certain dimensions given others, then the problem
might be much easier than in the general case. In the most extreme instance, if all
dimensions are independent, p(x) =
∏
d p(xd), then the cost of integration is only
O(D).
With a bit of luck (and foresight during model design), the univariate marginals
p(xd) might also be more susceptible to analytic integration. Of course, such a fully
factorizing model is also less interesting, as it corresponds to the assumption that
none of the elements of x have any influence on each other.
Between this linear and the general exponential-cost regime lies a wide spectrum of
conditional independence structures. Particularly interesting cases are hierarchical
models in which the elements of x have a natural ordering (known as directed
acyclic models). Complexity is not just determined by conditional independence
8 Graphical Models
either; the actual functional form of the relationships between different elements
of x also has influence on computational cost. For example, some elements of
x might act as switches, controlling functional influence of other elements upon
others (this is the case in mixture models). Such structure can have influence on
the computational complexity of inference.
2.1 Factor Graphs
x
p(x)
x ∼ p(x)
Figure 2.1: In factor graph notation, probabilistic variables are denoted by hollow circles.
Functions and probability distributions are denoted by gray squares called factors. Edges
between nodes denote membership in a functional relation.
Factor graphs are bipartite graphs providing a symbolic representation of the
functional relationships between elements of multivariate probability distributions,
where the term “relationship” is meant in the sense that the expression y = f(x, z)
defines a relationship f between the variables (x, y, z).
xd
p(xd)
D ∫ D∏
d=1
p(xd) dxd =
D∏
d=1
∫
p(xd) dxd
Figure 2.2: In factor graph notation, identical copies of functional relationships are
denoted by a black rectangle around the copied group of variables and factors, called
a plate. A parameter in a corner of the plate denotes the number of copies.
Figures 2.1 and 2.2 introduce the main components of factor graphs: factors (func-
tional relationships between groups of nodes); variable nodes and plates (copies of
groups of nodes). In addition, other graphs in this text will also contain observed
variables, denoted by black circles. Factor graphs have three main raisons d’être:
. they provide an expressive graphical language which can help reveal structure
in multivariate probability distributions
. often, inference algorithms can be read off mechanically from the factor graph
(and in fact there exist compilers that can generate machine code directly
from factor graphs [Minka and Winn, 2008])
. graph theoretical concepts (such as whether the graph is planar, or even a
tree) can be used to describe certain algebraic characteristics of the under-
lying multivariate model and make general statements about solvability and
computational complexity.
2.1 Factor Graphs 9
The development of a theory of graphical models in general, and factor graphs
in particular, has been a community effort and the available literature is now too
vast to give a full account. Seminal works were contributed by Pearl [1988], Lau-
ritzen and Spiegelhalter [1988] and Frey et al. [1997]. More recently, interest in
approximate inference algorithms operating on graphs led to important contribu-
tions by Frey and MacKay [1998] Minka [2001] and Winn and Bishop [2006]. Good
introductions to graphical models can be found in Bishop [2006] and Frey [1998].
2.1.1 Other Graphical Models
Besides factor graphs, there are two other forms of graphical models in general use,
both of which predate factor graphs historically. They are known as directed graph-
ical models, or Bayesian Networks, and undirected graphical models, or Markov
Random Fields (MRFs) / Markov Networks. In both kinds of models, there are
no factor nodes. Instead, variable nodes are connected by arrows and lines, respec-
tively, when they share functional relationships.
Undirected graphical models will not feature in this text. Directed graphical mod-
els are of particular utility in the design phase of a probabilistic model, as they are
often better suited to depict motivations and convey the intuition behind probabi-
listic models, and consequently will be used for this purpose regularly throughout
this text.
x y
z
x y
z
x y
z
Figure 2.3: Advantages of directed graph notation: The factor graph on the left may
be expressed by either of the directed graphs in the middle and on the right. However,
in the directed graph in the middle, x and y are dependent conditional on z; in the
graph in the right, x and y are independent conditional on z. The factor graph does
not make this structure explicit.
Another advantage of directed graphical models is that they allow conditional in-
dependence structure to be read off from the graph (Figure 2.3), which is not
possible in general in factor graphs. Conversely, factor graphs are more expres-
sive than directed graphs in depicting functional relationships (Figure 2.4): Not
all distributions that can be encoded as a factor graph can be encoded as a di-
rected graphical model, and often one directed graphical model can encode different
models which would be distinguishable as separate when written as a factor graph
[Bishop, 2006]. Hence, algorithmic constructions will invariably be accompanied
by both factor graphs and directed graphs in this thesis.
10 Graphical Models
x y
z
x y
z
x y
z
Figure 2.4: Advantages of factor graph notation: The directed model might represent
either of the two factor graphs.
2.1.2 Outlook
The remainder of this chapter proceeds as follows: Section 2.2 introduces the cen-
tral concept of inference in graphical models — message passing — and constructs
the most important exact algorithm for inference on the marginals of nodes in
graphs, the sum-product algorithm. Since this thesis focuses on approximate in-
ference, the rest of the chapter (Section 2.3) introduces some important general
approximate algorithms for inference which leverage the graphical view: Expecta-
tion Propagation (2.3.2), variational bounds (2.3.3), local Laplace approximations
2.3.4, Markov Chain Monte Carlo algorithms (2.3.5) and other less formalized
methods (2.3.6). All of these methods use convenient classes of parametric proba-
bility distributions known as exponential families, introduced in Section 2.3.1.
2.2 The Sum-Product Algorithm
One of the main strengths of the factor graph representation is that inference
becomes almost mechanical. This is true in particular if the graph is a tree, in
which case there is an exact algorithm for finding the marginal distribution of all
variables in the graph individually, known as the sum-product algorithm. Technical
derivations can be found in Kschischang et al. [2001] and Bishop [2006]; this section
gives a brief introduction.
x
fi
...
...
va
vb
ga
gb
vi
f1
Figure 2.5: A factor graph for the derivation of the sum-product algorithm. The clouds
on the right are placeholders for subgraphs.
Consider the tree-structured factor graph shown in Figure 2.5. Assume there is
a potentially large number Ki of variables, jointly denoted vi, connected to the
2.2 The Sum-Product Algorithm 11
factors fi to the right of x, indicated by the clouds in Figure 2.5. We will assume
that the conditional dependencies between these variables retain the tree structure.
For notational clarity, we will denote (va, vb) = v1. If we subsume the normalization
into a constant Z, then the graph represents the multivariate distribution
p(x, va, vb, {vi>1}) = Z−1f1(x, va, vb)pga(va)pgb(vb)
∏
i>1
fi(x,vi)pgi(vi) (2.1)
Assume that we are interested in the marginal on x. For a general function f(α, β, γ)
with finite integral (i.e. any function that can be turned into a probability distri-
bution), we have
∫
f(α, β, γ)
∫
f(α, β, γ) dα dβ dγ
dβ dγ =
f˜(α)
∫
f˜(α) dα
(2.2)
(using an implicitly defined “marginal function” f˜ whose exact form is irrelevant).
To get a marginal on x, we can thus work with the unnormalized joint distribution
p˜(x, {vi}) and normalize over each variable individually only at the end. So we can
write the marginal as
p˜(x) =
∫
· · ·
∫ ∏
i
pfi(x |vi)gi(vi) dvi
=
∏
i
∫
fi(x,vi)gi(vi) dvi
=
∏
i
mi(x).
(2.3)
Note that the step from the first line to the second in Equation (2.3) is only possible
because of the structure of the graph. The terms of the product in the second line
are functions only of x. They can thus be interpreted as messages mi(x), sent to
the node for x from the factors fi (Figure 2.6). This is a helpful paradigm, because
it allows to speak about the process of deriving local marginals in terms of local
objects: the messages.
x
fi
...
...
va
vb
ga
gb
m1 mi
f1
µa
µb
Figure 2.6: Messages mi sent from the neighbouring factor nodes to x, and messages
µj from variables to one of the factors.
But what is the actual form of the messages mi(x)? The factor fi is connected
12 Graphical Models
to Ki variables nodes vik other than x. Because the graph is a tree, these nodes
are themselves connected to separate sub-graphs of their own, containing other
variableswik, connected through a (potentially factorizing) distribution h
i
k(v
i
k,w
i
k).
So we can write
mi(x) =
∫
dvi1 · · ·
∫
dviKifi(x, v
i
1, . . . , v
i
Ki)
Ki∏
k
[∫
hik(v
i
k,w
i
k) dw
i
k
]
=
∫
dvi1 · · ·
∫
dviKifi(x, v
i
1, . . . , v
i
Ki)
Ki∏
k
µik(v
i
k)
(2.4)
In the second line, we have noted that the product again contains terms which are
local objects, only depending on vik, because all other variables are integrated out,
and have denoted these terms as messages µik(v
i
k) from the variable nodes v
i
k to
the factor fi. In its general form, Equation (2.4) is difficult to parse, so consider
specifically the example in Figure 2.6. The subgraph connected to f1 contains the
variables va and vb, each of which are is connected to only one other factor, ga and
gb. We thus have
m1(x) =
∫
dva
∫
dvb fi(x, va, vb)ga(va)gb(vb) (2.5)
and are already done; the messages from the variables are simply given by ga and
gb. In the more general case of Equation (2.4), it might seem like evaluating the
messages µik is complicated, because it involves integrating out all other variables
connected to this node. However, the sub-graph connected to this node is a tree
itself, containing other factors, so hik separates into other terms h
i
k,1, . . . h
i
k,H . This
is where an inductive argument can take hold: In fact, µik is of analogous form
to Equation (2.3), and the message can be constructed simply by multiplying the
incoming messages from all factors connected to vik, except for fi:
µik(v
i
k) =
H∏
`=1
mhik,`(v
i
k) (2.6)
We summarize: Evaluating the marginal of any variable node in a tree-structured
factor graph involves a series of localized computations represented by messages.
. Messages from variables vi to factors fk are formed by multiplying all incom-
ing factor-to-variable messages into vi, other than that from fk (denote that
neighbourhood set by ne(vi) \ fk):
µvi→fk(vi) =
∏
`∈ne(vi)\fk
mf`→vi(vi) (2.7)
2.2 The Sum-Product Algorithm 13
. Messages from factors fk to variables vi are formed by integrating out (sum-
ming over) all variables connected to fk, except for vi itself (denoted by
ne(fk) \ vi):
mfk→vi(vi) =
∏
h∈ne(fk)\vi
∫
µvh→fk(vh) dvh (2.8)
Using this scheme, calculating marginals in factor graphs reduces to a mechanical
process of summing and multiplying probability distributions. At the leaves of the
graph, the induction can be anchored by defining that the messages from leaf nodes
are
µx→f (x) = 1 and mf→x(x) = f(x) (2.9)
as already used in Equation (2.5). Another appealing property is the fact that
message passing to gain marginals on all variables in the graph is linear in the
number of variables. To see this, choose any variable in the graph and define it to
be the root of the tree. Sending messages from the leaves to the root is possible
because all necessary messages are available locally. Once the root has received
all messages, send messages to the leaves, which is now possible because all nodes
have their necessary incoming messages. The result is a set of marginals (products
of incoming messages) on all variables in the tree, where every node was involved
in message passing twice.
Conditioning on Data
Equation (2.2) already established that normalization of marginals can be per-
formed locally. So far it was assumed that all variables are described by probabil-
ity measures (i.e. they are latent). In real applications, a subset of the variables
will invariably have been observed, or otherwise set to some fixed value. Extend-
ing the message-passing paradigm to include this situation is straightforward from
a theoretical standpoint: We simply connect all observed variables to “pin-down”
factors containing Dirac δ distributions to fix values. Because
∫
δ(x−x0) dx = x0,
the effect of this on the resulting algorithm is even less complicated: Instead of
evaluating integrals over variables, their values are set to fixed values.
General Graphs
The sum-product algorithm is applicable only to trees. However, it is possible to
extend it to general graphs by constructing a tree through combination of cliques of
nodes (non-tree-structured subgraphs) into joint nodes of a meta-graph, known as
the junction tree, on which the sum-product algorithm can then be run [Lauritzen
and Spiegelhalter, 1988]. This scheme will be used implicitly in chapters 4 and
5 of this thesis. Unfortunately, inference within the cliques now involves high-
14 Graphical Models
dimensional integrals again, and the overall computational cost will be exponential
in — and hence dominated by — the size of the largest clique in the graph.
2.2.1 The Sum Product Algorithm is No Panacea
The graphical model paradigm and the sum-product algorithm leverage the con-
ditional independence structure of multivariate distributions to find low-cost ways
of evaluating marginals. Unfortunately, they do not solve all challenges of applied
inference. Some problems remain:
Integration Constructing factor-to-variable messages still involves multidimen-
sional integrations, albeit of lower dimensionality. Possible approximate ap-
proaches include representing the messages by a set of samples (Section
2.3.5), or by a “nonparametric” representation on a grid of fixed resolution,
i.e. essentially a histogram. Another popular approach, which will be used
widely in this thesis, is to choose the distributions involved carefully such
that analytical integration is possible, or to approximate the resulting dis-
tributions with others of simpler, parametric structure (Section 2.3.2).
Conditional Dependence Even taking all factorization properties into account,
many models will still retain large cliques of dependent variables which have
to be represented by joint vector-valued nodes in the graph to retain tree
structure. Inference involving such variables can still be subject to the curse
of dimensionality. Even in the easiest case, if the relationships are linear,
inference will involve solving linear systems of equations. Leaving some tech-
nical issues aside [Golub and Van Loan, 1996], solving such systems has
computational cost cubic in the clique size. This can still be too slow for
many real-world applications. Often, the best remaining options will then
be to construct independent approximations (Section 2.3.3), or simply to as-
sume independence in the marginals and use certain heuristics to deal with
the resulting defects (Section 2.3.6).
At their core, these issues reflect the fact that factorization simplifies integration,
but does not make it trivial: We are still left with integrals, and integration, both
analytic and approximate, is still an art, not a mechanical process. There are
no general analytic solution strategies for integrals. Applied inference remains a
challenging field. The remainder of this thesis will revolve around approximate
techniques, as outlined above, which can make hard inference problems tractable
without sacrificing the crucial aspects of the problem itself. In effect, this entire
thesis is a collection of approximations and design tricks. This is not necessarily a
deficiency: Entire disciplines in other fields, e.g. condensed matter physics, could
arguably be described as collections of highly developed integration tricks. The
2.3 Approximate Inference Methods 15
hope is that the algorithms presented in the following sections and the remaining
chapters of this thesis can convey concepts necessary for applied inference problems
in general.
2.3 Approximate Inference Methods
One could think that approximation in probabilistic models is a completely un-
structured field: We build a model, realize that inference is intractable, and then
have to somehow pluck an ad-hoc approximation out of thin air. In fact, there are
several more structured approaches available. A lot of problems can be prevented
from the start by using convenient probability distributions. Distributions forming
exponential families provide parametric descriptions with several favourable prop-
erties. The next section will give a brief introduction to this approach. Although
exponential families in themselves are exact mathematical representation of be-
liefs, they are presented here in a section on approximate methods; because from
a strict Bayesian point of view, they already represent a form of approximation,
in the sense that the designing human represents her or his internal beliefs in an
“alphabet” of parametric distributions approximating the mental beliefs which are
never explicitly stated.
2.3.1 Exponential Families
An exponential family is a set of distributions q(x) over the variable x ∈ RD,
parameterized by a set of parameters η ∈ RF , of the general form
q(x;η) = g(η) exp
[
ηTu(x)
]
= exp
[
ηTu(x) + log g(η)
]
(2.10)
with functions g : RF → R and u : RD → RF . In this, the so-called canonical
representation, the elements of η are the natural parameters of the distribution.
The elements of u(x), or linear combinations thereof, are known as the sufficient
statistics of the distribution. Not all functions of the form (2.10) are integrable, the
definition should be understood to contain the implicit assumption that the distri-
bution is in fact normalizable and normalized (this might also imply that both x
and/or η might only be well defined on a subspace of RD). Under this requirement,
the function value g(η) can be interpreted as the normalization constant, because
g(η)
∫
exp
[
ηTu(x)
]
dx = 1. (2.11)
16 Graphical Models
Example: The Gaussian Distribution The family of all normal distributions
forms an exponential family, because it can be written as
N (x;µ, σ2) =
1
√
2piσ2
exp
(
−
(x− µ)2
2σ2
)
= exp


(
µ/σ2
1/σ2
)T(
x
−12x
2
)
−
1
2
[
µ2
σ2
+ log(2piσ2)
]


(2.12)
its natural parameters are the precision-adjusted mean µ/σ2 and the precision σ−2.
Note that the normalization function
g(µ/σ2, σ−2) =
1
√
2piσ2
exp
(
−
1
2
µ2
σ2
)
(2.13)
has a slightly different form than the term normally interpreted as the normaliza-
tion constant of the Gaussian. The Gaussian’s sufficient statistics are often called
the sample mean and sample variance (as well as the number of samples), for the
following reason: If a dataset consists of N data points xi generated from a Gaus-
sian distribution, then their conditional probability given the natural parameters
is
p(D |µ, σ2) =
N∏
i
N (xi;µ, σ
2)
= exp


(
µ/σ2
1/σ2
)T( ∑
i xi
−12
∑
i x
2
i
)
−
N
2
[
µ2
σ2
+ log(2piσ2)
]


= exp


(
µ/σ2
1/σ2
)T(
Nx¯
−N2 (S + x¯)
)
−
N
2
[
µ2
σ2
+ log(2piσ2)
]


(2.14)
with
x¯ =
1
N
∑
i
xi and S =
1
N
∑
i
(xi − x¯)
2. (2.15)
Exponential families have several algebraic aspects that make them particularly
well suited for use in structured probabilistic models; the following paragraphs
point out some of these properties.
Closure Under Multiplication and Exponentiation The members of one ex-
ponential family form an associative algebra over the positive real numbers, in
the sense that two members of the family can be multiplied together and with a
positive real number, returning another unnormalized distribution which also lies
2.3 Approximate Inference Methods 17
in the family. They are also closed under exponentiation: In general, we find
a·q(x;η)αq(x;η′)β ∀a ∈ R+ ∀α, β ∈ R
= exp
[
(αη + βη′)Tu(x) + α log g(η) + β log g(η′) + a
]
= a
gα(η)gβ(η′)
g(αη + βη′)
· q(x;αη + βη′)
(2.16)
Consider for example the multiplication of two Gaussian distributions,N (x;µ1, σ21)
and N (x;µ2, σ22). From Equation (2.12) above, we see that multiplying Gaussians
involves adding the precision-adjusted mean and precision. We can also find the
constant scaling factor under multiplication (starting from Equation (2.13) and
leaving out a few lines of straightforward algebra) to be
g(µ1/σ21, σ
−2
1 )g(µ2/σ
2
2, σ
−2
2 )
g(µ1/σ21 + µ2/σ
2
2, σ
−2
1 + σ
−2
2 )
=
1
√
2pi(σ21 + σ
2
2)
exp
(
−
1
2
(µ1 − µ2)2
σ21 + σ
2
2
)
(2.17)
Which gives the widely known formula
N (x;µ1, σ
2
1)N (x;µ2, σ
2
2) =
N (µ1;µ2, σ
2
1 + σ
2
2)N
[
x;
(
µ1
σ21
+
µ2
σ22
)(
1
σ21
+
1
σ22
)−1
,
(
1
σ21
+
1
σ22
)−1
]
(2.18)
So multiplication within exponential families amounts to summation of the natural
parameters. This is crucial because multiplication is such a central operation in
message passing algorithms, and exponentiation will be a convenient trick in some
approximate schemes (Section 2.3.2). Of course, if we can multiply by adding natu-
ral parameters, we can also divide distributions, by subtracting natural parameters
— as long as we are careful not to break integrability and end up with an improper
distribution. Up to that restriction, exponential families do indeed form Abelian
groups under multiplication.
Analytic Conjugate Priors on the Parameters In Bayesian inference, conjugate
priors are a much-used tool in the construction of analytic inference algorithms. A
conjugate prior pi(η;ω) to the likelihood λ(x | η) is a probability distribution over η,
parametrized by some parameters ω, such that the posterior on η (the normalized
product of pi and λ) is a member of the same functional family as the prior:
λ(x | η)pi(η;ω)
∫
λ(x | η)pi(η;ω) dη
= pi(η;ω′) (2.19)
Such forms are crucial for efficient inference because they allow data to be incor-
porated into a belief analytically. For exponential families, conjugate priors can be
constructed analytically, up to normalization: Assume that we have access to N
18 Graphical Models
data points xn generated from a distribution with uncertain parameters η of the
form (2.10). We define a prior distribution r on the parameters η of the form
r(η;χ, ν) = f(χ, ν)g(η)ν exp
[
ηTχ
]
= exp


(
χ
ν
)T(
η
log[g(η)]
)
+ log f(χ, ν)

 .
(2.20)
Note that this distribution itself forms an exponential family, with the natural
parameters ω ≡ (χ, ν), the sufficient statistics u(η) = (η, log[g(η)]), and the
normalization constant f(ω). Multiplying the likelihood (2.10) with this prior, we
get the unnormalized posterior
p(η |χ,ν, {xn}) ∝ g(η)ν+N exp
[
ηT
(
∑
n
u(xn) + χ
)]
∝ r
(
η;χ+
∑
n
u(xn), ν +N
) (2.21)
which is of the same form as (2.20). This construction of course does not provide
the normalization constant f of either prior or posterior, which can be difficult to
evaluate in practice. Its technical nature also means that it does not necessarily
provide the most convenient parametrization for the conjugate prior. For example,
consider again the Gaussian distribution from Equation (2.12). Equation (2.20)
gives a conjugate prior of the unnormalized form
r(µ, σ2;χ, ν) ∝ (2piσ2)ν/2 exp
[
1
σ2
(
µχ1 + χ2 + µ
2χ3
)
]
∝
1
(2pi)ν/2
(σ−2)νeχ2σ
−2
e
(µχ1+µ
2χ3)
σ−2
(2.22)
which is in fact an unnormalized form of the well-known normal-gamma prior for
the parameters of a Gaussian. The usual parametrization of the normal-gamma
distribution is
r(µ, σ−2 | a, b, α, β) =
βα
√
b
Γ(α)
√
2pi
(σ−2)α−1/2e−βσ
−2
e−
b(µ−a)2
2σ2 (2.23)
so we can identify the linear transformation ν = α− 1/2, χ1 = ab, χ2 = −β, χ3 =
−b/2 to establish an isomorphism between our derivation and the standard form.
The remaining dissimilarities between Equations (2.22) and (2.23) are functions of
the parameters only and can thus be subsumed into the normalization constant.
Estimation Through Differentiation For exponential families, the expected
value of the sufficient statistics is closely related to the gradient of the normal-
2.3 Approximate Inference Methods 19
ization constant. To see this, we differentiate Equation (2.11) with respect to η on
both sides:
0 = [∇g(η)]
∫
exp
[
ηTu(x)
]
dx+ g(η)
∫
exp
[
ηTu(x)
]
u(x) dx (2.24)
Using (2.11) for the first term and (2.10) for the second, we get
0 = [∇g(η)] · g−1(η) + E[u(x)] (2.25)
and hence
E[u(x)] = −∇ log g(η). (2.26)
So the expected value of the sufficient statistics equals the negative log derivative
of the natural parameters. This is a somewhat obscure result at first sight, but
it will become important in the derivation of several approximate methods in the
following sections. The significance arises because for many of the widely used
exponential families at least one of these terms is readily evaluated; so being able
to express one in terms of the other is helpful.
Expressive Forms Many of the most popular parametric distributions in applied
statistics do indeed form exponential families. Among them are the (univariate and
multivariate) Gaussian, and the Gamma andWishart distributions (defined on pos-
itive real numbers and positive definite matrices, respectively). The Multinomial
distribution on categorical variables, and the Dirichlet distribution over discrete
probabilities (including their special two-dimensional or single-count cases known
as the Bernoulli, Discrete, Binomial and Beta distributions), as well as the Poisson
and Exponential distributions, the von-Mises distribution, and several others, also
form exponential families.
There are additional beneficial aspects of exponential family distributions which
are beyond the scope of this text. For example, they are in some sense easy to ex-
tend to nonparametric Bayesian models [Orbanz, 2009]. The upshot for the practi-
tioner of approximate inference is that it is generally a good idea to use exponential
family distributions when designing probabilistic models, as this will already avoid
many problems arising immediately when using general, less structured distribu-
tions. The diversity of available exponential family distributions means that they
are applicable in many situations; and they can be combined (using graphical
models) to form very expressive hierarchical models.
20 Graphical Models
2.3.2 Expectation Propagation
Minimizing KL-Divergence — Gaussian Moment Matching
Even when approximations become necessary, exponential family distributions can
still be helpful. Intractability will typically be due to nonlinear factors in the graph
or the connection of variables with nonconjugate marginals. When using the sum-
product algorithm, both of these problems can often be addressed through local
approximations: At a given problematic variable node x, the marginal
p(x) =
∏
i
mi(x) (2.27)
will consist of a product over messages mi(x) that do not give a simple parametric
form when multiplied. We can then try to find a projection of the marginal into
an exponential family q(x), and hope that the resulting approximate marginal
still captures the “important” aspects of the exact marginal. How to choose this
projection is a matter of mathematical convenience and what the approximation is
supposed to achieve (for example, whether the approximation should give a good
representation of the exact distribution’s overall width, or whether is should give
a good representation of a certain region of the distribution).
One popular method is to minimize the Kullback–Leibler divergence DKL(p‖q) of
q from p, which leads to an algorithm known as Expectation Propagation (EP)
[Minka, 2001]. The KL-divergence [Kullback and Leibler, 1951] is also known as
the relative entropy and defined as
DKL(p‖q) =
∫
p(x) log
p(x)
q(x)
dx = −
∫
p(x) log q(x) dx− H[p(x)] (2.28)
where H[p(x)] ≡ −Ep(log p) is the entropy of p. Since we want q to be in an
exponential family, i.e. of form (2.10), we can evaluate (2.28) to
DKL(p‖q) = − log g(η)− ηTEp[u(x)] + constants. (2.29)
To minimize the divergence, take the gradient with respect to η and set to zero,
giving
−∇ log g(η) = Ep[u(x)]. (2.30)
Now we can use Equation (2.26) on the left side to find
Eq[u(x)] = Ep[u(x)]. (2.31)
So, to minimize KL-divergence from p to q, we need to match the expected sufficient
statistics of q to their expected value under p. The most widely used exponential
2.3 Approximate Inference Methods 21
family for expectation propagation is the Gaussian family. Extending the univariate
form of Equation (2.12) to the general, D-dimensional case, the exponential family
form of the multivariate Gaussian is
N (x;µ,Λ−1) =
|Λ|1/2
(2pi)D/2
exp
[
−
1
2
(x− µ)TΛ(x− µ)
]
= exp
{
−
1
2
[
xTΛx− 2xTΛµ+ µTΛµ− log |Λ|+D log 2pi
]
}
= exp



−
1
2
(
Λµ
vec Λ
)T(
−2x
vecxxT
)
+ log g(Λµ,Λ)



(2.32)
where vecA denotes some arbitrary but consistent way of stacking matrix A into
a vector1. Analogous to the univariate case, the canonical parameters of the multi-
variate Gaussian are the precision matrix Λ (the inverse of the covariance matrix)
and the precision-adjusted mean vector Λµ; and the sufficient statistics of the
multivariate Gaussian are x and xxT. Hence, for Gaussian EP, we want to choose
a Gaussian approximation q(x) = N (x;µ,Λ−1) such that its first two moments
(mean E[x] and variance E[xxT]− [Ex][Ex]T) match the first two moments of the
true marginal p.
Approximating Individual Messages
In message passing algorithms, we will deal with marginals consisting of a product
of messages:
p(x) = Z−1
∏
i
fi(x). (2.33)
Because the messages are the fundamental objects of the sum product algorithm,
we will want the approximating Gaussian to be also a product of Gaussians
q(x) = Z˜−1
∏
i
f˜i(x) with f˜i(x) = N (x;mi, vi) (2.34)
The naïve thing to do would be to match the moments of each unnormalized dis-
tribution fi(x) to the moments of f˜i(x). However, we are ultimately interested in
the marginals, not the messages. So it would be better if we could ensure that the
approximate marginal minimizes KL-divergence to the exact marginal. Unfortu-
nately, matching the moments of the product of complicated messages fi can be
challenging — splitting the resulting marginal into meaningful messages even more
so. Minka [2001] found a surprisingly simple and elegant intermediate approach,
based on the ease with which exponential family distributions can be multiplied
1This operation is well defined here, because we can write, using the sum convention [Ein-
stein, 1916], xTΛx = xiΛijxj = Λij(xxT)ij , so any consistent way of mapping the indices
ij ∈ [1, . . . , D]× [1, . . . , D] to a new index k ∈ [0, . . . , D2] works.
22 Graphical Models
and divided (Section 2.3.1, Equation (2.16)). He proposes an iterative scheme, in
which, for any j, the message f˜j is chosen such that
q′(x) ∝ f˜j(x)
∏
i 6=j
f˜i(x) (2.35)
is as close to
fj(x)
∏
i 6=j
f˜i(x) (2.36)
as possible in KL-divergence. At runtime of the algorithm, we have available a
collection of approximate messages f˜i(x), as well as an approximate marginal q(x)
(if some messages are not yet available, they are replaced with uninformative de-
generate messages with natural parameter vector 0. In the Gaussian case, this
corresponds to setting the precision matrix to 0). To update the message f˜j(x),
we calculate the cavity distribution
q\j(x) ≡
q(x)
f˜j(x)
. (2.37)
Note that this division of distributions is a well defined operation, corresponding to
subtracting natural parameters (Section 2.3.1). Then, we calculate the expected
values of the sufficient statistics (for Gaussians: the first two moments) of the
product
fj(x)q
\j(x). (2.38)
These statistics (moments) give an approximation q′(x). Minka introduces an op-
erator proj for this operation, which allows the intuitive notation
q′(x) = proj
[
fj(x)q
\j(x)
]
. (2.39)
The distribution q′ is the new approximate marginal and defines the approximate
message as
f˜j(x) =
q′(x)
q\j(x)
=
proj[fj(x)q\j(x)]
q\j(x)
(2.40)
The process is repeated until convergence, which can be measured as follows: Be-
cause the message is in the exponential family, the update from the old message
f˜ oldj to the new message f˜
new
j can be written as
∆f˜j =
f˜newj
f˜ oldj
. (2.41)
Since this update is itself an element of the exponential family, it is described by
a natural parameter vector, which can be used to measure convergence through
2.3 Approximate Inference Methods 23
any arbitrary norm.2 It is interesting to note in passing that messages themselves
do not need to be proper distributions (see Section 2.2), as long as the marginal
is normalizable. Indeed, in some applications of EP, “Gaussian” messages with
negative precision are not unusual.
Power EP
One interpretation for EP is as a form of approximate numerical integration: The
normalization constant Zq =
∫ ∏
i f˜i(x) dx (which is easy to evaluate thanks to
Equation (2.16) and Equation (2.11)) is an approximation to the local partition
function Zp =
∫ ∏
i fi(x) dx. Sometimes, however, even the individual integrals
required to evaluate proj[fj(x)q\j(x)] can be intractable. In that case, an extension
of EP called Power EP [Minka, 2004] can offer hope: For many functions fj(x)
for which
∫
fj(x)q\j dx is difficult to evaluate, there exists a scalar nj such that
∫
fnjj (x)q
\j(x) is tractable. As a simple example, consider f(x) = 1/x. The integral
∫
x−1N (x;µ, σ2) dx (2.42)
is much more challenging than
∫
xN (x;µ, σ2) dx = µ (2.43)
In the Power EP algorithm, we introduce new functions hi(x) = f
ni
i (x). Then, the
algorithm proceeds almost identically to EP: Calculate the cavity distribution
q\j(x) = q(x)/h˜j(x); (2.44)
project
q′(x) = proj
[
hj(x)q
\j(x)
]
; (2.45)
and store the message
h˜newj (x) =
q′(x)
q\j(x)
; (2.46)
but now update the marginal to the new distribution
qnew = qold(x)
(
h˜newj (x)
h˜oldj (x)
)1/nj
(2.47)
2because the natural parameters are elements of RM for some M , and all norms on finite-
dimensional real vector spaces are equivalent, the choice of norm is arbitrary. If there is an εa such
that the parameter vectors η∆ of the updates become and remain, for all subsequent iterations,
smaller than εa under the norm ‖ · ‖a, i.e. ‖η∆‖ < εa, then for any other norm ‖ · ‖b, there exists
an εb such that the updates’ parameters become and remain smaller than εb under ‖ · ‖b. Hence,
only the numerical value of the stopping condition ε has to be chosen sensibly.
24 Graphical Models
where the exponent nj denotes exponentiation of exponential family distributions,
as defined in Equation (2.16). Minka [2004] shows that this iterative scheme has
the same fixed points as an implementation of EP which does not minimize KL-
divergence, but instead the more general α-divergence [Amari, 1985] for α = 1/ni.
The α-divergence is defined as
Dα(p‖q) =
1
α(1− α)
∫
αp(x) + (1− α)q(x)− pα(x)q1−α(x) dx (2.48)
It satisfies Dα(p‖q) ≥ 0, with equality if and only if p = q. For α → 1, it reduces
to3 DKL(p‖q); for α → 0, it equals DKL(q‖p). For α = 0.5 it gives the Hellinger
distance, for α = 2 the χ2 distance. Changing α → 1 − α swaps the position of p
and q [Minka, 2005].
Damping Messages
A drawback of EP is that it is not guaranteed to converge [Minka, 2004]. The
messages passed around the graph constitute a dynamic system that is not always
stable. One approach to this problem, which is not guaranteed to be effective, but
has been empirically found to be helpful in many cases, is to “damp the messages”
[Minka, 2004], i.e. to weaken the iterative updates to the messages in a way that
does not change the fix-points of the algorithm. For the general case of Power EP
(for standard EP, set nj = 1), we use a scalar 0 < γ ≤ 1 to construct the projected
marginal as in Equation (2.45), then create a new message as
h˜newj (x) =
(
h˜oldj (x)
)1−γ
(
q′(x)
q\j(x)
)γ
= h˜oldj (x)
(
q′(x)
q(x)
)γ
(2.49)
and update the marginal to
qnew(x) = qold(x)
(
h˜newj (x)
h˜oldj (x)
)nj
= qold(x)
(
q′(x)
q(x)
)γnj
. (2.50)
EP: Summary
EP and Power EP provide a flexible approximate scheme for message passing which
can double as an approximate integration method. Power EP iteratively minimizes
3To see this, note that both numerator and denominator of Equation (2.48) are smooth
functions and vanish for α → 1 and α → 0, so L’Hôpital’s rule applies. Differentiating both
numerator and denominator with respect to α gives
lim
α→1
Dα(p‖q) = lim
α→1
−
∫
log(p/q)pαq1−α dx
1− 2α
= DKL(p‖q).
because α→ 1− α swaps positions of p and q, the other statement follows directly.
2.3 Approximate Inference Methods 25
the local α-divergence
Dα
(
fj(x)
∏
i 6=j
f˜i(x)
∥
∥
∥
∏
i
f˜i(x)
)
(2.51)
with α = (2/ni) − 1, standard EP sets ni = 1, corresponding to minimizing the
local KL-divergence from p to q. Note that, since this KL-divergence contains the
term −
∫
p(x) log q(x) dx, it becomes large if q(x) puts small mass on values of
x where p(x) is large. Conversely, regions where p(x) vanishes do not influence
DKL(p‖q). This means EP prefers broad approximations. This is in stark contrast
to the other approximations following in this section, and is an important charac-
teristic determining the use cases for EP.
The two biggest issues with EP are
. EP still requires analytic evaluation of certain integrals (albeit much easier
ones than the ones required for the full joint distribution). This can make
EP difficult to apply (see Chapter 3 and Appendix A for derivations of just
one specific factor and the technical stretches required)
. EP is not guaranteed to converge (but often does so anyway)
In this thesis, EP is used extensively in Chapter 3 and Chapter 4. Appendices A
and B contains derivations for EP updates on one specific functional factor.
2.3.3 Variational Inference
The previous section established EP, an approximation based on the minimiz-
ing the KL-divergence DKL(p‖q) from p to q on some local objective. Minimizing
KL-divergence in this direction leads to a “broad” approximation, because the di-
vergence becomes large if p puts mass on a region where q vanishes. This section
considers in some sense the opposite approach (but see the alternative motivation
at the end of this section): Given the joint distribution p(x,θ) of a set of observable
variables (data) x and latent parameters θ, we try to minimize the KL-divergence
DKL[q(θ)‖p(θ |x)] = −
∫
q(θ) log
p(θ |x)
q(θ)
dθ (2.52)
between an approximating distribution q for the parameters and their exact pos-
terior distribution. This leads to an approximation attempting to fit the shape of
p in some region, while ignoring other regions. This behaviour can be desirable
for example in mixture models, which often have combinatorial symmetries: They
put identical mass on equivalent realizations. In such situations, the approximation
should pick out one good realization and represent a belief over it well. In contrast,
26 Graphical Models
EP would construct a very broad approximation that covers all equivalent realiza-
tions, but does not necessarily capture the structure of any individual realization
well.
Inspecting Equation (2.52), it would seem that minimizing this objective involves
having to evaluate the exact posterior, which would beg the question, as that is
exactly what we are trying to avoid. However, there is a “shortcut”, which involves
an algebraic trick: Notice that, using any arbitrary proper distribution q(θ), we
can use Bayes’ rule to expand the log model evidence (which is a fixed number) as
log p(x) =
∫
q(θ) log
p(x,θ)
q(θ)
dθ
︸ ︷︷ ︸
L[q(θ)]
+
∫
q(θ) log
q(θ)
p(θ |x)
dθ
︸ ︷︷ ︸
DKL[q(θ)‖p(θ |x)]
= L[q(θ)] +DKL[q(θ)‖p(θ |x)]
(2.53)
where we have defined a variational bound L(q), also known as the variational free
energy in statistical physics, on the log evidence. To see that this is indeed a lower
bound, note that the KL-divergence is non-negative, a statement known as Gibbs’
inequality4. Hence, to minimize the KL-divergence, maximize L(q) instead. This
is potentially much easier, because the joint is often available in closed form.
This is known as a variational approximation, because maximizing the bound in-
volves functional derivatives of the functional L, and the branch of mathematics
concerned with functional derivatives is known as the calculus of variations.
Alternate Motivation The derivation of the bound by Equation (2.53) may
seem somewhat unsatisfactory, and indeed it of course leaves out certain techni-
cal requirements on the approximating distribution. It is also possible to motivate
the variational approximation in an alternate way inverse to the argument of the
previous section: To find a good approximation q, construct a bound on the log ev-
idence of the model, note that it involves the positive definite KL-divergence, and
maximize the bound. The connection between EP and variational approximations
provided by the different directions of KL-divergence they minimize can provide a
helpful intuition when deciding which of the two to use (see Section 2.3.7). How-
ever, it also contains a pitfall: Note that variational inference directly minimizes
KL-divergence, i.e. using the actual joint distribution, not some factorised approx-
imation of it, while EP only minimizes a local KL-divergence. This difference is
less benign than it might seem: Using variational inference, one can at least rely
on having the right objective (even if the found bound might well be loose), while
4It is easy to prove Gibbs’ inequality [e.g. MacKay, 2003], using Jensen’s inequality [Jensen,
1906]. Consider the convex function f(u) = 1/u and u = p(x)/q(x) > 0. Then DKL(q‖p) =
Eq[f(u)] ≥ f [Eq(u)] = f(
∫
q pq ) = − log
∫
p = − log 1 = 0
2.3 Approximate Inference Methods 27
with EP, one always has to worry somewhat about the local approximations miss-
ing important aspects of the product of the factors.
Factorized Variational Approximations
If we pose no restrictions on q, the bound L reaches a unique maximum if q(θ) =
p(θ |x), where DKL(q‖p) = 0. Since the assumed intractability of this distribution
is the reason to use an approximation in the first place, this most general form is
of little use. To find approximations of easier to track structure, we can impose
factorization properties on q over K disjoint subsets θk of θ:
q(θ) =
K∏
k=1
qk(θk) (2.54)
In line with the overall point of this chapter that factorization leads to local com-
putations, this factorized form for q allows us to derive a local bound which is a
function of only one qj, simply by inserting (2.54) into L from (2.53). Writing qk
as shorthand for qk(θk) we get
L(q) =
∫ ∏
k
qk
[
log p(x,θ)−
∏
`
log q`
]
dθ
=
∫
qk
[∫
log p(x,θ)
∏
6`=k
q` dθ`
]
︸ ︷︷ ︸
≡E 6`=k[log p(x,θ)]
dθk −
∫
qk log qk dθk + const.
=
∫
qk · E` 6=k[log p(x,θ)] dθk + H[qk] + const.
(2.55)
The last line of this equation has the form (up to sign) of a KL-divergence between
qk(θk) and the distribution p˜ that satisfies the relation
log p˜(θk) = E`6=k[log p(x,θ)] + const. (2.56)
Hence, L is maximized when qk = p˜, i.e.
log qk(θk) = E` 6=k[log p(x,θ)] + const.
qk(θk) ∝ exp (E` 6=k[log p(x,θ)])
(2.57)
If the expectation of log probabilities in this expression is easy to evaluate (here
again, using exponential family distributions for model design can greatly reduce
the computational complexity), then Equation (2.57) directly leads to an iterative
optimization scheme: For each variable group θk, maximize the local bound, then
repeat. If we choose the θk to correspond to one variable in the graphical model
28 Graphical Models
each, this scheme is known as Variational Message Passing (VMP) [Winn and
Bishop, 2006]. To find good approximations, however, it can sometimes be a better
idea to adapt the graphical representation to the approximation, rather than the
other way round (see Chapter 5).
2.3.4 Laplace Approximations
The approximate inference methods described in the previous two sections can
provide expressive approximations; but they both require the evaluation of some
integrals involving the exact distribution p in some way. Even though the inte-
grals in question are much simpler than the integration required to evaluate the
exact posterior, they can still be challenging in some cases. This section introduces
an approximate scheme, known as a Laplace approximation, which only involves
derivatives of p. Since integration is harder than differentiation, this approxima-
tion can be applied to a larger class of distributions. Unfortunately this simplicity
comes at the cost of a considerable defect, which means that this scheme should
only be applied with care.
Since conditional dependence structure will not be important for the following
derivations, we will abandon the separation of variables and parameters, and con-
sider, for simplicity, a probability distribution p(x) of some multidimensional vari-
able x.
“Laplace approximation” is a grand name for a second order Taylor expansion of
log p(x) around a mode µ of p(x) (which, because the logarithm is a strictly mono-
tonic function, is identical to the mode of log p(x)): To find a mode, differentiate
once and set to zero
∇ log p(µ) =
∂ log p(x)
∂x
∣
∣
∣
∣
x=µ
= 0 (2.58)
if log p(x) is analytic in a neighbourhood of µ, then it can be expanded in this
neighbourhood to
log p(µ+ x) = log p(µ) + (x− µ)T∇ log p(µ)
︸ ︷︷ ︸
=0 by eq. (2.58)
−
1
2
(x− µ)TΛ(x− µ) +O[(x− µ)3]
= −
1
2
(x− µ)TΛ(x− µ) + const +O[(x− µ)3]
(2.59)
where Λ is the negative Hessian matrix with elements
Λij = −
∂2 log p(x)
∂xi∂xj
∣
∣
∣
∣
x=µ
. (2.60)
There is an exponential family whose natural parameters are linearly related to µ
2.3 Approximate Inference Methods 29
x
p(
x)
p(x)
q(x)
Figure 2.7: Defect of the Laplace approximation: because modes are a local feature of
the distribution rather than a global one, approximations based solely on the structure of
the distribution around the node can lead to bad representation of the true distribution
p(x) by the approximation q(x).
and Λ: the Gaussian
logN (x;µ,Λ−1) = −
1
2
(x− µ)TΛ(x− µ) +
1
2
(log |Λ| −D log 2pi) (2.61)
so the Laplace approximation consists of approximating p(x) with a Gaussian
q(x) = N (x;µ,Λ−1) whose mean is set to the location of a mode of p, and whose
precision matrix (inverse covariance matrix) is set to the negative Hessian of log p
at its mode.
From this construction, the main strength of this approximation scheme is im-
mediately obvious: Laplace approximations require very little algebraic structure
of p. It suffices to be able to differentiate p twice. It is a particularly convenient
approximation if we can find the mode and Hessian analytically, but if everything
fails, a numerical optimization scheme, such as Newton–Raphson iterations or even
numerical differentiation can be used as well. At this level, virtually every distri-
bution p which can be evaluated at all becomes fair game. Unfortunately, there
is a major drawback: Modes are a local feature and need by no means represent
the overall distribution well (see Figure 2.7). If the distribution has heavy tails,
is multimodal, is strongly asymmetric, or if its shape is badly represented by the
Hessian at the mode, the Laplace approximation can be arbitrarily bad, and this
defect may not even be easy to spot during the derivation! Hence, before resorting
30 Graphical Models
to this form of approximation, it should be ensured that the approximated function
does have an overall shape that can at least be roughly represented by a Gaussian.
See Sections 5.2.2 and 5.3.1 as well as Appendix C for an application where this
approach arguably makes sense.
2.3.5 Markov Chain Monte Carlo
Another approach to approximate inference, quite different from the three methods
presented in the previous sections, is to replace analytical parametric distributions
with samples, leading to Monte Carlo algorithms. If exact sampling from the dis-
tribution in question is not possible — which is the typical situation — sampling
algorithms based on ergodic random walks through the distribution, known as
Markov Chain Monte Carlo (MCMC) algorithms provide an approximate answer.
MCMC methods will not feature prominently in this thesis; good introductions
into this vast field can be found in Murray [2007] and MacKay [2003, chapters 29
and 30]. But approximate sampling can be a great tool for approximate inference,
for two main reasons.
. MCMC algorithms are guaranteed by construction to sample from the exact
posterior in the limit of large sample sizes
. they tend to be easy to implement for general probabilistic models
They also have a few important drawbacks which are mostly to do with practical
issues of implementation rather than with theoretical guarantees:
. finding and fixing bugs in MCMC algorithms can be difficult, as the results
are stochastic by nature
. more importantly, diagnosing convergence of a MCMC sampler is challeng-
ing, and there is no generally applicable criterion for convergence. A very
bad MCMC sampler, performing a random walk in a confined region of the
distribution, can look dangerously similar to a very good MCMC sampler
from the point of view of convergence diagnostics, such as autocorrelation
measures. The time required for convergence is also difficult to predict from
the model structure alone. In models where a particular sampling scheme
mixes well, it can in fact provide the most computationally efficient solution.
But if the algorithm is badly designed and mixes slowly, sampling can be an
exceedingly expensive solution.
The fact that MCMC methods only show up on the fringes of this thesis should not
be interpreted as a statement about their usefulness, but as a conscious decision
to focus on parametric analytic approximations.
2.3 Approximate Inference Methods 31
w1
w2
σprior σposterior
σlikelihood
y = w1 + w3
Figure 2.8: Sketch of the “explaining away” phenomenon. An independent Gaussian
prior (green circle) over the weights w turns into an anti-correlated posterior (blue
ellipsis) upon observing noisy information about the value of their sum y = w1 + w2
(red band, with thick mean and light lines at one standard deviation).
2.3.6 Assuming Independence
A final, drastic measure to simplify inference is to simply assume certain parts of
the model to be independent, even though they are not. Since every interesting
aspect of probabilistic inference relies on dependences between variables, this is
obviously not a good general approach. But there are a few special situations in
which a set of variables is “all but independent” in such a way that they can in
fact be treated as independent. The most important such case is sparse Gaussian
regression: Assume that we get to observe a noisy data point y ∈ R assumed to be
generated in a linear way from weights w ∈ RD, over which we have a Gaussian
prior.
p(w) = N (w;µ,Σ) p(y |w,φ, τ) = N (y;φTw, τ) (2.62)
(see Section 5.2.3 for the more general case with y ∈ RK). The posterior on w can
be found by “completing the square” [e.g. Bishop, 2006, §2.3.3], and is given by
p(w | y, φ) = N (w; Ψ
[
τ−1φy + Σ−1µ
]
,Ψ)
where Ψ =
(
Σ−1 + τ−1φφT
)−1
= Σ−
ΣφφTΣ
τ + φTΣφ
(2.63)
(The last line follows from the matrix inversion lemma. See also Equation (C.17)
and the footnote associated with it.) A typical situation in applications is that D
is very large and φ is a sparse vector with very few non-zero entries.
For example, consider the task of predicting user preferences for certain types of
music from the users’ past rankings of records belonging to certain classes. (This
is just an example, and should not be considered a particularly good solution to
this kind of task. See Bennett and Lanning [2007] for a review of good approaches
32 Graphical Models
w
φi
y˜i yi
τ
N
f
wd
φid
y˜i yi
τ
N
D
f
Figure 2.9: On factorized regression: Left: fully connected, exact regression factor
graph. Right: factor graph for the approximate assumption of independence. The
“sum factor” is marked by an f .
to such tasks). Here, φ might be a binary vector that is zero everywhere, except
for a one at the ID of a particular user, and a one at one other index, indicating
that “this is a user from North America”.
This setup poses a computational challenge: On the one hand, because φ has more
than one nonzero entry, the posterior on w is correlated — even if the prior has a
diagonal covariance matrix Σ = diag(σ2), the outer product φφT induces correla-
tions (nonzero off-diagonal elements). This effect is famously known as explaining
away [Pearl, 1988]; Figure 2.8 shows a simple sketch providing an intuition. On
the other hand, it would be computationally intractable to try to model the entire
D × D correlation matrix: If D ∼ 107 (a typical user number for popular con-
temporary web services), then a single precision representation of Ψ would require
Petabytes of storage.
The alternative is to enforce the assumption that the posterior beliefs on individual
weights wi be independent: p(w) = N [w;µ, diag(σ2)]. This approach has been an
established way to bring down computational cost for some time [Maybeck, 1982,
Opper, 1996]. Figure 2.9 shows a factor graph reflecting this assumption.
Under this assumption of independence, the message from the elements of w to
each data point yi — the “likelihood” term of Equation (2.62) — has a diagonal
covariance and can thus be evaluated with low cost:
mf→yi(yi) =
∫
f(y˜i |φi,w)N (y˜i; yi, τ)N (w;µ,σ
2) dτ dw
=
∫
N
[
yi;
∑
d
φidwd, τ
]
N (w;µ,σ2) dw
= N
[
yi;
∑
d
φidµd, τ +
∑
d
φ2idσ
2
d
]
.
(2.64)
Because the marginals on wd are considered independent, we can find the messages
to the individual weights simply by re-arranging the terms (because the Gaussian
2.3 Approximate Inference Methods 33
is symmetric around the mean: N (x;m, v) = N (m;x, v)). This gives
mf→wd(wd) = N
[
wd;
1
φ id
(
yi −
∑
d′ 6=d
φid′µd′
)
,
1
φ2id
(
τ +
∑
d′ 6=d
φ2id′σ
2
d′
)]
(2.65)
The drawback of this approach is that we have ignored a correction that would
show up in the fully connected model. Over time, the beliefs on wd will thus become
too confident. A simple but ad-hoc approach to rectify this behaviour is known
as exponential forgetting : When iterating through the dataset, at each yi, before
sending the messages to the wd, add a small constant to the uncertainty (variance)
on each wd with φid 6= 0:
σ2d ← σ
2
d +  (2.66)
where  is a free parameter chosen through some heuristics. There are motivations
for this approach — for example, one can interpret it as modeling “drift” of the
correct values for wd from one data point to the other — but they have the whiff of
ex post facto motivations. A more honest motivation is that this factorization as-
sumption is simply so much cheaper computationally that one is willing to accept
small inaccuracies. As long as φ are sparse and pairwise dissimilar (in the sense
that
∑
iφiφ
T
i is a matrix dominated by its diagonal), the resulting defects might
be unproblematic. If additional knowledge about the structure of the φi is avail-
able, e.g. if co-occurrences of features are limited to a sub-space, block-diagonal
approximations can provide a good trade-off between the fully factorized posterior
of Equation (2.63) and the fully factorized marginals arising from Equation (2.65).
2.3.7 Comparison of Approximation Schemes
All the approximate inference algorithms presented in the preceding sections are
generally applicable and fit well with the graphical models framework. Yet they
each have their advantages and shortcomings. Table 2.1 contains a compact overview
of the these aspects. Because of the table’s space limitations, the following list pro-
vides some background on the individual entries:
EP provides broad, “zero-avoiding” [Bishop, 2006] approximations (i.e. q tries to
be non-zero everywhere where p is non-zero). This is advantageous if we
are searching for a approximation of the type “estimate with error bar” —
an unstructured Gaussian approximation covering the entire spread of the
underlying exact, structured distribution (such as in Chapter 3). The same
characteristic is a problem in distributions with combinatorial symmetry if
we are only interested in one of the modes. The most important example of
such combinatorial symmetries is the case of mixture models.
Variational Inference is in some sense the counterpart to EP, providing a “zero-
34 Graphical Models
enforcing” [Bishop, 2006] approximation. Hence, it is a good candidate for
inference in mixture models (such as in Chapter 5), but not well suited for
structured broad distributions with several nodes, whose width we would like
to approximate (such as the max function of Appendix A).
Laplace Approximations can be used in two different ways: As an overall ap-
proximation for the evidence of a big multivariate model, or to construct a
lightweight approximation for the message between two variables within a
larger graphical model. In the former case, the mode and Hessian are usually
found numerically. The second case can be found in Chapter 5 and Ap-
pendix C, where it is used to provide a link between two parts of a graphical
model in which different approximate schemes (factorized message passing
and variational inference) produce two different exponential family distribu-
tions (multivariate Gaussian and Dirichlet distributions, respectively).
MCMC sampling algorithms have the advantage of providing a nonparametric
representation of a distribution through samples, and converge to the exact
posterior in the limit of many samples. They are thus a “gold-standard” to
which other, faster approximations can be compared. The single big draw-
back of MCMC methods is that, unless great care is taken to ensure good
mixing, they can mix badly and take exceedingly long to converge — and
that this behaviour is difficult to discover. This problem is particularly pro-
nounced in hierarchical models of complicated structure, particularly when a
large amount of data is available and the posterior is thus highly concentrated
in several “disconnected” regions.
Exact Sampling Methods mostly refers to rejection sampling algorithms, such
as adaptive rejection sampling [Gilks and Wild, 1992, Wild and Gilks, 1993].
Aside from analytical exact sampling schemes, which are rarely available, a
rejection sampler with high acceptance rate provides arguably the best pos-
sible representation of a joint posterior. Unfortunately, in high-dimensional
models any rejection sampler has low acceptance rate [MacKay, 2003, §29.3]
and is thus of little help. However, in some cases it can be helpful to sam-
ple individual dimensions exactly. For example, single dimensions in a larger
Gibbs sampling scheme might be sampled by a method like adaptive rejection
sampling, when the conditional distributions are not of closed exponential
family form.
2.3
A
pproxim
ate
Inference
M
ethods
35
method requirements strengths weaknesses good applications bad applications
EP
∫
f(θ |x)q(θ) dθ broad approximation
convergence not guar-
anteed; can smooth
out modes
non-linear factors (e.g.
max, step-function,
. . . )
mixture models
Variational
Inference
∫
q(θ) log p(θ, x) dθ
mode-finding
(symmetry-breaking)
weak tails,
ignores minor modes
mixture models non-linear factors
Laplace ∇p∗(θ |x),
∂2p∗(θ |x)
∂θi∂θj
lightweight
approximation based
on local feature of the
distribution (mode)
only
linking approxima-
tions
distributions where
node is no good
representation
MCMC
evaluation of
unnormalized
distribution
exact in the limit of
many samples
convergence/mixing
not measurable
gold standard for
other models to com-
pare to; models with
complex structure
highly hierarchical
models (unless well
mixing); applications
where computational
cost matters
Exact MC
good proposal
distribution
exact, known conver-
gence rate
hard to design; bad
proposal leads to high
rejection rate
challenging 1D sub-
parts of a model
high-dimensional
models (curse of
dimensionality)
Assumed
Independence
none
can drastically reduce
cost
can be arbitrarily
wrong
almost independent
distributions
strongly correlated
distributions
Table 2.1: Overview over strengths and weaknesses of the approximate inference schemes introduced in this chapter.
36 Graphical Models
Chapter 3
Inference on Optimal Play in
Games
The work presented in this chapter was carried out in collaboration with
Thore Graepel and David Stern, both of Microsoft Research Ltd. All
mathematical derivations, algorithmic implementations and experimental
evaluations were performed by the author of this thesis.
A shorter version of this chapter was published as [Hennig et al., 2010]: Co-
herent Inference on Optimal Play in Games ; P. Hennig, D. Stern and T.
Graepel; proceedings of the 13th International Conference on Artificial Intel-
ligence and Statistics, 2010. Journal of Machine Learning Research: W&CP
9
Abstract
Round-based games are an instance of discrete planning problems. Some of the best
contemporary game tree search algorithms use random roll-outs as data. Relying
on a good policy, they learn on-policy values by propagating information upwards
in the tree, but not between sibling nodes. This chapter presents a generative
model and a corresponding approximate message passing scheme for inference on
the optimal, off-policy value of nodes in smooth AND/OR trees, given random
roll-outs. The crucial insight is that the distribution of values in game trees is not
completely arbitrary. We define a generative model of the on-policy values using a
latent score for each state, representing the value under the random roll-out policy.
Inference on the values under the optimal policy separates into an inductive, pre-
data step and a deductive, post-data part. Both can be solved approximately with
Expectation Propagation, allowing off-policy value inference for any node in the
(exponentially big) tree in linear time.
38 Inference on Optimal Play in Games
3.1 Introduction
Games are one of the oldest problem of artificial intelligence research, going back
to early works by Wiener [1948] and Shannon [1950]. Most early research was
concerned with Chess. Following the success of Deep Blue [Campbell et al., 2002]
against Gary Kasparov, interest has shifted to the Asian board game Go. Com-
pared to Chess, Go has a much larger number of possible positions and is not as
amenable to the design of hand-crafted evaluation functions.
Many round-based two-player games, like Chess and Go, can be represented, up
to transpositions, by graphs with the structure of AND/OR trees [Nilsson, 1971].
These are trees with binary leaf nodes, in which branch nodes at alternate depths
are assigned the OR (maximum) value of their children or AND (minimum) value
of their children, respectively, reflecting the adversarial efforts of two competing
players trying to achieve either binary value as the outcome of the game. If the
branching factor b is constant, a tree of depth d contains bd nodes. For Go, b and
d are on the order of 200, so it seems finding the optimal path through the game
should be intractable (In fact, finding an optimal path through the Go game tree
has been shown to be EXPTIME-complete [Robson, 1983]1). Yet humans can find
good paths through the Go tree in finite time.
A crucial property of games that humans use is that the tree has structure: Winning
positions are not distributed uniformly among the tree’s leaves; they are clustered
in the sense that leaf nodes that are close to each other (in terms of the graph)
tend to have similar outcomes. This structure can be modeled by a latent score,
representing the amount by which one player is ‘ahead’ or ‘behind’. Random play
leads to a random walk, typically changing the evaluation by small increments.
Critical moves, changing the score drastically, are a rare occurrence in the tree
overall.
Although it is not usually mentioned explicitly, this smoothness is a crucial intu-
ition behind Monte Carlo algorithms for ‘best first’ tree search, like UCT [Kocsis
and Szepesvári, 2006], which have been very successful recently. These algorithms
repeatedly play roll-outs—random descents through the tree to a leaf, generated
by a relatively weak or even uniformly random policy. The search tree is expanded
asymmetrically from the root, based on the frequency of wins and losses in the roll-
outs. If wins and losses were distributed uniformly at random among the leaves, the
roll-out results would be almost completely uninformative [Pearl, 1985]. The best
contemporary Go machines use UCT as part of their method [Gelly and Silver,
2008]. However, UCT-like methods base their value estimates directly on average
1EXPTIME problems are problems solvable by a deterministic Turing machine in O(2p(n))
where p(n) is a polynomial of n. EXPTIME-complete problems are such problems, such that every
other EXPTIME problem has a polynomial time many-to-one reduction to it. In particular, all
NP problems are a subset of EXPTIME problems.
3.2 Methods 39
outcomes of tree-descents, making them dependent on a good exploration and roll-
out policy. They also do not propagate information laterally in the tree, although,
thanks to the smoothness of the tree, the value of a node does contain information
about the value of its siblings.
This chapter constructs an explicit generative model for the value of game tree
nodes under the random roll-out policy. Finding the value under the optimal policy
would amount to solving a min-max optimization problem of complexity O(bd)
if all nodes were observed. However, for best-first search algorithms, it will be
shown how an approximate closed form for the unobserved parts of the tree can
be constructed and used to derive an approximate message passing scheme. The
resulting algorithm tracks a joint posterior belief over the optimal values of all
nodes in the tree. It incorporates a new roll-out at depth k from the root in O(kb)
time (using heuristics, this can be brought to O(k), the complexity class of UCT),
arriving at an intermediate set of local marginals that is sufficient to evaluate
the posterior of an arbitrary node at depth ` in the tree (something classic Monte
Carlo algorithms cannot do) with costO(`). The algorithm can be interpreted as an
instance of Bayesian off-policy reinforcement learning [Watkins and Dayan, 1992,
Duff, 2002], inferring the optimal policy from samples generated by a non-optimal
policy. Our method might be applicable, to varying degree, to other tree-structured
optimization problems, if they exhibit a functional relationship between steps; c.f.
the metric (though not tree-structured) sets of bandits considered by Kleinberg
et al. [2008].
The main research contributions are the generative model for the score of game
positions, the formulation of a probabilistic best-first tree search algorithm, and a
demonstration of Expectation Propagation on min-max trees.
Probabilistic approaches to game tree search have been suggested before [see e.g.
Baum and Smith, 1997, Russell and Wefald, 1991, Palay, 1985, Stern et al., 2007].
These works concentrated on guiding the search policy. Here we focus on efficient
and consistent off-policy inference for the entire tree. This is valuable because a
coherent posterior over off-policy values provides meaningful probabilistic training
data for Bayesian algorithms attempting to learn a generalizing evaluation function
based on features of the game state.
3.2 Methods
This section defines the problem (3.2.1), then develops the algorithm in several
steps. We define the generative model of on-policy and off-policy values (3.2.2),
show how to perform inference in this model by way of example (3.2.3), and finally
combine the results into an explicit algorithm (3.2.4).
40 Inference on Optimal Play in Games
3.2.1 Problem Definition
max
max
min
min
Terminal
db
Figure 3.1: A zero-sum, two-player, round-based game represented by a game tree.
Terminal positions are shown as filled nodes, nonterminal positions as hollow circles.
The meaning of the branching factor b (here b = 2) and tree depth d (here d = 4) is
indicated by schematic annotations.
We consider a tree-structured graph defining a round-based, loop-free, zero-sum
game between two opposing players, MAX and MIN with binary outcomes 1 and
−1 — “win” and “loss” from MAX’s point of view (Games with real-valued out-
comes are in fact an easier variant of this problem, because the scores gt of terminal
nodes can be observed directly, rather than just their signs). MAX is defined to
be the first player to move.
The task is to predict the outcome of the game, assuming optimal play by both
players, from any position in the tree. The only type of data available (at request)
is the length ` ∈ N and result ri ∈ {−1; 1} of random roll-outs of the game starting
at node i. A roll-out is a path through the tree to a terminal node, generated by a
policy choosing stochastically (not necessarily uniformly) among available moves
in each encountered position. (The policies of contemporary UCT algorithms do
not choose moves uniformly at random. See Section 3.3.4 for experimental evidence
that our model can still be approximately valid in this case.)
3.2.2 Generative Model
Two kinds of evaluations for nodes in the game tree will be central to the analysis:
The value under the non-optimal roll-out policy, known as the on-policy value
Figure 3.2: Conceptual sketch of game tree smoothness. The binary win/loss values of
terminal positions (shown as black/white nodes) are the result of incremental changes
in the game score over the course of the game, indicated by grayness levels.
3.2 Methods 41
in reinforcement learning, will be called the score for clarity. The optimal value
achieved by two hypothetical ideal players is known as the off-policy value in
reinforcement learning. The following two definitions jointly form a generative
modelM for the latent scores G = {gi} and optimal values V = {vi} of tree nodes
i.
Definition: The score gi of node i models the value of i under random play. It
is a real number such that
. for any terminal position t, sign(gt) = rt, where rt is the binary result of the
game at t. The likelihood for gt is thus a step function (denoted θ):
p(rt | gt) = θ(rtgt). (3.1)
. the score of child node c of node i is generated from gi by a zero mean, unit
variance Gaussian step (see also note below):
p(gc | gi,M) = N (gc; gi, 1). (3.2)
. the prior for the score of the root node is Gaussian p(g0) = N (µ0, σ20) (one
is free to choose µ0 and σ0, although µ0 = 0 is the obvious choice).
Thus, scores of sibling nodes are independent given their parent’s score, and the
prior distribution of the value during a roll-out is a Brownian random walk. For
binary results, the scale factor of the steps is arbitrary, and so is the choice of
1 here. In the simpler case where the actual real value of a terminal position
is observed (which will not be considered further here), this step size obtains a
meaning, and should then be learned. How to do this will follow straightforwardly
from the following derivations in Section 3.2.3: It involves inferring the variance
of values at the leaf nodes (which is trivial from the observed data), and dividing
that number by the depth of the tree.
Definition: The off-policy value vi of node i is the true value of node i under
optimal play by both players. That is
. for terminal positions t, we have vt = gt.
. for non-terminal positions i with children {c}i
vi =



maxc{vc} if MAX plays at i
minc{vc} if MIN plays at i.
(3.3)
42 Inference on Optimal Play in Games
g0
g1
g11
v11
g12
v12
v1
g2
g21
v21
g22
v22
v2
v0
G
R
V
ro
ll
-o
u
t
max
max
max
max
min
min
min
min
T
e
rm
in
a
l
P
o
sitio
n
g1 ∼ N
(g0, 1)
g2 ∼ N (g0 , 1)
v0 = max(v1, v2)
v1 = min(v11, v12)
nodes stored
in memory
untracked nodes
(inductive part)
optimal value
(ground truth)
Figure 3.3: Generative model (Bayesian network, directed graphical model) for the
scores G, the roll-out results R and the optimal values V of an example game repre-
sented by a binary tree of depth 4. Shown is the situation after four search-descents
into the tree, the blue and orange shaded nodes represent the representation of the
exploratory tree in the algorithm’s memory. The most recent roll-out (of length 2) is
shown as two black curved arrows, previous roll-outs (from nodes (1), (2) and (11))
are not shown. The lower half of the diagram shows the generative process of optimal
values v. In a minor deviation from the standard notation for directed graphical mod-
els, ground truth identity of the optimal paths through the tree (not observed by the
algorithm) is indicated by thick arrows with double heads (i.e. optimal play consists of
MAX moving from node 0 to node 1, then MIN moving to 12, etc.). This is only for
intuition, the generative model itself just consists of the nodes and arrows. Note that,
while the v nodes are shown below the g nodes here for readability, in the text ‘up’ and
‘down’ refer to the game tree structure, with parent nodes vi being ‘above’ children
vij.
We note in passing that OR-trees (i.e. tree-structured optimization problems, and
non-adversarial games like Solitaire) are a trivial variant of this formulation.
Figure 3.3 shows the full generative model for a small tree of b = 2 and d = 4, as a
directed graphical model. Note that the only observed data is the sign (the binary
value) of terminal positions (black nodes); all other variables are latent.
In some games — such as Chess, see e.g. Campbell et al. [2002] — noisy value esti-
mates for non-terminal positions might be available from some evaluation function.
This situation will not be studied here, but in many cases it might be possible to
include such information in a principled way. In fact, if the observation noise can
be considered Gaussian with known mean and variance, incorporating such data
is a straightforward case of belief propagation (see Section 2.2).
3.2 Methods 43
3.2.3 Inference
We will use the results from 2 to derive an approximate message passing algorithm
to perform inference both on the scores and values, using Gaussian Expectation
Propagation (Section 2.3.2) to project the messages to the normal exponential
family.
Inspecting Figure 3.3, it might seem like inference in the tree would call for mes-
sages among all bd nodes. In this section, we will show that this can be avoided,
because the messages from unobserved parts of the tree can be derived a priori,
in jointly O(bd) time.
We assume the learner acquires data in a best-first manner: At any given point in
time, it tracks an asymmetric but contiguous tree in memory which includes the
root. Additional nodes are added to the boundary of the stored tree by requesting
a roll-out from that node (Figure 3.3, shows the situation after four roll-outs, with
four (blue/orange shaded) nodes already added to the memory representation).
The message passing is performed in parallel with the search process. The resulting
message passing schedule is quite complex. For clarity, we will use an example
descent through the small tree of Figure 3.3.
On-Policy Inference on g
Each search descent begins at the root node 0. As the descent passes through the
tracked part of the tree, a policy pi chooses among available children, potentially
based on the current beliefs over their v. For our example, say the policy chose the
descent 0→ 1→ 12. At each step, we update the message from the current node
to the chosen child. The message out of g1 in the direction of g12 is
mpa(1)(g1)
∏
j∈ch(1)\12
mgj(g1) ≡ N (µ1\12, σ
2
1\12) (3.4)
where ch(1)\12 is the set of child nodes of 1 excluding node 12, and pa(1) is the
parent node (i.e. 0) of 1. And the message into g12 is
mg1(g12) =
∫
p(g12|g1)p(g1|gpa(1), {gch(1)\12}) dg1
= N (µ1\12, σ
2
1\12 + 1)
(3.5)
Assume node 12 just reached is not part of the stored tree yet. To add it to the
tree, we request a roll-out starting from 12, which turns out to be of length ` = 2
and have result r = +1 (see Figure 3.3). The data thus gained is a likelihood
p(r|gt) of the score of the terminal position t at the end of the roll-out, which is
a step-function. We can generate a prior over gt as a message from the current
44 Inference on Optimal Play in Games
marginal over g12 by integrating out the ` intermediate steps (see Equation (3.5)):
p(gt | g12) = N (µ12, σ
2
12 + `) (3.6)
giving a posterior over gt which is a truncated Gaussian. To get the EP message
back to g12, we need the function that calculates the moments of this truncated
Gaussian distribution (this function will be denoted fEPtrG in Algorithm 1). The
calculation is straightforward, because for Gaussians in general
∫ ∞
0
xN (x;µ, σ2) = µΦ
(µ
σ
)
+ σφ
(µ
σ
)
and
∫ ∞
0
x2N (x;µ, σ2) = (µ2 + σ2)Φ
(µ
σ
)
+ µσφ
(µ
σ
)
where φ(x) = N (x; 0, 1) is the standard Gaussian and Φ(x) =
∫ x
−∞ φ(y) dy is the
cumulative Gaussian. This result was used previously for EP by Herbrich et al.
[2007]. To finally arrive at the message mr12(g12) from the roll-out to g12, we need
to apply fN (0,`) again to the resulting EP message. Note that the message contains
no g other than gi. It is thus possible to perform inference on gi using exactly two
messages: One from its parent node, and one from the outcome of the roll-out.
To incorporate the new knowledge from this roll-out into all ancestor nodes of
g12, we pass messages of analogous form to Equation (3.5) back up the tree. This
obviously does not propagate the information through the whole tree, but it leads
to a situation where the score gi of any node i in the tree can be evaluated in
linear time, simply by performing one descent from the root towards i, updating
messages analogously to Equation (3.5) downwards during the descent.
There is one more pitfall to avoid: Consider the next time a search descent passes
through node 12, which currently lies at the boundary of the tracked tree. This
leads to the addition of a child node of 12 and a roll-out from there. Now the
roll-out result associated with 12 has to be dealt with in a consistent way. Sim-
ply keeping the corresponding message in the marginal is not correct: Because
the roll-out necessarily passed through one of i’s children, information from that
child would otherwise be counted twice. There are two other options: If informa-
tion about the course of the roll-out was stored, the corresponding message can
be moved down along the path of the roll-out to the boundary of the search tree.
If the amount of roll-outs played is too large to store the paths of all roll-outs,
it becomes necessary to remove the information gained from the roll-out at node
12 from the marginal. This removal corresponds to ‘dividing’ the corresponding
message out of the marginal as discussed in Sections 2.3.1 and 2.3.2. In collect-
ing the experimental results reported in Section 3.3.2, we opted for this latter,
more memory-conservative (but slightly information-wasting) approach, to facili-
3.2 Methods 45
gi
oi
vi
∆i
p(oi|gi, `i,M) = N (oi; gi, `i)
vi = gi + ∆i(`i)
p(gi|gparent(i),M) = N (gi; gparent(i), 1)
p(∆i |M)
Figure 3.4: Inference on optimal values vi separates into two independent parts: The
value gi of playing to node i, and the value ∆i of playing optimally from node i onwards.
Inference on gi is deductive, because it uses the roll-out data oi, while inference on ∆i
uses the model only, and is thus inductive.
tate comparison to contemporary Monte Carlo tree search algorithms.
Inference on Optimal Values v
The previous paragraph sketched the message passing inference leading to a con-
sistent posterior over scores G. How can these beliefs be used to obtain a posterior
over the values V ? We will use the definitions of Section 3.2.2 to again derive a
message-passing scheme running parallel to the search process.
First, consider again node i = 12 in Figure 3.3, at the boundary of the stored tree.
The value under optimal play is the sum of the two independent variables (see
factor graph in Figure 3.4)
vi = gi + ∆i (3.7)
where ∆i is the optimal reachable increment to the score of i. So the inference
breaks up into a deductive part (on gi, as solved in the previous section) and an
inductive part (on ∆i). If the next move after i is controlled by MAX (replace max
with min in the opposite case), then
∆i = max
j∈children(i)
{ξj + ∆j} (3.8)
where ξj ∼ N (0, 1) is the unknown Brownian step to the score of node j. Deriving
a belief over ∆i for any node i, which is `i steps from a terminal position, is a
recursive problem which depends only on `i and the branching factor b: Assuming
MIN gets the last move (with straightforward variations in other cases), ∆i is
∆i(`i, b) =



0
maxj=1...b {∆j(`i − 1, b) + ξj}
minj=1...b {∆j(`i − 1, b) + ξj}
(3.9)
46 Inference on Optimal Play in Games
for ` = 0, for ` mod 2 = 0 and for ` mod 2 = 1, respectively.
Similarly to the situation in the previous section, the beliefs generated by this re-
cursive operation are not Gaussian themselves. To perform the EP approximation,
we need the function calculating the moments of the maximum or minimum over
Gaussian variables (this function will be denoted fEPmax /min in Algorithm 1). The
necessary derivations are lengthy and have thus been moved to Appendix A, where
these moments are derived for the case of the maximum of two Gaussian variables,
and it is shown how to combine such binary comparisons iteratively into an ap-
proximate posterior belief over the maximum of a finite set of such variables. The
corresponding messages for the minimum of variables is a trivial variant, because
mini{xi} = −maxi{−xi}.
Using this approximation, we arrive at a recursive operation in closed form, which
can be used to derive the message from ∆(`) for all ` up to a pre-defined depth.
Perhaps surprisingly, this lookup table for optimal increments can be constructed
prior to data acquisition, once for the entire game tree, in O(bd) time (as opposed
to the O(bd) cost of probing the entire tree), which is easily tractable even for
massive game trees like that of 19×19 Go. To see this, note that the optimal value
increments from nodes one level above the leafs to the leafs are the maximum (or
minimum) of b unit-variance, zero-mean Gaussian random variables. The above
approximation gives a new Gaussian approximation for the belief over the value
of this maximum (or minimum) that is identical for all these nodes. So the fact
that there are bd−1 of these nodes is irrelevant for this question.
In my simple, non-optimized implementation, constructing this table takes about
2 minutes on a contemporary desktop machine, for a tree of Go-like dimensions
(10400 nodes). This step is a parametrized version of the Monte Carlo technique
known as density evolution (see e.g. [MacKay, 2003, §47.5] and Richardson and
Urbanke [2008, §4.5]). See Section 3.3.3 for an experimental analysis of the quality
of the approximation.
We sum the independent variables gi and ∆i, using the exact function
f∑N
[
N (µa, σ
2
a),N (µb, σ
2
b )
]
= N (µa + µb, σ
2
a + σ
2
b ). (3.10)
For a node j that does not lie on the boundary of the stored tree, vj is given by
Equation (3.3). Marginals p(vcj) are available for all children cj of j in this case.
Hence, an approximate Gaussian message to vj from its children can be found using
the same method as above. However, it is important to note that the children of vj
are correlated variables, because they are all of the form shown in Equation (3.7),
sharing the contribution gj. The EP equations derived in Appendix A include the
3.2 Methods 47
correlated case. Using approximate Gaussian messages
q(vk) = N (µk, σ
2
k) = N (µgj + µ∆k , σ
2
gj + σ
2
∆k) (3.11)
for each child k of j, the correlation coefficient2 %k1k2 between two children k1 and
k2 is
%12 = V
−1(σ2gi − µgiµ∆k1 − µgiµ∆k2 − µ∆k1µ∆ik2 ) where
V ≡ (σ2gi + σ
2
∆k1
− 2µgiµ∆k1 )
1/2 · (σ2gi + σ
2
∆k2
− 2µgiµ∆k2 )
1/2
3.2.4 Algorithm
Algorithm 1 sums up the message-passing scheme presented in the previous sec-
tions. It defines a recursive function that descends through the tracked tree to a
leaf, passing messages downward (the operator fN (0,1)(p) refers to Equation (3.5)).
To choose the part of the tree to explore, the algorithm uses a generic policy pi
(line 3), whose precise form can be arbitrary. In particular, the policy may or
may not make use of the algorithm’s value estimates (see Section 3.2.6 below). At
the boundary of the stored tree, the algorithm performs a roll-out (line 14), then
passes g and v messages upwards to the root. The actual top-level search algo-
rithm repeatedly calls this function, accumulating more and more data, at roughly
constant computational cost per call (apart from the small increase in cost caused
by the growth of the stored tree). The notation pa(i) and si(i) refers to the parent
of i and the set of siblings of (and including) i, respectively. The function stored
accesses a one-dimensional array of stored inductive messages from the unexplored
parts of the tree, as discussed in Section 3.2.3.
2The correlation coefficient %ij between two Gaussian variables i and j is defined by cov (ij) =
%ij
√
Var (i)Var (j).
48 Inference on Optimal Play in Games
Algorithm 1 Bayesian Best-First Tree Search
1: procedure descent(i)
2: if i previously visited then
3: c← pi(i) . policy pi chooses child c to explore
4: p(gc)← p(gc)/mgi(gc) . update message to c
5: mgi(gc)← fN (0,1)[p(gi)/mgc(gi)]
6: p(gc)← p(gc) ·mgi(gc)
7: m′c(gi)← descent(c) . continue descent (returns g message from child)
8: p(gi)← p(gi)/mc(gi) ·m′c(gi) . update marginals
9: mc(gi)← m′c(gi)
10: else
11: p(gpa(i))← p(gpa(i))/mrpa(i)(gpa(i)) . divide out roll-out from parent’s marginal
12: mrpa(gpa(i))← N (0,∞)
13: [lines 3 to 5] . update message from parent to gi, identical to above
14: (ri, `i)← roll-out(i) . do roll-out
15: mri(gi)← fN (0,`i)
[
fEPtrG(ri, fN (0,`i)[p(gi)])
]
. build message from roll-out result
to i
16: p(gi)← p(gi) ·mri(gi)
17: p(vi)← f∑N [p(gi),stored(`i)] . generate marginal for vi
18: end if
19: mi(gpa(i))← fN (0,1)(p(gi)/mpa(i)(gi)) . Calculate messages to parent’s g and v
20: p(vpa(i))← fEPmax /min({p(vk)}k∈si(i), p(gi))
21: return mi(gpa(i))
22: end procedure
3.2.5 Replacing a Hard Problem with a Simple Prior
Figure 3.5 provides an intuition for the computational simplification achieved with
the probabilistic model and inference algorithm introduced in the previous sections.
Finding the exact answer to the tree search problem is an exponentially hard
problem. But finding a good path through the tree — estimating the value of a
position under optimal play — is a much simpler problem if we allow ourselves the
luxury of a probabilistic model for the tree structure. With this model, inferring a
belief (rather than an exact statement) over the value of any node in the tree has
only linear computational cost. This also provides some insight into why humans
can play exponentially hard problems like Go, despite their ostensibly incredible
structural complexity: Humans do not predict entire games from the start: They
think ahead a few moves, and assume that the game proceeds “as usual” (i.e. as
described by some simple model) from there on. In classic tree search algorithms,
this intuition is encoded as a hard evaluation function by the designer of the
algorithm. In games where such an evaluation is hard to obtain, such as Go, random
roll-outs combined with the inference algorithm presented above provide a means
of learning the evaluation function.
3.2 Methods 49
g0
v0
Figure 3.5: Factor graph illustrating the advantage of a probabilistic model over classic
tree search algorithms: the exponentially large tree from Figure 3.3 (indicated by gray
clouds) has been replaced, literally, with black boxes modelling its effect on optimal
play. The exponentially hard problem of identifying the optimal path through the tree
is replaced with the linear-cost problem of inferring a belief over the optimal path.
3.2.6 Exploration Policies
While the inference process itself is independent of the chosen policy pi in Algo-
rithm 1, the policy is crucial to make the roll-outs informative about the optimal
path (this is a general feature of off-policy reinforcement learning). Many possi-
ble policies are available, among them the point-estimate based UCT [Kocsis and
Szepesvári, 2006], greedy choice among samples [Thompson, 1933], and information
gain [Dearden et al., 1998]. An approach that has received some renewed interest
recently [Kolter and Ng, 2009] is ‘optimistic’ exploration based on weighted sums
of mean and variance of the value estimates. That is, given approximate beliefs
q(vi) = N (µi, σ2i ) over the optimal value of children i, the policy chooses as
pioptimistic[q(v)] = arg max
children i
(µi + βσi) (3.12)
with a parameter β controlling between exploration and exploitation. In our ex-
periments, we used this optimistic exploration where applicable, but it should be
understood that this is essentially an arbitrary choice, and the ‘right’ choice of
policy is still a matter of open debate in the reinforcement learning community.
50 Inference on Optimal Play in Games
0.0 0.2 0.4 0.6 0.8 1.0
Φ(g)
0.0
0.2
0.4
0.6
0.8
1.0
∫ g −∞
pˆ(
g′
)
dg
′
0.0 0.2 0.4 0.6 0.8 1.0
Φ(g)
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3.6: Quantile-quantile plot of 71 empirical distributions of g, centered on their
mean, at varying depth from the root, for one random path through Go, against the
standard normal distribution Φ(x) (black dotted line). Left: Random roll-out policy.
Right: Smart roll-out policy. Color intensity of the plots decays linearly with depth
from the root. Note that the distributions far from the root are based on increasingly
small sample sizes.
3.3 Results
We performed experiments to answer several questions arising with regard to the
described inference model: Is the model of Brownian motion applicable for real
games, like Go (3.3.1)? Is the EP approximation on the roll-out results effective
(3.3.2)? Does the recursive min/max approximation in the inductive part of the in-
ference produce a reasonable approximation of the true MIN/MAX values (3.3.3)?
How robust is the model to mis-match between generative model and ground truth
(3.3.4)? And could the algorithm be used as a standalone searcher (3.3.5)?
3.3.1 Structure of Go Game Trees
We generated one random path through the tree of 9× 9 Go. At each level in this
path, roll-outs from all legal moves i at this position were generated (1000 roll-outs
from each i). Depending on the depth from the root, there were between 81 and 0
such legal positions. We stopped the game after 71 steps, when there were less than
5 legal moves available. The average length ¯`i of the roll-outs and the empirical
frequency pˆ(win|i) of a win for the MAX player from i under a random roll-out
policy was stored. This implicitly defines the value of gi under the model through
p(win|gi) =
∫ ∞
0
N (gt, gi, `i) dgt = Φ
(
gi√
`i
)
(3.13)
where we replace p(win|gi) and `i with empirical averages. With sufficiently many
samples, p(win|gi) and thus gi can be evaluated up to negligible error. The gener-
ative process defined as part of our modelM then leads to the statement that the
{gi} of sibling nodes in the tree are distributed like the standard normal distri-
bution around their parent’s value. Figure 3.6 shows Q-Q plots of these empirical
3.3 Results 51
0 10 20 30 40 50 60 70
depth in game
4
3
2
1
0
1
2
3
4
g i+
1−
g i
MAX
MIN
0 10 20 30 40 50 60 70
depth in game
4
3
2
1
0
1
2
3
4
g i+
1−
g i
Figure 3.7: Change to means of positions’ values during roll-outs, as a function of
depth in tree (i.e. this plot shows the effects explicitly removed in Figure 3.6). Left:
random roll-out policy. Right: Smart roll-out policy. Nodes controlled by MIN player in
green, nodes controlled by MAX player in red. Connecting blue lines for visual aid only.
Means over entire dataset, MAX’s moves only, and MIN’s moves only as dashed blue,
red and green lines, respectively (For the smart policy, all these averages only include
data points up to move 48). Note the strong amplitudes at the end of the game for
the smart policy. The data points in this region, cut off in this plot to aid readability,
reach values of up to -14/+31.
distributions against Φ for depths d = 0 to 71 from the root. A standard normal
would lie on the diagonal in this plot; weaker tailed distributions are steeper, heav-
ier tailed ones flatter. The left plot shows results from a uniform roll-out policy
(only excluding illegal and trivially bad ‘suicide’ moves, which are not technically
illegal, but would never be played by a human player, and are cheap to test for).
Given the limited sample sizes, especially towards the end of the game, the empir-
ical distributions are strikingly similar to Φ(x).
Contemporary Monte Carlo search algorithms use ‘heavy roll-outs’, i.e. policies
that produce less random, more informative results. Clearly, the generative part
of our model will be more and more invalid the smarter the policy — the limit
of a perfect policy would repeatedly generate only a single perfect roll-out, and
it is unlikely that this roll-out would conform to the assumptions for randomly
generated games made in the generative model. To examine how drastic this effect
is, we repeated the above experiment with a smart policy, similar to the published
parts of MoGo [Gelly et al., 2006] (Figure 3.6, right). The results do develop
heavier tails, particularly deep in the tree, towards the end of the simulated game.
The reason is that in this late phase, only few good moves remain, and the roll-
outs under the smart policy become very similar to each other. Based on Figure
3.6, one could argue that the Gaussian generative model remains an acceptable
approximation. Alternatively, one could of course search for a better generative
model for any particular roll-out policy used; example approaches might include
heaver-tailed distributions like the Laplace distribution.
The analysis so far has only considered the distribution of children relative to
their parent nodes, and explicitly excluded the drift of the mean value gi from one
52 Inference on Optimal Play in Games
node to the next chosen child. Figure 3.7 shows the relative difference gi+1 + gi
between subsequent moves, as a function of the depth in the tree, again for the
uniformly random and “smart” roll-out policy. For the random policy, subsequent
means have no clear tendency towards one particular direction. An analysis of
variance shows not enough evidence for the rejection of the null hypothesis that
either player’s increments come from the same distribution (p-value 0.19). For the
smart policy, a clearer tendency to choose moves favouring the player’s desired
outcomes is apparent. More importantly, in the final twenty moves of the game,
the step sizes become extreme, as it becomes possible for the policy to choose very
good moves based on its heuristic alone. This effect is also reflected in the extreme
step-shaped cumulative density functions in visible in Figure 3.6. In this late phase
of the game, the generative model presented here arguably becomes invalid under
this particular smart roll-out policy. Note, however, that this very deviation from
the model also suggests that this part of the game tree has a simpler structure
that can in fact be modeled well with heuristics. The more benign tendency to
choose moves changing the score towards the direction favoured by the current
player evident in the earlier phase of the game can be modeled straightforwardly
by introducing a player-dependent bias-term in the mean for the generative step
in Equation (3.2).
3.3.2 Inference on the Generators
A good way to evaluate the quality of Bayesian models is the (log) likelihood they
assign to ground truth in known test environments. To do so, we generated 500
artificial trees of b = 2, d = 18 fromM. The inference model was implemented as
presented in Algorithm 1. A time step corresponds to one descent into the tree,
ending with a roll-out at a previously unexplored node. For the descent through
previously visited nodes, an optimistic policy was used (see Section 3.2.6), choosing
greedily among children i based on µvi + 3σvi . In addition, the value of the root
node’s score was assumed to be 0 with high precision. Figure 3.8 shows the log
likelihood assigned to the ground truth of both g and v at distances 1, 2 and 3
from the root, as a function of the number of roll-outs performed
As expected, the likelihoods rise on average during learning. They saturate when
the majority of the nodes is expanded, because the (binary) roll-out results do
not contain sufficient information to determine g, and thus v, to arbitrary preci-
sion. Nodes deeper in the tree saturate at smaller likelihoods because they receive
information from fewer offspring nodes. They also start out with a smaller likeli-
hood because their priors contain more uncertainty. Note that the message passing
causes the beliefs to develop simultaneously at all three depths, even though the
nodes at greater depths are not explored until several descents after initialization.
3.3 Results 53
100 101 102 103 104
ndescent
2.2
2.0
1.8
1.6
1.4
1.2
〈 lo
g
p(
g;
v)
〉
depth=1
depth=2
depth=3
Figure 3.8: Log likelihood of the ground truth value of gi (crosses) and vi (triangles)
under the beliefs of the model, at varying depth from the root in artificial trees, as a
function of number of roll-outs performed. Averages over all nodes at those depths.
3.3.3 Recursive Inductive Inference on Optimal Values
To evaluate the quality of the inductive part of the approximation, 1000 artificial
game trees were generated and solved by explicit min/max search as in the preced-
ing section. Figure 3.9 compares empirical ground truth of ∆ and values predicted
by pre-data inference, for all nodes at two different distances ` from the leaves,
in 1000 artificial game trees. Despite the repeated application of the Gaussian ap-
proximation to non-Gaussian beliefs, there is good agreement between predictions
and ground truth.
3.3.4 Errors Introduced by Model Mismatch
For the last two experiments, the artificial game trees were generated by the gen-
erative modelM, and we showed in Section 3.3.1 that the real-world game Go is
in fact approximated well by this model. However, other games might be less well
approximated by M. As a tentative test of the severity of the errors thus intro-
duced, the Bayesian searcher was tested on 500 generated p-game trees [Kocsis and
Szepesvári, 2006] with b = 2, d = 18. These trees are also generated by a latent
stochastic variable, but with a different generative model, choosing ξi uniformly
from [−1, 0] if the player is MIN and uniformly from [0, 1] if the player is MAX.
We performed a best-first search in these trees (results not shown), and compared
to the performance on trees for which M is the correct model. The performance
54 Inference on Optimal Play in Games
4 2 0 2 4
∆
0.0
0.1
0.2
0.3
0.4
0.5
0.6
p(
∆
)
b=5,`=6
b=5,`=5
Figure 3.9: Empirical histogram and model predictions (lines) for the optimal future
value increment ∆ (see Equation (1.1)), for p(∆|` = 6, b = 5) (first move for MAX,
last move for MIN) and p(∆|` = 5, b = 5) (first and last move for MIN). Model
predictions as lines.
on these two types of trees during the search was very similar (i.e. a corresponding
plot looks very much like Figure 3.10), except for a globally slightly higher chance
of the model misclassifying nodes into winning and losing nodes. At least in this
particular case, the model mismatch causes only minor decay in performance.
3.3.5 Use as a Standalone Tree Search
It is tempting to interpret the presented inference algorithm as a standalone search
method. Experiments on artificial game trees (Figure 3.10) suggest that the result-
ing algorithm does not necessarily improve on a vanilla UCT searcher, and that
performance depends strongly on the policy used. The intent of the presented algo-
rithm is not to develop a good tree searcher, but to provide a consistent posterior
from which generalizing evaluation functions can be learned.
3.4 Conclusion
We have presented a generative model for game trees, and derived an approximate
message-passing scheme for inference on the optimal value of game positions, given
samples from random roll-outs. Inference separates into two tractable deductive
and inductive parts, and allows evaluation of the marginal value of any node in the
3.4 Conclusion 55
Figure 3.10: Performance in a tree search task. Comparison between α-β search, stan-
dard UCT and a probabilistic tree search algorithm (PTS) based on our model, as a
function of number of evaluated leaf nodes / descents. Averages over a set of 500
artificially generated game trees from the model, with b = 5, d = 4 (larger trees are
very challenging for α-β). Left: Probability of the models predicting the correct binary
value (win/loss) of the root node of the tree (α-β assigns random values with uniform
probability before convergence, the exact value afterwards. The smooth development
of the corresponding curve is an effect of averaging). Note that, due to the optimistic
policy, the probabilistic algorithm is not guaranteed to actually explore the entire tree
and thus sometimes does not observe enough data to converge, leading to the about
1% error rate in the limit of many descents. Right: Probability of the algorithm’s
policy not choosing the correct optimal next move from the root.
tree in linear time. Similar to the way humans think about games, the computa-
tional cost of our algorithm depends more directly on the number of data collected
than on the size of the game tree.
Like any model, the assumption of Brownian motion is imperfect. Neither does it
catch all the details of any one game, nor does it necessarily apply to all games. But
it provides a quantitative concept of a smoothly developing game between com-
peting players. Our experimental results suggest that errors introduced by model
mismatch are not severe, and that this is a meaningful model for Go. We argue
that it lies at a ‘sweet spot’ between unstructured priors and non-probabilistic,
rule-based methods, which can over-fit.
Retaining a posterior does not necessarily improve on point-based methods in the
search for the optimal next move. Nevertheless, the results presented here are valu-
able in two ways: humans use sophisticated feature-based concepts to play games,
not full tree search. Emulating this behavior with a Bayesian algorithm, learning
an evaluation function of features of the game position, requires probabilistic be-
liefs over the optimal values, which our algorithm provides, decoupling the task
of designing an evaluation function from that of data acquisition. Secondly, game
trees are discrete planning tasks (hierarchical bandit problems). Our results indi-
cate that such problems can exhibit structure (in this case, correlation) even if they
might ostensibly look unstructured, offering potential for considerable performance
increase.
56 Inference on Optimal Play in Games
3.5 Addendum: Related Subsequent Work
The work presented in the preceding sections of this chapter was carried out in 2009
and, after an initial unsuccessful submission to NIPS 2009, published in AISTATS
2010 (see citation at beginning of chapter). Three months after the publication of
our paper, Tesauro et al. [2010] published a paper at UAI 2010 titled Bayesian
inference in Monte-Carlo tree search in which they independently arrived at very
similar results to ours. Their paper focused on guiding tree search more than on
providing good absolute predictions of minimax values, but the algorithm shares
many core aspects of our work, including the Gaussian moment matching for in-
ference on minimax values.
The main difference between the two works is that Tesauro et al. do not rely on
a generative model for the game tree, but rather describe a tree search algorithm
closer to the standard form of UCT. They construct beliefs over the generative
score gi of leaf nodes of the search tree by replacing our inductive step with in-
ference from the binary roll-out results. For this inference, they use a local Beta
posterior (moment matched to get Gaussian approximations). They also construct
an explicit policy for exploration of the tree, using the current beliefs over mini-
max values of leaf nodes of the search tree. After a node has been visited N times,
their policy explores the child node i with minimax belief p(vi) = N (vi;µi;σ2i )
that maximizes the score
si = µi +
√
2 logNσi (3.14)
The disadvantage of this approach is that it only allows inference on parts of the
tree which have already been incorporated into the search tree. The estimates will
also be of larger uncertainty as they do not incorporate the additional knowledge
about the tree’s structure available in our model. Both of these issues are of no
concern to Tesauro and colleagues, because their work focuses on guiding the tree
search, rather than providing high quality absolute value estimates. Finally since
their scheme does not take the depth of the remaining game tree below a leaf node
of the search tree into account, one could also conjecture that it might lead to
inefficient exploration if different children of a given node have drastically differing
remaining trees below them.
However, their algorithm also has some important advantages over ours. Most im-
portantly, it is guaranteed to converge to the exact minimax values as the number
of sampled roll-outs approaches infinity, and the authors provide proof-sketches
for these guarantees in their paper. The proofs are simple enough to be sketched
here in a few lines, and provide some insight into why proving a similar statement
is more challenging with regard to our algorithm.
To prove that an algorithm inferring the Bernoulli probabilities pii of the roll-out
results at leaf nodes i of the search tree (not the full game tree), and the minimax
3.5 Addendum: Related Subsequent Work 57
value under these probabilities for nodes within the search tree, converges to the
exact minimax probabilities pi∗i for all nodes in the search tree in the limit of large
sample numbers, consider the following argument.
. For the leaf nodes: Using any prior distribution assigning nonzero mass to all
0 ≤ pii ≤ 1 (e.g. a Beta distribution) and ni binary samples with Bernoulli
likelihoods, the posterior converges towards the correct pi∗i with the optimal
convergence rate of O(1/
√
ni). This is a general characteristic of Bayesian
inference.
. Theorem 1 (Tesauro et al.): Ergodic policies lead to convergence. Consider
a fixed finite bandit tree with binary reward leaf nodes and priors as in the
previous point, and assume that no two sibling nodes have the exact same
minimax pay-off rates. Then for any policy which visits every leaf an un-
bounded number of times, the minimax posteriors of all nodes converge to
Dirac δ-distributions at the exact minimax values. Proof: by induction, start-
ing from the point made above for leaf nodes. If a node’s children all collapse
to correct δ’s, due to the finite separation between children’s values, the par-
ents’ values also collapse. This also applies under Gaussian approximation,
because the approximation errors vanish as the input Gaussians’ variances
vanish.
. Theorem 2 (Tesauro et al.): For the policy of Equation (3.14), the algorithm
converges to the exact minimax reward probabilities everywhere in the search
tree in the limit of ni →∞ for all leaves i. Proof: Show that policy (3.14) is
ergodic. This is straightforward because for any node j which is never selected
over its siblings i, the variance σ2j will remain constant and thus the score sj
will rise without bound. Hence, the policy visits every node an unbounded
number of times. This also applies for the Gaussian approximation, because
the approximations variance is strictly positive if the children’s variances are
positive.
By contrast, our algorithm does not infer roll-out reward probabilities, but optimal
values from a model. Of course, in applications where the model is wrong, the
inferred values can be arbitrarily wrong. If the model is correct, however, we argue
that because our algorithm is based on a probabilistic construction, it is exact up to
the effects of the Gaussian approximations made in the message passing. The effects
of this approximation are subtle (see also Appendix A), and it is thus difficult
to make exact statements about its quality (apart from the trivial convergence
characteristics mentioned in the proofs above). As in the other chapters of this
thesis, this should not be seen as a defect of the approximation, but as a symptom
of the complexity of the underlying model: If it were easy to make general structural
58 Inference on Optimal Play in Games
statements about the exact beliefs, we would not need the approximations to begin
with.
Chapter 4
Approximate Bayesian
Psychometrics
The work presented in this chapter is the result of a collaboration with
Michal Kosinski, of the Department of Social and Developmental Psychol-
ogy at the University of Cambridge, who collected the presented data, and
David Stern and Thore Graepel, both of Microsoft Research Ltd. Except
where the work of others is explicitly cited, all mathematical derivations and
algorithmic implementations are the work of the author of this thesis.
Abstract
This chapter applies approximate Bayesian inference methods to a problem from
psychometrics. Specifically, we provide a Bayesian methodology for analyzing psy-
chometric questionnaire data in which people answer tens to hundreds of questions
on a five-point ordinal scale. Taking the established item response theory as a start-
ing point, we develop a Bayesian model of ordinal user-item responses and propose
an inference algorithm, based on Expectation Propagation, to infer marginal dis-
tributions on underlying traits, such as “extraversion” or “neuroticism”. We present
results on a subset of a very large scale psychometric data set collected on Face-
book and illustrate the advantages of the proposed method in terms of predictive
accuracy, computational speed, and the availability of error bars in the inferred
traits.
4.1 Introduction
Psychometrics, the business of measuring, describing and classifying the human
mind, has traditionally relied on the use of statistical methods. The complicated
60 Approximate Bayesian Psychometrics
variations of human behavior, the stochastic noise induced by the often limited
sample size, and the unavoidable biases in the selection of experimental subjects
make for challenging inference. This might account for the fact that explicitly
structured generative models of experimental data have not been applied as widely
in psychology as in other fields. This work is an initial attempt at showing how
methods from machine learning might be helpful for data description and inference
in psychology.
Psychometrics is a subfield of psychology, concerned with the design and evalua-
tion of psychological measurements, such as questionnaires and personality tests.
Psychological tests are usually designed based on accepted models of the human
mind. That means questions (called items) are chosen specifically to measure an
explicit latent quantity, such as “extraversion”, “agreeableness”, “neuroticism”, etc.
Because such traits are abstract concepts, they are typically defined implicitly
through associated responses to items (such as “Do you see yourself as a careless
person?” or “Are you inventive?”). The items themselves are often carefully de-
signed to “load” on a single trait only. That is, ideally, every item should co-vary
with only one trait. This focus on formulated traits raises the scientific question
of how well these latent dimensions describe the actual data, in particular when
compared to other models. However, from the scientists’ point of view, the latent
personality traits are not nuisance parameters here, so integrating them out for
prediction is not usually the desired path. Whether this focus on ease of interpre-
tation (rather than information content) and binary item loadings is a good design
decision by the community (or even is achievable at all) is a question far beyond
the scope of the technical work presented here. For the purpose of this chapter, the
modeling assumptions of the test designers will be taken at face value (see Section
4.2 for specific definitions).
Many psychological tests use discrete ordinal answers; for example a range of N
choices ranging from “completely disagree” through “indifferent” to “completely
agree” (known as Likert scales [Likert, 1932]). From these discrete answers, real-
valued latent variables associated with personality traits of the user are inferred
(such latent scores are often called the ability of the respondent in the associated
trait, even in tests which are not supposed to be judgmental). This discretized setup
is not as simple as it might seem at first sight, as the way in which the ordinal
responses are interpreted varies both between individual items and between indi-
vidual respondents. Apart from scaling and location, classical test theory [Novick,
1966] does not take effects specific to items or respondents into account. The more
modern item response theory [e.g. Hambleton and Swaminathan, 1990] uses a spe-
cial case of generalized logistic regression (see Section 4.2), allowing for varying
“difficulty” among items, but does not consider variations in the way users in-
terpret the Likert scale. Here, we will construct a probabilistic model for ordinal
4.1 Introduction 61
responses which, in addition to variations between items, also accounts for varying
nonlinear scalings of the Likert response levels among respondents. We introduce
an approximate inference scheme using Expectation Propagation, which returns
an approximate belief over the latent meaning of an answer on a unified scale for
all individual respondents.
Bayesian methods have so far occupied a niche among the psychometric literature,
and are usually confined to Markov Chain Monte Carlo techniques [Lord, 1986,
Patz and Junker, 1999, Arima, 2006], and such treatments are often accompanied
by warnings about the computational cost of these methods. With the advent of
social networking sites, psychometric tests have become popular among internet
users, producing large quantities of data: The experiments presented in Section
4.4 were carried out on a data set comprising 10, 000 respondents, which is only
a subset of a much larger data set collected on Facebook.com (the total number
of respondents is approximately 4 million). While these large amounts of available
data can potentially lead to improvements in test reliability and design, it also
creates challenging inference tasks. Approximate Bayesian inference can provide
inference results of high quality in acceptable computational time from such large
amounts of data, improving on maximum likelihood results by introducing an
explicit representation of parameter uncertainty.
As pointed out several times already in this thesis, approximate inference algo-
rithms have the perceived disadvantage of not providing provable performance
from a Frequentist point of view. It is thus important to point out that there is no
unequivocal “ground truth” in psychometry; so the usual Frequentist framework
of proving convergence toward ground truth in the limit of large amounts of data
does not seem meaningful. The human mind is too complex to be described by a
small number of parameters, so any simple low-dimensional regression model will
be patently wrong, whether it be based on point estimates or Bayesian beliefs. In
such a situation, insisting on provable convergence towards some ad-hoc popula-
tion has little value. If anything, such requirements limit the expressiveness of new
models by imposing unnecessarily hard requirements on their algebraic structure.
We argue that, since the ultimate goal of psychometry is to predict human be-
haviour, the only fair comparison of descriptive models and associated inference
algorithms is given by predictive performance on a test set.
The following Section 4.2 formally introduces the item response problem, and gives
a brief introduction to some state-of-the-art methods (a thorough overview of the
entire field would be far beyond the scope of this thesis). Section 4.3 extends the
classic models to a novel generative probabilistic description of users’ responses to
items based on low-dimensional latent traits. Section 4.3.2 derives an approximate
inference algorithm based on Expectation Propagation. Section 4.4 gives empiri-
cal evidence of the increased expressiveness of this scheme relative to maximum
62 Approximate Bayesian Psychometrics
likelihood estimation in contemporary item response models.
4.2 Item Response Theory
The Likert scale is a ubiquitous format for psychological tests, and indeed many
other forms of questionnaires. In its general form, it involves a range of R ordered
discrete answers, ranging from a strong disagreement to a strong agreement with
given statements. The machine learning community is accustomed to this form of
data from the widely studied Netflix movie ranking data set [Bennett and Lanning,
2007], where the R = 5 discrete answers correspond to “stars” rating a user’s
satisfaction with a particular movie.
We assume that a given questionnaire consisting of I items has been answered by
U respondents (or “users”), providing a data set of U×I responses rui ∈ {1, . . . , R}.
The goal of inference is to assign a latent score (ability) xuc ∈ R, c ∈ [1, . . . , C]
to C ≤ I different traits for each user. It is usually assumed that there exists a
function c(i) : [1, . . . , I] → [1, . . . , C] assigning items of the questionnaire to only
one trait (as pointed out in Section 4.1, this assumption may be questionable in
reality. But it is ubiquitous in the literature, and will be taken for granted for
the purpose of this work). Since the responses rui are ordinal, some relationship
between rui and xuc(i) is required. The definition of this relationship usually in-
volves an intermediate real-valued latent variable yui which is related to xuc(i) in
some straightforward way (e.g. through Gaussian noise), and which is assumed to
generate the item-specific response rui.
For example, in the popular Big Five factorization model, the item i = “I am always
the center of attention at parties.” is supposed to test for a trait c = “Extraversion”,
one of five such traits, and the answer to this item is not assumed to contain any
information about the other four traits, labeled “Conscientiousness”, “Openness”,
“Agreeableness” and “Neuroticism”. The Big Five model is a longstanding frame-
work of character classification, and was developed over an extended period by a
number of authors. Seminal papers include McDougall [1932] and [Cattell, 1943].
See Digman [1990] and Barrick and Mount [1991] for historical reviews. An impor-
tant detail about the history of this model is that it is in fact based on statistical
analysis, rather than philosophical deduction: College students were asked to de-
scribe peers with English adjectives; co-occurring clusters of words were identified
by factor analysis and then named. Early models had many more than 5 clusters,
and the decision to limit C = 5 seems to be based on analytical convenience as
much as on statistical evidence.
A problem with the Likert scale is that users’ responses do not only depend on
the latent trait itself, but also on personal interpretation of both the Likert scale’s
and the items’ meaning. Figure 4.1 shows empirical histograms of the marginal
4.2 Item Response Theory 63
1 2 3 4 5
response r
0
5
10
15
20
25
30
35
m
ar
g
in
al
 f
re
q
u
en
cy
utest=2
utest=8
utest=15
1 2 3 4 5
response r
0
500
1000
1500
2000
2500
3000
m
ar
g
in
al
 f
re
q
u
en
cy
itest=1
itest=2
itest=3
Figure 4.1: Marginal frequencies of reply categories in the dataset studied in Section
4. Connecting lines for visual aid only. Left: Marginal frequency of replies from users
1,2, and 3 from the dataset. Right: Marginal frequencies for items 1,2, and 3 in the
dataset. Note the strong variations in the frequencies between users, and that none of
the histograms follows a bell curve.
frequencies of the Likert answers 1 to 5 for three users to 100 items (Figure 4.1
left) and for three items (right) from 5000 users. It is immediately evident that the
users make very varied use of the responses, in a nonlinear way: Some users prefer
the “conservative” statements 2 and 4, others use response 5 much more often than
any other option, others again do not use response 1 at all. A similar behaviour
emerges in the marginals on the items.
4.2.1 Contemporary Item Response Models
A full overview over the psychometric literature is beyond the scope of this work.
For the purpose of the model presented here, the relevant conceptual development
in that literature is a trend toward more expressive and flexible descriptions, first
through an explicit model for the Likert response scale, then through the addition
of parameters specific to item and response. We we will here review three important
models in this development. Our contribution is then the addition of parameters
specific to every respondent and response.
Classic test theory [Lord, 1959, Novick, 1966] presupposes some means of turning
respondent’s u response rui to item i into an empirical score yui ∈ R. This mapping
is assumed to be provided by the test’s designer. The task left for the statistician is
then to infer a so-called true score ξui for further analysis. For example, a frequent
64 Approximate Bayesian Psychometrics
assumption is a linear relationship between true score and empirical score through
ξui = y¯i + ni(yui − y¯i), (4.1)
where y¯i is the average empirical score of all respondents, and ni is the reliability
of item i, i.e. a measure of variance.
Since the mapping from responses to scores is left to the test designer, it might be
tempting to simply use a linear or logistic scale for Likert type responses, mapping
directly from a discrete ordered response to a real-valued score. However, as Figure
4.1 shows, the empirical answer frequencies exhibit considerable structure. It thus
seems desirable to learn some of this structure directly from the data.
An important development in this direction is the Rasch Model [Rasch, 1960], and
a generalized version of it known as the generalized partial credit model [Muraki,
1992] (GPCM). These models are instances of logistic regression. Each item i is
assigned a difficulty di, and each response type r = 1, . . . , R is represented by an
item-specific threshold tri. In the GPCM, the probability of person u answering
rui to item i is given by a logistic distribution parameterized by the latent value
yui:
p(rui = 2, . . . , R | yui) =
exp (
∑rui
r˜=1 di(yui − tr˜i))
1 +
∑R
m=1 exp (
∑m
r˜=1 di(yui − tr˜i))
and
p(rui = 1 | yid) =
1
1 +
∑R
m=1 exp (
∑m
r˜=1 di(yui − tr˜i))
(4.2)
The Rasch model is the restricted case where di ≡ 1 for all items i. The parame-
ters are usually set to maximum likelihood (ML) values. While these models can
account for variations in the distribution of responses over items, they evidently
cannot take into account variations between individual users.
4.3 A Generative Model
This section will construct a probabilistic model for ordinal answers. A compact
directed graphical model is shown in Figure 4.2. Instead of the logistic link function
used in the GPCM, we will use a multivariate probit (i.e. cumulative Gaussian)
link function, because it has algebraic advantages when used with the Expectation
Propagation approximation. We will pair this link function with linear regression
on a combined model of both item-specific and user-specific thresholds.
Quite similar to the GPCM, we will use a set of L = R − 1 thresholds hui =
(h`ui)`=1,...,L to describe the generative process of the responses rui = 1, . . . , L + 1.
In contrast, however, we will assume that these thresholds are also a function of
the respondent, not only of the item. Specifically, we assume there is a set of L
4.3 A Generative Model 65
bu
bi
hui yui xuc
rui
U
I
C
user threshold
item threshold
user trait
response
Figure 4.2: Directed graphical model for ordinal regression on user and item thresholds
and user traits. Priors and intermediate variables are left out in this representation for
clarity. See Figure 4.3 for a more explicit factor graph.
thresholds bu for each user and similarly a set of item-specific thresholds bi for
item i, such that the overall threshold is given by
hui = bu + bi. (4.3)
These thresholds are latent variables, and we construct a prior over their values
by combining a general Gaussian prior with the explicit requirement that they be
ordered:
p(bu) ∝ N (bu;µu,Σu)
L∏
`=2
θ(b`u − b
`−1
u ), (4.4)
where θ here and throughout the rest of the chapter is Heaviside’s step function,
which equals 1 for positive arguments and 0 elsewhere. The normalization constant
of this prior is nontrivial, but the approximate algorithm derived in Section 4.3.2
will construct an approximate Gaussian belief, which is easily normalized.
As in the previous section, the latent variable xuc ∈ R, c ∈ [1, . . . , C] represents the
respondent’s traits. It generates the latent opinion / ability yui on item i through
the addition of Gaussian noise with variance τ 2
p(yui|xuc) = N (yui;xuc, τ
2). (4.5)
Similarly, we also assume that the actual threshold in the user-item combination
ui is a noisy version h˜ui of hui:
h˜ui ∼ N (hui, β2IL), (4.6)
where IL represents the L-dimensional identity matrix. The respondent’s discrete
answer rui represents the statement that yui lies between h
r−1
ui and h
r
ui, or above
66 Approximate Bayesian Psychometrics
or below all thresholds for rui = R = L+ 1 and rui = 1, respectively:
p(rui | h˜ui, yui) =
[
r−1∏
`=1
θ(yid − h˜
`
ui)
][
L∏
`=r
θ(h˜`ui − yid)
]
(4.7)
4.3.1 Expressiveness of the Model
By defining a generative model and a prior for its parameters, we have implicitly
defined a prior distribution over the responses. It is clear that this prior does not
put finite mass on all possible response processes. For example, the additive rela-
tionship in Equation (4.3) allows either of the thresholds to “veto” the other: There
is no way for the model to encode a situation in which a specific respondent never
uses, say, reply number 3 (implying that b2u = b
3
u), because the item thresholds
(assuming at least some of the other respondents do use reply 3) will enforce a
finite width of this region. So, even though the model is more expressive than a
model which does not take respondent-specific effects into account at all, it is not
fully general.
Other simple models one might consider instead of the additive interaction in
Equation (4.3) have defects of their own: For example a multiplicative interaction
exhibits a vetoing effect similar to that described above. We thus consider the
model defined above, while far from perfect, still a considerable improvement over
existing models.
It is also clear that a further generalization, to a situation in which every individ-
ual respondent-item pair has its own thresholds, is challenging because it would
remove all identifiability. The degree to which responses were described by the la-
tent trait over the latent free parameters for the thresholds would become critically
dependent on the priors over these different parameters, and great care would have
to be taken to analyze and find good priors for this situation.
4.3.2 Approximate Inference
Exact inference in the probabilistic model constructed in this section is intractable
in two ways: First, the prior (4.4) on the thresholds bu and bi has no closed form,
because it corresponds to a multivariate Gaussian distribution constrained to a
sub-space. Similarly, the posterior on (hui, yui) given the response rui, from the
likelihood (4.7), is intractable as it too is a restriction of a Gaussian. This is the
general form of the probit problem [Ashford and Sowden, 1970], which is known
to be intractable [Huguenin et al., 2009].
To circumvent this problem, we will use Expectation Propagation [Minka, 2001]
(Section 2.3.2) to construct approximate Gaussian posteriors. Even using this ap-
proach, inference remains potentially challenging, because the plates in Figure 4.2
4.3 A Generative Model 67
O bu
bi
O
hui h˜ui yui xuc
vui
A
rui
U
I
C
N (µuser,Σuser)
N (µitem,Σitem)
N (µtrait,Σtrait)∑
Nβ join Nτ
Figure 4.3: Factor graph for the user-item response model. To reduce clutter, the fixed
noise parameters β and τ have not been drawn as observed variables. Their influence
is indicated by labels underneath the corresponding factors. The factors marked O and
A represent structured subgraphs. The factor O enforces ordinality among thresholds,
the factor A encodes the precise meaning of the answer r. To improve readability, these
sub-graphs are shown separately in Figure 4.4.
introduce correlations between different respondent and item thresholds, so that
we would in principle have to track a joint approximate Gaussian posterior over all
variables in the model. For large datasets, such as the one studied in Section 4.4,
with many thousands of users, memory requirements alone make this approach
impractical. Instead, we will make an approximation to the posterior in which the
beliefs on the variables drawn as individual nodes in Figure 4.2 are independent.
That is, we will only retain covariance terms between the elements of individual
thresholds bu and bi. Figure 4.3 shows a factor graph from which a message passing
algorithm will be derived in the following sections. The subgraphs for the ordinal
factors have been replaced with placeholders, which are shown in detail in Figure
4.4. Note that the factor graph contains a large number of loops. In such situa-
tions, Expectation Propagation is not guaranteed to converge, but is empirically
known to often perform well anyway [Minka, 2001], similar to other loopy inference
methods [Frey and MacKay, 1998]. Inference is performed in an iterative way, cy-
cling through the data set several times. At each iteration, the message previously
incorporated in the marginal from this data point is “divided out” of the marginal
to avoid double counting (see Section 2.3.2).
68 Approximate Bayesian Psychometrics
bθ(bi − bi−1)
i = 2, . . . , B
O
rui vui
fk(vui, rui)
k = 1, . . . , B
fk =
{
θ(hkui − yui) if r ≤ k
θ(yui − hkui) else
vui ≡
(
hui
yui
)
A
Figure 4.4: Factor graphs for the ordinal (left) and response-assignment (right) infer-
ence, denoted by O and A in Figure 4.3, respectively.
Ensuring Proper Ordering of the Boundaries
We begin by constructing approximate Gaussian marginals on the boundaries bu
(the treatment for bi is entirely analogous). From Figure 4.4, an approximate
Gaussian marginal on bu can be constructed from a set of Gaussian EP messages
representing the effect of the step-functions θ(b`u− b
`−1
u ). This step factor has been
used for similar purposes before in Herbrich et al. [2007] and in Stern et al. [2009]. A
technical report by [Minka, 2008] describes a particularly efficient way of updating
the corresponding EP message, based on the observation that the integral
Z` =
∫
θ(piT` bu)N (bu;µu,Σu) dbu (4.8)
with a vector (pi`)k = δk` − δk(`−1) (using Kroneckers’ δ) depends on µu and Σu
only through rank 1 derivatives
∇µ logZ` = α`pi` ∇Σ logZ` =
1
2
(α2` − β`)pi`pi
T
` (4.9)
with some scalar terms α` and β`. Detailed derivations are reproduced in Ap-
pendix B for completeness. This allows the individual updates to be performed
with computational cost O(L2), rather than the O(L3) a naïve implementation
would require. For the purpose of the following derivations, it suffices to know that
there exists an algorithm constructing a joint approximate Gaussian marginal with
dense covariance matrix Σˆu on bu, which approximates the second moments of the
distribution (4.4) and, in later stages of the algorithm, of the posterior marginal
on bu (and analogously for bi).
Factorized Regression
We perform straightforward Gaussian belief propagation to construct an interme-
diate marginal on h˜ui (i.e. a message from this variable to the factor labeled “join”
in Figure 4.3): Because the approximate marginals on the thresholds are Gaussian
4.3 A Generative Model 69
N (bu; µˆu, Σˆu) and N (bi; µˆi, Σˆi), the message can be evaluated analytically to
m(h˜ui) =
∫
N (h˜ui; bu + bi, β2IL)N (bu; µˆu, Σˆu)N (bi; µˆi, Σˆi) dbi dbu
= N (h˜ui; µˆu + µˆi, Σˆu + Σˆi + β
2IL) ≡ N (h˜ui;µhui ,Σhui)
(4.10)
Similarly, the message from yui in this direction is N (yui;µxuc , σ
2
xuc + τ
2). This is
using the mean and variance of the marginal on the trait xuc, where c = c(i) is the
trait responsible for this item. From these two marginals, we construct a marginal
on their joint vui ≡ (hui, yui) in the trivial way, by stacking them:
p(vui) = N
[
vui;
(
µhui
µyui
)
,
(
Σhui 0
0 σ2yui
)]
(4.11)
Since the conditional (4.7) of the response is again a product of step functions, EP
inference on the marginal of vui can be performed in a similar way to the ordinal
inference in Section 4.3.2, establishing a marginal on vui with a dense co-variance
matrix Σui. Now that there is a set of messages from rui towards vui in Figure
4.3, the direction of message passing can be reversed: The messages from vui to
h˜ui and yui can be constructed by simply separating the elements of the mean and
variance of the marginal of vui, because for Gaussians in general (see also Equation
A.13)
p(x, y) = N
[(
x
y
)
;
(
µx
µy
)
,
(
Σxx Σxy
Σxy Σyy
)]
=⇒ p(x) = N (x;µx,Σxx)
(4.12)
The new message to xuc is constructed in an analogous way to the presentation
in Equation (4.10): Adding Gaussian noise of variance τ 2, the factorized messages
to the thresholds bu and bi are found by re-arranging the conditional in Equation
(4.10)
N (h˜ui; bu + bi, β2IL) = N (bu; h˜ui − bi, β2IL) = N (bi; h˜ui − bu, β2IL) (4.13)
and marginalizing once more, as in Equation (4.10). This completes the derivation
of all approximate Gaussian messages necessary for Expectation Propagation in
this model.
Algorithmic Implementation
For a large dataset, the number of messages passed in the scheme laid out in the
previous sections is considerable, and the message-passing schedule can have a
marked effect on the speed of convergence and numerical stability. In our imple-
mentation, the algorithm begins with a small loop over the ordinality constraints.
70 Approximate Bayesian Psychometrics
Once this loop has converged, the algorithm iterates over all pairs (u, i) of user u
responding to item i with answer rui, stopping at regular intervals to update the
ordinal messages, and loops over the data set in this fashion several times until
the messages have converged. Convergence can be measured in several ways. In
our implementation, we track the maximal change to the logarithm of the deter-
minant of the precision matrix of any message in the algorithm. We also damp
the messages slightly, as described in Section 2.3.2, which increases performance
slightly. The algorithm typically converges within less than 10 iterations, even for
large datasets.
Free Parameters and Identifiability
The approximate probabilistic algorithm has two obvious free parameters β and
τ . To fix these in our experiments, the evidence of the training set was evaluated
for a small number of settings and maximized, leading to the values τ = 3.0 and
β = 0.2. The parameters of the priors jointly define the scale and location of
the latent space, which is arbitrary. If the priors’ means are set to zero, there is
only one further meaningful degree of freedom, which is the relative width of the
prior on xui to the effective prior on hui defined through the priors on bu and
bi. Similar to the treatment of the other free parameters, a number of settings
were tested based on the assigned evidence, and the priors where chosen to be
p(bi) = p(bu) = N (bu/i; 0, I4), where I4 is the four-dimensional unit matrix, and
p(xuc) = N (xuc; 0, 1) ∀c. So the overall number of free parameters to choose by
maximum evidence is three, which should be compared to the U + I(L + 1) free
parameters fitted by maximum likelihood in the GPCM.
In the literature on generalized regression, it is often pointed out that models such
as the one presented here are not fully “identified”, because under the transfor-
mation hui → hui + k1 and xui → xui + k, i.e. under addition of a constant to
all involved quantities, the model produces the same predictions. In the GPCM,
the different structure of the last line of Equation (4.2) is a solution to this is-
sue, enforcing identifiability. However, since the algorithm presented here retains
beliefs rather than point estimates, producing predictions by integrating out the
latent beliefs, this degree of freedom is not a problem, as long as the latent distri-
bution is proper. The simple independent Gaussian prior mentioned above suffices
for this task. There is never a need to identify any specific solution for the latent
parameters.
4.4 Results 71
tr
ai
ni
ng
se
t
te
st
se
t
5000 users
5000 users
50
it
em
s
50
it
em
s
Figure 4.5: Sketch of the separation of the data set into training and test sets.
4.4 Results
We evaluated the methods presented in the previous sections on a data set provided
courtesy of the MyPersonality.org application1, which allows users of the Facebook
social networking system to complete personality tests. The entire dataset com-
prises approximately 4 million users, but to limit computational cost, was limited
to a subset from 10, 000 users. The test taken is a version of the “Big Five” per-
sonality test introduced in Section 4.2. The version of the test used here has 100
items, answered on a one-out-of-five Likert scale. As discussed before, each item
is associated, in theory, with one and only one of five traits called “openness”,
“conscientiousness”, “extraversion”, “agreeableness” and “neuroticism”. Because all
algorithms require training data for both users and items, the data set was split
into training and test set as follows (Figure 4.5): A first set of 5, 000 users was used
to train item characteristics (bi). The second half (5, 000 users) was split in half
by removing every other question, with one half providing another training set for
user parameters (bu), and the second half being the test set. For the Rasch and
CPGM models, the latent trait is the only variable conveying information about
the user; for our approximate scheme we used the users’ inferred traits as well as
the inferred user boundaries.
4.4.1 Computational cost
The GPCM and Rasch models were trained with the publicly available ltm pack-
age2 for the R programming language. This package returns a maximum likelihood
fit for the I(R+ 1) parameters of the GPCM model (Equation (4.2)), and the IR
1http://mypersonality.org/research/interested-in-collaborating/
2http://rwiki.sciviews.org/doku.php?id=packages:cran:ltm
72 Approximate Bayesian Psychometrics
parameters of the Rasch model (Equation (4.2) with β = 1). It also returns max-
imum a-posteriori Gaussian approximations for the latent traits xuc of users. The
EP algorithm was implemented in the .NET language F#. Due to the incompara-
ble platforms used, a precise timing comparison is not meaningful, but the overall
computation times were roughly comparable (the GPCModels took 5 hours to
conclude, the EP implementation about 2 hours), even though the probabilistic
algorithm infers about 100 times more latent parameters.
4.4.2 Approximate Bayesian Estimates
Figure 4.6 shows beliefs over the values of bu and bi for the same users and items
as shown in Figure 4.1. Note that the scale for user 8, who rarely uses response
r = 3, is “steeper”, with the boundaries b2u=3 and b
3
u=3 moved closer together. The
opposite effect can be seen for user 2, who prefers response 3. User 15 barely uses
the lowest response r = 1, leading to the boundary b1u=15 moving further down,
and the uncertainty on its value to increase. Similar effects can be seen for the
items, shown on the right. Since every item takes part in a much larger number
of user-item pairs (5000 pairs for items in the test set) than the individual users
(50 pairs for each user in the test set), the beliefs on the item thresholds are much
more precise than those on the users.
The bottom half of Figure 4.6 shows the resulting belief on the threshold h˜ui for
one specific item i = 2 and the three same users as above. Note that, despite
the relatively high precision of the user and item marginals, the noise term β still
produces a relatively broad distribution. It is important to note that, while all
the figures show the beliefs on threshold values as four individual distributions,
these are marginals of a joint distribution. One can imagine samples from this
distribution to be threshold values taking small or large values in co-ordination
(Figure 4.7).
4.4.3 Predictive Performance
Table 4.1 reports the log probability assigned by the variant methods to the re-
sponse chosen by the user. The probabilistic algorithm can make use of its more
expressive model to predict users’ answers with higher accuracy. The table also
contains predictive probabilities for the approximate inference model with bi and
bu fixed to 0, respectively. In the former setting, which is very similar to the
GPCM setup, the probabilistic algorithm produces approximately the same pre-
dictive performance as GPCM, while using only user-specific thresholds yields a
less expressive model. This is not surprising, as this setup effectively creates inde-
pendent datasets for every user, each consisting of 50 data points used to predict
50 other data points.
4.4 Results 73
-3 -2 -1 0 1 2
users (latent scale)
b 1u
b 2u
b 3u
b 4u
utest=2
utest=8
utest=15
4 3 2 1 0 1 2
items (latent scale)
b 1i
b 2i
b 3i
b 4i
itest=1
itest=2
itest=3
5 4 3 2 1 0 1 2 3 4
latent space [arbitrary scale]
h˜
1
ui
h˜
2
ui
h˜
3
ui
h˜
4
ui
h˜2,2
h˜8,2
h˜15,2
Figure 4.6: Thresholds learned by the approximate inference algorithm, for the users
and items introduced in Figure 4.1. Top left: Marginal beliefs for the first three users.
Means denoted by a marker, “error bars” of two standard deviations width to each
side in the same colour as the marker (the error bars have been slightly off-set from
the means to avoid overlap). The vertical position of each marginal is indicative of
its ordinal position, from threshold 1 at the bottom to threshold 4 at the top. The
connecting lines are for visual aid only. To create a smooth connection, they were
generated by evaluating the function f(x) =
∑
` Φ(x;µb`ui , σ
2
b`ui
), where Φ(x, µ, σ2)
is the cumulative density function of the one-dimensional Gaussian with mean µ and
variance σ2. Nonlinear functions of analogous form are sometimes used to represent
thresholded models in the literature on GPCM; note, however, that this distribution
is not connected in an obvious way to the actual predictive probability of replies rui.
Top right: Similar plot for the three items shown in Figure 4.1. Bottom: A similar
plot for the threshold h˜ui resulting for the combination of user u = 8 and item i = 2.
74 Approximate Bayesian Psychometrics
-2 -1 0 1 2
bu
0
5
10
15
20
sa
m
p
le
 i
n
d
ex
4 2 0 2 4
h˜ui and yui
0
5
10
15
20
sa
m
p
le
 i
n
d
ex
Figure 4.7: 20 sampled threshold values, for the item-respondent combination u =
8, i = 2 (see also Figure 4.6). Left: sampled respondent-specific thresholds bu=8.
Samples as circles, connecting lines for visual aid only (rows of samples were generated
independently, their order is arbitrary). Marginal means and standard deviations as
thick and dashed lines, respectively. The order of the threshold elements is from 1
(left) to 4 (right). Note the ordering among the thresholds, and that the approximate
character of the inference means that minor variations of the ordering are not entirely
impossible. Right: Similar plot for the combined thresholds h˜u=8,i=2 (coloured circles)
and the predicted opinion yui (black squares). Due to the independent regression noise
τ , these thresholds do mix sometimes, which is an explicit part of the model, not an
artifact of approximation.
4.4 Results 75
6 4 2 0 2 4 6
0
0.5
1
it
em
 1
6 4 2 0 2 4 6
0
0.5
1
it
em
 2
6 4 2 0 2 4 6
latent ability [arbitrary scale]
0
0.5
1
it
em
 3
h˜
`
ui
p(yui)
GPCM
EP
Figure 4.8: Predictions for the responses of user 8 (see Figure 4.1) to items 1 (top),
2 (middle), and 3 (bottom). All these items are in the test set, so the algorithms
have not observed this combination of item and user before. Solid green distribu-
tions: Marginals for the locations of the four latent thresholds h˜ui (note that these
are marginals of the elements of a highly correlated joint distribution). Dashed green
distributions: Marginals on this user’s latent ability yui. Blue bars (in front): Predictions
of the approximate inference scheme on user-item pair regression for the five possible
answers. Red bars (in background): Predictions of the GPCModel. The answer actually
chosen by the user is highlighted in a stronger colour. The location of the bars is es-
sentially arbitrary as the replies are elements of a different space as the thresholds and
abilities, but have been chosen suggestively: The left-most bar, to the left of the first
threshold, corresponds to reply 1, and so on up to reply 5, to the right of the highest
threshold.
76 Approximate Bayesian Psychometrics
MethodM 1/Ntest · log p(Dtest|M) p1/Ntest(Dtest|M)
Chance (for comparison) −1.609 0.200
Rasch Model (ML) −1.362 0.256
GPCM (ML) −1.299 0.273
Ordinal Model, items only −1.306 0.271
Ordinal Model, users only −1.385 0.250
Ordinal Model, users & items −1.257 0.285
Table 4.1: Average log and direct probabilities assigned, by the different models, to
the item actually chosen by the user, on the test set of 5, 000 users, on 50 items
(see Figure 4.5). Larger values are better. “Items only” denotes a model only modeling
item-specific thresholds; analogously for “users only”.
Figure 4.8 gives an intuition of the prediction process. The marginals of the thresh-
olds are shown as Gaussian distributions, the marginal on the latent ability yui as a
broader, dashed Gaussian distribution. The predictions of both the GPCM and the
probabilistic algorithm are shown as bars, with the actual response (which was not
observed by the algorithms) highlighted. Note that these predictions are a nontriv-
ial function of the depicted Gaussian beliefs and can not simply be read off from
the plot. The figure shows the beliefs for the combination of user 8 with the first
three items in the test set, as also shown in Figure 4.6. The GPCM tends to make
broader, less certain predictions than the more expressive probabilistic model. In
case of item 1, the GPCM happens to put larger mass on the correct answer, in the
other two cases, the probabilistic algorithm could predict the right answer with
higher probability. As Table 4.1 shows, overall the probabilistic algorithm is more
successful in predicting the users responses.
4.5 Discussion
The model and accompanying inference algorithms presented in this paper are in-
tended mainly as technical demonstrations. They obviously do not constitute psy-
chological research. We have presented and evaluated them using the framework
and evaluation criteria of the machine learning community. It should be pointed
out that the psychometric literature differs from this point of view in some im-
portant aspects. Most importantly, in psychological contexts inferring and then
studying the latent dimensions of the data is the main aim of the research pro-
cess. Many psychologists would thus find it not acceptable to retain a finite-width
Bayesian belief over possible values of these values, even though, from a statisti-
cian’s perspective, this would be a good course of action if the ultimate goal of the
research was to predict human behavior. The need for scientific as well as public
communication of the research results has so far discouraged reporting a set of
beliefs as final research outcomes. This also means that testing a particular pro-
4.6 Conclusion 77
babilistic model solely on its predictive power on held-out test data might not be
sufficient for applications in psychology, and that interpretability of the inferred
latent structure is just as important. On the other hand, it should be pointed
out that, on a fundamental level, neither the models in general use today nor the
probabilistic algorithm presented here can claim to be a faithful representation of
the complicated generative process for item responses. Such a model, if it exists,
would have to take into account both the semantic structure of the items and the
complicated thought process the human respondent uses to arrive at an answer. In
contrast to the GPCM, the algorithm presented here at least provides some crude
treatment of the latter aspect.
If modern inference methods are to make a significant contribution to the devel-
opment of psychological analysis, it will be necessary to find solutions to these
conceptual issues. The aim of this work is, more humbly, to communicate to the
machine learning community that psychometric datasets provide interesting and
challenging applications for machine learning methods. Thanks to social network-
ing, such datasets are reaching previously unseen scales, making efficient data
modeling techniques crucial.
4.6 Conclusion
Psychometry is a well-established discipline within psychology, with considerable
mathematical underpinnings in classical statistics, and developed through a large
body of literature. The recent advent of social networking has created a surge in
test results that have the potential to raise fidelity and reliability of psychometric
tests. At the same time, the ongoing fast pace of machine learning research has
generated new tools for efficient and robust approximate inference, which allow
inference in larger models of more complex structure. As a demonstration of the
usefulness of contemporary machine learning methods for psychological modeling,
we have presented an approximate inference scheme based on a combination and
extension of several recently developed methods and models, which allows fully
probabilistic inference from ordinal discrete item responses on latent structural
descriptions of human behavior.
78 Approximate Bayesian Psychometrics
Chapter 5
Fast, Online Inference for
Conditional Topic Models
The work presented in this chapter is the result of a collaboration with
Thore Graepel, Ralf Herbrich and David Stern, all of Microsoft Re-
search Ltd.. Except for where the work of others is explicitly cited, all math-
ematical derivations, algorithmic implementations and experimental eval-
uations were performed by the author of this thesis. The author thanks
Thomas Borchert for providing access to a different method as a benchmark
during development, as well as David Knowles and Simon Lacoste-Julien for
helpful comments.
A shorter version of this chapter is under review by the International World
Wide Web Conference 2011.
Abstract
Topic models provide a means of describing semantic similarity between documents
in a multidimensional latent space, based on similarity of word frequencies. Unfor-
tunately, virtually all contemporary inference algorithms for topic models require
multiple passes over the document corpus, making them ill-suited for use on the
large scale corpora typical of the web. Moreover, modern online document reposi-
tories regularly provide document features other than the words themselves, which
can provide helpful semantic information but are not part of standard topic mod-
els. These issues have so far confined the use of topic models to small and specific
corpora. Using a series of approximations that have proven helpful in other areas
of machine learning, we construct an efficient approximate inference scheme for
topic models conditional on arbitrary features of the document. We also study the
viability of online (single pass) inference in such models, and show experimental
80 Fast, Online Inference for Conditional Topic Models
results from large online document corpora.
5.1 Introduction
Topic models are probabilistic descriptions of corpora of text documents as “bags of
words” (i.e. ignoring word order), generated from a mixture of discrete distributions
over words. Each document in a collection is considered to be generated by a
mixture of topics such that each word in the document is generated by first drawing
a particular topic, then drawing the word from that topic. One way to interpret
these models is as a form of dimensionality reduction: Instead of having to explain
every word in every document individually, the model describes the documents’
words in terms of a lower-dimensional space of topics (see sketch in Figure 5.1, top).
Such discrete mixture models are of course more generally applicable than only to
texts. But texts are currently the most important application, and the metaphors
of documents, topics and texts are useful for visualization. This chapter will thus
stick to this nomenclature, and all experiments will be carried out on collections
of actual text documents.
A wide array of generative models for text — with a view towards indexing — have
been studied in the past, starting from ad-hoc methods such as the tf-idf heuristic
still popular in information retrieval [Salton and McGill, 1983], to methods such
as latent semantic indexing [Deerwester et al., 1990], which is based on a maxi-
mum likelihood estimate of a singular value decomposition. Seminal steps towards
models retaining probabilistic beliefs for this task were the development of a proba-
bilistic model for latent semantic indexing [Hofmann, 1999], and the introduction
of latent Dirichlet allocation [Blei et al., 2003] (in the compression and language
modelling communities, whose goals differ from the indexing setting discussed here;
probabilistic language models based on latent Dirichlet variables had been studied
much earlier than that, see e.g. MacKay and Bauman-Peto [1995]). Since then,
a considerable amount of further work on this subject has been produced. An
overview can be found in [Blei and Lafferty, 2009]. Of particular relevance for the
work presented here are the development of collapsed inference algorithms [Grif-
fiths and Steyvers, 2004, Teh et al., 2007], and of topic models conditioned on
features of the document [Mimno and McCallum, 2008, Zhu and Xing, 2010].
In most published applications, topic models are applied to collections of “conven-
tional” documents, such as scholarly articles, or newswire reports. In such corpora,
the overall number of documents is limited, and usually on the order of a few
thousand to tens of thousands, making several iterations over the corpus compu-
tationally feasible with standard hardware. Because the corpus is either of fixed
size or growing slowly, computation time (typically on the order of hours or days)
is short compared to the pace at which the corpus develops. Document collections
5.1 Introduction 81
D
d
o
c
u
m
e
n
ts
V words
W ∼
D
d
o
c
u
m
e
n
ts
K topics
×
V words
K
to
p
ic
s
Π Θ
D
d
o
c
u
m
e
n
ts
V words
W ∼ σ
∈ RD×K
∈ [0, 1]D×K∈ ND×V ∈ [0, 1]K×V
D
d
o
c
u
m
e
n
ts
F features
Φ ×
F
fe
a
tu
re
s
K topics
Z ×
V words
K
to
p
ic
s
Θ
Figure 5.1: Conceptual sketches. Top: Classic topic model. Words from a vocabulary of
size V observed with varying frequencies over a corpus ofD documents. The generative
process for the words is described in terms of a K-dimensional latent space of topics
(withK  D andK  V ). Bottom: Topic model conditioned on observable features
of the document. Note projection through the softmax function σ. The symbols in the
matrices refer to the nomenclature used throughout the chapter, where the elements of
the matrices are denoted by the corresponding lower-case symbols. Black circles denote
observed variables, white circles latent ones. In many applications, Φ is a sparse matrix
and F is large, though usually smaller than D. The colour coding for features and
topics will be re-used in Section 5.2.3 and Section 5.2.3 to clarify two different sources
of conditional dependence.
82 Fast, Online Inference for Conditional Topic Models
on the web differ considerably from this setup: these corpora are constantly grow-
ing, and topics must be available (almost) in real time, so an inference algorithm
has to be able to run at speeds comparable to the growth rate of the corpus. An
extreme example for such a setting is the microblogging service Twitter.com, where
users publish short “status updates” of ≤ 140 characters in length. At the time of
writing, the English subset of the Twitter corpus grows at > 106 status updates
per hour, and this rate is increasing. This makes single pass, or online algorithms
the only option for inference on the structure of this and similar corpora (such as
Wikipedia.org, Facebook.com, etc., each of which contain well over 107 documents
and are growing constantly).
A second, potentially beneficial aspect of documents on the web is the wide avail-
ability of explanatory metadata, such as the identity of the author, the time of
writing, annotation tags by other users, geographic location of the author or her/his
membership in certain groups. Such features can provide strong information about
the topic distribution of a document. For example, being in a region currently
struck by natural disaster may almost completely determine the topic of a Twitter
status update. A more mundane example is the high frequency of “Good morning
world” status updates on Twitter at certain times of day, or the focus of individual
Wikipedia authors on particular subjects.
The following sections present an extension of the latent Dirichlet allocation (LDA)
model, with a focus on computational efficiency, and the use of (sparse) features
for regression on topics. Such conditional topic models have recently been intro-
duced by other authors [Mimno and McCallum, 2008, Zhu and Xing, 2010]; but
those implementations have relatively high computational cost (see Section 5.2),
making these algorithms difficult to apply to large corpora. Instead, we propose a
semi-collapsed variational inference scheme adapted from work by Teh et al. [2007],
and combine it with sparse linear Gaussian regression based on assumed density
filtering in Gaussian message passing (see e.g. Opper [1996]). The necessary link
function between these two schemes is provided by a Laplace approximation for
Dirichlet distributions in the softmax basis first introduced by MacKay [1998]. The
result is a fast implementation of conditional topic models that can deal with doc-
ument collections of the size of Twitter and other online document collections, and
allows prediction of topics from features, inference of a document’s topic distribu-
tion from words and features, as well as inference on each topic’s word distribution.
5.2 Model
Consider a corpus ofD documents. Document d contains Id words wdi ∈ {1, . . . , V },
d ∈ {1, . . . , D}, i ∈ {1, . . . , Id} from a vocabulary of size V . Some description of
d is available in the form of a feature vector φd ∈ R
F (in many applications, this
5.2 Model 83
Z
τ
yd
φd
pid cdi wdi
βk θk
topic proportions topic assignments
document features
topic descriptions
feature weights
D
Id
K
Figure 5.2: Directed graphical model of the conditional topic model. Some of the latent
parameters have been labeled with descriptions for clarity.
vector will be sparse). We construct a topic model conditional on observable fea-
tures of the documents, using the following generative process for all vectors wd,
d ∈ [1, . . . , D], from K topics:
. For each topic k ∈ {1, . . . , K}, generate a discrete probability distribution
with parameters θk ∈ [0, 1]V over the vocabulary of size V by sampling from
a Dirichlet distribution with parameter vector βk:
p(θk |βk) = D(θk;βk) =
Γ
(∑V
v βkv
)
∏V
v Γ(βkv)
∏
k
θβkv−1kv (5.1)
where Γ is the Gamma function.
. Generate a matrix Z ∈ RF×K of feature-topic weights, by sampling each
weight independently from a Gaussian distribution with mean µfk and vari-
ance σ2fk:
p(zfk |µ,Σ) = N (zfk;µfk, σ2fk) (5.2)
(note that this notation allows a nonzero mean for every individual topic,
assuming the existence of a “bias” feature with fixed value 1.)
. For each document d with features φd ∈ R
F ,
– Draw a latent variable yd from the noisy linear model with a single
noise parameter τ :
p(yd |Z, τ,φd) = N (yd;Z
Tφd, diag(τ
2)) (5.3)
– Define the topic proportions pid = σ(y) ∈ [0, 1]K where σ is the vector
softmax function
σk(y) =
exp(yk)
∑K
` exp(y`)
(5.4)
– For the i-th of Id words
84 Fast, Online Inference for Conditional Topic Models
∗ draw a topic cdi from the discrete distribution defined by pid:
p(cdi = k |pid) = pidk (5.5)
∗ draw word wdi from the discrete distribution of topic cdi:
p(wdi = v | cdi,Θ) = θcdiv (5.6)
Figure 5.2 shows a directed graphical model representing this generative model. If
we replace everything to the left of pid in that figure by a single Dirichlet parameter
vector α (identical for all d), then the parts shown to the right of and including the
node pi correspond to the traditional LDA model [Blei et al., 2003]. The extension
to a conditional topic model as defined above has been used before by Mimno and
McCallum [2008], who developed an inference scheme based on expectation maxi-
mization and Gibbs sampling for it. Such an algorithm does not lend itself to large
datasets. There is also a related probabilistic model, known as the correlated topic
model [Blei and Lafferty, 2007], which shares the softmax relationship introduced
above and everything to its right in Figure 5.2, but not the regression element
to its left. While it is straightforward to extend the correlated topic model with
the regression part as introduced above, the published inference method for this
model uses an overall variational bound that can be optimized only numerically
and involves frequent matrix multiplications of costO(K2); this scheme thus comes
at considerable computational cost, too. Our contribution will be the derivation
of a fast inference scheme. To construct this algorithm, we will integrate out the
distribution over Θ (i.e. we will treat this part of the model exactly), and de-
rive a variational bound on Π and c (Section 5.2.1), using approximate discrete
beliefs on the cdi and Dirichlet beliefs on the pid. This requires a Dirichlet prior
for pid. To construct this prior, we will use factorized Gaussian message passing
in the regression algorithm (Section 5.2.3), which establishes a Gaussian belief on
yd = σ
−1(pid). We will use a Laplace approximation to establish a match between
Gaussians and Dirichlets in the softmax basis (Sections 5.2.2 and Appendix C).
5.2.1 Semi-Collapsed Variational Inference
This section constructs a variational inference scheme for LDA which we will call
semi-collapsed for the following reason: the necessary derivations are an extension
of a fully collapsed variational bound derived in an excellent paper by Teh et al.
[2007]. The parts of our algorithm formulated in this section differ from Teh’s
work in only two details, which cause minor changes in the derivation but will be
crucial for their use in our overall model: We will retain an explicit variational
approximation to the posterior beliefs on the pid, instead of integrating them out
5.2 Model 85
(i.e. this part of the inference is not collapsed). We will also make the more general
assumption of non-uniform Dirichlet parameters. To keep the notation uncluttered,
it will be assumed that a Dirichlet prior p(pid |αd) = D(pid;αd) on pid is available
at the time of inference. This prior will be constructed in Section 5.2.2.
From the definitions in Section 5.2, the joint probability of all latent and observed
variables in the LDA model, as a function of the parameters α and β, is a large
product of Dirichlet distributions. To see this, note that for every observed docu-
ment wd, the likelihood for pid and θ is multinomial in both parameters:
p(wd, cd |pid,θd) ∝
Id∏
i
pidcdiθcdiwi . (5.7)
Because the two Dirichlet distributions in the prior (per-document over topics and
per-topic over words) are conjugate to this double multinomial, the posterior is
Dirichlet. A compact representation can be achieved using counts
ndkv ≡ |{i s.t. cdi = k;wdi = v}|. (5.8)
We will use a dot to denote that a dimension has been summed out; for example
n·kv =
∑
d ndkv, etc. With this, we get the joint (note that the left-hand-side
variable C is represented implicitly on the right through the counts ndkv)
p(W ,C,Θ,Π |α,β) =
∏
d
Γ (
∑
k αdk)∏
k Γ (αdk)
∏
k
piαdk−1+ndk·dk
∏
k
Γ(
∑
v βkv)∏
v Γ(βkv)
∏
v
θβkv−1+n·kvkv .
(5.9)
Standard inference in LDA [Blei et al., 2003] uses a fully factorized approximate
distribution
q˜LDA(Π,C,Θ) =
∏
d
q˜(pid)
∏
i
q˜(cdi)
∏
k
q˜(θk) (5.10)
on all latent parameters. This raises two problems.
. Due to the mixing terms ndk·, n·k· and n·kv, the posterior beliefs over the cdn
are highly correlated. The fully factorized bound thu may be loose.
. Tracking approximate beliefs for Θ and Π is computationally inefficient. This
is less a problem of memory or nominal computation cost than of algorithmic
design, and related to the previous point: In a standard implementation, the
algorithm passes through the corpus, updating the estimates on Π and C. It
then steps out and updates the estimate on Θ. Because the other estimates
are highly correlated with Θ, they now have to be re-estimated, and this
cycle has to be repeated many times: The algorithm converges slowly. Note
that this problem is more pronounced for inference on Θ than for inference
on Π because the latter depends directly only on the words in one document,
86 Fast, Online Inference for Conditional Topic Models
while the former is a function of the entire corpus.
To address this problem, we adapt Equation (5.10) by introducing a dependence
of Θ on C:
q(Θ,Π,C) = q(Θ |C)
D∏
d
q(pid)
Id∏
i
q(cdi) (5.11)
The variational bound (see Section 2.3.3) is
L[q(C)q(Θ |C)q(Π)]
= Eq(Z)q(Θ |Z)q(Π) [log p(W ,C,Π,Θ |β,α)] + H [q(C)q(Θ |C)q(Π)]
= Eq(C)q(Π)
[
Eq(Θ |Z) [log p(W ,C,Π,Θ |β,α)] + H[q(Θ |C)]
]
+ H (q(C)q(Π))
(5.12)
Since we are not restricting the functional form, optimizing for q(Θ |C) leads to
a unique optimum at the exact posterior p(Θ |C,W ,β).
L(q(C)q(Π)) = max
q(Θ |C)
L[q(C)q(Θ |C)q(Π)]
= Eq(C)q(Π) [log p(W ,C,Π|β,α)] + H [q(C)q(Π)]
(5.13)
An obvious advantage of this treatment over the classic factorized bound is that,
because Equation (5.11) is a strictly weaker assumption on the structure than
Equation (5.10), the resulting bound on the true posterior under the model is
strictly tighter. A second advantage will become apparent when we derive an ex-
plicit scheme for the optimization of remaining the approximate distributions. To
do so, we first integrate out all θk from Equation (5.9), arriving at
p(W ,C,Π |β,α) =
∏
d
Γ (
∑
k αdk)∏
k Γ (αdk)
∏
k
piαdk−1+ndk·dk
∏
k
Γ (
∑
v βkv)
Γ (n·k· +
∑
v βkv)
∏
v
βkv + n·kv
Γ(βkv)
(5.14)
we now choose a particular parametric form for the approximating distributions,
namely independent discrete distributions on the cdi with parameters q(cdi = k) =
γdik and independent Dirichlet distributions on the pid, with parameter vectors
νd ∈ RK+ :
q(C) =
∏
d,i
∏
k
γdik and q(Π) =
∏
d
D(pid |νd) (5.15)
We insert this and Equation (5.14) into Equation (5.13), only retain the terms
containing pi, and use that the entropy of the Dirichlet is [e.g. Bishop, 2006]
H[D(x |ω)] = log
(∏
k Γ(ωk)
Γ (ωˆ)
)
+ (ωˆ −K)z (ωˆ)−
∑
k
(ωk − 1)z(ωk) (5.16)
5.2 Model 87
with ωˆ ≡
∑
k ωk and the Digamma function z, which is the first derivative of the
log of the Gamma function. This gives an explicit expression for the bound on each
document’s topic proportion as a function of νd, up to additive constants:
L(νd) =
∑
k
[αdk − 1 + Eq(ndk·)]Eq[log(pidk)]
+ (νˆd −K)z(νˆk)−
∑
k
(νdk − 1)z(νdk)
+
∑
k
log Γ(νdk)− log Γ(νˆk)
(5.17)
Using that the expected logarithm under Dirichlet beliefs is
ED(ω)[log x] = z(ωk)−z(ωˆ) (5.18)
we get
L(νd) =
∑
k
(αdk + Eq(ndk·)− νdk)
[
z(νdk)−z
(
K∑
`
νd`
)]
+
∑
k
log Γ(νdk)− log Γ(νˆk).
(5.19)
Differentiating this expression with respect to νdk and setting to zero establishes
an optimum at
νdk = αdk + Eq(ndk·). (5.20)
We will deal with the expectation in this equation in the next section. First, we
consider the optimization with respect to γdik. To do so, we introduce the notation
n\didkv for ndkv counting all words except the one at location (d, i). The entropy of
the discrete distribution with parameters γv is Hγ[q(c)] = −
∑
v γv log γv. Further,
we can expand
log
Γ(x+ a)
Γ(x)
=
a−1∑
`=0
log(x+ `) ∀x ∈ R+ ; a ∈ N. (5.21)
Using all this, again Equation (5.14) and some of the results in the previous equa-
tions, canceling terms appearing in both nominator and denominator, we arrive at
the update rule
γdik ∝ exp
{
z(νdk) + Eq
[
log(βkwdi + n
\di
·kwdi
)
]
− Eq
[
∑
v
βkv + n
\di
·k·
]}
. (5.22)
88 Fast, Online Inference for Conditional Topic Models
A Gaussian approximation to the expected values of counts
The final problem in this section is the evaluation of the expectation terms in
Equations (5.19) and (5.22). Computing these terms exactly is computationally
expensive. However, Teh et al. [2007] point out that these terms are sums of usually
large numbers of independent Bernoulli variables. For example,
ndi·kv =
∑
d′ 6=d;i′ 6=i
I(cd′i′ = k) (5.23)
(and analogously for the other counts). Sums of independent Bernoulli variables
are approximately Gaussian distributed, with mean and variance given by the sums
of the means and variances of the individual Bernoulli variables
Eq(n·k·) =
∑
d,i
γdik
and Varq(n·k·) =
∑
d,i
γdik(1− γdik)
(5.24)
We can thus approximate by expanding
Eq[log(βkwdi + n
\di
·kwdi
)] ≈ log(βkwdi + Eq[n
\di
·kwdi
])−
Varq(n
\di
·kwdi
)
2(βkwdi + Eq[n
\di
·kwdi
])2
(5.25)
Teh et al. [2007] studied this approximation and concluded that, even though the
count is usually larger than β, this Taylor expansion gives good results in this
application. In our model, this gives the update equation
γdik ∝ exp[z(νdk)]
βkwdi + Eqn
\di
·kwdi
∑
v βkv + Eq(n
\di
·k· )
· exp
{
−
Varq n
\di
·kwdi
2(βkwdi + Eq[n
\di
·kwdi
])2
+
Varq n
\di
·k·
2(
∑
v βkv + Eq[n
\di
·k· ])
2
}
.
(5.26)
Comparison to fully collapsed model
It is instructive to take a short diversion here and study the differences between
this semi -collapsed algorithm, where only Θ is integrated out, and the fully col-
lapsed algorithm derived in Teh et al. [2007]. In the latter scheme, there is no
5.2 Model 89
approximation on Π and the update rule for γdik is
γdik ∝ (αdk + Eq(n
\di
dk·)
βkwdi + Eqn
\di
·kwdi
∑
v βkv + Eq(n
\di
·k· )
· exp
{
−
Varq n
\di
dk·
2(αdk + Eq[n
\di
dk·])
2
−
Varq n
\di
·kwdi
2(βkwdi + Eq[n
\di
·kwdi
])2
+
Varq n
\di
·k·
2(
∑
v βkv + Eq[n
\di
·k· ])
2
}
(5.27)
We observe further that the Taylor expansion of the Digamma function is given
by ([Abramowitz and Stegun, 1972, §6.4.12])
z(x) = log(x)−
1
2x
+O(x−2). (5.28)
So if νdk is updated after every update of γ (which is not a computationally efficient
scheme, but possible in principle), the two update rules are identical, up to second
order corrections, up to a factor
∆ =
γsemi-collapseddik
γcollapseddik
≈ exp
(
−
1
2(α + Eq[n
\di
dk·])
+
Varq [n
\di
dk·]
2(α + Eq[n
\di
dk·])
2
)
= exp
(
Varq [n
\di
dk·]− α− Eq[n
\di
dk·]
2(α + Eq[n
\di
dk·])
2
)
= exp







−
α +
∑
i′ 6=i
γ2di′k
2
(
α +
∑
i′ 6=i
γdi′k
)2







(using Eq. 5.24)
(5.29)
Because γdik < 1 ∀d, i, k, we have ∆ ∼ 1, and the update rules are approximately
equal up to second order corrections. From this point of view, the main difference
between the two schemes is thus one of scheduling. The fully collapsed scheme can
utilize a slightly more efficient representation of the approximate beliefs, which
can aid mixing. However, for large corpora where the mean field is dominated by
effects from other documents, the differences between the two algorithms can be
expected to be small.
5.2.2 Laplace Approximation for Dirichlets
The previous section introduced a variational inference scheme for the subgraph
to the right of and including pid in Figure 5.2. The next section will construct a
fast approximate message passing scheme for regression from Gaussian beliefs on
90 Fast, Online Inference for Conditional Topic Models
yd, to the left of that variable in the graphical model. To link these two parts
of the inference, we require some way of connecting the beliefs on pid and yd.
The deterministic functional relationship σ(yd) = pid identifies this task as a
probabilistic form of logistic regression (probabilistic in the sense that discrete
samples cdn from the distribution defined by pid are replaced by probabilistic beliefs
over cdn).
As pointed out above, solutions to this problem found by previous authors suffer
from comparatively high computational cost. Here, we will retain the Dirichlet
beliefs on pid returned by the semi-collapsed variational scheme, and explicitly
construct approximate Gaussian beliefs on yd. For this, we utilize a Laplace ap-
proximation in the softmax basis, which was derived by MacKay [1998], and can be
extended such that it provides an invertible map from the set of parameters of K-
dimensional Dirichlet distributions to a subset of the parameters of K-dimensional
Gaussians, which can be approximately represented by the means and variances of
K independent Gaussians. The detailed derivation of this approximation is lengthy,
and has thus been moved to Appendix C. The map between a Dirichlet with pseu-
docounts α and a multivariate Gaussian with mean and covariance matrix (µ,Σ)
is (Equations C.32 and C.21)
µk = logαk −
1
K
K∑
`=1
logα` (5.30)
Σk` = δk`
1
αk
−
1
K
[
1
αk
+
1
α`
−
1
K
(
1
τ
+
K∑
u
1
αu
)]
(5.31)
and αk =
1
Σkk
(
1−
2
K
+
eµk
K2
K∑
`
e−µ`
)
for k = 1, . . . , K (5.32)
In the following Section 5.2.3, we will use this approximation in a special form of
generalized regression, to link a set of approximately independent Gaussians to a
Dirichlet distribution. A few interesting characteristics to note are:
. The correlation between components of the Gaussian approximation is small
for K  1 and will thus be ignored here, giving K independent Gaussians
with means as above and variances (see also Equation C.22)
σk = Σkk =
1
αk
(
1−
2
K
)
+
1
K2
K∑
`
1
α`
(5.33)
In the other direction, from Gaussians to Dirichlets, the approximation dis-
cards any correlation structure between the components. Since the factorized
approximation to be introduced in Section 5.2.3 does not capture such cor-
relations anyway, this does not cause any additional issues.
5.2 Model 91
. The Gaussian resulting from this map has weaker tails than the Dirichlet
distribution in the softmax basis. The resulting weights are thus slightly
overconfident. The factorized regression introduced in the next section will
have a similar defect, and both of these problems need to be addressed by
ad-hoc solutions.
. Inspecting the Equations following (5.30), it is clear that this approximation
is well defined for the entire parameter space of Dirichlets, including values of
0 < αk < 1. For such sparse cases, the resulting Gaussian approximation can
have considerable uncertainty, but it does not lead to ill-defined parameter
settings.
See Figure C.3 in Appendix C for an intuition on this approximation.
5.2.3 Gaussian Regression
D
IdZ y˜d
φd
τ
yd pid cdi
I(y˜
d
=
Z
T φ d
)
I(pi
d
=
σ(
y d
))
y d
∼
N
(y˜ d
, d
iag
(τ
))
q(pid) q(cdi)
Figure 5.3: Factor graph representation of approximate inference in the conditional
topic model. Note that the semi-collapsed variational algorithm provides independent
beliefs q(pid) on the per-document topic distributions, allowing inference by message
passing on the weights Z in the Gaussian regression part. The regression is here
represented in its exact joint form. Figure 5.4 and Section 5.2.3 introduce a further
factorizing approximation allowing O(EK) inference, where E is the average number
of non-zero features per document.
Under the exact model given by Eq. (5.9), the beliefs on the per-document topic
distributions pid are correlated indirectly through the correlation of the word topic
labels cdn, and would only become independent if these labels were observed. The
semi-collapsed variational inference scheme introduced in Section 5.2.1 explicitly
constructed independent beliefs on the pid (see factor graph representation in Fig-
ure 5.3). In this section, we will make use of these independent beliefs to learn
relations between observable features of the document and the topic mixture of
documents, using the linear model with weights zfk as defined in Eq. (5.3). Thanks
to the Laplace map introduced in Section 5.2.2, we have access to an approximate
92 Fast, Online Inference for Conditional Topic Models
Gaussian belief on the inverse softmax of pid, denoted by yd = σ
−1(pid). This
reduces the problem to straightforward Gaussian regression, which has been stud-
ied very extensively in the past [e.g. Bishop, 2006, Section 3.3], and will be only
outlined in this section.
Fully Connected Regression
Noting that the Laplace approximation returns a correlated belief over the elements
of yd, and because “explaining away” causes correlations between the weights zfk
for different features f , an exact treatment of the regression problem calls for a
fully connected (i.e. joint) posterior on the weights Z. Because this exact solution
requires the inversion of a K ×K matrix, of cost O(K3), it is usually too costly
for real applications, and the next Section 5.2.3 will introduce a factorized scheme
which is not exact, but much faster. For completeness, this section first provides
the exact answer to the regression problem. It is the multivariate version of the
Gaussian linear regression introduced in Section 2.3.6.
Because we will be interested in the posterior on the zfk, to simplify the algebra,
we introduce the stacked vector z = vec(Z) ∈ RKF , i.e. a re-arrangement of the
matrix Z into a vector, and the matrix F d ∈ RK×FK , a re-arrangement of the
vector φd, such that
ZTφd ≡ F dz = y˜d (5.34)
The concrete realization of the projection is irrelevant for the analysis, as long as
it is consistent. The likelihood for z under one document “data point” yd is then
p(yd | z,F ) = N (yd;F dz, diag(τ
2)) (5.35)
To get the full posterior, assuming a Gaussian prior p(z) = N (z;m,S), we use
Bayes’ rule, and “complete the square” (for a step-by-step derivation, see for ex-
ample Bishop [2006], Section 2.3.6). The posterior is also Gaussian, and takes the
form
p(z |yd,F d) = N
(
z; Ψ
[
F T diag(τ−2)yd + S
−1m
]
,Ψ
)
where Ψ =
[
S−2 + F T diag(τ−2)F
]−1
= Σ− SF T(diag(τ 2) + FSF T)−1FS
(5.36)
where we have applied the matrix inversion lemma (C.17) in the last conversion,
which reduces the size of the matrix to be inverted from KF × KF to the size
K × K of the Schur complement. Since the variational inference provides only a
belief p(yd) = N (y;µd,Σd), rather than an exact value, we have to marginalize
5.2 Model 93
over y:
p(z |µd,Σd,F d) =
∫
p(z |yd,F d)p(yd) dyd
= N
(
z; Ψ
[
F dT diag(τ−2)µd + S
−1m
]
,
Ψ + ΨF dT diag(τ−2)Σd diag(τ−2)F dΨ
)
.
(5.37)
This scheme corresponds to applying the Sum-Product algorithm (Section 2.2)
to the factor graph shown in Figure 5.3. Note the neat separation of the two
different types of correlation present in the model: Equation (5.36) introduces the
“explaining away” correlations between features, caused by the sum factor (see
Section 2.3.6 and Figure 2.8); Equation (5.37) propagates the correlations between
the elements (topics) of yd, created by the softmax factor and encoded in the
matrix Σd. This separation is indicated by the colours in the covariance matrix in
Equation (5.37), which should be compared with the colour coding in Figures 5.1
and 5.4.
Factorized Regression
zfk y˜dk
φdf
ydk
τ
K
F
D
y˜dk =
∑
fzfkφdf
Figure 5.4: Factor graph representation of factorized Gaussian regression using the sum
factor.
In large-scale applications of topic models, the dimensionalities of the topic model
can be considerable. For a concrete example, consider once again the Twitter mes-
saging service. An obvious feature of interest in Twitter posts is the user-ID, which
currently means that F > 5 · 107 using a 1-in-n encoding. With so many features,
even small values of K lead to prohibitively high memory requirements if Equa-
tion (5.36) is used for regression. As mentioned in Section 2.3.6, implementations
of linear cost in D and K that can also deal with sparse features efficiently are
then the only possible solutions. This, of course, comes at the cost of accuracy, as
we will have to make further approximating assumptions.
Figure 5.4 shows a factor graph for a fully factorized approximation. We retain
only the diagonal elements σ2k of Σ and we will store only the diagonal elements
94 Fast, Online Inference for Conditional Topic Models
mfk, s2fk of the elements of m and S. With the likelihood
p(zfk | zf ′ 6=f,k, ydk,φd, τ) = N
[
zfk;
1
φdf
(
ydk −
∑
f ′ 6=f
φfzf ′k
)
,
τ 2
φ2df
]
, (5.38)
the sum-product message into zfk is easily obtained by marginalizing over the
beliefs on all other variables connected to the sum factor, giving
msg(zfk) = N
[
zfk;
1
φdf
(
µdk −
∑
f ′ 6=f
φdf ′mf ′k
)
,
1
φ2df
(
τ 2 + σ2dk +
∑
f ′ 6=f
φ2df ′s
2
f ′k
)]
(5.39)
Such factorized approximations have been used widely in the machine learning
literature to construct linear-cost algorithms. Since we had to discard several cor-
relation effects to arrive at this result, this scheme will make overconfident predic-
tions. Equations (C.7) and (C.10) give an intuition for the factor determining the
quality of this approximation: We can expect it to work well if K is large and α
not extremely sparse, and if φd is sparse. The first and last assumptions are usually
well satisfied by topic models on big corpora. The effect of a sparse α (which is
required to a certain degree for a useful topic model) is of order O(1/K) and thus
less problematic.
5.2.4 Inference from Data Streams
Since the topic-model estimates depend on the regression predictions and vice
versa, inference in the overall model should be done iteratively in principle: Iterate
over the corpus; infer the topics of each document, passing Gaussian messages
to the regression weights zfk. For each document d, make a prediction from the
current beliefs on the weights zfk and the features φd. This prediction provides
a Gaussian message to yd which, after application of the Laplace map, acts as a
Dirichlet prior on pid. Semi-collapsed variational inference on the topics leads to
new Dirichlet and Gaussians marginals on pid and yd, respectively. Dividing out
the Gaussian message to yd provides a message into the sum factor, with which the
beliefs on the weights are updated. In subsequent iterations, an updated prediction
can be gathered by dividing the marginals on zfk by the message sent by ydk in
the previous iteration.
For very large corpora which continue to grow during inference, such as our example
of Twitter, such a scheme is not feasible, because the rate of growth of the corpus
is comparable to the speed of inference. In such a setup, an online, or single-pass
scheme can be constructed in the following way: For each new document d arriving
in the corpus, predict the topic distributions based on φd just as above. Semi-
collapsed variational inference on d, to convergence, returns an update message to
5.2 Model 95
the weights as above. Now, however, the messages to the weights received earlier
during the inference are incorrect because they are the result from a less converged
topic model. Situations like this are typical for online algorithms, and can be
addressed by ad-hoc methods such as exponential forgetting : After each update to
the beliefs p(zfk) ∼ N (mfk, s2fk), decrease the precision of the belief to
s−2fk ← s
−2
fk0 + ε(s
−2
fk − s
−2
fk0) where ε . 1. (5.40)
The analogous operation on the side of the per-topic word distributions is to mul-
tiply the counts n·kv with a constant ξ . 1.0 in regular intervals, which corre-
sponds to raising the variance of the corresponding Dirichlet beliefs. Algorithm
5.2.4 contains high-level pseudocode describing the individual steps necessary for
topic inference from a single document in a stream.
Algorithm 2 Single-Pass Conditional Topic Inference.
Require: E[n·kv],Var[n·kv],E[n·k·],Var[n·k·] ∀k, v . topic stats
Require: N (mfk, sfk)∀f, k . regression stats
1: procedure Infer(φd,wd)
2: add noise to p(zf ) ∀f with φdf 6= 0 . Eq. (5.40)
3: m→(ydk)← N (φ
T
dmk, τ
2 +
∑
f φ
2
fks
2
fk) . incoming message
4: D(pid;αd)← Laplace(N (yd;µd,σ
2
d)) . Eq. (5.30)
5: νd ← αd; γdi ← αd+rnd() . initialise variational parameters
6: repeat . variational inference
7: for i ∈ 1, . . . , ID do
8: γdi ← Equation (5.26)
9: update E[n·kv],Var[n·kv],E[n·k·],Var[n·k·] . O(1) update
10: end for
11: νd ← Equation (5.20)
12: until converged
13: N (yd;µ
′,σ2′)← Laplace(D(pid;αd + νd)) . Eq. (5.30)
14: m←(ydk)← N (µ′,σ2
′)/N (µd,σ
2
d) . message to regression
15: msg(zfk)← Equation (5.39) ∀f, k
16: p(zfk)← p(zfk) ·msg(zfk) . assumed density filtering
17: end procedure
5.2.5 Queries to the Model
From the point of view of the user, there are several different queries that can be
posed to a conditional topic model, after it has converged on a sufficiently large
corpus:
. To infer a topic distribution for a given document from (φd,wd), use Algo-
rithm 5.2.4 up to line 11, return
q(pid |φd,wd) = D(pid;αd + νd)
96 Fast, Online Inference for Conditional Topic Models
. To predict the topic distribution given certain features φd, use Algorithm
5.2.4 up to line 4
q(pid |φd) = D(pid;αd)
. An approximate posterior belief over the topics themselves is given by
q¯(v | k) = D(v;βk + E[n·kv])
5.3 Experiments
There are three distinct approximations used in our inference algorithm:
1. a variational bound on Π and C (Section 5.2.1)
2. the “Laplace bridge” between the regression and mixture model parts of the
generative process (Section 5.2.2)
3. the factorization assumptions in the sparse-feature regression (Section 5.2.3)
The quality of the variational bound has been studied in Teh et al. [2007]. Factor-
ization assumptions in linear regression are well understood and used extensively
throughout machine learning and thus will not be re-evaluated here. What remains
is the use of the Laplace approximation.
5.3.1 Quality of the Laplace Bridge
Although the quality of the Laplace approximation to the Dirichlet in the softmax
basis in itself has previously been investigated in MacKay [1998], the use of this
approximation here differs considerably from the setting studied in the cited paper
(which dealt with evidence estimation in neural networks). The setup in which the
approximation is used here effectively amounts to the following:
. Some unobserved process with known parameters µ, ζ generates data ac-
cording to the following process:
– Sample x ∈ RK ∼
∏
kN (µk, ζ
2
k)
– Map pi = σ(x)
– Sample data points c according to p(c = k |pi) = pi
. The inference method tries to infer x in the following way:
– Use the Laplace map to gain a Dirichlet belief on pi from the prior
∏
kN (µk, ζk)
5.3 Experiments 97
– Update this belief using the data (which is trivial, due to the Dirichlet’s
conjugacy to the Multinomial distribution)
– map the Dirichlet belief back to Rk using the Laplace map in the oppo-
site direction; claim the resulting belief to be an approximate posterior
on x
Figure 5.5 compares this approximate scheme to an asymptotically exact Markov
Chain Monte Carlo scheme (the particular MCMC method chosen for this task
is elliptical slice sampling [Murray et al., 2010], which has the advantages of fast
convergence and having no free parameters). More specifically, the Figure shows —
averaged over 10 independent experiments — the 2-norm error of a point estimate
for x returned by the two methods (solid lines) and error estimates constructed
from the algorithms’ results. For the MCMC sampler, these two estimates are the
sample mean and (unbiased) sample covariance. For the Laplace approximations,
the two estimates are the mean and standard deviation of the approximate Gaus-
sian belief. Note that the Laplace bridge does not show any discernible bias or
over-convergence, except that the error estimate is slightly too big.
0 50 100 150 200
number of observed samples
0.0
0.5
1.0
1.5
2.0
2.5
|xˆ
−
x
|
Laplace
MCMC
σLaplace
σMCMC
Figure 5.5: Convergence behaviour of approximate inference using the Laplace bridge
compared to exact inference using a Markov Chain Monte Carlo scheme. The solid
lines and data points represent the deviation of the mean estimate (sample mean
for MCMC) from ground truth, the dashed lines represent the error estimate of the
inference algorithm (one standard deviation). The results shown are averages over 10
independent experiments. Both methods were initialised with a prior of µ = 0, ζ = 1.
98 Fast, Online Inference for Conditional Topic Models
5.3.2 Experiments on the Twitter Corpus
Twitter provides a challenging dataset: Because the individual documents are very
short (< 140 characters, but regularly as short as 3–5 words), the overall number
of co-occurences, which are central to topic models, is low even if the number of
documents is large. We generated a dataset of Twitter status updates by selecting
posts from 19 different “Twitter lists” — collections of posts authored by users
with a stated topical preference, which yields a certain “denseness” of authors.
Some posts by random authors from the unbiased Twitter stream were inserted as
well. A set of 6, 635 authors was selected by choosing those authors with more than
5 documents in the corpus. The parameters of the topic model were set to K = 40,
with a fixed value β = 10−4 for all words in all topics (these parameter settings
were found through a rough grid-search based on log evidence on the training
set). Experiments without conditioning on features were performed with a fixed
α = 10−2. When the model was conditioned on features, the individual weights
were given Gaussian priors such that the average document had an initial prior
(i.e. before learning the regression weights) producing a Dirichlet prior on pid with
approximately the same parameter value α = 10−2. The noise in the regression
term was set to τ = 10−4. The feature set was chosen with F = 6, 636, with one
always present bias feature φ0 = 1, and 6,635 binary features for every author in
the aforementioned set of prolific users. The model was run for 100 iterations in
the iterative setup.
Only minimal preprocessing was performed: The text of each document was con-
verted to lower case, split on spaces, punctuation was removed, and members of a
conservative list of 180 stop words were removed. Of the remaining set of words,
all terms with frequencies < 10 in the corpus were discarded.
5.3.3 Learning Sparse Topics
After inference, the learned regression weights for each author can be used to
generate predictions of the topics used by this author. If a broad Gaussian prior
is used for the regression, and τ is chosen relatively large (e.g. 1 . τ . 10),
then the learned features will be sparse. Figure 5.6 gives an impression of the
sparsity of the topic distributions learned by our model. Most authors can be
represented almost completely with only a few topics. Also note the kink in tails
of the empirical cumulative density functions shown in Figure 5.6, indicating the
presence of a “heavy tail”, assigning finite probability to all topics. Such a tail is
expected, because the fixed regression noise τ enforces a non-zero pseudo-count
for all dimensions of the Dirichlet prior on documents, even in the limit of infinite
data. Since Figure 5.6 is rather technical and might be difficult to interpret, Figure
5.7 shows concrete topic distributions for four specific authors (the same authors
5.3 Experiments 99
100 101 102
k
10-4
10-3
10-2
10-1
100
p
ro
p
or
ti
on
 o
f 
u
se
rs
 c
ov
er
in
g
 t
 w
it
h
 k
 t
op
ic
s
t=0.50
t=0.75
t=0.95
t=0.99
Figure 5.6: Sparsity of learned author topic predictions. Shown are, for different values
of 0 < t < 1, the proportion of authors in the dataset for which the top k topics
cover less than t of the probability mass of the topic predictions. For example (dashed
black lines), for ∼ (1− 0.05) = 95% of all authors, 5 topics cover at least 95% of the
prediction mass. Note the double logarithmic scale. Lines for visual aid only.
will be used again at the end of Section 5.3.4).
5.3.4 Comparing Conditioned and Unconditioned Models
There is no generally accepted way of evaluating the performance of a topic model.
Since inference in topic models is an unsupervised learning problem, there is no
ground truth to compare to, and Chang et al. [2009] showed that evaluations based
on predictive strength (i.e. perplexity; log probability of the dataset) can differ
from subjective evaluations by human subjects. We will thus strike a compromise
by using human evaluations in this section, and predictive strength in Section
5.3.5.
Figures 5.8 and 5.9 show the topics learned by conditional topic models with
and without access to the author’s ID, respectively. Following Blei and Lafferty
[2009], we do not represent topics by words ranked by raw frequency, but weighted
according to the following score inspired by the tf-idf heuristic [Salton and McGill,
1983]:
scorekv = βˆkv
[
log βˆkv −
1
K
K∑
`=1
log βˆ`v
]
(5.41)
where
βˆkv =
βkv + Eqn·kv
∑
v βkv + Eqn·k·
(5.42)
100 Fast, Online Inference for Conditional Topic Models
0 5 10 15 20 25 30 35 40
topic index
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
to
p
ic
 p
ro
b
ab
il
it
y
Rick_Bayless
MarkKnoller
Rob_Madden1
Variety
Figure 5.7: Predictive topic distributions conditioned on individual authors, for four
authors in the dataset. This Figure provides an intuition for the distributions charac-
terized more abstractly for the whole data set in Figure 5.6. See Figure 5.10 for more
details on these authors, Figure 5.8 for a characterisation of the learned topics.
For words with the same marginal probability, this weights words of uneven dis-
tributions over the topics higher than those with more even distributions, thereby
providing a more informative representation of each topic (note that βˆkv is the
model’s estimate of the probability of word v in topic k, the second factor in
Equation (5.41) is a measure of how similar this word’s probability in topic k is to
its probabilities in all other models).
Simply inspecting the learned topics, one might find it evident that the model
conditioned on author features was able to learn much more meaningful topics.
To quantify this initial intuition, we performed a simple experiment to collect
unbiased human evaluations of the two models against each other: A small group
of volunteers, which had not seen either of the topic models before, were shown, in
sequence, pairs of the topics shown in Figures 5.8 and 5.9, and asked to choose one
over the other. We will make the widely accepted, albeit arguable, assumption that
the participants are reasonably representative of the wider population for this task.
To avoid bias, the topics were shown in randomized positions on the screen; i.e.
the unconditioned and conditioned models’ topics were shown in first and second
place on the screen randomly, with equal probability. The order of topics was kept
as shown in the figures (since the topics are already generated randomly by the
inference algorithm itself, there is no need to further randomize their order). The
precise task description given to the participants was
5.3 Experiments 101
The screen will now repeatedly show two collections of words, called "top-
ics". These are supposed to represent groupings of semantically similar
words. You have to decide which of the two topics is a better overall match
of words together to a topic. There are 40 such topic pairs overall.
Please note: The pairs of topics DO NOT necessarily describe the same
topic. In fact, they have been randomly matched. You should not attempt
to find similarities between the topics. Just decide which of the word collec-
tions makes for a neater topic, in your opinion.
The participants performed the experiments alone without supervision, so any
psychological influence of the author on the participants’ decisions can be ruled
out.
The result of each experiments is a set of 40 binary variables D = {ωi}, indicating
whether the topic inferred by the model conditioned on author features “won” over
the topic inferred by the model without author features. In such a setup, a simple
Bayesian hypothesis test is possible, using a Beta prior
p(ξ|H) = B(ξ; a, b) ≡
Γ(a+ b)
Γ(a)Γ(b)
ξa−1(1− ξ)b−1 (5.43)
for the probability ξ of choosing the author-based model over the no-author model.
The null hypothesis is H0 = (ξ < 0.5), i.e. “The no-author model has a larger
chance of being chosen”. We choose a uniform prior by setting a = 1, b = 1
(this is a conservative choice. A (0,0) prior would lead to stronger p-values in the
following analysis). The Beta distribution is the exponential family conjugate to
the Bernoulli probability distribution. The sufficient statistics are the number nD
of positive answers (i.e. nD ≡ |{ωi ∈ D |ωi = 1}|) and the total number N = 40 of
comparisons. We can thus perform a Bayesian hypothesis test1 by evaluating the
posterior probability for ξ < 0.5 (the evidence for the null hypothesis), which is
given by the normalized incomplete Beta function2:
p(H0 |D,H) =
B0.5(a, b)
B(a, b)
(5.44)
Table 5.1 shows the resulting evidences and nd for all the participants in the ex-
periment. As the table shows, all participants’ responses lead to evidences well
below the 1% level, confidently rejecting the null hypothesis. Since there is only
one alternative hypothesis, we can thus conclude that each of the participants pre-
ferred the author-based model over its alternative. This hard categorization allows
an easy hypothesis test on the next higher conceptual level: Observing 4 partic-
1From a Frequentist viewpoint, the evidence mass calculated in this hypothesis test may be
intepreted as a one-tailed “p-value”.
2Bx(a, b) ≡
∫ x
0 t
a−1(1− t)b−1 dt, and B(a, b) ≡ B1(a, b).
102 Fast, Online Inference for Conditional Topic Models
participant nD (out of 40) evidence for H0
#1 31 2.2 · 10−3
#2 33 1.3 · 10−5
#3 36 5.1 · 10−8
#4 32 5.6 · 10−5
Table 5.1: Evidence (“p-values”) for the null hypothesis of the no-author model provid-
ing equally good or better topics than the model conditioned on author features, for
the individual (anonymized) participants of the experiment.
ipants preferring the author model and no participant preferring the alternative,
the posterior belief that the probability of the frequency of participants preferring
the no-author model is larger than 0.5 for the entire population, using again a uni-
form Beta prior, is 3%. Based on this evidence, one might be convinced to reject
this hypothesis as well: We expect, with high (97%) confidence, that the major-
ity of people will prefer the model conditioned on author features over the model
without author features.
This shows that metadata can provide valuable additional information for topic
modeling. Of course, the author feature is of particular importance in corpora with
short documents and several documents per author, such as Twitter. Being able
to describe authors, rather than documents, in terms of topic mixtures is a useful
feature in itself. Figures 5.7 and 5.10 show the predictive topic distributions for
four individual authors.
5.3.5 Single Pass versus Iterative Inference
As pointed out in the preceding sections, the standard implementation of varia-
tional inference in topic models involves repeated iterations over the bound on a
fixed data set. The single pass setup introduced in Section 5.2.4 has a more ad-
hoc character than this iterative optimization scheme. Nevertheless, there are two
distinct issues which might make single-pass inference more attractive than the
iterative scheme in certain situations:
Data Abundance: Iterative inference has higher computational cost, and also a
considerably higher memory cost, due to the need to store the messages sent
in the regression module and the pseudo-counts determining the variational
bound for each document. Where essentially arbitrary amounts of training
data are available, limited computation time might be better spent consid-
ering more data than re-considering approximate beliefs on old data in an
iterative optimization scheme.
Topic Drift: Infinite data streams may change their topic structure over time.
If the drift is slow, a simple heuristic dynamic model like the one described
5.3 Experiments 103
in Section 5.2.4 may then be able to model such a drift, in contrast to the
inherently time-free description of the standard implementation (there has
been work on dynamical topic models [Blei and Lafferty, 2006, Wang and
McCallum, 2006], but these solutions again require multiple passes over the
dataset and are thus not suitable for data streams). This issue is somewhat
linked to the first issue — if the corpus has inhomogenous structure, using
streaming inference may or may not lead to better performance, depending
on what exactly the task of the topic model should be (see more details
below).
We study both these issues in two experiments on a second data set, consisting of
articles from the English Wikipedia. While this dataset has finite size, it is very
large, and comes from the provider in form of a database3 with some accidental
semantic structure (of course, since the corpus is finite, it would be easy to random-
ize, but this structure is an interesting real-world example of the effects expected
in streaming datasets as well; it is thus retained on purpose). The parameters of
the model were chosen based on an elaborate grid search for maximal evidence,
performed by another researcher for standard latent Dirichlet allocation on this
dataset4 to be K = 100, β = 0.1, and regression priors implicitly defined by a de-
sired Dirichlet prior on the document topic distributions with parameter α = 0.25.
All models used as features of the document the identity of the last author having
edited a given article (note that multiple authorship would be easy to represent
for the linear regression, but the database used does not provide this information).
Robot authors (scripts, which are ubiquitous on Wikipedia) and authors with very
few overall edits were lumped together under an “anonymous author” feature.
Figure 5.11 shows an experimental setup focusing on the first issue of data abun-
dance. We compare an iterative algorithm running for 200 iterations (i.e. to con-
vergence) on a fixed training set of 1000 articles, to streaming inference algorithms
observing 200 times as much data, but inferring every document’s topics only
once, which leads to comparable computation cost. As described in Section 5.2.4,
the streaming algorithm has two parameters controlling the discount of older data
over new data. Because the author features are relatively sparse, it became clear
during preliminary experiments that a discount on the regression weights is not
necessary. Figure 5.11 thus shows a family of performances for varying values of
only the topic distribution discount parameter ξ. The chosen performance measure
is average predictive probability of unseen words on a test set of 1000 documents
(both algorithms were allowed to observe the first half of all these documents, then
predicted the second half of the words of these documents based on the inferred
topics). Some interesting aspects to note are:
3see http://en.wikipedia.org/wiki/Wikipedia:Database_download
4Carl Scheffler, personal communication.
104 Fast, Online Inference for Conditional Topic Models
. For a good setting of the discount parameter, observing more data leads to
better performance than re-visiting and optimizing on old data.
. The iterative algorithm actually goes through a phase of decreasing perfor-
mance while optimizing its hyperparameters.
. As indicated above, inhomogenuous structure of the corpus can lead to vary-
ing performance for the single-pass algorithms. A good example of this be-
haviour are the spikes visible between documents number 50, 000 and 80, 000
in Figure 5.11 (and again around document 140, 000). These fluctuations are
caused by a range of atypical documents at these locations in the database
(closer inspections reveals these to be clusters of documents about small
towns in the United States).
Figure 5.12 shows a second comparison between an iterative and a single-pass
algorithm, focusing more on the second issue of adaptation to fluctuations in the
dataset. The only single pass algorithm used here was chosen, based on the results
of the previous experiment, with ξ = 0.95. In this setup, both algorithms first get
to observe a small training set of 1000 documents (left part of Figure 5.12). The
iterative algorithm (top half of Figure 5.12) runs to convergence on this set. It
then cedes updating its internal model parameters while passing once through the
remaining 190, 000 documents, for each document inferring topics from the first
half of its words and then predicting the second half. The average log probability
assigned to ground truth by this prediction is used as a performance measure. In
contrast, the algorithm shown in the lower half of Figure 5.12 only iterates 10 times
over the training set, then continues to update its model based on the first half
of all the remaining documents observed during evaluation — this again leads to
comparable computational cost (however, due to the need to store messages, the
batch phase has a considerably higher memory cost. In fact, it cannot readily be
scaled to training on the entire dataset). Two points to note are
. Again, overall predictive performance (straight lines in Figure 5.12 on the
right, reproduced for easier comparison on the left) is (slightly) better for the
streaming inference algorithm, which spends less time on each document, but
instead incorporates more data.
. The semantic changes of the documents around index 50k to 80k, which
looked disadvantageous in the previous experiment, now actually run in
favour of the single-pass algorithm, which can adapt well to the new struc-
ture.
5.4 Conclusion 105
5.4 Conclusion
We constructed an efficient approximate inference algorithm for topic models con-
ditioned on arbitrary features of the documents, of computational cost linear in
the number and length of documents, the number of topics and the number of
active features of each document, and independent of the size of the vocabulary.
Reaching this low complexity required an array of different methods and approx-
imations. The computational cost’s (though not the memory cost’s) dependence
on vocabulary size was removed by analytical integration. A lower bound on the
mixture model was found by means of a variational approximation and a fast
Gaussian approximation. Regression on the feature weights was realized using the
sum-product algorithm on a factorized approximation of the exact beliefs. Finally,
a link between the regression in RK and the mixture modeling on the [0, 1]K sim-
plex was constructed from a Laplace approximation to the Dirichlet distribution
in the softmax basis.
This concoction of approximations might seem unsatisfying, and of course a more
exact solution would be more appealing. But approximations are the price of fast
inference. Other authors have previously demonstrated the abilities of topic models
using more exact, more costly approximate inference schemes. Our experiments
show that document features can provide crucial information about the content
of a document, especially if the document itself is short. If topic models are to
become useful for the semantic description of documents at large scale, such as
those found on the web, then fast approximate algorithms will be necessary. The
advantage of approximate Bayesian approaches like those presented here is that the
exact objective is clearly spelled out in the form of the generative model, providing
a guiding light for the construction of the algorithm, and a means to assess the
quality of the approximations used. The result of our analysis is a fast algorithm
retaining approximate probabilistic beliefs, rather than point estimates. Since the
algorithm is linear in the number of documents, features and topics, and can be
run in a single pass over the data, it constitutes the first practical topic model
inference algorithm for very large web-scale corpora.
106 Fast, Online Inference for Conditional Topic Models
1. via, http, life, another, end, bit, august, people, wednesday, ask, . . .
2. post, new, blog, page, now, money, start, around, time, better, . . .
3. really, life, want, know, good, nba, draft, mean, moment, cream, . . .
4. news, change, says, climate, russian, afghanistan, president, #climate, leaders, public, . . .
5. iphone, twitter, like, use, design, great, know, things, think, miss, . . .
6. please, art, latest, live, boy, #tcot, word, blog, museum, security, . . .
7. run, care, update, says, hit, local, year, second, gets, calls, . . .
8. world, cup, #worldcup, england, japan, match, goal, final, football, team, . . .
9. dog, via, world, press, toronto, week, dogs, daily, story, die, . . .
10. love, follow, lol, like, back, movie, thank, girl, justin, happy, . . .
11. post, via, science, study, research, work, yesterday, important, death, training, . . .
12. today, film, photos, birthday, day, check, hours, show, gallery, amazing, . . .
13. right, good, said, #news, may, lose, finished, keep, believe, now, . . .
14. june, market, recovery, washington, rise, low, set, trade, may, space, . . .
15. food, summer, dinner, sale, eat, beautiful, sweet, recipes, healthy, lunch, . . .
16. full, new, wine, best, show, photography, red, mind, winner, interesting, . . .
17. thx, article, brain, storm, coast, breaking, area, mexico, near, car, . . .
18. think, know, like, yeah, back, going, even, guys, good, might, . . .
19. call, office, give, youtube, long, see, john, jackson, brown, words, . . .
20. new, music, video, plus, week, dead, coming, challenge, july, mark, . . .
21. good, tonight, fun, done, going, trying, everyone, pretty, thanks, much, . . .
22. game, season, win, tonight, night, last, tomorrow, congrats, games, play, . . .
23. record, news, talks, west, house, two, says, former, camp, report, . . .
24. book, review, read, years, reading, books, budget, ice, year, david . . . ,
25. oil, obama, gulf, vote, spill, #oilspill, mcchrystal, power, energy, sen, . . .
26. bill, court, #fb, kagan, financial, senate, bank, jobs, america, street, . . .
27. time, cont, told, days, today, back, another, daily, now, right, . . .
28. love, god, always, like, never, someone, see, people, heart, good, . . .
29. com, video, twitter, www, blog, live, check, #ff, que, help, . . .
30. hey, great, good, day, ever, favorite, sure, black, way, thanks, . . .
31. home, new, garden, tips, water, gardening, #sports, blog, summer, city, . . .
32. following, three, stop, chance, james, photo, pick, paul, top, man, . . .
33. live, day, rugby, set, race, week, now, win, today, join, . . .
34. new, times, women, post, photo, cool, ipad, sex, iphone, fun, . . .
35. now, need, travel, great, website, project, thanks, store, phone, #jobs, . . .
36. social, media, facebook, york, cloud, job, based, service, san, million, . . .
37. deal, one, home, #etsy, time, enough, face, #wwes, league, actually, . . .
38. help, recipe, easy, buy, part, success, #food, new, pls, save, . . .
39. google, source, business, health, #economy, company, stock, marketing, sales, china, . . .
40. london, open, #olympics, star, meet, olympic, visit, days, pic, sports, . . .
Figure 5.8: Topics learned by the conditioned topic model from the cleaned twitter
dataset. See text for details. Some of the learned topics reflect news items of the
summer of 2010, such as the G20 summit in Toronto (topic 4), the football world
cup (topic 8), the Great Recession (topics 14 and 26), the BP oil spill in the Gulf of
Mexico (topic 25), the sacking of the US supreme commander in Afghanistan, Gen.
McChrystal (also topic 25), and Congress’s confirmation of Elena Kagan to the US
supreme court (topic 26).
5.4 Conclusion 107
1. japan, took, rich, football, boy, sign, california, cover, mumbai, level, . . .
2. official, canada, power, recipe, thought, usa, fall, winning, williams, place, . . .
3. chris, web, serious, money, etc, street, dogs, reality, thursday, hall, . . .
4. become, org, child, system, fair, human, continues, markets, weekly, download, . . .
5. behind, sun, spain, special, gardening, amp, #nascar, despite, stuff, saw, . . .
6. baseball, alex, #scotus, huge, spent, bloggers, english, early, fan, private, . . .
7. solar, india, security, inside, far, stories, rescue, info, meeting, whether, . . .
8. tony, petraeus, nothing, photography, jersey, hello, happened, improve, olympics, wedding,. . .
9. mcchrystal, federal, friend, tip, joe, ipo, quote, analysis, buy, access, . . .
10. enough, tune, finds, double, king, spot, strategy, sunday, mexico, try, . . .
11. four, program, los, project, marketing, celebrate, #ff, calls, caught, room, . . .
12. mail, playing, plans, fresh, hill, smart, collection, believe, bet, opening, . . .
13. manager, answer, followers, goes, ago, consumer, true, anti, guardian, miles, . . .
14. #economy, france, return, poor, blogs, martin, shooting, breakfast, jackson, hair, . . .
15. air, baby, nearly, bring, winner, government, model, running, putting, pres, . . .
16. research, months, drive, storm, session, fit, guide, owner, girls, forum, . . .
17. beat, dead, talks, gift, wanted, without, #cloud, lunch, finally, actually, . . .
18. reform, code, omg, agree, ocean, star, diabetes, problem, push, row, . . .
19. fund, less, okay, #food, light, hear, vacation, paraguay, came, simple, . . .
20. building, competition, george, #tedxoilspill, five, sox, #olympics, proud, thru, deficit, . . .
21. america, heat, dont, looks, hard, seen, films, ball, town, date, . . .
22. hours, sometimes, class, scientists, store, face, fantastic, european, crazy, finish, . . .
23. weeks, image, sold, chance, expected, round, reading, interview, center, lower, . . .
24. girl, details, hands, pretty, ppl, yay, aug, cheese, yahoo, wake, . . .
25. #climate, posted, police, clean, australia, sense, pre, africa, pizza, comes, . . .
26. using, training, site, haha, general, shows, bag, exclusive, forget, conference, . . .
27. create, economic, #oilspill, yeah, song, price, tom, speak, often, challenge, . . .
28. rate, hulu, following, writing, soccer, cuts, west, mobile, #cnn, capital, . . .
29. tweets, enjoy, guy, modern, changes, publishing, signed, dreams, sea, record, . . .
30. piece, couple, risk, breaking, cancer, search, including, play, link, justice, . . .
31. gallery, congress, fifa, hearings, don, spies, features, wrote, extra, drug, . . .
32. mark, though, york, wants, finished, won, head, dream, heart, budget, . . .
33. lots, view, son, harry, given, important, russia, track, #tls, teams, . . .
34. apps, added, heard, study, sweet, toy, instead, forget, across, comments, . . .
35. hate, quick, earth, pro, bbc, #jobs, french, award, coast, experience, . . .
36. question, #traveltuesday, area, value, wish, bed, moment, germany, dating, brand, . . .
37. men, welcome, act, understand, camp, cute, wsj, former, eat, single, . . .
38. album, based, cnn, advice, chinese, shot, yesterday, dies, super, ones, . . .
39. links, link, photography, article, maybe, podcast, pet, rock, country, brian, . . .
40. campaign, style, books, update, statement, wife, #oilspill, different, edition, tax, . . .
Figure 5.9: Topics learned by a conditional topic model without access to the author
features, from the same dataset as in Figure 5.8. Note the lower degree of topical
separation, indicating the usefulness of author information. This model only learns
weights for the bias features. Results from a fully unconditioned model with fixed α
(not shown) look even less structured.
108 Fast, Online Inference for Conditional Topic Models
food, summer, dinner, sale, eat, beautiful, sweet, recipes, healthy, . . .
run, care, update, says, hit, local, year, second, gets, calls, . . .
today, film, photos, birthday, day, check, hours, show, gallery, . . .
help, recipe, easy, buy, part, success, #food, new, pls, save, . . .
oil, obama, gulf, vote, spill, #oilspill, mcchrystal, power, energy, sen, . . .
news, change, says, climate, russian, afghanistan, president, #climate, leaders, public, . . .
bill, court, #fb, kagan, financial, senate, bank, jobs, america, street, . . .
following, three, stop, chance, james, photo, pick, paul, top, man, . . .
home, new, garden, tips, water, gardening, #sports, blog, summer, . . .
world, cup, #worldcup, england, japan, match, goal, final, football, . . .
run, care, update, says, hit, local, year, second, gets, calls,. . .
deal, one, home, #etsy, time, enough, face, #wwes, league, actually, . . .
news, change, says, climate, russian, afghanistan, president, #climate, . . .
today, film, photos, birthday, day, check, hours, show, gallery, amazing, . . .
time, cont, told, days, today, back, another, daily, now, right, . . .
following, three, stop, chance, james, photo, pick, paul, top, man, . . .
Figure 5.10: Four prolific authors in the dataset, with the top three topics for each
author as identified by the algorithm. The bars in the background are indications of
the relative weight of the corresponding topics (arbitrary scale, precise numbers can be
found in Figure 5.7).
5.4 Conclusion 109
0 50 100 150 200
number of iterations (iterative) or multiples of set size (single pass)
8.8
9.0
9.2
9.4
9.6
−
N
−
1
te
st
∑ i
lo
g
p(
w
te
st
,i
)
Iterative
ξ=0.3
ξ=0.5
ξ=0.7
ξ=0.75
ξ=0.8
ξ=0.9
ξ=0.95
ξ=0.99
Figure 5.11: Comparison of convergence, as a function of computation time spent,
between iterative inference and a series of online inference algorithms with varying
settings of the discount parameter ξ. The convergence measure is negative average
predictive log probability on a fixed test set (i.e. lower values are better). Note the
spikes around 50k to 80k documents for the streaming inference algorithms, where the
corpus has a quantitative change in structure.
Figure 5.12: Predictive power (negative average log probability assigned to unobserved
words, smaller numbers are better) for a batch (top) and an online (bottom) algorithm.
Left: performance on the first 1000 articles (on which the batch algorithm is trained ex-
clusively). Right: performance on remaining dataset. Straight lines: averages over these
two entire data regions (right averages repeated in left plots for easier comparison).
Data points in grey, overlaid with running averages over a 1000 document window.
110 Fast, Online Inference for Conditional Topic Models
Chapter 6
Conclusions
This thesis offered contributions on three conceptual levels.
Applications Chapters 3, 4 and 5 each presented independent solutions to ap-
plied problems from different research communities.
. Chapter 3 proposed a Bayesian tree search algorithm. While the metaphor of
games was used there, it is easy to imagine that variations of this algorithm
might be applicable to other tree search problems, for reasons similar to the
motivation for the game models (i.e. unstructured tree search problems are in
some sense easy, because large parts of the tree are irrelevant to the solution.
Hence, hard tree search problems may benefit from using a generative model).
Tree search is a widely used paradigm in computer science, so the results
presented here might be of broader use.
. Chapter 4 presented an extended “noise” model for psychometric question-
naires. Psychometrics is an influential discipline, with applications from the
management of human resources to criminal law, so any improvements to
the mathematical methods used in this field have high potential for societal
impact.
. Chapter 5 offered a computationally lightweight model for the topics of doc-
uments with metadata. The results from this chapter have already found
real-world, commercial use within the Microsoft Corporation, and are the
subject of a patent application submitted by Microsoft Research Ltd. on
behalf of the author of this thesis.
Algorithms Some of the algorithmic constructs used in this thesis, such as the
methods presented in the following Appendices, form self-contained solutions for
specific approximate inference challenges, and may be re-usable in other settings
than those presented here. For example, the “max-factor” detailed in Appendix A
112 Conclusions
is applicable to probabilistic shortest path problems, and the “Laplace bridge” of
Appendix C addresses an issue common in Bayesian logistic regression.
The Approximate Inference Paradigm Despite the ostensible diversity of the
applications addressed in chapters 3–5, the solutions presented in this thesis were
all derived following the same three conceptual steps:
1. Define an explicit generative probabilistic model for the data at
hand. Having done so, we immediately know, in principle, that the correct
answer to any inference question involving variables of this model: It is given
by Bayes’ theorem. Since Bayesian inference is known to be both optimal and
isomorphic to any other sensible framework of reasoning, this approach avoids
having to invent a new ad-hoc inference method for every new problem.
The probabilistic framework also forces modeling assumptions to be made
explicit, which facilitates analysis and comparison to other models.
2. Write down the graphical model corresponding to the algebraic
probabilistic model. The directed graph clarifies conditional independence
and makes the generative model easer to understand. Rewriting it as a factor
graph elucidates analytical as well as computational bottlenecks.
3. Find specific approximations for these bottlenecks, and evaluate them
through unit tests. The graphical model framework often allows such separa-
tion of larger problems into conceptual sub-problems. This also means that
good solutions, once found, can be re-used in other problems. Notice that
this phase benefits crucially from both previous steps. Ad-hoc approxima-
tions may not work well and tend not to generalise well. But approximations
addressing aspects of the algebraic structure of probability distributions, as
exposed by graphical models, can be meaningfully tested and re-used.
The fact that this general approach works well for applications as wide-ranging
as tree search, psychometrics and semantic language modelling indicates that ap-
proximate probabilistic inference, particularly using graphical models, is a powerful
tool for the analytic scientist. It provides a language of uncertainty that removes
problem-specific technicalities to reveal the mathematical structure of the under-
lying inference task. The solutions presented in this thesis are far from a complete
compendium of these methods, but they showcase several families of approximate
inference algorithms as part of a framework. Further extension of this framework
may well lead to a general paradigm for tractable inference. All human behaviour
is learning, predicting, and acting based on these predictions. Probability theory
is the mathematical abstraction of these three abilities. Approximate inference is
one approach making them tractable.
Appendix A
The Maximum of Correlated
Gaussian Variables
The derivations presented in this chapter were published as the technical re-
port [Hennig, 2009] Expectation Propagation on the Maximum of Correlated
Normal Variables, P. Hennig, arXiv [stat.ML] 0910.0115, October 2009
Abstract
This appendix derives the first two moments of the distribution of the maximum
of a pair of correlated Gaussian variables. In a second step, an extension to the
maximum of finite sets is developed through an iterative approximation.
A.1 Introduction
The tree search problem introduced in Chapter 3 is not the only application requir-
ing a probabilistic belief over the maximum of a set of variables. Other problems in
this class include shortest path problems [Burton and Toint, 1992], Reinforcement
Learning [Dearden et al., 1998], and scientific inference in Seismology [Neumann-
Denzau and Behrens, 1984], to name but a few. Often, there is a corresponding
inverse optimization problem [Ahuja and Orlin, 2001, Heuberger, 2004], where the
optimal solution is known with some uncertainty and the question is about the
quantities generating this optimum. Most contemporary algorithms for this case
aim to provide a point estimate (typically the least-squares solution), but have
trouble offering an error estimate on this estimate as well.
This chapter derives (Section A.2) mean and variance of the posterior of the max-
imum of two correlated Gaussian variables (for forward optimization problems),
and the mean and variance on the posterior of the Gaussian variables generat-
ing the maximum (for inverse optimization problems). It will be shown how these
114 The Maximum of Correlated Gaussian Variables
results can be used to build a heuristic approximation to the maximum of a fi-
nite set of normal variables (Section A.3). This provides the necessary results for
Expectation Propagation with a black box “max”-factor on factor graphs. Because
maximum and minimum are linked by max({xi}) = −min({−xi}), this also allows
inference on the minimum where necessary. Limitations of this approximation are
examined in Section A.4.
The moments of the normalized likelihood function of the maximum of two normal
variables have previously been derived by Clark [1961]. To my best knowledge,
this is the first publication deriving the full posterior, and the first to report the
posterior for the inverse problem (see also Section A.2.4).
A.2 The Maximum of Two Gaussian Variables
A.2.1 Notation
We consider two normally distributed variables x1 and x2, forming the vector x.
Let there be some prior information Ig giving rise to the belief
p(x1, x2 | Ig) = N (x;µg,Σg)
=
1
2piσg1σg2(1− %2)1/2
exp
(
−
1
2
(x− µg)
TΣ−1g (x− µg)
)
(A.1)
over their values. Here we have defined a mean vector µg = (µg1, µg2)
T and a
covariance matrix Σg. The latter has the form
Σg =
(
σ2g1 %σg1σg2
%σg1σg2 σ2g2
)
and thus
Σ−1g =
1
σ2g1σ
2
g2(1− %2)
(
σ2g2 −%σg1σg2
−%σg1σg2 σ2g1
) (A.2)
with the linear coefficient of correlation
% =
cov(x1, x2)
σg1σg2
(A.3)
(for notational convenience, the index g is dropped from % because there will be
little risk of confusion). We further introduce the variable m = max(x1, x2), and
assume that there is some outside prior information Im on the value of m as well:
p(m | Im) = N (m;µm, σ
2
m) (A.4)
The inference problems to be solved are
A.2 The Maximum of Two Gaussian Variables 115
. The belief over m given both Im and Ig (jointly called Ic). This is not truly
a “posterior” in the strict technical sense, but the normalised product of two
different factors (independent sets of information):
p(m | Ic) =
p(m | Im)
∫
p(x |m)p(x | Ig) dx
∫ [
p(m | Im)
∫
p(x |m)p(x | Ig) dx
]
dm
= Z−1p(m | Im)
∫
p(x |m)p(x | Ig) dx
(A.5)
with the normalization constant Z =
∫∫
p(x,m | Ic) dx dm. This problem
will be called the “forward” problem here.
. The posterior over x given Ic,
p(x | Ic) =
p(x | Ig)
∫
p(m |x)p(m | Im) dm
∫ [
p(x | Ig)
∫
p(x |m)p(m | Im) dm
]
dx
= Z−1p(x | Ig)
∫
p(m |x)p(m | Im) dm.
(A.6)
This problem will be called the “inverse” problem.
Throughout the derivations, the notation
N (x;µ, σ2) ≡
1
√
2piσ2
exp
[
−
1
2
(
x− µ
σ
)2
]
φ(x) ≡
1
√
2pi
exp
(
−
x2
2
)
Φ(x) ≡
∫ x
−∞
φ(t) dt =
1
2
[
1 + erf
(
x
√
2
)]
(A.7)
will be used to denote the general and standard normal probability density func-
tions (PDF) and the standard normal cumulative distribution function (CDF),
respectively.
A.2.2 Some Integrals
The following derivations will repeatedly feature certain integrals. The first two
incomplete moments of the standard Gaussian are
∫ y
−∞
tφ(t) dt = −φ(y)
∫ y
−∞
t2φ(t) dt = Φ(y)− yφ(y).
(A.8)
116 The Maximum of Correlated Gaussian Variables
This is obvious directly from differentiation. A simple substitution gives
∫ y
−∞
tN (t;α, β2) dt = αΦ
(
y − α
β
)
− βφ
(
y − α
β
)
(A.9)
∫ y
−∞
t2N (t;α, β2) dt =(α2 + β2)Φ
(
y − α
β
)
−(α + y)βφ
(
y − α
β
)
(A.10)
Further, we will use
Lemma
∫ ∞
−∞
Φ
(
x− a
b
)
N (x;α, β2) dx = Φ(z)
∫ ∞
−∞
xΦ
(
x− a
b
)
N (x;α, β2) dx = αΦ(z) +
β2
b
√
1 + β2/b2
φ(z)
∫ ∞
−∞
x2Φ
(
x− a
b
)
N (x;α, β2) dx = (α2 + β2)Φ(z) +
[
2α
β2
b
√
1 + β2/b2
− z
β4
b2 + β2
]
φ(z)
where z =
α− a
b
√
1 + β2/b2
(A.11)
Proof: To prove the Lemma, we follow Rasmussen and Williams [2006, §3.9] and
expand
∫ ∞
−∞
Φ
(
x− a
b
)
N (x;α, β2) dx
=
∫ ∞
−∞
∫ x
−∞
N (y; a, b2)N (x;α, β2) dy dx
=
∫ ∞
−∞
∫ x
−∞
N
[(
y
x
)
;
(
a
α
)
,
(
b2 0
0 β2
)]
dy dx
=
∫ ∞
−∞
∫ α−a
−∞
N
[(
w
z
)
;
(
0
0
)
,
(
b2 + β2 β2
β2 β2
)]
dw dz
(A.12)
where we have introduced the auxiliary variables w ≡ y − a − (x − α) and z ≡
x−α. Note that the integration limit is independent of z, so we can exchange the
integrations. But, because Gaussians have the convenient marginalization property
(see e.g. von Mises [1964], §9.3, Equations (A.11–A.13))
∫
N
[(
X
Y
)
;
(
X˜
Y˜
)
,
(
A B
C D
)]
dX = N (Y ; Y˜ , D) (A.13)
the now inner integral over z is trivial, and the Lemma follows directly. 
A.2 The Maximum of Two Gaussian Variables 117
A.2.3 Analytic Forms
Forward Problem
Neither of the posterior distributions is normal itself. The forward posterior is
p(m|Ic) = Z
−1p(m | Im)
∫∫ ∞
−∞
p(x |m)p(x | Ig) dx
= Z−1p(m | Im)
∫ ∞
−∞
[∫ x1
−∞
δ(x1 −m)p(x | Ig) dx2 +
∫ ∞
x1
δ(x2 −m)p(x | Ig) dx2
]
dx1
= Z−1 p(m | Im)
∫ ∞
−∞
δ(x1 −m)
∫ x1
−∞
p(x | Ig) dx2 dx1
︸ ︷︷ ︸
ν1
+
Z−1 p(m | Im)
∫ ∞
−∞
δ(x2 −m)
∫ x2
−∞
p(x | Ig) dx1 dx2
︸ ︷︷ ︸
ν2
.
(A.14)
For a motivation of the change in the integration ranges from the second to the
third line in Equation (A.14), consider the sketch in Figure A.1. Since the two
x1
x2
Figure A.1: Sketch of the integration range for ν2 (shaded). The open set (x1, x2) ∈
((−∞,∞), (x1,∞)) is identical to the open set (x1, x2) ∈ ((−∞, x2), (−∞,∞)).
summands are related to each other through the symmetry x1 ↔ x2, we consider
only the first term, ν1. To solve the integrals, note that the bi-variate Gaussian
p(x | Ig) can be re-written as
p(x1, x2 | Ig) = p(x1 | Ig)p(x2 |x1, Ig)
=
1
√
2piσ2g1
exp
[
−
1
2
(
x1 − µg1
σg1
)2
]
1
√
2piσ2g2(1− %2)
exp
[
−
1
2σ2g2(1− %2)
(
x2 −
(
µg2 + %
σg2
σg1
(x1 − µg1)
))2
]
.
(A.15)
118 The Maximum of Correlated Gaussian Variables
So we can simplify ν1 to
ν1 = p(m | Im)N (m;µg1, σ
2
g1)×
∫ m
−∞
1
√
2piσ2g2(1− %2)
exp
[
−
1
2(1− %2)
(
x2 − µg2
σg2
− %
m− µg1
σg1
)2
]
dx2
= p(m | Im)N (m;µg1, σ
2
g1)×
∫ m
−∞
1
√
2piσ2g2(1− %2)
exp

−
1
2
(
x2 − µg2 − %
σg2
σg1
(m− µg1)
σg2(1− %2)1/2
)2

 dx2
(A.16)
The substitution
t(x2) ≡
x2 − µg2 − %
σg2
σg1
(m− µg1)
σg2(1− %2)1/2
with Jacobian
dt
dx2
=
1
σ2(1− %2)1/2
(A.17)
allows us to solve the integral and find the posterior up to normalization
p(m | Ic) = Z
−1N (µm;µg1, σ
2
m + σ
2
g1)N (m;µc1, σ
2
c1)×
Φ
(
(σg1 − %σg2)m− σg1µg2 + %σg2µg1
σg1σg2(1− %2)1/2
)
+
Z−1N (µm;µg2, σ
2
m + σ
2
g2)N (m;µc2, σ
2
c2)×
Φ
(
(σg2 − %σg1)m− σg2µg1 + %σg1µg2
σg2σg1(1− %2)1/2
)
,
(A.18)
where we have used the abbreviations
σ2c1 ≡
σ2g1σ
2
m
σ2c1 + σ2m
and µc1 ≡
(
µg1
σ2g1
+
µm
σ2m
)
σ2c1 (A.19)
for the mean and variance of the product of two Gaussians. This is using the result,
derived in Equation (2.18), that
N (x;a1, b
2
1)N (x; a2, b
2
2) =
N (a1; a2, b
2
1 + b
2
2)N
[
x;
(
a1
b21
+
a2
b22
)(
1
b21
+
1
b22
)−1
,
(
1
b21
+
1
b22
)−1
]
,
(A.20)
The forms of µc2 and σc2 are entirely analogous. To find the normalization constant
Z, we use the first identity in Equation (A.11) to get
Z = N (µm;µg1, σ
2
m + σ
2
g1)Φ(k1) +N (µm;µg2, σ
2
m + σ
2
g2)Φ(k2) (A.21)
A.2 The Maximum of Two Gaussian Variables 119
with
k1 ≡
(σg1 − %σg2)µc1 − σg1µg2 + %σg2µg1
[
σ2g1σ
2
g2(1− %2) + (σg1 − %σg2)2σ
2
c1
]1/2 and
k2 ≡
(σg2 − %σg1)µc2 − σg2µg1 + %σg1µg2
[
σ2g1σ
2
g2(1− %2) + (σg2 − %σg1)2σ
2
c2
]1/2 .
(A.22)
Inverse Problem
The conditional probability of x on m is
p(x1, x2 |m) = θ(x1 − x2)δ(x1 −m) + θ(x2 − x1)δ(x2 −m) (A.23)
where θ(y) is Heaviside’s step function. Therefore, the conditional of x on Im (the
likelihood of [x1, x2]) is
f(x1, x2 | Im) =
∫ ∞
−∞
p(m |x1, x2)p(m|Im) dm
= θ(x1 − x2)N (x1;µm, σ
2
m) + θ(x2 − x1)N (x2;µm, σ
2
m)
(A.24)
which, as a likelihood, is not a proper (i.e. normalizable) distribution, but becomes
normalizable after multiplication with the prior:
p(x | Ic) = Z−1 Θ(x1 − x2)N (x1;µm, σ2m)N (x;µg,Σg)
︸ ︷︷ ︸
ξ1
+
Z−1 Θ(x2 − x1)N (x2;µm, σ
2
m)N (x;µg,Σg)
︸ ︷︷ ︸
ξ2
(A.25)
Figure A.2 illustrates the shape of these functions by way of some concrete exam-
ples.
A.2.4 Moment Matching
The analytical forms derived in the preceding sections are clearly not members of
the normal exponential family. If x has more than two elements, they also quickly
take on complicated forms that are expensive to evaluate. If the application in
question allows, it might thus be desirable to find Gaussian approximations to the
posteriors. The next sections derive the moments of these distributions for use with
the EP approximation.
Forward Problem
We will denote the mean and variance of the posterior of the max as µm(12) and
σ2m(12) for reasons that will become clear in Section A.3. The corresponding integrals
120 The Maximum of Correlated Gaussian Variables
2 0 2 4 6
m
0.0
0.1
0.2
0.3
0.4
0.5
0.6
p(
m
)
p(m| )
p(xi | )
p(m| )
MC
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x 1
Figure A.2: Illustrative plots for the analytical form of the forward and inverse posteriors.
Left: Inference onm. Prior distribution and marginals on xi. Posteriors for five different
values of %: -0.9 (most peaked), -0.5, 0.0 (thick line), 0.5 and 0.9 (broadest). As an
experimental verification, a histogram of 20,000 samples from the posterior (generated
by rejection sampling, with % = 0) is shown in blue. Right: Inference on the inverse
problem: Prior with µg = (1, 1)T, σg1 = σg2 = 1 and % = −0.5. Data on m with
µm = 1, σm = 1 gives the posterior in red. Note the bimodality arising in this particular
case.
m p(m | Im) = N (m;µm, σ2m)
I[m = max(x1, x2)]
x
cov(x1, x2) = % %
x1p(x1 | Ig) = N (m;µg1, σ2g1) x2 p(x2 | Ig) = N (m;µg2, σ
2
g2)
Figure A.3: Minimal factor graph using the max factor.
to solve are
〈m〉 ≡ µm(12) =
∫ ∞
−∞
mp(m | Ic) dm = Z
−1
∫
m(ν1 + ν2) dm
〈m2〉 − 〈m〉2 ≡ σ2m(12) =
∫ ∞
−∞
m2p(m | Ic) dm− µm(12)
(A.26)
Comparison with Equation (A.18) shows that these two integrals are solved by
Equation (A.11). The solutions are thus, after some algebra,
µm(12) = w1
[
µc1 + σc1
b1
a1
φ(k1)
Φ(k1)
]
+ w2
[
µc2 + σc2
b2
a2
φ(k2)
Φ(k2)
]
σ2m(12) = w1
{
[
µ2c1 + σ
2
c1
]
+
[
2µc1σc1
b1
a1
− k1σ
2
c1
b21
a21
]
φ(k1)
Φ(k1)
}
+
w2
{
[
µ2c2 + σ
2
c2
]
+
[
2µc2σc2
b2
a2
− k2σ
2
c2
b22
a22
]
φ(k2)
Φ(k2)
}
− µ2m(12)
(A.27)
A.2 The Maximum of Two Gaussian Variables 121
where
w1 = Z
−1N (µm;µg1, σ
2
m + σ
2
1)Φ(k1) w2 = Z
−1N (µm;µg2, σ
2
m + σ
2
2)Φ(k2)
(A.28)
a1 =
[
σ2g1σ
2
g2(1− %
2) + (σg1 − %σg2)
2σ2c1
]1/2
a2 =
[
σ2g1σ
2
g2(1− %
2) + (σg2 − %σg1)
2σ2c2
]1/2
(A.29)
b1 = σc1(σg1 − %σg2) b2 = σc2(σg2 − %σg1) (A.30)
Inverse Problem
The derivation for the inverse problem is just slightly more involved. We are in-
terested in the moments of the marginals p(x1 | Ic) and p(x2 | Ic), and will denote
these means and variances with µ1(m2), σ21(m2), et cetera. From Equation (A.25),
we get
µ1(m2) = 〈x1〉Ic =
∫ ∞
−∞
x1
∫ x1
−∞
N (x1;µm, σ
2
m)N (x;µg,Σg) dx2 dx1+
∫ ∞
−∞
∫ x2
−∞
x1N (x2;µm, σ
2
m)N (x;µg,Σg) dx1 dx2.
(A.31)
The first integral is in fact identical to the first term of µm(12). The second term,
however, involves the first incomplete moment:
∫ ∞
−∞
∫ x2
−∞
x1N (x2;µm, σ
2
m)N (x;µg,Σg) dx1 dx2
=
∫ ∞
−∞
N (x2;µm, σ
2
m)N (x2;µg2, σ
2
g2)×
∫ x2
−∞
x1N
[
x1;µg1 + %
σg1
σg2
(x2 − µg2), σ
2
g1(1− %
2)
]
dx1 dx2
(A.32)
The inner integral can be solved using the result given in Equation (A.10), leading
to an expression solved by Equation (A.11). After a bit of algebra, we arrive at
the final result
µ1(m2) = w1
[
µc1 + σc1
b1
a1
φ(k1)
Φ(k1)
]
+ w2
[(
µg1 + %
σg1
σg2
(µc2 − µg2)
)
+
A
a2
φ(k2)
Φ(k2)
]
σ21(m2) = w1
{
[
µ2c1 + σ
2
c1
]
+
[
2µc1σc1
b1
a1
− k1σ
2
c1
b21
a21
]
φ(k1)
Φ(k1)
}
+
w2
{
σ2g1
[(
µg1
σg1
+ %
(µc2 − µg2)
σg2
)2
+ (1− %2) + %2
σ2c2
σ2g2
]
+
[
B
h(1 + σ2c2/h2)1/2
−
C
h3(1 + σ2c2/h2)3/2
]
φ(k2)
Φ(k2)
}
− µ21
(A.33)
122 The Maximum of Correlated Gaussian Variables
where
A = %σ2c2σg1
(
1− %
σg1
σg2
)
− σ2g1σg2(1− %
2)
B = 2%2
σ2g1
σ2g2
σ2c2(µc2 − µg2) + %
σg1
σg2
(
2σ2c2µg1 + µg2
σ2g1σg2(1− %
2)
σg2 − %σg1
)
− µg1σ
2
g1(1− %
2)
σg2
σg2 − %σg1
C = %2
σ2g1
σ2g2
σ4c2(µc2 − f) + σ
2
g1(1− %
2)
(
1 + %
σg1
σg2
)
σg2
σg2 − %σg1
(µc2h
2 + fσ2c2)
(A.34)
with
f =
σg2µg1 − %σg1µg2
σg2 − %σg1
and h =
σg1σg2(1− %2)1/2
σg2 − %σg1
(A.35)
The corresponding result for the posterior marginal on x2 can be derived trivially
from these results by exchanging the indices 1 and 2. Note that, as mentioned
above, the first terms of these mixtures are shared with the posterior for m. Intu-
itively, this can be interpreted as follows: For the posterior onm, the first term (ν1)
corresponds to the statement that “if x1 > x2” (the probability of this is encoded
by the cumulative density term in Equation (A.18)) “then m is distributed like
x1” (represented by the product of the probability density functions in (A.18)).
This part of the relationship features in the inverse problem as well: If x1 > x2,
then x1 is distributed like m. The second term in the posterior marginal on x1
corresponds to the statement that “if x1 < x2, then x2 is distributed like m, and x1
is distributed such that its distribution fits with the updated marginal of x2 given
the correlation between x1 and x2 and the prior marginal on x1.
Related Work
The moments of the likelihood of the max have been derived before by Clark [1961].
That is, for σm → ∞, the posterior p(m | Ic) reported here simplifies to a result
reported by Clark:
µm(12) → Φ(k)
[
µg1 + σg1
(σg1 − %σg2)
a
φ(k)
Φ(k)
]
+ Φ(−k)
[
µg2 + σg2
(σg2 − %σg1)
a
φ(−k)
Φ(−k)
]
σ2m(12) → Φ(k)
{
[
µ2g1 + σ
2
g1
]
+
[
2µg1σg1
(σg1 − %σg2)
a
− kσ2g1
(σg1 − %σg2)2
a2
]
φ(k)
Φ(k)
}
+ Φ(−k)
{
[
µ2g2 + σ
2
g2
]
+
[
2µg2σg2
(σg2 − %σg1)
a
+ kσ2g2
(σg2 − %σg1)2
a2
]
φ(−k)
Φ(−k)
}
− µ2m(12)
where a =
√
σ2g1 + σ
2
g2 − 2%σg1σg2 and k =
µg1 − µg2
a
(A.36)
A.3 The Maximum of a Finite Set 123
As expected, the posterior of the inverse problem simply becomes equal to the
prior in this case. From Equation (A.33) we find
µ1(m2) → Φ(k)µ1 + σ1
σ1 − %σ2
a
φ(k) + Φ(−k)µ1 − σ1
σ1 − %σ2
a
φ(−k)
= Φ(k)µ1 + σ1
σ1 − %σ2
a
φ(k) + (1− Φ(k))µ1 − σ1
σ1 − %σ2
a
φ(k)
= µ1
(A.37)
and similarly for the variance.
The max-factor is also part of the Infer.net software package [Minka and Winn,
2008] (to my knowledge, the derivations for this code have not been published yet).
However, their implementation can only handle two independent Gaussian inputs
(Section A.3 introduces the max over a finite set of correlated variables). So their
implementation corresponds to the case of % = 0, which leads to the following
simplifications, presented here for reference:
k1 =
µc1 − µg2
(σg1 + σc2)1/2
a1 = σg1(σg1 + σc2)
1/2 b1 = σc1σg1 (A.38)
A = σ2g1σg2 B = −µg1σ
2
g1 C = σ
2
g1(µc2σ
2
g1 + µg1σ
2
c2) (A.39)
f = µg1 h = σg1 (A.40)
Figure A.4 shows some of these approximations. The parameter settings used in
this figure represent a worst case (e.g., the posterior over x is rarely so strongly
bimodal.)
2 0 2 4 6
m
0.0
0.1
0.2
0.3
0.4
0.5
0.6
p(
m
)
p(m| )
p(xi | )
p(m| )
q(m| )
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x 1
Figure A.4: Illustrative plots for the Gaussian approximations to the posteriors. Same
beliefs in Ic as in Figure A.2. Left: For the sake of readability, only the cases % = −0.9
(broadest), % = 0 and % = 0.9 are plotted here. In red dashed lines, the correspond-
ing three Gaussian approximations. Note the varying quality of fit. Right: Gaussian
approximation (with µ1(m2) = 1.06 and σ
2
1(m2) = 0.94) indicated by shaded area.
124 The Maximum of Correlated Gaussian Variables
m p(m | Im) = N (m;µm, σ2m)
I[m = max(x1, x2)]
x
cov(xi, xj) = Σij Σ
xk
p(xi | Ig) = N (m;µgi, σ2gi)
k = 1, . . . , N
Figure A.5: Factor graph representation of the inference problem on a finite set. Note
the plate representin N copies of generating variable nodes.
A.3 The Maximum of a Finite Set
A.3.1 Analytic Form
Extending the analysis of Section A.2.3, we can write the posterior over the max m
of a finite set {xi}i=1,...,N of variables, distributed according to an N -dimensional
version of Equation (A.1), with a new normalization constant ZN , as
p(m | Ic) = ZNp(m | Im)
∫
p(x |m)p(x | Ig) dx
= ZNN (m;µm, σm)×
[
N∑
i=1
∫ ∞
−∞
δ(m− xi)p(xi | Ig)
∫
· · ·
∫ xi
−∞
p({xj}j 6=i |xi, Ig)
∏
j 6=i
dxj dxi
]
= ZN
∑
i
[
N (µm;µgi, σ
2
m + σ
2
gi)N (m;µci, σ
2
ci)×
∫
· · ·
∫ xi
−∞
N (x\i;µg\i(xi),Σg\i) dx\i
]
(A.41)
where x\i = (x1, . . . , xi−1, xi+1, . . . , xN). The conditional mean is [see e.g. Bishop,
2006, Section 2.3.2]
(
µ\i(xi)
)
j
= µgj + ΣgjiΣ
−1
gii (xi − µgi) = µgj + %ij
σgj
σgi
(xi − µgi) (A.42)
with the linear coefficient of correlation %ij = Σgij/(σgiσgj). The conditional co-
variance matrix is the Schur complement of Σgii = σ2gi in Σg:
Σg\i,kj = Σgkj − Σgkiσ
−2
gi Σgij (A.43)
A.3 The Maximum of a Finite Set 125
In principle, it would be possible to follow the path laid out in the previous sec-
tions to calculate the first two moments of this distribution. However, while the
univariate Gaussian CDF (essentially an evaluation of the error function) has com-
putational cost comparable to evaluating an exponential function, computationally
efficient ways of calculating a multivariate Gaussian CDF are not generally avail-
able. See, however, Appendix B for a numerical scheme which does in fact offer a
good approximation to the required integrals. Unfortunately, even though it is a
highly optimized numerical scheme, its computational cost is still considerable.
A.3.2 A Heuristic Approximation
Another, cheaper option is to use an iterative procedure initially proposed by Clark
[1961]. The idea is to start out with the approximation for only two of the gener-
ating variables. W.l.o.g., let these be x1 and x2, resulting in m(12) = max(x1, x2).
Next, estimate m(123) = max(x3,m(12)) and so on up to m(1...N). For the intermedi-
ate maxima, the likelihoods presented in Equation (A.36) suffice, and the prior is
included in the last step (using Equation (A.27)) to gain an approximate posterior
over the maximum of the whole set. Of course, this necessitates an analytic expres-
sion for the correlation coefficient %i(1...i−1) between the i-th variable and the max
over the preceding variables. This was derived by Clark. Adopted to the notation
used here and made more explicit, his result is
%3(12) = σ
−1
(12)
(
σ1%31Φ(k(12)) + σ2%32Φ(−k(12))
)
, (A.44)
where %ij = Σij/σiσj, the index g has been dropped for simplicity and k(12) =
(µ1−µ2)/
√
σ21 + σ
2
2 − 2%12σ1σ2 is the simplified version of k1 arising from Equation
(A.22) under σm →∞. Using this, we can build a recursive algorithm to calculate
%i(1...j) with j < i as
%i(1...j) = σ
−1
(1...j) ·



σj%ij if j = 1
Φ(−k(1...j))σj%ij + Φ(k(1...j))%i(1...j−1) otherwise
(A.45)
this necessitates a list k[j] = (k(12), . . . k(1...i−1)) which is available at the necessary
point in time from the calculation of previous maxima over the preceding parts
of the set. Note that calculating %i(1...i−1) involves i − 1 recursive function calls,
so building the full approximation over the max of N variables is of complexity
O(N2), as might be expected (although there are only (N−1) uses of the results in
Equation (A.27)). If all correlation coefficients are the same, %ij = % ∀ij, then the
recursive evaluations can be re-used in consecutive evaluations and the complexity
drops to O(N).
126 The Maximum of Correlated Gaussian Variables
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
5 0 5 10 15 20 25 30
y∈
{
xi ;m
}
0.00
0.05
0.10
0.15
0.20
0.25
p(
y)
Figure A.6: Quality and failure modes of the EP approximation. Max of five uncorrelated
Gaussians. Top row: examples of good fits. Left: well separated beliefs. Right: similar
beliefs. Bottom row: worst case examples. Left: high certainty contributions within
the center. Right: high uncertainty in one tail. In all plots, beliefs over the xi as slim
black lines. True posterior over m in thick red, approximation in thick dashed blue. For
simplicity, p(m | Im) was set to an uninformative value. See text for details.
Inverse Problem
The same iterative scheme can be used to provide an approximation for the in-
verse problem’s posterior. First, the list k[j] is build as in the preceding section.
Then, approximations to the posterior marginals are build iteratively, starting with
q(xN | Ic), ending with q(x2 | Ic) and q(x1 | Ic). At each intermediate step, we use the
EP approximation: To get q(xi | Ic), use q(m(1...i) | Im) = q(m(1...i) | Ic)/q(m(1...i) | Ig)
as an approximation to the prior over the subset max, and q(m(1...i−1) | Ig) as the
approximation on the max over the subset up to xi−1.
A.4 Discussion of the Approximation’s Quality
Figure A.6 gives some intuition on the quality of the approximation. For the pur-
pose of this comparison, uncorrelated Gaussians were used because this allows the
analytic evaluation of the true posterior (the CDF factorizes into individual one-
dimensional CDFs). The fit is reasonably good if the beliefs over the xi are either
very similar (Figure A.6 top right), or if the beliefs are “separated”, in the sense that
one of the xi provides a dominant contribution to the overall mixture (top left).
The fit becomes bad when the mixture has many modes (bottom left) or a strong
A.5 Conclusion 127
asymmetry (bottom right). The corresponding worst case distributions shown here
were generated by setting µgi = a + b−i and σ2gi = b
−i (left, a = −1, b = 16) or
µgi = ci and σ2gi = i
d + 1 (right, c = −1, d = 16). More quantitatively, consider
Equation (A.36) or Equation (A.18), the case of the max of only two Gaussians.
The two cases of good fit described above correspond to
1. one mixture component dominating the mixture, i.e.
|k12| =
|µg1 − µg2|
√
σ2g1 + σ
2
g2 − 2%σg1.σg2
 0 (A.46)
The likelihood then has one clearly dominating Gaussian component and the
fit is good. In this case, the inverse problem is also a good fit, as each of the
generating variables x1, x2 has one dominating component in its posterior.
2. the two mixture components being almost identical:
µg1 ≈ µg2 and σg1 ≈ σg2. (A.47)
The likelihood then consists of two roughly identical Gaussian components
with roughly the same weights, and is therefore roughly Gaussian. However,
the approximation is bad for the inverse problem here, as the true posterior
marginals become bimodal (c.f. Figure A.2, right). This effect is particularly
pronounced if the mean of the prior and the likelihood differ significantly.
These observations suggest a potential increase in the quality of the approximation
to be gained from calculating all N(N − 1) weight-generators kij as defined in
Equation (A.46) and iteratively choosing the pair ij with maximal kij. However,
this re-ordering has to be updated after each incremental two-component max
operation, involving a re-calculation of up to N correlation coefficients. It thus
raises the complexity of calculating the approximation for the overall max from
O(N2) to O(N3). Initial experiments suggest that the potential gain in fit is almost
always negligible.
A.5 Conclusion
This chapter derived the first two moments of the posterior over the maximum of a
pair of Gaussian variables, and over the posterior over the two generating variables.
These moments can be used for approximate inference on their own, or as part of
a larger graphical model using Expectation Propagation. I have also shown how
to extend the usefulness of these approximations to finite sets of Gaussian vari-
ables using a heuristic iterative approximation. The quality of the approximation
128 The Maximum of Correlated Gaussian Variables
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
2 1 0 1 2 3
y∈
{
xi ;m
}
0.0
0.2
0.4
0.6
0.8
1.0
p(
y)
Figure A.7: Illustrative examples for uses of approximate max factor. p(xi | Ig) in black
dashed lines. p(m | Im) in red dotted. Marginals after EP message passing as cor-
responding solid lines. Top left: Max over 5 uncorrelated variables. Only the two
variables contributing significantly to the max change their beliefs. Top right: same
as previous, but with %ij = 0.9 for all ij. The change in belief over the dominating
xi now also affects the other beliefs, as expected. Bottom left: The approximation
is well-behaved under inconsistent beliefs. p(m | Im) was set inconsistently low relative
to the beliefs on the xi (all %ij = 0.2). Note that the belief over the largest xi extends
beyond the belief over m as a result of the moment-matching. Bottom right: The
approximation is stable for large values of N . Maximum over 50 correlated normals,
all %ij were set to 0.5.
depends on the location and precision of the belief over the generating variables
relative to each other, but is always good enough to provide a meaningful point
estimate and error measure. It is sufficiently robust to deal with inconsistent belief
assignments and large numbers of generating variables (see Figure A.7).
Appendix B
Efficient Rank 1 EP Updates
Chapter 4 mentions a particularly efficient way of updating multivariate Gaussian
beliefs from factors with rank 1 derivatives. This update rule was derived by Ralf
Herbrich and communicated in a technical report by Minka [2008]. The report is
very concise and has not been published in a permanently accessible outlet. The
entire derivation is thus reproduced here for completeness, with some notational
adaptations and expansions to improve accessibility. We start by deriving the gen-
eral update rules for a Gaussian approximation under the Expectation Propagation
scheme: We want to approximate a factor fi(x) by a Gaussian term of the form
f˜i(x) = si exp
(
−
1
2
(x−mi)TV
−1
i (x−mi)
)
(B.1)
The approximate unnormalized Gaussian marginal on x is
q(x) = sN (x;m,V ) (B.2)
where the normalization constant s can be used for model comparison or as a
predictive probability, as in Chapter 4. We will require the cavity distribution for
the derivation of the updates, which is defined as
q\i(x) = N (x;m\i,V \i) ∝
q(x)
f˜i(x)
with V \i = (V −1 − V −1i )
−1
and m\i = V \i(V −1m− V \ii mi)
(B.3)
An important quantity is the normalization constant of the EP update,
Zi(m\i,V
\i) =
∫
fi(x)q\i(x) dx (B.4)
130 Efficient Rank 1 EP Updates
Its derivative with respect to m\i can be rewritten to reveal an expression for the
mean of the EP message, in the following way.
∇m logZi =
1
Zi
∫
fi(x)q\i(x)V
−1
i (x−m
\i) dx
⇒ m = m\i + V i∇m logZi
(B.5)
and analogously, after a few more lines of algebra
V = V \i − V \i
[(
∇m∇
T
m − 2∇v
)
logZi
]
V \i (B.6)
Using Equation (B.6), Equation (B.3), and the matrix inversion lemma (Equation
C.17), we get an expression for the covariance matrix of the EP message
V i =
[(
∇m∇
T
m − 2∇v
)
logZi
]−1
− V \i (B.7)
using further Equations (B.5) and (B.3) reveals an expression for the message
mean:
mi = V i(V
−1m− (V \i)−1m\i)
= m\i + (V i + V
\i)(V \i)−1(m−m\i)
= m\i + (V i + V
\i)∇m logZi
= m\i +
[(
∇m∇
T
m − 2∇v
)
logZi
]−1
∇m logZi
(B.8)
Even the contribution to the normalization constant can be expressed with these
two derivatives. Using the result (A.20) again, we get
si = Zi
|V i + V
\i|1/2
|V i|1/2
exp
(
1
2
(mi −m\i)T(V i + V
\i)−1(mi −m\i)
)
= Zi
∣
∣
∣I + V \iV −1i
∣
∣
∣
1/2
exp
(
1
2
∇Tm(∇m∇
T
m − 2∇v)
−1∇m logZi
) (B.9)
where the differential operators in the last line should be understood to all act on
logZi, the necessary brackets have been left out for readability. This special form
for the updates can be leveraged if the derivatives have low rank. In particular, if
the derivatives have rank 1:
∇m logZi = αiξi
∇m∇
T
m − 2∇v logZi = βiξiξ
T
i
ξTi (∇m∇
T
m − 2∇v logZi)
−1ξi = β
−1
i
(B.10)
where αi and βi are some scalars and ξi is a vector, then the messages can be
represented by one-dimensional Gaussians with mean mi and variance vi, defined
131
through
V −1i = v
−1
i ξiξ
T
i
ξTi V iξi = vi
mi = ξ
T
imi
(B.11)
Updating such a one-dimensional message also only involves one-dimensional pro-
jections, namely ξTi V
\iξi and ξ
T
im
\i, which can be found through
V \i = (V −1 − v−1i ξiξ
T
i )
−1
= V + (V ξi)(vi − ξ
T
i V ξi)
−1(ξTi V )
ξTi V
\iξi = ξ
T
i V ξi(1− v
−1
i ξ
T
i V ξi)
−1 and
m\i = m+ V \iV −1i (m−mi)
= m+ V \iξiv
−1
i (ξ
T
im−mi)
= m+ (V ξi)(1− v
−1
i ξ
T
i V ξi)
−1v−1i (ξ
T
im−mi)
ξTim
\i = ξTim+ (ξ
T
i V ξi)(1− v
−1
i ξ
T
i V ξi)
−1v−1i (ξ
T
im−mi)
(B.12)
Note that these terms can be evaluated in order O(D2) if V ∈ RD, instead of the
general O(D3) of Equation (B.7). Often, the vector ξi will be sparse, reducing cost
further. Equations (B.7), (B.8) and (B.9) now become one-dimensional updates:
mi = ξ
T
im
\i +
αi
βi
vi = β
−1
i − ξ
T
i V
\iξi
si = Zi(βivi)
−1/2 exp
(
α2i
2βi
)
(B.13)
The natural parameters of the Gaussian are v−1i and v
−1
i mi (see Equation 2.32).
In those parameters, the updates to the messages are,
using r ≡
βi
(ξTi V
\iξi)−1 − βi
,
v−1i = r(ξ
T
i V
\iξi)
−1
v−1i mi = r
(
αi +
ξTim
\i
ξTi V
\iξi
)
+ αi
(B.14)
132 Efficient Rank 1 EP Updates
The updates to the marginal can be derived from Equation (B.5) and Equation
(B.6), and are also of cost O(D2), or less if ξ is sparse:
using ∆(v−1i ) ≡ (v
new
i )
−1 − (voldi )
−1
and ∆(v−1i mi) ≡ (v
new
i )
−1mnewi − (v
old
i )
−1moldi ,
we get V new = V −
∆(v−1i )
1 + ξiV ξi∆(v
−1
i )
V ξiξ
T
i V
and mnew = m+
∆(v−1i mi)− ξ
T
im∆(v
−1
i )
1 + ξTi V ξi∆(v
−1
i )
V ξi.
(B.15)
The normalization constant (evidence term) of the marginal can be evaluated after
the algorithm has converged. Its logarithm is
log s =
1
2
[
mTV −1m−mT0V
−1
0 m0 + log |V | − log |V 0| −
∑
i
m2i
vi
]
+
∑
i
log si,
(B.16)
where the quantities with subscript 0 denote the values of the “prior”, i.e. the initial
normalized marginal before the EP messages are incorporated. Because precision
matrix and precision-adjusted mean are the Gaussian’s natural parameters (Equa-
tion (2.32)), we can utilize Equation (2.16) write the marginal’s parameters at any
point during the iterative updates compactly in terms of the contributions from
prior and messages as
V −1 = V −10 +
∑
i
v−1i ξiξ
T
i
m = V
(
V −10 m0 +
∑
i
v−1i miξi
) (B.17)
B.1 The Step Factor
The derivations in Chapter 4 require the EP updates for the step function factor
fi(x) = θ[ξ
T
i (x − ωi)] with arbitrary vectors ξi and ωi (this provides an approx-
imation to the multivariate Gaussian integral over polyhedral regions). For this
particular factor, the normalization constant is
Zi(m\i,V
\i) =
∫
θ[ξTi (x− ωi)]N (x;m
\i,V \i) dx
=
∫ z
−∞
N (z′; 0, 1) dz′ = Φ(z)
with z ≡
ξTi (m
\i − ωi)
√
ξTi V
\iξi
(B.18)
B.1 The Step Factor 133
and the scalars required in Equation (B.10) are
and αi =
1
√
ξTi V
\iξi
N (z; 0, 1)
Φ(z)
and βi = αi
(
αi +
ξTi (m
\i − ωi)
ξiV
\iξi
)
.
(B.19)
Deriving these specific results is tedious, so only the most important steps will be
reproduced here. Unnecessary subscripts will be left out from here on to reduce
clutter. Because N (x;m,V ) = N (x−ω;m−ω,V ), we will limit the derivations
to the case ω = 0, the general case can then be found simply by replacing m →
m− ω.
To get from the first line of Equation (B.18) to the second, notice that, for any
vector ξ, there exists an orthonormal basis {ζ,w2, . . . ,wD} of RD containing a
normalized form ζ = ξ/
√
ξTξ. In other words, there exists an orthonormal matrix
A =
(
ζ w2 . . . wD
)
with ATA = ID (B.20)
which we can use to perform a change of basis to new co-ordinates y ≡ ATx.
Because A is orthonormal, its determinant is 1 and the measure is conserved. We
can re-write the quadratic form of the Gaussian distribution as
(x−m)TV −1(x−m) = (x−m)TAATV −1AAT(x−m)
= (x−m)TA
(
ATV A
)−1
AT(x−m)
(B.21)
so the Gaussian can be rewritten in the new co-ordinate system as
N (x;m,V ) = N (ATx;ATm,ATV A) (B.22)
and we can now integrate out all the dimensions orthogonal to ζ, where the inte-
gration region spans the entire dimension, using Equation (A.13), leaving only a
one-dimensional incomplete Gaussian integral
Z =
∫
θ(ξTx)N (ζTx; ζTm, ζTV ζ) dζTx (B.23)
The normalization terms
√
ξTξ in the Gaussian all cancel, leading to the result
of Equation (B.18). The forms of α and β can be found by differentiating Z with
respect to m\i and the elements of V \i, analogous to Equations (B.10).
134 Efficient Rank 1 EP Updates
Appendix C
A Laplace Map Linking Gaussians
and Dirichlets
The Dirichlet distribution D is a measure over discrete probability distributions;
the Gaussian N a measure over real values. The two are designed for quite different
purposes. Nevertheless, sometimes the need to match them to each other arises.
For example, consider a situation where discrete samples {cn} ∈ N are available
in several different “locations” xn, and we would like to perform regression on the
discrete probability p(cn = k |xn) distributions in those and other locations. This
situation is a case of generalized regression, and there are many different ways of
dealing with it, using link functions mapping discrete probabilities to real values
(the inverse of the link function is also known as the activation function in machine
learning, due to historical links to neural systems). Two particularly well-known
methods involve
. the evaluation of the probit link function, which provides a direct link to the
Gaussian domain but is only easy to evaluate in the binary case (see, however,
Section B.1 for a computationally involved yet very expressive approach to
the multinomial case)
. using the softmax activation function σ with elements
σk(y) =
exp(yk)
K∑
k′
exp(yk′)
(C.1)
to map a probability measure (typically a Gaussian) on RK to normalized
probabilities pi ∈ [0, 1]K . A disadvantage of this approach is that the likeli-
hood p(cn = k|yn) is not Gaussian, and the posterior is thus not of Gaussian
form either, and has to be approximated somehow to keep computations
tractable.
136 A Laplace Map Linking Gaussians and Dirichlets
y pi cn N
I(pi = σ(y)) p(cn |pi) =M(cn;pi)
N D
Figure C.1: Factor graph representation of the softmax factor. The parametric approx-
imate beliefs used in the Laplace map on the variables y and pi are indicated as labels
underneath the graph.
A problem with both these approaches is that they presume the data cn be avail-
able as actual observations. During inference in graphical models, we often have
access to only a probabilistic belief over the values of ci. This chapter introduces
a computationally lightweight approximation for this case. The main ideas are as
follows.
. Retain a Dirichlet beliefD(pi;α) over the probabilities pi. Because the Dirich-
let is conjugate to the multinomial, this allows exact, fast evaluations of the
posterior given (probabilistic) data.
. To perform generalized regression, approximate the Dirichlet with a Gaus-
sian, using a Laplace approximation. The crucial idea here is that the right
basis to do this approximation in is the softmax basis, in which the Dirichlet
has full support on RK and is much closer to a Gaussian in shape than in
the standard basis with support on [0, 1]K .
The idea to use a basis transformation to improve Laplace approximations was
first raised by MacKay [1998]. That paper also pointed out the softmax basis as
convenient for Gaussian approximations on the Dirichlet, but it does not contain
explicit forms for the resulting Gaussian. Since the corresponding derivation is not
entirely trivial, it is developed here in some depth.
The remaining parts of this chapter proceed by first introducing the softmax basis
form of the Dirichlet (C.1). This basis has a technical issue involving identifiability,
which is addressed in Section C.1.1. The actual Laplace approximation is performed
in Section C.2, in which we arrive at a one-to-one map between Dirichlets and a
strict subset of the parameter space of multivariate Gaussians.
As with all approximations, the results presented in this chapter have their short-
comings, and should not be used without some preliminary thought about the
application in question. The point of this particular approximation is that it is
computationally lightweight, and that it can be used as a black box providing a
numerically stable map between the parameters of a multivariate Gaussian and
those of a Dirichlet. None of this changes the fact, though, that Gaussians and
Dirichlets are not the same thing. There are two big shortcomings of the Laplace
map: The Gaussian has weaker tails than the Dirichlet in the softmax basis (i.e.
C.1 The Dirichlet in the Softmax Basis 137
the Gaussian resulting from mapping a Dirichlet underestimates the probability of
extreme sparse values of pi). In the other direction of the map, the Dirichlet has
less parameters than the Gaussian, so almost all of the covariance structure of a
multivariate Gaussian is lost in translation (see Section C.1.1 for more details).
It would be ill-advised to use this map in cases of high correlation. In such cases,
computational cost is probably of lesser concern, and more elaborate inference
schemes can be used instead.
C.1 The Dirichlet in the Softmax Basis
[The derivations in this Section are reproduced from [MacKay, 1998] with only
minor changes, and are provided here solely for completeness.]
To Dirichlet in its standard basis on the space [0, 1]K has the form
Dpi(pi;α) =
Γ
(∑K
k αk
)
∏K
k Γ(αk)
K∏
k
piαk−1k δ(1
Tpi − 1) ≡ Z−1D
K∏
k
piαk−1k δ(1
Tpi − 1) (C.2)
where Γ denotes the Gamma function (the normalization constant ZD is also known
as the multinomial Beta function), and the subscript pi on the distribution name
denotes the basis used. The Dirac δ-distribution ensures that pi is in fact a discrete
probability measure (i.e. that its elements sum to 1). The vector 1 is the one-vector
1 = [1, 1, 1, . . . ].
C.1.1 Ensuring Identifiability
We can remove the notational complication of the Dirac distribution in Equation
(C.2) by defining the function on a K − 1 subspace ensuring the requirement
1Tpi = 1. In the subspace, we can define a new parameter vector % with
pik =



%k if k = 1, 2, . . . , K − 1
1−
∑K−1
k=1 %k if k = K
(C.3)
The situation in the softmax space is similar: σ−1 only fixes y up to additive
constants: For a given y, all y′ satisfying y′ = y+ξ1 share the same value of σ(y′)
for any ξ ∈ R. Since we are attempting to match the Dirichlet distribution in RK
with another distribution (a Gaussian), we need to ensure that this distribution is
proper, which requires us to introduce a further restriction to solve this ambiguity.
For this purpose, we are free to choose any restriction r of the functional form
138 A Laplace Map Linking Gaussians and Dirichlets
Figure C.2: conceptual sketch of the effect of the restriction r. All points on the red
line share the same values of σ(y) and are projected “softly” (i.e. with precision τ)
onto the black hyperplane by r.
r(1Ty). Following MacKay once more, we use the squared exponential
r = exp
[
−
τ
2
(1Ty)2
]
(C.4)
One way to get an intuition for what this amounts to is to interpret this restriction
as an unnormalized Gaussian “message” of precision τ on the deviation of the sum
of the elements of y from 0. Another is to see r as a soft projection of the subspaces
forming lines parallel to 1 onto their intersection with the hyperplane defined by
1Ty = 0. Figure C.1.1 contains a sketch of this operation.
In the special case of τ →∞, where the constraint becomes a Dirac distribution, we
can use an analogous re-formulation of the parameter space, usingK−1 parameters
a defined through
yk =



ak if k = 1, 2, . . . , K − 1
−
∑K−1
k=1 ak if k = K
(C.5)
C.1.2 Transforming to the Softmax Basis
We first consider the special case of the hard constraint τ → ∞. For probabil-
ity functions in general, changes of basis correspond to changes of the measure,
mediated by the Jacobian matrix J :
pu(u) du = pg(g(u)) dg ⇒ pu(u) = pg(g(u)) |detJ | with Jk` =
∂gk
∂u`
(C.6)
[see e.g. Bishop, 2006, Section 1.2.1]. For the softmax map in particular, we can
write the Jacobian for the density over % as a function of the density over a in
C.2 The Laplace Map 139
terms of the original parameters pi and y:
Jk` =
∂%k
∂a`
=
K∑
h=1
∂pik
∂yh
∂yh
∂a`
= δk`pik − pi`pik + pikpiK = pik(δk` − (pik − piK)) (C.7)
Now we define the K − 1 dimensional vector pi+k ≡ pik − piK , which gives
detJ = det
[
I− 1pi+T
]K−1∏
k=1
pik (C.8)
We can use the matrix determinant lemma det[A − xyT] = (1 − xA−1y) det(A)
(see e.g. [Roweis, 1999]) to get
detJ = (1− 1pi+)
K−1∏
k=1
pik = piK
K−1∏
k=1
pik =
K∏
k=1
pik (C.9)
Hence, the Dirichlet in the softmax basis takes the form
Dy(pi(y);α) =
Γ
(∑K
k αk
)
∏K
k Γ(αk)
K∏
k
piαkk δ(1− 1pi) (C.10)
This form is in fact also correct if we relax the hard Dirac constraint back to the
soft constraint of Equation (C.4), because integrating over that distribution does
not change the value of pi.
David MacKay [1998] points out several interesting aspects of this representation.
For example, the mean of this distribution, at pi(y) = α/‖α‖, now falls together
with its mode, which is not the case in the standard basis. As we got rid of the
−1 terms in the exponent, the distribution now also does not diverge any more for
αk < 1, but is much more well behaved as a function of y, approaching zero for all
large values of y.
C.2 The Laplace Map
The idea of the Laplace approximation (see also Section 2.3.4) is to approximate
a distribution p(x) with a Gaussian distribution q(x) = N (x;µ,Σ) by setting the
mean µ of q to the location of the mode of p, and the covariance matrix Σ of
q to the inverse of the Hessian of log p at its mode (because the logarithm is a
monotonic function, a mode of p is also a mode of log p). This is often considered
a weak approximation, mainly for the following three reasons
. Modes are local features and do not necessarily represent the overall “loca-
tion” of p well at all.
140 A Laplace Map Linking Gaussians and Dirichlets
0.0 0.2 0.4 0.6 0.8 1.0
pi
0.0
0.5
1.0
1.5
2.0
p(
pi)
a=2.0,b=1.2
a=0.5,b=0.9
6 4 2 0 2 4 6
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
p(
y)
N
D
N
D
Figure C.3: Effect of the softmax basis change on the Dirichlet distribution. Left: Two
Beta distributions (i.e. 2-dimensional Dirichlet distributions) for parameter settings
of a and b. Note that a naïve Laplace approximation to these distributions with a
Gaussian in this basis would give a bad match (The second distribution does not even
have a proper mode). Right: solid lines: the same distributions in the softmax basis,
with Gaussian approximations from the Laplace map as dashed lines. Note that there
is now a well defined mode. While the distributions are not symmetric, the Gaussian
approximations do provide meaningful approximations. Also note that the Gaussian
approximations have weaker tails than the Dirichlets.
. This approximation evidently misses any multimodality that might be present
in p.
. since the Gaussian has relatively weak tails, the Laplace approximation often
underestimates the spread of p.
In the case in question here, however, the first two points are moot, because we
have just found a representation in which mean and (unique) mode fall together.
The softmax basis representation also makes the last point less severe, as we will
see soon.
C.2.1 Mode and Hessian in the Softmax Basis
We have already established that the mode of the Dirichlet in the softmax basis
lies at a value of y satisfying σ(y) = α/‖α‖. The remaining degree of freedom is
fixed by the restriction introduced in Section C.1.1, which requires the elements
of y to sum to zero. So we set the mean µ of the Gaussian approximation to the
C.2 The Laplace Map 141
Dirichlet to
µk = logαk −
1
K
K∑
`
logα`, (C.11)
which is evidently the only setting fulfilling both requirements. The logarithm of
the Dirichlet with the soft identifiability constraint is, up to additive constants,
log py(y |α) ,
∑
k
αkpik −
τ
2
1Ty. (C.12)
Plugging in the definition pik = σk(y) we find, after some simple algebra, the
elements of the Hessian L
Lk` = αˆ (δk`pˆik − pˆikpˆi`) + τ(11T)k`, (C.13)
where we have introduced the shorthands αˆ ≡
∑
k αk and pˆik = αk/αˆ for the value
of pi at the mode. The term (11T)k` is a convoluted way of writing a 1, which will
make the subsequent algebra easier to parse.
C.2.2 A Sparse Representation
As mentioned in the introduction to this chapter, the previous sections of this
chapter were largely based on the work by MacKay [1998]. The following section
contains a novel but straightforward extension.
In principle, the previous section provided everything necessary to construct a
Gaussian approximation to the Dirichlet. However, in most cases where we might
be tempted to use this approximation, we will be hard pressed to save computation
time. Often, we will want to discard the correlation present in L and treat the
resulting Gaussian as a direct product of independent univariate Gaussians. Noting
that the elements of the Dirichlet are “almost” independent (independent up to
normalization) to begin with, this is not even too bad an approximation. But it
would not be a good idea to just throw away the off-diagonal elements of L and
use the diagonal elements as the precisions of independent Gaussians, because L’s
diagonal elements each only depend on one single element of α and do not capture
the correlation introduced by normalization.
Instead, this section will derive the analytic inverse of L. The diagonal elements of
the resulting covariance matrix Σ will provide a better starting point for a sparse
approximation. To construct this inverse, we introduce the rectangular matrix
142 A Laplace Map Linking Gaussians and Dirichlets
X ∈ RK×2 with elements
Xku = pˆikδ1u + 1kδ2u =






pˆi1 1
pˆi2 1
...
...
pˆiK 1






(C.14)
and the square matrices A ∈ RK×K and B ∈ R2×2 with
A = diag(α) and B =
(
−αˆ 0
0 τ
)
(C.15)
which allows us to write
L = A+XBXT (C.16)
Both A and B are diagonal with strictly positive diagonal elements, and thus
invertible. Hence we can use the well-known matrix inversion lemma1, which states
(
A+XBXT
)−1
= A−1 −A−1X
(
B−1 +XTA−1X
)−1
XTA−1 (C.17)
The 2 × 2 expression in brackets, known as the Schur complement, is, using the
summation convention [Einstein, 1916]
(
B−1 +XTA−1X
)
ij
= B−1ij + (
αk
αˆ
δi1 + nkδi2)
1
αk
δk`(
α`
αˆ
δj1 + n`δj2)
= B−1ij +
1
αˆ
δi1δj1 +
D
αˆ
(δi1δj2 + δi2δj1) + δi2δj2
∑
k
1
αk
B−1 +XTA−1X =
(
0 K/αˆ
K/αˆ τ−1 +
∑
k α
−1
k
)
(C.18)
The inverse of a 2× 2 matrix is
(
a b
c d
)−1
=
1
ad− bc
(
d −b
−c a
)
(C.19)
so we get for the inverse of the Schur complement (which exists in particular for
1This lemma is also known as the Woodbury identity, based on a technical report by that
author [Woodbury, 1950]. Hager [1989] points out that the formula itself showed up in several
earlier papers, but it appears Woodbury was the first to study it in detail. The special case
where the Schur complement is one-dimensional is also known as the Sherman-Morrison formula,
although it seems it was actually introduced by Bartlett [1951].
C.2 The Laplace Map 143
all α, τ with αk > 0 ∀k and τ > 0)
(
B−1 +XTA−1X
)−1
=
(
− αˆ
2
K
(
1
τ +
∑
k
1
αk
)
αˆ
K
αˆ
K 0
)
(C.20)
With some simple algebra similar to Equation (C.18), we project back to RK×K
and get the final value for the inverse of the Hessian
L−1k` = δk`
1
αk
−
1
K
[
1
αk
+
1
α`
−
1
K
(
1
τ
+
K∑
u
1
αu
)]
(C.21)
because the inverse is defined for all positive values of τ , we can now safely take
the limit of τ →∞, which hardens the constraint on the subspace 1Ty. Then, the
diagonal elements of Σ = L−1, which we are interested in, are
Σkk =
1
αk
(
1−
2
K
)
+
1
K2
K∑
`
1
α`
. (C.22)
Looking at this form, we firstly note that each diagonal element of Σ now depends
on all the elements of α, which is reassuring. The term 1−2/K might look worrying
at first, but it turns out that the formula is also meaningful for K = 2 (and in
fact even the trivial K = 1). For this bivariate situation, the variance on both
dimensions becomes identical, which is expected as this case corresponds to a
Laplace approximation on a Beta distribution. For more on this special case, see
Section C.3.
C.2.3 Inverse Map
Having established a map from the parameter space of Dirichlets into a subset of
the parameters of the Gaussian exponential family, we would now like to invert
this map, to arrive at a rule mapping sets of K independent scalar Gaussians to
the parameters of a K-dimensional Dirichlet. To do so, we first transform Equation
(C.11) to
αk = e
µk
K∏
`
α1/K` (C.23)
inserting this form for αk into Equation (C.22) and re-arranging gives
K∏
`
α1/K` =
1
Σkk
[
e−µk
(
1−
2
K
)
+
1
K2
K∑
u
e−µu
]
∀k ∈ 1, . . . , K (C.24)
144 A Laplace Map Linking Gaussians and Dirichlets
which we can re-insert into Equation (C.23) to get
αk =
1
Σkk
(
1−
2
K
+
e−µk
K2
K∑
`
e−µ`
)
for k = 1, . . . , K (C.25)
The results of the preceding sections will be summed up in section C.4. Before
that, though, it is instructive to take a closer look at the special case of K = 2.
C.3 The Two-Dimensional Case
The bivariate version of the Dirichlet distribution is the Beta distribution. Because
the probability pi1 of label 1 completely determines pi2 = 1 − pi1, it is possible to
write it as a function of only one variable pi. Its two parameters are often denoted
a and b, rather than α1 and α2:
D(2)pi (pi; a, b) =
Γ(a+ b)
Γ(a)Γ(b)
pia−1(1− pi)b−1 (C.26)
The same fact also simplifies the mapping to a subset of R2. Because the subspace
of y ∈ R2 which satisfies 1Ty = 0 is isomorphic to R itself, we can save the
regularization work of Section C.1.1 and use the one-dimensional version of the
softmax as the link function. This function is known as the logistic:
σ1(y) =
exp(y)
1 + exp(y)
(C.27)
As the one-dimensional case of Equation (C.1), the logistic has the Jacobian (see
Equation (C.7))
dσ1(y)
dy
= σ1(y)(1− σ1(y)) (C.28)
so the analysis of Equations (C.10) and following carries through. It is then not
difficult to see that this distribution, in the logistic basis, has its mode at
µ = log
(a
b
)
(C.29)
and the Hessian (i.e. the second derivative) of its logarithm at this point is
σ2 =
a+ b
ab
(C.30)
This map can be easily inverted to give
a =
exp(µ) + 1
σ2
and b =
exp(−µ) + 1
σ2
(C.31)
C.4 Summary 145
C.4 Summary
This appendix developed a Laplace approximation to the Dirichlet distribution
using the softmax basis. Introducing a regularizing distribution and inverting the
Hessian analytically, we arrived at a K-variate Gaussian approximation to the
Dirichlet with a simple covariance structure, which can be described by K pa-
rameters, and is fully determined through the diagonal elements of the covariance
matrix. The maps between the Dirichlet parameters α and the Gaussian parame-
ters (µ,Σ) is
µk = logαk −
1
K
K∑
`=1
logα` (C.32)
Σk` = δk`
1
αk
−
1
K
[
1
αk
+
1
α`
−
1
K
K∑
u=1
1
αu
]
(C.33)
and αk =
1
Σkk
(
1−
2
K
+
eµk
K2
K∑
`
e−µ`
)
for k = 1, . . . , K (C.34)
Note that, if K  1 and αk  0 ∀k, the covariance matrix approaches a diagonal
Σ = diag [(αk)−1]. In situations like this, where the dimensionality is high and the
Dirichlet is not sparse, the output of the Laplace map can thus be treated as a set
of K approximately independent Gaussians.
Computationally lightweight approximate regression algorithms often return inde-
pendent Gaussian beliefs over covariates. Using the inverse of the map, these can
be used to construct approximate Dirichlet beliefs.
Some experimental evaluations of the quality of such approximations for evidence
estimation can be found in the previously cited paper by MacKay [1998]. Chapter
5 contains experimental evaluations of the quality of this approximation when used
in combination with a factorized regression algorithm.
146 A Laplace Map Linking Gaussians and Dirichlets
Bibliography
M. Abramowitz and I.A. Stegun, editors. Handbook of Mathematical Functions
with Formulas, Graphs and Mathematical Tables. Dover, 1972. 89
R.K. Ahuja and J.B. Orlin. Inverse optimization. Operations Research, pages
771–783, 2001. 113
D.G. Altman. Statistics in medical journals. Statistics in Medicine, 1(1):59–71,
1982. 4
D.G. Altman. The scandal of poor medical research. British Medical Journal, 308
(6924):283, 1994. 4
S. Amari. Differential-geometrical methods in statistics. Springer, 1985. 24
S. Arima. Peabody picture vocabulary test-revised data: a Bayesian approach to
item response theory. Master’s thesis, Universiteit Hasselt, Belgium, 2006. 61
J.R. Ashford and R.R. Sowden. Multi-variate probit analysis. Biometrics, pages
535–546, 1970. 66
M.R. Barrick and M.K. Mount. The Big Five personality dimensions and job
performance: a meta-analysis. Personnel Psychology, 44, 1991. 62
M.S. Bartlett. An inverse matrix adjustment arising in discriminant analysis. The
Annals of Mathematical Statistics, 22(1):107–111, 1951. 142
E.B. Baum and W.D. Smith. A Bayesian approach to relevance in game playing.
Artificial Intelligence, 97(1-2):195 – 242, 1997. 39
T. Bayes. On a problem in the doctrine of chances. Philosophical Transactions of
the Royal Society, 1763. 2
M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The infinite hidden Markov
model. Advances in neural information processing systems, 1:577–584, 2002. 4
R.E. Bellman. Dynamic Programming. Princeton University Press, 1957. 7
148 BIBLIOGRAPHY
J. Bennett and S. Lanning. The Netflix prize. In Proceedings of KDD Cup and
Workshop, 2007. 31, 62
C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 3, 9,
10, 31, 33, 34, 86, 92, 124, 138
D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models
and the nested Chinese restaurant process. Advances in neural information
processing systems, 16:106, 2004. 4
D.M. Blei and J.D. Lafferty. Dynamic topic models. In Proceedings of the 23rd
international conference on Machine learning, ICML ’06, pages 113–120, 2006.
103
D.M. Blei and J.D. Lafferty. A correlated topic model of Science. Annals of
Statistics, 1(1):17–35, 2007. 84
D.M. Blei and J.D. Lafferty. Topic models. In A. Srivastava and M. Sahami,
editors, Text Mining: Theory and Application. Taylor & Francis, 2009. 80, 99
D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003. 80, 84, 85
D. Burton and P.L. Toint. On an instance of the inverse shortest paths problem.
Mathematical Programming, 53(1):45–61, 1992. 113
M. Campbell, A.J. Hoane, and F.H. Hsu. Deep Blue. Artificial Intelligence, 134
(1-2):57–83, 2002. 38, 42
R.B. Cattell. The description of personality: basic traits resolved into clusters.
Journal of Abnormal and Social Psychology, 38(4):476–506, 1943. 62
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D.M. Blei. Reading tea leaves:
How humans interpret topic models. In Neural Information Processing Systems,
2009. 99
C.E. Clark. The greatest of a finite set of random variables. Operations Research,
9(2):145–162, 1961. 114, 122, 125
D.D. Cox. An analysis of bayesian inference for nonparametric regression. Annals
of Statistics, 21(2):903–923, 1993. 3
R.T. Cox. Probability, frequency and reasonable expectation. American Journal
of Physics, 14(1):1–13, 1946. 1, 2
BIBLIOGRAPHY 149
R. Dearden, N. Friedman, and S. Russell. Bayesian Q-Learning. In Proceedings of
the Thirteenth AAAI Conference on Artificial Intelligence, pages 761–768, 1998.
49, 113
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman.
Indexing by latent semantic analysis. Journal of the American society for infor-
mation science, 41(6):391–407, 1990. 80
J.M. Digman. Personality structure: Emergence of the five-factor model. Annual
Review of Psychology, 41(1):417–440, 1990. 62
M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive
Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. 39
A. Einstein. Die Grundlage der allgemeinen Relativitätstheorie. Annalen der
Physik, 354(7):770–822, 1916. 21, 142
B.J. Frey. Graphical Models for Machine Learning and Digital Communication.
MIT Press, 1998. 5, 9
B.J. Frey and D.J.C. MacKay. A revolution: Belief propagation in graphs with
cycles. In Advances in Neural Information Processing Systems, 1998. 5, 9, 67
B.J. Frey, F.R. Kschischang, H.-A. Loeliger, and N. Wiberg. Factor graphs and
algorithms. In Proc. 35th Allerton Conf. on Communications, Control, and
Computing, pages 666–680, 1997. 9
S. Gelly and D. Silver. Achieving master level play in 9 x 9 computer Go. In
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence,
pages 1537–1540, 2008. 38
S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns
in Monte-Carlo Go. Research Report RR-6062, INRIA, 2006. 51
S. Ghosal, J.K. Ghosh, and A.W. van der Vaart. Convergence rates of posterior
distributions. Ann. Statistics, 28(2):500–531, 2000. 3
W.R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied
Statistics, 41(2):337–348, 1992. 34
G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins Univ Pr,
1996. 14
T.L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian
buffet process. Advances in neural information processing systems, 18:475, 2006.
4
150 BIBLIOGRAPHY
T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the Na-
tional Academy of Sciences, 101(Suppl 1):5228, 2004. 80
W.W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239,
1989. 142
R.K. Hambleton and H. Swaminathan. Item Response Theory. Kluwer-Nijhoff,
1990. 60
P. Hennig. Expectation propagation on the maximum of correlated normal vari-
ables. arXiv:0910.0115v1 [stat:ML], October 2009. URL http://arxiv.
org/abs/0910.0115. 113
P. Hennig, D. Stern, and T. Graepel. Coherent inference on optimal play in game
trees. Journal of Machine Learning Research, W&CP, 9, 2010. 37
R. Herbrich, T.P. Minka, and T. Graepel. TrueSkill™: A Bayesian skill rating
system. Advances in Neural Information Processing Systems, 20:569–576, 2007.
44, 68
C. Heuberger. Inverse combinatorial optimization: A survey on problems, methods,
and results. Journal of Combinatorial Optimization, 8(3):329–361, 2004. 113
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development in
information retrieval, pages 50–57, 1999. 80
J. Huguenin, F. Pelgrin, and A. Holly. Estimation of multivariate probit models by
exact maximum likelihood. Technical report, University of Lausanne, Institute
of Health Economics and Management (IEMS), 2009. 66
D. Hume. An Enquiry Concerning Human Understanding. original publisher
uknown. Currently under publication, for example, by Oxford Univ. Press, 1739.
5
I.A. Ibragimov and R.Z. Has’minskii. Statistical Estimation: Asymptotic Theory.
Springer, New York, 1981. 3
A.J. Jaffe and H.F. Spirer. Misused Statistics: Straight Talk for Twisted Numbers.
Marcel Dekker, New York, 1987. 4
E.T. Jaynes and G.L. Bretthorst. Probability Theory: the Logic of Science. Cam-
bridge University Press, 2003. 2, 3
J.L.W.V. Jensen. Sur les fonctions convexes et les inégalités entre les valeurs
moyennes. Acta Mathematica, 30(1):175–193, 1906. 26
BIBLIOGRAPHY 151
R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In
Proc. ACM Symposium on Theory of Computing, pages 681–690, 2008. 39
L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of
the European Conference on Machine Learning, pages 282–293. Springer, 2006.
38, 49, 53
A.N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Ergebnisse der
Mathematik und ihrer Grenzgebiete, 2, 1933. 2
J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Pro-
ceedings of the 26th International Conference on Machine Learning. Morgan
Kaufmann, 2009. 49
F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the sum-
product algorithm. IEEE Transactions on Information Theory, 47(2):498–519,
February 2001. 10
S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathe-
matical Statistics, 22(1):79–86, 1951. 20
P.S. Laplace. Mémoire sur la probabilité des causes par les évènemens. Mémoires
de mathématique et de physique presentés à l’Académie royale des sciences, par
divers savans, et lûs dans ses assemblées, 6:621–656, 1774. 2
S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabilities on
graphical structures and their application to expert systems. J. R. Stat. Soc.,
50:157–224, 1988. 5, 9, 13
L.M. Le Cam. Convergence of estimates under dimensionality restrictions. Ann.
Statistics, 1:38–53, 1973. 3
R. Likert. A technique for the measurement of attitudes. Archives of Psychology.
Vol, 22(140):55, 1932. 60
F. Lord. Statistical inferences about true scores. Psychometrika, 24:1–17, 1959.
63
F.M. Lord. Maximum likelihood and Bayesian parameter estimation in item re-
sponse theory. Journal of Educational Measurement, 23(2):157–162, 1986. 61
D.J.C. MacKay. Choice of basis for Laplace approximation. Machine Learning, 33
(1):77–86, 1998. 82, 90, 96, 136, 137, 139, 141, 145
D.J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cam-
bridge Univ Press, 2003. 3, 26, 30, 34, 46
152 BIBLIOGRAPHY
D.J.C. MacKay and L.C. Bauman-Peto. A hierarchical Dirichlet language model.
Natural Language Engineering, 1(03):289–308, 1995. 80
P.S. Maybeck. Stochastic Models, Estimation and Control, chapter 12.7. Academic
Press, 1982. 32
W. McDougall. Of the word character and personality. Character personality, 1:
3–16, 1932. 62
D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with
dirichlet-multinomial regression. In Uncertainty in Artificial Intelligence, 2008.
80, 82, 84
T. Minka. Divergence measures and message passing. Technical Report MSR-TR-
2005-173, Microsoft Research Technical Report, 2005. 24
T.P. Minka. Expectation Propagation for approximate Bayesian inference. In Pro-
ceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pages
362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann. ISBN 1-55860-
800-1. 5, 9, 20, 21, 66, 67
T.P. Minka. Power EP. Technical report, Dep. Statistics, Carnegie Mellon Uni-
versity, Pittsburgh, PA, 2004. URL http://research.microsoft.com/en-us/
um/people/minka/papers/. 23, 24
T.P. Minka. EP: A quick reference. Technical report, Microsoft Re-
search, 2008. URL http://research.microsoft.com/en-us/um/people/
minka/papers/. 68, 129
T.P. Minka and J. Winn. infer.NET. Microsoft Research Cam-
bridge, 2008. URL http://research.microsoft.com/en-us/um/cambridge/
projects/infernet/. 8, 123
E. Muraki. A generalized partial credit model. Applied Psychological Measurement,
2, 1992. 64
I. Murray. Advances in Markov Chain Monte Carlo Methods. PhD thesis, Univer-
sity College London, 2007. 30
I. Murray, R.P. Adams, and D.J.C. MacKay. Elliptical slice sampling. In Proceed-
ings of the 13th International Conference on Artificial Intelligence and Statistics,
2010. 97
R.M. Neal. Bayesian learning for neural networks. Springer Verlag, 1996. 4
BIBLIOGRAPHY 153
G. Neumann-Denzau and J. Behrens. Inversion of seismic data using tomographical
reconstruction techniques for investigations of laterally inhomogeneous media.
Geophysical Journal International, 79(1):305–315, 1984. 113
N.J. Nilsson. Problem-Solving Methods in Artificial Intelligence. McGraw-Hill
Pub. Co., 1971. 38
M.R. Novick. The axioms and principal results of classical test theory. Journal of
Mathematical Psychology, 3(1):1–18, 1966. 60, 63
M. Opper. On-line versus off-line learning from random examples: General results.
Phys. Rev. Lett., 77(22):4671–4674, Nov 1996. 32, 82
P. Orbanz. Construction of nonparametric Bayesian models from parametric Bayes
equations. In Adv. in Neural Information Processing Systems, 2009. 19
A.J. Palay. Searching with probabilities. Pitman Publishing, Inc., Marshfield, MA,
USA, 1985. 39
R.J. Patz and B.W. Junker. A straightforward approach to Markov chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral
Statistics, 24(2):146, 1999. 61
J. Pearl. Heuristics. Intelligent search strategies for computer problem solving.
Addison-Wesley, 1985. 38
J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
5, 9, 32
K. Popper. Logik der Forschung. Mohr Siebeck, 1934. 5
G. Rasch. Probabilistic Models for Some Intelligence and Attainment Tests, 1960.
64
C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning.
MIT Press, 2006. 116
T. Richardson and R.L. Urbanke. Modern coding theory. Cambridge University
Press, 2008. 46
J.M. Robson. The complexity of Go. Proc. Information Processing, 83:413–417,
1983. 38
S. Roweis. Matrix identities. Technical report, University of Toronto, 1999. 139
D.M. Roy and Y.W. Teh. The Mondrian process. In Adv. in neural information
processing systems, volume 21, 2009. 4
154 BIBLIOGRAPHY
S. J. Russell and E. Wefald. Do the Right Thing. MIT Press, 1991. 39
G. Salton and M.J. McGill. Introduction to modern information retrieval. McGraw-
Hill New York, 1983. 80, 99
C.E. Shannon. Programming a computer for playing chess. Philosophical Magazine
(Series 7), 41(314):256–275, 1950. 38
D.H. Stern, R. Herbrich, and T. Graepel. Learning to solve game trees. In Proceed-
ings of the 24th International Conference on Machine Learning, pages 839–846,
New York, NY, USA, 2007. ACM. 39
D.H. Stern, R. Herbrich, and T. Graepel. Matchbox: large scale online bayesian
recommendations. In 18th International Conference on the World Wide Web,
pages 111–120. ACM New York, NY, USA, 2009. 68
Y.W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian infer-
ence algorithm for latent Dirichlet allocation. Advances in neural information
processing systems, 19:1353, 2007. 80, 82, 84, 88, 96
G. Tesauro, V.T. Rajan, and R. Segal. Bayesian inference in Monte Carlo tree
search. In Uncertainty in Artificial Intelligence, July 2010. 56
W.R. Thompson. On the likelihood that one unknown probability exceeds another
in view of two samples. Biometrika, 25:275–294, 1933. 49
R. von Mises. Mathematical Theory of Probability and Statistics. Academic Press
New York, 1964. 116
S. Walker. New approaches to Bayesian consistency. The Annals of Statistics, 32
(5):2028–2043, 2004. 2
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time
model of topical trends. In Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD ’06, pages 424–433,
2006. 103
C.J.C.H. Watkins and P. Dayan. Technical note: Q-learning. Machine Learning,
8:279 – 292, 1992. 39
N. Wiener. Cybernetics or Control and Communication in the Animal and the
Machine. MIT Press, 1948. 38
P. Wild and WR Gilks. Algorithm AS 287: Adaptive rejection sampling from
log-concave density functions. Applied Statistics, 42(4):701–709, 1993. 34
BIBLIOGRAPHY 155
C.K.I Williams and C.E. Rasmussen. Gaussian processes for regression. In Ad-
vances in neural information processing systems, 1996. 4
J. Winn and C.M. Bishop. Variational message passing. Journal of Machine
Learning Research, 6(1):661, 2006. 5, 9, 28
M.A. Woodbury. Inverting modified matrices. Technical Report no. 42, Statistical
Research Group, Princeton University, Princeton, N.J., 1950. 142
J. Zhu and E.P. Xing. Conditional topic random fields. In International Conference
on Machine Learning, 2010. 80, 82
Index
activation function, 135
AND/OR, 40
Bayesian inference, 2
Bayesian Networks, 9
Beta distribution, 144
classic test model, 63
conjugate prior, 17
curse of dimensionality, 7
damping messages, 24
Digamma function, 87
Dirichlet distribution, 137
expectation propagation, 20, 66, 113
rank 1, 129
exponential family, 15
EXPTIME, 38
EXPTIME complete, 38
factorization, 27, 31
factor graph, 8
Frequentist inference, 4
game tree, 40
Gaussian distribution, 21
generalized partial credit model, 64
Gibb’s inequality, 26
Go, 38
Graphical Models, 7
graphical models, 5
directed, 9
undirected, 9
identifiability, 70, 137
inference, 1
item response theory, 62
KL divergence, 20
Laplace approximation, 28, 89
Laplace bridge, 89
learning, 1
Likert scale, 60
link function, 135
machine learning, 1
Markov Chain Monte Carlo, 30
matrix inversion lemma, 142
max-factor, 114
MAX/MIN, 40
max factor, 114
message passing, 12
Monte Carlo, 30
Nonparametric Methods, 4
Power EP, 23
probability, 1
product rule, 2
provability, 4
Psychometry, 59
Rasch model, 64
regression
Gaussian, 31
Multivariate Gaussian, 91
probit, 64
rejection sampling, 34
relative entropy, 20
step factor, 68, 132
sum product algorithm, 10
INDEX 157
sum rule, 2
topic model, 80
trait, 60
UCT, 38
variational inference, 25
collapsed, 84
message passing, 28
Woodbury identity, 142