A Unified Framework for
Resource-Bounded Autonomous
Agents Interacting with
Unknown Environments
Pedro A. Ortega
Department of Engineering
University of Cambridge
A thesis submitted for the degree of
Doctor of Philosophy
September 2010
2
To my parents Pedro and Pilar.
Acknowledgements
This thesis is the result of four years of work, and it would not have been
possible without the motivation and support of many people. Thanks to
them, the years I spent in Cambridge have been amongst the happiest of
my life.
I especially want to thank my family Pedro, Pilar, Carolina and Paulina
and my closest friends Francisca Albert, Paul Aguayo, Daniel Braun, Oscar
Van Heerden, Emre Karaa, Aliff Mohamad, Jose´ Donoso, Aditya Saxena,
Loreto Valenzuela, Horacio Tate, Aaron Lobo, Zoi Roupakia, Aysha Roohi,
Ben Mansfield, Francisco Pe´rez and Mauricio Gaete. Also, I am deeply
indebted to Disa Helander and Raffaella Nativio.
Special thanks go to my friend Daniel Braun, who has been closely col-
laborating with me during the course of my study. I also want to thank
Jose´ Aliste, John Cunningham, Jose´ Donoso, Marcus Hutter, Humberto
Maturana, Gonzalo Ruz and David Wingate for their invaluable help and
comments on earlier versions of this manuscript. The present study has
been supported by the Ministerio de Planificacio´n de Chile (MIDEPLAN),
the Comisio´n Nacional de Investigacio´n Cient´ıfica y Tecnolo´gica (CONI-
CYT) and the Bo¨hringer-Ingelheim-Fonds (BIF). Finally, I want to thank
my supervisor Zoubin Ghahramani for his guidance, motivation and sup-
port.
Abstract
The aim of this thesis is to present a mathematical framework for con-
ceptualizing and constructing adaptive autonomous systems under resource
constraints. The first part of this thesis contains a concise presentation of
the foundations of classical agency: namely the formalizations of decision
making and learning. Decision making includes: (a) subjective expected
utility (SEU) theory, the framework of decision making under uncertainty;
(b) the maximum SEU principle to choose the optimal solution; and (c)
its application to the design of autonomous systems, culminating in the
Bellman optimality equations. Learning includes: (a) Bayesian probability
theory, the theory for reasoning under uncertainty that extends logic; and
(b) Bayes-Optimal agents, the application of Bayesian probability theory
to the design of optimal adaptive agents. Then, two major problems of
the maximum SEU principle are highlighted: (a) the prohibitive computa-
tional costs and (b) the need for the causal precedence of the choice of the
policy. The second part of this thesis tackles the two aforementioned prob-
lems. First, an information-theoretic notion of resources in autonomous
systems is established. Second, a framework for resource-bounded agency
is introduced. This includes: (a) a maximum bounded SEU principle that is
derived from a set of axioms of utility; (b) an axiomatic model of probabilis-
tic causality, which is applied for the formalization of autonomous systems
having uncertainty over their policy and environment; and (c) the Bayesian
control rule, which is derived from the maximum bounded SEU principle
and the model of causality, implementing a stochastic adaptive control law
that deals with the case where autonomous agents are uncertain about their
policy and environment.
iv
Contents
Preface xvii
1 Introduction 1
1.1 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 2
I Foundations of Classical Agency 3
2 Characterization of Behavior 5
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Probabilities & Random Variables . . . . . . . . . . . . . . . . . 6
2.2 Models of Autonomous Systems . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Decision Making 13
3.1 Subjective Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Rationality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The Maximum Subjective Expected Utility Principle . . . . . . . . . . . 19
3.2.1 SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . . 19
3.2.2 I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Bellman Optimality Equations . . . . . . . . . . . . . . . . . . . 22
3.2.4 Subjective versus True Expected Utility . . . . . . . . . . . . . . 25
3.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 26
4 Learning 29
4.1 Bayesian Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Reasoning under Certainty . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Reasoning under Uncertainty . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Adaptive Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Bayesian Input Model . . . . . . . . . . . . . . . . . . . . . . . . 37
v
CONTENTS
4.2.2 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Induced Input Model . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Convergence of Predictive Distribution . . . . . . . . . . . . . . . 40
4.2.5 Bayes Optimal Agents . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 46
5 Problems of Classical Agency 47
5.1 Computational Cost and Precedence of Policy Choice . . . . . . . . . . 47
5.2 Is Rationality a Useful Concept? . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 49
II Resource-Bounded Agency 51
6 Resources 53
6.1 Preliminaries in Information Theory . . . . . . . . . . . . . . . . . . . . 54
6.1.1 The Communication Problem . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.3 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Resources as Information . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Thermodynamical Interpretation . . . . . . . . . . . . . . . . . . 60
6.2.2 Computational Interpretation . . . . . . . . . . . . . . . . . . . . 63
6.3 Resource Costs in Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 Cost of Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.2 Costs of Construction . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 69
7 Boundedness 71
7.1 An Example of Boundedness . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Utility & Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.2 Variational principle . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.3 Bounded SEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Bounded SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . 85
7.3.1 Bounded Optimal Control . . . . . . . . . . . . . . . . . . . . . . 85
7.3.2 Adaptive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 88
8 Causality 91
8.1 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Causal Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2.1 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Causality in Autonomous Systems . . . . . . . . . . . . . . . . . . . . . 99
8.3.1 Bayesian I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
CONTENTS
8.3.2 Causal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.3 Belief Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.4 Induced I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 104
9 Control as Estimation 105
9.1 Interlude: Dynamic versus Static . . . . . . . . . . . . . . . . . . . . . . 106
9.1.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2 Adaptive Estimative Control . . . . . . . . . . . . . . . . . . . . . . . . 109
9.3 Bayesian Control Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.4 Convergence of the Bayesian Control Rule . . . . . . . . . . . . . . . . . 111
9.4.1 Policy Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.4.2 Divergence Processes . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.4.3 Decomposition of Divergence Processes . . . . . . . . . . . . . . 113
9.4.4 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.4.5 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.4.6 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5.1 Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . 124
9.6 Critical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.7 Relation to Existing Approaches . . . . . . . . . . . . . . . . . . . . . . 129
9.8 Derivation of Gibbs Sampler for MDP Agent . . . . . . . . . . . . . . . 129
9.9 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 131
10 Discussion 133
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.2 What are the contributions? . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3 What is missing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
References 145
vii
CONTENTS
viii
List of Figures
2.1 Behavioral Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Interaction System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 An Output Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Setup of Subjective Expected Utility. . . . . . . . . . . . . . . . . . . . . 14
3.2 Axiom R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Axiom R5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Axiom R6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Axiom R7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 A Behavioral Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Combining Behavioral Functions determines an Interaction String. . . . 20
3.8 A Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9 Solution of a Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 A Fixed Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 An Adaptive Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Truth Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Extension of Truth Function. . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Bayes’ rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Progressive Refinement of Accuracy. . . . . . . . . . . . . . . . . . . . . 36
4.7 Convergence of Predictive Distribution. . . . . . . . . . . . . . . . . . . 43
6.1 Communication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Information Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Probability versus Codeword Length . . . . . . . . . . . . . . . . . . . . 59
6.5 The Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 A Generalized Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . 62
6.7 Time-Space Tradeoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 Logic Circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.9 Sequential Processing Machine. . . . . . . . . . . . . . . . . . . . . . . . 65
6.10 State Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.1 An Exhaustive Optimization. . . . . . . . . . . . . . . . . . . . . . . . . 72
ix
LIST OF FIGURES
7.2 Distributions after Bounded Optimization. . . . . . . . . . . . . . . . . . 73
7.3 Performance of the Bounded Optimization. . . . . . . . . . . . . . . . . 74
7.4 Expected Value Penalized by Relative Entropy. . . . . . . . . . . . . . . 74
7.5 Transformation of a System . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1 A Three-Stage Randomized Experiment. . . . . . . . . . . . . . . . . . . 93
8.2 A Causal Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 An Intervention in the Probability Tree. . . . . . . . . . . . . . . . . . . 95
8.4 Primitive Events and their Atom Sets. . . . . . . . . . . . . . . . . . . . 96
8.5 Causal Space of an Autonomous System. . . . . . . . . . . . . . . . . . . 100
8.6 Updates following an Observation versus an Action. . . . . . . . . . . . 101
9.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2 A Policy Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3 Realization of Divergence Processes. . . . . . . . . . . . . . . . . . . . . 113
9.4 Policies Influence Divergence Processes . . . . . . . . . . . . . . . . . . . 113
9.5 Decomposition of a Divergence Process into Sub-Divergences . . . . . . 114
9.6 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.7 Problems with Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . 117
9.8 Inconsistent Policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.9 Space of Bandit Configurations. . . . . . . . . . . . . . . . . . . . . . . . 122
9.10 Performance Comparison for Bandit Problem . . . . . . . . . . . . . . . 123
9.11 MDP Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 126
x
List of Notation
Basic
X A set or alphabet.
N The set of natural numbers 1, 2, 3, . . .
R The set of real numbers.
ǫ The empty string.
X n The set of strings of length n over the alphabet X .
X ∗ The set of all finite strings over the alphabet X .
xi:k The substring xixi+1 · · · xk−1xk.
x≤i The string x1x2 · · · xi.
ln(x) The natural logarithm of x.
log(x) The logarithm base-2 of x.
P(X ) The powerset of X .
Pr An arbitrary probability distribution.
Ω A sample space.
F An algebra.
Autonomous Agents
A The set of actions.
O The set of observations.
Z The set of interactions, i.e. Z := A×O.
at The action at time t.
ot The observation at time t.
aoi An interaction viewed as a symbol, i.e. aioi.
T The horizon, i.e. the maximum length of interaction strings.
Z⋄ The set of interaction strings up to length T .
U The utility function over interactions strings.
P The (behavioral) model of the agent, i.e. the distribution over
interaction sequences implemented by the agent. It is used to
denote the input model, the output model or the I/O model.
xi
LIST OF NOTATION
Q The (behavioral) model of the environment, i.e. the distribution
over interaction sequences implemented by the environment. It is
used to denote the input model, the output model the I/O model.
G The generative distribution, i.e. the sampling distribution over the
interactions sequences that results from the interaction between
the agent and the environment.
P The belief model of the agent, i.e. the Bayesian mixture distribu-
tion over interaction sequences. The symbol is used to denote the
Bayesian input model, the Bayesian output model or the Bayesian
I/O model. The belief model is a conceptual explanation that
gives rise to a unique behavioral model.
Model Types
P(at|ao<t) The probability of generating action at given the past ao<t. The
collection of these probabilities forms the agent’s output model.
P(ot|ao<tat) The probability of gathering observation ot given the past ao<tat.
The collection of these probabilities forms the agent’s input
model.
” The collection of the previous two forms the agent’s I/O model.
Q(at|ao<t) The probability of gathering action action at given the past ao<t.
The collection of these probabilities forms the environment’s in-
put model.
Q(ot|ao<tat) The probability of generating observation ot given the past ao<tat.
The collection of these probabilities forms the environment’s out-
put model.
” The collection of the previous two forms the environment’s I/O
model.
P (θ) The prior probability of the parameter θ ∈ Θ.
P (at|θ, ao<t) The probability of generating action at given the past ao<t under
the parameter θ. Together with the prior over θ, the collection of
these probabilities forms the agent’s Bayesian output model.
P (ot|θ, ao<tat) The probability of gathering observation ot given the past ao<tat
under the parameter θ. Together with the prior over θ, the col-
lection of these probabilities forms the agent’s Bayesian input
model.
” The collection of the previous three forms the agent’s Bayesian
I/O model.
xii
Resources & Boundedness
γ The conversion factor between units of information and units of
energy.
α The conversion factor between units of information and units of
utility.
U(A) The utility of event A.
u(A|B) The utility gain of changing from event B to event A ∩ B, i.e.
u(A|B) := U(A ∩B)−U(B).
J(Pr;U) The free utility of Pr under the utility U.
P0 In a transformation, this is the distribution that is assumed to be
known (It can be the distribution before or the distribution after
the transformation).
Pr In a variational problem, this is the distribution to be varied.
Pi,S In a transformation, this is the distribution before the change.
Pf ,R In a transformation, this is the distribution after the change.
xiii
LIST OF NOTATION
xiv
List of Definitions and Results
Definition 1. Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Definition 2. Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Definition 3. Generative Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Definition 4. Conditional Preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Definition 5. Null Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Definition 6. Constant Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Definition 7. Rationality/Savage Axioms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
Theorem 1. Expected Utility Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Definition 8. Behavioral Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Definition 9. Input Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Definition 10. I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Definition 11. “Knows”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Definition 12. Optimality Equations for Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Definition 13. Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Definition 14. Optimality Equations for Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Definition 15. Truth Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Definition 16. Belief axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Definition 17. Belief Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
Theorem 2. Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Definition 18. Bayesian Input Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Definition 19. Induced Input Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
Theorem 3. Convergence of Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Definition 20. Prefix Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Definition 21. Prefix Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Theorem 4. Kraft-McMillan Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xv
LIST OF DEFINITIONS AND RESULTS
Definition 22. Axioms of Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Theorem 5. Utility Gain ↔ Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Definition 23. Free Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Theorem 6. Variational Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Definition 24. Bounded Subjective Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Definition 25. Primitive Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Definition 26. Causal Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Definition 27. Causal Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Definition 28. Induced Belief Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Theorem 7. Induced Belief Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Definition 29. Intervention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Definition 30. Bayesian Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Definition 31. Bayesian I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Definition 32. Induced I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Theorem 8. Induced I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Definition 33. Divergence Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Definition 34. Sub-Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Definition 35. Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Theorem 9. Lower Bound of True Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Definition 36. Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Theorem 10. Not in Core ⇒ Vanishing Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Definition 37. Consistent Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Theorem 11. Convergence of Bayesian Control Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xvi
Preface
Artificial intelligence is a fascinating field. It is the only field of engineering that deals
with your most familiar possession—your mind! How do we create intelligence? And,
by the way, what is it?
When I first read Marcus Hutter’s Universal Artificial Intelligence (2004a) I was
astonished—it is the optimal solution to the artificial intelligence problem! The AIXI
agent defined therein optimally adapts to any (computable) situation: in particular, it
will make predictions about the weather, play chess, solve mazes, drive cars, discover
physical theories, solve IQ tests and predict future stock prices. Even though, admit-
tedly, AIXI cannot be implemented because it is uncomputable, one should still be able
to devise a down-scaled approach to create primitive intelligence. Or maybe not?
Well, down-scaling turns out to be a very difficult task. The literature is rich of
(implicit) “approximations” to the optimal solution, and probably there is at least one
new method being conceived every week. Many of these approximations resort to ad-
hoc methods that deviate considerably from the theory of rationality, and hence one
ends up asking oneself whether they can be justified at all from the point of view of the
theory. Moreover, it is a fact that all these state-of-the-art approximations either only
work in very restricted problem domains or, if they aim to be more general, cannot
cope with more than simple toy-problems. The computational complexity of these
approximations are just way too high. This is disappointing.
However, spiders build webs, ants and bees successfully navigate complex terrains
to collect food. Simple organisms display intelligence levels that seem inexplicable given
their limited resources—at least from the point of view of the theory of rationality.
Rather than starting a search for a “better” approximation, and driven by my
dampened optimism about the theory, I wanted to find out whether there is more
to rationality than we actually know. This necessarily implies investigating the well-
established foundations of decision making and learning that we nowadays take for
granted. It turns out that one can point out at least two fundamental shortcomings: the
causal precedence of the choice of the policy and the implicit computational bottleneck,
both of them imposed by the very theory of rational decision making!
Hence, my research program for the past four years has been to understand where
the limitations imposed by the theory of rationality arise, and to formulate a framework
of bounded rationality that includes classical rationality as a limit case.
The goal of this thesis is to present the foundations of classical rationality along with
xvii
0. PREFACE
a synthesis of my work during my graduate studies at the Department of Engineering
of the University of Cambridge. Most of the effort in writing this thesis has been
allocated in presenting the material and results in the most concise and simplest way I
could (hopefully, with moderate success), especially because the fundamental ideas are
very simple! In other words: one brief and coherent story, no fancy mathematics, and
to drop (admittedly with sadness) the topics that do not add anything substantial to
the main message.
Structure of this Thesis
The thesis contains ten short chapters and is divided into two parts: Foundations of
Classical Agency and Resource-Bounded Agency. The two parts also happen to roughly
subdivide the material into the preexisting literature and the original contributions,
respectively.
Chapter 1: Introduction. This chapter introduces the reader to the problem of
the design of autonomous agents.
Part I: Foundations of Classical Agency.
Chapter 2: Characterization of Behavior. The aim of this chapter is to famil-
iarize the reader with the basic notation & concepts that are necessary in order to
characterize behavior. In particular, it clarifies what is meant by a model of an au-
tonomous system, and introduces the first (and simplest) of the four models of this
thesis.
Chapter 3: Decision Making. The principle of maximum subjective expected util-
ity is the standard design principle for the construction of cybernetic systems. Where
does it come from and what is its justification? This chapter presents the underlying
theory.
Chapter 4: Learning. When the cybernetic system faces an unknown environment,
then it has to adapt to it by learning through experience. How is this learning pro-
cess formalized? This chapter reviews the standard framework for reasoning under
uncertainty, and applies it to the case of cybernetic systems.
Chapter 5: Problems of Classical Agency. This chapter highlights the main
drawbacks of classical agency.
Part II: Resource-Bounded Agency.
xviii
Chapter 6: Resources. What are resources and how are they formalized? This
chapter provides an information-theoretic answer to this question.
Chapter 7: Boundedness. Based on three intuitive axioms, one can derive a con-
version law between utilities and probabilities (and hence, information or resources).
This conversion law is then used to formulate a bounded subjective expected utility
that models decision-making under resource constraints.
Chapter 8: Causality. When a cybernetic system gathers inputs or generates out-
puts, its information state is updated. However, inputs and outputs have different
updates. While the first is well modeled by the learning framework, the latter requires
the consideration of causal constraints. This chapter introduces a framework to deal
with both inputs and outputs.
Chapter 9: Control as Estimation. This chapter combines the theory developed
in the two preceding chapters to formulate a new law for adaptive control. Furthermore,
a convergence result for a very restricted case is provided.
Chapter 10: Discussion. This chapter briefly discusses the results and concludes.
Declaration of Originality
This dissertation is my own work and contains nothing which is the outcome of work
done in collaboration with others, except as specified in the text and Acknowledge-
ments.
xix
0. PREFACE
xx
Chapter 1
Introduction
In artificial intelligence and control theory, a major field of ongoing research is the
design of complex autonomous systems1 interacting with unknown environments. Good
designs are not easy to achieve because they must consider and integrate solutions to
many different problems. Apart from the purely physical sensory-motor challenges in
the case of embodied agents, a well-designed agent must cope with several information
processing problems at the same time. From these, learning, decision making, and the
limited amount of (computational) resources play the major roles.
The first formal treatment of autonomous systems can be traced back to the ideas of
the field of cybernetics in the first half of the twentieth century (Rosenblueth, Wiener,
and Bigelow, 1943). Since then, and especially after the development of cheap and
fast computers during the decade of 1980, artificial intelligence and control theory
have evolved into rich and fruitful areas of research. Nowadays, the ideas that have
emerged from both fields are present everywhere, ranging from applications in large-
scale industrial manufacturing to small embedded systems and single-user software.
The theoretical foundations for the design of optimal autonomous systems for a
given problem class is well-developed and widely accepted. It is even well under-
stood how to characterize optimal universal autonomous systems that can adapt to any
(computable2) environment (Hutter, 2004a). However, the theory has not yet led to
real-world implementations of complex autonomous systems. Even though recently
promising approximations have been developed (Veness, Ng, Hutter, and Silver, 2010;
Veness, Ng, Hutter, W., and Silver, 2011), the problem remains essentially unresolved
in practice. The main reason for this failure is simple: the computational complexity of
constructing such a system is prohibitive.
The aim of this thesis is to review the theoretical foundations of agency, to identify
its shortcomings, and to propose a mathematical formalization of resource-bounded
agency. This is a necessary enterprise given the current state-of-the-art: virtually all
1Originally, autonomous systems were called cybernetic systems.
2The term computable refers to functions that are the formalized analogue of the intuitive notion of
an algorithm. The Church-Turing thesis postulates that computable functions are exactly the functions
that can be calculated using a mechanical calculation device given unlimited amounts of time and
storage space.
1
1. INTRODUCTION
of the research community centers its efforts around the design of increasingly more
efficient algorithms that are approximations to the “gold standard” dictated by the
theory. There is no consensus, however, about how to implement, how to rationalize,
nor how to compare such approximations.
1.1 Historical Remarks & References
Cybernetics was pioneered by Rosenblueth et al. (1943) and then established as a field by
Wiener’s book Cybernetics (Wiener, 1965). The field later evolved to what is now known as
control theory, where purposeful behavior is conceived as the minimization of the “error” be-
tween the current state and the goal state. This field, mainly driven by electrical engineers,
developed a rich theory of continuous, mainly linear, control systems. Nevertheless, the mathe-
matics used in control theory also imposed limitations to the kind of autonomous systems that
could be conceived. As a response to this, J. McCarthy, M. Minsky, C. Shannon, A. Turing,
R. Solomonoff, and other researchers gave birth to the field of artificial intelligence (AI)—“the
science and engineering of making intelligent machines”. Inspired by the work of McCulloch
and Pitts (1943), AI researchers embraced the idea of duplicating human faculties like creativity
and self-improvement.
Between the 1950s and 1970s, AI grew into a rich and fruitful field with many very di-
verse approaches, mainly centered around solving restricted problem domains (a paradigm now
called narrow AI ). Later, during the 1980s, the rapid development of cheap and fast computers
paved the way for the adoption of advanced statistical and control-theoretic techniques, thus
revitalizing the interest for the problem of the design of complex autonomous systems. Perhaps
the most influential modern books, presenting many of the disparate and isolated ideas in AI
and related fields in a unified way, are Russell and Norvig (2009) and Nilsson (1998). Another
modern approach that has considerably grown in popularity is reinforcement learning (Sutton
and Barto, 1998). In this setup, an autonomous system learns what to do from a feedback signal
(in addition to the observation signal) issued by the environment. The system system uses this
feedback signal to evaluate its performance. Within the reinforcement learning paradigm, uni-
versal artificial intelligence has established itself as a rigorous top-down approach to AI (Hutter,
2004a; Legg, 2008), combining ideas from sequential decision making with universal prediction
(Solomonoff, 1964). This work presents an optimal (though uncomputable) autonomous system
(called AIXI) that is universal with respect to the class of computable environments.
2
Part I
Foundations of Classical Agency
3

Chapter 2
Characterization of Behavior
An autonomous system is anything that has an observable input and output (I/O)
stream, like a calculator, a human cell, an animal, a computer program or a robot.
In other words, it is anything having a boundary defining what is inside and what is
outside and acting as an interface to communicate with its environment. This definition
is symmetrical: the environment of an autonomous system is an autonomous system
too.
A autonomous system communicates with its environment by interacting with it.
This interaction consists in the exchange of symbols, generated in order to influence
each other. The generation of these symbols is governed by (stochastic) rules that
make up the behavior of the autonomous system.
There are certain behaviors that are more desirable than others. Artificial intel-
ligence and control theory can be seen as the design and analysis of the behavior of
autonomous systems. The goal of this chapter is to introduce mathematical formal-
izations of behavior that should serve as a basis for the study of systems and their
interactions.
2.1 Preliminaries
We first establish some notation, conventions and basic concepts.
2.1.1 Basic Notation
A set is denoted by a calligraphic letter like X and consists of elements or symbols.
We use N = {1, 2, 3, . . .} for the set of natural numbers, and R for the set of real
numbers. Strings are finite concatenations of symbols. The empty string is denoted
by ǫ. X n denotes the set of strings of length n based on X . For substrings, the
following shorthand notation is used: a string that runs from index i to k is written as
xi:k := xixi+1 . . . xk−1xk. Similarly, x≤i := x1x2 . . . xi is a string starting from the first
index. By convention, xi:j := ǫ if i > j. Logarithms are always taken with respect to
base 2, thus log(2) = 1, unless written explicitly ln(x), in which case we mean natural
5
2. CHARACTERIZATION OF BEHAVIOR
logarithms. The symbol P(X ) denotes the powerset of X , i.e. the set of all subsets
of X .
2.1.2 Probabilities & Random Variables
To simplify the exposition, all probability spaces are assumed to be finite unless clearly
stated otherwise. While this assumption limits the domain of application of the expo-
sition, it allows isolating the problems belonging solely to the design of autonomous
systems from the problems that arise due to infinite sets (and in particular, from the
problems of geometrical or topological assumptions). Due to this, we clarify some
terminology.
A sample space is a finite set Ω, where each member ω ∈ Ω is called a sample
or outcome. A subset A ⊂ Ω is called an event. A subset F ⊂ P(Ω) of events
that contains Ω and is closed under complementation and finite union is called an
algebra. A measurable space is a tuple (Ω,F), where Ω is a sample space and F
is an algebra. Given a measurable space (Ω,F), a set function Pr over F is called a
probability measure iff it obeys the (Kolmogorov) probability axioms
K1. for all A ∈ F , Pr(A) ∈ [0, 1];
K2. Pr(∅) = 0 and Pr(Ω) = 1;
K3. for all disjoint A,B ∈ F , Pr(A ∪B) = Pr(A) +Pr(B).
A probability space is a tuple (Ω,F ,Pr) where (Ω,F) is a measurable space and Pr
is its measure. Given a probability space (Ω,F ,Pr), a random variable is a function
X : Ω → X mapping each outcome ω into a symbol X(ω) of a set X , and where
X−1(x) ∈ F for all x ∈ X . The probability of the random variable X taking on the
value x ∈ X is defined as Pr(x) := Pr(X = x) := Pr({ω ∈ Ω : X(ω) = x}).
2.2 Models of Autonomous Systems
If one wants to characterize the way an autonomous system behaves, it is necessary to
fully describe the rules governing its potential I/O stream. The mathematical descrip-
tion of an autonomous system’s behavior can be done at several levels, ranging from
very detailed physical descriptions (e.g. in the case of a robot, it would require specify-
ing its gears, distribution of current, sensors and effectors, etc.) to abstract statistical
descriptions. This thesis deals exclusively with statistical descriptions. In particular,
during the course of this thesis, two levels of description will be discussed: namely, the
behavioral level and the belief level.
1. Behavioral Level. The behavioral level specifies the actual I/O behavior, i.e. the
probabilities of making observations and issuing actions of the autonomous sys-
tem. These probabilities have to be specified for every possible information state
of the system, meaning that they must characterize the system’s I/O statistics
6
2.2 Models of Autonomous Systems
for every possible sequence of past I/O symbols. In Chapter 6, we will link this
description with the amount of resources (measured in thermodynamic work) that
the autonomous system has to spend during its interactions. The behavioral level
will be introduced in two parts: first, by specifying the output model and then
by completing it with an input model.
(a) Output Model. The output model specifies how the autonomous system gen-
erates its outputs given the past I/O symbols. This is the minimal statistical
description of an autonomous system’s behavior. However, it does not ex-
plain the purpose of the system, i.e. it does not explain what it tries to
accomplish. This model will be introduced in this chapter.
(b) Input Model. The autonomous system predicts its input stream using the
input model. This prediction model represents the assumptions that the
system makes about its environment. We will see in Chapter 3 that these as-
sumptions are necessary in order to formalize the purpose of the autonomous
system. Furthermore, we will argue in Chapter 6 that this model is also
necessary in order to characterize the thermodynamical work spent in inter-
actions. The input model is introduced in the next chapter.
2. Belief Level. In this thesis we are mainly interested in modeling adaptive au-
tonomous agents. However, specifying adaptive agents by directly writing down
their behavioral models is difficult. To simplify the description of adaptive agents,
one first starts out specifying a belief model, which is a high-level auxiliary model
characterizing the beliefs, assumptions and uncertainties of an agent. Subse-
quently, one simply derives the low-level behavioral model from the belief model.
This is similar to writing programs in a high-level programming language that
is then compiled into low-level machine code. As in the case of the behavioral
level, this belief level too will be introduced in two parts: first, by introducing
the Bayesian input model and then by introducing the Bayesian output model.
(a) Bayesian Input Model. When the autonomous system does not know its
environment, then the designer can use a Bayesian approach to model this
uncertainty. This Bayesian input model is currently the standard in the
adaptive control literature. It is introduced in Chapter 4.
(b) Bayesian Output Model. In addition to modeling the uncertainty an au-
tonomous system has over its environment, a designer can also model the
uncertainty the system has over its own policy. This leads to a Bayesian
output model that is analogous to the Bayesian input model. However, we
will see that this generalization is not straightforward, as it will require a
careful distinction between the technical treatment of actions and observa-
tion arising due to causal constraints. This model is an original contribution
of this thesis, and it will be introduced in Chapter 8, Part II.
A schematic illustration of this setup is given in Figure 2.1. Other aspects that are
important in the characterization of autonomous systems are the following:
7
2. CHARACTERIZATION OF BEHAVIOR
Output Model Input Model
Bayesian Input ModelBayesian Output Model
I/O Model
Bayesian I/O Model
(policy) (environment)
(uncertain environment)(uncertain policy)
Behavioral Level
Belief Level
Figure 2.1: Behavioral Models. Dependencies are indicated by arrows.
1. I/O Domains: The choice of the cardinality, geometry and topology of the I/O
domain can have a significant impact on the description, the implementation and
the performance of an autonomous system. In navigation devices for instance,
the usage of continuous I/O sets is vital for its robustness. Furthermore, choosing
and modeling continuous I/O sets appropriately can be a very challenging task.
However, in this thesis we will exclusively deal with finite I/O sets, because
arguably the core problems of the design of autonomous agents are already present
even in this simple setup.
2. Interaction Protocol: The interaction protocol specifies all the details concerning
the rules and the timing of the communication between two autonomous systems.
There are applications where de-synchronized interactions with variable timing
play a crucial role, like e.g. in moving artillery or in tennis. In this thesis it is
assumed that interactions occur in discrete time steps where autonomous systems
alternately take turns to generate a symbol. While in many cases, this interac-
tion protocol is flexible enough to accommodate other interaction regimes (e.g.
simultaneous interactions, discrete approximations of continuous time, etc.), it is
not clear whether the theory would significantly change under other interaction
protocols.
3. Multiple Autonomous Systems: This thesis deals exclusively with situations hav-
ing only two interacting autonomous systems. However, there are many situa-
tions, especially in economics, where one would prefer modeling the behavior of a
population of autonomous systems. If the population is significantly large, then
a more coarse-grained description of behavior could be beneficial: e.g. treating
8
2.3 Output Model
entire sub-populations as if they were individual autonomous systems; or using a
completely different description modeling emergent properties. This view might
be especially relevant to the understanding of decentralized behavior for instance.
2.3 Output Model
In the following an autonomous system’s behavior is formalized as a conditional prob-
ability measure over I/O sequences over I/O alphabets. We start with basic definitions
of interactions.
Definition 1 (Interactions) The possible I/O symbols are drawn from two finite
sets. Let O denote the set of observations and let A denote the set of actions.
The set Z := A × O is the set of interactions. The interaction string of length 0 is
denoted by ǫ. Let T ∈ N be the horizon, i.e. maximum length of interaction strings.
Let Z⋄ := ⋃Tt=0Zt denote the set of interactions up to length T . We also underline
symbols to glue them together as for example in ao≤2 = a1o1a2o2. 2
We assume that there are two autonomous systems P and Q. By convention,
we assume that P is the agent i.e. the autonomous system to be constructed by the
designer, and thatQ is the environment1, i.e. the autonomous system to be controlled
by the agent.
Agent and environment operate obeying the following interaction protocol. The in-
teraction proceeds in cycles t = 1, 2, . . . , T . In cycle t, the agent P generates an action
at conditioned on the past I/O symbols ao<t. The action is observed by both systems.
Then, the environment Q responds by generating an observation ot conditioned on the
past ao<tat. The observation is observed by both systems as well. Then, the next cycle
starts. This interaction protocol is fairly general, and many other interaction protocols
can be translated into this scheme. Figure 2.2 illustrates this setup.
The definitions of the agent’s and the environment’s output model follow.
Definition 2 (Output Model) An output model of an agent is a set of conditional
probabilities
P(at|ao<t), for all ao≤t ∈ Z⋄,
inducing a unique probability measure P over AT conditioned on OT given by
P(a≤t|o<t) :=
t∏
τ=1
P(aτ |ao<τ ).
Similarly, an output model of an environment is a set of conditional probabilities
Q(ot|ao<tat), for all ao≤t ∈ Z⋄,
1Agent and environment are the terms commonly used in the artificial intelligence literature. In the
control literature, the two autonomous systems are known as controller and plant respectively.
9
2. CHARACTERIZATION OF BEHAVIOR
a1 a2 a3 a4 aTo1 o2 o3 oT
Agent
P
Envi-
ronment
Q
Figure 2.2: An interaction system. The agent P and the environment Q define a probability
distribution over interaction sequences Z⋄.
inducing a unique probability measure Q over OT conditioned on AT given by
Q(o≤t|a≤t) :=
t∏
τ=1
Q(oτ |ao<τaτ ).
2
Thus, the output model for an agent is a probability distribution over the next action
given the past, and the output model for an environment is a probability distribution
over the next observation given the past. This type of specifications are called stream
probabilities. The asymmetry in the definitions should not distract the reader from
the fact that these choices are purely conventional. Output models are graphically
represented as trees (Figure 2.3).
When an agent and an environment are coupled, they uniquely define a probability
measure over the I/O strings Z⋄, i.e. the probability law governing the actual produc-
tion of I/O symbols. This probability measure is given by the generative probability
distribution defined next.
Definition 3 (Generative Probability Measure) LetP andQ be the output model
of an agent and an environment respectively. The generative probability measure
G over Z⋄ is defined by the conditional probabilities
G(at|ao<t) := P(at|ao<t)
G(ot|ao<tat) := Q(ot|ao<tat)
valid for all ao≤t ∈ Z⋄. 2
The generative probability measure G is the objective probability law from which
the interaction sequences are sampled. As such, it is the probability law that the
10
2.3 Output Model
1
2
1
2
0 1 14
3
4
3
4
1
4
1
3
2
3
a1
o1
a2
o2
Figure 2.3: An Output Model. Behavioral models are naturally represented as trees. The
figure illustrates an output model for an agent (i.e. a distribution over actions given past I/O
symbols) for binary I/O sets A := O := {0, 1}. In the tree there are three types of nodes.
Square nodes () are action nodes, round nodes (©) are observation nodes and leaves (•) are
full I/O histories. A transition from a parent node to a child node on the left corresponds
to choosing symbol 0, and a transition to a child node on the right corresponds to choosing
symbol 1. In this case, only the transition probabilities in the action nodes are specified because
observations are just conditionals. The probability of choosing symbol at in the node reached
by following the path ao<t starting from the root node is given by P(at|ao<t). For instance,
P(a2 = 1|a1 = 0, o1 = 1) = 34 .
11
2. CHARACTERIZATION OF BEHAVIOR
designer should use in order to assess the objective expected performance of a system.
However, in practice this probability distribution is unknown because the environment
is unknown. In this case, the designer can assume a “subjective” probability measure
over the I/O stream, i.e. a probability law that is thought to hold, and then use it to
compute “subjective” expectations. This will be clarified in the next chapter.
12
Chapter 3
Decision Making
The output model introduced in the previous chapter contains all the information
to characterize an autonomous system’s behavior. However, it is not very useful for
explaining what it tries to accomplish, i.e. it does not provide any insight about its
purpose. One would like to have a framework that allows justifying the choice of an
autonomous system’s behavior. The standard framework for conceptualizing purpose
is the theory of subjective expected utility (SEU) developed by Savage (1954).
3.1 Subjective Expected Utility
The theory of SEU is a framework dealing with decision making under uncertainty, i.e.
a situation where a choice does not uniquely determine the outcome. We can motivate
this as follows. Suppose you are running late to get to your work. You have the option
to either wait for the bus, or run to your work. These choices are called “acts” in SEU.
However, the bus can be unreliable, and there is a possibility of the bus not being on
time. These uncertain situations are called “states” in SEU. Depending on whether you
decide to wait for the bus or run to your work and whether the bus is on time, there
are three possible “outcomes”: either you are on time but tired (because you decided
to run); or you are on time and active (you waited for the bus and it arrived on time);
or you are late and active (you waited for the bus but it came too late). This decision
problem is summarized in Table 3.1.
Act
Wait for bus. Run to work.
State
Bus is late. Late and active. On time and tired.
Bus is on time. On time and active. On time and tired.
Table 3.1: A Decision Problem with Uncertainty.
SEU theory postulates that rational decision makers can be characterized by their
13
3. DECISION MAKING
preferences over acts (and mixtures over acts), and that these preferences obey certain
structural and consistency rules. For instance, you might prefer to “wait for the bus”
over “either waiting for the bus or running to work with equal probabilities”. Intu-
itively, it seems plausible to say that this preference reveals something about both your
subjective belief in the bus being on time and your subjective desire of getting to your
job active. This is precisely the idea underlying SEU: if preferences obey certain ratio-
nality axioms, then these preferences reveal the subjective beliefs in events happening
and the subjective desirability of the outcomes.
3.1.1 Setup
SEU theory is formalized as follows. Let S be a set of states, C be a set of conse-
quences and F be a set of acts (that is, mappings from the set of states S into the
set of consequences C). This is illustrated in Figure 3.1. The intuitive meaning is as
follows: an act is a choice available to the decision maker; a consequence is a possible
result of an act; and a state is a compilation of facts about the world that the decision
maker is uncertain about but uniquely determines the consequence of a chosen act. A
subset of states A ⊂ S is an event. Note that the states of world S are mutually
exclusive and complete, i.e. exactly one state s ∈ S will occur.
S C
s x
f ∈ F
Figure 3.1: Setup of Subjective Expected Utility.
3.1.2 Rationality
A decision maker is characterized by a preference relation < on the set of acts F ,
i.e. < is transitive and complete. For any pair of acts f, g ∈ F , the expression f < g
means that “the decision maker prefers f over g”. From <, one constructs the strict
preference relation ≻ and the indifference relation ∼ defined as
f ≻ g ⇔ (f < g) and not (g < f),
f ∼ g ⇔ (f < g) and (g < f),
and having the obvious meaning. The decision maker is called rational if its preference
relation < fulfills the axioms of rationality presented further down. Some definitions
are needed before.
Definition 4 (Conditional Preference) f is preferred over g given A ⊂ S, writ-
ten f <A g, iff f ′ < g′ where f = f ′ and g = g′ on A and f ′ = g′ on Ac. 2
14
3.1 Subjective Expected Utility
Intuitively, f <A g means that f is preferred over g if we disregard all the states
outside of A. The auxiliary functions f ′ and g′ are constructed such that comparing f ′
and g′ unconditionally will say only something about how f and g compare within A.
This is because f ′ and g′ are equal to f and g within A respectively, but are indistin-
guishable from each other outside of A. The relations ≻A and ∼A are constructed in
the obvious way from <A.
Definition 5 (Null Event) A subset A ⊂ S is said to be null iff for all f, g ∈ F , f
is indifferent over g given A, i.e. f ∼A g. 2
Null sets correspond to the events where the decision maker is indifferent between
all of his available choices. They will turn out to be the events having zero probability.
Definition 6 (Constant Act) An act f is said to be constant iff for all s ∈ S,
f(s) = x ∈ C for some x ∈ C. In this case, we also simply write x to denote the
constant act f . 2
Note that a constant act x is a “sure gamble”, because the consequence x is the
same irrespective of the state of the world s.
We first present the axioms of rationality together in one place and subsequently
briefly discuss their meaning. The following version corresponds to Kreps’ presentation
of the axioms (Kreps, 1988), but they are essentially the same as the original ones
introduced by Savage (1954).
Definition 7 (Rationality/Savage Axioms) Let S, C and F be a set of states, a
set of outcomes, and a set of acts respectively. A binary relation < over F is said to
be a rational preference relation iff it follows the axioms:
R1. < is transitive and complete.
R2. There exist x, y ∈ C such that x ≻ y.
R3. Let f, f ′, g, g′ ∈ F and A ⊂ S be such that
(a) f(s) = f ′(s) and g(s) = g′(s) if s ∈ A,
(b) f(s) = g(s) and f ′(s) = g′(s) if s /∈ A.
Then, f ≻ g iff f ′ ≻ g′.
R4. Let A ⊂ S be non-null and for all s ∈ A, f(s) = x and g(s) = y.
Then f ≻A g iff x ≻ y.
R5. Let x, y, x′, y′ ∈ C, f, g, f ′, g′ ∈ F , and A,B ⊂ S be such that
(a) x ≻ y and x′ ≻ y′,
(b) f(s) = x and f ′(s) = x′ for s ∈ A, and f(s) = y and f ′(s) = y′ for s ∈ Ac,
(c) g(s) = x and g′(s) = x′ for s ∈ B, and g(s) = y and g′(s) = y′ for s ∈ Bc.
Then, f ≻ g iff f ′ ≻ g′.
15
3. DECISION MAKING
R6. For all A ⊂ S,
(a) [f ≻A g(s) for all s ∈ A] implies f <A g,
(b) [g(s) ≻A f for all s ∈ A] implies g <A f .
R7. For all f, g ∈ F such that f ≻ g and for all x ∈ C,
there is a finite partition of S such that for every A in the partition,
(a) [f ′(s) = x for s ∈ A, f ′(s) = f(s) for s ∈ Ac] implies f ′ ≻ g,
(b) [g′(s) = x for s ∈ A, g′(s) = g(s) for s ∈ Ac] implies f ≻ g′.
Axiom R1 just states that < is a preference relation.
Axiom R2 is purely structural: it rules out the trivial situation where the decision
maker is indifferent between all the consequences.
Axiom R3 justifies talking about conditional preference. Consider the situation in
Figure 3.2 and the comparison of f with g. Essentially, Axiom R3 tells us that we only
need to worry about how they compare in A, since they agree on Ac. Thus, if there is
another pair f ′ and g′ that agrees on Ac and is identical to f and g respectively on A,
then we can conclude the preference order of f and g from f ′ and g′.
A Ac
f = f ′
g = g′
f = g
f ′ = g′
Figure 3.2: The setup in Axiom R3.
Axiom R4 tells us that utilities of outcomes are not state-dependent. That is, if
the decision maker strictly prefers an outcome x over y (and thereby strictly prefers a
constant act x over y), then knowing that the (non-null) event A obtains preserves the
strict preference order.
Axiom R5 essentially tells us that acts cannot affect probabilities. Let the “prizes”
(i.e. consequences) x and y be a “win” and a “loss” respectively. Consider two “gam-
bles” (i.e. acts) f and g over these outcomes having different winning odds, represented
by the different winning sets of states A and B respectively, as depicted in Figure 3.3 a.
Then, if we prefer f over g, then it is because the odds in f are more favorable than
in g. Hence, replacing the prizes x and y by x′ and y′ having the same order (i.e.
x′ ≻ y′) will preserve the preference order over the resulting gambles (i.e. f ′ ≻ g′, see
Figure 3.3 b).
Axiom R6 is the sure-thing principle. Consider for instance the situation depicted
in Figure 3.4, where the consequences in C are linearly ordered in increasing order of
preference. The acts f and g map the possible states in A into the sets of consequences
16
3.1 Subjective Expected Utility
(a)
(b)
f = g =
f ′ = g′ =
xx
yy
x′x′
y′y′
A
A
Ac
Ac
B
B
Bc
Bc
Figure 3.3: Example illustrating Axiom R5.
f(A) and g(A) respectively. If an act f is strictly preferred over any “sure gamble”
g(s) with s ∈ A, then f is preferred over any “randomizing gamble” g over A by the
force of Axiom R6.
g(A) f(A)
g(s), s ∈ A
C
Figure 3.4: Example illustrating Axiom R6.
Axiom R7 is a “continuity” axiom. Consider two acts f, g such that f ≻ g and
an arbitrary constant act x. Then, by Axiom R7, there is always a (sufficiently fine-
grained) finite partition of S, such that all mappings f ′ constructed from f by replacing
a small “patch” of the partition by the constant act x (Figure 3.5) are still strictly
preferred over g, i.e. slightly changing f into f ′ will still preserve the preference relation
f ′ ≻ g. Similarly, slightly changing g into g′ will preserve the preference relation f ≻ g′.
17
3. DECISION MAKING
f
x =: f ′
S
Figure 3.5: Example illustrating Axiom R7.
3.1.3 Representation Theorem
The main result of SEU is that any rational decision maker can be thought of as
evaluating an expected utility. Formally:
Theorem 1 (Expected Utility Representation Theorem) Let < be a preference
relation on F . Then the following two conditions are equivalent:
i. < satisfies axioms R1-R7.
ii. There exists a unique, nonatomic, finitely additive, probability measure P on
S such that P(A) = 0 iff A is null, and a bounded, unique up to a positive
affine transformation, real-valued function U on C such that, for all acts f and g,
f < g iff ∫
S
U(f(s)) dP(s) ≥
∫
S
U(g(s)) dP(s).
In this expression, the notation
∫
S φ(ω) dP denotes the Lebesgue-integral of a function
φ over the set of states S with respect to the measure P. 2
Remark 1 The rationality axioms R1–R7 imply that the underlying probability space
is uncountably infinite. This is a common technical assumption that is used to obtain
uniqueness results. For our purposes, we will assume that the utility function U and
the probability measure P are given, and think about them conceptually as resulting
from such a process of comparing acts. 2
Put differently, if a decision maker’s preference relation < is rational, then the the-
orem justifies the existence of two (essentially) unique subjective entities: a probability
measure P representing the plausibilities of the states of the world, and a utility func-
tionU representing how desirable the consequences are. Moreover, the decision maker’s
choice behavior can then be formalized as maximizing the expected utility. Conversely,
having a probability measure P and a utility function U, and basing choices on the
maximization of the expected utility, implies that the decision-maker has a rational
preference relation <.
18
3.2 The Maximum Subjective Expected Utility Principle
3.2 The Maximum Subjective Expected Utility Principle
An autonomous agent is a decision maker. Therefore, equipped with the theory of
SEU, one can construct the behavior of a rational autonomous agent by maximizing an
expected utility. This leads to a construction method called the maximum subjective
expected utility principle (abbreviated maximum SEU principle). But before we can
apply SEU theory to interaction systems, we need to develop a formal correspondence
between the language of SEU theory and the language of autonomous systems. This
is the aim of the following section. While the concepts that we will present in the
following are well-known in the literature, their formal correspondence to SEU theory
is an original contribution of this thesis.
3.2.1 SEU in Autonomous Systems
In the SEU framework, an act is a deterministic mapping of a states of the world into
consequences—there seems to be no randomness at all. This is not so: the randomness
arises because the decision maker has imperfect knowledge about the state of world,
represented by his subjective probability measure. Similarly, here we will assume that
an autonomous system with random behavior can be represented as a collection of
deterministic autonomous systems and a probability measure over them. If an au-
tonomous system is deterministic, then it can be represented as a function mapping a
past I/O string into an I/O symbol (Figure 3.6).
π
a1
o1
a2
o2
π
ao<t π(ao<t)
ǫ 1
0 0 1
0 1 1
1 0 0
1 1 1
Figure 3.6: A behavioral function π is most naturally represented as a transition tree. Inter-
actions (A = O = {0, 1}) start from the root node. In the figure, choices are highlighted with
solid edges. Note that π only chooses transitions in the at-levels, not in the ot-levels. The table
specifies the function π.
Definition 8 (Behavioral Function) A behavioral function of an agent is a map-
ping π : Z⋄ → A of histories ao<t into actions at = π(ao<t). Similarly, a behavioral
function of an environment is a mapping λ : Z⋄ × A → O of histories ao<tat into
observations ot = λ(ao<tat). 2
19
3. DECISION MAKING
Let Π and Λ be the set of behavioral functions of agents and environments respec-
tively. It is easily seen (Figure 3.7) that each choice of a pair (π, λ) ∈ Π×Λ determines
a unique interaction string a∗o∗≤T ∈ ZT as
a∗t = π(a
∗o∗<t) and o
∗
t = λ(a
∗o∗<ta
∗
t ). (3.1)
Put differently, choosing (π, λ) also chooses a unique interaction string in ZT . For
each π ∈ Π, define the agent functional φπ : Λ → ZT as the functional mapping a
behavioral function of an environment λ ∈ Λ into the interaction string φπ(λ) ∈ ZT
defined by (3.1). Let Φ := {φπ|π ∈ Π} be the set of these agent functionals.
π ∈ Π λ ∈ Λ (π, λ)
a1
o1
a2
o2
Figure 3.7: Choosing behavioral functions of an agent and of an environment determines a
unique interaction string.
With these definitions, one can establish the following correspondence to SEU:
States S ←→ Λ Beh. Func. of Environments
Consequences C ←→ ZT Interaction Strings
Acts F ←→ Φ Agent Functionals
Additionally, we assume that we are given a subjective probability measure P over Λ
and a subjective utility function U over ZT . This correspondence sets up a scheme
where we can determine whether an agent functional φπ1 is preferred over an agent
functional φπ2 by comparing their subjective expected utilities:
φπ1 < φπ2 ⇐⇒
∑
λ∈Λ
P(λ)U(φπ1(λ)) ≥
∑
λ∈Λ
P(λ)U(φπ2(λ)).
This in turn allows us extending the preference relation over agent functionals to a
preference relation over behavioral functions of agents by defining
π1 < π2 ⇐⇒ φπ1 < φπ2 .
Finally, our last step of the correspondence links the previous notions of behavioral
functions to stream probabilities. Because a behavioral function λ ∈ Λ of an envi-
ronment specifies how to choose the observation ot ∈ O for any past ao<tat, having a
20
3.2 The Maximum Subjective Expected Utility Principle
probability measure P over Λ induces a set of conditional probabilities
P(ot|ao<tat) :=
∑
λ∈Λ
P(λ) δλ(ot|ao<tat),
where
δλ(ot|ao<tat) :=
{
1 if ot = λ(ao<tat),
0 otherwise.
In a similar fashion, choosing a behavioral function of an agent π ∈ Π induces a set of
(degenerate) conditional probabilities
P(at|ao<t) :=
{
1 if at = π(ao<t),
0 otherwise.
Note how we have defined a way of extending the probability measure P over Λ to a
probability measure over the interaction strings Z⋄. Importantly, it is worth stressing
that fixing π ∈ Π corresponds to choosing an output model P(at|ao<t) (Definition 2).
The overall conclusion is as follows. If an autonomous system is rational, then it
chooses its output model P(at|ao<t) such that is maximizes the expected utility∑
ao≤T
P(ao≤T )U(ao≤T ), (3.2)
where the utility functionU over ZT and the set of conditional probabilitiesP(ot|ao<tat)
are given. Note that the probability measureP over ZT is obtained from the conditional
probabilities P(at|ao<t) and P(ot|ao<tat) using the product rule for probabilities. This
construction method is the well-known maximum SEU principle for autonomous
systems (Russell and Norvig, 2009).
Remark 2 Solving (3.2) by choosing the optimal behavioral function π ∈ Π and by
choosing the optimal output model P(at|ao<t) are not exactly the same. This is be-
cause choosing π always leads to deterministic conditional probabilities for the output
model, while directly choosing the output model allows choosing even non-deterministic
output models. However, these optimal non-deterministic output models always arise
as arbitrary mixtures of optimal deterministic output models, and thus are not “truly
probabilistic”. In particular, one can always choose an optimal deterministic output
model. This will become clear in Section 3.2.3. 2
3.2.2 I/O Model
In the previous section we have seen that purposeful behavior entails holding “implicit
beliefs” about the input stream which are formalized by a set of conditional probabilities
P(ot|ao<tat). This conditional probability measure will turn out to be a fundamental
part of the information-theoretic characterization of autonomous agents. The next
definition extends the output model introduced in Chapter 2 with a model of the input
stream.
21
3. DECISION MAKING
Definition 9 (Input Model) An input model of an agent is a set of conditional
probabilities
P(ot|ao<tat) for all ao≤t ∈ Z⋄.
inducing a unique probability measure P over OT conditioned on AT . 2
Remark 3 Note that an input model for an agent is formally equivalent to an output
model of an environment. 2
Definition 10 (I/O Model) An I/O model of an agent is an output model paired
with an input model, i.e. a set of conditional probabilities
P(at|ao<t) and P(ot|ao<tat) for all ao<t ∈ Z⋄.
inducing a unique probability measure P over ZT given by
P(ao<tat) := P(ao<t)P(at|ao<t) and P(ao≤t) :=
t∏
τ=1
P(aτ |ao<τ )P(oτ |ao<τaτ ).
2
The intuitive interpretation of the I/O model is as follows. The P(at|ao<t) are the
conditional probabilities that the autonomous system follows to generate actions, i.e.
they constitute propensities. It is common to call the output model the policy of
the I/O model. In contrast, the P(ot|ao<t) are the conditional probabilities that the
autonomous system uses to represent its assumptions, beliefs and/or predictions about
the observations, i.e. they constitute plausibilities. It is common to call the input
model the predictor of the I/O model.
Definition 11 (“Knows”) An agent P (a I/O model) is said to know its environ-
ment Q (a output model) iff
P(ot|ao<tat) = Q(ot|ao<tat) for all ao≤t ∈ Z⋄. 2
In other words, an agent knows its environment if it perfectly predicts its behavior.
Although this assumption is rarely met in practice, it is often a useful approximation.
An agent that is constructed using the maximum SEU principle is said to be optimal
(with respect to its input model).
3.2.3 Bellman Optimality Equations
The variational problem in (3.2) can be formulated as recursive system of equations
collectively known as Bellman optimality equations (Bellman, 1957). Given a
utility function U over ZT and given a set of conditional probabilities P(ot|ao<t),
the problem of finding the optimal output model can be graphically represented as a
decision tree (see Figure 3.8).
There are two equivalent ways of conceptualizing this decision problem. The first
is to find a policy such that, when executed, it will eventually lead to an optimal
22
3.2 The Maximum Subjective Expected Utility Principle
1
2
1
2 0 1
1
4
3
4
3
4
1
4
1
3
2
3
1
2
1
2
1
3
2
3
1
2
1
2
1
4
3
4
1
2
1
2
a1
o1
a2
o2
U 4 −4 4 −4 6 −3 2 6 0 6 2 −2 5 1 1 −3
Figure 3.8: A Decision Tree. Square nodes () are action nodes, round nodes (©) are
observations nodes and leaves (•) are full I/O histories, which correspond to the consequences
of the decision problem. The transition probabilities in observation nodes are given by the
input model P(ot|ao<tat), and the utilities of the consequences by the U(ao≤T ). The problem
consists in choosing the transition probabilities in action nodes (i.e. the P(at|ao<t)) such that
the expected utility of the resulting I/O history is maximized.
“final state” having a maximum utility. The second is to find a policy such that,
when executed, collects the maximum expected sum of “instantaneous rewards” before
eventually reaching a final state.
Definition 12 (Optimality Equations for Utilities) Let U be a utility function
over ZT and let the P(ot|ao<t) be a set of conditional probabilities. Define the future
utility function F : Z⋄ → R recursively as
F (ao≤T ) := U(ao≤T )
F (ao<t) := maxat
∑
ot
P(ot|ao<tat)F (ao≤t),
for all ao<t ∈ Z⋄ with t ≤ T . Then, the solution to the variational problem in (3.2) is
the output model given by
P(at|ao<t) := δa
∗
t
at , where a
∗
t := argmaxat
∑
ot
P(ot|ao<tat)F (ao≤t). 2
Remark 4 If there are several maximizing actions, then one can choose the action
that lexicographically precedes all the other ones for definiteness. 2
It can be verified that the previous definition characterizes the optimal output
model. The Bellman optimality equations of Definition 12 suggest a solution algorithm
called Expectimax (Michie, 1966; Russell and Norvig, 2009) illustrated in Figure 3.9.
23
3. DECISION MAKING
1
2
1
2 0 1
1
4
3
4
3
4
1
4
1
3
2
3
1
2
1
2
1
3
2
3
1
2
1
2
1
4
3
4
1
2
1
2
a1
o1
a2
o2
U 4 −4 4 −4 6 −3 2 6 0 6 2 −2 5 1 1 −3
−2 2 0 4 4 0 2 −1
3 2
2 4 4 2
3
Figure 3.9: Solution of the Decision Tree from Figure 3.8. Starting from the leaves and
working up until reaching the root, nodes are successively labeled with either the expected
value of their children nodes (in observation nodes) or the maximum value of their children
nodes (in action nodes). The transitions chosen in the action nodes form the solution. Note
that the values written in action nodes and in the leaves correspond to the future utilities
(Definition 12), and that the differences of two subsequent future utilities correspond to the
rewards (Definition 14).
As anticipated previously, one can formulate this recursion in terms of collecting
“instantaneous rewards” in each interaction. This formulation is useful for modeling
situations where there is a feedback signal communicating the autonomous system how
well it is performing. We introduce the definition of rewards.
Definition 13 (Rewards) Let F : Z⋄ → R the future utility from Definition 12.
Then, define the reward function r as
r(aot|ao<t) := F (ao≤t)− F (ao<t)
for all ao≤t ∈ Z⋄ where t ≥ 1. 2
In a reward-based decision problem, the goal is to maximize the cumulative sum
over rewards, i.e. the quantity
T∑
t=1
r(aot|ao<t) = F (ao≤T )− F (ǫ).
Although rewards are mathematically formalized as differences in future utility, deci-
sion problems based on rewards are most naturally stated by directly specifying the
reward function, i.e. without explicitly deriving it from an overall utility function. The
associated optimality equations follow.
24
3.2 The Maximum Subjective Expected Utility Principle
Definition 14 (Optimality Equations for Rewards) Let r be a reward function
and let the P(ot|ao<t) be a set of conditional probabilities. Define the value function
V : Z⋄ → R recursively as
V (ao≤T ) := 0
V (ao<t) := maxat
∑
ot
P(ot|ao<tat)
(
r(aot|ao<t) + V (ao≤t)
)
,
for all ao<t ∈ Z⋄ with t < T . Then, the solution is the output model given by
P(at|ao<t) := δa
∗
t
at , where a
∗
t := argmaxat
∑
ot
P(ot|ao<tat)
(
r(aot|ao<t) + V (ao≤t)
)
.2
Remark 5 Note that the choice of the policy P(at|ao<t) causally precedes the real-
ization of the stochastic process a1, o1, a2, o2, . . . , aT , oT in all the formulations we have
seen so far. This means that the agent has to “know how it will act under all potential
situations” before having interacted with the environment even once. We will relax this
assumption in Chapter 9. 2
3.2.4 Subjective versus True Expected Utility
In the previous section we have seen how to solve a decision problem. This design
method makes especially sense when the agent knows its environment, because
∑
ao≤T
G(ao≤T )U(ao≤T ) =
∑
ao≤T
( T∏
t=1
P(at|ao<t)Q(ot|ao<tat)
)
U(ao≤T )
=
∑
ao≤T
( T∏
t=1
P(at|ao<t)P(ot|ao<tat)
)
U(ao≤T )
=
∑
ao≤T
P(ao≤T )U(ao≤T ),
where G is the generative probability measure (Definition 3), and hence the true ex-
pected utility arising from the interaction between the agent P and the environment Q
coincides with the subjective expected utility of the agent P. As we have pointed out
in the context of Definition 11, this is a requirement that is hardly fulfilled in practice.
Finding the optimal policy given that the agent knows the environment constitutes the
paradigm of optimal control. Optimal control is an important problem class having
application in many real-life situations.
Obviously, if the agent does not know its environment, then the true expected utility
and the subjective expected utility do not coincide and in fact little can be said about
how these two quantities compare, i.e.∑
ao≤T
G(ao≤T )U(ao≤T )
?
⋚
∑
ao≤T
P(ao≤T )U(ao≤T ).
25
3. DECISION MAKING
Thus, it seems that rational agents are somewhat fragile, in the sense that their per-
formance seems to depend strongly on knowing the environment they are facing. This
begs the question of whether we can design more robust agents, e.g. agents that perform
well even if they don’t know what environment they are facing. This is the question
addressed in the next chapter.
3.3 Historical Remarks & References
The concept of utility has become an integral part of the standard vocabulary in economics,
control theory and artificial intelligence. However, its development has been slow and contro-
versial. The roots of decision theory can be traced back as far as before the 18th century.
Gamblers were thought to evaluate random ventures based on their expected returns (i.e. ex-
pected monetary payoffs). This led to the famous St. Petersburg paradox : a gamble with
infinite expected return, but nevertheless intuitively considered to be worth only a very small
amount of money. Bernoulli (1738) presented a solution to this paradox by introducing the
notion of utilities: an outcome is not worth its nominal (i.e. monetary) value, but a subjective
transformation of this value. Accordingly, the valuation of a gamble would then reduce to the
calculation of its expected utility, which decomposed this valuation into a sum of utilities of
outcomes weighted by their likelihoods. Surprisingly though, this idea did generally not pick
up with early economists. It was only after Knight (1921) made a case suggesting that risk,
uncertainty and utility might be relevant for economic analysis that the leading economists
finally took these concepts seriously into account.
Ramsey (1926) suggested a way of deriving a consistent theory of choice under uncertainty
that could isolate subjective probabilities (i.e. beliefs) from preferences. This idea got later
independently developed by De Finetti (1937). In his formulation, he proposed a thought
experiment where a decision maker is required to set the fee for a gamble involving the likelihood
of an event. He then showed that the decision maker has to choose a fee that is in accordance
with his subjective belief in the likelihood of the event in order to avoid being a victim of a
Dutch book, i.e. a strategy that an opponent can use in order to systematically win against him.
The first mathematical formalization of the expected utility hypothesis came with von
Neumann and Morgenstern (1944). In their work, a decision maker is formalized as having
preferences over lotteries (i.e. probability distributions over outcomes). von Neumann and
Morgenstern have shown that if this preference relation follows certain consistency axioms,
then there exists an “essentially” unique utility function over lotteries. Utilities over outcomes
are then derived from utilities of deterministic lotteries. Note that this formalization is based
on at least two important assumptions. First, probabilities are assumed to be “objective”, i.e.
they are given by Nature and cannot be influenced by the decision maker. Second, lotteries
logically precede outcomes, which seems to suggest that decision makers really get utility from
distributions, not from outcomes.
The ideas of subjective utility and probability culminated in the theory of subjective ex-
pected utility developed by Savage (1954). Savage combined de Finetti’s and von Neumann-
Morgenstern’s ideas into the theory of decision making under uncertainty presented in this
chapter. This theory is by many considered the biggest achievement in rational decision mak-
ing, being the standard mathematical framework until today. Later, Anscombe and Aumann
(1963) provided a simpler derivation by extending von Neumann and Morgenstern’s approach
with subjective probabilities. The exposition in Section 3.1 follows closely the one presented in
Kreps (1988, Chapter 9). A mathematically rigorous treatment is given in Fishburn (1982).
In control theory, Bellman (1957) discovered the optimality equations named after him.
26
3.3 Historical Remarks & References
These equations constitute a necessary condition for optimality. As it has been explained
in Section 3.2.3, the advantage of this condition is that it expresses the solution of a large
optimization problem as a recursive combination of smaller optimization problems.
Bellman’s work turned out to be very influential, and his method quickly established itself
as the standard in the control community. In particular, Bellman optimality equations have
been mainly applied to efficiently solve Markov decision problems (MDPs). An MDP is a
probabilistic model of a special class of environments, and it emphasizes the generative aspect
of observations. From a computational point of view, a finite MDP is a finite state machine
with stochastic transitions influenced by the actions of the agent, and where the computational
states correspond exactly to the observations that the agent can make. The latter assumption
allows devising an efficient computational algorithm, called dynamic programming, to find
the optimal policy in (linear) running time O(|Z|T ) (Tsitsiklis, 2007). Because of the wide
range of applicability of MDPs and the low computational cost of finding their solution, MDPs
have become almost synonymous with control problems. Later, as researchers have attempted
to tackle more realistic problems, they increasingly realized the shortcomings of the MDP
assumptions. As a response to this, the MDP model has been generalized to partially observable
MDPs (POMDPs), where observations are allowed to be functions of the computational state
(Russell and Norvig, 2009). However, the computational complexity of solving a POMDP grows
exponentially with the number of computational states. It is worth point out that the POMDP
formulation and the formulation presented in this thesis are essentially equivalent, with the
difference that POMDPs emphasize a particular generative model for observations, while the
generative mechanism is left open in this thesis.
Finally, while SEU theory makes a strong case for being a normative1 model of decision
making, it is not flawless. In fact, there is strong experimental evidence that disconfirms the
validity of SEU as a normative model in systematic ways—most notably Ellberg’s Paradox,
which has puzzled even Savage himself (Ellsberg, 1961). The interested reader can refer to the
specialized literature (Machina, 1987).
1Normative in the sense that a rational decision maker would revise his decisions if they do not
follow the maximum SEU principle.
27
3. DECISION MAKING
28
Chapter 4
Learning
In the previous chapter, we have seen how to solve an optimal control problem, i.e.
how to construct the policy of an agent when the environment is known. Examples of
this situation include hitting a target with a cannon under known weather conditions,
solving a maze having its map and controlling a robotic arm in a manufacturing plant.
However, when the environment is unknown, then the agent needs an adaptive policy.
For example, shooting the cannon lacking the appropriate measurement equipment,
finding the way out of an unknown maze and designing an autonomous robot for Mar-
tian exploration. In all these examples, the agent needs to “learn” its environment
on-the-fly in order to optimize its performance. How to design adaptive agents within
SEU theory is the subject matter of this chapter.
1
2
1
2
1
2
1
2
1
2
1
2
1
2
T H
Figure 4.1: A Fixed Predictor.
To illustrate learning, consider the following example. The task is to predict the
outcomes of a sequence of coin tosses. The coin has an unknown bias drawn from
θ ∈ [0, 1]. Observations ot are drawn from O := {H, T}, and beliefs are P(ot|o<t) for
any o≤t ∈ O∗ (actions are omitted in this case). Consider a first belief given by
P(ot = H|o<t) = P(ot = T|o<t) = 1
2
.
29
4. LEARNING
This is illustrated in Figure 4.1. This predictor does not learn anything, because it
maintains a fixed prediction over the next outcome, despite the past observations.
In contrast, consider the belief illustrated in Figure 4.2. This predictor learns from
experience, because it adjusts its prediction based on past observations. Two important
properties should be emphasized. First, it uses the past observations to find a suitable
prediction for the next coin toss (that is, within a predefined class of predictors, namely,
the Bernoulli processes1). Second, the predictor “accumulates evidence”, since the
“amount of change” is decreasing.
1
2
2
3
1
3
3
4
1
2
1
2
1
4
4
5
3
5
3
5
2
5
3
5
2
5
2
5
1
5
5
6
2
3
2
3
1
2
2
3
1
2
1
2
1
3
1
2
1
3
1
2
1
3
1
2
1
3
1
3
1
6
T H
Figure 4.2: An Adaptive Predictor.
Let h denote the number of times a head has been observed at time t. The proba-
bilities in the second predictor have been chosen as
P(ot+1 = H|o≤t) = h+ 1
t+ 2
,
known as the rule of succession. It is easy to see that this rule converges to the right
odds with enough coin tosses. The choice of this predictor is not arbitrary: it can be
given a statistical justification that will be developed in this chapter.
4.1 Bayesian Probability Theory
So far, we have given probabilities two usages: to specify a generative law (i.e. propen-
sities) and to specify degrees of belief (i.e. plausibilities). While both usages obey the
axioms of probability, it would be desirable to have an explanatory framework for plau-
sibilities that is intuitively more appealing. Such a framework is Bayesian probability
1An i.i.d. sequence of binary random variables with a fixed bias.
30
4.1 Bayesian Probability Theory
theory. Simply put, Bayesian probability theory is a framework that extends logic for
reasoning under uncertainty. The presentation of Bayesian probability theory outlined
in the following is an original contribution of this thesis.
4.1.1 Reasoning under Certainty
Logic is the most important framework of reasoning (under certainty). Here, it is
rephrased in set-theoretic terms2. As will be seen, this facilitates its extension to a
framework for reasoning under uncertainty.
Let Ω be a set of outcomes, which is assumed to be finite for simplicity. A subset
A ⊂ Ω is an event. Let c, ∪ and ∩ be the set-operations of complement, union
and intersection respectively. Let F be an algebra, i.e. a set of events obeying the
axioms
A1. F 6= ∅.
A2. A ∈ F ⇒ Ac ∈ F .
A3. A,B ∈ F ⇒ A ∪B ∈ F .
In this framework, an outcome ω ∈ Ω represents a state of affairs and an event A ∈ F
a proposition. For instance, if two coins are tossed, then there are four outcomes
Ω = {HH, HT, TH, TT}. A possible outcome is w = HT (i.e. first head, then tail),
and an event is A = {HH, HT, TH} (i.e. there is a head). Hence, a singleton {ω} ∈ F
is an irreducible (i.e. atomic) proposition about the world. The set-operations c, ∪
and ∩ correspond to the logical connectives of ¬ (negation), ∨ (disjunction) and ∧
(conjunction) respectively. They allow the construction of complex propositions from
simpler ones. An algebra is a system of propositions that is closed under negation
and disjunction (and hence is closed under conjunction as well), i.e. it comprises all
propositions that the reasoner might entertain.
Remark 6 A consequence of the axioms is that both the universal event Ω and the
impossible event ∅ are in F . 2
The objective of logic is to allow the reasoner to conclude the veracity of events
given information. Let V := {1, 0, ?} be the set of truth states, where 1 is true, 0 is
false, and ? is uncertain (but known to be either true or false). From these, {1, 0}
are called truth values. The truth function is the set function T over F×F defined
as
A,B ∈ F , T(A|B) =


1 if B ⊂ A,
0 if A ∩B = ∅,
? else.
2Strictly speaking, this set-theoretic logic is “a logic within logic”, since set theory is based on
standard logic.
31
4. LEARNING
Furthermore, define the shorthand T(A) := T(A|Ω). The quantity T(A|B) stands for
the “truth value of event A given that event B is true”. Accordingly, the knowledge of
the reasoner about the facts of the world is represented by his truth function and his
algebra. From his point of view, a proposition can be either true, false or uncertain (i.e.
having an unresolved truth value given his knowledge). Understanding the definition
of the truth function is simple. Claiming that an event B ∈ F is true means that
one of its members ω ∈ B is the current outcome/state of affairs. Thus, for instance,
claiming that the statement “there is some head” is true means that ω ∈ {HH, HT,
TH}. Hence the veracity of A given B is evaluated as follows (Figure 4.3): if A contains
every outcome in B then it must be true as well; if A is known not to contain any of
B’s outcome then it must be false; and if A contains only part of B then it cannot be
resolved, since knowing that ω ∈ B does neither imply that ω ∈ A nor ω ∈ Ac. The
definition of a truth space follows.
(a) (b)
(c) (d)
T(A|A) = 1 T(Ac|A) = 0
T(C|A) = 1 T(B|A) = ?
A
BC
Ω
Figure 4.3: A truth space. It is known that the true outcome ω ∈ Ω is in A. Hence, (a) the
event A is true and (b) its complement Ac is false. (c) Any event that contains a true event is
true as well. (d) An event that contains only part of a true event is uncertain.
Definition 15 (Truth Space) A truth space is a tuple (Ω,F ,T) where: Ω is a set
of outcomes, F is an algebra over Ω and T : F ×F → V is a truth function. 2
The intuitive meaning of a truth space is as follows. Nature arbitrarily selects an
outcome ω ∈ Ω. (This choice is not governed by a generative law.) Subsequently,
the reasoner performs a measurement: he chooses a set B and nature reveals to him
whether ω ∈ B or not. Accordingly, the reasoners infers the veracity of any event A ∈ F
by evaluating either T(A|B) (if ω ∈ B) or T(A|Bc) (if ω /∈ B).
Several measurements are combined as a conjunction. Thus, if the reasoner learns
that ω is in B1, B2, . . . , and Bt after performing t measurements, then the truth value
is T(A|B1 ∩ · · · ∩Bt) for any A ∈ F .
Remark 7 Knowing that ω ∈ Ω does not resolve uncertainty, i.e. T(A|Ω) = ? for any
A ∈ F \ {Ω,∅}, while knowing that ω ∈ {ω} resolves all uncertainty, i.e. T(A|{ω}) ∈
{0, 1} for any A ∈ F . 2
32
4.1 Bayesian Probability Theory
Remark 8 The set relation B ⊂ A corresponds to the logical relation B ⇒ A. Since
an algebra is an encoding of how sets are contained within each other, it should be
clear that an algebra is essentially a system of implications. 2
4.1.2 Reasoning under Uncertainty
Unlike logic, Bayesian probability theory allows reasoning under uncertainty. For this
end, it provides a consistent mechanism to replace the uncertainty state ? with a nu-
merical value in the interval [0, 1] representing degrees of truth, belief or plausibility.
(a) (b) (c)
AAA
BBB
ΩΩΩ
B(A|B) = T(A|B) = 0 B(A|B) = B(A∩B|Ω)
B(B|Ω) B(A|B) = T(A|B) = 1
Figure 4.4: Extension of Truth Function.
The goal is to find a suitable definition of a quantity B(A|B) meaning “the degree of
belief in event A given that event B is true” that is consistent with the truth function
when it is certain, i.e. B(A|B) := T(A|B) if T(A|B) ∈ {0, 1}. Consider the three
situations in Figure 4.4. (a) In the case A∩B = ∅, we impose B(A|B) := T(A|B) = 0.
(c) In the case B ⊂ A, we impose B(A|B) := T(A|B) = 1. (b) In the intermediate
case where T(A|B) = ?, the event A only partially covers the members of B. If one
interprets the quantity B(C|D) as “the fraction of D contained in C”, then one can
characterize B(A|B) with the relation
B(A|B) = B(A ∩B|Ω)
B(B|Ω)
as long as B(B|Ω) > 0. It is easy to see that this formula generalizes correctly to the
border cases, since B(A|B) = 0
B(B|Ω) = 0 when A ∩ B = ∅ and B(A|B) = B(B|Ω)B(B|Ω) = 1
when B ⊂ A. Noting that B = B ∩Ω and rearranging terms, one gets
B(A ∩B|Ω) = B(B|Ω)B(A|B ∩ Ω).
We demand this relation to hold under any restriction to a “universal” set C ∈ F , not
only when it is restricted to Ω. Thus, replacing Ω by C one obtains
B(A ∩B|C) = B(B|C)B(A|B ∩ C),
33
4. LEARNING
which is known as the product rule for beliefs. Following a similar reasoning, we
impose that for any event A ∈ F , the sum of the degree of belief in A and its comple-
ment Ac must be true under any condition B, i.e.
B(A|B) +B(Ac|B) = 1,
which is known as the sum rule for beliefs. In summary, we impose the following
axioms for beliefs (Jaynes and Bretthorst, 2003).
Definition 16 (Belief axioms) Let Ω be a set of outcomes and let F be an algebra
over Ω. A set function P over F ×F is a belief function iff
B1. A,B ∈ F , B(A|B) ∈ [0, 1].
B2. A,B ∈ F , B(A|B) = 1 if B ⊂ A.
B3. A,B ∈ F , B(A|B) = 0 if A ∩B = ∅.
B4. A,B ∈ F , B(A|B) +B(Ac|B) = 1.
B5. A,B,C ∈ F , B(A ∩B|C) = B(A|C)B(B|A ∩ C). 2
Furthermore, define the shorthand B(A) := B(A|Ω). Axiom B1 states that degrees of
belief are real values in the unit interval [0, 1]. Axioms B2 and B3 equate the belief and
the truth function under certainty. Axioms B4 and B5 are the structural requirements
under uncertainty discussed above. Accordingly, one defines a belief space as follows.
Definition 17 (Belief Space) A belief space is a tuple (Ω,F ,B) where: Ω is a set
of outcomes, F is an algebra over Ω and B : F × F → [0, 1] is a belief function. 2
The intuitive meaning of a belief space is analogous to a truth space. Nature arbi-
trarily selects an outcome ω ∈ Ω. Subsequently, the reasoner performs a measurement:
he chooses a set B and nature reveals to him whether ω ∈ B or not. Accordingly, the
reasoner infers the degree of belief in any event A ∈ F by evaluating either B(A|B) (if
ω ∈ B) or B(A|Bc) (if ω /∈ B).
Remark 9 The word “subsequently”, that has been emphasized for the second time
now, is crucial. When the reasoner performs his measurements, the outcome is already
determined. In Chapter 8, we will relax this assumption by allowing outcomes that are
only partially determined or jointly determined by the reasoner him-/herself. 2
Remark 10 Note that logical truth is not the same as probabilistic truth. That is
T(A|B) = 1 ⇒ B(A|B) = 1,
but B(A|B) = 1 ⇒ T(A|B) ∈ {?, 1}.
In particular, the case B(A|B) = 1 and T(A|B) = ? occurs when B is chosen such that
the set B \A is non-empty but has no probability mass relative to B. In other words,
if B(B \A|B) := 0, then B(A ∩B|B) = B(A|B) = 1, even though B 6⊂ A. 2
34
4.1 Bayesian Probability Theory
An easy but fundamental result is that the axioms of belief are equivalent to the
axioms of probability3 (Jaynes and Bretthorst, 2003). This simple observation is what
constitutes the foundation of Bayesian probability theory.
4.1.3 Bayes’ Rule
We now return to the central topic of this chapter. Suppose the reasoner has uncer-
tainty over a set of competing hypotheses about the world. Subsequently, he makes an
observation. He can use this observation to update his beliefs about the hypotheses.
The following theorem explains how to carry out this update.
Theorem 2 (Bayes’ Rule) Let (Ω,F ,B) be a belief space. Let {H1, . . . ,HN} be a
partition of Ω, and let D ∈ F be an event such that B(D) > 0. Then, for all n ∈
{1, . . . , N},
B(Hn|D) = B(D|Hn)B(Hn)
B(D)
=
B(D|Hn)B(Hn)∑
mB(D|Hm)B(Hm)
.
This is known as Bayes’ rule. 2
The interpretation is as follows. The H1, . . . ,HN represent N mutually exclusive
hypotheses, and the event D represents a new observation or data. Initially, the
reasoner holds a prior belief B(Hn) over each hypothesis Hn. Subsequently, he in-
corporates the observation of the event D and arrives at a posterior belief B(Hn|D)
over each hypothesis Hn. Bayes’ rule states that this update can be seen as com-
bining the prior belief B(Hn) with the likelihood B(D|Hn) of observation D under
hypothesis Hn. The denominator
∑
mB(D|Hm)B(Hm) = B(D) just plays the role of
a normalizing constant (Figure 4.5).
Ω
H1
H2
H2
H3
H3
D
D
Figure 4.5: Schematic Representation of Bayes’ Rule. The prior belief in hypotheses H1, H2
and H3 is roughly uniform. After conditioning on the observationD, the belief in hypothesis H3
increases significantly.
3More precisely, the axioms of beliefs as stated here imply the axioms of probability for finitely
additive measures over finite algebras. Furthermore, the axioms of beliefs also specify a unique version
of the conditional probability measure.
35
4. LEARNING
Bayes’ rule naturally applies to a sequential setting. Incorporating a new observa-
tion Dt after having observed D1,D2, . . . ,Dt−1 updates the beliefs as
Bt+1(Hn) := B(Hn|D1 ∩ · · · ∩Dt) = Bt(Dt|Hn)Bt(Hn)∑
mBt(Dt|Hm)B(Hm)
,
where for the t-th update,
Bt(Hn) := B(Hn|D1 ∩ · · · ∩Dt−1) and Bt(Dt|Hn) := B(Dt|Hn ∩D1 ∩ · · · ∩Dt−1)
play the role of the prior belief and the likelihood respectively. Note that
B(D1 ∩ · · · ∩Dt|Hn) =
t∏
τ=1
B(Dτ |Hn ∩D1 ∩ · · · ∩Dτ−1),
and hence each hypothesis Hn naturally determines a probability measure B(·|Hn) over
sequences of observations.
H1
H2
H3
S1
S2
S3
S4
S5
Figure 4.6: Progressive refinement of the accuracy of the joint observation. The sequence of
observations D1, . . . , D5 leads to refinements S1, S2, . . . , S5, where St = D1 ∩ · · · ∩ Dt. Note
that S5 ⊂ H1 and therefore B(H1|S5) = 1, while B(H2|S5) = B(H3|S5) = 0.
A smaller event D corresponds to a more “accurate” observation. Hence, making a
new observation D′ necessarily improves the accuracy, since
D ⊃ D ∩D′.
In some cases, the accuracy of an observation (or sequence of observations) can be so
high that it uniquely identifies a hypothesis (Figure 4.6).
The way Bayes’ rule operates can be illustrated as follows. Consider a partition
{X1, . . . ,XK} of Ω and let H∗ ∈ {H1, . . . ,HN} be the true hypothesis, i.e. the outcome
ω ∈ Ω is drawn obeying propensities described by B(·|H∗). The Xk represent different
observations the reasoner can make. If ω is drawn by Nature and reported to be in
36
4.2 Adaptive Optimal Control
Xk (without revealing ω itself), then the log-posterior probability of hypothesis Hn is
given by
logB(Hn|Xk) = logB(Xk|Hn)︸ ︷︷ ︸
ln
+ logB(Hn)︸ ︷︷ ︸
pn
− logB(Xk)︸ ︷︷ ︸
c
.
This decomposition highlights all the relevant terms for understanding Bayesian learn-
ing. The term ln is the log-likelihood of the data Xk. The term pn is the log-prior of
hypothesis Hn, which is a way of representing the relative confidence in hypothesis Hn
prior to seeing the data. In practice, it can also be interpreted as (a) a complexity term,
(b) the log-posterior resulting from “previous” inference steps, or (c) an initialization
term for the inference procedure. The term c is the log-probability of the data, which
is constant over the hypotheses, and thus does not affect our analysis. Hence, log-
posteriors are compared by their differences in ln+pn. Ideally, the log-posterior should
be maximum for the true hypothesis Hn = H∗. However, since ω is chosen randomly,
the observation Xk is random, and hence the log-posterior logB(Hn|Xk) is a random
quantity too. If the variance of the log-posterior is high enough, then a particular
realization of the data can lead to a log-posterior favoring some “wrong” hypotheses
over the true hypothesis, i.e. ln + pn > l∗ + p∗ for some Hn 6= H∗. In general, this is
an unavoidable problem (that necessarily haunts every statistical inference method).
Further insight can be gained by analyzing the expected log-posterior:∑
Xk
B(Xk|H∗) logB(Xk|Hn)
︸ ︷︷ ︸
Ln
+ logB(Hn)
︸ ︷︷ ︸
Pn=pn
−
∑
Xk
B(Xk|H∗) logB(Xk)
︸ ︷︷ ︸
C
.
This reveals4 that, on average, the log-likelihood Ln is indeed maximized by Hn = H∗.
Hence, the posterior belief will, on average, concentrate its mass on the hypotheses
having high Ln + Pn.
4.2 Adaptive Optimal Control
The previous section introduced the conceptual framework of Bayesian probability the-
ory to model reasoning under uncertainty. The aim of this section is to apply this
framework to model adaptive autonomous systems. We first introduce a model to rep-
resent the uncertainty an agent has over its environment. Then, we show that this
model of uncertainty also allows deriving a predictor over the input stream, which will
be used as the input model of the agent. Finally, we show a convergence result for this
predictive input model.
4.2.1 Bayesian Input Model
One can exploit the Bayesian interpretation of probabilities to construct adaptive au-
tonomous systems. The Bayesian framework can be used to specify a belief model of
4For pi, qi probabilities,
∑
i
pi log qi is maximum when qi = pi for fixed pi.
37
4. LEARNING
agents having uncertainty about their environments. From this belief model, one can
derive an adaptive predictor for the I/O model introduced in Chapter 3.
Definition 18 (Bayesian Input Model) Let Θ be a finite set and let ZT be the set
of interaction strings. A Bayesian input model of an agent is a set of conditional
probabilities
P (θ) and P (ot|θ, ao<tat) for all θ ∈ Θ and all ao<t ∈ Z⋄,
uniquely specifying a conditional probability measure P over Θ×OT conditioned on AT
given by
P (θ, o≤t|a≤t) = P (θ)
t∏
τ=1
P (oτ |θ, ao<τaτ ).
2
The Bayesian input model is a concise way of bundling together many hypotheses
about the input stream. In fact, for each choice of θ, the conditional probabilities
P (ot|θ, ao<tat) define an input model. One can derive a distribution over OT condi-
tioned on AT by marginalizing over Θ:
P (o≤T |a≤T ) =
∑
θ
P (θ)P (o≤T |θ, a≤T ). (4.1)
Hence, the Bayesian input model defines a mixture distribution over the different
input models, where the mixture weights are given by the P (θ).
4.2.2 Predictive Distribution
The Bayesian input model allows deriving an adaptive predictor, i.e. a predictor over
the next observation ot that learns from the past observed I/O string ao<tat.
Proposition 1 (Predictive Distribution) The probability of ot conditioned on the
past ao<tat is given by
P (ot|ao<tat) =
∑
θ
P (θ|ao<t)P (ot|θ, ao<tat), (4.2)
where the mixture weights P (θ|ao≤t) are given by the recursion
P (θ|ao≤t) =
P (ot|θ, ao<t)P (θ|ao<t)∑
θ′ P (ot|θ′, ao<t)P (θ′|ao<t)
.
Equation (4.2) is known as the predictive distribution. 2
Proof First, P (ot|ao<tat) can be rewritten as
P (ot|ao<tat) =
∑
θ
P (θ|ao<tat)P (ot|θ, ao<tat). (4.3)
38
4.2 Adaptive Optimal Control
The posterior probabilities P (θ|ao<tat) are calculated using Bayes’ rule:
P (θ|ao<tat) =
P (o<t|θ, a≤t)P (θ)∑
θ′ P (o<t|θ′, a≤t)P (θ′)
=
P (θ)
∏t−1
τ=1 P (oτ |θ, ao<τ )∑
θ′ P (θ
′)
∏t−1
τ=1 P (oτ |θ′, ao<τ )
.
Here, we see that the trailing at does not affect the observations o<t and thus can be
dropped, i.e.
P (θ|ao<tat) = P (θ|ao<t). (4.4)
Consider now without loss of generality the probability P (θ|ao≤t). Applying Bayes’
rule over the last observation ot, one obtains
P (θ|ao≤t) =
P (ot|θ, ao<t)P (θ|ao<t)∑
θ′ P (ot|θ′, ao<t)P (θ′|ao<t)
. (4.5)
Collecting (4.3), (4.4) and (4.5) yields the result. 
In (4.2), the posterior probabilities P (θ|ao≤t) play the role of adaptive mixture
weights. Also note that the result shows that P (θ|ao<tat) = P (θ|ao<t), i.e. the trailing
at does not affect the posterior weights. The advantage of Proposition 1 is that of
concisely highlighting the sequential nature of belief updates.
4.2.3 Induced Input Model
We have seen that the Bayesian input model allows deriving a predictor (i.e. the
predictive distribution) over the input stream. This predictor can serve as the input
model for an adaptive autonomous system.
Definition 19 (Induced Input Model) The input model P induced by a Bayesian
input model P is defined as
P(ot|ao<tat) :=
∑
θ
P (θ|ao<t)P (ot|θ, ao<tat) for all ao<t ∈ Z⋄.
2
The intuitive interpretation of this definition is as follows. The agent does not know
the environment Q. Probabilistically, this fact is made precise by assuming that Q is
going to be drawn with probability P (θ) from a set Q := {Qθ}θ∈Θ of possible output
models of environments before the interaction starts. Following a Bayesian approach,
this is modeled by equating
P (ot|θ, ao<tat) = Qθ(ot|ao<tat).
Hence, the uncertainty over the environment Qθ is translated into the uncertainty
over θ. Accordingly, an alternative way of looking at this situation is to think about an
environment that secretly chooses its “parameter” θ ∈ Θ before the interaction starts.
Because of this, the set Θ is known as the set of unknown parameters.
39
4. LEARNING
Remark 11 The distinction between the Bayesian input model and the input model,
and the use of different probability symbols P and P respectively, is meaningful, but this
will only become clear later in Chapter 8 when we will introduce causal interventions.2
Remark 12 While this setup models agents that are uncertain about its environment,
it assumes that agents are certain about their own policy. In other words, the choice
of the policy (causally) precedes the interactions. As we will see later, this is an
assumption that cannot be met in most situations. 2
4.2.4 Convergence of Predictive Distribution
So far, the usage of a Bayesian input model to describe the uncertainty in an au-
tonomous system has been justified from a purely axiomatic point of view. However,
the predictive distribution has an astonishing statistical property. If the Bayesian model
contains the true input model (i.e. the input model describing the environment) then
the predictive distribution converges to it.
Theorem 3 (Convergence of Predictive Distribution) Let P be an input model
induced by a Bayesian model P . Let θ∗ ∈ Θ be the true parameter, i.e. P (ot|θ∗, ao<tat) =
Q(ot|ao<tat). Let S and D defined as
S :=
T∑
t=1
∑
o≤t
Q(o<t|a<t)
(
Q(ot|ao<tat)−P(ot|ao<tat)
)2
D :=
∑
o≤T
Q(o≤T |a≤T ) ln Q(o≤T |a≤T )
P(o≤T |a≤T )
be the cumulative sum of the mean-squared prediction error and the relative entropy
respectively. Then, the following inequalities hold:
0 ≤ S ≤ D ≤ ln 1
P (θ∗)
.
2
Proof This proof is due to Hutter (2004a). We want to apply the entropy inequality
(Hutter, 2004a, Section 3.9.2)∑
i
(xi − yi)2 ≤
∑
i
xi ln
xi
yi
, for xi ≥ 0, yi ≥ 0,
∑
i
xi = 1,
∑
i
yi = 1.
In the entropy inequality, identify xi ↔ Q(ot|ao<tat), yi ↔ P(ot|ao<tat) and carry out
the sums over ot,
∑
ot
(Q(ot|ao<tat)−P(ot|ao<tat))2 ≤
∑
ot
Q(ot|ao<tat) ln
Q(ot|ao<tat)
P(ot|ao<tat)
.
40
4.2 Adaptive Optimal Control
We multiply the inequality by Q(o<t|a<t) and take sums over the o<t and over the
time steps t = 1, . . . , T . We further apply the chain rule Q(o<t|a<t)Q(ot|ao<tat) =
Q(o≤t|a≤t) on the r.h.s. and obtain
T∑
t=1
∑
o≤t
Q(o<t|a<t)
(
Q(ot|ao<tat)−P(ot|ao<tat)
)2
≤
T∑
t=1
∑
o≤t
Q(o≤t|a≤t) ln Q(ot|ao<tat)
P(ot|ao<tat)
.
The r.h.s. can be further developed,
(a)
=
T∑
t=1
∑
o≤t
Q(o≤t|a≤t) ln Q(ot|ao<tat)
P(ot|ao<tat)
(b)
=
∑
o≤T
Q(o≤T |a≤T ) ln
T∏
t=1
Q(ot|ao<tat)
P(ot|ao<tat)
(c)
=
∑
o≤T
Q(o≤T |a≤T ) ln Q(o≤T |a≤T )
P(o≤T |a≤T )
(d)
=
∑
o≤T
Q(o≤T |a≤T ) ln Q(o≤T |a≤T )∑
θ P (θ)P (o≤T |θ, a≤T )
(e)
≤
∑
o≤T
Q(o≤T |a≤T ) ln Q(o≤T |a≤T )
P (θ∗)P (o≤T |θ∗, a≤T )
(f)
= ln
1
P (θ∗)
.
In (a), we changed
∑
o≤t
Q(o≤t|a≤t) to
∑
o≤T
Q(o≤T |a≤T ). This can be done because
the argument in the logarithm does not depend on ot+1:T . (b) follows from moving
the
∑
t inside of the logarithm and thus converting it into a product
∏
t. (c) follows
from applying the chain rule. (d) is obtained by applying (4.1), and (e) follows from
dropping all the summands in the denominator but the one corresponding to the true
parameter θ∗. Equality (f) follows from P (o≤t|θ∗, a≤t) = Q(o≤t|a≤t) and from then
simplifying the expression. Note that S, D (as defined in the theorem statement) and
ln 1P (θ∗) are positive quantities. 
The theorem says essentially two things. First, the cumulative sum of prediction
errors (in the mean-square sense) S is bounded by the (total) relative entropy of the
predictive distribution from the environment. Absolute deviations like those in S are
intuitively easier to grasp (because of their geometrical interpretation) than relative
deviations like those in D. Second, the relative entropy is bounded from above by a
constant that is independent of T . The key property leading to this conclusion is that
the mixture distribution dominates any of its hypotheses, i.e.
P(o≤T |a≤T ) = P (o≤T |a≤T ) ≥ αP (o≤T |θ, a≤T ), α > 0. (4.6)
(In particular, in the Bayesian input model α = P (θ).) The dominance factor α can
be arbitrarily small as long as it stays constant. Hence, the upper bound − lnα is
valid even in the limit T →∞. Since S contains an infinite sum of positive terms, one
concludes:
Corollary 1 Under the same conditions as in Theorem 3, one has
P(ot|ao<tat) t→∞−−−→ Q(ot|ao<tat)
with G-probability one. 2
41
4. LEARNING
Proof Let z1(ω), z2(ω), . . . be a sequence of real-valued random variables drawn ac-
cording to a probability measure µ. zt is said to converge for t→∞ in the mean sum
to a real value z∗ iff
∞∑
t=1
E[(zt − z∗)2] <∞.
It is a well-known result that convergence in the mean sum implies convergence with
probability one, i.e.
µ{ω : zt(ω)→ z∗} = 1.
From Theorem 3 we know that
T∑
t=1
∑
o≤t
Q(o<t|a<t)
(
Q(ot|ao<tat)−P(ot|ao<tat)
)2
≤ − lnP (θ∗) <∞.
In particular, it can be verified that this bound holds for the case P (θ∗) = 1. Extending
the sum over o≤t to o≤T and then averaging over the a≤T preserves the inequality. This
yields
T∑
t=1
∑
ao≤T
G(ao≤T )
(
Q(ot|ao<tat)−P(ot|ao<tat)
)2
<∞,
where G is the generative probability measure
G(ao≤T ) =
T∏
t=1
P(at|ao<t)Q(ot|ao<tat).
Replacing G ↔ µ, (P(ot|ao<tat) −Q(ot|ao<tat)) ↔ zt and 0 ↔ z∗ and applying “in
the mean sum ⇒ with probability one” we obtain the result. 
Hence, in the limit T → ∞, a random sequence a1o1a2o2 . . . drawn from the gen-
erative measure G will have the property that the predictive distribution P(ot|ao<tat)
converges to the environment’s output stream Q(ot|ao<tat).
To illustrate the convergence result, consider the coin-toss prediction example from
the beginning of this chapter. Assume a Bayesian model with two possible parameter
settings, namely Θ := {14 , 34}. Hence,
P (ot = H|θ, o<t) = P (ot = H|θ) =
{
1
4 if θ =
1
4 ,
3
4 if θ =
3
4 .
The two hypotheses are shown in Figure 4.7a and b. Next, assume a uniform prior
distribution over the parameters, i.e.
P (θ) =
1
2
, θ =
1
4
,
3
4
.
42
4.2 Adaptive Optimal Control
1
64
1
64
3
64
3
64
3
64
3
64
9
64
9
64
3
64
3
64
9
64
9
64
9
64
9
64
27
64
27
64
0.0625
0.0469
0.0375
0.0301
(a) (b)
(c)
T H
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
2
5
8
3
8
7
10
1
2
1
2
3
10
41
56
5
8
5
8
3
8
5
8
3
8
3
8
15
56
Figure 4.7: Convergence of Predictive Distribution. Panels (a) and (b) illustrate the hypothe-
ses θ = 14 and θ =
3
4 respectively. The boxes represent the probabilities P (ot = T|θ, o<t) (black)
and P (ot = H|θ, o<t) (white). The resulting probabilities of drawing a particular realization
are indicated at the leaves. Panel (c) illustrates the predictive distribution P(ot|o<t) resulting
from combining the two hypotheses with prior probabilities P (θ = 14 ) = P (θ =
3
4 ) =
1
2 . Assume
any of the two hypotheses is true. The numbers written at each level correspond to the mean-
squared prediction error at time t, i.e. E[st], where st := (Q(ot|o<t)−P(ot|o<t))2. (The st are
the same for both hypotheses because of the symmetry.) Note how this average monotonically
decreases over time.
43
4. LEARNING
The posterior probability P (θ = 14 |o≤t) is calculated using Bayes’ rule:
P (θ = 14 |o≤t) =
P (o≤t|θ = 14)P (θ)∑
θ′ P (o≤t|θ′)P (θ′)
=
P (θ)
∏t
τ=1 P (oτ |θ = 14 , o<τ )∑
θ′ P (θ
′)
∏t
τ=1 P (oτ |θ′, o<τ )
=
(12) (
3
4 )
h (14)
t−h
(12 ) (
3
4 )
h (14 )
t−h + (12 ) (
3
4 )
t−h (14)
h
=
1
1 + 3t−2h
,
where h is the number of times a head has been observed in the first t coin tosses. The
posterior probability P (θ = 34 |o≤t) is obtained following a similar calculation. This
yields a posterior distribution given by
P (θ = 14 |o≤t) =
1
1 + 3t−2h
and P (θ = 34 |o≤t) =
1
1 + 32h−t
,
The induced input model is the predictive distribution given by
P(ot+1 = H|o≤t) =
∑
θ
P (θ|o≤t)P (ot+1 = H|θ, o≤t)
=
1
4
· 1
1 + 3t−2h
+
3
4
· 1
1 + 32h−t
.
This predictor is shown in Figure 4.7c. In the figure, it is seen how the predictive
distribution approaches the hypotheses that seems more plausible given the experience:
paths from the root that have more heads reach a leaf with bias ≈ 34 , and paths that
have more tails reach a leaf with bias ≈ 14 .
This is a simple predictor with only two hypotheses. If one increases the number of
possible bias settings to the unit interval Θ := [0, 1] and places a uniform prior density
p(θ) = 1 over Θ, then the resulting predictive distribution turns out to be the famous
rule of succession
P(ot+1 = H|o≤t) = h+ 1
t+ 2
presented at the beginning of this chapter. Accordingly, this predictor converges to any
bias θ∗ ∈ Θ. Thus, some simple predictors implement infinite mixtures of predictors!
Remark 13 Theorem 3 has additional implications. For instance, it provides a “con-
vergence rate” in the sense that it bounds the expected amount of times the mean-
squared error exceeds a given tolerance. 2
Remark 14 The convergence of the predictive distribution holds for any policy, since
the actions a≤t in Theorem 3 are arbitrary conditionals. 2
Remark 15 Note that there are no restrictions on the statistical properties of the
hypotheses. For instance, the input models P (ot|θ, ao<tat) are neither required to be
i.i.d., stationary nor Markov. The convergence of the predictive distribution holds even
with complex time-series. 2
44
4.2 Adaptive Optimal Control
Remark 16 The convergence of the posterior distribution to the true parameter θ∗,
i.e. P (θ|ao<t)→ δθθ∗ as t→∞, does not hold in general. This is simply because having
two distinct parameters θ 6= θ′ does not imply that P (ao<t|θ) 6= P (ao<t|θ′), i.e. it does
not imply that their likelihood functions are different under all realizations. For such
a convergence result to hold, one has to make additional assumptions (e.g. ergodicity,
i.i.d., etc.) 2
Remark 17 The key property for finding an upper bound in Theorem 3, namely
dominance (Equation 4.6), is a very powerful property that holds even in a number of
other situations where θ∗ 6∈ Θ. 2
4.2.5 Bayes Optimal Agents
The main question we have addressed in this section is: If the environment is un-
known, how do we construct an optimal agent? The Bayesian model introduced above
allows specifying a class of possible environments that the agent is uncertain about.
This leads to the specification of prior probabilities P (θ) and conditional probabilities
P (ot|θ, ao<tat). As explained previously, defining these quantities completely deter-
mines the agent’s input model P(ot|ao<tat):
P(ot|ao<tat) =
∑
θ
P (ot|θ, ao<tat)P (θ|ao<tat).
The important observation is that one can interpret this distribution as if it represented
a “real” environment Q˜ with the property that
P(ot|ao<tat) = Q˜(ot|ao<tat)
for all ao≤t ∈ Z⋄. This view allows the construction of a rational agent following
the maximum SEU principle, e.g. by solving the Bellman optimality equations from
Section 3.2.3. That is, given the input model P(ot|ao<tat) and a utility function U
over ZT , choose a policy P(at|ao<t) as
P(at|ao<t) := δa
∗
t
at , where a
∗
t := argmaxat
∑
ot
P(ot|ao<tat)F (ao≤t),
where F is the future utility function from Definition 12. The resulting policy is known
as theBayes optimal policy, and it constitutes the solution to the adaptive optimal
control problem.
Let us recapitulate what we have achieved. The previous equation is an important
result in control theory and artificial intelligence. It explains how to design adap-
tive autonomous systems: adaptive cannons, maze solvers and robots for Martian
exploration—at least, in SEU theory!
45
4. LEARNING
4.3 Historical Remarks & References
As in the case of decision theory, probability theory has also had a long and controversial
history, and in fact their development has always been closely connected. Currently, there are
several schools of thought who defend different interpretations of probability. Roughly, they
can be classified into physical and evidential5.
The physical interpretation of probability sees probabilities as an intrinsic property of
nature. These include the frequentist accounts (Venn, 1866; von Mises, 1919; Fisher, 1970;
Neyman, 1950) and propensity accounts (Popper, 1934). In the frequentist interpretation, a
probability corresponds to the relative frequency of the occurrence of an event in a random ex-
periment that is repeated over time under similar conditions. In the propensity interpretation,
a probability is a physical tendency (or chance) of an event to happen (even in a single trial).
The evidential interpretation sees probabilities as measures of degrees of belief in inference.
These include the logical and epistemic accounts (Jeffreys, 1939; Cox, 1961; Jaynes and Bret-
thorst, 2003), personalistic (or “gambling”) accounts (Ramsey, 1926; De Finetti, 1937; Savage,
1954), although the evidential interpretation can be traced back as far as to the 18th century
(Bayes, 1763; Laplace, 1774). In the logical and epistemic accounts probabilities are extensions
of plain truth values to partial truth values. In the personalistic interpretation, probabilities
correspond to assumptions about the world that are necessary to justify decision making.
While the different schools of thought disagree on the interpretation of probabilities, they
do agree on the axioms that govern them. These have been laid down in modern form by Kol-
mogorov (1933) following the success of measure theory (Borel, 1898; Lebesgue, 1904; Fre´chet,
1915). The exposition presented in this chapter, which is an original contribution of this thesis,
combines the basics of Kolmogorov (1933) and Jaynes and Bretthorst (2003).
Bayesian methods in control are as old as control theory itself. For instance, consider the
work of Bellman (1957) and Kalman (1960). Currently, Bayesian methods are being applied ex-
tensively in diverse areas of control and reinforcement learning. Recent expositions on Bayesian
methods in sequential decision making are for example Jordan, Ghahramani, and Saul (1997),
Duff (2002), Hutter (2004a) and Legg (2008).
5They are sometimes also known as objectivist and subjectivist respectively in the literature.
46
Chapter 5
Problems of Classical Agency
So far, we have seen how to construct an optimal adaptive agent that is universal
with respect to a given class of possible environments. This class can be very rich,
containing a vast number of environments. For example, one could have a Bayesian
input model spanning the set of all computable environments. Such a choice of the
class of environments would lead to an agent that is able to learn any computable
sequence: it will make predictions about the weather, play chess, solve mazes, drive
cars, discover physical theories, solve IQ tests and predict future stock prices1. While
such an agent is possible in principle (within SEU theory), it is a fact that state-of-the-
art implementations of adaptive agents do not even match fly intelligence. There are
many technical reasons for this lack of success, but there are also theoretical reasons
that are deeply rooted in SEU theory. This is the subject of the second part of this
thesis.
5.1 Computational Cost and Precedence of Policy Choice
According to the maximum SEU principle, the designer has a utility function U over
ZT and a predictor P(ot|ao<tat), possibly derived from a Bayesian input model. The
maximum SEU principle stipulates the choice of a policy P(at|ao<t) maximizing the
subjective expected utility ∑
ao≤T
P(ao≤T )U(ao≤T ).
For the maximum SEU principle to make sense, the choice of the optimal policy has to
be made before the interaction starts, i.e. the policy has to be fully precomputed.
What is not apparent from this simple formula, however, is that finding the optimal
policy is very difficult safe for simple toy problems. Essentially, the number of candidate
policies is too large, even if we consider only deterministic policies.
Recall from Section 3.2.1 that choosing a deterministic policy P(at|ao<t) amounts
to choosing a behavioral function π from the set Π. This choice specifies the actions to
1These are all activities that are at least approximable with a computational method.
47
5. PROBLEMS OF CLASSICAL AGENCY
be chosen in each decision node. In the first level, there is only one decision node. In
the second level, there are |A| · |O| = |Z| decision nodes. Similarly, in level t, there are
|Z|t−1
decision nodes. Because of the chronological property of policies, the causal depen-
dencies amongst the decision nodes are such that the decisions in the lower levels are
independent from the decisions further up the decision tree—which is exactly the prop-
erty that is exploited in order to formulate the Bellman optimality equations. Hence,
the optimal policy is found by choosing the optimal action of each decision node at
level T − 1, then at level T − 2, and so forth. Thus, if solving one decision node is one
elementary operation, then the total number of operations C is given by
C := |Z|T−1 + |Z|T−2 + · · ·+ |Z|+ 1 = |Z|
T − 1
|Z| − 1 .
As a matter of illustration, consider the construction of a very simple cybernetic system
having only two possible inputs and outputs, i.e. |A| = |O| = 2, running for one minute
at one action per second, i.e. a horizon of T = 60 time steps. (This is way simpler than
the structural design of an ant.) Applying the maximum SEU principle requires
C =
460 − 1
4− 1 ≈ 4.43 · 10
35
operations—which is clearly an unreasonably large number considering that there are
only about 1050 atoms in the world2!
Hence, any rigorous application of the maximum SEU principle to choose an optimal
policy must rely on severe domain assumptions that reduce the effective cardinality of
the policy space to a manageable size.
5.2 Is Rationality a Useful Concept?
The maximum SEU principle has always been advocated as a normative principle.
However, given that there are virtually no real-world application domains due to its
dramatic computational demands, it not unreasonable to question the theoretical use-
fulness of SEU theory and the notion of rationality all together. One can tackle this
question in at least three ways.
1. Rationality as the “gold standard”. One could easily imagine that computing the
optimal answer is so costly, that one would rather content oneself with a “sub-
optimal” (but strictly speaking, irrational) solution that incurs into less resource
costs. In this view, the maximum SEU principle is taken as a “gold standard”
that has to be approximated. This is the view shared by most of the engineering
community.
2And if this does not convince yet, consider that today’s fastest supercomputer carrying out 1016
operations per second would take 1.40 · 1010 years to find the optimal policy (which, ironically, lasts
only one minute), which is even more than the age of the universe, being “only” about 1.37 ·1010 years!
48
5.3 Historical Remarks & References
2. Rationality as an idealization. Scientific models are usually idealized descriptions
of phenomena happening in Nature, and some models are empirically more ac-
curate than others. The model of rationality is not different in this sense: it
is an idealization that has the advantage of being mathematically simple and
elegant. While it is inaccurate, it captures important aspects of the decision
making process. In this view, rationality turns out to be useful concept to study
many situations arising in real life situations. This is the view defended by many
economists.
3. Searching for new foundations. If the framework of rationality fails to capture
some aspects of decision making that are considered important (like the resource
costs of finding the solution), then one can search for new foundations to remedy
the shortcomings of SEU theory. In this view, the aim is to find a theory to
conceptualize behavior under limited resources. This is the approach pursued by
the paradigm of bounded rationality, as well as the goal of the second part of this
thesis.
5.3 Historical Remarks & References
The distinction between decision theory as a normative and as a descriptive theory has been
pointed out even during its first mathematical formalizations. For instance, see Luce (1959)
for a different set of axioms. The need for considering computational resources in decision
making has been first pointed out by Simon (1955). A modern reference to the field of bounded
rationality is ?.
49
5. PROBLEMS OF CLASSICAL AGENCY
50
Part II
Resource-Bounded Agency
51

Chapter 6
Resources
In the previous chapter we have argued that one of the major problems of building
autonomous agents according to the maximum SEU principle is the prohibitive compu-
tational complexity of the method. This computational cost arises because the method
requires us to find the optimal policy within a set of candidate policies that is way too
large before the agent has even interacted once with the environment. It is therefore of
crucial importance to formally understand this computational cost in order to propose
methods to deal with it.
There are several ways to study computational costs in learning and in control. One
widely developed field is computational learning theory, which studies the com-
putational complexity and feasibility1 of learning algorithms (Angluin, 1992; Kearns
and Vazirani, 1994; Bishop, 2006). Another approach is bounded rationality, an
approach to decision making that complements rational decision making by taking into
account the cognitive/computational limitations of decision makers (Simon, 1955; Ru-
binstein, 1988). In particular, in artificial intelligence, bounded rationality is viewed
as the ability to reason about the resource-costs of reasoning, i.e. meta-reasoning
(Russell and Wefald, 1991). In this view, a resource-bounded agent must decide what
to reason about, when, and for how long. This leads to agents that trade off computa-
tional expenses of reasoning against the expected utility of the outcome. However, this
approach to bounded rationality is not devoid of criticism. One can easily envisage an
infinite hierarchy of meta-problems of “reasoning-about-reasoning” with no principled
way to stop the infinite regression (Parkes, 1997).
In this chapter, the two main questions that we want to be able to address with a
formalization of resources are:
1. What are the resources spent by the agent during its execution, i.e. its interaction
with the environment?
2. What are the resources spent by the designer for the construction of the agent?
As we will argue in this chapter, answering these two questions requires understanding
the relationship between:
1A computation is considered feasible when it can be carried out in polynomial time.
53
6. RESOURCES
• the intuitive concept of resources;
• (transformations of) probability distributions;
• computational resources such as computation time and computation space.
As of today, the formal relationship between probability distributions and computa-
tional resources is not understood, and in fact relates to many deep open questions in
complexity theory. Fortunately, some progress into this direction can be made by re-
stricting ourselves to understanding the relationship between resources and probability
distributions by drawing ideas from thermodynamics and coding theory.
Consequently, the aim of this chapter is to introduce an information-theoretic (or
more accurately, a thermodynamic) formalization of resources. While this formalization
is admittedly non-standard from a control-theoretic point of view, it will prove to be
fruitful to reinterpret the construction and the behavior of an autonomous system. In
particular, it will allow to state a natural link between resources and probabilities,
in a way such that behavior can be thought of as arising only from resource costs.
Furthermore, we will sketch an intuitive but non-rigorous connection to computational
complexity.
6.1 Preliminaries in Information Theory
As has been anticipated in the introduction, our primary focus will be to establish
an information-theoretic formalization of resources. To understand this connection, it
is necessary to introduce some basic concepts of information theory, the mathemati-
cal theory concerned with the quantification of information (Shannon, 1948; MacKay,
2003).
6.1.1 The Communication Problem
The basic problem of information theory is that of communicating a message or a choice
from a sender to a receiver via a communication channel, i.e. any device that translates
an input symbol into an output symbol, possibly with some noise. This setup covers
a wide range of scenarios, such as a modem transmitting content to another modem
via an optic fiber; a parent cell passing genetic information to its offspring cells via its
chromosomes; a hard disk storing data, i.e. delivering the data to itself but at a later
point in time; and so forth. The fundamental result of information theory is that all
the properties of the channel can essentially be reduced to its capacity, that is, the
limit to the amount of (noiseless) information that it can transmit each time it is used.
In a communication problem, a sender communicates a choice to a receiver over a
communication channel (see Figure 6.1). This is formalized as follows. Let U and X
be two finite sets, where U is the set of possible choices, and where X is the alphabet
used by the channel. A channel is a device that receives an input symbol x ∈ X and
produces an output symbol y ∈ X following a conditional distribution P (y|x). Since X
is typically assumed to be much smaller than U , a choice u ∈ U is communicated to the
54
6.1 Preliminaries in Information Theory
receiver by encoding it as a string x≤n ∈ X and then transmitted symbol by symbol2.
These symbols are recovered as a (possibly corrupted) string y≤n ∈ X ∗ which is then
decoded as the object v ∈ U .
Sender Encoder Channel Decoder Receiver
u x≤n y≤n v
Figure 6.1: The communication problem consists in communicating a (noisy) choice of a
target object v ∈ V via the selection of a source object u ∈ U using a communication channel.
This is done by encoding u as a string x≤n ∈ X ∗ that is then transmitted over the channel.
The transmitted string is recovered as a string y≤n ∈ X ∗ that is then decoded as the target
object v. The blocks highlighted in bold correspond to the processes that have to be designed
by the engineer.
Most of the time, we will assume a binary alphabet for the channel, that is X =
{0, 1}, because the choice of an alphabet changes the length of the codes by at most
a constant factor that depends only on the size of the alphabet. We can thus think of
the communication problem as the design of binary codes that efficiently transmit a
choice using a given binary communication channel. Due to the central role that codes
play in information theory, the next subsection investigates their basic properties.
6.1.2 Codes
A code is a scheme to represent objects as binary strings. A badly chosen code can
lead to inefficient or ambiguous representations. An inefficient code is a code that does
not minimize the length of its codewords (measured in bits), and an ambiguous code is
a code where the original object cannot be recovered uniquely anymore. In this section
we investigate the property that a code has to possess in order to be both efficient and
unambiguous. The exposition in this subsection follows closely the one presented in
Gru¨nwald (2007). We first require a definition of suitable binary codes.
Definition 20 (Prefix Free) Given a finite alphabet X , a subset P ⊂ X ∗ is called
prefix free iff there are no distinct members u, v ∈ P such that u is a prefix of v. 2
For instance, for the binary alphabet {0, 1}, the sets
P1 = {001, 01, 101, 11}, P2 = {00, 01, 10, 11} and P3 = {0, 10, 110, 111}
are prefix free, but
P4 = {01, 011, 1}
2Recall that X ∗ is the set of all finite strings over the alphabet X .
55
6. RESOURCES
is not because 01 is a prefix of 011. In fact, P4 is even ambiguous: if we want to decode
the string 011 we would not be able to tell whether it corresponds to the codeword 011
alone or to the concatenation of the codewords 01 and 1.
Definition 21 (Prefix Code) Given a finite set of objects U , a prefix code for U
is an injective function c : U → P, where P ⊂ {0, 1}∗ is a prefix free set of binary
strings. For any object u ∈ U , c(u) is called the codeword of u, and l(u) denotes the
corresponding codeword length. 2
Prefix codes are important because of two reasons. First, they can be uniquely
decoded; and second, when a given codeword is scanned from left to right, the end
of the codeword is detected instantaneously. Prefix codes can be represented with a
binary tree, where codewords correspond to the paths starting from the root until a
leave is reached (Figure 6.2).
P1 P2 P3
0
1
c(a) = 001
c(b) = 01
c(c) = 101
c(d) = 11
c(a) = 00
c(b) = 01
c(c) = 10
c(d) = 11
c(a) = 0
c(b) = 10
c(c) = 110
c(d) = 111
Figure 6.2: Three prefix codes P1,P2 and P3 over U = {a, b, c, d}. Note that P1 could
allocate codewords for at least two additional objects (namely, codewords with prefixes 000 or
100), while the latter ones cannot (i.e. they are complete). Furthermore, if one wants to shorten
a codeword in a complete prefix code, then other codewords must necessarily grow.
From the examples in Figure 6.2, it is apparent that one cannot compress the
codewords indefinitely. Codewords seem to allocate some “room”, of which there is
only a limited amount available. The shorter a codeword, the more “room” it takes
away for allocating other codewords. For instance, compare the codes P1 and P2. In
P1, the allocation of codewords is not optimal, in the sense that the codewords for ‘a’
and ‘c’ can still be shortened, as has been done in P2. Then, compare the codes P2 and
P3. There, we decided to shorten the codeword for ‘a’ even more from 00 to 0. This
however exceeds the limit, and hence in exchange we had to allocate longer codewords
for the objects ‘c’ and ‘d’ to preserve the unique decodeability. This intuition is made
precise by the following theorem.
Theorem 4 (Kraft-McMillan Inequality) There is a prefix code over a finite set
of objects U with codeword lengths l1, l2, . . . , l|U| iff they satisfy the Kraft-McMillan
56
6.1 Preliminaries in Information Theory
inequality
|U|∑
i=1
2−li ≤ 1.
A prefix code whose codeword lengths satisfy this bound with equality is said to be
complete. 2
Proof (⇒) Interpret 0.u as the binary expansion of a real number in [0, 1). For each
u ∈ U , let Γu := [0.u, 0.u + 2−l(u)), i.e. the right-open interval on the unit line starting
at 0.u and ending at 0.u + 2−l(u). Observe that the measure of this interval is 2−l(u).
Since the Γu are disjoint and contained within [0, 1), the total measure cannot exceed 1,
thus proving that the inequality holds for prefix codes.
(⇐) Assume w.l.g. that the l1, . . . , l|U| are non-decreasing. Choose adjacent disjoint
intervals I1, . . . , I|U| of lengths 2
−l1 , . . . , 2−l|U| starting from the left end of the interval
[0, 1). The total measure of the union of the Ii is ≤ 1 by construction and the assump-
tion about the li. Note that since the li are non-decreasing, every interval Ii ends up
aligned with an interval Γu for some binary string u. Take u as the i-th codeword. 
Thus, long codewords add little, while short codewords add much to the sum. In-
formally, one can say that long codewords are “cheap”, while short codewords are
“expensive”. Accordingly, a prefix code that is incomplete can always be either com-
pressed more or extended with extra codewords until it is complete.
But the Kraft-McMillan inequality allows us to say more than that. Consider a
complete prefix code for U . This code fulfills the equality:∑
u∈U
2−l(u) = 1.
Now, define
Pr(u) := 2−l(u) for all u ∈ U .
We have obtained a probability distribution over U ! This is an important construction.
Essentially, the Kraft-McMillan inequality allows establishing a one-to-one correspon-
dence between codeword lengths and probabilities. This works even when we allow
codeword lengths to take on any positive real values as long as they satisfy the Kraft-
McMillan inequality. This enables us to talk about a codeword length as being a direct
specification of a probability that is invariant to the particular choice of the encoding
alphabet, and even to the actual codewords.
6.1.3 Information
We now return to the problem of communication, i.e. the problem of communicating a
choice using a communication channel. In the previous subsection, we have seen that the
Kraft-McMillan inequality characterizes the limits of compressibility for unambiguous
57
6. RESOURCES
codes. We have also seen that the inequality allows us establishing a bijection between
codeword lengths and probabilities given by
l(u) = − logPr(u), (6.1)
where the quantity on the right-hand side is called the information content of u.
The function f(p) = − log p is depicted in Figure 6.3a, where we use the convention
− log 0 =∞. What is the operational meaning of this relation? Suppose that both the
sender and the receiver agree on a code for U , and that the communication channel is
noiseless. When the sender chooses object u ∈ U , he has to transmit l(u) bits through
the channel. Assuming this choice is made following a probability distribution Pr over
U , the expected amount of transmitted bits is given by∑
u
Pr(u)l(u).
If we want to make this communication as efficient as possible, then we have to choose
a code that minimizes this expectation. One can show (MacKay, 2003, Section 5.3)
that the optimal choice of the codeword lengths is precisely given by (6.1). Using (6.1),
one sees that the minimum expected number of bits is given by
−
∑
u
Pr(u) logPr(u)
which is called the entropy of the probability distribution Pr. The function f(p) =
−p log p is shown in Figure 6.3b, where we use the convention 0 · log 0 = 0. The entropy
can be shown to fulfill a series of properties that make it a suitable measure of the
“uncertainty”, “randomness” and the “average amount of information” contained in
the choice of u (Shannon, 1948; MacKay, 2003). However, if we choose a code with
codeword lengths l′(u) 6= − logPr(u), then the expected codeword length is the cross-
entropy ∑
u
Pr(u)l′(u) = −
∑
u
Pr(u) logPr′(u),
where Pr′(u) = 2−l
′(u) are the probabilities “implicitly assumed” by the code. Obvi-
ously, the cross-entropy is lower bounded by the entropy,
−
∑
u
Pr(u) logPr′(u) ≥ −
∑
u
Pr(u) logPr(u).
Notice how the efficiency of the communication does not only depend on the channel,
but on its usage as well, i.e. the statistics of the information that is transmitted.
Intuitively speaking, if we knew beforehand what choice was going to be made, then it
would not be necessary to transmit any bit through the communication channel. But
because we are uncertain about the realization, we have to place bets, bets which are
implicitly captured by the code.
58
6.1 Preliminaries in Information Theory
a) b)
− log p −p log p
pp
Figure 6.3: Information functions.
Thus, the information content corresponds to the amount of bits that have to be
transmitted in order to communicate a choice. However, what happens during trans-
mission, i.e. when not all the bits have been transmitted yet? The few bits that got
sent did indeed communicate part of the choice. Communicating only part of the choice
means that some of the possible choices were ruled out by the transmitted informa-
tion. Consider the code P3 depicted in Figure 6.4a and assume that the first bit of the
choice, say ‘1’, is sent. This transmission changes the codeword function c to another
codeword function c′ depicted in Figure 6.4b, where the codewords starting with ‘1’
are shortened by this bit and the codewords starting with ‘0’ are discarded.
0
0
0
0
0
1
1
1
1
1
1
a) b)
c(a) = 0
c(b) = 10
c(c) = 110
c(d) = 111
c′(b) = 0
c′(c) = 10
c′(d) = 11
Figure 6.4: Probability versus Codeword Length (in bits).
The change in the expected codeword length can be calculated as{
final expected codeword length
}
−
{
initial expected codeword length
}
= −
∑
u
Pr′(u) logPr′(u) +
∑
u
Pr′(u) logPr(u) =
∑
u
Pr′(u) log
Pr(u)
Pr′(u)
.
In the previous calculation, the discarded codeword lengths are assumed to be equal
to ∞ by convention, and the actual distribution is Pr′(u) = 2−l′(u), where l′ is the
59
6. RESOURCES
codeword length function associated to c′. In the example of Figure 6.4, this change is∑
u
Pr′(u) log
Pr(u)
Pr′(u)
= 0 · (∞− 0) + 1
2
· (1− 2) + 1
4
· (1− 3) + 1
4
· (1− 3) = −1,
as expected. Hence, the negative of the previous quantity, that is∑
u
Pr′(u) log
Pr′(u)
Pr(u)
,
called the relative entropy or Kullback-Leibler divergence, measures the amount
of bits that have to be transmitted in order to change the knowledge about the choice
from Pr(u) to Pr′(u). Note that the relative entropy is always positive and zero iff
Pr = Pr′. More generally, the relative entropy corresponds to the amount of bits that
have to be paid in order to transform an initial probability distribution Pr into a final
probability distribution Pr′. Because of this, the relative entropy will play a fundamental
role in our formalization of resources.
6.2 Resources as Information
How does information theory help us to formalize resources in agents? In a communi-
cation problem, a natural way of measuring resources consists in counting the number
of bits that are necessary to communicate a choice. The resource costs conceptualized
in this way arise due to the uncertainty one has about the realization before it has hap-
pened. The setup of the communication problem is more general that it seems: given
an appropriate interpretation, it can be related to thermodynamical and computational
formalizations of resources.
6.2.1 Thermodynamical Interpretation
From a physical point of view, resources are typically expressed in terms of energy units
(e.g. Joules or Calories). In particular, energy is defined via work, i.e. the amount of
mechanical effort carried out on a system in order to change its physical state (Gold-
stein, 1980; Callen, 1985). For example, when an ideal gas is compressed with a piston
under isothermal conditions from an initial volume V to a final volume V ′, then the
work is calculated as
W = −
∫ V ′
V
NRT
V
dV = NRT ln
V
V ′
, (6.2)
where N ≥ 0 is the amount of substance, R > 0 is the gas constant, and T ≥ 0 is
the absolute temperature. The minus sign is just a convention to denote work done by
the piston rather than by the gas. The interpretation of resources that we want to put
forward here is analogous to the physical concept of work.
One can postulate a formal correspondence between one unit of information and
one unit of work (Feynman, 2000). Consider representing one bit of information using
60
6.2 Resources as Information
one of the following logical devices: a molecule that can be located either on the top or
the bottom part of a box; a coin whose face-up side can be either head or tail; a door
that can be either open or closed; a train that can be orientated facing either north or
south; and so forth. Assume that all these devices are initialized in an undetermined
logical state, where the first state has probability p and the second probability 1 − p.
Now, imagine you want to set these devices to their first logical state. In the case of the
molecule in a box, this means the following. Initially, the molecule is uniformly moving
around within a space confined by two pistons as depicted in Figure 6.5a. Assuming
that the initial volume is V , the molecule has to be pushed by the lower piston into the
upper part of the box having volume V ′ = pV (Figure 6.5b). From information theory,
we know that the number of bits that we fix by this operation is given by
− log p. (6.3)
Using the formula in (6.2), one gets that the amount of work done by the piston is
given by
W = RT ln
V
V ′
= RT ln
V
pV
= −RT ln p = − RT
log e
log p = −γmol log p,
where we have assumed N = 1 and where the constant γmol :=
RT
log e > 0 can be
interpreted as the conversion factor between one unit of information and one unit of
work for the molecule-in-a-box device.
a) b)
pV
(1 − p)V
A
B
Figure 6.5: The Molecule-In-A-Box Device. (a) Initially, the molecule moves freely within a
space of volume V delimited by two pistons. The compartments A and B correspond to the
two logical states of the device. (b) Then, the lower piston pushes the molecule into part A
having volume V ′ = pV .
How do we compute the information and work for the case of the coin, door and
train devices? The important observation is that we can model these cases as if they
were like molecule-in-a-box devices, with the difference that their conversion factors
between units of information and units of work are different. Hence, the number of bits
fixed while these devices are set to the first state is given by
− log p,
61
6. RESOURCES
i.e. exactly as in the case of the molecule. However, the work is given by
−γcoin log p, −γdoor log p, and − γtrain log p
respectively, where γcoin, γdoor and γtrain are the associated conversion factors between
units of information. Obviously,
γmol ≤ γcoin ≤ γdoor ≤ γtrain.
The point is that information is proportional to work. In other words, the amount of bits
required to communicate a choice is proportional to the amount of thermodynamical
work that has to be carried out on the recording device of the receiver.
One can easily envisage devices that generalize this idea to multiple states. Consider
for instance the device shown in Figure 6.6a corresponding to the codeword lengths
shown in Figure 6.4a. Here, four pistons control the probability of the molecule being
in the parts of the space labeled as {a, b, c, d}. A choice is ruled out by pushing its
piston downwards until it hits the wall. To rule out option ‘a’ as we did in the example
of Figure 6.4, we push the first piston to the end in order to obtain the configuration
shown in Figure 6.6b, which corresponds to codeword lengths of Figure 6.4b. Since this
change reduces the total volume by half, the thermodynamical work is proportional to
1 bit.
a) b)
aa bb cc dd
Figure 6.6: A generalized molecule-in-a-box device representing the partial choice of Fig-
ure 6.4. A molecule can move freely within a volume delimited by four pistons. The box has
four sections labeled as ‘a’, ‘b’, ‘c’ and ‘d’, corresponding to four possible states of the device.
Panel (a) and (b) show the initial and final configuration, corresponding to ruling out option ‘a’.
In general, this calculation is done using the relative entropy. Let v(u) and v′(u)
denote the volume of choice u before and after the compression respectively, and let
Pr(u) and Pr′(u) denote their associates probabilities. These probabilities are given by
Pr(u) =
v(u)
V
and Pr′(u) =
v′(u)
V ′
.
62
6.2 Resources as Information
Substituting the probabilities into the relative entropy, one obtains
∑
u
Pr′(u) log
Pr′(u)
Pr(u)
=
∑
u
v′(u)
V ′
log
v(u)
v′(u)
+ log
V ′
V
.
Using the quantities of the example, one obtains
0 · log 0 + 1
8
· log 1 + 1
16
· log 1 + 1
16
· log 1 + log 2 = 1 bit,
as expected. Note that this calculation only works if the volume stays equal or is
compressed, but not expanded.
6.2.2 Computational Interpretation
Remark 18 This subsection is of speculative nature. 2
Usually, the efficiency of an algorithm is assessed in terms of a relevant compu-
tational resource. There are many possible computational resources, but the most
important ones are the computation time and the computation space, i.e. the maxi-
mum number of time steps and the maximum number of cells that a Turing machine
uses in order to compute the output of a function for any input. Typically, the goal of
the designer is to construct an algorithm that minimizes either time or space.
However, these two goals seem to be somewhat incompatible. First, concentrating
on a single resource does not represent all the issues involved in solving a problem. Also,
there are numerous cases where reducing the computation time increases the computa-
tion space and viceversa. Indeed, the works of Borodin and Cook (1980), Beame (1989),
Beame, Jayram, and Saks (2001) and Beame, Saks, Sun, and Vee (2003) derive tradeoff
lower bounds for various problems such as sorting, finding unique elements in a set and
solving randomized decision problems. Their findings show that these lower bounds
can be stated in terms of minimum time-space products. For example, in Borodin and
Cook (1980), an optimal Ω(n2/ log n) lower bound is shown for the time-space product
of sorting n integers in the range of [1, n2]. Although these findings are not conclusive,
they seem to suggest that the time-space product might be a more general measure of
computational resources, i.e. one that captures the intuitive notion of the difficulty of
a computation (Figure 6.7). If one assumes that the time-space product is a suitable
measure of computational resources, then one is led to ask about the meaning of this
quantity.
Fortunately, one can give this product an information-theoretic meaning. We closely
follow the presentation of Savage (1998). Assume we want to build a device that
computes a function f mapping X into Y, where both X and Y are finite. To make our
discussion concrete, suppose we want to implement this device using a logic circuit,
i.e. a collection of interconnected logical gates computing elementary boolean functions.
It is sufficient to restrict ourselves to binary AND, binary OR and unary NOT gates,
since it can be shown that every boolean function can be implemented by them. An
example circuit is shown in Figure 6.8. The complexity of a function is measured by
63
6. RESOURCES
Space
Time
C1
C2
Figure 6.7: Time-Space Tradeoff. Studies suggest that there is a tradeoff between time and
space that can be characterized in terms of a constant that acts as a lower bound on the time-
space product of a problem. In the plot, the time-space curves of two problems C1 and C2 are
shown. In this case, C2 is more difficult than C1 because the time-space product of C2 is greater
than the one of C1, which can be seen by comparing their respective time-space rectangular
areas.
counting the number of logic gates it has3. A function requiring more logic gates is
considered to be more complex.
Many times we can also implement f using a sequential processing machine
which uses less logic gates. Such a machine is illustrated in Figure 6.9a. A sequen-
tial processing machine is capable of simulating computation models having a finite
number of configurations, such as space-bounded Turing machines or space-bounded
random access machines (Savage, 1998). As shown in the figure, a sequential pro-
cessing machine is a logic circuit φ communicating with an external storage device (or
“memory”) M . The computation proceeds in T discrete steps, where in step t, the
logic circuit takes the binary input xt and the current state qt and generates the binary
output yt and the next state qt+1. This is done with the help of M , which stores the
state qt+1 generated at step t until it is released in step t + 1. Here, the input string
x1:T encodes the input and the output string y1:T encodes the output.
The computation of a sequential processing machine can be rephrased as a com-
munication problem. This is easily seen by “unwinding” the computation as shown in
Figure 6.9, obtaining a concatenation of T times the logic circuit φ. This construction
transforms the computation into a communication problem where the channel is given
by the logic circuit φ, that is, φ can be regarded as a noisy channel transforming (xt, qt)
into (yt, qt+1). If the state qt is represented by S bits, then the total number of bits
3This measure is known as the circuit complexity of a function.
64
6.2 Resources as Information
NOT AND OR
x1
x2
y1
Figure 6.8: A logic circuit implementing the XOR function.
a) b)
φφφφφ
xt x1 x2 x3 xT
yt y1 y2 y3 yT
Mqt
qt+1
q0
q1 q2 q3 qT−1 qT
Figure 6.9: (a) A sequential processing machine consists of an logic circuit φ and an external
storage device M . In each step t, φ takes the t-th input xt and state qt (stored in the external
memory) and computes the t-th output yt and the (t+1)-th state qt+1. (b) The computation of
the sequential processing machine can be “unwinded”, thereby constructing a large logic circuit
consisting of T concatenations of φ that computes the same function as before.
65
6. RESOURCES
transmitted is given by the product
T · (S + 1),
where it is seen that (S + 1) is a measure of the maximum capacity that the circuit φ,
seen as a communication channel, can achieve. Note that T and S correspond to the
time and space complexity of the function f . Let C(f) and C(φ) denote the minimum
number of logic gates needed to implement the functions f and φ respectively. Then,
C(f) ≤ T · C(φ) = κ · T · S,
because C(φ) = κS for some positive κ and because the implementation of f as a
sequential processing machine cannot use less logic gates than the direct implementation
of f . Hence, the order of the time-space product is lower-bounded by the circuit
complexity of f . In some sense, it seems that:
Computer Science Information Theory
Compute f(x) ⇐⇒ Communicate x and f
although the author is unaware of any proof of this claim.
The point is that, if we are willing to accept the time-space product as an ap-
propriate measure of the computational complexity, then the computational resources
can be related to information-theoretic resources—and as such, they are governed by
information-theoretic principles. In other words, the computational resources corre-
spond to the amount of bits that have to be transmitted to communicate a “compu-
tation”. While this interpretation is intuitively appealing, there are still many open
questions left. For instance, it is unclear what “computation” means in this sense,
and it is also unclear how much computation is necessary in order to compute a given
function (Sipser, 1996; Papadimitriou, 1993).
6.3 Resource Costs in Agents
In the previous section, we have seen that resources can be formalized in information-
theoretic terms, and that this formalization can be given a thermodynamic and a
computational interpretation. This can be summarized as follows. Given a set U
of options, a receiver that does not know beforehand what the choice will be has to
allocate resource costs (codeword lengths) that implicitly specify his beliefs Pr(u) about
the realization of the choice u ∈ U . Furthermore, when the receiver makes a (possibly
stochastic and/or partial) choice represented by a probability distribution Pr′(u) over
U , then the amount of bits spent by the receiver in order to record this choice is given
by the relative entropy ∑
u
Pr′(u) log
Pr′(u)
Pr(u)
,
which is a generalization of the information content from deterministic choices to prob-
abilistic choices. We have argued that this quantity is proportional to the amount of
66
6.3 Resource Costs in Agents
thermodynamic work that the receiver has to carry out in order to record the choice.
Furthermore, we have sketched a relation between the amount of bits and the required
time-space product of the associated computation.
We argue that this way of thinking has many advantages: It greatly simplifies the
analysis of resource costs, because we only have to deal with changes in probability
distributions. This allows us abstracting away from the algorithmic details in order to
reason about computational costs. The objective of this section is to explain how this
formalization can be used to calculate the resource costs of running and constructing
an agent.
6.3.1 Cost of Interaction
We have formalized autonomous systems as probability distributions over interaction
sequences. According to the information-theoretic arguments presented in this chapter,
this means that we are implicitly assigning resource costs for interactions.
This makes sense from an intuitive point of view. The implementation (or embod-
iment) of an agent facilitates some interactions while it hampers others. For instance,
biologists can infer a great deal of the habits of a species by studying its anatomy. The
rationale behind this is that animals manifest energy-efficient behavior more frequently
than energy-intensive behavior. This kind of reasoning acquires an extreme form in
paleontology, where behavior is mainly inferred from fossilized animals. Conversely,
in engineering, systems are designed such that they minimize the resource costs of fre-
quent or desirable operations and uses. This is visible in the designs of cars, aeroplanes,
buildings, algorithms, advertising campaigns, etc.
From an information-theoretic point of view, an agent interacting with an environ-
ment is communicating an interaction sequence. Whenever the agent interacts with its
environment (either producing an output or reading an input), its “physical state” or
“internal configuration” changes as a necessary consequence of the interaction—simply
because the two instants are distinguishable. In other words, if the instants before and
after the interaction cannot be told apart, then we are forced to conclude that they
are empirically the same. This change in “physical state” or “internal configuration”
can take place in many possible ways: for instance, by a chemical reaction; by up-
dating the internal memory; by consulting a random number generator; by moving to
another location; or even by simply advancing in time (i.e. by changing the value of the
time-coordinate). Hence, in this context, the semantics of “physical state” or “internal
configuration” corresponds to an abstract information state: it is a description that
exhaustively characterizes a situation of an agent. We call such a description a state
of the agent.
To make this notion of states concrete, we introduce the following model. We
assume that a change in state occurs whenever the agent either issues an action or
reads an input. We start out from a blank binary tape that we will use to record the
agent’s experience. Then, we iteratively append a new binary string that encodes the
new input or output symbol experienced by the agent (Figure 6.10). In this model, the
appended binary strings are proxies for the changes in state that the agent experiences
67
6. RESOURCES
a1 a2 a3 aTo1 o2 oT
0 1 0 1 1 1 0 1 0 0 1 1
6
6
3
3
8
8
3
3
a) b)
Figure 6.10: State Model. The agent has an I/O domain given byA := O := {1, 2, . . . , 10}. So
far, the agent has experienced four I/O symbols (Panel a). The state of the agent is constructed
by iteratively encoding the four I/O symbols into binary strings (Panel b).
during its interactions with the environment. In this way, we abstract away from the
inner workings of the agent by simply representing every change by a binary string.
Note that this scheme does not allow an agent returning to a previous state, hence the
agent cannot “jump back in time”. Additionally, we want the content of the binary
tape to be uniquely decodeable, such that we can recover the whole I/O history the
agent has experienced so far at any given time by decoding the content of the tape.
This model highlights the correspondence between resources, codeword lengths and
probabilities. Therefore, the behavior of the agent can be thought of as a reflection of
the underlying resource costs.
6.3.2 Costs of Construction
Remark 19 This subsection is of speculative nature. 2
Designing and constructing an agent has a cost. Whether we are thinking to find a
solution to optimality equations, running a search algorithm, or assembling mechatronic
parts, we are always spending resources during the conception of an agent. These
resource costs can be thought of as arising from the change of the distribution over the
possible agents, where the cost of this change is given by the relative entropy.
Let Θ denote the index set parameterizing the possible agents. Furthermore, let
Pr(θ) denote the belief we have about θ being the optimal parameter. Consider the
relative entropy
ρ =
∑
θ
Pr′(θ) log
Pr′(θ)
Pr(θ)
, (6.4)
where Pr′(θ) is the (partial) choice made by the sender. This quantity correctly mea-
sures the number of bits we are going to receive over a noiseless channel from a sender
that picks out the right θ.
However, there is caveat: The previous calculation represents a situation where
we passively receive the optimal answer, which is not possible unless the sender is an
“oracle” who guesses the first ρ bits of the optimal θ. To correctly calculate the required
number of bits in this communication problem, we have to account for the fact that
we do not know the sender. Not knowing the sender has important implications, since
this uncertainty might lead to significant resource costs that we are not accounting for.
68
6.4 Historical Remarks & References
Unfortunately, the author is not aware of any widely accepted way of dealing with this
situation. However, one can speculate that the amount of information is
2O(ρ),
that is, exponential in the amount of information needed when the sender is known.
Roughly speaking, the justification for this intuition is based on results from computa-
tional complexity, where a non-deterministic machine can be simulated by deterministic
machine but incurring exponential cost (Sipser, 1996; Papadimitriou, 1993). In the next
chapter, this claim is made precise for a special case.
6.4 Historical Remarks & References
The fundamental results of information theory were almost entirely developed in the paper by
Shannon (1948). In particular, Shannon’s paper derives a quantitative measure of information
based on three desiderata. Surprisingly, the resulting formula for information turned out to have
the same mathematical shape as the formula for thermodynamic entropy discovered empirically
by Boltzmann (in a simpler form) and Gibbs. Because of this, much of information theory
borrows mathematics from thermodynamics. Information theory has found a wide range of
applications in communication, coding, compression, statistical learning, dynamical systems,
and other areas (Khinchin, 1957; Ash, 1965; Kolmogorov, 1968; Gallager, 1968; Billingsley, 1978;
Cover and Thomas, 1991; Li and Vitanyi, 2008; MacKay, 2003; Gru¨nwald, 2007). The relation
between information (more specifically, codeword lengths) and probability, follows roughly the
argument presented in Gru¨nwald (2007). The Kraft-McMillan inequality was developed in two
steps by Kraft (1949) and then by McMillan (1956).
The relative entropy ∑
u
Pr′(u) log
Pr′(u)
Pr(u)
has been introduced by Kullback and Leibler (1951). The standard interpretation is as follows.
The relative entropy measures the expected number of extra bits required to code samples
from Pr′ when using a code based on Pr, rather than using a code based on Pr′. The idea
of the relative entropy as a generalized formula for the information content is the author’s
(non-standard) interpretation. While mathematically equivalent, conceptually the author’s
interpretation seems to suggest a “temporal directionality”, where Pr and Pr′ represent the
knowledge state of the receiver before and after receiving information.
The connections between information theory, thermodynamics and computational complex-
ity presented in this chapter borrow ideas from various sources. The relation between units of
energy and units of information, has been originally put forward in the context of Maxwell’s
demon by various authors (Maxwell, 1867; Szilard, 1929; Brillouin, 1951, 1956; Gabor, 1964).
The first modern argument linking information theory with thermodynamics is due to Landauer
(1961). Since then, the ideas of the thermodynamics of computing have found wide acceptance
(Tribus and McIrvine, 1971; Bennett, 1973, 1982; Feynman, 2000; Li and Vitanyi, 2008). The
devices shown in figures 6.5 and 6.6 are the author’s contribution, and they differ from the de-
vices in the literature in that they do not allow expanding the volume (i.e. erasing knowledge)
but only compressing it (i.e. acquiring knowledge).
The relation between computational resources and information is an area under active re-
search. Time-space tradeoffs have been investigated in Borodin and Cook (1980), Beame (1989),
69
6. RESOURCES
Beame et al. (2001) and Beame et al. (2003) using a computational model called branching
programs. The time-space tradeoff related to circuit complexity is presented in Savage (1998).
The speculations relating the transmission of information to computational complexity are due
to the author, although related ideas have been put forward by Bremermann (1965).
The arguments linking information-theoretic resources to the cost of interaction and con-
struction are an original contribution of the author. The concept of an oracle, while non-
standard in the context of information theory, is commonplace in computational complexity.
An oracle is a hypothetical device which is capable of answering decision problems in a single
operation (Sipser, 1996; Papadimitriou, 1993), and they are widely used in cryptography to
make arguments about the security of cryptographic protocols involving hash functions.
70
Chapter 7
Boundedness
If a designer has unlimited computational resources, then he would pay the cost of cal-
culating the optimal policy for an autonomous agent. However, in Chapter 5 we have
argued that the rigorous application of the maximum SEU principle is computationally
too expensive to serve as a practical design principle. Therefore, most implementa-
tions make severe domain restrictions and/or approximations in order to significantly
reduce the computational expenses. In other words, designers content themselves with
suboptimal solutions to the original problem.
The preference of the designer of choosing a suboptimal solution over the optimal
one suggests that resources have an impact over the “perceived utility” of the solution.
One can argue that this “suboptimal solution” becomes the “optimal solution” when
resources are taken into account. Hence, intuitively there seems to be a common
“currency” for resources and utilities. What is this currency? If there were one, then
one could compute the optimal solution that trades off the benefits obtained from
maximizing the expected subjective utility and the resource costs of the calculation.
7.1 An Example of Boundedness
We start our discussion with a concrete example. We are given an array of N numbers
(v1, v2, . . . , vN ), and our task is to pick the largest one. We solve this problem by
checking the numbers one by one in any order, always comparing the current value
against the largest seen so far. It is easy to see that this algorithm takes O(N) time
and O(logN) space if we assume that each comparison is done in a single computational
step and that the indices are represented using ⌈logN⌉ bits. The time-space complexity
of this algorithm is O(N logN).
From an information-theoretic point of view, this algorithm transforms the knowl-
edge state about the maximum. Figure 7.1 shows an example array having N = 10
elements with values in {1, 2, . . . , 10}. Initially, the algorithm doesn’t know the loca-
tion of the maximum: If we assume that each index is encoded with log 10 ≈ 3.3219
bits, then the initial distribution is uniform. After running the algorithm, the location
of the maximum is known, which is represented by a delta function concentrating its
71
7. BOUNDEDNESS
probability mass on number 10. Notice that for discovering logN bits of information
we are running an algorithm of time complexity O(N) = 2O(logN), that is, exponential
in the number of bits. Compare this to the arguments for the cost of construction
presented in Section 6.3.2, page 68.
a) b) c)
Figure 7.1: An Exhaustive Optimization. (a) A shuffled array with numbers from 1 to 10 is
given. Initially, the algorithm does not know where the optimum is, which is represented by a
uniform distribution (b) over the elements. After the execution of the algorithm, the solution
was found, which is represented by a distribution (c) concentrating its probability mass on the
maximum.
If N is small, say N = 10, then choosing the largest number is easy, because one can
simply revise the whole array and then pick the largest number. However, if N is very
large, say N = 1000, then comparing all the numbers becomes a difficult task. How do
we go about this problem then? One solution would be to limit ourselves to comparing
only a fraction of the array, say M ≪ N elements. This reduces the computational
complexity to O(M) time and O(logN) space at the cost of tolerating an error with
probability 1 −M/N . That is, we can give up certainty for the sake of computational
efficiency. Notice that the time-space product is O(M logN), which is linear in M for
fixed N .
To understand the effect of the resulting tradeoff, we again analyze how the know-
ledge state is transformed. We assume that the set was shuffled beforehand and that the
algorithm inspects the first M outcomes in linear order, choosing the largest number.
We then use the frequency of choosing each number as its probability. This assigns
equal probability to each one of theN ! possible permutations of the array. The resulting
probability distributions are shown in Figure 7.2. Here, we see that inspecting only one
element does not change the state of knowledge at all; that increasing M moves the
probability mass towards the larger numbers; and that complete certainty is achieved
when M = N . Furthermore, notice that we can actually infer the ranking of the
numbers by merely looking at the distributions, since larger numbers consistently get
more probability mass.
The particular shape of the resulting distributions has some interesting properties.
Let ρ denote the relative entropy between the initial and the final distribution, and let
C =M − 1 denote the number of comparisons carried out by the algorithm, and let V
denote the expected maximum. Note that C = O(M logN) for fixed N , i.e. it serves as
a measure of the computational complexity. If we plot ρ versus C (Figure 7.3a), we see
72
7.1 An Example of Boundedness
M = 1 M = 2 M = 3 M = 4 M = 5
M = 6 M = 7 M = 8 M = 9 M = 10
Figure 7.2: Distributions after Bounded Optimization. The plots show the distributions over
the maximum in {1, 2, . . . , 10} obtained after running the bounded optimization algorithm for
different values of M .
a surprising property: the quantities are proportional, having a correlation coefficient1
of r = 0.9991. That is, we can use ρ as a good measure of the computational complexity
of the algorithm. We also want to understand how the expected value V evolves as we
increase the computational effort. This is seen by plotting V versus ρ (Figure 7.3b).
This plot shows that certainty has a diminishing marginal value, that is, the gain in
the value decreases with more computation. Intuitively, this is because the better the
candidate solution, the more effort it takes to find an even better one. Moreover, the
shape of the expected utility turns out to be logarithmic, as can be seen by plotting
2V versus ρ.
Intuitively, if we care about computational costs, then it is not always a good idea
to run an exhaustive optimization algorithm, because we can reach a point where the
gain in value is too small to justify the extra computational effort. This idea can be
captured by changing the evaluation criterion to one that penalizes the expected value
by the relative entropy, e.g.
V − αρ,
where α is a conversion factor. Fixing α defines a tradeoff between the expected value
and the relative entropy (Figure 7.4). The plot confirms our intuition: the smaller
α, the larger the number of elements we compare, and the more “rational” our choice
becomes. A brief simulation also shows that a achieving a perfectly rational choice does
not require α = 0; it is already achieved for α ≈ 0.2131.
This analysis leads to a principled algorithm for bounded optimization. For a
fixed α, consider the algorithm that linearly inspects the elements of the array until
1More precisely, the Pearson product-moment correlation coefficient is an indicator of the linear
dependence between two variables, having values ranging from −1 to 1. A value equal to 1 means that
a linear relation perfectly describes the relationship between the two variables.
73
7. BOUNDEDNESS
a) b) c)
ρ
C
U
ρ
2U
ρ
r = 0.9991 r = 0.9937
Figure 7.3: Performance of the Bounded Optimization. In panel (a), it is apparent that the
relative entropy ρ is proportional to the number of comparisons C carried out by the algorithm,
and hence ρ serves as a measure of the computational complexity. Plot (b) shows the expected
value V against the relative entropy ρ. The marginal increment of V diminishes as ρ increases.
Furthermore, panel (c) shows that 2V is linear in ρ, meaning that V is logarithmic in ρ.
V
−
α
ρ
ρ
α = 0.5
α = 1.0
α = 2.0
Figure 7.4: Expected Value Penalized by Relative Entropy. Choosing the conversion factor α
defines a tradeoff between the expected value and the relative entropy. In the plot, three
performance curves are shown, corresponding to the tradeoffs α = 12 , 1 and 2. Notice how an
exponential increase of α leads to a linear increase in the optimal ρ.
74
7.2 Utility & Resources
the largest number found so far, penalized by αρ, reaches its peak. This algorithm is
stochastic, with the property that the probabilities of choosing a number are monotonic:
for all i, j,
pi > pj ⇐⇒ vi > vj
where vi is the value of the i-th element and and pi is its probability of being chosen.
This seems to be a more general property of a bounded optimization algorithm. Fur-
thermore, notice that because of the diminishing marginal value, it is easy to reach a
good performance level (although it is very hard to approach the optimum).
This concludes our example. In the remainder of this chapter, we aim to develop
a general framework of bounded rationality and then apply it to autonomous systems.
We are especially interested in providing a solid axiomatic basis for bounded rationality.
7.2 Utility & Resources
In physics, the behavior of a system can be described in two ways: using the dynamical
equations or using an extremal principle (Goldstein, 1980). The first one specifies how
the coordinates of the physical system change in time, like e.g. in Newton’s second law
F =
dp
dt
where F is the force, p is the momentum and t is time. The second expresses the
dynamics of a system as the solution to a variational problem, like e.g. the action
integral
A =
∫ tf
ti
L[q, q˙, t] dt
in Langrangian Mechanics, where L is the Langrangian (i.e. the kinetic energy minus
the potential energy of the system), q are the generalized coordinates of the system
and t is time. According to the principle of least action, the dynamical equations of
the system are then obtained by finding the trajectory q(t) that is an extremum of the
action integral. On a conceptual level, the difference between stating the dynamical
equations or the extremum principle to specify a physical system is analogous to the
difference between stating the output model or the subjective expected utility in order
to specify an autonomous system.
In the previous chapter, we have argued that the resource cost (i.e. work) of ob-
serving an event A given B is
−γ logP(A|B),
where γ > 0 is the conversion factor between units of energy and information. This
has the advantage of linking three disparate concepts together, namely resource costs
(physics), information content (information theory) and behavior P(A|B) (statistics).
Can we exploit this connection to physics in order to devise a new principle for con-
structing autonomous systems? The answer is yes, but this connection is not straight-
forward, because it requires revising the way we think about utilities and about the
process of constructing an autonomous system.
75
7. BOUNDEDNESS
7.2.1 Utility
In this section, we propose a non-standard concept of utility that is more in accord
with thermodynamics and information-theory. Let U(A) denote the utility of an event
A ∈ F and let
u(A|B) := U(A ∩B)−U(B)
be a shorthand to denote the gain in utility obtained from experiencing event A given
event B. We will derive the relation
u(A|B) = α logP(A|B)
where α > 0 is a conversion factor between units of utility and information. This
relation is obtained from desiderata characterizing the notion of utilities in systems
that are “free” to generate events. More specifically, by a free system we mean a
system that can choose the sampling probabilities of events generated by itself.
Consider a free system represented by a finite probability space (Ω,F ,P). Here,
the probability measure P models the generative law that the system uses to choose
events. Thus, if P(A) > P(B), then the propensity of experiencing A is higher than
B. In such a system, differences in probability can be given an interpretation relating
them to differences in preferences: one can say that A is more probable than B because
A is more desirable than B. In other words, a system is more likely to choose the
events that it prefers. Using this line of reasoning, can we find a quantity which will
measure how desirable an event is? We call such a measure of a utility gain function,
although one should always bear in mind that the resulting utilities do not correspond
to the same notion of utility that we have seen in Part I.
If there is such a measure, then it is reasonable to demand the following three
properties for it:
i. Utility gains should be mappings from conditional probabilities into real numbers.
ii. Utility gains should be additive. That is, the gain of a joint event should be
obtained by summing up the gain of the sub-events. For example, the “gain of
drinking coffee and eating a croissant” should equal “the gain of drinking coffee”
plus the “gain of having a croissant given the gain of drinking coffee”.
iii. A more probable event should have a higher utility gain than a less probable event.
For example, if “drinking a coffee” is more likely than “eating a croissant”, then
the utility gain of the former is higher. Note that this is not necessarily true
anymore if the system is not free to choose between events. For instance, “losing
the lottery” has a much higher probability than “winning the lottery” even though
the utility gain of the latter is obviously higher, but this case differs from the
previous example in that the event is not determined by the system itself.
The three properties can then be summarized as follows.
76
7.2 Utility & Resources
Definition 22 (Axioms of Utility) Let (Ω,F ,P) be a probability space. A set func-
tion U : F → R is a utility function for P iff its utility gain function u(A|B) :=
U(A ∩B)−U(B) has the following three properties for all events A,B,C,D ∈ F :
U1. ∃f,u(A|B) = f(P(A|B)) ∈ R, (real-valued)
U2. u(A ∩B|C) = u(A|C) + u(B|A ∩C), (additive)
U3. P(A|B) > P(C|D) ⇔ u(A|B) > u(C|D). (monotonic)
Furthermore, we use the abbreviation u(A) := u(A|Ω). 2
The following theorem shows that these three properties enforce a strict mapping
between probabilities and utility gains.
Theorem 5 (Utility Gain ↔ Probability) If f is such that u(A|B) = f(P(A|B))
for every probability space (Ω,F ,P), then f is of the form
f(·) = α log(·),
where α > 0 is arbitrary strictly positive constant. 2
Proof Given arbitrary p, q ∈ (0, 1] and n,m ∈ N, one can always choose a probability
space (Ω,F ,P) having sequences of events A1, A2, . . . , An ∈ F and B1, B2, . . . , Bm ∈ F ,
and an event C ∈ F such that
p = P(A1|C) = P(A2|C ∩A1) = · · · = P(An|C ∩A<n),
q = P(B1|C) = P(B2|C ∩B1) = · · · = P(Bm|C ∩B<m),
where we use the shorthand A<j :=
⋂
i<j Ai. Applying f to the first sequence yields
the equivalence
P(A1|C) = P(Ai|C ∩A<j) ⇐⇒ u(A1|C) = u(Ai|C ∩A<j) (7.1)
for all i = 1, . . . , n. Then, one can show
f
(
P(A1|C)n
) (a)
= f
( n∏
i=1
P(Ai
∣∣∣C ∩A<j)) (b)= f(P(A1 ∩ · · · ∩An|C))
(c)
= u(A1 ∩ · · ·An|C) (d)=
n∑
i=1
u(Ai|C ∩A<j)
(e)
= nu(A1|C) (f)= nf(P(A1|C)).
Equality (a) is obtained by substituting each P(A1|C) in the product P(A1|C)n with
a corresponding P(Ai|C ∩A<j) term. Equality (b) is obtained by repeatedly applying
the product rule. Applying first the function to transform the probability into a utility
gain and then using the additivity property yields equalities (c) and (d). Finally,
77
7. BOUNDEDNESS
equalities (e) and (f) are obtained by first using (7.1) and then transforming back to
probabilities. Similarly, for the second sequence, one has
f
(
P(B1|C)m
)
= mf(P(B1|C)).
Since p, q and n,m are arbitrary, then there is always a probability space such that
f(pn) = nf(p) and f(qm) = mf(q).
This part of the argument parallels Shannon’s entropy theorem (Shannon, 1948).
Let p, q ∈ (0, 1] such that q < p. Choose an arbitrarily large m ∈ N and find an n ∈ N
to satisfy qm < pn < qm+1. Taking the logarithm, and dividing by n log q one obtains
m
n
<
log p
log q
<
m
n
+
1
n
. (7.2)
Similarly, using f(pn) = nf(p) and the monotonicity of f , we have
qm < pn < qm+1
⇐⇒ f(qm) < f(pn) < f(qm+1)
⇐⇒ mf(q) < nf(p) < (m+ 1)f(q).
Dividing the last set of inequalities by nf(p) yields
m
n
<
f(p)
f(q)
<
m
n
+
1
n
. (7.3)
Combining the inequalities in (7.2) and (7.3), one gets∣∣∣ log p
log q
− f(p)
f(q)
∣∣∣ < 2
n
.
Since m,n can be chosen arbitrary large, this implies
log p
log q
=
f(p)
f(q)
in the limit n→∞. Fixing q and rearranging terms gives the functional form
f(p) = α log p,
where α must be positive to satisfy the monotonicity property.
The previous choice of probability spaces implies that f(p) = α log p. We show now
that this choice does not violate the axioms for any probability space. First, f is real
valued. Second,
u(A ∩B|C) = α logP(A ∩B|C) = α log(P(A|C)P(B|A ∩ C))
= α logP(A|C) + α logP(B|A ∩ C) = u(A|C) + u(B|A ∩ C),
where the equalities are obtained by using the product rule for probabilities and a
property of logarithms. Third, f is monotonic. Hence, f(p) = α log p holds for any
probability space. 
78
7.2 Utility & Resources
Thus, if one is willing to accept Definition 22 as a reasonable characterization of a
utility function, then one obtains the relations
U(A ∩B)−U(B) = α logP(A|B). (7.4)
In this relation, α > 0 plays the role of a conversion factor between utilities and
information. If a probability measure P and a utility function U satisfy the relation
(7.4), then we say that they are conjugate. We call one unit of utility one utile. Given
that this transformation between utility gains and probabilities is a bijection, one has
that
P(A|B) = 2 1α (U(A∩B)−U(B)) .
There are several important observations with respect to this particular functional form.
First, recall that in Bayesian probability theory (Section 4.1), a new measurement
A is combined with the previous knowledge B by an intersection, i.e. the posterior
knowledge is given by A∩B. Furthermore, note that utility gains are always negative,
i.e.
U(A ∩B)−U(B) ≤ 0.
This means that every measurement decreases the utility. While this sounds counterin-
tuitive, the relation also says that minimizing the resource costs maximizes the utility
that is achieved after the interaction. Hence, if the free system is forced to act, then it
will favor outcomes that reduce the utility less.
Second, note that in the context of thermodynamics discussed in Section 6.2.1,
doing work W = −γ logP(A|B) on a physical system changes its energy level from
e(B) to e(A ∩B) as
e(B) −→ e(A ∩B) = e(B) +W = e(B)− γ logP(A|B).
That is, the work done on a system increases its internal energy. Comparing this with
the relation
U(A ∩B)−U(B) = −α logP(A|B)
leads to the conclusion that
U(A) = −αγ e(A)
for all A ∈ F . Hence, the utilities that we have derived from mathematical desiderata
are (safe for a conversion factor) negative energy levels. This is useful because it allows
us analyzing decision making by borrowing ideas from thermodynamics.
Furthermore, note that the exponentiation of the utility function can be written as
a sum of parts. That is, if A1, A2 ∈ F form a partition of A ∈ F , then
2
1
α
U(A) = 2
1
α
U(A1) + 2
1
α
U(A2).
79
7. BOUNDEDNESS
This is because
P(A|Ω) = P(A1|Ω) +P(A2|Ω),
2
1
α
U(A)
2
1
α
U(Ω)
=
2
1
α
U(A1)
2
1
α
U(Ω)
+
2
1
α
U(A2)
2
1
α
U(Ω)
,
2
1
α
U(A) = 2
1
α
U(A1) + 2
1
α
U(A2).
In particular, one can rewrite any probability P(A|B) as a Gibbs measure:
P(A|B) =
∑
ω∈A 2
1
α
U(ω)∑
ω∈B 2
1
α
U(ω)
.
where we have used the abbreviation U(ω) := U({ω}). As the conversion factor α
approaches zero, the probability measure P(ω) approaches a uniform distribution over
the maximal set Ωmax := {ω∗ ∈ Ω|ω∗ = argmaxωU(ω)}. Similarly, as α → ∞,
P(ω) → 1|Ω| , i.e. the uniform distribution over the whole outcome set Ω. Also, note
that
2
1
α
U(Ω) =
∑
ω
2
1
α
U(ω).
Intuitively, the utility U(Ω) of the sample set corresponds to the utility of the system
before any interaction has occurred.
7.2.2 Variational principle
The conversion between probability and utility established in the previous section can
be characterized using a variational principle. Inspired by thermodynamics, we now
define a quantity that will allow us analyzing changes of the state of the system.
Definition 23 (Free Utility) Let (Ω,F) be a measurable space and letU be a utility
function. Define the free utility functional as
J(Pr;U) :=
∑
ω∈Ω
Pr(ω)U(ω) − α
∑
ω∈Ω
Pr(ω) logPr(ω),
(the term Pr(ω) := Pr({ω}) is an abbreviation) where Pr is an arbitrary probability
measure over (Ω,F). 2
Here, we see that the free utility2 is the expected utility of the system plus the
uncertainty (i.e. the entropy) over the outcome multiplied by the utility-information
conversion factor. It satisfies the following relation.
2The functional F := −J is also known as the Helmholtz free energy in thermodynamics. F is a
measure of the “useful” work obtainable from a closed thermodynamic system at a constant temperature
and volume.
80
7.2 Utility & Resources
Theorem 6 (Variational Principle) A conjugate pair (P,U) satisfies
J(Pr;U) ≤ J(P;U) = U(Ω).
where Pr is an arbitrary probability measure over Ω. 2
Proof Rewriting terms using the utility-probability conversion and applying Jensen’s
inequality yields
J(Pr;U) =
∑
ω∈Ω
Pr(ω)U(ω) − α
∑
ω∈Ω
Pr(ω) logPr(ω)
= α
∑
ω∈Ω
Pr(ω) log
2
1
αU(ω)
Pr(ω)
≤ α log
∑
ω∈Ω
Pr(ω)
2
1
αU(ω)
Pr(ω)
= α log
∑
ω∈Ω
2
1
αU(ω)
= U(Ω),
with equality iff 2
1
αU(ω)
Pr(ω) is constant, i.e. if Pr = P. 
The variational principle tells us that the probability law P of the system is the one
that maximizes the free utility for a given utility function U, since
P = argmax
Pr
J(Pr;U).
Hence, the utility function U plays the role of a constraint landscape for probability
measuresPr out of which the conjugate probability measureP is the one that maximizes
the uncertainty.
7.2.3 Bounded SEU
In SEU theory, the designer constructs an autonomous system by first assuming a
probability measure that characterizes the behavior of the environment and a utility
criterion to compare different outcomes, and then by calculating a policy that maxi-
mizes the subjective expected utility. During this process, the calculation of the optimal
policy is done disregarding the costs of the computation. As we have argued at the
beginning of this chapter, here we do want to take into account the costs of computing
the optimal policy. However, this poses two important problems.
First, how do we measure the costs of this computation? To answer this question,
we first have to agree on what this calculation is actually doing. We know that after
this calculation, the agent ends up with an optimal policy. But what if this calculation
81
7. BOUNDEDNESS
had been omitted? Clearly, the agent would not end up with an optimal policy—unless
the agent already had the optimal policy from the very beginning. The important
conclusion here is that, in the general case, the purpose of such a calculation is to
transform an initial system Pi into a final system Pf . In the previous chapter, we have
seen that the amount of bits that have to be set to carry out this transformation is
given by the relative entropy, i.e.∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
.
Hence, the cost measured in joules and in utiles is obtained by multiplying the previous
quantity by γ and α respectively.
Second, how do we optimally choose this transformation? In other words, how
should we transform the initial system Pi such that the final system Pf optimally
trades off the benefits of maximizing a given utility function U∗ against the cost of the
transformation? To answer this question, consider the transformation represented in
Figure 7.5. The initial system satisfies the equation
Ji :=
∑
ω∈Ω
Pi(ω)Ui(ω)− α
∑
ω∈Ω
Pi(ω) logPi(ω) = Ui(Ω).
We add new constraints represented by the utility function U∗. Then, the resulting
utility function Uf is given by the sum
Uf = Ui +U∗.
The free utility Jf of the final system is then given by
Jf :=
∑
ω∈Ω
Pf (ω)Uf (ω)− α
∑
ω∈Ω
Pf (ω) logPf (ω)
=
∑
ω∈Ω
Pf (ω)(Ui(ω) +U∗(ω))− α
∑
ω∈Ω
Pf (ω) logPf (ω)
=
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
+Ui(Ω).
To understand the change that has occurred due to this transformation, we take the
difference Jf − Ji in the free utility. This results in a quantity that will play a central
role in our next development.
Definition 24 (Bounded Subjective Expected Utility) LetU∗ a utility function
and let Ji and Jf be the free utility for the conjugate pairs (Pi,Ui) and (Pf ,Uf ). Then,
the bounded subjective expected utility (bounded SEU) is given by the difference
in free utility
Jf − Ji =
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
. (7.5)
82
7.2 Utility & Resources
Pi,Ui Pf ,Uf
+U∗
Figure 7.5: Transformation of a System. A transformation from a system (Pi,Ui) into a
system (Pf ,Uf ) by addition of a constraint U∗.
This difference in free utility has an interpretation that is crucial for the formaliza-
tion of bounded rationality: it is the expected target utility U∗ (first term) penalized
by the cost of transforming Pi into Pf (second term). In practice, this means the
following. A designer starts out with an initial system Pi that he wants to change in
order to maximize a utility functionU∗. He then changes this system into Pf , spending
in the process a total amount of
γ
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
joules of energy. The expected utility of the system is given by∑
ω∈Ω
Pf (ω)U∗(x).
However, from the point of view of the designer, the total expected utility is smaller,
because he has to subtract the reduction in utility caused by the cost of the transfor-
mation itself: ∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
.
Remark 20 Alternatively, one can interpret bounded SEU as the expected utility pe-
nalized by an “uncertainty” term. In this interpretation, the relative entropy measures
the “risk” of a gamble. 2
Because of its interpretation, we define (7.5) as a functional to be maximized, ei-
ther with respect to Pf or with respect to Pi. We call this construction method the
maximum bounded SEU principle. Depending on whether we vary Pf or Pi, one
obtains two different variational problems having different applications.
Control Method. The construction method for control corresponds to the case
when the designer wants to build a system that optimizes a given utility function U∗
subject to the costs of the transformation. This is achieved by fixing Pi and varying
Pf :
Pf = argmax
Pr
∑
ω∈Ω
Pr(ω)U∗(ω)− α
∑
ω∈Ω
Pr(ω) log
Pr(ω)
Pi(ω)
. (7.6)
83
7. BOUNDEDNESS
The solution is given by
Pf (ω) ∝ Pi(ω) exp
(
1
α
U∗(ω)
)
.
This can be interpreted as follows. The decision maker starts out with a prior probabil-
ity of choosing ω given by Pi(ω). Then, he changes this probability to Pf (ω) obtained
from multiplying the prior with the term exp( 1αU∗(ω)), which intuitively corresponds
to the “likelihood” of ω being the best choice. Compare this to the example given in
the introduction. Here, α controls the amount of resource units required to increase
the utility. In particular, when the conversion factor between information and utility
is negligible α ≈ 0, then (7.5) becomes
Jf − Ji ≈
∑
ω∈Ω
Pf (ω)U∗(x),
and hence resource costs are ignored in the choice of Pf , leading to Pf ≈ δω∗(ω), where
ω∗ = argmaxωU∗(ω). Similarly, when α is very high, then the difference is
Jf − Ji ≈ −α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
,
and hence no computation is carried out to optimize the choice, i.e. Pf ≈ Pi.
Estimation Method. The construction method for estimation corresponds to the
case when the designer wants to build a system that approximates another system
(that is maximizing a possibly unknown utility function U∗) subject to the costs of the
transformation. This is achieved by fixing Pf and varying Pi:
Pi = argmax
Pr
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pr(ω)
(7.7)
= argmin
Pr
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pr(ω)
,
and thus we have recovered the minimum relative entropy principle for estimation,
having the solution
Pi = Pf
and therefore carry no transformation costs. While this “construction method” might
look bizarre at a first glance, it is saying something obvious: If the designer is looking
for a system Pi that approximates another known system Pf , then the best he can do
is to just choose Pf itself without having to carry out any transformation at all!
84
7.3 Bounded SEU in Autonomous Systems
7.3 Bounded SEU in Autonomous Systems
We tackle now the question of how to construct an autonomous system using the
bounded SEU principle. In the previous section, we have learnt that building a system
is conceptualized as transforming a preexisting system. Furthermore, there are two
construction methods, namely one for control and one for estimation.
We assume that we are in possession of a reference I/O model P0 and a utility
function U∗. The objective is to find an I/O model P maximizing the bounded SEU
using the method for control, estimation or a mixture of both. We further assume that
the conversion factor between utilities and information is given by α > 0.
To construct the associated bounded SEU, one has to carefully decide what con-
struction method to apply for each random variable. For instance, consider constructing
an I/O model P(ao) from a reference I/O model P0(ao) using a utility function U∗.
Then, assuming that a is derived using the control method and o using the estimation
method, one gets
P = argmax
Pr
{∑
ao
Pr(a)P0(o|a)U∗(ao)− α
∑
ao
Pr(a)P0(o|a) log Pr(a)P0(o|a)
P0(a)Pr(o|a)
}
.
This is because for a, the reference probabilities P0(a) play the role of Pf in (7.5); and
for o, the reference probabilities P0(o|a) play the role of Pi in (7.5).
A simple way to concisely write down the bounded SEU for mixed methods is by
defining two auxiliary I/O models R and S as
R(at|ao<t) :=
{
Pr(at|ao<t)
P0(at|ao<t)
R(ot|ao<tat) :=
{
Pr(ot|ao<tat) (control)
P0(ot|ao<tat) (estimation)
S(at|ao<t) :=
{
P0(at|ao<t)
Pr(at|ao<t)
S(ot|ao<tat) :=
{
P0(ot|ao<tat) (control)
Pr(ot|ao<tat) (estimation)
Then, the bounded SEU is given by
P = argmax
Pr

∑
ao≤T
R(ao≤T )U∗(ao≤T )− α
∑
ao≤T
R(ao≤T ) log
R(ao≤T )
S(ao≤T )

 . (7.8)
Observe that another way of looking at (7.8) is as a collection of independent variational
problems, where this collection contains one variational problem for each P(at|ao<t)
and P(ot|ao<tat). We now show two examples of how to construct autonomous systems
using the bounded SEU principle.
7.3.1 Bounded Optimal Control
As we have seen in Section 3.2.4, in optimal control problems it is generally assumed
that we are given a utility function U∗ and that the environment is fully known, i.e.
85
7. BOUNDEDNESS
P0(ot|ao<tat) = Q(ot|ao<tat). The probability measures R and S are given by
R(at|ao<t) = Pr(at|ao<t), R(ot|ao<tat) = P0(ot|ao<tat),
S(at|ao<t) = P0(at|ao<t), S(ot|ao<tat) = Pr(ot|ao<tat).
Hence, the variational problem to find P is to maximize the functional
∑
ao≤T
R(ao≤T )
[
U∗(ao≤T )− α
T∑
t=1
log
Pr(at|ao<t)
P0(at|ao<t)
− α
T∑
t=1
log
P0(ot|ao<tat)
Pr(ot|ao<tat)
]
, (7.9)
which results from using the definition of R and S in (7.8). In the variational problem
for the observation probabilities we can disregard the constraint utilities and the re-
source cost of the action probabilities. The t-th summand of the total expected reward
can then be written as
∑
ao<tat
R(ao<tat)
[∑
ot
P0(ot|ao<tat) log
Pr(ot|ao<tat)
P0(ot|ao<tat)
]
.
Since varying Pr(ot|ao<tat) does not influence the summands at times 6= t, the optimal
solution to this minimum relative entropy problem is trivially obtained by
P(ot|ao<tat) = P0(ot|ao<tat).
The variational problem with respect to the action probabilities is a bit more intricate,
since varying the action probability at time t has an impact on all subsequent condi-
tional action probabilities. The functional (7.9) can be expanded recursively, yielding
∑
a1
Pr(a1)
[
− α log Pr(a1)
P0(a1)
+
∑
o1
P(o1|a1)
[
+
∑
a2
Pr(a2|ao1)
[
− α log Pr(a2|ao1)
P0(a2|ao1)
+
∑
o2
P(o2|ao1a2)
[
+ · · ·
+
∑
aT
Pr(aT |ao<T )
[
− α log Pr(aT |ao<T )
P0(aT |ao<T )
+
∑
oT
P(oT |ao<TaT )U∗(ao≤T )
]
· · ·
]]]]
,
(7.10)
By inspection of the recursive expansion (7.10), one sees that the action probabilities
Pr(at|ao<t) can be solved backwards in time (akin to the Bellman optimality equations,
Section 3.2.3). The innermost variational problem can be written as
∑
aT
Pr(aT |ao<T )
[∑
oT
P(oT |ao<TaT )U∗(ao≤T )− α log
Pr(aT |ao<T )
P0(aT |ao<T )
]
.
86
7.3 Bounded SEU in Autonomous Systems
As discussed in Section 7.2.3, its solution is
P(aT |ao<T ) =
P0(aT |ao<T )
Zα(ao<T )
exp
{
1
α
∑
oT
P(oT |ao<TaT )U∗(ao≤T )
}
,
where Zα(ao<T ) is the normalizing constant, also known as the partition function. By
induction, it is seen that the variational problem for the time steps t < T can be written
as ∑
at
Pr(at|ao<t)
[
α
∑
ot
P(ot|ao<tat) logZα(ao≤t)− α log
Pr(at|ao<t)
P0(at|ao<t)
]
,
where the Zα(ao≤t) are the normalizing constants obtained from subsequent time steps.
The action probabilities P(at|ao<t) for the times t < T are given as
P(at|ao<t) =
P0(at|ao<t)
Zα(ao<t)
exp
{∑
ot
P(ot|ao<tat) logZα(ao≤t)
}
.
This way the optimal action probabilities can be computed recursively.
Case α→ 0. We first investigate the limit case for α→ 0. Identify the future utility
Fα(ao≤t) using the recursion
Fα(ao≤T ) := U∗(ao≤T ),
Fα(ao≤t) := α logZ
α(ao≤t).
If one takes the limit α→ 0, then the future utility Fα(ao<t) converges to the recursive
formula
F 0(ao≤T ) = U∗(ao≤T ),
F 0(ao<t) = maxat
∑
ot
P(ot|ao<tat)F 0(ao≤t),
and the policy P(at|ao<t) to3
P(at|ao<t) = δata∗ , where a∗ := maxat
∑
ot
P(ot|ao<tat)F 0(ao≤t).
Compare this to Definition 12. We have recovered the Bellman optimality equations for
utilities! Using an analogous construction one can also recover the Bellman optimality
equations for rewards.
Case α → ∞. Now we investigate the limit of unbounded resource costs by taking
the limit α→∞. This case yields
P(at|ao<t) = P0(at|ao<t).
Hence, the optimal agent is the reference agent itself, as expected.
3This result holds assuming that in each step there is only one optimal action.
87
7. BOUNDEDNESS
7.3.2 Adaptive Estimation
We consider the problem of adaptive estimation of an unknown source (i.e. envi-
ronment). Let P0 denote in the following a Bayesian input model with no actions,
i.e. characterized by P0(θ) and P0(ot|θ, o<t). This models possible input sources
P0(ot|θ, o<t) = Q(ot|θ, o<t) indexed by θ ∈ Θ and chosen randomly with probabili-
ties P0(θ). Let P0 denote the input model induced by P0. The auxiliary input models
R and S are given by
R(ot|o<t) = P0(ot|o<t), S(ot|o<t) = Pr(ot|o<t).
Substituting these into (7.8) yields
∑
o≤T
P0(o≤T )U∗(o≤T )− α
∑
o≤T
P0(o≤T ) log
P0(o≤T )
Pr(o≤T )
Here, the solution is P(o≤T ) = P0(o≤T ) for any utility function U∗, and thus P(ot|o<t)
is the predictive distribution (Section 4.2.2)
P(ot|o<t) =
∑
θ
P (θ|o<t)P (ot|o<t),
as expected.
Remark 21 The reference model P0 captures the current knowledge, which in this
case happens to be a state of uncertainty induced by the Bayesian model P0. This
illustrates that Bayesian models do not add any further complications as they integrate
seamlessly with the construction method. 2
Remark 22 This result looks trivial. However, notice that the variational problem
for ot can be rewritten (dropping some constants that do not affect the variational
problem) as
P(ot|o<t) = argmin
Pr
∑
θ
P (θ)
∑
o<t
P (o<t|θ)
∑
ot
P (ot|θ, o<t) log P (ot|θ, o<t)
Pr(ot|o<t) ,
which is the standard way of formulating the problem. The question posed is: “what is
the distribution P(ot|o<t) that minimizes the average relative entropy to an unknown
source that is chosen randomly?” This question is important in coding, because the
P(ot|o<t) chosen this way leads to the optimal adaptive compressor. From this point
of view, the fact that the optimum is given by the predictive distribution (and hence
the application of Bayes’ rule) is not trivial at all. 2
7.4 Historical Remarks & References
The formalization of bounded rationality presented in this chapter was entirely developed by
the author and D. A. Braun (Ortega and Braun, 2010d). The conversion between probability
88
7.4 Historical Remarks & References
and utility in Section 7.2.1 has been introduced for the first time in Ortega and Braun (2010b).
The proof of the variational principle is adapted from Keller (1998, Theorem 1.1.3).
Many of the presented concepts have pre-cursors in the literature. Previous theoretical
studies have reported, for example, structural similarities between entropy and utility functions,
see e.g. Candeal, De Miguel, Indura´in, and Mehta (2001). A duality between optimal control
and estimation has been reported by Todorov (2008), where an exponential transformation
mediates between the cost-to-go function and a probability distribution that acts as a backwards
filter. Free energy principles and the use of reference distributions for action selection have
been proposed by Todorov (2009) and Kappen, Gomez, and Opper (2009). In these studies,
transition probabilities of Markov systems were manipulated directly and the cost measured
as a probabilistic deviation with respect to the passive dynamics of the system. The idea of
bounded rationality through the consideration of information costs has been first proposed by
Simon (1982). One of the first instantiations of bounded rationality was introduced in the
context of risk in decision making (Borch, 1969). Accordingly, the decision maker prefers a
gamble maximizing the mean utility penalized by the variance of the gamble, where the latter
is multiplied by a risk factor. This is obviously conceptually in accordance with bounded SEU
introduced here if we interpret variance as a proxy for uncertainty.
Besides the links to classical decision theory that the presented material has by design,
there are obvious connections to information theory (Shannon, 1948), thermodynamics (see
e.g. Callen 1985) and statistical inference (see e.g. maximum entropy principles, Jaynes and
Bretthorst 2003). It remains to be seen whether these similarities are purely formal, or whether
they rest on more fundamental principles.
89
7. BOUNDEDNESS
90
Chapter 8
Causality
The maximum SEU principle assumes that an agent chooses its optimal policy before
the interaction with its environment starts. As has been argued in Chapter 5 and
further developed in Chapters 6 and 7, choosing the optimal policy requires running a
computationally intractable optimization algorithm.
This assumption can be relaxed by letting the choice of the optimal policy happen
during interaction, not before. This “dynamic choice” can be modeled by designing an
agent that is initially uncertain about the optimal policy, but then learns it from the
interactions with the environment. We can model this uncertainty over the optimal
policy in the same way we modeled the uncertainty over the environment, introducing
a Bayesian output model over a class of output models akin to the Bayesian input
model. In this way, we can skip the costly optimization step and use a Bayesian
mixture model to learn the optimal policy, driven exclusively by the interactions with
the environment. Obviously, the cost we pay for doing so is that the resulting adaptive
agent is suboptimal in the classical expected utility sense (although optimal in an
information-theoretic sense that will be explained in Chapter 9).
However, this usage of Bayesian probability theory will violate a subtle but vital
assumption underlying the reasoning model; namely, the predetermination of the out-
come. If the agent is just a passive observer, then it does not matter whether the
observations are generated on-the-fly or whether they were determined beforehand and
then merely “uncovered” by the environment, because the formal treatment is the same
in both cases. This is also true if the agent knows his policy. If the agent is certain
about its policy and the policy is deterministic, then actions are just functions of the
observations (Section 3.2.1). This in turn means that the I/O string can be modeled
as if it were randomly chosen by the environment alone before the interaction starts,
and then merely sequentially uncovered. However, if the agent is uncertain about its
policy, then its actions are determined dynamically (say, by choosing them according to
a stochastic rule), and hence not a function of the observation stream alone anymore.
In this case the outcome is generated on-the-fly rather than predetermined: the agent
chooses the actions and the environment chooses the observations.
Suppose there is an unknown cause influencing a result we are waiting for. As
91
8. CAUSALITY
soon as we observe the result, we learn something about the unknown cause. However,
if instead we decide to interrupt the process by choosing the result ourselves (i.e. by
acting), then our choice will not affect our knowledge about the unknown cause. This
is simply because we know that our current actions cannot change the past anymore.
Meanwhile, in both cases, we learn something about the future, i.e. about all the
outcomes that will follow the result.
This distinction between belief updates following externally generated observations
and internally generated actions is not modeled in Bayesian probability theory. Essen-
tially, the theory lacks the formal tools to deal with indeterminate outcomes chosen
by the reasoner himself. This requires introducing additional information to clearly
identify the past and the future of choices, or more abstractly speaking, introducing
a causal order of events. Inputs are incorporated into the knowledge by conditioning,
while outputs have to be incorporated by first intervening the belief model and then
conditioning.
The theory for enabling causal reasoning is known as statistical causality. Statistical
causality is the study of the functional dependencies of events, i.e. the study of their
cause-effect relationships. This stands in contrast to statistics proper, which, on an
abstract level, can be said to study the equivalence dependencies of events, such as
co-occurrence or correlation. Causal statements differ fundamentally from statistical
statements. Examples that highlight the differences are many, such as “do smokers get
lung cancer?” as opposed to “do smokers have lung cancer?” in a clinical study; “assign
y ← f(x)” as opposed to “compare y = f(x)” in a programming language; “how does
the increase of the price change the demand?” versus “how does the increase of the
price correlate with the demand?” in econometrics; “does the legalization of abortion
decrease the crime rate?” versus “does the legalization of abortion relate to the crime
rate?” in the social sciences; and “a← F/m” as opposed to “F = ma” in Newtonian
physics.
Especially over the last decade, significant progress has been made towards the for-
mal understanding of causation, mainly in the context of graphical models Pearl (2000);
Spirtes and Scheines (2001); Dawid (2007), but also in the context of probability trees
Shafer (1996). However, there is currently no set-theoretic formalization of causality
(and interventions) that is naturally compatible with measure theory.
The aim of this chapter is to present a simple framework for causal reasoning that
is compatible with the framework for uncertain reasoning introduced in Section 4.1.2.
The axiomatization of causality presented in this chapter is entirely due to the author.
8.1 The Big Picture
The aim of this section is to give an overview of the concepts that are going to be
formalized abstractly in the next section. Consider a three-stage experiment involving
two identical glass urns: the left one containing one white and three black balls, and
the right one having three white and one black ball. In stage one, the two urns are
either swapped or not with equal probabilities. In stage two it is randomly decided
92
8.1 The Big Picture
whether to exclude the left of the right urn from the experiment. If the urns have not
been swapped in the first stage, then the odds are 3/4 and 1/4 for keeping the left and
the right urn respectively. If the urns have been swapped, then the odds are reversed.
In the last stage, a ball is drawn from the urn with equal probabilities and its color is
revealed. We associate each stage with a binary random variable: namely Swap ∈ {yes,
no}, Pick ∈ {left, right} and Color ∈ {white, black} respectively. Figure 8.1 illustrates
the setup using a probability tree (Shafer, 1996). In calculations, we will abbreviate
variable names and their values with their first letters.
1/2 1/2
3/4 1/4 1/4 3/4
3/4 1/4 1/4 3/4 1/4 3/4 3/4 1/4
Figure 8.1: A Three-Stage Randomized Experiment.
The advantage of the representation as a probability tree is that it clearly highlights
how the binary random variables Swap, Pick and Color are instantiated, including both
the order and the probabilities. In this tree, the nodes corresponds to the (interme-
diate) steps of the realization of the experiment, where each node has a unique past
(given path of nodes starting at the root node and ending at the current node) and
many possible futures (given by any of the paths that follow from the node). The
conditional probabilities of the probability tree fully determine a unique belief func-
tion. In particular, one can compute the probability of each realization (Table 8.1) by
multiplying the conditional probabilities that make up the path of the realization.
Note that each level of the probability tree is associated with one and only one binary
random variable. This is not very restrictive, as random variables taking on values
in finite sets can always be broken down into sequences of binary random variables.
Restricting ourselves to binary trees allows us representing each node as an intersection
of “primitive” events. Informally, a primitive event tells us whether the realization took
a left or a right branch at a given level of the tree.
The reason why we have chosen a probability tree representation of the experiment
is because it clarifies the link between belief spaces introduced in Chapter 4 and Pearl’s
causal graphs (Pearl, 2000) which is currently one of the most-widely accepted for-
malisms for reasoning under interventions. The causal graph representation of the
93
8. CAUSALITY
Table 8.1: Probabilities of Realizations of the Experiment
Swap Pick Color Probability
no left black 9/32
no left white 3/32
no right black 1/32
no right white 3/32
yes left black 1/32
yes left white 3/32
yes right black 9/32
yes right white 3/32
experiment is shown in Figure 8.2a. Intuitively, Pearl defines an intervention as the act
of setting the value of a random variable, as opposed to passively observing it. In the
graph, this operation amounts to removing the parent links to the node of the variable
and then binding its value (Figure 8.2b). It turns out that the analogous operation in
a probability tree corresponds to changing the conditional probabilities of a primitive
event to either the left or the right branch (Figure 8.3). The next section introduces
an abstract model of probability trees and interventions.
a) b)
SS P
CC
P = l
Figure 8.2: A Causal Graph. The graph shown in (a) is a representation of the experiment of
Figure 8.1. Setting the value of the variable ‘P’ corresponds to an intervention, i.e. removing
the parent link to the associated node and then recording the value of the random variable as
shown in (b).
8.2 Causal Spaces
The aim of this section is to introduce causal spaces. Causal spaces contain enough
information to characterize the causal structure of a random process.
Let Ω be a finite set of outcomes. An atom set A is a partition of Ω, and an atom
is a member A ∈ A. Given a set E of subsets of Ω, define the algebra generated
by E , written σ(E), as the smallest algebra over Ω containing every member of E .
94
8.2 Causal Spaces
1 0 1 0
1/2 1/2
3/4 1/4 1/4 3/4 1/4 3/4 3/4 1/4
Figure 8.3: An Intervention in the Probability Tree. The figure illustrates choosing the left
urn at the second stage of the experiment as in Figure 8.2b.
Furthermore, define the atom set generated by an algebra F , written α(F), as the
largest set of atoms containing members of F . For any set E of subsets of Ω, we also
abbreviate α(E) := α(σ(E)).
Remark 23 In the finite case, it is easily seen that both generated algebras and gen-
erated atom sets are unique. 2
Definition 25 (Primitive Events) Let E = (E0, E1, E2, . . . , EN ) be a finite sequence
of subsets of Ω called primitive events, where E0 := Ω, and where for all n ≥ 1,
En /∈ σ
(
{E0, E1, . . . , En−1}
)
.
Furthermore, define En := {En, Ecn} and A0,A1, . . . ,AN as the sequence of atom sets
An := α
(
{E0, E1, . . . , En}
)
.
2
This setup is illustrated in Figure 8.4. The sequence of primitive events is an ab-
stract characterization of a random process that occurs in discrete steps n = 1, 2, . . . , N .
Each step n is associated with a primitive event En representing a basic proposition
whose truth value is resolved during this step (and not before!), i.e. step n determines
whether the outcome ω ∈ Ω is either in En or in Ecn. The n-th atom set An contains
one proposition for each possible path the random process can take. Therefore, after n
steps, the process will find itself in one (and only one) of the members in An.
Remark 24 The condition that En cannot be in the algebra generated by the pre-
vious events E0, . . . , En−1 guarantees that En adds a new proposition that cannot be
expressed in terms of the previous propositions. 2
95
8. CAUSALITY
E0
E1
E2
E3
A0 A1
A2 A3
Figure 8.4: Primitive Events and their Atom Sets.
The sequence of primitive events E = (E1, . . . , EN ) can equivalently be represented
by any sequence E′ = (E′1, . . . , E
′
N ) where E
′
n ∈ En. Due to this, we will call any
member of En primitive event. We introduce causal functions.
Definition 26 (Causal Axioms) Let Ω be a set of outcomes, and let E = (E1, . . . , EN )
be a sequence of primitive events. A set function Cn is a n-th causal function iff
C1. A ∈ En, B ∈ An−1, Cn(A|B) ∈ [0, 1].
C2. A ∈ En, B ∈ An−1, Cn(A|B) = 1 if B ⊂ A.
C3. A ∈ En, B ∈ An−1, Cn(A|B) = 0 if A ∩B = ∅.
C4. A ∈ En, B ∈ An−1, Cn(A|B) +Cn(Ac|B) = 1.
Hence, Cn maps En ×An−1 into [0, 1]. A causal function over E is a function
C(A|B) = Cn(A|B), if A ∈ En, B ∈ An−1,
where Cn is an n-th causal function. Hence, C maps
⋃
n(En ×An−1) into [0, 1]. 2
The intuition behind this definition is as follows. The causal function specifies the
knowledge the reasoner has about the evolution of a random process. It specifies the
likelihood of a primitive event A ∈ En to happen after the random process is known to
have taken a path B ∈ An−1.
By comparing Axioms C1–C4 with Axioms B1–B5 (Section 4.1.2) of belief functions,
we observe the following. First, in contrast to B, only a subset of combinations (A,B) ∈
F × F is specified for C, namely, the ones that chain a history of primitive events
B ∈ An−1 ⊂ F together with the primitive event A ∈ En ⊂ F that immediately
follows. Second, Axioms C1–C4 play the same roˆle as Axioms B1–B4, namely: (C1)
96
8.2 Causal Spaces
probabilities lie in the unit interval [0, 1]; (C2 & C3) probabilities are consistent with
the truth function; and (C4) probabilities of complementary events add up to one. No
axiom analogous to Axiom B5 is needed for C.
Putting everything together, one gets a causal space. A causal space contains
enough information to derive an associated belief space.
Definition 27 (Causal Space) A causal space is a tuple (Ω, E,C), where: Ω is a
set of outcomes, E is sequence of primitive events, and C is a causal function over E.2
Definition 28 (Induced Belief Space) Given a causal space (Ω, E,C), the induced
belief space is the belief space (Ω,F ,B) where the algebra F and the belief function
B are defined as
i. F = σ
(
{E0, E1, . . . , EN}
)
;
ii. B(A|B) = C(A|B), for all (A,B) ∈ ⋃n(En ×An−1). 2
Thus, the induced belief space is constructed by generating the algebra F from the
primitive events E, and by equating the belief function B to the causal function C over
the subset of F ×F where C is defined. The following theorem tells us that this subset
is enough to completely determine the whole belief function.
Theorem 7 (Induced Belief Space) The induced belief space exists and is unique.2
Proof Let F0,F1, . . . ,FN denote the sequence of algebras generated as
Fn := σ({E0, E1, . . . , En}).
Let r, s ∈ N, r ≤ s, be the smallest numbers such that B is Fr-measurable and A is Fs-
measurable. Let B ⊂ Ar and A ⊂ As be the partitions of B and A respectively.Then,
B(A|B) = 0 if A ∩B by the belief axioms, and
B(A|B) = B(A|Ω)
B(B|Ω)
otherwise. We have
B(A|Ω) =
∑
a∈A
B(a|Ω) and B(B|Ω) =
∑
b∈B
B(b|Ω).
For every a ∈ A, let a1, a2, . . . , as the unique sequence aj ∈ Ej such that
a = a1 ∩ a2 ∩ · · · ∩ as =
s⋂
j=1
aj.
Hence,
B(a|Ω) = B
( s⋂
j=1
aj
∣∣∣Ω) = s∏
j=1
B
(
aj
∣∣∣Ω ∩ j−1⋂
i=1
ai
)
.
97
8. CAUSALITY
Similarly, one obtains for every b ∈ B,
B(b|Ω) = B
( r⋂
j=1
bj
∣∣∣Ω) = r∏
j=1
B
(
bj
∣∣∣Ω ∩ j−1⋂
i=1
bi
)
.
Thus, we have proven the following. First, F is unique because generated algebras
are unique. Second, we have shown, for arbitrarily chosen events A,B ∈ F , how to
reexpress B(A|B) into an expression involving only terms of the form C(C|D). Hence,
it cannot be that B,B′ are both consistent with C and there is A,B ∈ F such that
B(A|B) 6= B′(A|B). 
8.2.1 Interventions
We now define the operation that specifies how the knowledge about the random process
transforms when the reasoner himself intervenes it.
Definition 29 (Intervention) Given a causal space (Ω, E,C) and a primitive event
A ∈ En for some n ∈ {1, . . . , N}, the A-intervention is the causal space (Ω, E,C′)
where for all (B,C) ∈ ⋃n(En ×An−1),
C′(B|C) =


1 if A = B and (B ∩ C) /∈ {∅, C},
0 if A = Bc and (B ∩ C) /∈ {∅, C},
C(B|C) else.
2
This is an important definition. The reasoner ask himself the question: “How do my
beliefs about the world change if I were to choose the truth value of a primitive event?”
This is answered by directly changing the causal function accordingly. However, this
change cannot contradict the logical constraints given by the underlying truth function.
Remark 25 Note that (B ∩ C) /∈ {∅, C} ⇔ T(B|C) = ?. Hence, an intervention can
only affect primitive propositions B ∈ En that have an unresolved truth value given
the history C ∈ An−1. Moreover, the intervention resolves the truth value of B. This
makes sense intuitively. 2
We will use the abbreviation Aˆ to denote A-interventions on a causal space. When
the underlying causal space (Ω, E,C) inducing a belief space (Ω,F ,B) is clear from the
context, then the expression B(B|Aˆ) denotes the belief B′(B|A) measured w.r.t. the
belief space (Ω,F ,B′) induced by the A-intervention of (Ω, E,C). Furthermore, when
A ∈ F is an event such that
A =
I⋂
i=1
Ai,
where each Ai is a primitive event, then the A-intervention is the causal space resulting
as the succession of Ai-interventions.
98
8.3 Causality in Autonomous Systems
8.3 Causality in Autonomous Systems
The previous section introduced causal spaces to model reasoning under uncertainty
and control. The aim of this section is to apply this framework to model autonomous
agents having uncertainty over their policies and their environments.
8.3.1 Bayesian I/O Model
We complete the description of the Bayesian model introduced in Section 4.2.1 by
adding uncertainty over the policy.
Definition 30 (Bayesian Output Model) ABayesian output model of an agent
is a set of conditional probabilities
P (θ), P (at|θ, ao<t) for all (θ, ao≤t) ∈ Θ×Z⋄.
inducing a unique probability measure P over Θ×AT conditioned on OT . 2
Definition 31 (Bayesian I/O Model) A Bayesian I/O model of an agent is a
Bayesian input model paired with a Bayesian output model, i.e. a set of conditional
probabilities
P (θ), P (at|θ, ao<t) and P (ot|θ, ao<tat) for all (θ, ao≤t) ∈ Θ×Z⋄.
inducing a unique probability measure P over Θ×ZT . 2
The intuition is analogous to the Bayesian input model. The set Θ is a collection of
unknown parameters that completely specify the I/O model, i.e. knowing θ would
lead to a unique I/O model given by
Pθ(at|ao<t) = P (at|θ, ao<t)
Pθ(ot|ao<tat) = P (ot|θ, ao<tat) for all ao≤t ∈ Z⋄.
8.3.2 Causal Structure
As has been pointed out in the introduction to this chapter, when using a model of
beliefs, one has to distinguish between inputs (provided externally by the environment)
and outputs (generated internally). Inputs are incorporated into the knowledge by
conditioning, while outputs have to be incorporated by first applying the corresponding
intervention to the belief model and then conditioning.
The Bayesian I/O model (and all the models of autonomous systems we have in-
troduced so far) is specified by conditional probabilities over the next symbol given its
past. It is therefore actually a specification of the causal structure! More precisely,
consider any causal space (Ω, E,C) fulfilling the following requirements. First, the
primitive events are given by
E =
(
E0, E1, . . . , ENθ ,︸ ︷︷ ︸
θ
ENθ+1, . . . , ENa1︸ ︷︷ ︸
a1
, ENa1+1, . . . , ENo1︸ ︷︷ ︸
o1
, . . . , ENaT +1, . . . , ENoT︸ ︷︷ ︸
oT
)
,
99
8. CAUSALITY
where each of the highlighted blocks of primitive events generates a partition of Ω
containing enough members for specifying a random variable (i.e. either θ, at or ot for
t = 1, . . . , T ). Note that Nθ ≤ Na1 ≤ No1 ≤ · · · ≤ NoT = N . This is illustrated
in Figure 8.5. Choosing an action at corresponds to an at-intervention (that breaks
down into a sequence of En-interventions). Second, the causal function C is such
that the induced belief function B coincides with the Bayesian model, i.e. for all
(θ, ao≤t) ∈ Θ×Z⋄,
B(θ) = P (θ),
B(at|θ, ao<t) = P (at|θ, ao<t)
and B(at|θ, ao<tat) = P (ot|θ, ao<tat).
E = (E0, . . . , EN )
E0
a b
c d
Figure 8.5: Causal Space of an Autonomous System. The sequence of primitive events is
represented as a stack of sets on the left hand side. The figure shows how a random variable (in
this case having the alphabet {a,b,c,d}) is constructed from two successive primitive events.
8.3.3 Belief Updates
The correspondence between causal spaces and Bayesian I/O models has the following
operational implications. When an autonomous system interacts with the environment,
then its belief state is updated. This update depends on whether the symbol is an input
or an output. If it is an input, then the update is a condition. If it is an output, then
the update is an intervention followed by a condition. This difference is illustrated in
Figure 8.6.
Assume there is a sequence of random variables X1,X2, . . . ,XT taking on values
in X with conditional probability measures B(Xt|X<t), t = 1, . . . , T .
An observation is a measurement. As such, it provides information about the
whole realization of the stochastic process. That is, learning the value of Xt provides
information about all {Xs : 1 ≤ s ≤ T} through the dependencies established by the
causal model. The update Xt = xt following an observation changes all conditional
probabilities as
B(A|B) Xt=xt−−−−→ B(A|B,Xt = xt),
100
8.3 Causality in Autonomous Systems
X2 = 1
X2 ← 1
X1 X2 X3
X1 X2 X3 X1 X2 X3
X1 X2 X3 X1 X2 X3
Figure 8.6: Updates following an Observation versus an Action. The figure shows three
causally ordered random variables X1, X2 and X3 (taking on binary values) and their proba-
bilities (through the height of their boxes). Two updates are compared: the update X2 = 1
following an observation and the update X2 ← 1 following an action. These updates eliminate
the incompatible probability mass (as shown in the first column after the update) and then
normalize the remaining probability mass (second column after the update). The observation
affects the probability mass of the whole history, eliminating the incompatible realizations and
then normalizing globally. The action affects only the probability mass of the present and the
future and then normalizes within each past.
101
8. CAUSALITY
where A and B are arbitrary events.
An action is a decision. As such, it only provides information about the future of
the realization of the stochastic process, but not about its past. That is, learning the
value of Xt provides information about {Xs : t ≤ s ≤ T} only. The update Xt ← xt
following an action changes all conditional probabilities as
B(A|B) Xt←xt−−−−→ B(A|B,Xt ← xt) = B′(A|B,Xt = xt),
where A and B are arbitrary events and where B′ is the probability measure uniquely
defined by the equations
i. B′(X<t) = B(X<t), (past)
ii. B′(Xt|X<t) = δxt(Xt), (present)
iii. B′(Xt+1:T |X≤t) = B(Xt+1:T |X≤t). (future)
(8.1)
When the random variable Xt is clear from the context, we use the abbreviation
B(A|B, xˆt) := B(A|B,Xt ← xt).
Hence, when an autonomous system (having a Bayesian I/O model P ) experiences
the I/O string ao≤t ∈ Z⋄, then its probabilities are given by
P (A|aˆo≤t),
where A is an arbitrary event. It is easily shown that
P (at|θ, aˆo<t) = P (at|θ, ao<t) and P (ot|θ, aˆo<taˆt) = P (ot|θ, ao<tat).
Remark 26 In general, one has that P (ot|aˆo<taˆt) 6= P (ot|ao<tat). Note however,
that when the policies of the Bayesian model are all the same, that is, P (at|θ, aˆo<t) =
P (at|θ′, aˆo<t) for all θ, θ′ ∈ Θ (or, if one prefers, there is no uncertainty over the policy),
then P (ot|aˆo<taˆt) = P (ot|ao<tat). This clarifies why causality does not play a role when
systems are constructed using the maximum SEU principle. 2
8.3.4 Induced I/O Model
We have seen that the Bayesian I/O model allows modeling both the uncertainty over
the policy and the predictor of an autonomous system.
Definition 32 (Induced I/O Model) The I/O model P induced by a Bayesian
I/O model P is defined as
P(at|ao<t) := P (at|aˆo<t) and P(ot|ao<tat) := P (ot|aˆo<taˆt)
for all ao<t ∈ Z⋄. 2
Before we give an intuitive interpretation of the induced model, we want to find out
what the quantities P (at|aˆo<t) and P (ot|aˆo<taˆt) actually are.
102
8.3 Causality in Autonomous Systems
Theorem 8 (Induced I/O Model) The quantities P (at|aˆo<t) and P (ot|aˆo<taˆt) are
given by
P (at|aˆo<t) =
∑
θ
P (at|θ, ao<t)P (θ|aˆo<t) (8.2)
and P (ot|aˆo<taˆt) =
∑
θ
P (ot|θ, ao<tat)P (θ|aˆo<t), (8.3)
where P (θ|aˆo≤t) is given by
P (θ|aˆo≤t) =
P (ot|θ, ao<tat)P (θ|aˆo<t)∑
θ′ P (ot|θ, ao<tat)P (θ′|aˆo<t)
(recursion) (8.4)
=
P (θ)
∏t
τ=1 P (oτ |θ, ao<τaτ )∑
θ′ P (θ
′)
∏t
τ=1 P (oτ |θ′, ao<τaτ )
. (function) (8.5)
Proof For P (at|aˆo<t), introduce the variable θ via marginalization and then applying
the chain rule:
P (at|aˆo<t) =
∑
θ
P (at|θ, aˆo<t)P (θ|aˆo<t).
First, note that P (at|θ, aˆo<t) = P (at|θ, ao<t). Then, the term P (θ|aˆo<t) can be further
expanded as
P (θ|aˆo<t) =
P (aˆo<t|θ)P (θ)∑
θ′ P (aˆo<t|θ′)P (θ′)
=
P (θ)
∏t−1
τ=1 P (aˆτ |θ, aˆo<τ )P (oτ |θ, aˆo<τ aˆτ )∑
θ′ P (θ
′)
∏t−1
τ=1 P (aˆτ |θ′, aˆo<τ )P (oτ |θ′, aˆo<τ aˆτ )
=
P (θ)
∏t−1
τ=1 P (oτ |θ, ao<τaτ )∑
θ′ P (θ
′)
∏t−1
τ=1 P (oτ |θ′, ao<τaτ )
.
The first equality is obtained by applying Bayes’ rule and the second by using the chain
rule for probabilities. The second equality follows from using the causal factorization
of the joint probability distribution. To get the last equality, one applies the inter-
ventions to the causal factorization. Thus, P (aˆτ |θ, aˆo<τ ) = 1 and P (oτ |θ, aˆo<τ aˆτ ) =
P (oτ |θ, ao<τaτ ). The recursive characterization of P (θ|ao≤t) is obtained trivially fol-
lowing similar arguments and by applying Bayes’ rule only over aot. Finally, the equa-
tions characterizing P (ot|aˆo<taˆt) are derived analogously. 
In the Bayesian I/O model, one assumes that the autonomous system has uncer-
tainty over the unknown parameter θ ∈ Θ that captures all the relevant information
to completely define an I/O model. More precisely, it is assumed that θ is going to
be drawn with probability P (θ), thus defining: P (ot|θ, ao<tat), the true input model
characterizing the true environment; and P (at|θ, ao<t), the true policy that should be
applied.
103
8. CAUSALITY
Remark 27 Recalling that P (ot|aˆo<taˆt) implements the predictive distribution and
comparing it to P (at|aˆo<t), it is reasonable to ask whether the latter is a “predictive
output distribution” having analogous properties, i.e. a distribution over the outputs
that converges to the “true optimal policy”. Experimental evidence suggests that this
might be the case. This will be further investigated in Chapter 9. 2
8.4 Historical Remarks & References
The axiomatization of causality presented in this chapter is entirely due to the author. This
formalization is based on previous work by other authors. The importance of distinguishing
between input and output, more commonly known as the difference between seeing and doing,
and their impact on inference, has been mainly developed by Pearl (2000) in the context of
graphical models. Especially, Pearl introduced the formalization of interventions in so-called
causal graphs and extended them later to the context of structural equations. Furthermore,
first ideas of how to formalize the causal structure of a random process in set-theoretic terms
were analyzed by Shafer (1996), although his formalization does not clarify how to extend it to
interventions. Of particular relevance for the context of autonomous systems is causal decision
theory (Stalnacker, 1968; Nozick, 1969; Lewis, 1973). This theory maintains that the expected
utility of actions should be evaluated with respect to their potential causal consequences. Es-
sentially, this proposes that there is a difference between actions and observations.
The study of causality has recently enjoyed considerable attention from the researchers
in the fields of statistics and machine learning. Especially over the last decade, significant
progress has been made towards the formal understanding of causation. For a more in-depth
exposition of causality and its applications, the reader is referred to the specialized literature.
For instance: Suppes (1970); Cartwright (1989); Eells (1991); Mellor (1995); Shafer (1996);
Pearl (2000); Spirtes and Scheines (2001); and Dawid (2007).
104
Chapter 9
Control as Estimation
Bounded SEU introduced in Chapter 7 allows conceptualizing autonomous agents under
bounded information-theoretic resources. These can be obtained by means of two pos-
sible transformations (or a combination of them): the control transformation and the
estimation transformation. The control transformation solves the problem of construct-
ing an autonomous agent optimizing the expected utility under resource constraints.
The estimation transformation solves the problem of constructing an autonomous agent
approximating a “known” reference autonomous agent under resource constraints.
Chapter 8 introduced a Bayesian model of autonomous systems that allows repre-
senting the uncertainty over the optimal policy and over the environment. Furthermore,
it explained how to carry out the belief updates following inputs and following outputs,
emphasizing the different causal constraints they are subject to.
The Bayesian model and the estimation transformation can be used jointly to for-
mulate an autonomous agent that “discovers” the optimal policy during the interactions
with the environment. That is, we can formulate an autonomous agent whose policy at
the behavioral level can be interpreted as having uncertainty over a set of policies at
the belief level. Then, while the agent is interacting with the environment, it obtains
evidence that progressively eliminates this uncertainty. This is in stark contrast to
autonomous agents that are constructed following the maximum SEU principle, where
only one policy (namely, the optimal) is chosen right from the beginning, even before
the interaction with the environment has even started.
One of the main points that has been put forward in this thesis is that obtaining
certainty is expensive, be it through computation or interaction with the environment.
Furthermore, it has been argued at a number of places in this thesis that computing
information oneself is more expensive than being told the same information. Recall that
in Chapter 6 and 7 we have argued that the resource costs spent in a transformation
can be measured by the relative entropy between the initial and the final distribution.
Intuitively speaking, the advantage of the method presented in this chapter lies in the
fact that we do not have to pre-compute the optimal policy. Instead, we can already
design an agent using just the distribution over policies that best describes our prior
knowledge. Hence, instead of having to transform the distribution over the optimal
105
9. CONTROL AS ESTIMATION
policy before the interaction starts, we can slowly transform this distribution guided
by the observations alone.
9.1 Interlude: Dynamic versus Static
During the course of this thesis, we have spoken on various occasions about dynamic
policies as opposed to pre-computed or static policies. Similarly, we have spoken about
stochastic and deterministic policies. What does this mean, what is the difference and
what are the implications?
Consider the well-known rock-paper-scissors game. In this game, non-random be-
havior can be exploited, and thus the safest way to protect oneself against an opponent
consists in playing with uniform probability1. Suppose we have to play a sequence of
rock-paper-scissors games. The uniform strategy can be implemented statically, e.g.
by sampling each move beforehand and revealing them turn by turn; or, it can be
implemented dynamically, e.g. by deciding each move just before it has to be played2.
This flexibility does not exists if we wanted to use a deterministic strategy, because by
choosing a deterministic policy we are determining all the moves beforehand. In general,
having uncertainty allows us determining the value of random variables dynamically.
9.1.1 Risk versus Ambiguity
A special case occurs when there is uncertainty over the very beliefs of the decision
maker, as is the case modeled by the Bayesian I/O model introduced in 3.2.2. The
author proposes that this kind of uncertainty relates to an old debate dating from the
early conceptualizations of decision theory: risk versus ambiguity (Knight, 1921).
Risk corresponds to the odds that the decision maker has adequate knowledge of,
whereas ambiguity to the odds that are unknown to him. From the two, risk is well-
understood; indeed, classical decision theory was regarded by its very founders as a
“normative foundation of optimal decision making under risk” (von Neumann and
Morgenstern, 1944; Savage, 1954). The success of these frameworks had put under
question the operational relevance of the concept of ambiguity, until Ellsberg presented
the paradox named after him in his seminal paper (Ellsberg, 1961). Several mathe-
matical formalizations for ambiguity have been proposed in the literature, but none of
them has found widespread acceptance (Camerer and Weber, 1992). However, on an
abstract level the vast majority of approaches considers ambiguity to be some kind of
uncertainty about risk or uncertainty of belief or lack of information. Here we propose
a simple formalization that is an original contribution of this thesis.
To illustrate the formalization, let us review an example. Imagine a rational decision
maker has to place a bet over the outcome of a biased coin toss which is either Head
1In game-theoretic parlance, it is said that the rock-paper-scissors game has a mixed strategy Nash
equilibrium (Osborne and Rubinstein, 1999).
2In computer science, this technique of delaying the evaluation of a quantity up until the point
where it is needed is known as lazy evaluation (Pratt and Zelkowitz, 2000).
106
9.1 Interlude: Dynamic versus Static
or Tail. The payoff is $1 for a correct bet or $0 for a wrong bet. Given the rational
decision maker’s belief, we want to predict the bet he will place under five different
cases, illustrated in Figure 9.1:
I. He believes that the odds are 14 for Head and
3
4 for Tail.
II. He believes that the odds are 34 for Head and
1
4 for Tail.
III. He believes that the odds 58 for Head and
3
8 for Tail.
IV. He believes that either I or II occurs with probability 14 or
3
4 respectively.
V. He finds himself believing in either I or II with probability 14 or
3
4 respectively.
Cases I–IV are easily examined under the framework of maximum expected utility. The
rational decision maker places the bet that maximizes his expected payoff. For cases I
and II, the optimal bets are Tail and Head respectively, because their expected payoff is
$0.75, as opposed to $0.25 offered by the alternative bet (Figures 9.1a & b). Likewise,
in III the decision maker bets Head. In case IV, the decision maker can reexpress the
two-stage coin toss, i.e. first selecting between situation I or II and then tossing the coin
(Figure 9.1c), as an equivalent single-stage coin toss with a re-weighted bias obtained
by multiplying the probabilities of the first stage with the probabilities of the second
stage (Figure 9.1e). This reduction reveals that case IV is equivalent to case III, with
Head being to optimal bet.
However, by construction, case V requires a different analysis. Here, the decision
maker’s belief can take on one out of two possible forms, in which the optimal bets
are Tail and Head as discussed previously. The crucial difference lies in the fact that
the probabilities of the belief instantiations are beyond the scope of the decision maker’s
analysis. Therefore, for him it is optimal to bet Tail when reaching case I and to bet
Head when reaching case II. This innocuous fact is by no means trivial, because the
subjective expected utility of the decision maker is 14 · 34 + 34 · 34 = 34 > 58 , that is, strictly
higher than the subjective expected utility of the classical analysis—and in practice,
subjective beliefs are all what a decision maker has3! We call these probabilities that
are exogenous to the decision maker’s beliefs ambiguities.
The distinction has an operational meaning. Notice that in case V, the belief of the
decision maker is itself a random variable, implying that the optimal policy is undefined
until the random variable is resolved. Hence, the computation of the optimal policy can
be delayed, i.e. the optimal policy can be determined dynamically. This is unlike case
IV, where the policy is pre-computed/static. The corresponding Bayesian I/O model
is as follows. Let θ ∈ {I, II} be the parameter determining whether the decision maker
3Intuitively, it seems safer to delay one’s decision until the evidence is conclusive.
107
9. CONTROL AS ESTIMATION
I II III
IV V
H
H
H
H
H
HH
T
T
T
T
T
TT
1
4
1
4
1
4
3
4
3
4
3
4
3
4
3
4
3
4
1
4
1
4
1
4
5
8
3
8
1
4
1
4
3
4
3
4
T
T
H
H H
H
Figure 9.1: Risk versus Ambiguity. In the figure, five different decision making scenarios are
shown. A biased coin is tossed. The goal is to predict the outcome and the payoffs are $1 and
$0 for a right and wrong guess respectively. A rational decision maker places bets (here shown
inside speech bubbles) such that his subjective expected utility is maximized. These subjective
beliefs are delimited within dotted boxes. The cases in panels I–IV differ from case V in that the
former can be fully understood in terms of classical decision theory, whereas the latter cannot.
108
9.2 Adaptive Estimative Control
is in case I or II. Then, the full Bayesian I/O model is given by:
P (θ = I) =
1
4
P (θ = II) =
3
4
P (a|θ = I) =
{
0 if a = H
1 if a = T
P (a|θ = II) =
{
1 if a = H
0 if a = T
P (o|θ = I, a) =
{
1
4 if o = H
3
4 if o = T
P (o|θ = II, a) =
{
3
4 if o = H
1
4 if o = T.
Here, the prior probabilities P (θ) are ambiguities while the P (a|θ) and P (o|θ, a) are
risk probabilities, because fixing θ determines the decision maker’s estimation about
the outcome and his policy.
9.2 Adaptive Estimative Control
If the environment Q is unknown, then the task of designing an appropriate agent
P constitutes an adaptive control problem. We have already seen how to solve such a
problem using the maximum SEU principle in Section 4.2, where it has been formulated
as an adaptive optimal control problem.
We formulate a different problem setup that will be termed adaptive estimative
control. Specifically, this setup deals with the case when the designer has a class of
output models {Pθ}θ∈Θ parameterized by a finite set Θ, designed to fit to a class of en-
vironments {Qθ}θ∈Θ; in other words, when the designer wants to use policy Pθ(at|ao<t)
for environment Qθ(ot|ao<tat).
Formally, it is assumed that we have a reference Bayesian I/O model (Section 8.3.1)
characterized by conditional probabilities
P0(at|θ, ao<t) = Pθ(at|ao<t) and P0(ot|θ, ao<tat) = Pθ(ot|θ, ao<tat)
representing θ-th I/O model, and by probabilities P (θ) over the unknown parameter
θ ∈ Θ. We will call the set of I/O models indexed by Θ the set of operation modes
and a particular I/O model an operation mode. The objective is to find an I/O
model P obtained by maximizing the bounded SEU using an estimation transformation
to solve for all its input and outputs.
9.3 Bayesian Control Rule
One can derive a solution to the adaptive estimative control problem. Given the refer-
ence Bayesian I/O model P0, let P0 denote the induced I/O model, i.e. the I/O model
defined by (Section 8.3.4)
P0(at|ao<t) = P0(at|aˆo<t) and P0(ot|ao<tat) = P0(ot|aˆo<taˆt).
109
9. CONTROL AS ESTIMATION
Then, define the auxiliary I/O models R and S as
R(at|ao<t) = P0(at|ao<t), R(ot|ao<tat) = P0(ot|ao<tat),
S(at|ao<t) = Pr(at|ao<t), S(ot|ao<tat) = Pr(ot|ao<tat).
Substituting these into (7.8) yields
∑
ao≤T
P0(ao≤T )U∗(ao≤T )− α
∑
ao≤T
P0(ao≤T ) log
P0(ao≤T )
Pr(ao≤T )
Here, the solution is P(ao≤T ) = P0(ao≤T ) for any utility function U∗. The policy and
the predictor are thus given by P(at|ao<t) and P(ot|ao<tat) respectively. According to
Theorem 8, the input model is given by the predictive distribution
P(ot|ao<tat) =
∑
θ
P0(θ|aˆo<t)P0(ot|θ, ao<tat).
The associated output model is given by a quantity which we now promote to a defini-
tion.
Proposition 2 (Bayesian Control Rule) The probability of at conditioned on the
past ao<t is given by
P(at|ao<t) =
∑
θ
P0(θ|aˆo<t)P0(at|θ, ao<t) (9.1)
where the mixture weights P (θ|ao≤t) are given by the recursion
P0(θ|aˆo≤t) =
P0(ot|θ, ao<tat)P0(θ|aˆo<t)∑
θ′ P0(ot|θ, ao<tat)P0(θ′|aˆo<t)
.
We call Equation (9.1) the Bayesian control rule. 2
Remark 28 Note that the Bayesian control rule is a stochastic control law, i.e. actions
at are sampled from this distribution. 2
Essentially, the previous result tells us that the Bayesian I/O model is the solution to
the adaptive estimative control problem. This result leads to a family of algorithms that
is very easy to implement in practice. The essential idea of the Bayesian control rule
is to regard P0(at|θ, ao<t) as the policy to apply when the agent faces the environment
described by P0(ot|θ, ao<tot), and then to perform a Bayesian (posterior) mixture over
all θ ∈ Θ. Then, at each time step, one samples a parameter θ from the posterior
distribution P (θ|aˆo≤t) and then executes the policy indexed by θ.
110
9.4 Convergence of the Bayesian Control Rule
9.4 Convergence of the Bayesian Control Rule
In this section, let P denote a I/O model induced by a Bayesian I/O model P . We have
seen that the predictive distribution converges to the true input stream (Section 4.2.4).
Can something similar be said about the Bayesian control rule? That is, is it true that
the Bayesian control rule converges to the “correct” output model, i.e.
P (at|aˆo<t) t→∞−−−→ P (at|θ∗, ao<t),
where θ∗ ∈ Θ is such that P (ot|θ∗, ao<tat) = Q(ot|θ∗, ao<tat)? In this chapter we
sketch a proof for a very restricted setting and we present experimental evidence that
this might be the case in a more general setting.
9.4.1 Policy Diagrams
In the following we use “policy diagrams” as a useful informal tool to analyze the effect
of policies on environments. One can imagine an environment as a collection of states
connected by transitions labeled by I/O symbols (Figure 9.2). The zoom highlights a
state s where taking action a ∈ A and collecting observation o ∈ O leads to state s′. Sets
of states and transitions are represented as enclosed areas similar to a Venn diagram.
Choosing a particular policy in an environment amounts to partially controlling the
transitions taken in the state space, thereby choosing a probability distribution over
state transitions (e.g. a Markov chain given by the environmental dynamics). If the
probability mass concentrates in certain areas of the state space, choosing a policy can
be thought of as choosing a subset of the environment’s dynamics. In the following, a
policy is represented by a subset in state space (enclosed by a directed curve) as shown
in Figure 9.2.
state space
policy
s s
′ao
Figure 9.2: A Policy Diagram.
Policy diagrams are especially useful to analyze the effect of policies on different
hypotheses about the environment’s dynamics. For the sake of simplifying the interpre-
tation of policy diagrams, we will assume the existence of a state space T : (A×O)∗ → S
mapping I/O histories into states. Note however that no such assumptions are made
to obtain the results of this section.
111
9. CONTROL AS ESTIMATION
9.4.2 Divergence Processes
The central question in this section is to investigate whether the Bayesian control rule
converges to the correct control law or not. As will be obvious from the discussion in
the rest of this section, this is in general not true.
As it is easily seen from (9.1), showing convergence amounts to show that the pos-
terior distribution P (θ|aˆo<t) concentrates its probability mass on a subset of operation
modes Θ∗ having “essentially” the same output stream as θ∗,∑
θ∈Θ
P (at|θ, ao<t)P (θ|aˆo<t) ≈
∑
θ∈Θ∗
P (at|θ∗, ao<t)P (θ|aˆo<t) ≈ P (at|θ∗, ao<t).
Hence, understanding the asymptotic behavior of the posterior probabilities
P (θ|aˆo≤t)
is crucial here. In particular, we need to understand under what conditions these
quantities converge to zero. The posterior can be rewritten as
P (θ|aˆo≤t) =
P (aˆo≤t|θ)P (θ)∑
θ′∈Θ P (aˆo≤t|θ′)P (θ′)
=
P (θ)
∏t
τ=1 P (oτ |θ, ao<τaτ )∑
θ′∈Θ P (θ
′)
∏t
τ=1 P (oτ |θ′, ao<τaτ )
.
If all the summands but the one with index θ∗ are dropped from the denominator, one
obtains the bound
P (θ|aˆo≤t) ≤
P (θ)
P (θ∗)
t∏
τ=1
P (oτ |θ, ao<τaτ )
P (oτ |θ∗, ao<τaτ )
,
which is valid for all θ∗ ∈ Θ. From this inequality, it is seen that it is convenient to
analyze the behavior of the following stochastic process.
Definition 33 (Divergence Process) The stochastic process defined by
dt(θ∗‖θ) :=
t∑
τ=1
ln
P (oτ |θ∗, ao<τaτ )
P (oτ |θ, ao<τaτ )
is called the divergence process of θ from the reference θ∗. 2
Indeed, if dt(θ∗‖θ)→∞ as t→∞, then
lim
t→∞
P (θ)
P (θ∗)
t∏
τ=1
P (oτ |θ, ao<τaτ )
P (oτ |θ∗, ao<τaτ )
= lim
t→∞
P (θ)
P (θ∗)
· e−dt(θ∗‖θ) = 0,
and thus clearly P (θ|aˆo≤t)→ 0. Figure 9.3 illustrates simultaneous realizations of the
divergence processes of a controller.
Remark 29 Intuitively speaking, the divergence process dt(θ∗‖θ) is a lower bound on
the “number of wrong bits predicted” by the θ-th input model. 2
112
9.4 Convergence of the Bayesian Control Rule
0
1
2
3
4
t
dt
Figure 9.3: Realization of Divergence Processes. The divergence processes 1 to 4 are associated
with a Bayesian I/O model having operation modes θ1 to θ4. The divergence processes 1 and 2
diverge, whereas 3 and 4 stay below the dotted bound. Hence, the posterior probabilities of θ1
and θ2 vanish.
Remark 30 A divergence process is a random walk whose value at time t depends
on the whole history up to time t − 1. A given divergence process can have different
growth rates depending on the policy (Figure 9.4). It can happen that a divergence
process stays stable under one policy, but diverges under another. 2
The divergence process dt(θ
∗‖θ) is a random variable that depends on the realization
ao≤t which is drawn from
t∏
τ=1
P (aτ |θτ , ao≤τ )P (oτ |θ∗, ao≤τaτ ),
where the θ1, θ2, . . . , θt are drawn themselves from P (θ1), P (θ2|aˆo1), . . . , P (θt|aˆo<t).
0
1
1
2
2
3
3
t
dt
Figure 9.4: Policies Influence Divergence Processes. The application of different policies lead
to different statistical properties of the same divergence process.
9.4.3 Decomposition of Divergence Processes
To deal with the heterogeneous nature of divergence processes, one introduces a tem-
poral decomposition that demultiplexes the original process into many sub-processes
belonging to unique policies.
113
9. CONTROL AS ESTIMATION
Definition 34 (Sub-Divergence) Let Nt := {1, 2, . . . , t} be the set of time steps up
to time t. Let T ⊂ Nt, and let θ, θ′ ∈ Θ. Define a sub-divergence of dt(θ∗‖θ) as a
random variable
gθ′(θ;T ) :=
∑
τ∈T
ln
P (oτ |θ∗, ao<τaτ )
P (oτ |θ, ao<τaτ )
drawn from
Pθ′({aoτ}τ∈T |{aoτ}τ∈T ∁) :=
(∏
τ∈T
P (aτ |θ′, ao<τ )
)(∏
τ∈T
P (oτ |θ∗, ao<τaτ )
)
,
where T ∁ := Nt \T and where {aoτ}τ∈T ∁ are given conditions that are kept constant.2
In this definition, θ′ plays the role of the policy that is used to sample the actions
in the time steps T . Clearly, any realization of the divergence process dt(θ∗‖θ) can be
decomposed into a sum of sub-divergences, i.e.
dt(θ∗‖θ) =
∑
θ′
gθ′(θ;Tθ′), (9.2)
where {Tθ}θ∈Θ forms a partition of Nt. Figure 9.5 shows an example decomposition.
0
1
2
3
t
dt
Figure 9.5: Decomposition of a Divergence Process (1) into Sub-Divergences (2 & 3). Note
that the sub-divergences grow in disjoint time intervals.
The averages of sub-divergences will play an important roˆle in the analysis. Define
the average over all realizations of gθ′(θ;T ) as
Gθ′(θ;T ) :=
∑
(aoτ )τ∈T
Pθ′({aoτ}τ∈T |{aoτ}τ∈T ∁) gθ′(θ;T ).
Notice that for any τ ∈ Nt,
Gθ′(θ; {τ}) =
∑
aoτ
P (aτ |θ′, ao<τ )P (oτ |θ∗, ao<τaτ ) ln
P (oτ |θ∗, ao<τaτ )
P (oτ |θ, ao<τaτ )
≥ 0,
because of Gibbs’ inequality. In particular,
Gθ′(θ∗; {τ}) = 0.
Clearly, this holds as well for any T ⊂ Nt:
∀θ Gθ′(θ;T ) ≥ 0,
Gθ′(θ∗;T ) = 0.
(9.3)
114
9.4 Convergence of the Bayesian Control Rule
9.4.4 Bounded Variation
In general, a divergence process is very complex: virtually all the classes of distributions
that are of interest in control go well beyond the assumptions of i.i.d. and stationarity.
This increased complexity can jeopardize the analytic tractability of the divergence
process, such that no predictions about its asymptotic behavior can be made anymore.
More specifically, if the growth rates of the divergence processes vary too much from
realization to realization, then the posterior distribution over operation modes can vary
qualitatively between realizations. Hence, one needs to impose a stability requirement
akin to ergodicity to limit the class of possible divergence-processes to a class that is
analytically tractable. For this purpose the following property is introduced.
Definition 35 (Bounded Variation) A divergence process dt(θ∗‖θ) is said to have
bounded variation in Θ iff for any δ > 0, there is a C ≥ 0, such that for all θ′ ∈ Θ,
all t and all T ⊂ Nt ∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C
with probability ≥ 1− δ. 2
0
1 2 3
t
dt
Figure 9.6: Bounded Variation. If a divergence process has bounded variation, then the
realizations (curves 2 & 3) of a sub-divergence stay within a band around the mean (curve 1).
Figure 9.6 illustrates this property. Bounded variation is the key property that is
going to be used to construct the results of this section. However, it is very restrictive.
For instance, simple Bernoulli processes do not fulfill this property. The first result is
that the posterior probability of the true parameter is bounded from below.
Theorem 9 (Lower Bound of True Posterior) Let the set of operation modes of
a controller be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded
variation. Then, for any δ > 0, there is a λ > 0, such that for all t ∈ N,
P (θ∗|aˆo≤t) ≥
λ
|Θ|
with probability ≥ 1− δ. 2
115
9. CONTROL AS ESTIMATION
Proof As has been pointed out in (9.2), a particular realization of the divergence
process dt(θ∗‖θ) can be decomposed as
dt(θ∗‖θ) =
∑
θ′
gθ(θ
′;Tθ′),
where the gθ(θ
′;Tθ′) are sub-divergences of dt(θ∗‖θ) and the Tθ′ form a partition of Nt.
However, since dt(θ∗‖θ) has bounded variation for all θ ∈ Θ, one has for all δ′ > 0,
there is a C(θ) ≥ 0, such that for all θ′ ∈ Θ, all t ∈ Nt and all T ⊂ Nt, the inequality∣∣∣gθ(θ′;Tθ′)−Gθ(θ′;Tθ′)∣∣∣ ≤ C(θ)
holds with probability ≥ 1− δ′. However, due to (9.3),
Gθ(θ
′;Tθ′) ≥ 0
for all θ′ ∈ Θ. Thus,
gθ(θ
′;Tθ′) ≥ −C(θ).
If all the previous inequalities hold simultaneously then the divergence process can be
bounded as well. That is, the inequality
dt(θ∗‖θ) ≥ −MC(θ) (9.4)
holds with probability ≥ (1− δ′)M where M := |Θ|. Choose
β(θ) := max{0, ln P (θ)P (θ∗)}.
Since 0 ≥ ln P (θ)P (θ∗) − β(θ), it can be added to the right hand side of (9.4). Using the
definition of dt(θ∗‖θ), taking the exponential and rearranging the terms one obtains
P (θ∗)
t∏
τ=1
P (oτ |θ∗, ao<τaτ ) ≥ e−α(θ)P (θ)
t∏
τ=1
P (oτ |θ, ao<τaτ )
where α(θ) := MC(θ) + β(θ) ≥ 0. Identifying the posterior probabilities of θ∗ and θ
by dividing both sides by the normalizing constant yields the inequality
P (θ∗|aˆo≤t) ≥ e−α(θ)P (θ|aˆo≤t).
This inequality holds simultaneously for all θ ∈ Θ with probability ≥ (1− δ′)M2 and in
particular for λ := minθ{e−α(θ)}, that is,
P (θ∗|aˆo≤t) ≥ λP (θ|aˆo≤t).
But since this is valid for any θ ∈ Θ, and because maxθ{P (θ|aˆo≤t)} ≥ 1M , one gets
P (θ∗|aˆo≤t) ≥
λ
M
,
with probability ≥ 1 − δ for arbitrary δ > 0 related to δ′ through the equation δ′ :=
1− M2√1− δ. 
116
9.4 Convergence of the Bayesian Control Rule
Remark 31 It has been pointed by M. Hutter4 that bounded variation can most
probably be weakened to “C growing sub-linearly” (which will require adapting the
definitions that follow as well) and still get the convergence result of this section. 2
9.4.5 Core
If one wants to identify the operation modes whose posterior probabilities vanish, then it
is not enough to characterize them as those modes whose hypothesis does not match the
true hypothesis. Figure 9.7 illustrates this problem. Here, three hypotheses along with
their associated policies are shown. H1 and H2 share the prediction made for region A
but differ in region B. Hypothesis H3 differs everywhere from the others. Assume H1
is true. As long as we apply policy P2, hypothesis H3 will make wrong predictions and
thus its divergence process will diverge as expected. However, no evidence against H2
will be accumulated. It is only when one applies policy P1 for long enough time that
the agent will eventually enter region B and hence accumulate counter-evidence for H2.
H1
P1
H2
P2
H3
P3A A
B B
Figure 9.7: Problems with Disambiguation. If hypothesis H1 is true and agrees with H2 on
region A, then policy P2 cannot disambiguate the three hypotheses.
But what does “long enough” mean? If P1 is executed only for a short period, then
the controller risks not visiting the disambiguating region. But unfortunately, neither
the right policy nor the right length of the period to run it are known beforehand.
Hence, an agent needs a clever time-allocating strategy to test all policies for all finite
time intervals. This motivates the following definition.
Definition 36 (Core) The core of an operation mode θ∗, denoted as [θ∗], is the
subset of Θ containing operation modes behaving like θ∗ under its policy. Formally, an
operation mode θ /∈ [θ∗] (i.e. is not in the core) iff for any C ≥ 0, δ > 0, there is a ξ > 0
and a t0 ∈ N, such that for all t ≥ t0,
Gθ∗(θ;T ) ≥ C
with probability ≥ 1− δ, where Gθ∗(θ;T ) is a sub-divergence of dt(θ∗‖θ), and Pr{τ ∈
T } ≥ ξ for all τ ∈ Nt. 2
In other words, if the agent was to apply θ∗’s policy in each time step with probabil-
ity at least ξ, and under this strategy the expected sub-divergence Gθ∗(θ;T ) of dt(θ∗‖θ)
grows unboundedly, then θ is not in the core of θ∗.
4personal communication
117
9. CONTROL AS ESTIMATION
Remark 32 Note that demanding a strictly positive probability of execution in each
time step guarantees that the agent will run θ∗ for all possible finite time-intervals. 2
As the following theorem shows, the posterior probabilities of the operation modes
that are not in the core vanish almost surely.
Theorem 10 (Not in Core ⇒ Vanishing Posterior) Let the set of operation modes
of an agent be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded vari-
ation. If θ /∈ [θ∗], then P (θ|aˆo≤t)→ 0 as t→∞ almost surely. 2
Proof The divergence process dt(θ∗‖θ) can be decomposed into a sum of sub-divergences
(see Equation 9.2)
dt(θ∗‖θ) =
∑
θ′
gθ′(θ;Tθ′). (9.5)
Furthermore, for every θ′ ∈ Θ, one has that for all δ > 0, there is a C ≥ 0, such that
for all t ∈ N and for all T ⊂ Nt∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C(θ)
with probability ≥ 1 − δ′. Applying this bound to the summands in (9.5) yields the
lower bound ∑
θ′
gθ′(θ;Tθ′) ≥
∑
θ′
(
Gθ′(θ;Tθ′)− C(θ)
)
which holds with probability ≥ (1− δ′)M , where M := |Θ|. Due to Inequality 9.3, one
has that for all θ′ 6= θ∗, Gθ′(θ;Tθ′) ≥ 0. Hence,∑
θ′
(
Gθ′(θ;Tθ′)− C(θ)
) ≥ Gθ∗(θ;Tθ∗)−MC
where C := maxθ{C(θ)}. The members of the set Tθ∗ are determined stochastically;
more specifically, the i-th member is included into Tθ∗ with probability P (θ∗|aˆo≤i) ≥
λ/M for some λ > 0 by Theorem 9. But since θ /∈ [θ∗], one has that Gθ∗(θ;Tθ∗) →∞
as t→∞ with probability ≥ 1− δ′ for arbitrarily chosen δ′ > 0. This implies that
lim
t→∞
dt(θ∗‖θ) ≥ lim
t→∞
Gθ∗(θ;Tθ∗)−MC ր∞
with probability ≥ 1−δ, where δ > 0 is arbitrary and related to δ′ as δ = 1−(1−δ′)M+1.
Using this result in the upper bound for posterior probabilities yields the final result
0 ≤ lim
t→∞
P (θ|aˆo≤t) ≤ lim
t→∞
P (θ)
P (θ∗)
e−dt(θ∗‖θ) = 0.

118
9.4 Convergence of the Bayesian Control Rule
9.4.6 Consistency
Even if an operation mode θ is in the core of θ∗, i.e. given that θ is essentially indis-
tinguishable from θ∗ under θ∗’s control, it can still happen that θ∗ and θ have different
policies. Figure 9.8 shows an example of this. The hypotheses H1 and H2 share re-
gion A but differ in region B. In addition, both operation modes have their policies P1
and P2 respectively confined to region A. Note that both operation modes are in the
core of each other. However, their policies are different. This means that it is unclear
whether multiplexing the policies in time will ever disambiguate the two hypotheses.
This is undesirable, as it could impede the convergence to the right control law.
H1
P1
H2
P2
A A
B B
Figure 9.8: Inconsistent Policies. An example of inconsistent policies. Both operation modes
are in the core of each other, but have different policies.
Thus, it is clear that one needs to impose further restrictions on the mapping
of hypotheses into policies. With respect to Figure 9.8, one can make the following
observations:
1. Both operation modes have policies that select subsets of region A. Therefore,
the dynamics in A are preferred over the dynamics in B.
2. Knowing that the dynamics in A are preferred over the dynamics in B allows us
to drop region B from the analysis when choosing a policy.
3. Since both hypotheses agree in region A, they have to choose the same policy in
order to be consistent in their selection criterion.
This motivates the following definition.
Definition 37 (Consistent Policies) An operation mode θ is said to be consistent
with θ∗ iff θ ∈ [θ∗] implies that for all ε < 0, there is a t0, such that for all t ≥ t0 and
all ao<tat, ∣∣∣P (at|θ, ao≤t)− P (at|θ∗, ao≤t)∣∣∣ < ε.
2
In other words, if θ is in the core of θ∗, then θ’s policy has to converge to θ∗’s policy.
The following theorem shows that consistency is a sufficient condition for convergence
to the right control law.
119
9. CONTROL AS ESTIMATION
Theorem 11 (Convergence of Bayesian Control Rule) Let the set of operation
modes of an agent be such that: for all θ ∈ Θ the divergence process dt(θ∗‖θ) has
bounded variation; and for all θ, θ∗ ∈ Θ, θ is consistent with θ∗. Then,
P (at|aˆo<t)→ P (at|θ∗, ao<t)
almost surely as t→∞. 2
Proof We will use the abbreviations pθ(t) := P (at|θ, aˆo<t) and wθ(t) := P (θ|aˆo<t).
Decompose P (at|aˆo<t) as
P (at|aˆo<t) =
∑
θ/∈[θ∗]
pθ(t)wθ(t) +
∑
θ∈[θ∗]
pθ(t)wθ(t). (9.6)
The first sum on the right-hand side is lower-bounded by zero and upper-bounded
by ∑
θ/∈[θ∗]
pθ(t)wθ(t) ≤
∑
θ/∈[θ∗]
wθ(t)
because pθ(t) ≤ 1. Due to Theorem 10, wθ(t) → 0 as t → ∞ almost surely. Given
ε′ > 0 and δ′ > 0, let t0(θ) be the time such that for all t ≥ t0(θ), wθ(t) < ε′. Choosing
t0 := maxθ{t0(θ)}, the previous inequality holds for all θ and t ≥ t0 simultaneously
with probability ≥ (1− δ′)M . Hence,∑
θ/∈[θ∗]
pθ(t)wθ(t) ≤
∑
θ/∈[θ∗]
wθ(t) < Mε
′. (9.7)
To bound the second sum in (9.6) one proceeds as follows. For every member
θ ∈ [θ∗], one has that pθ(t)→ pθ∗(t) as t→∞. Hence, following a similar construction
as above, one can choose t′0 such that for all t ≥ t′0 and θ ∈ [θ∗], the inequalities∣∣∣pθ(t)− pθ∗(t)∣∣∣ < ε′
hold simultaneously for the precision ε′ > 0. Applying this to the second sum in
Equation 9.6 yields the bounds∑
θ∈[θ∗]
(
pθ∗(t)− ε′
)
wθ(t) ≤
∑
θ∈[θ∗]
pθ(t)wθ(t) ≤
∑
θ∈[θ∗]
(
pθ∗(t) + ε
′
)
wθ(t).
Here
(
pθ∗(t) ± ε′
)
are multiplicative constants that can be placed in front of the sum.
Note that
1 ≥
∑
θ∈[θ∗]
wθ(t) = 1−
∑
θ/∈[θ∗]
wθ(t) > 1− ε.
Use of the above inequalities allows simplifying the lower and upper bounds respectively:(
pθ∗(t)− ε′
) ∑
θ∈[θ∗]
wθ(t) > pθ∗(t)(1 − ε′)− ε′ ≥ pθ∗(t)− 2ε′,
(
pθ∗(t) + ε
′
) ∑
θ∈[θ∗]
wθ(t) ≤ pθ∗(t) + ε′ < pθ∗(t) + 2ε′.
(9.8)
120
9.5 Examples
Combining the inequalities (9.7) and (9.8) in (9.6) yields the final result:∣∣∣P (at|aˆo<t)− pθ∗(t)∣∣∣ < (2 +M)ε′ = ε,
which holds with probability ≥ 1−δ for arbitrary δ > 0 related to δ′ as δ′ = 1− M√1− δ
and arbitrary precision ε. 
9.5 Examples
In this section we illustrate the usage of the Bayesian control rule on two examples
that are very common in the reinforcement learning literature: multi-armed bandits
and Markov decision processes. As a reminder, a summary of the Bayesian control rule
is given in Table 9.1.
Bayesian Control Rule: Given a set of operation modes {P (·|θ, ·)}θ∈Θ
over interaction sequences in Z∞ and a prior distribution P (θ) over the
parameters Θ, the probability of the action at+1 is given by
P (at+1|aˆo≤t) =
∑
θ
P (at+1|θ, ao≤t)P (θ|aˆo≤t), (9.9)
where the posterior probability over operation modes is given by the
recursion
P (θ|aˆo≤t) =
P (ot|θ, ao<t)P (θ|aˆo<t)∑
θ′ P (ot|θ′, ao<t)P (θ′|aˆo<t)
.
Table 9.1: Summary of the Bayesian control rule.
9.5.1 Bandit Problems
Consider the multi-armed bandit problem Robbins (1952). The problem is stated as
follows. Suppose there is an N -armed bandit, i.e. a slot-machine with N levers. When
pulled, lever i provides a reward drawn from a Bernoulli distribution with a bias hi
specific to that lever. That is, a reward r = 1 is obtained with probability hi and a
reward r = 0 with probability 1 − hi. The objective of the game is to maximize the
time-averaged reward through iterative pulls. There is a continuum range of stationary
strategies, each one parameterized by N probabilities {si}Ni=1 indicating the probabil-
ities of pulling each lever. The difficulty arising in the bandit problem is to balance
reward maximization based on the knowledge already acquired with attempting new
actions to further improve knowledge. This dilemma is known as the exploration versus
exploitation tradeoff Sutton and Barto (1998).
121
9. CONTROL AS ESTIMATION
This is an ideal task for the Bayesian control rule, because each possible bandit has
a known optimal agent. Indeed, a bandit can be represented by an N -dimensional bias
vector θ = [θ1, . . . , θN ] ∈ Θ = [0; 1]N . Given such a bandit, the optimal policy consists
in pulling the lever with the highest bias. That is, an operation mode is given by:
hi = P (ot = 1|θ, at = i) = θi si = P (at = i|θ) =
{
1 if i = maxj{θj},
0 else.
0 1
0
1
0
1
0
1
1
a)
θ1
θ2
θ1 ≥ θ2
b)
θ1
θ2
θ3
θ2 ≥ θ1, θ3
Figure 9.9: The space of bandit configurations can be partitioned into N regions according
to the optimal lever. Panel a and b show the 2-armed and 3-armed bandit cases respectively.
To apply the Bayesian control rule, it is necessary to fix a prior distribution over
the bandit configurations. Assuming a uniform distribution, the Bayesian control rule
is
P (at+1 = i|aˆo≤t) =
∫
Θ
P (at+1 = i|θ)P (θ|aˆo≤t) (9.10)
with the update rule given by
P (θ|aˆo≤t) =
P (θ)
∏t
τ=1 P (oτ |θ, aτ )∫
Θ P (θ
′)
∏t
τ=1 P (oτ |θ′, aτ ) dθ′
=
N∏
j=1
θ
rj
j (1− θj)fj
B(rj + 1, fj + 1)
(9.11)
where rj and fj are the counts of the number of times a reward has been obtained
from pulling lever j and the number of times no reward was obtained respectively.
Observe that here the summation over discrete operation modes has been replaced by
an integral over the continuous space of configurations. In the last expression we see
that the posterior distribution over the lever biases is given by a product of N Beta
distributions. Thus, sampling an action amounts to first sample an operation mode θ
by obtaining each bias θi from a Beta distribution with parameters ri + 1 and fi + 1,
and then choosing the action corresponding to the highest bias a = argmaxi θi. The
pseudo-code can be seen in Algorithm 1.
Simulation: The Bayesian control rule described above has been compared against
two other agents: an ε-greedy strategy with decay (on-line) and Gittins indices (off-
line). The test bed consisted of bandits with N = 10 levers whose biases were drawn
uniformly at the beginning of each run. Every agent had to play 1000 runs for 1000
122
9.5 Examples
Algorithm 1: BCR bandit.
for i = 1, . . . , N do
Initialize ri and fi to zero.
Main cycle: for t = 1, 2, 3, . . . do
Sample m using (9.11).
Interaction:
Set a← argmaximi and issue a.
Obtain o from environment.
Update belief:
if o = 1 then
ra = ra + 1
else
fa = fa + 1
200 400 600 800 10000
0.70
0.75
0.80
0.85
200 400 600 800 10000
0
20
40
60
80
100
A
v
g
.
R
ew
a
rd
%
B
es
t
L
ev
er
Bayesian control rule
ǫ-greedy
Gittins indices
Figure 9.10: Comparison in the N -armed bandit problem of the Bayesian control rule (solid
line), an ε-greedy agent (dashed line) and using Gittins indices (dotted line). 1,000 runs have
been averaged. The top panel shows the evolution of the average reward. The bottom panel
shows the evolution of the percentage of times the best lever was pulled.
123
9. CONTROL AS ESTIMATION
time steps each. Then, the performance curves of the individual runs were averaged.
The ε-greedy strategy selects a random action with a small probability given by εα−t
and otherwise plays the lever with highest expected reward. The parameters have been
determined empirically to the values ε = 0.1, and α = 0.99 after several test runs. They
have been adjusted in a way to maximize the average performance in the last trials of
our simulations. For the Gittins method, all the indices were computed up to horizon
1300 using a geometric discounting of α = 0.999, i.e. close to one to approximate the
time-averaged reward. The results are shown in Figure 9.10.
It is seen that ε-greedy strategy quickly reaches an acceptable level of performance,
but then seems to stall at a significantly suboptimal level, pulling the optimal lever only
60% of the time. In contrast, both the Gittins strategy and the Bayesian control rule
show essentially the same asymptotic performance, but differ in the initial transient
phase where the Gittins strategy significantly outperforms the Bayesian control rule.
There are at least three observations that are worth making here. First, Gittins indices
have to be pre-computed off-line. The time complexity scales quadratically with the
horizon, and the computations for the horizon of 1300 steps took several hours on
our machines. In contrast, the Bayesian control rule could be applied without pre-
computation. Second, even though the Gittins method actively issues the optimal
information gathering actions while the Bayesian control rule passively samples the
actions from the posterior distribution over operation modes, in the end both methods
rely on the convergence of the underlying Bayesian estimator. This implies that both
methods have the same information bottleneck, since the Bayesian estimator requires
the same amount of information to converge. Thus, active information gathering actions
only affect the utility of the transient phase, not the permanent state. Other efficient
algorithms for bandit problems can be found in the literature (Auer, CesaBianchi, and
Fischer, 2002).
9.5.2 Markov Decision Processes
A Markov Decision Process (MDP) is defined as a tuple (X ,A, T, r): X is the state
space; A is the action space; Ta(x;x′) = Pr(x′|a, x) is the probability that an action
a ∈ A taken in state x ∈ X will lead to state x′ ∈ X ; and r(x, a) ∈ R := R is
the immediate reward obtained in state x ∈ X and action a ∈ A. The interaction
proceeds in time steps t = 1, 2, . . . where at time t, action at ∈ A is issued in state
xt−1 ∈ X , leading to a reward rt = r(xt−1, at) and a new state xt that starts the
next time step t + 1. A stationary closed-loop control policy π : X → A assigns an
action to each state. For MDPs there always exists an optimal stationary deterministic
policy and thus one only needs to consider such policies. In undiscounted MDPs the
average reward per time step for a fixed policy π with initial state x is defined as
ρπ(x) = limt→∞E
π[1t
∑t
τ=0 rτ ]. It can be shown Bertsekas (1987) that ρ
π(x) = ρπ(x′)
for all x, x′ ∈ X under the assumption that the Markov chain for policy π is ergodic.
Here, we assume that the MDPs are ergodic for all stationary policies.
124
9.5 Examples
In order to keep the intervention model particularly simple5, we follow the Q-
notation of Watkins (1989). The optimal policy π∗ can then be characterized in terms of
the optimal average reward ρ and the optimal relative Q-values Q(x, a) for each state-
action pair (x, a) that are solutions to the following system of non-linear equations
(Singh, 1994): for any state x ∈ X and action a ∈ A,
Q(x, a) + ρ = r(x, a) +
∑
x′∈X
Pr(x′|x, a)
[
max
a′
Q(x′, a′)
]
= r(x, a) +Ex′
[
max
a′
Q(x′, a′)
∣∣∣x, a]. (9.12)
The optimal policy can then be defined as π∗(x) := argmaxaQ(x, a) for any state
x ∈ X .
Again this setup allows for a straightforward solution with the Bayesian control rule,
because each learnable MDP (characterized by the Q-values and the average reward)
has a known solution π∗. Accordingly, an operation mode θ is given by θ = [Q, ρ] ∈
Θ = R|A|×|O|+1. To obtain a likelihood model for inference over θ, we realize that
Equation 9.12 can be rewritten such that it predicts the instantaneous reward r(x, a)
as the sum of a mean instantaneous reward ξθ plus a noise term ν given the Q-values
and the average reward ρ for the MDP labeled by θ
r(x, a) = Q(x, a) + ρ−max
a′
Q(x′, a′)︸ ︷︷ ︸
mean instantaneous reward ξθ(x,a,x′)
+max
a′
Q(x′, a′)−E[max
a′
Q(x′, a′)|x, a]︸ ︷︷ ︸
noise ν
Assuming that ν can be reasonably approximated by a normal distribution N(0, 1/p)
with precision p, we can write down a likelihood model for the immediate reward r
using the Q-values and the average reward, i.e.
P (r|θ, x, a, x′) =
√
p
2π
exp
{
−p
2
(r − ξθ(x, a, x′))2
}
. (9.13)
In order to determine the intervention model for each operation mode, we can simply
exploit the above properties of the Q-values, which gives
P (a|θ, x) =
{
1 if a = argmaxa′ Q(x, a
′)
0 else.
(9.14)
To apply the Bayesian control rule, the posterior distribution P (θ|aˆ≤t, x≤t) needs to
be computed. Fortunately, due to the simplicity of the likelihood model, one can
5The “brute-force” adaptive agent for this problem would roughly look as follows. First, the agent
starts with a prior distribution over all MDPs, e.g. product of Dirichlet distributions over the transition
probabilities. Then, in each cycle, the agent samples a full transition matrix from the distribution and
solves it using dynamic programming. Once it has computed the optimal policy, it uses it to issue the
next action, and then discards the policy. Subsequently, it updates the distribution over MDPs using
the next observed state. However, in the main text we follow a different approach that avoids solving
an MDP in every time step.
125
9. CONTROL AS ESTIMATION
easily devise a conjugate prior distribution and apply standard inference methods (see
derivation in Section 9.8 at the end of this Chapter). Actions are again determined
by sampling operation modes from this posterior and executing the action suggested
by the corresponding intervention models. The resulting algorithm is very similar to
Bayesian Q-learning Dearden, Friedman, and Russell (1998); Dearden, Friedman, and
Andre (1999), but differs in the way actions are selected. The pseudo-code is listed in
Algorithm 2.
Algorithm 2: BCR-MDP Gibbs sampler.
Initialize entries of λ and µ to zero.
Set initial state to x← x0.
for t = 1, 2, 3, . . . do
Gibbs sweep:
Sample ρ using (9.19).
for Q(y, b) of visited states do
Sample Q(y, b) using (9.20).
Interaction:
Set a← argmaxa′ Q(x, a′) and issue a.
Obtain o = (r, x′) from environment.
Update hyperparameters:
µ(x, a, x′)← λ(x,a,x′)µ(x,a,x′)+p rλ(x,a,x′)+p
λ(x, a, x′)← λ(x, a, x′) + p
Set x← x′.
1250 250 375
500
x1000 time steps
0.0
0.1
0.2
0.3
0.4
f) Average Reward
C=30
Bayesian control rule
c) R-learning, C=5 d) R-learning, C=30b) Bayesian control rule
in
it
ia
l 
5
,0
0
0
 s
te
p
s
la
st
 5
,0
0
0
 s
te
p
s
a) 7x7 Maze
goal
membranes
high
probability
low
probability
C=5
C=200
e) R-learning, C=200
Figure 9.11: Results for the 7×7 grid-world domain. Panel (a) illustrates the setup. Columns
(b)-(e) illustrate the behavioral statistics of the algorithms. The upper and lower row have been
calculated over the first and last 5,000 time steps of randomly chosen runs. The probability of
being in a state is color-encoded, and the arrows represent the most frequent actions taken by
the agents. Panel (f) presents the curves obtained by averaging ten runs.
126
9.5 Examples
Simulation: We have tested our MDP-agent in a grid-world example. To give an
intuition of the achieved performance, the results are contrasted with those achieved
by R-learning. We have used the R-learning variant presented in the work of Singh
(1994, Algorithm 3) together with the uncertainty exploration strategy Mahadevan
(1996). The corresponding update equations are
Q(x, a)← (1− α)Q(x, a) + α(r − ρ+max
a′
Q(x′, a′)
)
ρ← (1− β)ρ+ β(r +max
a′
Q(x′, a′)−Q(x, a)), (9.15)
where α, β > 0 are learning rates. The exploration strategy chooses with fixed proba-
bility pexp > 0 the action a that maximizes Q(x, a)+
C
F (x,a) , where C is a constant, and
F (x, a) represents the number of times that action a has been tried in state x. Thus,
higher values of C enforce increased exploration.
In a study Mahadevan (1996), a grid-world is described that is especially useful as a
test bed for the analysis of RL algorithms. For our purposes, it is of particular interest
because it is easy to design experiments containing suboptimal limit-cycles. Figure 9.11,
panel (a), illustrates the 7× 7 grid-world. A controller has to learn a policy that leads
it from any initial location to the goal state. At each step, the agent can move to any
adjacent space (up, down, left or right). If the agent reaches the goal state then its
next position is randomly set to any square of the grid (with uniform probability) to
start another trial. There are also “one-way membranes” that allow the agent to move
into one direction but not into the other. In these experiments, these membranes form
“inverted cups” that the agent can enter from any side but can only leave through the
bottom, playing the role of a local maximum. Transitions are stochastic: the agent
moves to the correct square with probability p = 910 and to any of the free adjacent
spaces (uniform distribution) with probability 1 − p = 110 . Rewards are assigned as
follows. The default reward is r = 0. If the agent traverses a membrane it obtains a
reward of r = 1. Reaching the goal state assigns r = 2.5. The parameters chosen for this
simulation were the following. For our MDP-agent, we have chosen hyperparameters
µ0 = 1 and λ0 = 1 and precision p = 1. For R-learning, we have chosen learning rates
α = 0.5 and β = 0.001, and the exploration constant has been set to C = 5, C = 30
and to C = 200. A total of 10 runs were carried out for each algorithm. The results
are presented in Figure 9.11 and Table 9.2. R-learning only learns the optimal policy
given sufficient exploration (panels d & e, bottom row), whereas the Bayesian control
rule learns the policy successfully. In Figure 9.11f, the learning curve of R-learning for
C = 5 and C = 30 is initially steeper than the Bayesian controller. However, the latter
attains a higher average reward around time step 125,000 onwards. We attribute this
shallow initial transient to the phase where the distribution over the operation modes
is flat, which is also reflected by the initially random exploratory behavior.
127
9. CONTROL AS ESTIMATION
Average Reward
BCR 0.3582 ± 0.0038
R-learning, C = 200 0.2314 ± 0.0024
R-learning, C = 30 0.3056 ± 0.0063
R-learning, C = 5 0.2049 ± 0.0012
Table 9.2: Average reward attained by the different algorithms at the end of the run. The
mean and the standard deviation has been calculated based on 10 runs.
9.6 Critical Issues
Problems of Bayesian methods. The Bayesian control rule treats an adaptive
control problem as a Bayesian inference problem. Hence, all the problems typically
associated with Bayesian methods carry over to agents constructed with the Bayesian
control rule. These problems are of both analytical and computational nature. For
example, there are many probabilistic models where the posterior distribution does not
have a closed-form solution. Also, exact probabilistic inference is in general computa-
tionally very intensive. Even though there is a large literature in efficient/approximate
inference algorithms for particular problem classes Bishop (2006), not many of them
are suitable for on-line probabilistic inference in more realistic environment classes.
Bayesian control rule versus Bayes-optimal control. Directly maximizing the
(subjective) expected utility for a given environment class is not the same as minimizing
the expected relative entropy for a given class of operation modes. The two methods
are based on different assumptions and optimality principles. As such, the Bayesian
control rule is not a Bayes-optimal controller. Indeed, it is easy to design experiments
where the Bayesian control rule converges exponentially slower (or does not converge
at all) than a Bayes-optimal controller to the maximum utility. Consider the following
simple example: Environment 1 is a k-state MDP in which only k consecutive actions
A reach a state with reward +1. Any interception with a B-action leads back to the
initial state. Consider a second environment which is like the first but actions A and
B are interchanged. A Bayes-optimal controller figures out the true environment in
k actions (either k consecutive A’s or B’s). Consider now the Bayesian control rule:
The optimal action in Environment 1 is A, in Environment 2 is B. A uniform (12 ,
1
2)
prior over the operation modes stays a uniform posterior as long as no reward has been
observed. Hence the Bayesian control rule chooses at each time-step A and B with
equal probability. With this policy it takes about 2k actions to accidentally choose a
row of A’s (or B’s) of length k. From then on the Bayesian control rule is optimal too.
So a Bayes-optimal controller converges in time k, while the Bayesian control rule needs
exponentially longer. One way to remedy this problem might be to allow the Bayesian
control rule to sample actions from the same operation mode for several time steps in
a row rather than randomizing controllers in every cycle. However, if one considers
non-stationary environments this strategy can also break down. Consider, for example,
128
9.7 Relation to Existing Approaches
an increasing MDP with k =
⌈
10
√
t
⌉
, in which a Bayes-optimal controller converges in
100 steps, while the Bayesian control rule does not converge at all in most realizations,
because the boundedness assumption is violated.
9.7 Relation to Existing Approaches
Some of the ideas underlying this work are not unique to the Bayesian control rule. The
following is a selection of previously published work in the recent Bayesian reinforcement
learning literature where related ideas can be found.
Compression principles. In the literature, there is an important amount of work
relating compression to intelligence (MacKay, 2003; Hutter, 2004a). In particular, it
has been even proposed that compression ratio is an objective quantitative measure of
intelligence (Mahoney, 1999). Compression has also been used as a basis for a theory
of curiosity, creativity and beauty (Schmidhuber, 2009).
Mixture of experts. Passive sequence prediction by mixing experts has been studied
extensively in the literature (Cesa-Bianchi and Lugosi, 2006). In a study on online-
predictors (Hutter, 2004b), Bayes-optimal predictors are mixed. Bayes-mixtures can
also be used for universal prediction (Hutter, 2003). For the control case, the idea
of using mixtures of expert-controllers has been previously evoked in models like the
MOSAIC-architecture (Haruno, Wolpert, and Kawato, 2001). Universal learning with
Bayes mixtures of experts in reactive environments has been studied in the work of
Poland and Hutter (2005) and Hutter (2002).
Stochastic action selection. The idea of using actions as random variables, and the
problems that this entails, has been expressed in the work of Hutter (2004a, Problem
5.1). The study in this chapter can be regarded as a thorough investigation of this
open problem. Other stochastic action selection approaches are found in the thesis of
Wyatt (1997) who examines exploration strategies for (PO)MDPs, in learning automata
(Narendra and Thathachar, 1974) and in probability matching (Duda, Hart, and Stork,
2001) amongst others. In particular, the thesis discusses theoretical properties of an
extension to probability matching in the context of multi-armed bandit problems. There,
it is proposed to choose a lever according to how likely it is to be optimal and it is shown
that this strategy converges, thus providing a simple method for guiding exploration.
9.8 Derivation of Gibbs Sampler for MDP Agent
Inserting the likelihood given in Equation 9.13 into Equation 9.9 of the Bayesian control
rule, one obtains the following expression for the posterior
129
9. CONTROL AS ESTIMATION
P (θ|aˆ≤t, o≤t) = P (x
′|θ, x, a)P (r|θ, x, a, x′)P (θ|aˆ<t, o<t)∫
Θ P (x
′|θ′, x, a)P (r|θ′, x, a, x′)P (θ′|aˆ<t, o<t) dθ′
=
P (r|θ, x, a, x′)P (θ|aˆ<t, o<t)∫
Θ P (r|θ′, x, a, x′)P (θ′|aˆ<t, o<t) dθ′
, (9.16)
where we have replaced the sum by an integration over θ′, the finite-dimensional real
space containing only the average reward and the Q-values of the observed states, and
where we have simplified the term P (x′|θ, x, a) because it is constant for all θ′ ∈ Θ.
The likelihood model P (r|θ′, x, a, x′) in Equation 9.16 encodes a set of independent
normal distributions over the immediate reward with means ξθ(x, a, x
′) indexed by
triples (x, a, x′) ∈ X × A × X . In other words, given (x, a, x′), the rewards are drawn
from a normal distribution with unknown mean ξθ(x, a, x
′) and known variance σ2.
The sufficient statistics are given by n(x, a, x′), the number of times that the transition
x → x′ under action a, and r¯(x, a, x′), the mean of the rewards obtained in the same
transition. The conjugate prior distribution is well known and given by a normal
distribution with hyperparameters µ0 and λ0:
P (ξθ(x, a, x
′)) = N(µ0, 1/λ0) =
√
λ0
2π
exp
{
−λ02
(
ξθ(x, a, x
′)− µ0
)2}
. (9.17)
The posterior distribution is given by
P (ξθ(x, a, x
′)|aˆ≤t, o≤t) = N(µ(x, a, x′), 1/λ(x, a, x′))
where the posterior hyperparameters are computed as
µ(x, a, x′) =
λ0 µ0 + p n(x, a, x
′) r¯(x, a, x′)
λ0 + p n(x, a, x′)
λ(x, a, x′) = λ0 + p n(x, a, x
′).
(9.18)
By introducing the shorthand V (x) := maxaQ(x, a), we can write the posterior
distribution over ρ as
P (ρ|aˆ≤t, o≤t) = N(ρ¯, 1/S) (9.19)
where
ρ¯ =
1
S
∑
x,a,x′
λ(x, a, x′)(µ(x, a, x′)−Q(x, a) + V (x′)),
S =
∑
x,a,x′
λ(x, a, x′).
The posterior distribution over the Q-values is more difficult to obtain, because
each Q(x, a) enters the posterior distribution both linearly and non-linearly through µ.
However, if we fix Q(x, a) within the max operations, which amounts to treating each
130
9.9 Historical Remarks & References
V (x) as a constant within a single Gibbs step, then the conditional distribution can be
approximated by
P (Q(x, a)|aˆ≤t, o≤t) ≈ N
(
Q¯(x, a), 1/S(x, a)
)
(9.20)
where
Q¯(x, a) =
1
S(x, a)
∑
x′
λ(x, a, x′)(µ(x, a, x′)− ρ+ V (x′)),
S(x, a) =
∑
x′
λ(x, a, x′).
We expect this approximation to hold because the resulting update rule constitutes a
contraction operation that forms the basis of most stochastic approximation algorithms
Mahadevan (1996). As a result, the Gibbs sampler draws all the values from normal
distributions. In each cycle of the adaptive controller, one can carry out several Gibbs
sweeps to obtain a sample of θ to improve the mixing of the Markov chain. However, our
experimental results have shown that a single Gibbs sweep per state transition performs
reasonably well. Once a new parameter vector θ is drawn, the Bayesian control rule
proceeds by taking the optimal action given by Equation 9.14. Note that only the µ
and λ entries of the transitions that have occurred need to be represented explicitly;
similarly, only the Q-values of visited states need to be represented explicitly.
9.9 Historical Remarks & References
The Bayesian control rule has been entirely developed by the author and D. A. Braun and it
has been first published in Ortega and Braun (2010c). The convergence proof was published
later in Ortega and Braun (2010a). See also Braun and Ortega (2010). The name “Bayesian
control rule” has been suggested by Z. Ghahramani.
Some of the ideas underlying this work are not unique to the Bayesian control rule. The
following is a selection of previously published work in the recent Bayesian control literature
where related ideas can be found.
Compression principles. In the literature, there is an important amount of work relating
compression to intelligence (MacKay, 2003; Hutter, 2004a). In particular, it has been even
proposed that compression ratio is an objective quantitative measure of intelligence (Mahoney,
1999). Compression has also been used as a basis for a theory of curiosity, creativity and beauty
(Schmidhuber, 2009).
Mixture of experts. Passive sequence prediction by mixing experts has been studied ex-
tensively in the literature (Cesa-Bianchi and Lugosi, 2006). In a study on online-predictors
(Hutter, 2004b), Bayes-optimal predictors are mixed. Bayes-mixtures can also be used for
universal prediction (Hutter, 2003). For the control case, the idea of using mixtures of expert-
controllers has been previously evoked in models like the MOSAIC-architecture (Haruno et al.,
2001). Universal learning with Bayes mixtures of experts in reactive environments has been
studied in the work of Poland and Hutter (2005) and Hutter (2002).
Stochastic action selection. The idea of using actions as random variables, and the problems
that this entails, has been expressed in the work of Hutter (2004a, Problem 5.1). The present
131
9. CONTROL AS ESTIMATION
chapter can be regarded as a thorough investigation of this open problem. Other stochastic
action selection approaches are found in the thesis of Wyatt (1997) who examines exploration
strategies for (PO)MDPs, in learning automata (Narendra and Thathachar, 1974) and in proba-
bility matching (Duda et al., 2001) amongst others. In particular, the thesis discusses theoretical
properties of an extension to probability matching in the context of multi-armed bandit prob-
lems. There, it is proposed to choose a lever according to how likely it is to be optimal and it
is shown that this strategy converges, thus providing a simple method for guiding exploration.
Relative entropy criterion. The usage of a minimum relative entropy criterion to derive
control laws underlies the KL-control methods developed in the work of Todorov (2006, 2009)
and Kappen et al. (2009). There, it has been shown that a large class of optimal control problems
can be solved very efficiently if the problem statement is reformulated as the minimization of the
deviation of the dynamics of a controlled system from the uncontrolled system. A related idea
is to conceptualize planning as an inference problem (Toussaint, Harmeling, and Storkey, 2006).
This approach is based on an equivalence between maximization of the expected future return
and likelihood maximization which is both applicable to MDPs and POMDPs. Algorithms
based on this duality have become an active field of current research. See for example the
work of Rasmussen and Deisenroth (2008), where very fast model-based reinforcement learning
techniques are used for control in continuous state and action spaces.
132
Chapter 10
Discussion
Despite of recent major theoretical achievements in the field of artificial intelligence, the
current state-of-the-art implementations are still far away from even displaying insect
intelligence. Given this situation, the two main questions addressed in this thesis are:
1. Are there limitations imposed by the mathematical foundations of classical agency?
2. If yes, how do we formulate new foundations that overcome these limitations?
10.1 Summary
This thesis has been organized in two parts, summarized as follows.
1. The first part contains a concise presentation of the foundations of classical
agency: namely the formalizations of decision making and learning. The first
includes: SEU theory, the framework of decision making under uncertainty; the
maximum SEU principle to choose the optimal solution; and its application to
the design of autonomous systems, culminating in the Bellman optimality equa-
tions. The second includes: Bayesian probability theory, the theory for reasoning
under uncertainty that extends logic; and Bayes-Optimal agents, the application
of Bayesian probability theory to the design of optimal adaptive agents.
2. Then, two major problems of the maximum SEU principle are highlighted: the
prohibitive computational costs and the need for the causal precedence of the
choice of the policy.
3. The second part tackles the two aforementioned problems. First, an information-
theoretic notion of resources in autonomous agents is established. Second, a
framework for resource-bounded agency is introduced. This includes: a maxi-
mum bounded SEU principle that is derived from a set of axioms of utility; an
axiomatic model of probabilistic causality, which is applied for the formalization
of autonomous systems having uncertainty over their policy and environment; and
the Bayesian control rule, derived from the maximum bounded SEU principle and
133
10. DISCUSSION
the model of causality, implementing a stochastic adaptive control law that deals
with the case where autonomous agents are uncertain about their policy and
environment.
10.2 What are the contributions?
Bounded SEU. There is abundant literature tackling the problem of bounded ra-
tionality, whose purpose is: (a) capturing aspects of human decision making that con-
tradicts rationality; or (b) understanding how resource costs affect decision making.
This thesis introduces a related view whose main goal is to present an axiomatic for-
malization of bounded rationality that encompasses classical rationality as a limit case
when resource costs vanish. More specifically, the utility-information conversion factor
α controls the tradeoff between information and utility, and the energy-information
conversion factor γ controls the tradeoff between information and energy. Real au-
tonomous systems have α > 0, and hence, under the view of the presented framework,
perfect rationality is only an idealization. Conceptually, the distinguishing feature of
bounded SEU (compared to classical SEU) is that it provides an explanatory framework
for approximations.
Causality. Causality is a field that has historically been highly controversial, and it
has been only recently that mathematical formalizations have started to find wider ac-
ceptance. Still, so far these formalizations have not clarified the connection to measure
theory, the mathematically rigorous theory of probability. The framework introduced
in this thesis does a step towards this direction. Simultaneously, this thesis clarifies the
importance of causal consideration in agent designs. More specifically, the Bayesian
I/O model (Section 8.3.1), which is really a causal model, allows enriching the classi-
cal Bayesian autonomous system, having uncertainty only over its environment, with
having uncertainty over its very policy.
Bayesian control rule. Agents that are constructed following the maximum SEU
principle have to carry out massive computations before they even have had a single
interaction with its environment. To bypass this limitation, most practical algorithms
implement autonomous systems that “discover” their policy during the interactions
with the environment. In this thesis, we have used the framework of bounded SEU
and causal reasoning to derive the Bayesian control rule: a rule that allows construct-
ing a natural class adaptive agents having uncertainty over their policies. Formally,
the Bayesian control rule for outputs is the probabilistic equivalent to the predictive
distribution for inputs. Furthermore, this thesis presents a convergence proof for the
Bayesian control rule under a very restricted setting.
134
10.3 What is missing?
10.3 What is missing?
Understanding the Implications of the Relations. If one is willing to accept
the connections between decision theory, information theory and thermodynamics that
this thesis puts forward, then one obtains a potentially very fertile ground for novel
ideas and reinterpretations. While this thesis shows some of the benefits of adopting
this unified view, it also leaves many questions unanswered. For instance, the utility-
information conversion factor α controls the cost of translating resources into utilities.
On a very abstract level, one could blame the failure of approximations to the very large
value of α. But what does α mean in practice? How can it be influenced? Examples
such as this abound and need to be addressed in the future.
Descriptive Power. While the bounded SEU framework introduced in this thesis
has a theoretical appeal due to its properties and simplicity, it remains to be seen
whether it can explain human decision making, and especially, whether it can explain
the experimental evidence (Allais, 1953; Ellsberg, 1961; Kahneman and Tversky, 1979;
Machina, 1987; Kreps, 1988) that contradicts perfect rationality. In particular, it would
be important to verify whether the causality framework and/or the Bayesian control
rule can predict aspects of human decision making.
Intelligence Measure. Legg and Hutter (2006) proposed a formal measure of ma-
chine intelligence. While this measure is a synthesis of many informal definitions of
human intelligence that have been given by experts, it has been constructed mainly bor-
rowing ideas from the theory of universal artificial intelligence, staying thus within the
paradigm of agency with unlimited resources. It would be interesting to test whether
this intelligence measure can be accommodated to include agency with bounded re-
sources.
Game Theory. It is not at all clear how the design of autonomous systems connects
to the game theoretic literature. The fundamental difference lies in the assumptions.
In artificial intelligence and control theory, one assumes a dynamical model of the
environment first, and then constructs a suitable agent. Meanwhile, in game theory,
one instead only assumes a utility function describing the environment’s preferences.
This assumption, however, does not provide enough information to derive the dynam-
ical model of the environment, since this model must depend on the assumptions the
environment makes about the agent. Under this point of view, additional solution con-
cepts (collectively called equilibria) other than the maximum expected utility prin-
ciple are required to derive the resulting behavior of the interaction system. In the
context of this thesis, an important point to be verified is to investigate whether the
Bayesian I/O model for agents is “essentially” equivalent to the Bayesian game frame-
work (Harsanyi, 1967–1968), and whether game theoretic concepts can be extended to
the case of resource-bounded agency.
135
10. DISCUSSION
136
References
M. Allais. Le comportment de l’homme rationnel devant la risque: critique des postulats
et axiomes de l’ecole americaine. Econometrica, 21:503–546, 1953.
Dana Angluin. Computational learning theory: survey and selected bibliography. In
Proceedings of the twenty-fourth annual ACM symposium on Theory of computing,
STOC ’92, pages 351–369, New York, NY, USA, 1992.
F. J. Anscombe and R. J. Aumann. A definition of subjective probability. The Annals
of Mathematical Statistics, 34(1):199–205, mar 1963.
R. B. Ash. Information Theory. New York: Interscience, 1965.
P. Auer, N. CesaBianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47:235–256, 2002.
R. Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London, 53, 1763.
P. Beame. A general sequential time-space tradeoff for finding unique elements. In
Proceedings of the twenty-first annual ACM symposium on Theory of computing,
STOC ’89, pages 197–203, New York, NY, USA, 1989. ACM.
P. Beame, T. S. Jayram, and M. Saks. Time-space tradeoffs for branching programs.
J. Comput. Syst. Sci., 63:542–572, December 2001.
P. Beame, M. Saks, X. Sun, and E. Vee. Time-space trade-off lower bounds for ran-
domized computation of decision problems. J. ACM, 50:154–195, March 2003.
R. E. Bellman. Dynamic programming, 1957.
C. H. Bennett. Logical reversibility of computation. IBM Journal of Research and
Development, 17(6):525–532, 1973.
C. H. Bennett. The thermodynamics of computationa review. International Journal of
Theoretical Physics, 21(12):905–940, 1982.
D. Bernoulli. Specimen theoriae novae de mensara sortis. Commentarii Academiae
Scientiarum Imperialis Petropolitanae, 1738. (trans. in 1954, Econometrica).
137
REFERENCES
D. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-
Hall, Upper Saddle River, NJ, 1987.
P. Billingsley. Ergodic Theory and Information. R. E. Krieger Pub. Co., 1978.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
K. Borch. A note on uncertainty and indifference curves. The Review of Economic
Studies, 36(1):1–4, 1969.
E. Borel. Lec¸ons sur la the´orie des fonctions. Gauthier-Villars, Paris, 1898.
A. Borodin and S. Cook. A time-space tradeoff for sorting on a general sequential model
of computation. In Proceedings of the twelfth annual ACM symposium on Theory of
computing, STOC ’80, pages 294–301, New York, NY, USA, 1980. ACM.
D. A. Braun and P. A. Ortega. A minimum relative entropy principle for adaptive con-
trol in linear quadratic regulators. In Proceedings of the 7th international conference
on informatics in control, automation and robotics, page (in press), 2010.
H. J. Bremermann. Quantum noise and information. In 5th Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, California, 1965. Univ. of Califor-
nia Press.
L. Brillouin. Maxwell’s demon cannot operate: Information and entropy i. Journal of
Applied Physics, 22:334–337, 1951.
L. Brillouin. Science and Information Theory. New York: Academic Press., 1956.
H. B. Callen. Thermodynamics and an Introduction to Themostatistics. John Wiley &
Sons, 2nd edition, 1985.
C. F. Camerer and M. Weber. Recent developments in modelling preferences: uncer-
tainty and ambiguity. J. Risk Uncertain., 5:325–370, 1992.
J. C. Candeal, J. R. De Miguel, E. Indura´in, and G. B. Mehta. Utility and entropy.
Economic Theory, 17:233–238, 2001.
N. Cartwright. Nature’s Capacities and Their Measurement. Claredon Press, Oxnard,
1989.
N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University
Press, 2006.
T. M. Cover and J. A. Thomas. Elements of information theory. New York: Wiley-
Interscience, 1st edition, 1991.
R. T. Cox. The Algebra of Probable Inference. Johns Hopkins, 1961.
138
REFERENCES
A. P. Dawid. Fundamentals of statistical causality. Research Report 279, Department
of Statistical Science, University College London, 2007.
B. De Finetti. La pre´vision: Ses lois logiques, ses sources subjectives. In Annales de
l’Institut Henri Poincare´, volume 7, pages 1–68, 1937.
R. Dearden, N. Friedman, and S. Russell. Bayesian q-learning. In AAAI ’98/IAAI
’98: Proceedings of the fifteenth national/tenth conference on Artificial intelli-
gence/Innovative applications of artificial intelligence, pages 761–768, Menlo Park,
CA, US, 1998. American Association for Artificial Intelligence.
R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In In
Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, pages
150–159, 1999.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley & Sons, Inc.,
second edition, 2001.
M. O’G. Duff. Optimal learning: computational procedures for bayes-adaptive markov
decision processes. PhD thesis, University of Massachusetts—Amherst, 2002.
Director-Andrew Barto.
E. Eells. Probabilistic Causality. Cambridge University Press, 1991.
D. Ellsberg. Risk, ambiguity and the savage axioms. The Quaterly Journal of Eco-
nomics, 75:643–669, 1961.
R. P. Feynman. Feynman Lectures on Computation. Westview Press, 2nd edition,
2000.
P. C. Fishburn. The Foundations of Expected Utility. D. Reidel Publishing, Dordrecht,
1982.
R. A. Fisher. Statistical Methods for Research Workers. Macmillan Pub. Co., 13th
edition, 1970.
M. Fre´chet. De´finition de l’inte´grale d’une fonctionnelle e´tendue a` un ensemble abstrait.
Comptes Rendus de l’Acade´mie des sciences, 1915.
D. Gabor. Light and information. Progress in Optics, 1:111–153, 1964.
R. Gallager. Information Theory and Reliable Communication. New York: John Wiley
and Sons, 1968.
H. Goldstein. Classical Mechanics. Addison-Wesley, 2nd edition, 1980.
P. D. Gru¨nwald. The Minimum Description Length Principle. MIT Press, 2007.
J. C. Harsanyi. Games with incomplete information played by ”bayesian” players, i–iii.
Management Science, 14:159–182, 320–334, 486–502, 1967–1968.
139
REFERENCES
M. Haruno, D. M. Wolpert, and M. Kawato. Mosaic model for sensorimotor learning
and control. Neural Computation, 13:2201–2220, 2001.
M. Hutter. Self-optimizing and pareto-optimal policies in general environments based
on bayes-mixtures. In COLT, 2002.
M. Hutter. Optimality of universal Bayesian prediction for general loss and alphabet.
Journal of Machine Learning Research, 4:971–997, 2003.
M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic
Probability. Springer, Berlin, 2004a.
M. Hutter. Online prediction – bayes versus experts. Technical report, July 2004b. Pre-
sented at the EU PASCAL Workshop on Learning Theoretic and Bayesian Inductive
Principles (LTBIP-2004).
E. T. Jaynes and Larry G. Bretthorst. Probability Theory: The Logic of Science: Books.
Cambridge University Press, 2003.
H. Jeffreys. The Theory of Probability. The Clarendon Press, 1939.
M. I. Jordan, Z. Ghahramani, and L. K. Saul. Hidden Markov decision trees. In
Advances in Neural Information Processing Systems 9, 1997.
D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk.
Econometrica, 47:263–291, 1979.
R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions
of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
B. Kappen, V. Gomez, and M. Opper. Optimal control as a graphical model inference
problem. arXiv:0901.0633, 2009.
M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT
Press, 1994.
G. Keller. Equilibrium States in Ergodic Theory. London Mathematical Society Student
Texts. Cambridge University Press, 1998.
A. I. Khinchin. Mathematical Foundations of Information Theory. New York: Dover,
1957.
F. H. Knight. Risk, Uncertainty, and Profit. Houghton Mifflin, Boston, 1921.
A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin,
1933.
A. N. Kolmogorov. Three approaches to the quantitative definition of information.
International Journal of Computer Mathematics, 1968.
140
REFERENCES
L. G. Kraft. A Device for Quantizing, Grouping, and Coding Amplitude Modulated
Pulses. Ms thesis, Electrical Engineering Department, Massachusetts Institute of
Technology, Cambridge, MA, 1949.
D. M. Kreps. Notes on the Theory of Choice. Westview Press, 1988.
S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathe-
matical Statistics, 22(1):79–86, mar 1951.
R. Landauer. Irreversibility and heat generation in the computing process. IBM Journal
of Research and Development, 5(3):183–191, 1961.
P. S. Laplace. Me´moires sur la probabilite´ des causes par les e´ve`nements. Me´moires
de mathe´matique et des physiques presentee´s a` l’Acade´mie royale des sciences, par
divers savans, & luˆs dans ses assemble´es, 6:621–656, 1774.
H. Lebesgue. Lec¸ons sur l’inte´gration et la recherche des fonctions primitives. 1904.
S. Legg. Machine Super Intelligence. PhD thesis, Department of Informatics, University
of Lugano, June 2008.
S. Legg and M. Hutter. A formal measure of machine intelligence. In Annual machine
learning conference of Belgium and The Netherlands (Benelearn-2006), Ghent, 2006.
D. Lewis. Counterfactuals. Cambridge, MA: Harvard University Press, 1973.
M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications
(Texts in Computer Science). Springer, February 2008.
R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley,
1959.
M. Machina. Choice under uncertainty: Problems solved and unsolved. Journal of
Economic Perspectives, 1:121–154, 1987.
D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, 2003.
S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and
empirical results. Machine Learning, 22(1-3):159–195, 1996.
M. V. Mahoney. Text compression as a test for artificial intelligence. In AAAI/IAAI,
pages 486–502, 1999.
J. C. Maxwell. Letter to P. G. Tait, 11 December 1867, 1867.
W. S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity.
Bulletin of Mathematical Biophysics, 5:115–133, 1943.
141
REFERENCES
B. McMillan. Two inequalities implied by unique decipherability. IEEE Trans. Infor-
mation Theory, 2(4):115–116, 1956.
D. H. Mellor. The Facts of Causation. Routledge, 1995.
D. Michie. Game-playing and game-learning automata. Advances in Programming &
Non-Numerical Computation, pages 183–200, 1966.
K. Narendra and M. A. L. Thathachar. Learning automata - a survey. IEEE Transac-
tions on Systems, Man, and Cybernetics, SMC-4(4):323–334, July 1974.
J. Neyman. First course in probability and statistics. Holt, New York, 1950.
N. J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publishers,
San Francisco, 1998.
R. Nozick. Newcomb’s problem and two principles of choice. In N. Rescher, editor,
Essays in Honor of Carl G. Hempel, pages 114–146. Reidel, 1969.
P. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning and
acting. Journal of Artificial Intelligence Research, 38:475–511, 2010a.
P. A. Ortega and D. A. Braun. A conversion between utility and information. In
The third conference on artificial general intelligence, pages 115–120, Paris, 2010b.
Atlantis Press.
P. A. Ortega and D. A. Braun. A bayesian rule for adaptive control based on causal
interventions. In The third conference on artificial general intelligence, pages 121–
126, Paris, 2010c. Atlantis Press.
P. A. Ortega and D. A. Braun. An axiomatic formalization of bounded rationality
based on a utility-information equivalence. arXiv:1007.0940, 2010d.
M. J. Osborne and A. Rubinstein. A Course in Game Theory. MIT Press, 1999.
C. H. Papadimitriou. Computational Complexity. Adison Wesley, 1993.
D. C. Parkes. Bounded rationality. Technical report, University of Pennsylvania, 1997.
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press,
Cambridge, UK, 2000.
J. Poland and M. Hutter. Defensive universal learning with experts. In ALT, 2005.
K. R. Popper. The Logic of Scientific Discovery (Routledge Classics). Routledge, 1934.
T. W. Pratt and M. V. Zelkowitz. Programming Languages: Design and Implementa-
tion. Prentice Hall, 4th edition, 2000.
142
REFERENCES
F. P. Ramsey. The Foundations of Mathematics and Other Logical Essays, chapter
‘Truth and Probability’. Harcourt, Brace and Co., 1926. posthumously published in
1931.
C. E. Rasmussen and M. P. Deisenroth. Recent Advances in Reinforcement Learning,
volume 5323 of Lecture Notes on Computer Science, LNAI, chapter Probabilistic
Inference for Fast Learning in Control, pages 229–242. Springer-Verlag, 2008.
H. Robbins. Some aspects of the sequential design of experiments. Bulletin American
Mathematical Socierty, 58:527–535, 1952.
A. Rosenblueth, N. Wiener, and J. Bigelow. Behavior, purpose, and teleology. Philos-
ophy of Science, 10:18–24, 1943.
A. Rubinstein. Modeling Bounded Rationality. MIT Press, 1988.
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall,
Englewood Cliffs, NJ, 3rd edition edition, 2009.
S. Russell and E. Wefald. Principles of metareasoning. Artificial Intelligence, 49:361–
395, 1991.
J. E. Savage. Models of Computation: Exploring the Power of Computing. Addison
Wesley Publishing Company, 1998.
L. J. Savage. The Foundations of Statistics. John Wiley and Sons, New York, 1954.
J. Schmidhuber. Simple algorithmic theory of subjective beauty, novelty, surprise,
interestingness, attention, curiosity, creativity, art, science, music, jokes. Journal of
SICE, 48(1):21–32, 2009.
G. Shafer. The Art of Causal Conjecture. MIT Press, 1996.
C. E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27:379–423 and 623–656, Jul and Oct 1948.
H. Simon. Models of Bounded Rationality. MIT Press, 1982.
H. A. Simon. A behavioral model of rational choice. Quarterly Journal of Economics,
69:99–188, 1955.
S. P. Singh. Reinforcement learning algorithms for average-payoff markovian decision
processes. In National Conference on Artificial Intelligence, pages 700–705, 1994.
M. Sipser. Introduction to the Theory of Computation. PWS Pub. Co., 1996.
R. J. Solomonoff. A formal theory of inductive inference. part 1 & 2. Information and
Control, 7:1–22, 224–254, 1964.
143
REFERENCES
P. Spirtes and R. Scheines. Causation, Prediction, and Search, Second Edition. MIT
Press, 2001.
R. Stalnacker. Ifs: Conditionals, Belief, Decision, Chance, and Time, chapter A Theory
of Conditionals., pages 41–56. Dordrecht: Reidel, 1968.
P. Suppes. A Probabilistic Theory of Causality. Amsterdam: North-Holland Publishing
Company, 1970.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA, 1998.
L. Szilard. On the decrease of entropy in a thermodynamic system by the intervention
of intelligent beings. Zeitschrift fu¨r Physik, 53:840–856, 1929.
E. Todorov. Linearly solvable markov decision problems. In Advances in Neural Infor-
mation Processing Systems, volume 19, pages 1369–1376, 2006.
E. Todorov. General duality between optimal control and estimation. In Proceedings
of the 47th conference on decision and control, pages 4286–4292, 2008.
E. Todorov. Efficient computation of optimal actions. Proceedings of the National
Academy of Sciences U.S.A., 106:11478–11483, 2009.
M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving
(po)mdps, 2006.
M. Tribus and E. C. McIrvine. Energy and information. Scientific American, 225:
179–188, 1971.
J. N. Tsitsiklis. Computational complexity in Markov decision theory. HERMIS—An
International Journal of Computer Mathematics and its Applications, 9(1):45–54,
2007.
J. Veness, K. S. Ng, M. Hutter, and D. Silver. Reinforcement learning via aixi ap-
proximation. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial
Intelligence, 2010.
J. Veness, K. S. Ng, M. Hutter, Uther W., and D. Silver. A monte-carlo aixi approxi-
mation. Journal of Artificial Intelligence Research, 40:95–142, 2011.
J. Venn. The Logic of Chance. Macmillan Pub. Co., London, 1st edition, 1866.
R. von Mises. Grundlagen der wahrscheinlichkeitsrechnung. Mathemat. Zeitsch., 5:
52–99, 1919.
J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior.
Princeton University Press, 1944.
144
REFERENCES
C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge,
Cambridge, England, 1989.
N. Wiener. Cybernetics, Second Edition: or the Control and Communication in the
Animal and the Machine. The MIT Press, 1965.
J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis,
Department of Artificial Intelligence, University of Edinburgh, 1997.
145
Index
act, 14
constant, 15
action, 9
agent, 9
algebra, 6, 31
generated, 94
ambiguity, 106
atom, 94
atom set, 94
generated, 95
axioms
belief, 34
causal, 96
probability, 6
rationality, 15
truth, 31
Bayes’ rule, 35
Bayesian control rule, 110
behavioral function, 19
belief
function, 34
induced space, 97
posterior, 35
prior, 35
space, 34
Bellman optimality equations, 22
bounded variation, 115
capacity, 54
causal
n-th function, 96
function, 96
space, 97
channel, 54
circuit
logic, 63
codeword, 56
codeword length, 56
complement, 31
conjugate, 79
consequence, 14
construction method
for control, 83
for estimation, 84
control
adaptive estimative, 109
adaptive optimal, 45
bounded optimal, 85
optimal, 25
core, 117
cross-entropy, 58
cycle, 9
data, 35
decision, 102
divergence
Kullback-Leibler, 60
divergence process, 112
dominates, 41
dynamic programming, 27
element, 5
empty string, 5
entropy, 58
relative, 60
environment, 9
ε, see empty string
equilibrium, 135
Nash, 106
event, 6, 14, 31
null, 15
146
INDEX
primitive, 95
false, 31
free utility, 80
horizon, 9
hypothesis, 35
information content, 58
interaction, 9
intersection, 31
intervention, 93, 98
knows, 22
Kraft-McMillan inequality, 56
learning theory
computational, 53
likelihood, 35
logarithm, 5
natural, 5
machine
random access, 64
sequential processing, 64
Turing, 64
Markov decision problem, 27
MDP, see Markov decision problem
measure
probability, 6
measurement, 100
mixture distribution, 38
model
Bayesian I/O, 99
Bayesian input, 38
Bayesian output, 99
I/O, 22
induced I/O, 102
induced input, 39
input, 22
output, 9
natural numbers, 5
observation, 9
operation mode, 109
set, 109
optimal, 22
outcome, 6, 31, 94
parameter
unknown, 39, 99
plausibility, 22
policy, 22
Bayes optimal, 45
consistent, 119
powerset, 6
predictive distribution, 38
predictor, 22
preference
conditional, 14
preference relation, 14
indifference, 14
rational, 15
strict, 14
prefix code, 56
complete, 57
prefix free, 55
probability measure
generative, 10
product rule, 34
program
branching, 70
propensity, 22
random variable, 6
rational, 14
rationality
bounded, 53
real numbers, 5
reasoning
meta-, 53
reward function, 24
risk, 106
sample, 6
set, 5
SEU principle
maximum, 21
maximum bounded, 83
147
INDEX
space
measurable, 6
probability, 6
sample, 6
truth, 32
state, 14, 67
stream probabilities, 10
string, 5
sub-divergence, 114
subjective expected utility
bounded, 82
substring, 5
sum rule, 34
symbol, 5
system
autonomous, 5
free, 76
true, 31
truth function, 31
truth state, 31
truth value, 31
uncertain, 31
union, 31
utile, 79
utility function, 77
utility gain function, 76, 77
value function, 25
148