A Unified Framework for
Resource-Bounded Autonomous
Agents Interacting with
Unknown Environments
Pedro A. Ortega
Department of Engineering
University of Cambridge
A thesis submitted for the degree of
Doctor of Philosophy
September 2010
2
To my parents Pedro and Pilar.
Acknowledgements
This thesis is the result of four years of work, and it would not have been
possible without the motivation and support of many people. Thanks to
them, the years I spent in Cambridge have been amongst the happiest of
my life.
I especially want to thank my family Pedro, Pilar, Carolina and Paulina
and my closest friends Francisca Albert, Paul Aguayo, Daniel Braun, Oscar
Van Heerden, Emre Karaa, Aliff Mohamad, Jose´ Donoso, Aditya Saxena,
Loreto Valenzuela, Horacio Tate, Aaron Lobo, Zoi Roupakia, Aysha Roohi,
Ben Mansfield, Francisco Pe´rez and Mauricio Gaete. Also, I am deeply
indebted to Disa Helander and Raffaella Nativio.
Special thanks go to my friend Daniel Braun, who has been closely col-
laborating with me during the course of my study. I also want to thank
Jose´ Aliste, John Cunningham, Jose´ Donoso, Marcus Hutter, Humberto
Maturana, Gonzalo Ruz and David Wingate for their invaluable help and
comments on earlier versions of this manuscript. The present study has
been supported by the Ministerio de Planificacio´n de Chile (MIDEPLAN),
the Comisio´n Nacional de Investigacio´n Cient´ıfica y Tecnolo´gica (CONI-
CYT) and the Bo¨hringer-Ingelheim-Fonds (BIF). Finally, I want to thank
my supervisor Zoubin Ghahramani for his guidance, motivation and sup-
port.
Abstract
The aim of this thesis is to present a mathematical framework for con-
ceptualizing and constructing adaptive autonomous systems under resource
constraints. The first part of this thesis contains a concise presentation of
the foundations of classical agency: namely the formalizations of decision
making and learning. Decision making includes: (a) subjective expected
utility (SEU) theory, the framework of decision making under uncertainty;
(b) the maximum SEU principle to choose the optimal solution; and (c)
its application to the design of autonomous systems, culminating in the
Bellman optimality equations. Learning includes: (a) Bayesian probability
theory, the theory for reasoning under uncertainty that extends logic; and
(b) Bayes-Optimal agents, the application of Bayesian probability theory
to the design of optimal adaptive agents. Then, two major problems of
the maximum SEU principle are highlighted: (a) the prohibitive computa-
tional costs and (b) the need for the causal precedence of the choice of the
policy. The second part of this thesis tackles the two aforementioned prob-
lems. First, an information-theoretic notion of resources in autonomous
systems is established. Second, a framework for resource-bounded agency
is introduced. This includes: (a) a maximum bounded SEU principle that is
derived from a set of axioms of utility; (b) an axiomatic model of probabilis-
tic causality, which is applied for the formalization of autonomous systems
having uncertainty over their policy and environment; and (c) the Bayesian
control rule, which is derived from the maximum bounded SEU principle
and the model of causality, implementing a stochastic adaptive control law
that deals with the case where autonomous agents are uncertain about their
policy and environment.
iv
Contents
Preface xvii
1 Introduction 1
1.1 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 2
I Foundations of Classical Agency 3
2 Characterization of Behavior 5
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Probabilities & Random Variables . . . . . . . . . . . . . . . . . 6
2.2 Models of Autonomous Systems . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Decision Making 13
3.1 Subjective Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Rationality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The Maximum Subjective Expected Utility Principle . . . . . . . . . . . 19
3.2.1 SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . . 19
3.2.2 I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Bellman Optimality Equations . . . . . . . . . . . . . . . . . . . 22
3.2.4 Subjective versus True Expected Utility . . . . . . . . . . . . . . 25
3.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 26
4 Learning 29
4.1 Bayesian Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Reasoning under Certainty . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Reasoning under Uncertainty . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Adaptive Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Bayesian Input Model . . . . . . . . . . . . . . . . . . . . . . . . 37
v
CONTENTS
4.2.2 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Induced Input Model . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Convergence of Predictive Distribution . . . . . . . . . . . . . . . 40
4.2.5 Bayes Optimal Agents . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 46
5 Problems of Classical Agency 47
5.1 Computational Cost and Precedence of Policy Choice . . . . . . . . . . 47
5.2 Is Rationality a Useful Concept? . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 49
II Resource-Bounded Agency 51
6 Resources 53
6.1 Preliminaries in Information Theory . . . . . . . . . . . . . . . . . . . . 54
6.1.1 The Communication Problem . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.3 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Resources as Information . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Thermodynamical Interpretation . . . . . . . . . . . . . . . . . . 60
6.2.2 Computational Interpretation . . . . . . . . . . . . . . . . . . . . 63
6.3 Resource Costs in Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 Cost of Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.2 Costs of Construction . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 69
7 Boundedness 71
7.1 An Example of Boundedness . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Utility & Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.2 Variational principle . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.3 Bounded SEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Bounded SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . 85
7.3.1 Bounded Optimal Control . . . . . . . . . . . . . . . . . . . . . . 85
7.3.2 Adaptive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 88
8 Causality 91
8.1 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Causal Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2.1 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Causality in Autonomous Systems . . . . . . . . . . . . . . . . . . . . . 99
8.3.1 Bayesian I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
CONTENTS
8.3.2 Causal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.3 Belief Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.4 Induced I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 104
9 Control as Estimation 105
9.1 Interlude: Dynamic versus Static . . . . . . . . . . . . . . . . . . . . . . 106
9.1.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2 Adaptive Estimative Control . . . . . . . . . . . . . . . . . . . . . . . . 109
9.3 Bayesian Control Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.4 Convergence of the Bayesian Control Rule . . . . . . . . . . . . . . . . . 111
9.4.1 Policy Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.4.2 Divergence Processes . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.4.3 Decomposition of Divergence Processes . . . . . . . . . . . . . . 113
9.4.4 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.4.5 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.4.6 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5.1 Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . 124
9.6 Critical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.7 Relation to Existing Approaches . . . . . . . . . . . . . . . . . . . . . . 129
9.8 Derivation of Gibbs Sampler for MDP Agent . . . . . . . . . . . . . . . 129
9.9 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 131
10 Discussion 133
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.2 What are the contributions? . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3 What is missing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
References 145
vii
CONTENTS
viii
List of Figures
2.1 Behavioral Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Interaction System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 An Output Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Setup of Subjective Expected Utility. . . . . . . . . . . . . . . . . . . . . 14
3.2 Axiom R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Axiom R5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Axiom R6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Axiom R7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 A Behavioral Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Combining Behavioral Functions determines an Interaction String. . . . 20
3.8 A Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9 Solution of a Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 A Fixed Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 An Adaptive Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Truth Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Extension of Truth Function. . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Bayes’ rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Progressive Refinement of Accuracy. . . . . . . . . . . . . . . . . . . . . 36
4.7 Convergence of Predictive Distribution. . . . . . . . . . . . . . . . . . . 43
6.1 Communication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Information Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Probability versus Codeword Length . . . . . . . . . . . . . . . . . . . . 59
6.5 The Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 A Generalized Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . 62
6.7 Time-Space Tradeoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 Logic Circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.9 Sequential Processing Machine. . . . . . . . . . . . . . . . . . . . . . . . 65
6.10 State Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.1 An Exhaustive Optimization. . . . . . . . . . . . . . . . . . . . . . . . . 72
ix
LIST OF FIGURES
7.2 Distributions after Bounded Optimization. . . . . . . . . . . . . . . . . . 73
7.3 Performance of the Bounded Optimization. . . . . . . . . . . . . . . . . 74
7.4 Expected Value Penalized by Relative Entropy. . . . . . . . . . . . . . . 74
7.5 Transformation of a System . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1 A Three-Stage Randomized Experiment. . . . . . . . . . . . . . . . . . . 93
8.2 A Causal Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 An Intervention in the Probability Tree. . . . . . . . . . . . . . . . . . . 95
8.4 Primitive Events and their Atom Sets. . . . . . . . . . . . . . . . . . . . 96
8.5 Causal Space of an Autonomous System. . . . . . . . . . . . . . . . . . . 100
8.6 Updates following an Observation versus an Action. . . . . . . . . . . . 101
9.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2 A Policy Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3 Realization of Divergence Processes. . . . . . . . . . . . . . . . . . . . . 113
9.4 Policies Influence Divergence Processes . . . . . . . . . . . . . . . . . . . 113
9.5 Decomposition of a Divergence Process into Sub-Divergences . . . . . . 114
9.6 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.7 Problems with Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . 117
9.8 Inconsistent Policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.9 Space of Bandit Configurations. . . . . . . . . . . . . . . . . . . . . . . . 122
9.10 Performance Comparison for Bandit Problem . . . . . . . . . . . . . . . 123
9.11 MDP Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 126
x
List of Notation
Basic
X A set or alphabet.
N The set of natural numbers 1, 2, 3, . . .
R The set of real numbers.
ǫ The empty string.
X n The set of strings of length n over the alphabet X .
X ∗ The set of all finite strings over the alphabet X .
xi:k The substring xixi+1 · · · xk−1xk.
x≤i The string x1x2 · · · xi.
ln(x) The natural logarithm of x.
log(x) The logarithm base-2 of x.
P(X ) The powerset of X .
Pr An arbitrary probability distribution.
Ω A sample space.
F An algebra.
Autonomous Agents
A The set of actions.
O The set of observations.
Z The set of interactions, i.e. Z := A×O.
at The action at time t.
ot The observation at time t.
aoi An interaction viewed as a symbol, i.e. aioi.
T The horizon, i.e. the maximum length of interaction strings.
Z⋄ The set of interaction strings up to length T .
U The utility function over interactions strings.
P The (behavioral) model of the agent, i.e. the distribution over
interaction sequences implemented by the agent. It is used to
denote the input model, the output model or the I/O model.
xi
LIST OF NOTATION
Q The (behavioral) model of the environment, i.e. the distribution
over interaction sequences implemented by the environment. It is
used to denote the input model, the output model the I/O model.
G The generative distribution, i.e. the sampling distribution over the
interactions sequences that results from the interaction between
the agent and the environment.
P The belief model of the agent, i.e. the Bayesian mixture distribu-
tion over interaction sequences. The symbol is used to denote the
Bayesian input model, the Bayesian output model or the Bayesian
I/O model. The belief model is a conceptual explanation that
gives rise to a unique behavioral model.
Model Types
P(at|ao j. Logarithms are always taken with respect to
base 2, thus log(2) = 1, unless written explicitly ln(x), in which case we mean natural
5
2. CHARACTERIZATION OF BEHAVIOR
logarithms. The symbol P(X ) denotes the powerset of X , i.e. the set of all subsets
of X .
2.1.2 Probabilities & Random Variables
To simplify the exposition, all probability spaces are assumed to be finite unless clearly
stated otherwise. While this assumption limits the domain of application of the expo-
sition, it allows isolating the problems belonging solely to the design of autonomous
systems from the problems that arise due to infinite sets (and in particular, from the
problems of geometrical or topological assumptions). Due to this, we clarify some
terminology.
A sample space is a finite set Ω, where each member ω ∈ Ω is called a sample
or outcome. A subset A ⊂ Ω is called an event. A subset F ⊂ P(Ω) of events
that contains Ω and is closed under complementation and finite union is called an
algebra. A measurable space is a tuple (Ω,F), where Ω is a sample space and F
is an algebra. Given a measurable space (Ω,F), a set function Pr over F is called a
probability measure iff it obeys the (Kolmogorov) probability axioms
K1. for all A ∈ F , Pr(A) ∈ [0, 1];
K2. Pr(∅) = 0 and Pr(Ω) = 1;
K3. for all disjoint A,B ∈ F , Pr(A ∪B) = Pr(A) +Pr(B).
A probability space is a tuple (Ω,F ,Pr) where (Ω,F) is a measurable space and Pr
is its measure. Given a probability space (Ω,F ,Pr), a random variable is a function
X : Ω → X mapping each outcome ω into a symbol X(ω) of a set X , and where
X−1(x) ∈ F for all x ∈ X . The probability of the random variable X taking on the
value x ∈ X is defined as Pr(x) := Pr(X = x) := Pr({ω ∈ Ω : X(ω) = x}).
2.2 Models of Autonomous Systems
If one wants to characterize the way an autonomous system behaves, it is necessary to
fully describe the rules governing its potential I/O stream. The mathematical descrip-
tion of an autonomous system’s behavior can be done at several levels, ranging from
very detailed physical descriptions (e.g. in the case of a robot, it would require specify-
ing its gears, distribution of current, sensors and effectors, etc.) to abstract statistical
descriptions. This thesis deals exclusively with statistical descriptions. In particular,
during the course of this thesis, two levels of description will be discussed: namely, the
behavioral level and the belief level.
1. Behavioral Level. The behavioral level specifies the actual I/O behavior, i.e. the
probabilities of making observations and issuing actions of the autonomous sys-
tem. These probabilities have to be specified for every possible information state
of the system, meaning that they must characterize the system’s I/O statistics
6
2.2 Models of Autonomous Systems
for every possible sequence of past I/O symbols. In Chapter 6, we will link this
description with the amount of resources (measured in thermodynamic work) that
the autonomous system has to spend during its interactions. The behavioral level
will be introduced in two parts: first, by specifying the output model and then
by completing it with an input model.
(a) Output Model. The output model specifies how the autonomous system gen-
erates its outputs given the past I/O symbols. This is the minimal statistical
description of an autonomous system’s behavior. However, it does not ex-
plain the purpose of the system, i.e. it does not explain what it tries to
accomplish. This model will be introduced in this chapter.
(b) Input Model. The autonomous system predicts its input stream using the
input model. This prediction model represents the assumptions that the
system makes about its environment. We will see in Chapter 3 that these as-
sumptions are necessary in order to formalize the purpose of the autonomous
system. Furthermore, we will argue in Chapter 6 that this model is also
necessary in order to characterize the thermodynamical work spent in inter-
actions. The input model is introduced in the next chapter.
2. Belief Level. In this thesis we are mainly interested in modeling adaptive au-
tonomous agents. However, specifying adaptive agents by directly writing down
their behavioral models is difficult. To simplify the description of adaptive agents,
one first starts out specifying a belief model, which is a high-level auxiliary model
characterizing the beliefs, assumptions and uncertainties of an agent. Subse-
quently, one simply derives the low-level behavioral model from the belief model.
This is similar to writing programs in a high-level programming language that
is then compiled into low-level machine code. As in the case of the behavioral
level, this belief level too will be introduced in two parts: first, by introducing
the Bayesian input model and then by introducing the Bayesian output model.
(a) Bayesian Input Model. When the autonomous system does not know its
environment, then the designer can use a Bayesian approach to model this
uncertainty. This Bayesian input model is currently the standard in the
adaptive control literature. It is introduced in Chapter 4.
(b) Bayesian Output Model. In addition to modeling the uncertainty an au-
tonomous system has over its environment, a designer can also model the
uncertainty the system has over its own policy. This leads to a Bayesian
output model that is analogous to the Bayesian input model. However, we
will see that this generalization is not straightforward, as it will require a
careful distinction between the technical treatment of actions and observa-
tion arising due to causal constraints. This model is an original contribution
of this thesis, and it will be introduced in Chapter 8, Part II.
A schematic illustration of this setup is given in Figure 2.1. Other aspects that are
important in the characterization of autonomous systems are the following:
7
2. CHARACTERIZATION OF BEHAVIOR
Output Model Input Model
Bayesian Input ModelBayesian Output Model
I/O Model
Bayesian I/O Model
(policy) (environment)
(uncertain environment)(uncertain policy)
Behavioral Level
Belief Level
Figure 2.1: Behavioral Models. Dependencies are indicated by arrows.
1. I/O Domains: The choice of the cardinality, geometry and topology of the I/O
domain can have a significant impact on the description, the implementation and
the performance of an autonomous system. In navigation devices for instance,
the usage of continuous I/O sets is vital for its robustness. Furthermore, choosing
and modeling continuous I/O sets appropriately can be a very challenging task.
However, in this thesis we will exclusively deal with finite I/O sets, because
arguably the core problems of the design of autonomous agents are already present
even in this simple setup.
2. Interaction Protocol: The interaction protocol specifies all the details concerning
the rules and the timing of the communication between two autonomous systems.
There are applications where de-synchronized interactions with variable timing
play a crucial role, like e.g. in moving artillery or in tennis. In this thesis it is
assumed that interactions occur in discrete time steps where autonomous systems
alternately take turns to generate a symbol. While in many cases, this interac-
tion protocol is flexible enough to accommodate other interaction regimes (e.g.
simultaneous interactions, discrete approximations of continuous time, etc.), it is
not clear whether the theory would significantly change under other interaction
protocols.
3. Multiple Autonomous Systems: This thesis deals exclusively with situations hav-
ing only two interacting autonomous systems. However, there are many situa-
tions, especially in economics, where one would prefer modeling the behavior of a
population of autonomous systems. If the population is significantly large, then
a more coarse-grained description of behavior could be beneficial: e.g. treating
8
2.3 Output Model
entire sub-populations as if they were individual autonomous systems; or using a
completely different description modeling emergent properties. This view might
be especially relevant to the understanding of decentralized behavior for instance.
2.3 Output Model
In the following an autonomous system’s behavior is formalized as a conditional prob-
ability measure over I/O sequences over I/O alphabets. We start with basic definitions
of interactions.
Definition 1 (Interactions) The possible I/O symbols are drawn from two finite
sets. Let O denote the set of observations and let A denote the set of actions.
The set Z := A × O is the set of interactions. The interaction string of length 0 is
denoted by ǫ. Let T ∈ N be the horizon, i.e. maximum length of interaction strings.
Let Z⋄ := ⋃Tt=0Zt denote the set of interactions up to length T . We also underline
symbols to glue them together as for example in ao≤2 = a1o1a2o2. 2
We assume that there are two autonomous systems P and Q. By convention,
we assume that P is the agent i.e. the autonomous system to be constructed by the
designer, and thatQ is the environment1, i.e. the autonomous system to be controlled
by the agent.
Agent and environment operate obeying the following interaction protocol. The in-
teraction proceeds in cycles t = 1, 2, . . . , T . In cycle t, the agent P generates an action
at conditioned on the past I/O symbols ao 0. It is easy to see that this formula generalizes correctly to the
border cases, since B(A|B) = 0
B(B|Ω) = 0 when A ∩ B = ∅ and B(A|B) = B(B|Ω)B(B|Ω) = 1
when B ⊂ A. Noting that B = B ∩Ω and rearranging terms, one gets
B(A ∩B|Ω) = B(B|Ω)B(A|B ∩ Ω).
We demand this relation to hold under any restriction to a “universal” set C ∈ F , not
only when it is restricted to Ω. Thus, replacing Ω by C one obtains
B(A ∩B|C) = B(B|C)B(A|B ∩ C),
33
4. LEARNING
which is known as the product rule for beliefs. Following a similar reasoning, we
impose that for any event A ∈ F , the sum of the degree of belief in A and its comple-
ment Ac must be true under any condition B, i.e.
B(A|B) +B(Ac|B) = 1,
which is known as the sum rule for beliefs. In summary, we impose the following
axioms for beliefs (Jaynes and Bretthorst, 2003).
Definition 16 (Belief axioms) Let Ω be a set of outcomes and let F be an algebra
over Ω. A set function P over F ×F is a belief function iff
B1. A,B ∈ F , B(A|B) ∈ [0, 1].
B2. A,B ∈ F , B(A|B) = 1 if B ⊂ A.
B3. A,B ∈ F , B(A|B) = 0 if A ∩B = ∅.
B4. A,B ∈ F , B(A|B) +B(Ac|B) = 1.
B5. A,B,C ∈ F , B(A ∩B|C) = B(A|C)B(B|A ∩ C). 2
Furthermore, define the shorthand B(A) := B(A|Ω). Axiom B1 states that degrees of
belief are real values in the unit interval [0, 1]. Axioms B2 and B3 equate the belief and
the truth function under certainty. Axioms B4 and B5 are the structural requirements
under uncertainty discussed above. Accordingly, one defines a belief space as follows.
Definition 17 (Belief Space) A belief space is a tuple (Ω,F ,B) where: Ω is a set
of outcomes, F is an algebra over Ω and B : F × F → [0, 1] is a belief function. 2
The intuitive meaning of a belief space is analogous to a truth space. Nature arbi-
trarily selects an outcome ω ∈ Ω. Subsequently, the reasoner performs a measurement:
he chooses a set B and nature reveals to him whether ω ∈ B or not. Accordingly, the
reasoner infers the degree of belief in any event A ∈ F by evaluating either B(A|B) (if
ω ∈ B) or B(A|Bc) (if ω /∈ B).
Remark 9 The word “subsequently”, that has been emphasized for the second time
now, is crucial. When the reasoner performs his measurements, the outcome is already
determined. In Chapter 8, we will relax this assumption by allowing outcomes that are
only partially determined or jointly determined by the reasoner him-/herself. 2
Remark 10 Note that logical truth is not the same as probabilistic truth. That is
T(A|B) = 1 ⇒ B(A|B) = 1,
but B(A|B) = 1 ⇒ T(A|B) ∈ {?, 1}.
In particular, the case B(A|B) = 1 and T(A|B) = ? occurs when B is chosen such that
the set B \A is non-empty but has no probability mass relative to B. In other words,
if B(B \A|B) := 0, then B(A ∩B|B) = B(A|B) = 1, even though B 6⊂ A. 2
34
4.1 Bayesian Probability Theory
An easy but fundamental result is that the axioms of belief are equivalent to the
axioms of probability3 (Jaynes and Bretthorst, 2003). This simple observation is what
constitutes the foundation of Bayesian probability theory.
4.1.3 Bayes’ Rule
We now return to the central topic of this chapter. Suppose the reasoner has uncer-
tainty over a set of competing hypotheses about the world. Subsequently, he makes an
observation. He can use this observation to update his beliefs about the hypotheses.
The following theorem explains how to carry out this update.
Theorem 2 (Bayes’ Rule) Let (Ω,F ,B) be a belief space. Let {H1, . . . ,HN} be a
partition of Ω, and let D ∈ F be an event such that B(D) > 0. Then, for all n ∈
{1, . . . , N},
B(Hn|D) = B(D|Hn)B(Hn)
B(D)
=
B(D|Hn)B(Hn)∑
mB(D|Hm)B(Hm)
.
This is known as Bayes’ rule. 2
The interpretation is as follows. The H1, . . . ,HN represent N mutually exclusive
hypotheses, and the event D represents a new observation or data. Initially, the
reasoner holds a prior belief B(Hn) over each hypothesis Hn. Subsequently, he in-
corporates the observation of the event D and arrives at a posterior belief B(Hn|D)
over each hypothesis Hn. Bayes’ rule states that this update can be seen as com-
bining the prior belief B(Hn) with the likelihood B(D|Hn) of observation D under
hypothesis Hn. The denominator
∑
mB(D|Hm)B(Hm) = B(D) just plays the role of
a normalizing constant (Figure 4.5).
Ω
H1
H2
H2
H3
H3
D
D
Figure 4.5: Schematic Representation of Bayes’ Rule. The prior belief in hypotheses H1, H2
and H3 is roughly uniform. After conditioning on the observationD, the belief in hypothesis H3
increases significantly.
3More precisely, the axioms of beliefs as stated here imply the axioms of probability for finitely
additive measures over finite algebras. Furthermore, the axioms of beliefs also specify a unique version
of the conditional probability measure.
35
4. LEARNING
Bayes’ rule naturally applies to a sequential setting. Incorporating a new observa-
tion Dt after having observed D1,D2, . . . ,Dt−1 updates the beliefs as
Bt+1(Hn) := B(Hn|D1 ∩ · · · ∩Dt) = Bt(Dt|Hn)Bt(Hn)∑
mBt(Dt|Hm)B(Hm)
,
where for the t-th update,
Bt(Hn) := B(Hn|D1 ∩ · · · ∩Dt−1) and Bt(Dt|Hn) := B(Dt|Hn ∩D1 ∩ · · · ∩Dt−1)
play the role of the prior belief and the likelihood respectively. Note that
B(D1 ∩ · · · ∩Dt|Hn) =
t∏
τ=1
B(Dτ |Hn ∩D1 ∩ · · · ∩Dτ−1),
and hence each hypothesis Hn naturally determines a probability measure B(·|Hn) over
sequences of observations.
H1
H2
H3
S1
S2
S3
S4
S5
Figure 4.6: Progressive refinement of the accuracy of the joint observation. The sequence of
observations D1, . . . , D5 leads to refinements S1, S2, . . . , S5, where St = D1 ∩ · · · ∩ Dt. Note
that S5 ⊂ H1 and therefore B(H1|S5) = 1, while B(H2|S5) = B(H3|S5) = 0.
A smaller event D corresponds to a more “accurate” observation. Hence, making a
new observation D′ necessarily improves the accuracy, since
D ⊃ D ∩D′.
In some cases, the accuracy of an observation (or sequence of observations) can be so
high that it uniquely identifies a hypothesis (Figure 4.6).
The way Bayes’ rule operates can be illustrated as follows. Consider a partition
{X1, . . . ,XK} of Ω and let H∗ ∈ {H1, . . . ,HN} be the true hypothesis, i.e. the outcome
ω ∈ Ω is drawn obeying propensities described by B(·|H∗). The Xk represent different
observations the reasoner can make. If ω is drawn by Nature and reported to be in
36
4.2 Adaptive Optimal Control
Xk (without revealing ω itself), then the log-posterior probability of hypothesis Hn is
given by
logB(Hn|Xk) = logB(Xk|Hn)︸ ︷︷ ︸
ln
+ logB(Hn)︸ ︷︷ ︸
pn
− logB(Xk)︸ ︷︷ ︸
c
.
This decomposition highlights all the relevant terms for understanding Bayesian learn-
ing. The term ln is the log-likelihood of the data Xk. The term pn is the log-prior of
hypothesis Hn, which is a way of representing the relative confidence in hypothesis Hn
prior to seeing the data. In practice, it can also be interpreted as (a) a complexity term,
(b) the log-posterior resulting from “previous” inference steps, or (c) an initialization
term for the inference procedure. The term c is the log-probability of the data, which
is constant over the hypotheses, and thus does not affect our analysis. Hence, log-
posteriors are compared by their differences in ln+pn. Ideally, the log-posterior should
be maximum for the true hypothesis Hn = H∗. However, since ω is chosen randomly,
the observation Xk is random, and hence the log-posterior logB(Hn|Xk) is a random
quantity too. If the variance of the log-posterior is high enough, then a particular
realization of the data can lead to a log-posterior favoring some “wrong” hypotheses
over the true hypothesis, i.e. ln + pn > l∗ + p∗ for some Hn 6= H∗. In general, this is
an unavoidable problem (that necessarily haunts every statistical inference method).
Further insight can be gained by analyzing the expected log-posterior:∑
Xk
B(Xk|H∗) logB(Xk|Hn)
︸ ︷︷ ︸
Ln
+ logB(Hn)
︸ ︷︷ ︸
Pn=pn
−
∑
Xk
B(Xk|H∗) logB(Xk)
︸ ︷︷ ︸
C
.
This reveals4 that, on average, the log-likelihood Ln is indeed maximized by Hn = H∗.
Hence, the posterior belief will, on average, concentrate its mass on the hypotheses
having high Ln + Pn.
4.2 Adaptive Optimal Control
The previous section introduced the conceptual framework of Bayesian probability the-
ory to model reasoning under uncertainty. The aim of this section is to apply this
framework to model adaptive autonomous systems. We first introduce a model to rep-
resent the uncertainty an agent has over its environment. Then, we show that this
model of uncertainty also allows deriving a predictor over the input stream, which will
be used as the input model of the agent. Finally, we show a convergence result for this
predictive input model.
4.2.1 Bayesian Input Model
One can exploit the Bayesian interpretation of probabilities to construct adaptive au-
tonomous systems. The Bayesian framework can be used to specify a belief model of
4For pi, qi probabilities,
∑
i
pi log qi is maximum when qi = pi for fixed pi.
37
4. LEARNING
agents having uncertainty about their environments. From this belief model, one can
derive an adaptive predictor for the I/O model introduced in Chapter 3.
Definition 18 (Bayesian Input Model) Let Θ be a finite set and let ZT be the set
of interaction strings. A Bayesian input model of an agent is a set of conditional
probabilities
P (θ) and P (ot|θ, ao 0. (4.6)
(In particular, in the Bayesian input model α = P (θ).) The dominance factor α can
be arbitrarily small as long as it stays constant. Hence, the upper bound − lnα is
valid even in the limit T →∞. Since S contains an infinite sum of positive terms, one
concludes:
Corollary 1 Under the same conditions as in Theorem 3, one has
P(ot|ao 0 is the gas constant, and T ≥ 0 is
the absolute temperature. The minus sign is just a convention to denote work done by
the piston rather than by the gas. The interpretation of resources that we want to put
forward here is analogous to the physical concept of work.
One can postulate a formal correspondence between one unit of information and
one unit of work (Feynman, 2000). Consider representing one bit of information using
60
6.2 Resources as Information
one of the following logical devices: a molecule that can be located either on the top or
the bottom part of a box; a coin whose face-up side can be either head or tail; a door
that can be either open or closed; a train that can be orientated facing either north or
south; and so forth. Assume that all these devices are initialized in an undetermined
logical state, where the first state has probability p and the second probability 1 − p.
Now, imagine you want to set these devices to their first logical state. In the case of the
molecule in a box, this means the following. Initially, the molecule is uniformly moving
around within a space confined by two pistons as depicted in Figure 6.5a. Assuming
that the initial volume is V , the molecule has to be pushed by the lower piston into the
upper part of the box having volume V ′ = pV (Figure 6.5b). From information theory,
we know that the number of bits that we fix by this operation is given by
− log p. (6.3)
Using the formula in (6.2), one gets that the amount of work done by the piston is
given by
W = RT ln
V
V ′
= RT ln
V
pV
= −RT ln p = − RT
log e
log p = −γmol log p,
where we have assumed N = 1 and where the constant γmol :=
RT
log e > 0 can be
interpreted as the conversion factor between one unit of information and one unit of
work for the molecule-in-a-box device.
a) b)
pV
(1 − p)V
A
B
Figure 6.5: The Molecule-In-A-Box Device. (a) Initially, the molecule moves freely within a
space of volume V delimited by two pistons. The compartments A and B correspond to the
two logical states of the device. (b) Then, the lower piston pushes the molecule into part A
having volume V ′ = pV .
How do we compute the information and work for the case of the coin, door and
train devices? The important observation is that we can model these cases as if they
were like molecule-in-a-box devices, with the difference that their conversion factors
between units of information and units of work are different. Hence, the number of bits
fixed while these devices are set to the first state is given by
− log p,
61
6. RESOURCES
i.e. exactly as in the case of the molecule. However, the work is given by
−γcoin log p, −γdoor log p, and − γtrain log p
respectively, where γcoin, γdoor and γtrain are the associated conversion factors between
units of information. Obviously,
γmol ≤ γcoin ≤ γdoor ≤ γtrain.
The point is that information is proportional to work. In other words, the amount of bits
required to communicate a choice is proportional to the amount of thermodynamical
work that has to be carried out on the recording device of the receiver.
One can easily envisage devices that generalize this idea to multiple states. Consider
for instance the device shown in Figure 6.6a corresponding to the codeword lengths
shown in Figure 6.4a. Here, four pistons control the probability of the molecule being
in the parts of the space labeled as {a, b, c, d}. A choice is ruled out by pushing its
piston downwards until it hits the wall. To rule out option ‘a’ as we did in the example
of Figure 6.4, we push the first piston to the end in order to obtain the configuration
shown in Figure 6.6b, which corresponds to codeword lengths of Figure 6.4b. Since this
change reduces the total volume by half, the thermodynamical work is proportional to
1 bit.
a) b)
aa bb cc dd
Figure 6.6: A generalized molecule-in-a-box device representing the partial choice of Fig-
ure 6.4. A molecule can move freely within a volume delimited by four pistons. The box has
four sections labeled as ‘a’, ‘b’, ‘c’ and ‘d’, corresponding to four possible states of the device.
Panel (a) and (b) show the initial and final configuration, corresponding to ruling out option ‘a’.
In general, this calculation is done using the relative entropy. Let v(u) and v′(u)
denote the volume of choice u before and after the compression respectively, and let
Pr(u) and Pr′(u) denote their associates probabilities. These probabilities are given by
Pr(u) =
v(u)
V
and Pr′(u) =
v′(u)
V ′
.
62
6.2 Resources as Information
Substituting the probabilities into the relative entropy, one obtains
∑
u
Pr′(u) log
Pr′(u)
Pr(u)
=
∑
u
v′(u)
V ′
log
v(u)
v′(u)
+ log
V ′
V
.
Using the quantities of the example, one obtains
0 · log 0 + 1
8
· log 1 + 1
16
· log 1 + 1
16
· log 1 + log 2 = 1 bit,
as expected. Note that this calculation only works if the volume stays equal or is
compressed, but not expanded.
6.2.2 Computational Interpretation
Remark 18 This subsection is of speculative nature. 2
Usually, the efficiency of an algorithm is assessed in terms of a relevant compu-
tational resource. There are many possible computational resources, but the most
important ones are the computation time and the computation space, i.e. the maxi-
mum number of time steps and the maximum number of cells that a Turing machine
uses in order to compute the output of a function for any input. Typically, the goal of
the designer is to construct an algorithm that minimizes either time or space.
However, these two goals seem to be somewhat incompatible. First, concentrating
on a single resource does not represent all the issues involved in solving a problem. Also,
there are numerous cases where reducing the computation time increases the computa-
tion space and viceversa. Indeed, the works of Borodin and Cook (1980), Beame (1989),
Beame, Jayram, and Saks (2001) and Beame, Saks, Sun, and Vee (2003) derive tradeoff
lower bounds for various problems such as sorting, finding unique elements in a set and
solving randomized decision problems. Their findings show that these lower bounds
can be stated in terms of minimum time-space products. For example, in Borodin and
Cook (1980), an optimal Ω(n2/ log n) lower bound is shown for the time-space product
of sorting n integers in the range of [1, n2]. Although these findings are not conclusive,
they seem to suggest that the time-space product might be a more general measure of
computational resources, i.e. one that captures the intuitive notion of the difficulty of
a computation (Figure 6.7). If one assumes that the time-space product is a suitable
measure of computational resources, then one is led to ask about the meaning of this
quantity.
Fortunately, one can give this product an information-theoretic meaning. We closely
follow the presentation of Savage (1998). Assume we want to build a device that
computes a function f mapping X into Y, where both X and Y are finite. To make our
discussion concrete, suppose we want to implement this device using a logic circuit,
i.e. a collection of interconnected logical gates computing elementary boolean functions.
It is sufficient to restrict ourselves to binary AND, binary OR and unary NOT gates,
since it can be shown that every boolean function can be implemented by them. An
example circuit is shown in Figure 6.8. The complexity of a function is measured by
63
6. RESOURCES
Space
Time
C1
C2
Figure 6.7: Time-Space Tradeoff. Studies suggest that there is a tradeoff between time and
space that can be characterized in terms of a constant that acts as a lower bound on the time-
space product of a problem. In the plot, the time-space curves of two problems C1 and C2 are
shown. In this case, C2 is more difficult than C1 because the time-space product of C2 is greater
than the one of C1, which can be seen by comparing their respective time-space rectangular
areas.
counting the number of logic gates it has3. A function requiring more logic gates is
considered to be more complex.
Many times we can also implement f using a sequential processing machine
which uses less logic gates. Such a machine is illustrated in Figure 6.9a. A sequen-
tial processing machine is capable of simulating computation models having a finite
number of configurations, such as space-bounded Turing machines or space-bounded
random access machines (Savage, 1998). As shown in the figure, a sequential pro-
cessing machine is a logic circuit φ communicating with an external storage device (or
“memory”) M . The computation proceeds in T discrete steps, where in step t, the
logic circuit takes the binary input xt and the current state qt and generates the binary
output yt and the next state qt+1. This is done with the help of M , which stores the
state qt+1 generated at step t until it is released in step t + 1. Here, the input string
x1:T encodes the input and the output string y1:T encodes the output.
The computation of a sequential processing machine can be rephrased as a com-
munication problem. This is easily seen by “unwinding” the computation as shown in
Figure 6.9, obtaining a concatenation of T times the logic circuit φ. This construction
transforms the computation into a communication problem where the channel is given
by the logic circuit φ, that is, φ can be regarded as a noisy channel transforming (xt, qt)
into (yt, qt+1). If the state qt is represented by S bits, then the total number of bits
3This measure is known as the circuit complexity of a function.
64
6.2 Resources as Information
NOT AND OR
x1
x2
y1
Figure 6.8: A logic circuit implementing the XOR function.
a) b)
φφφφφ
xt x1 x2 x3 xT
yt y1 y2 y3 yT
Mqt
qt+1
q0
q1 q2 q3 qT−1 qT
Figure 6.9: (a) A sequential processing machine consists of an logic circuit φ and an external
storage device M . In each step t, φ takes the t-th input xt and state qt (stored in the external
memory) and computes the t-th output yt and the (t+1)-th state qt+1. (b) The computation of
the sequential processing machine can be “unwinded”, thereby constructing a large logic circuit
consisting of T concatenations of φ that computes the same function as before.
65
6. RESOURCES
transmitted is given by the product
T · (S + 1),
where it is seen that (S + 1) is a measure of the maximum capacity that the circuit φ,
seen as a communication channel, can achieve. Note that T and S correspond to the
time and space complexity of the function f . Let C(f) and C(φ) denote the minimum
number of logic gates needed to implement the functions f and φ respectively. Then,
C(f) ≤ T · C(φ) = κ · T · S,
because C(φ) = κS for some positive κ and because the implementation of f as a
sequential processing machine cannot use less logic gates than the direct implementation
of f . Hence, the order of the time-space product is lower-bounded by the circuit
complexity of f . In some sense, it seems that:
Computer Science Information Theory
Compute f(x) ⇐⇒ Communicate x and f
although the author is unaware of any proof of this claim.
The point is that, if we are willing to accept the time-space product as an ap-
propriate measure of the computational complexity, then the computational resources
can be related to information-theoretic resources—and as such, they are governed by
information-theoretic principles. In other words, the computational resources corre-
spond to the amount of bits that have to be transmitted to communicate a “compu-
tation”. While this interpretation is intuitively appealing, there are still many open
questions left. For instance, it is unclear what “computation” means in this sense,
and it is also unclear how much computation is necessary in order to compute a given
function (Sipser, 1996; Papadimitriou, 1993).
6.3 Resource Costs in Agents
In the previous section, we have seen that resources can be formalized in information-
theoretic terms, and that this formalization can be given a thermodynamic and a
computational interpretation. This can be summarized as follows. Given a set U
of options, a receiver that does not know beforehand what the choice will be has to
allocate resource costs (codeword lengths) that implicitly specify his beliefs Pr(u) about
the realization of the choice u ∈ U . Furthermore, when the receiver makes a (possibly
stochastic and/or partial) choice represented by a probability distribution Pr′(u) over
U , then the amount of bits spent by the receiver in order to record this choice is given
by the relative entropy ∑
u
Pr′(u) log
Pr′(u)
Pr(u)
,
which is a generalization of the information content from deterministic choices to prob-
abilistic choices. We have argued that this quantity is proportional to the amount of
66
6.3 Resource Costs in Agents
thermodynamic work that the receiver has to carry out in order to record the choice.
Furthermore, we have sketched a relation between the amount of bits and the required
time-space product of the associated computation.
We argue that this way of thinking has many advantages: It greatly simplifies the
analysis of resource costs, because we only have to deal with changes in probability
distributions. This allows us abstracting away from the algorithmic details in order to
reason about computational costs. The objective of this section is to explain how this
formalization can be used to calculate the resource costs of running and constructing
an agent.
6.3.1 Cost of Interaction
We have formalized autonomous systems as probability distributions over interaction
sequences. According to the information-theoretic arguments presented in this chapter,
this means that we are implicitly assigning resource costs for interactions.
This makes sense from an intuitive point of view. The implementation (or embod-
iment) of an agent facilitates some interactions while it hampers others. For instance,
biologists can infer a great deal of the habits of a species by studying its anatomy. The
rationale behind this is that animals manifest energy-efficient behavior more frequently
than energy-intensive behavior. This kind of reasoning acquires an extreme form in
paleontology, where behavior is mainly inferred from fossilized animals. Conversely,
in engineering, systems are designed such that they minimize the resource costs of fre-
quent or desirable operations and uses. This is visible in the designs of cars, aeroplanes,
buildings, algorithms, advertising campaigns, etc.
From an information-theoretic point of view, an agent interacting with an environ-
ment is communicating an interaction sequence. Whenever the agent interacts with its
environment (either producing an output or reading an input), its “physical state” or
“internal configuration” changes as a necessary consequence of the interaction—simply
because the two instants are distinguishable. In other words, if the instants before and
after the interaction cannot be told apart, then we are forced to conclude that they
are empirically the same. This change in “physical state” or “internal configuration”
can take place in many possible ways: for instance, by a chemical reaction; by up-
dating the internal memory; by consulting a random number generator; by moving to
another location; or even by simply advancing in time (i.e. by changing the value of the
time-coordinate). Hence, in this context, the semantics of “physical state” or “internal
configuration” corresponds to an abstract information state: it is a description that
exhaustively characterizes a situation of an agent. We call such a description a state
of the agent.
To make this notion of states concrete, we introduce the following model. We
assume that a change in state occurs whenever the agent either issues an action or
reads an input. We start out from a blank binary tape that we will use to record the
agent’s experience. Then, we iteratively append a new binary string that encodes the
new input or output symbol experienced by the agent (Figure 6.10). In this model, the
appended binary strings are proxies for the changes in state that the agent experiences
67
6. RESOURCES
a1 a2 a3 aTo1 o2 oT
0 1 0 1 1 1 0 1 0 0 1 1
6
6
3
3
8
8
3
3
a) b)
Figure 6.10: State Model. The agent has an I/O domain given byA := O := {1, 2, . . . , 10}. So
far, the agent has experienced four I/O symbols (Panel a). The state of the agent is constructed
by iteratively encoding the four I/O symbols into binary strings (Panel b).
during its interactions with the environment. In this way, we abstract away from the
inner workings of the agent by simply representing every change by a binary string.
Note that this scheme does not allow an agent returning to a previous state, hence the
agent cannot “jump back in time”. Additionally, we want the content of the binary
tape to be uniquely decodeable, such that we can recover the whole I/O history the
agent has experienced so far at any given time by decoding the content of the tape.
This model highlights the correspondence between resources, codeword lengths and
probabilities. Therefore, the behavior of the agent can be thought of as a reflection of
the underlying resource costs.
6.3.2 Costs of Construction
Remark 19 This subsection is of speculative nature. 2
Designing and constructing an agent has a cost. Whether we are thinking to find a
solution to optimality equations, running a search algorithm, or assembling mechatronic
parts, we are always spending resources during the conception of an agent. These
resource costs can be thought of as arising from the change of the distribution over the
possible agents, where the cost of this change is given by the relative entropy.
Let Θ denote the index set parameterizing the possible agents. Furthermore, let
Pr(θ) denote the belief we have about θ being the optimal parameter. Consider the
relative entropy
ρ =
∑
θ
Pr′(θ) log
Pr′(θ)
Pr(θ)
, (6.4)
where Pr′(θ) is the (partial) choice made by the sender. This quantity correctly mea-
sures the number of bits we are going to receive over a noiseless channel from a sender
that picks out the right θ.
However, there is caveat: The previous calculation represents a situation where
we passively receive the optimal answer, which is not possible unless the sender is an
“oracle” who guesses the first ρ bits of the optimal θ. To correctly calculate the required
number of bits in this communication problem, we have to account for the fact that
we do not know the sender. Not knowing the sender has important implications, since
this uncertainty might lead to significant resource costs that we are not accounting for.
68
6.4 Historical Remarks & References
Unfortunately, the author is not aware of any widely accepted way of dealing with this
situation. However, one can speculate that the amount of information is
2O(ρ),
that is, exponential in the amount of information needed when the sender is known.
Roughly speaking, the justification for this intuition is based on results from computa-
tional complexity, where a non-deterministic machine can be simulated by deterministic
machine but incurring exponential cost (Sipser, 1996; Papadimitriou, 1993). In the next
chapter, this claim is made precise for a special case.
6.4 Historical Remarks & References
The fundamental results of information theory were almost entirely developed in the paper by
Shannon (1948). In particular, Shannon’s paper derives a quantitative measure of information
based on three desiderata. Surprisingly, the resulting formula for information turned out to have
the same mathematical shape as the formula for thermodynamic entropy discovered empirically
by Boltzmann (in a simpler form) and Gibbs. Because of this, much of information theory
borrows mathematics from thermodynamics. Information theory has found a wide range of
applications in communication, coding, compression, statistical learning, dynamical systems,
and other areas (Khinchin, 1957; Ash, 1965; Kolmogorov, 1968; Gallager, 1968; Billingsley, 1978;
Cover and Thomas, 1991; Li and Vitanyi, 2008; MacKay, 2003; Gru¨nwald, 2007). The relation
between information (more specifically, codeword lengths) and probability, follows roughly the
argument presented in Gru¨nwald (2007). The Kraft-McMillan inequality was developed in two
steps by Kraft (1949) and then by McMillan (1956).
The relative entropy ∑
u
Pr′(u) log
Pr′(u)
Pr(u)
has been introduced by Kullback and Leibler (1951). The standard interpretation is as follows.
The relative entropy measures the expected number of extra bits required to code samples
from Pr′ when using a code based on Pr, rather than using a code based on Pr′. The idea
of the relative entropy as a generalized formula for the information content is the author’s
(non-standard) interpretation. While mathematically equivalent, conceptually the author’s
interpretation seems to suggest a “temporal directionality”, where Pr and Pr′ represent the
knowledge state of the receiver before and after receiving information.
The connections between information theory, thermodynamics and computational complex-
ity presented in this chapter borrow ideas from various sources. The relation between units of
energy and units of information, has been originally put forward in the context of Maxwell’s
demon by various authors (Maxwell, 1867; Szilard, 1929; Brillouin, 1951, 1956; Gabor, 1964).
The first modern argument linking information theory with thermodynamics is due to Landauer
(1961). Since then, the ideas of the thermodynamics of computing have found wide acceptance
(Tribus and McIrvine, 1971; Bennett, 1973, 1982; Feynman, 2000; Li and Vitanyi, 2008). The
devices shown in figures 6.5 and 6.6 are the author’s contribution, and they differ from the de-
vices in the literature in that they do not allow expanding the volume (i.e. erasing knowledge)
but only compressing it (i.e. acquiring knowledge).
The relation between computational resources and information is an area under active re-
search. Time-space tradeoffs have been investigated in Borodin and Cook (1980), Beame (1989),
69
6. RESOURCES
Beame et al. (2001) and Beame et al. (2003) using a computational model called branching
programs. The time-space tradeoff related to circuit complexity is presented in Savage (1998).
The speculations relating the transmission of information to computational complexity are due
to the author, although related ideas have been put forward by Bremermann (1965).
The arguments linking information-theoretic resources to the cost of interaction and con-
struction are an original contribution of the author. The concept of an oracle, while non-
standard in the context of information theory, is commonplace in computational complexity.
An oracle is a hypothetical device which is capable of answering decision problems in a single
operation (Sipser, 1996; Papadimitriou, 1993), and they are widely used in cryptography to
make arguments about the security of cryptographic protocols involving hash functions.
70
Chapter 7
Boundedness
If a designer has unlimited computational resources, then he would pay the cost of cal-
culating the optimal policy for an autonomous agent. However, in Chapter 5 we have
argued that the rigorous application of the maximum SEU principle is computationally
too expensive to serve as a practical design principle. Therefore, most implementa-
tions make severe domain restrictions and/or approximations in order to significantly
reduce the computational expenses. In other words, designers content themselves with
suboptimal solutions to the original problem.
The preference of the designer of choosing a suboptimal solution over the optimal
one suggests that resources have an impact over the “perceived utility” of the solution.
One can argue that this “suboptimal solution” becomes the “optimal solution” when
resources are taken into account. Hence, intuitively there seems to be a common
“currency” for resources and utilities. What is this currency? If there were one, then
one could compute the optimal solution that trades off the benefits obtained from
maximizing the expected subjective utility and the resource costs of the calculation.
7.1 An Example of Boundedness
We start our discussion with a concrete example. We are given an array of N numbers
(v1, v2, . . . , vN ), and our task is to pick the largest one. We solve this problem by
checking the numbers one by one in any order, always comparing the current value
against the largest seen so far. It is easy to see that this algorithm takes O(N) time
and O(logN) space if we assume that each comparison is done in a single computational
step and that the indices are represented using ⌈logN⌉ bits. The time-space complexity
of this algorithm is O(N logN).
From an information-theoretic point of view, this algorithm transforms the knowl-
edge state about the maximum. Figure 7.1 shows an example array having N = 10
elements with values in {1, 2, . . . , 10}. Initially, the algorithm doesn’t know the loca-
tion of the maximum: If we assume that each index is encoded with log 10 ≈ 3.3219
bits, then the initial distribution is uniform. After running the algorithm, the location
of the maximum is known, which is represented by a delta function concentrating its
71
7. BOUNDEDNESS
probability mass on number 10. Notice that for discovering logN bits of information
we are running an algorithm of time complexity O(N) = 2O(logN), that is, exponential
in the number of bits. Compare this to the arguments for the cost of construction
presented in Section 6.3.2, page 68.
a) b) c)
Figure 7.1: An Exhaustive Optimization. (a) A shuffled array with numbers from 1 to 10 is
given. Initially, the algorithm does not know where the optimum is, which is represented by a
uniform distribution (b) over the elements. After the execution of the algorithm, the solution
was found, which is represented by a distribution (c) concentrating its probability mass on the
maximum.
If N is small, say N = 10, then choosing the largest number is easy, because one can
simply revise the whole array and then pick the largest number. However, if N is very
large, say N = 1000, then comparing all the numbers becomes a difficult task. How do
we go about this problem then? One solution would be to limit ourselves to comparing
only a fraction of the array, say M ≪ N elements. This reduces the computational
complexity to O(M) time and O(logN) space at the cost of tolerating an error with
probability 1 −M/N . That is, we can give up certainty for the sake of computational
efficiency. Notice that the time-space product is O(M logN), which is linear in M for
fixed N .
To understand the effect of the resulting tradeoff, we again analyze how the know-
ledge state is transformed. We assume that the set was shuffled beforehand and that the
algorithm inspects the first M outcomes in linear order, choosing the largest number.
We then use the frequency of choosing each number as its probability. This assigns
equal probability to each one of theN ! possible permutations of the array. The resulting
probability distributions are shown in Figure 7.2. Here, we see that inspecting only one
element does not change the state of knowledge at all; that increasing M moves the
probability mass towards the larger numbers; and that complete certainty is achieved
when M = N . Furthermore, notice that we can actually infer the ranking of the
numbers by merely looking at the distributions, since larger numbers consistently get
more probability mass.
The particular shape of the resulting distributions has some interesting properties.
Let ρ denote the relative entropy between the initial and the final distribution, and let
C =M − 1 denote the number of comparisons carried out by the algorithm, and let V
denote the expected maximum. Note that C = O(M logN) for fixed N , i.e. it serves as
a measure of the computational complexity. If we plot ρ versus C (Figure 7.3a), we see
72
7.1 An Example of Boundedness
M = 1 M = 2 M = 3 M = 4 M = 5
M = 6 M = 7 M = 8 M = 9 M = 10
Figure 7.2: Distributions after Bounded Optimization. The plots show the distributions over
the maximum in {1, 2, . . . , 10} obtained after running the bounded optimization algorithm for
different values of M .
a surprising property: the quantities are proportional, having a correlation coefficient1
of r = 0.9991. That is, we can use ρ as a good measure of the computational complexity
of the algorithm. We also want to understand how the expected value V evolves as we
increase the computational effort. This is seen by plotting V versus ρ (Figure 7.3b).
This plot shows that certainty has a diminishing marginal value, that is, the gain in
the value decreases with more computation. Intuitively, this is because the better the
candidate solution, the more effort it takes to find an even better one. Moreover, the
shape of the expected utility turns out to be logarithmic, as can be seen by plotting
2V versus ρ.
Intuitively, if we care about computational costs, then it is not always a good idea
to run an exhaustive optimization algorithm, because we can reach a point where the
gain in value is too small to justify the extra computational effort. This idea can be
captured by changing the evaluation criterion to one that penalizes the expected value
by the relative entropy, e.g.
V − αρ,
where α is a conversion factor. Fixing α defines a tradeoff between the expected value
and the relative entropy (Figure 7.4). The plot confirms our intuition: the smaller
α, the larger the number of elements we compare, and the more “rational” our choice
becomes. A brief simulation also shows that a achieving a perfectly rational choice does
not require α = 0; it is already achieved for α ≈ 0.2131.
This analysis leads to a principled algorithm for bounded optimization. For a
fixed α, consider the algorithm that linearly inspects the elements of the array until
1More precisely, the Pearson product-moment correlation coefficient is an indicator of the linear
dependence between two variables, having values ranging from −1 to 1. A value equal to 1 means that
a linear relation perfectly describes the relationship between the two variables.
73
7. BOUNDEDNESS
a) b) c)
ρ
C
U
ρ
2U
ρ
r = 0.9991 r = 0.9937
Figure 7.3: Performance of the Bounded Optimization. In panel (a), it is apparent that the
relative entropy ρ is proportional to the number of comparisons C carried out by the algorithm,
and hence ρ serves as a measure of the computational complexity. Plot (b) shows the expected
value V against the relative entropy ρ. The marginal increment of V diminishes as ρ increases.
Furthermore, panel (c) shows that 2V is linear in ρ, meaning that V is logarithmic in ρ.
V
−
α
ρ
ρ
α = 0.5
α = 1.0
α = 2.0
Figure 7.4: Expected Value Penalized by Relative Entropy. Choosing the conversion factor α
defines a tradeoff between the expected value and the relative entropy. In the plot, three
performance curves are shown, corresponding to the tradeoffs α = 12 , 1 and 2. Notice how an
exponential increase of α leads to a linear increase in the optimal ρ.
74
7.2 Utility & Resources
the largest number found so far, penalized by αρ, reaches its peak. This algorithm is
stochastic, with the property that the probabilities of choosing a number are monotonic:
for all i, j,
pi > pj ⇐⇒ vi > vj
where vi is the value of the i-th element and and pi is its probability of being chosen.
This seems to be a more general property of a bounded optimization algorithm. Fur-
thermore, notice that because of the diminishing marginal value, it is easy to reach a
good performance level (although it is very hard to approach the optimum).
This concludes our example. In the remainder of this chapter, we aim to develop
a general framework of bounded rationality and then apply it to autonomous systems.
We are especially interested in providing a solid axiomatic basis for bounded rationality.
7.2 Utility & Resources
In physics, the behavior of a system can be described in two ways: using the dynamical
equations or using an extremal principle (Goldstein, 1980). The first one specifies how
the coordinates of the physical system change in time, like e.g. in Newton’s second law
F =
dp
dt
where F is the force, p is the momentum and t is time. The second expresses the
dynamics of a system as the solution to a variational problem, like e.g. the action
integral
A =
∫ tf
ti
L[q, q˙, t] dt
in Langrangian Mechanics, where L is the Langrangian (i.e. the kinetic energy minus
the potential energy of the system), q are the generalized coordinates of the system
and t is time. According to the principle of least action, the dynamical equations of
the system are then obtained by finding the trajectory q(t) that is an extremum of the
action integral. On a conceptual level, the difference between stating the dynamical
equations or the extremum principle to specify a physical system is analogous to the
difference between stating the output model or the subjective expected utility in order
to specify an autonomous system.
In the previous chapter, we have argued that the resource cost (i.e. work) of ob-
serving an event A given B is
−γ logP(A|B),
where γ > 0 is the conversion factor between units of energy and information. This
has the advantage of linking three disparate concepts together, namely resource costs
(physics), information content (information theory) and behavior P(A|B) (statistics).
Can we exploit this connection to physics in order to devise a new principle for con-
structing autonomous systems? The answer is yes, but this connection is not straight-
forward, because it requires revising the way we think about utilities and about the
process of constructing an autonomous system.
75
7. BOUNDEDNESS
7.2.1 Utility
In this section, we propose a non-standard concept of utility that is more in accord
with thermodynamics and information-theory. Let U(A) denote the utility of an event
A ∈ F and let
u(A|B) := U(A ∩B)−U(B)
be a shorthand to denote the gain in utility obtained from experiencing event A given
event B. We will derive the relation
u(A|B) = α logP(A|B)
where α > 0 is a conversion factor between units of utility and information. This
relation is obtained from desiderata characterizing the notion of utilities in systems
that are “free” to generate events. More specifically, by a free system we mean a
system that can choose the sampling probabilities of events generated by itself.
Consider a free system represented by a finite probability space (Ω,F ,P). Here,
the probability measure P models the generative law that the system uses to choose
events. Thus, if P(A) > P(B), then the propensity of experiencing A is higher than
B. In such a system, differences in probability can be given an interpretation relating
them to differences in preferences: one can say that A is more probable than B because
A is more desirable than B. In other words, a system is more likely to choose the
events that it prefers. Using this line of reasoning, can we find a quantity which will
measure how desirable an event is? We call such a measure of a utility gain function,
although one should always bear in mind that the resulting utilities do not correspond
to the same notion of utility that we have seen in Part I.
If there is such a measure, then it is reasonable to demand the following three
properties for it:
i. Utility gains should be mappings from conditional probabilities into real numbers.
ii. Utility gains should be additive. That is, the gain of a joint event should be
obtained by summing up the gain of the sub-events. For example, the “gain of
drinking coffee and eating a croissant” should equal “the gain of drinking coffee”
plus the “gain of having a croissant given the gain of drinking coffee”.
iii. A more probable event should have a higher utility gain than a less probable event.
For example, if “drinking a coffee” is more likely than “eating a croissant”, then
the utility gain of the former is higher. Note that this is not necessarily true
anymore if the system is not free to choose between events. For instance, “losing
the lottery” has a much higher probability than “winning the lottery” even though
the utility gain of the latter is obviously higher, but this case differs from the
previous example in that the event is not determined by the system itself.
The three properties can then be summarized as follows.
76
7.2 Utility & Resources
Definition 22 (Axioms of Utility) Let (Ω,F ,P) be a probability space. A set func-
tion U : F → R is a utility function for P iff its utility gain function u(A|B) :=
U(A ∩B)−U(B) has the following three properties for all events A,B,C,D ∈ F :
U1. ∃f,u(A|B) = f(P(A|B)) ∈ R, (real-valued)
U2. u(A ∩B|C) = u(A|C) + u(B|A ∩C), (additive)
U3. P(A|B) > P(C|D) ⇔ u(A|B) > u(C|D). (monotonic)
Furthermore, we use the abbreviation u(A) := u(A|Ω). 2
The following theorem shows that these three properties enforce a strict mapping
between probabilities and utility gains.
Theorem 5 (Utility Gain ↔ Probability) If f is such that u(A|B) = f(P(A|B))
for every probability space (Ω,F ,P), then f is of the form
f(·) = α log(·),
where α > 0 is arbitrary strictly positive constant. 2
Proof Given arbitrary p, q ∈ (0, 1] and n,m ∈ N, one can always choose a probability
space (Ω,F ,P) having sequences of events A1, A2, . . . , An ∈ F and B1, B2, . . . , Bm ∈ F ,
and an event C ∈ F such that
p = P(A1|C) = P(A2|C ∩A1) = · · · = P(An|C ∩A 0 plays the role of a conversion factor between utilities and
information. If a probability measure P and a utility function U satisfy the relation
(7.4), then we say that they are conjugate. We call one unit of utility one utile. Given
that this transformation between utility gains and probabilities is a bijection, one has
that
P(A|B) = 2 1α (U(A∩B)−U(B)) .
There are several important observations with respect to this particular functional form.
First, recall that in Bayesian probability theory (Section 4.1), a new measurement
A is combined with the previous knowledge B by an intersection, i.e. the posterior
knowledge is given by A∩B. Furthermore, note that utility gains are always negative,
i.e.
U(A ∩B)−U(B) ≤ 0.
This means that every measurement decreases the utility. While this sounds counterin-
tuitive, the relation also says that minimizing the resource costs maximizes the utility
that is achieved after the interaction. Hence, if the free system is forced to act, then it
will favor outcomes that reduce the utility less.
Second, note that in the context of thermodynamics discussed in Section 6.2.1,
doing work W = −γ logP(A|B) on a physical system changes its energy level from
e(B) to e(A ∩B) as
e(B) −→ e(A ∩B) = e(B) +W = e(B)− γ logP(A|B).
That is, the work done on a system increases its internal energy. Comparing this with
the relation
U(A ∩B)−U(B) = −α logP(A|B)
leads to the conclusion that
U(A) = −αγ e(A)
for all A ∈ F . Hence, the utilities that we have derived from mathematical desiderata
are (safe for a conversion factor) negative energy levels. This is useful because it allows
us analyzing decision making by borrowing ideas from thermodynamics.
Furthermore, note that the exponentiation of the utility function can be written as
a sum of parts. That is, if A1, A2 ∈ F form a partition of A ∈ F , then
2
1
α
U(A) = 2
1
α
U(A1) + 2
1
α
U(A2).
79
7. BOUNDEDNESS
This is because
P(A|Ω) = P(A1|Ω) +P(A2|Ω),
2
1
α
U(A)
2
1
α
U(Ω)
=
2
1
α
U(A1)
2
1
α
U(Ω)
+
2
1
α
U(A2)
2
1
α
U(Ω)
,
2
1
α
U(A) = 2
1
α
U(A1) + 2
1
α
U(A2).
In particular, one can rewrite any probability P(A|B) as a Gibbs measure:
P(A|B) =
∑
ω∈A 2
1
α
U(ω)∑
ω∈B 2
1
α
U(ω)
.
where we have used the abbreviation U(ω) := U({ω}). As the conversion factor α
approaches zero, the probability measure P(ω) approaches a uniform distribution over
the maximal set Ωmax := {ω∗ ∈ Ω|ω∗ = argmaxωU(ω)}. Similarly, as α → ∞,
P(ω) → 1|Ω| , i.e. the uniform distribution over the whole outcome set Ω. Also, note
that
2
1
α
U(Ω) =
∑
ω
2
1
α
U(ω).
Intuitively, the utility U(Ω) of the sample set corresponds to the utility of the system
before any interaction has occurred.
7.2.2 Variational principle
The conversion between probability and utility established in the previous section can
be characterized using a variational principle. Inspired by thermodynamics, we now
define a quantity that will allow us analyzing changes of the state of the system.
Definition 23 (Free Utility) Let (Ω,F) be a measurable space and letU be a utility
function. Define the free utility functional as
J(Pr;U) :=
∑
ω∈Ω
Pr(ω)U(ω) − α
∑
ω∈Ω
Pr(ω) logPr(ω),
(the term Pr(ω) := Pr({ω}) is an abbreviation) where Pr is an arbitrary probability
measure over (Ω,F). 2
Here, we see that the free utility2 is the expected utility of the system plus the
uncertainty (i.e. the entropy) over the outcome multiplied by the utility-information
conversion factor. It satisfies the following relation.
2The functional F := −J is also known as the Helmholtz free energy in thermodynamics. F is a
measure of the “useful” work obtainable from a closed thermodynamic system at a constant temperature
and volume.
80
7.2 Utility & Resources
Theorem 6 (Variational Principle) A conjugate pair (P,U) satisfies
J(Pr;U) ≤ J(P;U) = U(Ω).
where Pr is an arbitrary probability measure over Ω. 2
Proof Rewriting terms using the utility-probability conversion and applying Jensen’s
inequality yields
J(Pr;U) =
∑
ω∈Ω
Pr(ω)U(ω) − α
∑
ω∈Ω
Pr(ω) logPr(ω)
= α
∑
ω∈Ω
Pr(ω) log
2
1
αU(ω)
Pr(ω)
≤ α log
∑
ω∈Ω
Pr(ω)
2
1
αU(ω)
Pr(ω)
= α log
∑
ω∈Ω
2
1
αU(ω)
= U(Ω),
with equality iff 2
1
αU(ω)
Pr(ω) is constant, i.e. if Pr = P.
The variational principle tells us that the probability law P of the system is the one
that maximizes the free utility for a given utility function U, since
P = argmax
Pr
J(Pr;U).
Hence, the utility function U plays the role of a constraint landscape for probability
measuresPr out of which the conjugate probability measureP is the one that maximizes
the uncertainty.
7.2.3 Bounded SEU
In SEU theory, the designer constructs an autonomous system by first assuming a
probability measure that characterizes the behavior of the environment and a utility
criterion to compare different outcomes, and then by calculating a policy that maxi-
mizes the subjective expected utility. During this process, the calculation of the optimal
policy is done disregarding the costs of the computation. As we have argued at the
beginning of this chapter, here we do want to take into account the costs of computing
the optimal policy. However, this poses two important problems.
First, how do we measure the costs of this computation? To answer this question,
we first have to agree on what this calculation is actually doing. We know that after
this calculation, the agent ends up with an optimal policy. But what if this calculation
81
7. BOUNDEDNESS
had been omitted? Clearly, the agent would not end up with an optimal policy—unless
the agent already had the optimal policy from the very beginning. The important
conclusion here is that, in the general case, the purpose of such a calculation is to
transform an initial system Pi into a final system Pf . In the previous chapter, we have
seen that the amount of bits that have to be set to carry out this transformation is
given by the relative entropy, i.e.∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
.
Hence, the cost measured in joules and in utiles is obtained by multiplying the previous
quantity by γ and α respectively.
Second, how do we optimally choose this transformation? In other words, how
should we transform the initial system Pi such that the final system Pf optimally
trades off the benefits of maximizing a given utility function U∗ against the cost of the
transformation? To answer this question, consider the transformation represented in
Figure 7.5. The initial system satisfies the equation
Ji :=
∑
ω∈Ω
Pi(ω)Ui(ω)− α
∑
ω∈Ω
Pi(ω) logPi(ω) = Ui(Ω).
We add new constraints represented by the utility function U∗. Then, the resulting
utility function Uf is given by the sum
Uf = Ui +U∗.
The free utility Jf of the final system is then given by
Jf :=
∑
ω∈Ω
Pf (ω)Uf (ω)− α
∑
ω∈Ω
Pf (ω) logPf (ω)
=
∑
ω∈Ω
Pf (ω)(Ui(ω) +U∗(ω))− α
∑
ω∈Ω
Pf (ω) logPf (ω)
=
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
+Ui(Ω).
To understand the change that has occurred due to this transformation, we take the
difference Jf − Ji in the free utility. This results in a quantity that will play a central
role in our next development.
Definition 24 (Bounded Subjective Expected Utility) LetU∗ a utility function
and let Ji and Jf be the free utility for the conjugate pairs (Pi,Ui) and (Pf ,Uf ). Then,
the bounded subjective expected utility (bounded SEU) is given by the difference
in free utility
Jf − Ji =
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
. (7.5)
82
7.2 Utility & Resources
Pi,Ui Pf ,Uf
+U∗
Figure 7.5: Transformation of a System. A transformation from a system (Pi,Ui) into a
system (Pf ,Uf ) by addition of a constraint U∗.
This difference in free utility has an interpretation that is crucial for the formaliza-
tion of bounded rationality: it is the expected target utility U∗ (first term) penalized
by the cost of transforming Pi into Pf (second term). In practice, this means the
following. A designer starts out with an initial system Pi that he wants to change in
order to maximize a utility functionU∗. He then changes this system into Pf , spending
in the process a total amount of
γ
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
joules of energy. The expected utility of the system is given by∑
ω∈Ω
Pf (ω)U∗(x).
However, from the point of view of the designer, the total expected utility is smaller,
because he has to subtract the reduction in utility caused by the cost of the transfor-
mation itself: ∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
.
Remark 20 Alternatively, one can interpret bounded SEU as the expected utility pe-
nalized by an “uncertainty” term. In this interpretation, the relative entropy measures
the “risk” of a gamble. 2
Because of its interpretation, we define (7.5) as a functional to be maximized, ei-
ther with respect to Pf or with respect to Pi. We call this construction method the
maximum bounded SEU principle. Depending on whether we vary Pf or Pi, one
obtains two different variational problems having different applications.
Control Method. The construction method for control corresponds to the case
when the designer wants to build a system that optimizes a given utility function U∗
subject to the costs of the transformation. This is achieved by fixing Pi and varying
Pf :
Pf = argmax
Pr
∑
ω∈Ω
Pr(ω)U∗(ω)− α
∑
ω∈Ω
Pr(ω) log
Pr(ω)
Pi(ω)
. (7.6)
83
7. BOUNDEDNESS
The solution is given by
Pf (ω) ∝ Pi(ω) exp
(
1
α
U∗(ω)
)
.
This can be interpreted as follows. The decision maker starts out with a prior probabil-
ity of choosing ω given by Pi(ω). Then, he changes this probability to Pf (ω) obtained
from multiplying the prior with the term exp( 1αU∗(ω)), which intuitively corresponds
to the “likelihood” of ω being the best choice. Compare this to the example given in
the introduction. Here, α controls the amount of resource units required to increase
the utility. In particular, when the conversion factor between information and utility
is negligible α ≈ 0, then (7.5) becomes
Jf − Ji ≈
∑
ω∈Ω
Pf (ω)U∗(x),
and hence resource costs are ignored in the choice of Pf , leading to Pf ≈ δω∗(ω), where
ω∗ = argmaxωU∗(ω). Similarly, when α is very high, then the difference is
Jf − Ji ≈ −α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pi(ω)
,
and hence no computation is carried out to optimize the choice, i.e. Pf ≈ Pi.
Estimation Method. The construction method for estimation corresponds to the
case when the designer wants to build a system that approximates another system
(that is maximizing a possibly unknown utility function U∗) subject to the costs of the
transformation. This is achieved by fixing Pf and varying Pi:
Pi = argmax
Pr
∑
ω∈Ω
Pf (ω)U∗(ω)− α
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pr(ω)
(7.7)
= argmin
Pr
∑
ω∈Ω
Pf (ω) log
Pf (ω)
Pr(ω)
,
and thus we have recovered the minimum relative entropy principle for estimation,
having the solution
Pi = Pf
and therefore carry no transformation costs. While this “construction method” might
look bizarre at a first glance, it is saying something obvious: If the designer is looking
for a system Pi that approximates another known system Pf , then the best he can do
is to just choose Pf itself without having to carry out any transformation at all!
84
7.3 Bounded SEU in Autonomous Systems
7.3 Bounded SEU in Autonomous Systems
We tackle now the question of how to construct an autonomous system using the
bounded SEU principle. In the previous section, we have learnt that building a system
is conceptualized as transforming a preexisting system. Furthermore, there are two
construction methods, namely one for control and one for estimation.
We assume that we are in possession of a reference I/O model P0 and a utility
function U∗. The objective is to find an I/O model P maximizing the bounded SEU
using the method for control, estimation or a mixture of both. We further assume that
the conversion factor between utilities and information is given by α > 0.
To construct the associated bounded SEU, one has to carefully decide what con-
struction method to apply for each random variable. For instance, consider constructing
an I/O model P(ao) from a reference I/O model P0(ao) using a utility function U∗.
Then, assuming that a is derived using the control method and o using the estimation
method, one gets
P = argmax
Pr
{∑
ao
Pr(a)P0(o|a)U∗(ao)− α
∑
ao
Pr(a)P0(o|a) log Pr(a)P0(o|a)
P0(a)Pr(o|a)
}
.
This is because for a, the reference probabilities P0(a) play the role of Pf in (7.5); and
for o, the reference probabilities P0(o|a) play the role of Pi in (7.5).
A simple way to concisely write down the bounded SEU for mixed methods is by
defining two auxiliary I/O models R and S as
R(at|ao 58 , that is, strictly
higher than the subjective expected utility of the classical analysis—and in practice,
subjective beliefs are all what a decision maker has3! We call these probabilities that
are exogenous to the decision maker’s beliefs ambiguities.
The distinction has an operational meaning. Notice that in case V, the belief of the
decision maker is itself a random variable, implying that the optimal policy is undefined
until the random variable is resolved. Hence, the computation of the optimal policy can
be delayed, i.e. the optimal policy can be determined dynamically. This is unlike case
IV, where the policy is pre-computed/static. The corresponding Bayesian I/O model
is as follows. Let θ ∈ {I, II} be the parameter determining whether the decision maker
3Intuitively, it seems safer to delay one’s decision until the evidence is conclusive.
107
9. CONTROL AS ESTIMATION
I II III
IV V
H
H
H
H
H
HH
T
T
T
T
T
TT
1
4
1
4
1
4
3
4
3
4
3
4
3
4
3
4
3
4
1
4
1
4
1
4
5
8
3
8
1
4
1
4
3
4
3
4
T
T
H
H H
H
Figure 9.1: Risk versus Ambiguity. In the figure, five different decision making scenarios are
shown. A biased coin is tossed. The goal is to predict the outcome and the payoffs are $1 and
$0 for a right and wrong guess respectively. A rational decision maker places bets (here shown
inside speech bubbles) such that his subjective expected utility is maximized. These subjective
beliefs are delimited within dotted boxes. The cases in panels I–IV differ from case V in that the
former can be fully understood in terms of classical decision theory, whereas the latter cannot.
108
9.2 Adaptive Estimative Control
is in case I or II. Then, the full Bayesian I/O model is given by:
P (θ = I) =
1
4
P (θ = II) =
3
4
P (a|θ = I) =
{
0 if a = H
1 if a = T
P (a|θ = II) =
{
1 if a = H
0 if a = T
P (o|θ = I, a) =
{
1
4 if o = H
3
4 if o = T
P (o|θ = II, a) =
{
3
4 if o = H
1
4 if o = T.
Here, the prior probabilities P (θ) are ambiguities while the P (a|θ) and P (o|θ, a) are
risk probabilities, because fixing θ determines the decision maker’s estimation about
the outcome and his policy.
9.2 Adaptive Estimative Control
If the environment Q is unknown, then the task of designing an appropriate agent
P constitutes an adaptive control problem. We have already seen how to solve such a
problem using the maximum SEU principle in Section 4.2, where it has been formulated
as an adaptive optimal control problem.
We formulate a different problem setup that will be termed adaptive estimative
control. Specifically, this setup deals with the case when the designer has a class of
output models {Pθ}θ∈Θ parameterized by a finite set Θ, designed to fit to a class of en-
vironments {Qθ}θ∈Θ; in other words, when the designer wants to use policy Pθ(at|ao 0, there is a C ≥ 0, such that for all θ′ ∈ Θ,
all t and all T ⊂ Nt ∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C
with probability ≥ 1− δ. 2
0
1 2 3
t
dt
Figure 9.6: Bounded Variation. If a divergence process has bounded variation, then the
realizations (curves 2 & 3) of a sub-divergence stay within a band around the mean (curve 1).
Figure 9.6 illustrates this property. Bounded variation is the key property that is
going to be used to construct the results of this section. However, it is very restrictive.
For instance, simple Bernoulli processes do not fulfill this property. The first result is
that the posterior probability of the true parameter is bounded from below.
Theorem 9 (Lower Bound of True Posterior) Let the set of operation modes of
a controller be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded
variation. Then, for any δ > 0, there is a λ > 0, such that for all t ∈ N,
P (θ∗|aˆo≤t) ≥
λ
|Θ|
with probability ≥ 1− δ. 2
115
9. CONTROL AS ESTIMATION
Proof As has been pointed out in (9.2), a particular realization of the divergence
process dt(θ∗‖θ) can be decomposed as
dt(θ∗‖θ) =
∑
θ′
gθ(θ
′;Tθ′),
where the gθ(θ
′;Tθ′) are sub-divergences of dt(θ∗‖θ) and the Tθ′ form a partition of Nt.
However, since dt(θ∗‖θ) has bounded variation for all θ ∈ Θ, one has for all δ′ > 0,
there is a C(θ) ≥ 0, such that for all θ′ ∈ Θ, all t ∈ Nt and all T ⊂ Nt, the inequality∣∣∣gθ(θ′;Tθ′)−Gθ(θ′;Tθ′)∣∣∣ ≤ C(θ)
holds with probability ≥ 1− δ′. However, due to (9.3),
Gθ(θ
′;Tθ′) ≥ 0
for all θ′ ∈ Θ. Thus,
gθ(θ
′;Tθ′) ≥ −C(θ).
If all the previous inequalities hold simultaneously then the divergence process can be
bounded as well. That is, the inequality
dt(θ∗‖θ) ≥ −MC(θ) (9.4)
holds with probability ≥ (1− δ′)M where M := |Θ|. Choose
β(θ) := max{0, ln P (θ)P (θ∗)}.
Since 0 ≥ ln P (θ)P (θ∗) − β(θ), it can be added to the right hand side of (9.4). Using the
definition of dt(θ∗‖θ), taking the exponential and rearranging the terms one obtains
P (θ∗)
t∏
τ=1
P (oτ |θ∗, ao<τaτ ) ≥ e−α(θ)P (θ)
t∏
τ=1
P (oτ |θ, ao<τaτ )
where α(θ) := MC(θ) + β(θ) ≥ 0. Identifying the posterior probabilities of θ∗ and θ
by dividing both sides by the normalizing constant yields the inequality
P (θ∗|aˆo≤t) ≥ e−α(θ)P (θ|aˆo≤t).
This inequality holds simultaneously for all θ ∈ Θ with probability ≥ (1− δ′)M2 and in
particular for λ := minθ{e−α(θ)}, that is,
P (θ∗|aˆo≤t) ≥ λP (θ|aˆo≤t).
But since this is valid for any θ ∈ Θ, and because maxθ{P (θ|aˆo≤t)} ≥ 1M , one gets
P (θ∗|aˆo≤t) ≥
λ
M
,
with probability ≥ 1 − δ for arbitrary δ > 0 related to δ′ through the equation δ′ :=
1− M2√1− δ.
116
9.4 Convergence of the Bayesian Control Rule
Remark 31 It has been pointed by M. Hutter4 that bounded variation can most
probably be weakened to “C growing sub-linearly” (which will require adapting the
definitions that follow as well) and still get the convergence result of this section. 2
9.4.5 Core
If one wants to identify the operation modes whose posterior probabilities vanish, then it
is not enough to characterize them as those modes whose hypothesis does not match the
true hypothesis. Figure 9.7 illustrates this problem. Here, three hypotheses along with
their associated policies are shown. H1 and H2 share the prediction made for region A
but differ in region B. Hypothesis H3 differs everywhere from the others. Assume H1
is true. As long as we apply policy P2, hypothesis H3 will make wrong predictions and
thus its divergence process will diverge as expected. However, no evidence against H2
will be accumulated. It is only when one applies policy P1 for long enough time that
the agent will eventually enter region B and hence accumulate counter-evidence for H2.
H1
P1
H2
P2
H3
P3A A
B B
Figure 9.7: Problems with Disambiguation. If hypothesis H1 is true and agrees with H2 on
region A, then policy P2 cannot disambiguate the three hypotheses.
But what does “long enough” mean? If P1 is executed only for a short period, then
the controller risks not visiting the disambiguating region. But unfortunately, neither
the right policy nor the right length of the period to run it are known beforehand.
Hence, an agent needs a clever time-allocating strategy to test all policies for all finite
time intervals. This motivates the following definition.
Definition 36 (Core) The core of an operation mode θ∗, denoted as [θ∗], is the
subset of Θ containing operation modes behaving like θ∗ under its policy. Formally, an
operation mode θ /∈ [θ∗] (i.e. is not in the core) iff for any C ≥ 0, δ > 0, there is a ξ > 0
and a t0 ∈ N, such that for all t ≥ t0,
Gθ∗(θ;T ) ≥ C
with probability ≥ 1− δ, where Gθ∗(θ;T ) is a sub-divergence of dt(θ∗‖θ), and Pr{τ ∈
T } ≥ ξ for all τ ∈ Nt. 2
In other words, if the agent was to apply θ∗’s policy in each time step with probabil-
ity at least ξ, and under this strategy the expected sub-divergence Gθ∗(θ;T ) of dt(θ∗‖θ)
grows unboundedly, then θ is not in the core of θ∗.
4personal communication
117
9. CONTROL AS ESTIMATION
Remark 32 Note that demanding a strictly positive probability of execution in each
time step guarantees that the agent will run θ∗ for all possible finite time-intervals. 2
As the following theorem shows, the posterior probabilities of the operation modes
that are not in the core vanish almost surely.
Theorem 10 (Not in Core ⇒ Vanishing Posterior) Let the set of operation modes
of an agent be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded vari-
ation. If θ /∈ [θ∗], then P (θ|aˆo≤t)→ 0 as t→∞ almost surely. 2
Proof The divergence process dt(θ∗‖θ) can be decomposed into a sum of sub-divergences
(see Equation 9.2)
dt(θ∗‖θ) =
∑
θ′
gθ′(θ;Tθ′). (9.5)
Furthermore, for every θ′ ∈ Θ, one has that for all δ > 0, there is a C ≥ 0, such that
for all t ∈ N and for all T ⊂ Nt∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C(θ)
with probability ≥ 1 − δ′. Applying this bound to the summands in (9.5) yields the
lower bound ∑
θ′
gθ′(θ;Tθ′) ≥
∑
θ′
(
Gθ′(θ;Tθ′)− C(θ)
)
which holds with probability ≥ (1− δ′)M , where M := |Θ|. Due to Inequality 9.3, one
has that for all θ′ 6= θ∗, Gθ′(θ;Tθ′) ≥ 0. Hence,∑
θ′
(
Gθ′(θ;Tθ′)− C(θ)
) ≥ Gθ∗(θ;Tθ∗)−MC
where C := maxθ{C(θ)}. The members of the set Tθ∗ are determined stochastically;
more specifically, the i-th member is included into Tθ∗ with probability P (θ∗|aˆo≤i) ≥
λ/M for some λ > 0 by Theorem 9. But since θ /∈ [θ∗], one has that Gθ∗(θ;Tθ∗) →∞
as t→∞ with probability ≥ 1− δ′ for arbitrarily chosen δ′ > 0. This implies that
lim
t→∞
dt(θ∗‖θ) ≥ lim
t→∞
Gθ∗(θ;Tθ∗)−MC ր∞
with probability ≥ 1−δ, where δ > 0 is arbitrary and related to δ′ as δ = 1−(1−δ′)M+1.
Using this result in the upper bound for posterior probabilities yields the final result
0 ≤ lim
t→∞
P (θ|aˆo≤t) ≤ lim
t→∞
P (θ)
P (θ∗)
e−dt(θ∗‖θ) = 0.
118
9.4 Convergence of the Bayesian Control Rule
9.4.6 Consistency
Even if an operation mode θ is in the core of θ∗, i.e. given that θ is essentially indis-
tinguishable from θ∗ under θ∗’s control, it can still happen that θ∗ and θ have different
policies. Figure 9.8 shows an example of this. The hypotheses H1 and H2 share re-
gion A but differ in region B. In addition, both operation modes have their policies P1
and P2 respectively confined to region A. Note that both operation modes are in the
core of each other. However, their policies are different. This means that it is unclear
whether multiplexing the policies in time will ever disambiguate the two hypotheses.
This is undesirable, as it could impede the convergence to the right control law.
H1
P1
H2
P2
A A
B B
Figure 9.8: Inconsistent Policies. An example of inconsistent policies. Both operation modes
are in the core of each other, but have different policies.
Thus, it is clear that one needs to impose further restrictions on the mapping
of hypotheses into policies. With respect to Figure 9.8, one can make the following
observations:
1. Both operation modes have policies that select subsets of region A. Therefore,
the dynamics in A are preferred over the dynamics in B.
2. Knowing that the dynamics in A are preferred over the dynamics in B allows us
to drop region B from the analysis when choosing a policy.
3. Since both hypotheses agree in region A, they have to choose the same policy in
order to be consistent in their selection criterion.
This motivates the following definition.
Definition 37 (Consistent Policies) An operation mode θ is said to be consistent
with θ∗ iff θ ∈ [θ∗] implies that for all ε < 0, there is a t0, such that for all t ≥ t0 and
all ao 0 and δ′ > 0, let t0(θ) be the time such that for all t ≥ t0(θ), wθ(t) < ε′. Choosing
t0 := maxθ{t0(θ)}, the previous inequality holds for all θ and t ≥ t0 simultaneously
with probability ≥ (1− δ′)M . Hence,∑
θ/∈[θ∗]
pθ(t)wθ(t) ≤
∑
θ/∈[θ∗]
wθ(t) < Mε
′. (9.7)
To bound the second sum in (9.6) one proceeds as follows. For every member
θ ∈ [θ∗], one has that pθ(t)→ pθ∗(t) as t→∞. Hence, following a similar construction
as above, one can choose t′0 such that for all t ≥ t′0 and θ ∈ [θ∗], the inequalities∣∣∣pθ(t)− pθ∗(t)∣∣∣ < ε′
hold simultaneously for the precision ε′ > 0. Applying this to the second sum in
Equation 9.6 yields the bounds∑
θ∈[θ∗]
(
pθ∗(t)− ε′
)
wθ(t) ≤
∑
θ∈[θ∗]
pθ(t)wθ(t) ≤
∑
θ∈[θ∗]
(
pθ∗(t) + ε
′
)
wθ(t).
Here
(
pθ∗(t) ± ε′
)
are multiplicative constants that can be placed in front of the sum.
Note that
1 ≥
∑
θ∈[θ∗]
wθ(t) = 1−
∑
θ/∈[θ∗]
wθ(t) > 1− ε.
Use of the above inequalities allows simplifying the lower and upper bounds respectively:(
pθ∗(t)− ε′
) ∑
θ∈[θ∗]
wθ(t) > pθ∗(t)(1 − ε′)− ε′ ≥ pθ∗(t)− 2ε′,
(
pθ∗(t) + ε
′
) ∑
θ∈[θ∗]
wθ(t) ≤ pθ∗(t) + ε′ < pθ∗(t) + 2ε′.
(9.8)
120
9.5 Examples
Combining the inequalities (9.7) and (9.8) in (9.6) yields the final result:∣∣∣P (at|aˆo 0 related to δ′ as δ′ = 1− M√1− δ
and arbitrary precision ε.
9.5 Examples
In this section we illustrate the usage of the Bayesian control rule on two examples
that are very common in the reinforcement learning literature: multi-armed bandits
and Markov decision processes. As a reminder, a summary of the Bayesian control rule
is given in Table 9.1.
Bayesian Control Rule: Given a set of operation modes {P (·|θ, ·)}θ∈Θ
over interaction sequences in Z∞ and a prior distribution P (θ) over the
parameters Θ, the probability of the action at+1 is given by
P (at+1|aˆo≤t) =
∑
θ
P (at+1|θ, ao≤t)P (θ|aˆo≤t), (9.9)
where the posterior probability over operation modes is given by the
recursion
P (θ|aˆo≤t) =
P (ot|θ, ao 0 are learning rates. The exploration strategy chooses with fixed proba-
bility pexp > 0 the action a that maximizes Q(x, a)+
C
F (x,a) , where C is a constant, and
F (x, a) represents the number of times that action a has been tried in state x. Thus,
higher values of C enforce increased exploration.
In a study Mahadevan (1996), a grid-world is described that is especially useful as a
test bed for the analysis of RL algorithms. For our purposes, it is of particular interest
because it is easy to design experiments containing suboptimal limit-cycles. Figure 9.11,
panel (a), illustrates the 7× 7 grid-world. A controller has to learn a policy that leads
it from any initial location to the goal state. At each step, the agent can move to any
adjacent space (up, down, left or right). If the agent reaches the goal state then its
next position is randomly set to any square of the grid (with uniform probability) to
start another trial. There are also “one-way membranes” that allow the agent to move
into one direction but not into the other. In these experiments, these membranes form
“inverted cups” that the agent can enter from any side but can only leave through the
bottom, playing the role of a local maximum. Transitions are stochastic: the agent
moves to the correct square with probability p = 910 and to any of the free adjacent
spaces (uniform distribution) with probability 1 − p = 110 . Rewards are assigned as
follows. The default reward is r = 0. If the agent traverses a membrane it obtains a
reward of r = 1. Reaching the goal state assigns r = 2.5. The parameters chosen for this
simulation were the following. For our MDP-agent, we have chosen hyperparameters
µ0 = 1 and λ0 = 1 and precision p = 1. For R-learning, we have chosen learning rates
α = 0.5 and β = 0.001, and the exploration constant has been set to C = 5, C = 30
and to C = 200. A total of 10 runs were carried out for each algorithm. The results
are presented in Figure 9.11 and Table 9.2. R-learning only learns the optimal policy
given sufficient exploration (panels d & e, bottom row), whereas the Bayesian control
rule learns the policy successfully. In Figure 9.11f, the learning curve of R-learning for
C = 5 and C = 30 is initially steeper than the Bayesian controller. However, the latter
attains a higher average reward around time step 125,000 onwards. We attribute this
shallow initial transient to the phase where the distribution over the operation modes
is flat, which is also reflected by the initially random exploratory behavior.
127
9. CONTROL AS ESTIMATION
Average Reward
BCR 0.3582 ± 0.0038
R-learning, C = 200 0.2314 ± 0.0024
R-learning, C = 30 0.3056 ± 0.0063
R-learning, C = 5 0.2049 ± 0.0012
Table 9.2: Average reward attained by the different algorithms at the end of the run. The
mean and the standard deviation has been calculated based on 10 runs.
9.6 Critical Issues
Problems of Bayesian methods. The Bayesian control rule treats an adaptive
control problem as a Bayesian inference problem. Hence, all the problems typically
associated with Bayesian methods carry over to agents constructed with the Bayesian
control rule. These problems are of both analytical and computational nature. For
example, there are many probabilistic models where the posterior distribution does not
have a closed-form solution. Also, exact probabilistic inference is in general computa-
tionally very intensive. Even though there is a large literature in efficient/approximate
inference algorithms for particular problem classes Bishop (2006), not many of them
are suitable for on-line probabilistic inference in more realistic environment classes.
Bayesian control rule versus Bayes-optimal control. Directly maximizing the
(subjective) expected utility for a given environment class is not the same as minimizing
the expected relative entropy for a given class of operation modes. The two methods
are based on different assumptions and optimality principles. As such, the Bayesian
control rule is not a Bayes-optimal controller. Indeed, it is easy to design experiments
where the Bayesian control rule converges exponentially slower (or does not converge
at all) than a Bayes-optimal controller to the maximum utility. Consider the following
simple example: Environment 1 is a k-state MDP in which only k consecutive actions
A reach a state with reward +1. Any interception with a B-action leads back to the
initial state. Consider a second environment which is like the first but actions A and
B are interchanged. A Bayes-optimal controller figures out the true environment in
k actions (either k consecutive A’s or B’s). Consider now the Bayesian control rule:
The optimal action in Environment 1 is A, in Environment 2 is B. A uniform (12 ,
1
2)
prior over the operation modes stays a uniform posterior as long as no reward has been
observed. Hence the Bayesian control rule chooses at each time-step A and B with
equal probability. With this policy it takes about 2k actions to accidentally choose a
row of A’s (or B’s) of length k. From then on the Bayesian control rule is optimal too.
So a Bayes-optimal controller converges in time k, while the Bayesian control rule needs
exponentially longer. One way to remedy this problem might be to allow the Bayesian
control rule to sample actions from the same operation mode for several time steps in
a row rather than randomizing controllers in every cycle. However, if one considers
non-stationary environments this strategy can also break down. Consider, for example,
128
9.7 Relation to Existing Approaches
an increasing MDP with k =
⌈
10
√
t
⌉
, in which a Bayes-optimal controller converges in
100 steps, while the Bayesian control rule does not converge at all in most realizations,
because the boundedness assumption is violated.
9.7 Relation to Existing Approaches
Some of the ideas underlying this work are not unique to the Bayesian control rule. The
following is a selection of previously published work in the recent Bayesian reinforcement
learning literature where related ideas can be found.
Compression principles. In the literature, there is an important amount of work
relating compression to intelligence (MacKay, 2003; Hutter, 2004a). In particular, it
has been even proposed that compression ratio is an objective quantitative measure of
intelligence (Mahoney, 1999). Compression has also been used as a basis for a theory
of curiosity, creativity and beauty (Schmidhuber, 2009).
Mixture of experts. Passive sequence prediction by mixing experts has been studied
extensively in the literature (Cesa-Bianchi and Lugosi, 2006). In a study on online-
predictors (Hutter, 2004b), Bayes-optimal predictors are mixed. Bayes-mixtures can
also be used for universal prediction (Hutter, 2003). For the control case, the idea
of using mixtures of expert-controllers has been previously evoked in models like the
MOSAIC-architecture (Haruno, Wolpert, and Kawato, 2001). Universal learning with
Bayes mixtures of experts in reactive environments has been studied in the work of
Poland and Hutter (2005) and Hutter (2002).
Stochastic action selection. The idea of using actions as random variables, and the
problems that this entails, has been expressed in the work of Hutter (2004a, Problem
5.1). The study in this chapter can be regarded as a thorough investigation of this
open problem. Other stochastic action selection approaches are found in the thesis of
Wyatt (1997) who examines exploration strategies for (PO)MDPs, in learning automata
(Narendra and Thathachar, 1974) and in probability matching (Duda, Hart, and Stork,
2001) amongst others. In particular, the thesis discusses theoretical properties of an
extension to probability matching in the context of multi-armed bandit problems. There,
it is proposed to choose a lever according to how likely it is to be optimal and it is shown
that this strategy converges, thus providing a simple method for guiding exploration.
9.8 Derivation of Gibbs Sampler for MDP Agent
Inserting the likelihood given in Equation 9.13 into Equation 9.9 of the Bayesian control
rule, one obtains the following expression for the posterior
129
9. CONTROL AS ESTIMATION
P (θ|aˆ≤t, o≤t) = P (x
′|θ, x, a)P (r|θ, x, a, x′)P (θ|aˆ 0, and hence, under the view of the presented framework,
perfect rationality is only an idealization. Conceptually, the distinguishing feature of
bounded SEU (compared to classical SEU) is that it provides an explanatory framework
for approximations.
Causality. Causality is a field that has historically been highly controversial, and it
has been only recently that mathematical formalizations have started to find wider ac-
ceptance. Still, so far these formalizations have not clarified the connection to measure
theory, the mathematically rigorous theory of probability. The framework introduced
in this thesis does a step towards this direction. Simultaneously, this thesis clarifies the
importance of causal consideration in agent designs. More specifically, the Bayesian
I/O model (Section 8.3.1), which is really a causal model, allows enriching the classi-
cal Bayesian autonomous system, having uncertainty only over its environment, with
having uncertainty over its very policy.
Bayesian control rule. Agents that are constructed following the maximum SEU
principle have to carry out massive computations before they even have had a single
interaction with its environment. To bypass this limitation, most practical algorithms
implement autonomous systems that “discover” their policy during the interactions
with the environment. In this thesis, we have used the framework of bounded SEU
and causal reasoning to derive the Bayesian control rule: a rule that allows construct-
ing a natural class adaptive agents having uncertainty over their policies. Formally,
the Bayesian control rule for outputs is the probabilistic equivalent to the predictive
distribution for inputs. Furthermore, this thesis presents a convergence proof for the
Bayesian control rule under a very restricted setting.
134
10.3 What is missing?
10.3 What is missing?
Understanding the Implications of the Relations. If one is willing to accept
the connections between decision theory, information theory and thermodynamics that
this thesis puts forward, then one obtains a potentially very fertile ground for novel
ideas and reinterpretations. While this thesis shows some of the benefits of adopting
this unified view, it also leaves many questions unanswered. For instance, the utility-
information conversion factor α controls the cost of translating resources into utilities.
On a very abstract level, one could blame the failure of approximations to the very large
value of α. But what does α mean in practice? How can it be influenced? Examples
such as this abound and need to be addressed in the future.
Descriptive Power. While the bounded SEU framework introduced in this thesis
has a theoretical appeal due to its properties and simplicity, it remains to be seen
whether it can explain human decision making, and especially, whether it can explain
the experimental evidence (Allais, 1953; Ellsberg, 1961; Kahneman and Tversky, 1979;
Machina, 1987; Kreps, 1988) that contradicts perfect rationality. In particular, it would
be important to verify whether the causality framework and/or the Bayesian control
rule can predict aspects of human decision making.
Intelligence Measure. Legg and Hutter (2006) proposed a formal measure of ma-
chine intelligence. While this measure is a synthesis of many informal definitions of
human intelligence that have been given by experts, it has been constructed mainly bor-
rowing ideas from the theory of universal artificial intelligence, staying thus within the
paradigm of agency with unlimited resources. It would be interesting to test whether
this intelligence measure can be accommodated to include agency with bounded re-
sources.
Game Theory. It is not at all clear how the design of autonomous systems connects
to the game theoretic literature. The fundamental difference lies in the assumptions.
In artificial intelligence and control theory, one assumes a dynamical model of the
environment first, and then constructs a suitable agent. Meanwhile, in game theory,
one instead only assumes a utility function describing the environment’s preferences.
This assumption, however, does not provide enough information to derive the dynam-
ical model of the environment, since this model must depend on the assumptions the
environment makes about the agent. Under this point of view, additional solution con-
cepts (collectively called equilibria) other than the maximum expected utility prin-
ciple are required to derive the resulting behavior of the interaction system. In the
context of this thesis, an important point to be verified is to investigate whether the
Bayesian I/O model for agents is “essentially” equivalent to the Bayesian game frame-
work (Harsanyi, 1967–1968), and whether game theoretic concepts can be extended to
the case of resource-bounded agency.
135
10. DISCUSSION
136
References
M. Allais. Le comportment de l’homme rationnel devant la risque: critique des postulats
et axiomes de l’ecole americaine. Econometrica, 21:503–546, 1953.
Dana Angluin. Computational learning theory: survey and selected bibliography. In
Proceedings of the twenty-fourth annual ACM symposium on Theory of computing,
STOC ’92, pages 351–369, New York, NY, USA, 1992.
F. J. Anscombe and R. J. Aumann. A definition of subjective probability. The Annals
of Mathematical Statistics, 34(1):199–205, mar 1963.
R. B. Ash. Information Theory. New York: Interscience, 1965.
P. Auer, N. CesaBianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47:235–256, 2002.
R. Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London, 53, 1763.
P. Beame. A general sequential time-space tradeoff for finding unique elements. In
Proceedings of the twenty-first annual ACM symposium on Theory of computing,
STOC ’89, pages 197–203, New York, NY, USA, 1989. ACM.
P. Beame, T. S. Jayram, and M. Saks. Time-space tradeoffs for branching programs.
J. Comput. Syst. Sci., 63:542–572, December 2001.
P. Beame, M. Saks, X. Sun, and E. Vee. Time-space trade-off lower bounds for ran-
domized computation of decision problems. J. ACM, 50:154–195, March 2003.
R. E. Bellman. Dynamic programming, 1957.
C. H. Bennett. Logical reversibility of computation. IBM Journal of Research and
Development, 17(6):525–532, 1973.
C. H. Bennett. The thermodynamics of computationa review. International Journal of
Theoretical Physics, 21(12):905–940, 1982.
D. Bernoulli. Specimen theoriae novae de mensara sortis. Commentarii Academiae
Scientiarum Imperialis Petropolitanae, 1738. (trans. in 1954, Econometrica).
137
REFERENCES
D. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-
Hall, Upper Saddle River, NJ, 1987.
P. Billingsley. Ergodic Theory and Information. R. E. Krieger Pub. Co., 1978.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
K. Borch. A note on uncertainty and indifference curves. The Review of Economic
Studies, 36(1):1–4, 1969.
E. Borel. Lec¸ons sur la the´orie des fonctions. Gauthier-Villars, Paris, 1898.
A. Borodin and S. Cook. A time-space tradeoff for sorting on a general sequential model
of computation. In Proceedings of the twelfth annual ACM symposium on Theory of
computing, STOC ’80, pages 294–301, New York, NY, USA, 1980. ACM.
D. A. Braun and P. A. Ortega. A minimum relative entropy principle for adaptive con-
trol in linear quadratic regulators. In Proceedings of the 7th international conference
on informatics in control, automation and robotics, page (in press), 2010.
H. J. Bremermann. Quantum noise and information. In 5th Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, California, 1965. Univ. of Califor-
nia Press.
L. Brillouin. Maxwell’s demon cannot operate: Information and entropy i. Journal of
Applied Physics, 22:334–337, 1951.
L. Brillouin. Science and Information Theory. New York: Academic Press., 1956.
H. B. Callen. Thermodynamics and an Introduction to Themostatistics. John Wiley &
Sons, 2nd edition, 1985.
C. F. Camerer and M. Weber. Recent developments in modelling preferences: uncer-
tainty and ambiguity. J. Risk Uncertain., 5:325–370, 1992.
J. C. Candeal, J. R. De Miguel, E. Indura´in, and G. B. Mehta. Utility and entropy.
Economic Theory, 17:233–238, 2001.
N. Cartwright. Nature’s Capacities and Their Measurement. Claredon Press, Oxnard,
1989.
N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University
Press, 2006.
T. M. Cover and J. A. Thomas. Elements of information theory. New York: Wiley-
Interscience, 1st edition, 1991.
R. T. Cox. The Algebra of Probable Inference. Johns Hopkins, 1961.
138
REFERENCES
A. P. Dawid. Fundamentals of statistical causality. Research Report 279, Department
of Statistical Science, University College London, 2007.
B. De Finetti. La pre´vision: Ses lois logiques, ses sources subjectives. In Annales de
l’Institut Henri Poincare´, volume 7, pages 1–68, 1937.
R. Dearden, N. Friedman, and S. Russell. Bayesian q-learning. In AAAI ’98/IAAI
’98: Proceedings of the fifteenth national/tenth conference on Artificial intelli-
gence/Innovative applications of artificial intelligence, pages 761–768, Menlo Park,
CA, US, 1998. American Association for Artificial Intelligence.
R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In In
Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, pages
150–159, 1999.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley & Sons, Inc.,
second edition, 2001.
M. O’G. Duff. Optimal learning: computational procedures for bayes-adaptive markov
decision processes. PhD thesis, University of Massachusetts—Amherst, 2002.
Director-Andrew Barto.
E. Eells. Probabilistic Causality. Cambridge University Press, 1991.
D. Ellsberg. Risk, ambiguity and the savage axioms. The Quaterly Journal of Eco-
nomics, 75:643–669, 1961.
R. P. Feynman. Feynman Lectures on Computation. Westview Press, 2nd edition,
2000.
P. C. Fishburn. The Foundations of Expected Utility. D. Reidel Publishing, Dordrecht,
1982.
R. A. Fisher. Statistical Methods for Research Workers. Macmillan Pub. Co., 13th
edition, 1970.
M. Fre´chet. De´finition de l’inte´grale d’une fonctionnelle e´tendue a` un ensemble abstrait.
Comptes Rendus de l’Acade´mie des sciences, 1915.
D. Gabor. Light and information. Progress in Optics, 1:111–153, 1964.
R. Gallager. Information Theory and Reliable Communication. New York: John Wiley
and Sons, 1968.
H. Goldstein. Classical Mechanics. Addison-Wesley, 2nd edition, 1980.
P. D. Gru¨nwald. The Minimum Description Length Principle. MIT Press, 2007.
J. C. Harsanyi. Games with incomplete information played by ”bayesian” players, i–iii.
Management Science, 14:159–182, 320–334, 486–502, 1967–1968.
139
REFERENCES
M. Haruno, D. M. Wolpert, and M. Kawato. Mosaic model for sensorimotor learning
and control. Neural Computation, 13:2201–2220, 2001.
M. Hutter. Self-optimizing and pareto-optimal policies in general environments based
on bayes-mixtures. In COLT, 2002.
M. Hutter. Optimality of universal Bayesian prediction for general loss and alphabet.
Journal of Machine Learning Research, 4:971–997, 2003.
M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic
Probability. Springer, Berlin, 2004a.
M. Hutter. Online prediction – bayes versus experts. Technical report, July 2004b. Pre-
sented at the EU PASCAL Workshop on Learning Theoretic and Bayesian Inductive
Principles (LTBIP-2004).
E. T. Jaynes and Larry G. Bretthorst. Probability Theory: The Logic of Science: Books.
Cambridge University Press, 2003.
H. Jeffreys. The Theory of Probability. The Clarendon Press, 1939.
M. I. Jordan, Z. Ghahramani, and L. K. Saul. Hidden Markov decision trees. In
Advances in Neural Information Processing Systems 9, 1997.
D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk.
Econometrica, 47:263–291, 1979.
R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions
of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
B. Kappen, V. Gomez, and M. Opper. Optimal control as a graphical model inference
problem. arXiv:0901.0633, 2009.
M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT
Press, 1994.
G. Keller. Equilibrium States in Ergodic Theory. London Mathematical Society Student
Texts. Cambridge University Press, 1998.
A. I. Khinchin. Mathematical Foundations of Information Theory. New York: Dover,
1957.
F. H. Knight. Risk, Uncertainty, and Profit. Houghton Mifflin, Boston, 1921.
A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin,
1933.
A. N. Kolmogorov. Three approaches to the quantitative definition of information.
International Journal of Computer Mathematics, 1968.
140
REFERENCES
L. G. Kraft. A Device for Quantizing, Grouping, and Coding Amplitude Modulated
Pulses. Ms thesis, Electrical Engineering Department, Massachusetts Institute of
Technology, Cambridge, MA, 1949.
D. M. Kreps. Notes on the Theory of Choice. Westview Press, 1988.
S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathe-
matical Statistics, 22(1):79–86, mar 1951.
R. Landauer. Irreversibility and heat generation in the computing process. IBM Journal
of Research and Development, 5(3):183–191, 1961.
P. S. Laplace. Me´moires sur la probabilite´ des causes par les e´ve`nements. Me´moires
de mathe´matique et des physiques presentee´s a` l’Acade´mie royale des sciences, par
divers savans, & luˆs dans ses assemble´es, 6:621–656, 1774.
H. Lebesgue. Lec¸ons sur l’inte´gration et la recherche des fonctions primitives. 1904.
S. Legg. Machine Super Intelligence. PhD thesis, Department of Informatics, University
of Lugano, June 2008.
S. Legg and M. Hutter. A formal measure of machine intelligence. In Annual machine
learning conference of Belgium and The Netherlands (Benelearn-2006), Ghent, 2006.
D. Lewis. Counterfactuals. Cambridge, MA: Harvard University Press, 1973.
M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications
(Texts in Computer Science). Springer, February 2008.
R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley,
1959.
M. Machina. Choice under uncertainty: Problems solved and unsolved. Journal of
Economic Perspectives, 1:121–154, 1987.
D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, 2003.
S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and
empirical results. Machine Learning, 22(1-3):159–195, 1996.
M. V. Mahoney. Text compression as a test for artificial intelligence. In AAAI/IAAI,
pages 486–502, 1999.
J. C. Maxwell. Letter to P. G. Tait, 11 December 1867, 1867.
W. S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity.
Bulletin of Mathematical Biophysics, 5:115–133, 1943.
141
REFERENCES
B. McMillan. Two inequalities implied by unique decipherability. IEEE Trans. Infor-
mation Theory, 2(4):115–116, 1956.
D. H. Mellor. The Facts of Causation. Routledge, 1995.
D. Michie. Game-playing and game-learning automata. Advances in Programming &
Non-Numerical Computation, pages 183–200, 1966.
K. Narendra and M. A. L. Thathachar. Learning automata - a survey. IEEE Transac-
tions on Systems, Man, and Cybernetics, SMC-4(4):323–334, July 1974.
J. Neyman. First course in probability and statistics. Holt, New York, 1950.
N. J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publishers,
San Francisco, 1998.
R. Nozick. Newcomb’s problem and two principles of choice. In N. Rescher, editor,
Essays in Honor of Carl G. Hempel, pages 114–146. Reidel, 1969.
P. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning and
acting. Journal of Artificial Intelligence Research, 38:475–511, 2010a.
P. A. Ortega and D. A. Braun. A conversion between utility and information. In
The third conference on artificial general intelligence, pages 115–120, Paris, 2010b.
Atlantis Press.
P. A. Ortega and D. A. Braun. A bayesian rule for adaptive control based on causal
interventions. In The third conference on artificial general intelligence, pages 121–
126, Paris, 2010c. Atlantis Press.
P. A. Ortega and D. A. Braun. An axiomatic formalization of bounded rationality
based on a utility-information equivalence. arXiv:1007.0940, 2010d.
M. J. Osborne and A. Rubinstein. A Course in Game Theory. MIT Press, 1999.
C. H. Papadimitriou. Computational Complexity. Adison Wesley, 1993.
D. C. Parkes. Bounded rationality. Technical report, University of Pennsylvania, 1997.
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press,
Cambridge, UK, 2000.
J. Poland and M. Hutter. Defensive universal learning with experts. In ALT, 2005.
K. R. Popper. The Logic of Scientific Discovery (Routledge Classics). Routledge, 1934.
T. W. Pratt and M. V. Zelkowitz. Programming Languages: Design and Implementa-
tion. Prentice Hall, 4th edition, 2000.
142
REFERENCES
F. P. Ramsey. The Foundations of Mathematics and Other Logical Essays, chapter
‘Truth and Probability’. Harcourt, Brace and Co., 1926. posthumously published in
1931.
C. E. Rasmussen and M. P. Deisenroth. Recent Advances in Reinforcement Learning,
volume 5323 of Lecture Notes on Computer Science, LNAI, chapter Probabilistic
Inference for Fast Learning in Control, pages 229–242. Springer-Verlag, 2008.
H. Robbins. Some aspects of the sequential design of experiments. Bulletin American
Mathematical Socierty, 58:527–535, 1952.
A. Rosenblueth, N. Wiener, and J. Bigelow. Behavior, purpose, and teleology. Philos-
ophy of Science, 10:18–24, 1943.
A. Rubinstein. Modeling Bounded Rationality. MIT Press, 1988.
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall,
Englewood Cliffs, NJ, 3rd edition edition, 2009.
S. Russell and E. Wefald. Principles of metareasoning. Artificial Intelligence, 49:361–
395, 1991.
J. E. Savage. Models of Computation: Exploring the Power of Computing. Addison
Wesley Publishing Company, 1998.
L. J. Savage. The Foundations of Statistics. John Wiley and Sons, New York, 1954.
J. Schmidhuber. Simple algorithmic theory of subjective beauty, novelty, surprise,
interestingness, attention, curiosity, creativity, art, science, music, jokes. Journal of
SICE, 48(1):21–32, 2009.
G. Shafer. The Art of Causal Conjecture. MIT Press, 1996.
C. E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27:379–423 and 623–656, Jul and Oct 1948.
H. Simon. Models of Bounded Rationality. MIT Press, 1982.
H. A. Simon. A behavioral model of rational choice. Quarterly Journal of Economics,
69:99–188, 1955.
S. P. Singh. Reinforcement learning algorithms for average-payoff markovian decision
processes. In National Conference on Artificial Intelligence, pages 700–705, 1994.
M. Sipser. Introduction to the Theory of Computation. PWS Pub. Co., 1996.
R. J. Solomonoff. A formal theory of inductive inference. part 1 & 2. Information and
Control, 7:1–22, 224–254, 1964.
143
REFERENCES
P. Spirtes and R. Scheines. Causation, Prediction, and Search, Second Edition. MIT
Press, 2001.
R. Stalnacker. Ifs: Conditionals, Belief, Decision, Chance, and Time, chapter A Theory
of Conditionals., pages 41–56. Dordrecht: Reidel, 1968.
P. Suppes. A Probabilistic Theory of Causality. Amsterdam: North-Holland Publishing
Company, 1970.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA, 1998.
L. Szilard. On the decrease of entropy in a thermodynamic system by the intervention
of intelligent beings. Zeitschrift fu¨r Physik, 53:840–856, 1929.
E. Todorov. Linearly solvable markov decision problems. In Advances in Neural Infor-
mation Processing Systems, volume 19, pages 1369–1376, 2006.
E. Todorov. General duality between optimal control and estimation. In Proceedings
of the 47th conference on decision and control, pages 4286–4292, 2008.
E. Todorov. Efficient computation of optimal actions. Proceedings of the National
Academy of Sciences U.S.A., 106:11478–11483, 2009.
M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving
(po)mdps, 2006.
M. Tribus and E. C. McIrvine. Energy and information. Scientific American, 225:
179–188, 1971.
J. N. Tsitsiklis. Computational complexity in Markov decision theory. HERMIS—An
International Journal of Computer Mathematics and its Applications, 9(1):45–54,
2007.
J. Veness, K. S. Ng, M. Hutter, and D. Silver. Reinforcement learning via aixi ap-
proximation. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial
Intelligence, 2010.
J. Veness, K. S. Ng, M. Hutter, Uther W., and D. Silver. A monte-carlo aixi approxi-
mation. Journal of Artificial Intelligence Research, 40:95–142, 2011.
J. Venn. The Logic of Chance. Macmillan Pub. Co., London, 1st edition, 1866.
R. von Mises. Grundlagen der wahrscheinlichkeitsrechnung. Mathemat. Zeitsch., 5:
52–99, 1919.
J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior.
Princeton University Press, 1944.
144
REFERENCES
C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge,
Cambridge, England, 1989.
N. Wiener. Cybernetics, Second Edition: or the Control and Communication in the
Animal and the Machine. The MIT Press, 1965.
J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis,
Department of Artificial Intelligence, University of Edinburgh, 1997.
145
Index
act, 14
constant, 15
action, 9
agent, 9
algebra, 6, 31
generated, 94
ambiguity, 106
atom, 94
atom set, 94
generated, 95
axioms
belief, 34
causal, 96
probability, 6
rationality, 15
truth, 31
Bayes’ rule, 35
Bayesian control rule, 110
behavioral function, 19
belief
function, 34
induced space, 97
posterior, 35
prior, 35
space, 34
Bellman optimality equations, 22
bounded variation, 115
capacity, 54
causal
n-th function, 96
function, 96
space, 97
channel, 54
circuit
logic, 63
codeword, 56
codeword length, 56
complement, 31
conjugate, 79
consequence, 14
construction method
for control, 83
for estimation, 84
control
adaptive estimative, 109
adaptive optimal, 45
bounded optimal, 85
optimal, 25
core, 117
cross-entropy, 58
cycle, 9
data, 35
decision, 102
divergence
Kullback-Leibler, 60
divergence process, 112
dominates, 41
dynamic programming, 27
element, 5
empty string, 5
entropy, 58
relative, 60
environment, 9
ε, see empty string
equilibrium, 135
Nash, 106
event, 6, 14, 31
null, 15
146
INDEX
primitive, 95
false, 31
free utility, 80
horizon, 9
hypothesis, 35
information content, 58
interaction, 9
intersection, 31
intervention, 93, 98
knows, 22
Kraft-McMillan inequality, 56
learning theory
computational, 53
likelihood, 35
logarithm, 5
natural, 5
machine
random access, 64
sequential processing, 64
Turing, 64
Markov decision problem, 27
MDP, see Markov decision problem
measure
probability, 6
measurement, 100
mixture distribution, 38
model
Bayesian I/O, 99
Bayesian input, 38
Bayesian output, 99
I/O, 22
induced I/O, 102
induced input, 39
input, 22
output, 9
natural numbers, 5
observation, 9
operation mode, 109
set, 109
optimal, 22
outcome, 6, 31, 94
parameter
unknown, 39, 99
plausibility, 22
policy, 22
Bayes optimal, 45
consistent, 119
powerset, 6
predictive distribution, 38
predictor, 22
preference
conditional, 14
preference relation, 14
indifference, 14
rational, 15
strict, 14
prefix code, 56
complete, 57
prefix free, 55
probability measure
generative, 10
product rule, 34
program
branching, 70
propensity, 22
random variable, 6
rational, 14
rationality
bounded, 53
real numbers, 5
reasoning
meta-, 53
reward function, 24
risk, 106
sample, 6
set, 5
SEU principle
maximum, 21
maximum bounded, 83
147
INDEX
space
measurable, 6
probability, 6
sample, 6
truth, 32
state, 14, 67
stream probabilities, 10
string, 5
sub-divergence, 114
subjective expected utility
bounded, 82
substring, 5
sum rule, 34
symbol, 5
system
autonomous, 5
free, 76
true, 31
truth function, 31
truth state, 31
truth value, 31
uncertain, 31
union, 31
utile, 79
utility function, 77
utility gain function, 76, 77
value function, 25
148