A Unified Framework for Resource-Bounded Autonomous Agents Interacting with Unknown Environments Pedro A. Ortega Department of Engineering University of Cambridge A thesis submitted for the degree of Doctor of Philosophy September 2010 2 To my parents Pedro and Pilar. Acknowledgements This thesis is the result of four years of work, and it would not have been possible without the motivation and support of many people. Thanks to them, the years I spent in Cambridge have been amongst the happiest of my life. I especially want to thank my family Pedro, Pilar, Carolina and Paulina and my closest friends Francisca Albert, Paul Aguayo, Daniel Braun, Oscar Van Heerden, Emre Karaa, Aliff Mohamad, Jose´ Donoso, Aditya Saxena, Loreto Valenzuela, Horacio Tate, Aaron Lobo, Zoi Roupakia, Aysha Roohi, Ben Mansfield, Francisco Pe´rez and Mauricio Gaete. Also, I am deeply indebted to Disa Helander and Raffaella Nativio. Special thanks go to my friend Daniel Braun, who has been closely col- laborating with me during the course of my study. I also want to thank Jose´ Aliste, John Cunningham, Jose´ Donoso, Marcus Hutter, Humberto Maturana, Gonzalo Ruz and David Wingate for their invaluable help and comments on earlier versions of this manuscript. The present study has been supported by the Ministerio de Planificacio´n de Chile (MIDEPLAN), the Comisio´n Nacional de Investigacio´n Cient´ıfica y Tecnolo´gica (CONI- CYT) and the Bo¨hringer-Ingelheim-Fonds (BIF). Finally, I want to thank my supervisor Zoubin Ghahramani for his guidance, motivation and sup- port. Abstract The aim of this thesis is to present a mathematical framework for con- ceptualizing and constructing adaptive autonomous systems under resource constraints. The first part of this thesis contains a concise presentation of the foundations of classical agency: namely the formalizations of decision making and learning. Decision making includes: (a) subjective expected utility (SEU) theory, the framework of decision making under uncertainty; (b) the maximum SEU principle to choose the optimal solution; and (c) its application to the design of autonomous systems, culminating in the Bellman optimality equations. Learning includes: (a) Bayesian probability theory, the theory for reasoning under uncertainty that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian probability theory to the design of optimal adaptive agents. Then, two major problems of the maximum SEU principle are highlighted: (a) the prohibitive computa- tional costs and (b) the need for the causal precedence of the choice of the policy. The second part of this thesis tackles the two aforementioned prob- lems. First, an information-theoretic notion of resources in autonomous systems is established. Second, a framework for resource-bounded agency is introduced. This includes: (a) a maximum bounded SEU principle that is derived from a set of axioms of utility; (b) an axiomatic model of probabilis- tic causality, which is applied for the formalization of autonomous systems having uncertainty over their policy and environment; and (c) the Bayesian control rule, which is derived from the maximum bounded SEU principle and the model of causality, implementing a stochastic adaptive control law that deals with the case where autonomous agents are uncertain about their policy and environment. iv Contents Preface xvii 1 Introduction 1 1.1 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 2 I Foundations of Classical Agency 3 2 Characterization of Behavior 5 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Probabilities & Random Variables . . . . . . . . . . . . . . . . . 6 2.2 Models of Autonomous Systems . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Decision Making 13 3.1 Subjective Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Rationality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 The Maximum Subjective Expected Utility Principle . . . . . . . . . . . 19 3.2.1 SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . . 19 3.2.2 I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 Bellman Optimality Equations . . . . . . . . . . . . . . . . . . . 22 3.2.4 Subjective versus True Expected Utility . . . . . . . . . . . . . . 25 3.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 26 4 Learning 29 4.1 Bayesian Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.1 Reasoning under Certainty . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Reasoning under Uncertainty . . . . . . . . . . . . . . . . . . . . 33 4.1.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Adaptive Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Bayesian Input Model . . . . . . . . . . . . . . . . . . . . . . . . 37 v CONTENTS 4.2.2 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.3 Induced Input Model . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.4 Convergence of Predictive Distribution . . . . . . . . . . . . . . . 40 4.2.5 Bayes Optimal Agents . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 46 5 Problems of Classical Agency 47 5.1 Computational Cost and Precedence of Policy Choice . . . . . . . . . . 47 5.2 Is Rationality a Useful Concept? . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 49 II Resource-Bounded Agency 51 6 Resources 53 6.1 Preliminaries in Information Theory . . . . . . . . . . . . . . . . . . . . 54 6.1.1 The Communication Problem . . . . . . . . . . . . . . . . . . . . 54 6.1.2 Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.3 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 Resources as Information . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.1 Thermodynamical Interpretation . . . . . . . . . . . . . . . . . . 60 6.2.2 Computational Interpretation . . . . . . . . . . . . . . . . . . . . 63 6.3 Resource Costs in Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.1 Cost of Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3.2 Costs of Construction . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 69 7 Boundedness 71 7.1 An Example of Boundedness . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Utility & Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2.1 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.2 Variational principle . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.2.3 Bounded SEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.3 Bounded SEU in Autonomous Systems . . . . . . . . . . . . . . . . . . . 85 7.3.1 Bounded Optimal Control . . . . . . . . . . . . . . . . . . . . . . 85 7.3.2 Adaptive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 88 8 Causality 91 8.1 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2 Causal Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.2.1 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.3 Causality in Autonomous Systems . . . . . . . . . . . . . . . . . . . . . 99 8.3.1 Bayesian I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . 99 vi CONTENTS 8.3.2 Causal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3.3 Belief Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.3.4 Induced I/O Model . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 104 9 Control as Estimation 105 9.1 Interlude: Dynamic versus Static . . . . . . . . . . . . . . . . . . . . . . 106 9.1.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 Adaptive Estimative Control . . . . . . . . . . . . . . . . . . . . . . . . 109 9.3 Bayesian Control Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.4 Convergence of the Bayesian Control Rule . . . . . . . . . . . . . . . . . 111 9.4.1 Policy Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9.4.2 Divergence Processes . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.4.3 Decomposition of Divergence Processes . . . . . . . . . . . . . . 113 9.4.4 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.4.5 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.4.6 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.5.1 Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.5.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . 124 9.6 Critical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.7 Relation to Existing Approaches . . . . . . . . . . . . . . . . . . . . . . 129 9.8 Derivation of Gibbs Sampler for MDP Agent . . . . . . . . . . . . . . . 129 9.9 Historical Remarks & References . . . . . . . . . . . . . . . . . . . . . . 131 10 Discussion 133 10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 10.2 What are the contributions? . . . . . . . . . . . . . . . . . . . . . . . . . 134 10.3 What is missing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 References 145 vii CONTENTS viii List of Figures 2.1 Behavioral Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Interaction System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 An Output Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Setup of Subjective Expected Utility. . . . . . . . . . . . . . . . . . . . . 14 3.2 Axiom R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Axiom R5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Axiom R6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 Axiom R7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 A Behavioral Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.7 Combining Behavioral Functions determines an Interaction String. . . . 20 3.8 A Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.9 Solution of a Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 A Fixed Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 An Adaptive Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Truth Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Extension of Truth Function. . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Bayes’ rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Progressive Refinement of Accuracy. . . . . . . . . . . . . . . . . . . . . 36 4.7 Convergence of Predictive Distribution. . . . . . . . . . . . . . . . . . . 43 6.1 Communication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Information Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.4 Probability versus Codeword Length . . . . . . . . . . . . . . . . . . . . 59 6.5 The Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . . . . . . . 61 6.6 A Generalized Molecule-In-A-Box Device. . . . . . . . . . . . . . . . . . 62 6.7 Time-Space Tradeoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.8 Logic Circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.9 Sequential Processing Machine. . . . . . . . . . . . . . . . . . . . . . . . 65 6.10 State Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.1 An Exhaustive Optimization. . . . . . . . . . . . . . . . . . . . . . . . . 72 ix LIST OF FIGURES 7.2 Distributions after Bounded Optimization. . . . . . . . . . . . . . . . . . 73 7.3 Performance of the Bounded Optimization. . . . . . . . . . . . . . . . . 74 7.4 Expected Value Penalized by Relative Entropy. . . . . . . . . . . . . . . 74 7.5 Transformation of a System . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.1 A Three-Stage Randomized Experiment. . . . . . . . . . . . . . . . . . . 93 8.2 A Causal Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.3 An Intervention in the Probability Tree. . . . . . . . . . . . . . . . . . . 95 8.4 Primitive Events and their Atom Sets. . . . . . . . . . . . . . . . . . . . 96 8.5 Causal Space of an Autonomous System. . . . . . . . . . . . . . . . . . . 100 8.6 Updates following an Observation versus an Action. . . . . . . . . . . . 101 9.1 Risk versus Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.2 A Policy Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9.3 Realization of Divergence Processes. . . . . . . . . . . . . . . . . . . . . 113 9.4 Policies Influence Divergence Processes . . . . . . . . . . . . . . . . . . . 113 9.5 Decomposition of a Divergence Process into Sub-Divergences . . . . . . 114 9.6 Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.7 Problems with Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . 117 9.8 Inconsistent Policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.9 Space of Bandit Configurations. . . . . . . . . . . . . . . . . . . . . . . . 122 9.10 Performance Comparison for Bandit Problem . . . . . . . . . . . . . . . 123 9.11 MDP Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 126 x List of Notation Basic X A set or alphabet. N The set of natural numbers 1, 2, 3, . . . R The set of real numbers. ǫ The empty string. X n The set of strings of length n over the alphabet X . X ∗ The set of all finite strings over the alphabet X . xi:k The substring xixi+1 · · · xk−1xk. x≤i The string x1x2 · · · xi. ln(x) The natural logarithm of x. log(x) The logarithm base-2 of x. P(X ) The powerset of X . Pr An arbitrary probability distribution. Ω A sample space. F An algebra. Autonomous Agents A The set of actions. O The set of observations. Z The set of interactions, i.e. Z := A×O. at The action at time t. ot The observation at time t. aoi An interaction viewed as a symbol, i.e. aioi. T The horizon, i.e. the maximum length of interaction strings. Z⋄ The set of interaction strings up to length T . U The utility function over interactions strings. P The (behavioral) model of the agent, i.e. the distribution over interaction sequences implemented by the agent. It is used to denote the input model, the output model or the I/O model. xi LIST OF NOTATION Q The (behavioral) model of the environment, i.e. the distribution over interaction sequences implemented by the environment. It is used to denote the input model, the output model the I/O model. G The generative distribution, i.e. the sampling distribution over the interactions sequences that results from the interaction between the agent and the environment. P The belief model of the agent, i.e. the Bayesian mixture distribu- tion over interaction sequences. The symbol is used to denote the Bayesian input model, the Bayesian output model or the Bayesian I/O model. The belief model is a conceptual explanation that gives rise to a unique behavioral model. Model Types P(at|ao j. Logarithms are always taken with respect to base 2, thus log(2) = 1, unless written explicitly ln(x), in which case we mean natural 5 2. CHARACTERIZATION OF BEHAVIOR logarithms. The symbol P(X ) denotes the powerset of X , i.e. the set of all subsets of X . 2.1.2 Probabilities & Random Variables To simplify the exposition, all probability spaces are assumed to be finite unless clearly stated otherwise. While this assumption limits the domain of application of the expo- sition, it allows isolating the problems belonging solely to the design of autonomous systems from the problems that arise due to infinite sets (and in particular, from the problems of geometrical or topological assumptions). Due to this, we clarify some terminology. A sample space is a finite set Ω, where each member ω ∈ Ω is called a sample or outcome. A subset A ⊂ Ω is called an event. A subset F ⊂ P(Ω) of events that contains Ω and is closed under complementation and finite union is called an algebra. A measurable space is a tuple (Ω,F), where Ω is a sample space and F is an algebra. Given a measurable space (Ω,F), a set function Pr over F is called a probability measure iff it obeys the (Kolmogorov) probability axioms K1. for all A ∈ F , Pr(A) ∈ [0, 1]; K2. Pr(∅) = 0 and Pr(Ω) = 1; K3. for all disjoint A,B ∈ F , Pr(A ∪B) = Pr(A) +Pr(B). A probability space is a tuple (Ω,F ,Pr) where (Ω,F) is a measurable space and Pr is its measure. Given a probability space (Ω,F ,Pr), a random variable is a function X : Ω → X mapping each outcome ω into a symbol X(ω) of a set X , and where X−1(x) ∈ F for all x ∈ X . The probability of the random variable X taking on the value x ∈ X is defined as Pr(x) := Pr(X = x) := Pr({ω ∈ Ω : X(ω) = x}). 2.2 Models of Autonomous Systems If one wants to characterize the way an autonomous system behaves, it is necessary to fully describe the rules governing its potential I/O stream. The mathematical descrip- tion of an autonomous system’s behavior can be done at several levels, ranging from very detailed physical descriptions (e.g. in the case of a robot, it would require specify- ing its gears, distribution of current, sensors and effectors, etc.) to abstract statistical descriptions. This thesis deals exclusively with statistical descriptions. In particular, during the course of this thesis, two levels of description will be discussed: namely, the behavioral level and the belief level. 1. Behavioral Level. The behavioral level specifies the actual I/O behavior, i.e. the probabilities of making observations and issuing actions of the autonomous sys- tem. These probabilities have to be specified for every possible information state of the system, meaning that they must characterize the system’s I/O statistics 6 2.2 Models of Autonomous Systems for every possible sequence of past I/O symbols. In Chapter 6, we will link this description with the amount of resources (measured in thermodynamic work) that the autonomous system has to spend during its interactions. The behavioral level will be introduced in two parts: first, by specifying the output model and then by completing it with an input model. (a) Output Model. The output model specifies how the autonomous system gen- erates its outputs given the past I/O symbols. This is the minimal statistical description of an autonomous system’s behavior. However, it does not ex- plain the purpose of the system, i.e. it does not explain what it tries to accomplish. This model will be introduced in this chapter. (b) Input Model. The autonomous system predicts its input stream using the input model. This prediction model represents the assumptions that the system makes about its environment. We will see in Chapter 3 that these as- sumptions are necessary in order to formalize the purpose of the autonomous system. Furthermore, we will argue in Chapter 6 that this model is also necessary in order to characterize the thermodynamical work spent in inter- actions. The input model is introduced in the next chapter. 2. Belief Level. In this thesis we are mainly interested in modeling adaptive au- tonomous agents. However, specifying adaptive agents by directly writing down their behavioral models is difficult. To simplify the description of adaptive agents, one first starts out specifying a belief model, which is a high-level auxiliary model characterizing the beliefs, assumptions and uncertainties of an agent. Subse- quently, one simply derives the low-level behavioral model from the belief model. This is similar to writing programs in a high-level programming language that is then compiled into low-level machine code. As in the case of the behavioral level, this belief level too will be introduced in two parts: first, by introducing the Bayesian input model and then by introducing the Bayesian output model. (a) Bayesian Input Model. When the autonomous system does not know its environment, then the designer can use a Bayesian approach to model this uncertainty. This Bayesian input model is currently the standard in the adaptive control literature. It is introduced in Chapter 4. (b) Bayesian Output Model. In addition to modeling the uncertainty an au- tonomous system has over its environment, a designer can also model the uncertainty the system has over its own policy. This leads to a Bayesian output model that is analogous to the Bayesian input model. However, we will see that this generalization is not straightforward, as it will require a careful distinction between the technical treatment of actions and observa- tion arising due to causal constraints. This model is an original contribution of this thesis, and it will be introduced in Chapter 8, Part II. A schematic illustration of this setup is given in Figure 2.1. Other aspects that are important in the characterization of autonomous systems are the following: 7 2. CHARACTERIZATION OF BEHAVIOR Output Model Input Model Bayesian Input ModelBayesian Output Model I/O Model Bayesian I/O Model (policy) (environment) (uncertain environment)(uncertain policy) Behavioral Level Belief Level Figure 2.1: Behavioral Models. Dependencies are indicated by arrows. 1. I/O Domains: The choice of the cardinality, geometry and topology of the I/O domain can have a significant impact on the description, the implementation and the performance of an autonomous system. In navigation devices for instance, the usage of continuous I/O sets is vital for its robustness. Furthermore, choosing and modeling continuous I/O sets appropriately can be a very challenging task. However, in this thesis we will exclusively deal with finite I/O sets, because arguably the core problems of the design of autonomous agents are already present even in this simple setup. 2. Interaction Protocol: The interaction protocol specifies all the details concerning the rules and the timing of the communication between two autonomous systems. There are applications where de-synchronized interactions with variable timing play a crucial role, like e.g. in moving artillery or in tennis. In this thesis it is assumed that interactions occur in discrete time steps where autonomous systems alternately take turns to generate a symbol. While in many cases, this interac- tion protocol is flexible enough to accommodate other interaction regimes (e.g. simultaneous interactions, discrete approximations of continuous time, etc.), it is not clear whether the theory would significantly change under other interaction protocols. 3. Multiple Autonomous Systems: This thesis deals exclusively with situations hav- ing only two interacting autonomous systems. However, there are many situa- tions, especially in economics, where one would prefer modeling the behavior of a population of autonomous systems. If the population is significantly large, then a more coarse-grained description of behavior could be beneficial: e.g. treating 8 2.3 Output Model entire sub-populations as if they were individual autonomous systems; or using a completely different description modeling emergent properties. This view might be especially relevant to the understanding of decentralized behavior for instance. 2.3 Output Model In the following an autonomous system’s behavior is formalized as a conditional prob- ability measure over I/O sequences over I/O alphabets. We start with basic definitions of interactions. Definition 1 (Interactions) The possible I/O symbols are drawn from two finite sets. Let O denote the set of observations and let A denote the set of actions. The set Z := A × O is the set of interactions. The interaction string of length 0 is denoted by ǫ. Let T ∈ N be the horizon, i.e. maximum length of interaction strings. Let Z⋄ := ⋃Tt=0Zt denote the set of interactions up to length T . We also underline symbols to glue them together as for example in ao≤2 = a1o1a2o2. 2 We assume that there are two autonomous systems P and Q. By convention, we assume that P is the agent i.e. the autonomous system to be constructed by the designer, and thatQ is the environment1, i.e. the autonomous system to be controlled by the agent. Agent and environment operate obeying the following interaction protocol. The in- teraction proceeds in cycles t = 1, 2, . . . , T . In cycle t, the agent P generates an action at conditioned on the past I/O symbols ao 0. It is easy to see that this formula generalizes correctly to the border cases, since B(A|B) = 0 B(B|Ω) = 0 when A ∩ B = ∅ and B(A|B) = B(B|Ω)B(B|Ω) = 1 when B ⊂ A. Noting that B = B ∩Ω and rearranging terms, one gets B(A ∩B|Ω) = B(B|Ω)B(A|B ∩ Ω). We demand this relation to hold under any restriction to a “universal” set C ∈ F , not only when it is restricted to Ω. Thus, replacing Ω by C one obtains B(A ∩B|C) = B(B|C)B(A|B ∩ C), 33 4. LEARNING which is known as the product rule for beliefs. Following a similar reasoning, we impose that for any event A ∈ F , the sum of the degree of belief in A and its comple- ment Ac must be true under any condition B, i.e. B(A|B) +B(Ac|B) = 1, which is known as the sum rule for beliefs. In summary, we impose the following axioms for beliefs (Jaynes and Bretthorst, 2003). Definition 16 (Belief axioms) Let Ω be a set of outcomes and let F be an algebra over Ω. A set function P over F ×F is a belief function iff B1. A,B ∈ F , B(A|B) ∈ [0, 1]. B2. A,B ∈ F , B(A|B) = 1 if B ⊂ A. B3. A,B ∈ F , B(A|B) = 0 if A ∩B = ∅. B4. A,B ∈ F , B(A|B) +B(Ac|B) = 1. B5. A,B,C ∈ F , B(A ∩B|C) = B(A|C)B(B|A ∩ C). 2 Furthermore, define the shorthand B(A) := B(A|Ω). Axiom B1 states that degrees of belief are real values in the unit interval [0, 1]. Axioms B2 and B3 equate the belief and the truth function under certainty. Axioms B4 and B5 are the structural requirements under uncertainty discussed above. Accordingly, one defines a belief space as follows. Definition 17 (Belief Space) A belief space is a tuple (Ω,F ,B) where: Ω is a set of outcomes, F is an algebra over Ω and B : F × F → [0, 1] is a belief function. 2 The intuitive meaning of a belief space is analogous to a truth space. Nature arbi- trarily selects an outcome ω ∈ Ω. Subsequently, the reasoner performs a measurement: he chooses a set B and nature reveals to him whether ω ∈ B or not. Accordingly, the reasoner infers the degree of belief in any event A ∈ F by evaluating either B(A|B) (if ω ∈ B) or B(A|Bc) (if ω /∈ B). Remark 9 The word “subsequently”, that has been emphasized for the second time now, is crucial. When the reasoner performs his measurements, the outcome is already determined. In Chapter 8, we will relax this assumption by allowing outcomes that are only partially determined or jointly determined by the reasoner him-/herself. 2 Remark 10 Note that logical truth is not the same as probabilistic truth. That is T(A|B) = 1 ⇒ B(A|B) = 1, but B(A|B) = 1 ⇒ T(A|B) ∈ {?, 1}. In particular, the case B(A|B) = 1 and T(A|B) = ? occurs when B is chosen such that the set B \A is non-empty but has no probability mass relative to B. In other words, if B(B \A|B) := 0, then B(A ∩B|B) = B(A|B) = 1, even though B 6⊂ A. 2 34 4.1 Bayesian Probability Theory An easy but fundamental result is that the axioms of belief are equivalent to the axioms of probability3 (Jaynes and Bretthorst, 2003). This simple observation is what constitutes the foundation of Bayesian probability theory. 4.1.3 Bayes’ Rule We now return to the central topic of this chapter. Suppose the reasoner has uncer- tainty over a set of competing hypotheses about the world. Subsequently, he makes an observation. He can use this observation to update his beliefs about the hypotheses. The following theorem explains how to carry out this update. Theorem 2 (Bayes’ Rule) Let (Ω,F ,B) be a belief space. Let {H1, . . . ,HN} be a partition of Ω, and let D ∈ F be an event such that B(D) > 0. Then, for all n ∈ {1, . . . , N}, B(Hn|D) = B(D|Hn)B(Hn) B(D) = B(D|Hn)B(Hn)∑ mB(D|Hm)B(Hm) . This is known as Bayes’ rule. 2 The interpretation is as follows. The H1, . . . ,HN represent N mutually exclusive hypotheses, and the event D represents a new observation or data. Initially, the reasoner holds a prior belief B(Hn) over each hypothesis Hn. Subsequently, he in- corporates the observation of the event D and arrives at a posterior belief B(Hn|D) over each hypothesis Hn. Bayes’ rule states that this update can be seen as com- bining the prior belief B(Hn) with the likelihood B(D|Hn) of observation D under hypothesis Hn. The denominator ∑ mB(D|Hm)B(Hm) = B(D) just plays the role of a normalizing constant (Figure 4.5). Ω H1 H2 H2 H3 H3 D D Figure 4.5: Schematic Representation of Bayes’ Rule. The prior belief in hypotheses H1, H2 and H3 is roughly uniform. After conditioning on the observationD, the belief in hypothesis H3 increases significantly. 3More precisely, the axioms of beliefs as stated here imply the axioms of probability for finitely additive measures over finite algebras. Furthermore, the axioms of beliefs also specify a unique version of the conditional probability measure. 35 4. LEARNING Bayes’ rule naturally applies to a sequential setting. Incorporating a new observa- tion Dt after having observed D1,D2, . . . ,Dt−1 updates the beliefs as Bt+1(Hn) := B(Hn|D1 ∩ · · · ∩Dt) = Bt(Dt|Hn)Bt(Hn)∑ mBt(Dt|Hm)B(Hm) , where for the t-th update, Bt(Hn) := B(Hn|D1 ∩ · · · ∩Dt−1) and Bt(Dt|Hn) := B(Dt|Hn ∩D1 ∩ · · · ∩Dt−1) play the role of the prior belief and the likelihood respectively. Note that B(D1 ∩ · · · ∩Dt|Hn) = t∏ τ=1 B(Dτ |Hn ∩D1 ∩ · · · ∩Dτ−1), and hence each hypothesis Hn naturally determines a probability measure B(·|Hn) over sequences of observations. H1 H2 H3 S1 S2 S3 S4 S5 Figure 4.6: Progressive refinement of the accuracy of the joint observation. The sequence of observations D1, . . . , D5 leads to refinements S1, S2, . . . , S5, where St = D1 ∩ · · · ∩ Dt. Note that S5 ⊂ H1 and therefore B(H1|S5) = 1, while B(H2|S5) = B(H3|S5) = 0. A smaller event D corresponds to a more “accurate” observation. Hence, making a new observation D′ necessarily improves the accuracy, since D ⊃ D ∩D′. In some cases, the accuracy of an observation (or sequence of observations) can be so high that it uniquely identifies a hypothesis (Figure 4.6). The way Bayes’ rule operates can be illustrated as follows. Consider a partition {X1, . . . ,XK} of Ω and let H∗ ∈ {H1, . . . ,HN} be the true hypothesis, i.e. the outcome ω ∈ Ω is drawn obeying propensities described by B(·|H∗). The Xk represent different observations the reasoner can make. If ω is drawn by Nature and reported to be in 36 4.2 Adaptive Optimal Control Xk (without revealing ω itself), then the log-posterior probability of hypothesis Hn is given by logB(Hn|Xk) = logB(Xk|Hn)︸ ︷︷ ︸ ln + logB(Hn)︸ ︷︷ ︸ pn − logB(Xk)︸ ︷︷ ︸ c . This decomposition highlights all the relevant terms for understanding Bayesian learn- ing. The term ln is the log-likelihood of the data Xk. The term pn is the log-prior of hypothesis Hn, which is a way of representing the relative confidence in hypothesis Hn prior to seeing the data. In practice, it can also be interpreted as (a) a complexity term, (b) the log-posterior resulting from “previous” inference steps, or (c) an initialization term for the inference procedure. The term c is the log-probability of the data, which is constant over the hypotheses, and thus does not affect our analysis. Hence, log- posteriors are compared by their differences in ln+pn. Ideally, the log-posterior should be maximum for the true hypothesis Hn = H∗. However, since ω is chosen randomly, the observation Xk is random, and hence the log-posterior logB(Hn|Xk) is a random quantity too. If the variance of the log-posterior is high enough, then a particular realization of the data can lead to a log-posterior favoring some “wrong” hypotheses over the true hypothesis, i.e. ln + pn > l∗ + p∗ for some Hn 6= H∗. In general, this is an unavoidable problem (that necessarily haunts every statistical inference method). Further insight can be gained by analyzing the expected log-posterior:∑ Xk B(Xk|H∗) logB(Xk|Hn) ︸ ︷︷ ︸ Ln + logB(Hn) ︸ ︷︷ ︸ Pn=pn − ∑ Xk B(Xk|H∗) logB(Xk) ︸ ︷︷ ︸ C . This reveals4 that, on average, the log-likelihood Ln is indeed maximized by Hn = H∗. Hence, the posterior belief will, on average, concentrate its mass on the hypotheses having high Ln + Pn. 4.2 Adaptive Optimal Control The previous section introduced the conceptual framework of Bayesian probability the- ory to model reasoning under uncertainty. The aim of this section is to apply this framework to model adaptive autonomous systems. We first introduce a model to rep- resent the uncertainty an agent has over its environment. Then, we show that this model of uncertainty also allows deriving a predictor over the input stream, which will be used as the input model of the agent. Finally, we show a convergence result for this predictive input model. 4.2.1 Bayesian Input Model One can exploit the Bayesian interpretation of probabilities to construct adaptive au- tonomous systems. The Bayesian framework can be used to specify a belief model of 4For pi, qi probabilities, ∑ i pi log qi is maximum when qi = pi for fixed pi. 37 4. LEARNING agents having uncertainty about their environments. From this belief model, one can derive an adaptive predictor for the I/O model introduced in Chapter 3. Definition 18 (Bayesian Input Model) Let Θ be a finite set and let ZT be the set of interaction strings. A Bayesian input model of an agent is a set of conditional probabilities P (θ) and P (ot|θ, ao 0. (4.6) (In particular, in the Bayesian input model α = P (θ).) The dominance factor α can be arbitrarily small as long as it stays constant. Hence, the upper bound − lnα is valid even in the limit T →∞. Since S contains an infinite sum of positive terms, one concludes: Corollary 1 Under the same conditions as in Theorem 3, one has P(ot|ao 0 is the gas constant, and T ≥ 0 is the absolute temperature. The minus sign is just a convention to denote work done by the piston rather than by the gas. The interpretation of resources that we want to put forward here is analogous to the physical concept of work. One can postulate a formal correspondence between one unit of information and one unit of work (Feynman, 2000). Consider representing one bit of information using 60 6.2 Resources as Information one of the following logical devices: a molecule that can be located either on the top or the bottom part of a box; a coin whose face-up side can be either head or tail; a door that can be either open or closed; a train that can be orientated facing either north or south; and so forth. Assume that all these devices are initialized in an undetermined logical state, where the first state has probability p and the second probability 1 − p. Now, imagine you want to set these devices to their first logical state. In the case of the molecule in a box, this means the following. Initially, the molecule is uniformly moving around within a space confined by two pistons as depicted in Figure 6.5a. Assuming that the initial volume is V , the molecule has to be pushed by the lower piston into the upper part of the box having volume V ′ = pV (Figure 6.5b). From information theory, we know that the number of bits that we fix by this operation is given by − log p. (6.3) Using the formula in (6.2), one gets that the amount of work done by the piston is given by W = RT ln V V ′ = RT ln V pV = −RT ln p = − RT log e log p = −γmol log p, where we have assumed N = 1 and where the constant γmol := RT log e > 0 can be interpreted as the conversion factor between one unit of information and one unit of work for the molecule-in-a-box device. a) b) pV (1 − p)V A B Figure 6.5: The Molecule-In-A-Box Device. (a) Initially, the molecule moves freely within a space of volume V delimited by two pistons. The compartments A and B correspond to the two logical states of the device. (b) Then, the lower piston pushes the molecule into part A having volume V ′ = pV . How do we compute the information and work for the case of the coin, door and train devices? The important observation is that we can model these cases as if they were like molecule-in-a-box devices, with the difference that their conversion factors between units of information and units of work are different. Hence, the number of bits fixed while these devices are set to the first state is given by − log p, 61 6. RESOURCES i.e. exactly as in the case of the molecule. However, the work is given by −γcoin log p, −γdoor log p, and − γtrain log p respectively, where γcoin, γdoor and γtrain are the associated conversion factors between units of information. Obviously, γmol ≤ γcoin ≤ γdoor ≤ γtrain. The point is that information is proportional to work. In other words, the amount of bits required to communicate a choice is proportional to the amount of thermodynamical work that has to be carried out on the recording device of the receiver. One can easily envisage devices that generalize this idea to multiple states. Consider for instance the device shown in Figure 6.6a corresponding to the codeword lengths shown in Figure 6.4a. Here, four pistons control the probability of the molecule being in the parts of the space labeled as {a, b, c, d}. A choice is ruled out by pushing its piston downwards until it hits the wall. To rule out option ‘a’ as we did in the example of Figure 6.4, we push the first piston to the end in order to obtain the configuration shown in Figure 6.6b, which corresponds to codeword lengths of Figure 6.4b. Since this change reduces the total volume by half, the thermodynamical work is proportional to 1 bit. a) b) aa bb cc dd Figure 6.6: A generalized molecule-in-a-box device representing the partial choice of Fig- ure 6.4. A molecule can move freely within a volume delimited by four pistons. The box has four sections labeled as ‘a’, ‘b’, ‘c’ and ‘d’, corresponding to four possible states of the device. Panel (a) and (b) show the initial and final configuration, corresponding to ruling out option ‘a’. In general, this calculation is done using the relative entropy. Let v(u) and v′(u) denote the volume of choice u before and after the compression respectively, and let Pr(u) and Pr′(u) denote their associates probabilities. These probabilities are given by Pr(u) = v(u) V and Pr′(u) = v′(u) V ′ . 62 6.2 Resources as Information Substituting the probabilities into the relative entropy, one obtains ∑ u Pr′(u) log Pr′(u) Pr(u) = ∑ u v′(u) V ′ log v(u) v′(u) + log V ′ V . Using the quantities of the example, one obtains 0 · log 0 + 1 8 · log 1 + 1 16 · log 1 + 1 16 · log 1 + log 2 = 1 bit, as expected. Note that this calculation only works if the volume stays equal or is compressed, but not expanded. 6.2.2 Computational Interpretation Remark 18 This subsection is of speculative nature. 2 Usually, the efficiency of an algorithm is assessed in terms of a relevant compu- tational resource. There are many possible computational resources, but the most important ones are the computation time and the computation space, i.e. the maxi- mum number of time steps and the maximum number of cells that a Turing machine uses in order to compute the output of a function for any input. Typically, the goal of the designer is to construct an algorithm that minimizes either time or space. However, these two goals seem to be somewhat incompatible. First, concentrating on a single resource does not represent all the issues involved in solving a problem. Also, there are numerous cases where reducing the computation time increases the computa- tion space and viceversa. Indeed, the works of Borodin and Cook (1980), Beame (1989), Beame, Jayram, and Saks (2001) and Beame, Saks, Sun, and Vee (2003) derive tradeoff lower bounds for various problems such as sorting, finding unique elements in a set and solving randomized decision problems. Their findings show that these lower bounds can be stated in terms of minimum time-space products. For example, in Borodin and Cook (1980), an optimal Ω(n2/ log n) lower bound is shown for the time-space product of sorting n integers in the range of [1, n2]. Although these findings are not conclusive, they seem to suggest that the time-space product might be a more general measure of computational resources, i.e. one that captures the intuitive notion of the difficulty of a computation (Figure 6.7). If one assumes that the time-space product is a suitable measure of computational resources, then one is led to ask about the meaning of this quantity. Fortunately, one can give this product an information-theoretic meaning. We closely follow the presentation of Savage (1998). Assume we want to build a device that computes a function f mapping X into Y, where both X and Y are finite. To make our discussion concrete, suppose we want to implement this device using a logic circuit, i.e. a collection of interconnected logical gates computing elementary boolean functions. It is sufficient to restrict ourselves to binary AND, binary OR and unary NOT gates, since it can be shown that every boolean function can be implemented by them. An example circuit is shown in Figure 6.8. The complexity of a function is measured by 63 6. RESOURCES Space Time C1 C2 Figure 6.7: Time-Space Tradeoff. Studies suggest that there is a tradeoff between time and space that can be characterized in terms of a constant that acts as a lower bound on the time- space product of a problem. In the plot, the time-space curves of two problems C1 and C2 are shown. In this case, C2 is more difficult than C1 because the time-space product of C2 is greater than the one of C1, which can be seen by comparing their respective time-space rectangular areas. counting the number of logic gates it has3. A function requiring more logic gates is considered to be more complex. Many times we can also implement f using a sequential processing machine which uses less logic gates. Such a machine is illustrated in Figure 6.9a. A sequen- tial processing machine is capable of simulating computation models having a finite number of configurations, such as space-bounded Turing machines or space-bounded random access machines (Savage, 1998). As shown in the figure, a sequential pro- cessing machine is a logic circuit φ communicating with an external storage device (or “memory”) M . The computation proceeds in T discrete steps, where in step t, the logic circuit takes the binary input xt and the current state qt and generates the binary output yt and the next state qt+1. This is done with the help of M , which stores the state qt+1 generated at step t until it is released in step t + 1. Here, the input string x1:T encodes the input and the output string y1:T encodes the output. The computation of a sequential processing machine can be rephrased as a com- munication problem. This is easily seen by “unwinding” the computation as shown in Figure 6.9, obtaining a concatenation of T times the logic circuit φ. This construction transforms the computation into a communication problem where the channel is given by the logic circuit φ, that is, φ can be regarded as a noisy channel transforming (xt, qt) into (yt, qt+1). If the state qt is represented by S bits, then the total number of bits 3This measure is known as the circuit complexity of a function. 64 6.2 Resources as Information NOT AND OR x1 x2 y1 Figure 6.8: A logic circuit implementing the XOR function. a) b) φφφφφ xt x1 x2 x3 xT yt y1 y2 y3 yT Mqt qt+1 q0 q1 q2 q3 qT−1 qT Figure 6.9: (a) A sequential processing machine consists of an logic circuit φ and an external storage device M . In each step t, φ takes the t-th input xt and state qt (stored in the external memory) and computes the t-th output yt and the (t+1)-th state qt+1. (b) The computation of the sequential processing machine can be “unwinded”, thereby constructing a large logic circuit consisting of T concatenations of φ that computes the same function as before. 65 6. RESOURCES transmitted is given by the product T · (S + 1), where it is seen that (S + 1) is a measure of the maximum capacity that the circuit φ, seen as a communication channel, can achieve. Note that T and S correspond to the time and space complexity of the function f . Let C(f) and C(φ) denote the minimum number of logic gates needed to implement the functions f and φ respectively. Then, C(f) ≤ T · C(φ) = κ · T · S, because C(φ) = κS for some positive κ and because the implementation of f as a sequential processing machine cannot use less logic gates than the direct implementation of f . Hence, the order of the time-space product is lower-bounded by the circuit complexity of f . In some sense, it seems that: Computer Science Information Theory Compute f(x) ⇐⇒ Communicate x and f although the author is unaware of any proof of this claim. The point is that, if we are willing to accept the time-space product as an ap- propriate measure of the computational complexity, then the computational resources can be related to information-theoretic resources—and as such, they are governed by information-theoretic principles. In other words, the computational resources corre- spond to the amount of bits that have to be transmitted to communicate a “compu- tation”. While this interpretation is intuitively appealing, there are still many open questions left. For instance, it is unclear what “computation” means in this sense, and it is also unclear how much computation is necessary in order to compute a given function (Sipser, 1996; Papadimitriou, 1993). 6.3 Resource Costs in Agents In the previous section, we have seen that resources can be formalized in information- theoretic terms, and that this formalization can be given a thermodynamic and a computational interpretation. This can be summarized as follows. Given a set U of options, a receiver that does not know beforehand what the choice will be has to allocate resource costs (codeword lengths) that implicitly specify his beliefs Pr(u) about the realization of the choice u ∈ U . Furthermore, when the receiver makes a (possibly stochastic and/or partial) choice represented by a probability distribution Pr′(u) over U , then the amount of bits spent by the receiver in order to record this choice is given by the relative entropy ∑ u Pr′(u) log Pr′(u) Pr(u) , which is a generalization of the information content from deterministic choices to prob- abilistic choices. We have argued that this quantity is proportional to the amount of 66 6.3 Resource Costs in Agents thermodynamic work that the receiver has to carry out in order to record the choice. Furthermore, we have sketched a relation between the amount of bits and the required time-space product of the associated computation. We argue that this way of thinking has many advantages: It greatly simplifies the analysis of resource costs, because we only have to deal with changes in probability distributions. This allows us abstracting away from the algorithmic details in order to reason about computational costs. The objective of this section is to explain how this formalization can be used to calculate the resource costs of running and constructing an agent. 6.3.1 Cost of Interaction We have formalized autonomous systems as probability distributions over interaction sequences. According to the information-theoretic arguments presented in this chapter, this means that we are implicitly assigning resource costs for interactions. This makes sense from an intuitive point of view. The implementation (or embod- iment) of an agent facilitates some interactions while it hampers others. For instance, biologists can infer a great deal of the habits of a species by studying its anatomy. The rationale behind this is that animals manifest energy-efficient behavior more frequently than energy-intensive behavior. This kind of reasoning acquires an extreme form in paleontology, where behavior is mainly inferred from fossilized animals. Conversely, in engineering, systems are designed such that they minimize the resource costs of fre- quent or desirable operations and uses. This is visible in the designs of cars, aeroplanes, buildings, algorithms, advertising campaigns, etc. From an information-theoretic point of view, an agent interacting with an environ- ment is communicating an interaction sequence. Whenever the agent interacts with its environment (either producing an output or reading an input), its “physical state” or “internal configuration” changes as a necessary consequence of the interaction—simply because the two instants are distinguishable. In other words, if the instants before and after the interaction cannot be told apart, then we are forced to conclude that they are empirically the same. This change in “physical state” or “internal configuration” can take place in many possible ways: for instance, by a chemical reaction; by up- dating the internal memory; by consulting a random number generator; by moving to another location; or even by simply advancing in time (i.e. by changing the value of the time-coordinate). Hence, in this context, the semantics of “physical state” or “internal configuration” corresponds to an abstract information state: it is a description that exhaustively characterizes a situation of an agent. We call such a description a state of the agent. To make this notion of states concrete, we introduce the following model. We assume that a change in state occurs whenever the agent either issues an action or reads an input. We start out from a blank binary tape that we will use to record the agent’s experience. Then, we iteratively append a new binary string that encodes the new input or output symbol experienced by the agent (Figure 6.10). In this model, the appended binary strings are proxies for the changes in state that the agent experiences 67 6. RESOURCES a1 a2 a3 aTo1 o2 oT 0 1 0 1 1 1 0 1 0 0 1 1 6 6 3 3 8 8 3 3 a) b) Figure 6.10: State Model. The agent has an I/O domain given byA := O := {1, 2, . . . , 10}. So far, the agent has experienced four I/O symbols (Panel a). The state of the agent is constructed by iteratively encoding the four I/O symbols into binary strings (Panel b). during its interactions with the environment. In this way, we abstract away from the inner workings of the agent by simply representing every change by a binary string. Note that this scheme does not allow an agent returning to a previous state, hence the agent cannot “jump back in time”. Additionally, we want the content of the binary tape to be uniquely decodeable, such that we can recover the whole I/O history the agent has experienced so far at any given time by decoding the content of the tape. This model highlights the correspondence between resources, codeword lengths and probabilities. Therefore, the behavior of the agent can be thought of as a reflection of the underlying resource costs. 6.3.2 Costs of Construction Remark 19 This subsection is of speculative nature. 2 Designing and constructing an agent has a cost. Whether we are thinking to find a solution to optimality equations, running a search algorithm, or assembling mechatronic parts, we are always spending resources during the conception of an agent. These resource costs can be thought of as arising from the change of the distribution over the possible agents, where the cost of this change is given by the relative entropy. Let Θ denote the index set parameterizing the possible agents. Furthermore, let Pr(θ) denote the belief we have about θ being the optimal parameter. Consider the relative entropy ρ = ∑ θ Pr′(θ) log Pr′(θ) Pr(θ) , (6.4) where Pr′(θ) is the (partial) choice made by the sender. This quantity correctly mea- sures the number of bits we are going to receive over a noiseless channel from a sender that picks out the right θ. However, there is caveat: The previous calculation represents a situation where we passively receive the optimal answer, which is not possible unless the sender is an “oracle” who guesses the first ρ bits of the optimal θ. To correctly calculate the required number of bits in this communication problem, we have to account for the fact that we do not know the sender. Not knowing the sender has important implications, since this uncertainty might lead to significant resource costs that we are not accounting for. 68 6.4 Historical Remarks & References Unfortunately, the author is not aware of any widely accepted way of dealing with this situation. However, one can speculate that the amount of information is 2O(ρ), that is, exponential in the amount of information needed when the sender is known. Roughly speaking, the justification for this intuition is based on results from computa- tional complexity, where a non-deterministic machine can be simulated by deterministic machine but incurring exponential cost (Sipser, 1996; Papadimitriou, 1993). In the next chapter, this claim is made precise for a special case. 6.4 Historical Remarks & References The fundamental results of information theory were almost entirely developed in the paper by Shannon (1948). In particular, Shannon’s paper derives a quantitative measure of information based on three desiderata. Surprisingly, the resulting formula for information turned out to have the same mathematical shape as the formula for thermodynamic entropy discovered empirically by Boltzmann (in a simpler form) and Gibbs. Because of this, much of information theory borrows mathematics from thermodynamics. Information theory has found a wide range of applications in communication, coding, compression, statistical learning, dynamical systems, and other areas (Khinchin, 1957; Ash, 1965; Kolmogorov, 1968; Gallager, 1968; Billingsley, 1978; Cover and Thomas, 1991; Li and Vitanyi, 2008; MacKay, 2003; Gru¨nwald, 2007). The relation between information (more specifically, codeword lengths) and probability, follows roughly the argument presented in Gru¨nwald (2007). The Kraft-McMillan inequality was developed in two steps by Kraft (1949) and then by McMillan (1956). The relative entropy ∑ u Pr′(u) log Pr′(u) Pr(u) has been introduced by Kullback and Leibler (1951). The standard interpretation is as follows. The relative entropy measures the expected number of extra bits required to code samples from Pr′ when using a code based on Pr, rather than using a code based on Pr′. The idea of the relative entropy as a generalized formula for the information content is the author’s (non-standard) interpretation. While mathematically equivalent, conceptually the author’s interpretation seems to suggest a “temporal directionality”, where Pr and Pr′ represent the knowledge state of the receiver before and after receiving information. The connections between information theory, thermodynamics and computational complex- ity presented in this chapter borrow ideas from various sources. The relation between units of energy and units of information, has been originally put forward in the context of Maxwell’s demon by various authors (Maxwell, 1867; Szilard, 1929; Brillouin, 1951, 1956; Gabor, 1964). The first modern argument linking information theory with thermodynamics is due to Landauer (1961). Since then, the ideas of the thermodynamics of computing have found wide acceptance (Tribus and McIrvine, 1971; Bennett, 1973, 1982; Feynman, 2000; Li and Vitanyi, 2008). The devices shown in figures 6.5 and 6.6 are the author’s contribution, and they differ from the de- vices in the literature in that they do not allow expanding the volume (i.e. erasing knowledge) but only compressing it (i.e. acquiring knowledge). The relation between computational resources and information is an area under active re- search. Time-space tradeoffs have been investigated in Borodin and Cook (1980), Beame (1989), 69 6. RESOURCES Beame et al. (2001) and Beame et al. (2003) using a computational model called branching programs. The time-space tradeoff related to circuit complexity is presented in Savage (1998). The speculations relating the transmission of information to computational complexity are due to the author, although related ideas have been put forward by Bremermann (1965). The arguments linking information-theoretic resources to the cost of interaction and con- struction are an original contribution of the author. The concept of an oracle, while non- standard in the context of information theory, is commonplace in computational complexity. An oracle is a hypothetical device which is capable of answering decision problems in a single operation (Sipser, 1996; Papadimitriou, 1993), and they are widely used in cryptography to make arguments about the security of cryptographic protocols involving hash functions. 70 Chapter 7 Boundedness If a designer has unlimited computational resources, then he would pay the cost of cal- culating the optimal policy for an autonomous agent. However, in Chapter 5 we have argued that the rigorous application of the maximum SEU principle is computationally too expensive to serve as a practical design principle. Therefore, most implementa- tions make severe domain restrictions and/or approximations in order to significantly reduce the computational expenses. In other words, designers content themselves with suboptimal solutions to the original problem. The preference of the designer of choosing a suboptimal solution over the optimal one suggests that resources have an impact over the “perceived utility” of the solution. One can argue that this “suboptimal solution” becomes the “optimal solution” when resources are taken into account. Hence, intuitively there seems to be a common “currency” for resources and utilities. What is this currency? If there were one, then one could compute the optimal solution that trades off the benefits obtained from maximizing the expected subjective utility and the resource costs of the calculation. 7.1 An Example of Boundedness We start our discussion with a concrete example. We are given an array of N numbers (v1, v2, . . . , vN ), and our task is to pick the largest one. We solve this problem by checking the numbers one by one in any order, always comparing the current value against the largest seen so far. It is easy to see that this algorithm takes O(N) time and O(logN) space if we assume that each comparison is done in a single computational step and that the indices are represented using ⌈logN⌉ bits. The time-space complexity of this algorithm is O(N logN). From an information-theoretic point of view, this algorithm transforms the knowl- edge state about the maximum. Figure 7.1 shows an example array having N = 10 elements with values in {1, 2, . . . , 10}. Initially, the algorithm doesn’t know the loca- tion of the maximum: If we assume that each index is encoded with log 10 ≈ 3.3219 bits, then the initial distribution is uniform. After running the algorithm, the location of the maximum is known, which is represented by a delta function concentrating its 71 7. BOUNDEDNESS probability mass on number 10. Notice that for discovering logN bits of information we are running an algorithm of time complexity O(N) = 2O(logN), that is, exponential in the number of bits. Compare this to the arguments for the cost of construction presented in Section 6.3.2, page 68. a) b) c) Figure 7.1: An Exhaustive Optimization. (a) A shuffled array with numbers from 1 to 10 is given. Initially, the algorithm does not know where the optimum is, which is represented by a uniform distribution (b) over the elements. After the execution of the algorithm, the solution was found, which is represented by a distribution (c) concentrating its probability mass on the maximum. If N is small, say N = 10, then choosing the largest number is easy, because one can simply revise the whole array and then pick the largest number. However, if N is very large, say N = 1000, then comparing all the numbers becomes a difficult task. How do we go about this problem then? One solution would be to limit ourselves to comparing only a fraction of the array, say M ≪ N elements. This reduces the computational complexity to O(M) time and O(logN) space at the cost of tolerating an error with probability 1 −M/N . That is, we can give up certainty for the sake of computational efficiency. Notice that the time-space product is O(M logN), which is linear in M for fixed N . To understand the effect of the resulting tradeoff, we again analyze how the know- ledge state is transformed. We assume that the set was shuffled beforehand and that the algorithm inspects the first M outcomes in linear order, choosing the largest number. We then use the frequency of choosing each number as its probability. This assigns equal probability to each one of theN ! possible permutations of the array. The resulting probability distributions are shown in Figure 7.2. Here, we see that inspecting only one element does not change the state of knowledge at all; that increasing M moves the probability mass towards the larger numbers; and that complete certainty is achieved when M = N . Furthermore, notice that we can actually infer the ranking of the numbers by merely looking at the distributions, since larger numbers consistently get more probability mass. The particular shape of the resulting distributions has some interesting properties. Let ρ denote the relative entropy between the initial and the final distribution, and let C =M − 1 denote the number of comparisons carried out by the algorithm, and let V denote the expected maximum. Note that C = O(M logN) for fixed N , i.e. it serves as a measure of the computational complexity. If we plot ρ versus C (Figure 7.3a), we see 72 7.1 An Example of Boundedness M = 1 M = 2 M = 3 M = 4 M = 5 M = 6 M = 7 M = 8 M = 9 M = 10 Figure 7.2: Distributions after Bounded Optimization. The plots show the distributions over the maximum in {1, 2, . . . , 10} obtained after running the bounded optimization algorithm for different values of M . a surprising property: the quantities are proportional, having a correlation coefficient1 of r = 0.9991. That is, we can use ρ as a good measure of the computational complexity of the algorithm. We also want to understand how the expected value V evolves as we increase the computational effort. This is seen by plotting V versus ρ (Figure 7.3b). This plot shows that certainty has a diminishing marginal value, that is, the gain in the value decreases with more computation. Intuitively, this is because the better the candidate solution, the more effort it takes to find an even better one. Moreover, the shape of the expected utility turns out to be logarithmic, as can be seen by plotting 2V versus ρ. Intuitively, if we care about computational costs, then it is not always a good idea to run an exhaustive optimization algorithm, because we can reach a point where the gain in value is too small to justify the extra computational effort. This idea can be captured by changing the evaluation criterion to one that penalizes the expected value by the relative entropy, e.g. V − αρ, where α is a conversion factor. Fixing α defines a tradeoff between the expected value and the relative entropy (Figure 7.4). The plot confirms our intuition: the smaller α, the larger the number of elements we compare, and the more “rational” our choice becomes. A brief simulation also shows that a achieving a perfectly rational choice does not require α = 0; it is already achieved for α ≈ 0.2131. This analysis leads to a principled algorithm for bounded optimization. For a fixed α, consider the algorithm that linearly inspects the elements of the array until 1More precisely, the Pearson product-moment correlation coefficient is an indicator of the linear dependence between two variables, having values ranging from −1 to 1. A value equal to 1 means that a linear relation perfectly describes the relationship between the two variables. 73 7. BOUNDEDNESS a) b) c) ρ C U ρ 2U ρ r = 0.9991 r = 0.9937 Figure 7.3: Performance of the Bounded Optimization. In panel (a), it is apparent that the relative entropy ρ is proportional to the number of comparisons C carried out by the algorithm, and hence ρ serves as a measure of the computational complexity. Plot (b) shows the expected value V against the relative entropy ρ. The marginal increment of V diminishes as ρ increases. Furthermore, panel (c) shows that 2V is linear in ρ, meaning that V is logarithmic in ρ. V − α ρ ρ α = 0.5 α = 1.0 α = 2.0 Figure 7.4: Expected Value Penalized by Relative Entropy. Choosing the conversion factor α defines a tradeoff between the expected value and the relative entropy. In the plot, three performance curves are shown, corresponding to the tradeoffs α = 12 , 1 and 2. Notice how an exponential increase of α leads to a linear increase in the optimal ρ. 74 7.2 Utility & Resources the largest number found so far, penalized by αρ, reaches its peak. This algorithm is stochastic, with the property that the probabilities of choosing a number are monotonic: for all i, j, pi > pj ⇐⇒ vi > vj where vi is the value of the i-th element and and pi is its probability of being chosen. This seems to be a more general property of a bounded optimization algorithm. Fur- thermore, notice that because of the diminishing marginal value, it is easy to reach a good performance level (although it is very hard to approach the optimum). This concludes our example. In the remainder of this chapter, we aim to develop a general framework of bounded rationality and then apply it to autonomous systems. We are especially interested in providing a solid axiomatic basis for bounded rationality. 7.2 Utility & Resources In physics, the behavior of a system can be described in two ways: using the dynamical equations or using an extremal principle (Goldstein, 1980). The first one specifies how the coordinates of the physical system change in time, like e.g. in Newton’s second law F = dp dt where F is the force, p is the momentum and t is time. The second expresses the dynamics of a system as the solution to a variational problem, like e.g. the action integral A = ∫ tf ti L[q, q˙, t] dt in Langrangian Mechanics, where L is the Langrangian (i.e. the kinetic energy minus the potential energy of the system), q are the generalized coordinates of the system and t is time. According to the principle of least action, the dynamical equations of the system are then obtained by finding the trajectory q(t) that is an extremum of the action integral. On a conceptual level, the difference between stating the dynamical equations or the extremum principle to specify a physical system is analogous to the difference between stating the output model or the subjective expected utility in order to specify an autonomous system. In the previous chapter, we have argued that the resource cost (i.e. work) of ob- serving an event A given B is −γ logP(A|B), where γ > 0 is the conversion factor between units of energy and information. This has the advantage of linking three disparate concepts together, namely resource costs (physics), information content (information theory) and behavior P(A|B) (statistics). Can we exploit this connection to physics in order to devise a new principle for con- structing autonomous systems? The answer is yes, but this connection is not straight- forward, because it requires revising the way we think about utilities and about the process of constructing an autonomous system. 75 7. BOUNDEDNESS 7.2.1 Utility In this section, we propose a non-standard concept of utility that is more in accord with thermodynamics and information-theory. Let U(A) denote the utility of an event A ∈ F and let u(A|B) := U(A ∩B)−U(B) be a shorthand to denote the gain in utility obtained from experiencing event A given event B. We will derive the relation u(A|B) = α logP(A|B) where α > 0 is a conversion factor between units of utility and information. This relation is obtained from desiderata characterizing the notion of utilities in systems that are “free” to generate events. More specifically, by a free system we mean a system that can choose the sampling probabilities of events generated by itself. Consider a free system represented by a finite probability space (Ω,F ,P). Here, the probability measure P models the generative law that the system uses to choose events. Thus, if P(A) > P(B), then the propensity of experiencing A is higher than B. In such a system, differences in probability can be given an interpretation relating them to differences in preferences: one can say that A is more probable than B because A is more desirable than B. In other words, a system is more likely to choose the events that it prefers. Using this line of reasoning, can we find a quantity which will measure how desirable an event is? We call such a measure of a utility gain function, although one should always bear in mind that the resulting utilities do not correspond to the same notion of utility that we have seen in Part I. If there is such a measure, then it is reasonable to demand the following three properties for it: i. Utility gains should be mappings from conditional probabilities into real numbers. ii. Utility gains should be additive. That is, the gain of a joint event should be obtained by summing up the gain of the sub-events. For example, the “gain of drinking coffee and eating a croissant” should equal “the gain of drinking coffee” plus the “gain of having a croissant given the gain of drinking coffee”. iii. A more probable event should have a higher utility gain than a less probable event. For example, if “drinking a coffee” is more likely than “eating a croissant”, then the utility gain of the former is higher. Note that this is not necessarily true anymore if the system is not free to choose between events. For instance, “losing the lottery” has a much higher probability than “winning the lottery” even though the utility gain of the latter is obviously higher, but this case differs from the previous example in that the event is not determined by the system itself. The three properties can then be summarized as follows. 76 7.2 Utility & Resources Definition 22 (Axioms of Utility) Let (Ω,F ,P) be a probability space. A set func- tion U : F → R is a utility function for P iff its utility gain function u(A|B) := U(A ∩B)−U(B) has the following three properties for all events A,B,C,D ∈ F : U1. ∃f,u(A|B) = f(P(A|B)) ∈ R, (real-valued) U2. u(A ∩B|C) = u(A|C) + u(B|A ∩C), (additive) U3. P(A|B) > P(C|D) ⇔ u(A|B) > u(C|D). (monotonic) Furthermore, we use the abbreviation u(A) := u(A|Ω). 2 The following theorem shows that these three properties enforce a strict mapping between probabilities and utility gains. Theorem 5 (Utility Gain ↔ Probability) If f is such that u(A|B) = f(P(A|B)) for every probability space (Ω,F ,P), then f is of the form f(·) = α log(·), where α > 0 is arbitrary strictly positive constant. 2 Proof Given arbitrary p, q ∈ (0, 1] and n,m ∈ N, one can always choose a probability space (Ω,F ,P) having sequences of events A1, A2, . . . , An ∈ F and B1, B2, . . . , Bm ∈ F , and an event C ∈ F such that p = P(A1|C) = P(A2|C ∩A1) = · · · = P(An|C ∩A 0 plays the role of a conversion factor between utilities and information. If a probability measure P and a utility function U satisfy the relation (7.4), then we say that they are conjugate. We call one unit of utility one utile. Given that this transformation between utility gains and probabilities is a bijection, one has that P(A|B) = 2 1α (U(A∩B)−U(B)) . There are several important observations with respect to this particular functional form. First, recall that in Bayesian probability theory (Section 4.1), a new measurement A is combined with the previous knowledge B by an intersection, i.e. the posterior knowledge is given by A∩B. Furthermore, note that utility gains are always negative, i.e. U(A ∩B)−U(B) ≤ 0. This means that every measurement decreases the utility. While this sounds counterin- tuitive, the relation also says that minimizing the resource costs maximizes the utility that is achieved after the interaction. Hence, if the free system is forced to act, then it will favor outcomes that reduce the utility less. Second, note that in the context of thermodynamics discussed in Section 6.2.1, doing work W = −γ logP(A|B) on a physical system changes its energy level from e(B) to e(A ∩B) as e(B) −→ e(A ∩B) = e(B) +W = e(B)− γ logP(A|B). That is, the work done on a system increases its internal energy. Comparing this with the relation U(A ∩B)−U(B) = −α logP(A|B) leads to the conclusion that U(A) = −αγ e(A) for all A ∈ F . Hence, the utilities that we have derived from mathematical desiderata are (safe for a conversion factor) negative energy levels. This is useful because it allows us analyzing decision making by borrowing ideas from thermodynamics. Furthermore, note that the exponentiation of the utility function can be written as a sum of parts. That is, if A1, A2 ∈ F form a partition of A ∈ F , then 2 1 α U(A) = 2 1 α U(A1) + 2 1 α U(A2). 79 7. BOUNDEDNESS This is because P(A|Ω) = P(A1|Ω) +P(A2|Ω), 2 1 α U(A) 2 1 α U(Ω) = 2 1 α U(A1) 2 1 α U(Ω) + 2 1 α U(A2) 2 1 α U(Ω) , 2 1 α U(A) = 2 1 α U(A1) + 2 1 α U(A2). In particular, one can rewrite any probability P(A|B) as a Gibbs measure: P(A|B) = ∑ ω∈A 2 1 α U(ω)∑ ω∈B 2 1 α U(ω) . where we have used the abbreviation U(ω) := U({ω}). As the conversion factor α approaches zero, the probability measure P(ω) approaches a uniform distribution over the maximal set Ωmax := {ω∗ ∈ Ω|ω∗ = argmaxωU(ω)}. Similarly, as α → ∞, P(ω) → 1|Ω| , i.e. the uniform distribution over the whole outcome set Ω. Also, note that 2 1 α U(Ω) = ∑ ω 2 1 α U(ω). Intuitively, the utility U(Ω) of the sample set corresponds to the utility of the system before any interaction has occurred. 7.2.2 Variational principle The conversion between probability and utility established in the previous section can be characterized using a variational principle. Inspired by thermodynamics, we now define a quantity that will allow us analyzing changes of the state of the system. Definition 23 (Free Utility) Let (Ω,F) be a measurable space and letU be a utility function. Define the free utility functional as J(Pr;U) := ∑ ω∈Ω Pr(ω)U(ω) − α ∑ ω∈Ω Pr(ω) logPr(ω), (the term Pr(ω) := Pr({ω}) is an abbreviation) where Pr is an arbitrary probability measure over (Ω,F). 2 Here, we see that the free utility2 is the expected utility of the system plus the uncertainty (i.e. the entropy) over the outcome multiplied by the utility-information conversion factor. It satisfies the following relation. 2The functional F := −J is also known as the Helmholtz free energy in thermodynamics. F is a measure of the “useful” work obtainable from a closed thermodynamic system at a constant temperature and volume. 80 7.2 Utility & Resources Theorem 6 (Variational Principle) A conjugate pair (P,U) satisfies J(Pr;U) ≤ J(P;U) = U(Ω). where Pr is an arbitrary probability measure over Ω. 2 Proof Rewriting terms using the utility-probability conversion and applying Jensen’s inequality yields J(Pr;U) = ∑ ω∈Ω Pr(ω)U(ω) − α ∑ ω∈Ω Pr(ω) logPr(ω) = α ∑ ω∈Ω Pr(ω) log 2 1 αU(ω) Pr(ω) ≤ α log ∑ ω∈Ω Pr(ω) 2 1 αU(ω) Pr(ω) = α log ∑ ω∈Ω 2 1 αU(ω) = U(Ω), with equality iff 2 1 αU(ω) Pr(ω) is constant, i.e. if Pr = P.  The variational principle tells us that the probability law P of the system is the one that maximizes the free utility for a given utility function U, since P = argmax Pr J(Pr;U). Hence, the utility function U plays the role of a constraint landscape for probability measuresPr out of which the conjugate probability measureP is the one that maximizes the uncertainty. 7.2.3 Bounded SEU In SEU theory, the designer constructs an autonomous system by first assuming a probability measure that characterizes the behavior of the environment and a utility criterion to compare different outcomes, and then by calculating a policy that maxi- mizes the subjective expected utility. During this process, the calculation of the optimal policy is done disregarding the costs of the computation. As we have argued at the beginning of this chapter, here we do want to take into account the costs of computing the optimal policy. However, this poses two important problems. First, how do we measure the costs of this computation? To answer this question, we first have to agree on what this calculation is actually doing. We know that after this calculation, the agent ends up with an optimal policy. But what if this calculation 81 7. BOUNDEDNESS had been omitted? Clearly, the agent would not end up with an optimal policy—unless the agent already had the optimal policy from the very beginning. The important conclusion here is that, in the general case, the purpose of such a calculation is to transform an initial system Pi into a final system Pf . In the previous chapter, we have seen that the amount of bits that have to be set to carry out this transformation is given by the relative entropy, i.e.∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) . Hence, the cost measured in joules and in utiles is obtained by multiplying the previous quantity by γ and α respectively. Second, how do we optimally choose this transformation? In other words, how should we transform the initial system Pi such that the final system Pf optimally trades off the benefits of maximizing a given utility function U∗ against the cost of the transformation? To answer this question, consider the transformation represented in Figure 7.5. The initial system satisfies the equation Ji := ∑ ω∈Ω Pi(ω)Ui(ω)− α ∑ ω∈Ω Pi(ω) logPi(ω) = Ui(Ω). We add new constraints represented by the utility function U∗. Then, the resulting utility function Uf is given by the sum Uf = Ui +U∗. The free utility Jf of the final system is then given by Jf := ∑ ω∈Ω Pf (ω)Uf (ω)− α ∑ ω∈Ω Pf (ω) logPf (ω) = ∑ ω∈Ω Pf (ω)(Ui(ω) +U∗(ω))− α ∑ ω∈Ω Pf (ω) logPf (ω) = ∑ ω∈Ω Pf (ω)U∗(ω)− α ∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) +Ui(Ω). To understand the change that has occurred due to this transformation, we take the difference Jf − Ji in the free utility. This results in a quantity that will play a central role in our next development. Definition 24 (Bounded Subjective Expected Utility) LetU∗ a utility function and let Ji and Jf be the free utility for the conjugate pairs (Pi,Ui) and (Pf ,Uf ). Then, the bounded subjective expected utility (bounded SEU) is given by the difference in free utility Jf − Ji = ∑ ω∈Ω Pf (ω)U∗(ω)− α ∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) . (7.5) 82 7.2 Utility & Resources Pi,Ui Pf ,Uf +U∗ Figure 7.5: Transformation of a System. A transformation from a system (Pi,Ui) into a system (Pf ,Uf ) by addition of a constraint U∗. This difference in free utility has an interpretation that is crucial for the formaliza- tion of bounded rationality: it is the expected target utility U∗ (first term) penalized by the cost of transforming Pi into Pf (second term). In practice, this means the following. A designer starts out with an initial system Pi that he wants to change in order to maximize a utility functionU∗. He then changes this system into Pf , spending in the process a total amount of γ ∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) joules of energy. The expected utility of the system is given by∑ ω∈Ω Pf (ω)U∗(x). However, from the point of view of the designer, the total expected utility is smaller, because he has to subtract the reduction in utility caused by the cost of the transfor- mation itself: ∑ ω∈Ω Pf (ω)U∗(ω)− α ∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) . Remark 20 Alternatively, one can interpret bounded SEU as the expected utility pe- nalized by an “uncertainty” term. In this interpretation, the relative entropy measures the “risk” of a gamble. 2 Because of its interpretation, we define (7.5) as a functional to be maximized, ei- ther with respect to Pf or with respect to Pi. We call this construction method the maximum bounded SEU principle. Depending on whether we vary Pf or Pi, one obtains two different variational problems having different applications. Control Method. The construction method for control corresponds to the case when the designer wants to build a system that optimizes a given utility function U∗ subject to the costs of the transformation. This is achieved by fixing Pi and varying Pf : Pf = argmax Pr ∑ ω∈Ω Pr(ω)U∗(ω)− α ∑ ω∈Ω Pr(ω) log Pr(ω) Pi(ω) . (7.6) 83 7. BOUNDEDNESS The solution is given by Pf (ω) ∝ Pi(ω) exp ( 1 α U∗(ω) ) . This can be interpreted as follows. The decision maker starts out with a prior probabil- ity of choosing ω given by Pi(ω). Then, he changes this probability to Pf (ω) obtained from multiplying the prior with the term exp( 1αU∗(ω)), which intuitively corresponds to the “likelihood” of ω being the best choice. Compare this to the example given in the introduction. Here, α controls the amount of resource units required to increase the utility. In particular, when the conversion factor between information and utility is negligible α ≈ 0, then (7.5) becomes Jf − Ji ≈ ∑ ω∈Ω Pf (ω)U∗(x), and hence resource costs are ignored in the choice of Pf , leading to Pf ≈ δω∗(ω), where ω∗ = argmaxωU∗(ω). Similarly, when α is very high, then the difference is Jf − Ji ≈ −α ∑ ω∈Ω Pf (ω) log Pf (ω) Pi(ω) , and hence no computation is carried out to optimize the choice, i.e. Pf ≈ Pi. Estimation Method. The construction method for estimation corresponds to the case when the designer wants to build a system that approximates another system (that is maximizing a possibly unknown utility function U∗) subject to the costs of the transformation. This is achieved by fixing Pf and varying Pi: Pi = argmax Pr ∑ ω∈Ω Pf (ω)U∗(ω)− α ∑ ω∈Ω Pf (ω) log Pf (ω) Pr(ω) (7.7) = argmin Pr ∑ ω∈Ω Pf (ω) log Pf (ω) Pr(ω) , and thus we have recovered the minimum relative entropy principle for estimation, having the solution Pi = Pf and therefore carry no transformation costs. While this “construction method” might look bizarre at a first glance, it is saying something obvious: If the designer is looking for a system Pi that approximates another known system Pf , then the best he can do is to just choose Pf itself without having to carry out any transformation at all! 84 7.3 Bounded SEU in Autonomous Systems 7.3 Bounded SEU in Autonomous Systems We tackle now the question of how to construct an autonomous system using the bounded SEU principle. In the previous section, we have learnt that building a system is conceptualized as transforming a preexisting system. Furthermore, there are two construction methods, namely one for control and one for estimation. We assume that we are in possession of a reference I/O model P0 and a utility function U∗. The objective is to find an I/O model P maximizing the bounded SEU using the method for control, estimation or a mixture of both. We further assume that the conversion factor between utilities and information is given by α > 0. To construct the associated bounded SEU, one has to carefully decide what con- struction method to apply for each random variable. For instance, consider constructing an I/O model P(ao) from a reference I/O model P0(ao) using a utility function U∗. Then, assuming that a is derived using the control method and o using the estimation method, one gets P = argmax Pr {∑ ao Pr(a)P0(o|a)U∗(ao)− α ∑ ao Pr(a)P0(o|a) log Pr(a)P0(o|a) P0(a)Pr(o|a) } . This is because for a, the reference probabilities P0(a) play the role of Pf in (7.5); and for o, the reference probabilities P0(o|a) play the role of Pi in (7.5). A simple way to concisely write down the bounded SEU for mixed methods is by defining two auxiliary I/O models R and S as R(at|ao 58 , that is, strictly higher than the subjective expected utility of the classical analysis—and in practice, subjective beliefs are all what a decision maker has3! We call these probabilities that are exogenous to the decision maker’s beliefs ambiguities. The distinction has an operational meaning. Notice that in case V, the belief of the decision maker is itself a random variable, implying that the optimal policy is undefined until the random variable is resolved. Hence, the computation of the optimal policy can be delayed, i.e. the optimal policy can be determined dynamically. This is unlike case IV, where the policy is pre-computed/static. The corresponding Bayesian I/O model is as follows. Let θ ∈ {I, II} be the parameter determining whether the decision maker 3Intuitively, it seems safer to delay one’s decision until the evidence is conclusive. 107 9. CONTROL AS ESTIMATION I II III IV V H H H H H HH T T T T T TT 1 4 1 4 1 4 3 4 3 4 3 4 3 4 3 4 3 4 1 4 1 4 1 4 5 8 3 8 1 4 1 4 3 4 3 4 T T H H H H Figure 9.1: Risk versus Ambiguity. In the figure, five different decision making scenarios are shown. A biased coin is tossed. The goal is to predict the outcome and the payoffs are $1 and $0 for a right and wrong guess respectively. A rational decision maker places bets (here shown inside speech bubbles) such that his subjective expected utility is maximized. These subjective beliefs are delimited within dotted boxes. The cases in panels I–IV differ from case V in that the former can be fully understood in terms of classical decision theory, whereas the latter cannot. 108 9.2 Adaptive Estimative Control is in case I or II. Then, the full Bayesian I/O model is given by: P (θ = I) = 1 4 P (θ = II) = 3 4 P (a|θ = I) = { 0 if a = H 1 if a = T P (a|θ = II) = { 1 if a = H 0 if a = T P (o|θ = I, a) = { 1 4 if o = H 3 4 if o = T P (o|θ = II, a) = { 3 4 if o = H 1 4 if o = T. Here, the prior probabilities P (θ) are ambiguities while the P (a|θ) and P (o|θ, a) are risk probabilities, because fixing θ determines the decision maker’s estimation about the outcome and his policy. 9.2 Adaptive Estimative Control If the environment Q is unknown, then the task of designing an appropriate agent P constitutes an adaptive control problem. We have already seen how to solve such a problem using the maximum SEU principle in Section 4.2, where it has been formulated as an adaptive optimal control problem. We formulate a different problem setup that will be termed adaptive estimative control. Specifically, this setup deals with the case when the designer has a class of output models {Pθ}θ∈Θ parameterized by a finite set Θ, designed to fit to a class of en- vironments {Qθ}θ∈Θ; in other words, when the designer wants to use policy Pθ(at|ao 0, there is a C ≥ 0, such that for all θ′ ∈ Θ, all t and all T ⊂ Nt ∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C with probability ≥ 1− δ. 2 0 1 2 3 t dt Figure 9.6: Bounded Variation. If a divergence process has bounded variation, then the realizations (curves 2 & 3) of a sub-divergence stay within a band around the mean (curve 1). Figure 9.6 illustrates this property. Bounded variation is the key property that is going to be used to construct the results of this section. However, it is very restrictive. For instance, simple Bernoulli processes do not fulfill this property. The first result is that the posterior probability of the true parameter is bounded from below. Theorem 9 (Lower Bound of True Posterior) Let the set of operation modes of a controller be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded variation. Then, for any δ > 0, there is a λ > 0, such that for all t ∈ N, P (θ∗|aˆo≤t) ≥ λ |Θ| with probability ≥ 1− δ. 2 115 9. CONTROL AS ESTIMATION Proof As has been pointed out in (9.2), a particular realization of the divergence process dt(θ∗‖θ) can be decomposed as dt(θ∗‖θ) = ∑ θ′ gθ(θ ′;Tθ′), where the gθ(θ ′;Tθ′) are sub-divergences of dt(θ∗‖θ) and the Tθ′ form a partition of Nt. However, since dt(θ∗‖θ) has bounded variation for all θ ∈ Θ, one has for all δ′ > 0, there is a C(θ) ≥ 0, such that for all θ′ ∈ Θ, all t ∈ Nt and all T ⊂ Nt, the inequality∣∣∣gθ(θ′;Tθ′)−Gθ(θ′;Tθ′)∣∣∣ ≤ C(θ) holds with probability ≥ 1− δ′. However, due to (9.3), Gθ(θ ′;Tθ′) ≥ 0 for all θ′ ∈ Θ. Thus, gθ(θ ′;Tθ′) ≥ −C(θ). If all the previous inequalities hold simultaneously then the divergence process can be bounded as well. That is, the inequality dt(θ∗‖θ) ≥ −MC(θ) (9.4) holds with probability ≥ (1− δ′)M where M := |Θ|. Choose β(θ) := max{0, ln P (θ)P (θ∗)}. Since 0 ≥ ln P (θ)P (θ∗) − β(θ), it can be added to the right hand side of (9.4). Using the definition of dt(θ∗‖θ), taking the exponential and rearranging the terms one obtains P (θ∗) t∏ τ=1 P (oτ |θ∗, ao<τaτ ) ≥ e−α(θ)P (θ) t∏ τ=1 P (oτ |θ, ao<τaτ ) where α(θ) := MC(θ) + β(θ) ≥ 0. Identifying the posterior probabilities of θ∗ and θ by dividing both sides by the normalizing constant yields the inequality P (θ∗|aˆo≤t) ≥ e−α(θ)P (θ|aˆo≤t). This inequality holds simultaneously for all θ ∈ Θ with probability ≥ (1− δ′)M2 and in particular for λ := minθ{e−α(θ)}, that is, P (θ∗|aˆo≤t) ≥ λP (θ|aˆo≤t). But since this is valid for any θ ∈ Θ, and because maxθ{P (θ|aˆo≤t)} ≥ 1M , one gets P (θ∗|aˆo≤t) ≥ λ M , with probability ≥ 1 − δ for arbitrary δ > 0 related to δ′ through the equation δ′ := 1− M2√1− δ.  116 9.4 Convergence of the Bayesian Control Rule Remark 31 It has been pointed by M. Hutter4 that bounded variation can most probably be weakened to “C growing sub-linearly” (which will require adapting the definitions that follow as well) and still get the convergence result of this section. 2 9.4.5 Core If one wants to identify the operation modes whose posterior probabilities vanish, then it is not enough to characterize them as those modes whose hypothesis does not match the true hypothesis. Figure 9.7 illustrates this problem. Here, three hypotheses along with their associated policies are shown. H1 and H2 share the prediction made for region A but differ in region B. Hypothesis H3 differs everywhere from the others. Assume H1 is true. As long as we apply policy P2, hypothesis H3 will make wrong predictions and thus its divergence process will diverge as expected. However, no evidence against H2 will be accumulated. It is only when one applies policy P1 for long enough time that the agent will eventually enter region B and hence accumulate counter-evidence for H2. H1 P1 H2 P2 H3 P3A A B B Figure 9.7: Problems with Disambiguation. If hypothesis H1 is true and agrees with H2 on region A, then policy P2 cannot disambiguate the three hypotheses. But what does “long enough” mean? If P1 is executed only for a short period, then the controller risks not visiting the disambiguating region. But unfortunately, neither the right policy nor the right length of the period to run it are known beforehand. Hence, an agent needs a clever time-allocating strategy to test all policies for all finite time intervals. This motivates the following definition. Definition 36 (Core) The core of an operation mode θ∗, denoted as [θ∗], is the subset of Θ containing operation modes behaving like θ∗ under its policy. Formally, an operation mode θ /∈ [θ∗] (i.e. is not in the core) iff for any C ≥ 0, δ > 0, there is a ξ > 0 and a t0 ∈ N, such that for all t ≥ t0, Gθ∗(θ;T ) ≥ C with probability ≥ 1− δ, where Gθ∗(θ;T ) is a sub-divergence of dt(θ∗‖θ), and Pr{τ ∈ T } ≥ ξ for all τ ∈ Nt. 2 In other words, if the agent was to apply θ∗’s policy in each time step with probabil- ity at least ξ, and under this strategy the expected sub-divergence Gθ∗(θ;T ) of dt(θ∗‖θ) grows unboundedly, then θ is not in the core of θ∗. 4personal communication 117 9. CONTROL AS ESTIMATION Remark 32 Note that demanding a strictly positive probability of execution in each time step guarantees that the agent will run θ∗ for all possible finite time-intervals. 2 As the following theorem shows, the posterior probabilities of the operation modes that are not in the core vanish almost surely. Theorem 10 (Not in Core ⇒ Vanishing Posterior) Let the set of operation modes of an agent be such that for all θ ∈ Θ the divergence process dt(θ∗‖θ) has bounded vari- ation. If θ /∈ [θ∗], then P (θ|aˆo≤t)→ 0 as t→∞ almost surely. 2 Proof The divergence process dt(θ∗‖θ) can be decomposed into a sum of sub-divergences (see Equation 9.2) dt(θ∗‖θ) = ∑ θ′ gθ′(θ;Tθ′). (9.5) Furthermore, for every θ′ ∈ Θ, one has that for all δ > 0, there is a C ≥ 0, such that for all t ∈ N and for all T ⊂ Nt∣∣∣gθ′(θ;T )−Gθ′(θ;T )∣∣∣ ≤ C(θ) with probability ≥ 1 − δ′. Applying this bound to the summands in (9.5) yields the lower bound ∑ θ′ gθ′(θ;Tθ′) ≥ ∑ θ′ ( Gθ′(θ;Tθ′)− C(θ) ) which holds with probability ≥ (1− δ′)M , where M := |Θ|. Due to Inequality 9.3, one has that for all θ′ 6= θ∗, Gθ′(θ;Tθ′) ≥ 0. Hence,∑ θ′ ( Gθ′(θ;Tθ′)− C(θ) ) ≥ Gθ∗(θ;Tθ∗)−MC where C := maxθ{C(θ)}. The members of the set Tθ∗ are determined stochastically; more specifically, the i-th member is included into Tθ∗ with probability P (θ∗|aˆo≤i) ≥ λ/M for some λ > 0 by Theorem 9. But since θ /∈ [θ∗], one has that Gθ∗(θ;Tθ∗) →∞ as t→∞ with probability ≥ 1− δ′ for arbitrarily chosen δ′ > 0. This implies that lim t→∞ dt(θ∗‖θ) ≥ lim t→∞ Gθ∗(θ;Tθ∗)−MC ր∞ with probability ≥ 1−δ, where δ > 0 is arbitrary and related to δ′ as δ = 1−(1−δ′)M+1. Using this result in the upper bound for posterior probabilities yields the final result 0 ≤ lim t→∞ P (θ|aˆo≤t) ≤ lim t→∞ P (θ) P (θ∗) e−dt(θ∗‖θ) = 0.  118 9.4 Convergence of the Bayesian Control Rule 9.4.6 Consistency Even if an operation mode θ is in the core of θ∗, i.e. given that θ is essentially indis- tinguishable from θ∗ under θ∗’s control, it can still happen that θ∗ and θ have different policies. Figure 9.8 shows an example of this. The hypotheses H1 and H2 share re- gion A but differ in region B. In addition, both operation modes have their policies P1 and P2 respectively confined to region A. Note that both operation modes are in the core of each other. However, their policies are different. This means that it is unclear whether multiplexing the policies in time will ever disambiguate the two hypotheses. This is undesirable, as it could impede the convergence to the right control law. H1 P1 H2 P2 A A B B Figure 9.8: Inconsistent Policies. An example of inconsistent policies. Both operation modes are in the core of each other, but have different policies. Thus, it is clear that one needs to impose further restrictions on the mapping of hypotheses into policies. With respect to Figure 9.8, one can make the following observations: 1. Both operation modes have policies that select subsets of region A. Therefore, the dynamics in A are preferred over the dynamics in B. 2. Knowing that the dynamics in A are preferred over the dynamics in B allows us to drop region B from the analysis when choosing a policy. 3. Since both hypotheses agree in region A, they have to choose the same policy in order to be consistent in their selection criterion. This motivates the following definition. Definition 37 (Consistent Policies) An operation mode θ is said to be consistent with θ∗ iff θ ∈ [θ∗] implies that for all ε < 0, there is a t0, such that for all t ≥ t0 and all ao 0 and δ′ > 0, let t0(θ) be the time such that for all t ≥ t0(θ), wθ(t) < ε′. Choosing t0 := maxθ{t0(θ)}, the previous inequality holds for all θ and t ≥ t0 simultaneously with probability ≥ (1− δ′)M . Hence,∑ θ/∈[θ∗] pθ(t)wθ(t) ≤ ∑ θ/∈[θ∗] wθ(t) < Mε ′. (9.7) To bound the second sum in (9.6) one proceeds as follows. For every member θ ∈ [θ∗], one has that pθ(t)→ pθ∗(t) as t→∞. Hence, following a similar construction as above, one can choose t′0 such that for all t ≥ t′0 and θ ∈ [θ∗], the inequalities∣∣∣pθ(t)− pθ∗(t)∣∣∣ < ε′ hold simultaneously for the precision ε′ > 0. Applying this to the second sum in Equation 9.6 yields the bounds∑ θ∈[θ∗] ( pθ∗(t)− ε′ ) wθ(t) ≤ ∑ θ∈[θ∗] pθ(t)wθ(t) ≤ ∑ θ∈[θ∗] ( pθ∗(t) + ε ′ ) wθ(t). Here ( pθ∗(t) ± ε′ ) are multiplicative constants that can be placed in front of the sum. Note that 1 ≥ ∑ θ∈[θ∗] wθ(t) = 1− ∑ θ/∈[θ∗] wθ(t) > 1− ε. Use of the above inequalities allows simplifying the lower and upper bounds respectively:( pθ∗(t)− ε′ ) ∑ θ∈[θ∗] wθ(t) > pθ∗(t)(1 − ε′)− ε′ ≥ pθ∗(t)− 2ε′, ( pθ∗(t) + ε ′ ) ∑ θ∈[θ∗] wθ(t) ≤ pθ∗(t) + ε′ < pθ∗(t) + 2ε′. (9.8) 120 9.5 Examples Combining the inequalities (9.7) and (9.8) in (9.6) yields the final result:∣∣∣P (at|aˆo 0 related to δ′ as δ′ = 1− M√1− δ and arbitrary precision ε.  9.5 Examples In this section we illustrate the usage of the Bayesian control rule on two examples that are very common in the reinforcement learning literature: multi-armed bandits and Markov decision processes. As a reminder, a summary of the Bayesian control rule is given in Table 9.1. Bayesian Control Rule: Given a set of operation modes {P (·|θ, ·)}θ∈Θ over interaction sequences in Z∞ and a prior distribution P (θ) over the parameters Θ, the probability of the action at+1 is given by P (at+1|aˆo≤t) = ∑ θ P (at+1|θ, ao≤t)P (θ|aˆo≤t), (9.9) where the posterior probability over operation modes is given by the recursion P (θ|aˆo≤t) = P (ot|θ, ao 0 are learning rates. The exploration strategy chooses with fixed proba- bility pexp > 0 the action a that maximizes Q(x, a)+ C F (x,a) , where C is a constant, and F (x, a) represents the number of times that action a has been tried in state x. Thus, higher values of C enforce increased exploration. In a study Mahadevan (1996), a grid-world is described that is especially useful as a test bed for the analysis of RL algorithms. For our purposes, it is of particular interest because it is easy to design experiments containing suboptimal limit-cycles. Figure 9.11, panel (a), illustrates the 7× 7 grid-world. A controller has to learn a policy that leads it from any initial location to the goal state. At each step, the agent can move to any adjacent space (up, down, left or right). If the agent reaches the goal state then its next position is randomly set to any square of the grid (with uniform probability) to start another trial. There are also “one-way membranes” that allow the agent to move into one direction but not into the other. In these experiments, these membranes form “inverted cups” that the agent can enter from any side but can only leave through the bottom, playing the role of a local maximum. Transitions are stochastic: the agent moves to the correct square with probability p = 910 and to any of the free adjacent spaces (uniform distribution) with probability 1 − p = 110 . Rewards are assigned as follows. The default reward is r = 0. If the agent traverses a membrane it obtains a reward of r = 1. Reaching the goal state assigns r = 2.5. The parameters chosen for this simulation were the following. For our MDP-agent, we have chosen hyperparameters µ0 = 1 and λ0 = 1 and precision p = 1. For R-learning, we have chosen learning rates α = 0.5 and β = 0.001, and the exploration constant has been set to C = 5, C = 30 and to C = 200. A total of 10 runs were carried out for each algorithm. The results are presented in Figure 9.11 and Table 9.2. R-learning only learns the optimal policy given sufficient exploration (panels d & e, bottom row), whereas the Bayesian control rule learns the policy successfully. In Figure 9.11f, the learning curve of R-learning for C = 5 and C = 30 is initially steeper than the Bayesian controller. However, the latter attains a higher average reward around time step 125,000 onwards. We attribute this shallow initial transient to the phase where the distribution over the operation modes is flat, which is also reflected by the initially random exploratory behavior. 127 9. CONTROL AS ESTIMATION Average Reward BCR 0.3582 ± 0.0038 R-learning, C = 200 0.2314 ± 0.0024 R-learning, C = 30 0.3056 ± 0.0063 R-learning, C = 5 0.2049 ± 0.0012 Table 9.2: Average reward attained by the different algorithms at the end of the run. The mean and the standard deviation has been calculated based on 10 runs. 9.6 Critical Issues Problems of Bayesian methods. The Bayesian control rule treats an adaptive control problem as a Bayesian inference problem. Hence, all the problems typically associated with Bayesian methods carry over to agents constructed with the Bayesian control rule. These problems are of both analytical and computational nature. For example, there are many probabilistic models where the posterior distribution does not have a closed-form solution. Also, exact probabilistic inference is in general computa- tionally very intensive. Even though there is a large literature in efficient/approximate inference algorithms for particular problem classes Bishop (2006), not many of them are suitable for on-line probabilistic inference in more realistic environment classes. Bayesian control rule versus Bayes-optimal control. Directly maximizing the (subjective) expected utility for a given environment class is not the same as minimizing the expected relative entropy for a given class of operation modes. The two methods are based on different assumptions and optimality principles. As such, the Bayesian control rule is not a Bayes-optimal controller. Indeed, it is easy to design experiments where the Bayesian control rule converges exponentially slower (or does not converge at all) than a Bayes-optimal controller to the maximum utility. Consider the following simple example: Environment 1 is a k-state MDP in which only k consecutive actions A reach a state with reward +1. Any interception with a B-action leads back to the initial state. Consider a second environment which is like the first but actions A and B are interchanged. A Bayes-optimal controller figures out the true environment in k actions (either k consecutive A’s or B’s). Consider now the Bayesian control rule: The optimal action in Environment 1 is A, in Environment 2 is B. A uniform (12 , 1 2) prior over the operation modes stays a uniform posterior as long as no reward has been observed. Hence the Bayesian control rule chooses at each time-step A and B with equal probability. With this policy it takes about 2k actions to accidentally choose a row of A’s (or B’s) of length k. From then on the Bayesian control rule is optimal too. So a Bayes-optimal controller converges in time k, while the Bayesian control rule needs exponentially longer. One way to remedy this problem might be to allow the Bayesian control rule to sample actions from the same operation mode for several time steps in a row rather than randomizing controllers in every cycle. However, if one considers non-stationary environments this strategy can also break down. Consider, for example, 128 9.7 Relation to Existing Approaches an increasing MDP with k = ⌈ 10 √ t ⌉ , in which a Bayes-optimal controller converges in 100 steps, while the Bayesian control rule does not converge at all in most realizations, because the boundedness assumption is violated. 9.7 Relation to Existing Approaches Some of the ideas underlying this work are not unique to the Bayesian control rule. The following is a selection of previously published work in the recent Bayesian reinforcement learning literature where related ideas can be found. Compression principles. In the literature, there is an important amount of work relating compression to intelligence (MacKay, 2003; Hutter, 2004a). In particular, it has been even proposed that compression ratio is an objective quantitative measure of intelligence (Mahoney, 1999). Compression has also been used as a basis for a theory of curiosity, creativity and beauty (Schmidhuber, 2009). Mixture of experts. Passive sequence prediction by mixing experts has been studied extensively in the literature (Cesa-Bianchi and Lugosi, 2006). In a study on online- predictors (Hutter, 2004b), Bayes-optimal predictors are mixed. Bayes-mixtures can also be used for universal prediction (Hutter, 2003). For the control case, the idea of using mixtures of expert-controllers has been previously evoked in models like the MOSAIC-architecture (Haruno, Wolpert, and Kawato, 2001). Universal learning with Bayes mixtures of experts in reactive environments has been studied in the work of Poland and Hutter (2005) and Hutter (2002). Stochastic action selection. The idea of using actions as random variables, and the problems that this entails, has been expressed in the work of Hutter (2004a, Problem 5.1). The study in this chapter can be regarded as a thorough investigation of this open problem. Other stochastic action selection approaches are found in the thesis of Wyatt (1997) who examines exploration strategies for (PO)MDPs, in learning automata (Narendra and Thathachar, 1974) and in probability matching (Duda, Hart, and Stork, 2001) amongst others. In particular, the thesis discusses theoretical properties of an extension to probability matching in the context of multi-armed bandit problems. There, it is proposed to choose a lever according to how likely it is to be optimal and it is shown that this strategy converges, thus providing a simple method for guiding exploration. 9.8 Derivation of Gibbs Sampler for MDP Agent Inserting the likelihood given in Equation 9.13 into Equation 9.9 of the Bayesian control rule, one obtains the following expression for the posterior 129 9. CONTROL AS ESTIMATION P (θ|aˆ≤t, o≤t) = P (x ′|θ, x, a)P (r|θ, x, a, x′)P (θ|aˆ 0, and hence, under the view of the presented framework, perfect rationality is only an idealization. Conceptually, the distinguishing feature of bounded SEU (compared to classical SEU) is that it provides an explanatory framework for approximations. Causality. Causality is a field that has historically been highly controversial, and it has been only recently that mathematical formalizations have started to find wider ac- ceptance. Still, so far these formalizations have not clarified the connection to measure theory, the mathematically rigorous theory of probability. The framework introduced in this thesis does a step towards this direction. Simultaneously, this thesis clarifies the importance of causal consideration in agent designs. More specifically, the Bayesian I/O model (Section 8.3.1), which is really a causal model, allows enriching the classi- cal Bayesian autonomous system, having uncertainty only over its environment, with having uncertainty over its very policy. Bayesian control rule. Agents that are constructed following the maximum SEU principle have to carry out massive computations before they even have had a single interaction with its environment. To bypass this limitation, most practical algorithms implement autonomous systems that “discover” their policy during the interactions with the environment. In this thesis, we have used the framework of bounded SEU and causal reasoning to derive the Bayesian control rule: a rule that allows construct- ing a natural class adaptive agents having uncertainty over their policies. Formally, the Bayesian control rule for outputs is the probabilistic equivalent to the predictive distribution for inputs. Furthermore, this thesis presents a convergence proof for the Bayesian control rule under a very restricted setting. 134 10.3 What is missing? 10.3 What is missing? Understanding the Implications of the Relations. If one is willing to accept the connections between decision theory, information theory and thermodynamics that this thesis puts forward, then one obtains a potentially very fertile ground for novel ideas and reinterpretations. While this thesis shows some of the benefits of adopting this unified view, it also leaves many questions unanswered. For instance, the utility- information conversion factor α controls the cost of translating resources into utilities. On a very abstract level, one could blame the failure of approximations to the very large value of α. But what does α mean in practice? How can it be influenced? Examples such as this abound and need to be addressed in the future. Descriptive Power. While the bounded SEU framework introduced in this thesis has a theoretical appeal due to its properties and simplicity, it remains to be seen whether it can explain human decision making, and especially, whether it can explain the experimental evidence (Allais, 1953; Ellsberg, 1961; Kahneman and Tversky, 1979; Machina, 1987; Kreps, 1988) that contradicts perfect rationality. In particular, it would be important to verify whether the causality framework and/or the Bayesian control rule can predict aspects of human decision making. Intelligence Measure. Legg and Hutter (2006) proposed a formal measure of ma- chine intelligence. While this measure is a synthesis of many informal definitions of human intelligence that have been given by experts, it has been constructed mainly bor- rowing ideas from the theory of universal artificial intelligence, staying thus within the paradigm of agency with unlimited resources. It would be interesting to test whether this intelligence measure can be accommodated to include agency with bounded re- sources. Game Theory. It is not at all clear how the design of autonomous systems connects to the game theoretic literature. The fundamental difference lies in the assumptions. In artificial intelligence and control theory, one assumes a dynamical model of the environment first, and then constructs a suitable agent. Meanwhile, in game theory, one instead only assumes a utility function describing the environment’s preferences. This assumption, however, does not provide enough information to derive the dynam- ical model of the environment, since this model must depend on the assumptions the environment makes about the agent. Under this point of view, additional solution con- cepts (collectively called equilibria) other than the maximum expected utility prin- ciple are required to derive the resulting behavior of the interaction system. In the context of this thesis, an important point to be verified is to investigate whether the Bayesian I/O model for agents is “essentially” equivalent to the Bayesian game frame- work (Harsanyi, 1967–1968), and whether game theoretic concepts can be extended to the case of resource-bounded agency. 135 10. DISCUSSION 136 References M. Allais. Le comportment de l’homme rationnel devant la risque: critique des postulats et axiomes de l’ecole americaine. Econometrica, 21:503–546, 1953. Dana Angluin. Computational learning theory: survey and selected bibliography. In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, STOC ’92, pages 351–369, New York, NY, USA, 1992. F. J. Anscombe and R. J. Aumann. A definition of subjective probability. The Annals of Mathematical Statistics, 34(1):199–205, mar 1963. R. B. Ash. Information Theory. New York: Interscience, 1965. P. Auer, N. CesaBianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002. R. Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 1763. P. Beame. A general sequential time-space tradeoff for finding unique elements. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, STOC ’89, pages 197–203, New York, NY, USA, 1989. ACM. P. Beame, T. S. Jayram, and M. Saks. Time-space tradeoffs for branching programs. J. Comput. Syst. Sci., 63:542–572, December 2001. P. Beame, M. Saks, X. Sun, and E. Vee. Time-space trade-off lower bounds for ran- domized computation of decision problems. J. ACM, 50:154–195, March 2003. R. E. Bellman. Dynamic programming, 1957. C. H. Bennett. Logical reversibility of computation. IBM Journal of Research and Development, 17(6):525–532, 1973. C. H. Bennett. The thermodynamics of computationa review. International Journal of Theoretical Physics, 21(12):905–940, 1982. D. Bernoulli. Specimen theoriae novae de mensara sortis. Commentarii Academiae Scientiarum Imperialis Petropolitanae, 1738. (trans. in 1954, Econometrica). 137 REFERENCES D. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice- Hall, Upper Saddle River, NJ, 1987. P. Billingsley. Ergodic Theory and Information. R. E. Krieger Pub. Co., 1978. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. K. Borch. A note on uncertainty and indifference curves. The Review of Economic Studies, 36(1):1–4, 1969. E. Borel. Lec¸ons sur la the´orie des fonctions. Gauthier-Villars, Paris, 1898. A. Borodin and S. Cook. A time-space tradeoff for sorting on a general sequential model of computation. In Proceedings of the twelfth annual ACM symposium on Theory of computing, STOC ’80, pages 294–301, New York, NY, USA, 1980. ACM. D. A. Braun and P. A. Ortega. A minimum relative entropy principle for adaptive con- trol in linear quadratic regulators. In Proceedings of the 7th international conference on informatics in control, automation and robotics, page (in press), 2010. H. J. Bremermann. Quantum noise and information. In 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California, 1965. Univ. of Califor- nia Press. L. Brillouin. Maxwell’s demon cannot operate: Information and entropy i. Journal of Applied Physics, 22:334–337, 1951. L. Brillouin. Science and Information Theory. New York: Academic Press., 1956. H. B. Callen. Thermodynamics and an Introduction to Themostatistics. John Wiley & Sons, 2nd edition, 1985. C. F. Camerer and M. Weber. Recent developments in modelling preferences: uncer- tainty and ambiguity. J. Risk Uncertain., 5:325–370, 1992. J. C. Candeal, J. R. De Miguel, E. Indura´in, and G. B. Mehta. Utility and entropy. Economic Theory, 17:233–238, 2001. N. Cartwright. Nature’s Capacities and Their Measurement. Claredon Press, Oxnard, 1989. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006. T. M. Cover and J. A. Thomas. Elements of information theory. New York: Wiley- Interscience, 1st edition, 1991. R. T. Cox. The Algebra of Probable Inference. Johns Hopkins, 1961. 138 REFERENCES A. P. Dawid. Fundamentals of statistical causality. Research Report 279, Department of Statistical Science, University College London, 2007. B. De Finetti. La pre´vision: Ses lois logiques, ses sources subjectives. In Annales de l’Institut Henri Poincare´, volume 7, pages 1–68, 1937. R. Dearden, N. Friedman, and S. Russell. Bayesian q-learning. In AAAI ’98/IAAI ’98: Proceedings of the fifteenth national/tenth conference on Artificial intelli- gence/Innovative applications of artificial intelligence, pages 761–768, Menlo Park, CA, US, 1998. American Association for Artificial Intelligence. R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In In Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 150–159, 1999. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley & Sons, Inc., second edition, 2001. M. O’G. Duff. Optimal learning: computational procedures for bayes-adaptive markov decision processes. PhD thesis, University of Massachusetts—Amherst, 2002. Director-Andrew Barto. E. Eells. Probabilistic Causality. Cambridge University Press, 1991. D. Ellsberg. Risk, ambiguity and the savage axioms. The Quaterly Journal of Eco- nomics, 75:643–669, 1961. R. P. Feynman. Feynman Lectures on Computation. Westview Press, 2nd edition, 2000. P. C. Fishburn. The Foundations of Expected Utility. D. Reidel Publishing, Dordrecht, 1982. R. A. Fisher. Statistical Methods for Research Workers. Macmillan Pub. Co., 13th edition, 1970. M. Fre´chet. De´finition de l’inte´grale d’une fonctionnelle e´tendue a` un ensemble abstrait. Comptes Rendus de l’Acade´mie des sciences, 1915. D. Gabor. Light and information. Progress in Optics, 1:111–153, 1964. R. Gallager. Information Theory and Reliable Communication. New York: John Wiley and Sons, 1968. H. Goldstein. Classical Mechanics. Addison-Wesley, 2nd edition, 1980. P. D. Gru¨nwald. The Minimum Description Length Principle. MIT Press, 2007. J. C. Harsanyi. Games with incomplete information played by ”bayesian” players, i–iii. Management Science, 14:159–182, 320–334, 486–502, 1967–1968. 139 REFERENCES M. Haruno, D. M. Wolpert, and M. Kawato. Mosaic model for sensorimotor learning and control. Neural Computation, 13:2201–2220, 2001. M. Hutter. Self-optimizing and pareto-optimal policies in general environments based on bayes-mixtures. In COLT, 2002. M. Hutter. Optimality of universal Bayesian prediction for general loss and alphabet. Journal of Machine Learning Research, 4:971–997, 2003. M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2004a. M. Hutter. Online prediction – bayes versus experts. Technical report, July 2004b. Pre- sented at the EU PASCAL Workshop on Learning Theoretic and Bayesian Inductive Principles (LTBIP-2004). E. T. Jaynes and Larry G. Bretthorst. Probability Theory: The Logic of Science: Books. Cambridge University Press, 2003. H. Jeffreys. The Theory of Probability. The Clarendon Press, 1939. M. I. Jordan, Z. Ghahramani, and L. K. Saul. Hidden Markov decision trees. In Advances in Neural Information Processing Systems 9, 1997. D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47:263–291, 1979. R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960. B. Kappen, V. Gomez, and M. Opper. Optimal control as a graphical model inference problem. arXiv:0901.0633, 2009. M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. G. Keller. Equilibrium States in Ergodic Theory. London Mathematical Society Student Texts. Cambridge University Press, 1998. A. I. Khinchin. Mathematical Foundations of Information Theory. New York: Dover, 1957. F. H. Knight. Risk, Uncertainty, and Profit. Houghton Mifflin, Boston, 1921. A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin, 1933. A. N. Kolmogorov. Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 1968. 140 REFERENCES L. G. Kraft. A Device for Quantizing, Grouping, and Coding Amplitude Modulated Pulses. Ms thesis, Electrical Engineering Department, Massachusetts Institute of Technology, Cambridge, MA, 1949. D. M. Kreps. Notes on the Theory of Choice. Westview Press, 1988. S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathe- matical Statistics, 22(1):79–86, mar 1951. R. Landauer. Irreversibility and heat generation in the computing process. IBM Journal of Research and Development, 5(3):183–191, 1961. P. S. Laplace. Me´moires sur la probabilite´ des causes par les e´ve`nements. Me´moires de mathe´matique et des physiques presentee´s a` l’Acade´mie royale des sciences, par divers savans, & luˆs dans ses assemble´es, 6:621–656, 1774. H. Lebesgue. Lec¸ons sur l’inte´gration et la recherche des fonctions primitives. 1904. S. Legg. Machine Super Intelligence. PhD thesis, Department of Informatics, University of Lugano, June 2008. S. Legg and M. Hutter. A formal measure of machine intelligence. In Annual machine learning conference of Belgium and The Netherlands (Benelearn-2006), Ghent, 2006. D. Lewis. Counterfactuals. Cambridge, MA: Harvard University Press, 1973. M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications (Texts in Computer Science). Springer, February 2008. R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley, 1959. M. Machina. Choice under uncertainty: Problems solved and unsolved. Journal of Economic Perspectives, 1:121–154, 1987. D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1-3):159–195, 1996. M. V. Mahoney. Text compression as a test for artificial intelligence. In AAAI/IAAI, pages 486–502, 1999. J. C. Maxwell. Letter to P. G. Tait, 11 December 1867, 1867. W. S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. 141 REFERENCES B. McMillan. Two inequalities implied by unique decipherability. IEEE Trans. Infor- mation Theory, 2(4):115–116, 1956. D. H. Mellor. The Facts of Causation. Routledge, 1995. D. Michie. Game-playing and game-learning automata. Advances in Programming & Non-Numerical Computation, pages 183–200, 1966. K. Narendra and M. A. L. Thathachar. Learning automata - a survey. IEEE Transac- tions on Systems, Man, and Cybernetics, SMC-4(4):323–334, July 1974. J. Neyman. First course in probability and statistics. Holt, New York, 1950. N. J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publishers, San Francisco, 1998. R. Nozick. Newcomb’s problem and two principles of choice. In N. Rescher, editor, Essays in Honor of Carl G. Hempel, pages 114–146. Reidel, 1969. P. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research, 38:475–511, 2010a. P. A. Ortega and D. A. Braun. A conversion between utility and information. In The third conference on artificial general intelligence, pages 115–120, Paris, 2010b. Atlantis Press. P. A. Ortega and D. A. Braun. A bayesian rule for adaptive control based on causal interventions. In The third conference on artificial general intelligence, pages 121– 126, Paris, 2010c. Atlantis Press. P. A. Ortega and D. A. Braun. An axiomatic formalization of bounded rationality based on a utility-information equivalence. arXiv:1007.0940, 2010d. M. J. Osborne and A. Rubinstein. A Course in Game Theory. MIT Press, 1999. C. H. Papadimitriou. Computational Complexity. Adison Wesley, 1993. D. C. Parkes. Bounded rationality. Technical report, University of Pennsylvania, 1997. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK, 2000. J. Poland and M. Hutter. Defensive universal learning with experts. In ALT, 2005. K. R. Popper. The Logic of Scientific Discovery (Routledge Classics). Routledge, 1934. T. W. Pratt and M. V. Zelkowitz. Programming Languages: Design and Implementa- tion. Prentice Hall, 4th edition, 2000. 142 REFERENCES F. P. Ramsey. The Foundations of Mathematics and Other Logical Essays, chapter ‘Truth and Probability’. Harcourt, Brace and Co., 1926. posthumously published in 1931. C. E. Rasmussen and M. P. Deisenroth. Recent Advances in Reinforcement Learning, volume 5323 of Lecture Notes on Computer Science, LNAI, chapter Probabilistic Inference for Fast Learning in Control, pages 229–242. Springer-Verlag, 2008. H. Robbins. Some aspects of the sequential design of experiments. Bulletin American Mathematical Socierty, 58:527–535, 1952. A. Rosenblueth, N. Wiener, and J. Bigelow. Behavior, purpose, and teleology. Philos- ophy of Science, 10:18–24, 1943. A. Rubinstein. Modeling Bounded Rationality. MIT Press, 1988. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 3rd edition edition, 2009. S. Russell and E. Wefald. Principles of metareasoning. Artificial Intelligence, 49:361– 395, 1991. J. E. Savage. Models of Computation: Exploring the Power of Computing. Addison Wesley Publishing Company, 1998. L. J. Savage. The Foundations of Statistics. John Wiley and Sons, New York, 1954. J. Schmidhuber. Simple algorithmic theory of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. Journal of SICE, 48(1):21–32, 2009. G. Shafer. The Art of Causal Conjecture. MIT Press, 1996. C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656, Jul and Oct 1948. H. Simon. Models of Bounded Rationality. MIT Press, 1982. H. A. Simon. A behavioral model of rational choice. Quarterly Journal of Economics, 69:99–188, 1955. S. P. Singh. Reinforcement learning algorithms for average-payoff markovian decision processes. In National Conference on Artificial Intelligence, pages 700–705, 1994. M. Sipser. Introduction to the Theory of Computation. PWS Pub. Co., 1996. R. J. Solomonoff. A formal theory of inductive inference. part 1 & 2. Information and Control, 7:1–22, 224–254, 1964. 143 REFERENCES P. Spirtes and R. Scheines. Causation, Prediction, and Search, Second Edition. MIT Press, 2001. R. Stalnacker. Ifs: Conditionals, Belief, Decision, Chance, and Time, chapter A Theory of Conditionals., pages 41–56. Dordrecht: Reidel, 1968. P. Suppes. A Probabilistic Theory of Causality. Amsterdam: North-Holland Publishing Company, 1970. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. L. Szilard. On the decrease of entropy in a thermodynamic system by the intervention of intelligent beings. Zeitschrift fu¨r Physik, 53:840–856, 1929. E. Todorov. Linearly solvable markov decision problems. In Advances in Neural Infor- mation Processing Systems, volume 19, pages 1369–1376, 2006. E. Todorov. General duality between optimal control and estimation. In Proceedings of the 47th conference on decision and control, pages 4286–4292, 2008. E. Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences U.S.A., 106:11478–11483, 2009. M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (po)mdps, 2006. M. Tribus and E. C. McIrvine. Energy and information. Scientific American, 225: 179–188, 1971. J. N. Tsitsiklis. Computational complexity in Markov decision theory. HERMIS—An International Journal of Computer Mathematics and its Applications, 9(1):45–54, 2007. J. Veness, K. S. Ng, M. Hutter, and D. Silver. Reinforcement learning via aixi ap- proximation. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. J. Veness, K. S. Ng, M. Hutter, Uther W., and D. Silver. A monte-carlo aixi approxi- mation. Journal of Artificial Intelligence Research, 40:95–142, 2011. J. Venn. The Logic of Chance. Macmillan Pub. Co., London, 1st edition, 1866. R. von Mises. Grundlagen der wahrscheinlichkeitsrechnung. Mathemat. Zeitsch., 5: 52–99, 1919. J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. 144 REFERENCES C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, Cambridge, England, 1989. N. Wiener. Cybernetics, Second Edition: or the Control and Communication in the Animal and the Machine. The MIT Press, 1965. J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis, Department of Artificial Intelligence, University of Edinburgh, 1997. 145 Index act, 14 constant, 15 action, 9 agent, 9 algebra, 6, 31 generated, 94 ambiguity, 106 atom, 94 atom set, 94 generated, 95 axioms belief, 34 causal, 96 probability, 6 rationality, 15 truth, 31 Bayes’ rule, 35 Bayesian control rule, 110 behavioral function, 19 belief function, 34 induced space, 97 posterior, 35 prior, 35 space, 34 Bellman optimality equations, 22 bounded variation, 115 capacity, 54 causal n-th function, 96 function, 96 space, 97 channel, 54 circuit logic, 63 codeword, 56 codeword length, 56 complement, 31 conjugate, 79 consequence, 14 construction method for control, 83 for estimation, 84 control adaptive estimative, 109 adaptive optimal, 45 bounded optimal, 85 optimal, 25 core, 117 cross-entropy, 58 cycle, 9 data, 35 decision, 102 divergence Kullback-Leibler, 60 divergence process, 112 dominates, 41 dynamic programming, 27 element, 5 empty string, 5 entropy, 58 relative, 60 environment, 9 ε, see empty string equilibrium, 135 Nash, 106 event, 6, 14, 31 null, 15 146 INDEX primitive, 95 false, 31 free utility, 80 horizon, 9 hypothesis, 35 information content, 58 interaction, 9 intersection, 31 intervention, 93, 98 knows, 22 Kraft-McMillan inequality, 56 learning theory computational, 53 likelihood, 35 logarithm, 5 natural, 5 machine random access, 64 sequential processing, 64 Turing, 64 Markov decision problem, 27 MDP, see Markov decision problem measure probability, 6 measurement, 100 mixture distribution, 38 model Bayesian I/O, 99 Bayesian input, 38 Bayesian output, 99 I/O, 22 induced I/O, 102 induced input, 39 input, 22 output, 9 natural numbers, 5 observation, 9 operation mode, 109 set, 109 optimal, 22 outcome, 6, 31, 94 parameter unknown, 39, 99 plausibility, 22 policy, 22 Bayes optimal, 45 consistent, 119 powerset, 6 predictive distribution, 38 predictor, 22 preference conditional, 14 preference relation, 14 indifference, 14 rational, 15 strict, 14 prefix code, 56 complete, 57 prefix free, 55 probability measure generative, 10 product rule, 34 program branching, 70 propensity, 22 random variable, 6 rational, 14 rationality bounded, 53 real numbers, 5 reasoning meta-, 53 reward function, 24 risk, 106 sample, 6 set, 5 SEU principle maximum, 21 maximum bounded, 83 147 INDEX space measurable, 6 probability, 6 sample, 6 truth, 32 state, 14, 67 stream probabilities, 10 string, 5 sub-divergence, 114 subjective expected utility bounded, 82 substring, 5 sum rule, 34 symbol, 5 system autonomous, 5 free, 76 true, 31 truth function, 31 truth state, 31 truth value, 31 uncertain, 31 union, 31 utile, 79 utility function, 77 utility gain function, 76, 77 value function, 25 148