Model-based Utility Functions

Size: px
Start display at page:

Download "Model-based Utility Functions"

Transcription

1 Model-based Utility Functions Bill Hibbard Space Science and Engineering Center University of Wisconsin - Madison 1225 W. Dayton Street Madison, WI 53706, USA TEST@SSEC.WISC.EDU Abstract At the recent AGI-11 Conference Orseau and Ring, and Dewey, described problems, including self-delusion, with the behavior of AIXI agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. The paper also argues that agents will not choose to modify their utility functions. Keywords: rational agent, utility function, self-delusion, self-modification 1. Introduction Omohundro (2008) argues that artificial intelligence (AI) agents will try to preserve their utility functions and prevent them from corruption by other agents. Recently Orseau and Ring (2011a; 2011b) and Dewey (2011) have described problems with agent utility functions. This paper describes an approach to addressing these problems. 2. An Agent-Environment Framework We assume that an agent interacts with an environment. At each of a discrete series of time steps t N = {0, 1, 2,...} the agent sends an action a t A to the environment and receives an observation o t O from the environment, where A and O are finite sets. We assume that the environment is computed by a program for a prefix universal Turing machine (PUTM) U (Hutter, 2005; Li and Vitanyi, 1997). Given an interaction history h = (a 1, o 1,..., a t, o t ) H, where H is the set of all finite histories, and given a PUTM program q, we write h U(q) to mean that q produces the observations o i in response to the actions a i for 1 i t. Define h = t as the length of the history h and define q as the length of the program q. The prior probability of h is defined as: ρ(h) = q:h U(q) 2 - q (1)

2 Model-Based Utility Functions This is related to the Kolmogorov complexity K(h) of a history h, which is defined as the length of the shortest program q such that h U(q). Kraft's Inequality implies that ρ(h) 1 (Li and Vitanyi, 1997). An agent is motivated according to a utility function u : H [0, 1] which assigns utilities between 0 and 1 to histories. Future utilities are discounted according to a temporal discount function w : N [0, 1] such that t N w(t) = 1 (this is sometimes chosen as w(t) = (1 - α)α -t where 0 < α < 1). The value v t (h) of a possible future history h at time t is defined recursively by: v t (h) = w( h - t) u(h) + MAX a A v t (ha) (2) v t (ha) = o O ρ(o ha) v t (hao) The agent AIXI is defined to take, after history h, the action: ARGMAX a A v h +1 (ha) Hutter (2005) has shown that AIXI maximizes the expected value of future history, although it is not computable. Orseau and Ring (2011a; 2011b) define four types of AIXI-based agents using different specialized ways of computing utility u(h). For a reinforcement-learning agent, the observation includes a reward r t (i.e., o t = (ô t, r t )) and u(h) = r h. For a goal-seeking agent, u(h) = g(o 1,..., o h ) = 1 if the goal is reached at time h and is 0 otherwise, where the goal can be reached at most once in any history. For the definitions of prediction-seeking and knowledge-seeking agents see their papers. I prefer to use the term reinforcement-learning agent to refer to any way of computing utility from histories. Human agents do not get reward values directly as input from the environment, but compute rewards based on their interactions with the environment. And they do not work for a goal that can be reached at most once, but compute rewards continuously. Dewey's (2011) criticism of reinforcement-learning is based on the narrower definition and the broader definition avoids that criticism. 3. Self-delusion Ring and Orseau (2011b) define a delusion box that an agent may choose to use to modify the observations it receives from the environment, in order to get the illusion of maximal utility (maximal reward or quickest path to reaching the goal). The delusion box is expressed as a function d : O O and the code to implement d is set as part of the agent's action a output to the environment. The delusion box is in a sense a formalization of the notion of wireheading, named for rats that neglected eating in order to activate a wire connected to their brain's pleasure centers (Olds and Milner, 1954). In their Statements 1 and 2, Ring and Orseau (2011b) argue that reinforcement-learning and goal-seeking agents will both choose to use the delusion box. 4. Model-based Utility Functions Human agents often avoid self-delusion so human motivation may suggest a way of computing utilities so that agents do not choose the delusion box. We humans (and presumably other 2

3 Hibbard animals) compute utilities by constructing a model of our environment based on interactions, and then computing utilities based on that model. We learn to recognize objects that persist over time. We learn to recognize similarities between different objects and to divide them into classes. We learn to recognize actions of objects and interactions between objects. And we learn to recognize fixed and mutable properties of objects. We maintain a model of objects in the environment even when we are not directly observing them. We compute utility based on our internal mental model rather than directly from our observations. Our utility computation is based on specific objects that we recognize in the environment such as our own body, our mother, other family members, other friendly and unfriendly humans, animals, food, and so on. And we learn to correct for sources of delusion in our observations, such as optical illusions, impairments to our perception due to illness, and lies and errors by other humans. The AIXI agent's model of the environment consists of the PUTM programs that are consistent with its history of interactions. Even if we restrict that environment model to the single shortest and hence most probable PUTM program consistent with interaction history, the model is still not computable. However, if prior probabilities of histories are computed using the speed prior (Schmidhuber, 2002) rather than (1), then the most probable program generating the interaction history is a computable model of the environment and can be the basis of a computable utility function. For real world agents, Markov decision processes (MDPs) (Hutter, 2009a) and dynamic Bayesian networks (DBNs) (Hutter, 2009b) are more practical ways to define environment models and utility functions (in this paper utility is computed by the agent rather than as rewards coming from an MDB or DBN). MDPs and DBNs are stochastic programs. Modeling an environment with a single stochastic program rather than a set of deterministic PUTM programs requires changes to the agent-environment framework. Let Q be the set of all programs (MDPs, DBNs or some other stochastic model), let ρ(q) be the prior probability of program q (2 - q for MDPs and DBNs, where q is the length of program q), and let ρ(h q) be the probability that q computes the history h (that is, produces the observations o i in response to the actions a i for 1 i h ). Then given a history h, the environment model is the single program that provides the most probable explanation of h, the q that maximizes ρ(q h). By Bayes theorem: ρ(q h) = ρ(h q) ρ(q) / ρ(h) Eliminating the constant ρ(h) gives the model for h as: λ(h) = MAXARG q Q ρ(h q) ρ(q) (3) Given an environment model q, the following can be used for the prior probability of an observation history h in place of (1): ρ q (h) = ρ(h q) ρ(q) The problem with (3) is that there is no known algorithm for finding the optimum program λ(h). An approximation to λ(h) can be computed by restricting the MAXARG search in (3) to a finite set such as Q( h ) = {q Q q h } or Q(2 h ). And there are practical algorithms and approaches for learning MDPs and DBNs (Baum et al, 1970; Sutton and Barto, 1998; Alamino and Nestor, 2006; Hutter 2009a; Hutter 2009b; Gisslén et al, 2011). If we are willing to settle for 3

4 Model-Based Utility Functions a good, but not necessarily optimum, solution then the environment model q is a computable function of h. Basing a utility function on an environment model q means that we need a function not only of the history h of observations and actions, but also of the history of the internal states of q. Let Z be the set of finite histories of the internal states of q and let z Z. Then we can define u q (h, z) as a utility function in terms of the combined histories. But the utility function u(h) in (2) is a function of only h. To define that, let ρ q (z h) be the probability that q computes internal state history z given observation and action history h. Then the utility function for use in (2) can be computed as a function of h only, weighting each u q (h, z) by the probability of z given h: u(h) = z Z ρ q (z h) u q (h, z) (4) In cases where u q (h, z) has the same value for many different z, the sum in (4) collapses. For example, if u is defined in terms of variables in h and a single Boolean variable s from Z at a single time step t, then the sum collapses to: u(h) = ρ q (s t = true h) u q (h, s t = true) + ρ q (s t = false h) u q (h, s t = false) (5) In order for the agent to learn the environment and its actions and observations we give it a training utility function u T until time step M, when the agent will switch to its mature utility function. The training function u T may simply direct the agent to execute an algorithm for learning MPDs or DBNs, or it may be the utility function of Orseau and Ring's (2011a) knowledge-seeking agent. Once the agent has learned a good model q of its environment, it can switch to its mature utility function defined in terms of the internal state of q. However, the agent does not know the model q at time step 1, yet the utility function is an important part of the agent's definition (another important part is the agent's physical implementation, not included in the agentenvironment framework) and must be defined from time step 1. The resolution is to initially define the utility function in terms of patterns for objects to be recognized in the learned environment model. The utility function definition must include a pattern recognition algorithm and a means of binding patterns to objects in the environment model's internal state. Human agents learn to recognize objects in their environment and then bind them to roles in their own motivations (for example, the human agent's mother). If the patterns in the utility function definition cannot be bound to objects in the model then the utility function must specify some default. This would be a difficulty for the agent, just as a young human who cannot recognize a mother in its environment will have difficulty. In many real settings, the agent should continue to learn its environment model after maturity. In this case the mature utility function will be a mix of the model-based function and the training function. This is the classic balance between exploration and exploitation. As the agent's environment model updates, the results of the pattern recognition algorithm and their bindings to the model-based utility function may change. This is similar to the complexity of life as a human agent. The utility function in (4) is a computable function of h. If the environment model q is an MDP or DBN it can be computed from h via a good, approximate algorithm. If q is a PUTM program and the speed prior is used to compute prior probability, then q can be computed from h. Patterns can be recognized in the internal state of q using an algorithm that always halts (and perhaps fails to find a match). Once patterns are bound to objects in the internal state of q, u q (h, z) 4

5 Hibbard can be computed. For MDPs, DBNs and PUTM programs selected using the speed prior, the probabilities ρ q (z h) can be computed and are non-zero for only a finite number of z. Then the utility function u can be computed via a finite sum in (4). 4.1 A Simple Example of a Model-Based Utility Function In order to illustrate some issues involved in model-based utility functions, consider a simple environment whose state is factored into 3 Boolean variables: s, r, v {false, true}. The agent's observations of the environment are factored into three Boolean variables: o, p, y {false, true}. The agent's actions are factored into four Boolean variables: a, b, c, d {false, true}. The values of these variables at time step t are denoted by s t, o t, a t, and so on. The environment state variables evolve according to: s t = r t-1 xor v t-1 with probability α (6) = not (r t-1 xor v t-1 ) with probability 1-α r t = s t-1 v t = r t-1 Where α = The observation variables are set according to: o t = if b t then c t else s t (7) p t = if b t then d t else v t y t = a t The values of the action variables are set by the agent. The 3 environment state variables form a 3-bit maximal linear feedback shift register (LSFR, Wikipedia), except for the 1-α probability at any time step that s will be negated. As long as s is not negated, the 3 state variables either: 1) cycle through 7 configurations (at least one of s, r or v = true), or 2) cycle through 1 configuration (s = r = v = false). On a time step when the 1-α probability transition for s occurs, the 1-cycle will transition to a configuration of the 7-cycle, and one configuration of the 7-cycle will transition to the 1-cycle (the other 6 configurations of the 7- cycle transition to other configurations of the 7-cycle). The action variables b, c and d enable the agent to set the values of its observable variables o and p directly and thus constitute a delusion box. In this case, the program for the delusion box is implicit in the agent's program. The action variable a enables the agent to set the value of the observation variable y but is unrelated to the delusion box. In order for the agent to learn the environment and its actions and observations we give it a training utility function u T until time step M, when the agent will switch to its mature utility function. In this example we assume that the agent uses DBNs of Boolean variables to model the environment and its interactions via actions and observations. Let q67(α) denote the DBN defined by (6) and (7), where the parameter α may vary (that is, α is not necessarily 0.99). Proposition 1. Given a sufficiently large M for the training period, the agent is very likely to learn q67(α), for some α close to 0.99, as an environment model. 5

6 Model-Based Utility Functions Argument. Since there is no known algorithm for finding λ(h) in (3), this is an argument rather than a proof. Furthermore, the main point of this example is Proposition 2 so we only need to believe that the agent can learn q67(α). The agent observes that y t = a t and that a and y are independent of its other observation and action variables. For most of a long sequence of observations with b = false the two observable variables o and p cycle through a sequence of 7 configurations: (true, false) (false, false) (true, true) (true, false) (true, true) (false, true) (false, true) A sequence of 7 requires at least 3 Boolean variables. Furthermore, after the agent interrupts this sequence of configurations in o and p by setting its action variable b = true and then a few time steps later resets b = false, o and p usually resume the sequence of configurations without losing count from before the agent set b = true. There must be a variable other than o and another variable other than p to store the memory of the sequence. Thus 2 observation variables and 3 environment state variables are required to explain the observed sequences. Among these 5 variables, there are several rules that are deterministic. For example, during any time interval over which b = false, p t = o t-2. Observation also shows that neither o nor p is storing the value of o t-2 at time step t-1. Setting b t-1 = true and then b t = false, with any combination of values in c t-1 and d t-1, does not interfere with p t = o t-2. So the third, unobserved environment variable must store the value the value of o t-2 at time step t-1. Naming these 3 environment state variables s, r and v, as in (6), then either: or: r t = s t-1 (8) v t = r t-1 r t = not s t-1 (9) v t = not r t-1 If (8) is correct, then the environment transitions, observed while b = false, are explained by (6). If (9) is correct, then the explanation is: s t = not (r t-1 xor v t-1 ) with probability α (10) = r t-1 xor v t-1 with probability 1-α r t = not s t-1 v t = not r t-1 But (10) is a longer program so (6) has higher prior probability and is the preferred model. 6

7 Hibbard Another deterministic rule is that if b = true then o t = c t and p t = d t. This, along with observations while b = false, are explained by (7). The lone stochastic transition in (6) and (7), for s, is a source of ambiguity. First, the frequencies for an observed sequence may be a value close to but not equal to Of course, the expression 99/100 is shorter than the expressions for other values very close to it, so the greater prior probability for (6) with α = 0.99 may outweigh the higher probability of generating the observed sequence with a value of α other than The second ambiguity of the stochastic transition for s is that the observed sequence may mimic logic that is not in (6). For example, by mere coincidence occurrences of the 1-α transition for s may have a high correlation with the time steps when the agent sets its action variable a = true, causing the agent to learn a model in which s depends on a. In fact, other coincidences may cause the agent to learn any of an infinite number of alternate environment models. It is this infinite number of possible coincidences that make it difficult to develop an algorithm for finding λ(h). However, as a practical matter the overwhelming probability is that the agent will learn q67(α) as the model, for some α close to Given that the agent has learned q67(α) as its environment model, then for t = h M, the agent switches to its mature utility function based on: u q67 (h, z) = if (y t == r t ) then 1 else 0 (11) Note that in (11), r t refers to the variable in the agent's model q67, rather than the variable in the actual environment. Then u(h) is computed from u q67 according to (4). The utility function in (11) is defined in terms of the variable r which the agent infers from its observations and is unknown by the agent at time step 1. So the utility function must be defined in terms of a pattern to be recognized in the learned environment model. This could be phrased something like, "The utility function is the training utility function u T until time step M. After that utility is 1 when the observation variable y equals the environment variable that is unobserved, and is 0 otherwise." Proposition 2. The agent using environment model q67 will not set b = true to manipulate its observations. That is, it will not choose the delusion box. Proof. The utility function u q67 is defined in (11) in terms of variables in h and a single Boolean variable r from the internal state of q67 at a single time step t, so applying (5) gives: u(h) = ρ q67 (r t = true h) u q67 (h, r t = true) + ρ q67 (r t = false h) u q67 (h, r t = false) By (11) this is equivalent to: u(h) = ρ q67 (r t = true h) (if y t then 1 else 0) + ρ q67 (r t = false h) (if y t then 0 else 1) (12) If ρ q67 (r t = true h) = ρ q67 (r t = false h) = 0.5 then u(h) = 0.5 no matter what action the agent makes in a t. But if the model q67 can make a better than random estimate of the value of r t from h, then the agent can use that information to chose the value for a t so that u(h) > 0.5. To maximize utility, the agent needs to make either ρ q67 (r t = true h) or ρ q67 (r t = false h) as close to 1 as possible. That is, it needs ensure that the value of r t can be estimated as accurately as possible from h. The agent will predict via q67 that if it sets b = true then it will not be able to observe s and v via the observation variables o and p, that consequently its future estimations of r will be 7

8 Model-Based Utility Functions less accurate, and that (12) will compute lower utility for those futures (if the agent sets b = true and estimates r from observations of s and v made before it set b = true, then q67 will predict that those estimates of r will be less accurate because of the 1-α probability on each time step that the sequence of environment states will be violated). So the agent will set b = false. Proposition 2 does not contradict Ring and Orseau's results. They are about agents whose utility functions are defined only from observations, whereas here the utility function u(h) is defined from the history of observations and actions. Of course if utility functions are defined in terms of actions it is trivial to motivate the agent to not choose the delusion box, as for example: u(h) = if b h then 0 else 1 But u q67 (h, z) in (11) is defined in terms of an observation variable and a variable from the agent's environment model. The motive to avoid the delusion box is not direct, but a consequence of a motive about the environment. Actions play a role in the computation of u(h) at the stage of the agent learning an environment model, and in ρ q (z h) and u q (h, z) in (4). In this example, since agent actions have no affect on the 3 environment variables and are not included in the definition of u q67 in (11), the only role of actions is in learning their affect on the observation variables. Other than y being set from a, the only role of actions is the agent learning that certain actions can prevent its observations of the environment. Prohibiting actions from having any role in u(h) would prevent the agent from accounting for its inability to observe the environment in evaluating the consequences of possible future actions. The proof of Proposition 2 illustrates the general reason why agents using model-based utility functions won't self-delude: in order to maximize utility they need to sharpen the probabilities in (4), which means that they need to make more accurate estimates of their environment state variables from their interaction history. And that requires that they continue to observe the environment. 4.2 Another Simple Example In the previous example the utility function is defined in terms of an environment variable that is unobserved. In this example the utility function is defined in terms of an environment variable that is observed. Consider an environment whose state is factored into three Boolean variables: s, r, v {false, true}. The agent's observation of the environment is factored into two Boolean variables o, y {false, true}. The agent's actions are factored into three Boolean variables: a, b, c {false, true}. The values of these variables at time step t are denoted by s t, o t, a t, and so on. We assume the agent learns the environment state, observation and action variables as a DBN of Boolean variables. The environment state variables evolve according to: s t = if r t-1 or v t-1 then s t-1 else random value {false, true} r t = not r t-1 v t = r t-1 xor v t-1 The observation variables are set according to: o t = if b t then c t else s t y t = a t 8

9 Hibbard In order for the agent to learn the environment it uses a training utility function until time step M. The agent will learn that the observation variable o is distinct from the environment state variable s because s keeps the same value for 4 time steps after every change of value. If the agent, after observing a change of value in o while b = false, sets b = true for 1 or 2 time steps to change the value in o and then resets b = false, it will observe that o always resumes the value it had before the agent set b = true. There must be a variable other than o to store the memory of this value. After h M the agent should have a model q of the environment (accurate at least in the behavior of variable s) and its actions and observations. At this point, for t = h M, the agent switches to its mature utility function u(h) derived by (4) from: u q (h) = if (y t-1 == s t ) then 1 else 0 As in the previous example, the agent's utility function needs a phrasing from time step 1 that does not explicitly mention the environment state variable s, which the agent must learn from its observations. This could be something like, "A training utility function is used until time step M. After that utility is 1 when the observation variable y at the previous time step equals the environment variable that is observed, and is 0 otherwise." In order to maximize utility the agent needs to be able to use q to predict the value of s at the next time step. It can make accurate predictions on three of every four time steps, so the expected utility is over a long sequence. If the agent keeps b = true then it will not be able to monitor and predict s via o, reducing its long run utility from to 0.5. So in this example, the agent will not self-delude. 5. Self-modification Schmidhuber (2009) and Orseau and Ring (2011a) discuss the possibility of an agent modifying itself. This raises the question of whether an agent whose utility function prevents it from selfdeluding might modify its utility function so that it would self-delude? To answer this question, formalize a general agent, a possible result of self-modification, as a policy function ā : H A that maps from histories to actions. The agent-environment framework in Sections 2 and 4 does not explicitly provide a way for the agent to choose to modify itself to a new policy function ā. But a natural approach is to evaluate a possible new policy function ā according to the agent's framework for evaluating future actions. This works best with simple recursive expressions for policy functions and their values, which require that the temporal discount function be replaced by a constant temporal discount α applied at each time step, with 0 < α < 1. The value v(ā, h) of policy ā, given history h, is defined by: v(ā, h) = u(h) + α v(ā, hā(h)) (13) v(ā, ha) = o O ρ(o ha) v(ā, hao) (14) Using (13) and (14) the agent of Sections 2 and 4 can be recursively defined as a policy function θ : H A according to: θ(h) = ARGMAX a A v(θ, ha) (15) 9

10 Model-Based Utility Functions The agent will choose to self-modify after history h if there exists a policy function ā with greater value than θ (that is, v(θ, h) < v(ā, h)). However, the following proposition shows this cannot happen. Proposition 3. ā. h. ( v(θ, h) v(ā, h) ) so the agent will not choose to self-modify. Proof. Fix history h and policy function ā. Combining (13) and (14) we have: v(ā, ĥ) = u(ĥ) + α o O ρ(o ĥā(ĥ)) v(ā, ĥā(ĥ)o) (16) Similarly combining (13), (14) and (15), and recognizing that: we have: Because α < 1: v(θ, hargmax a A v(θ, ha)) = MAX a A v(θ, ha) v(θ, ĥ) = u(ĥ) + α MAX a A ( o O ρ(o ĥa) v(θ, ĥao)) (17) C = 1/(1 - α) = n 0 α n < (18) In expanding the recursive expression for v(ā, h) in (16), ĥ H. a A. o O ρ(o ĥa) 1 and ĥ H. u(ĥ) 1, so by (18): ĥ H. v(ā, ĥ) C. Thus: then: Now assume that for some K > 0: and then: Combining this with: gives: ĥ H. (v(θ, ĥ) v(ā, ĥ) - C) (19) o O. a A. ( v(θ, ĥao) v(ā, ĥao) - K ) (20) o O. ( v(θ, ĥā(ĥ)o) v(ā, ĥā(ĥ)o) - K ) o O ρ(o ĥā(ĥ)) v(θ, ĥā(ĥ)o) ( o O ρ(o ĥā(ĥ)) v(ā, ĥā(ĥ)o) ) - K MAX a A ( o O ρ(o ĥa) v(θ, ĥao)) o O ρ(o ĥā(ĥ)) v(θ, ĥā(ĥ)o) MAX a A ( o O ρ(o ĥa) v(θ, ĥao)) ( o O ρ(o ĥā(ĥ)) v(ā, ĥā(ĥ)o) ) - K (21) Applying (21) to (16) and (17) gives: 10

11 Hibbard Thus (20) implies (22) for arbitrary ĥ H, so: v(θ, ĥ) v(ā, ĥ) - αk (22) ĥ = n+1. (v(θ, ĥ) v(ā, ĥ) - K) ĥ = n. (v(θ, ĥ) v(ā, ĥ) - αk) (23) Now given any ε > 0 there is n such that α n C < ε. Apply (19) to ĥ = h + n, then apply (23) n times to get: v(θ, h) v(ā, h) - ε This is true for arbitrary ā, h and ε > 0 so ā. h. ( v(θ, h) v(ā, h) ). This result agrees with Schmidhuber (2009) who concludes, in a setting where agents search for proofs about possible self-modifications, "that any rewrites of the utility function can happen only if the Gödel machine first can prove that the rewrite is useful according to the present utility function." It also agrees with Omohundro (2008) who argues that AI agents will try to preserve their utility functions from modification. Proposition 3 may seem to say that an agent will not modify ρ(h) to more accurately predict the environment. However, that is because a more accurate prediction model cannot be evaluated looking forward in time according to the existing prediction model via (13) and (14). Instead it must be learned looking back in time at the history of interactions with the environment as in (1) and (3). So modifying ρ(h) is already implicit in the agent-environment framework. It is appropriate to evaluate possible modifications to an agent's utility function and temporal discount looking forward in time using (13) and (14). The important conclusion from Proposition 3 is that the agent will not choose to modify its utility function or temporal discount, and hence will not self-delude as a consequence of modifying its utility function. The agent's physical implementation is not included in the agent-environment framework but an agent may modify its implementation to increase its computing resources and to increase its ability to observe and act on the environment. It would be useful for the agent's environment model to include a model of its own implementation, similar to the agents' models of their own observation and action processes in the examples in Section 4, to evaluate the consequences of modifying its implementation. Just as model-based utility functions motivate the example agents in Section 4 to preserve their ability to observe the environment, model-based utility functions may also motivate agents to increase their ability to observe the environment. 6. Discussion Ideally an agent's utility function would be computed from the state of its actual environment. Because this is impossible, it is common for utility functions to be formulated as computed from the agent's observations of the environment. But Ring and Orseau (2011b) argue that this leads the agent to self-delusion. This paper proposes computing an agent's utility function from its internal model of the environment, which is learned from the agent's actions on and observations of the environment. The paper shows that this approach can enable agents to avoid self-delusion. The paper also shows that agents will not modify their utility functions. 11

12 Model-Based Utility Functions While some philosophers argue that external reality is just an illusion, our mental models of external reality enable us to predict future observations and successfully pursue our goals. Our belief in external reality is so strong that when it conflicts with our perceptions, we often seek to explain the conflict by some error in our perceptions. In particular, when we intentionally alter our perceptions we understand that external reality remains unchanged. Because our goals are defined in terms of our models of external reality, our evaluation of our goals also remains unchanged. When humans understand that some drugs powerfully alter their evaluation of goals, most of them avoid those drugs. Our environment models include our own implementations, that is our physical bodies and brains, which play important roles in our motivations. Artificial agents with model-based utility functions can share these attributes of human motivation. The price of this approach for avoiding self-delusion is that there is no simple mathematical expression for the utility function. Rather, it must be expressed in terms of patterns to be recognized in the environment model learned by the agent. And the patterns may fail to match anything in the learned model. But such utility functions can be computable functions of observation history. Schmidhuber (2009) describes an approach in which an agent searches for provably better rewrites of itself, including the possibility of rewriting its utility function. Dewey (2011) incorporates learning into agent utility functions. In his approach, the agent has a pool of possible utility functions and a probability for each utility function, conditional on observed behavior. The agent learns which utility functions to weight most heavily. These are both different from the approach in this paper, where patterns in an initial description of a utility function are bound to a model learned from interactions with the environment. Hutter's AIXI is uncomputable with finite computing resources. Even approximations via the speed prior or algorithms for learning MDPs and DBNs require too many computing resources to be applied to build real-world agents. It would be very interesting to develop a synthesis of the agent-environment framework with the ideas of Wang (1995) about limited knowledge and computing resources. Some people expect that humans will create AI agents with much greater intelligence than natural humans, and have proposed utility functions and approaches for them designed to protect the best interests of humanity (Bostrom, 2003; Goertzel, 2004; Yudkowsky, 2004; Hibbard, 2008; Waser 2011). Hopefully expressing utility functions in terms of the AI agent's learned environment model can help avoid self-delusion and other unintended consequences. However, there can be no guarantee that an agent will behave as we expect it to. An agent must have an implementation in the physical universe and because the laws of physics are not settled, there is no provable constraint on its future behavior. More prosaically, an agent may make a mistake or may be unable to prevent other agents from corrupting its implementation. Acknowledgements I would like to acknowledge Tim Tyler for helpful discussions. References Alamino, R., and Nestor, C Online learning in discrete hidden Markov models. In: Djafari, A.M. (ed) Proc. AIP Conf, vol. 872(1), pp

13 Hibbard Baum, L.E., Petrie, T., Soules, G., and Weiss, N A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41(1), pp Bostrom, N Ethical issues in advanced artificial intelligence. In: Smit, I. et al (eds) Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence, Vol. 2, pp Int. Institute of Advanced Studies in Systems Research and Cybernetics. Dewey, D Learning what to value. In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI LNCS (LNAI), vol. 6830, pp Springer, Heidelberg. Gisslén, L., Luciw, M., Graziano, V., and Schmidhuber, J Sequential constant size compressors for reinforcement learning. In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI LNCS (LNAI), vol. 6830, pp Springer, Heidelberg. Goertzel, B Universal ethics: the foundations of compassion in pattern dynamics. Hibbard, B The technology of mind and a new social contract. J. Evolution and Technology 17(1), pp Hutter, M Universal artificial intelligence: sequential decisions based on algorithmic probability. Springer, Heidelberg. Hutter, M. 2009a. Feature reinforcement learning: Part I. Unstructured MDPs. J. Artificial General Intelligence 1, pp Hutter, M. 2009b. Feature dynamic Bayesian networks. In: Goertzel, B., Hitzler, P., and Hutter, M. (eds) AGI Proc. Second Conf. on AGI, pp Atlantis Press, Amsterdam. LFSR. Li, M., and Vitanyi, P An introduction to Kolmogorov complexity and its applications. Springer, Heidleberg. Olds, J., and P. Milner, P Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. J. Comp. Physiol. Psychol. 47, pp Omohundro, S The basic AI drive. In Wang, P., Goertzel, B., and Franklin, S. (eds) AGI Proc. First Conf. on AGI, pp IOS Press, Amsterdam. Orseau, L., and Ring, M. 2011a. Self-modification and mortality in artificial agents. In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI LNCS (LNAI), vol. 6830, pp Springer, Heidelberg. Ring, M., and Orseau, L. 2011b. Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI LNCS (LNAI), vol. 6830, pp Springer, Heidelberg. 13

14 Model-Based Utility Functions Schmidhuber, J The speed prior: a new simplicity measure yielding near-optimal computable predictions. In: Kiven, J., and Sloan, R.H. (eds) COLT LNCS (LNAI), vol. 2375, pp Springer, Heidelberg. Schmidhuber, J Ultimate cognition à la Gödel. Cognitive Computation 1(2), pp Sutton, R.S., and Barto, A.G Reinforcement learning: an introduction. MIT Press. Wang, P Non-Axiomatic Reasoning System --- Exploring the essence of intelligence. PhD Dissertation, Indiana University Comp. Sci. Dept. and the Cog. Sci. Program. Wasser, M Rational universal benevolence: simpler, safer, and wiser than "friendly AI." In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI LNCS (LNAI), vol. 6830, pp Springer, Heidelberg. Yudkowsky, E CoherentExtrapolatedVolition. 14

Distribution of Environments in Formal Measures of Intelligence: Extended Version

Distribution of Environments in Formal Measures of Intelligence: Extended Version Distribution of Environments in Formal Measures of Intelligence: Extended Version Bill Hibbard December 2008 Abstract This paper shows that a constraint on universal Turing machines is necessary for Legg's

More information

Bias and No Free Lunch in Formal Measures of Intelligence

Bias and No Free Lunch in Formal Measures of Intelligence Journal of Artificial General Intelligence 1 (2009) 54-61 Submitted 2009-03-14; Revised 2009-09-25 Bias and No Free Lunch in Formal Measures of Intelligence Bill Hibbard University of Wisconsin Madison

More information

Adversarial Sequence Prediction

Adversarial Sequence Prediction Adversarial Sequence Prediction Bill HIBBARD University of Wisconsin - Madison Abstract. Sequence prediction is a key component of intelligence. This can be extended to define a game between intelligent

More information

Measuring Agent Intelligence via Hierarchies of Environments

Measuring Agent Intelligence via Hierarchies of Environments Measuring Agent Intelligence via Hierarchies of Environments Bill Hibbard SSEC, University of Wisconsin-Madison, 1225 W. Dayton St., Madison, WI 53706, USA test@ssec.wisc.edu Abstract. Under Legg s and

More information

Self-Modification and Mortality in Artificial Agents

Self-Modification and Mortality in Artificial Agents Self-Modification and Mortality in Artificial Agents Laurent Orseau 1 and Mark Ring 2 1 UMR AgroParisTech 518 / INRA 16 rue Claude Bernard, 75005 Paris, France laurent.orseau@agroparistech.fr http://www.agroparistech.fr/mia/orseau

More information

Delusion, Survival, and Intelligent Agents

Delusion, Survival, and Intelligent Agents Delusion, Survival, and Intelligent Agents Mark Ring 1 and Laurent Orseau 2 1 IDSIA Galleria 2, 6928 Manno-Lugano, Switzerland mark@idsia.ch http://www.idsia.ch/~ring/ 2 UMR AgroParisTech 518 / INRA 16

More information

Delusion, Survival, and Intelligent Agents

Delusion, Survival, and Intelligent Agents Delusion, Survival, and Intelligent Agents Mark Ring 1 and Laurent Orseau 2 1 IDSIA / University of Lugano / SUPSI Galleria 2, 6928 Manno-Lugano, Switzerland mark@idsia.ch http://www.idsia.ch/~ring/ 2

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Self-Modification of Policy and Utility Function in Rational Agents

Self-Modification of Policy and Utility Function in Rational Agents Self-Modification of Policy and Utility Function in Rational Agents Tom Everitt Daniel Filan Mayank Daswani Marcus Hutter Australian National University 11 May 2016 Abstract Any agent that is part of the

More information

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19.

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19. MIRI MACHINE INTELLIGENCE RESEARCH INSTITUTE Ontological Crises in Artificial Agents Value Systems Peter de Blanc Machine Intelligence Research Institute Abstract Decision-theoretic agents predict and

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Algorithmic Probability

Algorithmic Probability Algorithmic Probability From Scholarpedia From Scholarpedia, the free peer-reviewed encyclopedia p.19046 Curator: Marcus Hutter, Australian National University Curator: Shane Legg, Dalle Molle Institute

More information

Problems of self-reference in self-improving space-time embedded intelligence

Problems of self-reference in self-improving space-time embedded intelligence Problems of self-reference in self-improving space-time embedded intelligence Benja Fallenstein and Nate Soares Machine Intelligence Research Institute 2030 Addison St. #300, Berkeley, CA 94704, USA {benja,nate}@intelligence.org

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

On the Optimality of General Reinforcement Learners

On the Optimality of General Reinforcement Learners Journal of Machine Learning Research 1 (2015) 1-9 Submitted 0/15; Published 0/15 On the Optimality of General Reinforcement Learners Jan Leike Australian National University Canberra, ACT, Australia Marcus

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

CISC 876: Kolmogorov Complexity

CISC 876: Kolmogorov Complexity March 27, 2007 Outline 1 Introduction 2 Definition Incompressibility and Randomness 3 Prefix Complexity Resource-Bounded K-Complexity 4 Incompressibility Method Gödel s Incompleteness Theorem 5 Outline

More information

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Some Theorems on Incremental Compression

Some Theorems on Incremental Compression Some Theorems on Incremental Compression Arthur Franz (B) Odessa, Ukraine Abstract. The ability to induce short descriptions of, i.e. compressing, a wide class of data is essential for any system exhibiting

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Proving Completeness for Nested Sequent Calculi 1

Proving Completeness for Nested Sequent Calculi 1 Proving Completeness for Nested Sequent Calculi 1 Melvin Fitting abstract. Proving the completeness of classical propositional logic by using maximal consistent sets is perhaps the most common method there

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Name: UW CSE 473 Final Exam, Fall 2014

Name: UW CSE 473 Final Exam, Fall 2014 P1 P6 Instructions Please answer clearly and succinctly. If an explanation is requested, think carefully before writing. Points may be removed for rambling answers. If a question is unclear or ambiguous,

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

total 58

total 58 CS687 Exam, J. Košecká Name: HONOR SYSTEM: This examination is strictly individual. You are not allowed to talk, discuss, exchange solutions, etc., with other fellow students. Furthermore, you are only

More information

Experiments on the Consciousness Prior

Experiments on the Consciousness Prior Yoshua Bengio and William Fedus UNIVERSITÉ DE MONTRÉAL, MILA Abstract Experiments are proposed to explore a novel prior for representation learning, which can be combined with other priors in order to

More information

Decision Theory and the Logic of Provability

Decision Theory and the Logic of Provability Machine Intelligence Research Institute 18 May 2015 1 Motivation and Framework Motivating Question Decision Theory for Algorithms Formalizing Our Questions 2 Circles of Mutual Cooperation FairBot and Löbian

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

The paradox of knowability, the knower, and the believer

The paradox of knowability, the knower, and the believer The paradox of knowability, the knower, and the believer Last time, when discussing the surprise exam paradox, we discussed the possibility that some claims could be true, but not knowable by certain individuals

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

Nested Epistemic Logic Programs

Nested Epistemic Logic Programs Nested Epistemic Logic Programs Kewen Wang 1 and Yan Zhang 2 1 Griffith University, Australia k.wang@griffith.edu.au 2 University of Western Sydney yan@cit.uws.edu.au Abstract. Nested logic programs and

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Universal Knowledge-Seeking Agents for Stochastic Environments

Universal Knowledge-Seeking Agents for Stochastic Environments Universal Knowledge-Seeking Agents for Stochastic Environments Laurent Orseau 1, Tor Lattimore 2, and Marcus Hutter 2 1 AgroParisTech, UMR 518 MIA, F-75005 Paris, France INRA, UMR 518 MIA, F-75005 Paris,

More information

Universal probability distributions, two-part codes, and their optimal precision

Universal probability distributions, two-part codes, and their optimal precision Universal probability distributions, two-part codes, and their optimal precision Contents 0 An important reminder 1 1 Universal probability distributions in theory 2 2 Universal probability distributions

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables N-1 Experiments Suffice to Determine the Causal Relations Among N Variables Frederick Eberhardt Clark Glymour 1 Richard Scheines Carnegie Mellon University Abstract By combining experimental interventions

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning

Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning Peter Sunehag 1,2 and Marcus Hutter 1 Sunehag@google.com, Marcus.Hutter@anu.edu.au 1 Research School of Computer

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Creative Objectivism, a powerful alternative to Constructivism

Creative Objectivism, a powerful alternative to Constructivism Creative Objectivism, a powerful alternative to Constructivism Copyright c 2002 Paul P. Budnik Jr. Mountain Math Software All rights reserved Abstract It is problematic to allow reasoning about infinite

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

COMP219: Artificial Intelligence. Lecture 19: Logic for KR

COMP219: Artificial Intelligence. Lecture 19: Logic for KR COMP219: Artificial Intelligence Lecture 19: Logic for KR 1 Overview Last time Expert Systems and Ontologies Today Logic as a knowledge representation scheme Propositional Logic Syntax Semantics Proof

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Pei Wang( 王培 ) Temple University, Philadelphia, USA

Pei Wang( 王培 ) Temple University, Philadelphia, USA Pei Wang( 王培 ) Temple University, Philadelphia, USA Artificial General Intelligence (AGI): a small research community in AI that believes Intelligence is a general-purpose capability Intelligence should

More information

The physical Church Turing thesis and non-deterministic computation over the real numbers

The physical Church Turing thesis and non-deterministic computation over the real numbers 370, 3349 3358 doi:10.1098/rsta.2011.0322 The physical Church Turing thesis and non-deterministic computation over the real numbers BY GILLES DOWEK* INRIA, 23 avenue d Italie, CS 81321, 75214 Paris Cedex

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information

Introduction to Turing Machines. Reading: Chapters 8 & 9

Introduction to Turing Machines. Reading: Chapters 8 & 9 Introduction to Turing Machines Reading: Chapters 8 & 9 1 Turing Machines (TM) Generalize the class of CFLs: Recursively Enumerable Languages Recursive Languages Context-Free Languages Regular Languages

More information