A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Size: px
Start display at page:

Download "A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation"

Transcription

1 A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology Takayama, Ikoma, JAPAN {hajime-f,ishii}@is.aist-nara.ac.jp CREST, Japan Science and Technology Agency Abstract This article presents the state estimation method based on Monte Carlo sampling in a partially observable situation. We formulate an automatic strategy acquisition problem for the multi-agent card game Hearts as a reinforcement learning (RL) problem. Since there are often a lot of unobservable cards in this game, RL is dealt with in the framework of a partially observable Markov decision process (POMDP). We apply a Monte Carlo method in order to estimate unobservable states. Simulation results show our model-based POMDP-RL method with Monte Carlo state estimation is applicable to this realistic multi-agent problem. 1 Introduction Studies on reinforcement learning (RL) schemes in multi-agent environments have so far been focusing on gamestrategy acquisition problems, which usually assume completely observable situations. For example, BlackJack [5], Othello [9] and Backgammon [7] have been studied. In more realistic applications, however, there is partial observability, and hence partially observable Markov decision processes (POMDPs) have attracted much attention in recent years. POMDPs provide a way to choose optimal actions in partially observable situations. For example, Kaelbling et al. presented a method that obtains the optimal solution of POMDPs [4], but in general it is difficult to achieve this. One reason is that the estimation of missing information [2], which is crucial for solving a POMDP, is often intractable due to a large state space. The approximation method, then, is essential to solve a realistic POMDP problem. In our previous study [3], we dealt with the card game Hearts (a four-player, non-cooperative, finite-state, zero-sum and imperfect-information game) and proposed an RL scheme that used the prediction of environmental behaviors and the estimation of unobservable state variables. We formulated the automatic strategy acquisition problem of this card game as a POMDP, such that the state transition was approximated as one defined on mean-field approximation states. Although the achieved RL agent grew stronger than rule-based agents, further improvement was possible; mean-field approximation was effective, but was not very accurate because the states of Hearts are defined as discrete. It is well-known that the analog approximation of states makes the value (evaluation of state) smaller than its optimum. One possible way of avoiding this difficulty is to estimate the unobservable variables based on a sampling method, such as a Monte Carlo method [8]. This article proposes a state estimation method based on Monte Carlo sampling in a realistic partially observable situation, i.e., the card game Hearts. Since the state space of this game is huge, we focus on pessimistic observations, each of which is predicted to be detrimental to the learning agent, in order to carve out an important domain from such a large space. We then use Monte Carlo sampling over such future observations, to estimate hidden states and calculate their likelihood based on prediction of environmental behaviors. This allows the state space to shrink and makes estimation of the real world more effective. We carried out experiments using rule-based agents which are stronger than the previous ones [3]; a new rulebased agent has more than 50 rules and is similar to an experienced-level human player of the game Hearts. The learning agent trained by our improved POMDP-RL method exhibits high performance against the rule-based agents, and is dramatically better than an agent trained by our previous method in its learning speed and strength. Thus, it is concluded that our model-based POMDP-RL method with the Monte Carlo state estimation is applicable to a realistic multi-agent problem.

2 2 Preparation 2.1 Review of POMDPs This section describes the essential features of POMDPs which are used throughout this article. A POMDP framework consists of (1) a set of real states S = {s 1, s 2,, s S }, which cannot be determined with complete certainty, (2) a set of observation states O = {o 1, o 2,, o O }, which can be perceived by the agents, (3) a set of actions A = {a 1, a 2,, a A }, which can be executed by the agents, and, (4) a reward function R : S A R, which maps a pair of a state and an action into a numerical reward. The dynamics of this model can be represented as transition probability P(s t+1 s t, a t ) and observation probability P(o t s t ). Since the agents cannot perceive real states, they maintain belief states, each of which is a probability distribution over S and primarily a sufficient statistic of the whole history H t = {o t, a t 1, o t 1,, a 1, o 1 }. The belief state is updated according to an incremental Bayes rule: b(s t+1 ) = P(s t+1 o t+1, a t, H t ) = P(o t+1 s t+1, a t, H t )P(s t+1 a t, H t ) P(o t+1 a t, H t ) = P(o t+1 s t+1 ) s t P(s t+1 s t, a t )b(s t ). (1) P(o t+1 a t, H t ) Here, b(s t ) P(s t H t ) B represents a belief state about a real state s t. The main problem in a POMDP is to make the agents learn a policy π : B A which maximizes the expected numerical reward. A formulation of a belief-state MDP presents a way to achieve this, by calculating the value function for a belief state: V π (b t ) = s t b(s t )V π (s t ). (2) Here, V π (b t ) and V π (s t ) represent the value function based on a policy π for b(s t ) and s t, respectively. Since V π (b t ) is linear with respect to b t b(s t ), the optimal V (b t ) is the upper surface of the collection of the state value functions, and hence the optimal value function V (b t ) is piecewise-linear and concave. In general, it is difficult to calculate equations (1) and (2) because the set of real states S is huge, and therefore summation over s is intractable. So, some approximation method is needed. In our previous study [3], we used mean-field approximation, which calculated the expectation value of states to avoid this intractability. Although this approximation achieved a good performance, there was some error in evaluating the current optimal value. By applying an analog approximation like the mean-field approximation to problems whose states are defined as discrete, like the card game Hearts, there is a performance loss due to the under-estimation of the value function. One possible way to solve this difficulty is to use some sampling method, e.g., a Monte Carlo method, and to estimate the unobservable states using an ensemble of sampled discrete states. The details of this method are described below (see Proposed method). 2.2 POMDP formulation The game of Hearts has partially observability for each of the four players. Then, if we assume that there is a single learning agent and three opponent agents which do not learn in the environment, this game is approximated as a partially observable Markov decision process (POMDP). That is, the opponent agents constitute the static environment. A single state transition of the game is represented by: (1) a real state s that includes every card (observable and unobservable) allocation; (2) observation o for the learning agent, i.e., the cards in the agent s hand and the cards that have already been played in past time steps and in the current time; (3) the agent s action a, i.e., a single play at its turn; and (4) the strategy φ of each of the opponent agents. Let t indicate a playing turn of the learning agent. t = 14 indicates the end state of the game 1. In the following descriptions, we assume that there are three opponent agents intervening between the t-th play and the (t + 1)-th play of the learning agent 2. 1 There are 13 time steps in the card game Hearts. See later section for the details of this game s rules. 2 If the first player of the t-th time step and of the (t + 1)-th time step are different, the number of intervening opponent agents is not three. However, the following explanation is still valid in such a case.

3 Between the t-th play and the (t + 1)-th play of the learning agent, then, there are three state transitions due to actions by the three opponent agents. These state transitions are indexed by t. Each of the opponent agents is also in a partial observation situation; state, observation, action and strategy at its t-th playing turn are denoted by s i t, o i t, a i t and φ i t, respectively, where i is the index of an opponent agent. Let M denote the learning agent, and M i (i = 1, 2, 3) denote the i-th opponent agent. We assume that agent M i probabilistically determines its action a i t for its own observation o i t at its t-th play. Under this assumption, the state transition between the t-th play and the (t + 1)-th play of the learning agent is given by P(s t+1 s t, a t, Φ) = P(s j+1 t s 1 t,s2 t,s3 t a 1 t,a2 t,a3 j=0 t s j t, aj t ) i=1 o i t P(a i t oi t, φi )P(o i t si t ), (3) where s 0 t = s t, a 0 t = a t, s 4 t = s t+1, a 4 t = a t+1, o 0 t = o t and o 4 t = o t+1. Φ {φ i : i = 1, 2, 3}, where φ i denotes the strategy of opponent agent M i. Since the game process of Hearts is deterministic, there are three facts: Fact 1 New state s i+1 t, which is reached from a previous state s i t by an action a i t, is uniquely determined. Namely, P(s i+1 t s i t, ai t ) is 1 for a certain state and 0 for the other states. Fact 2 Observation o i t is uniquely determined at state si t. Namely, P(oi t si t ) is 1 for a certain observation state and 0 for the other observation states. Fact 3 The penalty point at time t is determined only by the actual observation o t+1, By assuming a reward is given by the penalty point received in a turn, the reward r t is dependent only on the observation o t+1. Since state s t is not observable for each agent, the learning agent could estimate it using the history H t {(o t,, ), (o t 1, a t 1, a 1,2,3 t 1 ),..., (o 1, a 1, a 1,2,3 1 )}, and actions a i t(i = 1, 2, 3) at the t-th play. To predict the rewards, however, it is not necessary to estimate s t, but enough to estimate o t+1, because of the above Fact 3. Using this property, we propose the state estimation method to predict the expected reward effectively in the next section. 3 Proposed method 3.1 State estimation by Monte Carlo sampling In the POMDP framework, we need to consider all possible combinations of unobservable states and to calculate summation over the possible states (see equations (1) and (2)). However, this is computationally intractable in problems whose state space is huge. Additionally, Hearts has the following difficulty peculiar to realistic problems. Since an ordinary 52-card deck is used in this game, each state is represented by 52-dimensional vector. Although there are 13 cards in the learning agent s hand, which are observable variables, the remaining unobservable cards occupy 39-dimension in each state. Moreover, each player plays a card in turn, so the state space of unobservable variables shrink by three dimensions to its subspace at every time step. Therefore, if we estimate primitive states from such a high-dimensional and difficult state space based on a sampling method, it does not perform well because sampled states in shrunken state space have little meaning, and expected values calculated from such sampled states are not accurate. According to our method, therefore, we first sample a next observation ô t+1 from the subspace O t+1 which is extracted from the possible next observation space O t+1 before estimating the current state s t. Since the effective dimension of O t+1 is larger by three than that of O t, and it is enough for the sampling method to focus on the three dimension; this restriction on the observation space is helpful for restricting the possible state space. Next, we estimate unobservable states from the state subspace restricted by each sampled observation state, and predict the environmental change starting from each of the estimated states. More concretely, we consider a specific next observation space Ot+1 which is extracted from the possible next observation space O t+1 such that every next observation state ô t+1 Ot+1 is predicted to be detrimental to the learning agent. Such an observation state ô t+1 is called a pessimistic observation. Three cards played at time t are determined only by the difference between each sampled observation ô t+1 and the current observation o t, and the winner at time t is also determined. Here, we assume for simplicity that ô t+1 includes the order of cards that have

4 played at time t. The state space S t (ô t+1 ) which is restricted depending on each sampled observation ô t+1 is the subspace of the whole state space S t. The dimension of the subspace S t (ô t+1 ) is smaller than that of S t almost by three. By sampling an unobservable state from the restricted state space, the likelihood of the next-time state which produces a pessimistic observation ô t+1 can be calculated based on the dynamics of the environment. Note that each pessimistic observation is generated according to opponent agents strategies, assuming that they take actions detrimental to the learning agent (see Action control by state prediction, below). This idea for reducing the state space to explore is similar to the Min-Max search for a game decision tree. The pessimistic observations are generated from the current history H t based on a set of rules, which is easily performed due to the above Fact 3. Second, we sample a possible current state ŝ t (k, n) S t (ô t+1 ) such to be consistent with a generated pessimistic observation ô t+1 (k), where k is an index of pessimistic observations and n is an index of sampled current states. Let K denote the number of pessimistic observations ô t+1 and N the number of estimated current states ŝ t (k, n) sampled for each ô t+1. Since there are many current states which are consistent with a pessimistic observation ô t+1 (k), we sample N current states ŝ t (k, n) for each of pessimistic observation ô t+1 (k). As just described, we can carve out an important domain of the state space by restricting the next observation space using the pessimistic observation, and estimate the unobservable state variables using Monte Carlo sampling from such a shrunken subspace. Using the above state estimation method, the learning agent calculates the likelihood of each pessimistic observation ô t+1 (k), as described in the next section. 3.2 State prediction The transition probability for the observation state is given by P(o t+1 a t, Φ, H t ) = s t S t s t+1 S t+1 P(o t+1 s t+1 )P(s t+1 s t, a t, Φ)P(s t H t ). (4) From the above Facts 1 and 2 and equation (3), equation (4) is transformed into P(o t+1 a t, Φ, H t ) = s t S t (a 1 t,a2 t,a3 t ) A i=1 t+1 P(a i t oi t, φi, H i )P(s t H t ), (5) where P(s t H t ) is a belief state. A t+1 denotes the set of possible (a1 t, a 2 t, a 3 t) by which the previous state-action pair (s t, a t ) reaches a new state s t+1 whose observation state is o t+1. Note that A t+1 corresponds to the difference between the next observation space O t+1 and the current observation space O t. Equation (5) provides a model of the environmental dynamics. However, the calculation in equation (5) has two difficulties. One is the intractability of the belief state; since the state space S t of the game of Hearts is huge, the rigorous calculation of the summation s t S t is difficult. The other is the difficulty in retrieving the game tree; since the next observation space O t+1 is huge, A t+1 which is the difference between it and the current observation space O t is also huge, and then the calculation of the summation is also difficult. (a 1 t,a2 t,a3 t ) A t+1 To avoid these computational difficulties, we approximate equation (5) using Monte Carlo sampling described in section 3.1 as P(o t+1 a t, Φ, H t ) ŝ t S t (ôt+1) (â 1 t,â2 t,â3 t ) A t+1 (ôt+1) i=1 = 1 N K N k=1 n=1 i=1 P(â i t ôi t, φi, Ht i )P(ŝ t H t ) P(â i t(k) ô i t(k, n), ˆφ i, Ĥi t(k, n)). (6) Note that when the next observation space O t+1 is restricted to Ot+1, S t and A t+1 are also restricted to S t (ô t+1) and A t+1 (ô t+1), respectively, and the computational difficulties described above can be avoided. ô i t (k, n) denotes an observation of the opponent agent M i. Using the property that every card played at the t-th time step is contained in the next observation o t+1, if we pick up the k-th pessimistic observation, a next history Ĥ t+1 (k) = H t {ô t+1 (k), a t, â 1 t (k), â2 t (k), â3 t (k)} can be determined without any ambiguity. Moreover, if we

5 consider the n-th current sampled state ŝ t (k, n) for the k-th next history Ĥt+1(k), the observation state of the opponent agent, ô i t(k, n), can be also determined. â i t(k) denotes an action of the opponent agent M i depending on the k-th pessimistic observation, and Ĥi t (k, n) denotes an opponent agent s history depending on ôi t (k, n). ˆφi represents the policy approximated by the learning agent, which is assumed to produce actions of agent M i, but differs from the real strategy φ i in equation (3) 3. The action selection probability of the i-th opponent agent, P(â i t ôi t (k, n), ˆφ i t, Ĥi t (k, n)), is calculated by a function approximator whose input is ôi t (k, n). We assume P(s t = ŝ t (k, n)) = 1 N, that is, the current states are sampled from the restricted state space S t with uniform probability. With this assumption, the sampling method does not use the incremental Bayes inference for the belief state estimation. However, it has no need to maintain the belief state which is a probability distribution over a huge unobservable state space. Calculating equation (6) approximately based on our state estimation method, the learning agent can predict the observation dynamics and select the optimal action based on the environmental behavior. 3.3 Action control by state prediction According to our RL method, an action is selected based on the expected TD error, which is defined by δ t (a t ) = R(o t+1 ) (a t ) + γ V (o t+1 ) (a t ) V (o t ), (7) where f(o t+1 ) (a t ) P(o t+1 a t, Φ, H t )f(o t+1 ) (8) o t+1 O t+1 and P(o t+1 a t, Φ, H t ) is given by equation (6). The expected TD error considers the estimation of unobservable states and the strategies of the other agents. Using the expected TD error, the action selection probability is determined as P(a t o t ) = exp( δ t (a t )/T m ) a t A exp( δ t (a t )/T m ). (9) T m is a parameter controlling the action randomness. Our RL method uses the TD error expected with respect to the estimated transition probability for the observation state. An action is then determined based on the estimated environmental model. Such an RL method is often called a model-based RL method. 3.4 Two modules Our RL architecture consists of two modules: a state evaluation module and an action control module. The action control module consists of three action predictors, each corresponding to one of the three opponent agents, and one action selector. The state evaluation module has the same role as the critic in the actor-critic algorithm [1], i.e., it maintains the value function V (o t ) that evaluates an observation state o t. Since we can estimate the value of every pessimistic observation ô t+1, the learning agent can learn a policy based on environmental behaviors by calculating the expectation of such evaluation with respect to the estimated transition probability for the observation state, according to equations (7) and (8). In the action selection module, there are three action predictors. The action predictor for agent M i predicts a card played by that agent, in a similar manner to the actor s action selection in the actor-critic algorithm. To predict an action by agent M i at its t-th turn, the i-th action predictor calculates the merit function value U i (ô i t(k, n), â i t(k)) for the opponent agents observation ô i t(k, n), which is derived from the sampled state ŝ t (k, n), and an opponent agents action â i t (k), which is determined by a pessimistic observation ô t+1(k). After calculating 3 We have assumed that agent M i probabilistically determines its action a i t for its own observation oi t.

6 the merit value, the probability for an action â i t (k), which is used in equation (6), is given by P(â i t (k) ôi t (k, n), ˆφ i, Ĥi t ) = exp(ui (ô i t (k, n), âi t (k))/t i ) A i exp(u i (ô i t (k, n), âi t (k))/t i ). (10) Here, A i denotes the set of possible actions for agent M i, and T i is a constant that denotes the assumed randomness of the action selection of agent M i. When training the action predictor for agent M i, the merit function U i (ô i t(k, n), â i t(k)) is updated similarly to the actor learning. o i t is reproduced by replaying a past game, and ai t is the action actually taken by agent Mi at its t-th play in the past game. To accelerate the learning of the state evaluation module and action predictor, we use a feature extraction technique for the input and output of the function approximators of those modules. Since the state space of a realistic problem like the game Hearts is huge and it is difficult for the learning agent to experience every possible state, the generalization ability of function approximators is very important. In this study, we use normalized Gaussian networks (NGnets) as function approximators. An NGnet can be defined as a probabilistic model, and its maximum likelihood inference is performed by an on-line expectation-maximization (EM) algorithm [6]. The on-line EM algorithm is based on a stochastic gradient method, and is faster than gradient methods. Therefore, the learning of the action predictors is so fast that our RL method can be applied to a situation where the strategies of the opponent agents may change with time. 4 Experiment 4.1 The card game Hearts Hearts is played by four players and uses an ordinary 52-card deck. There is an order of strength within each suit (i.e. decreasing from A,K,Q,...,2). Cards are distributed to the four players so that each player has 13 cards at the beginning of the game. Thereafter, each player plays a card in clockwise order. The game ends when 13 time steps have been carried out. After each time step, the player that has played the strongest card becomes the winner of that time step. Each heart incurs a one-point penalty, while the Q incurs a 13-point penalty. The winner of each time step receives all of the penalty points of the cards played in that time step. According to the rules above, a single game is played, and at the end of the game the score of each player is determined as the sum of the received points. The lower the score, the better. Players must therefore contrive strategies to avoid receiving penalty points. 4.2 Single agent learning in a stationary environment We carried out experiments using one learning agent and three rule-based opponent agents. A newly developed rule-based agent has more than 50 rules and is similar to an experienced-level human player of the game Hearts. The acquired penalty ratio is 0.42 when an agent which only took out permitted cards at random from its hand challenged the three rule-based agents. The acquired penalty ratio is the ratio of the acquired penalty points of the learning agent to the total penalty points of the four agents; that is, a random agent acquired about 2.1-fold points of rule-based agents on average. The rule-based agent used in this experiment is much stronger than the previous one [3]. The learning agent based on our previous RL method could not get stronger than this rule-based agent even after 120,000 training games (see Figure 2, detailed data not shown). Figure 1 shows the learning curve of an agent trained by our new RL method when it challenged the three rule-based agents. This learning curve is an average of three learning runs, each consisting of 100,000 training games. The acquired penalty ratio of the learning agent decreases from the beginning of learning episodes. This simulation result shows that the learning agent based on our proposed method shows a good performance against experienced-level rule-based agents. Figure 2 shows the comparison of the learning agent trained by our improved RL method with a learning agent trained by our previous method. The learning curves are averages of three learning runs, each consisting of 80,000 training games, when each agent challenged the three rule-based agents. The simulation result shows that our new RL method is improved dramatically in comparison to our previous one in its learning speed and strength. There are two reasons for this successful result. First, the learning agent can avoid the worst consequences by generating pessimistic observation, which also makes the state space to shrink effectively. Second, our state

7 Proposed RL agent Proposed RL agent with mean field state estimation averaged aquired penalty ratio Rule based agents averaged aquired penalty ratio Proposed RL agent with Monte Carlo state estimation number of games x 10 4 Figure 1: Proposed RL agent vs. rule-based agents. Penalty ratio is smoothed by using 2000 games just before that number of training games number of games x 10 4 Figure 2: The comparison between the proposed RL agent and an agent trained by our previous RL method. Penalty ratio is smoothed by using 2000 games just before that number of training games. estimation based on the Monte Carlo method over discrete states can effectively eliminate the above-mentioned performance loss. 5 Summary In this article, we proposed a state estimation method based on Monte Carlo sampling in a partially observable world, and designed an autonomous learning agent that plays the multi-player card game Hearts. Since Hearts is a realistic imperfect-information game and its state space is huge, the RL scheme is formulated as a POMDP and an approximated method is required to estimate unobservable state variables. By dealing with pessimistic observation, we carved out the important domain from the huge state space and estimated transition probability for a discrete state sampled depending on the pessimistic observations. This proposed method allowed us to solve the performance loss arising from analog approximation, and to avoid computational intractability. Thus, it copes with partial observability in a multi-agent environment. Our model-based POMDP-RL method with sampling-based estimation of unobservable states has successfully been applied to a realistic multi-agent problem. References [1] Barto, A.; Sutton, R.; and Anderson, C Neuronlike adaptive elements that can solve difficult learning control problems. In IEEE Trans. Syst., Man & Cybern., , Vol.13. [2] Ginsberg, M Imperfect information in a computationally challenging fame. Journal of Artificial Intelligence Research, , 14. [3] H.Fujita, Y.Matsuno and S.Ishii. A reinforcement learning scheme for a multi-agent card game. In IEEE Int. Conf. on Syst., Man. & Cybern., pp , [4] L.P.Kaelbling, M.L.Littman, and A.R.Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Vol.101, 1998.

8 [5] Pèrez-Uribe, A. and Sanchez, A Blackjack as a test bed for learning strategies in neural networks. In Proc. IEEE International Joint Conference Neural Networks, , Vol.3. [6] Sato, M. and Ishii, S On-line EM algorithm for the normalized Gaussian Network. Neural Computation, , Vol.12, No.2. [7] Tesauro, G TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, , Vol.6, No.2. [8] Thrun, S Monte Carlo POMDPs. In Advances in Neural Information Processing Systems, [9] Yoshioka, T.; Ishii, S.; and Ito, M Strategy acquisition for game othello based on min-max reinforcement learning. IEICE Trans. Inf & Syst., , Vol.E82 D, No.12.

Reinforcement Learning based on On-line EM Algorithm

Reinforcement Learning based on On-line EM Algorithm Reinforcement Learning based on On-line EM Algorithm Masa-aki Sato t t ATR Human Information Processing Research Laboratories Seika, Kyoto 619-0288, Japan masaaki@hip.atr.co.jp Shin Ishii +t tnara Institute

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24 and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague /doku.php/courses/a4b33zui/start pagenda Previous lecture: individual rational

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

A Simple Model for Sequences of Relational State Descriptions

A Simple Model for Sequences of Relational State Descriptions A Simple Model for Sequences of Relational State Descriptions Ingo Thon, Niels Landwehr, and Luc De Raedt Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Heverlee,

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Lecture 8: Policy Gradient I 2

Lecture 8: Policy Gradient I 2 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient Efficient Non-Linear Control by Combining Q-learning with Local Linear Controllers Hajime Kimura Λ Tokyo Institute of Technology gen@fe.dis.titech.ac.jp Shigenobu Kobayashi Tokyo Institute of Technology

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning 1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg

More information

Reinforcement Learning. Up until now we have been

Reinforcement Learning. Up until now we have been Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying,

More information

Planning with Predictive State Representations

Planning with Predictive State Representations Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information