Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution

Size: px
Start display at page:

Download "Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution"

Transcription

1 SICE Journal of Control, Measurement, and System Integration, Vol. 7, No. 1, pp , January 2014 Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution Mauricio BURDELIS and Kazushi IKEDA Abstract : Although the framework of linearly solvable Markov decision processes (LMDPs) reduces the computational complexity in reinforcement learning, it requires the knowledge of the state-transition probability in the absence of control or passive dynamics. The passive dynamics can be estimated by a temporal difference method called Z learning if the environment obeys the passive dynamics. However, it leads to a slow convergence of learning since no control is allowed during learning. This paper proposed a method to estimate the passive dynamics using Z learning under a different statetransition probability from the passive dynamics. The proposed method requires only the knowledge on what states can be visited from each possible state, and estimates the state-transition probability as well as the immediate cost of the states from the constraints they should satisfy. The computer experiments showed that the proposed method remains more efficient than Q learning with successful estimation of the passive dynamics and state costs and has a comparable convergence speed with the traditional Z learning. Key Words : reinforcement learning, Bellman equation, linearly solvable Markov decision process. 1. Introduction Reinforcement learning (RL) is a machine learning technique used to learn how to take actions to achieve a desired goal. An agent is not given the correct action to take in each situation but only a reward according to the current state and the chosen action followed by a stochastic state-transition. This is regarded as a discrete time Markov decision process (MDP) with stochastic dynamics [1]. In other words, reinforcement learning can widely be applied to problems modeled by MDPs such as trajectory optimization [2],[3], robotics and control [4], mobile communication [5], image recognition [6],[7], e-commerce [8] and medical treatment [9]. The objective of reinforcement learning is to choose actions that minimize the expected total cumulative cost called the costto-go function. Since the values of the cost-to-go function at the current and the following states satisfy the Bellman equation, the function is given as the solution of the Bellman equation [1],[10]. Although solving the Bellman equation has an exponential computational complexity in general, Todorov gave the conditions of MDPs under which the Bellman equation became linear [11],[12]. In the conditions, the state-transition probability in the absence of control is crucial, which is called the passive dynamics. Todorov also proposed a method to estimate the cost-to-go function called Z learning. Z learning is a temporal difference method that converges faster than traditional methods such as Q learning [1],[11],[12]. When no control is given, the environment follows the passive dynamics and Z learning correctly estimates the cost-to-go function. Otherwise, Z learning requires the knowledge of the passive dynamics but it can balance the dilemma of exploration and exploitation. For example, it can Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Nara , Japan mauricio-b@is.naist.jp, kazushi@is.naist.jp (Received January 24, 2013) (Revised May 10, 2013) employ the greedy policy that chooses the optimal in the currently estimated knowledge. This paper, we proposed a method to estimate the passive dynamics and the immediate costs during the execution of Z learning with general dynamics. The proposed method estimates the state-transition probability using the immediate cost information observed by the agent during exploring the environment. Since the method updates the estimates step by step, it does not require all immediate costs beforehand or to follow the passive dynamics. This property allows the agent to use more efficient policy. The effectiveness of the method was confirmed by computer simulations, where Newtonian dynamics in two-dimensional grid world were considered including a simple model of inertia and collisions. The method showed performance comparable to greedy Z learning in the convergence speed of the estimates of the cost-to-go function, and better performance than Q learning. This paper is organized as follows. In section 2 we give a review of the traditional MDPs theory, and the framework of LMDPs is explained, as well as the sufficient conditions for its existence. Section 3 describes the proposed method for calculating the passive dynamics from observed costs, and explains how to apply it during the execution of Z learning. Section 4 presents computational experiments and results. In section 5 practical difficulties are discussed. Finally, section 6 presents brief concluding remarks. 2. Linearly Solvable Markov Decision Processes Suppose that the environment of a reinforcement learning problem is a discrete time Markov decision process, that is, Pr (x t+1 u t, x t ) = Pr (x t+1 u t, x t, x t 1, x t 2...) (1) where x t and u t denote respectively the state of the agent and the taken action at time step t. Then, the problem of reinforcement learning is to give the optimal probability p(u x)ofactions JCMSI 0001/14/ c 2013 SICE

2 SICE JCMSI, Vol. 7, No. 1, January u U for the current state x X where U and X denote the possible sets of actions and states, respectively. If the agent takes an action u for a state x, the state changes to according to the state transition probability p( x, u) and the agent pays the immediate cost l(x, u). The optimality of actions here means minimizing the expected total cumulative cost v(x) from a state x until the agent reaches a terminal or goal state [1],[10],[11]. Hereafter, v(x) is termed the cost-to-go function according to Todorov s work. It is known that v(x) must satisfy v (x) = min u { [ ( l (x, u) + Ep( x,u) v x )] } which is called the Bellman equation. E p( x,u) [v ( )] means the statistical expectation of v ( ) taken with respect to p ( x, u). Some methods such as Dynamic Programming [10] or Reinforcement Learning [1] can solve the Bellman equation. However, these can be time consuming due to explosions of the number of unknown variables because the number of future states grows exponentially with time. Todorov showed that the class of linearly solvable Markov decision processes (LMDPs) greatly simplifies reinforcement learning [11],[12]. When specific conditions are met, the Bellman equation of an MDP becomes linear and the problem reduces to an eigenvector problem. We review these facts according to [11],[12]. There are two conditions for an MDP to be a linear Bellman equation. One is that the action u can directly specify the state transition probability. That is, state transition probability p( x, u) is represented as u( x). The other is that the immediate cost is given by the sum of the action cost and the state cost, where the action cost is measured by the Kullback-Leibler (KL) divergence from the passive dynamics p d ( x)tothecurrent transition probability u( x), that is, [ KL (u ( x) p d ( x)) E u( x) log u ] (x x) p d ( (3) x) where E u( x) means the statistical expectation of taken with respect to the controlled transition distribution u ( x). The passive dynamics is the transition probability that corresponds to the behavior of the system in the absence of controls. In theory it corresponds to a reference distribution which makes the KL divergence above null, and it can also be arbitrary. It is usually defined as random walk. The state cost depends only on the current state x and hence is denoted by q(x). In total, the immediate cost is expressed as (2) l (x, u) = q (x) + KL (u (. x) p d (. x)). (4) Todorov introduced the desirability function z (x) exp ( v (x)) (5) instead of considering v(x) itself. Then, under the conditions mentioned above, the Bellman equation (2) can be reduced to z(x) = exp ( q (x)) G [z] (x) (6) where G [z] (x) ( p d x ) z ( ). (7) Note that (6) is linear in z and the optimal controlled transition probability u is given by u ( x ) = p d ( x) z ( ) [ pd ( x) z ( ]. (8) ) The class of linearly solvable MDPs is restricted because of the conditions that must be satisfied, but they are important in reinforcement learning [11] [15]. If the state costs q(x) and passive dynamics p d are not known, they must be learned through the agent s exploration of the environment. One learning method is Z learning, that is a temporal difference method. It has the same benefits as other temporal difference methods like Q learning [1], being an off-policy method and being able to absorb small errors in the measurement of the immediate cost, but it has the advantage of faster convergence. When the agent follows the passive dynamics, Z learning updates the desirability function z as z new (x t ) (1 η t ) z cur (x t ) + η t exp ( q t ) z cur (x t+1 ) (9) where z new (x t ) is the new estimate of z at the current state x t, z cur (x t )andz cur (x t+1 ) are the current estimate of z(x t )and z(x t+1 ), respectively, q t is the state cost of the current state x t, and η t is a learning rate that decreases over time. The observed immediate cost l(x, p d ) is equal to the state cost q t because the KL divergence (3) is null, so q t can be obtained directly by observing the immediate cost l. When the agent follows a controlled transition probability û, we need to introduce the importance sampling technique to (9), that is, z new (x t ) (1 η t ) z cur (x t ) + η t exp ( q t ) z cur (x t+1 ) p d(x t+1 x t ) û(x t+1 x t ). (10) This means that we can use a more efficient policy than the passive dynamics. In the case of the greedy Z learning, for example, û is the policy which appears optimal given the current estimates of ẑ according to (8). However, this method requires the knowledge on the passive dynamics p d beforehand. Because the policy is different from the passive dynamics, the KL divergence will not be null, and the observed costs l will not be necessarily equal to the state costs q. Hence, those must be known beforehand, or a method to measure q separately must be proposed. 3. Passive Dynamics Estimation To make Z learning applicable under a controlled condition, we need the knowledge of the passive dynamics and state costs. We propose a method for estimating them from measured immediate costs l(x, u) using their constraints [16],[17]. This way Z learning can be applied by measuring only immediate costs and updating estimates of the quantities of interest, as in other temporal difference methods such as Q learning. Suppose that a discrete state space X has cardinality X = N S (Fig. 1). Our method regards the log of each state transition probability, log p d ( x), as unknown variables and each state cost q(x) as well. Then, the variables and the immediate costs must satisfy (4), or more concretely

3 50 SICE JCMSI, Vol. 7, No. 1, January 2014 Fig. 1 A state space with N S possible states and NS 2 possible transitions. Algorithm 1: Gradient Descent (with probability normalization) 1: Take an initial solution m t=1 using the Moore-Penrose pseudoinverse matrix A + GD of matrix A GD m t=1 A + GD b 2: repeat 3: Take a step of the gradient descent algorithm m t+1 m t + 1 γ AT GD (b A GD.m t ) where γ>1 4: Normalize the probabilities ( to sum up ) to one for i = 1toN V : m i log 5: until convergence Fig. 2 exp m i NV j=1 exp m j A pseudo code of the proposed Gradient Descent with probability normalization algorithm. Algorithm 2: Variable Substitution Method 1: by observing that A VS is a stochastic matrix, rewrite: A VS (q1 n) = b 2: change of variables: (q1 n) = c 3: solve A VS c = b for c 4: by observing that N V i=1 exp (n i) = 1 q log ( NV i=1 exp ( c i) ) 5: solve (q1 n) = c for n Fig. 3 A pseudo code of the variable substitution algorithm. ( u(x l(x, u) = u( ) x) x)log p d ( + q(x) x) = u( x)logu( x) x u( x)logp d ( x) + q(x) (11) for any x. Rearranging the terms we have: u(x x)logp d ( x) + q(x) = l(x, u) u(x x)logu( x) (12) Because we know log u( x) and can measure l(x, u), we get N eq = N S linear equations of log p d ( x) andq(x) from N S distinct x with an arbitrarily fixed u (where N S is the number of states, N eq denotes the number of equations). Hence, repeating the procedure for N S different controlled distributions u, say u 1,...,u Ns,wegetasystemofN eq = NS 2 linear equations. This number of equations is less than the number of unknown variables, NS 2 + N S. However, the probability p d ( x) forx has an additional constraint, p d ( x) = 1, (13) for any x. Although these are not linear in log p d ( x), the NS 2 + N S equations are easily solved by a gradient method (Fig. 2) or a variable substitution method (Fig. 3). For both algorithms, in the general case, applying less than N S different controlled distributions u obtains less equations Algorithm 3: Calculating the passive dynamics distributions and state costs during Z learning 1: When each state x i is visited (following policy û t ): 2: if p d ( x i )andq(x i ) are not yet known then 3: gather one equation from the measured l (x i û t ): l(x i, û t )= û t ( x i )logû t ( x i ) x û t ( x i )logp d ( x i ) + q(x i ) 4: N eq (x i ) N eq (x i ) + 1 5: if N eq (x i ) < N V (x i ) then 6: get one solution for the incomplete system with N eq equations, under the constraint: p d ( x i ) = 1 7: else 8: solve the complete system (with N eq = N V (x i ) equations), under the constraint: p d ( x i ) = 1 9: consider that p d ( x i )andq(x i ) are known 10: end if 11: end if Fig. 4 A pseudo code of the proposed algorithm. than necessary, and applying more than N S is unnecessary under the constraint (13). In the gradient descent algorithm (Fig. 2), the system is written in vector notation A GD m = b, wherea GD corresponds to the matrix of coefficients, m corresponds to the variables array, and b corresponds to the constants array. Each equation of the system is of the form of (12). The elements m 1,..., m NV of m are log (p d ( x)) for all valid at x, and the last element m NV +1 is q(x). N V is the number of valid possible future states at x (so the length of m is N V + 1). The term 1 γ corresponds to the step size of the gradient descent algorithm. Each element of the constants vector b corresponds to the right side of equation (12). The matrix ( A GD has N V lines and N V + 1 columns, with elements u i x j, x) at line i and column j for columns 1 to N V. The rightmost column (N V + 1) consists only of ones, which are the coefficients of the variable q(x). In the variable substitution algorithm (Fig. 3), the system is also written in vector notation q1 A VS n = b,wherea VS corresponds to thematrix of coefficients, n corresponds to the variables array (similar to m, but without q), q is the state cost of current state x, andb corresponds to the constants array (the same from the gradient descent algorithm). n 1,...,n NV are the elements of n and c 1,...,c NV, are the elements of c. The matrix A VS is the negative of A GD without the column of ones (coefficients of the variable q(x)). When q(x) is known, the problem becomes easier. If we consider the time minimization to reach a goal, for example, q(x) takes a constant value. Then, (11) reduces to a system of linear equations, l(x, u) = u( x)logu( x) x u( x)logp d ( x) + c. (14) Our algorithm can apply during Z learning with any policy such as greedy. In fact, we need not to fix u in (14). On the contrary, changeable u is more preferrable to acquire a well-conditioned equation system. Hence, (10) is replaced with

4 SICE JCMSI, Vol. 7, No. 1, January z new (x t ) (1 η t ) z cur (x t ) + η t exp ( ˆq (x t )) z cur (x t+1 ) ˆp d (x t+1 x t ) û (x t+1 x t ), (15) where ˆp d (x t+1 x t ) and ˆq (x t ) are the current estimates. The algorithm for p d and q estimation during Z learning is illustrated in Fig. 4. In the figure, û t corresponds to the policy which appears optimal (according to (8)) given the current estimates z cur (x i ), pˆ d ( x i )andˆq(x i ) at time t. N V (x i ) is the number of valid possible future states at x i. In order to get one solution to the incomplete system of equations (step 6 in Fig. 4), we can use either the gradient descent method (Fig. 2) or the variable substitution method (Fig. 3). The latter is preferable because it is not iterative and can find the solution directly, but for its application to an incomplete system we take the Moore-Penrose pseudoinverse of matrix A VS (step 3 in Fig. 3) and multiply it by b in order to obtain c. The gradient descent method (Fig. 2) can be used without any change. 4. Computational Experiments To validate our method, three experiments were carried out. The first one showsthe efficiency of Z learning with the greedy policy (greedy Z learning) compared to the Z learning with the passive dynamics (passive Z learning) under the condition that the passive dynamics is known. The second experiment confirms that our method can estimate the passive dynamics and state costs correctly under a controlled condition. The third experiment shows that our method works well during the execusion of our modified Z learning. The environment for all the experiments consisted of a twodimensional 10x10-size grid world with obstacles (Fig. 5). The task of the agent in the environment was to reach the goal position (lower right in Fig. 5) from a random start position at the grid as fast as possible. To do so, the state cost of every state was set to unity. This means the total cost tends to be big if the trajectory chosen by the agent is long. Note that one possible application for LMDPS is finding shortest paths in graphs [11]. The agent obeyed Newtonian mechanics, that is, it could only move from its current position to an adjacent position at each time step. Each state of the agent consisted of the pair of the current and the previous positions. This was because Newtonian mechanics is determined with the pair (previous and current positions) in discrete time steps. In addition, the pair can express Newtonian mechanics as a Markov decision process (1). Now, we have N P = 86 positions and N S = 575 possible states. As for the walls and obstacles, we considered two cases: the obstacles were reflexive walls in one case and absorptive walls in the other. In both scenarios the passive dynamics is the same in open spaces (Fig. 6), where the agent obeys Newtonian mechanics with high probability hp = 0.9 and takes another state with low probability. The passive dynamics for the absorptive scenario and the reflexive scenario differ when there are walls or obstacles (Fig. 7). In all experiments only transitions to adjacent positions were possible, and hence the number of valid future states N V corresponded to the number of adjacent positions to the current position. 4.1 Efficiency of Greedy Z Learning To confirm the efficiency of Z learning with other dynamics than the passive dynamics, we compared the learning curves of the greedy Z learning and the passive Z learning assuming that the passive dynamics p d and state costs q were known. In our experiments, the learning rate η t at time step t decays obeying η t = c/(c + t), where c = 10, 000 for the greedy Z learning and c = 30, 000 for the passive Z learning. The estimation error was calculated as the normalized difference between the estimated cost-to-go function ˆv and the optimal one v obtained analytically using (6), that is, NS i=1 ˆv (x i) v (x i ) NS. (16) i=1 v (x i ) The learning curves in Figs. 8 (a) and 8 (b) show Z learning converges faster when following the greedy policy than when following the passive dynamics. Hence, the greedy Z learning is more efficient. Note that these results are consistent in the experiments in [11],[12]. 4.2 Estimation of Passive Dynamics and State Costs To confirm that the proposed method can correctly estimate the passive dynamics p d and the state costs q from measured total immediate costs l, we ran Algorithms 1 and 2 with a controlled transition probability u. Here, u was set so that one adjacent state had a large probability, the other adjacent states had Fig. 6 Modeling inertia. Fig. 5 The environment of our experiments. Fig. 7 Modeling collisions (hp = 0.9).

5 52 SICE JCMSI, Vol. 7, No. 1, January 2014 Fig. 10 Box plots of the errors in estimation of passive dynamics and state costs (estimation from action costs, from total costs with gradient descent method, and from total costs with variable substitution method). Fig. 8 Learning curves for the reflexive and the absorptive environments. Fig. 9 A controlled transition probability u. a low probability and the rest were zero as seen in (Fig. 9), where the squares represent positions and the darker one the current position, and the numbers therein represent probabilities of transition (the probabilities depend on the number of adjacent states). The estimation error was calculated as the difference between the estimated probabilities ˆp d and the correct passive dynamics p d, that is, ˆpd (x i ) p d (x i). (17) In a similar way, the errors of the state costs were calculated as ˆq (x i ) q (x i ). (18) The results show that the errors are almost within the numerical precision of the simulator software in both algorithms (Fig. 10). This means that the method can calculate the passive dynamics p d and state costs q correctly. 4.3 Z Learning with Passive Dynamics and State Costs Estimation To confirm that the greedy Z learning performs well during estimation of the passive dynamics p d and the state costs q,the authors ran Algorithm 1 with a controlled transition probability û made from the current estimates ˆp d and ˆq and compared with the greedy Z learning with the optimal control u and the traditional Q learning (the ɛ-greedy policy with ɛ = 0.1). In the Fig. 11 Estimation errors of p d and q, with reflexive walls and with absorptive walls. experiments, the learning rate η t at time step t decays obeying η t = c/(c + t), where c = 10, 000 for the greedy Z learning and c = 200, 000 for the Q learning. Each simulation was repeated ten times. The errors of the cost-to-go v were calculated as (16). In the same way, the errors of the estimated passive dynamics and the state costs were calculated as NS i=1 NS j=1 NS i=1 ( ) ( ) pˆ d x j x i p d x j x i ( ), (19) x j x i NS j=1 p d NS i=1 ˆq (x i) q (x i ) NS i=1 q (x i ) (20) respectively. Even during the greedy Z learning with the estimates of p d and q, the errors of the estimated passive dynamics and the state costs consistently reduced as the number of simulation steps progresses (Fig. 11). Moreover, the errors of these quantities did not have a significant impact on the convergence speed of the Z learning algorithm (Fig. 12). The difference between the errors of Z learning with p d and q estimation and traditional Z learing can be observed in Fig. 13. It is possible to observe in Fig. 11 that the errors of the estimated p d and q become very small near simulation time step Before this time step the errors of v for Z learning with p d and q estimation are bigger (as seen in

6 SICE JCMSI, Vol. 7, No. 1, January Discussions Although the proposed method was successful in the experiments, there is a difficulty in practical applications. It is the uniqueness of the solution of linear equation systems. Each obtained system of linear equations should have a unique solution when the number of obtained equations equals the number of unknowns, that is, the number of valid future states N V.However, if the controlled distributions are not sufficiently different from each other, the determinant of the matrix of equations can become close to zero and be considered null within the numerical precision of the simulator software. This causes the software to be unable to find a unique solution to the system, even when N V equations are obtained. In our experiments a heuristic solution was adopted, by obtaining two more equations than necessary for each state. After this solution was adopted, the number of cases in which the system of equations was not solved for a unique solution was negligibly small. Fig. 12 Errors of the estimated v by the following methods: Z learning with estimation of p d and q, Z learning without p d or q estimation, and the traditional Q learning algorithm. 6. Conclusion The authors proposed a method for the direct application of Z learning in a true temporal difference approach without the need for previous knowledge about the passive dynamics or its state costs. All that is required is the possibility to measure immediate costs or total costs incurred in state transitions, as well as the knowledge of impossible state transitions and of the controlled state transitions imposed on the system. The complete knowledge of the controlled transition distributions and the possibility of imposing any desired distribution might not be available in realistic problems, in which symbolic actions might exist and the transition distributions might be dependent on those actions. A future work possibility would be to extend the method to consider symbolic actions. Acknowledgments This work was supported by JSPS KAKENHI Grant Numbers and Fig. 13 Difference of the error of the estimated v between Z learning with p d and q estimation, and traditional Z learning. Fig. 12 and Fig. 13), because the correct p d and q were still not estimated. Approximately after time step the errors of the estimated p d and q are very small, and the difference of the errors of v oscillates (Fig. 13), and approaches zero in the long run. Positive values mean that the error of v for Z learning with p d and q estimation is bigger. This was consistently observed for both simulated passive dynamics distributions. References [1] R.S. Sutton and A.G. Barto: Reinforcement learning: An introduction, MIT Press, [2] M.A.P. Burdelis and K. Ikeda: Temporal difference approach in linearly solvable Markov decision problems, Proc. Artificial Life and Robotics, GS12-3, [3] T. Kollar and N. Roy: Trajectory optimization using reinforcement learning for map exploration, The International Journal of Robotics Research, Vol. 27, pp , [4] J. Buchli, F. Stulp, E. Theodorou, and S. Schaal: Learning variable impedance control, The International Journal of Robotics Research, Vol. 30, pp , [5] J. Nie and S. Haykin: A dynamic channel assignment policy through Q-learning, IEEE Transactions on Neural Networks, Vol. 10, No. 6, pp , [6] K. Shibata and T. Kawano: Acquisition of flexible image recognition by coupling of reinforcement learning and a neural network, SICE Journal of Control, Measurement, and System Integration, Vol. 2, No. 2, pp , [7] A.A.M. Faudzi and K. Shibata: Acquisition of active perception and recognition through Actor-Q learning using a movable camera, Proc. SICE Annual Conf., FB03-2, [8] L. Jian: An agent bilateral multi-issue alternate bidding negotiation protocol based on reinforcement learning and its application in e-commerce, Proc. Int l Symposium on Electronic Commerce and Security, pp , [9] A. Gaweda, M. Muezzinoglu, G. Aronoff, A. Jacobs, J. Zurada, and M. Brier: Individualization of pharmacological anemia management using reinforcement learning, Neural Networks, Vol. 18, pp , [10] D. Bertsekas: Dynamic Programming and Optimal Control, Athena Scientific, [11] E. Todorov: Efficient computation of optimal actions, Proceedings of the National Academy of Sciences, Vol. 106, No. 28, pp , [12] E. Todorov: Linearly-solvable Markov decision problems, Schölkopf et al. ed., Adv Neural Informatin Proc Syst, Vol. 19,

7 54 SICE JCMSI, Vol. 7, No. 1, January 2014 pp , MIT Press, [13] K. Doya: How can we learn efficiently to act optimally and flexibly?, Proceedings of the National Academy of Sciences, Vol. 106, No. 28, pp , [14] E. Todorov: Compositionality of optimal control laws, Advances in Neural Information Processing Systems, Vol. 22, pp , [15] K. Dvijotham, and E. Todorov: Inverse optimal control with linearly-solvable MDPs, International Conference on Machine Learning, Vol. 27, pp , [16] M.A.P. Burdelis and K. Ikeda: Modeling and estimating passive dynamics distributions in linearly solvable Markov decision processes, IEICE Technical Report, NC , pp , [17] M.A.P. Burdelis and K. Ikeda: Estimating passive dynamics distributions in linearly solvable Markov decision processes from measured immediate costs in reinforcement learning problems, Proc. JNNS, P3-20, Mauricio BURDELIS He received the B.E. and M.E. degrees in computer systems engineering from the University of Sao Paulo in 2001 and 2009, respectively. Since 2009, he has been a Ph.D. student at the Nara Institute of Science and Technology (NAIST) receiving scholarship from the Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT). Kazushi IKEDA (Member) He received the B.E., M.E., and Ph.D. degrees in mathematical engineering and information physics from The University of Tokyo in 1989, 1991, and 1994, respectively. He was a research associate with the Department of Electrical and Computer Engineering, Kanazawa University from 1994 to In 1995, he was a research associate of the Chinese University of Hong Kong for three months. From 1998 to 2008, he was with the Graduate School of Informatics, Kyoto University, as an associate professor. Since 2008, he has been a full professor of Nara Institute of Science and Technology. He is the editorin-chief of Journal of Japanese Neural Network Society, an action editor of Neural Networks, an associate editor of IEEE Transactions on Neural Networks and an associate editor of IEICE Transactions on Information and Systems.

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Linearly-Solvable Stochastic Optimal Control Problems

Linearly-Solvable Stochastic Optimal Control Problems Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

arxiv: v1 [cs.ai] 10 Mar 2016

arxiv: v1 [cs.ai] 10 Mar 2016 Hierarchical Linearly-Solvable Markov Decision Problems (Extended Version with Supplementary Material) Anders Jonsson and Vicenç Gómez Department of Information and Communication Technologies Universitat

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Information, Utility & Bounded Rationality

Information, Utility & Bounded Rationality Information, Utility & Bounded Rationality Pedro A. Ortega and Daniel A. Braun Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 PZ, UK {dab54,pao32}@cam.ac.uk Abstract.

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Path Integral Stochastic Optimal Control for Reinforcement Learning

Path Integral Stochastic Optimal Control for Reinforcement Learning Preprint August 3, 204 The st Multidisciplinary Conference on Reinforcement Learning and Decision Making RLDM203 Path Integral Stochastic Optimal Control for Reinforcement Learning Farbod Farshidian Institute

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Effect of number of hidden neurons on learning in large-scale layered neural networks

Effect of number of hidden neurons on learning in large-scale layered neural networks ICROS-SICE International Joint Conference 009 August 18-1, 009, Fukuoka International Congress Center, Japan Effect of on learning in large-scale layered neural networks Katsunari Shibata (Oita Univ.;

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Time Indexed Hierarchical Relative Entropy Policy Search

Time Indexed Hierarchical Relative Entropy Policy Search Time Indexed Hierarchical Relative Entropy Policy Search Florentin Mehlbeer June 19, 2013 1 / 15 Structure Introduction Reinforcement Learning Relative Entropy Policy Search Hierarchical Relative Entropy

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

An application of the temporal difference algorithm to the truck backer-upper problem

An application of the temporal difference algorithm to the truck backer-upper problem ESANN 214 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 23-25 April 214, i6doc.com publ., ISBN 978-28741995-7. Available

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Fuzzy Model-Based Reinforcement Learning

Fuzzy Model-Based Reinforcement Learning ESIT 2, 14-15 September 2, Aachen, Germany 212 Fuzzy Model-Based Reinforcement Learning Martin Appl 1, Wilfried Brauer 2 1 Siemens AG, Corporate Technology Information and Communications D-8173 Munich,

More information

LINEARLY SOLVABLE OPTIMAL CONTROL

LINEARLY SOLVABLE OPTIMAL CONTROL CHAPTER 6 LINEARLY SOLVABLE OPTIMAL CONTROL K. Dvijotham 1 and E. Todorov 2 1 Computer Science & Engineering, University of Washington, Seattle 2 Computer Science & Engineering and Applied Mathematics,

More information

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information