Introduction to Reinforcement Learning

Size: px
Start display at page:

Download "Introduction to Reinforcement Learning"

Transcription

1 Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011, Bordeaux

2 Outline of the course Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. Part 2: Approximate dynamic programming L - performance bounds Sample-based algorithms: Least Squares TD, Bellman Residual, Fitted-VI Part 3: Exploration-Exploitation tradeoffs The stochastic bandit: UCB The adversarial bandit: EXP3 Populations of bandits: Tree search, Nash equilibrium. Applications to games (Go, Poker) and planning.

3 Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. General references: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Introduction to Reinforcement Learning, Sutton and Barto, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., Algorithms for Reinforcement Learning, Szepesvári, 2009.

4 Introduction to Reinforcement Learning (RL) Acquire skills for sequencial decision making in complex, stochastic, partially observable, possibly adversarial, environments. Learning from experience a behavior policy (what to do in each situation) from past success or failures Examples: Hot and Cold game, chess game, learning to ride a bicycle, autonomous robotics, operation research, decision making in stochastic market,... any adaptive sequential decision making task.

5 Birth of the domain Meeting in the end of the 70s: Computational Neurosciences. Reinforcement of synaptic weights in neuronal transmissions (Hebbs rules, Rescorla-Wagner models). Reinforcement = correlations in neuronal activity. Experimental Psychology. Animal conditioning: reinforcement of behaviors that lead to a satisfaction (behaviorism initiated by Pavlov, Skinner). Reinforcement = satisfaction or discomfort. Independently, a mathematical formalism appeared in the 50s, 60s: Dynamic Programming by R. Bellman, in optimal control theory. Reinforcement = criterion to be optimized.

6 Experimental Psychology About animal conditioning, Thorndike (1911) says: Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond The Law of Effect.

7 History of the computational RL Shannon 1950: Programming a computer for playing chess. Minsky 1954: Neural-Analog Reinforcement Systems. Samuel 1959: Learning for the game of checkers. Michie 1961: Trial and error for tic-tac-toe game. Michie and Chambers 1968: Inverted pendulum. Widrow, Gupta, Maitra 1973: Punish/reward: learning with a critic in adaptive threshold systems. Neuronal rules. Barto, Sutton, Anderson 1983: Actor-Critic neuronal rules. Sutton 1984: Temporal Credit Assignment in RL. Sutton 1988: Learning from temporal differences. Watkins 1989: Q-learning. Tesauro 1992: TD-Gammon

8 A few applications TD-Gammon. [Tesauro ]: Backgammon. KnightCap [Baxter et al. 1998]: chess ( 2500 ELO) Robotics: juggling, acrobots [Schaal and Atkeson, 1994] Mobile robot navigation [Thrun et al., 1999] Elevator controller [Crites et Barto, 1996], Packet Routing [Boyan et Littman, 1993], Job-Shop Scheduling [Zhang et Dietterich, 1995], Production manufacturing optimization[mahadevan et al., 1998], Game of poker (Bandit algo for Nash computation) Game of go (hierarchical bandits, UCT) szepesva/research/rlapplications.html

9 Intro to Reinforcement Learning Intro to Dynamic Programming DP algorithms... RL algorithms...

10 Reinforcement Learning Decision making agent State Action Reinforcement Environment Stochastic Partially observable Adversarial Environment: can be stochastic (Tetris), adversarial (Chess), partially unknown (bicycle), partially observable (robot) Available information: the reinforcement (may be delayed) Goal: maximize the expected sum of future rewards. Problem: How to sacrify a short term small reward to priviledge larger rewards in the long term?

11 Optimal value function Gives an evaluation of each state if the agent plays optimally. Ex: in a stochastic environment: V (x t ) actions Transition probabilities V (x t+1 ) Bellman eq.: V (x t ) = max a A [r(x t, a) + ] y p(y x t, a)v (y) Temporal difference: δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) If V is known, then when choosing the optimal action a t, E[δ t (V )] = 0 (i.e., in average there is no surprise)

12 Learning the value function How can we learn the value function? Thanks to the surprise! (i.e., the temporal difference) If V is an approximation of V and we observe δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) > 0 then V (x t ) should be increased. From V we deduce the optimal action in x t : [ arg max r(xt, a) + a A y p(y x t, a)v (y) ] locally maximizing the value function maximizing the sum of future rewards. Note that RL is really different from supervised learning!

13 Challenges of RL The state-dynamics and reward functions are unknown. Two approaches: Model-based: learn first a model, then use dynamic programming Model-free: use samples to directly learn a good value function The curse of dimensionality: In high-dimensional problems, the computational complexity may be prohibitively large. Need to learn an approximation of the value function / policy. Exploration issue: Where should one explore the state space in order to build a good representation of the unknown function where (and only where) it is useful.

14 Introduction to Dynamic Programming References: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., 2008.

15 Markov Decision Process [Bellman 1957, Howard 1960, Dubins et Savage 1965] We consider a discrete-time process (x t ) X where: X : state space (assumed to be finite) A: action space (or decisions) (assumed to be finite) State dynamics: All relevant information about future is included in the current state and action P(x t+1 x t, x t 1,..., x 0, a t, a t 1,..., a 0 ) = P(x t+1 x t, a t ) (Markov property). Thus we can define transition probabilities p(y x, a) = P(x t+1 = y x t = x, a t = a). Reinforcement (or reward): r(x, a) is obtained when choosing action a in state x. An MDP is defined by (X, A, p, r).

16 Definition of policy Policy π = (π 1, π 2,... ), where π t : X A defines the action π t (x) choosen in x at time t. If π = (π, π,..., π), the policy is stationary ou Markovian. When following a stationary policy π the process (x t ) t 0 is a Markov chain with transition probabilities p(x t+1 x t ) = p(x t+1 x t, π t (x t )). Our goal is to learn a policy that maximizes the sum of rewards.

17 Performance of a policy For any policy π, define the value function V π : Infinite horizon: Discounted: V π (x) = E [ γ t r(x t, a t ) x 0 = x; π ], t=0 where 0 γ < 1 is the discount factor Undiscounted: V π (x) = E [ r(x t, a t ) x 0 = x; π ] Average: V π (x) = lim T t=0 1 T 1 T E[ r(x t, a t ) x 0 = x; π ] Finite horizon: V π (x, t) = E [ T 1 r(x s, a s ) + R(x T ) x t = x; π ] s=t t=0 Goal of the MDP: Find an optimal policy π, i.e. V π = sup V π π

18 Example: Tetris State: wall configuration + new piece Action: posible positions of the new piece on the wall, Reward: number of lines removed Next state: Resulting configuration of the wall + random new piece. (we can prove that for any policy, the game ends in finite time a.s.) State space states for regular Tetris

19 The dilemma of the MLSS student 1 r=0 0.5 p=0.5 r=1 Sleep 0.5 Sleep 0.4 Think Think Sleep 0.6 r= Think Sleep r= Think You try to maximize the sum of rewards! 4 r= 10 5 r=100 r=

20 Solution of the MLSS student V = r=0 0.5 p=0.5 Sleep Think Think Sleep 0.6 r= 1 V = r=1 0.5 Think 0.5 r= 10 V = V = Sleep Sleep Think r=100 r= 10 V = 10 5 V = r= 1000 V = V 5 = 10, V 6 = 100, V 7 = 1000, V 4 = V V V 3 = V V V 2 = V V 1 and V 1 = max{0.5v V 1, 0.5V V 1 }, thus: V 1 = V 2 = 88.3.

21 Infinite horizon, discounted problems Let π be a stationary deterministic policy. Define the value function V π as: V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ], t=0 where 0 γ < 1 a discount factor (which relates rewards in the future compared to current rewards). And define the optimal value function: V = sup π V π. Proposition 1 (Bellman equations). For any policy π, V π satisfies: V π (x) = r(x, π(x)) + γ p(y x, π(x))v π (y), y X and V satisfies: V [ (x) = max r(x, a) + γ p(y x, a)v (y) ]. a A y X

22 Proof of Proposition 1 V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] t 1 = r(x, π(x)) + γ P(x 1 = y x 0 = x; π) y E [ γ t 1 r(x t, π(x t )) x 1 = y; π ] t 1 = r(x, π(x)) + γ p(y x, π(x))v π (y). y

23 Proof of Proposition 1 (continued) And for all policy π = (a, π ) (not necessarily stationary), V (x) = max π = max (a,π ) = max a = max a E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ r(x, a) + γ y [ r(x, a) + γ y [ r(x, a) + γ y ] p(y x, a)v π (y) ] p(y x, a) max V π (y) π ] p(y x, a)v (y). (1) where (1) holds since: max π y p(y x, a)v π (y) y p(y x, a) max π V π (y) Let π be the policy defined by π(y) = arg max π V π (y). Thus y p(y x, a) max π V π (y) = y p(y x, a)v π (y) max π y p(y x, a)v π (y).

24 Bellman operators Since X is finite (say with N states), V π can be considered as a vector of IR N. Write: r π IR N the vector with components r π (x) = r(x, π(x)) P π IR N N the stochastic matric with elements P π (x, y) = p(y x, π(x)). Define the Bellman operator T π : IR N IR N : for any W IR N, T π W (x) = r(x, π(x)) + γ y p(y x, π(x))w (y) Dynamic Programming operator T : IR N IR N : [ T W (x) = max r(x, a) + γ p(y x, a)w (y) ]. a A y

25 Proposition 2. Properties of the value functions 1. V π is the unique fixed-point of T π V π = T π V π. 2. V is the unique fixed-point of T : V = T V. 3. For any policy π, we have V π = (I γp π ) 1 r π 4. The policy defined by π (x) arg max a A is optimal (and stationary) [ r(x, a) + γ p(y x, a)v (y) ] y

26 Proof of Proposition 2 [part 1] Basic properties of the Bellman operators: Monotonicity: If W 1 W 2 (componentwise) then T π W 1 T π W 2, and T W 1 T W 2. Contraction in max-norm: For any vectors W 1 and W 2, Indeed, for all x X, T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. T W 1 (x) T W 2 (x) = max a max a γ max a [ r(x, a) + γ p(y x, a)w 1 (y) ] y [ r(x, a) + γ p(y x, a)w 2 (y) ] y p(y x, a) W 1 (y) W 2 (y) y γ W 1 W 2

27 Proof of Proposition 2 [part 2] 1. From Proposition 1, V π is a fixed point of T π. Uniqueness comes from the contraction property of T π. 2. Idem for V. 3. From Point 1, V π = r π + γp π V π. Thus (I γp π )V π = r π. Now P π is a stochastic matrix (whose eingenvalues have a modulus 1), thus the eing. of (I γp π ) have a modulus 1 γ > 0, thus is invertible. 4. From the definition of π, we have T π V = T V = V. Thus V is the fixed-point of T π. But, by definition, V π is the fixed-point of T π and since there is uniqueness of the fixed-point, V π = V and π is optimal.

28 Value Iteration Proposition 3. For any V 0 IR N, define V k+1 = T V k. Then V k V. Idem for V π : define V k+1 = T π V k. Then V k V π. Proof. V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0 (idem for V π ) Variant: asynchronous iterations

29 Policy Iteration Choose any initial policy π 0. Iterate: 1. Policy evaluation: compute V π k. 2. Policy improvement: π k+1 greedy w.r.t. V π k : [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A (i.e. π k+1 arg max π T π V π k ) y Stop when V π k = V π k+1. Proposition 4. Policy iteration generates a sequence of policies with increasing performance (V π k+1 V π k ) and terminates in a finite number of steps with the optimal policy π.

30 Proof of Proposition 4 From the definition of the operators T, T π k, T π k+1 and from π k+1, V π k = T π k V π k T V π k = T π k+1 V π k, (2) and from the monotonicity of T π k+1, we have V π k lim n (T π k+1 ) n V π k = V π k+1. Thus (V π k ) k is a non-decreasing sequence. Since there is a finite number of possible policies (finite state and action spaces), the stopping criterion holds for a finite k; We thus have equality in (2), thus V π k = T V π k so V π k = V and π k is an optimal policy.

31 Back to Reinforcement Learning What if the transition probabilities p(y x, a) and the reward functions r(x, a) are unknown? In DP, we used their knowledge in value iteration: V k+1 (x) = T V k (x) = max a [ r(x, a) + γ p(y x, a)v k (y) ]. in policy iteration: when computing V π k (which requires iterating T π k ) when computing the greedy policy: [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A RL = introduction of 2 ideas: sampling and Q-functions. y y

32 1st idea: Use sampling Use the real system (or a simulator) to generate trajectories. This means that from any state-action (x t, a t ) we can obtain a next-state sample x t+1 p( x t, a t ) and a reward sample r(x t, a t ). So we can estimate V π (x) by Monte-Carlo: Run trajectories (x k t ) starting from x and following π: V k+1 (x) = (1 η k )V k (x) + η k γ t r(xt k, π(xt k )), where e.g. η k = 1/k (sample mean), or more generally use Stochastic Approximation with k η k = and k η2 k <. Problem: this method should be repeated for all initial states x and is not sample efficient. t 0

33 Temporal difference learning The update rule rewrites: for s t (by writing r t = r(x t, π(x t ))) V k+1 (x s ) = (1 η k )V k (x s ) + η k γ t s r t, = V k (x s ) + η k t s t s γ t s[ ] r t + γv k (x t+1 ) V k (x t ) }{{} δ t(v k ) δ t (V k ) is the temporal difference for the transition x t x t+1. δ t (V k ) provides an indication of the direction towards which the estimate V k (x s ) should be updated. Note that E[δ t (V π )] = 0 (this is Bellman equation).

34 TD(λ) algorithm [Sutton, 1988] Observe a trajectory (x t ) by following π. Update the value of each states (x s ): V k+1 (x s ) = V k (x s ) + η k (x s ) t s (γλ) t s δ t (V k ), where 0 λ 1. V k (x s ) is impacted by δ t (V k ) at all times t s For λ = 1, we recover Monte-Carlo: V k+1 (x s ) = V k (x s ) + η k (x s ) t s γ t s δ t (V k ) For λ = 0, we have TD(0): V k+1 (x s ) = V k (x s ) + η k (x s )δ s (V k )

35 Convergence of TD(λ) Proposition 5 (Jaakkola, Jordan et Singh, 1994). Assume that all states x X are visited infinitely often and that the learning steps η k (x) satisfy for all x X, k 0 η k(x) =, k 0 η2 k (x) <, then V k a.s. V π. Proof. ] TD(0) rewrites V k+1 (x s ) = V k (x s ) + η k (x s )[ T π V k (x s ) V k (x s ), where T π V k (x s ) = r s + γv k (x s+1 ) is a noisy estimate of T π V k (x s ) = r(x s, a s ) + γ y p(y x s, a s )V k (y). This is a Stochastic Approximation algorithm for estimating the fixed-point of the contraction mapping T π. TD(λ) similar.

36 Tradeoff for λ Example of the linear chain: Expected quadratic error (for 100 trajectories): λ Small λ reduces variance Large λ propagates rewards faster

37 2nd Idea: use Q-value functions Define the Q-value function Q π : X A IR: for a policy π, Q π (x, a) = E [ γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1 ] t 0 and the optimal Q-value function Q (x, a) = max π Q π (x, a). Proposition 6. Q π and Q satisfy the Bellman equations: Q π (x, a) = r(x, a) + γ y X Q (x, a) = r(x, a) + γ y X p(y x, a)q π (y, π(y)) p(y x, a) max b A Qπ (y, b) Idea: compute Q and then π (x) arg max a Q (x, a).

38 Q-learning algorithm [Watkins, 1989] Whenever a transition x t, a t update the Q-value: r t xt+1 occurs, Q k+1 (x t, a t ) = Q k (x t, a t )+η k (x t, a t ) [ r t +γ max b A Q k(x t+1, b) Q k (x t, a t ) ]. Proposition 7 (Watkins et Dayan, 1992). Assume that all state-action pairs (x, a) are visited infinitely often and that the learning steps satisfy for all x, a, k 0 η k(x, a) =, k 0 η2 k (x, a) <, then Q k a.s. Q. Again the proof relies on Stochastic Approximation algorithm for estimating the fixed-point of a contraction mapping. Remark: This does not say how to explore the space.

39 Q-learning algorithm Deterministic case, discount factor γ = 0.9. Take steps η = After transition x, a r y update Q k+1 (x, a) = r + γ max b A Q k (y, b)

40 Optimal Q-values Bellman s equation: Q (x, a) = γ max b A Q (next-state(x, a), b).

41 Conclusions of Part 1 When the state-space is finite and small : If transition probabilities and rewards are known, then DP algorithms (value iteration, policy iteration) compute the optimal solution Otherwise, use sampling techniques and RL algorithms (TD(λ), Q-learning) apply 2 big problems: Usually state-space is HUGE! We face the curse of dimensionality and thus we need to find approximate solutions Part 2. We need efficient exploration Part 3.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Computational Reinforcement Learning: An Introduction

Computational Reinforcement Learning: An Introduction Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

Reinforcement Learning in a Nutshell

Reinforcement Learning in a Nutshell Reinforcement Learning in a Nutshell V. Heidrich-Meisner 1, M. Lauer 2, C. Igel 1 and M. Riedmiller 2 1- Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany 2- Neuroinformatics Group, University

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between

More information

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning 1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Animal learning theory

Animal learning theory Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Reinforcement Learning. Up until now we have been

Reinforcement Learning. Up until now we have been Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying,

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Rewards and Value Systems: Conditioning and Reinforcement Learning

Rewards and Value Systems: Conditioning and Reinforcement Learning Rewards and Value Systems: Conditioning and Reinforcement Learning Acknowledgements: Many thanks to P. Dayan and L. Abbott for making the figures of their book available online. Also thanks to R. Sutton

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information