Introduction to Reinforcement Learning
|
|
- Clara Powers
- 6 years ago
- Views:
Transcription
1 Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011, Bordeaux
2 Outline of the course Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. Part 2: Approximate dynamic programming L - performance bounds Sample-based algorithms: Least Squares TD, Bellman Residual, Fitted-VI Part 3: Exploration-Exploitation tradeoffs The stochastic bandit: UCB The adversarial bandit: EXP3 Populations of bandits: Tree search, Nash equilibrium. Applications to games (Go, Poker) and planning.
3 Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. General references: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Introduction to Reinforcement Learning, Sutton and Barto, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., Algorithms for Reinforcement Learning, Szepesvári, 2009.
4 Introduction to Reinforcement Learning (RL) Acquire skills for sequencial decision making in complex, stochastic, partially observable, possibly adversarial, environments. Learning from experience a behavior policy (what to do in each situation) from past success or failures Examples: Hot and Cold game, chess game, learning to ride a bicycle, autonomous robotics, operation research, decision making in stochastic market,... any adaptive sequential decision making task.
5 Birth of the domain Meeting in the end of the 70s: Computational Neurosciences. Reinforcement of synaptic weights in neuronal transmissions (Hebbs rules, Rescorla-Wagner models). Reinforcement = correlations in neuronal activity. Experimental Psychology. Animal conditioning: reinforcement of behaviors that lead to a satisfaction (behaviorism initiated by Pavlov, Skinner). Reinforcement = satisfaction or discomfort. Independently, a mathematical formalism appeared in the 50s, 60s: Dynamic Programming by R. Bellman, in optimal control theory. Reinforcement = criterion to be optimized.
6 Experimental Psychology About animal conditioning, Thorndike (1911) says: Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond The Law of Effect.
7 History of the computational RL Shannon 1950: Programming a computer for playing chess. Minsky 1954: Neural-Analog Reinforcement Systems. Samuel 1959: Learning for the game of checkers. Michie 1961: Trial and error for tic-tac-toe game. Michie and Chambers 1968: Inverted pendulum. Widrow, Gupta, Maitra 1973: Punish/reward: learning with a critic in adaptive threshold systems. Neuronal rules. Barto, Sutton, Anderson 1983: Actor-Critic neuronal rules. Sutton 1984: Temporal Credit Assignment in RL. Sutton 1988: Learning from temporal differences. Watkins 1989: Q-learning. Tesauro 1992: TD-Gammon
8 A few applications TD-Gammon. [Tesauro ]: Backgammon. KnightCap [Baxter et al. 1998]: chess ( 2500 ELO) Robotics: juggling, acrobots [Schaal and Atkeson, 1994] Mobile robot navigation [Thrun et al., 1999] Elevator controller [Crites et Barto, 1996], Packet Routing [Boyan et Littman, 1993], Job-Shop Scheduling [Zhang et Dietterich, 1995], Production manufacturing optimization[mahadevan et al., 1998], Game of poker (Bandit algo for Nash computation) Game of go (hierarchical bandits, UCT) szepesva/research/rlapplications.html
9 Intro to Reinforcement Learning Intro to Dynamic Programming DP algorithms... RL algorithms...
10 Reinforcement Learning Decision making agent State Action Reinforcement Environment Stochastic Partially observable Adversarial Environment: can be stochastic (Tetris), adversarial (Chess), partially unknown (bicycle), partially observable (robot) Available information: the reinforcement (may be delayed) Goal: maximize the expected sum of future rewards. Problem: How to sacrify a short term small reward to priviledge larger rewards in the long term?
11 Optimal value function Gives an evaluation of each state if the agent plays optimally. Ex: in a stochastic environment: V (x t ) actions Transition probabilities V (x t+1 ) Bellman eq.: V (x t ) = max a A [r(x t, a) + ] y p(y x t, a)v (y) Temporal difference: δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) If V is known, then when choosing the optimal action a t, E[δ t (V )] = 0 (i.e., in average there is no surprise)
12 Learning the value function How can we learn the value function? Thanks to the surprise! (i.e., the temporal difference) If V is an approximation of V and we observe δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) > 0 then V (x t ) should be increased. From V we deduce the optimal action in x t : [ arg max r(xt, a) + a A y p(y x t, a)v (y) ] locally maximizing the value function maximizing the sum of future rewards. Note that RL is really different from supervised learning!
13 Challenges of RL The state-dynamics and reward functions are unknown. Two approaches: Model-based: learn first a model, then use dynamic programming Model-free: use samples to directly learn a good value function The curse of dimensionality: In high-dimensional problems, the computational complexity may be prohibitively large. Need to learn an approximation of the value function / policy. Exploration issue: Where should one explore the state space in order to build a good representation of the unknown function where (and only where) it is useful.
14 Introduction to Dynamic Programming References: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., 2008.
15 Markov Decision Process [Bellman 1957, Howard 1960, Dubins et Savage 1965] We consider a discrete-time process (x t ) X where: X : state space (assumed to be finite) A: action space (or decisions) (assumed to be finite) State dynamics: All relevant information about future is included in the current state and action P(x t+1 x t, x t 1,..., x 0, a t, a t 1,..., a 0 ) = P(x t+1 x t, a t ) (Markov property). Thus we can define transition probabilities p(y x, a) = P(x t+1 = y x t = x, a t = a). Reinforcement (or reward): r(x, a) is obtained when choosing action a in state x. An MDP is defined by (X, A, p, r).
16 Definition of policy Policy π = (π 1, π 2,... ), where π t : X A defines the action π t (x) choosen in x at time t. If π = (π, π,..., π), the policy is stationary ou Markovian. When following a stationary policy π the process (x t ) t 0 is a Markov chain with transition probabilities p(x t+1 x t ) = p(x t+1 x t, π t (x t )). Our goal is to learn a policy that maximizes the sum of rewards.
17 Performance of a policy For any policy π, define the value function V π : Infinite horizon: Discounted: V π (x) = E [ γ t r(x t, a t ) x 0 = x; π ], t=0 where 0 γ < 1 is the discount factor Undiscounted: V π (x) = E [ r(x t, a t ) x 0 = x; π ] Average: V π (x) = lim T t=0 1 T 1 T E[ r(x t, a t ) x 0 = x; π ] Finite horizon: V π (x, t) = E [ T 1 r(x s, a s ) + R(x T ) x t = x; π ] s=t t=0 Goal of the MDP: Find an optimal policy π, i.e. V π = sup V π π
18 Example: Tetris State: wall configuration + new piece Action: posible positions of the new piece on the wall, Reward: number of lines removed Next state: Resulting configuration of the wall + random new piece. (we can prove that for any policy, the game ends in finite time a.s.) State space states for regular Tetris
19 The dilemma of the MLSS student 1 r=0 0.5 p=0.5 r=1 Sleep 0.5 Sleep 0.4 Think Think Sleep 0.6 r= Think Sleep r= Think You try to maximize the sum of rewards! 4 r= 10 5 r=100 r=
20 Solution of the MLSS student V = r=0 0.5 p=0.5 Sleep Think Think Sleep 0.6 r= 1 V = r=1 0.5 Think 0.5 r= 10 V = V = Sleep Sleep Think r=100 r= 10 V = 10 5 V = r= 1000 V = V 5 = 10, V 6 = 100, V 7 = 1000, V 4 = V V V 3 = V V V 2 = V V 1 and V 1 = max{0.5v V 1, 0.5V V 1 }, thus: V 1 = V 2 = 88.3.
21 Infinite horizon, discounted problems Let π be a stationary deterministic policy. Define the value function V π as: V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ], t=0 where 0 γ < 1 a discount factor (which relates rewards in the future compared to current rewards). And define the optimal value function: V = sup π V π. Proposition 1 (Bellman equations). For any policy π, V π satisfies: V π (x) = r(x, π(x)) + γ p(y x, π(x))v π (y), y X and V satisfies: V [ (x) = max r(x, a) + γ p(y x, a)v (y) ]. a A y X
22 Proof of Proposition 1 V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] t 1 = r(x, π(x)) + γ P(x 1 = y x 0 = x; π) y E [ γ t 1 r(x t, π(x t )) x 1 = y; π ] t 1 = r(x, π(x)) + γ p(y x, π(x))v π (y). y
23 Proof of Proposition 1 (continued) And for all policy π = (a, π ) (not necessarily stationary), V (x) = max π = max (a,π ) = max a = max a E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ r(x, a) + γ y [ r(x, a) + γ y [ r(x, a) + γ y ] p(y x, a)v π (y) ] p(y x, a) max V π (y) π ] p(y x, a)v (y). (1) where (1) holds since: max π y p(y x, a)v π (y) y p(y x, a) max π V π (y) Let π be the policy defined by π(y) = arg max π V π (y). Thus y p(y x, a) max π V π (y) = y p(y x, a)v π (y) max π y p(y x, a)v π (y).
24 Bellman operators Since X is finite (say with N states), V π can be considered as a vector of IR N. Write: r π IR N the vector with components r π (x) = r(x, π(x)) P π IR N N the stochastic matric with elements P π (x, y) = p(y x, π(x)). Define the Bellman operator T π : IR N IR N : for any W IR N, T π W (x) = r(x, π(x)) + γ y p(y x, π(x))w (y) Dynamic Programming operator T : IR N IR N : [ T W (x) = max r(x, a) + γ p(y x, a)w (y) ]. a A y
25 Proposition 2. Properties of the value functions 1. V π is the unique fixed-point of T π V π = T π V π. 2. V is the unique fixed-point of T : V = T V. 3. For any policy π, we have V π = (I γp π ) 1 r π 4. The policy defined by π (x) arg max a A is optimal (and stationary) [ r(x, a) + γ p(y x, a)v (y) ] y
26 Proof of Proposition 2 [part 1] Basic properties of the Bellman operators: Monotonicity: If W 1 W 2 (componentwise) then T π W 1 T π W 2, and T W 1 T W 2. Contraction in max-norm: For any vectors W 1 and W 2, Indeed, for all x X, T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. T W 1 (x) T W 2 (x) = max a max a γ max a [ r(x, a) + γ p(y x, a)w 1 (y) ] y [ r(x, a) + γ p(y x, a)w 2 (y) ] y p(y x, a) W 1 (y) W 2 (y) y γ W 1 W 2
27 Proof of Proposition 2 [part 2] 1. From Proposition 1, V π is a fixed point of T π. Uniqueness comes from the contraction property of T π. 2. Idem for V. 3. From Point 1, V π = r π + γp π V π. Thus (I γp π )V π = r π. Now P π is a stochastic matrix (whose eingenvalues have a modulus 1), thus the eing. of (I γp π ) have a modulus 1 γ > 0, thus is invertible. 4. From the definition of π, we have T π V = T V = V. Thus V is the fixed-point of T π. But, by definition, V π is the fixed-point of T π and since there is uniqueness of the fixed-point, V π = V and π is optimal.
28 Value Iteration Proposition 3. For any V 0 IR N, define V k+1 = T V k. Then V k V. Idem for V π : define V k+1 = T π V k. Then V k V π. Proof. V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0 (idem for V π ) Variant: asynchronous iterations
29 Policy Iteration Choose any initial policy π 0. Iterate: 1. Policy evaluation: compute V π k. 2. Policy improvement: π k+1 greedy w.r.t. V π k : [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A (i.e. π k+1 arg max π T π V π k ) y Stop when V π k = V π k+1. Proposition 4. Policy iteration generates a sequence of policies with increasing performance (V π k+1 V π k ) and terminates in a finite number of steps with the optimal policy π.
30 Proof of Proposition 4 From the definition of the operators T, T π k, T π k+1 and from π k+1, V π k = T π k V π k T V π k = T π k+1 V π k, (2) and from the monotonicity of T π k+1, we have V π k lim n (T π k+1 ) n V π k = V π k+1. Thus (V π k ) k is a non-decreasing sequence. Since there is a finite number of possible policies (finite state and action spaces), the stopping criterion holds for a finite k; We thus have equality in (2), thus V π k = T V π k so V π k = V and π k is an optimal policy.
31 Back to Reinforcement Learning What if the transition probabilities p(y x, a) and the reward functions r(x, a) are unknown? In DP, we used their knowledge in value iteration: V k+1 (x) = T V k (x) = max a [ r(x, a) + γ p(y x, a)v k (y) ]. in policy iteration: when computing V π k (which requires iterating T π k ) when computing the greedy policy: [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A RL = introduction of 2 ideas: sampling and Q-functions. y y
32 1st idea: Use sampling Use the real system (or a simulator) to generate trajectories. This means that from any state-action (x t, a t ) we can obtain a next-state sample x t+1 p( x t, a t ) and a reward sample r(x t, a t ). So we can estimate V π (x) by Monte-Carlo: Run trajectories (x k t ) starting from x and following π: V k+1 (x) = (1 η k )V k (x) + η k γ t r(xt k, π(xt k )), where e.g. η k = 1/k (sample mean), or more generally use Stochastic Approximation with k η k = and k η2 k <. Problem: this method should be repeated for all initial states x and is not sample efficient. t 0
33 Temporal difference learning The update rule rewrites: for s t (by writing r t = r(x t, π(x t ))) V k+1 (x s ) = (1 η k )V k (x s ) + η k γ t s r t, = V k (x s ) + η k t s t s γ t s[ ] r t + γv k (x t+1 ) V k (x t ) }{{} δ t(v k ) δ t (V k ) is the temporal difference for the transition x t x t+1. δ t (V k ) provides an indication of the direction towards which the estimate V k (x s ) should be updated. Note that E[δ t (V π )] = 0 (this is Bellman equation).
34 TD(λ) algorithm [Sutton, 1988] Observe a trajectory (x t ) by following π. Update the value of each states (x s ): V k+1 (x s ) = V k (x s ) + η k (x s ) t s (γλ) t s δ t (V k ), where 0 λ 1. V k (x s ) is impacted by δ t (V k ) at all times t s For λ = 1, we recover Monte-Carlo: V k+1 (x s ) = V k (x s ) + η k (x s ) t s γ t s δ t (V k ) For λ = 0, we have TD(0): V k+1 (x s ) = V k (x s ) + η k (x s )δ s (V k )
35 Convergence of TD(λ) Proposition 5 (Jaakkola, Jordan et Singh, 1994). Assume that all states x X are visited infinitely often and that the learning steps η k (x) satisfy for all x X, k 0 η k(x) =, k 0 η2 k (x) <, then V k a.s. V π. Proof. ] TD(0) rewrites V k+1 (x s ) = V k (x s ) + η k (x s )[ T π V k (x s ) V k (x s ), where T π V k (x s ) = r s + γv k (x s+1 ) is a noisy estimate of T π V k (x s ) = r(x s, a s ) + γ y p(y x s, a s )V k (y). This is a Stochastic Approximation algorithm for estimating the fixed-point of the contraction mapping T π. TD(λ) similar.
36 Tradeoff for λ Example of the linear chain: Expected quadratic error (for 100 trajectories): λ Small λ reduces variance Large λ propagates rewards faster
37 2nd Idea: use Q-value functions Define the Q-value function Q π : X A IR: for a policy π, Q π (x, a) = E [ γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1 ] t 0 and the optimal Q-value function Q (x, a) = max π Q π (x, a). Proposition 6. Q π and Q satisfy the Bellman equations: Q π (x, a) = r(x, a) + γ y X Q (x, a) = r(x, a) + γ y X p(y x, a)q π (y, π(y)) p(y x, a) max b A Qπ (y, b) Idea: compute Q and then π (x) arg max a Q (x, a).
38 Q-learning algorithm [Watkins, 1989] Whenever a transition x t, a t update the Q-value: r t xt+1 occurs, Q k+1 (x t, a t ) = Q k (x t, a t )+η k (x t, a t ) [ r t +γ max b A Q k(x t+1, b) Q k (x t, a t ) ]. Proposition 7 (Watkins et Dayan, 1992). Assume that all state-action pairs (x, a) are visited infinitely often and that the learning steps satisfy for all x, a, k 0 η k(x, a) =, k 0 η2 k (x, a) <, then Q k a.s. Q. Again the proof relies on Stochastic Approximation algorithm for estimating the fixed-point of a contraction mapping. Remark: This does not say how to explore the space.
39 Q-learning algorithm Deterministic case, discount factor γ = 0.9. Take steps η = After transition x, a r y update Q k+1 (x, a) = r + γ max b A Q k (y, b)
40 Optimal Q-values Bellman s equation: Q (x, a) = γ max b A Q (next-state(x, a), b).
41 Conclusions of Part 1 When the state-space is finite and small : If transition probabilities and rewards are known, then DP algorithms (value iteration, policy iteration) compute the optimal solution Otherwise, use sampling techniques and RL algorithms (TD(λ), Q-learning) apply 2 big problems: Usually state-space is HUGE! We face the curse of dimensionality and thus we need to find approximate solutions Part 2. We need efficient exploration Part 3.
Markov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationComputational Reinforcement Learning: An Introduction
Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning (1)
Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationRL 3: Reinforcement Learning
RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationMarkov Decision Processes and Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationLecture 2: Learning from Evaluative Feedback. or Bandit Problems
Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationUniversity of Alberta
University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment
More informationReinforcement Learning in a Nutshell
Reinforcement Learning in a Nutshell V. Heidrich-Meisner 1, M. Lauer 2, C. Igel 1 and M. Riedmiller 2 1- Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany 2- Neuroinformatics Group, University
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationMachine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel
Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning
More informationKey concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate
Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between
More informationDirect Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms
Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationPOMDPs and Policy Gradients
POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationAn Introduction to Reinforcement Learning
1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationValue Function Based Reinforcement Learning in Changing Markovian Environments
Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationAnimal learning theory
Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationLeast-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems
: Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr
More informationAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June
More informationReinforcement Learning. Up until now we have been
Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying,
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationRewards and Value Systems: Conditioning and Reinforcement Learning
Rewards and Value Systems: Conditioning and Reinforcement Learning Acknowledgements: Many thanks to P. Dayan and L. Abbott for making the figures of their book available online. Also thanks to R. Sutton
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More information