CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming Temporal Difference Learning Q-Learning Handout: Class Notes Reading Assignment for Next Class Chapters 1-4 http://www.cs.ualberta.ca/~sutton/book/the-book.html

Goals for the Course Highlights: Introduction to reinforcement learning with view towards solving realworld problems State-of-the-art of learning control with reinforcement learning Projects possible with simulated or actual robots Course Description: This course will introduce and discuss machine learning methods for learning control, particularly with a focus on robotics, but also applicable to models of learning in biology and any other control process. The course will cover the basics of reinforcement learning with value functions (dynamic programming, temporal difference learning, Q-learning). The emphasis, however, will be on learning methods that scale to complex high dimensional control problems. Thus, we will cover function approximation methods for reinforcement learning, policy gradients, probabilistic reinforcement learning, learning from trajectory trials, optimal control methods, stochastic optimal control methods, dynamic Bayesian networks for learning control, Gaussian processes for reinforcement learning, etc.

Reminder: Supervised Learning Supervised Learning provides a signed error vector, i.e., the gradient Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output)

Evaluative Feedback Evaluating actions vs. instructing by giving correct actions Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization and reinforcement learning are evaluative

What is Reinforcement Learning? Reinforcement Learning is learning what to do how to map actions to situations so as to maximize a numerical reward signal, i.e., evaluative feedback. There is no information about which actions to take, only the reward signal is given. Actions in the present may affect future rewards, thus there is a temporal credit assignment problem Reinforcement Learning requires learning of certain functional relationships, and thus builds on techniques of function approximation Reinforcement Learning considers an entire learning problem, i.e., an agent interacting with the environment => it is a more complex problem than most learning tasks.

What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal

Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment

RL In The Context of Other Research Areas Artificial Intelligence Psychology Control Theory and Operations Research Reinforcement Learning (RL) Neuroscience Artificial Neural Networks

Elements of Reinforcement Learning Policies perceived state to action mapping (can be probabilistic) Reward functions maps the perceived state-action pair into a a single number, an immediate reward (stochastic) Value functions maps the state into the accumulated expected reward that would be received if starting in the state Models predicts the next state given the current state and action (can be probabilistic) Objective: Optimize Reward!

Elements of RL Policy Policy: what to do Reward: what is good Reward Value Model of environment Value: what is good because it predicts reward Model: what follows what

The Agent-Environment Interface Agent and environment interact at discrete time steps Agent observes state at step t : s t S produces action at step t : a t A(s t ) gets resulting reward : r t +1 R : t = 0, 1, 2, K and resulting next state : s t +1... s t a t r t +1 r s t +2 t +1 s r t +3 a t +2 st t +1 a +3... t +2 a t +3 11

Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Riedmiller et al. World s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller Many Robots navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player

The Markov Property By the state at step t, the book means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property: Pr{ s t +1 = s, r t +1 = r s t,a t,r t, s t 1,a t 1,K,r 1,s 0,a } 0 = 13 Pr{ s t +1 = s, r t +1 = r s t,a } t for all s, r, and histories s t,a t,r t, s t 1,a t 1,K,r 1, s 0,a 0.

Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: a = Pr{ s t +1 = s s t = s,a t = a} for all s, s S, a A(s). P s s reward probabilities: a R s s { } for all s, = E r t +1 s t = s,a t = a,s t +1 = s 14 s S, a A(s).

Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State - value function for policy π : V π (s) = E π R t s t = s { } = E π γ k r t +k +1 s t = s The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : k =0 Action- value function for policy π : { } = E π γ k r t + k +1 s t = s,a t = a Q π (s, a) = E π R t s t = s, a t = a 15 k = 0

Bellman Equation for a Policy The basic idea: R t = r t +1 + γ r t +2 +γ 2 r t + 3 +γ 3 r t + 4 L = r t +1 + γ ( r t +2 + γ r t +3 + γ 2 r t + 4 L ) = r t +1 + γ R t +1 So: V π (s) = E π R t s t = s { } { } = E π r t +1 + γ V ( s t +1 ) s t = s Or, without the expectation operator: V π a (s) = π(s,a) P s a s 16 s [ R a + γv π ( s s s )]

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) = max Q π (s,a) a A(s) { } = max E r t +1 + γv (s t +1 ) s t = s,a t = a a A(s) = max P a [ s s R a + γv ( s s s )] a A(s) s V is the unique solution of this system of nonlinear equations. 17

Bellman Optimality Equation for Q* { } Q (s,a) = E r t +1 + γ maxq (s t +1, a ) s t = s,a t = a a s = P s [ ] a s R a s s + γ maxq ( s, a ) a Q * is the unique solution of this system of nonlinear equations. 18

The reward hypothesis That all of what we mean by goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward) 19

Returns In general, Suppose the sequence of rewards after step t is : r t +1, r t+ 2, r t + 3, K What do we want to maximize? we want to maximize the expected return, E{ R t }, for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. R t = r t +1 + r t +2 +L + r T, where T is a final time step at which a terminal state is reached, ending an episode. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: k =0 R t = r t +1 +γ r t+ 2 + γ 2 r t +3 +L = γ k r t + k +1, where γ, 0 γ 1, is the discount rate. shortsighted 0 γ 1 farsighted 21

An Example Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: As a continuing task with discounted return: reward = +1 for each step before failure return = number of steps before failure reward = 1 upon failure; 0 otherwise return = γ k, for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. 22

Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where not at top of hill return = number of steps before reaching top of hill Return is maximized by minimizing number of steps to reach the top of the hill. 23

Example: Tic-Tac-Toe Goal: Learn to play optimal game against, e.g.: a particular opponent the optimal playing opponent What is the state of the system? all possible board configurations What is an action? put down a new X in an empty field What is the reward? say, +10 for winning, -1 for every move that does not win

Dynamic Programming for Tic-Tac-Toe Key Idea: Use the value function to organize and structure the search for good policies Key Ingredients: a model of the state transitions (i.e., the opponent) an algorithm to compute the value function from a given policy an algorithm to compute the policy from a given value function the proof that an iteration between policy computation and value function computation converges to the optimal policy and value function!

Computing the Model Observe VERY many Tic-Tac-Toe games Count the number n x that the opponent does a particular move in a particular state, and the total number n the state is visited Thus, we obtain a model of the opponent, indicating his statetransition probability P x n+1 x n ( ) = n x n+1 x n n x n For the given discrete-state example, this results in a the statetransition matrix Note: we could also learn this model in an on-line fashion simultaneously with the policy and value function

Policy Evaluation Goal: Compute the value function under a given policy π and model (state-transition matrix) Remember: The value function is the expected long term reward from a given state: V π ( x) = E{ r n+1 +γr n+2 +γ 2 r n+3 + x} (γ [ 0,1]) = E{ r n+1 +γv π ( x n+1 ) x} = π( x,a) P( x n+1 x) ( ( )) r n+1 +γv π x n+1 a x n +1 For any given policy (stochastic or deterministic), a repeated application of this update formula will lead to the correct value function under the given policy π! V n+1 ( x) = π( x, a) P( x n+1 x) ( r n+1 + γv n ( x n+1 )) a x n+1

Example: Grid World

Policy Improvement Goal: Compute a better policy given the value function Bellman s Principle of Optimality: An optimal policy has to be locally optimal as well Thus, the Policy can be improved by local improvements π n +1 ( x) = argmax a x n +1 ( ) P x n+1 x r n+1 ( a, x) +γv n ( x n +1 ( ))

Example: Grid World (cont d)

Computing the Optimal Policy Policy Iteration update value function (policy evaluation) update policy (policy improvement) iterate until convergence policy iteration converges usually fairly fast Value Iteration policy evaluation is expensive since it takes several iteration to converge to the correct value function value iteration corresponds to a single policy evaluation step and then a policy improvement step, which can actually omit the policy update step V n+1 ( x) = max a x n +1 ( )( r n+1 ( a, x) +γv n ( x n +1 )) P x n+1 x Asynchronous DP it is not necessary to update all the states simultaneously, just every state has to be updated sufficiently often

Monte Carlo Methods Goal: Learn Value Function only from experience, without knowledge of the model (environment) (on-line learning) Advantage: real data is often easily obtained while building models of the environment (e.g., density estimation) can be very hard Monte Carlo methods are episode -based: this assumes a trial ends after a while (absorbing states, finite number of step and discounted reward)

Monte Carlo Methods: Example: Grid World Monte Carlo Policy Evaluation: start at random state follow policy, keep entire trajectory and rewards in memory after goal was reached, update all values of the Value Function of the entire trajectory starting from the last state of the trajectory: each value becomes the discounted average of the rewards after this state

Temporal Difference Learning TD is a combination of Monte Carlo techniques and DP to compute a value function TD allows a more natural (on-line) calculation of the value function as it only needs to look at states that are neighbors in TIME, de-emphasizing the knowledge of the spatial layout of states In its simplest form, TD (actually called TD(0)) updates the value function as: ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] In order to obtain data, we need to follow the current policy for a while until sufficient data have been experienced.

N-Step TD and TD(λ) N-Step TD is an update method between TD(0) and Monte Carlo Methods (TD(1)) instead of taking just two temporally adjacent states into account for updating, more than one are used ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] ( ) ( ( )) R 1 ( t) = r( t +1) +γv x( t +1) R 2 ( t) = r( t +1) +γr t + 2 n R n ( t) = λ i 1 r( t +i) i=1 ( )+ γ 2 V x t + 2 +γ n V( x( t + n) ) When using averaging over R(t) values, one obtains the TD(λ) method R λ ( t) = 1 λ ( ) ( ) λ n 1 R n t n=1

TD(λ) Implemented With Eligibility Traces Using the concept of decaying activation traces (eligibility traces), TD(λ) can be implemented in a simple on-line version: every state gets assigned an eligibility trace e(x) the value function is updated as: δ = r( t +1) +γv( x( t +1) ) V( x( t) ) ( ) = e x t e x( t) ( ( ))+1 For all states x: V( x) = V( x) +αδ e( x) e( x) = γλe( x)

Q-Learning, A Special Case of TD(0) By building a value function (Q-function) that is both a function of the states AND the actions, Q-Learning avoids the need for a model of the environment One-step Q-Learning: ( ) = Q( x( t), a( t) ) Q x( t),a( t) ( ( )) +α r( t +1) +γ max{ Q( x( t +1), a )} Q x( t), a ( t ) a Note: it is not necessary to follow the policy for Q-learning, just visit every state action pair sufficiently often! The policy is simply the action that has the maximal Q-value in a particular state.

The Exploration/ Exploitation Dilemma Suppose you form estimates Q t (a) Q * (a) The greedy action at t is a t action value estimates a t * = argmax a Q t (a) a t = a t * exploitation a t a t * exploration You can t exploit all the time; you can t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe.

ε-greedy Action Selection Greedy action selection: a t = a t * = arg max a Q t (a) ε-greedy: a t = a t * with probability 1 ε {random action with probability ε... the simplest way to balance exploration and exploitation

10-Armed Testbed n = 10 possible actions Each Q * (a) is chosen randomly from a normal distrib.: each r t 1000 plays is also normal: repeat the whole thing 2000 times and average the results sample average

ε-greedy Methods on the 10-Armed Testbed

Softmax Action Selection Softmax action selection methods grade action probs. by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: Choose action a on play t with probability e Q t (a) τ n e Q t (b) τ b=1 where τ is the computational temperature,

Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): Q k = r 1 + r 2 +L r k k Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: Q k +1 = Q k + 1 [ k +1 r Q k +1 k ] This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target OldEstimate]

Tracking a Nonstationary Problem Q k Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a) change over time, But not in a nonstationary problem. Better in the nonstationary case is: Q k +1 = Q k +α[ r k +1 Q k ] for constant α, 0 < α 1 = (1 α) k Q 0 + α(1 α) k i r i k i =1 exponential, recency-weighted average

Optimistic Initial Values All methods so far depend on Q 0 (a), i.e., they are biased. Suppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use Q 0 (a) = 5 for all a