Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Bellman Equations Recursive relationships among values that can be used to compute values

The tree of transition dynamics a path, or trajectory state action possible path

The web of transition dynamics a path, or trajectory state action possible path

The web of transition dynamics state action possible path backup diagram

4 Bellman-equation backup diagrams representing recursive relationships among values state values action values prediction state action possible path control max max max

Bellman Equation for a Policy π The basic idea: G t = R t+1 + γ R t+2 + γ 2 R t+3 + γ 3 R t+4 L ( +... ) = R t+1 + γ R t+2 + γ R t+3 + γ 2 R t+4 L = R t+1 + γ G t+1 +... So: v π (s) = E { π G t S t = s} = E { π R t+1 + γ v ( π S ) t+1 S t = s} Or, without the expectation operator: v (s) = X a (a s) X s 0,r h i p(s 0,r s, a) r + v (s 0 ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

More on the Bellman Equation v (s) = X a (a s) X s 0,r h i p(s 0,r s, a) r + v (s 0 ) This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution. Backup diagrams: for v π for q π R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; γ = 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

X Bellman Optimality Equation for q * h i q (s, a) = E R t+1 + max q (S t+1,a 0 ) S t = s, A t = a a 0 = X h i p(s 0,r s, a) r + max q (s 0,a 0 ). a 0 s 0,r The relevant backup diagram: (b) s,a r s' max a' q * is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Dynamic Programming Using Bellman equations to compute values and optimal policies (thus a form of planning)

Iterative Methods v 0! v 1!! v k! v k+1!! v a sweep X X A sweep consists of applying a backup operation to each state. X X A full policy-evaluation backup: v k+1 (s) = X (a s) X a s 0,r 0 h h i p(s 0,r s, a) r + v k (s 0 ) i 8s 2 S " # R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

A Small Gridworld R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Iterative Policy Eval for the Small Gridworld π = equiprobable random action choices R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Iterative Policy Evaluation One array version Input, the policy to be evaluated Initialize an array V (s) =0,foralls 2 S + Repeat 0 For each s 2 S: v V (s) V (s) Pa (a s) P s 0,r p(s0,r s, a) r + V (s 0 ) max(, v V (s) ) until < (a small positive number) Output V v R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Gambler s Problem Gambler can repeatedly bet $ on a coin flip Heads he wins his stake, tails he loses it Initial capital {$1, $2, $99} Gambler wins if his capital becomes $100 loses if it becomes $0 Coin is unfair! Heads (gambler wins) with probability p =.4 States, Actions, Rewards? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Gambler s Problem Solution R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Gambler s Problem Solution R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. π evaluation v v π π greedy(v) v A geometric metaphor for convergence of GPI: improvement v = v π v 0 π 0 v * π * π * v * π = greedy(v) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Jack s Car Rental $10 for each car rented (must be available when request rec d) Two locations, maximum of 20 cars at each Cars returned and requested randomly λ! Poisson distribution, n returns/requests with prob n! 1st location: average requests = 3, average returns = 3! 2nd location: average requests = 4, average returns = 2 Can move up to 5 cars between locations overnight n! e λ States, Actions, Rewards? Transition probabilities? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Jack s Car Rental " 0 " 1 " 2 0 5 4 3 2 4 3 2 1 5 4 3 2 1 0 1 0!1!2-4!3!1!2!3!4 0 20 #Cars at first location 0 5 4 3 2 1 " 3 0!1!2!3!4 #Cars at second location 20 5 4 3 2 1 " 4 0!1!2!3!4 612 4200 V 4 v 4 #Cars at second location 20 0 20 #Cars at first location R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Solving MDPs with Dynamic Programming Finding an optimal policy by solving the Bellman Optimality Equation requires the following:! accurate knowledge of environment dynamics;! we have enough space and time to do the computation;! the Markov Property. How much space and time do we need?! polynomial in number of states (via dynamic programming methods; Chapter 4),! BUT, number of states is often huge (e.g., backgammon has about 10 20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables We need to use approximation, but unfortunately classical DP is not sound with approximation (later) In practice, classical DP can be applied to problems with a few millions of states. It is surprisingly easy to come up with MDPs for which DP methods are not practical. Biggest limitation of DP is that it requires a probability model (as opposed to a generative or simulation model) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Unified View Temporaldifference learning width of backup Dyna Dynamic programming height (depth) of backup Eligibilty traces Monte Carlo MCTS Exhaustive search... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33