CS 188: Artificial Intelligence Reinforcement Learning Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein and Pieter Abbeel @ai.berkeley.edu]
Reinforcement Learning
Double Bandits
Double-Bandit MDP Actions: Blue, Red States: Win, Lose $1 W 0.75 $2 0.25 $0 0.25 $0 L $1 No discount 10 time steps Both states have the same value 1.0 0.75 $2 1.0
Offline Planning Solving MDPs is offline planning You determine all quantities through computation You need to know the details of the MDP You do not actually play the game! No discount 10 time steps Both states have the same value Play Red Value 15 $1 W 0.75 $2 0.25 $0 0.25 $0 L $1 Play Blue 10 1.0 0.75 $2 1.0
Let s Play! $2 $2 $0 $2 $2 $2 $2 $0 $0 $0
Online Planning Rules changed! Red s win chance is different.?? $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0
Let s Play! $0 $0 $0 $2 $0 $2 $0 $0 $0 $0
What Just Happened? That wasn t planning, it was learning! Specifically, reinforcement learning There was an MDP, but you couldn t solve it with just computation You needed to actually act to figure it out Important ideas in reinforcement learning that came up Exploration: you have to try unknown actions to get information Exploitation: eventually, you have to use what you know Regret: even if you learn intelligently, you make mistakes Sampling: because of chance, you have to try things repeatedly Difficulty: learning can be much harder than solving a known MDP
Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know T or R I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn
Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea: Learn how to maximize expected rewards based on observed samples of transitions
Example: Learning to Walk Initial After Learning [1K Trials] [Kohl and Stone, ICRA 2004]
Example: Learning to Walk [Kohl and Stone, ICRA 2004] Initial [Video: AIBO WALK initial]
Example: Learning to Walk [Kohl and Stone, ICRA 2004] Finished [Video: AIBO WALK finished]
Example: Sidewinding [Andrew Ng] [Video: SNAKE climbstep+sidewinding]
Example: Toddler Robot [Tedrake, Zhang and Seung, 2005] [Video: TODDLER 40s]
The Crawler! [Demo: Crawler Bot (L10D1)] [You, in Project 3]
Video of Demo Crawler Bot
DeepMind Atari ( Two Minute Lectures) 19
Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model P(s a,s) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know P or R I.e. we don t know which states are good or what the actions do Must actually try actions and explore new states -- to boldly go where no Pacman agent has been before
Offline (MDPs) vs. Online (RL) Offline Solution Online Learning
Model-Based Learning
Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s for each s, a Normalize to give an estimate of P(s s, a) Discover each R(s,a,s ) when we experience the transition Step 2: Solve the learned MDP For example, use value or policy iteration, as before
Example: Model-Based Learning Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Learned Model T(s,a,s ). P(s s,a) P(C B, east) = 1.00 P(D C, east) = 0.75 P(A C, east) = 0.25 R(s,a,s ). R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10
Pro: Makes efficient use of experiences Pros and cons Con: May not scale to large state spaces Learns model one state-action pair at a time (but this is fixable) Cannot solve MDP for very large S
Model-Free Learning
Basic idea of model-free methods To approximate expectations with respect to a distribution, you can either Estimate the distribution from samples, compute an expectation Or, estimate the expectation from samples directly
Example: Expected Age Goal: Compute expected age of cs188 students Known P(A) E[A] = a P(a) a = 0.35 x 20 + Without P(A), instead collect samples [a 1, a 2, a N ] Unknown P(A): Model Based Unknown P(A): Model Free Why does this work? Because eventually you learn the right model. P^(A) = N a /N E[A] a P^ (a) a E[A] 1/N i a i Why does this work? Because samples appear with the right frequencies.
Passive Reinforcement Learning
Passive Reinforcement Learning Simplified task: policy evaluation Input: a fixed policy (s) You don t know the transitions P(s s,a) You don t know the rewards R(s,a,s ) Goal: learn the state values V (s) In this case: Learner is along for the ride No choice about what actions to take: just do it This is NOT offline planning! Agent takes actions in the world.
Direct utility estimation Goal: Estimate V (s), i.e., expected total discounted reward from s onwards Idea: Use the actual sum of discounted rewards from s Average over multiple trials and visits to s This is called direct utility estimation
Example: Direct Evaluation Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Output Values -10 A +8 +4 +10 B C D E -2
Problems with Direct Evaluation What s good about direct evaluation? It s easy to understand It doesn t require any knowledge of P(s s,a) or R(s,a,s ) It converges to the right answer in the limit What s bad about it? Each state must be learned separately (fixable) It ignores information about state connections So, it takes a long time to learn Output Values -10 A +8 +4 +10 B C D E -2 If B and E both go to C under this policy, how can their values be different?
Why Not Use Policy Evaluation? Simplified Bellman updates calculate V for a fixed policy: Each round, replace V with a one-step-look-ahead layer over V This approach fully exploited the connections between the states Unfortunately, we need T and R to do it! s, (s),s s (s) s, (s) s Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights?
Sample-Based Policy Evaluation? We want to improve our estimate of V by computing these averages: Idea: Take samples of outcomes s (by doing the action!) and average s (s) s, (s) s, (s),s s 2' s' s 1 ' ' s 3 Almost! But we can t rewind time to get sample after sample from state s.
Temporal difference (TD) learning
TD as approximate Bellman update Policy evaluation (version 1) improves the estimate of V by computing Bellman update, i.e., expectation over nextstate values: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] Idea 1: Use actual samples to estimate the expectation: sample 1 = R(s, (s),s 1 ) + γv k (s 1 ) sample 2 = R(s, (s),s 2 ) + γv k (s 2 ) sample N = R(s, (s),s N ) + γv k (s N ) V k+1(s) 1/N i sample i
TD as approximate Bellman update Idea 2: Update value of s after each transition s,a,s,r : Update V ([3,1]) based on R([3,1],up,[3,2]) and γv ([3,2]) Update V ([3,2]) based on R([3,2],up,[3,3]) and γv ([3,3]) Update V ([3,3]) based on R([3,3],right,[4,3]) and γv ([4,3])
TD as approximate Bellman update Idea 3: Update values by maintaining a running average
Running averages How do you compute the average of 1, 4, 7? Method 1: add them up and divide by N 1+4+7 = 12 average = 12/N = 12/3 = 4 Method 2: keep a running average n and a running count n n=0 0 =0 n=1 1 = (0 0 + x 1 )/1 = ( 0 0 + 1)/1 = 1 n=2 2 = (1 1 + x 2 )/2 = (1 1 + 4)/2 = 2.5 n=3 3 = (2 2 + x 3 )/3 = (2 2.5 + 7)/3 = 4 General formula: n = ((n-1) n-1 + x n )/n = [(n-1)/n] n-1 + [1/n] x n (weighted average of old mean, new sample)
Running averages contd. What if we use a weighted average with a fixed weight? n = (1- ) n-1 + x n n=1 1 = x 1 n=2 2 = (1- ) 1 + x 2 = (1- ) x 1 + x 2 n=3 3 = (1- ) 2 + x 3 = (1- ) 2 x 1 + (1- )x 2 + x 3 n=4 4 = (1- ) 3 + x 4 = (1- ) 3 x 1 + (1- ) 2 x 2 + (1- )x 3 + x 4 I.e., exponential forgetting of old values
TD as approximate Bellman update Idea 3: Update values by maintaining a running average sample = R(s, (s),s ) + γv (s ) V (s) (1- ) V (s) + sample V (s) V (s) + [sample - V (s)] This is the temporal difference learning rule [sample - V (s)] is the TD error I.e., observe a sample, move V (s) a little bit to make it more consistent with its neighbor V (s )
Example: Temporal Difference Learning States Observed Transitions B, east, C, -2 C, east, D, -2 A 0 0 0 B C D 0 0 8-1 0 8-1 3 8 E 0 0 0 Assume: = 1, α = 1/2
Problems with TD Value Learning TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages However, if we want to turn values into a (new) policy, we re sunk: a s, a s Idea: learn Q-values, not values Makes action selection model-free too! s,a,s s
Detour: Q-Value Iteration Value iteration: find successive (depth-limited) values Start with V 0 (s) = 0, which we know is right Given V k, calculate the depth k+1 values for all states: But Q-values are more useful, so compute them instead Start with Q 0 (s,a) = 0, which we know is right Given Q k, calculate the depth k+1 q-values for all q-states:
Q-Learning Q-Learning: sample-based Q-value iteration Learn Q(s,a) values as you go Receive a sample (s,a,s,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: [Demo: Q-learning gridworld (L10D2)] [Demo: Q-learning crawler (L10D3)]
Q-Learning Properties Amazing result: Q-learning converges to optimal policy -- even if you re acting suboptimally! This is called off-policy learning Technical conditions for convergence: Explore enough: Eventually try every state-action pair infinitely often Decay the learning rate properly t t = and t t2 < t = O(1/t) meets these conditions
Video of Demo Q-Learning -- Gridworld
Video of Demo Q-Learning -- Crawler
Active Reinforcement Learning
Active Reinforcement Learning Full reinforcement learning: You don t know the transition model You don t know the reward function You choose the actions now Goal: learn the optimal policy / values In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens
Approaches to reinforcement learning 1. Learn the model, solve it, execute the solution 2. Learn values from experiences a. Direct utility estimation Add up rewards obtained from each state onwards b. Temporal difference (TD) learning Adjust V(s) to agree better with R(s,a,s )+γv (s ) Still need a transition model to make decisions c. Q-learning Like TD learning, but learn Q(s,a) instead Sufficient to make decisions without a model! Important idea: use samples to approximate the weighted sum over next-state values
TD learning and Q-learning Approximate version of value iteration update gives TD learning: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] V (s) (1- ) V (s) + [R(s, (s),s ) + γv (s ) ] Approximate version of Q iteration update gives Q learning: Q k+1 (s,a) s P(s a,s) [R(s,a,s ) + γ max a Q k (s,a ) ] Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) ] We obtain a policy from learned Q, with no model!
Video of Demo Q-Learning Auto Cliff Grid
Exploration vs. Exploitation
Exploration vs exploitation Exploration: try new things Exploitation: do what s best given what you ve learned so far Key point: pure exploitation often gets stuck in a rut and never finds an optimal policy! 64
Exploration method 1: -greedy -greedy exploration Every time step, flip a biased coin With (small) probability, act randomly With (large) probability 1-, act on current policy Properties of -greedy exploration Every s,a pair is tried infinitely often Does a lot of stupid things Jumping off a cliff lots of times to make sure it hurts Keeps doing stupid things for ever Decay towards 0
Video of Demo Q-learning Epsilon-Greedy Crawler
Sensible exploration: Bandits A B C D Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 Which one-armed bandit to try next? Most people would choose C > B > A > D Basic intuition: higher mean is better; more uncertainty is better Gittins (1979): rank arms by an index that depends only on the arm itself 67
Exploration Functions Exploration functions implement this tradeoff Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g., f(u,n) = u + k/ n Regular Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a) ] Modified Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a f(q (s,a),n(s,a )) ] Note: this propagates the bonus back to states that lead to unknown states as well! [Demo: exploration Q-learning crawler exploration function (L11D4)]
Video of Demo Q-learning Exploration Function Crawler
Optimality and exploration total reward per trial optimal exploration function regret decay -greedy fixed -greedy number of trials 70
Regret Regret measures the total cost of your youthful errors made while exploring and learning instead of behaving optimally Minimizing regret goes beyond learning to be optimal it requires optimally learning to be optimal
Approximate Q-Learning
Generalizing Across States Basic Q-Learning keeps a table of all Q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the Q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations Can we apply some machine learning tools to do this? [demo RL pacman]
Example: Pacman Let s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!
Video of Demo Q-Learning Pacman Tiny Watch All
Video of Demo Q-Learning Pacman Tiny Silent Train
Video of Demo Q-Learning Pacman Tricky Watch All
Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost f GST Distance to closest dot Number of ghosts 1 / (dist to closest dot) f DOT Is Pacman in a tunnel? (0/1) etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
Linear Value Functions We can express V and Q (approximately) as weighted linear functions of feature values: V w (s) = w 1 f 1 (s) + w 2 f 2 (s) + + w n f n (s) Q w (s,a) = w 1 f 1 (s,a) + w 2 f 2 (s,a) + + w n f n (s,a) Important: depending on the features used, the best possible approximation may be terrible! But in practice we can compress a value function for chess (10 43 states) down to about 30 weights and get decent play!!!
Updating a linear value function Original Q learning rule tries to reduce prediction error at s,a: Q(s,a) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Instead, we update the weights to try to reduce the error at s,a: w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i = w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] f i (s,a) Qualitative justification: Pleasant surprise: increase weights on +ve features, decrease on ve ones Unpleasant surprise: decrease weights on +ve features, increase on ve ones 80
Example: Q-Pacman [Demo: approximate Q- learning pacman (L11D10)]
Video of Demo Approximate Q-Learning -- Pacman
Convergence* Let V L be the closest linear approximation to V*. TD learning with a linear function approximator converges to some V that is pretty close to V L Q-learning with a linear function approximator may diverge With much more complicated update rules, stronger convergence results can be proved even for nonlinear function approximators such as neural nets 83
Nonlinear function approximators We can still use the gradient-based update for any Q w : w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i Neural network error back-propagation already does this! Maybe we can get much better V or Q approximators using a complicated neural net instead of a linear function 84
Backgammon 85
TDGammon 4-ply lookahead using V(s) trained from 1,500,000 games of self-play 3 hidden layers, ~100 units each Input: contents of each location plus several handcrafted features Experimental results: Plays approximately at parity with world champion Led to radical changes in the way humans play backgammon 86
DeepMind DQN Used a deep learning network to represent Q: Input is last 4 images (84x84 pixel values) plus score 3 hidden layers: 16 units, each taking input form an 8x8 region, replicated 400 times 32 units, each taking input from a 4x4 region from layer 1, replicated 100 times 256 units fully connected to all units from layer 2 Trained on 7 Atari video games: Beam Rider, Breakout, Enduro Pong, Q*bert, Seaquest, Space Invaders 87
88
DeepMind (DQN) results The games Q*bert, Seaquest, Space Invaders, on which we are far from human performance, are more challenging because they require the network to find a strategy that extends over long time scales. 89
RL and dopamine Dopamine signal generated by parts of the striatum Encodes predictive error in value function (as in TD learning) 90
RL and Flappy Bird State space Discretized vertical distance from lower pipe Discretized horizontal distance from next pair of pipes Life: Dead or Living Actions Click Do nothing Rewards +1 if Flappy Bird still alive -1000 if Flappy Bird is dead 6-7 hours of Q-learning