Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1
Today Reinforcement learning Model based and model free RL TD learning On vs. off policy Exploration SARSA Q-learning Russell & Norvig: Chapter 21 Sutton & Barto: Reinforcement Learning: An Introduction 2
Reinforcement learning Agent s t r t r t+1 a t s t+1 Environment Learning by trial and error Receive feedback in form of rewards Maximize expected future rewards Learning is based on observed samples We have to try and see 3
Learning in MDPs A Markov Decision Process (MDP) is a tuple S, A, T, R, γ, H : S is a finite set of states A is a finite set of actions T is a state transition function: T (s, a, s ) = P(S t+1 = s S t = s, A t = a) R is a reward function: R(s, a) = E [R t S t = s, A t = a] γ [0, 1] is a discount factor H is a horizon (possibly H = ) s t r t r t+1 s t+1 Agent Environment a t T and R are unknown! 4
model based model P (s a, s) R(s, a) model learning dynamic prog. planning data {(s, a, r, s )} policy search policy π(a s) Q-learning TD-learning action selection value V (s), Q(s, a) model free 5
Learning in MDPs While interacting with the world, the agent collects data of the form What can we learn from this? D = {(s t, a t, r t, s t+1 )} H t=0 Model-based RL: learn to predict next state: estimate P(s s, a) learn to predict immediate reward: estimate R(s, a) Model-free RL: learn to predict value: estimate V (s) or Q(s, a) Policy search: estimate the policy gradient, or directly use black box (e.g. evolutionary) search 6
Model-based RL Idea of model-based reinforcement learning: Learn a model based on experience Assume model is correct, perform planning Adaptive dynamic programming model learning + dynamic programming intractable for large state spaces 1. Model learning Given data D = {(s t, a t, r t, s t+1)} H t=0 estimate P(s s, a) and R(s, a) e.g. ˆP(s s, a) = N(s,s,a) N(s,a) e.g. ˆR(s, a) = 1 N(s,a) R(s, a) 2. Planning using the estimated model e.g. search, dynamic programming 7
Passive reinforcement learning Policy evaluation given a fixed policy π what are the state values V π (s)? s : V π k+1(s) = R(s, a) + γ s P (s s, π(s)) V π k (s ) Assume we do not know T and R need to learn the value function V from experience requires interaction not offline planning in MDPs however: learner does not take actions itself 8
Example direct value estimation Samples (s, a, r, s ): (1, 1), up, 0.04, (1, 2) (1, 2), up, 0.04, (1, 2) (1, 2), up, 0.04, (1, 3) (1, 3), right, 0.04, (1, 3) (1, 3), right, 0.04, (2, 3) (2, 3), right, 0.04, (3, 3) (3, 3), right, 0.04, (4, 3) (4, 3), no-op, +1.0, none 3 2 1 1.000-1.000 1 2 3 4 V ((2, 3)) = ( 0.04) + ( 0.04) + (+1) = 0.92 V ((1, 1)) = ( 0.04) 7 + 1 = 0.72 V ((1, 3)) = (4 ( 0.04) + 1)/2 + (3 ( 0.04) + 1)/2 = 0.86 V ((1, 2)) = (5 ( 0.04) + 1)/2 + (4 ( 0.04) + 1)/2 = 0.82 γ = 1.0, step cost = 0.04 9
Example direct value estimation Samples (s, a, r, s ): (1, 1), up, 0.04, (1, 2) (1, 2), up, 0.04, (1, 2) (1, 2), up, 0.04, (1, 3) (1, 3), right, 0.04, (1, 3) (1, 3), right, 0.04, (2, 3) (2, 3), right, 0.04, (3, 3) (3, 3), right, 0.04, (4, 3) (4, 3), no-op, +1.0, none 3 2 1 0.812 0.868 0.918 0.762 0.660 0.705 0.655 0.611 1.000-1.000 0.388 1 2 3 4 V ((2, 3)) = ( 0.04) + ( 0.04) + (+1) = 0.92 V ((1, 1)) = ( 0.04) 7 + 1 = 0.72 V ((1, 3)) = (4 ( 0.04) + 1)/2 + (3 ( 0.04) + 1)/2 = 0.86 V ((1, 2)) = (5 ( 0.04) + 1)/2 + (4 ( 0.04) + 1)/2 = 0.82 γ = 1.0, step cost = 0.04 10
Direct value estimation 1. Wait until end of sequence 2. Calculate the observed reward-to-go for each encountered state 3. Update estimates keep running average for each state Reduces reinforcement learning to supervised learning input: state target: observed reward-to-go Misses opportunities for learning: states are not independent immediate reward + expected reward of successor state often converges slowly 11
Temporal difference learning Directly learn V π from experience under policy π Model-free: no knowledge of T and R Update V π (s) every time we experience a sample s, a, r, s Bootstrapping: update value towards estimate V π (s ) update values to match those of successor states sample: R(s, a) + γv π (s ) update: V π (s) (1 α)v π (s) + α (R(s, a) + γv π (s )) V π (s) V π (s) + α ( R(s, a) + γv π (s ) V π (s) ) TD target: R(s, a) + γv π (s ) TD error: R(s, a) + γv π (s ) V π (s) learning rate: α 12
Exponential moving average Temporal difference update V π (s) (1 α)v π (s) + α (R(s, a) + γv π (s )) Exponential moving average x n = (1 α) x n 1 + αx n x n = x n + (1 α)x n 1 + (1 α) 2 x n 2 +... 1 + (1 α) + (1 α) 2 +... Weight for recent samples higher Decreasing α can produce converging averages 13
TD vs direct evaluation Direct evaluation requires complete sequence only works in terminating domains does not rely on Markov property high variance: many random... actions state transitions rewards Temporal difference learning can learn from every step (incomplete sequences) works in non-terminating domains bootstrapping, exploits Markov property much lower variance 14
Active reinforcement learning Eventually the goal is to use RL for control! Agent must decide what actions to take No longer a fixed policy a = π(s) Assume we have V what action should we take? If we want to turn learned values V π into a new policy: [ π(s) = arg max a 1. learn a model ˆP(s, a, s ) and ˆR(s, a) 2. or use Q instead of V R(s, a) + γ s P (s s, a) V π (s ) ] acting greedy with respect to Q is model-free: π(s) = arg max Q(s, a) a 15
Active reinforcement learning Problem: choosing optimal action with respect to learned Q-function can lead to suboptimal behavior optimal in learned model does not need to be optimal in true model actions provide reward and new knowledge exploration exploitation tradeoff link to bandits 16
Example of greedy action selection Two doors in front of you: s l, s r You open the left door and get reward 0 V (s l ) = 0 V (s r ) = 0 You open the right door and get reward 1 V (s l ) = 0 V (s r ) = 1 You open the right door and get reward 3 V (s l ) = 0 V (s r ) = 2 Are you sure you are choosing the right door?. example from David Silver 17
ɛ-greedy exploration Simple idea to force continued exploration With probability P = 1 ɛ take the greedy action With probability P = ɛ take a random action All actions are chosen with non-zero probability { 1 ɛ + ɛ π(a s) = m if a = arg max a Q(s, a ) else ɛ m 18
Boltzmann exploration (softmax) τ is called temperature π(a s) = eq(s,a)/τ a eq(s,a )/τ large temperature: more exploration small temperature: more greedy Suppose we have Q(s, a 1 ) = 1 and Q(s, a 2 ) = 2 for τ = 10: P(a 1 s) 0.48 and P(a 2 s) 0.52 for τ = 1: P(a 1 s) 0.27 and P(a 2 s) 0.73 for τ = 0.5: P(a 1 s) 0.12 and P(a 2 s) 0.88 for τ = 0.1: P(a 1 s) 0.00005 and P(a 2 s) 0.99995 19
Greedy in the limit of infinite exploration (GLIE) All (s, a) pairs are explored infinitely often: lim N k(s, a) = k Policy converges to a greedy policy: { 1 if a = arg lim π(a s) = maxa Q k (s, a ) k 0 else e.g. ɛ-greedy is GLIE with ɛ(k) = 1 k very slow to converge in practice often constant e.g. ɛ = 0.1 or ɛ = 0.01 20
On policy and off policy model free learning On policy learning: Learning on the job SARSA Off policy learning: Look over someone else s shoulder Q-learning 21
SARSA SARSA is temporal difference learning of Q π Recall the recursive property of Q(s, a): Q π (s, a) = R(s, a) + γ s P(s s, a)q π (s, π(s)) Recall TD update rule: V (s t ) V (s t ) + α (R(s t, a t ) + γv (s t+1 ) V (s)) SARSA update rule: Q(s, a) Q(s, a) + α (R(s, a) + γq(s, a ) Q(s, a)) 22
SARSA SARSA update rule: Q(s, a) Q(s, a) + α (R(s, a) + γq(s, a ) Q(s, a)) R(s, a) > Q(s, a) γq(s, a ) increase Q(s, a) R(s, a) < Q(s, a) γq(s, a ) decrease Q(s, a) Update scaled with learning rate α 23
Q-learning Q-learning is temporal difference learning of Q Action a is performed in s Update towards value of alternative (optimal) action a Q-learning update rule: Q(s, a) Q(s, a) + α ( ) R(s, a) + γ max a Q(s, a ) Q(s, a) Converges to optimal policy even when acting suboptimal! requires sufficient exploration (carefully) decreasing learning rate α 24
Off policy learning Why is it important? Learn from observing other agents (e.g. humans) Re-use experience improve sample efficiency Learn about optimal policy while exploring Off policy Q-learning: both behavior and target policy can improve target policy is greedy with respect to Q(s, a) behavior policy is ɛ greedy with respect to Q(s, a) Is off policy always a good idea? 25
Example: cliff walking 26
Summary RL agent learns to maximize reward given only percepts and reward Passive vs. active passive: estimate V of an unknown MDP active: control in an unknown MDP Model based vs. model free: model based: 1. learn estimates of T and R model free: estimate Q directly from samples 2. solve MDP On policy vs. off-policy Learning on policy: estimate Q π while executing π off policy: estimate Q while executing π 27