ARTIFICIAL INTELLIGENCE. Reinforcement learning

INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Outline Reinforcement learning basics Relation with MDPs Model-based and model-free learning Exploitation vs. exploration (Approximate Q-learning) 2

Reinforcement learning RL methods are employed to address two related problems: the Prediction Problem and the Control Problem. Prediction: learn value function for a (fixed) policy and use that to predict reward for future actions. Control: learn, by interacting with the environment, a policy which maximizes the reward when traveling through state space obtain an optimal policy which allows for action planning and optimal control. 3

Examples of Reinforcement Learning Robocup Soccer Teams (Stone & Veloso, Reidmiller et al.) World s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management (Van Roy, Bertsekas, Lee & Tsitsiklis) 10-15% improvement over industry standard methods Dynamic Channel Assignment (Singh & Bertsekas, Nie & Haykin) World's best assigner of radio channels to mobile telephone calls Elevator Control (Crites & Barto) (Probably) world's best down-peak elevator controller Many Robots navigation, bi-pedal walking, grasping, switching between skills... Games: TD-Gammon, Jellyfish (Tesauro, Dahl), AlphaGo (Deepmind) World's best backgammon & Go players (Alpha Go: https://www.youtube.com/watch?v=subqykxvx0a) 5

Key Features of RL Agent learns by interacting with environment Agent learns from the consequences of its actions, rather than from being explicitly taught, by receiving a reinforcement signal Because of chance, agent has to try things repeatedly Agent makes mistakes, even if it learns intelligently (regret) Agent selects its actions based on its past experiences (exploitation) and also on new choices (exploration) trial and error learning Possibly sacrifices short-term gains for larger long-term gains 6

Reinforcement Learning: idea Agent State: s Reward: r Actions: a Environment Basic idea: Receive feedback in the form of rewards Agent s return in long run is defined by the reward function Must (learn to) act so as to maximize expected return All learning is based on observed samples of outcomes! 8

The Agent-Environment Interface Agent: Interacts with environment at time t Observes state at step t: s t S Produces action at step t: a t A(s t ) Gets resulting reward: r t 1 R And resulting next state: s t 1 0, 1, 2,... r t +1 s t +1 r t +2 s t +2 r t +3 s t +3... s t a t a t +1 a t +2 a t +3 9

RL as MDP The best studied case is when RL can be formulated as a (finite) Markov Decision Process (MDP), i.e. we assume: A (finite) set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Markov assumption Still looking for a policy (s) New twist: we don t know T or R! I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn 11

An Example: Recycling robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if it runs out of power while searching, it has to be rescued (which is bad). Actions are chosen based on current energy level (states): high, low. Reward = number of cans collected 12

Recycling Robot MDP S high, low A(high) search, wait A(low) search, wait, recharge R R search wait expected no. of cans while searching expected no. of cans while waiting R search R wait 13

MDPs and RL Known MDP: Offline Solution, no learning Goal Compute V π Compute V*, * Technique Policy evaluation Value / policy iteration Unknown MDP: Model-Based Unknown MDP: Model-Free Goal Technique Compute V*, * VI/PI on approximated MDP Goal Technique Compute V π Direct evaluation TD-learning Compute Q*, * Q-learning 14

Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values, as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s for each s, a Normalize to give an estimate of Discover each when we experience (s, a, s ) Step 2: Solve the learned MDP For example, use value iteration, as before: 15

Model-Free Learning Model-Free idea: Directly learn (approximate) state values, based on experiences Methods (a.o.): I. Direct evaluation II. Temporal difference learning III. Q-learning Passive: use fixed policy Active: off-policy Remember: this is NOT offline planning! You actually take actions in the world. 16

I: Direct Evaluation Goal: Compute V(s) under given Idea: Average reward to go of visits 1. First act according to for several episodes/epochs 2. Afterwards, for every state s and every time t that s is visited: determine the rewards r t r subsequently received in epoch 3. Sample for s at time t = sum of discounted future rewards sample = R t s = r t + γr t+1 s (R s = r ) given experience tuples <s, (s), r t, s > 4. Average samples over all visits of s Note: this is the simplest Monte Carlo method 17

Example: Direct Evaluation Input: Policy A B C D E States Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, -1, C C, east, -1, D D, exit, +10, B, east, -1, C C, east, -1, D D, exit, +10, Episode 3 Episode 4 E, north, -1, C C, east, -1, D D, exit, +10, E, north, -1, C C, east, -1, A A, exit, -10, Output Values -10 A +8 +4 +10 B C D E -2 18

Properties of Direct Evaluation Benefits: easy to understand doesn t require any knowledge of T, R eventually computes the correct average values, using just sample transitions Drawbacks: wastes information about state connections each state must be learned separately takes a long time to learn Output Values -10 A +8 +4 +10 B C D E -2 If B and E both go to C under this policy, how can their values be different? 20

II: Temporal Difference Learning Goal: Compute V(s) under given Big idea: update after every experience! Likely outcomes will contribute updates more often s Temporal difference learning of values s (s) 1. Initialize each V(s) with some value 2. Observe experience tuple <s, (s), r, s > 3. Use observation in rough estimate of long-term reward V(s) sample s = r + γ V π (s ) 4. Update V(s) by moving values slightly towards estimate: V π (s) V π (s) + α (sample s V π (s)) where 0 α 1 is the learning rate. 21

Example: TD- Learning Input: Policy A A B C D B C D E E States Assume: 0 0 0 8 0 init Each V(s) can be initialised with an arbitrary value. Reward function is unknown; but perhaps we do know that we receive a reward of 8 after ending up in D this can be exploited. = 1, α = 1/2 22

Example: TD- Learning Input: Policy A A Experience <s,π(s),r,s > B, east, -2, C 0 0 sample(b): 2 + γ 0 = 2 B C D B C D E E States Assume: = 1, α = 1/2 0 0 8 0 init -1 0 8 0 Update V π (B): 1 α 0 + α sample(b) V π (s) V π (s) + α (sample s V π (s)) = (1 α)v π (s) + α sample s = (1 α)v π (s) + α(r + γ V π (s )) 23

Example: TD- Learning Input: Policy A A 0 Experienced <s,π(s),r,s > C, east, -2, D 0 0 B C D B C D E E States Assume: = 1, α = 1/2 0 0 8-1 0 8-1 3 8 0 0 0 init sample(c): 2 + γ 8 = 6 Update V π (C): 1 α 0 + α sample(c) 24

Properties of TD Value Learning Benefits: Model free Bellman updates: connections between states used Updates upon each action Drawback: Values are learnt per policy Good for policy evaluation Long way from establishing optimal policy (Note that same holds for Direct evaluation) 25

26 Golf example: how valuable is a state? State is ball location Reward of 1 for each stroke until the ball is in the hole Actions: putt (use putter) driver (use driver) putt succeeds anywhere on the green Value of a state??

Optimal quantities revisited State s has value V(s): V * (s) = expected reward starting in s and acting optimally a state s q-state (s,a) has value Q(s,a): Q * (s,a) = expected reward having taken action a from state s and (thereafter) acting optimally q-state The optimal policy: * (s) = optimal action from state s state s

Bellman equation revisited Recall the Bellman equation for the optimal value function: V * ( s) max T ( s, a, s') a s' * * R( s, a, s') V ( s') maxq ( s, a) a Now, since also Q * V * ( s') maxq a' ( s', a'), we have that * ( s, a) T ( s, a, s') R( s, a, s') max Q ( s', a' ) a' s' The optimal policy now directly (no look-ahead) follows with argmax: * * ( s) arg maxq a * ( s, a) 28

Gridworld: V and Q values Noise = 0.2 Discount γ = 0.9 Living reward R(s) = 0 Optimal policy? 29

III: Q-Learning Idea: do Q-value updates to each q-state (like VI): But: can t compute this update without knowing T, R Instead: incorporate estimates as we go (like TD) 1. Initialize Q(s,a) = 0 for each s,a pair 2. Select action a and observe experience <s, a, r, s > 3. Use observation in rough estimate of Q(s, a): sample( s, a) r max a' Q( s', a') 4. Update Q(s,a) by moving values slightly towards estimate: Q( s, a) Q( s, a) sample( s, a) Q( s, a) (1 ) Q( s, a) sample( s, a) 30

Optimal Q -Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q(s,driver) gives the value or using driver first, then using whichever actions are best 31

Updating Q -values: example Current Q(s,a) indicated; experience <s 1,a right, 0,s 2 > sample( s1, aright ) r max Q( s2 a', a') 0 0.9 max{63,81,100} γ = 0.9 α = 1 90 Q( s1, aright ) (1 ) Q( s1, aright ) sample( s1, a (1 ) 72 90 right ) 32

Q-Learning Properties I Q-learning is off-policy learning if rewards 0 then Q -values 0 and non-decreasing with each update If each (s,a) pair is visited infinitely often, the process convergences to true (optimal) Q Amazing result: Q-learning converges to optimal policy -- even if you re acting suboptimally! Basically, in the limit, it doesn t matter how you select actions (!) 33

Q-Learning Properties II Caveats: You have to explore enough You have to eventually make the learning rate α small enough but not decrease it too quickly 34

Exploration vs. Exploitation Multi-armed bandit: each machine provides a random reward from a distribution specific to that machine. Which machine should you play, and how many times? 35

Exploration vs Exploitation The policy indicates the exploration strategy: which action to take in which state Standard Q-learning uses Q-values associated with best action: pure exploitation, using what it already knows We can add randomness for true exploration: sometimes try to learn something new by picking a random action (e.g. -greedy) The exploration-exploitation trade-off is highly influenced by context: online or offline? 36

Q-learning to crawl 37

Approximate Q-Learning 38

Generalizing Across States Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations This is a fundamental idea in machine learning! 39

Example: Pacman Let s say we discover through experience that this state is bad: In naïve Q-learning, we know nothing about this state: Or even this one! 40

Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot) 2 Is Pacman in a tunnel? (0/1) etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 41

Linear Value Functions Using a feature representation, we can write a Q or V function for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value! 42

Approximate Q-Learning In Q-learning, use difference between current Q(s,a) and new sample to update weights of active features: transition = <s,a,r,s > sample before (exact Q): now: approximate Q with w updates Intuitive interpretation: if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state s features 43

Example: Q-Pacman (no noise) 44

Summary Reinforcement learning: learn from experience, not from a teacher Reinforcement learning problem can be cast as MDP with unknown T and R Model-based RL: estimate R and T from experience Model-free RL: estimate V(s) or Q(s,a) from experience Latter can be done actively, using Q-learning, and gives optimal policy Large state-spaces: use approximate Q-learning with domain-specific features 46