Grundlagen der Künstlichen Intelligenz

Size: px

Start display at page:

Download "Grundlagen der Künstlichen Intelligenz"

Virgil Farmer
6 years ago
Views:

1 Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1

2 Today Eligibility traces n-step TD returns Forward and backward view Function approximation Partially observable domains Russell & Norvig: Chapter 21 Sutton & Barto: Reinforcement Learning: An Introduction 2

3 Scaling-up reinforcement learning Sparse rewards Gridworld Large state spaces: Backgammon: states Go: states Continuous spaces: e.g. inverted pendulum, helicopter, etc. Partially observable domains (POMDPs) Does model-free solve everything? 3

4 reward sequence, S t,r t+1,s t+1,r t+2,...,r T,S T (omitting the actions for simplicity). We know that in Monte Carlo backups the estimate of v (S t ) is updated in the direction of the complete return: Recall: Temporal difference learning Update. G V π (s) every time we experience a sample s, a, r, s t = Rt+1 + R t R t T t 1 R T, Bootstrapping: update value towards estimate V π (s ) where T is the last time step of the episode. Let us call this quantity the target of the backup. sample: Whereas R(s, in a) Monte + γvcarlo π (s ) backups the target is the return, in one-step backups the target is the update: V π first reward (s) V π plus the discounted (s) + α (R(s, a) + γv π estimated (s ) V π value of the next (s)) TD (1-step) 2-step 3-step n-step Monte Carlo 4

5 n-step return n-step returns for n = 1, 2,... : n-step return: n = 1 TD R t + γv (S t+1 ) n = 2 R t + γr t+1 + γ 2 V (S t+2 ) n = 3 R t + γr t+1 + γ 2 R t+2 + γ 3 V (S t+3 ).. n = MC R t + γr t+1 + γ 2 R t γ T R T G (n) t = R t + γr t+1 + γ 2 R t γ n 1 R t+n 1 + γ n V (S t+n ) n-step temporal difference update: V (S t ) V (S t ) + α ( ) G (n) t V (S t ) 5

6 If 0: P min( +n,t ) Example: n-step G i= +1 sarsa i 1 R i If + n<t,theng G + n Q(S +n,a +n ) (G (n) Q(S,A ) Q(S,A )+ [G Q(S,A )] If is being learned, then ensure that ( S )is"-greedy wrt Q Until = T 1 ) Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(!) by 10-step with Sarsa!=0.9 G G G Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods. HowThe to first choose paneln? shows the path taken by an agent in a single episode, ending at a location of high reward, marked by the G. In this example the values were all initially 0, and all rewards were zero except for a positive reward at G. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step and n-step Sarsa methods. The one-step method strengthens only the last action of the sequence of actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, so that much more is learned from the one episode. 6

7 n=1 λ-return Figure 12.2 further illustrates the weighting on the sequence of n-step returns in the -return. The one-step return The λ-return Gt λ is given the largest weight, 1 ; the two-step return is given the next largest weight, combines (1 all ) n-step ; the returns: three-step return is given the weight (1 ) 2 ; and so on. The weight fades by with each additional step. After a terminal state has been reached, Gt λ all = (1 subsequent λ) n-step λ n 1 G returns (n) are equal to G t.ifwe n=1 TD("), "-return t 1!" (1!") " (1!") " 2 # = 1 " T-t-1 Figure 12.1: The backup digram for TD( ). If = 0, then the overall backup reduces to its 7

8 λ-return weighting function Gt λ = (1 λ) λ n 1 G (n) t 254 CHAPTER 12. ELIGIBILITY TRACES n=1 weight given to the 3-step return is (1 ) 2 total area = 1 Weight 1!" decay by " weight given to actual, final return is T t 1 t Time T Figure 12.2: Weighting given in the -return to each of the n-step returns. 8

9 to all the future rewards and decide how best to combine them. We might imagine ourselves riding the stream of states, looking forward from each state to determine its update, as suggested by Figure After looking forward from and updating one state, we move on to the next and never have to work with the preceding state again. Future states, on the other hand, are viewed and processed repeatedly, once from each vantage point preceding them. Forward view Rr T Rr t+1 t+1 s t S t+1 S t Rr t+3 Ss t+3 Rr t+2 S t+2 s t+1 s t+2 Time Figure 12.4: Update The values forwardby view. looking We decide forward how to update future each rewards stateand by looking statesforward to future rewards and states. Update values towards λ-return Can only be computed for terminated sequences 9

10 This is why that algorithm was called TD(0). In terms of Figure 12.5, TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of, butstill < 1, more of the preceding states Backward view e t Ss t-3 e t Ss t-2 e t Ss t-1!t t e t Ss t Ss t+1 Time Figure 12.5: The backward or mechanistic view. Each update depends on the current TD error combined Forward with view eligibility provides traces theory of past events. Backward view provides a mechanism how to perform updates Update every step; works for incomplete sequences 10

11 Eligibility traces Credit assignment problem: Frequency: assign credit to most frequent states Recency: assign credit to recent states s : e(s) γλe(s) e(s t ) e(s t )

12 the case in which only the one state preceding the current one is changed by the Backward TD error. For viewlarger values of, butstill < 1, more of the preceding states e t Ss t-3 e t Ss t-2 e t Ss t-1!t t e t Ss t Ss t+1 Time Figure 12.5: The backward or mechanistic view. Each update depends on the current TD error combined Keep an with eligibility eligibilitytrace tracesfor of every past events. state s Update value V (s) every state s: δ t = R t + γv (S t+1 ) V (S t ) s : V (s) V (s) + αδ t e t (s) 12

13 TD(λ) and TD(0) When λ = 0: { 1 for s = St e(s) = 0 else δ t = R t + γv (S t+1 ) V (S t ) s : V (s) V (s) + αδ t e t (s) Same as TD(0): V (S t ) V (S t ) + α (R t + γv (S t+1 ) V (S t )) 13

14 Large state spaces Backgammon: states Go: states Continuous spaces: inverted pendulum mountain car helicopter... 14

Small domains: tabular representation So far V and Q were just lookup tables: k = 50 k = 50 0.59 0.67 0.77 3 0.645 0.744 0.848 1.000 3 0.57 0.64 0.60 0.74 0.66 0.85 0.53 0.67 0.57 1.00 2 0.566 0.

15 Small domains: tabular representation So far V and Q were just lookup tables: k = 50 k = Problems with large MDPs: too many states to store in memory too slow to learn What about continuous state spaces? 15

16 Function approximation Parameterized functional form, with weights w R n : ˆV π (s, w) V π (s) Generally, much less weights than states n S Changing single weight, changes value estimate of many states When one state is update, change generalizes to many states Update w with TD learning 16

17 Function approximators Linear combinations of features state aggregation tile coding polynomials radial basis functions (RBFs) Fourier bases... Neural networks Decision trees Nearest neighbours 17

18 State aggregation 18

Mountain Car 10 0 10 Velocity 8 6 4 2 0 0 2 4 6 8 10 Position

19 Mountain Car Velocity Position Velocity Position 19

20 Mountain Car 10 8 Velocity Position 20

21 features present is always the same as the number of tilings. This allows the step- coding parameter,, to be set in an easy, intuitive way. For example, choosing = 1 m, Tilesize where m is the number of tilings, results in exact one-trial learning. If the example s 7! v is trained on, then whatever the prior estimate, ˆv(s, t ), the new estimate will be ˆv(s, t+1 )=v. Usually one wishes to change more slowly than this, to allow for generalization and stochastic variation in target outputs. For example, one might choose = 1 10m, in which case the estimate for the trained state would move one- Continuous 2D state space Tiling 1 Tiling 2 Tiling 3 Tiling 4 Point in state space to be represented Four active tiles/features overlap the point and are used to represent it Figure 9.9: Multiple, overlapping grid-tilings on a limited two-dimensional space. These tilings are o set from one another by a uniform amount in each dimension. 21

Example: TD-Gammon by Gerald Tesauro (1992)

22 Example: TD-Gammon by Gerald Tesauro (1992) Multi-layer perceptron (neural network) represents value function Only reward given at the end of game Self-play: use the current policy to sample moves on both sides! Random policies: thousands of moves Skilled players: moves 22

23 Example: TD-Gammon TD-Gammon: raw features number of pieces at each place Chess: features designed by domain experts Only 2-ply lookahead limiting factor in endgame Comments by Kit Woolsey (world-class player back then): TD-Gammon particularly good on vague positions superior weighing of risk against safety not so good on calculable/special positions just the opposite for (old) chess programs Same ideas used in recent work on Go, Chess, Shogi Alpha Go Zero (2015) Alpha Zero (2017) 23

24 Recent successes of deep reinforcement learning images from Human-level control through deep reinforcement learning and Continuous control with deep reinforcement learning (Google Deepmind / Nature) 24

25 Deep (supervised) learning Deep representation is a composition of many functions x w1 h 1 w2 h 2 w3... wn h n wn+1 y Linear transformation and non-linear activation functions h k Weight sharing Recurrent neural networks: across time steps Convolutional neural networks: across spatial (or temporal) regions Weights w optimized by stochastic gradient descent (SGD) Powerful function approximation and representation learning finds compact low-dimensional representation (features) State-of-the-art for image, text and audio 25

26 Stochastic gradient descent Differentiable loss-function, e.g., l(w) = ( ) 2 V π (s) ˆV (s, w) The objective is to minimize expected loss: L = E π [l(w)] Gradient descent finds a local minimum Stochastic gradient descent samples the gradient Adjust weights in direction of gradient: w i = α δl(w) δw i 26

27 Naive deep Q-learning Q-learning update rule: ( Q(s, a) Q(s, a) + α r + γ max a ) Q(s, a) Q(s, a) Q is represented by a neural network with weights w: Q(s, a, w) Loss is the mean-squared TD-error: [ ( ) ] 2 L(w) = E r + γ max Q(s, a, w) Q(s, a, w]) a Minimize loss with SGD: δl(w) δw 27

28 Stability Naive Q-learning with neural networks oscillates or diverges: 1. Data is non i.i.d! trajectories samples are correlated (generated by interaction) 2. Policy changes rapidly with slight changes to Q-values policy may oscillate 3. Reward range is unknown gradients can be large instabilities during back-propagation 28

29 Deep Q-networks (DQN) Deep Q-networks (DQN) address instabilities through: Experience replay store transitions S t, A t, R t, S t+1 sample random mini-batches removes correlation, restores i.i.d. property Target network second Q network fixed parameters in target network periodically update target network parameters Reward clipping/normalization clip rewards to r [ 1, 1] batch normalization 29

DQN in Atari End-to-end learning: state: stack of 4 frames, raw pixels action: joystick commands (18 discrete actions) RESEARCH LETTER reward: change in score Convolution Convolution Fully connected

30 DQN in Atari End-to-end learning: state: stack of 4 frames, raw pixels action: joystick commands (18 discrete actions) RESEARCH LETTER reward: change in score Convolution Convolution Fully connected Fully connected No input Figure 1 Schematic illustration of the convolutional neural network. The detailsofthe architectureareexplainedinthemethods. Theinputtotheneural symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed network consists of an image produced by the preprocessing by a rectifier nonlinearity (that is, maxð0,xþ). map w, followed imageby from three Human-level convolutional control layers through (note: snaking deep reinforcement blue line learning (Google Deepmind / Nature) 30

31 DQN in Atari image from Human-level control through deep reinforcement learning (Google Deepmind / Nature) 31

32 Recall: POMDPs agent y 0 a 0 y 1 a 1 y 2 a 2 s 0 s 1 s 2 r 0 r 1 r 2 initial state distribution: P(s 0 ) transition probabilities: P(s t+1 s t, a t ) observation probabilities: P(y t+1 s t+1, a t ) reward probabilities: P(r t s t, a t ) Markov property does not hold for y: P(y t+1 y t, a t ) is unknown 32

33 Example: Tiger problem 33

34 POMDPs With model (planning in belief state): Beliefs represent a probability distribution over states Belief state: b t (s t ) = P(s t b t 1, y t, r t 1, a t 1, y t 1,... ) Belief update: b t (s ) P(y t s, a t 1 ) s P(s s, a t 1 )b t 1 (s) Without model (learning): 1. Ignore the fact that y is not Markovian: learn y t a t 2. Use recent history: y t k:t, a t k:t 1 a t Note: eligibility traces may help to overcome some of the issues with hidden state 34

35 Is model-free learning always the answer? Model-free learning is powerful Surprisingly successful on a variety of domains Limitations of the model-free learning Given learnt values, behavior is a fixed strategy If the goal changes: need to re-learn values for every state in the world all previous values are obsolete No general knowledge learned, only values No anticipation of general outcomes (s ) of actions No planning possible 35

36 Wolfgang Köhler (1917) Intelligenzprüfung am Menschenaffen. 36

37 Laumer et al. (2017) Can hook-bending be let off the hook? Bending/unbending of pliant tools by cockatoos. 37

Reinforcement Learning

Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function