Reinforcement Learning - PDF Free Download

Reinforcement Learning Value Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Hung Ngo & Vien Ngo MLR Lab, University of Stuttgart

Outline Review: Function Approximation (FA) Representation issues: hypothesis and features Linear and nonlinear models Deep learning RL with FA Gradient descent methods Estimating target values Linear RL (online & batch) Nonlinear methods; Deep RL 2/57

Example: Value Iteration in Continuous MDP [ V (s) = sup r(s, a) + γ a P (s s, a)v (s )ds ] 3/57

Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A RL in large and/or continuous state and/or action spaces: Backgammon: 10 20 states (board size: 28) Computer GO: 10 35 states (9 9); 10 171 states (19 19) Autonomous helicopter: continuous & high-dim spaces of S and A Scalability and generalization issues In this course: learn/optimize parametric models for value functions V (s; θ), Q(s, a; θ): this lecture for policy π(a s; θ): policy search - next lecture 4/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) 5/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) 5/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) Many choices for θ (i.e., V (s; θ), Q(s, a; θ)): linear regression, decision tree, nearest neighbors, ANN (UFA), GPR, kernel methods, etc. 5/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. 6/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) 7/57

Example: Coarse Coding (2D) Generalization from X to Y depends on the number of their features whose receptive fields (circles) overlap. Here: slight generalization. a) Narrow generalization b) Broad generalization c) Asymmetric generalization Sparse, over-complete, distributed representation 8/57

Example: Tile Coding (2D) Receptive fields of the features are grouped into exhaustive partitions of the input space. Each such partition: a tiling, and each element of the partition: a tile. Each tile is the receptive field for one binary feature. tiling #1 tiling #2 2D state space Shape of tiles Generalization #Tilings Resolution of final approximation Exactly one feature is present in each tiling, so the total number of features present is always the same as the number of tilings. Aka CMAC (cerebellar model articulation controller) 9/57

Example: Radial Basis Functions φ i (x) = rbf(d i (x)), e.g.: Gaussian kernel, rbf(d) = exp( d2 2 ) d 2 i (x) = x c i 2 Σ i = (x c i ) Σ 1 i (x c i ) : distance in metric Σ i ( e.g., Σ i = σi 2 I : φ i (x) = exp x c i 2 ) 2σ 2 i Prototype/center c i, kernel width σ i σ i c i-1 c i c i+1 RBF s receptive field has continuous coverage Previous methods receptive fields discretize the input space. If parameters {c i, Σ i } are learned: RBF network. 10/57

amples of SVMs with Kernels Example: Polynomial Kernel y y 2 x φ x 2 x y 11/57

Example: Tabular Basis Naive features for finite discrete state space S = {s 1,..., s n } δ s1 (s) δ s2 (s) φ(s) =... δ sn (s) δ si (s) {0, 1}: Dirac function. 1-of-K coding, or one-hot vector. For (simple) generalization: bias weight & augmented feature 1 δ s1 (s) φ(s) = δ s2 (s)... δ sn (s) 12/57

Example: Perceptron, Frank Rosenblatt, 1957! fixed features φ(x) Paul Werbos Backpropagation for MLP: 1974. Kunihiko Fukushima s deep CNN: 1980. 13/57

Example: Deep Learning, 2012 14/57

Example: Deep Learning, 2012 A lot of training data available + computational power (GPU) Good learning priors (hierarchical, modular, sparse-coding, pooling, regularizers e.g. dropouts, invariances etc.) and training algs. 15/57

Review: Gradient Descent (Steepest Descent) For f : R n R: f(x + δx) f(x) + x f δx + 1 2 δx H x δx steepest descent direction: ˆδx = x f (true only for small stepsize) Effects: initial point, direction, stepsize; 2-nd order methods 2nd Order Plain Gradient Conjugate Gradient 16/57

Approximate RL: Formulation & Solution Techniques 17/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) Least mean squares (LMS), aka Widrow-Hoff rule: SGD update θ t+1 = θ t + α t ( V π (s t ) V (s t ; θ t ) ) θ V (s t ; θ t ) 18/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) TD(λ): biased for λ < 1 Forward: θ t+1 = θ t + α t ( R λ t V (s t ; θ t ) ) θ V (s t ; θ t ) Backward: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t with e t = γλe t 1 + θ V (s t ; θ t ) 19/57

Linear RL 20/57

RL with Linear Function Approximation Input representation: fixed features φ 21/57

RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d Learning: θ t+1 θ t given (s t, a t, r t+1, s t+1 ) Update rule is particularly simple with θ V (s; θ) = φ(s) 21/57

Linear TD(λ) for Online Prediction TD(0) with function approximation: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] φ(s) TD(λ) (with eligibility trace): e t = γλe t 1 + φ(s t ) θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t Similarly for action-value function Q π (s t, a t ) 22/57

Linear SARSA(λ) for On-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0, a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) Repeat (for each step of episode) Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode a π(s ), e.g., ɛ-greedy w.r.t. Q(s, ; θ) θ θ + αe [ r + γq(s, a ; θ) Q(s, a; θ) ] e γλe s s, a a 23/57

Linear Q(λ) for Off-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0 Repeat (for each step of episode) a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) If a arg max a Q(s, a ; θ), then e = 0 Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode θ θ + αe [ r + γmax a Q(s, a ; θ) Q(s, a; θ) ] e γλe s s 24/57

SARSA(λ) with tile-coding function approximation 25/57

λ: How Much Bootstrapping? 26/57

Linear Least Squares Prediction Recall: batch (offline) training of linear models has analytical solution θ L(θ) = 0 0 = s D ( V π (s) V (s; θ) ) θ V (s; θ) 0 = ( ) V π (s) φ(s) θ φ(s) s D ( θ = φ(s)φ(s) ) 1 φ(s)v π (s) s D s D Shermann-Morrison incremental update (O(d 2 )) of A 1 (A + uv ) 1 = A 1 A 1 uv A 1 1 + v A 1 u = B Buv B 1 + v Bu 27/57

Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t 28/57

Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t In each case solve directly for fixed point of MC / TD(0) / TD(λ) LSMC: ( θ = φ(s)φ(s) ) 1 φ(s)r(s) s D s D LSTD: ( θ = φ(s)(φ(s) γφ(s )) ) 1 φ(s)r(s) s D s D LSTD(λ): ( θ = E(s)(φ(s) γφ(s )) ) 1 E(s)r(s), s D s D αδ(s)e(s) = 0 s D 28/57

Linear Least Squares Control Policy evaluation: off-policy LSTDQ ( θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D s D 29/57

Linear Least Squares Control Policy evaluation: Lecture 6: Value off-policy Function Approximation LSTDQ ( Least Squares Control θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D Batch Methods Least Squares Policy Iteration Algorithm The following pseudocode uses LSTDQ for policy evaluation Policy improvement: LSPI iterates over D It repeatedly re-evaluates experience D with di erent policies function LSPI-TD(D, 0 ) 0 0 repeat 0 Q LSTDQ(, D) for all s 2 S do 0 (s) end for until ( 0 ) return end function argmax a2a Q(s, a) s D 29/57

LSPI: Bellman residual minimization The Q-functions for a given policy π fulfills for any s, a: Q π (s, a) = R(s, a) + γ s P (s a, s) Q π (s, π(s )) Written as optmization: minimize the Bellman residual error L(Q π ) = R + γp ΠQ π Q π where Π is the projection operator onto the feature space. 30/57

LSPI: Bellman residual minimization The true fixed point of Bellman Residual Minimization (this is an overconstrained system) β π = ( (Φ γp ΠΦ) (Φ γp ΠΦ)) 1(Φ γp ΠΦ)r the solution β π of the system is unique since the columns of Φ (the basis functions) are linearly independent by definition. (See Lagoudakis & Parr (JMLR 2003) for details.) 31/57

LSPI: Least Squares Fixed-Point Approximation Projection T π Q back onto span(φ) ˆT π (Q) = Φ(Φ Φ) 1 Φ (T π Q) Minimize projected Bellman residual ˆT π (Q) = Q The approximate fixed-point β π = ( Φ (Φ γp ΠΦ)) 1Φ r 32/57

LSPI: Comparisons of two views Bellman residual minimization: focus on the magnitude of the change. LS fixed-point approximation: focus on the direction of the change. Least-squares fixed point approximation is less stable and less predictable Least-squares fixed-point method might be preferable because Learning the Bellman residual minimizing approximation requires doubled samples. Experimentally, it often delivers policies that are superior. (See Lagoudakis & Parr (JMLR 2003) for details.) 33/57

Convergence Properties Prediction On/O -Policy Algorithm Table Lookup Linear Non-Linear MC 3 3 3 On-Policy LSMC 3 3 - TD 3 3 7 LSTD 3 3 - O -Policy MC 3 3 3 LSMC 3 3 - TD 3 7 7 LSTD 3 3 - Lecture 6: Value Function Approximation Batch Methods Least Squares Control Convergence of Control Algorithms Control Algorithm Table Lookup Linear Non-Linear Monte-Carlo Control 3 (3) 7 Sarsa 3 (3) 7 Q-learning 3 7 7 LSPI 3 (3) - (3) = chatters around near-optimal value function 34/57

Linear Q-Learning Diverges Sometimes A Counter-example from Baird (1995) (due to bootstrapping + not following true gradients) 35/57

Using exact update (DP backups) for policy evaluation θ t+1 = θ t + α s P (s) [ ] E{r t+1 + γv t s s t = s} V (s; θ t ) θ V t (s; θ t ) if P (s) is uniform ( the true stationary distribution of the Markov chain), then the asymptotic bahaviour becomes unstable 36/57

Problems with Gradient-Descent TD Methods With nonlinear FA, on-policy methods like SARSA may still diverge (Tsitsiklis and Van Roy, 1997) With linear FA, off-policy like Q-learning may diverge (Baird 1995) They are not true gradient-descent methods Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all above problems: true gradient direction of projected Bellman error. 37/57

Gradient temporal difference learning GTD (gradient temporal difference learning) GTD2 (gradient temporal difference learning, version 2) TDC (temporal difference learning with gradient corrections.) 38/57

Value Function Geometry Bellman operator T V = R + γp V (The space spanned by the feature vectors) RMSBE : residual mean-squared Bellman error RMSPBE: residual mean-squared projected Bellman error 39/57

TD performance measure Error from the true value: V θ V 2 P Error in the Bellman update (used in previous section: TD(0), GTD0 methods) V θ T V θ 2 P Error in Bellman update after projection (TDC and GTD2 methods) V θ ΠT V θ 2 P 40/57

TD performance measure GTD(0): the Norm of the Expected TD Update NEU(θ) = E(δφ) E(δφ) GTD(2) and TDC: the norm of the expected TD update weighted by the covariance matrix of the features (δ is the TD error.) MSP BE(θ) = E(δφ) E(φφ ) 1 E(δφ) (GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.) 41/57

Updates GTD(0) θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t π(s t, a t ) w t+1 = w t + β t (δ t φ(s t ) w t )π(s t, a t ) GTD2 θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) TDC θ t+1 = θ t + α t δ t φ(s t ) α t γφ(s t+1 )φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) where δ t = r t + γφ(s t+1 ) φ(s t ) is the TD error. 42/57

On Baird s counterexample. 43/57

Gradient TD: Summary Gradient TD algorithms with linear function approximation problems are guaranteed convergent under both general on- and off-policy training. the compuational complexity is only O(n) (n is the number of features). the curse of dimensionality is removed 44/57

Nonlinear & Deep RL 45/57

TD-Gammon By Gerry Tesauro (1992, 1994, 1995). World-class level. Training data: playing about 300,000 games against itself 24 23 22 21 20 19 18 17 16 15 14 13 white pieces move counterclockwise 1 2 3 4 5 6 7 8 9 10 11 12 black pieces move clockwise 46/57

ore pieces on the point, then all of the first three units were 1. If there than three pieces, the fourth unit also came on, to a degree indicating r of additional pieces beyond three. Letting n denote the total number of he point, TD-Gammon if n>3, then the fourth unit took on the value (n 3)/2. With for white and four for black at each of the 24 points, that made a total s. Two additional units encoded the number of white and black pieces (each took TD-Gammon the value n/2, used where the n is gradient-descent the number of pieces form on theof bar), the TD(λ) ore encoded with the number the gradients of black and computed white pieces by the already error successfully backpropagation algorithm predicted probability of winning, ˆv(S V t t, ) TD error TD ˆv(S error, V t+1! V t+1, ) ˆv(S t, ) t............ hidden units (40-80)...... backgammon position (198 input units) Figure 15.2: The neural network used in TD-Gammon Later versions combine learned value functions and decide-time search 47/57

Fitted Q-Iteration Given all experience D = {(s t, a t, r t+1, s t+1 )} H t=0. FQI approximates Q(s, a; θ) using offline (batch, not SGD) supervised regression by constructing a training dataset at each iteration k D k = {(s t, a t ), Q(s t, a t ; θ k )} H t=0 where Q(s t, a t ; θ k ) = r t+1 + γ max a Q(s t, a ; θ k 1 ) Hence, FQI is considered as Approximate Q-Iteration At each iteration, any regression techniques could be used: neural network, radial basis function networks, regression trees, etc. 48/57

Fitted Q-Iteration: Mountain Car Benchmark The control interval is t = 0.05s Actions are continuous, restricted within the interval [4, 4] Using a neural network with RPROP: configuration of 3-5-5-1 (input-hidden-hidden-output layers) Training trajectories had a maximum length of 50 primitive control steps. After each trajectory/episode, run one FQI iteration Neural fitted Q-learning by Martin Riedmiller, 2005. 49/57

24 Deep Fitted Q-Iteration: Neuro Slot Car Racer target: reconstruction Deep Autoencoder high-dimensional identity gradient descent feature space improved by Reinforcement Learning bottle neck policy low-dimensional CHAPTER 3. FEATURE maps SPACE feature vectors to actions After the pretraining process, the values network is trained by RProp[RB92], identica input: vector of pixel to the training of a regular feed-forward network. This setup, known as a Deep network, has been shown toaction produce exceptionally well, non-linear embeddings of a system data. [Lan10]. The deep networks used in this thesis all share the same outer layer 50/57

Deep Q-Networks (DQN) Given the training data D = {(s t, a t, r t+1, s t+1 )} H t=0 recall the squared-error L(θ) = 1 D H t=0 ( ) 2 Q π (s t, a t ) Q(s t, a t ; θ) = E D (Q π (s t, a t ) Q(s t, a t ; θ) E D (r t+1 + max Q π (s t+1, a; ˆθ) Q(s t, a t ; θ) a ) 2 ) 2 ˆθ is called target network : a fixed model trained several steps before. using stochastic gradient descent at time t sample k-th data point (s, a, r, s ) D SGD update θ k+1 = θ k + α t (r t + max Q π (s, a, ˆθ t t ) ˆQ(s, ) a; θ k ) θ Q(s, a; θ k ) a 51/57

Deep Q-Networks (DQN) Q(s, a) is approximated using a deep CNN. Initialize θ t, ˆθ (time index t) then loop until convergence take action a t using ɛ-greedy w.r.t Q(s t, a; θ t) add (s t, a t, r t, s t+1) into replay memory D. sample a subset of (s i, a i, r i, s i+1) from D optimize the least squared error on (as in previous slide), ρ θ = 1 ( s i r i + γ max a Q(s i+1, a; ˆθ t) ˆQ(s ) 2 i, a i, θ) and return trained model as θ t+1 update target network ˆθ = θ t+1 after t steps 52/57

DQN in Atari Games from David Silver, a NIPS talk 2014 53/57

DQN in Atari Games Q(s, a) is approximated using a deep CNN. input s is stack of raw pixels from 4 consecutive frames control a is 18 joystick/button positions from David Silver, a NIPS talk 2014 54/57

DQN in Atari Games 55/57

DQN in Atari Games 56/57

DQN in Atari Games Experience Replay vs. no Replay from David Silver, a NIPS talk 2014 57/57