Reinforcement Learning

Size: px
Start display at page:

Download "Reinforcement Learning"

Transcription

1 Reinforcement Learning Value Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Hung Ngo & Vien Ngo MLR Lab, University of Stuttgart

2 Outline Review: Function Approximation (FA) Representation issues: hypothesis and features Linear and nonlinear models Deep learning RL with FA Gradient descent methods Estimating target values Linear RL (online & batch) Nonlinear methods; Deep RL 2/57

3 Example: Value Iteration in Continuous MDP [ V (s) = sup r(s, a) + γ a P (s s, a)v (s )ds ] 3/57

4 Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A 4/57

5 Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A RL in large and/or continuous state and/or action spaces: Backgammon: states (board size: 28) Computer GO: states (9 9); states (19 19) Autonomous helicopter: continuous & high-dim spaces of S and A 4/57

6 Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A RL in large and/or continuous state and/or action spaces: Backgammon: states (board size: 28) Computer GO: states (9 9); states (19 19) Autonomous helicopter: continuous & high-dim spaces of S and A Scalability and generalization issues In this course: learn/optimize parametric models for value functions V (s; θ), Q(s, a; θ): this lecture for policy π(a s; θ): policy search - next lecture 4/57

7 Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) 5/57

8 Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) 5/57

9 Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) Many choices for θ (i.e., V (s; θ), Q(s, a; θ)): linear regression, decision tree, nearest neighbors, ANN (UFA), GPR, kernel methods, etc. 5/57

10 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m 6/57

11 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m 6/57

12 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification 6/57

13 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. 6/57

14 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. 6/57

15 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. Generalization & regularization: Occam s razor; prefers simpler W. 6/57

16 Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. Generalization & regularization: Occam s razor; prefers simpler W. Learning: optimize W w.r.t. l D ; gradient-based, PA, CW, etc. 6/57

17 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! 7/57

18 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) 7/57

19 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... 7/57

20 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN 7/57

21 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) 7/57

22 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) 7/57

23 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation 7/57

24 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation Universal function approximator! 7/57

25 Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation Universal function approximator! Each feature φ i encodes a landmark, prototype, regularity, pattern Each feature value φ i (x): similarity/distance measure of x and φ i 7/57

26 Example: Coarse Coding (2D) Generalization from X to Y depends on the number of their features whose receptive fields (circles) overlap. Here: slight generalization. a) Narrow generalization b) Broad generalization c) Asymmetric generalization Sparse, over-complete, distributed representation 8/57

27 Example: Tile Coding (2D) Receptive fields of the features are grouped into exhaustive partitions of the input space. Each such partition: a tiling, and each element of the partition: a tile. Each tile is the receptive field for one binary feature. tiling #1 tiling #2 2D state space Shape of tiles Generalization #Tilings Resolution of final approximation Exactly one feature is present in each tiling, so the total number of features present is always the same as the number of tilings. Aka CMAC (cerebellar model articulation controller) 9/57

28 Example: Radial Basis Functions φ i (x) = rbf(d i (x)), e.g.: Gaussian kernel, rbf(d) = exp( d2 2 ) d 2 i (x) = x c i 2 Σ i = (x c i ) Σ 1 i (x c i ) : distance in metric Σ i ( e.g., Σ i = σi 2 I : φ i (x) = exp x c i 2 ) 2σ 2 i Prototype/center c i, kernel width σ i σ i c i-1 c i c i+1 RBF s receptive field has continuous coverage Previous methods receptive fields discretize the input space. If parameters {c i, Σ i } are learned: RBF network. 10/57

29 amples of SVMs with Kernels Example: Polynomial Kernel y y 2 x φ x 2 x y 11/57

30 Example: Tabular Basis Naive features for finite discrete state space S = {s 1,..., s n } δ s1 (s) δ s2 (s) φ(s) =... δ sn (s) δ si (s) {0, 1}: Dirac function. 1-of-K coding, or one-hot vector. 12/57

31 Example: Tabular Basis Naive features for finite discrete state space S = {s 1,..., s n } δ s1 (s) δ s2 (s) φ(s) =... δ sn (s) δ si (s) {0, 1}: Dirac function. 1-of-K coding, or one-hot vector. For (simple) generalization: bias weight & augmented feature 1 δ s1 (s) φ(s) = δ s2 (s)... δ sn (s) 12/57

32 Example: Perceptron, Frank Rosenblatt, 1957! fixed features φ(x) Paul Werbos Backpropagation for MLP: Kunihiko Fukushima s deep CNN: /57

33 Example: Deep Learning, /57

34 Example: Deep Learning, 2012 A lot of training data available + computational power (GPU) Good learning priors (hierarchical, modular, sparse-coding, pooling, regularizers e.g. dropouts, invariances etc.) and training algs. 15/57

35 Review: Gradient Descent (Steepest Descent) For f : R n R: f(x + δx) f(x) + x f δx δx H x δx steepest descent direction: ˆδx = x f (true only for small stepsize) Effects: initial point, direction, stepsize; 2-nd order methods 2nd Order Plain Gradient Conjugate Gradient 16/57

36 Approximate RL: Formulation & Solution Techniques 17/57

37 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 18/57

38 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D ( V π (s) V (s; θ) ) θ V (s; θ) = 0 18/57

39 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 18/57

40 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) 18/57

41 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) Least mean squares (LMS), aka Widrow-Hoff rule: SGD update θ t+1 = θ t + α t ( V π (s t ) V (s t ; θ t ) ) θ V (s t ; θ t ) 18/57

42 Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) Least mean squares (LMS), aka Widrow-Hoff rule: SGD update θ t+1 = θ t + α t ( V π (s t ) V (s t ; θ t ) ) θ V (s t ; θ t ) Combined with experience replay. Store all real-world experience in D + (iid) sampling for SGD 18/57

43 Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) 19/57

44 Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) 19/57

45 Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) 19/57

46 Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) TD(λ): biased for λ < 1 Forward: θ t+1 = θ t + α t ( R λ t V (s t ; θ t ) ) θ V (s t ; θ t ) Backward: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t with e t = γλe t 1 + θ V (s t ; θ t ) 19/57

47 Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) TD(λ): biased for λ < 1 Forward: θ t+1 = θ t + α t ( R λ t V (s t ; θ t ) ) θ V (s t ; θ t ) Backward: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t with e t = γλe t 1 + θ V (s t ; θ t ) Similarly, estimate q t for action-value function Q (s t, a t ) (control) θ t+1 = θ t +α t ( rt+1 + γ max a Q(s t+1, a ; θ t ) Q(s t, a t ; θ t ) ) θ Q(s t, a t ; θ t ) 19/57

48 Linear RL 20/57

49 RL with Linear Function Approximation Input representation: fixed features φ 21/57

50 RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d 21/57

51 RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d Learning: θ t+1 θ t given (s t, a t, r t+1, s t+1 ) Update rule is particularly simple with θ V (s; θ) = φ(s) 21/57

52 RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d Learning: θ t+1 θ t given (s t, a t, r t+1, s t+1 ) Update rule is particularly simple with θ V (s; θ) = φ(s) (9 9, states, 10 5 binary features and parameters, R. Sutton ICML 09) 21/57

53 Linear TD(λ) for Online Prediction TD(0) with function approximation: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] φ(s) TD(λ) (with eligibility trace): e t = γλe t 1 + φ(s t ) θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t Similarly for action-value function Q π (s t, a t ) 22/57

54 Linear SARSA(λ) for On-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0, a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) Repeat (for each step of episode) Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode a π(s ), e.g., ɛ-greedy w.r.t. Q(s, ; θ) θ θ + αe [ r + γq(s, a ; θ) Q(s, a; θ) ] e γλe s s, a a 23/57

55 Linear Q(λ) for Off-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0 Repeat (for each step of episode) a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) If a arg max a Q(s, a ; θ), then e = 0 Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode θ θ + αe [ r + γmax a Q(s, a ; θ) Q(s, a; θ) ] e γλe s s 24/57

56 SARSA(λ) with tile-coding function approximation 25/57

57 λ: How Much Bootstrapping? 26/57

58 Linear Least Squares Prediction Recall: batch (offline) training of linear models has analytical solution θ L(θ) = 0 0 = s D ( V π (s) V (s; θ) ) θ V (s; θ) 0 = ( ) V π (s) φ(s) θ φ(s) s D ( θ = φ(s)φ(s) ) 1 φ(s)v π (s) s D s D 27/57

59 Linear Least Squares Prediction Recall: batch (offline) training of linear models has analytical solution θ L(θ) = 0 0 = s D ( V π (s) V (s; θ) ) θ V (s; θ) 0 = ( ) V π (s) φ(s) θ φ(s) s D ( θ = φ(s)φ(s) ) 1 φ(s)v π (s) s D s D Shermann-Morrison incremental update (O(d 2 )) of A 1 (A + uv ) 1 = A 1 A 1 uv A v A 1 u = B Buv B 1 + v Bu 27/57

60 Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t 28/57

61 Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t In each case solve directly for fixed point of MC / TD(0) / TD(λ) LSMC: ( θ = φ(s)φ(s) ) 1 φ(s)r(s) s D s D LSTD: ( θ = φ(s)(φ(s) γφ(s )) ) 1 φ(s)r(s) s D s D LSTD(λ): ( θ = E(s)(φ(s) γφ(s )) ) 1 E(s)r(s), s D s D αδ(s)e(s) = 0 s D 28/57

62 Linear Least Squares Control Policy evaluation: off-policy LSTDQ ( θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D s D 29/57

63 Linear Least Squares Control Policy evaluation: Lecture 6: Value off-policy Function Approximation LSTDQ ( Least Squares Control θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D Batch Methods Least Squares Policy Iteration Algorithm The following pseudocode uses LSTDQ for policy evaluation Policy improvement: LSPI iterates over D It repeatedly re-evaluates experience D with di erent policies function LSPI-TD(D, 0 ) 0 0 repeat 0 Q LSTDQ(, D) for all s 2 S do 0 (s) end for until ( 0 ) return end function argmax a2a Q(s, a) s D 29/57

64 LSPI: Bellman residual minimization The Q-functions for a given policy π fulfills for any s, a: Q π (s, a) = R(s, a) + γ s P (s a, s) Q π (s, π(s )) Written as optmization: minimize the Bellman residual error L(Q π ) = R + γp ΠQ π Q π where Π is the projection operator onto the feature space. 30/57

65 LSPI: Bellman residual minimization The true fixed point of Bellman Residual Minimization (this is an overconstrained system) β π = ( (Φ γp ΠΦ) (Φ γp ΠΦ)) 1(Φ γp ΠΦ)r the solution β π of the system is unique since the columns of Φ (the basis functions) are linearly independent by definition. (See Lagoudakis & Parr (JMLR 2003) for details.) 31/57

66 LSPI: Least Squares Fixed-Point Approximation Projection T π Q back onto span(φ) ˆT π (Q) = Φ(Φ Φ) 1 Φ (T π Q) Minimize projected Bellman residual ˆT π (Q) = Q The approximate fixed-point β π = ( Φ (Φ γp ΠΦ)) 1Φ r 32/57

67 LSPI: Comparisons of two views Bellman residual minimization: focus on the magnitude of the change. LS fixed-point approximation: focus on the direction of the change. Least-squares fixed point approximation is less stable and less predictable Least-squares fixed-point method might be preferable because Learning the Bellman residual minimizing approximation requires doubled samples. Experimentally, it often delivers policies that are superior. (See Lagoudakis & Parr (JMLR 2003) for details.) 33/57

68 Convergence Properties Prediction On/O -Policy Algorithm Table Lookup Linear Non-Linear MC On-Policy LSMC TD LSTD O -Policy MC LSMC TD LSTD Lecture 6: Value Function Approximation Batch Methods Least Squares Control Convergence of Control Algorithms Control Algorithm Table Lookup Linear Non-Linear Monte-Carlo Control 3 (3) 7 Sarsa 3 (3) 7 Q-learning LSPI 3 (3) - (3) = chatters around near-optimal value function 34/57

69 Linear Q-Learning Diverges Sometimes A Counter-example from Baird (1995) (due to bootstrapping + not following true gradients) 35/57

70 Using exact update (DP backups) for policy evaluation θ t+1 = θ t + α s P (s) [ ] E{r t+1 + γv t s s t = s} V (s; θ t ) θ V t (s; θ t ) if P (s) is uniform ( the true stationary distribution of the Markov chain), then the asymptotic bahaviour becomes unstable 36/57

71 Problems with Gradient-Descent TD Methods With nonlinear FA, on-policy methods like SARSA may still diverge (Tsitsiklis and Van Roy, 1997) With linear FA, off-policy like Q-learning may diverge (Baird 1995) They are not true gradient-descent methods 37/57

72 Problems with Gradient-Descent TD Methods With nonlinear FA, on-policy methods like SARSA may still diverge (Tsitsiklis and Van Roy, 1997) With linear FA, off-policy like Q-learning may diverge (Baird 1995) They are not true gradient-descent methods Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all above problems: true gradient direction of projected Bellman error. 37/57

73 Gradient temporal difference learning GTD (gradient temporal difference learning) GTD2 (gradient temporal difference learning, version 2) TDC (temporal difference learning with gradient corrections.) 38/57

74 Value Function Geometry Bellman operator T V = R + γp V (The space spanned by the feature vectors) RMSBE : residual mean-squared Bellman error RMSPBE: residual mean-squared projected Bellman error 39/57

75 TD performance measure Error from the true value: V θ V 2 P Error in the Bellman update (used in previous section: TD(0), GTD0 methods) V θ T V θ 2 P Error in Bellman update after projection (TDC and GTD2 methods) V θ ΠT V θ 2 P 40/57

76 TD performance measure GTD(0): the Norm of the Expected TD Update NEU(θ) = E(δφ) E(δφ) GTD(2) and TDC: the norm of the expected TD update weighted by the covariance matrix of the features (δ is the TD error.) MSP BE(θ) = E(δφ) E(φφ ) 1 E(δφ) (GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.) 41/57

77 Updates GTD(0) θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t π(s t, a t ) w t+1 = w t + β t (δ t φ(s t ) w t )π(s t, a t ) GTD2 θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) TDC θ t+1 = θ t + α t δ t φ(s t ) α t γφ(s t+1 )φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) where δ t = r t + γφ(s t+1 ) φ(s t ) is the TD error. 42/57

78 On Baird s counterexample. 43/57

79 Gradient TD: Summary Gradient TD algorithms with linear function approximation problems are guaranteed convergent under both general on- and off-policy training. the compuational complexity is only O(n) (n is the number of features). the curse of dimensionality is removed 44/57

80 Nonlinear & Deep RL 45/57

81 TD-Gammon By Gerry Tesauro (1992, 1994, 1995). World-class level. Training data: playing about 300,000 games against itself white pieces move counterclockwise black pieces move clockwise 46/57

82 ore pieces on the point, then all of the first three units were 1. If there than three pieces, the fourth unit also came on, to a degree indicating r of additional pieces beyond three. Letting n denote the total number of he point, TD-Gammon if n>3, then the fourth unit took on the value (n 3)/2. With for white and four for black at each of the 24 points, that made a total s. Two additional units encoded the number of white and black pieces (each took TD-Gammon the value n/2, used where the n is gradient-descent the number of pieces form on theof bar), the TD(λ) ore encoded with the number the gradients of black and computed white pieces by the already error successfully backpropagation algorithm predicted probability of winning, ˆv(S V t t, ) TD error TD ˆv(S error, V t+1! V t+1, ) ˆv(S t, ) t hidden units (40-80) backgammon position (198 input units) Figure 15.2: The neural network used in TD-Gammon Later versions combine learned value functions and decide-time search 47/57

83 Fitted Q-Iteration Given all experience D = {(s t, a t, r t+1, s t+1 )} H t=0. FQI approximates Q(s, a; θ) using offline (batch, not SGD) supervised regression by constructing a training dataset at each iteration k D k = {(s t, a t ), Q(s t, a t ; θ k )} H t=0 where Q(s t, a t ; θ k ) = r t+1 + γ max a Q(s t, a ; θ k 1 ) 48/57

84 Fitted Q-Iteration Given all experience D = {(s t, a t, r t+1, s t+1 )} H t=0. FQI approximates Q(s, a; θ) using offline (batch, not SGD) supervised regression by constructing a training dataset at each iteration k D k = {(s t, a t ), Q(s t, a t ; θ k )} H t=0 where Q(s t, a t ; θ k ) = r t+1 + γ max a Q(s t, a ; θ k 1 ) Hence, FQI is considered as Approximate Q-Iteration At each iteration, any regression techniques could be used: neural network, radial basis function networks, regression trees, etc. 48/57

85 Fitted Q-Iteration: Mountain Car Benchmark The control interval is t = 0.05s Actions are continuous, restricted within the interval [4, 4] Using a neural network with RPROP: configuration of (input-hidden-hidden-output layers) Training trajectories had a maximum length of 50 primitive control steps. After each trajectory/episode, run one FQI iteration Neural fitted Q-learning by Martin Riedmiller, /57

86 24 Deep Fitted Q-Iteration: Neuro Slot Car Racer target: reconstruction Deep Autoencoder high-dimensional identity gradient descent feature space improved by Reinforcement Learning bottle neck policy low-dimensional CHAPTER 3. FEATURE maps SPACE feature vectors to actions After the pretraining process, the values network is trained by RProp[RB92], identica input: vector of pixel to the training of a regular feed-forward network. This setup, known as a Deep network, has been shown toaction produce exceptionally well, non-linear embeddings of a system data. [Lan10]. The deep networks used in this thesis all share the same outer layer 50/57

87 Deep Q-Networks (DQN) Given the training data D = {(s t, a t, r t+1, s t+1 )} H t=0 recall the squared-error L(θ) = 1 D H t=0 ( ) 2 Q π (s t, a t ) Q(s t, a t ; θ) = E D (Q π (s t, a t ) Q(s t, a t ; θ) E D (r t+1 + max Q π (s t+1, a; ˆθ) Q(s t, a t ; θ) a ) 2 ) 2 ˆθ is called target network : a fixed model trained several steps before. using stochastic gradient descent at time t sample k-th data point (s, a, r, s ) D SGD update θ k+1 = θ k + α t (r t + max Q π (s, a, ˆθ t t ) ˆQ(s, ) a; θ k ) θ Q(s, a; θ k ) a 51/57

88 Deep Q-Networks (DQN) Q(s, a) is approximated using a deep CNN. Initialize θ t, ˆθ (time index t) then loop until convergence take action a t using ɛ-greedy w.r.t Q(s t, a; θ t) add (s t, a t, r t, s t+1) into replay memory D. sample a subset of (s i, a i, r i, s i+1) from D optimize the least squared error on (as in previous slide), ρ θ = 1 ( s i r i + γ max a Q(s i+1, a; ˆθ t) ˆQ(s ) 2 i, a i, θ) and return trained model as θ t+1 update target network ˆθ = θ t+1 after t steps 52/57

89 DQN in Atari Games from David Silver, a NIPS talk /57

90 DQN in Atari Games Q(s, a) is approximated using a deep CNN. input s is stack of raw pixels from 4 consecutive frames control a is 18 joystick/button positions from David Silver, a NIPS talk /57

91 DQN in Atari Games 55/57

92 DQN in Atari Games 56/57

93 DQN in Atari Games Experience Replay vs. no Replay from David Silver, a NIPS talk /57

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning: An Introduction. ****Draft****

Reinforcement Learning: An Introduction. ****Draft**** i Reinforcement Learning: An Introduction Second edition, in progress ****Draft**** Richard S. Sutton and Andrew G. Barto c 2014, 2015 A Bradford Book The MIT Press Cambridge, Massachusetts London, England

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car) Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Least-Squares Temporal Difference Learning based on Extreme Learning Machine Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Least squares temporal difference learning

Least squares temporal difference learning Least squares temporal difference learning TD(λ) Good properties of TD Easy to implement, traces achieve the forward view Linear complexity = fast Combines easily with linear function approximation Outperforms

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning 1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 Online Selective Kernel-Based Temporal Difference Learning Xingguo Chen, Yang Gao, Member, IEEE, andruiliwang

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6. ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Algorithms for Fast Gradient Temporal Difference Learning

Algorithms for Fast Gradient Temporal Difference Learning Algorithms for Fast Gradient Temporal Difference Learning Christoph Dann Autonomous Learning Systems Seminar Department of Computer Science TU Darmstadt Darmstadt, Germany cdann@cdann.de Abstract Temporal

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information