Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Size: px
Start display at page:

Download "Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation"

Transcription

1 Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. Abstract We study a sequential variance reduction technique for onte Carlo estimation of functionals in arkov Chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular onte Carlo estimates have a variance of O(/N), where N is the number of samples. Here, we obtain a geometric variance reduction O(ρ N ) (with ρ < ) up to a threshold that depends on the approximation error V AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in arkov chains, which may be used for policy evaluation in policy iteration for arkov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity αv of the performance measure V with respect to some parameter α of the transition probabilities. For example, in parametric optimization of the policy, an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations, the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. Introduction We consider a arkov chain over a finite state space X defined by the transition matrix P. We write X(x) a trajectory (x t ) t 0 starting at a state x 0 = x. Let Ψ(r, X(x)) be a functional that depends on some function r : X IR and the trajectory X(x), and write V (x) the expectation of the functional that we wish to evaluate: V (x) = E[Ψ(r, X(x))]. () Here, the quantity of interest V is expressed in terms of a arkov representation, as an expectation of a functional that depends on trajectories. We will consider a functional Ψ(r, ) that is linear in r, and such that its expectation V may be equivalently be expressed in terms of a solution to a linear system LV = r, () Copyright c 005, American Association for Artificial Intelligence ( All rights reserved. with L an invertible linear operator (matrix). An example of Ψ is the sum of discounted rewards r received along the trajectory: Ψ(r, X(x)) = t 0 γ t r(x t ). (3) with γ < being a discount factor. In that case, V is solution to the Bellman equation () with L = I γp. Indeed, using matrix notations, V equals t 0 γt P t r = (I γp ) r. Other functionals include finite-horizon sum of rewards Ψ(r, X(x)) = T t=0 r(x t) or infinite-time average reward Ψ(r, X(x)) = lim T T T t=0 r(x t). A regular onte-carlo (C) method would estimate V (x) by sampling N independent trajectories {X n (x)} n N starting from x and calculate the average N N n= Ψ(r, Xn (x)). The variance of such an estimator is of order /N. Variance reduction is crucial since the numerical approximation error of the quantity of interest is directly related to the variance of its estimate. Variance reduction techniques include importance sampling, correlated sampling, control variates, antithetic variates and stratified sampling, see e.g. (Hammersley and Handscomb, 964; Halton, 970). Geometric variance reduction rates have been obtained by processing these variance reduction methods iteratively, the so-called sequential (or recursive) onte-carlo. Examples include adaptive importance sampling (Kollman et al., 999) and what Halton called the Third Sequential ethod (Halton, 994) based on sequential correlated sampling and control variates. This approach has been recently developed in (aire, 003) for numerical integration and, more related to our work, applied to (continuous time) arkov processes in (Gobet and aire, 005). The idea is to replace the expectation of Ψ(r, ) by an expectation of Ψ(r LW, ) for some function W close to V. From the linearity of Ψ and the equivalence between the representations () and (), for any W, one has V (x) = W (x) + E[Ψ(r LW, X(x))]. Thus, if W is a good approximation of V, the residual r LW is small, and the variance is low. In the sequential method described in this paper, we use successive approximations V n of V to estimate by onte AAAI-05 / 0

2 Carlo a correction E n using the residual r LV n in Ψ, which is used to process a new approximation V n+. We consider an approximation operator A that is linear in the values. We show that (for enough sampled trajectories at each iteration), the variance of the estimator has a geometric rate ρ N (with ρ <, and N the total number of sampled trajectories) until some threshold is reached, whose value is related to the approximation error AV V. An interesting extension of this method concerns the estimation of the gradient α V of V with respect to (w.r.t.) some parameter α of the transition matrix P. A useful application of such sensitivity analysis appears in policy gradient estimation. An optimal control problem may be approximated by a parametric optimization problem in a given space of parameterized policies. Thus, the transition matrix P depends on some (possible multidimensional) policy parameter α. In order to apply gradient methods to search for a local maximum of the performance in the parameter space, one wishes to estimate the policy gradient, i.e. the sensitivity Z = α V of the performance measure with respect to α. The gradient may be expressed as an expectation Z(x) = E[Φ(r, X(x))], using the so-called likelihood ratio or score method (Reiman and Weiss, 986; Glynn, 987; Williams, 99; Baxter and Bartlett, 00; arbach and Tsitsiklis, 003). The gradient Z is also the solution to a linear system LZ = α L L r = α LV. (4) Indeed, since V solves V = L r, we have Z = α V = L α L L r. For example, in the discounted case (3), the functional Φ is defined by Φ(r, X(x)) = t γ t r(x t ) t 0 s=0 α P (x s, x s+ ) P (x s, x s+ ), (5) and Z is solution to (4) with L = I γp and α L = γ α P. We show that, using two approximations V n and Z n of the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. Numerical experiments on a simple Gambler s ruin problem illustrate the approach. Value function estimation We first describe the approximation operator linear in the values considered here, then describe the algorithm, and state the main result on geometric variance reduction. Approximation operator A We consider a fixed set of J representative states X J := {x j X } j J and functions {φ j : X IR} j J. The linear approximation operator A maps any function W : X J IR to the function AW : X IR, according to J AW (x) = W (x j )φ j (x). (6) j= This kind of function approximation includes: Linear regression, for example with Spline, Polynomial, Radial Basis, Fourier or Wavelet decomposition. This is the projection of a function W onto the space spanned by a set of functions {ψ k : X IR} k K, that is which minimizes some norm (induced by a discrete inner product f, g := J j= µ jf(x j )g(x j ), for some distribution µ over X J ): K min α k ψ k W. α IR K k= The solution α solves the linear system Aα = b with A an K K-matrix of elements A kl = ψ k, ψ l and b a K-vector of components b k = W, ψ k. Thus α k = K J l= A kl j= µ jψ l (x j )W (x j ) and the best fit K k= α kψ k is thus of type (6) with K K φ j (x) = µ j A kl ψ l(x j )ψ k (x). (7) k= l= k-nearest neighbors (Hastie et al., 00): here φ j (x) = k if x has x j as one of its k nearest neighbors, and φ j (x) = 0 otherwise. Locally weighted learning and Kernel regression (Atkeson et al., 997). Functions similar to (7) may be derived, with the matrix A being dependent on x (through the kernel). The algorithm We assume the equivalence between the arkov representation () and its interpretation in terms of the solution to the linear system (), i.e. for any function f : X IR, we have f(x) = E[Ψ(Lf, X(x))]. (8) We consider successive approximations V n IR J of V defined at the states X J = (x j ) j J recursively: We initialize V 0 (x j ) = 0. At stage n, we use the values V n (x j ) to provide a new estimation of V (x j ). Let E n (x j ) := V (x j ) AV n (x j ) be the approximation error at the states X J. From the equivalence property (8), we have: AV n (x) = E[Ψ(LAV n, X(x))]. Thus, by linearity of Ψ w.r.t. its first variable, E n (x j ) = E[Ψ(r LAV n, X(x j ))]. Now, we use a onte Carlo technique to estimate E n (x j ) with trajectories (X n,m (x j )) m : we calculate the average Ê n (x j ) := Ψ(r LAV n, X n,m (x j )) m= and define the new approximation at the states X J : V n+ (x j ) := AV n (x j ) + Ên(x j ). (9) Remark. Notice that there is a slight difference between this algorithm and that of (Gobet and aire, 005), which may be written V n+ (x j ) = V n (x j ) + A[ m= Ψ(r LV n, X n,m (x j ))]. Our formulation enables us to avoid the assumption of the idempotent property for A (i.e. that A = A), and guarantees that V n is an unbiased estimate of V, for all n, as showed in the next paragraph. AAAI-05 / 03

3 Properties of the estimates V n We write the conditional expectations and variances: E n [Y ] = E[Y X p,m (x j ), 0 p<n, m, j J] and Var n [Y ] = E n [Y ] (E n [Y ]). We have the following properties on the estimates: Expectation of V n. From the definition (9), E n [V n+ (x j )] = AV n (x j ) + E n (x j ) = V (x j ). Thus E[V n (x j )] = V (x j ) for all n : the approximation V n (x j ) is an unbiased estimate of V (x j ). Variance of V n. Write v n = sup j J Var V n (x j ). The following result expresses that for a large enough value of, the variance decreases geometrically. Theorem. We have v n+ ρ v n + V Ψ(V AV ) (0) with ρ = ( J j= VΨ (φ j ) ), using the notation V Ψ (f) := sup Var Ψ(Lf, X(x j )). j J Thus, for a large enough value of, (i.e. whenever ρ < ), (v n ) n decreases geometrically at rate ρ, up to the threshold ρ V Ψ(V AV ). lim sup v n n If V belongs to the space of functions that are representable by A, i.e. AV = V, then the variance geometrically decreases to 0 at rate ρ N with ρ := ρ / and N the total number of sampled trajectories per state x j (i.e. N is the number n of iterations times the number of trajectories per iteration and state x j ). Further research should consider the best tradeoff between n and, for a given budget of a total of N = n trajectories per state. Notice that the threshold depends on the variance of Ψ for the function L(V AV ) = r LAV, the residual of the representation (by A) of V. Notice also that this threshold depends on V AV only at states reached by the trajectories {X(x j )} xj X J : a uniform (over the whole domain) representation of V is not required. Of course, once the threshold is reached, a further convergence of O(/N) can be obtained thereafter, using regular onte Carlo. Example: the discounted case Let us illustrate the sequential control variates algorithm to value function estimation in arkov chains. As an example, we consider the infinite horizon, discounted case (3). The value function V (x) = E [ t 0 γt r(x t ) ] solves the Bellman equation: V = r + γp V, which may be written as the linear system () with L = I γp. In the previous algorithm, at stage n, the approximation error E n (x j ) = V (x j ) AV n (x j ) is therefore the expectation E n (x j ) = E [ γ t ] [r(x t ) (I γp )AV n (x t )] x 0 = x j. t 0 () We notice that the term r (I γp )AV n is the Bellman residual of the approximation AV n. The estimate has thus zero variance if this approximation happens to be the value function. In model-free learning, it may be interesting (in order to avoid the computation of the expectation in P ) to replace the term P AV n (x t ) by AV n (x t+ ) in (), leaving the expectation unchanged (because of the linearity of the approximation operator A): [ E n (x j )=E γ t [r(x t ) AV n (x t )+γav n (x t+ )] x 0 = x j ], t 0 () at the cost of a slight increase of the variance (here, even if the approximation AV n is the value function, the variance is not zero). The approximation error is thus the expected discounted sum of Temporal Differences (Sutton, 988) r(x t ) AV n (x t ) + γav n (x t+ ). Following the algorithm, the next approximation V n+ is defined by (9) with Ên(x j ) being a onte Carlo estimate of () or (). Numerical experiment We consider the Gambler s ruin problem described in (Kollman et al., 999): a gambler with i dollars bets repeatedly against the house, whose initial capital is L i. Each bet is one dollar and the gambler has probability p of winning. The state space is X = {0,..., L} and the transition matrix P is defined, for i, j X, by { p, if j i = and 0 < i < L, P ij = p, if i j = and 0 < i < L, 0, otherwise. Betting continues until either the gambler is ruined (i = 0) or he has broken the bank (i = L) (thus 0 and L are terminal states). We are interested in computing the probability of the gambler s eventual ruin V (i) when starting from initial fortune i. We thus define the function r(0) = and r(i 0) = 0. The value function V solves the Bellman equation (I P )V = r, and its value is V (i) = λi λ L, for i X, (3) λl with λ := p p when p 0.5, and V (i) = i/l for p = 0.5. The representative states are X J = {, 7, 3, 9}. We consider two linear function approximation A and A that are projection operators (minimizing the L norm at the states X J ) onto the space spanned by a set of functions {ψ k : X IR} k K. A uses K = functions ψ (i) =, ψ (i) = λ i, i X, whereas A uses K = 4 functions ψ (i) =, ψ (i) = i, ψ 3 (i) = i, ψ 4 (i) = i 3, i X. Notice that V is representable by A (i.e. A V = V ) but not by A. We chose p = 0.5. We ran the algorithm with L = I P (which is an invertible matrix). At each iteration, we used = 00 simulations per state. Figure shows the L approximation error (max j XJ V (j) V n (j) ) in logarithmic scale, as a AAAI-05 / 04

4 Approximation error (in log scale) C with 0^8 trajectories Approximation Approximation Number of iterations Figure : Approximation error for regular C and sequential control variate algorithm using two approximations A and A, as a function of the number of iterations. function of the iteration number n 0. This approximation error (which is the true quantity of interest) is directly related to the variance of the estimates V n. For the approximation A, we observe the geometric convergence to 0, as predicted in Theorem. It takes less than 0 00 simulations per state to reach an error of 0 5. Using A, the error does not decrease below some threshold.0 5 due to the approximation error V A V. This threshold is reached using about 5 00 simulations per state. For comparison, usual C reaches an error of 0 4 with 0 8 simulations per state. The variance reduction obtained when using such sequential control variates is thus considerable. Gradient estimation Here, we assume that the transition matrix P depends on some parameter α, and that we wish to estimate the sensitivity of V (x) = E[Ψ(r, X(x))] with respect to α, which we write Z(x) = α V (x). An example of interest consists in solving approximately a arkov Decision Problem by searching for a feedback control law in a given class of parameterized stochastic policies. The optimal control problem is replaced by a parametric optimization problem, which may be solved (at least in order to find a local optimum) using gradient methods. Thus we are interested in estimating the gradient of the performance measure w.r.t. the parameter of the policy. In this example, the transition matrix P would be the transition matrix of the DP combined with the parameterized policy. As mentioned in the introduction, the gradient may be expressed by an expectation Z(x) = E[Φ(r, X(x))] (using the so-called likelihood ratio or score method (Reiman and Weiss, 986; Glynn, 987; Williams, 99; Baxter and Bartlett, 00; arbach and Tsitsiklis, 003)) where Φ(r, X(x)) is also a functional that depends on the trajectory X(x), and that is linear in its first variable. For example, in the discounted case (3), the functional Φ is given by (5). The variance is usually high, thus variance reduction techniques are highly needed (Greensmith et al., 005). The gradient Z is also the solution to the linear system (4). Unfortunately, this linear expression is not of the form () since α L is not invertible, which prevents us from using directly the method of the previous section. However, the linear equation (4) provides us with another representation for Z in terms of arkov chains: Z(x) = E [ t 0 γt α LV (x t ) ] = E[Ψ( α LV, X(x))]. (4) We may extend the previous algorithm to the estimation of Z by using two representations: V n and Z n. The approximation V n of V is updated from onte-carlo estimation of the residual r LV n, and Z n, which approximates Z, is updated from the gradient residual α LV n LZ n built from the current V n. This approach may be related to the so-called Actor-Critic algorithms (Konda and Borkar, 999; Sutton et al., 000), which use the representation (4) with an approximation of the value function. A geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. The algorithm From (4) and the equivalence property (8), we obtain the following representation for Z: Z(x) = AZ n (x) + E [ Ψ( α LV LAZ n, X(x)) ] = AZ n (x) + E [ Ψ( α L(V AV n ), X(x)) Ψ( α LAV n + LAZ n, X(x)) ] = AZ n (x) + E [ Φ(r LAV n, X(x)) (5) Ψ( α LAV n + LAZ n, X(x)) ]. from which the algorithm is deduced. We consider successive approximations V n IR J of V and Z n IR J of Z defined at the states X J = (x j ) j J. We initialize V 0 (x j ) = 0, Z 0 (x j ) = 0. At stage n, we simulate by onte Carlo trajectories (X n,m (x j )) m and define the new approximations V n+ and Z n+ at the states X J : V n+ (x j )=AV n (x j )+ Ψ(r LAV n, X n,m (x j )) Z n+ (x j )=AZ n (x j )+ m= [ Φ(r LAVn, X n,m (x j )) m= Ψ( α LAV n + LAZ n, X n,m (x j )) ]. Expectation of V n and Z n. We have already seen that E[V n ] = V for all n > 0. Now, (5) implies that E n [Z n+ ] = Z, thus E[Z n ] = Z for all n > 0. Variance of V n and Z n. We write v n = sup j J Var V n (x j ) and z n = sup j J Var Z n (x j ). The next theorem state the geometric variance reduction for large enough values of. AAAI-05 / 05

5 Theorem. We have v n+ ρ v n + V Ψ(V AV ) z n+ ρ z n + [c (V AV, Z AZ) + c v n ] ( with ρ = J j= VΨ (φ j ) ), and the coefficients ( c (f, g) = VΦ (f) + V Ψ (L α Lf) + V Ψ (g)) c = [ J j=, V Φ (φ j ) + V Ψ (L α Lφ j )] using the notations V Ψ (f) := sup j J Var Ψ(Lf, X(x j )) and V Φ (f) := sup j J Var Φ(Lf, X(x j )). Thus, for a large enough value of, (i.e. whenever ρ < ), the convergence of (v n ) n and (z n ) n is geometric at rate ρ, up to the thresholds lim sup v n n ρ V Ψ(V AV ) [ c (V AV, Z AZ) lim sup z n n ρ ] +c ρ V V Ψ(V AV ). Here also, if V and Z are representable by A, then the variance converges geometrically to 0. Numerical experiment Again we consider the Gambler s ruin problem described previously. The transition matrix is parameterized by α = p, the probability of winning. The gradient Z(i) = α V (i) may be derived from (3): Z(i) = L( λi )λ L i( λ L )λ i ( λ L ) α for i X, (for α 0.5), and Z(i) = 0 for α = 0.5. Again we use the representative states X J = {, 7, 3, 9}. Here, we consider two approximators A and A for the value function representations V n, and two approximators A and A 3 for the gradient representations Z n, where A 3 is a projection that uses K = 3 functions ψ (i) =, ψ (i) = λ i, ψ 3 (i) = iλ i, i X. Notice that Z is representable by A 3 but not by A. We chose p = 0.5 and = 000. Figure shows the L approximation error of Z (max j XJ Z(j) Z n (j) ) in logarithmic scale, for the different possible approximations of V and Z. When both V and Z may be perfectly approximated (i.e. A for V and A 3 for Z) we observe the geometric convergence to 0, as predicted in Theorem. The error is about 0 4 using a total of 0 4 simulations. When either the value function or the gradient is not representable in the approximation spaces, the error does not decrease below some threshold ( when Z is not representable) reached in.0 3 simulations. The threshold is lower (.0 4 ) when Z is representable. For comparison, usual C reaches an error (for Z) of with 0 8 simulations per state. The variance reduction of this sequential method compared to regular C is thus also considerable. Approximation error of Z (in log scale) Number of iterations 3 Figure : Approximation error of the gradient Z = α V using approximators A and A for the value function, and A and A 3 for the gradient. Conclusion We described a sequential control variates method for estimating an expectation of functionals in arkov chains, using linear approximation (in the values). We illustrate the method on value function and policy estimates. We proved geometric variance reduction up to a threshold that depends on the approximation error of the functions of interest. Future work would consider sampling initial states from some distribution over X instead of using representative states X J. Proof of Theorem From the decomposition V AV n = V AV i= (V V n)(x i )φ i, (6) we have V n+ (x j ) = AV n (x j )+ [ Ψ(L(V AV ), X n,m (x j )) Thus m= i= (V V n)(x i )Ψ(Lφ i, X n,m (x j )) Var n V n+ (x j ) = Varn[ Ψ(L(V AV ), X(x j )) i= (V V n)(x i )Ψ(Lφ i, X(x j )) ]. We use the general bound Var [ i α ] [ iy i α i Var [Y i ] ], (7) for any real numbers (α i ) i and square integrable real random variables (Y i ) i, to deduce that i Var n V n+ (x j ) [ VΨ (V AV ) (8) 3 ]. i= V V n (x i ) V Ψ (φ i )], AAAI-05 / 06

6 with V Ψ (f) := sup j J Var Ψ(Lf, X(x j )). Now, we use the variance decomposition Var V n+ (x j ) = Var [E n [V n (x j )]] + E[Var n [V n (x j )]] = E[Var n [V n (x j )]], and the general bound E [ J (α 0 + α i Y i ) ] ( J ), α 0 + α i E[Yi ] (9) i= i= to deduce from (8) that v n+ [ VΨ (V AV ) + ( J VΨ (φ i ) ) ] vn, i= which gives (0). Now, if is such that ρ := ( J i= VΨ (φ i ) ) <, then taking the upper limit finishes the proof of Theorem. Proof of Theorem Using (4) and (6), we have the decomposition α LAV n LAZ n = α LA(V n V ) α L(AV V ) +L(Z AZ) + LA(Z Z n ) = J i= (V V n)(x i ) α Lφ i α L(AV V ) +L(Z AZ) i= (Z Z n)(x i )Lφ i. Now, using (6), Var n Z n+ (x j ) equals Varnh Φ(L(V AV ), X(x j)) JX + (V V n)(x i)φ(lφ i, X(x j)) Ψ( αl(av V ), X(x j)) + + i= JX (V V n)(x i)ψ( αlφ i, X(x j)) + Ψ(L(Z AZ), X(x j)) i= JX i (Z Z n)(x i)ψ(lφ i, X(x j)). i= We use (7) to deduce that Var n Z n+ (x j ) is bounded by [ VΦ (V AV ) + V Ψ (L α L(AV V )) i= V V n (x i ) ( V Φ (φ i ) + V Ψ (L α Lφ i ) ) + V Ψ (Z AZ) i= Z Z n (x i ) V Ψ (φ i )], Now, we use (9) to deduce that z n+ { ( VΦ (V AV ) + V Ψ (L α L(AV V )) + [ J i= VΦ (φ i ) + V Ψ (L α Lφ i )] vn + V Ψ (Z AZ) + and Theorem follows. [ J ] zn } i= VΨ (φ i ), References Atkeson, C. G., Schaal, S. A., and oore, A. W. (997). Locally weighted learning. AI Review,. Baxter, J. and Bartlett, P. (00). Infinite-horizon gradientbased policy search. Journal of Artificial Intelligence Research, 5: Glynn, P. (987). Likelihood ratio gradient estimation: an overview. In Thesen, A., Grant, H., and Kelton, W., editors, Proceedings of the 987 Winter Simulation Conference, pages Gobet, E. and aire, S. (005). Sequential control variates for functionals of markov processes. To appear in SIA Journal on Numerical Analysis. Greensmith, E., Bartlett, P., and Baxter, J. (005). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of achine Learning Research, 5: Halton, J. H. (970). A retrospective and prospective survey of the monte carlo method. SIA Review, (): 63. Halton, J. H. (994). Sequential monte carlo techniques for the solution of linear systems. Journal of Scientific Computing, 9:3 57. Hammersley, J. and Handscomb, D. (964). onte Carlo ethods. Chapman and Hall. Hastie, T., Tibshirani, R., and Friedman, J. (00). The Elements of Statistical Learning. Springer Series in Statistics. Kollman, C., Baggerly, K., Cox, D., and Picard, R. (999). Adaptive importance sampling on discrete markov chains. The Annals of Applied Probability, 9():39 4. Konda, V. R. and Borkar, V. S. (999). Actor-critic-type learning algorithms for markov decision processes. SIA Journal of Control and Optimization, 38::94 3. aire, S. (003). An iterative computation of approximations on korobov-like spaces. J. Comput. Appl. ath., 54(6):6 8. arbach, P. and Tsitsiklis, J. N. (003). Approximate gradient methods in policy-space optimization of markov reward processes. Journal of Discrete Event Dynamical Systems, 3: 48. Reiman,. and Weiss, A. (986). Sensitivity analysis via likelihood ratios. In Wilson, J., Henriksen, J., and Roberts, S., editors, Proceedings of the 986 Winter Simulation Conference, pages Sutton, R. (988). Learning to predict by the method of temporal differences. achine Learning, pages Sutton, R., callester, D., Singh, S., and ansour, Y. (000). Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. IT Press, pages Williams, R. J. (99). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. achine Learning, 8:9 56. AAAI-05 / 07

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation ournal of achine Learning Research 7 (2006) 413 427 Submitted 5/05; Revised 1/06; Published 2/06 Geometric Variance Reduction in arkov Chains: Application to Value Function and Gradient Estimation Rémi

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Laplacian Agent Learning: Representation Policy Iteration

Laplacian Agent Learning: Representation Policy Iteration Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective AI and Robotics: Papers from the AAAI-14 Worshop PolicyBoost: Functional Policy Gradient with Raning-Based Reward Objective Yang Yu and Qing Da National Laboratory for Novel Software Technology Nanjing

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Covariant Policy Search

Covariant Policy Search Covariant Policy Search J. Andrew Bagnell and Jeff Schneider Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 { dbagnell, schneide } @ ri. emu. edu Abstract We investigate the problem

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Approximating Martingales in Continuous and Discrete Time Markov Processes

Approximating Martingales in Continuous and Discrete Time Markov Processes Approximating Martingales in Continuous and Discrete Time Markov Processes Rohan Shiloh Shah May 6, 25 Contents 1 Introduction 2 1.1 Approximating Martingale Process Method........... 2 2 Two State Continuous-Time

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

On Differentiability of Average Cost in Parameterized Markov Chains

On Differentiability of Average Cost in Parameterized Markov Chains On Differentiability of Average Cost in Parameterized Markov Chains Vijay Konda John N. Tsitsiklis August 30, 2002 1 Overview The purpose of this appendix is to prove Theorem 4.6 in 5 and establish various

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Variance Adjusted Actor Critic Algorithms

Variance Adjusted Actor Critic Algorithms Variance Adjusted Actor Critic Algorithms 1 Aviv Tamar, Shie Mannor arxiv:1310.3697v1 [stat.ml 14 Oct 2013 Abstract We present an actor-critic framework for MDPs where the objective is the variance-adjusted

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

arxiv: v1 [cs.lg] 6 Jun 2013

arxiv: v1 [cs.lg] 6 Jun 2013 Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee Bruno Scherrer, LORIA MAIA project-team, Nancy, France, bruno.scherrer@inria.fr Matthieu Geist, Supélec IMS-MaLIS Research Group,

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Chapter 13 Wow! Least Squares Methods in Batch RL

Chapter 13 Wow! Least Squares Methods in Batch RL Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

CS Machine Learning Qualifying Exam

CS Machine Learning Qualifying Exam CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

BACKPROPAGATION THROUGH THE VOID

BACKPROPAGATION THROUGH THE VOID BACKPROPAGATION THROUGH THE VOID Optimizing Control Variates for Black-Box Gradient Estimation 27 Nov 2017, University of Cambridge Speaker: Geoffrey Roeder, University of Toronto OPTIMIZING EXPECTATIONS

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Linearly-Solvable Stochastic Optimal Control Problems

Linearly-Solvable Stochastic Optimal Control Problems Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Covariant Policy Search

Covariant Policy Search Carnegie Mellon University Research Showcase @ CMU Robotics Institute School of Computer Science 003 Covariant Policy Search J. Andrew Bagnell Carnegie Mellon University Jeff Schneider Carnegie Mellon

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Renewal Monte Carlo: Renewal theory based reinforcement learning

Renewal Monte Carlo: Renewal theory based reinforcement learning 1 Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan arxiv:1804.01116v1 cs.lg] 3 Apr 2018 Abstract In this paper, we present an online reinforcement

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information