Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Size: px

Start display at page:

Download "Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation"

Peter Morton
5 years ago
Views:

1 Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. Abstract We study a sequential variance reduction technique for onte Carlo estimation of functionals in arkov Chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular onte Carlo estimates have a variance of O(/N), where N is the number of samples. Here, we obtain a geometric variance reduction O(ρ N ) (with ρ < ) up to a threshold that depends on the approximation error V AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in arkov chains, which may be used for policy evaluation in policy iteration for arkov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity αv of the performance measure V with respect to some parameter α of the transition probabilities. For example, in parametric optimization of the policy, an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations, the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. Introduction We consider a arkov chain over a finite state space X defined by the transition matrix P. We write X(x) a trajectory (x t ) t 0 starting at a state x 0 = x. Let Ψ(r, X(x)) be a functional that depends on some function r : X IR and the trajectory X(x), and write V (x) the expectation of the functional that we wish to evaluate: V (x) = E[Ψ(r, X(x))]. () Here, the quantity of interest V is expressed in terms of a arkov representation, as an expectation of a functional that depends on trajectories. We will consider a functional Ψ(r, ) that is linear in r, and such that its expectation V may be equivalently be expressed in terms of a solution to a linear system LV = r, () Copyright c 005, American Association for Artificial Intelligence ( All rights reserved. with L an invertible linear operator (matrix). An example of Ψ is the sum of discounted rewards r received along the trajectory: Ψ(r, X(x)) = t 0 γ t r(x t ). (3) with γ < being a discount factor. In that case, V is solution to the Bellman equation () with L = I γp. Indeed, using matrix notations, V equals t 0 γt P t r = (I γp ) r. Other functionals include finite-horizon sum of rewards Ψ(r, X(x)) = T t=0 r(x t) or infinite-time average reward Ψ(r, X(x)) = lim T T T t=0 r(x t). A regular onte-carlo (C) method would estimate V (x) by sampling N independent trajectories {X n (x)} n N starting from x and calculate the average N N n= Ψ(r, Xn (x)). The variance of such an estimator is of order /N. Variance reduction is crucial since the numerical approximation error of the quantity of interest is directly related to the variance of its estimate. Variance reduction techniques include importance sampling, correlated sampling, control variates, antithetic variates and stratified sampling, see e.g. (Hammersley and Handscomb, 964; Halton, 970). Geometric variance reduction rates have been obtained by processing these variance reduction methods iteratively, the so-called sequential (or recursive) onte-carlo. Examples include adaptive importance sampling (Kollman et al., 999) and what Halton called the Third Sequential ethod (Halton, 994) based on sequential correlated sampling and control variates. This approach has been recently developed in (aire, 003) for numerical integration and, more related to our work, applied to (continuous time) arkov processes in (Gobet and aire, 005). The idea is to replace the expectation of Ψ(r, ) by an expectation of Ψ(r LW, ) for some function W close to V. From the linearity of Ψ and the equivalence between the representations () and (), for any W, one has V (x) = W (x) + E[Ψ(r LW, X(x))]. Thus, if W is a good approximation of V, the residual r LW is small, and the variance is low. In the sequential method described in this paper, we use successive approximations V n of V to estimate by onte AAAI-05 / 0

2 Carlo a correction E n using the residual r LV n in Ψ, which is used to process a new approximation V n+. We consider an approximation operator A that is linear in the values. We show that (for enough sampled trajectories at each iteration), the variance of the estimator has a geometric rate ρ N (with ρ <, and N the total number of sampled trajectories) until some threshold is reached, whose value is related to the approximation error AV V. An interesting extension of this method concerns the estimation of the gradient α V of V with respect to (w.r.t.) some parameter α of the transition matrix P. A useful application of such sensitivity analysis appears in policy gradient estimation. An optimal control problem may be approximated by a parametric optimization problem in a given space of parameterized policies. Thus, the transition matrix P depends on some (possible multidimensional) policy parameter α. In order to apply gradient methods to search for a local maximum of the performance in the parameter space, one wishes to estimate the policy gradient, i.e. the sensitivity Z = α V of the performance measure with respect to α. The gradient may be expressed as an expectation Z(x) = E[Φ(r, X(x))], using the so-called likelihood ratio or score method (Reiman and Weiss, 986; Glynn, 987; Williams, 99; Baxter and Bartlett, 00; arbach and Tsitsiklis, 003). The gradient Z is also the solution to a linear system LZ = α L L r = α LV. (4) Indeed, since V solves V = L r, we have Z = α V = L α L L r. For example, in the discounted case (3), the functional Φ is defined by Φ(r, X(x)) = t γ t r(x t ) t 0 s=0 α P (x s, x s+ ) P (x s, x s+ ), (5) and Z is solution to (4) with L = I γp and α L = γ α P. We show that, using two approximations V n and Z n of the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. Numerical experiments on a simple Gambler s ruin problem illustrate the approach. Value function estimation We first describe the approximation operator linear in the values considered here, then describe the algorithm, and state the main result on geometric variance reduction. Approximation operator A We consider a fixed set of J representative states X J := {x j X } j J and functions {φ j : X IR} j J. The linear approximation operator A maps any function W : X J IR to the function AW : X IR, according to J AW (x) = W (x j )φ j (x). (6) j= This kind of function approximation includes: Linear regression, for example with Spline, Polynomial, Radial Basis, Fourier or Wavelet decomposition. This is the projection of a function W onto the space spanned by a set of functions {ψ k : X IR} k K, that is which minimizes some norm (induced by a discrete inner product f, g := J j= µ jf(x j )g(x j ), for some distribution µ over X J ): K min α k ψ k W. α IR K k= The solution α solves the linear system Aα = b with A an K K-matrix of elements A kl = ψ k, ψ l and b a K-vector of components b k = W, ψ k. Thus α k = K J l= A kl j= µ jψ l (x j )W (x j ) and the best fit K k= α kψ k is thus of type (6) with K K φ j (x) = µ j A kl ψ l(x j )ψ k (x). (7) k= l= k-nearest neighbors (Hastie et al., 00): here φ j (x) = k if x has x j as one of its k nearest neighbors, and φ j (x) = 0 otherwise. Locally weighted learning and Kernel regression (Atkeson et al., 997). Functions similar to (7) may be derived, with the matrix A being dependent on x (through the kernel). The algorithm We assume the equivalence between the arkov representation () and its interpretation in terms of the solution to the linear system (), i.e. for any function f : X IR, we have f(x) = E[Ψ(Lf, X(x))]. (8) We consider successive approximations V n IR J of V defined at the states X J = (x j ) j J recursively: We initialize V 0 (x j ) = 0. At stage n, we use the values V n (x j ) to provide a new estimation of V (x j ). Let E n (x j ) := V (x j ) AV n (x j ) be the approximation error at the states X J. From the equivalence property (8), we have: AV n (x) = E[Ψ(LAV n, X(x))]. Thus, by linearity of Ψ w.r.t. its first variable, E n (x j ) = E[Ψ(r LAV n, X(x j ))]. Now, we use a onte Carlo technique to estimate E n (x j ) with trajectories (X n,m (x j )) m : we calculate the average Ê n (x j ) := Ψ(r LAV n, X n,m (x j )) m= and define the new approximation at the states X J : V n+ (x j ) := AV n (x j ) + Ên(x j ). (9) Remark. Notice that there is a slight difference between this algorithm and that of (Gobet and aire, 005), which may be written V n+ (x j ) = V n (x j ) + A[ m= Ψ(r LV n, X n,m (x j ))]. Our formulation enables us to avoid the assumption of the idempotent property for A (i.e. that A = A), and guarantees that V n is an unbiased estimate of V, for all n, as showed in the next paragraph. AAAI-05 / 03

3 Properties of the estimates V n We write the conditional expectations and variances: E n [Y ] = E[Y X p,m (x j ), 0 p<n, m, j J] and Var n [Y ] = E n [Y ] (E n [Y ]). We have the following properties on the estimates: Expectation of V n. From the definition (9), E n [V n+ (x j )] = AV n (x j ) + E n (x j ) = V (x j ). Thus E[V n (x j )] = V (x j ) for all n : the approximation V n (x j ) is an unbiased estimate of V (x j ). Variance of V n. Write v n = sup j J Var V n (x j ). The following result expresses that for a large enough value of, the variance decreases geometrically. Theorem. We have v n+ ρ v n + V Ψ(V AV ) (0) with ρ = ( J j= VΨ (φ j ) ), using the notation V Ψ (f) := sup Var Ψ(Lf, X(x j )). j J Thus, for a large enough value of, (i.e. whenever ρ < ), (v n ) n decreases geometrically at rate ρ, up to the threshold ρ V Ψ(V AV ). lim sup v n n If V belongs to the space of functions that are representable by A, i.e. AV = V, then the variance geometrically decreases to 0 at rate ρ N with ρ := ρ / and N the total number of sampled trajectories per state x j (i.e. N is the number n of iterations times the number of trajectories per iteration and state x j ). Further research should consider the best tradeoff between n and, for a given budget of a total of N = n trajectories per state. Notice that the threshold depends on the variance of Ψ for the function L(V AV ) = r LAV, the residual of the representation (by A) of V. Notice also that this threshold depends on V AV only at states reached by the trajectories {X(x j )} xj X J : a uniform (over the whole domain) representation of V is not required. Of course, once the threshold is reached, a further convergence of O(/N) can be obtained thereafter, using regular onte Carlo. Example: the discounted case Let us illustrate the sequential control variates algorithm to value function estimation in arkov chains. As an example, we consider the infinite horizon, discounted case (3). The value function V (x) = E [ t 0 γt r(x t ) ] solves the Bellman equation: V = r + γp V, which may be written as the linear system () with L = I γp. In the previous algorithm, at stage n, the approximation error E n (x j ) = V (x j ) AV n (x j ) is therefore the expectation E n (x j ) = E [ γ t ] [r(x t ) (I γp )AV n (x t )] x 0 = x j. t 0 () We notice that the term r (I γp )AV n is the Bellman residual of the approximation AV n. The estimate has thus zero variance if this approximation happens to be the value function. In model-free learning, it may be interesting (in order to avoid the computation of the expectation in P ) to replace the term P AV n (x t ) by AV n (x t+ ) in (), leaving the expectation unchanged (because of the linearity of the approximation operator A): [ E n (x j )=E γ t [r(x t ) AV n (x t )+γav n (x t+ )] x 0 = x j ], t 0 () at the cost of a slight increase of the variance (here, even if the approximation AV n is the value function, the variance is not zero). The approximation error is thus the expected discounted sum of Temporal Differences (Sutton, 988) r(x t ) AV n (x t ) + γav n (x t+ ). Following the algorithm, the next approximation V n+ is defined by (9) with Ên(x j ) being a onte Carlo estimate of () or (). Numerical experiment We consider the Gambler s ruin problem described in (Kollman et al., 999): a gambler with i dollars bets repeatedly against the house, whose initial capital is L i. Each bet is one dollar and the gambler has probability p of winning. The state space is X = {0,..., L} and the transition matrix P is defined, for i, j X, by { p, if j i = and 0 < i < L, P ij = p, if i j = and 0 < i < L, 0, otherwise. Betting continues until either the gambler is ruined (i = 0) or he has broken the bank (i = L) (thus 0 and L are terminal states). We are interested in computing the probability of the gambler s eventual ruin V (i) when starting from initial fortune i. We thus define the function r(0) = and r(i 0) = 0. The value function V solves the Bellman equation (I P )V = r, and its value is V (i) = λi λ L, for i X, (3) λl with λ := p p when p 0.5, and V (i) = i/l for p = 0.5. The representative states are X J = {, 7, 3, 9}. We consider two linear function approximation A and A that are projection operators (minimizing the L norm at the states X J ) onto the space spanned by a set of functions {ψ k : X IR} k K. A uses K = functions ψ (i) =, ψ (i) = λ i, i X, whereas A uses K = 4 functions ψ (i) =, ψ (i) = i, ψ 3 (i) = i, ψ 4 (i) = i 3, i X. Notice that V is representable by A (i.e. A V = V ) but not by A. We chose p = 0.5. We ran the algorithm with L = I P (which is an invertible matrix). At each iteration, we used = 00 simulations per state. Figure shows the L approximation error (max j XJ V (j) V n (j) ) in logarithmic scale, as a AAAI-05 / 04

4 Approximation error (in log scale) C with 0^8 trajectories Approximation Approximation Number of iterations Figure : Approximation error for regular C and sequential control variate algorithm using two approximations A and A, as a function of the number of iterations. function of the iteration number n 0. This approximation error (which is the true quantity of interest) is directly related to the variance of the estimates V n. For the approximation A, we observe the geometric convergence to 0, as predicted in Theorem. It takes less than 0 00 simulations per state to reach an error of 0 5. Using A, the error does not decrease below some threshold.0 5 due to the approximation error V A V. This threshold is reached using about 5 00 simulations per state. For comparison, usual C reaches an error of 0 4 with 0 8 simulations per state. The variance reduction obtained when using such sequential control variates is thus considerable. Gradient estimation Here, we assume that the transition matrix P depends on some parameter α, and that we wish to estimate the sensitivity of V (x) = E[Ψ(r, X(x))] with respect to α, which we write Z(x) = α V (x). An example of interest consists in solving approximately a arkov Decision Problem by searching for a feedback control law in a given class of parameterized stochastic policies. The optimal control problem is replaced by a parametric optimization problem, which may be solved (at least in order to find a local optimum) using gradient methods. Thus we are interested in estimating the gradient of the performance measure w.r.t. the parameter of the policy. In this example, the transition matrix P would be the transition matrix of the DP combined with the parameterized policy. As mentioned in the introduction, the gradient may be expressed by an expectation Z(x) = E[Φ(r, X(x))] (using the so-called likelihood ratio or score method (Reiman and Weiss, 986; Glynn, 987; Williams, 99; Baxter and Bartlett, 00; arbach and Tsitsiklis, 003)) where Φ(r, X(x)) is also a functional that depends on the trajectory X(x), and that is linear in its first variable. For example, in the discounted case (3), the functional Φ is given by (5). The variance is usually high, thus variance reduction techniques are highly needed (Greensmith et al., 005). The gradient Z is also the solution to the linear system (4). Unfortunately, this linear expression is not of the form () since α L is not invertible, which prevents us from using directly the method of the previous section. However, the linear equation (4) provides us with another representation for Z in terms of arkov chains: Z(x) = E [ t 0 γt α LV (x t ) ] = E[Ψ( α LV, X(x))]. (4) We may extend the previous algorithm to the estimation of Z by using two representations: V n and Z n. The approximation V n of V is updated from onte-carlo estimation of the residual r LV n, and Z n, which approximates Z, is updated from the gradient residual α LV n LZ n built from the current V n. This approach may be related to the so-called Actor-Critic algorithms (Konda and Borkar, 999; Sutton et al., 000), which use the representation (4) with an approximation of the value function. A geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations. The algorithm From (4) and the equivalence property (8), we obtain the following representation for Z: Z(x) = AZ n (x) + E [ Ψ( α LV LAZ n, X(x)) ] = AZ n (x) + E [ Ψ( α L(V AV n ), X(x)) Ψ( α LAV n + LAZ n, X(x)) ] = AZ n (x) + E [ Φ(r LAV n, X(x)) (5) Ψ( α LAV n + LAZ n, X(x)) ]. from which the algorithm is deduced. We consider successive approximations V n IR J of V and Z n IR J of Z defined at the states X J = (x j ) j J. We initialize V 0 (x j ) = 0, Z 0 (x j ) = 0. At stage n, we simulate by onte Carlo trajectories (X n,m (x j )) m and define the new approximations V n+ and Z n+ at the states X J : V n+ (x j )=AV n (x j )+ Ψ(r LAV n, X n,m (x j )) Z n+ (x j )=AZ n (x j )+ m= [ Φ(r LAVn, X n,m (x j )) m= Ψ( α LAV n + LAZ n, X n,m (x j )) ]. Expectation of V n and Z n. We have already seen that E[V n ] = V for all n > 0. Now, (5) implies that E n [Z n+ ] = Z, thus E[Z n ] = Z for all n > 0. Variance of V n and Z n. We write v n = sup j J Var V n (x j ) and z n = sup j J Var Z n (x j ). The next theorem state the geometric variance reduction for large enough values of. AAAI-05 / 05

5 Theorem. We have v n+ ρ v n + V Ψ(V AV ) z n+ ρ z n + [c (V AV, Z AZ) + c v n ] ( with ρ = J j= VΨ (φ j ) ), and the coefficients ( c (f, g) = VΦ (f) + V Ψ (L α Lf) + V Ψ (g)) c = [ J j=, V Φ (φ j ) + V Ψ (L α Lφ j )] using the notations V Ψ (f) := sup j J Var Ψ(Lf, X(x j )) and V Φ (f) := sup j J Var Φ(Lf, X(x j )). Thus, for a large enough value of, (i.e. whenever ρ < ), the convergence of (v n ) n and (z n ) n is geometric at rate ρ, up to the thresholds lim sup v n n ρ V Ψ(V AV ) [ c (V AV, Z AZ) lim sup z n n ρ ] +c ρ V V Ψ(V AV ). Here also, if V and Z are representable by A, then the variance converges geometrically to 0. Numerical experiment Again we consider the Gambler s ruin problem described previously. The transition matrix is parameterized by α = p, the probability of winning. The gradient Z(i) = α V (i) may be derived from (3): Z(i) = L( λi )λ L i( λ L )λ i ( λ L ) α for i X, (for α 0.5), and Z(i) = 0 for α = 0.5. Again we use the representative states X J = {, 7, 3, 9}. Here, we consider two approximators A and A for the value function representations V n, and two approximators A and A 3 for the gradient representations Z n, where A 3 is a projection that uses K = 3 functions ψ (i) =, ψ (i) = λ i, ψ 3 (i) = iλ i, i X. Notice that Z is representable by A 3 but not by A. We chose p = 0.5 and = 000. Figure shows the L approximation error of Z (max j XJ Z(j) Z n (j) ) in logarithmic scale, for the different possible approximations of V and Z. When both V and Z may be perfectly approximated (i.e. A for V and A 3 for Z) we observe the geometric convergence to 0, as predicted in Theorem. The error is about 0 4 using a total of 0 4 simulations. When either the value function or the gradient is not representable in the approximation spaces, the error does not decrease below some threshold ( when Z is not representable) reached in.0 3 simulations. The threshold is lower (.0 4 ) when Z is representable. For comparison, usual C reaches an error (for Z) of with 0 8 simulations per state. The variance reduction of this sequential method compared to regular C is thus also considerable. Approximation error of Z (in log scale) Number of iterations 3 Figure : Approximation error of the gradient Z = α V using approximators A and A for the value function, and A and A 3 for the gradient. Conclusion We described a sequential control variates method for estimating an expectation of functionals in arkov chains, using linear approximation (in the values). We illustrate the method on value function and policy estimates. We proved geometric variance reduction up to a threshold that depends on the approximation error of the functions of interest. Future work would consider sampling initial states from some distribution over X instead of using representative states X J. Proof of Theorem From the decomposition V AV n = V AV i= (V V n)(x i )φ i, (6) we have V n+ (x j ) = AV n (x j )+ [ Ψ(L(V AV ), X n,m (x j )) Thus m= i= (V V n)(x i )Ψ(Lφ i, X n,m (x j )) Var n V n+ (x j ) = Varn[ Ψ(L(V AV ), X(x j )) i= (V V n)(x i )Ψ(Lφ i, X(x j )) ]. We use the general bound Var [ i α ] [ iy i α i Var [Y i ] ], (7) for any real numbers (α i ) i and square integrable real random variables (Y i ) i, to deduce that i Var n V n+ (x j ) [ VΨ (V AV ) (8) 3 ]. i= V V n (x i ) V Ψ (φ i )], AAAI-05 / 06

6 with V Ψ (f) := sup j J Var Ψ(Lf, X(x j )). Now, we use the variance decomposition Var V n+ (x j ) = Var [E n [V n (x j )]] + E[Var n [V n (x j )]] = E[Var n [V n (x j )]], and the general bound E [ J (α 0 + α i Y i ) ] ( J ), α 0 + α i E[Yi ] (9) i= i= to deduce from (8) that v n+ [ VΨ (V AV ) + ( J VΨ (φ i ) ) ] vn, i= which gives (0). Now, if is such that ρ := ( J i= VΨ (φ i ) ) <, then taking the upper limit finishes the proof of Theorem. Proof of Theorem Using (4) and (6), we have the decomposition α LAV n LAZ n = α LA(V n V ) α L(AV V ) +L(Z AZ) + LA(Z Z n ) = J i= (V V n)(x i ) α Lφ i α L(AV V ) +L(Z AZ) i= (Z Z n)(x i )Lφ i. Now, using (6), Var n Z n+ (x j ) equals Varnh Φ(L(V AV ), X(x j)) JX + (V V n)(x i)φ(lφ i, X(x j)) Ψ( αl(av V ), X(x j)) + + i= JX (V V n)(x i)ψ( αlφ i, X(x j)) + Ψ(L(Z AZ), X(x j)) i= JX i (Z Z n)(x i)ψ(lφ i, X(x j)). i= We use (7) to deduce that Var n Z n+ (x j ) is bounded by [ VΦ (V AV ) + V Ψ (L α L(AV V )) i= V V n (x i ) ( V Φ (φ i ) + V Ψ (L α Lφ i ) ) + V Ψ (Z AZ) i= Z Z n (x i ) V Ψ (φ i )], Now, we use (9) to deduce that z n+ { ( VΦ (V AV ) + V Ψ (L α L(AV V )) + [ J i= VΦ (φ i ) + V Ψ (L α Lφ i )] vn + V Ψ (Z AZ) + and Theorem follows. [ J ] zn } i= VΨ (φ i ), References Atkeson, C. G., Schaal, S. A., and oore, A. W. (997). Locally weighted learning. AI Review,. Baxter, J. and Bartlett, P. (00). Infinite-horizon gradientbased policy search. Journal of Artificial Intelligence Research, 5: Glynn, P. (987). Likelihood ratio gradient estimation: an overview. In Thesen, A., Grant, H., and Kelton, W., editors, Proceedings of the 987 Winter Simulation Conference, pages Gobet, E. and aire, S. (005). Sequential control variates for functionals of markov processes. To appear in SIA Journal on Numerical Analysis. Greensmith, E., Bartlett, P., and Baxter, J. (005). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of achine Learning Research, 5: Halton, J. H. (970). A retrospective and prospective survey of the monte carlo method. SIA Review, (): 63. Halton, J. H. (994). Sequential monte carlo techniques for the solution of linear systems. Journal of Scientific Computing, 9:3 57. Hammersley, J. and Handscomb, D. (964). onte Carlo ethods. Chapman and Hall. Hastie, T., Tibshirani, R., and Friedman, J. (00). The Elements of Statistical Learning. Springer Series in Statistics. Kollman, C., Baggerly, K., Cox, D., and Picard, R. (999). Adaptive importance sampling on discrete markov chains. The Annals of Applied Probability, 9():39 4. Konda, V. R. and Borkar, V. S. (999). Actor-critic-type learning algorithms for markov decision processes. SIA Journal of Control and Optimization, 38::94 3. aire, S. (003). An iterative computation of approximations on korobov-like spaces. J. Comput. Appl. ath., 54(6):6 8. arbach, P. and Tsitsiklis, J. N. (003). Approximate gradient methods in policy-space optimization of markov reward processes. Journal of Discrete Event Dynamical Systems, 3: 48. Reiman,. and Weiss, A. (986). Sensitivity analysis via likelihood ratios. In Wilson, J., Henriksen, J., and Roberts, S., editors, Proceedings of the 986 Winter Simulation Conference, pages Sutton, R. (988). Learning to predict by the method of temporal differences. achine Learning, pages Sutton, R., callester, D., Singh, S., and ansour, Y. (000). Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. IT Press, pages Williams, R. J. (99). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. achine Learning, 8:9 56. AAAI-05 / 07

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

ournal of achine Learning Research 7 (2006) 413 427 Submitted 5/05; Revised 1/06; Published 2/06 Geometric Variance Reduction in arkov Chains: Application to Value Function and Gradient Estimation Rémi