On Differentiability of Average Cost in Parameterized Markov Chains

Size: px

Start display at page:

Download "On Differentiability of Average Cost in Parameterized Markov Chains"

Junior Allen Harrell
5 years ago
Views:

1 On Differentiability of Average Cost in Parameterized Markov Chains Vijay Konda John N. Tsitsiklis August 30, Overview The purpose of this appendix is to prove Theorem 4.6 in 5 and establish various facts used in verifying some of the assumptions of Theorem A.1 therein. We use the same notation and assumptions as in 5, and to simplify the analysis, we assume that the parameter is scalar. The setting used here is similar (but not identical) to the one in 4. The major difference is that the setting of 4 allows only RSPs of the form U k = f(x k,w k ), where the parameter enters only through the distribution of W k which does not depend on x. Incontrast, our setting allows for any RSPs of the form U k = f (X k,w k ), as long as the distribution µ (u x) off (x, U) satisfies the required assumptions. The differentiability results presented in this report are based on the following key lemma (see 3, Corollary 2.2.1). Lemma 1.1. Consider a parameterized family of random variables {F (ω); R} on a Polish space Ω. Suppose the following hold: 1. For each 0, there exists some d>1 and K d > 0 such that, E F F 0 d <K d 0 d, for all sufficiently close to There is a family of random variables {f } such that where F +h F h P denotes convergence in probability. P f, as h 0,, 1

2 Then the map E F is differentiable with d d E F =E f. Our proof of the differentiability results involves the following steps. 1. Obtain a representation of the steady-state expected cost and of solutions to the Poisson equation in terms of suitably defined expectations E,x. 2. Using likelihood ratios, represent the above expectations using only the expectations E 0,x corresponding to some fixed 0. This new representation has the advantage that the probability law w.r.t. which expectations are taken does not depend on the parameter. Only the random variables inside the expectation depend on the parameter, making our key result Lemma 1.1 applicable to this representation. 3. Verify the hypotheses of Lemma 1.1. The next section carries out these steps for expectations over finite horizon. In Section 3, we recall the results on the regenerative representations for the average cost and solutions to the Poisson equation. In Section 4, we introduce the likelihood ratio representations and show that they satisfy the assumptions of Lemma 1.1. In Sections 5 and 6, we use all the previous results to prove differentiability of expectations over infinite horizon and verify some of the assumptions of Theorem A.1 in 5. 2 Finite Horizon In this section, we prove the differentiability of expectations of the form E,x W where W is a random variable that depends only on a finite number of state-decision pairs (X k,u k ). For each and k, let Λ,k = k µ (U l X l ) µ 0 (U l X l ). It is easy to see that Λ,k is the Radon-Nikodym derivative of the distribution of (X l,u l ) k under policy with respect to the distribution under policy 0.Wewill use the following basic lemma to derive some bounds on Λ,k. Lemma 2.1. If W 0,...,W k 1 are positive random variables satisfying E W l 1 d ɛ d C(d), d 1, l =0,...,k 1, for some monotonic function C( ) 1, and if ɛ<1/k2 k, then k 1 d E W l 1 k 1 ɛ d (2k) d C(kd), E 2 d C(kd), for any d 1. W d l 2

3 Proof. For any d 1 and z 0, we have z d 2 d 1 ( 1+ z 1 d), (1) we have E W d l 2 d 1 (1 + ɛ d C(d)) 2 d C(d), l =0,...,k 1. To prove the first part, note that and therefore z 0 z k 1 1 z z 0 z z 0 z k 2 z k 1 1 z 0 z k 1 1 d k d 1 z 0 1 d + z d 0 z 1 1 d + + z d 0 z d k 2 z k 1 1 d. We now use Holder s inequality to see that k 1 d E k 1 l 1 W l 1 k d 1 E Wj d W l 1 d k 1 l 1 k d 1 j=0 j=0 E W (l+1)d j k 1 k d 1 2 ld ɛ d C((l +1)d) k d ɛ d 2 kd C(kd). 1/(l+1) E W l 1 (l+1)d 1/(l+1) To prove the second part, use (1) and the fact that ɛ<1/k2 k to simplify the bound. The following are two immediate consequences of the above lemma. In the following lemma, note that the constant K d might depend on k. This dependence can be safely ignored as the next lemma is intended to verify the first assumption of Lemma 1.1 for a fixed k. Lemma For every d 1, there exists K d > 0 such that Λ,k 1 d K d 0 d L(x),, x. E 0,x 2. For every d/ (1, 1), there exists K d > 0 such that Λ,k d K d L(x), x, E 0,x and for all sufficiently close to 0. 3

4 Proof. Use the mean-value theorem to see that µ (u x) µ 0 (u x) 1 0 sup µ (u x) µ 0 (u x). (2) Since µ (u x)/µ 0 (u x) belongs to D, wehave E,x sup µ (X k,u k ) K d L(x), 0 µ 0 (X k,u k ) for some constant K d.part 1, and Part 2 for d 1, are implied by Lemma 2.1. To prove Part 2 for d 1, note that the inequality (2) holds even if and 0 are interchanged. The above lemma verifies the first assumption of Lemma 1.1 for expectations of random variables that are functions of state-decision pairs up to time k. The next lemma verifies the second assumption and derives a formula for the gradient. Lemma 2.3. Consider a family of random variables {W } such that the map W (ω) is differentiable for each ω, W is a function of the state-decision pairs up to time k, and, 0 E,x W d <, E,x sup W d <, x, for some d>1. Then, the map E,x W is differentiable with where E,x W =E,x W +E,x Λ,k W, Λ,k = k ψ (X l,u l )= k µ (X l,u l ) µ (X l,u l ). Proof. Note that E,x W =E 0,x Λ,k W. Using the bounds of the previous lemma, the fact that d-moments of W and sup W are finite, and the mean-value theorem, verify the first assumption of Lemma 1.1. Using the fact that µ (u x) and W are differentiable in, the second assumption of Lemma 1.1 follows easily. The gradient formula now follows from interchanging the order of differentiation and expectation. The following is an immediate consequence of the previous result. It uses the formula in the previous result to derive bounds on the rate at which the gradient grows with k. E,x f (X k,u k ) 4

5 Corollary 2.4. Consider a family of functions {f (x, u)} in D such that the family { f (x, u)} is also in D. Then, the map E,x f (X k,u k ) is differentiable for each x, and there exists some K>0 such that E,x f (X k,u k ) (k +1)KL(x), x. Furthermore, the families of functions belong to D. {E,x f (X k,u k )}, { E,x f (X k,u k )} Before we move on, we would like to point out that the above result holds for expectations of the form E,x U 0 = u aswell, if we redefine Λ,k as 3 Splitting k Λ,0 =0, Λ,k = ψ (X l,u l ), k >0. l=1 In this section, we recall the splitting technique of Athreya and Ney 2, and Nummelin 7 to obtain a regenerative representation for the average cost function ᾱ() and the solutions Q of the Poisson equation. Let δ, N, X 0 and ϑ be as in Assumption 4.2 of 5. Consider the {0, 1}-valued process (B k ) constructed as follows: 1. If k is not divisible by N, then B k =0. 2. If k is divisible by N, then P(B k =1 X l,u l,l =0, 1,...,B l,l k) = P(B k =1 X k,x k+n ) = (X k,x k+n ) where = δ 2 I ϑ(dy) X 0 (x) P,x (X N dy). (Assumption 4.2(a) ensures existence of.) Let It is easy to see that for each, x, f (0) = 1,, x, y, Q,x (dy) = P,x(X N dy) δ 2 I X 0 (x)ϑ(dy) 1 δ 2 I. X 0 (x) f (0) (x, ) = ( 1 δ 2 I X 0 (x) 5 ) Q,x (dy) P,x (X N dy)

6 where the existence of the Radon-Nikodym derivative is guaranteed by the fact that 2 Q,x (dy) 2 δ P,x(X N dy). With this construction it is not difficult to see that the process (X kn,b kn )isamarkov chain for which X {1} is an atom. Consider the first time this atom is hit, i.e., τ = min{k B k =1}. The time τ is well-defined and finite w.p.1. as the set X 0 is hit infinitely often from any initial state (cf. Assumption 4.2 in 5) and the probability that B k is 1 given that X k = x is δ/2 whenever x X 0. Using standard results 6, Theorems and on Markov chains with atoms, the average cost ᾱ() can be written as (τ/n) 1 E,ϑ c,n (X kn ) ᾱ() =, E,ϑ τ where c,n (x) = N 1 Furthermore, the functions V (x) given by V (x) =E,x satisfy the Poisson equation (τ/n) 1 E,x c(x k,u k ). ( c,n (X kn ) Nᾱ()) (3) V (x) =E,x c(x, U 0 )+V (X 1 ) for the Markov chain {X k }.Tosee this, note that V (x) isasolution to the Poisson equation for the cost function c,n and the Markov chain {X kn } (see 6, p.441 ). Furthermore, if ˆV is a solution to the Poisson equation for the cost function c,1 and the Markov chain {X k }, then ˆV is a solution to the Poisson equation for the cost function c,n and the Markov chain {X kn }. Since ˆV and V are two solutions to the Poisson equation for a positive Harris chain {X kn }, they must differ by a constant (cf. Proposition of 6) and therefore, V is a solution to the Poisson equation for the cost function c,1 and the Markov chain {X k }. 4 Likelihood Ratio For each, k, let where Λ,k =Λ,kN ˆΛ,k 6

7 k 1 f ˆΛ (0) (X ln,x ln+n ),k = f (0) 0 (X ln,x ln+n ), and 0/0 isinterpreted as 1. It is easy to see that Λ,k is the Radon-Nikodym derivative of the joint distribution of (X l,u l ;0 l kn, B l ;0 l (k 1)N) under P with respect to that under P 0 on the set of outcomes for which τ>kn. Therefore for any W that is a function of only these random variables, we have E,x WI{τ >kn} =E 0,x WI{τ >kn} Λ,k. To show that the map E,x WI{τ >kn} is differentiable, we need to verify the hypothesis of Lemma 1.1 for the likelihood ratio Λ,k.Wenow derive several bounds to verify the first assumption of Lemma 1.1. Lemma For each d 1, there exists K d > 0 such that E 0 f (i),x (x, X d N) f (i) 0 (x, X N ) 1 K d 0 d L(x), x, i =0, For each d 1, there exists K d > 0 such that E 0 f (i),x (x, X d N) f (i) K 0 (x, X N ) d L(x) x, i =0, 1, and for all sufficiently close to For each d 1, there exists K d > 0 such that E 0,x ˆΛ,k 1 d K d 0 d L(x), x. 4. For each d 1, there exists K d > 0 such that E 0,x ˆΛ,k d K d L(x) x, and for all sufficiently close to 0. 7

8 Proof. To prove the first part, note that for x/ X 0, 0 = 1, and for x X 0, 0 = P 0,x(X N dy) P,x (X N dy) = {E 0,x Λ,N X N = y} 1. Therefore, we have E 0 (x, X N ) d,x 0 (x, X N ) 1 E 0,x E 0,x E 0,x E0,x Λ,N X N 1 1 d E 0,x Λ,N X N 2d E 0,x E 0,x Λ,N 1 X N 2d Λ 2d,N E 0,x Λ,N 1 2d, where the last two inequalities follow from Holder s and Jensen s inequalities respectively. For i =0,note that f (0) f (0) 0 1 = ( (1) 0 1 f f (0) 0 since 1/2 for all. The proof of the second part is similar. The last two parts follow from the first two and Lemma 2.1. The next lemma proves that the first of the two hypotheses for Lemma 1.1 hold for Λ. Lemma For each d 1, there exists K d > 0 such that E 0,x Λ,k 1 d K d 0 d L(x), for sufficiently close to For each d 1, there exists K d > 0 such that E 0,x Λ,k d K d L(x). ) 8

9 Proof. Follows from Lemmas 2.2, 4.1 and 2.1. We are now ready to state the following result on differentiability of finite horizon expectations. Lemma 4.3. For any family of functions {f (x, u)} in D such that the family { f (x, u)} is also in D, the map E,x f (X kn )I{τ >kn}, where f (x) = N 1 E,x f (X k,u k ), is differentiable for each x, with E,x f (X kn )I{τ >kn} (k +1)ρ k 0KL(x), for some ρ 0 < 1. Proof. Since E,x f (X kn )I{τ >kn} = E 0,x Λ,k f (X kn )I{τ >kn}, we can use Lemma 2.1 to prove its differentiability and to calculate the derivative. To verify the first assumption of Lemma 2.1, use Holder s inequality to see that E 0,x f (X kn )I{τ >kn} Λ,k f (X kn )I{τ >kn} Λ 0,k d = E 0,x f (X kn ) d Λ,k 1 d E 0,x f (X kn ) 2d 1/2 E0,x Λ,k 1 2d 1/2. The first assumption now follows from the fact that f (x, u) (and therefore f (x)) belongs to D, and the first part of the previous lemma. To verify the second assumption of Lemma 1.1, use Assumption 4.4 of 5 to conclude that the map Λ,k (ω) is continuously differentiable for all ω, and therefore it is enough to prove that ˆΛ 0 +h,k 1 converges in probability. This in turn is verified if the functions f (i) h (i) /f 0,i=0, 1, can be shown to be differentiable w.r.t. for all x, y. Since 0 / 9

10 is equal to E 0,x Λ,N X N = y, for all x X 0 and is equal to 1 otherwise, use part 3 of Assumption 4.4 in 5 and the mean-value theorem to see that Λ 0 +h,n 1 h is bounded above by an integrable random variable. Therefore, we invoke dominated convergence theorem for conditional expectations to show that there exists a version of / 0 that is differentiable at 0 for all x, y. Furthermore, Λ 0 +h 1 h converges in probability to Λ 0,k where kn k 1 jn+n 1 Λ,k = ψ (X k,u k ) I X0 (X jn )E,x ψ (X l,u l ) j=0 l=jn X jn,x jn+n. Therefore, we have E,x f (X kn )I{τ >kn} = E,x Λ,k f (X kn )I{τ >kn} +E,x Λ,k f (X kn )I{τ >kn} To derive the bound on the derivative, use Holder s inequality to obtain E,x f (X kn )I{τ >k} 3 1/3 E,x Λ,k E,x f (X kn ) 3 1/3 P,x (τ>kn) 1/3 +E,x f (X kn ) 3 1/3 P,x (τ>kn) 1/3. To calculate E,x Λ,k 3 1/3, note that Λ,k is a sum of 2kN +1 terms. Since ψ (x, u) belongs to D, the 3-norm of each of these terms is bounded by KL(x) 1/3 for some K. Similarly, since f (x, u) and f (x, u) (therefore f (x) and f (x)) belong to D, the terms E,x f (X kn ) 3 1/3 and E,x f (X kn ) 3 1/3 are also bounded by KL(x) 1/3 for some K. Finally, due to uniform geometric ergodicity, we have P,x (τ>kn) Kρ k 0L(x), for some ρ 0 < 1. Combining all these terms it is easy to establish the stated bound on the derivative. 10

11 5 Infinite Horizon The above lemma concerns the differentiability of expectation w.r.t. finite dimensional marginals of P,x. We use the following result from advanced calculus (e.g., see Apostol 1) to extend the above result to the infinite horizon case. Theorem 5.1. Assume that each f k is a real-valued function defined on a neighborhood Θ of 0 such that the derivative f k () exists for each in Θ. Assume that 1. f k ( 0 ) converges, 2. there exists a g() such that f k () =g() uniformly on Θ. Then: 1. There exists a function f() such that f k () =f() uniformly on Θ. 2. If Θ, the derivative f() exists and equals f k (). By combining the previous two results we have the following theorem. Recall that η is the steady state expectation of the state-decision process {(X k,u k )}. Theorem 5.2. For any family of functions {f (x, u)} in D such that the family { f (x, u)} is also in D, the map (τ/n) 1 E,x f (X kn ) is differentiable for each x, and the families of functions (τ/n) 1 E,x f (X kn ), E,x (τ/n) 1 f (X kn ), belong to D. Proof. Note that E,x (τ/n) 1 f (X kn ) = E,x f (X kn )I{τ >kn}. The differentiability now follows from the previous two lemmas. To show that the functions (τ/n) 1 E (τ/n) 1,x f (X kn ), E,x f (X kn ) 11

12 are in D, consider E,x sup E,x E,x (τ/n) 1 (τ/n) 1 d f (X kn ) d sup f (X kn ) sup f (X d kn ) (τ/n) 1 (τ/n) d 1 E,x (τ/n) d 1 sup f (X kn ) E,x (τ/n) 3(d 1) 1/3 E,x sup P,x (τ>kn) 1/3. d I{τ >kn} f (X kn ) 3d 1/3 It is now easy to see that the right hand side is bounded by K d L(x) for some K d > 0. The following corollaries are immediate consequences of this result. Corollary 5.3. For any family of functions {f (x, u)} in D such that the family { f (x, u)} is also in D, the map η (f ) is bounded and differentiable with bounded derivatives. Corollary 5.4. (Lemma 5.1 in 5) If X 0 is singleton (i.e., X 0 is an atom and therefore splitting in unnecessary) and τ is the first hitting time of X 0 then the functions defined as τ 1 Q (x, u) = E,x c(x k,u k ) U 0 = u, T (x, u) = E,x τ U 0 = u, belong to D. Furthermore, for each (x, u), the maps Q (x, u), T (x, u) are differentiable, and the derivatives Q (x, u), T (x, u) also belong to D. Corollary 5.5. The average expected cost function ᾱ() is bounded and differentiable with bounded derivatives. Furthermore, the solutions V (x) to the Poisson equation given by (3) belong to D and the map V (x) is differentiable for each x and the derivatives V (x) also belong to D. Furthermore, the family of functions Q = c ᾱ()1 + P V 12

13 belongs to D, with the map Q (x, u) being differentiable for all (x, u) and Q (x, u) belonging to D. This corollary is used in 5 to establish Theorem 4.6 in 5, which is again proved here for completeness. Theorem 5.6. For any solution Q to the Poisson equation we have ᾱ() = ψ,q. Furthermore, ᾱ() has bounded derivatives. Proof. Note that the Q defined in the previous corollary satisfies the Poisson equation: Q (x, u) =c(x, u) ᾱ()+e,x Q (X 1,U 1 ) U 0 = u. Since Q belongs to D, Q is differentiable in, and Q also belongs to D, wecan differentiate both sides of the above equation to obtain Q (x, u) = ᾱ()+e,x Q (X 1,U 1 ) U 0 = u +E,x ψ (X 1,U 1 )Q (X 1,U 1 ) U 0 = u. Taking expectation with respect to the steady state distribution of (X k,u k )weobtain the desired formula with Q defined as in the previous corollary. To see that Q can be replaced by any other solution ˆQ to the Poisson equation, note that Q ˆQ is some function C() (Proposition of 6) and that E,x ψ (x, U 0 ) = 0. To prove that ᾱ is twice differentiable with bounded derivatives, note that ᾱ = η (ψ Q ) and apply Corollary 5.3. The above result leads to the following theorem on differentiability of a common representation for solutions to Poisson equation. Theorem 5.7. For any family of functions {f (x, u)} in D such that the family { f (x, u)} is also in D, the map E,x f (X k,u k ) η (f ) is differentiable for each x, and the families of functions { } { } E,x f (X k,u k ) η (f ), E,x f (X k,u k ) η (f ), belong to D. 13

14 Proof. To simplify the proof, assume that f = c. Then the results on differentiability for the finite horizon case, imply that the function is differentiable with E,x c(x k,u k ) ᾱ() E,x c(x k,u k ) ᾱ() k = E,x ψ (X l,u l )(c(x k,u k ) ᾱ()) ᾱ() k/2 + E,x ψ (X l,u l )E,Xl+1 c(x k,u k ) ᾱ() k l=k/2+1 + l=k/2 E,x ψ (X l,u l )(c(x k,u k ) ᾱ()) ψ,p k l ψ,p l (c ᾱ()1), (c ᾱ()1) where k/2 represents the floor of k/2. Using geometric ergodicity, the fact that V 1/d also satisfies the geometric Foster Lyapunov criterion with the decay factor ρ 1/d, and Schwartz inequality, we can see that each of these terms can be bounded by Kρ k 0L(x) for some ρ 0 < 1. The result now follows from an easy application of Theorem Verification of Assumption A.3 parts (d) and (e) In this section, we verify parts (d) and (e) of Assumption A.3 of 5 for the two cases of TD(1) and of TD(λ), λ<1 separately. Since φ,q,t belong to D, part (d) in both cases follows from Corollary 5.3. Therefore, we will verify only part (e). We start with the TD(1) case. 6.1 TD(1) We will now verify part (e) for ĥ( ). Note that τ 1 ĥ (y) = E, x (h (Y k ) h()) Y 0 = y, = E, x (h (Y k ) h())i{τ >k} Y 0 = y. Since h (Y k )isafunction of the state-decision pairs up to time k and satisfies the assumptions of Lemma 2.3, a formula can be derived for the derivative of the expectation E, x (h (Y k ) h())i{τ >k} Y 0 = y. 14

15 Using this formula, a bound on this derivative of the form K(k +1) d ρ k L(x, 0 u)z,d 1,K >0,ρ 0 < 1, L D. can be obtained. This bound and Lemma 5.1 can be used to conclude that ĥ is differentiable with respect to and the derivative is bounded by K L(x, u)z for some other K>0. It is now easy to verify that E,x ĥ (Y 1 ) Y 0 = y is differentiable with the derivative bounded appropriately. The verification of part (e) for Ĝ is similar. 6.2 TD(λ), λ<1 Since the verification of part (e) for ĥ is simpler and the verification for Ĝ is similar, we will consider ĥ only. Note that the first component of the vector ĥ does not depend on z and is equal to L E,x (c(x k,u k ) ᾱ()) U 0 = u. Therefore, part (e) for this component follows from Lemma 5.7. For the second component, consider the sum E,x c(xk,u k )Z k h 1 () ᾱ() Z() Y0 = y = z + = z + λ k E,x c(x k,u k ) P l c, φ Y0 = y k 1 λ l E,x c(xk,u k )φ (X k l,u k l ) P l c, φ Y0 = y + l=k+1 λ l P l c, φ λ k E,x c(x k,u k ) P l c, φ Y0 = y + λ l k=l+1 λ l l 1 E,x c(xk,u k ))φ (X k l,u k l ) P l c ᾱ()1,φ Y 0 = y P l c, φ. It is now easy to see that the second sums of the first two terms are differentiable with the derivatives uniformly bounded in l as the second sum of the second term is an expectation of the solution to a Poisson equation for the Markov chain (X k l,u k l,x k,u k ) (cf. Lemma 5.7). The second sum of the third term is also differentiable with the bound on the derivative linear in k. Itnow follows from Theorem 5.1 that ĥ(y) isdifferentiable with the derivative bounded above by K L(x, u)z for some L D. 15

16 References 1 T. M. Apostol. Mathematical Analysis: A Modern Approach to Advanced Calculus. Addison-Wesley Pub Co, 2 edition, January K. B. Athreya and P. Ney. A new approach to the limit theory of recurrent Markov chains. Trans. Amer. Math. Soc., 245: , V. S. Borkar. Probability theory: an advanced course. Springer-Verlag, New York, P. W. Glynn and P. L Ecuyer. Likelihood ratio gradient estimation for stochastic recursions. Advances in applied probability, 27: , V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. Submitted to the SIAM Journal on Control and Optimization, February S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer- Verlag, E. Nummelin. A splitting technique for Harris recurrent chains. Z. Wahrscheinlichkeitstheorie and Verw. Geb., 43: ,

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization