Average Reward Parameters

Size: px

Start display at page:

Download "Average Reward Parameters"

Posy O’Brien’
6 years ago
Views:

1 Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend on a set of parameters. Previously, we proposed a simulation-based methodology to tune the parameters to optimize the average reward. The resulting algorithms converge with probability, but may have a high variance. Here we propose two approaches to reduce the variance, which however introduce a new bias into the update direction. We report numerical results which indicate that the resulting algorithms are robust with respect to a small bias. duce the variance of a typical update, but introduce an additional bias into the update direction. As gradienttype methods tend to be robust with respect to small biases, the resulting algorithm may have an improved practical performance. A numerical study that we carried out suggests that this is indeed the case. For a comparison of our approach with other simulationbased optimization methods such as the likelihoodratio method [Gly6, Gly], the innitesimal perturbation analysis (IPA) [CR4, CC, CW, FH4, FH], and neuro-dynamic programming/reinforcement learning in [JSJ5], we refer to [MT]. Introduction In [MT] we considered nite state Markov reward processes for which the transition probabilities and onestage rewards depend on a parameter vector 2 < K, and proposed a simulation-based methodology which uses gradient estimates for tuning the parameter to optimize the average reward. The resulting algorithms can be implemented online and have the property that the gradient of the average reward converges to zero with probability (which is the strongest possible result for gradient-related stochastic approximation algorithms). A drawback of these algorithms is that the updates may have a high variance, which can lead to slow convergence. This is due to the fact they essentially employ a renewal period to produce an estimate of the gradient. If the length of a typical renewal period is large (as it tends to be the case for systems involving a large state space) then the variance in the corresponding estimate can become quite high. In this paper, we address this issue and propose two approaches to reduce the variance: one which estimates the gradient based on trajectories which tend to be shorter than a renewal period and one which employs a discount factor. The resulting algorithms re- This research was supported by contracts with Siemens AG, Munich, Germany, and Alcatel Bell, Belgium and by contracts DMI-6254 and ACI-333 with the National Science Foundation 2 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA current aliation: Center for Communication Systems Research, Cambridge University, UK p.marbach@ccsr.cam.ac.uk 3 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA jnt@mit.edu 2 Formulation In this section we give a summary of the general framework, which is slightly more general than the one in [MT] (see [Mar] for a detailed discussion). Consider a discrete-time, nite-state Markov chain fi n g with state space S = f : : : Ng, whose transition probabilities depend on a parameter vector 2 < K. We denote the one-step transition probabilities by P ij (), i j 2 S, and the n?step transition probabilities by Pij(). n Whenever the state is equal to i, we receive a one-stage reward, that also depends on, and is denoted by g. For every 2 < K, let P () be the stochastic matrix with entries P ij (). Let P = fp () j 2 < K g be the set of all such matrices, and let P be its closure. Note that every element of P is also a stochastic matrix and, therefore, denes a Markov chain on the same state space. We make the following assumptions. Assumption (Recurrence) The Markov chain corresponding to every P 2 P is aperiodic. Furthermore, there exists a state i which is recurrent for every such Markov chain. Assumption 2 (Regularity) For all states i j 2 S and for all P 2 P, the transition probability P ij (), and the one-state reward g, are bounded, twice dierentiable, and have bounded rst and second derivatives. Furthermore, we have rp ij () = P ij ()L ij () where the function L ij () is bounded. p.

2 Assumption (Recurrence) states that under every transition matrix P 2 P we will eventually revisit the state i. This property allows us to employ results of renewal theory (see for example [Gal5]) for our analysis. Assumption 2 (Regularity) assures that the transition probabilities P ij () and the one-stage reward g depend \smoothly" on, and that the quotient rp ij ()=P ij () is \well behaved". As a performance metric we use the average reward criterion () = lim t! t E " t? k= g ik () Here, i k is the state visited at time k, and the notation E [] indicates that the expectation is taken with respect to the distribution of the Markov chain with transition probabilities P ij (). Under Assumption (Recurrence), the average reward () is well dened for every and does not depend on the initial state. We dene the the dierential reward v of state i 2 S and the mean recurrence time E [T ] by v = E " T?? # gik ()? () j i = i k= E [T ] = E [T j i = i ] where T = minfk > j i k = i g is the rst future time that state i is visited. We have that v =. 3 A Gradient-Based Algorithm for Updating Given that our goal is to maximize the average reward (), it is natural to consider gradient-type methods. Using the following expression for the gradient r() of the average reward with respect to (see [MT] for a derivation), r() = rg + rp ij ()v j () we could consider an algorithm of the following form k+ = k + k r( k ): # : A Alternatively, if we could use simulation to produce an unbiased estimate F k () of r( k ) and employ the approximate gradient iteration k+ = k + k F k () to tune. Unbiased estimates of the gradient exist [Gly6], but it is not clear how to use them in an algorithm which has the properties we will require in the following. This diculty is bypassed by the method developed in the following. We will again employ an estimate of the gradient to update the parameter vector which however is biased. Convergence of the resulting simulation based method can be established by showing that the bias (asymptotically) vanishes. 3. Estimation of r() We rewrite the expression for the gradient r() as follows (see [MT], r() = rg + () P ij ()L ij ()v j () A where L ij () = rp ij ()=P ij (. This expression suggests the following way to estimate the gradient r(). Let the parameter vector be xed to some value and let fi n g be a sample path of the corresponding Markov chain, possibly obtained through simulation. Furthermore, let t m be the time of the mth visit at the recurrent state i. Consider the estimate of r() given by F m ( ) ~ = where for i n = j ~v in ( ~ ) = ~v in ( )Lin?i ~ n () + rg in () k=n g ik ()? ~ () (2) is an estimate of the dierential reward v j () and ~ is some estimate of (). For n = t m we set ~v in ( ~ ) =. Assumption (Recurrence) allows us to employ renewal theory (see for example [Gal5]) to obtain the following result which states that the expectation of F m ( ~ ) is aligned with r() to the extend that ~ is close the () (see [MT, Mar]). Proposition We have where E hf m ( ~ ) i = E i2s +G()(()? ~ ) " tm+? G() = E = E [T ]r() + G()(()? ~ rg + rp ij ()v j () + (t m+? n)l in?i n () # A + ::: [Gly6] develops an algorithm using unbiased estimates of the gradient which updates the parameter vector at visits to the recurrent state i. It is not known if this algorithm can be extended so that the parameter gets updated at every time step, a property which we will require based on the discussion in Section 4. : p. 2

3 3.2 An Algorithm that Updates at Visits to the Recurrent State Using F m ( ~ ) as an estimate of the gradient direction, we can formulate the following algorithm which updates at visits to the recurrent state i. At the time t m that state i is visited for the mth time, we have available a current vector m and an average reward estimate ~ m. We then simulate the process according to the transition probabilities P ij ( m ) until the next time t m+ that i is visited and update according to m+ = m + m F m ( m ~ m ) (3) ~ m+ = ~ m + m (g in ( m )? ~ m ) (4) where is a positive scalar and m is a step size sequence which satises the following assumption. Assumption 3 (Step Size) The step sizes m are nonnegative and satisfy m= m = m= 2 m < : This assumption is satised, for example, if we let m = =m. Note that the update in (3) is a biased estimate of the gradient direction (see Proposition ). By showing that the update (4) drives this bias (asymptotically) to zero, we obtain the following convergence result (see [MT, Mar]). Proposition 2 Let Assumption (Recurrence), Assumption 2 (Regularity), and Assumption 3 (Step Size), hold and let f m g be the sequence of parameter vectors generated by the above described algorithm. Then, with probability, ( m ) converges and lim m! r( m) = : 4 Implementation Issues For systems involving a large state space, the interval between visits to the state i can be large. This means that in the algorithm proposed above the parameter vector gets updated only infrequently and the estimate F m () can have a large variance. In [MT], we have shown how the above algorithm can be extended so that the parameter vector gets updated at every time step. Here, we will in addition consider two approaches to reduce the variance in the updates, which are based on two alternative estimates of the dierential reward. 4. Reducing the Variance If the time until we reach the recurrent state i from state i is large, it may be desirable to choose a subset S of S containing i and to estimate v through ~v S i( ) ~ T? = g ik ()? ~ (5) k= where T = minfk > j i k 2 S g is the rst future time a state in the set S is visited. Note that it takes fewer time steps to reach the set S than to revisit the recurrent state i therefore ~v S i( ) ~ has typically a smaller variance than the estimate based on (2). As an alternative approach, we may use a factor, <, to discount future rewards. This leads to the following estimate, ~v i ( ~ ) = T? k g ik ()? ~ (6) k= where T = minfk > j i k = i g is the rst future time the state i is visited. In Sections 5 and 6 we will use (5) and (6) to derive modied estimates of the gradient r(). 5 Using the Set S to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a trajectory of the corresponding Markov chain. Let t m be the time of the mth time to the termination state i. Furthermore, given a set S S containing i, let (m) be the number of times between t m and t m+ that a state in the set S nfi g is visited, let the t mn be the time of the nth visit, and let t m and t m(m)+ be equal to t m and t m+, respectively. Using these denitions, we consider the estimate F S m( ) ~ of the gradient r() given by F S m( ~ ) = k=t m ~v S i k ( ~ )L ik? i k () + rg ik () where, for t mn k < t mn+, and i k 6= i, we set ~v S i k ( ~ ) = t mn+? l=k g il ()? ~ and, for k = t m, we set ~v ik ( ~ ) =. () One can show (see [Mar]), that the expectation E [F S m( )] ~ is of the same form as the expectation of the original estimate F m ( ) ~ in Proposition, however the exact value of the dierential reward v j () is replaced by the approximation v S j() = E " T?? # gik ()? () j i = j k= p. 3

4 where T = minfk > j i k 2 S g. This introduces a new bias term S () into the estimate of the gradient direction which is equal to E [T ] rp ij ()(v S j()? v j ()) A : 5. A Bound on the Bias S () To derive an upper bound on k S ()k, we use some suitable assumptions on the bias in the estimate of the dierential reward given by ^v S i() = v S i()? v : P The basic idea is the following. As rp ij() =, for all i 2 S and all 2 < K, we only have to know the \relative magnitudes" of the dierential rewards to compute r(), i.e. we have that for all constant c r() = rg + rp ij () v j ()? c A : Therefore, if ^v S i() = ^v S, for all i i 2 S, then the bias term k S ()k is equal to. In fact, the term k S ()k vanishes under the weaker assumption that for all states i 2 S we have ^v S j() = ^v S j (), if j j 2 S i, where S i = j 2 S j rp ij () 6= for some 2 < K. Now assume that there exists a such that, for all states i 2 S, we have ^v S j()? ^v S j () if j j 2 S i : Then we have that under Assumption (Recurrence) and Assumption 2 (Regularity) k S ()k T max NC where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N = maxfni j i 2 Sg with where N i is the number of states in the set S i. Therefore, in order to keep the bias S () small one should choose S such that, for all states i 2 S and for all states j j 2 S i, the dierence ^vs j()? ^v S j () is small. A proof for this result is given in [Mar]. 5.2 An Algorithm that Updates at Visit to the Recurrent State Consider the following version of algorithm in Section 3 to update and ~ at visits to the recurrent state i, m+ = m + m F S m( m ) ~ m+ = ~ m + m (g in ( m )? ~ m ) where we use the estimate F S m( m ) in place of F m ( m ). Under Assumption (Recurrence) and Assumption 2 (Regularity), we have that with probability lim inf kr( m)k D : m! T min Here D is a bound on the bias k S ()k and T min is lower bound on E [T ] (see [Mar]). This establishes that if the bias k S ()k tends to be small, then the gradient r( m ) is small at innitely many visits to the recurrent state i. However, it does not say anything about the behavior of the sequence ( m ) or how we might detect instances at which the average reward ( m ) is high. This result can be strengthened in the following sense (see [Mar]): if the upper-bound D on k S ()k is small enough, then ~ m overestimates the average reward ( m ) at most by a little, i.e. there exists a bound B(D) such that lim sup m! ( ~ m? ( m )) B(D): This implies that ~ m can be used to detect instances where the average reward ( m ) is high. Similar to [MT], one can derive a version of this algorithm which updates at each time step (see [Mar]). 6 Using a Discount Factor to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a sample path of the corresponding Markov chain. Furthermore, given a discount factor 2 ( ), let t m be the time of the mth visit at the recurrent state i and consider the following estimate F m ( ) ~ of the gradient r(), F m ( ~ ) = where, t m < n t m+?, we set ~v in ( ) ~ = ~v in ( )L ~ in?i n () + rg in () k=n k?n g ik ()? ~ () and, for n = t m, we set ~v in ( ~ ) =. As in the previous section, the expectation E [F m ( ~ )] is of the same form as the expectation of the original estimate F m ( ~ ), except that the exact value of the dierential reward v j () is replaced by the approximation v j () = E " T? k? # g ik ()? () j i = j k= where T = minfk > j i k = i g. 6. A Bound on the Bias () To derive a bound on the resulting new bias () which is equal to E [T ] rp ij () v j ()? v j () A p. 4

5 ... Average Reward.5. Average Reward x x 6 Figure : Trajectory of the exact average reward (top) and threshold parameters (bottom) of the idealized gradient algorithm. we use the following assumption. There exist scalars, <, and A such that for, all states i 2 S and all integers n, we have Pij() n? j () g A n : () We can think of as a \mixing constant" for the Markov reward processes associated with P. It can be shown that under assumption Assumption (Recurrence) such constants A and exist (see [Mar]). It then follows, that under Assumption (Recurrence) and Assumption 2 (Regularity), k ()k AT max C N??? where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N is the same as in Section 5. This result states that the bound on the bias () vanishes as # and as ". A proof is given in [Mar]. Again, we can use the estimate F m ( ~ ) to derive an algorithm which updates the parameter vector either at visits to the recurrent state i or at every time step (see [Mar]). The results of the previous section also hold in this case. Experimental Results We applied the algorithms proposed above to the problem where a provider of a communication link has to accept and reject incoming calls of several types, while taking into account current congestion. Calls are charged according to the type they belong to and the Figure 2: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 3 (the scaling factor for the iteration steps is 6 ). goal of the provider is to maximize long-term average reward. We consider the following admission control policy described in terms a \fuzzy threshold" parameter (m) for each call type m = ::: M. Given that the used link bandwidth is equal to B, a new call of type m is accepted with probability + exp(b? (m)) and is otherwise rejected. A detailed description of the case study can be found in [Mar]. In the following, we provide a short summary of the main results for a case study involving 3 dierent service types.. Idealized Gradient Algorithm For this case, we were able to compute the exact gradient r() and to implement the idealized gradient algorithm of Section 2. The corresponding trajectories are given in Figure. The resulting average reward is. (the optimal average reward obtained by dynamic programming is.6)..2 Simulation-Based Optimization Figure 2, 3, and 4, shows the results for the versions of the simulation-based algorithms of Section 2, Section 5 (for the choice of S see [Mar]), and Section 6 (with = :), respectively, where the parameter gets updated at every time step. We make here the following observations.. All three algorithms make rapid progress in the beginning, improving the average reward from p. 5

6 Average Reward x x 5 Average Reward x x 5 Figure 3: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 5 (the scaling factor for the iteration steps is 5 ). Figure 4: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 6 (the scaling factor for the iteration steps is 5 ).. to. within the rst 6 iteration steps for the algorithm of Section 2 and within rst 5 iteration steps for the algorithms of Section 5 and 6. After reaching an average reward of., the algorithms progresses only slow. This is not unlike the behavior of the idealized algorithm (see Figure ). 2. The simulation-based algorithms of Section 2 attains an average reward of. after 6 iteration steps, while the algorithms of Section 5 and 6 obtain an average reward of.2, and.5 respectively, after 6 iteration steps. These results are encouraging, as the algorithms with reduced variance speed up the convergence by an order of magnitude, while introducing a negligible bias. References [Ber5a] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. I and II. Athena Scientic, Belmont, MA, 5. [Ber5b] D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 5. [CC]. R. Cao and H. F. Chen. Perturbation Realization, Potentials, and Sensitivity Analysis of Markov Processes. IEEE Transactions on Automatic Control, 42:32{33,. [CR4] E. K. P. Chong and P. J. Ramadage. Stochastic Optimization of Regenerative Systems Using In- nitesimal Perturbation Analysis. IEEE Trans. on Automatic Control, 3:4{4, 4. [CW]. R. Cao and Y. W. Wan. Algorithms for Sensitivity Analysis of Markov Systems through Potentials and Perturbation Realization. IEEE Trans. on Control Systems Technology, 6:42{44,. [FH4] M. C. Fu and J.-Q. Hu. Smoothed Perturbation Analysis Derivative Estimation for Markov Chains. Operations Research Letters, 5:24{25, 4. [FH] M. Fu and J.-Q. Hu. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publisher, Boston, MA,. [Gal5] R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston/Dordrech/London, 5. [Gly6] P. W. Glynn. Stochastic Approximation for Monte Carlo Optimization. Proceedings of the 6 Winter Simulation Conference, pages 25{2, 6. [Gly] P. W. Glynn. Likelihood Ratio Gradient Estimation: An Overview. Proceedings of the Winter Simulation Conference, pages 366{35,. [JSJ5] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems. Advances in Neural Information Processing Systems, :35{46, 5. [Mar] P. Marbach. Simulation-based optimization of markov decision processes. PhD Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, MA,. [MT] P. Marbach and J. N. Tsitsiklis. Simulationbased optimization of markov reward processes. Technical Report, LIDS-P 24, Lab. for Info. and Decision System, Massachusetts Institute of Technology, MA,. p. 6

Simulation-Based Optimization of Markov Reward Processes

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 46, NO. 2, FEBRUARY 2001 191 Simulation-Based Optimization of Markov Reward Processes Peter Marbach John N. Tsitsiklis, Fellow, IEEE Abstract This paper proposes