Average Reward Parameters

Similar documents
Simulation-Based Optimization of Markov Reward Processes

On the Convergence of Optimistic Policy Iteration

Procedia Computer Science 00 (2011) 000 6

On Average Versus Discounted Reward Temporal-Difference Learning

A System Theoretic Perspective of Learning and Optimization

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Q-Learning for Markov Decision Processes*


Linear stochastic approximation driven by slowly varying Markov chains

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Open Theoretical Questions in Reinforcement Learning

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

An Adaptive Clustering Method for Model-free Reinforcement Learning

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Elements of Reinforcement Learning

Semi-Markov Decision Problems and Performance Sensitivity Analysis

Linearly-solvable Markov decision problems

On-Line Policy Gradient Estimation with Multi-Step Sampling

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particu

1 Problem Formulation

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Approximate Dynamic Programming

Bias-Variance Error Bounds for Temporal Difference Updates

October 7, :8 WSPC/WS-IJWMIP paper. Polynomial functions are renable

On Mean Curvature Diusion in Nonlinear Image Filtering. Adel I. El-Fallah and Gary E. Ford. University of California, Davis. Davis, CA

Approximate Dynamic Programming

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Notes on Iterated Expectations Stephen Morris February 2002

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

ROYAL INSTITUTE OF TECHNOLOGY KUNGL TEKNISKA HÖGSKOLAN. Department of Signals, Sensors & Systems

Reinforcement Learning as Classification Leveraging Modern Classifiers

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Abstract Dynamic Programming

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

WE consider finite-state Markov decision processes

Policy Gradient Reinforcement Learning for Robotics

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Reinforcement Learning: the basics

Maximum Likelihood Estimation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Approximate active fault detection and control

Introduction to Approximate Dynamic Programming

State Space Abstractions for Reinforcement Learning

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Spurious Chaotic Solutions of Dierential. Equations. Sigitas Keras. September Department of Applied Mathematics and Theoretical Physics

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

To appear in Machine Learning Journal,, 1{37 ()

1 Introduction This work follows a paper by P. Shields [1] concerned with a problem of a relation between the entropy rate of a nite-valued stationary

Infinite-Horizon Policy-Gradient Estimation

H 1 optimisation. Is hoped that the practical advantages of receding horizon control might be combined with the robustness advantages of H 1 control.

Markov Decision Processes and Dynamic Programming

The Art of Sequential Optimization via Simulations

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences

2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

arxiv: v1 [cs.ai] 5 Nov 2017

Optimal Convergence in Multi-Agent MDPs

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

arxiv: v1 [cs.lg] 23 Oct 2017

Homework 1 Solutions ECEn 670, Fall 2013

Linear Regression and Its Applications

Semi-strongly asymptotically non-expansive mappings and their applications on xed point theory

Online solution of the average cost Kullback-Leibler optimization problem

Lecture notes for Analysis of Algorithms : Markov decision processes

Decision Theory: Q-Learning

Monte Carlo Linear Algebra: A Review and Recent Results

Numerical Solution of Hybrid Fuzzy Dierential Equation (IVP) by Improved Predictor-Corrector Method

/97/$10.00 (c) 1997 AACC

Lecture 23: Reinforcement Learning

Reinforcement Learning

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

16.410/413 Principles of Autonomy and Decision Making

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin.

Introduction to Reinforcement Learning

Convergence of Simulation-Based Policy Iteration

and the nite horizon cost index with the nite terminal weighting matrix F > : N?1 X J(z r ; u; w) = [z(n)? z r (N)] T F [z(n)? z r (N)] + t= [kz? z r

Admission control schemes to provide class-level QoS in multiservice networks q

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Generalization and Function Approximation

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

Variance Adjusted Actor Critic Algorithms

Markov Decision Processes and Dynamic Programming

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug

The Optimal Reward Baseline for Gradient Based Reinforcement Learning

Transcription:

Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend on a set of parameters. Previously, we proposed a simulation-based methodology to tune the parameters to optimize the average reward. The resulting algorithms converge with probability, but may have a high variance. Here we propose two approaches to reduce the variance, which however introduce a new bias into the update direction. We report numerical results which indicate that the resulting algorithms are robust with respect to a small bias. duce the variance of a typical update, but introduce an additional bias into the update direction. As gradienttype methods tend to be robust with respect to small biases, the resulting algorithm may have an improved practical performance. A numerical study that we carried out suggests that this is indeed the case. For a comparison of our approach with other simulationbased optimization methods such as the likelihoodratio method [Gly6, Gly], the innitesimal perturbation analysis (IPA) [CR4, CC, CW, FH4, FH], and neuro-dynamic programming/reinforcement learning in [JSJ5], we refer to [MT]. Introduction In [MT] we considered nite state Markov reward processes for which the transition probabilities and onestage rewards depend on a parameter vector 2 < K, and proposed a simulation-based methodology which uses gradient estimates for tuning the parameter to optimize the average reward. The resulting algorithms can be implemented online and have the property that the gradient of the average reward converges to zero with probability (which is the strongest possible result for gradient-related stochastic approximation algorithms). A drawback of these algorithms is that the updates may have a high variance, which can lead to slow convergence. This is due to the fact they essentially employ a renewal period to produce an estimate of the gradient. If the length of a typical renewal period is large (as it tends to be the case for systems involving a large state space) then the variance in the corresponding estimate can become quite high. In this paper, we address this issue and propose two approaches to reduce the variance: one which estimates the gradient based on trajectories which tend to be shorter than a renewal period and one which employs a discount factor. The resulting algorithms re- This research was supported by contracts with Siemens AG, Munich, Germany, and Alcatel Bell, Belgium and by contracts DMI-6254 and ACI-333 with the National Science Foundation 2 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA current aliation: Center for Communication Systems Research, Cambridge University, UK email: p.marbach@ccsr.cam.ac.uk 3 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA email: jnt@mit.edu 2 Formulation In this section we give a summary of the general framework, which is slightly more general than the one in [MT] (see [Mar] for a detailed discussion). Consider a discrete-time, nite-state Markov chain fi n g with state space S = f : : : Ng, whose transition probabilities depend on a parameter vector 2 < K. We denote the one-step transition probabilities by P ij (), i j 2 S, and the n?step transition probabilities by Pij(). n Whenever the state is equal to i, we receive a one-stage reward, that also depends on, and is denoted by g. For every 2 < K, let P () be the stochastic matrix with entries P ij (). Let P = fp () j 2 < K g be the set of all such matrices, and let P be its closure. Note that every element of P is also a stochastic matrix and, therefore, denes a Markov chain on the same state space. We make the following assumptions. Assumption (Recurrence) The Markov chain corresponding to every P 2 P is aperiodic. Furthermore, there exists a state i which is recurrent for every such Markov chain. Assumption 2 (Regularity) For all states i j 2 S and for all P 2 P, the transition probability P ij (), and the one-state reward g, are bounded, twice dierentiable, and have bounded rst and second derivatives. Furthermore, we have rp ij () = P ij ()L ij () where the function L ij () is bounded. p.

Assumption (Recurrence) states that under every transition matrix P 2 P we will eventually revisit the state i. This property allows us to employ results of renewal theory (see for example [Gal5]) for our analysis. Assumption 2 (Regularity) assures that the transition probabilities P ij () and the one-stage reward g depend \smoothly" on, and that the quotient rp ij ()=P ij () is \well behaved". As a performance metric we use the average reward criterion () = lim t! t E " t? k= g ik () Here, i k is the state visited at time k, and the notation E [] indicates that the expectation is taken with respect to the distribution of the Markov chain with transition probabilities P ij (). Under Assumption (Recurrence), the average reward () is well dened for every and does not depend on the initial state. We dene the the dierential reward v of state i 2 S and the mean recurrence time E [T ] by v = E " T?? # gik ()? () j i = i k= E [T ] = E [T j i = i ] where T = minfk > j i k = i g is the rst future time that state i is visited. We have that v =. 3 A Gradient-Based Algorithm for Updating Given that our goal is to maximize the average reward (), it is natural to consider gradient-type methods. Using the following expression for the gradient r() of the average reward with respect to (see [MT] for a derivation), r() = i2s @ rg + rp ij ()v j () we could consider an algorithm of the following form k+ = k + k r( k ): # : A Alternatively, if we could use simulation to produce an unbiased estimate F k () of r( k ) and employ the approximate gradient iteration k+ = k + k F k () to tune. Unbiased estimates of the gradient exist [Gly6], but it is not clear how to use them in an algorithm which has the properties we will require in the following. This diculty is bypassed by the method developed in the following. We will again employ an estimate of the gradient to update the parameter vector which however is biased. Convergence of the resulting simulation based method can be established by showing that the bias (asymptotically) vanishes. 3. Estimation of r() We rewrite the expression for the gradient r() as follows (see [MT], r() = i2s @ rg + () P ij ()L ij ()v j () A where L ij () = rp ij ()=P ij (. This expression suggests the following way to estimate the gradient r(). Let the parameter vector be xed to some value and let fi n g be a sample path of the corresponding Markov chain, possibly obtained through simulation. Furthermore, let t m be the time of the mth visit at the recurrent state i. Consider the estimate of r() given by F m ( ) ~ = where for i n = j ~v in ( ~ ) = ~v in ( )Lin?i ~ n () + rg in () k=n g ik ()? ~ () (2) is an estimate of the dierential reward v j () and ~ is some estimate of (). For n = t m we set ~v in ( ~ ) =. Assumption (Recurrence) allows us to employ renewal theory (see for example [Gal5]) to obtain the following result which states that the expectation of F m ( ~ ) is aligned with r() to the extend that ~ is close the () (see [MT, Mar]). Proposition We have where E hf m ( ~ ) i = E i2s +G()(()? ~ ) " tm+? G() = E = E [T ]r() + G()(()? ~ ) @ rg + rp ij ()v j () + (t m+? n)l in?i n () # A + ::: [Gly6] develops an algorithm using unbiased estimates of the gradient which updates the parameter vector at visits to the recurrent state i. It is not known if this algorithm can be extended so that the parameter gets updated at every time step, a property which we will require based on the discussion in Section 4. : p. 2

3.2 An Algorithm that Updates at Visits to the Recurrent State Using F m ( ~ ) as an estimate of the gradient direction, we can formulate the following algorithm which updates at visits to the recurrent state i. At the time t m that state i is visited for the mth time, we have available a current vector m and an average reward estimate ~ m. We then simulate the process according to the transition probabilities P ij ( m ) until the next time t m+ that i is visited and update according to m+ = m + m F m ( m ~ m ) (3) ~ m+ = ~ m + m (g in ( m )? ~ m ) (4) where is a positive scalar and m is a step size sequence which satises the following assumption. Assumption 3 (Step Size) The step sizes m are nonnegative and satisfy m= m = m= 2 m < : This assumption is satised, for example, if we let m = =m. Note that the update in (3) is a biased estimate of the gradient direction (see Proposition ). By showing that the update (4) drives this bias (asymptotically) to zero, we obtain the following convergence result (see [MT, Mar]). Proposition 2 Let Assumption (Recurrence), Assumption 2 (Regularity), and Assumption 3 (Step Size), hold and let f m g be the sequence of parameter vectors generated by the above described algorithm. Then, with probability, ( m ) converges and lim m! r( m) = : 4 Implementation Issues For systems involving a large state space, the interval between visits to the state i can be large. This means that in the algorithm proposed above the parameter vector gets updated only infrequently and the estimate F m () can have a large variance. In [MT], we have shown how the above algorithm can be extended so that the parameter vector gets updated at every time step. Here, we will in addition consider two approaches to reduce the variance in the updates, which are based on two alternative estimates of the dierential reward. 4. Reducing the Variance If the time until we reach the recurrent state i from state i is large, it may be desirable to choose a subset S of S containing i and to estimate v through ~v S i( ) ~ T? = g ik ()? ~ (5) k= where T = minfk > j i k 2 S g is the rst future time a state in the set S is visited. Note that it takes fewer time steps to reach the set S than to revisit the recurrent state i therefore ~v S i( ) ~ has typically a smaller variance than the estimate based on (2). As an alternative approach, we may use a factor, <, to discount future rewards. This leads to the following estimate, ~v i ( ~ ) = T? k g ik ()? ~ (6) k= where T = minfk > j i k = i g is the rst future time the state i is visited. In Sections 5 and 6 we will use (5) and (6) to derive modied estimates of the gradient r(). 5 Using the Set S to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a trajectory of the corresponding Markov chain. Let t m be the time of the mth time to the termination state i. Furthermore, given a set S S containing i, let (m) be the number of times between t m and t m+ that a state in the set S nfi g is visited, let the t mn be the time of the nth visit, and let t m and t m(m)+ be equal to t m and t m+, respectively. Using these denitions, we consider the estimate F S m( ) ~ of the gradient r() given by F S m( ~ ) = k=t m ~v S i k ( ~ )L ik? i k () + rg ik () where, for t mn k < t mn+, and i k 6= i, we set ~v S i k ( ~ ) = t mn+? l=k g il ()? ~ and, for k = t m, we set ~v ik ( ~ ) =. () One can show (see [Mar]), that the expectation E [F S m( )] ~ is of the same form as the expectation of the original estimate F m ( ) ~ in Proposition, however the exact value of the dierential reward v j () is replaced by the approximation v S j() = E " T?? # gik ()? () j i = j k= p. 3

where T = minfk > j i k 2 S g. This introduces a new bias term S () into the estimate of the gradient direction which is equal to E [T ] i2s @ rp ij ()(v S j()? v j ()) A : 5. A Bound on the Bias S () To derive an upper bound on k S ()k, we use some suitable assumptions on the bias in the estimate of the dierential reward given by ^v S i() = v S i()? v : P The basic idea is the following. As rp ij() =, for all i 2 S and all 2 < K, we only have to know the \relative magnitudes" of the dierential rewards to compute r(), i.e. we have that for all constant c r() = i2s @ rg + rp ij () v j ()? c A : Therefore, if ^v S i() = ^v S, for all i i 2 S, then the bias term k S ()k is equal to. In fact, the term k S ()k vanishes under the weaker assumption that for all states i 2 S we have ^v S j() = ^v S j (), if j j 2 S i, where S i = j 2 S j rp ij () 6= for some 2 < K. Now assume that there exists a such that, for all states i 2 S, we have ^v S j()? ^v S j () if j j 2 S i : Then we have that under Assumption (Recurrence) and Assumption 2 (Regularity) k S ()k T max NC where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N = maxfni j i 2 Sg with where N i is the number of states in the set S i. Therefore, in order to keep the bias S () small one should choose S such that, for all states i 2 S and for all states j j 2 S i, the dierence ^vs j()? ^v S j () is small. A proof for this result is given in [Mar]. 5.2 An Algorithm that Updates at Visit to the Recurrent State Consider the following version of algorithm in Section 3 to update and ~ at visits to the recurrent state i, m+ = m + m F S m( m ) ~ m+ = ~ m + m (g in ( m )? ~ m ) where we use the estimate F S m( m ) in place of F m ( m ). Under Assumption (Recurrence) and Assumption 2 (Regularity), we have that with probability lim inf kr( m)k D : m! T min Here D is a bound on the bias k S ()k and T min is lower bound on E [T ] (see [Mar]). This establishes that if the bias k S ()k tends to be small, then the gradient r( m ) is small at innitely many visits to the recurrent state i. However, it does not say anything about the behavior of the sequence ( m ) or how we might detect instances at which the average reward ( m ) is high. This result can be strengthened in the following sense (see [Mar]): if the upper-bound D on k S ()k is small enough, then ~ m overestimates the average reward ( m ) at most by a little, i.e. there exists a bound B(D) such that lim sup m! ( ~ m? ( m )) B(D): This implies that ~ m can be used to detect instances where the average reward ( m ) is high. Similar to [MT], one can derive a version of this algorithm which updates at each time step (see [Mar]). 6 Using a Discount Factor to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a sample path of the corresponding Markov chain. Furthermore, given a discount factor 2 ( ), let t m be the time of the mth visit at the recurrent state i and consider the following estimate F m ( ) ~ of the gradient r(), F m ( ~ ) = where, t m < n t m+?, we set ~v in ( ) ~ = ~v in ( )L ~ in?i n () + rg in () k=n k?n g ik ()? ~ () and, for n = t m, we set ~v in ( ~ ) =. As in the previous section, the expectation E [F m ( ~ )] is of the same form as the expectation of the original estimate F m ( ~ ), except that the exact value of the dierential reward v j () is replaced by the approximation v j () = E " T? k? # g ik ()? () j i = j k= where T = minfk > j i k = i g. 6. A Bound on the Bias () To derive a bound on the resulting new bias () which is equal to E [T ] i2s @ rp ij () v j ()? v j () A p. 4

... Average Reward.5. Average Reward.6.4.2...5 2 3 4 5 6 3 2 2 3 4 5 6.6 2 3 4 5 6 x 6 2 6 2 3 4 5 6 x 6 Figure : Trajectory of the exact average reward (top) and threshold parameters (bottom) of the idealized gradient algorithm. we use the following assumption. There exist scalars, <, and A such that for, all states i 2 S and all integers n, we have Pij() n? j () g A n : () We can think of as a \mixing constant" for the Markov reward processes associated with P. It can be shown that under assumption Assumption (Recurrence) such constants A and exist (see [Mar]). It then follows, that under Assumption (Recurrence) and Assumption 2 (Regularity), k ()k AT max C N??? where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N is the same as in Section 5. This result states that the bound on the bias () vanishes as # and as ". A proof is given in [Mar]. Again, we can use the estimate F m ( ~ ) to derive an algorithm which updates the parameter vector either at visits to the recurrent state i or at every time step (see [Mar]). The results of the previous section also hold in this case. Experimental Results We applied the algorithms proposed above to the problem where a provider of a communication link has to accept and reject incoming calls of several types, while taking into account current congestion. Calls are charged according to the type they belong to and the Figure 2: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 3 (the scaling factor for the iteration steps is 6 ). goal of the provider is to maximize long-term average reward. We consider the following admission control policy described in terms a \fuzzy threshold" parameter (m) for each call type m = ::: M. Given that the used link bandwidth is equal to B, a new call of type m is accepted with probability + exp(b? (m)) and is otherwise rejected. A detailed description of the case study can be found in [Mar]. In the following, we provide a short summary of the main results for a case study involving 3 dierent service types.. Idealized Gradient Algorithm For this case, we were able to compute the exact gradient r() and to implement the idealized gradient algorithm of Section 2. The corresponding trajectories are given in Figure. The resulting average reward is. (the optimal average reward obtained by dynamic programming is.6)..2 Simulation-Based Optimization Figure 2, 3, and 4, shows the results for the versions of the simulation-based algorithms of Section 2, Section 5 (for the choice of S see [Mar]), and Section 6 (with = :), respectively, where the parameter gets updated at every time step. We make here the following observations.. All three algorithms make rapid progress in the beginning, improving the average reward from p. 5

Average Reward...6.4.2...6 2 3 4 5 6 x 5 2 6 2 3 4 5 6 x 5 Average Reward..6.4.2...6 2 3 4 5 6 x 5 2 2 3 4 5 6 x 5 Figure 3: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 5 (the scaling factor for the iteration steps is 5 ). Figure 4: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 6 (the scaling factor for the iteration steps is 5 ).. to. within the rst 6 iteration steps for the algorithm of Section 2 and within rst 5 iteration steps for the algorithms of Section 5 and 6. After reaching an average reward of., the algorithms progresses only slow. This is not unlike the behavior of the idealized algorithm (see Figure ). 2. The simulation-based algorithms of Section 2 attains an average reward of. after 6 iteration steps, while the algorithms of Section 5 and 6 obtain an average reward of.2, and.5 respectively, after 6 iteration steps. These results are encouraging, as the algorithms with reduced variance speed up the convergence by an order of magnitude, while introducing a negligible bias. References [Ber5a] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. I and II. Athena Scientic, Belmont, MA, 5. [Ber5b] D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 5. [CC]. R. Cao and H. F. Chen. Perturbation Realization, Potentials, and Sensitivity Analysis of Markov Processes. IEEE Transactions on Automatic Control, 42:32{33,. [CR4] E. K. P. Chong and P. J. Ramadage. Stochastic Optimization of Regenerative Systems Using In- nitesimal Perturbation Analysis. IEEE Trans. on Automatic Control, 3:4{4, 4. [CW]. R. Cao and Y. W. Wan. Algorithms for Sensitivity Analysis of Markov Systems through Potentials and Perturbation Realization. IEEE Trans. on Control Systems Technology, 6:42{44,. [FH4] M. C. Fu and J.-Q. Hu. Smoothed Perturbation Analysis Derivative Estimation for Markov Chains. Operations Research Letters, 5:24{25, 4. [FH] M. Fu and J.-Q. Hu. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publisher, Boston, MA,. [Gal5] R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston/Dordrech/London, 5. [Gly6] P. W. Glynn. Stochastic Approximation for Monte Carlo Optimization. Proceedings of the 6 Winter Simulation Conference, pages 25{2, 6. [Gly] P. W. Glynn. Likelihood Ratio Gradient Estimation: An Overview. Proceedings of the Winter Simulation Conference, pages 366{35,. [JSJ5] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems. Advances in Neural Information Processing Systems, :35{46, 5. [Mar] P. Marbach. Simulation-based optimization of markov decision processes. PhD Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, MA,. [MT] P. Marbach and J. N. Tsitsiklis. Simulationbased optimization of markov reward processes. Technical Report, LIDS-P 24, Lab. for Info. and Decision System, Massachusetts Institute of Technology, MA,. p. 6