1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particu

Size: px

Start display at page:

Download "1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particu"

Avis Simmons
5 years ago
Views:

1 Strong Points of Weak Convergence: A Study Using RPA Gradient Estimation for Automatic Learning Felisa J. Vazquez-Abad * Department of Computer Science and Operations Research University of Montreal, Montreal, Quebec H3C 3J7 vazquez@iro.umontreal.ca Revised, March 11, 1998 Abstract In this paper we focus on the behavior of adaptive control schemes for automatic learning. Estimates of the sensitivities are used in a gradient-based stochastic approximation procedure, in order to drive the process along the steepest descent trajectory in search for the optimum. The learning rates are kept constant for adaptability. For such procedures, convergence can be established in a weak sense. We consider a model problem of a exible machine where the control parameter is a probability vector. We propose a new sensitivity estimator, generalizing the Phantom Rare Perturbation Analysis (RPA) estimator to multi-valued decisions. From the basic properties of the estimators, we build several updating rules based on the weak convergence theory to ensure asymptotic optimality. We illustrate the predicted theoretical behavior with computer simulations. Finally, we present the comparison between the behavior of our proposed scheme with a regenerative one for which we can establish strong convergence. Our results show that weak convergence yields a dramatic improvement in the rate of convergence, in addition to the capability of adaptation, or tracking. Keywords: Weak Convergence; Gradient Estimation; Rare perturbation Analysis; Tracking; Automatic Control * Supported in part by NSERC-Canada grants # WFA and FCAR-Quebec grant # 93-ER- 1

2 1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particular, we compare the behavior of SA using constant gains with that using decreasing gains. It is well known, as we shall briey review, that gradient-based SA can be used as a steepest descent algorithm in search for local minima. When we apply a gradient based SA to the control parameter of a system that we wish to optimize, the result of the procedure is a (stochastic) adaptive control process. Under some conditions, SA with decreasing gain parameters converge strongly to local optima. The ODE method, introduced by Kushner [15] and used to show a.s. convergence by [2] and [23] among others, can also be used to establish weak convergence of SA algorithms with constant gain [16], [18], [34]. In addition, the weak convergence method establishes verifyable conditions that characterize the ensuing behavior of the control processes [31]. We show in a simple example what each of the dierent assumptions and properties mean and discuss the implications when implementing the method in practice. While strong convergence results are naturally preferable and more reassuring in practice, it is impossible to show strong convergence for a learning algorithm that uses constant gains. If we settle for weak convergence, however, we may be able to track perturbations and adapt to changes in external conditions. This paper presents explicit implementations of gradient estimation into SA, as well as dierent updating rules, all of which yield the same asymptotic result for SA under decreasing gains. Our focus is to study the short term behavior of the dierent implementations. In particular, we illustrate weak convergence of the control processes to an ODE by showing sample trajectories and comparing them with the corresponding ones under decreasing gains, which converge a.s. We also discuss the concept of consistency in the average, a condition on the gradient estimators used to show weak convergence with the ODE method. A subsection is devoted to illustrate time scales, a theoretical device used in the proofs that can also be used to accelerate convergence by choosing the right implementation. Finally, we present the typical behavior of our learning algorithms for tracking perturbations. We believe that the unique strength of the weak convergence approach lies precisely in the type of assumptions and requirements on the algorithms, that allow us to distinguish the behavior of slightly dierent implementations. Consider a model where 2 IR d is a control parameter. Suppose that a performance function of the form F () has been dened, but that we do not have a closed form expression for it in terms of our model. Instead, we are capable of measuring data directly from the physical system, if it evolves in time, or from controlled experiments. Examples of this problem are numerous in applications, where the system is subject to uncertainty. A stochastic approximation procedure is a recursive algorithm of the form: n+1 = n + n Y n (1) that in some way adjusts the values of the control variable according to the observations Y n that{we hope{reect a measure of the sensitivity of the performance 2

3 as a function of. The work by Robbins and Monro [27] in the 1950's began a fruitful period for the development of control of stochastic systems. They established that such algorithms behave very much like their deterministic counterpart. Suppose that the gradient r F () is well dened and that the function F () is convex, so that there is a unique maximum. If we can produce a sequence of random variables fy n g with EfY n g = r F ( n ) and uniformly bounded variance, then n!, provided that that gain parameters satisfy: n > 0; n n! 1; n 2 n < 1 (2) The condition P 2 n < 1 can be weakened (see [20]). Intuitively, if we can obtain direct measurements of r F () corrupted by noise, then (1) follows (in the average) the direction of increase of the performance function, while the changes in n become smaller as we approach the optimum. Until very recently, as we shall mention later on, there were no available methods for measuring derivatives directly from observations of a stochastic process. Kiefer and Wolfowitz [13] proposed the following approach. Use a sequence c n! 0 to evaluate Y n = f( n + c n^e k )? f( n? c n^e k ) 2c n where ^e k is a vector with the k-th component set to unity and the remaining to zero, and f() is a sample or pathwise estimate of the function F (). We assume that E[f()] = F (). Their estimators are strongly consistent for the partial derivative F 0 k () w.r.t., in the following sense. If we do not perform the updates, but rather only produce the measurements Y n (), then lim n!1 E[Y n ()] = F 0 k () a.s. It is generally the case that good estimation is achieved as more samples are used in the construction of each estimate Y n. This is commonly performed as follows. One simulates or observes the system for a period of time T n during which the control parameter is kept xed at value n. At the end of the estimation interval, the parameter is updated using (1). It often is necessary that T n! 1 in order to achieve strong consistency of the estimators. Under these conditions, the procedure converges strongly to. In [15] the basic Robbins-Monro procedure is generalized, considering problems subject to constraints and using a truncated version of (1). In the context of stochastic optimization, much progress has been achieved in the past decade, with the increasing interest for constructing single path gradient estimators. Such are the Likelihood Ratio (LR) Method of [26] and [9], the Perturbation Analysis Method (IPA and SPA) of [10], [8] and, more recently, the Rare Perturbation Analysis (RPA) of [4], [29], the Harmonic Gradient (HG) of [12], the Weak Derivative Methods of [25] and the Simultaneous Perturbations (SPSA) of [28]. While many of these gradient estimators need observations of the system as well as detailed information about the systems dynamics, some of them can be applied to specic problems in a very robust manner. Notably, SPSA, HG and RPA can be used without assuming knowledge of the system's dynamics. In our 3

4 examples we shall discuss one imlementation of RPA that requires only observation of the states along a single trajectory. Most of the work has focused on showing strong convergence for simulation optimization of stochastic DES, that is, n! a.s., as in [1], [5], [6], [7], [11], [21] and [25], among others. Their method of proof refers to the general ODE method in [15], [23], [20], typically under the requirement (2). Applications of automatic learning in telecommunications, exible manufacturing systems (FMS), surveillance policies and dynamic resource allocation require algorithms capable of performing adaptive control under possibly changing environments. The implementation of a learning scheme diers form the common procedure for simulation optimization in that the gain parameters do not decrease, but are kept constant, using: n+1 = n + Y n (3) The requirement that the gain parameter (or \learning rate") does not decrease is essential for the algorithm to be able to adjust and track the optimal control when the underlying processes vary their behavior. While this is desirable for on-line optimization, strong convergence to the optimum cannot be achieved. Such would be the case, for example, when the call patterns in a telecommunications network change, machines fail or deteriorate their processing capabilities in a FMS, etc. An alternative approach has been proposed (see [16], [19], [17], [18], [31], [34]) in which convergence is established in a weak sense. Further, [24] has studied the stationary solution of such control processes. We shall study the behavior of (3) when estimates of the sensitivities are used in a gradient-based stochastic approximation procedure. The weak convergence method establishes the conditions under which (3) drives the process along the steepest descent trajectory in search for the optimum. The interpretation of the results of the method is intuitively appealing, but diers considerably from statements such as n! a.s. Instead of studying a sequence of random variables, the method focuses on the stochastic process dened by the time-varying control parameter. The notions of convergence of a process are, naturally, related to the topology introduced in the appropriate functional space. Roughly speaking, the method establishes the conditions under which the random trajectories approach, as! 0, the deterministic trajectories of a companion ODE, the asymptotes of which are its stable points. Therefore, establishing asymptotic optimality of the learning scheme requires identifying such stability points as the optimal control values of the original problem. Our model example is the following. A single server queue has V 2 possible settings or \speeds" under which the service distributions dier. The control parameter is the probability vector determining the fraction of customers served at each setting. The performance measure considers the minimization of the stationary average waiting time as well as the operation costs. Ours is an example of the canonical model in [5], [7] and [18], who implement IPA derivative estimators. In 4

5 our model, however, the control variable is not scalar, the constraint set requires to satisfy the law of total probability, and IPA is not applicable. The aim of this paper is to illustrate how the weak convergence methodology can be used in practice in order to construct learning algorithms. In so doing, we extend the notion of sensitivity estimation in terms of generalized gradients that take into account the Karush-Kuhn-Tucker (KKT) conditions for optimality. Although truncated algorithms are shown to converge in the limit (! 0 for (3) or n! 1 for (1)), the actual implementations are constructed with positive gain parameters. In our experience, truncations may yield bad behavior when one or more components of reach zero, which may be absorbing points of the procedure, even if far from the actual optimum. To solve this practical problem, we introduce the generalized gradients, which makes truncations of (3) unnecessary. We construct a new estimator for such sensitivities, generalizing the Phantom RPA method to the case of multi-valued decisions and discuss two implementations of our estimators. The predicted theoretical behavior of the updating schemes is illustrated via computer simulations using the regenerative and non-reset versions of the estimators. Finally, we present the comparison between the behavior of our proposed scheme with a regenerative one for which we can establish strong convergence. In Section 2, we present a brief review of the weak convergence method, emphasizing the ODE method. We state the main assumptions on the model and sensitivity estimators in the context of optimization of stochastic DES and the main results for strong and weak convergence. In Section 3 we present the model example. Section 4 deals with the construction of the \target" ODE that governs the limiting behavior of the learning scheme. Even if the performance measure is unknown in analytical form, the target ODE is constructed such that its stable points are local minima. In Section 5 we develop the gradient estimators and in Section 6 we describe the ensuing learning algortihms. Section 7 presents the empirical study of the behavior of the algorithms (3) and (1) under dierent implementations. 2 Weak Convergence Review We shall briey present the basic concepts and notation of the Weak Convergence Theory for stochastic approximation. For details on the model and methods of proof, we refer to [16], [18], [31], [19]. We strongly recommend the authoritative reference [20]. 2.1 Assumptions We shall now state the assumptions for strong and weak convergence for a Discrete Event Driven System (DES) under control. Let 2 IR d be the control variable. We consider a stochastic process f ~ t (); t 0g whose evolution is determined by a sequence of \events", happening at random times ft i ; i 0g. We assume that 5

6 T i is measurable with respect to the history or -algebra generated by the process, so that ft i tg 2 F(t)g. Calling i () the state of the embedded process plus the residual clocks at time T n, then f i ()g has a Markovian structure and is generally known as a Generalized Semi-Markov Process (see [8]). We shall use F i () to denote the -algebra generated by f j (); j ig. We shall assume that for some compact set, the process f i ()g regenerates and has a unique, ergodic invariant measure ~ (dx). The performance measure of interest is of the form: F () = Z f(x)~ (dx) = lim M!1 1 M M i=1 f[ i ()] (4) We shall also assume that F is continuously dierentiable. Before proceeding, we introduce two useful denitions. Denition: A sequence of measures f u (); u 2 Ug dened on a common, complete, separable state space S is said to be tight if for every there exists a compact set K S such that sup u2u u (K ) > 1?. Accordingly, we say that a sequence of random variables f i ; i = 1; 2 : : :g dened on a common, complete, separable state space S is tight if for every there exists a compact set K S such that P f i 62 K g. Tightness is the equivalent of compactness in the sense that if a set of measures is tight, then every sequence has a further weakly convergent subsequence, and the limit is a well dened probability measure (see [3]). Denition: We say that a sequence of random variables fy n ; n = 0; 1; : : :g dened on a common, complete, separable state space S is uniformly integrable if for every > 0 there exists a compact set K such that: sup n Z y>k y P fy n 2 dyg Uniform integrability follows from the weaker condition that Var(Y n ) be uniformly bounded in n (see [3]). A pathwise derivative estimator is an estimator constructed from the observations or measurements of the system over a nite horizon, that we shall call the estimation intervals and we shall denote by m. Consider the model for a xed value of the control at, where no updates takes place, but where we compute a pathwise derivative estimator using the observations contained in the m-th estimation interval, that is: n m=0 m () < i n+1 m=0 m () thus obtaining a sequence of estimates. A detailed analysis of weak convergence for general random estimation intervals in in [31]. In the present work, we shall focus only on the commonly used intervals, namely a constant number M of events 6

7 of the process f i ()g or a constant number M of regenerative cycles of the process f i ()g. Thus, both the estimation interval as well as the estimators are functions of the past state values. In particular, the derivative estimator Y n (M; ) is constructed via sample averages of quantities that depend on previous state values. In other words, there exists a F i ()-measurable process Z i () that accounts for the necessary bookkeeping. Call i () = ( i (); Z i ()) the enlarged state space that describes the process and the derivative estimation. Notice that the -algebra generated by f j (); j ig is F i (). Then the event f n () > ig F i (). Assumption 1 Consider the xed control process f i ()g, for 2. We assume that P (; B) = P f i+1 () 2 Bj i () = g is weakly continuous in (; ) and that f i ()g possesses a unique invariant, ergodic measure (d). In addition, we assume that the set of measures f (); 2 Ag is tight for every compact set A. Recall that we assumed that f i ()g has a unique, ergodic invariant measure. If fz i ()g are tight, then every sequence has a further weakly convergent subsequence. Since Z i () is a function of the states f j (); j ig and these possess a unique limiting measure, it follows that the limiting measure of the process i () is unique and uniquely determined by ~(). Denition: We say that Y n (M; ) is a strongly consistent estimator of G() if for any n: lim M!1 Y n(m; ) = G() a:s: We say that Y n (M; ) is consistent in the average sense for G() if for any M: ( 1 m?1 lim m!1 E Y n (M; ) m n=0 ) = G() We now proceed to the description of the model when (3) is used to update the values of at the end of the estimation intervals. A more general description can be found in [31], where a more complex system with asynchronous controllers is studied. In that reference, the updating intervals in a decentralized operation do not necessarily coincide with the estimation intervals of the processors that carry out the estimations. When the control variable changes this way, the process i evolves according to the dynamics of the xed control process at value n over the n-th estimation interval of corresponding length n, at the end of which the estimator Yn(M) is obtained and n is updated according to (3). Let n = P n m=0 m, we denote ~ i = n for i = n + 1; : : : ; n+1. This variable keeps the actual value of the control parameter as the process evolves. We denote by Fi the -algebra generated by fj ; ~ j ; j ig. From the Markovian structure, it follows that P fi+1 2 BjF i g = P f i+1 2 Bj i ; ~ i g and therefore the process f( i ; ~ i )g is a Markov Decision Process (MDP) with general state space. 7

8 Assumption 2 The random variables fyn(m)g and f ng are uniformly integrable. Assumption 3 The sequence f(i ; ~ i ); i = 1; 2; : : : ; > 0g is tight. 2.2 The Main Results The main results of strong convergence have been treated in detail in [7] and [5], [21] when IPA derivative estimators are used. The following version of the result considers the most commonly used assumptions on the estimation. We shall apply this result in Section 7. Theorem 1 Suppose that fy n ()g are independent and E[Y n ()j n = ] = r F () + n (); that sup EjY n ()j 2 < 1, and that F () is locally convex and has a unique maximum 2. Construct Y n as an estimator using the information available within the n-th estimation interval, where the control variable is kept at the value n. Update this value at the end of the estimation interval using (1). In order to ensure that n 2 with probability 1, use a truncation if necessary. Assume that (2) is satised and that n j n j < 1 a.s. Then n! with probability 1. n This model is known as the \martingale dierence noise" model for Y n. In many cases, when the estimators are strongly consistent, their variance decreases as 1=M. A common approach is to use increasing update intervals M(n) n and n 1=n. As mentioned before, weak convergence is established in the sense of convergence in distribution of the stochastic process dened by the time-varying control parameters. Following [31], we introduce the following control processes: Denition: The ladder interpolation process # (t) is dened by: and the natural interpolation process by: # (t) = n t 2 [n; (n + 1)) ~# (t) = ~ i t 2 [i; (i + 1)) As we shall see, these processes reect the common descriptions of learning algorithms, related to the iteration number (that is, in terms of the updates performed) and the actual time (in terms of the event scale of the process). It is 8

9 customary to present results of convergence as a function of the number of iterations performed. This would be related to the behavior of # (). If an updating scheme follows a number M of regenerative cycles to construct the estimators, then the length of the updating intervals depends on the control values themselves and it may be more realistic to study the behavior of ~ # (). Naturally, in the case of (1), where n! 0 and M(n)! 1, the natural interpolation process would stretch the time scale of the corresponding ladder interpolation process. In applications of on-line control, we are interested in the behavior and capability of tracking in real time rather than iteration number. In Section 7 we shall illustrate the various time scales of interest with computer simulations. Assumption 4 Suppose that the functions: are continuous and sup t () > 0. m?1 1 G() = lim E[Y n (M; )] m!1 m n=0 m?1 1 () = lim E[ n ()] m!1 m The following result summarizes the weak convergence approach and follows from [16], [18] and [31]. Theorem 2 Under Assumptions 1 to 4, if Yn is the estimator calculated within the interval n < i n+1 and (3) is used to update n (using a truncation to ensure n 2 if necessary), then # () converges in distribution to the deterministic solution of the ODE: d #(t) = G[#(t)] (5) dt n=0 and ~ # () ) ~ #(), where ~ #[(t)] = #(t) and d[(t)]=dt = [#(t)]. If, in addition, is the only stable point of (5), then lim t!1 #(t) = lim t!1 ~ #(t) =. In general, () > 1, for example, () = M or M times the average number of customers in a cycle, in the case of nite horizon or regenerative estimation, respectively. That is, the time scale of ~ #() is stretched, making its evolution slower. Recent research has considered the long term behavior of such SA procedures, analyzing the limiting stationary control process [24]. 3 A Flexible Machine We consider a system in which a machine processes items at dierent speeds v k ; k = 1; : : : ; V. Items arrive at the machine following a renewal process N(t). Let i () be the speed chosen by the machine for the i-th item processed. The service times of consecutive customers fs i g are independent random variables with E[S i j i () = 9

10 v k ] =?1, and k E[S2 i j i() = v k ] = & k 2. The associated operating cost per unit time is c k. We assume k < j ; c k < c j if k < j, thus faster modes of operation are more costly. Although we may know which speeds are faster in the average, in practice we may not know the distribution of the consecutive service times. The problem is to select the speeds of operation of the machines that minimize the waiting time at the lowest cost. The simplest strategy is the randomized strategy, in which P f i () = v k g = k. We assume that < V, ensuring stability of the process for all 2 = f 2 f0; 1g V : P V k=1 k = 1g. Therefore, the limiting probabilities exist, the process is ergodic, and the invariant measure is unique for each. Calling the vector ( k ; k = 1; : : : ; V ), we seek to optimize the performance function F () = W () + C(), where W () is the stationary average waiting time in the system and C() is the mean stationary cost resulting from the use of the machine. To nd the expression for C() = lim t!1 C(t)=t, where C(t) is the cumulative operation cost up to time t, we dene D(t) as the departure process, N the time of the N-th service completion, and T k (N) as the total time the machine operates at speed v k, having processed N items. Conditioning on the speed chosen at each of the N items, at time N, we have: E[C( N )] = V k=1 c k E[T k (N)] = N Under stability of the process, the arrival rate is the same as the departure rate, so lim N!1 (N= N ) = lim t!1 (D(t)=t) = a.s., thus: 1 V C() = lim t!1 t E[D(t)] k = k=1 k V k=1 V k=1 c k k k c k k k (6) From the structure of the model, F () is continuously dierentiable. This problem may be stated as a minimization problem of the function F () under the feasibility constraints: 4 Limiting ODE Minimize F () = W () + C() subject to h() = V i=1 i = 1 In this section we shall build the \target" ODE and show that its stable points are local minima. Our arguments are general for the optimization under randomized multi-valued decisions. Dene F 0() = ()=@ k. Letting be the vector of optimal values, the theory of Karush-Kuhn-Tucker (KKT) tells us there must exist a real number u such that: 1) Fk( 0 )? uh 0 k( = 0 if ) k > 0 0 if k = 0 for each k 10

11 2) h( ) = 1 3) k 0 for k = 1; ::; V Since h( ) = 1, the rst condition tells us that either k = 0 or F 0 k ( ) = u, which may be synthesized as u = P V k=1 k F 0 k ( ). Call G k the generalized gradient operator dened () () G k [F ()] =? j j Then, a value of such that P V k=1 k = 1 and G k [F ( = 0 if )] = k > 0 0 if for k = 1; ::; V (8) k = 0 satises the KKT conditions. Under convexity of F (), if satises (8), then =. Consider now the following target ODE: d# k (t) dt j=1 =?# k (t)g k [F (#(t))] (9) and notice that no truncation is necessary: if #(0) 2, then for any t, #(t) 2, which follows adding (9) over k and using the denition (7) of the generalized gradient, together with #(t) > 0. Lemma 1 If the starting point #(0) 2 is such that # k (0) > 0 for each k = 1; : : : ; V and it is not a local maximum of F (), then the stable points of (9) are local minima. If the function F () has a unique minimum, then (9) has an asymptote at = min 2 F (). Proof : We shall verify two conditions, namely, that the cost is non-increasing along the trajectory of the ODE, and that the stable points of the ODE are KKT points. The rst condition is easily veried since d V dt F [#(t)] = Fk[#(t)] 0 d# k(t) dt k=1 =? " V? # # k (t) D k (t)? D(t) 2 k=1 0 (10) where we have used the fact that (10) is the negative of a variance, with D k (t) = F 0[#(t)] and P k D(t) = Vk=1 # k (t)d k (t). Since F is bounded below by zero (costs are non-negative) and is non-increasing along the trajectory, then lim t!1 df [#(t)]=dt = 0 and F [#(t)] converges to a value F, possibly depending upon the initial value #(0). The second condition stems from the fact that, from (10), if = lim t!1 #(t) is any limit of the ODE, then d F [ ] = 0 and each term of the sum disappears. This dt 11

12 implies for each component, that either k = 0 or lim t!1 D k (t) = lim t!1 D(t), which may be rewritten as G k [F ( )] = 0. In order to verify (8), it suces to show now that if k = 0 then the corresponding term G k [F ( )] 0. By continuity of the generalized gradient, G k [F ( )] = lim t!1 G k [F (#(t))]. Using (9), since the limit point is k = 0, this component must decrease as time increases, thus d dt [#(t)] 0 for large enough t, implying that lim t!1 G k [F (#(t))] 0. The limiting ODE will therefore have a limit point that satises the KKT conditions and is a local minimum. If F has a unique minimum, then the gradient driven process (9) is asymptotically optimal in the sense that the trajectories of #(t) approach the optimal value as t! 1. 5 The RPA Gradient Estimators We shall now propose two phantom RPA estimators, generalizing [4] and [29] for the case of multi-valued decisions. If we do not know the exact values of k, the gradient of C() can be strongly consistently estimated via sample averages. Since F () = W ()+C(), it suces to build estimators of the generalized gradient in(9) for the waiting time. Call: W () = E 2 4 N bp() i=1 W i 3 5 ; N () = E[N bp ] (11) where N bp is the number of customers in one busy period of the process at control value. Then W () = W ()= N (). 5.1 Parallel Phantom Systems Our model can be stated according to the framework of Section 2, as follows. Call A i the interarrival time between customers i and i + 1, and let fu i g be a sequence of i.i.d uniform variates. The service of customer i is S i (u i ; i ) = V k=1 k G?1 k (u i)1 fi =v k g where G k () is the service distribution of speed v k. The discrete event process is described via Lindley's equations. If R i is the total time a customers remains in the system and W i its waiting time, then given R i, Lindley's equations yield: W i+1 = max(0; R i? A i ) R i+1 = S i+1 (u i ; i ) + W i+1 The above equations hold regardless of the way we choose the speeds of the machine, since we are using the information on the decisions themselves. We shall now use the notation i (0) for the decisions of the nominal system, taken according to P f i (0) = v k g = k, independently of f(a j ; u j ); j ig. Fix the 12

13 index k to estimate a sensitivity, let > 0 be a small number, and ~ 2 IR V be a vector with the value as the k-th component. The other components satisfy ~ l =?p l ; p l > 0; l 6= k and shall be determined later, according to the two gradient estimators that we shall describe. Given N customers, we dene a parallel phantom system using the same sequence (A i ; u i ) and assigning the decision ~ i of the i-th customer as follows: P f~ i = v l j i (0) = v l g = 1; l 6= k P f~ i = v k j i (0) = v k g = 1? k P f~ i = v l j i (0) = v k g = p l k ; l 6= k Then the decisions of the phantom system satisfy P f~ i = v l g = l? ~ l, for all l = 1; : : : ; V and ~ has a distribution according to ~ =? ~. Clearly, the evolution of such a system can be evaluated in parallel to the nominal system using Lindley's equation. For any sequence = f i g of decisions, we dene the cumulative waiting time over the rst M customers as ' M (). We shall be interested in two cases: M a xed, deterministic number, and M = N(), the number of customers in one busy period. To distinguish the two cases we use the notation: ' M () = M i=1 The nite dierences are dened by: W i ; 'N() = N() i=1 D (M) = ' M((0))? ' M (~) D (N) = ' N((0))? 'N(~) W i (12) Given M customers, the probability of having m phantoms in the sequence f~ i ; i = 1; : : : ; Mg is P f~ i 6= i (0)g = P f~ i 6= v k j i (0) = v k gp f i (0) = v k g =. Therefore, E[D (M)] = E + E h M(1? ) M?1 E (1) [D (M)] M m (1? ) M?m E (m) [D (M)] m m=2 " M where E (m) is the expectation w.r.t. ~, conditioning on having exactly m phantoms. As in [29], we use the fact that P M m=2 m (1? ) M?m M 2 2 to bound the second term. If M < 1 is a deterministic number, then ' M () P M j=1 S i for any sequence of decisions, and since E[S i ] 1= V < 1, we can P M i=1 M m i # (13) (14) 13

14 use the dominated convergence to establish the phantom RPA formula for any nite M: lim E[D (M)] = 1 E!0 k 8 < M : j=1 I k (j) [' M ((0)? ' M ((j))] where I k (j) = 1 fj (0)=v k g and we have used that all the sequences of phantom decisions that dier only in one component with (0) are equiprobable. The sequence (j) is dened by i (j) = i (0); i 6= j, and P f j (j) = v l j j (0) = v k g = p l. Consider now the nite dierence (14). In order to use the dominated convergence, we assume that the service distributions are stochastically dominated by a system of possibly random decisions such that S( i (0); u i ) S i (; u i ) and S i (~; u i ) S i (; u i ). Such is the case if G l (s) G V (s); l < V, with i v V. Then the nite dierence D W (N) converges as! 0 provided that E[N() 4 ] < 1. This follows from '(N((0)); (0))? '(N(~); ~) 2 P N() i=1 W i(), since then we can use that W i () is bounded by the length of the busy period in the system with decisions, obtaining E[N 2 ()N 2 ()]E[S i ()]! 0 as! 0. The corresponding RPA formula becomes: lim E[D (N)] = 1 E!0 k 8 < N(0) : j=1 The sequences (j) are dened as before. 5.2 The Swapping Phantoms I k (j) ['N((0))? 'N((j))] Our rst estimator of the generalized gradient is based in the following observation. Even if the partial derivatives F 0 k () make sense mathematically, expressed as F ()?F ( ) lim ~!0, where ~ =? ~ and ~ l = 0; l 6= k, the law of total probability is not preserved, because P V ~ i=1 i 6= 1 and thus the nite dierence for > 0 does not correspond to a physical process. An alternative is to look for appropriate directional derivatives. Consider ~ l = l =(1? k ). Simple algebraic manipulations show that F ()? F ( ) lim ~ = 1 G k [F ()] (17)!0 1? k Therefore, the estimation problem can be stated in terms of estimating the generalized gradient. Phantom customers swap speeds compared to the corresponding nominal decisions, choosing the remaining ones in their original proportion. We then use (15) or (16) with p l = l =(1? k ). This approach helps us dene the corresponding estimators to drive the process towards the limiting ODE (9). The practical problem is the required condition for domination. In our numerical examples, we used Poisson arrivals and uniformly distributed service distributions for each speed. In general, for an M=G=1 queue, Takacs method (see [14]) gives a functional relationship between the moment generating function of the service time and the one 14 9 = ; 9 = ; (15) (16)

15 of the number in a busy period. If the m-th moment of the service distribution is bounded, so is E[N m ], which follows from the above argument, although we omit the details of the proof. In our case, E[S 4 i ] < 1. Proposition 1 Assume that the service times are dominated by fs i ()g, for some random sequence, and that the dominating queueing process satises E[N 4 ()] < 1. Then: k G k [W ()] = lim (1? k)e M!1 k G k [ W ()] = (1? k )E k G k [ N ()] = (1? k )E 8 < N(0) : j=1 8 < N(0) : j=1 8 < : M j=1 I k (j) [' M((0))? ' M ((j))] M I k (j) ['N((0))? 'N((j))] I k (j) [N(0)? N(j)] Proof : The last two statements follow directly from the development of the phantom RPA formula, under the domination assumption. By construction, (15) is an unbiased estimator of (1? k )?1 G k [ P M i=1 W i]=m. The convergence as M! 1 of the r.h.s. follows from the uniform bound Ej' M ((0))? ' M ((j))j 2E[ P N j () i=1 W i ()] < 1, where N j () is the number of customers in the busy period of the dominating system where customer j belongs. This fact follows from the construction of the phantom systems: W i (0) = W i (j) for all i j. When the system nishes the busy period where j belongs, both the nominal and the j-th phantom system with decisions f(j)g nish. After this, their evolution is identical, since j was the only decision that is dierent. We need only to verify that the interchange between the expectation and the limit M! 1 is valid. Since the invariant measure is unique and ergodic, G k [W ()] < 1, and the generalized gradient is a linear operator, this is established if 1 M [W ()] = lim W i k k M for every initial distribution of the process, which in turn follows from [32] for our model, where all the n-step transition probabilities of the process are polynomials in ( i ; i = 1; : : : ; k) and therefore continuously dierentiable. 5.3 The Disappearing Phantoms The requirement that the system be stable if we always choose the slower speed may be too restrictive, especially when we do not know explicitly the service distributions. Naturally, if we knew the service distribution, we could evaluate the analytical solution and nd the optimal setting without need for adaptive control. In the more general situation, we let the machine function and evaluate its own i=1 9 = ; 9 = ; 9 = ; 15

16 estimates to drive the control towards the optimal value. In such cases, and when the service distributions as well as the input rate may vary, we need the algorithm to be able to track the optimal value. We propose now a more robust estimator that does not require assigning the services of the phantom customers as S i (u i ; v l ) with probability p l = l =(1? k ). Consider the alternative model for the original process, where customers of class k arrive according to a renewal process with rate k. They require an amount of service that has distribution G k (). In this model, we can use ~ =? ~ with p l = 0; l 6= k, meaning that some of the customers, the phantom ones, are not allowed entrance to the machine and thus \disappear" from the system. The corresponding system will have a total incoming rate of ~ =?. In this case, we obtain: F (; )? F ( ; G k (; ) = lim ~ ) (; ) = k G k [F (; )] and therefore the generalized gradient can be expressed as: G k [F (; )] = G k [F (; )]? V k=1 (18) k G k [F (; )] (19) In the case of the disappearing phantoms, Lindley equations can be calculated in parallel to the nominal system using S j (j) S j ( j (j); u j ) 0. Proposition 2 Assume that E[N(0) 4 ] < 1. Then: k G k [W ()] = lim k G k [ W ()] = E k G k [ N ()] = E 8 < M E M!1 : j=1 8 < N(0) : j=1 8 < N(0) : j=1 I k (j) [' M((0))? ' M ((j))] M I k (j) ['N((0))? 'N((j))] I k (j) [N(0)? N(j)] Proof : Since S j (j) = 0, the nominal system dominates all of the possible phantoms systems and N(~) N((0)) a.s., which requires then the condition E[N((0)) 4 ] < 1. Notice that in this case it is no longer required that = V < 1, but only that < () = P k k= k. As before, the rst statement follows from the a.s. convergence of the derivatives of the nite horizon averages to those of the stationary averages. 6 The Learning Algorithms In this section we shall describe the actual algorithms for estimating the gradients in (18) via the disappearing phantoms. We present two methods based on the 16 9 = ; 9 = ; 9 = ;

17 regenerative estimation approach, as well as the non-reset version of [29]. The development for the swapping phantoms is analogous and we omit the details. Let C k = c k = k be the partial derivative of C() w.r.t. k and call: so that, clearly, using (18),? k () = G k [W ()] + C k G k [F ()] =? k ()? V l=1 l? l () (20) We shall focus on constructing the estimators of? k (). To simplify notation, x the index k to evaluate the estimator of? k (), and call: W () = G k [ W ()]; N () = G k [ N ()] Taking the appropriate derivative of W () from (11), we get? k () = W () N ()? N() W () [ N ()] Regenerative and Non-Reset Estimators We shall rst describe the ensuing estimators when a regenerative approach is chosen. Since the nominal system dominates the phantoms, we shall use N m (0) to denote the number of customers in the m-th busy period of the phantom system, and N (m) (j) for the -th busy period of the j-th phantom system within the m-th nominal system. In order to obtain unbiased estimations of the numerator, we use the approach proposed in [5], [1]. We estimate? k () using 2M busy periods, estimating W () and W () using the odd numbered BP's, then N () and N () in the even ones, so that the expectation of the product is the product of the expectations. The estimators of N () and N () in the m-th busy period, called ^ N (m); ^ W (m) respectively, are unbiased using the sample averages. The estimators of N () and W () are given respectively by: ^ W (m) = 1 N m(0) k j=1 N m(0) i=j+1 d i (j); ^ N (m) = 1 N m(0) [N m (0)? N (m) 1 (j)] k where d i (j) = W i (0)? W i (j) S j (j) a.s. (see the Appendix), therefore: j=1 Varf^ W (m)g 1 2 k E 8 20 >< 6 >: 2 kn 2 m(0)e Nm(0) i=j+1 d i (j) 1 A 2 j N m (0) 39 7 >= 5>; 1 k E[N 4 (0)] V() < 1 (21) 17

18 which is uniformly bounded for 2, and clearly, Var[^ N (m)] < 1 is also uniformly bounded in. Using Proposition 2 and independence between dierent busy periods, Ef^ N (2m+ 1) ^ W (2m)g = N () W (), for every m. As described, the n-th estimation interval considers M pairs (2m; 2m + 1) of busy periods, from m = nm + 1 up to m = (n + 1)M. To simplify notation, for any function f(m) call: (n) (n+1)m f(m) = m=nm+1 f(m) Method 1: The regenerative estimator ^? (1) n (M; k) is given by: ^? (1) n (M; k) =? P (n) ^ W (2m) P (n) ^N (2m + 1) P (n) ^ N (2m + 1) P (n) ^W (2m) h P(n) ^N (2m + 1)i 2 + C k (22) Since the f^? (1) n (M; k)g are i.i.d. and their variance is bounded, they are strongly consistent estimators of? k (). However, they are not consistent in the average: for any m, 1 m P m?1 n=0 ^? (1) 1 (1) E[^? n (M; k)] = E[^? (1) 1 (M; k)], but for any xed M, the estimator (M; k) is biased. Method 2: As a variant, we consider another regenerative estimator ^? (1) n (M; k) similar to Y n (N), but for which the denominator is replaced by a cumulative version of it. We dene We use now: (n+1) ^ (n) (M) = 1 N ^ N (2m + 1) (n + 1)M m=0 ^? (2) n (M; k) =? P (n) ^ W (2m) ^ (n) N (M) P (n) ^ N (2m + 1) P (n) ^W (2m) h ^(n) N (M) i 2 + C k (23) This estimator is strongly consistent for? k (), and also consistent in the average, (n) since for every M, ^ N (M)! N() a.s. as n! 1. In practice, Method 1 has been preferred to Method 2 for stochastic optimization, mostly because the sequence of estimates are i.i.d. Both are strongly consistent and that is the required property for (1) to converge towards the optimum. We shall discuss their behavior in Section 7. 18

19 The non-reset RPA estimator uses a nite horizon of xed length M, without resetting the propagation of the perturbations to zero. It uses the information available over the service completions from customer nm +1 to customer (n+1)m. In this case, as we do not need the renewal theorem, we directly estimate dw d. Method 3: Consider now: ^? (3) n (M; k) = 1 k M nm n j= n i=nm+1 d i (j) + (n+1)m j=nm+1 (n+1)m i=j+1 1 d i (j) A + Ck (24) where n (respectively, n ) is the index of the customer that starts (nishes) the nominal busy period where customer nm +1 belongs. The second term corresponds to the (unbiased) estimator of the gradient of the nite horizon. The rst term accounts for the bookkeeping, and represents the propagation of the perturbation of the previous phantom systems into the current estimation interval i nm + 1. This estimator is both strongly consistent and (strongly) consistent in the average, provided that E[N 4 (0)] < 1. To see this, we use the fact that the double sum in the rst term (or accumulator) has expectation bounded by E[ k N(0)E[ P N(0) i=1 S j(j)jn(0)]] k E[N 2 (0]= k < 1. Its variance is bounded by V(). Let N (M) be the number of busy periods totally contained within [nm + 1; (n + 1)M], then (24) can be written as a sum over the N (M) busy periods plus the corresponding terms in the rst and last busy periods. Therefore, using independence and Wald's identity (conditioning on N (M)): Var[^? (3) n (M; k)] 1 M E N (M) M V() + 2 V() M which implies strong convergence as M! 1. As for the consistency in the average, it suces to remark that our estimator is additive, so that for any m, m?1 1 ^? (3) n (M; k) = ^? (3) 0 ((m? 1)M; k) m n=0 and therefore, for any M, as m! 1, from Proposition 2 and the fact that the initial accumulator converges to zero a.s. for any initial condition, this latter average converges to? k () a.s. Finally, using either of the estimators ^? (e) n (M; k); e = 1; 2; 3, let: Y (e) n;k (M; ) = k ^? (e) n (M; k)? V l=1 l^? (e) n (M; l) be the estimator of the derivative. Summarizing our previous results, it follows that Y (e) n;k (M; ) is a strongly consistent estimator of kg k () for e = 1; 2; 3 and consistent in the average for e = 2; 3.! 19

20 6.2 Iterative Gradient Search Algorithm We are now ready to go back to where we started. In the following section we shall analyze the behavior of the stochastic approximation using Methods e = 1; 2; 3 and updating each component k of the control variable as: n+1;k = n + Y (e) n;k (M; n) (25) where, for Methods 1 and 2, n has the total length of the busy periods labeled by m = nm + 1; : : : ; (n + 1)M and for Method 3, n = M. We shall next compare the performance of the procedures with the more commonly used stochastic approximation using Method 1 and: n+1 = n + n Y (1) n;k (M(n); n) (26) where, with n 1=n and M(n) n. According to Theorem 1, n! a.s. In the framework of Section 2, we let f i ()g denote the process, counting the waiting time in customer number. Our underlying probability space is dened by! = f! i = (A i ; u i ; i )g, so that i+1 () = max[0; ( i () + S i (u i ; i )? A i )]. In order to simplify notation, we focus on Methods 1 and 2 with M = 1, but the treatment for Method 3 and any M is similar. The enlarged state is constructed using Z i () = ( i(k); i? i ; (d i (j; k); j = i ; : : : ; i? 1); k = 1 : : : ; V ) where i is the index of the rst customer in the busy period where customer i belongs, d i (j; k) = [W i (0)? W i (j)]i k (j) and i(k) is given by: i(k) = i?1 j= i I k (j) i l=j+1 d l (j; k) (27) Assumptions 1 and 4 are veried for the process at xed control value. We provide in the Appendix the recursions satised by Z i (), which imply that P f i+1 () 2 BjF i ()g = P f i+1 () 2 Bj i ()g and this latter is a linear function of, yielding weak continuity. Notice that the enlarged process also regenerates at W i = 0, when the components of Z i are set to 0. Under stability of f i ()g for every 2, the number of customers and the length of the busy periods is nite a.s., implying that the invariant measure exists, it is ergodic and tight for every compact set A. Assumption 4 follows from E[N 4 (0)] < 1, as we have seen, with G k () = k G k () + (; M) for each component k and (t) = E[N(0)] when M = 1. For methods 2 and 3, (; M) = 0 and for method 1, (; M)! 0 as M! 1. Assumptions 2 and 3 refer to the MDP model f(i ; ~ i )g. Clearly, when W i+1 = 0 and an update takes place, the value of ~ i+1 is obtained using the current value of i(k); k = 1; : : : ; V. Then the future evolution of the state j depends on the value ~ i. Since sup 2 V() < 1, then Varf njn = g and VarfY (e) n;k (M; )j n = g are 20

21 uniformly bounded for 2, therefore they are uniformly integrable, verifying Assumption 2. Finally, Assumption 4 requires f(i ; ~ i )g to be tight. Tightness of ~ i follows from boundedness 0 k 1. Verifying tightness of i presents a minor diculty, which was also present when we coded the algorithms: the dimension of Z i is random and, in principle, unbounded, since it has as many components fd i (j; k)g as customers present in a busy period. The solution for the analysis of tightness is very similar to the practical solution. Use an array of xed but \large" dimension D, and notice that P f!: N m (0) > Dg can be made uniformly small for any 2, since the latter is a compact set and the process is stable for all. Given ~ i =, at the start of the estimation interval, the initial distribution is xed and independent of. Use now the bounds d i (j) S (the upper bound on our uniform distributions), i? i N m (0) where m is the current BP, i(k) N 2 m(0) S and the fact that W i is a.s. bounded by the length of the current BP. Using our truncation argument, we can now choose constants K i suciently large so that P f i () > K 1 ; i > K 2 ; (i? i ) > K 3 g <. This implies tightness of fi j~ i = g, over one regenerative cycle, where the bounds are independent of. 7 Simulation Results For the purposes of verifying our simulation results, we consider the case V = 2. Then = P f i = v 1 g becomes scalar, and the process regenerates if < (). Our model is a M/G/1 queue with uniformly distributed service for each speed. Let S(k); k = 1; 2 be the service time of items processed at speed v k, then S(k) U[a k ; b k ]. We used = 0:028; a 1 = 33; b 1 = 38; c 1 = 5; a 2 = 3; b 2 = 7; c 2 = 145. Let C = (c 1 = 1?c 2 = 2 ) be the derivative of C(). Using the Pollaczek-Kintchine formula: df d = (& 1 2? &2 2)? (& 1 2?1 2? & 2 2?1 2 1? + 1? 1 1 ) C where & k 2 = E[S2 k ]. Figure 1 shows the graph of the performance function, along with the waiting time and cost as functions of the control parameter. The dotted line plots C(), the dashed-dotted line plots W (), and the solid line plots F (). Clearly, for certain combinations of costs c 1 and c 2, it will be more advantageous to always use one of the two possible speeds, so that will either be 0 or 1. In our example, = 0:812 and F ( ) = 367:01, but the method can also be used if lies at the boundaries. In all our simulations, our initial point was (0) = 0:51. In all our graphs showing the behavior of # (), the solid line indicates the evolution of #(), given by (9), and the dotted line indicates the evolution of # (). 21

22 Figure 1: Behavior of the Cost Functions 7.1 Understanding Weak Convergence In this section we show how the choice of changes both the convergence rate and the uctuations of the control process associated to (25) around the limiting ODE. Our aim is to present in a visual manner what the convergence in distribution of a stochastic process to a limiting deterministic process means in practice. Theorem 2 predicts that, as! 0, the processes # () approach (in the weak topology) the solid curves in Figure 2, which are the solution of (9). We plotted the values ~ i against the number of iterations to show the processes # () The three plots are a sequence of trajectories of the ladder interpolation processes # () obtained using Method 3, with M = 100, going (from left to right) from = 5 10?4 to = 5 10?5 to = 5 10?6. This is the only dierence in the stochastic approximations Figure 2: Weak Convergence to the ODE Notice, however, that the iteration numbers increase by a corresponding factor in the plots. Naturally, as one increases and thus the rate of convergence, one loses accuracy in the stochastic approximation. 7.2 Understanding Consistency In The Average We show now how closely the dierent processes converge to the deterministic solution of the target ODE when (25) is used for the updates. Figure 3 shows the control processes when the regenerative estimators given in (22) (Method 1, to the 22

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY 1998 631 Centralized and Decentralized Asynchronous Optimization of Stochastic Discrete-Event Systems Felisa J. Vázquez-Abad, Christos G. Cassandras,