Asynchronous Control for Coupled Markov Decision Systems

INFORMATION THEORY WORKSHOP (ITW) 22 Asynchronous Contro for Couped Marov Decision Systems Michae J. Neey University of Southern Caifornia Abstract This paper considers optima contro for a coection of separate Marov decision systems that operate asynchronousy over their own state spaces. Decisions at each system affect: (i) the time spent in the current state, (ii) a vector of penaties incurred, and (iii) the next-state transition probabiities. An exampe is a networ of smart devices that perform separate tass but share a common wireess channe. The mode can aso be appied to data center scheduing and to various types of cyber-physica networs. The combined state space grows exponentiay with the number of systems. However, a simpe strategy is deveoped where each system maes separate decisions. Tota compexity grows ony ineary in the number of systems, and the resuting performance can be pushed arbitrariy cose to optima. I. INTRODUCTION This paper considers contro for a coection of couped systems. Each system is a semi-marov decision process that operates in continuous time over its own state space. Decisions at each system affect the time spent in each state, the transition probabiities to the next state, and a vector of penaties or rewards. The systems are couped through constraints on the sum of time averages of their penaties and rewards. An exampe is a coection of smart devices that repeatedy perform compex tass such as image or video processing, compression, or other types of computation. These tass may aso generate or request data for wireess transmission. Each device has a state space that corresponds to different tas functions and/or different energy saving modes of operation. Decisions in each state affect energy expenditure, computation time, and the amount of data generated or requested for wireess communication. The state transition times are not synchronized across devices. Further, the devices are couped through the muti-access constraints of the wireess networ. This presents a chaenging and important probem of asynchronous contro of couped Marov decision systems. Such probems aso arise in data center scheduing and in contro of cyber-physica networs. This paper demonstrates that optimaity can be achieved by separate controers at each system. Whie the size of the combined state space vector grows exponentiay in the number of systems, the soution compexity grows ony ineary. Indeed, the compexity of the controer at each system depends on the size of its own state space. Thus, the soution can be used even when the number of systems is arge, say, or, provided that the state space of each system is sma. In Section IV a noninear program for the optima contro poicy is derived. The probem is non-convex and has frac- This materia is supported in part by one or more of: the NSF Career grant CCF-747525, the Networ Science Coaborative Technoogy Aiance sponsored by the U.S. Army Research Laboratory W9NF-9-2-53. tiona terms with different denominators. This is more compex than a inear program or a inear fractiona program. Genera probems of this type are intractabe. However, the probem under study has specia structure that aows an optima soution. It is shown to be equivaent to a inear program via a noninear change of variabes. This change of variabes is inspired by techniques used in ]2] to sove inear fractiona programs associated with (singe) unconstrained semi-marov decision systems. The current wor can be viewed as a generaization of ]2] to the case of mutipe asynchronous systems with mutipe couped constraints. The inear programming formuation assumes a underying probabiities of the system are nown. Section V treats a more compex scenario where each system can observe a vector of random events with possiby unnown probabiity distribution (such as a vector of wireess channe states used for opportunistic transmission). Learning-based approaches to discrete time Marov decision probems are considered in 3] using a 2-timescae anaysis and in 4] using poicy gradients. The current paper taes a different approach that utiizes Lyapunov optimization theory. It buids on the Lyapunov method for optimizing renewa systems in 5] and semi- Marov decision systems in 6]. The resut in 6] treats a singe Marov system and uses a more compex bisection routine to evauate a drift-pus-penaty ratio expression. The current paper uses a change of variabes that resuts in a driftpus-penaty expression without a ratio, and hence does not require a bisection step. The current paper is aso reated to recent wor in 7] that treats asynchronous scheduing at a data center. The wor in 7] deveops an onine poicy for asynchronous contro, but treats a simper cass of systems that do not have an embedded Marov structure. II. SYSTEM MODEL Consider a coection of S separate Marovian systems, where S is a positive integer. Define S ={,..., S}. Each system s S has a finite state space K and operates in continuous time. The timeine for each system is segmented into bac-to-bac intervas caed frames. Each frame represents the time spent in one state. The size of each frame can vary depending on random events and contro actions. Let {T } r= be the sequence of frame sizes for system s, where r is a frame index in the set {,, 2,...}. Frame boundaries are not necessariy synchronized across systems. Let be the state of system s during frame r. At the beginning of each frame r, the system observes a random event ω that taes vaues in some abstract event space Ω. It then chooses a contro action α A, where A is

INFORMATION THEORY WORKSHOP (ITW) 22 2 an abstract set of possibe actions for system s. The 3-tupe (, ω, α ) determines: The frame size T. A vector of L + penaties for frame r, for some nonnegative integer L. This penaty vector has the form: = (, y,..., y L ) The next-state transition probabiities P (assuming that i = is the current state for system s). These are given by functions ˆT ( ), ŷ ( ), ˆP ( ): T = ˆT (, ω, α ) = ŷ (, ω, α ) {,,..., L} P i,j = ˆP i,j (ω, α ) i, j K A. Assumptions For simpicity of exposition, assume that for each s S, the sets A and Ω are finite. Assume that the ω processes are independent across systems. Further, for each system s S, the processes {ω } r= are independent and identicay distributed (i.i.d.) across frames r {,, 2,...}. For each ω Ω, define π (ω) =P rω = ω]. The transition probabiities are non-negative and satisfy the foowing for a (, ω, α ): j K ˆP ( ) = s S, i K The frame sizes are assumed to be bounded by some positive minimum and maximum vaues T min and T max for a (, ω, α ): T min ˆT ( ) T max The penaties can be positive, negative, or zero (negative penaties can be used to represent rewards), and are bounded by some finite minimum and maximum vaues,min, y,max for a (, ω, α ):,min ŷ B. Optimization Objective ( ),max The time average penaty of type {,,..., L} incurred by system s up to frame R is given by: R r= y R r= T Mutipying the numerator and denominator of the above expression by /R and taing a imit as R gives an expression for the time average penaty of type in system s: where and T T = im R R R r= y R R r= T is a frame average that is defined: = im R R is defined simiary. R r= y At the beginning of the rth frame for system s, the controer observes the random event ω and chooses an action α A. The goa is to design decision-maing poicies for each system so that the resuting time averages sove the foowing optimization probem: T () c d T {,..., L} (2) α A s S, r {,, 2,...} (3) where c, d are given rea numbers for {,..., L} and s S. It is assumed throughout that the constraints of probem ()-(3) are feasibe. For simpicity, it is assumed that each system s S has a state K that is positive recurrent under any stationary poicy for choosing α. This occurs, for exampe, when each state has a positive probabiity of returning to state under any (ω, α ). This assumption is not crucia, but simpifies some technica detais. In particuar, it can be shown that it ensures the initia states of the system do not affect optimaity. Such a state often naturay exists when systems have an ide state that is returned to infinitey often. III. AN EXAMPLE NETWORK OF SMART DEVICES Consider a networ of M wireess smart devices. Each device contains two embedded chips: a processing chip and a communication chip. The processing chip operates over variabe ength frames and is used for computation and tas processing. The communication chip operates over fixed frame sizes and is used for wireess transmission and reception over one of L possibe transmission ins. The processing chip at each device m {,..., M} is assumed to have three states: K (m) = {ide, processing mode, processing mode 2} The different states can represent different functionaities or tass that the chip performs, and/or different energy-saving modes that affect computation time and energy expenditure. Let A (m) be an abstract space of processing actions for each device m {,..., M}. For simpicity, assume there is no random event process ω (m) for these chips. The action α (m) at device m affects the energy expenditure e (m), the frame duration T (m), transition probabiities to the next state, and generates b (m) bits for transmission over in : e (m) = ê (m) ( (m), α (m) ) T (m) = ˆT (m) ( (m), α (m) ) b (m) = ˆb (m) ( (m), α (m) ) {,..., L} Finay, define an (M + )th system that represents a of the L wireess ins. This system operates in discrete time with fixed frame sizes T (M+) = for a r {,, 2,...}, and has ony one Marov state (M+) = for a r {,, 2,...} (so that system M + has no Marov dynamics). However, this system has a time-varying channe state process ω (M+) = (η,..., η L ), where η represents the state of wireess channe on frame r. Let A (M+) represent

INFORMATION THEORY WORKSHOP (ITW) 22 3 the set of transmission/reception contro actions on each frame (for exampe, this set might restrict the networ to transmit over ony one in per frame). Let e (M+) and µ be the energy expended and bits transmitted over in on frame r: e (M+) = ê (M+) (ω (M+), α (M+) ) µ = ˆµ (ω (M+), α (M+) ) {,..., L} The goa is to operate each system to minimize tota average power expenditure subject to transmission rate constraints: M m= e (M+) + M m= e(m) (4) T (m) b (m) T (m) µ {,..., L} (5) α (m) A (m) (6) where the fina constraint α (m) A (m) hods for a m {,..., M + } and a r {,, 2,...}. IV. THE NONLINEAR PROGRAM TRANSFORMED To begin, first assume there are no random event processes ω. It can be shown that the probem ()-(3) can be soved by stationary and randomized agorithms (see reated resuts in 8]9]2]). Specificay, each system s S observes its current state and independenty chooses a contro action α according to a probabiity distribution p (α): ] P r α = α = = p (α ) The p (α ) probabiities are non-negative and sum to : α A p (α ) = K The fraction of frames that system s spends in each state under this poicy can be viewed as a steady state distribution that satisfies a goba baance equation. A standard tric is to define variabes φ (, α) that intuitivey represent the steady state probabiity that system s is in state and chooses action α. They shoud satisfy (see, for exampe, ]8]9]2]): α A φ (, α) = i K,α A φ (i, α) ˆP i (α) (7) φ (, α) (8) K,α A φ (, α) = (9) where (7) is for a K, and (8) is for a K, α A. Constraint (7) can be interpreted as a baance equation. Its eft-hand-side represents the steady state probabiity that system s is in state. Its right-hand-side represents the probabiity of transitioning into state in the next frame. It shoud be noted that this steady state is with respect to frame averages (corresponding to the steady state of the embedded Marov chain), and is not the same as the time average steady state (which woud aso incude the time spent in each state). Given vaues φ (, α) that satisfy (7)-(9), one can define a stationary randomized poicy by: p φ (, α) (α ) = β A φ (, β) This gives rise to the foowing noninear program for computing the optima stationary poicy for probem ()-(3): ],α φ (,α)ŷ (,α),α φ (,α) ˆT () (,α) ],α φ (,α)ŷ (,α) d c,α φ (,α) ˆT (,α) {,..., L} () φ (, α) satisfies (7)-(9) (2) where the summations,α above are understood to be over K, α A. The above probem has variabes φ (, α) and constants c, d, ŷ (, α), ˆT (, α). The constraints (7)-(9) are inear in the variabes φ (, α). The probem aso invoves fractiona terms where the numerators and denominators are inear functions of the variabes φ (, α). Probems with fractiona terms with different denominators are non-convex and are generay intractabe. However, a fractiona terms in the probem above have the same denominator for each system s S. This property is expoited in the first resut beow, which transforms the probem via a noninear change of variabes. This change of variabes is inspired by simiar techniques in ]2] which treat (singe) unconstrained semi-marov systems. Consider the foowing inear program defined over new variabes γ (, α) for s S, K, α A :,α γ (, α)ŷ (, α) (3),α γ (, α)c ŷ (, α) d {,..., L} (4) α γ (, α) = i,α γ (i, α) ˆP i (α) (5) γ (, α) (6),α γ (, α) ˆT (, α) = (7) where summations α and,α are understood to be over α A and K. The constraints (5) are for a s S, K, the constraints (6) are for a s S, K, α A, and the constraints (7) are for a s S. Theorem : The optima objective function vaue is the same for the origina probem ()-(2) and the new probem (3)-(7). Further, if γ (, α) are variabes that sove the new probem, then the foowing variabes φ (, α) sove the origina probem: φ γ (, α) (, α) = (8) i K,β K γ (i, β) Proof: Let φ (, α) be vaues that sove the origina probem ()-(2), and et V origina be the vaue of the optima objective function: V origina =,α φ (, α)ŷ ] (, α),α φ (, α) ˆT (9) (, α) Define: γ φ (, α) (, α) = i K,β A φ (i, β) ˆT (i, β) (2)

INFORMATION THEORY WORKSHOP (ITW) 22 4 and note that because the ˆT (, α) vaues are stricty positive and the φ (, α) vaues are non-negative and sum to, the denominator in (2) must be positive. Because the φ (, α) vaues satisfy the constraints ()-(2), it can be shown that the γ (, α) vaues defined by (2) satisfy the constraints (4)-(7). Indeed, the definition of γ (, α) in (2) immediatey impies constraint (7), non-negativity of φ (, α) immediatey impies (6), and dividing the constraint (7) by i K,β A φ (i, β) ˆT (i, β) impies (5). Finay, substituting (2) into () and using (7) impies constraint (4). Further, by substituting (2) into (9) it is easy to see that the objective function associated with these γ (, α) variabes is equa to V origina. It foows that the optima objective function vaue of the new probem is ess than or equa to V origina, that is, V new V origina, where V new is defined as the minimum objective function vaue (3) for the new probem. Now et γ (, α) represent optima variabes that sove the new probem (3)-(7), and define φ (, α) according to (8). By simiar substitutions, it can be seen that these φ (, α) vaues satisfy the constraints ()-(2) of the origina probem and produce an objective function vaue in () that is equa to V new. Hence, V new = V origina, and these φ (, α) vaues are optima for the origina probem. Theorem transforms the origina noninear probem into a inear program with variabes γ (, α). Reca that there are S systems. Suppose each system has at most K max states and an action space size of at most A max, for some positive numbers K max and A max. Thus, the tota number of variabes γ (, α) is at most SK max A max, which grows ineary in the number of systems. It is easy to see that the number of constraints of the inear program (3)-(7) aso grows ineary in the number of systems. The tota compexity is essentiay the same as the compexity associated with each system separatey soving its own Marov decision probem on its own state space. V. LYAPUNOV OPTIMIZATION The previous section soves for the optima conditiona probabiities p (α ), but does not treat cases when there are observed random events ω. For such cases, one needs conditiona probabiities p (α ω, ). The number of ω vectors can be enormous, in which case it is not practica to consider estimating the probabiities of each and computing the optima p (α ω, ) probabiities. However, Lyapunov optimization can treat reated probems of optimizing time averages in systems with random events, without nowing the probabiities of these events and regardess of the cardinaity of the event space 5]]2]3]. Rather than attempting to compute the optima probabiities for every possibe event, the Lyapunov poicies mae onine decisions based on greediy minimizing a drift-pus-penaty expression. Recent wor in 6] extends this by deveoping an onine poicy for a (singe) semi-marov decision system, provided that certain target information is given. Specificay, suppose that for each system s S, one is given vaues P, y,, T that respectivey represent desired targets for the fraction of time the embedded Marov chain transitions from i to j, the average type penaties incurred whie in state, and the average time spent in state. Then one can use the onine poicy of Section IV in 6] to contro the system and meet these targets, without requiring the probabiity distribution for the random events ω. In the foowing, a Lyapunov-based agorithm for computing the optima targets corresponding to the asynchronous contro probem ()-(3) is deveoped. A. The Time Average Probem As in 6], consider a modified coection of systems with no Marov dynamics, where state variabes for system s can be chosen as decision variabes every frame r. Define the foowing attributes q for a s S and i, j K : q = i ˆP (ω, α ) (2) where i is an indicator function that is if = i, and ese. Let i be its frame average. The probem ()-(3) can be transformed as (compare with Section III in 6]): (22) T c d T {,..., L} (23) = i K q i (24) K (25) α A (26) is chosen before ω is nown (27) where (24) hods for a s S, K, and (25)-(27) hod for a s S, r {,, 2,...}. The objective function (22) is identica to (), and the constraints (23) and (26) are the same as (2)-(3). Constraint (24) is a baance equation simiar to (7) and, together with (25) and (27), ensures the resuting time averages can actuay be achieved on the Marov decision system. Constraint (27) is subte, and ensures the decisions are independent of ω. Now consider the foowing transformed probem, simiar in spirit to the transformation of the previous section: For each system s S, define new variabes θ that are chosen every frame r {,, 2,...} over the interva /T max, /T min ]. Consider the probem: θ (28) c θ d {,..., L} (29) θ = i K θ q i (3) θ T = (3) K (32) α A (33) /T max θ /T min (34) is chosen before ω is nown (35) where frame averages θ θ R = im R R are defined: r= θ

INFORMATION THEORY WORKSHOP (ITW) 22 5 and frame averages θ T, θ q i, θ are defined simiary. It can be shown that the origina probem (22)-(27) and the new probem (28)-(35) have the same optima objective function vaue (proof omitted for brevity). Further, the soution to the new probem can be used to construct optima targets P, y,, T for the origina probem as foows: y, = θ y θ B. Virtua Queues, T = θ T θ, P = θ q θ i Using the drift-pus-penaty technique of 5], the constraints (29)-(3) are treated with virtua queues Z, H, J for {,..., L}, s S, K : ] Z r + ] = max Z + c θ d, r + ] = H + θ θ q i i K H J r + ] = J + θ T C. The Drift-Pus-Penaty Agorithm For a given parameter V, define f (, ω, α) by: D. Discussion The above agorithm seects α and θ without nowedge of the distribution of ω. Seection of requires evauation of the expectation in (36). This decision is trivia in specia cases such as that given in Section III, where the systems s S that have random event processes ω are -state systems (without Marov dynamics) for which one aways seects =, and the systems that have Marov dynamics do not have ω processes (so that the expectation in (36) reduces to the deterministic minimum). In the genera case, the expectation (36) can be efficienty estimated based on a coection of past sampes of ω, as justified by the max-weight earning framewor of 4]. The agorithm above can be viewed as an offine agorithm for computing desired targets and finding the optima time average quantities given a sampe sequence of observed {ω } r= vaues for each system. In an onine impementation where such a sampe sequence is graduay observed, the agorithm acts over virtua frames that run sower than the actua system. Specificay, the operations required on the rth virtua frame cannot be performed unti the ω vaue for each system s is observed. Each observed vaue is stored in memory as needed. The resuting weighted averages achieved in this virtua system act as progressivey updated targets that are passed into an onine agorithm such as 6] that runs separatey on each actua system. f (, ω, α) =V ŷ (, ω, α) + L = Z c +H j K H j +J ˆT (, ω, α) ŷ (, ω, α) ˆP j (ω, α) Define g (θ,, ω, α) =θf (, ω, α). Define B as the set of a (θ, α) vaues that satisfy /T max θ /T min, α A. At the beginning of each frame r and for each s S, observe virtua queues and perform the foowing: ( seection) Choose as the index K that minimizes the foowing (breaing ties arbitrariy): E { min (θ,α) B g (θ,, ω, α) } (36) where the expectation above is with respect to the randomness of ω. (α, θ seection) Once the decision is made, observe the actua ω and choose α as the minimizer of f (, ω, α) over a α A, breaing ties arbitrariy. Then chose θ by: θ = { T min T max if f (, ω, α ) otherwise (Virtua Queue Update) Update the virtua queues according to the update equations in Section V-B. The resuting agorithm satisfies a constraints whenever it is possibe to do so, and yieds an objective function that differs by O(/V ) from optima, with a corresponding poynomia convergence time tradeoff with V 5]. REFERENCES ] B. Fox. Marov renewa programming by inear fractiona programming. Siam J. App. Math, vo. 4, no. 6, Nov. 966. 2] H. Mine and S. Osai. Marovian Decision Processes. American Esevier, New Yor, 97. 3] V. S. Borar. An actor-critic agorithm for constrained Marov decision processes. Systems and Contro Letters (Esevier), vo. 54, pp. 27-23, 25. 4] F. J. Vázquez Abad and V. Krishnamurthy. Poicy gradient stochastic approximation agorithms for adaptive contro of constrained time varying marov decision processes. Proc. IEEE Conf. on Decision and Contro, Dec. 23. 5] M. J. Neey. Stochastic Networ Optimization with Appication to Communication and Queueing Systems. Morgan & Caypoo, 2. 6] M. J. Neey. Onine fractiona programming for Marov decision systems. Proc. Aerton Conf. on Communication, Contro, and Computing, Sept. 2. 7] M. J. Neey. Asynchronous scheduing for energy optimaity in systems with mutipe servers. Proc. 46th Conf. on Information Sciences and Systems (CISS), March 22. 8] M. L. Puterman. Marov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiey & Sons, 25. 9] E. Atman. Constrained Marov Decision Processes. Boca Raton, FL, Chapman and Ha/CRC Press, 999. ] S. Ross. Introduction to Probabiity Modes. Academic Press, 8th edition, Dec. 22. ] L. Georgiadis, M. J. Neey, and L. Tassiuas. Resource aocation and cross-ayer contro in wireess networs. Foundations and Trends in Networing, vo., no., pp. -49, 26. 2] M. J. Neey, E. Modiano, and C. Li. Fairness and optima stochastic contro for heterogeneous networs. IEEE/ACM Transactions on Networing, vo. 6, no. 2, pp. 396-49, Apri 28. 3] L. Tassiuas and A. Ephremides. Dynamic server aocation to parae queues with randomy varying connectivity. IEEE Transactions on Information Theory, vo. 39, no. 2, pp. 466-478, March 993. 4] M. J. Neey, S. T. Rager, and T. F. La Porta. Max weight earning agorithms for scheduing in unnown environments. IEEE Transactions on Automatic Contro, to appear.