Stochastic Subgradient Methods with Approximate Lagrange Multipliers

Size: px

Start display at page:

Download "Stochastic Subgradient Methods with Approximate Lagrange Multipliers"

Coral Lyons
6 years ago
Views:

1 Stochastic Subgradient Methods with Approximate Lagrange Multipliers Víctor Valls, Douglas J. Leith Trinity College Dublin Abstract We study the use of approximate Lagrange multipliers in the stochastic subgradient method for the dual problem in convex optimisation. The use of approximate Lagrange multipliers in the optimisation (instead of the true multipliers) is motivated by the fact that we can cast some typically thought non-convex control problems as convex optimisations. For example, we can solve some stochastic discrete decision maing problems by simply solving a sequence of unconstrained convex optimisations. The use of approximate multipliers allows us also to solve problems where there are constraints on the order in which control actions can be made. Index Terms convex optimisation, max-weight scheduling, subgradient methods, approximate multipliers, stochastic control. I. INTRODUCTION In this paper we study the use of approximate Lagrange multipliers in subgradient methods for constrained convex optimisation problems. One of the motivations of using an approximation of the Lagrange multipliers is that in some problems the exact multipliers might not be available, but a perturbed or outdated version of the multipliers is available instead. For example, in distributed optimisation the exchange of system state information might be affected by transmission delays or losses. Another motivation of using approximate Lagrange multipliers is because of its applications to networ scheduling problems involving queues. As shown in [], it is possible to identify scaled queues occupancies with approximate Lagrange multipliers, i.e., the Lagrange multiplier λ generated by the subgradient method stays close to the α-scaled queue occupancyαq in a networ (where α is the step size used in the subgradient method). A direct consequence of this is that the scaled queues occupancies in networs can be used as surrogate quantities of the Lagrange multipliers in the subgradient method for the dual problem. In other words, by using approximate multipliers it is possible to cast some discrete decision maing problems as convex optimisations. Our convex approach to solving networ/scheduling optimisation problems is in mared contrast to approaches lie max-weight [] that are built on Lyapunov optimisation theory. Whereas max-weight approaches focus on constructing a policy that stabilises a system, we instead solve the dual problem and then derive the stability of the system via the properties of the dual problem. This way of looing at the problem allows us to construct scheduling policies in a more flexible manner This wor was supported by Science Foundation Ireland under Grant No. /PI/77. than with max-weight approaches. In particular, we can design scheduling policies that have constraints on the order in which discrete actions can be selected. A. Related Wor The max-weight scheduling algorithm was proposed by Tassiulas and Ephremides in []. The algorithm is able to find an optimal scheduling policy (which consists of choosing the discrete action that minimises the sum of the squared queues baclogs) without previous nowledge of the mean pacet arrival rates in the system. The max-weight algorithm has been extended in a sequel of papers [3], [4], [5] to consider the minimisation of a convex utility function, and to general functions in [6]. The wors in [5], [6] also extend max-weight to allow convex constraints. The construction of max-weight policies with constraints (or penalties) is proposed in [7]. There the authors consider the problem of dynamic allocation of servers with switchover delay. The wor in [8] studies the allocation of rates in a networ while providing fairness using a dynamical system approach and Lyapunov stability. In [9] the authors use a convex approach and study the optimal allocation of rates using subgradient methods. Importantly, [8] and [9] wor with continuous valued variables rather than with the discrete actions used in the max-weight approaches [], [3], [4], [5], [6]. The wors in [0], [] are the first ones that establish rigorous connections between the max-weight and dual approaches in convex optimisation via Lagrange multipliers. II. CONVEX OPTIMISATION Consider the following convex optimisation problem P in standard form: minimise x X f(x) () subject to g i (x) 0 i =,...,m () where f,g i : X R are convex functions and X R n a convex set. We will always assume that set X 0 := {x X g i (x) 0,i =,...,m}, and so problem P is feasible. Further, using standard notation we let f := min x X0 f(x) and x argmin x X0 f(x). As it is usual in convex optimisation, the Lagrange dual function of problem P is given by q(λ) = inf L(x,λ) = inf {f(x)+ m λ (i) g i (x)}, (3) x X x X

2 where λ (j) is the j th component of vector λ R m +. Recall that since q(λ) is the infimum of a set of linear functions it is concave. In order to eep notation short we will write L(x,λ) = f(x)+λ T g(x) whereg = [g,...,g m ] T, and define X(λ) := {argmin x X L(x,λ)} to be the set of minima of L(,λ) for a given vector λ. The following two assumptions are ey in our wor. Assumption (Bounded Set). X is convex and bounded. Assumption (Slater Condition). relint(x 0 ) is non-empty, i.e., there exists a point x X such that g(x) 0. Assumption is important because it guarantees that q is Lipschitz continuous with constant L g := max x X g(x), and Assumption ensures that strong duality holds. Namely, the solution of the dual problem P D, maximise λ 0 q(λ) (4) coincides with the solution of the primal problem P. That is, max λ 0 q(λ) =: q(λ ) = f where λ argmax λ 0 q(λ). Another consequence of the Slater condition is that the set of dual optima, Λ := {argmax λ 0 q(λ)}, is a bounded subset from R m + (see Lemma in []). A. Subgradient Method for the Dual Problem Problem P D is an unconstrained concave optimisation problem that can be solved using the subgradient method. Recall that when the dual function is differentiable then the subgradient is indeed the gradient. In short, the subgradient method consists of the following update λ + = [λ +αg(x )] +, (5) where [ ] + = max{0, }, λ R m, g(x ) is the subgradient of q at point λ, and α > 0 is a constant step size. The subgradient method can mae use of more complex step sizes, but as we will show in Section II-B, constant step size will play a prominent role when woring with averages. Lemma (Subgradient Method). Consider optimisation problem P D and the subgradient update (5) where x X(λ ) and α > 0. Suppose Assumptions and hold. Then, there exists a non-negative and non-increasing sequence {β } that converges to α(3/)l g for large enough and β q(λ ) q(λ ) 0 for all. The intuition behind the lemma is that λ is monotonically attracted to a vectorλ Λ until it converges to a ball around Λ, the size of which depends on parameter α. Hence, by the Lipchitz continuity of the dual function we can then expect that q(λ ) converges (probably not monotonically) to a ball around q(λ ). Moreover, since λ is monotonically attracted to a ball aroundλ, andλ is a bounded set fromr m from Assumption, it follows that λ is bounded for all. See [, Lemma 3] for a bound on λ. Figure illustrates schematically the convergence of q(λ ) to level set q(λ ) q(λ ) α(3/)l g. Fig.. Illustrating the convergence of the subgradient method for the dual problem with constant step size. B. Approximate Primal Solutions Using Averages We now proceed to show we can recover approximate primal solutions using the subgradient method for the dual problem. First we present the following lemma which relates bounded Lagrange multipliers and the feasibility of averages of primal variables. Lemma (Feasible Solutions using Averages). Let {x } be a sequence of points in X and consider update λ + = [λ + αg(x )] +. Then, g( x ) λ + /, where x = x i. Observe from Lemma that if λ + is bounded for all (which is the case in the subgradient method) then x is attracted to a feasible point as. We now present one of our main results, which is a refinement of Theorem 3 in [3] with sharper bounds and simpler proof. Theorem (Approximate Primal Solutions using Averages). Consider the subgradient method for the dual problem with constant step size α > 0, i.e., λ + = [λ +αg(x )] +, (6) where x X(λ ) := {argmin x X L(x,λ )}. Suppose Assumptions and hold. Then, β λ α f( x ) f αl g + λ α, (7) where λ = λt λ +, x := x i and {β } is a sequence given by the subgradient method. Theorem says that f( x ) converges to a solution close to f for sufficiently large and α sufficiently small. Observe that if we let we obtain α3l g f( x ) f αl g, (8) and if we mae α 0 then f( x ) f. Further, since by Lemma we have that g( x ) 0 when, we can conclude that x x. How fast sequence {β } converges to α(3/)l g will depend on the step size and characteristics of the objective function and constraints. For example, if q(λ) were strongly convex we could use second order methods and obtain a sequence {β } that converges fast to α(3/)l g. Theorem can be extended to consider the case where the subgradient g(x ) is computed using a point nearby to λ. Formally, x X(µ ) where µ is an approximate Lagrange multiplier of λ that satisfies λ µ ǫ/l g for all and ǫ 0. As we will show in Section III it

3 3 Fig.. Illustrating the convergence of the subgradient method for the dual problem with constant step size when the subgradients are obtained using nearby (dual) points or approximate Lagrange multipliers. Observe from the figure that the approximate Lagrange multiplier µ (dashed line) is always close to the Lagrange multiplier λ (straight line). Fig. 3. Illustrating the convergence of the stochastic subgradient method with approximate Lagrange multipliers. In contrast to the deterministic case (see Figure ), we now have that λ converges to a ball around λ. will be possible to identify scaled queues occupancies with approximate multipliers µ. A consequence of this is that scaled queues occupancies can be used as surrogate quantities in networ optimisation. We have the following corollary to Theorem. Corollary. Consider the setup of Theorem where update (6) now uses x X(µ ) where λ µ ǫ/l g for all =,,... and ǫ 0. Then, the bound (7) holds and sequence {β } converges to α(3/)l g + ǫ for sufficiently large. The use of a nearby or approximate Lagrange multiplier to compute the subgradient of the dual function is in fact equivalent to the subgradient method for the dual problem with ǫ-subgradients see [4, pp. 65]). Observe that q(λ ) q(µ ) λ µ L g ǫ. Figure illustrates the convergence of q(λ ) to a ball around q(λ ) when the subgradient method is computed using a nearby Lagrange multiplier. Compare Figure and to see that the ball to which q(λ ) converges to depends now on parameter ǫ (which controls the distance between λ and µ ). C. Subgradient Method with Noisy Dual Updates The subgradient method for the dual problem can be extended to consider noisy dual updates. We have the following lemma. Lemma 3 (Stochastic Subgradient Method with Approximate Lagrange Multipliers). Consider optimisation problem P D and the subgradient update λ + = [λ +α(g(x )+B )] +, (9) where x X(µ ) with λ µ ǫ/l g, ǫ 0 and B are i.i.d. random variables with mean b R m and finite varianceσ B. Suppose Assumption holds and that there exists a point x X such that g(x) + b 0. Also, suppose the subgradients of the dual function are bounded for all, i.e., max x X g(x)+b L h for all. Then, λ λ α αl h where λ := λ i. ǫ q(e( λ )) q(λ ) 0, (0) In the stochastic subgradient method we have that the expected value of the running average λ is attracted to a ball around Λ. The latter is schematically illustrated in Figure 3. An important point to note is that now we require that the Slater condition holds with the mean of the random variable, i.e., there exists a point x X such that g(x)+b 0 and not just g(x) 0. We can obtain approximate primal solutions using the stochastic subgradient update. Theorem (Approximate Primal Solutions using Averages, Approximate Lagrange Multipliers and Noisy Dual Updates). Consider the stochastic subgradient method of Lemma 3. Suppose Assumption holds and that there exists a point x X such that g(x) + b 0. Also, suppose the subgradients of the dual function are bounded for all, i.e., max x X g(x)+b L h for all. Then, λ λ α αl h f(e( x )) f αl h λ ǫ () α where x := x i and λ = E[ λ ] T E[λ + ]. + λ α, () Theorem says that f(e( x )) converges to a ball around f for sufficiently large, and when we have that αl h ǫ f(e( x )) f αl h. Further, if α 0 then f(e( x )) f as. III. APPROXIMATE LAGRANGE MULTIPLIERS & QUEUE CONTINUITY Queues updates in networs are usually given by Q + = [Q +δ ] +, where Q R m and δ Z m. Difference δ is discrete and represents the change of discrete quantities (pacets, cars, people, etc.) that get in/out of the queue in a time slot. Next, observe that since [ ] + = max{0, } is a homogeneous function, if we let αq := µ we can write µ + = [µ +αδ ] +, where α is the step size used in the subgradient method. That is, an α-scaled queue is lie a subgradient update with the

4 4 difference that δ is constrained to be discrete. Recall that µ will be an approximate Lagrange multiplier if it stays uniformly close to λ for all. This will be actually the case when the sequence of discrete quantities{δ } and subgradients {g(x )} stay close in appropriate sense, and constraints in the optimisation P are linear [3]. We have the following lemma, which is a restatement of [5, Proposition 3..]. Lemma 4 (Continuity of the Sorohod Map). Consider updates λ + = [λ + αx ] +,µ + = [µ + αy ] + where λ = µ 0, α > 0, and {x } and {y } are two sequences of points from R such that (x i y i ) ǫ. Then, λ + µ + αǫ. (3) Lemma 4 requires (x i y i ) to be uniformly bounded so that the difference λ µ is also bounded. Hence, it is interesting to now under which conditions it is possible to construct a sequence {y } Y R n that stays close to an arbitrary sequence {x } X R n. Next we show that this is always possible when Y is a finite collection of points from R n and X conv(y), the convex hull of Y. A. Construction of Discrete Sequences Consider a set of m discrete points Y from R n such that X conv(y). Collect the points in Y as columns in matrix W. We can write any point x X as a convex combination of points in Y, i.e.,x = Wu where u U := {u R m u =,u 0}. Note there might exist multiple vectors in U such that Wu = x. Similarly, define E := {e {0,} m e = } to be the set containing the m-dimensional standard basis vectors. There is only one vector e E for each y Y such that We = y. We can write (x i y i ) = W (u i e i ), and since W (u i e i ) W (u i e i ) by the Cauchy-Schwarz inequality it follows that showing that (u i e i ) (4) is bounded for all is sufficient to establish the boundedness of (x i y i ). We show this in the following corollary, which is a straightforward extension of [3, Theorem 4]. Corollary (Bloc Sequence). Let T N, U := {u R m u =,u 0}, E := {e {0,} m e = } and define S = Tm u i where u i U for all i =,...,Tm. Let δ D := {w R m w T = 0, w } and select e argmin e E δ +S e i e. (5) Then, we have that (δ +S S ) D where S = Tm e i. One such vector can be efficiently found by solving convex optimisation problem min u U Wu x with standard convex methods see Chapters 6 and 9 in [6]. Fig. 4. Illustrating the construction of a subsequence in an online mode in a delayed or shifted manner with m = 3 and T =. Subsequence {e 7,...,e } is constructed with update (5) with subsequence {u,...,u 6 }, and therefore (δ+ 6 u i j=7 e j) D. Subsequence {e,...,e 6 } can be selected randomly from E and it does not affect the boundedness of (4). Corollary says that for any subsequence {u,...,u Tm } we can use update (5) to construct a subsequence {e,...,e Tm } such that (δ +S S ) = δ + Tm (u i e i ) := δ D. (6) Hence, for any other subsequence {u Tm+,...,u Tm } we can use update (5) to construct a subsequence {e Tm+,...,e Tm } such that Tm δ + (u i e i ) = ( δ + Tm i=t m+ (u i e i ) ) D. Next, observe that if we let S l = l(tm) j=(l )Tm+ u j and S l = l(tm) j=(l )Tm+ e j with l =,,... we will always have that { l (S i S i )} is a sequence of points from D := {w R m w T = 0, w } and therefore bounded. The boundedness of (4) for all =,,... follows immediately from the fact that subsequences have finite length. Importantly, note that update (5) requires to now T m elements from U in advance in order to select Tm elements from E. This implies that it will only be possible to construct subsequences in an online mode in a delayed or shifted manner see the example in Figure 4. Nonetheless, since subsequences have finite length, constructing subsequences in a shifted manner does not affect the boundedness of (4). An important observation is that the elements in a subsequence {e,...,e Tm } can be reordered and still have that (δ + Tm (u i e i )) D. This will be particularly useful in problems where there are constraints on the order how actions y (and so e ) can be selected. In the next section we present a networ example where the selection of discrete actions has constraints and cannot be solved with classic maxweight approaches. A. Problem Setup IV. WIRELESS NETWORK EXAMPLE Consider the networ shown in Figure 5, an Access Point (AP) that transmits to two wireless nodes. Time is slotted and in each time slot pacets arrive at the queues of the AP (Q and Q ) with probability p and p, respectively. In each time slot the AP taes an action from action set Y := {y 0,y,y } = {[0,0],[,0],[0,]}, where each action corresponds, respectively, to not transmitting, to transmitting

5 5 AP Fig. 5. Illustrating the networ in the example of Section IV. The Access Point (AP) sends pacets from Q to node, and from Q to node. one pacet from Q to node, and to transmitting one pacet from Q to node. The transmission protocol of the AP has constraints on how actions can be selected. In particular, it is not possible to select action y after y without first selecting y 0. Liewise, it is not possible to select y after y without first selecting y 0. However, y or y can be selected sequentially. An example of an admissible sequence is {y,y,y,y 0,y,y 0,y,y,y 0,y,y,...}. These type of constraints appear in different areas, and are nown in the literature as reconfiguration delays [7]. For example, asing for the Channel State Information (CSI) in wireless communications in order to adjust the transmission parameters. Our goal is to design a scheduling policy for the AP (select actions from sety)in order to minimise a convex cost function of the average throughput x, and ensure that the system is stable, i.e., the queues do not overflow and so all traffic can be served. B. Problem Formulation The convex or fluid formulation of the problem is minimise x X f(x) (7) subject to b x (8) where f : R R, b R + and X is a bounded convex subset from conv(y) R that depends on the protocol constraints, i.e., on how actions can be selected. If there were no constraints on the order in which actions could be selected we would have X := conv(y), however, this is not the case. Characterising X is as simple as noting that if we have a subsequence of actions where y and y appear (each) sequentially, then y 0 should appear at least twice in order to have a subsequence that is compliant with the transmission protocol for example {y 0,y,y,y,y 0,y,y,y,y }. Conversely, any subsequence in which y 0 appears at least twice can be reordered to obtain a subsequence that is compliant with the transmission protocol. Since we can always choose a subsequence of discrete actions and then reorder its elements (see discussion after Corollary ), we just need to enforce that y 0 appears twice in a subsequence. This can be obtained if the convex combination of a point x X uses point y 0 at least The CSI in wireless communications is in practice requested periodically, and not only at the beginning of a transmission, but we will assume this for simplicity of exposure. The extension is nevertheless straightforward. Fig. 6. Networ capacity region of the example in Section IV. X = ηconv(y ) with η = /(Tm). fraction /Tm, where (Tm) is the length of a subsequence. This will be the case when X := ηconv(y), (9) where η := /(Tm). Note from (9) that as T gets larger η and therefore X conv(y), i.e., the networ capacity region increases. Figure 6 illustrates the networ capacity region of the example. ) Updates: We use scaled queues occupancies as approximate multipliers. At each time slot we have updates x argmin x X {f(x) αqt x}, (0) Q + = [Q y +B ] +, () where B are i.i.d. random variables that tae values {0,} with mean [p,p ] T. Action y = We, and e is obtained with (5) in the online mode explained in Section III-A. The elements in a subsequences are reordered to comply with the transmission protocol constraints. ) Simulation: We run a simulation with objective function f = x and parameters p = 0.5, p = 0.5, α = 0.05, λ = αq = 0, T = 3 (so the number of elements in a subsequence is 9 since m = 3). Figure 7 shows the convergence of λ to a ball around λ. Interestingly, see that whereas λ converges to λ, the Lagrange multipliersλ (grey points) do not converge to a ball around λ. The stability of the system follows from the fact that if the average of the Lagrange multipliers is bounded, then the average of the α- scaled queue occupancies is also bounded. Figure 8 shows that f( x ) converges to a ball around f as increases. V. CONCLUSIONS We have shown that the stochastic subgradient method for the dual problem in convex optimisation allows the use of approximate Lagrange multipliers. The use of approximate Lagrange multipliers is motivated because it is possible to identify α-scaled queue occupancies as approximations of Lagrange multipliers in the subgradient method. As a result, it is possible to cast some discrete decision maing problems as convex optimisations. One of the advantages of our convex approach is that we can mae optimal control decisions that have ordering constraints.

6 6, λ () λ () λ () (), λ Fig. 7. Illustrating the convergence of average λ + (blue line) to a ball around λ = [0.5,] T (grey area). Grey points indicate the values of the Lagrange multipliers λ, and λ () and λ () denote the first and second components of vector λ. f( x) time slot Fig. 8. Illustrating the convergence of f( x ) to f = 0.35 (straight line) in the networ example of Section IV. A. Proof of Lemma APPENDIX Consider update λ + = [λ + αg(x )] + and suppose we select x X(µ ) := {x X L(x,µ ) q(µ ) ǫ}. Observe that λ + λ = [λ +αg(x )] + λ λ +αg(x ) λ = λ λ +α g(x ) +α(λ λ ) T g(x ) λ λ +α L g +α(q(λ )+ǫ q(λ )), where the last inequality follows from the fact that (λ λ ) T g(x ) = L(x,λ ) L(x,λ ) L(x,λ ) q(λ ) q(λ )+ǫ q(λ ), and g(x ) L g for all. Rearranging terms yields λ + λ λ λ () α L g +α(q(λ )+ǫ q(λ )). (3) Now let Λ δ := {λ R m + q(λ ) q(λ) αl g + ǫ} and consider two cases. Case (i) (λ / Λ δ ) Observe that (3) is strictly negative and therefore λ is monotonically attracted to Λ δ. Hence, for any λ R m and sufficiently large it will eventually happen that λ Λ δ and therefore q(λ ) q(λ ) αl g / +ǫ. Case (ii) (λ Λ δ ) We cannot ensure that the RHS in (3) is strictly negative, however, from the Lipchitz continuity of the dual function we have q(λ + ) q(λ ) λ + λ L g = [λ +αg(x )] + λ L g αg(x ) L g αl g. Now see that since q(λ +) q(λ ) = q(λ + ) q(λ ) + q(λ ) q(λ ) q(λ + ) q(λ ) + q(λ ) q(λ ) αl g + αl g /+ǫ = α(3/)l g +ǫ. That is, λ + Λ δ := {λ Rm + q(λ ) q(λ) α(3/)l g +ǫ}. Finally, note that Λ δ Λ δ and that since from case (i) any λ + Λ δ \ Λ δ will be attracted to Λ δ, we can conclude that for every λ R m there exists a non-negative and nonincreasing sequence {β } that converges to α(3/)l g+ǫ for sufficiently large and β q(λ ) q(λ ) 0 for all. B. Proof of Lemma Observe that λ + = [λ +αg(x )] + λ +αg(x i ), and rearranging terms λ + λ αg(x ). Summing from i =,..., yields α g(x i) λ + λ λ +. Dividing by α and using the fact that g( x ) g(x i) yields the stated result. C. Proof of Theorem From Lemma we have that β q(λ ) q(λ ) 0, where {β } is a non-increasing sequence that converges to α(3/)l g + ǫ for sufficiently large. Next, since q(λ ) q(λ ) L( x,λ ) q(λ ) = f( x ) + λ T g( x ) q(λ ), by rearranging terms and using the fact that q(λ ) = f (by strong duality) we obtain β λ T g( x ) f( x ) f. From Lemma we have thatg( x ) λ + /(α) for all, and since λ + λ +αl g we obtain β λ α f( x ) f (4) where λ = λt λ +. We now show the upper bound in (7). Since f = q(λ ) q(λ i) = f(x i)+λ T i g(x i) f( x )+ λt i g(x i), by rearranging terms we have that f( x ) f λt i g(x i). Now observe that for any sequence {x } in X we can write λ + = [λ + αg(x )] + λ + αg(x ) λ + α g(x ) + αλt g(x ). Applying the latter equation recursively for i =,..., we obtain that λ + λ + α g(x i) + α λt i g(x i), and rearranging terms and dividing by α yields λt i g(x i) α g(x i) + λ α. Finally, since g(x i ) L g by Assumption we obtain which concludes the proof. D. Proof of Lemma 3 f( x ) f αl g + λ α, In each iteration we aim to decrease the expected distance of λ from λ. In short, observe that for any vector θ R m we have E( λ + θ λ ) = E( [λ +α g(x )] + θ λ ) E( λ +α g(x ) θ λ ) where g(x ) = g(x )+B.

7 7 Expanding the RHS of the last equation and rearranging terms yields E( λ + θ λ ) λ θ (5) α E( g(x ) λ )+αe((λ θ) T g(x ) λ ). α L h +αe((λ θ) T g(x ) λ ). (6) where the last equation follows from the boundedness of the subgradient. Next, see that since E((λ θ) T (g(x )+B ) λ ) = (λ θ) T (g(x ) + b) = L(x,λ ) L(x,θ) L(x,λ ) q(θ) q(λ )+ǫ q(θ) we have that E( λ + θ λ ) λ θ α L h +α(q(λ ) +ǫ q(θ)) and taing expectations yields E( λ + θ ) E( λ θ ) α L h +αe(q(λ )+ǫ q(θ)). Applying the latter expression recursively to obtaine( λ + θ ) λ θ α L h +α E(q(λ i)+ǫ q(θ)). Dropping E( λ + θ ), further rearranging and dividing by α yields λ θ αl h ǫ E(q(λ i ) q(θ)). (7) α Next, see that by the linearity of the expectation one can write E(q(λ i )) = E q(λ i ) where λ i, i =,..., are the Lagrange multiplier obtained from update (9). Next, let λ := λ i and see that by the concavity of q we can write E( q(λ i)) E(q( λ )) q(e( λ )). Hence, λ θ α αl h ǫ q(e( λ )) q(θ), and if we let θ = λ we obtain that q(e( λ )) converges monotonically to a ball around q(λ ). E. Proof of Theorem We follow the same steps as in the proof of Theorem. From the stochastic subgradient method (see Lemma 3) we have that for any θ R m there exists a nonnegative and strictly decreasing sequence {β } such that β q(e( λ )) q(θ) for all. Also, since we can always write q(e( λ )) L(E( x ),E( λ )) = f(e( x )) + E( λ ) T g(e( x )), and g(e( x )) E(λ + )/(α) from Lemma, rearranging terms yields β E(λ ) α f(e( x )) q(θ). (8) where E(λ ) = E( λ ) T E(λ + ). We now proceed to upper bound (8). From (7) we have E(q(λ i)) q(θ) = E(f(x i) + λ T i (g(x i) + b)) q(θ) f(e( x ))+E( λt i (g(x i)+b)) q(θ) and therefore f(e( x )) q(λ (b)) E λ T i (g(x i)+b) (9) where λ (b) depends on b. Now observe that using θ = 0 in (6) we have E( λ + λ ) λ α L h +αe(λt (g(x )+b)). Applying the latter equation recursively we obtain E( λ + ) λ +α L h +αe( λt i (g(x i)+b)). Rearranging terms and dividing by α yields E λ T i (g(x i)+b) αl h + λ α. (30) Finally, letting θ = λ (b) in (8) concludes the proof. REFERENCES [] V. Valls and D. J. Leith. On the relationship between queues and multipliers. In Communication, Control, and Computing (Allerton), 04 5nd Annual Allerton Conference on, pages 50 55, Sept 04. [] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networs. IEEE Transactions on Automatic Control, 37(): , Dec 99. [3] A. Eryilmaz and R. Sriant. Fair resource allocation in wireless networs using queue-length-based scheduling and congestion control. In INFO- COM th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, volume 3, pages vol. 3, March 005. [4] M. J. Neely. Optimal energy and delay tradeoffs for multiuser wireless downlins. IEEE Transactions on Information Theory, 53(9): , Sept 007. [5] Alexander L. Stolyar. Maximizing queueing networ utility subject to stability: Greedy primal-dual algorithm. Queueing Systems, 50(4):40 457, 005. [6] Michael J. Neely. Stochastic Networ Optimization with Application to Communication and Queueing Systems. Morgan and Claypool Publishers, 00. [7] G. D. Celi, L. B. Le, and E. Modiano. Dynamic server allocation over time-varying channels with switchover delay. IEEE Transactions on Information Theory, 58(9): , Sept 0. [8] D. K. H. Tan F. P. Kelly, A. K. Maulloo. Rate control for communication networs: Shadow prices, proportional fairness and stability. The Journal of the Operational Research Society, 49(3):37 5, 998. [9] A. Nedić and A. Ozdaglar. Subgradient methods in networ resource allocation: Rate analysis. In Information Sciences and Systems, 008. CISS nd Annual Conference on, pages 89 94, March 008. [0] L. Huang and M. J. Neely. Delay reduction via lagrange multipliers in stochastic networ optimization. IEEE Transactions on Automatic Control, 56(4):84 857, April 0. [] V. Valls and D. J. Leith. Max-weight revisited: Sequences of nonconvex optimizations solving convex optimizations. IEEE/ACM Transactions on Networing, PP(99):, 05. [] Angelia Nedić and Asuman Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 9(4): , 009. [3] Víctor Valls and Douglas J. Leith. Descent With Approximate Multipliers is Enough: Generalising Max-Weight. arxiv e-print, 05. [4] Dimitri P. Bertseas. Nonlinear Programming. Athena Scientific, nd edition, 999. [5] Sean Meyn. Control Techniques for Complex Networs. Cambridge University Press, st edition, 007. [6] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, st edition, 004.

Descent With Approximate Multipliers is Enough: Generalising Max-Weight

1 Descent With Approximate Multipliers is Enough: Generalising Max-Weight Víctor Valls, Douglas J. Leith Trinity College Dublin arxiv:1511.02517v1 [math.oc] 8 Nov 2015 Abstract We study the use of approximate