Stochastic Gradient Descent Algorithms for Resource Allocation

Size: px

Start display at page:

Download "Stochastic Gradient Descent Algorithms for Resource Allocation"

Victoria Cameron
5 years ago
Views:

1 Stochastic Gradient Descent Algorithms for Resource Allocation Amrit Singh Bedi Supervisor: Dr. Ketan Rajawat Department of Electrical Engineering, Indian Institute of Technology, Kanpur Kanpur, Uttar Pradesh 1/65

2 Outline Gradient descent algorithm 2/65

3 Outline Gradient descent algorithm Subgradient descent algorithm 2/65

4 Outline Gradient descent algorithm Subgradient descent algorithm Stochastic subgradient algorithm 2/65

5 Outline Gradient descent algorithm Subgradient descent algorithm Stochastic subgradient algorithm Incremental Stochastic subgradient algorithm 2/65

6 Outline Gradient descent algorithm Subgradient descent algorithm Stochastic subgradient algorithm Incremental Stochastic subgradient algorithm Ergodic stochastic algorithm 2/65

7 Outline Gradient descent algorithm Subgradient descent algorithm Stochastic subgradient algorithm Incremental Stochastic subgradient algorithm Ergodic stochastic algorithm Applications in wireless communication, smart grid. 2/65

8 Outline Gradient descent algorithm Subgradient descent algorithm Stochastic subgradient algorithm Incremental Stochastic subgradient algorithm Ergodic stochastic algorithm Applications in wireless communication, smart grid. Future work 2/65

9 Introduction Standard convex optimization problem minimize f (x) (1) subject to g(x) 0 (2) x X 3/65

10 Introduction Standard convex optimization problem minimize f (x) (1) subject to g(x) 0 (2) x X x is the optimization variable 3/65

11 Introduction Standard convex optimization problem minimize f (x) (1) subject to g(x) 0 (2) x X x is the optimization variable f (x) is the objective function 3/65

12 Introduction Standard convex optimization problem minimize f (x) (1) subject to g(x) 0 (2) x X x is the optimization variable f (x) is the objective function g(x) is the constraint function 3/65

13 Introduction Standard convex optimization problem minimize f (x) (1) subject to g(x) 0 (2) x X x is the optimization variable f (x) is the objective function g(x) is the constraint function X represents the convex domain for x 3/65

14 First-order methods 4/65

15 First-order methods Only the first order derivative is required 4/65

16 First-order methods Only the first order derivative is required Every iteration is inexpensive, does not require second derivative 4/65

17 First-order methods Only the first order derivative is required Every iteration is inexpensive, does not require second derivative Useful for large scale optimization problems 4/65

18 First-order methods Only the first order derivative is required Every iteration is inexpensive, does not require second derivative Useful for large scale optimization problems Can be easily extended to include uncertainty cases 4/65

19 First-order methods Only the first order derivative is required Every iteration is inexpensive, does not require second derivative Useful for large scale optimization problems Can be easily extended to include uncertainty cases Useful to take optimal decisions on-the-fly 4/65

20 Gradient Descent Algorithm [1] Motivation: Very useful for large scale problems, much faster 5/65

21 Gradient Descent Algorithm [1] Motivation: Very useful for large scale problems, much faster Definition: If f : R n R, the gradient is given by ( f f (x 1, x 2,, x n ) :=, f,, f ) x 1 x 2 x n (3) 5/65

22 Gradient Descent Algorithm [1] Motivation: Very useful for large scale problems, much faster Definition: If f : R n R, the gradient is given by ( f f (x 1, x 2,, x n ) :=, f,, f ) x 1 x 2 x n (3) Algorithm: x (t+1) = x (t) ɛ (t) f (x (t) ) (4) where, ɛ (t) is the step size for the algorithms. [1] Boyd S, Vandenberghe L. Convex optimization. Cambridge university press; /65

23 Convergence analysis Convergence properties of the algorithm is governed by ɛ(t) Too small values of ɛ(t) will cause the algorithm to converge slowly Too large values could cause the algorithm to overshoot and diverge A simple convergence analysis for constant step size is discussed here: Definition: A function f : R n R is called L-Lipschitz continuous gradient if and only if f (x) f (y) 2 L x y 2, x, y R n (5) 6/65

24 Convergence analysis Convergence properties of the algorithm is governed by ɛ(t) Too small values of ɛ(t) will cause the algorithm to converge slowly Too large values could cause the algorithm to overshoot and diverge A simple convergence analysis for constant step size is discussed here: Definition: A function f : R n R is called L-Lipschitz continuous gradient if and only if Implications: f (x) f (y) 2 L x y 2, x, y R n (5) Lipschitz continuous gradient, denoted as f C L Speed at which gradient varies is bounded Objective function has bounded curvature 6/65

25 Lemma Let f C L, then the following upper bound holds f (y) f (x) + f (x) T (y x) + L 2 y x 2 2 (6) 7/65

26 Lemma Let f C L, then the following upper bound holds f (y) f (x) + f (x) T (y x) + L 2 y x 2 2 (6) Theorem If f C L and f = min x f (x) >, then algorithm with constant step size ɛ < 2 L will converge to stationary point, as f (x (t) ) 2 0, as t. (7) 7/65

27 Proof: Using the lemma, put y = x (t+1), we get f (x (t+1) ) f (x (t) ) + f (x (t) ) T (x (t+1) x (t) ) + L 2 x(t+1) x (t) 2 =f (x (t) ) ɛ f (x (t) ) 2 + ɛ2 L 2 f (x(t) ) 2 =f (x (t) ) ɛ(1 ɛ 2 L) f (x(t) ) 2 = f (x (t) ) 2 1 ( ) ɛ(1 ɛ 2 L) f (x (t) ) f (x (t+1) ) T 1 f (x (t) ) 2 1 ( ) ɛ(1 ɛ t=0 2 L) f (x (0) ) f (x (T ) ) 1 ( ) ɛ(1 ɛ 2 L) f (x (0) ) f ) (8) (9) (10) 8/65

28 Proof: Using the lemma, put y = x (t+1), we get f (x (t+1) ) f (x (t) ) + f (x (t) ) T (x (t+1) x (t) ) + L 2 x(t+1) x (t) 2 =f (x (t) ) ɛ f (x (t) ) 2 + ɛ2 L 2 f (x(t) ) 2 =f (x (t) ) ɛ(1 ɛ 2 L) f (x(t) ) 2 = f (x (t) ) 2 1 ( ) ɛ(1 ɛ 2 L) f (x (t) ) f (x (t+1) ) T 1 f (x (t) ) 2 1 ( ) ɛ(1 ɛ t=0 2 L) f (x (0) ) f (x (T ) ) 1 ( ) ɛ(1 ɛ 2 L) f (x (0) ) f ) (8) (9) (10) Since f >, therefore, as T, the LHS must converge f (x (t) ) 2 0, as t. (11) 8/65

29 Bound on x x 2 In a similar way, if the function f is assumed to be strongly convex, 2 f (x) mi, then we could bound the term x x 2 as follows: which will follows from the inequality by substituting y = x. x x 2 2 m f (x) 2 2 (12) f (y) f (x) + f (x) T (y x) + m 2 y x 2 2 (13) 9/65

30 Convergence rate Smoothness of the objective controls the convergence rate of gradient based methods Convex objective f (x) Iterations... Nondifferentiable O(1/ɛ 2 ) differentiable O(1/ɛ) Smooth (Lipschitz gradient) O(1/ ɛ) Strongly convex O(log (1/ɛ)) 10/65

31 Contrast with Newton method 11/65

32 Contrast with Newton method In general, if we minimize a n dimensional objective function 11/65

33 Contrast with Newton method In general, if we minimize a n dimensional objective function Gradient descent requires more iterations, but each one is fast (we only need to compute 1st derivatives) 11/65

34 Contrast with Newton method In general, if we minimize a n dimensional objective function Gradient descent requires more iterations, but each one is fast (we only need to compute 1st derivatives) Newton s method requires fewer iterations, but each one is slow (we need to compute 2nd dervatives too) 11/65

35 Contrast with Newton method In general, if we minimize a n dimensional objective function Gradient descent requires more iterations, but each one is fast (we only need to compute 1st derivatives) Newton s method requires fewer iterations, but each one is slow (we need to compute 2nd dervatives too) Recent result Accelerated method are proposed in [2] An O(1/k) Gradient Method for Network Resource Allocation is proposed in [3] [2]Tseng P. On accelerated proximal gradient methods for convex-concave optimization Submitted to SIAM J. Optim [3]Beck A, Nedic A, Ozdaglar A, Teboulle M. An Gradient Method for Network Resource Allocation Problems. IEEE TCNS /65

36 Subgradient Methods Subgradient method [4] is a simple algorithm for minimizing the non-differential convex function. 12/65

37 Subgradient Methods Subgradient method [4] is a simple algorithm for minimizing the non-differential convex function. Applies direcly to non-differential objective functions 12/65

38 Subgradient Methods Subgradient method [4] is a simple algorithm for minimizing the non-differential convex function. Applies direcly to non-differential objective functions In contrast to gradient method, function value may increase 12/65

39 Subgradient Methods Subgradient method [4] is a simple algorithm for minimizing the non-differential convex function. Applies direcly to non-differential objective functions In contrast to gradient method, function value may increase Definition: Vector g is the subgradient of f ( ) at x, if f (y) f (x) + g T (y x) for all y (14) [4]Boyd S, Mutapcic A. Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter. 2006; /65

40 Figure: Example for one dimensional setting g 1, g 2 and g 3 are the subgradients at x 1, and x 2. 13/65

41 Algorithm For unconstrained convex problems, the algorithm is x (t+1) = x (t) ɛ (t) g (t) (15) Here, g (t) is any subgradient of f at x (t) and ɛ (t) > 0 is the step size. 14/65

42 Algorithm For unconstrained convex problems, the algorithm is x (t+1) = x (t) ɛ (t) g (t) (15) Here, g (t) is any subgradient of f at x (t) and ɛ (t) > 0 is the step size. Since, it in not a descent method, it is common to keep track of the best point found so far given by f t best := min{f (x (1) ), f (x (2) ),, f (x (t) )} (16) 14/65

43 Convergence Analysis It is guaranteed to converge to within some range of optimal value lim f best t f < ɛ (17) t where, ɛ is function of step size parameter. Assumptions: Consider the following assumptions for the proof: There exist a minimizer of f, say x The norm of subgradient is bounded, i.e., g (t) 2 G for all t 15/65

44 Convergence Analysis Consider the Euclidean distance to the optimal point, we have x (t+1) x 2 2 = x (t) ɛg (t) x 2 2 (18) ( ) x (t) x 2 2 2ɛ f (x (t) f ) + ɛ 2 G 2 (19) 16/65

45 Convergence Analysis Consider the Euclidean distance to the optimal point, we have x (t+1) x 2 2 = x (t) ɛg (t) x 2 2 (18) ( ) x (t) x 2 2 2ɛ f (x (t) f ) + ɛ 2 G 2 (19) summation over t yields x (t+1) x 2 2 x (1) x 2 2 2ɛ t i=1 ( ) f (x (i) f ) + ɛ 2 TG 2 (20) 16/65

46 Convergence Analysis Consider the Euclidean distance to the optimal point, we have x (t+1) x 2 2 = x (t) ɛg (t) x 2 2 (18) ( ) x (t) x 2 2 2ɛ f (x (t) f ) + ɛ 2 G 2 (19) summation over t yields = x (t+1) x 2 2 x (1) x 2 2 2ɛ 2ɛ t i=1 t i=1 ( ) f (x (i) f ) + ɛ 2 TG 2 (20) ( ) f (x (i) f ) R 2 + ɛ 2 TG 2 (21) f (t) best f R2 + ɛ 2 TG 2 2ɛT (22) 16/65

47 A simple example[5] 17/65

48 Projected subgradient method Consider the constrained optimization problem minimize f (x) (23) subject to x X (24) 18/65

49 Projected subgradient method Consider the constrained optimization problem minimize f (x) (23) subject to x X (24) Projected subgradient algorithm [4] for this problem is x (t+1) = P X [x (t) ɛ (t) g (t)] (25) [4]Boyd S, Mutapcic A. Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter. 2006; /65

50 Simple to implement and can be applied to variety of problems Subgradient methods were first introduced in the middle sixties by N. Z. Shor [6] Extensive treatment of these subgradient methods are provided in books [7, 17] Nemirovski & Yudin in [8] derived the worst-case complexity bound to achieve an ɛ-solution where it is O( 1 ɛ 2 ) for Lipschitz continuous nonsmooth problems and O( 1 ɛ ) for smooth problems with Lipschitz continuous gradient Mixture with primal or dual decomposition techniques, sometimes provides simple distributed algorithm [9] [6]Shor NZ. Minimization Methods for Non-differentiable Functions. Springer, [7] Shor NZ. Nondifferentiable Optimization and Polynomial Problems. Springer Science & Business Media; [8]Blair C. Problem Complexity and Method Efficiency in Optimization (AS Nemirovsky and DB Yudin). SIAM Review [9] Palomar DP, Chiang M. A tutorial on decomposition methods for network utility maximization. IEEE JSAC /65

51 Dual Descent Algorithms [10, 11] Consider the Primal problem minimize f 0 (x) (26) subject to f i (x) 0, i = 1,, m. (27) 20/65

52 Dual Descent Algorithms [10, 11] Consider the Primal problem minimize f 0 (x) (26) subject to f i (x) 0, i = 1,, m. (27) the Lagrangian is given by m L(x, λ) = f 0 (x) + λ i f i (x) (28) We assume that, x (λ) is unique maximizer of Lagrangian over x. i=1 [10]Kelly F. Charging and rate control for elastic traffic. European tran. on Telecommun., [11]Bertsekas DP, Nedi A, Ozdaglar AE. Convex analysis and optimization. 20/65

53 The dual function is g(λ) = inf x L(x, λ) = f 0(x (λ)) + (for λ 0). The dual problem is m λ i f i (x (λ)) (29) i=1 maximize g(λ) (30) subject to λ 0 Assuming the slater condition holds, we have x = x (λ ). 21/65

54 The dual function is g(λ) = inf x L(x, λ) = f 0(x (λ)) + (for λ 0). The dual problem is m λ i f i (x (λ)) (29) i=1 maximize g(λ) (30) subject to λ 0 Assuming the slater condition holds, we have x = x (λ ). Applying projected subgradient method on dual, we get ( ) m x (t) = arg min f 0 (x) + λ (t) i f i (x) x i=1 [ ] λ (t+1) i = λ (t) i + α t f i (x (t) ) + (31) (32) 21/65

55 Motivation Motivation for solving dual problem: 22/65

56 Motivation Motivation for solving dual problem: The dual is a convex optimization problem 22/65

57 Motivation Motivation for solving dual problem: The dual is a convex optimization problem Dual may have smaller dimensions than primal 22/65

58 Motivation Motivation for solving dual problem: The dual is a convex optimization problem Dual may have smaller dimensions than primal If duality gap is zero, primal optimal can be derived form dual 22/65

59 Motivation Motivation for solving dual problem: The dual is a convex optimization problem Dual may have smaller dimensions than primal If duality gap is zero, primal optimal can be derived form dual if duality gap is non-zero, dual provides a lower bound for primal 22/65

60 Recent results and scope 23/65

61 Recent results and scope Recent convergence results in this direction are discussed in [12] averaging scheme applied Provides convergence rate estimates for the approximate solutions (Slater s qualification) Amount of feasibility violation Provides upper and lower bound for primal objective function Accelerated Dual Descent for Network Flow Optimization [13] [12] Nedic A, Ozdaglar A. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Opt [13]Zargham M, Ribeiro A, Ozdaglar A, Jadbabaie A. Accelerated dual descent for network flow optimization. IEEE TAC /65

62 Standard Convex Stochastic Optimization Problem minimize E [f 0 (x, w)] (33) subject to E [f i (x, w)] 0, i = 1, 2,, m. (34) x X (35) 24/65

63 Standard Convex Stochastic Optimization Problem minimize E [f 0 (x, w)] (33) subject to E [f i (x, w)] 0, i = 1, 2,, m. (34) x X (35) x is the optimization variable, w is the random variable Here, f i (x, w) is convex in x for each realization of w. X represents the box constraints. 24/65

64 History and Motivation History of stochastic algorithms way back to adaptive filtering algorithm by Robbins & Monro [14] and Widrow & Stearns [15] Extensively studied in the context of LMS and RLS algorithms [16] Stochastic subgradient methods with detailed analysis are discussed in [18, 19] [14]Robbins H, Monro S. A stochastic approximation method. AMS. Sep [15] Widrow B, Stearns SD. Adaptive signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., [16]Sayed AH. Adaptive filters. John Wiley & Sons; Oct [18]Ermoliev Y. stochastic quasigradient methods and their application to system optimization. Stochastics: An IJPSP /65

65 Different applications Stochastic gradient, subgradient methods have been widely applied to Neural networks [20], Parameter tracking [21], Large scale machine learning [22, 23], and Resource allocation problems [24, 25, 27, 28, 34]. [20]Bottou L. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nmes [21]Kushner HJ, Yang J. Analysis of adaptive step size SA algorithms for parameter tracking., Proc. of IEEE Conference on DoC, [22]Bottou L. Large-scale machine learning with stochastic gradient descent. In Proc. of COMPSTAT [24]Alaei S, Hajiaghayi M, Liaghat V. The online stochastic generalized assignment problem. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques [27]Wang X, Gao N. Stochastic resource allocation over fading multiple access and broadcast channels. IEEE TIT, /65

66 Relation with LMS algorithm The goal is to minimize E [ e(t) 2] 27/65

67 Relation with LMS algorithm The goal is to minimize E [ e(t) 2] Utilizing the steepest descent we get h(t + 1) = h(t) + µe [u(t)e(t)] (36) when statistics are not known, then using the approximation for average term, we get LMS h(t + 1) = h(t) + µu(t)e(t) (37) 27/65

68 Relation with LMS algorithm The goal is to minimize E [ e(t) 2] Utilizing the steepest descent we get h(t + 1) = h(t) + µe [u(t)e(t)] (36) when statistics are not known, then using the approximation for average term, we get LMS h(t + 1) = h(t) + µu(t)e(t) (37) which is nothing but the stochastic gradient descent algorithm. 27/65

69 Stochastic Subgradient Methods 28/65

70 Stochastic Subgradient Methods Stochastic subgradient algorithms [29] are generalization of gradient one 28/65

71 Stochastic Subgradient Methods Stochastic subgradient algorithms [29] are generalization of gradient one Uses noisy subgradient and more limited set of step size rules 28/65

72 Stochastic Subgradient Methods Stochastic subgradient algorithms [29] are generalization of gradient one Uses noisy subgradient and more limited set of step size rules Definition: A noisy (unbiased) subgradient of f ( ) at x domf is defined as vector g, if f (y) f (x) + (E [ g]) T (y x) (38) for all y. This noisy subgradient can be written as g = g + ν, where g f (x) and ν is a zero mean random vector. 28/65

73 Stochastic Subgradient Methods Stochastic subgradient algorithms [29] are generalization of gradient one Uses noisy subgradient and more limited set of step size rules Definition: A noisy (unbiased) subgradient of f ( ) at x domf is defined as vector g, if f (y) f (x) + (E [ g]) T (y x) (38) for all y. This noisy subgradient can be written as g = g + ν, where g f (x) and ν is a zero mean random vector. If x in the problem is also random (specifically in resource allocation problems), the g is said to be noisy subgradient of f ( ) at x, if yf (y) f (x) + (E [ g x]) T (y x) (39) [29]Weng, Lingjie, and Yutian Chen. Stochastic Subgradient Methods. 28/65

74 Algorithm For unconstrained minimization of a convex function f : R n R, the stochastic subgradient update is given as x (t+1) = x (t) ɛ (t) g (t) (40) where, ɛ (t) > 0 is the t th step size, and g (t) is stochastic subgradient. [ E g (t) x (t)] = g (t) f (x (t) ) (41) 29/65

75 Algorithm For unconstrained minimization of a convex function f : R n R, the stochastic subgradient update is given as x (t+1) = x (t) ɛ (t) g (t) (40) where, ɛ (t) > 0 is the t th step size, and g (t) is stochastic subgradient. [ E g (t) x (t)] = g (t) f (x (t) ) (41) In this algorithm, similar to subgradient one, f t best := min{f (x (1) ), f (x (2) ),, f (x (t) )} (42) 29/65

76 Convergence analysis Assumptions: There exist a minimizer of f, say x. There exist G such that, E g (t) 2 G for all t. There exist R such that x (1) x R 2. Results: Convergence in Expectation E{f (t) best } f as t. (43) Convergence in Probability ( ) lim Prob f (t) t best f + α = 0 for any α > 0. (44) 30/65

77 ( E x (t+1) x 2 2 ) ( ) x(t) =E x (t) ɛ g (t) x 2 2 x(t) x (t) x 2 2 2ɛE + ɛ 2 G 2 = x (t) x 2 2 2ɛ ( g (t)t (x (t) x ) ( f (x (t) ) f ) + ɛ 2 G 2 (45) ) x(t) The above inequality holds almost surely. Now, take expectation to get ( ) ( E x (t+1) x 2 2 E x (t) x 2 2 2ɛ Ef (x (t) ) f ) + ɛ 2 G 2 31/65

78 Taking the summation over t = 1, 2,, T, we get E{f (t) best } = E{ min f i=1,,t (x(i) )} R2 + ɛ 2 TG 2 2ɛT (46) 32/65

79 Taking the summation over t = 1, 2,, T, we get E{f (t) best } = E{ min f i=1,,t (x(i) )} R2 + ɛ 2 TG 2 2ɛT (46) For convergence in probability, use the Markov s inequality ( ) Prob f (t) best f α E(f (t) best f ) α The RHS goes to zero as t, so the LHS as well. for any α > 0. (47) 32/65

80 Incremental Stochastic SubGradient Algorithms [30] A problem of recent interest in distributed networks is To design decentralized algorithms to minimize a sum of functions With each component function is known only to a particular agent 33/65

81 Incremental Stochastic SubGradient Algorithms [30] A problem of recent interest in distributed networks is To design decentralized algorithms to minimize a sum of functions With each component function is known only to a particular agent Consider a network of m agents, indexed by i = 1, 2,, m. The aim is to solve the following optimization problem: minimize f (x) = subject to x X m f i (x) (48) i=1 33/65

82 Incremental Stochastic SubGradient Algorithms [30] A problem of recent interest in distributed networks is To design decentralized algorithms to minimize a sum of functions With each component function is known only to a particular agent Consider a network of m agents, indexed by i = 1, 2,, m. The aim is to solve the following optimization problem: minimize f (x) = subject to x X m f i (x) (48) i=1 x R n is the decision parameter vector 33/65

83 Incremental Stochastic SubGradient Algorithms [30] A problem of recent interest in distributed networks is To design decentralized algorithms to minimize a sum of functions With each component function is known only to a particular agent Consider a network of m agents, indexed by i = 1, 2,, m. The aim is to solve the following optimization problem: minimize f (x) = subject to x X m f i (x) (48) i=1 x R n is the decision parameter vector X is the closed and convex subset of R n 33/65

84 Incremental Stochastic SubGradient Algorithms [30] A problem of recent interest in distributed networks is To design decentralized algorithms to minimize a sum of functions With each component function is known only to a particular agent Consider a network of m agents, indexed by i = 1, 2,, m. The aim is to solve the following optimization problem: minimize f (x) = subject to x X m f i (x) (48) i=1 x R n is the decision parameter vector X is the closed and convex subset of R n function f i is a convex function from R n to R known to only agent i [30]Ram SS, Nedic A, Veeravalli VV. Incremental stochastic subgradient algorithms for convex optimization. in SIAM /65

85 Algorithms Cyclic incremental subgradient algorithm in a network where agents are connected in a directed ring structure, the updates are z 0,t+1 = z m,t = x t (49) ] z i,t+1 = P X [z i 1,t+1 α (t+1) ( f i (z i 1,t+1 + ɛ i,t+1 )) (50) 34/65

86 Randomized incremental subgradient algorithm In this algorithm, agent i that updates is selected randomly according to a distribution. Formally the updates are x (t+1) = P X [x(t) ( )] α (t+1) f s(t+1) (x (t+1) + ɛ s(t+1),t+1 ) (51) The integer s(t + 1) is the index of the agent that performs the update at time t /65

87 Convergence analysis Assumptions: The Set X R n is closed and convex. The function f i : R n R is convex for each i {1, 2,, m}. 36/65

88 Convergence analysis Assumptions: The Set X R n is closed and convex. The function f i : R n R is convex for each i {1, 2,, m}. There exists scalar sequences µ t and ν t such that 36/65

89 Convergence analysis Assumptions: The Set X R n is closed and convex. The function f i : R n R is convex for each i {1, 2,, m}. There exists scalar sequences µ t and ν t such that E [ ɛ i,t Ft i 1 ] µt (52) E [ ɛ i,t 2 Ft i 1 ] ν 2 t (53) 36/65

90 Convergence analysis Assumptions: The Set X R n is closed and convex. The function f i : R n R is convex for each i {1, 2,, m}. There exists scalar sequences µ t and ν t such that E [ ɛ i,t Ft i 1 ] µt (52) E [ ɛ i,t 2 Ft i 1 ] ν 2 t (53) For every i, f i (x) C i, for all x X. 36/65

91 Results (constant step size, α) 37/65

92 Results (constant step size, α) Bound on Iterates: E [ x (t+1) y 2 F m t ] x(t) y 2 Ft m 2α (f (x t ) f (y)) m + 2ɛµ (t+1) E [ z i 1,(t+1) y 2 Ft m ] i=1 ) 2 m + α (mν 2 (t+1) + C i (54) i=1 37/65

93 Results (constant step size, α) Bound on Iterates: E [ x (t+1) y 2 F m t ] x(t) y 2 Ft m 2α (f (x t ) f (y)) m + 2ɛµ (t+1) E [ z i 1,(t+1) y 2 Ft m ] i=1 ) 2 m + α (mν 2 (t+1) + C i (54) i=1 Error bound: inf f (x (t)) f + mµ max x y + α t 0 x,y X 2 with probability 1. ( mν + ) 2 m C i (55) i=1 37/65

94 Ergodic Stochastic Optimization Algorithm [31] Stochastic resource allocation problem (x, {p t } t N ) = arg max f 0 (x) (56) s. t. E [s t (p t, x)] 0 (57) x X, p t P t (58) f 0 : R n R is concave function. s t : R n R p R k is a random function, indexed by time t. x R n is the optimization variable. p t R p is the online policy to be determined for all t N. [31]Ribeiro A. Ergodic stochastic optimization algorithms for wireless communication and networking. IEEE TSP, Dec /65

95 ESO algorithm The ergodic stochastic optimization (ESO) algorithm consist of iterative application of following steps: Primal iteration: (x t, p t ) = arg max x X,p P t f 0 (x) + λ T t s t (p, x) (59) 39/65

96 ESO algorithm The ergodic stochastic optimization (ESO) algorithm consist of iterative application of following steps: Primal iteration: (x t, p t ) = arg max x X,p P t f 0 (x) + λ T t s t (p, x) (59) Dual iteration: λ t+1 = [λ t ɛs t (x t, p t )] + (60) 39/65

97 Convergence results The following asymptotic results are established in [31]. 40/65

98 Convergence results The following asymptotic results are established in [31]. Almost sure feasibility 1 t 1 lim s τ (p τ, x τ ) 0 (61) t t τ=0 40/65

99 Convergence results The following asymptotic results are established in [31]. Almost sure feasibility Almost sure near optimality 1 t 1 lim s τ (p τ, x τ ) 0 (61) t t τ=0 G 2 lim f ( x t) P ɛ t 2 (62) 40/65

100 Recent results A recent technique for large scale machine learning problems is proposed in [32] The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: The use of local stochastic averaging gradients. Determination of descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of the sequence generated by DSA in expectation. 41/65

101 Recent results A recent technique for large scale machine learning problems is proposed in [32] The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: The use of local stochastic averaging gradients. Determination of descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of the sequence generated by DSA in expectation. Future scope: 41/65

102 Recent results A recent technique for large scale machine learning problems is proposed in [32] The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: The use of local stochastic averaging gradients. Determination of descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of the sequence generated by DSA in expectation. Future scope:the almost sure convergence rate results for stochastic dual descent are not there [32] Mokhtari A, Ribeiro A. DSA: Decentralized Double Stochastic Averaging Gradient Algorithm. arxiv preprint arxiv: /65

103 Applications Resource allocation in OFDM networks [26, 27, 28] 42/65

104 Applications Resource allocation in OFDM networks [26, 27, 28] Load shedding in smart grid with real time pricing (RTP) [33, 34] [27] Wang X, Gao N. Stochastic resource allocation over fading multiple access and broadcast channels. IEEE TIT, [26]Wang X, Giannakis GB. Resource allocation for wireless multiuser OFDM networks. IEEE TIT, Jul [33]Gatsis N, Marques AG. A stochastic approximation approach to load shedding inpower networks. IEEE ICASSP, /65

105 Resource allocation in OFDM networks Modeling preliminaries 43/65

106 Utility based resource allocation Optimization problem max r r th,(α,p) F s.t. U( r) (63) [ K ] ˆr j E c j,k (αj,k, t pj,k) t, j assign µ j k=1 J K E P assign λ j=1 k=1 p t j,k 44/65

107 Explanation αj,k t pj,k t 0, is the time sharing fraction of a slot, average transmit power allocated J αj,k t 1, k = 1,, K. (64) j=1 assuming, AWGN at receiver with unit variance and sub-bandwidth = 1, the maximum achievable rate is { ( ) α cj,k t t = j,k log γt j,k pt j,k α, α t j,k t > 0 j,k 0, αj,k t = 0. (65) Set F is F := (α, p) α j,k 0, p j,k 0, J J K αj,k t 1, E j=1 j=1 k=1 p j,k P (66) 45/65

108 Utility based resource allocation Optimization problem max r r th,(α,p) F s.t. U( r) (67) [ K ] ˆr j E c j,k (αj,k, t pj,k) t, j assign µ j k=1 J K E P assign λ j=1 k=1 p t j,k 46/65

109 Offline solution After calculating the Lagrangian, it results in two separate primal subproblems (across r and (α, p)) as follows; Subproblem I: max U( r) µ T r (68) r r th 47/65

110 Offline solution After calculating the Lagrangian, it results in two separate primal subproblems (across r and (α, p)) as follows; Subproblem I: max U( r) µ T r (68) r r th Subproblem II: J K max λp + E µ j cj,k(α t j,k, t pj,k) t λpj,k t (69) (α,p) F j=1 k=1 its solution provides, ˆr (µ) and {(α (λ, µ)), p (λ, µ), j, k}. 47/65

111 Dual optimal using projected gradient algorithm The dual iterations are as follows: K λ[i + 1] = λ[i] + β E µ j [i + 1] = [ µ j [i] + β ( k=1 j=1 r j (µ[i]) E J p t (λ[i], µ[i]) P [ K k=1 + (70) r t j,k(λ[i], µ[i])])] + (71) 48/65

112 Online version Primal updates: with γ[t], ˆλ[t] and µ[t] ˆ available per slot, the AP schedules according to allocation α t (ˆλ[t], ˆµ[t], γ[t]) and p t (ˆλ[t], ˆµ[t], γ[t]) Dual updates: K ˆλ[t + 1] = ˆλ[t] + β ˆµ[t + 1] = [ ˆµ[t] + β ( k=1 j=1 J p t (ˆλ[t], ˆµ[t]) P r j (ˆµ[t]) K k=1 + (72) r t j,k(ˆλ[t], ˆµ[t]))] + (73) 49/65

113 Future work Almost sure convergence results for the Incremental stochastic algorithms. Convergence results for the Cyclo-stationary case are not available. 50/65

114 Thank you 51/65

115 References [1] Boyd S, Vandenberghe L. Convex optimization. Cambridge university press; 2004 Mar 8. [2] Tseng P. On accelerated proximal gradient methods for convex-concave optimization Submitted to SIAM J. Optim [3] Beck A, Nedic A, Ozdaglar A, Teboulle M. An Gradient Method for Network Resource Allocation Problems. IEEE TCNS [4] Boyd S, Mutapcic A. Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter. 2006;2007. [5] Boyd S, Xiao L, Mutapcic A. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter Oct 1;2004: [6] Shor NZ. Minimization Methods for Non-differentiable Functions. Springer Series in Computational Mathematics. Springer, /65

116 References [7] Shor NZ. Nondifferentiable Optimization and Polynomial Problems. Springer Science & Business Media; [8] Blair C. Problem Complexity and Method Efficiency in Optimization (AS Nemirovsky and DB Yudin). SIAM Review [9] Palomar DP, Chiang M. A tutorial on decomposition methods for network utility maximization. IEEE JSAC [10] Kelly F. Charging and rate control for elastic traffic. European transactions on Telecommunications Jan 1. [11] Bertsekas DP, Nedi A, Ozdaglar AE. Convex analysis and optimization. [12] Nedic A, Ozdaglar A. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Opt [13] Zargham M, Ribeiro A, Ozdaglar A, Jadbabaie A. Accelerated dual descent for network flow optimization. IEEE TAC /65

117 References [14] Robbins H, Monro S. A stochastic approximation method. The annals of mathematical statistics. Sep [15] Widrow B, Stearns SD. Adaptive signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., [16] Sayed AH. Adaptive filters. John Wiley & Sons; Oct [17] Bertsekas, D.P., Nonlinear programmingm, [18] Ermoliev Y. stochastic quasigradient methods and their application to system optimization. Stochastics: An International Journal of Probability and Stochastic Processes Jan. [19] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming: an overview. InDecision and Control, 1995., Proceedings of the 34th IEEE Conference on 1995 Dec 13. [20] Bottou L. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nmes /65

118 References [21] Kushner HJ, Yang J. Analysis of adaptive step size SA algorithms for parameter tracking., Proceedings of the 33rd IEEE Conference on Decision and Control, [22] Bottou L. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT Jan 1 (pp ). Physica-Verlag HD. [23] Moulines E, Bach FR. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 2011 (pp ). [24] Alaei S, Hajiaghayi M, Liaghat V. The online stochastic generalized assignment problem. InApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques 2013 Jan 1 (pp ). Springer Berlin Heidelberg. [25] Legrain A, Jaillet P. Stochastic online bipartite resource allocation problems. CIRRELT; 2013 Jun. 55/65

119 References [26] Wang X, Giannakis GB. Resource allocation for wireless multiuser OFDM networks. IEEE Transactions on Information Theory, Jul [27] Wang X, Gao N. Stochastic resource allocation over fading multiple access and broadcast channels. IEEE Transactions on Information Theory, [28] Ribeiro A. Optimal resource allocation in wireless communication and networking. EURASIP Journal on Wireless Communications and Networking, Dec [29] Weng, Lingjie, and Yutian Chen. Stochastic Subgradient Methods. [30] Ram SS, Nedic A, Veeravalli VV. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization. Jun /65

120 References [31] Ribeiro A. Ergodic stochastic optimization algorithms for wireless communication and networking. IEEE Transactions on Signal Processing, Dec [32] Mokhtari A, Ribeiro A. DSA: Decentralized Double Stochastic Averaging Gradient Algorithm. arxiv preprint arxiv: Jun 13. [33] Gatsis N, Marques AG. A stochastic approximation approach to load shedding inpower networks. IEEE International Conference in Acoustics, Speech and Signal Processing (ICASSP), [34] Deng R, Yang Z, Chen J, Chow MY. Load scheduling with price uncertainty and temporally-coupled constraints in smart grids. IEEE Transactions on Power Systems, Nov /65

121 Load Shedding in Smart Grid System model Demand Energy User 1 Renewable source Grid Price Load serving entity (LSE) User 2 User 3 User K Energy storage (battery) 58/65

122 System variables System parameters: π t : actual demand - procured energy w t : produced renewable energy at slot t a t : real time energy price at slot t r t : state of charge of battery at the end of slot t 59/65

123 System variables System parameters: π t : actual demand - procured energy w t : produced renewable energy at slot t a t : real time energy price at slot t r t : state of charge of battery at the end of slot t Optimization variables: s t k: amount of load shedded for user k at slot t b t : amount of energy bought at slot t e t out: energy discharged from battery at slot t [33] Gatsis N, Marques AG. A stochastic approximation approach to load shedding inpower networks. IEEE (ICASSP), [34] Deng R, Yang Z. Load scheduling with price uncertainty and temporally-coupled constraints in smart grids. IEEE TPS, /65

124 Problem formulation The system variables must satisfy the following relation K π t w t sk t + b t + η dis eout t (74) k=1 60/65

125 Problem formulation The system variables must satisfy the following relation K π t w t sk t + b t + η dis eout t (74) k=1 Battery dynamics equation r t = t t 1 + e t in e t out, t = 1,, T (75a) 0 r t R, t = 1,, T (75b) ein t =η ch min{ein max, [π t w t ]}; 0 eout t eout max (75c) where, [x] := max{ x, 0}. 60/65

126 Optimization problem min K {s t },{b t },{eout t },{ŝ k } k=1 subject to lim T lim T 1 J k (ŝ k ) + lim T T T a t b t t=1 (76a) T sk t ŝ k, k = 1,, K, assign σ k (76b) t=1 T t=1 e t out = lim T T ein, t assign λ (76c) t=1 (74), (75), 0 b t b max, t (76d) 0 sk t sk max, t&k (76e) 61/65

127 Optimization problem min K {s t },{b t },{eout t },{ŝ k } k=1 subject to lim T lim T 1 J k (ŝ k ) + lim T T T a t b t t=1 (77a) T sk t ŝ k, k = 1,, K, assign σ k (77b) t=1 T t=1 e t out = lim T T ein, t assign λ (77c) t=1 (74), (75), 0 b t b max, t (77d) 0 sk t sk max, t&k (77e) 62/65

128 Offline solution average primal variables ŝ k (σ k ) = arg min s {J k (s) σ k s} (78) 63/65

129 Offline solution average primal variables ŝ k (σ k ) = arg min s {J k (s) σ k s} (78) instantaneous primal variables 63/65

130 Offline solution average primal variables ŝ k (σ k ) = arg min s {J k (s) σ k s} (78) instantaneous primal variables If π t w t 0, then there is an instantaneous extra energy, to be stored in battery. ein t = η ch min{ein max, w t π t }, while eout( ), t b t ( ), sk t ( ) are 0, k. 63/65

131 Offline solution average primal variables ŝ k (σ k ) = arg min s {J k (s) σ k s} (78) instantaneous primal variables If π t w t 0, then there is an instantaneous extra energy, to be stored in battery. ein t = η ch min{ein max, w t π t }, while eout( ), t b t ( ), sk t ( ) are 0, k. If π t w t > 0, there is instantaneous energy deficit, optimization variables are found by solving min s t k,bt,e t out S a t b t + ρe t out + K σ k sk t (79) k=1 subj. to π t w t b t + η dis e t out + K sk t (80) k=1 63/65

132 Online version Challenges with offline technique: main obstacle is to find optimal σk, ρ knowledge if joint distribution of {w t, a t } is required algorithm will be computationally expensive Merits of stochastic approximation only current samples σk t, at is required computationally efficient 64/65

133 Online algorithm With µ σ > 0 and µ ρ > 0 denoting constant step size, Dual updates: ρ t+1 = [ ρ t µ ρ ( ηch e t in (d t ) e t out(d t ) )] + σ t+1 k = [ σ t k + µ σ (ŝk (σ t k) s t k (d t ) )] + (81) (82) primal variables are calculated similar to offline manner by replacing ρ with ρ t and σ k with σ t k. 65/65

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent