Lecture 24 November 27

Size: px

Start display at page:

Download "Lecture 24 November 27"

Charlotte Deborah Miles
5 years ago
Views:

1 EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve the convergence rate of subgradient descent with respect to the dimension of the problem. Recall that the bound on the convergence of sub-gradient descent is given by f ( best) f L R k + 1, where L is the Lipschitz constant of the function f with respect to and R is the distance of the initial guess 0 from the optimal point : 0. Also, recall that the subgradient update is given by + = Proj X ( γ t g) = arg min [ γg ω (), u + ω (u)], u X where g f () and ω (u) = 1 u is the distance generating function (DGF) that is continuous, differentiable, and strongly conve with respect to. he main idea of mirror descent is to replace ω(u) = 1 u with some other DGF so that the bounds are replaced by L L f and R R f where R f is the size of set as measured by the new Bregman divergence DGF ω ( ). f ( best) f Lf R f k + 1, Note that ω ( ) should be α-strongly conve with respect to the norm used Analysis of Convergence In order to analyze the convergence of mirror descent, we ll first consider the Lyapunov function ( k ) used in the Euclidean case for sub-gradient descent. hen we ll discuss how the equation will change for mirror descent. he key inequality for the convergence analysis of sub-gradient descent is a guaranteed decrease in the Lyapunov function, which is given, for any u X, by 1 u 1 + u γ g, u 1 γ g. (4.1) 4-1

2 Recall from last lecture that the Bregman divergence of is given by D (u, v) = ω (u) ω (v) ω (v), u v. herefore, the analog to the Lyapunov key inequality in (4.1) for mirror descent, can be formed by replacing u v by the D(u, v) and can be reformulated, for some iteration t, as D (u, t ) D (u, t+1 ) γ t g t, t u 1 α γ t g t, (4.) (For w(u) = 1 u, 4. this is eactly what we had in 4.1.) Eq. 4. can be rewritten as [ ω ( t ), t u ω ( t )] [ ω ( t+1 ), t+1 u ω ( t+1 )] H u( t) H u( t+1 ) γ t g t, t u 1 α γ t g t. (4.3) o complete the convergence analysis, recall, for any u X and for some iteration t, f (u) f ( t ) + g t, u t γ t (f ( t ) f (u)) γ t g t, t u. hen, summing (4.3) from t = 0 to t = forms a telescoping sum that yields γ t g t, t u H u ( 0 ) H u ( ) t=0 Θ γt (f ( t ) f (u)) f( best) f( t) ( f ( best ) f (u) ) } {{ } Let u= f ( best γt Θ + 1 α γ t g t ) f Θ + 1 α γ t g t γt, + 1 α γ t g t where Θ is the upper bound on 0 = diamx, or generally the size of X measured by D (, ), the Bregman divergence. f( best ) represents the closest that f( ) gets to f over the entire time interval. Now, consider a specific value for the step size γ t at each iteration t, Θ α γ t = g t t. 4-

3 Given this step size, it is possible to show that the error in the optimal function value ɛ f ( best) f can be bounded as follows: ɛ O (1) ΘL f. Note that when using the l-norm L = L and for the DGF w( ) = 1, we recover an upper bound that is eactly what we had for subgradient descent Simple Mirror Descent versus Subgradient Descent A natural question to ask is: when does the convergence rate of mirror descent eceed the convergence rate of standard subgradient descent. Here s one scenario: Let the feasible set be the simple set scaled by k: X + n (k), let the distance generating function beω () = i ln ( i ) and let our norm be the l1-norm, = 1. For these parameters, the mirror descent update is easy. he modulus of strong conveity with respect to the l1-norm is α = O(1)/R and the upper bound Θ O (1) ln (n). hus, for this scenario, the upper bound on the convergence rate is ( ) L F R ɛ O ln (n) 1. Now that we have a convergence bound for Mirror Descent with a simple set, we can compare this to the bound for subgradient descent, i.e. using the Euclidean norm. We consider the following ratio comparing the convergence error of the two methods: ɛ MD Simple ɛ SD = = ( ln ) L F O (n) 1 R. L (4.4) R ( ln ) O (n) ma X y 1 Lf 1 (4.5) }{{ 1 } ma X y }{{ L f } (I) (II) (III) We can break apart Eq. 4.4 into three terms so that we can classify them as either favoring Mirror Descent or Subgradient descent. (I) Always Favors subgradient descent (i.e. the Euclidean norm) since the numerator is 1 (II) his term represents the error in the initial guess. his always favors subgradient descent since the l1-norm-based numerator is at worst equal to n, where n is the 4-3

4 dimension of the system and at best is 1. he denominator is always 1 since the set for subgradient descent is the Euclidean ball. Consequently this ratio is always between 1 and n:1 ratio n (III) Favors Mirror Descent with a simple set over subgradient descent with a Euclidean set ( 1 n ratio 1) From this analysis we can make the following conclusions: 1. If our set X is a Euclidean ball and our function f is sensitive to O (1) coordinate and subgradient descent much better by a factor of n ln (n). If our setx is a simple and our function f is sensitive to O (n) coordinates, then Mirror Descent is better by a factor of: n ln(n) 4. Algorithms that use the Dual Recall the concept of duality: he primal of the problem is he Lagrangian for the problem is: min f() subject to h() 0 A = b L λ 0 (, λ, ν) = f() + λ h() + ν(a b) he dual objective of the problem is g(λ, ν) = min L (, λ, ν) he solution to the dual is λ, ν = arg ma g(λ, ν) λ 0,ν from which we can recover the optimal solution to the primal by: = arg min L (, λ, ν ) 4-4

5 4..1 Primal and Dual Decomposition Oftentimes, it becomes possible to eploit the problem structure to parallelize the solution for faster processing. However during parallelizing we often have to deal with one of two complications: Coupled variables Coupled constraints Primal Decomposition he primal decomposition master problem given a coupled variable can be posed as follows: min y φ 1 (y) + φ (y) where φ 1 and φ are the subproblems defined as follows: φ 1 (y) = min 1 f 1 (, y) φ (y) = min f (, y) he master problem can then be solved by iterating between the master problem and the individual subproblems. he master problem is always feasible, meaning that the constraints are always met during the minimization process Dual Decomposition Similar to primal decomposition, it is possible to parallelize the dual minimization problem by breaking it into subproblems. For following primal master problem, min 1 y 1 y f 1 ( 1, y 1 ) + f (, y ) subject to y 1 = y the Lagrangian can can be formed as follows L ( 1, y 1,, y ) = f 1 ( 1, y 1 ) + f (, y ) + λ(y 1 y ) and then split into two subproblems as follows subproblem 1 min 1 y 1 f 1 ( 1, y 1 ) + λy 1 subproblem min y f (, y ) λy 4-5

6 he Lagrangian multiplier is used to enforce the primal constraint and can be updated using the following update equation λ + = λ α(y y 1 ) where α is the step size. his update equation is effectively gradient ascent on the dual w.r.t λ. he dual can then be solved by iterating between the subproblems, the update equation. Unlike primal decomposition, dual decomposition can be infeasible at times during the minimization process, as the coupled variable can be different in each subproblem. his concludes coupled variables. Net lecture will review coupled variables and introduce coupled constraints. 4-6

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October