Math 273a: Optimization Lagrange Duality

Math 273a: Optimization Lagrange Duality Instructor: Wotao Yin Department of Mathematics, UCLA Winter 2015 online discussions on piazza.com

Gradient descent / forward Euler assume function f is proper close conve, differentiable consider min f() gradient descent iteration (with step size c): k+1 = k c f( k ) k+1 minimizes the following local quadratic approimation of f: f( k ) + f( k ), k + 1 2c k 2 2 compare with forward Euler iteration, a.k.a. the eplicit update: (t + 1) = (t) Δt f((t))

Backward Euler / implicit gradient descent backward Euler iteration, also known as the implicit update: (t + 1) solve (t + 1) = (t) Δt f ((t + 1)). equivalent to: 1. u(t + 1) solve u = f ((t) Δt u), 2. (t + 1) = (t) Δt u(t + 1). we can view it as the implicit gradient descent: ( k+1, u k+1 ) solve = k cu, u = f(). c is the step size, very different from a standard step size. eplicit (implicit) update uses the gradient at the start (end) point

Implicit gradient step = proimal operation proimal update: optimality condition: pro cf (z) := arg min f() + 1 2c z 2. 0 = c f( ) + ( z). given input z, - pro cf (z) returns solution - f(pro cf (z)) returns u (, u ) solve = z cu, u = f(). Proposition Proimal operator is equivalent to an implicit gradient (or backward Euler) step.

Proimal operator handles sub-differentiable f assume that f is closed, proper, sub-differentiable conve function f() is denoted as the subdifferential of f at. Recall u f() if f( ) f() + u,, R n. f() is point-to-set, neither direction is unique pro is well-defined for sub-differentiable f; it is point-to-point, pro maps any input to a unique point

Proimal operator pro cf (z) := arg min f() + 1 2c z 2. since objective is strongly conve, solution pro cf (z) is unique since f is proper, dom pro cf = R n the followings are equivalent pro cf z = = arg min f() + 1 2c z 2, solve 0 c f() + ( z), (, u ) solve = z cu, u f(). point minimizes f if and only if = pro f ( ).

Lagrange duality Conve problem minimize f() subject to A = b. Rela the constraints and price their violation (pay a price if violated one way; get paid if violated the other way; payment is linear to the violation) L(; y) := f() + y T (A b) For later use, define the augmented Lagrangian L A(; y, c) := f() + y T (A b) + c A b 2 2 Minimize L for fied price y: d(y) := min L(; y). Always, d(y) is conve The Lagrange dual problem minimize y d(y) Given dual solution y, recover = min L(; y ) (under which conditions?) Question: how to compute the eplicit/implicit gradients of d(y)?

Dual eplicit gradient (ascent) algorithm Assume d(y) is differentiable (true if f() is strictly conve.) Gradient descent iteration (if the maimizing dual is used, it is called gradient ascent): y k+1 = y k c f(y k ). It turns out to be relatively easy to compute d, via an unstrained subproblem: d(y) = b Aˉ, where ˉ = arg min L(; y). Dual gradient iteration 1. k solve min L(; y k ); 2. y k+1 = y k c(b A k ).

Sub-gradient of d(y) Assume d(y) is sub-differentiable (which condition on primal can guarantee this?) Lemma Given dual point y and ˉ = arg min L(; y), we have b Aˉ d(y). Proof. Recall u d(y) if d(y ) d(y) + u, y y for all y ; d(y) := min L(; y). From (ii) and definition of ˉ, d(y) + b Aˉ, y y = L(ˉ; y) + (b Aˉ) T (y y ) = [f(ˉ + y T (Aˉ b)] + (b Aˉ) T (y y ) = [f(ˉ) + (y ) T (Aˉ b)] = L(ˉ; y ) d(y ). From (i), b Aˉ d(y).

Dual eplicit (sub)gradient iteration The iteration: 1. k solve min L(; y k ); 2. y k+1 = y k c k (b A k ); Notes: (b A k ) d(y k ) as shown in the last slide it does not require d(y) to be differentiable convergence might require a careful choice of c k (e.g., a diminishing sequence) if d(y) is only sub-differentiable (or lacking Lipschitz continuous gradient)

Dual implicit gradient Goal: to descend using the (sub)gradient of d at the net point y k+1 : Following from the Lemma, we have b A k+1 d(y k+1 ), where k+1 = arg min L(; y k+1 ) Since the implicit step is y k+1 = y k c(b A k+1 ), we can derive k+1 = arg min L(; y k+1 ) 0 L( k+1 ; y k+1 ) = f( k+1 ) + A T y k+1 = f( k+1 ) + A T (y k c(b A k+1 )). Therefore, while k+1 is a solution to min L(; y k+1 ); it is also a solution to min L A(; y k, c) = f() + (y k ) T (A b) + c 2 A b 2, which is independent of y k+1.

Dual implicit gradient Proposition Assuming y = y c(b A ), the followings are equivalent 1. solve min L(; y ), 2. solve min L A(; y, c).

Dual implicit gradient iteration The iteration y k+1 = pro cd (y k ) is commonly known as the augmented Lagrangian method or the method of multipliers. Implementation: k+1 solve 1. min L A(; y k, c); 2. y k+1 = y k c(b A k+1 ). Proposition The followings are equivalent 1. the augmented Lagrangian iteration; 2. the implicit gradient iteration of d(y); 3. the proimal iteration y k+1 = pro cd (y k ).

Definitions: Dual eplicit/implicit (sub)gradient computation L(; y) = f() + y T (A b) L A (; y, c) = L(; y) + c A b 2 2 Objective: d(y) = min L(; y). Eplicit (sub)gradient iteration: y k+1 = y k c d(y k ) or use a subgradient d(y k ) 1. k+1 = arg min L(; y k ); 2. y k+1 = y k c(b A k+1 ). Implicit (sub)gradient step: y k+1 = pro cd y k 1. k+1 = arg min L A(; y k, c); 2. y k+1 = y k c(b A k+1 ). The implicit iteration is more stable; step size c does not need to diminish.