First-order methods for structured nonsmooth optimization

First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

Outline I. Coordinate (Gradient) Descent Method II. Incremental Gradient Method III. (Linearized) Alternating Direction Method of Multipliers

Thank you! I. Coordinate (Gradient) Descent Method

Structured Nonsmooth Optimization min x F (X) = f (x) + P(x), f : real-valued, (convex) smooth on domf P: proper, convex, lsc In particular, P: separable i.e., P(x) = n j=1 P j(x j ) Bound constraint: P(x) = where l u (possibly with or ). l 1 -norm: P(x) = λ x 1 with λ > 0 { 0 if l x u; else, or indicator function of linear constraints (Ax = b).

Bound-constrained Optimization min l x u f (x), where f : R N R is smooth, l u (possibly with or components). Can be reformulated as the following unconstrained optimization problem: min x { 0 if l x u where P(x) =. else f (x) + P(x),

l 1 -regularized Convex Minimization 1. l 1 -regularized linear least squares problem Find x so that Ax b 0 and x has few nonzeros. Formulate this as an unconstrained convex optimization problem: min x R n Ax b 2 2 + λ x 1 2. l 1 -regularized logistic regression problem min w R n 1,v R 1 m m log(1 + exp( (w T a i + vb i ))) + λ w 1, i=1 where a i = b i z i and (z i, b i ) R n 1 { 1, 1}, i = 1,..., m are a given set of (observed or training) examples.

Support Vector Machines

Support Vector Machines Support Vector Classification Training points : z i R p, i = 1,..., n. Consider a simple case with two classes (linear separable case): Define a vector a : a i = { 1 if z i in class 1 1 if z i in class 2 A hyperplane (0 = w T z b) separates data with the maximal margin. Margin is the distance of the hyperplane to the nearest of the positive and negative points. Nearest points lie on the planes ±1 = w T z b

Support Vector Machines Convex Quadratic Programming Problem: Support Vector Classification The (original) Optimization Problem min a,b 1 2 a 2 2 subject to z i ( a T x i b ) 1, i = 1,..., n. The Modified Optimization Problem (allows, but penalizes, the failure of a point to reach the correct margin) min a,b,ξ 1 2 a 2 2 + C n i=1 ξ i subject to z i ( a T x i b ) 1 ξ i, ξ i 0, i = 1,..., n.

Support Vector Machines SVM (Dual) Optimization Problem (Convex Quadratic Program) min x 1 2 x T Qx e T x subject to 0 x i C, i = 1,..., n, a T x = 0, where a { 1, 1} n, 0 < C, e = [1,..., 1] T, Q R n n is a sym. pos. semidef. with Q ij = a i a j K (z i, z j ), K : R p R p R ( kernel function ), and z i R p ( ith data point ), i = 1,..., n. Popular Choices of K : linear kernel K (z i, z j ) = z T i z j radial basis function kernel K (z i, z j ) = exp( γ z i z j 2 2 ) sigmoid kernel K (z i, z j ) = tanh(γz T i z j ) where γ is a constant. Q is an n n fully dense matrix and even indefinite. (n 5000)

Rank Minimization Now, imagine that we only observe a few entries of a data matrix. Then is it possible to accurately guess the entries that we have not seen? Netflix problem: Given a sparse matrix where M ij is the rating given by user i on movie j, predict the rating a user would assign to a movie he has not seen, i.e., we would like to infer users preference for unrated movies. (impossible! in general) The problem is ill-posed. Intuitively, users preferences depend only on a few factors, i.e., rank(m) is small. Thus can be formulated as the low-rank matrix completion problem (affine rank minimization): X R m n min { rank(x) X ij = M ij, (i, j) Ω }, (NP hard!) where Ω is an index set of p observed entries.

Rank Minimization Nuclear norm minimization: { min X := X R m n m i=1 where σ i (X) s are singular values of X. } σ i (X) X ij = M ij, (i, j) Ω. a more general nuclear norm minimization problem: X R m n min { X : A(X) = b }. When the matrix variable is restricted to be diagonal, the above problem reduces to the following l 1 -minimization problem: x R n min { x 1 : Ax = b }.

Nuclear Norm Regularized Least Squares Problem If the observation b is contaminated with noise consider the following nuclear norm regularized least squares problem: where µ > 0 is a given parameter. 1 min X R m n 2 A(X) b 2 2 + µ X. appeared in many applications of engineering and science including collaborative filtering global positioning system identification remote sensing computer vision

Sparse Portfolio Selection How to allocate an investor s available capital into a prefixed set of assets with the aims of maximizing the expected return and minimizing the investment risk. Traditional Markowitz portfolio selection model: min x x T Qx subject to µ T x = β, e T x = 1, x 0, where β is the desired expected return of the portfolio. Modified models: min x x T Qx subject to µ T x = β, e T x = 1, x 0, x 0 K. min x x 0 subject to µ T x = β, x T Qx α, e T x = 1, x 0.

Sparse Covariance Selection Undirected graphical models offer a way to describe and explain relationships among a set of variables, central element of multivariate data analysis From a sample covariance matrix, wish to estimate the true covariance matrix, in which some of entries of its inverse are zero, by maximizing l 1 -regularized log-likelihood: where max X S n log detx subject to X ij = 0, (i, j) V, ˆΣ, X (i,j) V ρ ij X ij ˆΣ S n + is an empirical covariance matrix and ˆΣ is singular or nearly so May want to impose structural conditions on Σ 1, such as conditional independence, which is reflected as zero entries in Σ 1. V is a collection of all pairs of conditional independent nodes ρ ij > 0: parameter controlling the trade-off between the goodness-of-fit and the sparsity of X

Sparse Covariance Selection log det X + ˆΣ, X is strictly convex, cont. diff. on its domain S++, n O(n 3 ) opers. to evaluate. In applications, n can exceed 5000. Dual problem can be expressed: min X S n log det X n subject to (X ˆΣ) ij υ ij, i, j = 1,..., n, where υ ij = ρ ij for all (i, j) V and υ ij = for all (i, j) V.

Coordinate Descent Method When P 0. Given x R n, Choose i N = {1,..., n}. Update Repeat until convergence. x new = arg min f (u). u u j =x j j i Gauss-Seidel rule: Choose i cyclically, 1, 2,..., n, 1, 2,... Gauss-Southwell rule: Choose i with f x i (x) maximum.

Coordinate Descent Method Properties: If f convex, then every cluster point of the x-sequence is a minimizer. If f nonconvex, then G-Seidel can cycle 1 but G-Southwell still converges. Convergence is possible when P 0 2 1 M. J. D. Powell, On search directions for minimization algorithms, Math. Program. 4, (1973), 193 201 2 P. Tseng, Convergence of block coordinate descent method for nondifferentiable minimization, J. Optim. Theory Appl. 109, (2001), 473 492

Coord. Gradient Descent Method Descent direction. For x domp, choose J ( ) N and H 0 n, Then solve min { f d d j =0 j J (x)t d + 1 2 d T Hd + P(x + d) P(x)} direc. subprob Let d H (x; J ) and q H (x; J ) be the opt. soln and obj. value of the direc. subprob. Facts: d H (x; N ) = 0 F (x; d) 0 d R n. H is diagonal d H (x; J ) = d H (x; j), q H (x; J ) = q H (x; j). j J j J q H (x; J ) 1 2 d T Hd where d = d H (x; J ).

Coord. Gradient Descent Method This coord. grad. descent approach may be viewed as a hybrid of gradient-projection and coordinate descent. In particular, { 0 if l x u if J = N and P(x) =, then d H (x; N ) is a scaled else gradient-projection direction for bound-constrained optimization. if f is quadratic and we choose H = 2 f (x), then d H (x; J ) is a (block) coordinate descent direction. If H is diagonal, then subproblems can be solved in parallel. If P 0, then d H (x) j = f (x) j /H jj. { 0 if l x u If P(x) =, then else d H (x) j = median{l j x j, f (x) j /H jj, u j x j }. If P is the 1-norm, then d H (x) j = median{( f (x) j λ)/h jj, x j, ( f (x) j + λ)/h jj }.

Coord. Gradient Descent Method Stepsize: Armijo rule Choose α to be the largest element of {β k } k=0,1,... satisfying F (x + αd) F(x) σαq H (x; J ) (0 < β < 1, 0 < σ < 1). For the l 1 -regularized linear least squares problem, the minimization rule α arg min{f (x + td) t 0} or the limited minimization rule α arg min{f(x + td) 0 t s}, where 0 < s <, can also be used.

Coord. Gradient Descent Method Choose J : Gauss-Seidel rule: J cycles through {1}, {2},..., {n}. Gauss-Southwell-r rule: d D (x; J ) υ d D (x; N ) where 0 < υ 1, D 0 n is diagonal (e.g., D = diag(h)). Gauss-Southwell-q rule: q D (x; J ) υ q D (x; N ), Where 0 < υ 1, D 0 n is diagonal (e.g., D = diag(h)).

Coord. Gradient Descent Method Advantage of CGD CGD method is simple, highly parallelizable, and is suited for solving large-scale problems. CGD not only has cheaper iterations than exact coordinate descent, it also has stronger global convergence properties.

Convergence Results Global convergence If 0 λi D, H λi, J is chosen by G-Seidel, G-Southwell-r, G-Southwell-q rule, is chosen by Armijo rule, then every cluster point of the x-sequence generated by CGD method is a stationary point of F.

Convergence Results Local convergence rate 0 λi D, H λi, If J is chosen by G-Seidel or Gauss-Southwell-q rule, α is chosen by Armijo rule, in addition, if P and f satisfy any of the following assumptions, then the x-sequence generated by CGD method converges at R-linear rate. C1 f is strongly convex, f is Lipschitz cont. on domp. C2 f is (nonconvex) quadratic. P is polyhedral. C3 f (x) = g(ex) + q T x, where E R m N, q R N, g is strongly convex, g is Lipschitz cont. on R m. P is polyhedral. C4 f (x) = max y Y {(Ex) T y g(y)} + q T x, where Y R m is polyhedral, E R m N, q R N, g is strongly convex, g is Lipschitz cont. on R m. P is polyhedral.

Convergence Results Complexity Bound If f is convex with Lipschitz cont. grad., then the number of iterations for achieving ɛ-optimality is Gauss-Seidel rule: ( n 2 Lr 0 ) O, ɛ where L is a Lipschitz constant and r 0 = max { dist(x, X ) 2 F(x) F (x 0 ) }. Gauss-Southwell-q rule: ( Lr 0 O {0, υɛ + max Lυ ( )}) e 0 ln r 0, where e 0 = F (x 0 ) min x X F (x).

Thank you! II. Incremental Gradient Method

Sum of several functions min x F (x) := f (x) + P(x), where c > 0, P : R n (, ] is a proper, convex, lower semicontinuous (lsc) function, and m f (x) := f i (x), where each function f i is real-valued and smooth (i.e., continuously differentiable) on R n. i=1

Sum of several functions In applications, m is often large (exceeding 10 4 ). In this case, traditional gradient methods would be inefficient since they require evaluating f i (x) for all i before updating x. Incremental gradient methods (IGM), in contrast, update x after evaluation of f i (x) for only one or a few i. In the unconstrained case P 0, this method has the basic form x k+1 = x k + α k f ik (x k ), k = 0, 1,..., where i k is chosen to cycle through 1,..., m (i.e., i 0 = 1, i 1 = 2,..., i m 1 = m, i m = 1,... ) and α k > 0.

Sum of several functions For global convergence of IGMs, stepsize α k (also called learning rate ) needs to diminish to zero, which can lead to slow convergence 3. If a constant stepsize is used, only convergence to an approximate solution can be shown 4. Methods 5 to overcome this difficulty were proposed. However, these methods need additional assumptions such as f i (x) = 0 for all i at a stationary point x to achieve global convergence without the stepsize tending to zero. Moreover, its extension to P 0 is problematic. 3 D. P. Bertsekas, A new class of incremental gradient methods for least squares problems, SIAM J. Optim., 7 (1997), 913 926 4 M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero, Comput. Optim. Appl., 11 (1998), 23 35 5 P. Tseng, An incremental gradient(-projection) method with momentum term and adaptive stepsize rule, SIAM J. Optim., 8 (1998), 506 531

Sum of several functions For the case of P 0, Blatt, Hero and Gauchman 6 proposed a method: g k = g k 1 + f ik (x k ) f ik (x k m ), x k+1 = x k αg k, with α > 0, g 1 = m i=1 f i(x i m 1 ), x 0, x 1,..., x m R n given. Computes the gradient of a single component function at each iteration. But instead of updating x using this gradient, it uses the sum of the m most recently computed gradients. This method requires more storage (O(mn) instead of O(n)) and slightly more communication/computation per iteration. 6 D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29 51

IGM: Constant Stepsize 0. Choose x 0, x 1, domp and α (0, 1]. Initialize k = 0. Go to 1. 1. Choose H k 0 and 0 τ k i k for i = 1,..., m, and compute g k, d k, and x k+1 by g k = m i=1 d k = arg min d R n f i (x τ k i ), x k+1 = x k + αd k. Increment k by 1 and return to 1. τ k i = { g k, d + 1 } 2 d, Hk d + P(x k + d), The method of Blatt et al. corresponds to the special case of P 0, H k = I, K = m 1, and { k if i = (k mod m) + 1; τ k 1 i otherwise, 1 i m, k m,

IGM: Constant Stepsize Assumption 1 (a) τi k k K for all i and k, where K 0 is an integer. (b) λi H k λi for all k, where 0 < λ λ. Assumption 2 f i (y) f i (z) L i y z y, z domp, for some L i 0, i = 1,..., m. Let L = m i=1 L i.

IGM: Constant Stepsize Global Convergence Let {x k }, {d k }, {H k } be sequences generated by Algorithm 1 under Assumptions 1 and 2, and with α < 2λ/(L(2K + 1)). Then {d k } 0 and every cluster point of {x k } is a stationary point.

IGM: Adaptive Stepsize 0. Choose x 0, x 1, domp, β (0, 1), σ > 1 2, and α (0, 1]. Initialize k = 0. Go to 1. 1. Choose H k 0 and 0 τ k i g k = m i=1 d k = arg min d R n f i (x τ k i ), k for i = 1,..., m, compute g k, d k by { g k, d + 1 } 2 d, Hk d + cp(x k + d). Choose α init k [α, 1] and let α k be the largest element of {α init k βj } j=0,1,... satisfying the descent-like condition k 1 F c (x k + α k d k ) F c (x k ) σkl α k d k 2 + L α j d j 2 2 j=(k K ) + and set x k+1 = x k + αd k. Increment k by 1 and return to 1.

IGM: Adaptive Stepsize Global Convergence Let {x k }, {d k }, {H k }, {α k } be sequences generated by Algorithm 2 under Assumptions 1 and 2. Then the following results hold. (a) For each k 0, the descent-like condition holds whenever α k ᾱ, where ᾱ = λ L(σK +K /2+1/2). (b) We have α k min{α, βᾱ} for all k. (c) {d k } 0 and every cluster point of {x k } is a stationary point.

IGM: 1-memory with Adaptive Stepsize 0. Choose x 0 domp. Initialize k = 0 and g 1 = 0. Go to 1. 1. Choose H k 0 and compute g k, d k, and x k+1 by g k k = k + 1 gk 1 + m k + 1 f i k (x k ) with i k = (k mod m) + 1, { d k = arg min g k, d + 1 } d R n 2 d, Hk d + cp(x k + d), x k+1 = x k + α k d k, with α k (0, 1]. Increment k by 1 and return to 1.

IGM: 1-memory with Adaptive Stepsize Assumption 3 (a) α k =. k=0 (b) lim l l j=0 (x 1 = x 0 ). j + 1 l + 1 δ j = 0, where δ j := max x k+i x k+m i=0,1,...,m k=jm 1

IGM: 1-memory with Adaptive Stepsize Global Convergence Let {x k }, {d k }, {H k }, {α k } be sequences generated by Algorithm 3 under Assumptions 1(b), 2, and 3. Then the following results hold. (a) { x k+1 x k } 0 and { f (x k ) g k } 0. (b) lim inf k d k = 0. (c) If {x k } is bounded, then there exists a cluster point of {x k } that is a stationary point.

Thank you! III. (Linearized) Alternating Direction Method of Multipliers

Total Variation Regularized Linear Least Squares Problem Image restorations such as image deconvolution, image inpainting and image denoising is often formulated as an inverse problem unknown true image u R n. b = Au + η, observed image (or measurements) b R l. η is a Gaussian noise. A R l n is a linear operator, typically a convolution operator in deconvolution, a projection in inpainting and the identity in denoising. The unknown images can be recovered by solving TV regularized linear least squares problem (Primal): 1 min x R n 2 Ax b 2 2 + µ x, where µ > 0 and x = n i=1 ( x) i 2 with ( x) i R 2.

Gaussian Noise Figure: denoising (a) original: 512 512 (b) noised image (c) recovered image

Gaussian Noise Figure: deblurring (a) original: 512 512 (b) motion blurred image (c) recovered image

Gaussian Noise Figure: inpainting (a) original: 512 512 (b) scratched image (c) recovered image

Algorithms Saddle Point Formulation and Dual Problem 1. Dual Norm: min u max p 1 u, p + 1 2 Au b 2 2. 2. Convex Conjugate (Legendre-Fenchel transform) of J(u) = u : min sup p, u J (p) + 1 u p 2 Au b 2 2. where J (p) = sup z p, z J(z). 3. Lagrangian Function: max inf L 1(u, p, z). p u,z where L 1 (u, p, z) =: 1 2 Au b 2 2 + z + p, u z

Algorithms 4. Dual Problem: max p { inf p, u J (p) + 1 u 2 Au b 2 2 = J (p) H (div p) where H(u) = 1 2 Au b 2 2. 5. Lagrangian Function with y = div p: max inf L 2(u, p, y). u p,y }. where L 2 (u, p, y) =: J (p) + H (div p) + u, div p y

ADMM 7 E. Esser, X. Zhang, and T. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 2010 8 J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems in two or three space variables, Trans. Amer. Math. Soc., 1956 Alternating Direction Method of Multipliers 7 Augmented Lagrangian Function: L α (u, p, z) =: 1 2 Au b 2 2 + z + p, u z + α 2 u z 2 2 u k+1 z k+1 = arg min u = arg min d p k+1 = p k + α(z k+1 u k+1 ) 1 2 Au b 2 2 + p k, u z k + α 2 u zk 2 2 z + p k, u k+1 z + α 2 uk+1 z 2 2 ADMM on primal (or 3) is equivalent to Douglas-Rachford 8 on dual (4). ADMM on dual (or 5) is equivalent to Douglas-Rachford on primal.

ADMM Alternating Minimization Algorithm & Forward-Backward Splitting Method AMA 9 on primal (or 3) u k+1 z k+1 = arg min u 1 2 Au b 2 2 + p k, u z k = arg min d p k+1 = p k + α(z k+1 u k+1 ) z + p k, u k+1 z + α 2 uk+1 z 2 2 FBS 10 on dual (4) u k+1 p k+1 = arg min u = arg min d 1 2 Au b 2 2 + p k, u J (p) p, u k+1 + 1 2α p pk 2 2 FBS on dual is equivalent to AMA on primal (or 3). 9 P. Tseng, Applications of a splitting algorithm to decomposition in convex programming and variational inequalities, SIAM J. Control and Optimization, 1991 10 P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., 2005

Proximal Splitting Method Linearly Constrained Separable Convex Programming Problem: min x,y {f (x) + g(y) Qx = y}. Augmented Lagrangian Function: L α (x, y, z) := f (x) + g(y) + z, y Qx + α 2 Qx y 2 2. x k+1 = arg min L α (x, y k, z k ) u y k+1 = arg min L α (x k+1, y, z k ) z z k+1 = z k + α(qx k+1 y k+1 )

Proximal Splitting Method If f (x) = 1 2 Ax b 2 2, x k+1 = (A T A + αq T Q) 1 (A T b + Q T z k + αq T y k ). If f (x) = 1, Ax b, log(ax) Need inner solver or inversion involving Q and/or A. If f (x) = x + be x, 1 Need inner solver.

Poisson Noise Figure: deblurring (a) original (b) blurred image (c) recovered image

Multiplicative Noise

Multiplicative Noise Copyright : Sandia Nat. Lab. (http://www.sandia.gov/radar/sar.html)

Multiplicative Noise 4M (2 22 ) pixel @ 20sec.

Multiplicative Noise

Proximal Splitting Method In order to avoid any inner iterations or inversions involving Laplacian operator required in algorithms based on augment Lagrangian. Alternating minimization algorithm u k+1 = L 0 (x, y k, z k ) z k+1 = L α (x k+1, y, z k ) p k+1 = p k + α(z k+1 u k+1 ) Linearized augmented Lagrangian with proximal function u k+1 = LL α (x, y k, z k ) z k+1 = L α (x k+1, y, z k ) p k+1 = p k + α(z k+1 u k+1 ) LL α (x, x k, y k, z z ) := x f (x k ), x x k + g(y k ) + z k, y k Qx + αq T (Qx k y k ), x x k + 1 2δ x x k 2 2

Thank you! Thank you!