Sparse Optimization Lecture: Basic Sparse Optimization Models

Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm models some applications of these models how to reformulate them into standard conic programs which conic programming solvers to use 1 / 33

Eamples of Sparse Optimization Applications See online seminar at piazza.com 2 / 33

Basis pursuit min{ 1 : A = b} find least l 1-norm point on the affine plane { : A = b} tends to return a sparse point (sometimes, the sparsest) l 1 ball 3 / 33

Basis pursuit min{ 1 : A = b} find least l 1-norm point on the affine plane { : A = b} tends to return a sparse point (sometimes, the sparsest) l 1 ball touches the affine plane 4 / 33

Basis pursuit denoising, LASSO min { A b 2 : 1 τ}, (1a) min 1 + µ 2 A b 2 2, (1b) min { 1 : A b 2 σ}. (1c) all models allow A b 5 / 33

Basis pursuit denoising, LASSO min { A b 2 : 1 τ}, (2a) min 1 + µ 2 A b 2 2, (2b) min { 1 : A b 2 σ}. (2c) 2 is most common for error but can be generalized to loss function L (2a) seeks for a least-squares solution with bounded sparsity (2b) is known as LASSO (least absolute shrinkage and selection operator). it seeks for a balance between sparsity and fitting (2c) is referred to as BPDN (basis pursuit denoising), seeking for a sparse solution from tube-like set { : A b 2 σ} they are equivalent (see later slides) in terms of regression, they select a (sparse) set of features (i.e., columns of A) to linearly epress the observation b 6 / 33

Sparse under basis Ψ / l 1 -synthesis model min s { s 1 : AΨs = b} (3) signal is sparsely synthesized by atoms from Ψ, so vector s is sparse Ψ is referred to as the dictionary commonly used dictionaries include both analytic and trained ones analytic eamples: Id, DCT, wavelets, curvelets, gabor, etc., also their combinations; they have analytic properties, often easy to compute (for eample, multiplying a vector takes O(n log n) instead of O(n 2 )) Ψ can also be numerically learned from training data or partial signal they can be orthogonal, frame, or general 7 / 33

Sparse under basis Ψ / l 1 -synthesis model If Ψ is orthogonal, problem (3) is equivalent to by change of variable = Ψs, equivalently s = Ψ. Related models for noise and approimate sparsity: min { Ψ 1 : A = b} (4) min{ A b 2 : Ψ 1 τ}, min Ψ 1 + µ 2 A b 2 2, min{ Ψ 1 : A b 2 σ}. 8 / 33

Sparse after transform / l 1 -analysis model min { Ψ 1 : A = b} (5) Signal becomes sparse under the transform Ψ (may not be orthogonal) Eamples of Ψ: DCT, wavelets, curvelets, ridgelets,... tight frames, Gabor,... (weighted) total variation When Ψ is not orthogonal, the analysis is more difficult 9 / 33

Eample: sparsify an image 10 / 33

(a) DCT coefficients (b) Haar wavelets (c) Local variation 10 5 10 0 10 5 10 10 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (d) DCT coeff s decay 1000 800 600 400 200 0 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (e) Haar wavelets 300 250 200 150 100 50 0 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (f) Local variation Figure: the DCT and wavelet coefficients are scaled for better visibility. 11 / 33

Questions 1. Can we trust these models to return intended sparse solutions? 2. When will the solution be unique? 3. Will the solution be robust to noise in b? 4. Are constrained and unconstrained models equivalent? in what sense? Questions 1 4 will be addressed in net lecture. 5. How to choose parameters? τ (sparsity), µ (weight), and σ (noise level) have different meanings applications determine which one is easier to set generality: use a test data set, then scale parameters for real data cross validation: reserve a subset of data to test the solution 12 / 33

Joint/group sparsity Joint sparse recovery model: min{ X 2,1 : A(X) = b} (6) X where m X 2,1 := [ i1 i,2 in] 2. i=1 l 2-norm is applied to each row of X l 2,1-norm ball has sharp boundaries across different rows, which tend to be touched by {X : A(X) = b}, so the solution tends to be row-sparse also X p,q for 1 < p, affects magnitudes of entries on the same row comple-valued signals are a special case 13 / 33

Joint/group sparsity Decompose {1,..., n} = G 1 G 2 G S. non-overlapping groups: G i G j =, i j. otherwise, groups may overlap (modeling many interesting structures). Group-sparse recovery model: where min { G,2,1 : A = b} (7) G,2,1 = S w s Gs 2. s=1 14 / 33

Auiliary constraints Auiliary constraints introduce additional structures of the underlying signal into its recovery, which sometimes significantly improve recovery quality nonnegativity: 0 bound (bo) constraints: l u general inequalities: Q q They can be very effective in practice. They also generate corners. 15 / 33

Reduce to conic programs Sparse optimization often has nonsmooth objectives. Classic conic programming solvers do not handle nonsmooth functions. Basic idea: model nonsmoothness by inequality constraints. Eample: for given, we have Therefore, 1 = min 1, 2 {1 T ( 1 + 2) : 1 2 =, 1 0, 2 0}. (8) min{ 1 : A = b} reduces to a linear program (LP) min 1 + µ 2 A b 2 2 reduces to a bound constrained quadratic program (QP) min { A b 2 : 1 τ} reduces to a bound constrained QP min { 1 : A b 2 σ} reduces to a second-order cone program (SOCP) 16 / 33

Basic form: Conic programming min{c T : F + g K 0, A = b.} a K b stands for a b K, which is a conve, closed, pointed cone. Eamples: first orthant (cone): R n + = { R n : 0}. norm cone (2nd order cone): Q = {(, t) : t} polyhedral cone: P = { : A 0} positive semidefinite cone: S + = {X : X 0, X T = X} Eample: { (, y, z) : [ y ] } y S + z z 1 0.5 0 1 0 y 1 0 0.5 1 17 / 33

Model Linear program min{c T : A = b, K 0} where K is the nonnegative cone (first orthant). K 0 0. Algorithms the Simple method (move between vertices) interior-point methods (IPMs) (move inside the polyhedron) decomposition approaches (divide and conquer) In primal IPM, 0 is replaced by its logarithmic barrier: ψ(y) = i log(y i) log-barrier formulation: min{c T (1/t) i log( i) : A = b} 18 / 33

Second-order cone program Model min{c T : A = b, K 0} where K = K 1 K K; each K k is the second-order cone { } K k = y R n k : y nk y1 2 + + y2 n k 1. IPM is the standard solver (though other options also eist) Log-barrier of K k : ψ(y) = log ( y 2 n k (y 2 1 + + y nk 1) ) 19 / 33

Semi-definite program Model min{c X : A(X) = b, X K 0} where K = K 1 K K; each K k = S n k +. IPM is the standard solver (though other options also eist) Log-barrier of S n k + (still a concave function): ψ(y) = log det(y). 20 / 33

(from Boyd & Vandenberghe, Conve Optimization) properties (without proof): for y K 0, ψ(y) K 0, y T ψ(y) = θ nonnegative orthant R n +: ψ(y) = n i=1 log y i ψ(y) = (1/y 1,..., 1/y n ), y T ψ(y) = n positive semidefinite cone S n +: ψ(y ) = log det Y ψ(y ) = Y 1, tr(y ψ(y )) = n second-order cone K = {y R n+1 (y1 2 + + yn) 2 1/2 y n+1 }: y 1 2 ψ(y) = yn+1 2 y2 1. y2 n y n, yt ψ(y) = 2 y n+1 21 / 33

(from Boyd & Vandenberghe, Conve Optimization) Central path for t > 0, define (t) as the solution of minimize subject to tf 0 () + φ() A = b (for now, assume (t) eists and is unique for each t > 0) central path is { (t) t > 0} eample: central path for an LP minimize c T subject to a T i b i, i = 1,..., 6 hyperplane c T = c T (t) is tangent to level curve of φ through (t) c (10) 22 / 33

Log-barrier formulation: min{tf 0() + φ() : A = b} Compleity of log-barrier interior-point method: log(( i k θi)/(εt(0) )) log µ 23 / 33

Primal-dual interior-point methods more efficient than barrier method when high accuracy is needed update primal and dual variables at each iteration; no distinction between inner and outer iterations often ehibit superlinear asymptotic convergence search directions can be interpreted as Newton directions for modified KKT conditions can start at infeasible points cost per iteration same as barrier method 24 / 33

Model l 1 minimization by interior-point method min { 1 : A b 2 σ} min min {1 T ( 1 + 1, 2) : 1 0, 2 0, 1 2 =, A b 2 σ} 2 min 1, 2 {1 T ( 1 + 2) : 1 0, 2 0, A( 1 2) b 2 σ} min {1 T 1 + 1 T 2 : A 1 A 2 + y = b, z = σ, ( 1, 1, 2, z, y) K 0} 2 where ( 1, 2, z, y) K 0 means 1, 2 R n +, (t, y) Q m+1. Solver: Mosek, SDPT3, Gurobi. Also, modeling language CVX and YALMIP. 25 / 33

Nuclear-norm minimization by interior-point method If we can model as an SDP... (how? see net slide)... then, we can also model min X { X : A(X) b F σ} min X { A(X) b F : X τ} min X µ X + 1 2 A(X) b 2 F min{ X : A(X) = b} (9) X as well as problems involving tr(x) and spectral norm X. X α αi X 0. 26 / 33

Sparse calculus for l 1 inspect to get some ideas: y, z 0 and yz = 1 2 (y + z) yz. moreover, 1 2 (y + z) = yz = if y = z =. observe y, z 0 and yz [ y ] 0. z So, [ y ] z 0 = 1 (y + z). 2 we attain 1 (y + z) = if y = z =. 2 Therefore, given, we have = min M { 1 2 tr(m) : [ ] } 0 1 M =, M = M T, M 0. 0 0 27 / 33

Generalization to nuclear norm Consider X R m n (w.o.l.g., assume m n) and let s try imposing [ ] Y X 0 X T Z Diagonalize X = UΣV T, Σ = diag(σ 1,..., σ m), X = i σi. [U T, V T ] [ Y X T ] [ ] X U = U T YU + V T ZV U T XV V T X T U Z V = U T YU + V T ZV 2Σ 0. So, tr(uyu T + VZV T 2Σ) = tr(y) + tr(z) 2 X 0. To attain =, we can let Y = UΣU T and Z = VΣ n nv T. 28 / 33

Therefore, X = min Y,Z = min M { [ ] } 1 Y X (tr(y) + tr(z)) : 0 (10) 2 X T Z { [ ] } 1 2 tr(m) : 0 I M = X, M = M T, M 0. (11) 0 0 Eercise: epress the following problems as SDPs min X { X : A(X) = b} min X µ X + 1 A(X) b F 2 min L,S { L + λ S 1 : A(L + S) = b} 29 / 33

Practice of interior-point methods (IPMs) LP, SOCP, SDP are well known and have reliable (commercial, off-the-shelf) solvers Yet, the most reliable solvers cannot handle large-scale problems (e.g., images, video, manifold learning, distributed stuff,...) Eample: to recover a still image, there can be 10M variables and 1M constraints. Even worse, the constraint coefficients are dense. Result: Out of memory. Simple and active-set methods: matri containing A must be inverted or factorized to compute the net point (unless A is very sparse). IPMs approimately solve a Newton system and thus also factorize a matri involving A. Even large and dense matrices can be handled, for sparse optimization, one should take advantages of the solution sparsity. Some compressive sensing problems have A with structures friendly for operations like A and A T y. 30 / 33

Practice of interior-point methods (IPMs) The Simple, active-set, and IPMs have reliable solvers; good to be the benchmark. They have nice interfaces (including CVX and YALMIP, which save you time.) CVX and YALMIP are not solvers; they translate problems and then call solvers; see http://goo.gl/zulmk and http://goo.gl/1u0p. They can return highly accurate solutions; some first-order algorithms (coming later in this course) do not always. There are other remedies; see net slide. 31 / 33

Papers of large-scale SDPs Low-rank factorizations: S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization, Math. Program., 95:329 357, 2003. LMaFit, http://lmafit.blogs.rice.edu/ First-order methods for conic programming: Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented Lagrangian methods for semidefinite programming. Math. Program. Comput., 2(3-4):203 230, 2010. Matri-free IPMs: K. Fountoulakis, J. Gondzio, P. Zhlobich. Matri-free interior point method for compressed sensing problems, 2012. http://www.maths.ed.ac.uk/~gondzio/reports/mfcs.html 32 / 33

Subgradient methods Sparse optimization is typically nonsmooth, so it is natural to consider subgradient methods. apply subgradient descent to, say, min 1 + µ A 2 b 2 2. apply projected subgradient descent to, say, min { 1 : A = b}. Good: subgradients are easy to derive, methods are simple to implement. Bad: convergence requires carefully chosen step sizes (classical ones require diminishing step sizes). Convergence rate is weak on paper (and in practice, too?) Further readings: http://ariv.org/pdf/1001.4387.pdf, http://goo.gl/qfva6, http://goo.gl/vc21a. 33 / 33