Sparse Optimization Lecture: Basic Sparse Optimization Models

Similar documents
12. Interior-point methods

12. Interior-point methods

Convex Optimization and l 1 -minimization

Convex Optimization Problems. Prof. Daniel P. Palomar

Lecture 4: September 12

Convex Optimization. (EE227A: UC Berkeley) Lecture 6. Suvrit Sra. (Conic optimization) 07 Feb, 2013

Lecture 20: November 1st

Introduction to Alternating Direction Method of Multipliers

Course Outline. FRTN10 Multivariable Control, Lecture 13. General idea for Lectures Lecture 13 Outline. Example 1 (Doyle Stein, 1979)

Lecture 7: Convex Optimizations

Lecture 14: Optimality Conditions for Conic Problems

Homework 4. Convex Optimization /36-725

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Nonlinear Optimization for Optimal Control

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Duality revisited. Javier Peña Convex Optimization /36-725

Semidefinite Programming Basics and Applications

Computational Optimization. Constrained Optimization Part 2

Constrained optimization

III. Applications in convex optimization

Lecture 7: Weak Duality

Proximal Methods for Optimization with Spasity-inducing Norms

Projection methods to solve SDP

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

Convex Optimization Overview (cnt d)

Lecture 16: October 22

Lecture: Duality of LP, SOCP and SDP

Lecture 23: November 19

The Q-parametrization (Youla) Lecture 13: Synthesis by Convex Optimization. Lecture 13: Synthesis by Convex Optimization. Example: Spring-mass System

Sparsity and Compressed Sensing

Sparse linear models

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

FRTN10 Multivariable Control, Lecture 13. Course outline. The Q-parametrization (Youla) Example: Spring-mass System

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Math 273a: Optimization Subgradient Methods

Second-order cone programming

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

15. Conic optimization

Lecture 6: Conic Optimization September 8

Fast ADMM for Sum of Squares Programs Using Partial Orthogonality

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Lecture: Convex Optimization Problems

Lecture 14: Newton s Method

Constrained Optimization and Lagrangian Duality

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

L. Vandenberghe EE236C (Spring 2016) 18. Symmetric cones. definition. spectral decomposition. quadratic representation. log-det barrier 18-1

Math 273a: Optimization Lagrange Duality

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

1 Computing with constraints

Mathematics of Data: From Theory to Computation

Lecture Notes 9: Constrained Optimization

Advances in Convex Optimization: Theory, Algorithms, and Applications

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization for Compressed Sensing

Subgradient Method. Ryan Tibshirani Convex Optimization

Interior Point Methods: Second-Order Cone Programming and Semidefinite Programming

Homework 5. Convex Optimization /36-725

Lecture 4: January 26

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones

ABSTRACT. Recovering Data with Group Sparsity by Alternating Direction Methods. Wei Deng

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

arxiv: v1 [math.oc] 16 Nov 2015

Sparse Optimization Lecture: Dual Methods, Part I

Lecture 24: August 28

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

10725/36725 Optimization Homework 4

Tutorial on Convex Optimization: Part II

A CONIC DANTZIG-WOLFE DECOMPOSITION APPROACH FOR LARGE SCALE SEMIDEFINITE PROGRAMMING

Lecture Notes 5: Multiresolution Analysis

Lecture: Examples of LP, SOCP and SDP

9. Interpretations, Lifting, SOS and Moments

2.098/6.255/ Optimization Methods Practice True/False Questions

Lecture Note 5: Semidefinite Programming for Stability Analysis

The convex algebraic geometry of rank minimization

Sparsity Matters. Robert J. Vanderbei September 20. IDA: Center for Communications Research Princeton NJ.

Gauge optimization and duality

Math 273a: Optimization Overview of First-Order Optimization Algorithms

4. Convex optimization problems

Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma

10-725/ Optimization Midterm Exam

Agenda. 1 Cone programming. 2 Convex cones. 3 Generalized inequalities. 4 Linear programming (LP) 5 Second-order cone programming (SOCP)

Lecture 3. Optimization Problems and Iterative Algorithms

Convex Optimization. (EE227A: UC Berkeley) Lecture 28. Suvrit Sra. (Algebra + Optimization) 02 May, 2013

Research overview. Seminar September 4, Lehigh University Department of Industrial & Systems Engineering. Research overview.

Compressive Sensing (CS)

Lecture: Algorithms for LP, SOCP and SDP

Primal-Dual Geometry of Level Sets and their Explanatory Value of the Practical Performance of Interior-Point Methods for Conic Optimization

Math 273a: Optimization Subgradients of convex functions

Low Complexity Regularization 1 / 33

Algorithms for constrained local optimization

Mathematical Optimization Models and Applications

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Optimization Algorithms for Compressed Sensing

ICS-E4030 Kernel Methods in Machine Learning

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Module 04 Optimization Problems KKT Conditions & Solvers

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Lecture Notes on Support Vector Machine

Primal/Dual Decomposition Methods

Lecture: Introduction to LP, SDP and SOCP

Transcription:

Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm models some applications of these models how to reformulate them into standard conic programs which conic programming solvers to use 1 / 33

Eamples of Sparse Optimization Applications See online seminar at piazza.com 2 / 33

Basis pursuit min{ 1 : A = b} find least l 1-norm point on the affine plane { : A = b} tends to return a sparse point (sometimes, the sparsest) l 1 ball 3 / 33

Basis pursuit min{ 1 : A = b} find least l 1-norm point on the affine plane { : A = b} tends to return a sparse point (sometimes, the sparsest) l 1 ball touches the affine plane 4 / 33

Basis pursuit denoising, LASSO min { A b 2 : 1 τ}, (1a) min 1 + µ 2 A b 2 2, (1b) min { 1 : A b 2 σ}. (1c) all models allow A b 5 / 33

Basis pursuit denoising, LASSO min { A b 2 : 1 τ}, (2a) min 1 + µ 2 A b 2 2, (2b) min { 1 : A b 2 σ}. (2c) 2 is most common for error but can be generalized to loss function L (2a) seeks for a least-squares solution with bounded sparsity (2b) is known as LASSO (least absolute shrinkage and selection operator). it seeks for a balance between sparsity and fitting (2c) is referred to as BPDN (basis pursuit denoising), seeking for a sparse solution from tube-like set { : A b 2 σ} they are equivalent (see later slides) in terms of regression, they select a (sparse) set of features (i.e., columns of A) to linearly epress the observation b 6 / 33

Sparse under basis Ψ / l 1 -synthesis model min s { s 1 : AΨs = b} (3) signal is sparsely synthesized by atoms from Ψ, so vector s is sparse Ψ is referred to as the dictionary commonly used dictionaries include both analytic and trained ones analytic eamples: Id, DCT, wavelets, curvelets, gabor, etc., also their combinations; they have analytic properties, often easy to compute (for eample, multiplying a vector takes O(n log n) instead of O(n 2 )) Ψ can also be numerically learned from training data or partial signal they can be orthogonal, frame, or general 7 / 33

Sparse under basis Ψ / l 1 -synthesis model If Ψ is orthogonal, problem (3) is equivalent to by change of variable = Ψs, equivalently s = Ψ. Related models for noise and approimate sparsity: min { Ψ 1 : A = b} (4) min{ A b 2 : Ψ 1 τ}, min Ψ 1 + µ 2 A b 2 2, min{ Ψ 1 : A b 2 σ}. 8 / 33

Sparse after transform / l 1 -analysis model min { Ψ 1 : A = b} (5) Signal becomes sparse under the transform Ψ (may not be orthogonal) Eamples of Ψ: DCT, wavelets, curvelets, ridgelets,... tight frames, Gabor,... (weighted) total variation When Ψ is not orthogonal, the analysis is more difficult 9 / 33

Eample: sparsify an image 10 / 33

(a) DCT coefficients (b) Haar wavelets (c) Local variation 10 5 10 0 10 5 10 10 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (d) DCT coeff s decay 1000 800 600 400 200 0 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (e) Haar wavelets 300 250 200 150 100 50 0 0 0.5 1 1.5 2 2.5 sorted coefficient magnitudes 10 5 (f) Local variation Figure: the DCT and wavelet coefficients are scaled for better visibility. 11 / 33

Questions 1. Can we trust these models to return intended sparse solutions? 2. When will the solution be unique? 3. Will the solution be robust to noise in b? 4. Are constrained and unconstrained models equivalent? in what sense? Questions 1 4 will be addressed in net lecture. 5. How to choose parameters? τ (sparsity), µ (weight), and σ (noise level) have different meanings applications determine which one is easier to set generality: use a test data set, then scale parameters for real data cross validation: reserve a subset of data to test the solution 12 / 33

Joint/group sparsity Joint sparse recovery model: min{ X 2,1 : A(X) = b} (6) X where m X 2,1 := [ i1 i,2 in] 2. i=1 l 2-norm is applied to each row of X l 2,1-norm ball has sharp boundaries across different rows, which tend to be touched by {X : A(X) = b}, so the solution tends to be row-sparse also X p,q for 1 < p, affects magnitudes of entries on the same row comple-valued signals are a special case 13 / 33

Joint/group sparsity Decompose {1,..., n} = G 1 G 2 G S. non-overlapping groups: G i G j =, i j. otherwise, groups may overlap (modeling many interesting structures). Group-sparse recovery model: where min { G,2,1 : A = b} (7) G,2,1 = S w s Gs 2. s=1 14 / 33

Auiliary constraints Auiliary constraints introduce additional structures of the underlying signal into its recovery, which sometimes significantly improve recovery quality nonnegativity: 0 bound (bo) constraints: l u general inequalities: Q q They can be very effective in practice. They also generate corners. 15 / 33

Reduce to conic programs Sparse optimization often has nonsmooth objectives. Classic conic programming solvers do not handle nonsmooth functions. Basic idea: model nonsmoothness by inequality constraints. Eample: for given, we have Therefore, 1 = min 1, 2 {1 T ( 1 + 2) : 1 2 =, 1 0, 2 0}. (8) min{ 1 : A = b} reduces to a linear program (LP) min 1 + µ 2 A b 2 2 reduces to a bound constrained quadratic program (QP) min { A b 2 : 1 τ} reduces to a bound constrained QP min { 1 : A b 2 σ} reduces to a second-order cone program (SOCP) 16 / 33

Basic form: Conic programming min{c T : F + g K 0, A = b.} a K b stands for a b K, which is a conve, closed, pointed cone. Eamples: first orthant (cone): R n + = { R n : 0}. norm cone (2nd order cone): Q = {(, t) : t} polyhedral cone: P = { : A 0} positive semidefinite cone: S + = {X : X 0, X T = X} Eample: { (, y, z) : [ y ] } y S + z z 1 0.5 0 1 0 y 1 0 0.5 1 17 / 33

Model Linear program min{c T : A = b, K 0} where K is the nonnegative cone (first orthant). K 0 0. Algorithms the Simple method (move between vertices) interior-point methods (IPMs) (move inside the polyhedron) decomposition approaches (divide and conquer) In primal IPM, 0 is replaced by its logarithmic barrier: ψ(y) = i log(y i) log-barrier formulation: min{c T (1/t) i log( i) : A = b} 18 / 33

Second-order cone program Model min{c T : A = b, K 0} where K = K 1 K K; each K k is the second-order cone { } K k = y R n k : y nk y1 2 + + y2 n k 1. IPM is the standard solver (though other options also eist) Log-barrier of K k : ψ(y) = log ( y 2 n k (y 2 1 + + y nk 1) ) 19 / 33

Semi-definite program Model min{c X : A(X) = b, X K 0} where K = K 1 K K; each K k = S n k +. IPM is the standard solver (though other options also eist) Log-barrier of S n k + (still a concave function): ψ(y) = log det(y). 20 / 33

(from Boyd & Vandenberghe, Conve Optimization) properties (without proof): for y K 0, ψ(y) K 0, y T ψ(y) = θ nonnegative orthant R n +: ψ(y) = n i=1 log y i ψ(y) = (1/y 1,..., 1/y n ), y T ψ(y) = n positive semidefinite cone S n +: ψ(y ) = log det Y ψ(y ) = Y 1, tr(y ψ(y )) = n second-order cone K = {y R n+1 (y1 2 + + yn) 2 1/2 y n+1 }: y 1 2 ψ(y) = yn+1 2 y2 1. y2 n y n, yt ψ(y) = 2 y n+1 21 / 33

(from Boyd & Vandenberghe, Conve Optimization) Central path for t > 0, define (t) as the solution of minimize subject to tf 0 () + φ() A = b (for now, assume (t) eists and is unique for each t > 0) central path is { (t) t > 0} eample: central path for an LP minimize c T subject to a T i b i, i = 1,..., 6 hyperplane c T = c T (t) is tangent to level curve of φ through (t) c (10) 22 / 33

Log-barrier formulation: min{tf 0() + φ() : A = b} Compleity of log-barrier interior-point method: log(( i k θi)/(εt(0) )) log µ 23 / 33

Primal-dual interior-point methods more efficient than barrier method when high accuracy is needed update primal and dual variables at each iteration; no distinction between inner and outer iterations often ehibit superlinear asymptotic convergence search directions can be interpreted as Newton directions for modified KKT conditions can start at infeasible points cost per iteration same as barrier method 24 / 33

Model l 1 minimization by interior-point method min { 1 : A b 2 σ} min min {1 T ( 1 + 1, 2) : 1 0, 2 0, 1 2 =, A b 2 σ} 2 min 1, 2 {1 T ( 1 + 2) : 1 0, 2 0, A( 1 2) b 2 σ} min {1 T 1 + 1 T 2 : A 1 A 2 + y = b, z = σ, ( 1, 1, 2, z, y) K 0} 2 where ( 1, 2, z, y) K 0 means 1, 2 R n +, (t, y) Q m+1. Solver: Mosek, SDPT3, Gurobi. Also, modeling language CVX and YALMIP. 25 / 33

Nuclear-norm minimization by interior-point method If we can model as an SDP... (how? see net slide)... then, we can also model min X { X : A(X) b F σ} min X { A(X) b F : X τ} min X µ X + 1 2 A(X) b 2 F min{ X : A(X) = b} (9) X as well as problems involving tr(x) and spectral norm X. X α αi X 0. 26 / 33

Sparse calculus for l 1 inspect to get some ideas: y, z 0 and yz = 1 2 (y + z) yz. moreover, 1 2 (y + z) = yz = if y = z =. observe y, z 0 and yz [ y ] 0. z So, [ y ] z 0 = 1 (y + z). 2 we attain 1 (y + z) = if y = z =. 2 Therefore, given, we have = min M { 1 2 tr(m) : [ ] } 0 1 M =, M = M T, M 0. 0 0 27 / 33

Generalization to nuclear norm Consider X R m n (w.o.l.g., assume m n) and let s try imposing [ ] Y X 0 X T Z Diagonalize X = UΣV T, Σ = diag(σ 1,..., σ m), X = i σi. [U T, V T ] [ Y X T ] [ ] X U = U T YU + V T ZV U T XV V T X T U Z V = U T YU + V T ZV 2Σ 0. So, tr(uyu T + VZV T 2Σ) = tr(y) + tr(z) 2 X 0. To attain =, we can let Y = UΣU T and Z = VΣ n nv T. 28 / 33

Therefore, X = min Y,Z = min M { [ ] } 1 Y X (tr(y) + tr(z)) : 0 (10) 2 X T Z { [ ] } 1 2 tr(m) : 0 I M = X, M = M T, M 0. (11) 0 0 Eercise: epress the following problems as SDPs min X { X : A(X) = b} min X µ X + 1 A(X) b F 2 min L,S { L + λ S 1 : A(L + S) = b} 29 / 33

Practice of interior-point methods (IPMs) LP, SOCP, SDP are well known and have reliable (commercial, off-the-shelf) solvers Yet, the most reliable solvers cannot handle large-scale problems (e.g., images, video, manifold learning, distributed stuff,...) Eample: to recover a still image, there can be 10M variables and 1M constraints. Even worse, the constraint coefficients are dense. Result: Out of memory. Simple and active-set methods: matri containing A must be inverted or factorized to compute the net point (unless A is very sparse). IPMs approimately solve a Newton system and thus also factorize a matri involving A. Even large and dense matrices can be handled, for sparse optimization, one should take advantages of the solution sparsity. Some compressive sensing problems have A with structures friendly for operations like A and A T y. 30 / 33

Practice of interior-point methods (IPMs) The Simple, active-set, and IPMs have reliable solvers; good to be the benchmark. They have nice interfaces (including CVX and YALMIP, which save you time.) CVX and YALMIP are not solvers; they translate problems and then call solvers; see http://goo.gl/zulmk and http://goo.gl/1u0p. They can return highly accurate solutions; some first-order algorithms (coming later in this course) do not always. There are other remedies; see net slide. 31 / 33

Papers of large-scale SDPs Low-rank factorizations: S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization, Math. Program., 95:329 357, 2003. LMaFit, http://lmafit.blogs.rice.edu/ First-order methods for conic programming: Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented Lagrangian methods for semidefinite programming. Math. Program. Comput., 2(3-4):203 230, 2010. Matri-free IPMs: K. Fountoulakis, J. Gondzio, P. Zhlobich. Matri-free interior point method for compressed sensing problems, 2012. http://www.maths.ed.ac.uk/~gondzio/reports/mfcs.html 32 / 33

Subgradient methods Sparse optimization is typically nonsmooth, so it is natural to consider subgradient methods. apply subgradient descent to, say, min 1 + µ A 2 b 2 2. apply projected subgradient descent to, say, min { 1 : A = b}. Good: subgradients are easy to derive, methods are simple to implement. Bad: convergence requires carefully chosen step sizes (classical ones require diminishing step sizes). Convergence rate is weak on paper (and in practice, too?) Further readings: http://ariv.org/pdf/1001.4387.pdf, http://goo.gl/qfva6, http://goo.gl/vc21a. 33 / 33