Minimizing the Difference of L 1 and L 2 Norms with Applications

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Strengthened Sobolev inequalities for a random subspace of functions

Optimization methods

Compressive Sensing and Beyond

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

MINIMIZATION OF TRANSFORMED L 1 PENALTY: THEORY, DIFFERENCE OF CONVEX FUNCTION ALGORITHM, AND ROBUST APPLICATION IN COMPRESSED SENSING

Optimization methods

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Convex Optimization and l 1 -minimization

Solving DC Programs that Promote Group 1-Sparsity

Structured matrix factorizations. Example: Eigenfaces

Compressed Sensing and Neural Networks

Point Source Super-resolution Via Non-convex L 1 Based Methods

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Enhanced Compressive Sensing and More

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Compressive Sensing (CS)

6. Proximal gradient method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Introduction to Compressed Sensing

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Computing Sparse Representation in a Highly Coherent Dictionary Based on Difference of L 1 and L 2

6. Proximal gradient method

Newton s Method. Javier Peña Convex Optimization /36-725

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Reconstruction from Anisotropic Random Measurements

Recovering overcomplete sparse representations from structured sensing

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture Notes 9: Constrained Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Accelerated primal-dual methods for linearly constrained convex problems

Sensing systems limited by constraints: physical size, time, cost, energy

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Bias-free Sparse Regression with Guaranteed Consistency

Coordinate Descent and Ascent Methods

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Solution Recovery via L1 minimization: What are possible and Why?

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Self-Calibration and Biconvex Compressive Sensing

Towards a Mathematical Theory of Super-resolution

Block stochastic gradient update method

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 1: September 25

1 Sparsity and l 1 relaxation

CSC 576: Variants of Sparse Learning

Optimization for Compressed Sensing

arxiv: v3 [math.oc] 29 Jun 2016

First Order Methods for Large-Scale Sparse Optimization. Necdet Serhat Aybat

Discriminative Models

Greedy Signal Recovery and Uniform Uncertainty Principles

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization for Learning and Big Data

Lecture Notes 10: Matrix Factorization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

An Introduction to Sparse Approximation

Constrained optimization

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Compressed Sensing and Sparse Recovery

Super-resolution via Convex Programming

Recent Developments in Compressed Sensing

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Sparse and Robust Optimization and Applications

Discriminative Models

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Sparsity Regularization

Gauge optimization and duality

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Lecture 8: February 9

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Solving Corrupted Quadratic Equations, Provably

Provable Alternating Minimization Methods for Non-convex Optimization

Lecture 22: More On Compressed Sensing

Least Sparsity of p-norm based Optimization Problems with p > 1

2.3. Clustering or vector quantization 57

Lecture 9: September 28

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Compressed Sensing via Partial l 1 Minimization

Interpolation-Based Trust-Region Methods for DFO

Algorithms for Nonsmooth Optimization

CPSC 540: Machine Learning

SPARSE SIGNAL RESTORATION. 1. Introduction

Oslo Class 6 Sparsity based regularization

Inverse problems and sparse models (6/6) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France.

Compressive Imaging by Generalized Total Variation Minimization

ECS289: Scalable Machine Learning

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

c 2015 Society for Industrial and Applied Mathematics

Adaptive Primal Dual Optimization for Image Processing and Learning

Transcription:

1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786

2/36 Outline 1 A nonconvex approach: L 1 -L 2 2 Minimization algorithms 3 Some applications 4

3/36 Background We aim to find a sparse vector from an under-determined linear system, ˆx 0 = argmin x x 0 s.t. Ax = b. This is NP-hard.

3/36 Background We aim to find a sparse vector from an under-determined linear system, ˆx 0 = argmin x x 0 s.t. Ax = b. This is NP-hard. A popular approach is to replace L 0 by L 1, i.e., ˆx 1 = argmin x x 1 s.t. Ax = b.

3/36 Background We aim to find a sparse vector from an under-determined linear system, ˆx 0 = argmin x x 0 s.t. Ax = b. This is NP-hard. A popular approach is to replace L 0 by L 1, i.e., ˆx 1 = argmin x x 1 s.t. Ax = b. The equivalence between L 0 and L 1 norms holds when the matrix A satisfies the restricted isometry property (RIP). Candes-Romberg-Tao (2006)

4/36 Coherence Another sparse recovery guarantee is based on coherence. x 0 1 2 (1 + µ(a) 1 ), where coherence of a matrix A = [a 1,, a N ] is defined as µ(a) = max i j a T i a j a i a j.

4/36 Coherence Another sparse recovery guarantee is based on coherence. x 0 1 2 (1 + µ(a) 1 ), where coherence of a matrix A = [a 1,, a N ] is defined as Two extreme cases are µ(a) = max i j µ 0 incoherent matrix µ 1 coherent matrix a T i a j a i a j.

4/36 Coherence Another sparse recovery guarantee is based on coherence. x 0 1 2 (1 + µ(a) 1 ), where coherence of a matrix A = [a 1,, a N ] is defined as Two extreme cases are µ(a) = max i j µ 0 incoherent matrix µ 1 coherent matrix a T i a j a i a j. What if the matrix is coherent?

5/36 L 1 -L 2 works well for coherent matrix We consider an over-sampled DCT matrix with each column as a j = 1 cos( 2πjw N F ), j = 1,, N where w is a random vector of length M. The larger F is, the more coherent the matrix. Take a 100 1000 matrix for an example: F coherence 1 0.3981 10 0.9981 20 0.9999 P. Yin, Y. Lou, Q. He and J. Xin, SIAM Sci. Comput., 2015 Y. Lou, P. Yin, Q. He and J. Xin, J. Sci. Comput., 2015

6/36 100 90 80 70 L 1 L 2 L 1/2 L 1 Success rate 60 50 40 30 20 10 0 10 15 20 25 30 35 Sparsity Figure: Success rates of incoherent matrices, F = 1.

6/36 100 90 80 70 L 1 L 2 L 1/2 L 1 Success rate 60 50 40 30 20 10 0 5 10 15 20 25 30 35 Sparsity Figure: Success rates of coherent matrices, F = 20.

7/36 Comparing metrics 1 L 2 L 1 L 1 L 2 1 1 0.5 0.5 0.5 0 0 0-0.5-0.5-0.5-1 -1-0.5 0 0.5 1-1 -1-0.5 0 0.5 1-1 -1-0.5 0 0.5 1 Figure: Level lines of three metrics: L 2 (strictly convex), L 1 (convex), and L 1 L 2 (nonconvex).

8/36 Comparing nonconvex metrics Consider a matrix A of size 17 19 and V R 19 2 be the basis of the null space of A, i.e. AV = 0. So the feasible set is a two-dimensional [ ] affine space, i.e. s {x : Ax = Ax g } = {x = x g + V : s, t R}. t Visualize objective functions L 0, L 1/2, and L 1 -L 2 over 2D st-plane.

9/36 L 0 : incoherent (left) and coherent (right)

10/36 L 1/2 : incoherent (left) and coherent (right)

11/36 L 1 -L 2 : incoherent (left) and coherent (right)

12/36 Model failure v.s. algorithm failure Occurrence rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 L 1 L 1/2 Lasso F ( x) F (x) (model failure) F ( x) <F(x) (algorithmic failure) 0 5 10 15 20 25 30 35 Sparsity Occurrence rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 IRucLq-v F ( x) F (x) (model failure) F ( x) <F(x) (algorithmic failure) 0 5 10 15 20 25 30 35 Sparsity L 1 -L 2 truncated L 1 -L 2 Occurrence rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 l1 2 F ( x) F (x) (model failure) F ( x) <F(x) (algorithmic failure) 0 5 10 15 20 25 30 35 Sparsity Occurrence rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 lt,1 2 F ( x) F (x) (model failure) F ( x) <F(x) (algorithmic failure) 0 5 10 15 20 25 30 35 Sparsity

13/36 Truncated L 1 -L 2 x t,1 2 := x i x i / Γ x,t i i 2, / Γx,t where Γ x,t {1,..., N} with cardinality t is a set containing the indices of the entries of x with the t largest magnitudes. Success rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Lasso l1 2 IRucLq-v ISD lt,1 2 0 10 15 20 25 30 35 Sparsity Success rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Lasso l1 2 IRucLq-v ISD lt,1 2 0 5 10 15 20 25 30 35 Sparsity T. Ma, Y. Lou, and T. Huang, SIAM Imaging Sci., to appear 2017

14/36 Advantages of L 1 -L 2 Lipschitz continuous

14/36 Advantages of L 1 -L 2 Lipschitz continuous Correct L 1 s biasedness by subtracting something with smooth gradient a.e.

14/36 Advantages of L 1 -L 2 Lipschitz continuous Correct L 1 s biasedness by subtracting something with smooth gradient a.e. Exact recovery of 1-sparse vectors (truncated version yields exact recovery of t-sparse vectors)

14/36 Advantages of L 1 -L 2 Lipschitz continuous Correct L 1 s biasedness by subtracting something with smooth gradient a.e. Exact recovery of 1-sparse vectors (truncated version yields exact recovery of t-sparse vectors) Good for coherent compressive sensing

15/36 Outline 1 A nonconvex approach: L 1 -L 2 2 Minimization algorithms 3 Some applications 4

16/36 We consider an unconstrained L 1 L 2 formulation, i.e., min F(x) = 1 x R N 2 A x b 2 2 + λ( x 1 x 2 ). Our first attempt is using the difference of convex algorithm (DCA) by composing decompose F(x) = G(x) H(x) into { G(x) = 1 2 A x b 2 2 + λ x 1 H(x) = λ x 2. An iterative scheme is, x n+1 1 = arg min x R N 2 A x λx n b 2 2 + λ x 1 x, x n. 2

17/36 We then derive a proximal operator for L 1 -αl 2 (α 0) x = arg min x λ ( x 1 α x 2 ) + 1 2 x y 2 2, which has a closed-form solution: 1 If y > λ, then x = z( z 2 + αλ)/ z 2, where z = shrink(y, λ); 2 if y = λ, then x 2 = αλ, x i = 0 for y i < λ; 3 If (1 α)λ < y < λ, then x is 1-sparse vector satisfying x i = 0 for y i < y ; 4 If y (1 α)λ, then x = 0. Y. Lou and M. Yan, J. Sci Comput., to appear 2017

18/36 Remarks Most L 1 solves are applicable for L 1 -αl 2 by replacing soft shrinkage with this proximal operator.

18/36 Remarks Most L 1 solves are applicable for L 1 -αl 2 by replacing soft shrinkage with this proximal operator. The algorithm of combining ADMM and this operator (nonconvex-admm), is much faster than the DCA.

18/36 Remarks Most L 1 solves are applicable for L 1 -αl 2 by replacing soft shrinkage with this proximal operator. The algorithm of combining ADMM and this operator (nonconvex-admm), is much faster than the DCA. Both nonconvex-admm and DCA converge to stationary points.

18/36 Remarks Most L 1 solves are applicable for L 1 -αl 2 by replacing soft shrinkage with this proximal operator. The algorithm of combining ADMM and this operator (nonconvex-admm), is much faster than the DCA. Both nonconvex-admm and DCA converge to stationary points. However, nonconvex-admm does not give better performance than DCA for coherent CS.

19/36 100 90 80 70 60 50 40 30 20 10 0 L (ADMM) 1 L 1 -L 2 (DCA) L 1 -L 2 (ADMM) L -L (weighted) 1 2 15 20 25 30 35 Sparsity Figure: Success rates of coherent matrices, F=20.

20/36 100 90 80 70 60 50 40 30 20 10 0 L (ADMM) 1 L 1 -L 2 (DCA) L 1 -L 2 (ADMM) L -L (weighted) 1 2 15 20 25 30 35 Sparsity Figure: Success rates of incoherent matrices, F=5.

21/36 DCA is more stable than nonconvex-admm, as each DCA subproblem is convex.

21/36 DCA is more stable than nonconvex-admm, as each DCA subproblem is convex. Since it is convex at α = 0, we consider a continuation scheme of gradually increasing α from 0 to 1, referred to as weighted.

21/36 DCA is more stable than nonconvex-admm, as each DCA subproblem is convex. Since it is convex at α = 0, we consider a continuation scheme of gradually increasing α from 0 to 1, referred to as weighted. How to update the weight α?

21/36 DCA is more stable than nonconvex-admm, as each DCA subproblem is convex. Since it is convex at α = 0, we consider a continuation scheme of gradually increasing α from 0 to 1, referred to as weighted. How to update the weight α? For incoherence matrices, a linear increase for α with a large slope until reaching one;

21/36 DCA is more stable than nonconvex-admm, as each DCA subproblem is convex. Since it is convex at α = 0, we consider a continuation scheme of gradually increasing α from 0 to 1, referred to as weighted. How to update the weight α? For incoherence matrices, a linear increase for α with a large slope until reaching one; For coherent cases, a sigmoid function to update α, which may or may not reach one.

22/36 1 0.8 0.6 0.4 0.2 F=5 F=20 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Iteration number Figure: Different ways of updating α for incoherent (blue) or coherent (red) matrices.

23/36 100 90 80 70 60 50 40 30 20 10 0 L (ADMM) 1 L 1 -L 2 (DCA) L 1 -L 2 (ADMM) L -L (weighted) 1 2 15 20 25 30 35 Sparsity Figure: Success rates of coherent matrices, F=20.

24/36 Outline 1 A nonconvex approach: L 1 -L 2 2 Minimization algorithms 3 Some applications 4

25/36 Super-resolution The super-resolution problem discussed here is different to image zooming or magnification, but aiming to recover a real-valued signal from its low-frequency measurements. A mathematical model is expressed as b k = 1 N 1 x t e i2πkt/n, k f c, N t=0 where x R N is a vector of interest, and b C n is the given low frequency information with n = 2f c + 1 (n < N).

26/36 Point source with minimum separation Theorem by Candés and Fernandez-Granda 2012 Let T = {t j } be the support of x. If the minimum distance obeys (T) 2 N/f c, then x is the unique solution to L 1 minimization. If x is real-valued, then the minimum gap can be lowered to 1.26 N/f c.

27/36 100 90 80 70 60 50 40 Constrained L 1 (SDP) Constrained L 1 2 Unconstrained L 1 2 Unconstrained CL 1 Unconstrained L 1/2 30 20 10 0 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 MSF Y. Lou, P. Yin and J. Xin, J. Sci. Comput., 2016

28/36 Low-rank recovery Replacing nuclear norm with truncated L 1 -L 2 of the singular values Success rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 FPC FPCA LMaFit IRucLq-M l t,1 2 0 10 14 18 22 26 30 Rank T. Ma, Y. Lou, and T. Huang, SIAM Imaging Sci., to appear 2017

29/36 Image processing We consider J(u) = D x u 1 + D y u 1 α D x u 2 + D y u 2 1, which turns out to be a weighted difference of anisotropic and isotropic TV: J(u) = J ani αj iso, where α [0, 1] is a weighting parameter. Gradient vectors (u x, u y ) are mostly 1-sparse, and α takes into account the occurrence of non-sparse gradient vectors. Y. Lou, T. Zeng, S. Osher and J. Xin, SIAM J. Imaging Sci., 2015

30/36 MRI reconstruction original k-space data sampling mask FBP, ER=0.99 L 1, ER = 0.1 L 1 L 2, ER 10 8

31/36 Block-matching Vectorization w w w 2 1 w 2 n Block matching Grouping Reference patch Similar patches Figure: Illustration of constructing groups by block-matching (BM). For each w w reference patch from an n 1 n 2 image, we use block-matching to search its n 1 best matched patches in terms of Euclidean distance, and then vectorize and combine those patches to form a group of size w 2 n.

32/36 Block-matching inpainting Barbara 85% missing BM3D SAIST ours PSNR=26.71,SSIM=0.8419 PSNR=29.53,SSIM=0.9147 PSNR=30.65,SSIM=0.9264 T. Ma, Y. Lou, T. Huang and X. Zhao, ICIP, to appear 2017

33/36 Logistic regression Given a collection of training data x i R n with a label y i {±1} for i = 1, 2,, m, we aim to find a hyperplane defined by {x : w T x + v = 0} by minimizing the following objective function, min F(w, v) + l avg(w, v) w R n,v R where the second term is called averaged loss defined as l avg (w, v) = 1 m m ln ( 1 + exp( y i (w T x i + v)) ). i=1

34/36 Preliminary results Real data of patients with inflammatory bowel disease (IBD) for years 2011 and 2012. For each year, the data set contains 18 types of medical information, such as prescriptions, number of office visits, and whether the patient was hospitalized. We used the 2011 data to train our classifier and the 2012 data to validate its performance. L 1 L 2 L 1 -L 2 Recall 0.6494 0.6585 0.6829 Precision 0.0883 0.0882 0.0912 F-Score 0.0573 0.0581 0.0622 AUC 0.7342 0.7321 0.7491 UCLA REU project in 2015, followed by Q. Jin and Y. Lou for a journal submission

35/36 1 L 1 -L 2 is always better than L 1, and is better than L p for highly coherent matrices.

35/36 1 L 1 -L 2 is always better than L 1, and is better than L p for highly coherent matrices. 2 Proximal operator can accelerate the minimization, but it tends to obtain a suboptimal solution.

35/36 1 L 1 -L 2 is always better than L 1, and is better than L p for highly coherent matrices. 2 Proximal operator can accelerate the minimization, but it tends to obtain a suboptimal solution. 3 In general, nonconvex methods have better empirical performance compared to convex ones, but lack of provable grounds.

36/36 Thank you!