Homotopy methods based on l 0 norm for the compressed sensing problem

Similar documents
Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Optimization Algorithms for Compressed Sensing

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Sparsity in Underdetermined Systems

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

An Introduction to Sparse Approximation

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

SPARSE SIGNAL RESTORATION. 1. Introduction

Iterative Hard Thresholding Methods for l 0 Regularized Convex Cone Programming

Minimizing the Difference of L 1 and L 2 Norms with Applications

of Orthogonal Matching Pursuit

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

Uniqueness Conditions for A Class of l 0 -Minimization Problems

Necessary and Sufficient Conditions of Solution Uniqueness in 1-Norm Minimization

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

1-Bit Compressive Sensing

Generalized Orthogonal Matching Pursuit- A Review and Some

Introduction to Compressed Sensing

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Basis Pursuit Denoising and the Dantzig Selector

Compressed Sensing and Sparse Recovery

Gradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property

About Split Proximal Algorithms for the Q-Lasso

Sparse signals recovered by non-convex penalty in quasi-linear systems

The Sparsest Solution of Underdetermined Linear System by l q minimization for 0 < q 1

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Inverse problems and sparse models (6/6) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France.

Optimization methods

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Bayesian Methods for Sparse Signal Recovery

A tutorial on sparse modeling. Outline:

Optimization methods

Enhanced Compressive Sensing and More

Necessary and sufficient conditions of solution uniqueness in l 1 minimization

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Elaine T. Hale, Wotao Yin, Yin Zhang

EUSIPCO

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Learning with stochastic proximal gradient

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

Robust Sparse Recovery via Non-Convex Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

An Homotopy Algorithm for the Lasso with Online Observations

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms

OWL to the rescue of LASSO

Sparse Approximation via Penalty Decomposition Methods

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

MIST: l 0 Sparse Linear Regression with Momentum

Sparse Solutions of an Undetermined Linear System

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

SPARSE signal representations have gained popularity in recent

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Adaptive Primal Dual Optimization for Image Processing and Learning

Block Coordinate Descent for Regularized Multi-convex Optimization

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images!

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Analysis of Greedy Algorithms

A New Estimate of Restricted Isometry Constants for Sparse Solutions

Convex Optimization and l 1 -minimization

Lecture 23: November 21

Sparsity Regularization

Multiple Change Point Detection by Sparse Parameter Estimation

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Gauge optimization and duality

Lecture 25: November 27

ABSTRACT. Recovering Data with Group Sparsity by Alternating Direction Methods. Wei Deng

Exact penalty decomposition method for zero-norm minimization based on MPEC formulation 1

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

MATCHING PURSUIT WITH STOCHASTIC SELECTION

A direct formulation for sparse PCA using semidefinite programming

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Sparse & Redundant Signal Representation, and its Role in Image Processing

Accelerated primal-dual methods for linearly constrained convex problems

Recent Advances in Structured Sparse Models

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Fast Hard Thresholding with Nesterov s Gradient Method

Block stochastic gradient update method

Tractable Upper Bounds on the Restricted Isometry Constant

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

SVRG++ with Non-uniform Sampling

TRACKING SOLUTIONS OF TIME VARYING LINEAR INVERSE PROBLEMS

Group Sparse Optimization via l p,q Regularization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Sparse Signal Reconstruction with Hierarchical Decomposition

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Transcription:

Homotopy methods based on l 0 norm for the compressed sensing problem Wenxing Zhu, Zhengshan Dong Center for Discrete Mathematics and Theoretical Computer Science, Fuzhou University, Fuzhou 350108, China Abstract In this paper, two homotopy methods, which combine the advantage of the homotopy technique with the effectiveness of the iterative hard thresholding method, are presented for solving the compressed sensing problem. Under some mild assumptions, we prove that the limits of the sequences generated by the proposed homotopy methods are feasible solutions of the problem, and under some conditions they are local minimizers of the problem. The proposed methods overcome the difficulty of the iterative hard thresholding method on the choice of the regularization parameter by tracing solutions of the sparse problem along a homotopy path. Moreover, to improve the solution quality of the two methods, we modify them and give two empirical algorithms. Numerical experiments demonstrate the effectiveness of the two proposed algorithms in accurately and efficiently generating sparse solutions of the compressed sensing problem. Key words: Compressed sensing; sparse optimization; homotopy method; iterative hard thresholding method; proximal gradient method. 1 Introduction In this paper, we present two homotopy methods based on the l 0 norm to approximately solve the following compressed sensing (CS) problem, min x x 0 s.t. Ax = y, (1) where x R n is the vector of unknowns, A R m n (m n, assume that A is full row rank), and y R m are the problem data. x 0 is the number of nonzero components of x. Over the last decade, with the principle that using the simplest representation to explain the given problem or phenomena, the sparse model has been pursued in signal 1

processing and many other fields, such as denoising, linear regression, inverse problems, model selection, machine learning, image processing and so on. Despite that finding an optimal solution of problem (1) is NP-hard [22], there exist a large number of methods to approximately solve the problem. These methods can be categorized into four classes: (i) Combinatorial methods, e.g., Matching Pursuit (MP) [21], Orthogonal Matching Pursuit (OMP) [11], Least Angle Regression (LARS) [13]; (ii) l 1 -norm regularization methods, e.g., Gradient Projection (GP) [14], Iterative Shrinkage-Thresholding (IST) [10], Accelerated Proximal Gradient (APG) [1, 23], Iterative Reweighted Method [7], Alternating Direction Method (ADM) [27], Homotopy methods [20, 25]; (iii) l p -norm (0 < p < 1) regularization methods [8, 9, 16]; (iv) l 0 -norm regularization methods, e.g., Penalty Decomposition (PD) method [19], Iterative Hard Thresholding (IHT) methods [3, 4, 18, 24, 25], and so on. To overcome the difficulty of solving the sparse problem (1), with the reason that the l 1 norm can promote sparsity [6], the l 1 -norm regularization methods replace the l 0 norm by the l 1 norm, min x 1 s.t. Ax = y. (2) x Regularized by a parameter λ, the above problem (2) can be written as the following popular l 1 -regularized least-square (l 1 -LS) problem, min x 1 2 Ax y 2 + λ x 1, (3) where x 1 = n x i. Problem (3) can be solved by Nesterov s first order methods i=1 [23], i.e., the primal gradient (PG) method, the dual gradient (DG) method, and the accelerated dual gradient (AC) method. The PG and DG methods have convergence rate o( 1 ), where k is the number of iteration steps, and the AC method has a faster k convergence rate o( 1 ). Some similar PG methods can be found in literatures [25, 26]. k 2 Some methods solve the sparse problem (1) basing on the l 0 -norm directly, such as the Iterative Hard Thresholding (IHT) methods [3, 4, 18, 24, 25], the Penalty Decomposition (PD) method [19], and provide much better sparse solutions of the sparsity problem. When regularized directly by the l 0 norm, they solve the following l 0 -regularized leastsquare (l 0 -LS) problem, 1 min x 2 Ax y 2 + λ x 0. (4) Recently, Blumensath and Davies [3, 4] presented two IHT methods to solve the l 0 - LS problem (4) and s-sparse model respectively. Experimental results show that the IHT methods can be used to improve the results generated by other methods. Lu [18] extended the IHT methods to the l 0 regularized convex cone programming problem and showed that the IHT methods can converge to a local minimizer. He also presented the iteration complexity of the IHT methods for finding an ϵ-optimal solution. What s more, the properties of the global/local minimizers of the above problem (4) have been studied in literature [24]. 2

The above mentioned methods (no matter for the l 1 -norm regularized problem or for the l 0 -norm regularized problem) use a fixed value of the regularization parameter λ, although suitable values of the parameter can generate better solutions. Unfortunately, there is no rule for selecting a suitable value of the regularization parameter, leaving that finding a good regularization parameter is still a challenge work [2]. However, homotopy methods can overcome this difficulty, which calculate efficiently and trace the solutions of the sparse problem along a continuation path. Recently, Xiao and Zhang [26] proposed a homotopy method (PGH method) for the l 1 -norm regularized problem (3) for finding an ϵ-optimal solution quickly and efficiently, whose overall iteration complexity is o(log( 1 ϵ )). Later, Lin and Xiao [17] proposed an adaptive accelerated technique of this method with complexity analysis. Following this line, two homotopy methods (called HIHT and AHIHT methods), combining the advantage of the homotopy technique with the effectiveness of the IHT method, are presented in this paper for solving the compressed sensing problem (1) directly. Under some mild assumptions, convergence of the proposed methods is proved. Comparing with the convergence analysis of the IHT method in [3], the assumption that A < 1 is not needed. To improve the solution quality of the two methods, we modify them and give two empirical algorithms. Experimental results show that the modified l 0 -norm based homotopy methods (HIHT/AHIHT) are more efficient and effective than the l 1 -norm based homotopy methods (PGH/ADGH) [26]. Moreover, though the HIHT/AHIHT methods cannot deal with too small regularization parameter λ well enough, they can use smaller λ than the PGH method. In fact, the regularization parameter λ of the PGH method cannot be too small [26]. The rest of this paper is organized as follows. Section 2 is the preliminary, in which some notations and the IHT method for the l 0 -norm regularized convex programming are described. Section 3 presents two homotopy methods for the l 0 -norm regularized problem (4), which combine the homotopy technique with the IHT method. Moreover, convergence analyses are given, and we prove that the limits of the sequences generated by the proposed homotopy methods are feasible solutions of problem (1). And under some conditions they are local minimizers of the problem. In section 4, we modify the two methods and give two empirical algorithms. Numerical experiments are put in Section 5 and conclusions are made in Section 6, respectively. 2 Preliminaries 2.1 Notations In this subsection, some notations are presented to simplify presentation. If without special statement, all norms used are the Euclidean norm denoted by. The transpose of a vector x R n is denoted by x H. Given an index set I {1,..., n}, x I denotes the subvector formed by the components of x indexed by I. E denotes the identity matrix. 3

The index of nonzero components of a vector x is denoted by S(x) = {i : x i 0} (called support set). Let S(x) be the complement of S(x), i.e., S(x) = {1, 2,, n} S(x) = {i : x i = 0}. Denote the size of S(x) as S(x). Given an index set I {1,, n}, the set of solutions with subvectors formed by I zero is denoted by B I = {x R n : x I = 0}. For the sake of easy statement, problem (4) is rewritten as min x ϕ λ (x) = f(x) + λ x 0, (5) where f(x) = 1 2 Ax y 2 is a differentiable convex function whose gradient is Lipschitz continuous (denote its Lipschitz constant as L f ). 2.2 IHT method In this subsection, the IHT method [18] for l 0 regularized convex programming and some main results are described. To solve problem (5), the main idea of IHT is the use of proximal point technique at each iteration: the function f(x) is approximated by a quadratic function at the current solution x 0 and keep the second term in problem (5) unchanged. This forms the following function: p L,λ (x 0, x) = f(x 0 ) + f(x 0 ), x x 0 + L 2 x x0 2 + λ x 0, (6) where the quadratic term is the proximal term, L > 0 is a constant, which should essentially be an upper bound on the Lipschitz constant of f(x), i.e., L L f. The minimizer of p L,λ (x 0, x) is the same as that of the following problem, min x [ L x (x 0 1 2 L f(x0 )) 2 + 2λ ] L x 0. Then the minimizer of p L,λ (x 0, x) can be obtained by the hard thresholding operator [18, 25]. If denote T L (x 0 ) = argmin p L,λ (x 0, x), (7) x then the closed form solution of T L (x 0 ) is given by the following Lemma 2.1. Lemma 2.1. [18, 25] The solution T L (x 0 ) of problem (7) is given by [s L (x 0 )] i, if [s L (x 0 )] 2 [T L (x 0 i > 2λ; L )] i = 0, if [s L (x 0 )] 2 i < 2λ; L 0 or [s L (x 0 )] i, otherwise, where s L (x) = x 1 L f(x), [ ] i denotes the i-th component of the vector. 4

Remark: For getting a unique solution, in the following algorithms, we set [T L (x 0 )] i = 0 when [s L (x 0 )] 2 i = 2λ L. The core of the basic IHT method is repeatedly solving the subproblem (7) until some termination condition reaches. But for the basic IHT method, a fixed L is used throughout all iterations, which may miss some local information of f(x) at the current step. What s more, the upper bound on L f is unknown or may not be easily calculated. To improve its practical performance, a suitable value of L is obtained dynamically by iteratively increasing the value. The frame viht which is a variant of the IHT method is presented as follows. Algorithm 1: [18] {x, L } viht (L 0, λ, x 0 ) Input: L 0, x 0, λ, L min, L max ; //L 0 [L min, L max ] Output: {x, L }; 1: initialization k 0, γ > 1, η > 0; 2: repeat 3: x k+1 T Lk (x k ); 4: while ϕ λ (x k ) ϕ λ (x k+1 ) < η 2 xk x k+1 2 do 5: L k min{γl k, L max }; 6: x k+1 T Lk (x k ); 7: end while 8: L k+1 L k ; 9: k k + 1; 10: until some termination condition reaches 11: x x k ; 12: L L k. Remark: (1) Lu [18] presented a strategy for the initial value of L 0 in Algorithm 1, L 0 = min{max{ f H x x 2, L min}, L max }, (8) where x = x k x k 1, f = f(x k ) f(x k 1), and [L min, L max ] is the interval with L value in. (2) For each outer loop, the number of iterations between lines 4-7 of Algorithm 1 is finite. In other words, the line search can be terminated in a finite number of steps. In fact, for the outer iteration of Algorithm 1, if L k > L f, then one can show that ([18]) ϕ λ (x k ) ϕ λ (x k+1 ) L k L f 2 x k+1 x k 2, (9) which implies that the inner line search steps stop if L k L f + η. Thus, L k /γ L f + η holds, that is, L k γ(l f + η). Let ˆn k be the number of inner loops at the k-th iteration of outer loop. Then one can get L min γ ˆn k 1 L 0 γ ˆn k 1 L k γ(l f + η). 5

Therefore, ˆn k log(l f +η) log(l min ) log(γ) + 2. Assumption. In the sequel, we always assume that L k L f + η (η > 0), which is reasonable since L k is increased by a factor of γ at step 5 in Algorithm 1. Before proceeding, the following lemmas are given, which will be used later. Lemma 2.2. Let {x k } be generated by Algorithm 1, then p Lk,λ(x k, x k+1 ) is nonincreasing. Proof. From Algorithm 1, one can get p Lk,λ(x k, x k+1 ) = f(x k ) + f(x k ), x k+1 x k + L k 2 xk+1 x k 2 + λ x k+1 0 f(x k ) + λ x k 0 f(x k 1 ) + f(x k 1 ), x k x k 1 + L k 1 2 xk x k 1 2 + λ x k 0 = p Lk 1,λ(x k 1, x k ), where the second inequality follows from that f is Lipschitz continuous and the assumption that L k 1 > L f. Thus, p L,λ (x k, x k+1 ) is nonincreasing. Lemma 2.3. [18] Let {x k } be generated by Algorithm 1. S(x k ) does not change if k is large enough. Moreover, for all k 0, if x k+1 j 0, then it holds that x k+1 2λ j L k. Next we give a definition of local minimizer for problem (1). If we use the definition of local minimizer for continuous optimization problem, it is easy to verify that every feasible solution of problem (1) is a local minimizer. This is not useful for the solution of the problem. Hence we have to consider the characteristic of the problem for giving a useful definition of local minimizer. Definition 2.1. If Ax = y, and the columns of A corresponding to the nonzero components of x are linearly independent, then x is called a local minimizer of the problem (1). Definition 2.1 is reasonable, since we have the following sufficient and necessary condition. Theorem 2.1. x is a local minimizer of problem (1) if and only if Ax = y, and for every h S(x ), A S(x )\{h}x S(x )\{h} = y has no solution. Proof. Suppsoe that x is a local minimizer of problem (1). Without lost of generality, assume that S(x ) = {1, 2,, p}. Since x is feasible, we have where A i is the i-th column of A. A 1 x 1 + A 2 x 2 + + A p x p = y, (10) 6

Now take the p-th component of x out of the support set, and we solve the equation If it has a solution, let it be ˆx. Then By (10)-(11), we have A 1 x 1 + A 2 x 2 + + A p 1 x p 1 = y. A 1ˆx 1 + A 2ˆx 2 + + A p 1ˆx p 1 = y. (11) A 1 (x 1 ˆx 1 ) + A 2 (x 2 ˆx 2 ) + + A p 1 (x p 1 ˆx p 1 ) + A p x p = 0. Thus, if A 1, A 2,, A p are linearly independent, then x p = 0, which contradicts the assumption that x p 0, and then the conclusion holds. The converse direction can be proved in a similar manner, and is omitted here. Hence we have given a discrete version of the definition of local minimizer for problem (1). For proving that the solutions produced by our algorithms are local minimizers defined as Definition 2.1, we need the definition of spark, which was firstly given in [12]. Definition 2.2. [12] The spark of a given matrix A, denoted by Spark(A), is the smallest number of columns from A that are linearly dependent. 3 Homotopy IHT methods The sparse optimization problem (5) is regularized by the l 0 -norm, balanced by a regularization parameter λ. To solve the sparse optimization problem, the iterative hard thresholding methods use a fixed value of the regularization parameter. Obviously, a suitable value of the parameter λ may produce better solutions. But unfortunately, finding a good value of the regularization parameter is a challenge work. Hence a homotopy method is given here to efficiently calculate and trace the solutions of the sparse optimization problem along a homotopy path. The main idea of the homotopy method based on the l 0 -norm is that, set a large initial value of the regularization parameter λ and gradually decrease it with some strategy. For every fixed value of the regularization parameter λ, the viht method is used to find an approximate optimal solution of problem (5), and then use it as the initial solution for the next iteration. This kind of initial solution strategy is called warm-starting. Usually, the next loop with warm-starting will require fewer iterations than the current loop, and dramatically fewer than the number of iterations initialized at zero [25]. The initial value of the regularization parameter is set as: λ 0 = c y 2, 0 < c < 1, 7

where y is as in equation (1), since if λ > y 2, then ϕ λ (x) has a strict global minimum at the origin [24]. In fact, x 0 x 0 1 ϕ λ (0) = y 2 Ax y 2 + λ x 0. Moreover, the decreasing speed of λ is always geometrical. That is, for an initial value λ 0 and a parameter ρ (0, 1), set λ k+1 = ρλ k for k = 0, 1, 2,, until some termination condition reaches. The upper bound L k on the Lipschitz constant L f is obtained by a line search technique. The line search begins with an initial value L 0 k such as equation (8), and then increases L k by a factor of γ > 1 until the condition in step 4 of Algorithm 1 reaches. The frame of the proposed homotopy iterative hard thresholding (HIHT) algorithm is described as in Algorithm 2. Algorithm 2: {x, L } HIHT (L 0, λ 0, x 0 ) Input: L 0, λ 0, x 0, L min, L max ; // L o [L min, L max ] Output: {x, L }; 1: initialization k 0, ρ (0, 1); 2: repeat 3: i 0; 4: x k,0 = x k ; 5: L k,0 L k ; 6: repeat 7: x k,i+1 T Lk,i (x k,i ); 8: while ϕ λk (x k,i ) ϕ λk (x k,i+1 ) < η 2 xk,i x k,i+1 2 do 9: L k,i min{γl k,i, L max }; 10: x k,i+1 T Lk,i (x k,i ); 11: end while 12: L k,i+1 L k,i ; 13: i i + 1; 14: until S(x k,i ) does not change when increasing i 15: x k+1 x k,i ; 16: L k+1 L k,i. 17: λ k+1 ρλ k ; 18: k k + 1; 19: until some termination condition reaches 20: x x k ; 21: L L k ; Remark. According to Lemma 2.3, the iteration between steps 6-14 terminates in a finite number of steps. 8

The convergence analysis of Algorithm 2 is presented as the following theorem. The proof is based on the proof of the convergence analysis of IHT [5, 18]. Theorem 3.1. Let {x k,i } be the sequence generated by steps 6-14 of Algorithm 2, and {x k } be the sequence generated by step 15 of Algorithm 2. We have: (i) {x k,i } and {ϕ λk (x k,i )} converge. (ii) Either S(x k,i ) changes only in a finite number of iterations, or for all h Γ k,i, x k,i h is an arbitrarily small amount, i.e., for any ε > 0, there exists K > 0, such that when k K, x k,i h < ε or xk,i+1 h < ε, where Γ k,i is the set of indices of components or x k,i+1 h at which the zero components of x k,i are set to nonzero, or nonzero components are set to zero. (iii) If A is full row rank, then any limit point of {x k,i } is feasible for problem (1), i.e., if x = lim k x k,i, then Ax = y. (iv) Suppose that S(x k,i ) does not change after a sufficiently large k. Let ϕ =, lim ϕ λ k (x k ). Then the number of changes of S(x k ) is at most T = k where λ denotes the value of λ k 2λ mj δ j = max{, 2λ mj 1 L mj +1 L mj }. 2(ϕλ0 (x 0 ) ϕ ) ηδ 2 +2(1 ρ) λ from which S(x k ) begins unchanged, and δ = min j δ j, Proof. (i) Firstly, for each fixed λ k, by Lemma 2.3, steps 6-14 will be terminated in a finite number of iterations. Furthermore, by Lemma 2.2 and the choice of x k,i from Algorithm 2, one can obtain that ϕ λk (x k ) = f(x k ) + λ k x k 0 f(x k ) + f(x k ), x k,1 x k + L k,0 2 xk,1 x k 2 + λ k x k,1 0 = f(x k,0 ) + f(x k,0 ), x k,1 x k,0 + L k,0 2 xk,1 x k,0 2 + λ k x k,1 0 f(x k,nk 1 ) + f(x k,nk 1 ), x k+1 x k,nk 1 + L k,n k 1 x k+1 2 x k,nk 1 2 + λ k x k+1 0, (12) where n k is the number of iterations between steps 6-14 at the k-th outer iteration. Then for each λ k, when i n k, S(x k,i ) = S(x k,n k ) and S(x k,n k ) S(x k,n k 1 ). (13) 9

On the other hand, since f is Lipschitz continuous, one can observe that ϕ λk+1 (x k+1 ) = f(x k+1 ) + λ k+1 x k+1 0 f(x k,n k 1 ) + f(x k,n k 1 ), x k+1 x k,n k 1 + L f 2 xk+1 x k,n k 1 2 + λ k+1 x k+1 0, which together with (12), x k+1 = x k,n k and Lk,nk 1 L f + η imply that ϕ λk (x k ) ϕ λk+1 (x k+1 ) L k,n k 1 L f x k+1 x k,nk 1 2 + (λ k λ k+1 ) x k+1 0 2 η 2 xk+1 x k,nk 1 2 + λ k (1 ρ) x k+1 0 (14) λ k (1 ρ) x k+1 0 0. Thus ϕ λk (x k ) is nonincreasing. Furthermore, ϕ λ (x) is lower bounded for λ 0. Therefore ϕ λk (x k ) converges for k. By the proof of Theorem 3.4 in [18], i.e., for every fixed λ k, ϕ λk (x k,i ) is nonincreasing, the convergence of ϕ λk (x k ), and nonincreasing λ k, one can observe that the sequence ϕ λ0 (x 0 ) ϕ λ0 (x 0,0 ), ϕ λ0 (x 0,1 ), ϕ λ0 (x 0,2 ),, ϕ λ0 (x 0,n 0 ) ϕ λ0 (x 1 ) ϕ λ1 (x 1 ) ϕ λ1 (x 1,0 ), ϕ λ1 (x 1,1 ), ϕ λ1 (x 1,2 ),, ϕ λ1 (x 1,n 1 ) ϕ λ1 (x 2 ) ϕ λk (x k ) ϕ λk (x k,0 ), ϕ λk (x k,1 ), ϕ λk (x k,2 ),, ϕ λk (x k,n k ) ϕ λk (x k+1 ) is nonincreasing and converges. Next, by ϕ λk (x k ) ϕ λk+1 (x k+1 ) 0 when k, and the second inequality of (14), one can get that x k+1 x k,n k 1 = x k,n k x k,n k 1 0, when k. Similarly, x k,i+1 x k,i 0, when k, i = 0, 1, 2,, n k 1. Thus, the sequence x 0 x 0,0, x 0,1, x 0,2,, x 0,n 0 x 1 x 1 x 1,0, x 1,1, x 1,2,, x 1,n 1 x 2 x k x k,0, x k,1, x k,2,, x k,n k x k+1 converges, i.e., for any ε > 0, there exists K > 0, such that when k K, x k,i x k,i+1 < ε holds for all i {0, 1, 2,, n k 1}. (ii) There are two possibilities for S(x k,i ), i.e., either S(x k,i ) does not change after a sufficiently large k, or S(x k,i ) changes infinitely often. 10

In the second case, since {x k,i } converges, for any ε > 0, there exists K > 0, such that when k K, x k,i x k,i+1 < ε. Thus, x k,i h xk,i+1 h < ε holds, for all h {1, 2,, n}. Let Γ k,i be the set of indices of components at which the zero components of x k,i are set to nonzero, or nonzero components are set to zero. Then for all h Γ k,i, x k,i h < ε or x k,i+1 h < ε. (iii) By the hard thresholding operator (see Lemma 2.1), all components that are set to zero must satisfy [x k,i 1 f(x k,i 2λ k )] j. L k,i L k,i 2λk Since λ k 0 when k, for ε > 0 there exists K 1, such that when k K 1, < ε and [x k,i 1 f(x k,i )] j < ε. (15) L k,i L k,i Furthermore, since x k,i+1 j is set to 0, and {x k,i } is convergent, for sufficiently large K 1 these components must satisfy which together with (15) imply that x k,i j = x k,i j x k,i+1 j < ε, f j (x k,i ) L k,i ( x k,i j + ε) < 2L k,i ε. In the other respect, by Lemma 2.1, all components that are set to nonzero must satisfy x k,i j x k,i+1 1 j = f j (x k,i ). L k,i Since {x k,i } is convergent, the above inequality leads to for sufficiently large k. x k,i j x k,i+1 1 j = f j (x k,i ) < ε L k,i Hence, if k is large enough, then for all j {1, 2,, n}, f j (x k,i ) < 2L k,i ε. Thus if ε 0, we have f(x k,i ) 0, i.e., A H (Ax k,i y) 0. So A H (Ax y) = 0. With the assumption that A is full row rank, one can obtain that Ax y = 0. (iv) Suppose that S(x k ) changes only finitely often. Without lost of generality we suppose that S(x k ) changes only at k = m 1 + 1,, m J + 1. In other words, S(x m j 1+1 ) = = S(x m j ) S(x m j+1 ) = = S(x m j+1 ), j = 1, 2,, J. (16) 11

Let m 0 = 0. For any j {1, 2,, J}, there exists i, such that x m j+1 i or x m j+1 i Let and δ = 0 and x m j i = 0. Then by Lemma 2.1, x m j+1 x m j max{ x m j+1 i min δ j. We have j {1,2,,J} δ j = max{ which together with (14) imply that, x m j i } max{ 2λ mj L mj +1, 2λmj L mj,n mj 1 2λ mj 1 L mj }, x m j+1 x m j δ j, j = 1, 2,, J, ϕ λmj (x m j ) ϕ λmj +1(x m j+1 ) η 2 δ2 + λ mj (1 ρ) x m j+1 0. Summing up the above inequalities, one can get ϕ λ0 (x 0 ) ϕ ϕ λm1 (x m 1 ) ϕ λmj +1(x m J +1 ), η 2 δ2 J + (1 ρ)(λ m1 + + λ mj ) η 2 δ2 J + (1 ρ) λj, = 0 and x m j i 0, 2λmj 1 L mj 1,n mj 1 1 where λ = λ mj, and the first inequality follows from the fact that {ϕ λ (x k )} is nonincreasing (see (14)). Thus 2(ϕλ0 (x 0 ) ϕ ) J. ηδ 2 + 2(1 ρ) λ }. Remark. (i) Although the above theorem states that S(x k,i ) may change, the relative components just change by an arbitrarily small amount. However, the experiments in Section 5 empirically show that S(x k,i ) changes only in a finite number of steps. (ii) Let {x k } be the sequence generated by step 15 of Algorithm 2. By Theorem 3.1, the sequence {ϕ λk (x k )} is nonincreasing. In other words, Algorithm 2 is a decreasing algorithm. Now we consider the following problem min φ λ (x) = x 0 + 1 f(x), (17) x λ 12

which has the same solution as problem (5). We use the similar method for problem (5) to solve the problem (17), and consider q L,λ (x 0, x) = x 0 + 1 λ (f(x0 ) + f(x 0 ), x x 0 + L 2 x x0 2 ), (18) which has the same solution as problem (6) and can be solved by the viht method. Moreover, it is easy to verify that the viht for both the problems are equivalent. For the sake of easy statement, let h L (x 0, x) = f(x 0 ) + f(x 0 ), x x 0 + L 2 x x0 2. (19) Then we can get a bound on the number of nonzero components of the limit of the sequence produced by Algorithm 2 as follows, and prove that the limit is a local minimizer of problem (1). Theorem 3.2. Let {x k,i } be the sequence generated by steps 6-14 of Algorithm 2 and x = lim x k,i. If for all k, 1 > ρ h Lf (xk,nk 1,xk+1 ), then k h Lk,nk 1 (xk,n k 1,x k+1 ) (i) x 0 x 0 0 + 1 λ 0 f(x 0 ) C, where C = lim x 0 y 2 2λ 0 -C; (ii) S(x k ) is a constant when k is large enough; f(xk ) k λ k. Specifically, if x 0 = 0, then (iii) if x 0 = 0 and Spark(A) > y 2 2λ 0 C, then x is a local minimizer of problem (1). Proof. (i) Similar to the proof of Theorem 3.1, φ λk (x k ) = 1 λ k f(x k ) + x k 0 1 λ k (f(x k ) + f(x k ), x k,1 x k + L k,0 2 xk,1 x k 2 ) + x k,1 0 = 1 λ k (f(x k,0 ) + f(x k,0 ), x k,1 x k,0 + L k,0 2 xk,1 x k,0 2 ) + x k,1 0 1 (f(x k,nk 1 ) + f(x k,nk 1 ), x k+1 x k,nk 1 + L k,n k 1 x k+1 λ k 2 x k,nk 1 2 ) + x k+1 0. (20) On the other hand, since f is Lipschitz continuous, one can observe that ϕ λk+1 (x k+1 ) = 1 λ k+1 f(x k+1 ) + x k+1 0 1 λ k+1 (f(x k,n k 1 ) + f(x k,n k 1 ), x k+1 x k,n k 1 + L f 2 xk+1 x k,n k 1 2 ) + x k+1 0, 13

which together with (20) and x k+1 = x k,n k imply that φ λk (x k ) φ λk+1 (x k+1 ) 1 λ k h Lk,nk 1 (xk,n k 1, x k+1 ) 1 λ k+1 h Lf (x k,n k 1, x k+1 ). (21) Hence, by the assumption that 1 > ρ = λ k+1 λ k h L f (x k,nk 1,x k+1 ), and the fact that h Lk,nk 1 (xk,n k 1,x k+1 ) φ λ (x) is bounded below, we know that {φ λk (x k )} is nonincreasing and converges. Then one can get that x 0 0 + 1 λ 0 f(x 0 ) x k 0 + 1 λ k f(x k ). (22) Let k, and x be the limit of {x k }. Then (22) becomes where C = lim k 1 λ k f(x k ). Specifically, if x 0 = 0, then we get x 0 0 + 1 λ 0 f(x 0 ) x 0 + C, x 0 y 2 2λ 0 C. (ii) Since {φ λk (x k )} converges, and the first item of x k 0 + 1 λ k f(x k ) takes discrete values, it is evident that x k 0 is unchanged when k is large enough. (iii) Furthermore, if Spark(A) > y 2 2λ 0 C, then by (i), starting from x 0 = 0, the limit x of the sequence x k is a local minimizer of problem (1) according to Definitions 2.1 and 2.2. Remark. (i) Without lost of generality, suppose that the observed data y 0. Then it easy to see that φ λ (x) > 0. Furthermore, by the fact that h Lf (x k,n k 1, x k+1 ) f(x k+1 ), if x k+1 is not a feasible solution, then h Lf (x k,n k 1, x k+1 ) f(x k+1 ) > 0. In this case, if L k,nk 1 > L f, then we have h Lk,nk 1 (xk,n k 1, x k+1 ) > h Lf (x k,n k 1, x k+1 ) > 0, and h Lf (x k,n k 1,x k+1 ) h Lk,nk 1 (xk,n k 1,x k+1 ) < 1, which means that we can find some values of ρ that satisfy 1 > ρ = λ k+1 λ k h L f (x k,nk 1, x k+1 ) h Lk,nk 1 (xk,n k 1, x k+1 ). (ii) Statements (i) and (iii) of Theorem 3.2 indicate that x 0 = 0 is a good initial solution for Algorithm 2, since it will produce a solution of problem (1) with theoretical guarantees. For each fixed λ, Algorithm 2 iterates to get a solution of problem (5) between steps 6-14. In the following Algorithm, we accelerate Algorithm 2 by just calling one outer 14

loop of steps 2-10 of Algorithm 1 for every regularization parameter λ. The main frame is given as follows. Algorithm 3: {x, L } AHIHT (L 0, λ 0, x 0 ) Input: L 0, λ 0, L min, L max ; Output: {x, L }; 1: initialization k 0, ρ (0, 1); 2: repeat 3: x k+1 T Lk (x k ); 4: while ϕ λk (x k ) ϕ λk (x k+1 ) < η 2 xk x k+1 2 do 5: L k min{γl k, L max }; 6: x k+1 T Lk (x k ); 7: end while 8: L k+1 L k ; 9: λ k+1 ρλ k ; 10: k k + 1; 11: until some termination condition reaches 12: x x k ; 13: L L k ; Remark. Algorithm 3 is different from Algorithm 1 mainly on that the regularization parameter is changing. As Algorithm 2, it traces possible solutions along a homotopy path to overcome the difficulty of parameter choosing. The convergence analysis of Algorithm 3 is presented as the following theorem and the proof of this theorem is similar to that of Theorem 3.1. Theorem 3.3. Let {x k } be the sequence generated by steps 2-7 of Algorithm 3. We have: (i) {x k } and {ϕ λk (x k )} converge. (ii) Either S(x k ) changes only in a finite number of iterations, or for all h Γ k, x k h or x k+1 h is an arbitrarily small amount, i.e., for any ε > 0, there exists K > 0, such that when k K, x k h < ε or xk+1 h < ε, where Γ k is the set of indices of components at which the zero components of x k are set to nonzero, or nonzero components are set to zero. (iii) If A is full row rank, then any limit point of {x k } is feasible for problem (1), i.e., if x = lim k x k, then Ax = y. (iv) Suppose S(x k ) does not change after a sufficiently large k. Let ϕ = lim ϕ λk (x k ). k Then the number of changes of S(x k ) is at most T 2(ϕλ0 (x = 0 ) ϕ ), where λ denotes the ηˆδ 2 +2(1 ρ) λ value of λ where S(x k ) begins unchanged; ˆδ = min ˆδj, ˆδ 2λ mj 1 j = max{, 2λ mj j L mj 1 L mj }. 15

Proof. (i) Firstly, by the choice of x k+1 in Algorithm 3, we have ϕ λk (x k ) = f(x k ) + λ k x k 0 f(x k ) + f(x k ), x k+1 x k + L k 2 xk+1 x k 2 + λ k x k+1 0. (23) On the other hand, following from the Lipschitz continuous of f(x), one can observe that ϕ λk+1 (x k+1 ) = f(x k+1 ) + λ k+1 x k+1 0 f(x k ) + f(x k ), x k+1 x k + L f 2 xk+1 x k 2 + λ k+1 x k+1 0. (24) Inequalities (23), (24), and together with L k L f + η imply that ϕ λk (x k ) ϕ λk+1 (x k+1 ) L k L f x k+1 x k 2 + (λ k λ k+1 ) x k+1 0 2 = η 2 xk+1 x k 2 + λ k (1 ρ) x k+1 0 η 2 xk+1 x k 2. Hence ϕ λk (x k ) is nonincreasing. Furthermore, ϕ λ (x) is lower bounded for λ 0. Therefore ϕ λk (x k ) converges for k. The proofs of the other statements are similar to the proofs of the relevant statements in Theorem 3.1. Similar to the proof of Theorem 3.2, we can get the following results. Theorem 3.4. Let {x k } be the sequence generated by steps 2-11 of Algorithm 3 and x = lim x k. If for all k, 1 > ρ h Lf (xk,xk+1 ), then k h Lk (x k,x k+1 ) (i) x 0 x 0 0 + 1 λ 0 f(x 0 ) C, where C = lim x 0 y 2 2λ 0 -C; (ii) S(x k ) is a constant when k is large enough; f(xk ) k λ k. Specifically, if x 0 = 0, then (iii) if x 0 = 0 and Spark(A) > y 2 2λ 0 C, then x is a local minimizer of problem (1). 4 Empirical algorithms In the previous section, we have proved that our methods converge to feasible solutions of problem (1). The theorems require that the regularization parameter λ converges to zero. However, in our experiment (see Subsection 5.1), we have found that if λ is too small, the 16

experiment results are not good enough. Hence, similar to the idea of the PGH method [26], we set a lower bound λ t on λ in our HIHT/AHIHT methods. Since Algorithm 1 is a good method for the sparsity problem and it may converge to a local minimizer of problem (5) [18], similar to the idea of the PGH method [26], we use it to improve the solutions obtained by our HIHT/AHIHT methods. For the termination condition in step 14 of Algorithm 2, it may not be easily verified. Hence for practical considerations, we modify it as: for each λ k, the loop 6-14 stops when it reaches x k,i x k,i+1 ϵ 0, (25) where ϵ 0 > 0. For the termination condition in step 10 of Algorithm 1, we set it as: when the infinity norm of the difference of two adjacent solutions in the sequence generated by the method is smaller than a given precision ϵ > 0, i.e., x k+1 x k ϵ. (26) Since the solutions obtained by Algorithms 2 and 3 will be improved by Algorithm 1, we set ϵ 0 > ϵ, e.g., ϵ = 10 5, and ϵ 0 = 10 1. The main frame of the empirical algorithm is given as follows: Algorithm 4: {x, L } HomAlg(L 0, λ 0, x 0 ) Input: L 0, λ 0, x 0 ; Output: {x, L }; 1: initialization; 2: { x, L} HIHT/AHIHT (L 0, λ 0, x 0 ); //the termination condition is that λ reaches the lower bound λ t 3: {x, L } viht ( L, λ t, x). // the termination condition is that (26) is satisfied 5 Experiments In this section, numerical experiments for testing the performance of our HIHT/AHIHT methods for the CS problem are presented. When conducting the experiments, our HI- HT/AHIHT methods take the frame of Algorithm 4. And for simplification, we still call them HIHT/AHIHT methods, respectively. In Subsection 5.1, the performances of our HIHT/AHIHT methods and Algorithm 1 with different parameters are showed. The results indicate that our algorithms can overcome the shorting of Algorithm 1 in some degree. The separable approximation (SpaRSA) method [25] and the fixed point continuation (FPC) method [15] are also the 17

state of the art methods for solving the sparse problem. But in the experiments of [26], the PGH and ADGH methods [26] outperform them, so we just compare our methods with the PGH and ADGH methods 1 in Subsection 5.2. The results show that our methods outperform the PGH/ADGH methods both in quality and running time. All experiments are performed on a personal computer with an Intel(R) Core(TM)2 Duo CPU E7500 (2.93GHz) and 2GB memory, using a MATLAB toolbox. 5.1 Influence of the parameters In this experiments, we mainly conduct our experiments on solving the CS problem with noise. When the observations have noises, the compressed sensing problem can be written as: y = A x +z, where x R n is the vector of unknowns, A R m n (m n) and y R m are the problem data, and z is the measure noise. Then for a fixed error level ε > 0, the CS problem (1) can be rewritten as: min x x 0 s.t. Ax y < ε, which can also be regularized as the problem (4). In the first experiment, we investigate the influences of the target value λ t and the descent speed ρ of the regularization parameter λ. An instance of the compressed sensing problem was generated similar to that in [26], which was generated randomly with size 1000 5000, i.e., m = 1000, n = 5000, and S( x ) = 100, and the elements of matrix A distributed uniformly in the unit sphere. The vector x was generated with the same distribution at 100 randomly chosen coordinates. The noise z was distributed randomly and uniformly in the sphere with ratio r = 0.01. Finally, the vector y was generated by y = A x + z. In the experiment, all parameters were set as follows. The initial value λ 0 of the regularization parameter λ was set to A H y. The initial value L 0 of the line search was set similar to that in [26], i.e., L min = max 1 j n A j 2, where A j is the j-th column of A. L max was set to. γ = 2 controls the increasing speed of L and η = 1. The initial solution for all algorithms was set as x 0 = 0, and the precisions were set as ϵ = 10 5, and ϵ 0 = 10 1. The results of the HIHT method and the AHIHT method are showed in Tables 1 and 2, respectively. In the tables, we present the CPU times (in second) required by the methods, the sizes of the support sets of the reconstructed data ˆx given by the methods, the mean squared errors (MSE) with respect to x, which is defined as MSE = 1 n ˆx x, 1 The PGH/ADGH packages may be found at: http://research.microsoft.com/enus/downloads/fc58804a-616a-4d59-a9c1-244ba5b2a5e9/default.aspx. 18

Table 1: Test results of the HIHT method with various λ t and descent speed ρ of λ. λ ρ t 0.0001 0.001 0.01 0.1 1 2 time(s) 4.46 3.34 3.18 2.79 2.44 2.26 0.8 S(ˆx) 125 100 100 99 89 86 MSE 1.367e-6 7.415e-7 7.417e-7 2.744e-6 4.518e-5 6.358e-5 Er 0.023 0.031 0.031 0.073 14.218 25.295 time(s) 2.33 1.82 1.74 1.56 1.47 1.39 0.6 S(ˆx) 130 100 100 99 89 86 MSE 1.441e-6 7.418e-7 7.418e-7 2.744e-6 4.518e-5 6.358e-5 Er 0.021 0.031 0.031 0.073 14.218 25.295 time(s) 1.47 1.18 1.16 1.10 1.03 0.99 0.4 S(ˆx) 134 100 100 99 91 87 MSE 1.511e-6 7.413e-7 7.413e-7 2.743e-6 3.638e-5 5.959e-5 Er 0.021 0.031 0.031 0.073 9.245 21.437 Table 2: Test results of the AHIHT method with various λ t and descent speed ρ of λ. λ ρ t 0.0001 0.001 0.01 0.1 1 2 time(s) 3.30 2.49 2.11 1.77 1.40 1.34 0.8 S(ˆx) 125 100 100 99 89 86 MSE 1.367e-6 7.415e-7 7.417e-7 2.744e-6 4.518e-5 6.358e-5 Er 0.023 0.031 0.031 0.073 14.218 25.295 time(s) 1.68 1.16 1.08 0.98 0.88 0.83 0.6 S(ˆx) 130 100 100 99 91 86 MSE 1.441e-6 7.413e-7 7.413e-7 2.744e-6 3.638e-5 6.358e-5 Er 0.021 0.031 0.031 0.073 9.245 25.295 time(s) 1.02 0.87 0.83 0.80 0.74 0.67 0.4 S(ˆx) 125 100 100 99 91 86 MSE 1.364e-6 7.412e-7 7.412e-7 2.744e-6 3.638e-5 6.358e-5 Er 0.023 0.031 0.031 0.073 9.245 25.295 and the errors of Aˆx y, which is defined as Er = Aˆx y 2. From the two tables, for every given λ t, we can find that the descent speed ρ of the regularization parameter λ affects in some degree the CPU times required by the relative methods, but slightly the qualities of the solutions generated by our methods. And for every given ρ, the target value λ t affects evidently the performances of our methods both in running time and solution quality. Though a bigger value of λ t may reduce the CPU time, the quality of the solution is becoming bad. To balance the running times required 19

Table 3: Influence of the regularization parameter λ on the viht method. λ 0.01 0.1 0.5 1 10 100 time(s) 1.57 5.12 3.68 1.47 0.64 0.48 S(ˆx) 4066 2117 94 92 78 18 MSE 1.009e-3 9.916e-4 1.525e-5 2.407e-5 1.685e-4 9.136e-4 Er 1.776e-6 6.392e-5 1.616 4.162 169.284 5714.609 by the methods and the qualities of the solutions, we can find out that when conducting the experiments with λ t = 0.01, our methods give better reconstructed signals. When comparing our two methods on the same instance and parameters, the AHIHT method is faster than the HIHT method but the qualities of the solutions generated by the two methods are almost the same. In the second experiment, we will show that the regularization parameter λ affect significantly the quality of the solutions generated by the viht method. The size of the instance, the way of data generation and the settings of parameters are the same as in the first experiment. The results of the viht method on the generated instance are put in Table 3, from which we can find out that if the regularization parameter λ is not chosen suitably, the quality of the reconstructed signal is terrible. From the above two experiments, we find that our HIHT/AHIHT methods do not depend on the target value λ t as strongly as that the viht method depends on the regularization parameter λ. Hence, our methods in some sense can overcome the shortcoming of the viht method. 5.2 Comparing with other homotopy methods In this subsection, we compare the performances of our HIHT/AHIHT methods with those of the state of the art methods the PGH/ADGH methods for solving the compressed sensing problem, since they are the homotopy methods, but regularized with different norms. A test instance was generated in the same way as that in the first experiment. To be fairly, all values of the parameters were set as the default values in the PGH/ADGH packages, except that the termination condition of all methods (including PGH/ADGH) was changed to (26), and the target value λ t of our methods was set as 0.01, but 1 for the PGH/ADGH methods as it is the default value. When using the PGH/ADGH packages, all other parameters were set by default. Numerical results are put in Table 4 and Figures 1-3. Table 4 presents the CPU times required by the compared methods, as well as the sizes of the support sets of the reconstructed data ˆx obtained by the methods, the mean squared errors (MSE) with respect to x, and the errors of Aˆx y. From the column 20

Table 4: Numerical comparisons of the methods on instance with size m = 1000, n = 5000, and S( x ) = 100, and noise z is uniformly distributed over [-0.01,0.01]. Alg. time(s) S(ˆx) MSE( 10 6 ) Er PGH 3.06 115 7.364 0.370 HIHT 2.24 100 0.623 0.030 AHIHT 1.55 100 0.623 0.030 ADGH 11.96 115 7.364 0.370 φ(x k ) φ * 10 4 10 2 10 0 PGH ADGH HIHT AHIHT 10 2 0 100 200 300 k Figure 1: Objective gap for each iteration. time(s), we can find that the AHIHT method is faster than the HIHT method, but both the two methods are much faster than the PGH and ADGH methods. On the size of the support set of the reconstructed signal, the HIHT/AHIHT methods can reconstruct signals with the same number of nonzero components as that of the original signal x, but the PGH/ADGH methods cannot. Furthermore, from the columns MSE and Er, we can find that our methods can reconstruct x with smaller errors. Hence our methods outperform the PGH/ADGH methods both in running time and solution quality in this experiment. Figure 1 shows the relationship between the total number of iterations k and the objective gap 2. Though the descent speeds of the PGH method and the HIHT method are almost the same, they are faster than the ADGH method. While the AHIHT method needs the fewest number of iterations among them, which just takes about 30 iterations to achieve the termination condition. 2 In Figure 1, for the HIHT/AHIHT methods, ϕ is the minimum value of ϕ(x) = 1 2 Ax y 2 +λ t x 0, ϕ(x k ) = 1 2 Axk y 2 + λ k x k 0. And for the PGH/ADGH methods, ϕ is the minimum value of ϕ(x) = 1 2 Ax y 2 + λ t x 1, ϕ(x k ) = 1 2 Axk y 2 + λ k x k 1. The minimum values ϕ are given by the relative methods with lower ϵ value, such as 10 6 in our experiments. 21

x k 0 600 400 200 PGH ADGH HIHT AHIHT 0 0 50 100 k Figure 2: Sparsity for each x k. 50 40 PGH ADGH HIHT AHIHT Iterations 30 20 10 0 10 0 λ 0 /λ K Figure 3: The number of inner iterations for each λ k. Figure 2 pictures the sparsities of {x k } or {x k,i } at each iteration along the algorithms progress. It shows that the sizes of the support sets of the sequences {x k } or {x k,i } generated by our HIHT/AHIHT methods are much more stable than those generated by the PGH/ADPG methods. At the first 10 iterations, they are stable in almost the same manner. But after that, for the PGH/ADGH methods, the sizes of the support sets of the sequences are wavy until 80 or more iterations. To get approximate optimal solutions, the numbers of total iterations of our HIHT/AHIHT methods are smaller. Moreover, we find that after dozens of steps, the support sets of the sequences generated by our proposed 22

methods are unchanged in this experiment. Figure 3 pictures the number of inner iterations for each homotopy method, i.e., the number of inner iterations for each regularization parameter λ. It shows that, except the last one, the number of inner iterations for each fixed regularization parameter λ is small. Actually, they are not more than 10 inner iterations, and the HIHT method just takes 3 or 4 inner iterations to achieve the relative precision. However, all algorithms need more inner iterations at the last regularization parameter value. This is because that, when the homotopy methods stop, a warm-starting strategy is used with the initial values given by the relative homotopy methods and a higher precision ϵ = 1e 5, to get more accurate solutions. In the following fourth experiment, since the PGH method outperforms the ADGH method in the above third experiment, we just compare the qualities and performances of the three main methods HIHT/AHIHT/PGH with different sparsity levels. For each value of sparsity level S( x ), we randomly generated 50 instances with the same size using the method in the first experiment. The parameters of PGH were set as default, except that the termination condition was changed to (26). And for the HIHT/AHIHT methods, we set the target value as λ t = 0.01 since at this value our methods have good performances in the first experiment, and we set ρ = 0.7 since ρ = 0.7 is the default value in the PGH/ADGH packages. The other parameters were set as the same values in the first experiment, i.e., the same as the default values of the PGH/ADGH packages. The averaged results are presented in Table 5. The results in Table 5 show that the three methods can solve the CS problem with low sparsity levels efficiently but high sparsity levels badly. The PGH method can reconstruct the signals when the sparsity level is 50. But our HIHT/AHIHT methods can reconstruct the signals when the sparsity levels are not larger than 250, and 200, respectively. As for the running times, our HIHT/AHIHT methods can solve the CS problem faster than the PGH method when the sparsity level is the same, especially when the sparsity level is becoming higher. When comparing our two methods, the AHIHT method is faster than the HIHT method with the same quality of solution if the sparsity level is low. But if the sparsity level becomes 250, the AHIHT method cannot reconstruct the signals well, and if the sparsity level is 300, both of our two methods cannot reconstruct the signals. 6 Conclusions In this paper, we have applied the homotopy technique into the l 0 regularized sparsity problem directly and have proposed two homotopy methods, HIHT and AHIHT, to approximately solve the compressed sensing problem. Under some mild assumptions, the two methods can converge to feasible solutions of the problem. Moreover, we have given a definition of local minimizer for problem (1). Theoretical analyses indicate that, under some conditions the numbers of nonzero components of the limits of the sequences 23

Table 5: The averaged results on the 50 random instances with size m = 1000, n = 5000, and noise z is uniformly distributed over [-0.01,0.01], for each sparsity level S( x ). S( x ) 50 100 Alg. time(s) S(ˆx) MSE( 10 6 ) Er time(s) S(ˆx) MSE( 10 6 ) Er PGH 1.90 50 4.629 0.190 2.62 120 7.144 0.360 HIHT 1.79 50 0.640 0.034 2.22 99 1.095 0.037 AHIHT 1.00 50 0.640 0.034 1.11 99 1.095 0.037 S( x ) 150 200 Alg. time(s) S(ˆx) MSE( 10 6 ) Er time(s) S(ˆx) MSE( 10 6 ) Er PGH 4.18 255 10.888 0.566 15.56 480 19.311 0.839 HIHT 2.62 148 1.623 0.043 3.41 198 1.817 0.040 AHIHT 1.26 148 1.697 0.045 2.07 198 3.854 0.041 S( x ) 250 300 Alg. time(s) S(ˆx) MSE( 10 6 ) Er time(s) S(ˆx) MSE( 10 6 ) Er PGH 286.90 891 165.91 1.357 1111.81 981 707.55 1.480 HIHT 4.28 248 2.172 0.042 30.32 788 909.03 0.378 AHIHT 15.42 703 485.94 0.821 45.07 922 1232.7 0.206 produced by our methods are bounded above, and the converged solutions are local minimizers of problem (1). To improve the solution qualities of the methods, we have given two empirical algorithms, which consist of two stages: in the first stage, our HIHT/AHIHT methods are called, and in the second stage, we use the viht method to improve the solutions given by the first stage. From the experimental results, we find out that our two empirical algorithms can efficiently solve the CS problems with high quality, and can overcome the shortcoming of the viht method in some degree. Moreover, when comparing with the two state-ofthe-art homotopy methods PGH/ADGH, the solutions generated by our two empirical algorithms are better than those by the PGH/ADGH methods both in running time and solution quality. In fact, our two proposed homotopy methods almost can recover the noise signals with suitable values of parameters. References [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): 183-202, 2009. [2] T. Blumensath. Accelerated iterative hard thresholding. Signal Processing, 92(3): 752-756, 2012. [3] T. Blumensath and M. E. Davies. Iterative thresholding for sparse approximations. Journal of Constructive Approximation, 14(5-6): 629-654, 2008. 24

[4] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3): 265-274, 2009. [5] T. Blumensath and M. E. Davies. Normalized iterative hard thresholding: guaranteed stability and performance. IEEE Journal of Selected Topics in Signal Processing, 4(2): 298-309, 2010. [6] E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2): 489-509, 2006. [7] E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l 1 minimization. Journal of Fourier Analysis and Applications, 14(5-6): 877-905, 2008. [8] R. Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10): 707-710, 2007. [9] X. Chen, M. Ng, and C. Zhang. Nonconvex l p regularization and box constrained model for image restoration. IEEE Transactions on Image Processing, 21(12): 4709-4721, 2012. [10] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4): 1168-1200, 2005. [11] G. M. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Journal of Constructive Approximation, 13(1): 57-98, 1997. [12] D. L. Donoho and M. Elad. Optimally sparse representation in genearal (nonorthogonal) dictionaries via l 1 minimization. Proc. Natl. Acad. Sci. USA, 100: 2197-2202, 2003. [13] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2): 407-499, 2004. [14] M. A. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4): 586-597, 2007. [15] E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l 1 -minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3): 1107-1130, 2008. [16] M. J. Lai, Y. Y. Xu, and W. T. Yin. Improved iteratively reweighted least squares for unconstrained smoothed l q minimization. SIAM Journal on Numerical Analysis, 51(2): 927-957, 2013. 25

[17] Q. Lin and L. Xiao. An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Technical Report MSR-TR-2013-41, Microsoft Research, http://research.microsoft.com/apps/pubs/default.aspx?id=189743, 2013. [18] Z. Lu. Iterative hard thresholding methods for l 0 regularized convex cone programming. Mathematical Programming, to appear, 2013. [19] Z. Lu and Y. Zhang. Sparse approximation via penalty decomposition methods. SIAM Journal on Optimization, to appear, 2013. [20] D. M. Malioutov and A. S. Willsky. Homotopy continuation for sparse signal representation. IEEE International Conference on Acoustics, Speech and Signal Processing, 5: 733-736, 2005. [21] S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal processing, 41(12): 3397-3415, 1993. [22] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2): 227-234, 1995. [23] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1): 125-161, 2007. [24] M. Nikolova. Description of the minimizers of least squares regularized with l 0 -norm. uniqueness of the global minimizer. SIAM Journal on Imaging Sciences, 6(2): 904-937, 2013. [25] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): 2479-2493, 2009. [26] L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse leastsquares problem. SIAM Journal on Optimization, 23(2): 1062-1091, 2013. [27] J. Yang and Y. Zhang. Alternating direction algorithms for l 1 -problems in compressive sensing. SIAM Journal on Scientific Computing, 33(1): 250-278, 2011. 26