arxiv: v1 [math.na] 10 Oct 2016

Similar documents
Feature Extraction Techniques

Sharp Time Data Tradeoffs for Linear Inverse Problems

A note on the multiplication of sparse matrices

Ch 12: Variations on Backpropagation

Distributed Subgradient Methods for Multi-agent Optimization

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

arxiv: v1 [cs.ds] 3 Feb 2014

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Randomized Recovery for Boolean Compressed Sensing

Explicit solution of the polynomial least-squares approximation problem on Chebyshev extrema nodes

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Block designs and statistics

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

On the Use of A Priori Information for Sparse Signal Approximations

Lecture 21. Interior Point Methods Setup and Algorithm

A Simple Regression Problem

Boosting with log-loss

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Lower Bounds for Quantized Matrix Completion

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Weighted- 1 minimization with multiple weighting sets

Polygonal Designs: Existence and Construction

Asynchronous Gossip Algorithms for Stochastic Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Curious Bounds for Floor Function Sums

On Conditions for Linearity of Optimal Estimation

Complex Quadratic Optimization and Semidefinite Programming

Generalized AOR Method for Solving System of Linear Equations. Davod Khojasteh Salkuyeh. Department of Mathematics, University of Mohaghegh Ardabili,

Lecture 20 November 7, 2013

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Topic 5a Introduction to Curve Fitting & Linear Regression

A new type of lower bound for the largest eigenvalue of a symmetric matrix

3.8 Three Types of Convergence

arxiv: v1 [cs.ds] 29 Jan 2012

The Methods of Solution for Constrained Nonlinear Programming

A Simple Homotopy Algorithm for Compressive Sensing

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Non-Parametric Non-Line-of-Sight Identification 1

Exact tensor completion with sum-of-squares

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Stochastic Subgradient Methods

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

On the theoretical analysis of cross validation in compressive sensing

Hybrid System Identification: An SDP Approach

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

Least Squares Fitting of Data

Lecture 9 November 23, 2015

New Classes of Positive Semi-Definite Hankel Tensors

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

ADVANCES ON THE BESSIS- MOUSSA-VILLANI TRACE CONJECTURE

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

Interactive Markov Models of Evolutionary Algorithms

OPTIMIZATION in multi-agent networks has attracted

Bipartite subgraphs and the smallest eigenvalue

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

3.3 Variational Characterization of Singular Values

When Short Runs Beat Long Runs

The Weierstrass Approximation Theorem

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

Recovery of Sparsely Corrupted Signals

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

A Bernstein-Markov Theorem for Normed Spaces

Support recovery in compressed sensing: An estimation theoretic approach

Combining Classifiers

Multi-Dimensional Hegselmann-Krause Dynamics

Reed-Muller Codes. m r inductive definition. Later, we shall explain how to construct Reed-Muller codes using the Kronecker product.

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

PAC-Bayes Analysis Of Maximum Entropy Learning

Convex Programming for Scheduling Unrelated Parallel Machines

Variations on Backpropagation

HESSIAN MATRICES OF PENALTY FUNCTIONS FOR SOLVING CONSTRAINED-OPTIMIZATION PROBLEMS

Kernel Methods and Support Vector Machines

Introduction to Machine Learning. Recitation 11

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Anisotropic reference media and the possible linearized approximations for phase velocities of qs waves in weakly anisotropic media

Divisibility of Polynomials over Finite Fields and Combinatorial Applications

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

Fairness via priority scheduling

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

A remark on a success rate model for DPA and CPA

Page 1 Lab 1 Elementary Matrix and Linear Algebra Spring 2011

Physics 215 Winter The Density Matrix

GLOBALLY CONVERGENT LEVENBERG-MARQUARDT METHOD FOR PHASE RETRIEVAL

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Necessity of low effective dimension

Compressive Distilled Sensing: Sparse Recovery Using Adaptivity in Compressive Measurements

Hamming Compressed Sensing

Konrad-Zuse-Zentrum für Informationstechnik Berlin Heilbronner Str. 10, D Berlin - Wilmersdorf

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Tail estimates for norms of sums of log-concave random vectors

Transcription:

GREEDY GAUSS-NEWTON ALGORITHM FOR FINDING SPARSE SOLUTIONS TO NONLINEAR UNDERDETERMINED SYSTEMS OF EQUATIONS MÅRTEN GULLIKSSON AND ANNA OLEYNIK arxiv:6.395v [ath.na] Oct 26 Abstract. We consider the proble of finding sparse solutions to a syste of underdeterined nonlinear syste of equations. The ethods are based on a Gauss-Newton approach with line search where the search direction is found by solving a linearized proble using only a subset of the coluns in the Jacobian. The choice of coluns in the Jacobian is ade through a greedy approach looking at either axiu descent or an approach corresponding to orthogonal atching for linear probles. The ethods are shown to be convergent and efficient and outperfor the l approach on the test probles presented.. Introduction We consider the nonlinear underdeterined syste of equations or siply f x,..., x N ) =. f x,..., x N ) = ) fx) =, where x R N and f : D R N R, < N is twice continuously differentiable on the open convex set D, i.e., f i C 2 D), i =,...,. If fd) the solution to ) is not unique, which is a direct consequence of the Iplicit Function Theore []. We refer to [2, 3, 4, 5] for the exaples fro different application areas as otivation for solving ). In this paper we are interested in sparse solutions to ), i.e., solutions that contain only a few nonzero coponents. Let x be the so-called l - nor which is actually not a nor) on R N defined as the nuber of nonzero eleents x = {i : x i }. We say that a vector x is n-sparse if x n, and sparse if x. The proble of finding the ost sparse solution to ) reads 2) in x x s.t. fx) =. Due to the cobinatorial coplexity proble 2) is considered to be intractable, see [4], and current algoriths can not guarantee that the sparse) solution attained is a solution to 2). Linear probles, i.e., fx) = Ax b, A R N, b R has been studied extensively. For algoriths solving the linear sparse solution proble we refer to [4]. Iportant references can also be found in [6]. To the best of our knowledge there are no nuerical algoriths specifically developed to find sparse solutions of ) except the ones described in [2] which we will refer to as the l -ethod. 2 Matheatics Subject Classification. 68Q25, 68R, 68U5. Key words and phrases. sparse optiization, underdeterined nonlinear systes of equations, Gauss-Newton, line search, greedy algorith, sparsity constraints.

2 M. GULLIKSSON AND A. OLEYNIK We will later copare this ethod with our approach and therefore we describe the ethod in ore detail. Let x p, < p < be given as 3) x p = i x i p ) /p. For p 3) defines the l p -nor while for < p < it is only a quasi-nor. In the sequel, we use instead of 2. The algoriths in [2] are based on solving 4) in x x p s.t. fx) = for < p and f given as above, which is otivated by the fact that x p p x, p + on a bounded set. In particular, the l -nor algorith described in [2] is realized in the following way. Starting with x = one obtains a new approxiation as x k+ = x k + p k, k =, 2, 3,..., where p k is the solution to 5) in p p s.t. f k + J k p =. Here we denote f k = fx k ) and J k = f i x k )/ x j ) ij, i =,...,, j =,..., N is the Jacobian of fx) at x = x k. The proble 5) can be recast as a linear prograing proble 6) where in w c T w s.t. Aw = b, w c = 2N, A = J k, J k ), b = f k, w = u; v), p = u v. In [2] it was shown that the ethod converges locally to a solution which is not necessarily sparse) with quadratic convergence rate. However, global convergence was not proven. There are other ethods not directly applied to 2) but that contains soe ideas and properties related to our approach and thus relevant to ention here. In a series of papers [7, 8, 9, 6] a general theory is developed for the proble 7) in x F x) s.t. x C s B, where F : R N R, B is a closed and convex set, and C s = {x R n : x s}. The theory is used for a nuber of applications and several algoriths are developed and analyzed. In our context, we note that in [9] an algorith, GESPAR greedy sparse phase retreival), is developed to solve a nonlinear overdeterined least squares proble based on a coordinate search where the sparse sall) overdeterined nonlinear least squares subprobles are solved using a Gauss- Newton approach with line search. Convergence results for the gradient are derived. In [] the proble 7) with B = R n is considered with a coordinate search algorith based on a local gradient search in a sparse solution set gradient support pursuit). Estiates of the error in the iterates are developed using the size of the eleents in the gradient at the sparse solution. Furtherore, there are cobinatorial ethods that solve the nonlinear proble ) using cardinality constrains, see [5], which we do not consider here. Here we present an alternative ethod, that we call a Greedy Gauss-Newton algorith, that cobines a greedy approach with the Gauss-Newton ethod []. The ethod is based on a line search where at the k th iterate we set 8) x k+ = x k + α k p k, k =, 2,..., where p k is the search direction and α k is the step length. We start the iterations with x = or x sparse enough. In every iteration we use the atrix L k consisting of the coluns of J k corresponding to the nonzero part of x k and an additional colun of J k, J k :, t), to calculate the search direction as p t = arg in p f k L k, J:, t)) p. The choice of t is discussed and we analyze the two choices in detail. The first one is based on axiizing the descent of fx) 2 2

GREEDY GAUSS-NEWTON ALGORITHM 3 at x = x k in the direction p t and we call this ethod Maxiu Descent MD). The second idea of choosing t is siilar to orthogonal atching on the linear proble in f k + J k p 2, see [4], and consists of axiizing the angle between r k = f k + L k L + k f k and J k :, t), where L + k is the pseudo inverse of L k [2]. We denote our ethod based on orthogonal atching as OM. The paper is organized as follows. In Section 2 we describe how to calculate p t and we show that it is a descent direction together with soe useful corollaries. The MD algorith is presented in Section 2. and OM in Section 2.2. In Section 3 we show results on global and local convergence together with the algorith in pseudocode, and finally we give soe nuerical tests in Section 4. 2. The algorith Here we describe the line search ethod 8) to find a sparse solution to ). We start with x = or soe sufficiently sparse vector. At iteration k, let x k contain n k nonzero eleents at positions i Ω k and zero eleents at i Ω k where Ω k = {i, i 2,..., i nk }, Ω k = {, 2,..., N} \Ω k. We use Matlab [3] inspired notation, that is, xi) = x i, x:) = x, x : n k ) = x,..., x nk ) T, and xω k ) = x i,..., x ink ) T. We ai at finding p k in 8) such that i) p k is a descent direction and ii) the update x k+ = x k + α k p k is n k + )-sparse for any α k R. The ost straightforward approach would be to solve the linearized proble to ), that is, 9) f k + J k p k =. However, solving 9) for a sparse p k is not efficient enough for large N, [4]. Thus, for every t Ω k we define a projection Π t k as, i = j and i Ω k {t}, Π t k i, j) =, otherwise, where i, j =,..., N. Then instead of 9) we solve the iniization proble ) in p 2 p 2 s.t. in p 2 f k + J k Π t k p 2, with t Ω k to obtain p k. We choose t Ω k by two different ethods: MD, t = t MD, or OM, t = t OM, that we describe in details in the coing subsections. Let p t be a solution of ) for t Ω k. It is clear that p t for any t Ω k satisfies the sparsity requireent ii). Indeed, p t Ω k \ {t}) = for any t Ω k. Below we discuss when p t is a descent direction. Denote L k = J k :, Ω k ), then the reaining non-zero part of p t, that is, q t = p t Ω k ) T, p t t)) T R n k+ is the solution to ) that is, in q 2 q 2 s.t. in q 2 f k + L k, J:, t) ) q 2, 2) q t = L k, J k :, t) ) + fk. Note that q t is the unique iniu of f k + L k, J:, t) ) q if rank L k, J k :, t) )) n k +.

4 M. GULLIKSSON AND A. OLEYNIK Lea. Let p t be a solution of ), J k and f k be given as above. Then 3) p T t J T k f k = f T k L kl + k f k + f T k I L kl + k )J k:, t) 2 I L k L + k )J k:, t) 2 and p t is a descent direction of /2 fx) 2 at x = x k if and only if 4) f T k L kl + k f k + f T k I L kl + k )J k:, t) 2 I L k L + k )J k:, t) 2 >. Proof. Let t Ω k, a = J k :, t), and P = I L k L + k. Observe that L kl + k orthogonal projections on RL k ) and RL k ), respectively. We have p T t Jk T f k = qt T Lk, a ) + fk. Theore 2 in Ch.7 Section 5 in [2] yields 5) Lk, a ) + L + = k ) L+ k abt b T where P a) +, P a, b T = a T L + k )T L + k + a T L + k )T L + k a, P a. Thus, using 2) and 5) we obtain and P define the 6) p T t J T k f k = q T t Lk, a ) + fk = f T k L kl + k f k + f T k P a 2 P a 2. The second clai of the lea follows fro 3) and the definition of a descent direction. Corollary. The solution p t of ) is a descent direction of /2 fx) 2 at x = x k unless f k RL k ) and J k :, t) RL k ) siultaneously. Proof. Observe that L k L + k is positive sei-definite. Thus the two last ters in 4) are nonnegative. Assue that fk T L kl + k f k =. Then f k RL k ) and the last ter in 3) is equal to zero only if a RL k ). It is clear fro 3) that adding an extra colun of J k will iprove the descent as long as the added colun does not belong to RL k ). We forulate it as a corollary. Corollary 2. The descent of p t, t Ω k is not less than the descent of p where p Ω k ) = and p Ω k ) is the solution to 7) in d 2 xω k) + d 2 s.t. in d 2 f k + L k d 2. Proof. Siple calculations show that p T J k T f k = fk T L kl + k f k. Together with 3) it iplies p T J k T pt t Jk T. After p k is constructed, the step length α k is found by using a standard step length algorith, see [4], that satisfy the Goldan-Anijo rule. We get x k+ = x k +α k p k that has at least n k + nonzero eleents and Ω k+ = Ω k {t }. If the step length α k is too sall it indicates that the descent is insufficient and we restart the algorith with a sparse enough x where the positions and the values of nonzero eleents are chosen randoly, see Section 3.. If there are eleents in x k+ close to zero it could ake sense to put these values to zero and then recalculate the set of non-zero entries Ω k+. This approach would be however very uch proble dependent and we do not consider it here.

GREEDY GAUSS-NEWTON ALGORITHM 5 2.. Maxiu descent ethod MD). MD is based on choosing p k = p tmd where t MD = arg ax p T t Jk T f ) k t Ω k or, equivalently, 8) t MD = arg ax t Ω k qt T Lk, J:, t) ) ) T fk. The next lea gives us the explicit forula for coputing t = t MD. Lea 2. Let p t be the solution to ) for t Ω k. If there exists a t Ω k such that p t is a descent direction of /2 fx) 2 at x k, then the axiu descent direction is given as p k = p tmd where 9) t MD = arg ax t Moreover, p tmd f T k I L kl + k )J k:, t) I L k L + k )J k:, t). provides the iniu of the nor f k + J k p t, i.e., p tmd = arg in f k + J k p t. p t Proof. Let t Ω k, a = J k :, t), P = I L k L + k, and S = P aat P/ P a 2, where P and S define the orthogonal projections on RL k ) on RP a), respectively. Descent is given by 3) where the first ter in the right hand side, fk T L kl + k f k, does not depend on a and thus the axiu descent is achieved when fk T P a / P a is axiu. Thus, we obtain the expression in 9). To prove the second clai of the theore we copute the squared nor using the expression for q t in 2) 2) f k + J k Π t k p t 2 = f k + L k, a ) q t 2 = I L k, a ) L k, a ) ) + 2 f k. Using 5) we obtain I L k, a ) L k, a ) ) + 2 f k = P P ab T )f k 2 = P S)f k 2 2) = fk T P 2 P S)f k = fk T P f k f k T P a 2 P a 2 The ter fk T P f k does not depend on a and the nor f k + J k Π t k p t reaches its iniu when fk T P a / P a is axiu. Fro Corollary and Lea 2 it is clear that p k = p tmd is always a descent direction if rankj k ) > rankl k ). Let us assue that q t in 2) is calculated with a QR-decoposition, see [3], N n k, and t MD is calculated using 8). Then the coplexity nuber of flops, i.e., one addition, subtraction, ultiplication, or division of two floating-point nubers) of MD in iteration k is 2n 2 k + + )n k + )N n k ). If instead we use 9), the coplexity is 2n k + )N n k ). Assuing that the ter including N n k is the largest the coplexity of MD can be reduced by accepting a descent large enough without considering the whole set Ω k. However, we have not considered this generalization here. 2.2. Orthogonal atching ethod OM). Let L k = J k :, Ω k ) as before and consider 22) in 2 d 2 s.t. in d 2 f k + L k d 2. d The solution of 22) is d k = L + k f k which is the unique iniu to f k +L k d if rankl k ) n k, and the iniu nor solution otherwise.

6 M. GULLIKSSON AND A. OLEYNIK OM ais at finding the colun J k :, t OM ) that is the ost strongly correlated with the linear residual r k = f k + L k d k, i.e., t OM = arg ax J k :, t) rt k t Ω k J k :, t) or equivalently, fk T 23) t OM = arg ax I L kl + k )J k:, t), t Ω k J k :, t) to obtain p k = p tom. Following the assuptions ade for MD regarding coplexity analysis we get the coplexity of OM to be 4n 2 k + 2N n k) where the first ter is the calculation of r k and q tom in 2) and the second is fro solving the axiization proble in 23). Let us consider ) where we set pω k ) = d k, that is, 24) 25) Then ) can be rewritten as or, equivalently, with the solution in 2 { p 2 in s.t. p 2 f k + J k Π t k p 2 pω k ) = d k p in δ 2 dk δ ) 2 s.t. in δ 2 f k + L k d k + J:, t)δ 2, in δ 26) δ t = J k:, t) T δ s.t. in δ 2 I L kl + k )f k + J:, t)δ 2 J k :, t) 2 I L kl + k )f k. Hence, the solution to 24) is p t where p t Ω k ) = d k, p t t) = δ t and p t Ω k \ {t}) =. Lea 3. Let p t be a solution to ) and p t to 24) for t Ω k, and t OM be given by 23). If there exists a descent direction aong p t then p tom and p tom are descent directions. Moreover, p tom gives the iniu nor of f k + J k p t, i.e., p tom = arg in f k + J k p t. p t Proof. Let P define the orthogonal projections on RL k ), i.e.,p = I L k L + k. Fro Corollary,it i seen that p tom is a descent direction. Indeed, if f k RL k ) then any p t gives a descent. Assue that f k RL k ). Let p t, t Ω k be a descent direction. Hence, J k :, t ) RL k ), that is, fk T P J k:, t ) > which iplies fk T P J:, t OM) >. Let t Ω k, a = J k :, t) and Q = aa T / a 2 define the orthogonal projections on Ra). To show that p tom gives a descent we calculate ) p T t Jk T f k = d T L k, T δt ) k f k a T. f k Using the forulas for d k and 26) we have p T t J T k f k = f T k L kl + k f k + f T k P aat f k a 2

GREEDY GAUSS-NEWTON ALGORITHM 7 as P aa T is positive sei-definite which can be seen by looking at the eigenvalue equation P aa T u = λu giving λ. Siilarly to as above, f k RL k ) iplies that p t is a descent direction for any t Ω k. Assue that this is not the case and f k RL k ). Then J:, t OM ) RL k ) which iplies fk T P aat f k / a 2 > for a = J:, t OM ) and p tom gives a descent direction. To show that p tom provide the iniu nor we copute f k + J k p t 2 = P f k QP f k 2 = f T k P I Q)P f k = f T k P f k f T k P QP f k = f T k P f k f T k P a 2 a 2. The ter fk T P f k does not depend on a and the nor f k + J k p t reaches its iniu when fk T P a / a is axiu. Fro Lea 3 it follows that one can use p k = p tom instead of p k = p tom. However the coplexity of this approach would be only 2n k less than OM. As Lea 3 and Lea 2 iply f k + J k p tmd f k + J k p tom f k + J k p tom and we have not seen any real advantages of this approach copared to OM, we do not consider it further. 2.3. Coparison and generalizations of OM and MD. There are soe interesting coon features between MD and OM. In 9) we notice that the new colun is chosen as to axiize the angle between the vectors f k and v t k = I L kl + k )J k:, t)/ I L k L + k )J k:, t). Geoetrically this eans that we choose the colun J:, t) whose projection onto RL k ) is as parallel as possible to the nonlinear residual f k. In OM we instead choose t OM fro 23) which is the axiization of the angle between the linear residual r k and J k :, t). This is the sae Orthogonal Mathing principle as for linear proble [4] but here on the linearized proble in p f k + J k p. Fro a coplexity point of view the two ethods are coparable if we assue that N n k but if n k MD will be ore expensive since the large ter is O 2 N n k )) copared to ON n k )) using OM. We note that when n k = rankj k ) no colun will be added and we then choose to reain in the corresponding subspace. There are soe ore or less obvious variants or generalizations of MD and OM and we ention soe here. Firstly, ore than one colun can be added in every iteration siplifying the algorith and possibly aking it ore efficient. Secondly, the search of the coluns ay not be exhaustive, i.e., as soon as a colun is found satisfying the criteria for being added the search can be terinated. Specifically, this is an attractive approach for MD since only sufficient descent is necessary not necessarily axiu descent. Finally, it is possible to iterate in the corresponding subspace at each step possibly using a line search or any other approach. 3. Convergence properties The global convergence is given by the following classical theore that we state here for the sake of copleteness. For the reference see Theore 6.3.3. in [5] or Theore 4.2.4 in [6]. Theore Global Convergence of a Descent ethod). Let F : D R N R be continuously differentiable on the open convex set D and assue that F satisfy the Lipschitz condition F x) F x) 2 γ x z for every x, z D and soe γ >. Given x D assue that the level set Λ = {x D F x) F x )} is copact. Consider the sequence {x k } defined by 8) with α k satisfying the Arijo- Goldstein condition, and p T k F x k) > for all k N. Then {x k } Λ and p T k 27) li F x k) =. k p k

8 M. GULLIKSSON AND A. OLEYNIK Next we show that the algorith in Section 3. with p k chosen using MD ethod or OM has the sae convergence properties as the Gauss-Newton ethod for underdeterined nonlinear probles. Lea 4. Let f be given as in ), x D where D R N is a convex open set such that Λ = {x D fx) fx ) } is copact. Consider the sequence {x k } given by 8) with the descent direction p k chosen using MD or OM, and α k > satisfying the Arijo-Goldstein rule. If rankjx)) = ρ for all x Λ then there is k ρ N such that for k k ρ 28) p T k J T k f k = f T k J kj + k f k. Proof. Under the conditions of Theore x k Λ, see 4.2.3 in [6], and thus rankj k ) = ρ, k N. Let a = J k :, t ) where where t = t MD or t = t OM, see 9) and 23). Fro Lea and Corollary, rankl k ) = ρ for all k k ρ for soe k ρ N and thus, I L k L + k )a =. Hence, fro 3) we have 29) p T k J k T f k = fk T L kl + k f k. ) Without loss of generality assue J k = L k, L k and let E R N N be a product of eleentary atrices such that ) ) J k = L k, L k = L k, E. Then which yields 28). ) J k J + k L = k, EE ) + L k, = L k, ) L + ) k = L k L + k Notice that fro Lea 4 the algorith becoes equivalent to the Gauss-Newton ethod only starting fro soe k ρ th iterate, when we already has hopefully) reached the vicinity of a sparse local iniu of /2 f 2, say x. This iniu is a solution to fx) = if rankjx )) = but this is not necessarily the case when rankjx )) <. In practice we exclude the convergence to a stationary point x giving fx ) > by restarting the algorith. We also do a restart when p k fails to give a significant descent, see Section 3.. Let {x k } be generated by the Greedy Gauss-Newton ethod and {x k } x where fx ) =. Then the convergence rate is quadratic given α k = in a vicinity of x, see [5]. However, fro Lea 4 this rate of convergence is only guarantied for k > k ρ. With next proposition we show that this assuption on k can be oitted. Proposition Rate of Convergence). Let f be given as in ) and ˆx R N be such that fˆx) =. Let the sequence {x k } given by 8) with the descent direction p k chosen using MD or OM and α k = converges to ˆx as k. If p k C f k for all k K, for soe K N, then {x k } converges to ˆx quadratically. Proof. Let A k :, Ω k {t }) = J k :, Ω k {t }) and A k :, Ω k \ {t }) = O where t = t MD or t = t OM. Then p k = A + k f k and A + k C. In a vicinity of ˆx the Taylor expansion is valid fˆx) = fx) + J k ˆx x k ) + rx k ) = f k + A k ˆx x k ) + r k with r k = O x ˆx 2 ) as the Hessian is continuous and thus uniforly bounded in a closed neighbourhood of ˆx. We have A + k fˆx) = A+ k f k + A + k A kˆx x k ) + A + k rx). Reebering that fˆx) = and A + k A k = I we obtain Next, which copletes our proof. x k ˆx = A + k f k + A + k rx). x k+ ˆx = x k ˆx) A + k f k = A + k rx) = O x k ˆx 2 )

GREEDY GAUSS-NEWTON ALGORITHM 9 3.. The Greedy Gauss-Newton Algorith in pseudocode. Below we outline the algorith we use in our nuerical tests. For the values of the constants in step. we refer to the nuerical tests in Section 4. The paraeter k ax stands for the axiu nuber of iterations counting throughout restarts), ε f, δ x, δ α, tol, and grad are tolerances. In step 4. the sign stands for the Hadaard product and randn, ) returns a vector of N uniforly distributed rando nubers in the interval, ), and prob, ]. The erit function φα) in step. is given as φα) = fx k + αp k ) 2 2 /2. Greedy Gauss-Newton Algorith Predefined functions are f : R N R and Jacobian Jx) : R N R N, < N. Input: k ax, ε f, δ x, δ α, tol, grad, prob 2. k =, x =, Ω =, n restarts = 3. while fx k ) > ε f and k < k ax 4. Find t ax fro 9) if MD or 23) if OM or any other ethod) 5. if the axiu in 9) or 23) respectively is larger than tol 6. Set Ω k+ = Ω k t ax else 7. Set Ω k+ = Ω k end 8. Copute p k = J:, Ω k ) + fx k ) 9. Find α k using the erit function φα). Set x k+ = x k + α k p k. if α k < δ α or Jk T f k / f k < grad 2. n restarts = n restarts + 3. Set x k+ = 2randN, ) ) randn, ) < prob) 4. Update Ω k+ = {i : x k i) > δ x } end 5. Update k = k + end 6. Update Ω k+ = {i : x k i) > δ x } and x k+ Ω k+ ) = 7. Output: Solution to fx) = or if k = k ax the vector x kax A restart, see step 4., is perfored if either the step length is too sall indicating not enough descent, or if the gradient is sall while the nor of f is not sall, see step 2. The first case appears when the Gauss-Newton ethod does not converge locally, i.e., the solution has a large residual f and/or a sall curvature, see [7] for details. The second case for a restart ay occur when the algorith is converging to a local inia where the nor of f is not close to zero. In the next section we use k ax = 2, δ x = 8, ε f = 3, δ α = 3, tol =, and grad = 6. The other constants vary for different probles and are given below. 4. Nuerical tests We test our ethod on three different probles where the solution space is known. The first is a sall proble that is considered in [2]. The second and the third one have quadratic and exponential nonlinearities, respectively. These are large probles which size can be changed. We illustrate the results fro both qualitative and quantitative point of view and test the algorith versus l -ethod described in [2]. 4.. Sall test proble. Let f in ) be given as fx) = Ax + φx) y

M. GULLIKSSON AND A. OLEYNIK where A = φx) = 3.933.7.26 9.99 48.83 7.64.987 22.95 28.37.2.235 5.67.92 6.5.68.96.7.727x2)x3) + 8.39x3)x4) 684.4x4)x5) + 63.5x4)x7).949x)x2).578x)x4).32x4)x7).76x)x2).578x)x4) +.32x4)x7) x)x5) x)x4) y =.999,.485,.567,.84,.96) T. We run the l -ethod and both MD and OM starting with x = R 8. It turns out that for this set up MD and OM are equivalent. All the ethods converged to the sae sparse solution ˆx =,,,,.,.5,, ) T. After three iterations we obtained fx 3 ) < e 5. Below we print the atrix X l = x, x 2, x 3 ) where x k, k =, 2, 3, are the iterates obtained using the l -ethod X l =.94e 5 5.7e 6.64e 4 8.53e 5.2e 5 2.3e 7.9e 4 9.99e 5...5.5 2.38e 5.44e 4 5.83e 5 2.42e 5, and X = x, x 2, x 3 ) with x k, k =, 2, 3, obtained using OM or MD) X =...5.5 The atrices above give a good illustration of the difference between the two algoriths. In particular, the choice of the paraeter δ x plays ore significant role for the l -ethod then for the Greedy Gauss-Newton algorith. Moreover, the axiu sparsity of a solution obtained by the Greedy Gauss-Newton algorith not grater than, which can not be guaranteed by the l -ethod. 4.2. Quadratic test proble. Consider the quadratic function x x) T H x x) 3) fx) = Ax x) + x x) T H 2 x x) 2., x x) T H x x) where A, H i R N N, i =,...,. Let s, n be such that s < n + s N and Q = Q, Q 2 ), Q R n+s) n, Q 2 R n+s) s, Q T Q = I.,,

We define GREEDY GAUSS-NEWTON ALGORITHM A = BQ T, C ) Q T, and H i = i Q T ) S i where B, C, T i, S i, and R i, i =,...,, are all rando atrices of the corresponding sizes whose eleents are uniforly distributed in, ). We assue that x is n + s) - sparse with n + s) first non-zero eleents. Let z = x : n + s) then any x such that 3) x x = z z ) = Q2 y S T i R i ), y R s, is a solution to 3). Moreover, as one can always find y R s such that z = Q 2 y + z has additional s zeros, we conclude that there are solutions x of sparsity n. The Jacobian, Jx) R N, of f is given by J ij x) = a ij + e T j H i x x), i =,...,, j =,... N, where e j is the j th unit vector, and f i = H i. Thus, for x as in 3) we obtain Jx) = BQ T, C + J 2 ), J 2 = e T j S T i Q 2 y ȳ) which ost probably has rank. All the tests we run with N =, = 2, s = 2, prob =.2 and with the constants given in Section 3.. In Figures - 4 we deonstrate the qualitative behaviour of the Greedy Gauss-Newton ethod and copare it with the l -ethod. In Figure we show the results of the algorith for solving 3) using MD with n = 6. In particular, we plot the absolute value of the solution x obtained using MD, and the inus absolute value of the solution obtained using the l - ethod in Figure left upper). The sparsity of the solution obtained by MD is equal to n = 6 and the sparsity of the solution obtained by the l -ethod is 56. In Figure right lower) one can see which coluns of J k were added at each iteration step k =, 2,... We plot fx k ) in logarithic scale in Figure right upper) and the size of Ω k in Figure right lower) at each iteration. The sae test proble as in Figure is then solved using OM. We display the results in Figure 2. Note that the solution with OM is not the sae as the one for MD even if the sparsity is the sae. x x MD) x l ) - 5 spyx,..., x k ) 5 log fx k ) 2-2 5 2 4 6 8 size Ω k 5 2 4 6 8 Figure. MD ethod perforance for the test proble 3) with N =, = 2, n = 6 and s = 2

2 M. GULLIKSSON AND A. OLEYNIK x x OM) x l ) - 5 spyx,..., x k ) 5 log fx k ) 2-2 5 2 4 6 8 size Ω k 5 2 4 6 8 Figure 2. MD ethod perforance for the test proble 3) with N =, = 2, n = 6 and s = 2. For the chosen paraeters the convergence to a sparse solution, as in Figure and Figure 2, is the ost coon case. However, the algorith ay not produce a convergent to the solution) sequence starting with x =, see Figure 3, or produce an -sparse solution, as in Figure 4. In Figure 3 right upper), one can see an exaple of the case when the algorith got stuck in a subspace with a local iniu to fx) 2 /2 that does not yield a solution to fx) =. The rank of the Jacobian at these inia are equal to 8, 9, 9 which can be seen fro Figure 3 right lower). The algorith converged to a sparse solution after three different) restarts. We have plotted the absolute value of the solution and the inus absolute value of the solution of sparsity 56 obtained by the l - ethod in Figure 3 left upper). In Figure 3 left lower) the subspace of the local iniu and the subspace of the solution are shown. Finally, in Figure 4 we show the case where the algorith does not find a sparse solution but converges to a solution of the sparsity, = 2. The sparsity of the solution obtained by l -ethod is equal to 54, see Figure 4 lower left). Since we have not found significant difference in the qualitative behaviour between OM and MD we have displayed the results for the last two tests only for MD. We would like note that while solutions obtained by the Greedy Gauss-Newton ethod can not exceed, the l -ethod ay produce a solution of even larger sparsity than, which was the case for the considered test proble 3) for all our runs. In Figure 5 and 6 we illustrate the perforance of the algorith over the average of runs where N =, and n vary as = 8,,..., 98 and n = 2, 4,..., 6.. The upper two plots in Figure 5 show that the sparsity n of the solution is attained except for a curved ridge. It has been shown in [8] that for linear probles orthogonal atching pursuit can provably recover n-sparse signals when n /2 logn)). This estiate is illustrated by the cutting plane in the figures. It is seen that MD and OM anage to find less sparse solutions than the estiate. In the lower right plots in Figure 5 and 6 it is seen that MD outperfors OM for ost proble sizes. The nuber of restarts were insignificantly sall for these tests. 4.3. Exponential Test Proble. This proble is taken fro [3]. Define 32) fx) = Ae Bx b, A R N, b R, e x = e x,..., e x N ) T

GREEDY GAUSS-NEWTON ALGORITHM 3.5 x x MD) x l ) -.5 5 spyx,..., x k ) 5 log fx k ) 2-2 2 4 6 size Ω k 5 2 4 6 Figure 3. MD ethod perforance for the test proble 3) with N =, = 2, n = 6 and s = 2.5 x x OM) x l ) -.5 5 spyx,..., x k ) 5 log fx k ) 2-2 2 5 5 2 size Ω k 2 5 5 2 Figure 4. MD ethod perforance for the test proble 3) with N =, = 2, n = 6 and s = 2. where the eleents in A are chosen rando uniforly in, ) and then by using Singular Value Decoposition to have ranka) = p. The atrix B is constructed in the following way. First, we generate N N rando atrix whose eleents are uniforly distributed in [, ]. Next, using Singular Value Decoposition we fix this atrix to have the first n + s < coluns to have the rank n for soe n, s N. That is, rankb:, : n + s)) = n and B ost probably has the rank N s. We choose x = z, ) T with soe z R n+s and set b = A expb x). Then for any y R s ) V2 y 33) x = x +

4 M. GULLIKSSON AND A. OLEYNIK Sparsity for MD Sparsity for MD copared to n 99 7 2 96 8 2 n 9-3 96 8 2 n 9 Iterations for MD Iterations for OM copared to MD 24 79 5 96 9 n 8 2-29 96 8 2 n 9 Figure 5. The 3D plots of the perforance of MD and OM ethods for the test proble 3) over the average of runs, N =, s = 2, and = 8,,..., 98, n = 2, 4,..., 6. 96 Sparsity for MD 8 2 n 9 8 6 4 2 96 Sparsity for MD copared to n 8 2 n 9 - -2-3 96 Iterations for MD 8 2 n 9 2 8 6 4 2 96 Iterations for OM copared to MD 8 2 n 9 6 4 2-2 Figure 6. The contour plots of the perforance of MD and OM ethods for the test proble 3) over the average of runs, N =, s = 2, and = 8,,..., 98, n = 2, 4,..., 6. solves fx) = with V 2 R n+s) s such that RV 2 ) = N B:, : n+s)). Fro this construction it is clear that soe of x aong 33) have the sparsity n. The Jacobian and second derivatives are given as Jx) = A diag e x,..., e x N ) = A diage x ), f i = diag a i e x,..., a in e x N ), i =,...,. where a ij, j =,..., N, are the eleents of A. The atrix Jx) is always rank deficient. Indeed, since A diag e x,..., e x N ) has the sae rank as A we have rankjx)) in {ranka), rankb)} = in { p, N s}

GREEDY GAUSS-NEWTON ALGORITHM 5 Sparsity for MD Sparsity for MD copared to n 34 58 2 98 2 n 23 98 2 n 23 Iterations for MD Iterations for OM copared to MD 64 96 5 98 2 n 23-2 98 2 n 23 Figure 7. The perforance of MD and OM ethods for the test proble with 32) over the average of runs, N =, s = 4, and = 2, 6,..., 96, n = 2, 6,...,. 98 Sparsity for MD 3 98 Sparsity for MD copared to n 5 25 4 2 n 23 2 5 5 2 n 23 3 2 98 Iterations for MD 6 5 4 3 98 Iterations for OM copared to MD 8 6 4 2 2 2 n 23 2 n 23 Figure 8. The perforance of MD and OM ethods for the test proble with 32) over the average of runs, N =, s = 4, and = 2, 6,..., 96, n = 2, 6,...,. All the tests were run with N =, s = 2, prob = 2 + /)/ and the constants given in see Section 3.. Furtherore, an additional condition for a restart, ax i x k i) > 3, is added in the condition of the if-stateent on row in the pseudocode to prevent convergence to infinity. In Figure 7 and 8 we illustrate the perforance of the algorith over the average of runs where N =, s = 4,, n vary as = 2, 6,..., 96, n = 2, 6,...,.

6 M. GULLIKSSON AND A. OLEYNIK The upper right plots in Figure 5 and 8 show that the sparsity of the solution is attained very close to the estiate n /2 logn)) obtained for linear probles. We however do not have theoretical justification of this estiate for nonlinear cases. Figure 5 lower right) and 8 lower right) shows that MD outperfors OM for all proble sizes. The nuber of restarts for this test proble were ore frequent than for the quadratic test proble, see Section 4.2. However, there were few cases when n and is large, where there was no convergence. References [] To M Apostol. Matheatical analysis; 2nd ed. Addison-Wesley Series in Matheatics. Addison-Wesley, Reading, MA, 974. [2] Philipp Kuegler. A sparse update ethod for solving underdeterined systes of nonlinear equations applied to the anipulation of biological signaling pathways. SIAM Journal on Applied Matheatics, 724):982, 22. [3] JosMario Martnez. Quasi-newton ethods for solving underdeterined nonlinear siultaneous equations. Journal of Coputational and Applied Matheatics, 342):7 9, 99. [4] J. A. Tropp and S. J. Wright. Coputational Methods for Sparse Solution of Linear Inverse Probles. Proceedings of the IEEE, 986):948 958, jun 2. [5] Xiaoling Sun, Xiaojin Zheng, and Duan Li. Recent advances in atheatical prograing with seicontinuous variables and cardinality constraint. Journal of the Operations Research Society of China, ):55 77, 23. [6] Air Beck and Nadav Hallak. On the iniization over sparse syetric sets: Projections, optiality conditions, and algoriths. Matheatics of Operations Research, 4):96 223, 26. [7] Air Beck and Yonina C. Eldar. Sparsity constrained nonlinear optiization: Optiality conditions and algoriths. SIAM Journal on Optiization, 233):48 59, 23. [8] A. Beck and Y. C. Eldar. Sparse signal recovery fro nonlinear easureents. In 23 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5464 5468, May 23. [9] Y. Shechtan, A. Beck, and Y. C. Eldar. Gespar: Efficient phase retrieval of sparse signals. IEEE Transactions on Signal Processing, 624):928 938, Feb 24. [] S. Bahani, P. Boufounos, and B. Raj. Greedy sparsity-constrained optiization. In 2 Conference Record of the Forty Fifth Asiloar Conference on Signals, Systes and Coputers ASILOMAR), pages 48 52, Nov 2. [] A. Björck. Nuerical Methods for Least Squares Probles. SIAM, Philadelphia, 996. [2] A. Ben-Israel and T.N.E. Greville. Generalized Inverses: Theory and Applications. CMS Books in Matheatics. Springer, 23. [3] Gene H. Golub and Van Loan. Matrix Coputations 4th Ed.). Johns Hopkins University Press, Baltiore, MD, USA, 23. [4] C. Kelley. Iterative Methods for Optiization. Society for Industrial and Applied Matheatics, 999. [5] J. Dennis and R. Schnabel. Nuerical Methods for Unconstrained Optiization and Nonlinear Equations. Society for Industrial and Applied Matheatics, 996. [6] J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Matheatics, 2. [7] J. Eriksson, P. A. Wedin, M. E. Gulliksson, and I. Söderkvist. Regularization ethods for uniforly rankdeficient nonlinear least-squares probles. Journal of Optiization Theory and Applications, 27): 26, 25. [8] Joel A. Tropp. On the conditioning of rando subdictionaries. Applied and Coputational Haronic Analysis, 25): 24, 28. M. Gulliksson, School of Science and Technology, Örebro University, Sweden E-ail address: arten.gulliksson@oru.se A. Oleynik, Departent of Matheatical Sciences and Technology, Norwegian University of Life Sciences, Postboks 53 NMBU 432 Ås E-ail address: anna.oleynik@nbu.no