Finding a sparse vector in a subspace: linear sparsity using alternating directions

Size: px

Start display at page:

Download "Finding a sparse vector in a subspace: linear sparsity using alternating directions"

Allen Pope
6 years ago
Views:

1 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 Finding a sarse vector in a subsace: linear sarsity using alternating directions Qing Qu Student Member IEEE Ju Sun Student Member IEEE and John Wright Member IEEE Abstract Is it ossible to find the sarsest vector direction in a generic subsace S R with dim S = n <? This roblem can be considered a homogeneous variant of the sarse recovery roblem and finds connections to sarse dictionary learning sarse PCA and many other roblems in signal rocessing and machine learning In this aer we focus on a lanted sarse model for the subsace: the target sarse vector is embedded in an otherwise random subsace Simle convex heuristics for this lanted recovery roblem rovably break down when the fraction of nonzero entries in the target sarse vector substantially exceeds O/ n In contrast we exhibit a relatively simle nonconvex aroach based on alternating directions which rovably succeeds even when the fraction of nonzero entries is Ω To the best of our knowledge this is the first ractical algorithm to achieve linear scaling under the lanted sarse model Emirically our roosed algorithm also succeeds in more challenging data models eg sarse dictionary learning Index Terms Sarse vector Subsace modeling Sarse recovery Homogeneous recovery Dictionary learning Nonconvex otimization Alternating direction method I INTRODUCTION Suose that a linear subsace S embedded in R contains a sarse vector x 0 0 Given an arbitrary basis of S can we efficiently recover x 0 u to scaling? Equivalently rovided a matrix A R n with NullA = S can we efficiently find a nonzero sarse vector x such that Ax = 0? In the language of sarse recovery can we solve min x x 0 st Ax = 0 x 0? I In contrast to the standard sarse recovery roblem Ax = b b 0 for which convex relaxations erform nearly otimally for broad classes of designs A 3] the comutational roerties of roblem I are not This work was artially suorted by grants ONR N NSF 3438 NSF and funding from the Moore and Sloan Foundations Q Qu J Sun and J Wright are all with the Electrical Engineering Deartment Columbia University New York NY 007 USA {qq05 js4038 jw966}@columbiaedu This aer is an extension of our revious conference version ] NullA = {x R Ax = 0} denotes the null sace of A nearly as well understood It has been known for several decades that the basic formulation min x x 0 st x S \ {0} I is NP-hard for an arbitrary subsace 4 5] In this aer we assume a secific random lanted sarse model for the subsace S: a target sarse vector is embedded in an otherwise random subsace We will show that under the secific random model roblem I is tractable by an efficient algorithm based on nonconvex otimization A Motivation The general version of Problem I in which S can be an arbitrary subsace takes several forms in numerical comutation and comuter science and underlies several imortant roblems in modern signal rocessing and machine learning Below we rovide a samle of these alications Sarse Null Sace and Matrix Sarsification: The sarse null sace roblem is finding the sarsest matrix N whose columns san the null sace of a given matrix A The roblem arises in the context of solving linear equality roblems in constrained otimization 5] null sace methods for quadratic rogramming 6] and solving underdetermined linear equations 7] The matrix sarsification roblem is of similar flavor the task is finding the sarsest matrix B which is equivalent to a given full rank matrix A under elementary column oerations Sarsity hels simlify many fundamental matrix oerations see 8] and the roblem has alications in areas such as machine learning 9] and in discovering cycle bases of grahs 0] ] discusses connections between the two roblems and also to other roblems in comlexity theory Sarse Comlete Dictionary Learning: In dictionary learning given a data matrix Y one seeks an aroximation Y AX such that A is a reresentation dictionary with certain desired structure and X collects the reresentation coefficients with maximal sarsity Such comact reresentation naturally allows signal comression and also facilitates efficient signal acquisition and classification see relevant discussion in ] When

2 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 A is required to be comlete ie square and invertible by linear algebra we have rowy = rowx 3] Then the roblem reduces to finding sarsest vectors directions in the known subsace rowy ie I Insights into this roblem have led to new theoretical develoments on comlete dictionary learning 3 5] Sarse Princial Comonent Analysis Sarse PCA: In geometric terms Sarse PCA see eg 6 8] for early develoments and 9 0] for discussion of recent results concerns stable estimation of a linear subsace sanned by a sarse basis in the data-oor regime ie when the available data are not numerous enough to allow one to decoule the subsace estimation and sarsification tasks Formally given a data matrix Z = U 0 X 0 + E 3 where Z R n collects column-wise n data oints U 0 R r is the sarse basis and E is a noise matrix one is asked to estimate U 0 u to sign scale and ermutation Such a factorization finds alications in gene exression financial data analysis and attern recognition 4] When the subsace is known say by the PCA estimator with enough data samles the roblem again reduces to instances of I and is already nontrivial 4 The full geometric sarse PCA can be treated as finding sarse vectors in a subsace that is subject to erturbation In addition variants and generalizations of the roblem I have also been studied in alications regarding control and otimization 5] nonrigid structure from motion 6] sectral estimation and Prony s roblem 7] outlier rejection in PCA 8] blind source searation 9] grahical model learning 30] and sarse coding on manifolds 3]; see also 3] and the references therein B Prior Arts Desite these otential alications of roblem I it is only very recently that efficient comutational surrogates with nontrivial recovery guarantees have been discovered for some cases of ractical interest In the context of sarse dictionary learning Sielman et al 3] introduced a convex relaxation which relaces the nonconvex roblem I with a sequence of linear rograms l /l Relaxation: min x x st xi = x S i I3 They roved that when S is generated as a san of n random sarse vectors with high robability wh the Here row denotes the row sace 3 Variants of multile-comonent formulations often add an additional orthonormality constraint on U 0 but involve a different notation of sarsity; see eg 6 3] 4 4] has also discussed this data-rich sarse PCA setting relaxation recovers these vectors rovided the robability of an entry being nonzero is at most θ O / n In the lanted sarse model in which S is formed as direct sum of a single sarse vector x 0 and a generic subsace Hand and Demanet roved that I3 also correctly recovers x 0 rovided the fraction of nonzeros in x 0 scales as θ O / n 4] One might imagine imroving these results by tightening the analyses Unfortunately the results of 3 4] are essentially shar: when θ substantially exceeds Ω/ n in both models the relaxation I3 rovably breaks down Moreover the most natural semidefinite rogramming SDP relaxation of I min X st A A X = 0 tracex] = I4 X 0 also breaks down at exactly the same threshold of θ O/ n 5 One might naturally conjecture that this / n threshold is simly an intrinsic rice we must ay for having an efficient algorithm even in these random models Some evidence towards this conjecture might be borrowed from the suerficial similarity of I-I4 and sarse PCA 6] In sarse PCA there is a substantial ga between what can be achieved with currently available efficient algorithms and the information theoretic otimum 9 33] Is this also the case for recovering a sarse vector in a subsace? Is θ O / n simly the best we can do with efficient guaranteed algorithms? Remarkably this is not the case Recently Barak et al introduced a new rounding technique for sum-of-squares relaxations and showed that the sarse vector x 0 in the lanted sarse model can be recovered when Ω n and θ = Ω 34] It is erhas surrising that this is ossible at all with a olynomial time algorithm Unfortunately the runtime of this aroach is a highdegree olynomial in see Table I; for machine learning roblems in which is often either the feature dimension or the samle size this algorithm is mostly of theoretical interest only However it raises an interesting algorithmic question: Is there a ractical algorithm that rovably recovers a sarse vector with θ / n ortion of nonzeros from a generic subsace S? C Contributions and Recent Develoments In this aer we address the above roblem under the lanted sarse model We allow x 0 to have u to θ 0 nonzero entries where θ 0 0 is a constant 5 This breakdown behavior is again in shar contrast to the standard sarse aroximation roblem with b 0 in which it is ossible to handle very large fractions of nonzeros say θ = Ω/ log n or even θ = Ω using a very simle l relaxation 3]

3 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 3 TABLE I COMPARISON OF EXISTING METHODS FOR RECOVERING A PLANTED SPARSE VECTOR IN A SUBSPACE Method Recovery Condition Time Comlexity 6 l /l Relaxation 4] θ O/ n On 3 log/ε SDP Relaxation θ O/ n O 35 log /ε SOS Relaxation 34] Ωn θ O O 7 log/ε 7 Sectral Method 35] Ωn oly logn θ O O n log/ɛ This work Ωn 4 log n θ O On 5 log n + n 3 log/ε We rovide a relatively simle algorithm which wh exactly recovers x 0 rovided that Ω n 4 log n A comarison of our results with existing methods is shown in Table I After initial submission of our aer Hokins et al 35] roosed a different simle algorithm based on the sectral method This algorithm guarantees recovery of the lanted sarse vector also u to linear sarsity whenever Ωn olylogn and comes with better time comlexity 8 Our algorithm is based on alternating directions with two secial twists First we introduce a secial data driven initialization which seems to be imortant for achieving θ = Ω Second our theoretical results require a second linear rogramming based rounding hase which is similar to 3] Our core algorithm has very simle iterations of linear comlexity in the size of the data and hence should be scalable to moderateto-large scale roblems Besides enjoying the θ Ω guarantee that is out of the reach of revious ractical algorithms our algorithm erforms well in simulations emirically succeeding with Ω n olylogn It also erforms well emirically on more challenging data models such as the comlete dictionary learning model in which the subsace of interest contains not one but n random target sarse vectors This is encouraging as breaking the O/ n sarsity barrier with a ractical algorithm and otimal guarantee is an imortant roblem in theoretical dictionary learning 36 40] In this regard our recent work 5] resents an efficient algorithm based on Riemannian otimization that guarantees recovery u to linear sarsity However the result is based on different ideas: a different nonconvex formulation otimization algorithm and analysis methodology 8 Desite these imroved guarantees in the lanted sarse model our method still roduces more aealing results on real imagery data see Section V-B for examles D Paer Organization Notations and Reroducible Research The rest of the aer is organized as follows In Section II we rovide a nonconvex formulation and show its caability of recovering the sarse vector Section III introduces the alternating direction algorithm In Section IV we resent our main results and sketch the roof ideas Exerimental evaluation of our method is rovided in Section V We conclude the aer by drawing connections to related work and discussing otential imrovements in Section VI Full roofs are all deferred to the aendix sections For a matrix X we use x i and x j to denote its i-th column and j-th row resectively all in column vector form Moreover we use xi to denote the i-th comonent of a vector x We use the comact notation k] = { k} for any ositive integer k and use c or C and their indexed versions to denote absolute numerical constants The scoe of these constants are always local namely within a articular lemma roosition or roof such that the aarently same constant in different contexts may carry different values For robability events sometimes we will just say the event holds with high robability wh if the robability of failure is dominated by κ for some κ > 0 The codes to reroduce all the figures and exerimental results can be found online at: htts://githubcom/sunju/sv II PROBLEM FORMULATION AND GLOBAL OPTIMALITY We study the roblem of recovering a sarse vector x 0 0 u to scale which is an element of a known subsace S R of dimension n rovided an arbitrary orthonormal basis Y R n for S Our starting oint is the nonconvex formulation I Both the objective and the constraint set are nonconvex and hence it is not easy to otimize over We relax I by relacing the l 0 norm with the l norm For the constraint x 0 since in most alications we only care about the solution u to

4 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 4 scaling it is natural to force x to live on the unit shere S n giving min x x st x S x = II This formulation is still nonconvex and for general nonconvex roblems it is known to be NP-hard to find even a local minimizer 4] Nevertheless the geometry of the shere is benign enough such that for well-structured inuts it actually will be ossible to give algorithms that find the global otimizer The formulation II can be contrasted with I3 in which effectively we otimize the l norm subject to the constraint x = : because the set {x : x = } is olyhedral the l -constrained roblem immediately yields a sequence of linear rograms This is very convenient for comutation and analysis However it suffers from the aforementioned breakdown behavior around x 0 0 / n In contrast though the shere x = is a more comlicated geometric constraint it will allow much larger number of nonzeros in x 0 Indeed if we consider the global otimizer of a reformulation of II: min Yq q R n st q = II where Y is any orthonormal basis for S the sufficient condition that guarantees exact recovery under the lanted sarse model for the subsace is as follows: Theorem II l /l recovery lanted sarse model There exists a constant θ 0 > 0 such that if the subsace S follows the lanted sarse model S = san x 0 g g n R where g i iid N 0 I and x 0 iid θ Berθ are all jointly indeendent and / n < θ < θ 0 then the unique u to sign otimizer q to II for any orthonormal basis Y of S roduces Yq = ξx 0 for some ξ 0 with robability at least c rovided Cn Here c and C are ositive constants Hence if we could find the global otimizer of II we would be able to recover x 0 whose number of nonzero entries is quite large even linear in the dimension θ = Ω On the other hand it is not obvious that this should be ossible: II is nonconvex In the next section we will describe a simle heuristic algorithm for aroximately solving a relaxed version of the l /l roblem II More surrisingly we will then rove that for a class of random roblem instances this algorithm lus an auxiliary rounding technique actually recovers the global otimizer the target sarse vector x 0 The roof requires a detailed robabilistic analysis which is sketched in Section IV-B Before continuing it is worth noting that the formulation II is in no way novel see eg the work of 9] in blind source searation for recedent However our algorithms and subsequent analysis are novel III ALGORITHM BASED ON ALTERNATING DIRECTION METHOD ADM To develo an algorithm for solving II it is useful to consider a slight relaxation of II in which we introduce an auxiliary variable x Yq: min qx fq x = Yq x + λ x st q = III Here λ > 0 is a enalty arameter It is not difficult to see that this roblem is equivalent to minimizing the Huber M-estimator over Yq This relaxation makes it ossible to aly the alternating direction method to this roblem This method starts from some initial oint q 0 alternates between otimizing with resect to wrt x and otimizing wrt q: x k+ = arg min x q k+ = arg min q S n Yq k x + λ x III Yq x k+ III3 where x k and q k denote the values of x and q in the k-th iteration Both III and III3 have simle closed form solutions: x k+ = S λ Yq k ] q k+ = Y x k+ Y x k+ III4 where S λ x] = signx max { x λ 0} is the softthresholding oerator The roosed ADM algorithm is summarized in Algorithm Algorithm Nonconvex ADM Alogrithm Inut: A matrix Y R n with Y Y = I initialization q 0 threshold arameter λ > 0 Outut: The recovered sarse vector ˆx 0 = Yq k : for k = 0 O n 4 log n do : x k+ = S λ Yq k ] 3: q k+ = Y x k+ Y x k+ 4: end for The algorithm is simle to state and easy to imlement However if our goal is to recover the sarsest vector x 0 some additional tricks are needed Initialization Because the roblem II is nonconvex an arbitrary or random initialization may not roduce a

5 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 5 global minimizer 9 In fact good initializations are critical for the roosed ADM algorithm to succeed in the linear sarsity regime For this urose we suggest using every normalized row of Y as initializations for q and solving a sequence of nonconvex rograms II by the ADM algorithm To get an intuition of why our initialization works recall the lanted sarse model S = sanx 0 g g n and suose Y = x 0 g g n ] R n III5 If we take a row y i of Y in which x 0 i is nonzero then x 0 i = Θ / θ Meanwhile the entries of g i g n i are all N 0 / and so their magnitude have size about / Hence when θ is not too large x 0 i will be somewhat bigger than most of the other entries in y i Put another way y i is biased towards the first standard basis vector e Now under our robabilistic model assumtions Y is very well conditioned: Y Y I 0 Using the Gram-Schmidt rocess we can find an orthonormal basis Y for S via: Y = YR III6 where R is uer triangular and R is itself wellconditioned: R I Since the i-th row y i of Y is biased in the direction of e and R is well-conditioned the i-th row y i of Y is also biased in the direction of e In other words with this canonical orthobasis Y for the subsace the i-th row of Y is biased in the direction of the global otimizer The heuristic arguments are made rigorous in Aendix B and Aendix D What if we are handed some other basis Ŷ = YU where U is an arbitary orthogonal matrix? Suose q is a global otimizer to II with the inut matrix Y then it is easy to check that U q is a global otimizer to II with the inut matrix Ŷ Because YU e i U q = Y e i q our initialization is invariant to any rotation of the orthobasis Hence even if we are handed an arbitrary orthobasis for S the i-th row is still biased in the direction of the global otimizer Rounding by linear rogramming LP Let q denote the outut of Algorithm As illustrated in Fig we 9 More recisely in our models random initialization does work but only when the subsace dimension n is extremely low comared to the ambient dimension 0 This is the common heuristic that tall random matrices are well conditioned 4] QR decomosition in general with restriction that R = will rove that with our articular initialization and an aroriate choice of λ ADM algorithm uniformly moves towards the otimal over a large ortion of the shere and its solution falls within a certain small radius of the globally otimal solution q to II To exactly recover q or equivalently to recover the exact sarse vector x 0 = γyq for some γ 0 we solve the linear rogram min q Yq st r q = III7 with r = q Since the feasible set {q q q = } is essentially the tangent sace of the shere S n at q whenever q is close enough to q one should exect that the otimizer of III7 exactly recovers q and hence x 0 u to scale We will rove that this is indeed true under aroriate conditions IV MAIN RESULTS AND SKETCH OF ANALYSIS A Main Results In this section we describe our main theoretical result which shows that wh the algorithm described in the revious section succeeds Theorem IV Suose that S obeys the lanted sarse model and let the columns of Y form an arbitrary orthonormal basis for the subsace S Let y y R n denote the transoses of the rows of Y Aly Algorithm with λ = / using initializations q 0 = y / y y / y to roduce oututs q q Solve the linear rogram III7 with r = q q to roduce q q Set i arg min i Y q i Then Y q i = γx 0 IV for some γ 0 with robability at least c rovided Cn 4 log n and n θ θ 0 Here C c and θ 0 are ositive constants IV Remark IV We can see that the result in Theorem IV is subotimal in samle comlexity comared to the global otimality result in Theorem II and Barak et al s result 34] and the subsequent work 35] For successful recovery we require Ω n 4 log n while the global otimality and Barak et al demand Ω n and Ω n resectively Aside from ossible deficiencies in our current analysis comared to Barak et al we believe this is still the first ractical and efficient method which is guaranteed to achieve θ Ω rate The lower bound on θ in Theorem IV is mostly for convenience in

6 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 6 the roof; in fact the LP rounding stage of our algorithm already succeeds wh when θ O / n B A Sketch of Analysis In this section we briefly sketch the main ideas of roving our main result in Theorem IV to show that the initialization + ADM + LP rounding ieline recovers x 0 under the stated technical conditions as illustrated in Fig The roof of our main result requires rather detailed technical analysis of the iteration-by-iteration roerties of Algorithm most of which is deferred to the aendices As noted in Section III the ADM algorithm is invariant to change of basis So wlog let us assume Y = x 0 g g n ] and let Y to be its orthogonalization ie x0 ] / Y = P x 0 x 0 G G P x 0 G IV3 When is large Y is nearly orthogonal and hence Y is very close to Y Thus in our roofs whenever convenient we make the arguments on Y first and then roagate the quantitative results onto Y by erturbation arguments With that noted let y y be the transose of the rows of Y and note that these are all indeendent random vectors To rove the result of Theorem IV we need the following results First given the secified Y we show that our initialization is biased towards the global otimum: Proosition IV3 Good initialization Suose θ > / n and Cn It holds with robability at least c that at least one of our initialization vectors suggested in Section III say q 0 i = y i / y i obeys y i y i e 0 θn IV4 Here C c are ositive constants Proof See Aendix D Second we define a vector-valued random rocess Qq on q S n via Qq = y i S λ q y i] IV5 i= Note that with robability one the inverse matrix square-root in Y is well defined So Y is well defined wh ie excet for x 0 = 0 See more quantitative characterization of Y in Aendix B so that based on III4 one ste of the ADM algorithm takes the form: q k+ = Q q k Q q k IV6 This is a very favorable form for analysis: the term in the numerator Q q k is a sum of indeendent random vectors with q k viewed as fixed We study the behavior of the iteration IV6 through the random rocess Q q k We want to show that wh the ADM iterate sequence q k converges to some small neighborhood of ±e so that the ADM algorithm lus the LP rounding described in Section III successfully retrieves the sarse vector x 0 / x 0 = Ye Thus we hoe that in general Qq is more concentrated on the first coordinate than q S n Let us artition the vector q as q = q ; q ] with q R and q R n ; and corresondingly Qq = Q q; Q q] The inner roduct of Qq/ Qq and e is strictly larger than the inner roduct of q and e if and only if Q q q > Q q q In the following roosition we show that wh this inequality holds uniformly over a significant ortion of the shere Γ = { q S n 0 nθ q 3 θ q } 0 IV7 so the algorithm moves in the correct direction Let us define the ga Gq between the two quantities Q q / q and Q q / q as Gq = Q q q Q q q IV8 and we show that the following result is true: Proosition IV4 Uniform lower bound for finite samle ga There exists a constant θ 0 0 such that when Cn 4 log n the estimate inf Gq q Γ 0 4 θ n holds with robability at least c rovided θ / n θ 0 Here C c are ositive constants Proof See Aendix E Next we show that whenever q 3 θ wh the iterates stay in a safe region with q θ which is enough for LP rounding III7 to succeed Proosition IV5 Safe region for rounding There exists

7 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 7 No Jum Away from the Ca e 3 θ θ Gq = jq qj jq j jq qj jjqqjj > θ Otimizer q? jjq qjj jjq jj LP Rounding Succeeds > C θ n Stoing Point q Uniform Progress by ADM Algorithm Initializer q 0 0 θn O e? Fig An illustration of the roof sketch for our ADM algorithm a constant θ 0 0 such that when Cn 4 log n it holds with robability at least c that Q q Qq θ for all q S n satisfying q > 3 θ rovided θ / n θ 0 Here C c are ositive constants Proof See Aendix F In addition the following result shows that the number of iterations for the ADM algorithm to reach the safe region can be bounded grossly by On 4 log n wh Proosition IV6 Iteration comlexity of reaching the safe region There is a constant θ 0 0 such that when Cn 4 log n it holds with robability at least c that the ADM algorithm in Algorithm with any initialization q 0 S n satisfying q 0 0 will θn roduce some iterate q with q > 3 θ at least once in at most On 4 log n iterations rovided θ / n θ 0 Here C c are ositive constants Proof See Aendix G Moreover we show that the LP rounding III7 with inut r = q exactly recovers the otimal solution wh whenever the ADM algorithm returns a solution q with first coordinate q > θ Proosition IV7 Success of rounding There is a constant θ 0 0 such that when Cn the following holds with robability at least c rovided θ / n θ 0 : Suose the inut basis is Y defined in IV3 and the ADM algorithm roduces an outut q S n with q > θ Then the rounding rocedure with r = q returns the desired solution ±e Here C c are ositive constants Proof See Aendix H Finally given Cn 4 log n for a sufficiently large constant C we combine all the results above to comlete the roof of Theorem IV Proof of Theorem IV Wlog let us again first consider Y as defined in III5 and its orthogonalization Y in a natural/canonical form IV3 We show that wh our algorithmic ieline described in Section III exactly recovers the otimal solution u to scale via the following argument: Good initializers Proosition IV3 shows that wh at least one of the initialization vectors say q 0 i = y i / y i obeys q 0 i e 0 θn which imlies that q 0 i is biased towards the global otimal solution Uniform rogress away from the equator By Proosition IV4 for any θ / n θ 0 with a constant θ 0 0 Gq = Q q q Q q q 0 4 θ n IV9 holds uniformly for all q S n in the region 0 q θn 3 θ wh This imlies that with an inut q 0 such that q 0 the ADM 0 θn algorithm will eventually obtain a oint q k for

8 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 8 which q k 3 θ if sufficiently many iterations are allowed 3 No jums away from the cas Proosition IV5 shows that for any θ / n θ 0 with a constant θ 0 0 wh Q q Qq θ holds for all q S n with q 3 θ This imlies that once q k 3 θ for some iterate k all the future iterates roduced by the ADM algorithm stay in a sherical ca region around the otimum with q θ 4 Location of stoing oints As shown in Proosition IV6 wh the strictly ositive ga Gq in IV9 ensures that one needs to run at most O n 4 log n iterations to first encounter an iterate q k such that q k 3 θ Hence the stes above imly that wh Algorithm fed with the roosed initialization scheme successively roduces iterates q S n with its first coordinate q θ after O n 4 log n stes 5 Rounding succeeds when r θ Proosition IV7 roves that wh the LP rounding III7 with an inut r = q roduces the solution ±x 0 u to scale Taken together these claims imly that from at least one of the initializers q 0 the ADM algorithm will roduce an outut q which is accurate enough for LP rounding to exactly return x 0 / x 0 On the other hand our l /l otimality theorem Theorem II imlies that ±x 0 are the unique vectors with the smallest l norm among all unit vectors in the subsace Since wh x 0 / x 0 is among the unit vectors q q our row initializers finally roduce our minimal l norm selector will successfully locate x 0 / x 0 vector For the general case when the inut is an arbitrary orthonormal basis Ŷ = YU for some orthogonal matrix U the target solution is U e The following technical ieces are erfectly arallel to the argument above for Y Discussion at the end of Aendix D imlies that wh at least one row of Ŷ rovides an initial oint q 0 such that q 0 U e 0 θn Discussion following Proosition IV4 in Aendix E indicates that for all q such that 0 θn q U e 3 θ there is a strictly ositive ga indicating steady rogress towards a oint q k such that q k U e 3 θ 3 Discussion at the end of Aendix F imlies that once q satisfies q U e the next iterate will not move far away from the target: Q q; Ŷ Ŷ / Q q; U e θ 4 Reeating the argument in Aendix G for general inut Ŷ shows it is enough to run the ADM algorithm O n 4 log n iterations to cross the range 0 q U e θn 3 θ So the argument above together dictates that with the roosed initialization wh the ADM algorithm roduces an outut q that satisfies q U e θ if we run at least O n 4 log n iterations 5 Since the ADM returns q satisfying q R e θ discussion at the end of Aendix H imlies that we will obtain a solution q = ±U e u to scale as the otimizer of the rounding rogram exactly the target solution Hence we comlete the roof Remark IV8 Under the lanted sarse model in ractice the ADM algorithm with the roosed initialization converges to a global otimizer of III that correctly recovers x 0 In fact simle calculation shows such desired oint for successful recovery is indeed the only critical oint of III near the ole in Fig Unfortunately using the current analytical framework we did not succeed in roving such convergence in theory Proosition IV5 and IV6 imly that after On 4 log n iterations however the ADM sequence will stay in a small neighborhood of the target Hence we roosed to sto after On 4 log n stes and then round the outut using the LP that rovable recover the target as imlied by Proosition IV5 and IV7 So the LP rounding rocedure is for the urose of comleting the theory and seems not necessary in ractice We susect alternative analytical strategies such as the geometrical analysis that we will discuss in Section VI can likely get around the artifact V EXPERIMENTAL RESULTS In this section we show the erformance of the roosed ADM algorithm on both synthetic and real datasets On the synthetic dataset we show the hase transition of our algorithm on both the lanted sarse and the dictionary learning models; for the real dataset we demonstrate how seeking sarse vectors can hel discover interesting atterns on face images A Phase Transition on Synthetic Data For the lanted sarse model for each air of k we generate the n dimensional subsace S R by direct sum of x 0 and G: x 0 R is a k-sarse vector with uniformly random suort and all nonzero entries

9 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 9 equal to and G R n is an iid Gaussian matrix distributed by N 0 / So one basis Y of the subsace S can be constructed by Y = GS x 0 G] U where GS denotes the Gram-Schmidt orthonormalization oerator and U R n n is an arbitrary orthogonal matrix For each we set the regularization arameter in III as λ = / use all the normalized rows of Y as initializations of q for the roosed ADM algorithm and run the alternating stes for 0 4 iterations We determine the recovery to be successful whenever x 0 / x 0 Yq 0 for at least one of the trials we set the tolerance relatively large as we have shown that LP rounding exactly recovers the solutions with aroximate inut To determine the emirical recovery erformance of our ADM algorithm first we fix the relationshi between n and as = 5n log n and lot out the hase transition between k and Next we fix the sarsity level θ = 0 or k = 0 and lot out the hase transition between and n For each air of k or n we reeat the simulation for 0 times Fig shows both hase transition lots We also exeriment with the comlete dictionary learning model as in 3] see also 5] Secifically the observation is assumed to be Y = A 0 X 0 where A 0 is a square invertible matrix and X 0 a n sarse matrix Since A 0 is invertible the row sace of Y is the same as that of X 0 For each air of k n we generate X 0 = x x n ] where each vector x i R is k-sarse with every nonzero entry following iid Gaussian distribution and construct the observation by Y = GS X 0 U We reeat the same exeriment as for the lanted sarse model described above The only difference is that here we determine the recovery to be successful as long as one sarse row of X 0 is recovered by one of those rograms Fig 3 shows both hase transition lots Fig a and Fig 3a suggest our ADM algorithm could work into the linear sarsity regime for both models rovided Ωn log n Moreover for both models the log n factor seems necessary for working into the linear sarsity regime as suggested by Fig b and Fig 3b: there are clear nonlinear transition boundaries between success and failure regions For both models On log n samle requirement is near otimal: for the lanted sarse model obviously Ωn is necessary; for the comlete dictionary learning model 3] roved that Ωn log n is required for exact recovery For the lanted sarse model our result Ωn 4 log n is far from this much lower emirical requirement Fig b further suggests that alternative reformulation and algorithm are needed to solve II so that the otimal recovery guarantee as deicted in Theorem II can be obtained B Exloratory Exeriments on Faces It is well known in comuter vision that the collection of images of a convex object only subject to illumination changes can be well aroximated by a low-dimensional subsaces in raw-ixel sace 43] We will lay with face subsaces here First we extract face images of one erson 65 images under different illumination conditions Then we aly robust rincial comonent analysis 44] to the data and get a low dimensional subsace of dimension 0 ie the basis Y R We aly the ADM + LP algorithm to find the sarsest elements in such a subsace by randomly selecting 0% rows of Y as initializations for q We judge the sarsity in the l /l sense that is the sarsest vector x 0 = Yq should roduce the smallest Yq / Yq among all results Once some sarse vectors are found we roject the subsace onto orthogonal comlement of the sarse vectors already found 3 and continue the seeking rocess in the rojected subsace Fig 4To shows the first four sarse vectors we get from the data We can see they corresond well to different extreme illumination conditions We also imlemented the sectral method with the LP ost-rocessing roosed in 35] for comarison under the same rotocol The result is resented as Fig 4Bottom: the ratios l / l are significantly higher and the ratios l 4 / l this is the metric to be maximized in 35] to romote sarsity are significantly lower By these two criteria the sectral method with LP rounding consistently roduces vectors with higher sarsity levels under our evaluation rotocol Moreover the resulting images are harder to interret hysically Second we manually select ten different ersons faces under the normal lighting condition Again the dimension of the subsace is 0 and Y R We reeat the same exeriment as stated above Fig 5 shows four sarse vectors we get from the data Interestingly the sarse vectors roughly corresond to differences of face images concentrated around facial arts that different eole tend to differ from each other eg eye brows forehead hair nose etc By comarison the vectors returned by the sectral method 35] are relatively denser and the sarsity atterns in the images are less structured hysically In sum our algorithm seems to find useful sarse vectors for otential alications such as eculiarity discovery in first setting and locating differences in second setting Nevertheless the main goal of this 3 The idea is to build a sarse orthonormal basis for the subsace in a greedy manner

IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 0 Fig Phase transition for the lanted sarse model using the ADM algorithm: a with fixed relationshi between and n: = 5n log n; b with fixed

and n: = 5n log n; b with fixed relationshi between and k: k = 0 White indicates success and black indicates failure exeriment is to invite readers to think about similar attern discovery roblems

and in roducing meaningful results even beyond the idealized lanted sarse model that we adoted for analysis VI CONNECTIONS AND DISCUSSION For the lanted sarse model there is a substantial erformance

10 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 0 Fig Phase transition for the lanted sarse model using the ADM algorithm: a with fixed relationshi between and n: = 5n log n; b with fixed relationshi between and k: k = 0 White indicates success and black indicates failure Fig 3 Phase transition for the dictionary learning model using the ADM algorithm: a with fixed relationshi between and n: = 5n log n; b with fixed relationshi between and k: k = 0 White indicates success and black indicates failure exeriment is to invite readers to think about similar attern discovery roblems that might be cast as the roblem of seeking sarse vectors in a subsace The exeriment also demonstrates in a concrete way the racticality of our algorithm both in handling data sets of realistic size and in roducing meaningful results even beyond the idealized lanted sarse model that we adoted for analysis VI CONNECTIONS AND DISCUSSION For the lanted sarse model there is a substantial erformance ga in terms of -n relationshi between the our otimality theorem Theorem II emirical simulations and guarantees we have obtained via efficient algorithm Theorem IV More careful and tighter analysis based on decouling 45] and chaining 46 47] and geometrical analysis described below can robably hel bridge the ga between our theoretical and emirical results Matching the theoretical limit deicted in Theorem II seems to require novel algorithmic ideas The random models we assume for the subsace can be extended to other random models articularly for dictionary learning where all the bases are sarse eg Bernoulli-Gaussian random model This work is art of a recent surge of research efforts on deriving rovable and ractical nonconvex algorithms to central roblems in modern signal rocessing and machine learning These roblems include low-rank matrix recovery/comletion 48 56] tensor recovery/decomosition 57 6] hase retrieval 6 65] dictionary learning ] and so on 4 Our aroach like the others is to start with a carefully chosen roblem-secific initialization and then erform a local analysis of the subsequent iterates to guarantee 4 The webage htt://sunjuorg/research/nonconvex/ maintained by the second author contains ointers to the growing list of work in this direction

IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 Fig 4 The first four sarse vectors extracted for one erson in the Yale B database under different illuminations To by our ADM algorithm; Bottom

11 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 Fig 4 The first four sarse vectors extracted for one erson in the Yale B database under different illuminations To by our ADM algorithm; Bottom by the seeding-u SOS algorithm roosed in 35] Fig 5 The first four sarse vectors extracted for 0 ersons in the Yale B database under normal illuminations To by our ADM algorithm; Bottom by the seeding-u SOS algorithm roosed in 35]

12 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 convergence to a good solution In comarison our subsequent work on comlete dictionary learning 5] and generalized hase retrieval 65] has taken a geometrical aroach by characterizing the function landscae and designing efficient algorithm accordingly The geometric aroach has allowed rovable recovery via efficient algorithms with an arbitrary initialization The article 66] summarizes the geometric aroach and its alicability to several other roblems of interest A hybrid of the initialization and the geometric aroach discussed above is likely to be a owerful comutational framework To see it in action for the current lanted sarse vector roblem in Fig 6 we rovide the asymtotic function landscae ie of the Huber loss on the shere S aka the relaxed formulation we tried to solve III It is clear that with an initialization that is biased towards either the north or the south ole we are situated in a region where the gradients are always nonzero and oints to the favorable directions such that many reasonable otimization algorithms can take the gradient information and make steady rogress towards the target This will robably ease the algorithm develoment and analysis and hel yield tight erformance guarantees We rovide a very efficient algorithm for finding a sarse vector in a subsace with strong guarantee Our algorithm is ractical for handling large datasets in the exeriment on the face dataset we successfully extracted some meaningful features from the human face images However the otential of seeking sarse/structured element in a subsace seems largely unexlored desite the cases we mentioned at the start We hoe this work could insire more alication ideas ACKNOWLEDGEMENT We thank the anonymous reviewers for their valuable comments and constructive criticism JS thanks the Wei Family Private Foundation for their generous suort We thank Cun Mu IEOR Deartment of Columbia University for helful discussion and inut regarding this work We thank the anonymous reviewers for their constructive comments that heled imrove the manuscrit This work was artially suorted by grants ONR N NSF 3438 NSF and funding from the Moore and Sloan Foundations APPENDIX A TECHNICAL TOOLS AND PRELIMINARIES In this aendix we record several lemmas that are useful for our analysis Lemma A Let ψx and Ψx to denote the robability density function df and the cumulative distribution function cdf for the standard normal distribution: Standard Normal df ψx = π ex Standard Normal cdf Ψx = π x t t { x ex } { t Suose a random variable X N 0 with the df f x = ψ x then for any t > t we have t t t f xdx = Ψ Ψ xf xdx = ψ t x f xdx = Ψ t t t ψ t ψ Ψ t t t t t ψ ] ] t ] Lemma A Taylor Exansion of Standard Gaussian cdf and df Assume ψx and Ψx be defined as above There exists some universal constant C ψ > 0 such that for any x 0 x R ψx ψx 0 x 0 ψ x 0 x x 0 ] C ψ x x 0 Ψx Ψx 0 + ψx 0 x x 0 ] C ψ x x 0 Lemma A3 Matrix Induced Norms For any matrix A R n the induced matrix norm from l l q is defined as A l l = q su Ax q x = } dt In articular let A = a a n ] = a a ] we have A l l = su a k x A l l = max a k x = k= AB l l r A l q l r B l l q and B is any matrix of size comatible with A k Lemma A4 Moments of the Gaussian Random Variable If X N 0 X then it holds for all integer m that ] E X m ] = X m m!! π m=k+ + m=k m X m!! k = m/ Lemma A5 Moments of the χ Random Variable If X χ n ie X = x for x N 0 I then it

IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 3 Fig 6 Function landscae of fq with θ = 04 for n = 3 Left fq over the shere S Note that near the sherical cas around the north and south oles

obtained via the rearameterization qw = w; w ] Corresonding to the left there is no undesired critical oint around 0 within a large radius holds for all integer m that E X m m/ Γ m/ + n/ ] = m!

13 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 3 Fig 6 Function landscae of fq with θ = 04 for n = 3 Left fq over the shere S Note that near the sherical cas around the north and south oles there are no critical oints and the gradients are always nonzero; Right Projected function landscae by rojecting the uer hemishere onto the equatorial lane Mathematically the function gw : e 3 R obtained via the rearameterization qw = w; w ] Corresonding to the left there is no undesired critical oint around 0 within a large radius holds for all integer m that E X m m/ Γ m/ + n/ ] = m!! n m/ Γ n/ Lemma A6 Moments of the χ Random Variable If X χ n ie X = x for x N 0 I then it holds for all integer m that E X m m Γ m + n/ ] = Γ n/ m = n + k m! nm k= Lemma A7 Moment-Control Bernstein s Inequality for Random Variables 67] Let X X be iid realvalued random variables Suose that there exist some ositive numbers R and X such that E X k m ] m! XR m for all integers m Let S = k= X k then for all t > 0 it holds that t P S E S] t] ex X + Rt Lemma A8 Moment-Control Bernstein s Inequality for Random Vectors 5] Let x x R d be iid random vectors Suose there exist some ositive number R and X such that E x k m ] m! XR m for all integers m Let s = k= x k then for any t > 0 it holds that t P s E s] t] d + ex X + Rt Lemma A9 Gaussian Concentration Inequality Let x N 0 I Let f : R R be an L-Lischitz function Then we have for all t > 0 that P fx EfX t] ex t L Lemma A0 Bounding Maximum Norm of Gaussian Vector Sequence Let x x n be a sequence of not necessarily indeendent standard Gaussian vectors in R n It holds that P max x i > n + ] logn n 3 i n ] Proof Since the function is -Lischitz by Gaussian concentration inequality for any i n ] we have ] P x i E x i > t P x i E x i > t] ex t for all t > 0 Since E x i = n by a simle union bound we obtain P max x i > ] n + t ex t i n ] + log n for all t > 0 Taking t = logn gives the claimed result Corollary A Let Φ R n n iid N 0 It

14 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 4 holds that n Φx + A simle alication of union bound yields logn x for all x R n P E c ] ex δ n π + n log + δ with robability at least n 3 Proof Let Φ = φ ] Choosing δ small enough such that φ n Without loss of generality let us only consider x S n 3δ δ ε + δ δ + ε we have Φx = max x φ i max φ i then conditioned on E we can conclude that i n ] i n A ] ε Invoking Lemma A0 returns the claimed result π n Φx + ε π n x S n Lemma A Covering Number of a Unit Shere 4] Let S n = {x R n x = } be the unit shere For any ε 0 there exists some ε cover of S n wrt the l norm denoted as N ε such that N ε + ε n 3 ε n Lemma A3 Sectrum of Gaussian Matrices 4] Let Φ R n n n > n contain iid standard normal entries Then for every t 0 with robability at least ex t / one has n n t min Φ max Φ n + n + t Lemma A4 For any ε 0 there exists a constant C ε > such that rovided n > C ε n the random matrix Φ R n n iid N 0 obeys ε π n x Φx + ε π n x for all x R n with robability at least ex c ε n for some c ε > 0 Geometrically this lemma roughly corresonds to the well known almost sherical section theorem 68 69] see also 70] A slight variant of this version has been roved in 3] borrowing ideas from 7] Proof By homogeneity it is enough to show that the bounds hold for every x of unit l norm For a fixed x 0 with x 0 = Φx 0 N 0 I So E Φx = π n Note that is n -Lischitz by concentration of measure for Gaussian vectors in Lemma A9 we have P Φx E Φx ] > t] ex t n for any t > 0 For a fixed δ 0 S n can be covered by a δ-net N δ with cardinality #N δ + /δ n Now consider the event E = { x N δ δ π n Φx + δ } π n Indeed suose E holds Then it can easily be seen that any z S n can be written as z = λ k x k with λ k δ k x k N δ for all k k=0 Hence we have Φz = Φ λ k x k k=0 Similarly Φz = Φ λ k x k k=0 δ k Φx k k=0 + δ δ π n δ δ + δ δ ] π n = 3δ δ π n Hence the choice of δ above leads to the claimed result Finally given n > Cn to make the robability P E c ] decaying in n it is enough to set C = π δ log + δ This comletes the roof APPENDIX B THE RANDOM BASIS VS ITS ORTHONORMALIZED VERSION In this aendix we consider the lanted sarse model Y = x 0 g g n ] = x 0 G] R n as defined in III5 where x 0 k iid Ber θ k θ g l iid N 0 I l n B

15 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 5 Recall that one natural/canonical orthonormal basis for the subsace sanned by columns of Y is x0 ] / Y = P x 0 x 0 G G P x 0 G which is well-defined with high robability as P x 0 G is well-conditioned roved in Lemma B We write G / = Px 0 G G P x 0 G B for convenience When is large Y has nearly orthonormal columns and so we exect that Y closely aroximates Y In this section we make this intuition rigorous We rove several results that are needed for the roof of Theorem II and for translating results for Y to results for Y in Aendix E-D For any realization of x 0 let I = sux 0 = {i x 0 i 0} By Bernstein s inequality in Lemma A7 with X = θ and R = the event { } E 0 = θ I θ B3 holds with robability at least ex θ/6 Moreover we show the following: Lemma B When Cn and θ > / n the bound x 0 4 n log 5 θ B4 holds with robability at least c Here C c are ositive constants ] Proof Because E x 0 = by Bernstein s inequality in Lemma A7 with X = /θ and R = /θ we have ] ] ] x0 P E x 0 x0 > t = P > t ex θt 4 + t for all t > 0 which imlies ] t P x 0 > x 0 + = P x 0 x 0 + > t] ex θt 4 + t On the intersection with E 0 x /4 and setting t = P n log θ we obtain x n log θ ] E 0 ex n log Unconditionally this imlies that with robability at least ex θ/6 ex n log we have x 0 = x 0 4 n log x 0 5 θ as desired Let M = G P x 0 G / Then G = GM x 0x 0 GM We show the following results hold: x 0 Lemma B Provided Cn it holds that n M M I log with robability at least Here C is a ositive constant Proof First observe that / M = min G P x 0 G = min Px 0 G Now suose B is an orthonormal basis sanning x 0 Then it is not hard to see the sectrum of P x 0 G is the same as that of B G R n ; in articular min Px 0 G = min B G Since each entry of G iid N 0 and B has orthonormal rows B G iid N 0 we can invoke the sectrum results for Gaussian matrices in Lemma A3 and obtain that n log min B G n max B log G + + with robability at least Thus when C n for some sufficiently large constant C by using the results above we have M = min B G n = I M = max max M min M = max min B G max B G log

16 IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 6 n log max n log + + { n log = max + + n log n log + + n log + + n log + + n log with robability at least Lemma B3 Let Y I be a submatrix of Y whose rows are indexed by the set I There exists a constant C > 0 such that when Cn and / > θ > / n the following Y l l 3 Y I l l 7 θ G G l 4 n + 7 log l YI Y l n log I 0 l θ Y Yl n log 0 l θ hold simultaneously with robability at least c for a ositive constant c Proof First of all we have x 0 x 0 x 0 GM x l x l 0 x 0 l l 0 GM l l = x x 0 x 0 0 G where in the last inequality we have alied the fact M from Lemma B Now x 0 G is an iid Gaussian vectors with each entry distributed as N 0 x0 where x 0 = I θ So by Gaussian concentration inequality in Lemma A9 we have x log 0 G x 0 with robability at least c On the intersection with E 0 this imlies x 0 x 0 x 0 GM θ log l l with robability at least c rovided θ > / n Moreover when intersected with E 0 Lemma A4 imlies that when C n G l l G I l l θ with robability at least c 3 rovided θ > / n Hence by Lemma B when > C n G G l G l l l I M + x 0 x 0 x 0 GM l l n log + θ log 4 n + 7 log Yl x l 0 l l + G l l x 0 + θ + 3 G l I G l I l l M + x 0 x 0 x 0 GM l l θ + θ log 4 θ GI G I l G l I l l I M + x 0 x 0 x 0 GM n θ log l l + θ log 4 θn + 6 θ log Y I l l x 0 x 0 + G l I l l l x 0 x θ 7 θ with robability at least c 4 rovided θ > / n Finally by Lemma B and the results above we obtain Y Yl l x 0 x 0 + G G l l n log 0 θ

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules