Signal and Noise Statistics Oblivious Orthogonal Matching Pursuit

Sreejith Kallummil * Sheetal Kalyani * Abstract Orthogonal matching pursuit OMP) is a widely used algorithm for recovering sparse high dimensional vectors in linear regression models. The optimal performance of OMP requires a priori knowledge of either the sparsity of regression vector or noise statistics. Both these statistics are rarely known a priori and are very difficult to estimate. In this paper, we present a novel technique called residual ratio thresholding RRT) to operate OMP without any a priori knowledge of sparsity and noise statistics and establish finite sample and large sample support recovery guarantees for the same. Both analytical results and numerical simulations in real and synthetic data sets indicate that RRT has a performance comparable to OMP with a priori knowledge of sparsity and noise statistics.. Introduction This article deals with the estimation of the regression vector β R p in the linear regression model y = Xβ + w, where X R n p is a known design matrix with unit Euclidean norm columns, w is the noise vector and y is the observation vector. Throughout this article, we assume that the entries of the noise w are independent, zero mean and Gaussian distributed with variance σ. We consider the high dimensional and sample starved scenario of n < p or n p where classical techniques like ordinary least squares OLS) are no longer applicable. This problem of estimating high dimensional vectors in sample starved scenarios is ill-posed even in the absence of noise unless strong structural assumptions are made on X and β. A widely used and practically valid assumption is sparsity. The vector β R p is sparse if the support of β given by S = suppβ) = {k : β k } has cardinality k = cards) p. * Equal contribution Department of Electrical Engineering, IIT Madras, India Department of Electrical Engineering, IIT Madras, India. Correspondence to: Sreejith Kallummil <sreejith.k.venugopal@gmail.com>. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 8, 8. Copyright 8 by the authors). A number of algorithms like least absolute shrinkage and selection operator LASSO)Tropp, 6; Tibshirani, 996), Dantzig selector DS)Candes & Tao, 7), subspace pursuit SP)Dai & Milenkovic, 9), OMP Pati et al., 993; Mallat & Zhang, 993; Tropp, ; Cai & Wang, ), elastic net Zou & Hastie, 5) etc. are proposed to efficiently estimate β. Tuning the hyper parameters of aforementioned algorithms to achieve optimal performance require a priori knowledge of signal parameters like sparsity k or noise statistics like σ etc. Unfortunately, these parameters are rarely known a priori. To the best of our knowledge, no computationally efficient technique to estimate k is reported in open literature. However, limited success on the estimation of σ has been reported in literature Dicker, ; Fan et al., ; Dicker & Erdogdu, 6; Bayati et al., 3). However, the performance of these σ estimates when used for tuning hyper parameters in LASSO, DS, OMP etc. are largely unknown. Generalised techniques for hyper parameter selection like cross validation CV)Arlot et al., ), re-sampling Meinshausen & Bühlmann, ) etc. are computationally challenging. Further, CV is reported to have poor variable selection behaviourchichignoud et al., 6; Arlot et al., ). Indeed, algorithms that are oblivious to signal and noise statistics are also proposed in literature. This include algorithms inspired or related to LASSO like square root LASSOBelloni et al., ), AV Chichignoud et al., 6), approximate message passing Mousavi et al., 3; Bayati et al., 3) etc. and ridge regression inspired techniques like least squares adaptive thresholding LAT), ridge adaptive thresholding RAT)Wang et al., 6) etc. However, most of existing signal and noise statistics oblivious sparse recovery techniques have only large sample performance guarantees. Further, many of these techniques assume that design matrix X is sampled from a random ensemble, a condition which is rarely satisfied in practice... Contributions of this paper This article present a novel technique called residual ratio thresholding RRT) for finding a good estimate of support S from the data dependent/adaptive sequence of supports generated by OMP. RRT is analytically shown to accomplish exact support recovery, i.e., identifying S) under the same finite sample and deterministic constraints on X like

restricted isometry constants RIC) or mutual coherence required by OMP with a priori knowledge of k or σ. However, the signal to noise ratio SNR= Xβ /nσ ) required for support recovery using RRT is slightly higher than that of OMP with a priori knowledge of k or σ. This extra SNR requirement is shown to decrease with the increase in sample size n. RRT and OMP with a priori knowledge of k or σ are shown to be equivalent as n in terms of the SNR required for support recovery. RRT involves a tuning parameter α that can be set independent of ambient SNR or noise statistics. The hyper parameter α in RRT have an interesting semantic interpretation of being the high SNR upper bound on support recovery error. Also RRT is asymptotically tuning free in the sense that a very wide range of α deliver similar performances as n. Numerical simulations indicate that RRT can deliver a highly competitive performance when compared to OMP having a priori knowledge of k or σ, OMP with k estimated using CV and the recently proposed LAT algorithm. Further, RRT also delivered a highly competitive performance when applied to identify outliers in real data sets, an increasingly popular application of sparse estimation algorithmsmitra et al., ; 3). The remainder of this article is organised as follows. In section we discuss OMP algorithm. RRT algorithm is presented in Section 3. Section presents theoretical performance guarantees for RRT. Section 5 presents numerical simulation results. All the proofs are provided in the supplementary material... Notations used p x q = ) x k q q k= is the lq norm of x R p. n is the n zero vector and I n is the n n identity matrix. spanx) is the column space of X. X = X T X) X T is the Moore-Penrose pseudo inverse of X. X J denotes the sub-matrix of X formed using the columns indexed by J. N u, C) represents a Gaussian random vector R.V) with mean u and covariance matrix C. Ba, b) denotes a Beta R.V with parameters a and b. a b implies that a and b are identically distributed. [p] represents the floor operator. φ represents the null set. For any two sets J and J, J /J denotes the set difference. a P b represents the convergence of R.V a to R.V b in probability.. Orthogonal Matching Pursuit OMP) OMP Algorithm ) starts with a null support estimate and in each iteration it adds that column index to the current support which is the most correlated with the previous residual r k, i.e., t k = arg max X T j rk. Then a j LS estimate of β restricted to the current support Somp k is Algorithm Orthogonal matching pursuit Input: Observation y, matrix X Initialize S omp = φ. k = and residual r = y repeat Identify the next column t k = arg max X T j rk j Expand current support Somp k = Somp k t k Restricted LS estimate: ˆβ S k omp = X Sompy. k ˆβ {,...,p}/s k omp = p k. Update residual: r k = y X ˆβ = I n P k )y. Increment k k +. until stopping condition SC) is true Output: Support estimate Ŝ = Sk omp. Vector estimate ˆβ computed as an intermediate estimate of β and this estimate is used to update the residual. Note that P k in Algorithm refers to X S k omp X S k omp, the projection matrix onto spanx S k omp ). Since the residual r k is orthogonal to spanx S k omp ), X T j rk = for all j Somp. k Consequently, t k+ / Somp, k i.e., the same index will not be selected in two different iterations. Hence, Somp k+ Somp, k i.e. the support sequence is monotonically increasing. The monotonicity of Somp k in turn implies that the residual norm r k is a non increasing function of k, i.e, r k+ r k. Most of the theoretical properties of OMP are derived assuming a priori knowledge of true sparsity level k in which case OMP stops after exactly k iterationstropp, ; Wang, 5). When k is not known, one has to rely on stopping conditions SC) based on the properties of the residual r k as k varies. For example, one can stop OMP iterations once the residual power is too low compared to the expected noise power. Mathematically, when the noise w is l bounded, i.e., w ɛ for some a priori known ɛ, then OMP can be stopped if r k ɛ. For a Gaussian noise vector w N n, σ I n ), ɛ σ = σ n + n logn) satisfiescai & Wang, ) P w ɛ σ ) n, ) i.e., Gaussian noise is l bounded with a very high probability. Consequently, one can stop OMP iterations in Gaussian noise once r k ɛ σ. A number of deterministic recovery guarantees are proposed for OMP. Among these guarantees the conditions based on RIC are the most popular. RIC of order j denoted by δ j is defined as the smallest value of δ such that δ) b Xb + δ) b ) hold true for all b R p with b = cardsuppb)) j. A smaller value of δ j implies that X act as a near orthogonal

matrix for all j sparse vectors b. Such a situation is ideal for the recovery of a j-sparse vector b using any sparse recovery technique. The latest RIC based support recovery guarantee using OMP is given in Lemma Liu et al., 7). Lemma. OMP with k iterations or SC r k w can recover any k sparse vector β provided that δ k+ < / k + and w ɛ omp = β min δk+ k + δ k+ + δk. + k + δ k+ Since P w < ɛ σ ) /n when w N n, σ I n ), it follows from Lemma that OMP with k iterations or SC r k ɛ σ can recover any k -sparse vector β with probability greater than /n provided that δ k+ < / k + and ɛ σ ɛ omp. Lemma implies that OMP with a priori knowledge of k or σ can recover support S once the matrix satisfies the regularity condition δ k+ < / k + and the SNR is high. It is also known that this RIC condition is worst case necessary. Consequently, Lemma is one of the best deterministic guarantee for OMP available in literature. Note that the mutual incoherence condition given by µ X = max j k XT j X k < /k ) also ensures exact support recovery at high SNR. Note that the a priori knowledge of k or σ required to materialise the recovery guarantees in Lemma are not available in practical problems. Further, k and σ are very difficult to estimate. This motivates the proposed RRT algorithm which does not require a priori knowledge of k or σ. 3. Residual Ratio Thresholding RRT) RRT is a novel signal and noise statistics oblivious technique to estimate the support S based on the behaviour of the residual ratio statistic RRk) = r k / r k as k increases from k = to a predefined value k = k max > k. As aforementioned, identifying the support using the behaviour of r k requires a priori knowledge of σ. However, as we will show in this section, support detection using RRk) does not require a priori knowledge of σ. Since the residual norms are non negative and non increasing, RRk) always satisfy RRk). 3.. Minimal superset and implications Consider running k max > k iterations of OMP and let {Somp} k kmax k= be the support sequence generated by OMP. Recall that Somp k is monotonically increasing. Definition :- The minimal superset in the OMP support sequence {Somp} k kmax k= is given by Somp kmin, where k min = min{k : S Somp}). k When the set {k : S Somp} k = φ, we set k min = and Somp kmin = φ. In words, minimal superset is the smallest superset of support S present in a particular realization of the support estimate sequence {Somp} k kmax k=. Note that both k min and Somp kmin are unobservable random variables. Since cardsomp) k = k, Somp k for k < k cannot satisfy S Somp k and hence k min k. Further, the monotonicity of Somp k implies that S Somp k for all k k min. Case :- When k min = k, then Somp k = S and Somp k S for k k, i.e., S is present in the solution path. Further, when k min = k, it is true that Somp k S for k k. Case :- When k < k min k max, then Somp k S for all k and S omp k S for k k min, i.e., S is not present in the solution path. However, a superset of S is present. Case 3:- When k min =, then Somp k S for all k, i.e., neither S nor a superset of S is present in {Somp} k kmax k=. To summarize, exact support recovery using any OMP based scheme including the signal and noise statistics aware schemes is possible only if k min = k. Whenever k min > k, it is possible to estimate true support S without having any false negatives. However, one then has to suffer from false positives. When k min =, any support in {Somp} k kmax k= has to suffer from false negatives and all supports Somp k for k > k has to suffer from false positives also. Note that the matrix and SNR conditions required for exact support recovery in Lemma automatically implies that k min = k. We formulate the proposed RRT scheme assuming that k min = k. 3.. Behaviour of RRk ) Next we consider the behaviour of residual ratio statistic at the k iteration, i.e., RRk ) = r k / r k under the assumption that w ɛ omp and δ k+ < / k + which ensures k min = k and S k omp S for all k k. Since Xβ = X S β S spanx S ), I n P k )Xβ n if S S k omp and I n P k )Xβ = n if S S k omp. This along with the monotonicity of S k omp implies the following. I n P k )Xβ n for k < k min = k and I n P k )Xβ = n for k k min = k. Thus r k = I n P k )y = I n P k )X S β S + I n P k )w for k < k min = k, whereas, r k = I n P k )w for k k min = k. Consequently, at k = k, the numerator r k of RRk ) contains contribution only from the noise term I n P k )w, whereas, the denominator r k in RRk ) contain contributions from both the signal term i.e., I n P k )X S β S and the noise term I n P k )w. This behaviour of RRk ) along with the fact that w P as σ implies the following theorem. Theorem. Assume that the matrix X satisfies the RIC constraint δ k+ < / k + and k max > k. Then a). RRk min ) P as σ. b). lim σ Pk min = k ) =.

Algorithm Residual ratio thresholding Input: Observation y, matrix X Step : Run k max iterations of OMP. Step : Compute RRk) for k =,..., k max. Step 3: Estimate k RRT = max{k : RRk) Γ α RRT k)} Output: Support estimate Ŝ = Sk RRT omp. Vector estimate ˆβS k RRT omp ) = X y, ˆβ{,..., p}/s k S k RRT RRT omp ) = omp p krrt. 3.3. Behaviour of RRk) for k > k min Next we discuss the behaviour of RRk) for k > k min. By the definition of k min we have S Somp k which implies that r k = I n P k )w for k k min. The absence of signal terms in numerator and the denominator of RRk) = In P k)w I n P k )w for k > k min implies that even when w or σ, RRk) for k > k min does not converge to zero. This behaviour of RRk) for k > k min is captured in Theorem where we provide explicit σ or SNR independent lower bounds on RRk) for k > k min. Theorem. Let F a,b x) denotes the cumulative distribution function of a Ba, b) random variable. Then σ >, ) Γ α RRT k) = F α n k satisfies,.5 k max p k + ) PRRk) > Γ α RRT k), k > k min ) α. 3) Theorem states that the residual ratio statistic RRk) for k > k min is lower bounded by the deterministic sequence {Γ α RRT k)}kmax k=k min+ with a high probability for small values of α). Please note that k min is itself a R.V. Note that the sequence Γ α RRT k) is dependent only on the matrix dimensions n and p. Further, Theorem does not make any assumptions on the noise variance σ or the design matrix X. Theorem is extremely non trivial considering the fact that the support estimate sequence {Somp} k kmax k= produced by OMP is adaptive and data dependent. Lemma. The following important properties of Γ α RRT k) are direct consequences of the monotonicity of CDF and the fact that a Beta R.V take values only in [, ]. ). Γ α RRT k) is defined only in the interval α [, k max p k + )]. ). Γ α RRT k). 3). Γ α RRT k) is a monotonically increasing function of α. ). Γ α RRT k) = when α = and Γα RRT k) = when α = k max p k + ). 3.. Residual ratio thresholding framework From Theorem, it is clear that Pk min = k ) and PS omp k = S) increases with increasing SNR or decreasing σ ), whereas, RRk min ) decreases to zero with increasing SNR. At the same time, for small values of α like α =., RRk) for k > k min is lower bounded by Γ α RRT k) with a very high probability at all SNR. Hence, finding the last index k such that RRk) Γ α RRT k), i.e., k RRT = max{k : RRk) Γ α RRT k)} gives k and equivalently Somp k = S with a probability increasing with increasing SNR. This motivates the proposed signal and noise statistics oblivious RRT algorithm presented in Algorithm. Remark. An important aspect regarding the RRT in Algorithm is the choice of k RRT when the set {k : RRk) Γ α RRT k)} = φ. This situation happens only at very low SNR. When {k : RRk) Γ α RRT k)} = φ for a given value of α, we increase the value of α to the smallest value α new > α such that {k : RRk) Γ αnew RRT k)} φ. Mathematically, we set k RRT = max{k : RRk) < Γ αnew RRT k)}, where α new = min {a : {k : RRk) a>α Γα RRT k)} = φ}. Since α = p k max gives Γ α RRT ) = and RR), a value of α new pk max always exists. α new can be easily computed by first pre-computing {Γ a RRT k)}kmax k= for say prefixed values of a in the interval α, pk max ]. Remark. RRT requires performing k max iterations of OMP. All the quantities required for RRT including RRk) and the final estimates can be computed while performing these k max iterations itself. Consequently, RRT has complexity Ok max np). As we will see later, a good choice of k max is k max = [.5n + )] which results in a complexity order On p). This complexity is approximately n/k times higher than the Onpk ) complexity of OMP when k or σ are known a priori. This is the computational cost being paid for not knowing k or σ a priori. In contrast, L fold CV requires running /L)n iterations of OMP L times resulting in a OL /L)n p) = OLn p) complexity, i.e., RRT is L times computationally less complex than CV. Remark 3. RRT algorithm is developed only assuming that the support sequence generated by the sparse recovery algorithm is monotonically increasing. Apart from OMP, algorithms such as orthogonal least squareswen et al., 7) and OMP with thresholdingyang & de Hoog, 5) also produce monotonic support sequences. RRT principle can be directly applied to operate these algorithms in a signal and noise statistics oblivious fashion.. Analytical Results for RRT In this section we present support recovery guarantees for RRT and compare it with the results available for OMP with a priori knowledge of k or σ. The first result in this section deals with the finite sample and finite SNR performance for RRT.

Theorem 3. Let k max k and suppose that the matrix X satisfies δ k+ < k+. Then RRT can recover the true support S with probability greater than /n α provided that ɛ σ < minɛ omp, ɛ rrt ), where ɛ rrt = Γα RRT k ) δ k β min + Γ α RRT k. ) ) Theorem 3 implies that RRT can identify the support S at a higher SNR or lower noise level than that required by OMP with a priori knowledge of k and σ. For small values of α like α =., the probability of exact support recovery, i.e., α /n is similar to that of the /n probability of exact support recovery in Lemma. Also please note that the RRT framework does not impose any extra conditions on the design matrix X. Consequently, the only appreciable difference between RRT and OMP with a priori knowledge of k and σ is in the extra SNR required by RRT which is quantified next using the metric ɛ extra = ɛ omp /ɛ rrt. Note that the larger the value of ɛ extra, larger should be the SNR or equivalently smaller should be the noise level required for RRT to accomplish exact support recovery. Substituting the values of ɛ omp and ɛ rrt and using the bound δ k δ k+ gives ɛ extra + Γ α RRT k). 5) Note that. Consequently, δ k + k +δ k + ɛ extra.5 + = δ k + k +δ k + δ k + k +δ k + ) +δk + δ k + ) + Γ α RRT k. 6) ) Since Γ α RRT k ), it follows that.5 + is always greater than or equal to one. Γ α RRT k) ) However, ɛ extra decreases with the increase in Γ α RRT k ). In particular, when Γ α RRT k ) =, there is no extra SNR requirement. Remark. RRT algorithm involves two hyper parameters viz. k max and α. Exact support recovery using RRT requires only that k max k. However, k is an unknown quantity. In our numerical simulations, we set k max = minp, [.5rankX) + )]). This choice is motivated by the facts that k < [.5rankX)+)] is a necessary condition for exact support recovery using any sparse estimation algorithmelad, ) when n < p and minn, p) is the maximum possible number of iterations in OMP. Since evaluating rankx) requires extra computations, one can always use rankx) n to set k max = minp, [.5n+)]). Please note that this choice of k max is independent of the operating SNR, design matrix and the vector to be estimated and the user is not required to tune this parameter. Hence, α is the only user specified hyper parameter in RRT algorithm... Large sample behaviour of RRT Next we discuss the behaviour of RRT as n. From 6), it is clear that the extra SNR required for support recovery using RRT decreases with increasing Γ α RRT k ). However, by Lemma increasing Γ α RRT k ) requires an increase in the value of α. However, increasing α decreases the probability of support recovery given by α /n. In other words, one cannot have exact support recovery using RRT at lower SNR without increasing the probability of error in the process. An answer to this conundrum is available in the large sample regime where it is possible to achieve both α and Γ α RRT k ), i.e., no extra SNR requirement and no decrease in probability of support recovery. The following theorem states the conditions required for Γ α RRT k ) for large values of n. Theorem. Define k lim = lim k /n, p lim = n lim logp)/n and α lim = lim logα)/n. Let n n k max = minp, [.5n + )]). Then Γ α RRT k ) = ) F α n k satisfies the following,.5 k max p k + ) asymptotic limits. Case :-). lim n Γα RRT k ) =, whenever k lim <.5, p lim = and α lim =. Case :-). < lim n Γα RRT k ) < if k lim <.5, α lim = and p lim >. In particular, lim n Γα RRT k ) = exp p lim k lim ). Case 3:- lim n Γα RRT k ) = if k lim <.5, α lim = and p lim =. Theorem states that all choices of n, p, k ) satisfying p lim = and k lim <.5 can result in lim n Γα RRT k ) = provided that the parameter α satisfies α lim =. Note that α lim = for a wide variety of α including α = constant, α = /n δ for some δ >, α = / logn) etc. It is interesting to see which n, p, k ) scenario gives p lim = and k lim <.5. Note that exact recovery in n < p scenario is possible only if k [.5n + )]. Thus, the assumption k lim <.5 will be satisfied in all interesting problem scenarios. Regime :- lim n Γα RRT k ) = in low dimensional regression problems with p fixed and n or all n, p), ) with lim n p/n. Regime :- lim n Γα RRT k ) = in high dimensional case with p increases sub exponentially with n as expn δ ) for some δ < or p increases polynomially w.r.t n, i.e., p = n δ for some δ >. In both cases, p lim = lim n lognδ )/n = and p lim = lim n logexpnδ ))/n =. Regime 3:- lim n Γα RRT k ) = in the extreme high dimensional case where n, p, k ),, ) satisfy-

ing n ck logp) for some constant c >. Here p lim = lim logp)/n lim = and k lim = n n ck /c logp) =. Note that the sampling regime lim n n k logp) is the best known asymptotic guarantee available for OMPFletcher & Rangan, ). Regime :- Consider a sampling regime where n, p), ) such that k is fixed and n = ck logp), i.e., p is exponentially increasing with n. Here p lim = /ck ) and ck ) < k lim =. Consequently, lim n Γα RRT k ) = exp. A good example of this sampling regime is Tropp & Gilbert, 7) where it was shown that OMP can recover a not every) particular k dimensional signal from n random measurements in noiseless case) when n = ck logp). Note that c for all k and c for large k. Even if we assume that only n = k logp) measurements are sufficient for recovering a k sparse signal, we have lim n Γα RRT k ) = exp.5) =.95 for k = 5 i.e., ɛ extra.57) and lim n Γα RRT k ) = exp.5) =.9753 for k = i.e.,ɛ extra.7). Note that Γ α RRT k ) as n implies that ɛ extra and minɛ omp, ɛ rrt ). This asymptotic behaviour of Γ α RRT k ) and ɛ extra imply the large sample consistency of RRT as stated in the following theorem. Theorem 5. Suppose that the sample size n such that the matrix X satisfies δ k+ <, ɛ k+ σ ɛ omp and p lim =. Then, a). OMP running k iterations and OMP with SC r k ɛ σ are large sample consistent, i.e.. lim PŜ = S) =. n b). RRT with hyper parameter α satisfying lim n α = and α lim = is also large sample consistent. Theorem 5 implies that at large sample sizes, RRT can accomplish exact support recovery under the same SNR and matrix conditions required by OMP with a priori knowledge of k or σ. Theorem 5 has a very important corollary. Remark 5. Theorem implies that all choices of α satisfying α and α lim = deliver similar performances as n. Note that the range of adaptations satisfying α and α lim = include α = / logn), α = /n δ for δ > etc. Since a very wide range of tuning parameters deliver similar results as n, RRT is in fact asymptotically tuning free. Remark 6. Based on the large sample analysis of RRT, one can make the following guidelines on the choice of α. When the sample size n is large, one can choose α as a function of n that satisfies both lim α = and α lim =. Also since n the support recovery guarantees are of the form /n α, it does not make sense to choose a value of α that decays to zero faster than /n. Hence, it is preferable to choose values of α that decreases to zero slower than /n like α = / logn), α = / n etc... A high SNR operational interpretation of α Having discussed the large sample behaviour of RRT, we next discuss the finite sample and high SNR behaviour of RRT. Define the events support recovery error E = {Ŝ S} and false positive F = cardŝ/s) > and missed discovery or false negative M = cards/ŝ) >. The following theorem characterizes the likelihood of these events as SNR increases to infinity or σ. Theorem 6. Let k max > k and the matrix X satisfies δ k+ < / k +. Then, a). lim PM) =. σ b). lim PE) = lim PF) α. σ σ Theorem 6 states that when the matrix X allows for exact support recovery in the noiseless or low noise situation, RRT will not suffer from missed discoveries. Under such favourable conditions, α is a high SNR upper bound on both the probability of error and the probability of false positives. Please note that such explicit characterization of hyper parameters are not available for hyper parameters in Square root LASSO, RAT, LAT etc. 5. Numerical Simulations In this section, we provide extensive numerical simulations comparing the performance of RRT with state of art sparse recovery techniques. In particular, we compare the performance of RRT with OMP with k estimated using five fold CV and the least squares adaptive thresholding LAT) proposed in Wang et al., 6). In synthetic data sets, we also compare RRT with OMP running exactly k iterations and OMP with SC r k σ n + n logn)cai & Wang, ). These algorithms are denoted in Figures - by CV, LAT, OMP and OMP respectively. RRT and RRT represent RRT with parameter α set to α = / logn) and α = / n respectively. By Theorem 5, RRT and RRT are large sample consistent. 5.. Synthetic data sets The synthetic data sets are generated as follows. We consider two models for the matrix X. Model sample each entry of the design matrix X R n p independently according to N, ). Matrix X in Model is formed by concatenating I n with a n n Hadamard matrix H n, i.e., X = [I n, H n ]. This matrix guarantee exact support recovery using OMP at high SNR once k < + n Elad, ). The columns of X in both models are normalised to have unit l -norm. Based on the choice of X and support S, we conduct experiments. Experiments - involve matrix of model with n, p) given

l error 8 False discoveries Missed discoveries 8 6.8.6 6. OMP OMP RRT RRT LAT CV OMP OMP RRT RRT LAT CV.. OMP OMP RRT RRT LAT CV Figure. Experiment : Box plots of l error ˆβ β left), false positives middle) and false negatives right). l error False discoveries Missed discoveries 8.5.8.6 6. OMP OMP RRT RRT LAT CV.5 OMP OMP RRT RRT LAT CV.. OMP OMP RRT RRT LAT CV Figure. Experiment : Box plots of l error ˆβ β left), false positives middle) and false negatives right). l error False discoveries Missed discoveries 5 8 5 6 3 5 OMP OMP RRT RRT LAT CV OMP OMP RRT RRT LAT CV OMP OMP RRT RRT LAT CV Figure 3. Experiment 3: Box plots of l error ˆβ β left), false positives middle) and false negatives right). l error 5 False discoveries 6 Missed discoveries 5 5 3 5 OMP OMP RRT RRT LAT CV 5 OMP OMP RRT RRT LAT CV OMP OMP RRT RRT LAT CV Figure. Experiment : Box plots of l error ˆβ β left), false positives middle) and false negatives right). Data Set Outliers reported in literature RRT CV LAT Stack Loss, 3,,, 3,,, 3,,, n = and p = Rousseeuw & Leroy, 5) plus observations including intercept Rousseeuw & Leroy, 5) AR 9,, 3, 3, 38, 7 9,,, 3 9,, 3, 3, 38, 7 9,, n = 6 and p = 3 3, 38, 7, 5 plus observations 3, 3, 38 Atkinson & Riani, ) Atkinson & Riani, ) 7, 5 Brain Body Weight, 6,, 6, 7, 5, 6, 6, 5, 6, 6, 5, 6, 6, 5 n = 7 and p = Rousseeuw & Leroy, 5) Rousseeuw & Leroy, 5) Stars,, 3, 3,, 3, 3,, 3, 3,, 3, 3 n = 7 and p = plus 3 observations Rousseeuw & Leroy, 5) Rousseeuw & Leroy, 5) Table. Outliers detected by various algorithms. RRT with both α = / logn) and α = / n delivered similar results. Existing results on Stack loss, Brain and Body weight and Stars data set are based on the combinatorially complex least median of squares LMedS) algorithm. Existing results on AR are based on extensive graphical analysis.

by, 3) and, 9) respectively with support S sampled randomly from the set {,..., p}. Experiment 3 and involve matrix of model with n = 8, p = 56). For experiment 3, support S is sampled randomly from the set {,..., p}, whereas, in experiment, support S is fixed at {,,..., k }. The noise w is sampled according to N n, σ I n ) with σ =. The non zero entries of β are randomly assigned β j = ±. Subsequently, these entries are scaled to achieve SNR = Xβ /n = 3. The number of non zero entries k in all experiments are fixed at six. We compare the algorithms in terms of the l error, the number of false positives and the number of false negatives produced in runs of each experiment. From the box plots given in Figures -, it is clear that RRT with both values of α perform very similar to OMP. They differ only in one run of experiment 3 where RRT and RRT suffer from a false negative. Further, RRT and RRT outperform CV and LAT in all the four experiments in terms of all the three metrics considered for evaluation. This is primarily because LAT and CV are more prone to make false positives, whereas RRT and RRT does not report any false positives. OMP consistently made false negatives which explains its poor performance in terms of l error. We have observed that once the SNR is made slightly higher, OMP delivers a performance similar to OMP. Also note that RRT with two significantly different choices of α viz. α = / n and α = / logn) delivered similar performances. This observation is in agreement with the claim of asymptotic tuning freeness made in Remark 5. Similar trends are also visible in the simulation results presented in supplementary materials. 5.. Outlier detection in real data sets We next consider the application of sparse estimation techniques including RRT to identify outliers in low dimensional or full column rank i.e., n > p) real life data sets, an approach first considered in Mitra et al., ; 3). Consider a robust regression model of the form y = Xβ + w + g out with usual interpretations for X, β and w. The extra term g out R n represents the gross errors in the regression model that cannot be modelled using the distributional assumptions on w. Outlier detection problem in linear regression refers to the identification of the support S g = suppg out ). Since X has full rank, one can always annihilate the signal component Xβ by projecting onto a subspace orthogonal to spanx). This will result in a simple linear regression model of the form given by ỹ = I n XX )y = I n XX )g out + I n XX )w, 7) i.e., identifying S g in robust regression is equivalent to a sparse support identification problem in linear regression. Even though this is a regression problem with n observations and n variables, the design matrix I n XX ) in 7) is rank deficient i.e., ranki n XX ) = n rankx) < n). Hence, classical techniques based on LS are not useful for identifying S g. Since cards g ) and variance of w are unknown, we only consider the application of RRT, OMP with CV and LAT in detecting S g. We consider four widely studied real life data sets and compare the outliers identified by these algorithms with the existing and widely replicated studies on these data sets. More details on these data sets are given in the supplementary materials. The outliers detected by the aforementioned algorithms and outliers reported in existing literature are tabulated in TABLE. Among the four data sets considered, outliers detected by RRT and existing results are in consensus in two data sets viz. Stack loss and Stars data sets. In AR data set, RRT identifies all the outliers. However, RRT also include observations and 5 as outliers. These identifications can be potential false positives. In Brain and Body Weight data set, RRT agrees with the existing results in observations. However, RRT misses two observations viz. and 7 which are claimed to be outliers by existing results. LAT agrees with RRT in all data sets except the stack loss data set where it missed outlier indices and 3. CV correctly identified all the outliers identified by other algorithms in all four data sets. However, it made lot of false positives in three data sets. To summarize, among all the three algorithms considered, RRT delivered an outlier detection performance which is the most similar to the results reported in literature. 6. Conclusions This article proposed a novel signal and noise statistics independent sparse recovery technique based on OMP called residual ratio thresholding and derived finite and large sample guarantees for the same. Numerical simulations in real and synthetic data sets demonstrates a highly competitive performance of RRT when compared to OMP with a priori knowledge of signal and noise statistics. The RRT technique developed in this article can be used to operate sparse recovery techniques that produce a monotonic sequence of support estimates in a signal and noise statistics oblivious fashion. However, the support estimate sequence generated by algorithms like LASSO, DS, SP etc. are not monotonic in nature. Hence, extending the concept of RRT to operate sparse estimation techniques that produce non monotonic support sequence in a signal and noise statistics oblivious fashion is an interesting direction of future research.

References Arlot, S., Celisse, A., et al. A survey of cross-validation procedures for model selection. Statistics surveys, : 79,. Atkinson, A. and Riani, M. Robust diagnostic regression analysis. Springer Science & Business Media,. Bayati, M., Erdogdu, M. A., and Montanari, A. Estimating lasso risk and noise level. In Advances in Neural Information Processing Systems, pp. 9 95, 3. Belloni, A., Chernozhukov, V., and Wang, L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98):79 86,. Cai, T. T. and Wang, L. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information theory, 577):68 688,. Candes, E. and Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, pp. 33 35, 7. Chichignoud, M., Lederer, J., and Wainwright, M. J. A practical scheme and fast algorithm to tune the lasso with optimality guarantees. Journal of Machine Learning Research, 73):, 6. Dai, W. and Milenkovic, O. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 555):3 9, May 9. Dicker, L. H. Variance estimation in high-dimensional linear models. Biometrika, ):69 8,. Dicker, L. H. and Erdogdu, M. A. Maximum likelihood for variance estimation in high-dimensional linear models. In Artificial Intelligence and Statistics, pp. 59 67, 6. Elad, M. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer,. Fan, J., Guo, S., and Hao, N. Variance estimation using refitted cross-validation in ultra high dimensional regression. Journal of the Royal Statistical Society: Series B Statistical Methodology), 7):37 65,. Fletcher, A. K. and Rangan, S. Orthogonal matching pursuit: A Brownian motion analysis. IEEE Transactions on Signal Processing, 63):, March. Liu, C., Fang, Y., and Liu, J. Some new results about sufficient conditions for exact support recovery of sparse signals via orthogonal matching pursuit. IEEE Transactions on Signal Processing, 657):5 5, Sept 7. Mallat, S. G. and Zhang, Z. Matching pursuits with timefrequency dictionaries. IEEE Transactions on signal processing, ):3397 35, 993. Meinshausen, N. and Bühlmann, P. Stability selection. Journal of the Royal Statistical Society: Series B Statistical Methodology), 7):7 73,. Mitra, K., Veeraraghavan, A., and Chellappa, R. Robust regression using sparse learning for high dimensional parameter estimation problems. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 386 389, March. Mitra, K., Veeraraghavan, A., and Chellappa, R. Analysis of sparse regularization based robust regression approaches. IEEE Transactions on Signal Processing, 65):9 57, March 3. Mousavi, A., Maleki, A., and Baraniuk, R. G. Parameterless optimal approximate message passing. arxiv preprint arxiv:3.35, 3. Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 993. 993 Conference Record of The Twenty-Seventh Asilomar Conference on, pp.. IEEE, 993. Rousseeuw, P. J. and Leroy, A. M. Robust regression and outlier detection, volume 589. John wiley & sons, 5. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Methodological), pp. 67 88, 996. Tropp, J. A. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 5):3,. Tropp, J. A. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE transactions on information theory, 53):3 5, 6. Tropp, J. A. and Gilbert, A. C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53):655 666, 7. Wang, J. Support recovery with orthogonal matching pursuit in the presence of noise. IEEE Transactions on Information Theory, 63):5868 5877, Nov 5. Wang, X., Dunson, D., and Leng, C. No penalty no tears: Least squares in high-dimensional linear models. In International Conference on Machine Learning, pp. 8 8, 6.

Wen, J., Wang, J., and Zhang, Q. Nearly optimal bounds for orthogonal least squares. IEEE Transactions on Signal Processing, 65):537 5356, 7. Yang, M. and de Hoog, F. Orthogonal matching pursuit with thresholding and its application in compressive sensing. IEEE Transactions on Signal Processing, 63):579 586, 5. Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B Statistical Methodology), 67):3 3, 5.