ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi

Size: px
Start display at page:

Download "ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi"

Transcription

1

2 ABSTRACT Topics on LASSO and Approximate Message Passing by Ali Mousavi This thesis studies the performance of the LASSO (also known as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal x o R N from n random and noisy linear observations y = Ax o + w, wherea is the measurement matrix and w is the noise. The LASSO estimate is given by the solution to the optimization problem x o with ˆx =argmin x ky Axk + kxk. Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter. In this thesis we study two questions in the asymptotic setting (i.e., where N!, n!while the ratio n/n converges to a fixed number in (0, )): (i) How does the size of the active set kˆx k 0 /N behave as a function of, and (ii) How does the mean square error kˆx x o k /N behave as a function of? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP). Furthermore, we propose a parameter-free approximate message passing (AMP) algorithm that sets the threshold parameter at each iteration in a fully automatic way without either having an information about the signal to be reconstructed or needing any tuning from the user. We show that the proposed method attains the minimum reconstruction error in the least number of iterations. Our method is based on applying the Stein unbiased risk estimate (SURE)

3 iii along with a modified gradient descent to find the optimal threshold in each iteration. Motivated by the connections between AMP and LASSO, it could be employed to find the solution of the LASSO for the optimal regularization parameter. To the best of our knowledge, this is the first work concerning parameter tuning that obtains the smallest MSE in the least number of iterations with theoretical guarantees.

4 Contents Abstract List of Illustrations List of Tables ii vii x Introduction. Motivation for Analysis of LASSO s Solution Path Analysis of LASSO s Solution Path Implications for Approximate Message Passing Algorithms Motivation for Designing Parameterless Approximate Message Passing 6.5 Implications of Parameter Tuning for LASSO Related Work in Parameter Tuning Notation Organization of the Thesis Analysis of LASSO s Solution Path. Asymptotic CS Framework LASSO s Solution Path Implications for AMP AMP in Asymptotic Settings Connection Between AMP and LASSO Fixed Detection Thresholding Parameter Free Approximate Message Passing 4 3. Tuning the AMP

5 v 3.. Intuitive Explanation of the AMP Features Tuning Scheme Optimal Parameter Tuning for Denoising Problems Optimizing the Ideal Risk Approximate Gradient Descent Algorithm Accuracy of the Gradient Descent Algorithm Optimal Tuning of AMP Simulation Results 4 4. Phase Transition of AMP Details of Simulations Figure Figure Figure Practical Approximate Gradient Descent Algorithm Estimating the Noise Variance Setting the Step Size Avoiding Local Minima Accuracy of the Approximate Gradient Descent Impact of Sample Size N Setting N Comparison with Other Tuning Procedures A Proofs of the Main Results 56 A. Background A.. Quasiconvex Functions A.. Risk of the Soft Thresholding Function A. Proof of Theorem A.3 Proof of Theorem

6 vi A.4 Proof of Lemma A.5 Proof of Theorem A.6 Proof of Lemma A.7 Proof of Theorem A.8 Proof of Corollary A.9 Proof of Theorem A.0 Proof of Theorem A. Proof of Theorem A.. Proof of Lemma A. Proof of Theorem Bibliography 9

7 Illustrations. The number of active elements in the solution of LASSO as a function of.itisclearthatthisfunctiondoesnotmatchthe intuition. The size of the active set at one location grows as we increase.forthedetailsofthisexperiment,seesection Histogram of v t for three di erent iterations. The red curve displays the best Gaussian fit The number of active elements in the solution of LASSO as a function of.thesizeoftheactivesetdecreasesmonotonicallyasweincrease. 7. Behavior of the MSE as a function of of LASSO for two di erent noise variances Risk function r(, )asafunctionofthethresholdparameter. x o R N is a k-sparse vector where N =000andk =45. In addition, =0.54 where is the standard deviation of the noise in the model x t = x t o + v t Three di erent forms for MSE vs.. Threeplotscorrespondtotwo di erent standard deviation of the noise in the observation The dashed black curve denotes the risk function and the solid blue curve indicates its estimation. For the model we have used in order to produce this plot refer to Section 4. Measurements are noiseless.... 3

8 viii 3.4 Risk function and its estimate. The estimation of the risk function has local minima for the points where dr ( ) = O N + d N.The two regions for which this phenomenon can happen are specified by ellipsoids Comparison between the empirical phase transition (heatmap of the probability of success) and theoretical phase transition curve (black curve) obtained from (4.). Red color corresponds to probability (successful recovery) and the blue color corresponds to probability 0 (unsuccessful recovery) in the heatmap Performance of Algorithm in estimating ˆ opt in di erent iterations of AMP. In this experiment N =000, =0.85, =0.5, and we consider noiseless measurements ( = 0) Performance of Algorithm in estimating ˆ opt in di erent iterations of AMP. In this experiment N =000, =0.85, =0.5, and the standard deviation of the noise of the measurements = Performance of Algorithm in estimating ˆ opt for di erent values of N. Inthisexperiment =0.85, =0.5, and we consider noiseless measurements ( =0) Performance of Algorithm in estimating ˆ opt for di erent values of N. InthisexperimentN =000, =0.85, =0.5, and the standard deviation of the noise of the measurements is 0.. Blue dashed curve and black dashed curve show estimated risk and risk functions, respectively. Green circle shows the optimal threshold. Finally, the red cross shows the estimated optimal threshold

9 ix 4.6 MSE of AMP at each iteration for four di erent threshold setting approaches. The blue curve corresponds to the approach introduced in Algorithm. In the black curve, the threshold is set to a constant value. This constant threshold, which causes the smallest reconstruction error, is found by the grid search method. In the red curve, threshold is set to the value which gives the best phase transition [4]. Finally, in the green curve, threshold is set by using fixed detection thresholding policy [3] Semi-logarithmic version of Figure Logarithmic version of Figure A. The derivative of r(µ, ) asafunctionof for µ 0 < <. Note that the derivative of the risk has only one sign change. Below that point the derivative is negative and above of that point is positive (even though it converges to zero as!). Hence we expect the risk to be quasi-convex j k A. Dividing [0, max ]into max equally spaced points. We use this procedure to show that ( ) is closeto E[ ( )] for all [0, max ] A.3 Illustration of and, which are two monotonically decreasing functions with respect to. Thesupremumofthedistancebetween these two functions is achieved at the jump points of, whichis peicewise constant

10 Tables. Some observables and their abbreviations. The function for each observable is also specified

11 Chapter Introduction. Motivation for Analysis of LASSO s Solution Path Consider the problem of recovering a vector x o R N from a set of undersampled random linear measurements y = Ax o + w, wherea R n N is the measurement matrix, and w R n denotes the noise. One of the most successful recovery algorithms, called basis pursuit denoising or LASSO ( [, ]), that employs the following optimization problem to obtain an estimate of x o : ˆx =argmin x ky Axk + kxk. (.) A rich literature has provided a detailed analysis of this algorithm [3, 4, 5, 6, 7, 8, 9, 0,,, 3, 4, 5, 6, 7, 8, 9, 0,, ]. Most of the work published in this area falls into two categories: (i) non-asymptotic and (ii) asymptotic results. The nonasymptotic results consider N and n to be large but finite numbers and characterize the reconstruction error as a function of N and n. These analyses provide qualitative guidelines on how to design compressed sensing (CS) system. However, they su er from loose constants and are incapable of providing quantitative guidelines. Therefore, inspired by the seminal work of Donoho and Tanner [3], researchers have started the asymptotic analysis of LASSO. Such analyses provide sharp quantitative guidelines for designing CS systems. Despite the major progress in our understanding of LASSO, one major aspect of

12 the method that is of major algorithmic importance has remained unexplored. In most of the theoretical work, it is assumed that an oracle has given the optimal value of to the statistician/engineer and the analysis is performed for the optimal value of. However, in practice the optimal value of is not known a priori. One important analysis that may help in both searching for the optimal value of and/or designing e of cient algorithms for solving LASSO, is the behavior of the solution ˆx as a function. In this thesis, we conduct such an analysis and demonstrate how such results can be employed for designing e cient approximate message passing algorithms.. Analysis of LASSO s Solution Path In this thesis we aim to analyze the properties of the solution of the LASSO as changes. The two main problems that we address are: Q: How does N kˆx k 0 change as varies? Q: How does N kˆx xk change as varies? The first question is about the number of active elements in the solution of the LASSO, and the second one is about the mean squared error. Intuitively speaking, one would expect the size of the active set to shrink as increases and the mean square error to be a bowl-shaped function of. Unfortunately the peculiar behavior of LASSO breaks this intuition. See Figure. for a counter-example; we will clarify the details of this experiment in Section 4.. This figure exhibits the number of active elements in the solution as we increase the value of.itisclearthatthesizeofthe active set is not monotonically decreasing. Such pathological examples have discouraged further investigation of these problems in the literature. In this thesis we show that such pathological examples are quite

13 3 ˆβ λ Figure. : The number of active elements in the solution of LASSO as a function of. It is clear that this function does not match the intuition. The size of the active set at one location grows as we increase. For the details of this experiment, see Section 4.. λ rare, and if we consider the asymptotic setting (that will be described in Section.), then we can provide quite intuitive answers to the two questions raised above. Let us summarize our results here in a non-rigorous way. We will formalize these statements and clarify the conditions under which they hold in Section.. A: In the asymptotic setting, N kˆx k 0 is a decreasing function of. A: In the asymptotic setting, N kˆx xk is a quasi-convex function of..3 Implications for Approximate Message Passing Algorithms Traditional techniques of solving LASSO, such as the interior point method, have fail in addressing high-dimensional CS-type problems. Therefore, researchers have started exploring iterative algorithms with inexpensive per-iteration computations. One such algorithm is called approximate message passing (AMP) [3]; it is given by

14 4 the following iteration: x t+ = (x t + A z t ; t ), z t = y Ax t + It n zt. (.) AMP is an iterative algorithm, and t is the index of iteration. x t is the estimate of x o at iteration t. is the soft thresholding function applied component-wise to the elements of the vector. For a R, (a; ), ( a ) + sign(a). I t, {i : x t i 6= 0}. Finally t is called the threshold parameter. One of the most interesting features of AMP is that, in the asymptotic setting (which will be clarified later), the distribution of v t, x t + A z t x o is Gaussian at every iteration, and it can be considered to be independent of x o. Figure. shows the empirical distribution of v t at a three di erent iterations. 40 i=5 40 i=5 40 i= Density Data Data Data Figure. : Histogram of v t for three di erent iterations. The red curve displays the best Gaussian fit. As is clear from (.), the only parameter that exists in the AMP algorithm is the threshold parameter t at di erent iterations. It turns out that di erent choices of this parameter can lead to very di erent performance. One choice that has interesting

15 5 theoretical properties was first introduced in [3,4] and is based on the Gaussianity of v t. Suppose that an oracle gives us the standard deviation of v t at time t, called t. Then one way for determining the threshold is to set t = t,where is a fixed number. This is called the fixed false alarm thresholding policy. Itturnsout that if we set properly in terms of (the regularization parameter of LASSO), then x t will eventually converge to ˆx. The nice theoretical properties of the fixed false alarm thresholding policy come at a price, however, and that is the requirement for estimating t at every iteration, which is not straightforward as we observe x o + v t and not v t. However, the fact that the size of the active set of LASSO is a monotonic function of provides a practical and easy way for setting t.wecallthisapproach fixed detection thresholding. Definition.3. (Fixed detection thresholding policy). Let 0 < <. Set the threshold value, t to the absolute value of the b nc th largest element (in absolute value) of x t + A T z t. Note that a similar thresholding policy has been employed for iterative hard thresholding [5,6], iterative soft thresholding [7], and AMP [8] in a slightly di erent way. In these works, it is assumed that the signal is sparse and its sparsity level is known, and is set according to the sparsity level. However, here is assumed to be a free parameter. In the asymptotic setting, AMP with this thresholding policy is also equivalent to the LASSO in the following sense: for every >0 there exists aunique (0, ) for which AMP converges to the solution of LASSO as t!. This result is a conclusion of the monotonicity of the size of the active set of LASSO in terms of. We will formally state our results regarding the AMP algorithm with fixed detection thresholding policy in Section.3.

16 .4 Motivation for Designing Parameterless Approximate Message Passing One of the main issues in using iterative thresholding algorithms in practice is the tuning of their free parameters. For instance, in AMP one should tune,,... properly to obtain the best performance. The t have a major impact on the following aspects of the algorithm: 6 (i) The final reconstruction error, lim t! kx t x o k /N.Improperchoiceof t could lead the algorithm not to converge to the smallest final reconstruction error. (ii) The convergence rate of the algorithm to its final solution. A bad choice of t leads to extremely slow convergence of the algorithm. Ideally speaking, one would like to select the parameters in a way that the final reconstruction error is the smallest while simultaneously the algorithm converges to this solution in the least number of iterations. Addressing these challenges seem to require certain knowledge about x o. In particular, it seems that for a fixed value of, lim t! kx t x o k depends on x o.therefore,theoptimalvalueof depends on x o as well. This issue has motivated researchers to consider the least favorable signals that achieve the maximum value of the mean square error (MSE) for a given and then tune t to obtain the minimum MSE for the least favorable signal [4, 9, 30]. These schemes are usually too pessimistic for practical purposes. One of the main objectives of this thesis is to show that the properties of the AMP algorithm plus the high dimensionality of the problem enable us to set the threshold parameters t such that (i) the algorithm converges to its final solution in the least

17 7 number of iterations, and (ii) the final solution of the algorithm has the minimum MSE that is achievable for AMP with the optimal set of parameters. The result is a parameter-free AMP algorithm that requires no tuning by the user and at the same time achieves the minimum reconstruction error in the least number of iterations. The statements claimed above are true asymptotically as N!. However, our simulation results show that the algorithm is successful even for medium problem sizes such as N =000. WewillformalizethesestatementsinSections3. and Implications of Parameter Tuning for LASSO As mentioned in Section., LASSO minimizes the following cost function: ˆx =argmin x ky Axk + kxk. (0, ) is called the regularization parameter. The optimal choice of this parameter has a major impact on the performance of LASSO. It has been shown that the final solutions of AMP with di erent threshold parameters corresponds to the solutions of the LASSO for di erent values of [3, 3, 4, 9, 0]. This equivalence implies that if the parameters of the AMP algorithm are tuned optimally, then the final solution of AMP corresponds to the solution of LASSO for the optimal value of, i.e., the value of that minimizes the MSE, kˆx x o k /N. Therefore, finding the optimal parameters for AMP automatically provides the optimal parameters for LASSO as well.

18 8.6 Related Work in Parameter Tuning Several other papers that consider various threshold-setting strategies to improve the convergence rate are [3,33]. However, these schemes are based on heuristic arguments and lack theoretical justification. Optimal tuning of parameters to obtain the smallest final reconstruction error has been the focus of major research in CS, machine learning, and statistics. The methods considered in the literature fall into the following three categories: (i) The first approach is based on obtaining an upper bound for the reconstruction error and setting the parameters to obtain the smallest upper bound. For many of the algorithms proposed in the literature, there exists a theoretical analysis based on certain properties of the matrix, such as RIP [34,35], Coherence [36], and RSC [37]. These analyses can potentially provide a simple approach for tuning parameters. However, they su er from two issues: (i) Inaccuracy of the upper bounds derived for the risk of the final estimates usually lead to pessimistic parameter choices that are not useful for practical purposes. (ii) The requirement of an upper bound for the sparsity level [38,7], which is often not available in practice. (ii) The second approach is based on the asymptotic analysis of recovery algorithms. The first step in this approach is to employ asymptotic settings to obtain an accurate estimate of the reconstruction error of the recovery algorithms. This is done through either pencil-and-paper analysis or computer simulation. The next step is to employ this asymptotic analysis to obtain the optimal value of the parameters. This approach is employed in [4]. The main drawback of this approach is that the user must know the signal model (or at least an

19 9 upper bound on the sparsity level of the signal) to obtain the optimal value of the parameters. Usually, an accurate signal model is not available in practice, and hence the tuning should consider the least favorable signal that leads to pessimistic tuning of the parameters. (iii) The third approach involves model selection ideas that are popular in statistics. For a review of these schemes refer to Chapter 7 of [39]. Since the number of parameters that must be tuned in AMP is too large (one parameter per iteration), such schemes are of limited applicability. However, as described in Section 3.., the features of AMP enable us to employ these techniques in certain optimization algorithms and tune the parameters e ciently. Rather than these general methods, other approaches to skip the parameter tuning of AMP is proposed in [40,4,4,43]. These approaches are inspired by the Bayesian framework; a Gaussian mixture model is considered for x o,andthentheparameters of that mixture are estimated at every iteration of AMP by using an expectationminimization technique [4]. While these schemes perform well in practice, there is no theoretical result to confirm these observations. A first step toward a mathematical understanding of these methods is taken in [43]..7 Notation Capital letters denote both matrices and random variables. As we may consider a sequence of vectors with di erent sizes, sometimes we denote x with x(n) to emphasize its dependency on the ambient dimension. For a matrix A, A T, min(a), and max(a) denotethetransposeofa, theminimum,andmaximumsingularvaluesof A respectively. Calligraphic letters such as A denote sets. For set A, A, and A c

20 0 are the size of the set and its complement respectively. For a vector x R n, x i, kxk p, ( P x i p ) /p,andkxk 0 = {i : x i 6=0} represent the i th component, `p, and `0 norms respectively. We use P and E to denote the probability and expected value with respect to the measure that will be clear from the context. The notation E denotes the expected value with respect to the randomness in random variable. The two functions and denote the probability density function and cumulative distribution function of standard normal distribution. I( ) and sign( ) denote the indicator and sign functions, respectively. Finally, O( ) ando( ) aredenoting bigo and small O notations, respectively..8 Organization of the Thesis The organization of the thesis is as follows: Chapter sets up the framework and formally states the main contributions regarding the analysis of LASSO s solution path. Chapter 3 considers the tuning of the threshold parameter for the problem of denoising by soft thresholding and connecting the results of optimal denoising with the problem of optimal tuning of the parameters of AMP. Chapter 4 presents and summarizes our simulation results. Finally, Appendix contains the proofs of our main results.

21 Chapter Analysis of LASSO s Solution Path. Asymptotic CS Framework In this thesis we consider the problem of recovering an approximately sparse vector x o R N from n noisy linear observations y = Ax o + w. Our goal is to analyze the properties of the solution of LASSO, defined in (.), on CS problems with the following two main feature. (i) the measurement matrix has iid gaussian elements, and (ii) the ambient dimension and the number of measurements are large. We adopt the asymptotic framework to incorporate these two features. Here is the formal definition of this framework [4,9]. Let n, N!while = n N is fixed. We write the vectors and matrices as x o (N),A(N),y(N), and w(n) toemphasizeontheambient dimension of the problem. Clearly, the number of row of the matrix A is equal to N, butweassumethat is fixed and therefore we do not include n in our notation for A. The same argument is applied to y(n) andw(n). Definition... A sequence of instances {x o (N),A(N),w(N)} is called a converging sequence if the following conditions hold: - The empirical distribution of x o (N) R N converges weakly to a probability measure p with bounded second moment. With the recent results in CS [44] our results can be easily extended to subgaussian matrices. However, for notational simplicity we consider the Gaussian setting here.

22 - The empirical distribution of w(n) R n (n = N) converges weakly to a probability measure p W with bounded second moment. - If {e i } N i= denotes the standard basis for R N, then max i ka(n)e i k, min i ka(n)e i k! as N!. Note that we have not imposed any constraint on the limiting distributions p or p W.Infactforthepurposeofthissection,p is not necessarily a sparsity promoting prior. Furthermore, unlike most of the other works that assumes p W is Gaussian, we do not even impose this constraint on the noise. Also, the last condition is equivalent to saying that all the columns have asymptotically unit ` norm. For each problem instance x o (N),A(N), and w(n) we solve LASSO and obtain ˆx (N) astheestimate. We would now like to evaluate certain measures of performance for this estimate such as the mean squared error. The next generalization formalizes the types of measure we are interested in. Definition... Let ˆx (N) be the sequence of solutions of the LASSO problem for the converging sequence of instances {x o (N),A(N),w(N)}. Consider a function : R! R. An observable J is defined as N J x o, ˆx, lim N! N i= A popular choice of the function is M (u, v) =(u observable has the form: J M x o, ˆx, lim N! N N i= x o,i (N) x o,i (N), ˆx i (N). v). For this function the ˆx i (N) = lim N! N kx o ˆx k. Another example of function that we consider in this thesis is D (u, v) =I(v 6= 0), which leads us to J D x o, ˆx, lim N! N N i= kˆx k 0 I(ˆx i 6=0)= lim N! N. (.)

23 3 Table. : Some observables and their abbreviations. The function for each observable is also specified. Name Abbreviation = (u, w) Mean Square Error MSE =(u w) False Alarm Rate FA = I(w 6= 0,u = 0) Detection Rate DR = I(w 6= 0) Missed Detection MD = I(w =0,u6= 0) Some of the popular observables are summarized in Table. with their corresponding functions. Note that so far we do not have any major assumption on the sequences of matrices. Following the other works in CS, we would now consider random measurement matrices. While all our discussion can be extended to more general classes of random matrices [44], for the notational simplicity we consider A ij N(0, /n). Clearly, these matrices satisfy the unit norm column condition of converging sequences with high probability. Since ˆx (N) israndom,therearetwo questions that need to be addressed about lim N! N P N i= x o,i(n), ˆx i (N). (i) Does it exist and in what sense (e.g., in probability or almost surely)? (ii) Does it converge to a random variable or to a deterministic quantity? The following theorem, conjectured in [4] and proved in [0], shows that under some restrictions on the function, not only the almost sure limit exists in this scenario, but also it converges to a non-random number. Theorem..3. Consider a converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that ˆx (N) is the solution of the

24 4 LASSO problem. Then for any pseudo-lipschitz function : R! R, almost surely lim N! N i ˆx i (N),x o,i = E o,w [ ( ( o +ˆW; ˆ), o )], (.) where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively, is the soft thresholding operator, and ˆ and satisfy the following equations: ˆ =! + E,W [( ( +ˆW; ˆ) ) ], (.3) = ˆ P( +ˆW > ˆ). (.4) This theorem will provide the first step in our analysis of the LASSO s solution path. Before we proceed to the implications of this theorem, let us explain some of its interesting features. Suppose that ˆx equal to ( o +ˆW; has iid elements, and each element is in law ˆ), where o p and W N(0, ). Also, x o,i iid p. If these two assumptions were true, then we could use strong law of large numbers (SLLN) and argue that (.) were true under some mild conditions (required for SLLN). While this heuristic is not quite correct, and the elements of ˆx i are not necessarily independent, at the level of calculating observables defined in Definition.. (and pseudo Lipschitz) this theorem confirms the heuristic. Note that the key element that has led to this heuristic is the randomness in the measurement matrix and the large size of the problem. As we see in (.), there are two constants, (, ˆ), that are calculated according to (.3) and (.4). [4, 3] have shown that for a fixed,thesetwoequationshavea A function : R! R is pseudo-lipschitz if there exists a constant L>0 such that for all x, y R we have (x) (y) applel( + kxk + kyk )kx yk.

25 5 unique solution for (, ˆ). Note that here ˆ w,i.e.,thevarianceofthenoisethatwe observe after the reconstruction, ˆ, is larger than the input noise (according to (.3)). The extra noise that we observe after the reconstruction is due to subsampling. In fact, if we keep fixed and decrease, then we see that ˆ increases. This phenomena is sometimes called noise-folding in the CS literature [45, 46]. One of the main applications of Theorem..3 is in characterizing the normalized mean squared error of the LASSO problem as is summarized by the next corollary. Corollary..4. If {x o (N),A(N),w(N)} is a converging sequence and ˆx (N) is the solution of the LASSO problem, then almost surely lim N! kˆx (N) N x o(n)k = E o,w (o +ˆW; ˆ) o, where, ˆ, and satisfy (.3) and (.4). As we mentioned before, we are also interested in another observable and that kˆx k is lim 0 N!. As described in (.), this observable can be constructed by using N (u, v) =I(v 6= 0). However, it is not di cult to see that for this observable, the function is not pseudo-lipschitz, and hence Theorem..3 does not apply. However, as conjectured in [4] and proved in [0] we can still characterize the almost sure limit of this observable. Theorem..5. [0] If {x o (N),A(N),w(N)} is a converging sequence and ˆx (N) is the solution of the LASSO problem, then almost surely lim N! N where,, and ˆ satisfy (.3) and (.4). I ˆx i (N) 6= 0 = P( ( o +ˆW; ˆ) > 0), i

26 6. LASSO s Solution Path In Section. we characterized two simple expressions for the asymptotic behavior of normalized mean square error and normalized number of detections. These two expressions enable us to formalize the two questions that we raised in the Introduction. As mentioned in the Introduction, if we consider a generic CS problem, there are some pathological examples for which the behavior of LASSO is quite unpredictable and inconsistent with our intuition. See Figure. for an example and Section 4.. for a detailed description about it. Here, we consider the asymptotic regime introduced in the last section. It turns out that in this setting the solution of LASSO behaves as expected. Theorem... Let {x o (N),A(N),w(N)} denotes a converging sequence of problem instances as defined in... Suppose that A ij iid N(0, /n). Ifˆx (N) is the solution of LASSO with regularization parameter Furthermore, lim N! N d d lim N! N, then I ˆx i (N) 6= 0 P i I x i (N) 6= 0 apple. i! < 0. We summarize the proof of this theorem in Section A.. Intuitively speaking, Theorem.. claims that, as we increase the regularization parameter, the number of elements in the active set is decreasing. Also, according P to the condition lim N! N i I x i (N) 6= 0 apple the largest it can get is = n/n. Since the number of active elements is a decreasing function of, appears only in the limit! 0. Figure. shows the number of active elements as a function of for a setting described in Section 4... In the next section, we will exploit this property to design and tune AMP for solving the LASSO.

27 ˆβ λ Figure. : The number of active elements in the solution of LASSO as a function of.thesizeoftheactivesetdecreasesmonotonicallyasweincrease. λ Our next result is regarding the behavior of the normalized MSE in terms of the regularization parameter.inasymptoticsetting,weprovethatthenormalizedmse is a quasi-convex function of. See Section 3.4 of [47] for a short introduction on quasi-convex functions. Figure. exhibits the behavior of MSE as a function of. The detailed description of this problem instance can be found in Section 4.. Before we proceed further, we define bowl-shaped functions. Definition... A quasi-convex function f : R! R is called bowl-shaped if and only if there exists x o R at which f achieves its minimum, i.e., f(x 0 ) apple f(x), 8x R. Here is the formal statement of this result. Theorem..3. Let {x o (N),A(N),w(N)} denotes a converging sequence of problem instances as defined in Definition... Suppose A ij iid N(0, /n). If ˆx (N) is the solution of LASSO with regularization parameter, then lim N! N kˆx (N) x ok

28 8 0 σ =0.4 0 σ = MSE λ λ Figure. : Behavior of the MSE as a function of variances. of LASSO for two di erent noise is a quasi-convex function of. Furthermore, if p ( =0)6=, then the function is bowl-shaped. See the proof in Section A.3..3 Implications for AMP.3. AMP in Asymptotic Settings In this section we show how the result of Theorem.. can lead to an e cient method for setting the threshold in the AMP algorithm. We first review some background on the asymptotic analysis of AMP. This section is mainly based on the results in [3,4,9], and the interested reader is referred to these papers for further details. As we mentioned in Section.3, AMP is an iterative thresholding algorithm. Therefore, we would like to know the discrepancy of its estimate at every iteration from the original vector x o.thefollowingdefinitionformalizesdi erentdiscrepancymeasures

29 9 for the AMP estimates. Definition.3.. Let {x o (N),A(N),w(N)} denote a converging sequences of instances. Let x t (N) be a sequence of the estimates of AMP at iteration t. Consider a function : R! R. An observable J at time t is defined as J x o,x t = lim N! N As before, we can consider (u, v) =(u N i= x o,i (N),x t i(n). v) that leads to the normalized MSE of AMP at iteration t. Thefollowingresultthatwasconjecturedin[3,4]andwas finally proved in [9] provides a simple description of the almost sure limits of the observables. Theorem.3.. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that x t (N) is the estimate of AMP at iteration t. Then for any pseudo-lipschitz function : R! R lim N! N i x t i(n),x o,i = E o,w ( (o + t W ; t ), o ) almost surely, where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively. t satisfies ( t+ ) =! + E,W ( ( + t W ; t ) ), 0 = E [ o ]. (.5) Similarly, our discussion of the solution of the LASSO, this theorem claims that, as long as the calculation of the pseudo-lipschitz observables is concerned, we can assume that estimate of the AMP are modeled as iid elements with each element modeled in law as ( o + t W ; t ), where o p and W N(0, ). As before, we are also interested in the normalized number of detections. The following theorem

30 0 establishes this result. Theorem.3.3. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that x t (N) is the estimate of AMP at iteration t. Then kx t (N)k 0 lim N! N = P( o + t W t ) almost surely, where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively. t satisfies (.5). In other words, the result of Theorem.3. can be extended to (u, v) =I(v 6= 0), even though this function is not pseudo-lipschitz..3. Connection Between AMP and LASSO The AMP algorithm in its general form can be considered as a sparse signal recovery algorithm. The choice of the threshold parameter t has major impact on the performance of AMP. It turns out that if we set t appropriately, then the fixed point of AMP corresponds to the solution of LASSO in the asymptotic regime. One such choice of parameters is the fixed false alarm threshold given by t = t,where t satisfies (.5). The following result conjectured in [4, 3] and later proved in [0] formalizes this statement. Theorem.3.4. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Let x t (N) be the estimate of the AMP algorithm with parameter t = t, where t satisfies (.5). Assume that lim t! t = To see more general form of AMP refer to Chapter 5 of [8].

31 ˆ. Finally, let ˆx denotes the solution of the LASSO with parameter that satisfies = ˆ( P( +ˆW ˆ)). Then, almost surely. lim lim t! N! N kˆx (N) xt (N)k =0 This promising result indicates that AMP can be potentially used as a fast iterative algorithm for solving the LASSO problem. However, it is not readily useful for practical scenarios in which t is not known (since neither nor its distribution are known). Therefore, in the first implementations of AMP, t has been estimated at every iteration from the observations x t + A z t. From Section.3 we know that v t = x t + A z t x o can be modeled as Gaussian N(0, t ). Therefore, if we had access to w t we could easily estimate t. However, we only observe x t + A z t = x 0 + v t,and we have to estimate t from this observation. The estimates that have been proposed so far are exploiting the fact that x o is sparse and provide a biased estimate of t. While such biased estimates still work well in practice, our discussion of LASSO provide an easier way to set the threshold. In the next section, based on our analysis of LASSO we discuss the performance of fixed detection thresholding policy, introduced in Section.3, and show that not only this thresholding policy can be implemented in its exact form, but also it has the nice properties of the fixed false alarm threshold..3.3 Fixed Detection Thresholding AMP looks for the sparsest solution of y = Ax o + w through the following iterations: x t+ = (x t + A z t ; t ), z t = y Ax t + It n zt. (.6)

32 As was discussed in Section, a good choice for the threshold parameter t is vital to the good performance of AMP. We proved in Section. that the number of active elements in the solution of LASSO is a monotonic function of the parameter.this motivates us to set the threshold of AMP in a way that at every iteration, a certain number of coe cients remains in the active set. To understand this claim better, compare (.3) for the fixed point of LASSO and (.5) for the iterations of AMP. Let us replace ˆ, ˆ in (.3). In addition, assume that is such that P( +ˆW ˆ)/ is equal to for some (0, ). Under these two assumptions, (.3) and (.4) can be converted to ˆ =! + E,W ( ( +ˆW;ˆ ) ), = ˆ ( ). (.7) Let us now consider the fixed point of AMP. By letting t!in (.5) we obtain =! + E,W ( ( + W ; ) ), (.8) where, lim t! t. Comparing (.7) and (.8) we conclude that if we set t in way that t! ˆ as t!thenamp has a fixed point that corresponds to the solution of LASSO. One such approach is the fixed detection thresholding policy that was introduced in Section.3. According to this thresholding policy, we keep the size of the active set of AMP fixed at every iteration. Then clearly, if the algorithm converges, then final solution will have the desired number of active elements. In other words, the final solution of the AMP will also satisfy the two equations: =! + E,W ( ( + W ; ) ), = ( ). (.9)

33 3 The first question that we shall address here is wether the above two equations have a unique fixed point. Otherwise, depending on the initialization, AMP may converge to di erent fixed points. Lemma.3.5. The fixed point of (.9) is unique, i.e., for every 0 < < there is a unique (, ) that satisfies (.9). See Section A.4 for the proof of this lemma. The heuristic discussion we have had so far shows that the fixed point of the AMP algorithm with fixed detection threshold converges to the solution of LASSO. The following theorem formalizes this result. Theorem.3.6. Let x t (N) be an estimate of AMP with fixed detection threshold for parameter. Let (ˆ, ˆ ) satisfies the fixed point equation of (.9). In addition, let ˆx (N) be the solution of LASSO for =ˆ ( ). Then, we have lim lim t! N! N kxt ˆx k! 0. As we will show in Section A.5, the proof of this theorem is essentially the same as the proof of Theorem 3. in [0]. There is a slight change in the proof due to the di erent thresholding policy that we consider here.

34 4 Chapter 3 Parameter Free Approximate Message Passing 3. Tuning the AMP 3.. Intuitive Explanation of the AMP Features In this section, we summarize some of the main features of AMP intuitively. Consider the iterations of AMP defined in (.). Define x t, x t + A z t and v t, x t x o.we call v t the noise term at the t th iteration. Clearly, at every iteration AMP calculates x t. In our new notation this can be written as x o + v t. If the noise term v t has iid zero-mean Gaussian distribution and is independent of x o, then we can conclude that at every iteration of AMP the soft thresholding is playing the role of a denoiser. The Gaussianity of v t,ifholds,willleadtodeeperimplicationsthatwillbediscussedas we proceed. To test the validity of this noise model we have presented a simulation result in Figure.. This figure exhibits the histogram of v t overlaid with its Gaussian fit for a CS problem. It has been proved that the Gaussian behavior we observe for the noise term is accurate in the asymptotic settings [3, 4, 9]. In most calculations, if N is large enough that we can assume that v t is iid Gaussian noise. This astonishing feature of AMP leads to the following theoretically and practically important implications: (i) The MSE of AMP, i.e., kx t x ok N can be theoretically predicted (with certain knowledge of x o )throughwhatisknownasstateevolution(se).

35 5 (ii) The MSE of AMP can be estimated through the Stein unbiased risk estimate (SURE). This will enable us to optimize the threshold parameters. This scheme will be described in the next section. 3.. Tuning Scheme In this section we assume that each noisy estimate of AMP, x t,canbemodeledas x t = x o + v t,wherev t is an iid Gaussian noise as claimed in the last section, i.e., v t N(0, t I), where t denotes the standard deviation of the noise. The goal is to obtain a better estimate of x o.sincex o is sparse, AMP applies the soft thresholding to obtain a sparse estimate x t = ( x t ; t ). The main question is how shall we set the threshold parameter t? To address this question first define the risk (MSE) of the soft thresholding estimator as r( ; )= N Ek (x o + u; ) x o k, where u N(0,I). Figure 3. depicts r(, )asafunctionof for a given signal x o and given noise level. In order to maximally reduce the MSE we have to set to opt defined as opt =argminr( ). There are two major issues in finding the optimizing parameter opt : (i) r(, ) is a function of x o and hence is not known. (ii) Even if the risk is known then it seems that we still require an exhaustive search over all the values of (at a certain resolution) to obtain opt. This is due to the fact that r(, )isnotnecessarilya well-behaved function, and hence more e cient algorithms such as gradient descent or Newton method do not necessarily converge to opt.

36 r(τ ) τ opt τ Figure 3. : Risk function r(, )asafunctionofthethresholdparameter. x o R N is a k-sparse vector where N =000andk =45. Inaddition, =0.54 where is the standard deviation of the noise in the model x t = x t o + v t. Let us first discuss the problem of finding opt when the risk function r(, )and the noise standard deviation are given. In Lemma 3.. we have proved that r(, ) is a quasi-convex function of. Furthermore,thederivativeofr(, )withrespectto is only zero at opt.inotherwords,themsedoesnothaveanylocalminimaexcept for the global minima. Combining these two facts we will prove in Section 3.. that if the gradient descent algorithm is applied to r(, ), then it will converge to opt. The ideal gradient descent is presented in Algorithm. We call this algorithm the ideal gradient descent since it employs r(, )thatisnotavailableinpractice. The other issue we raised above is that in practice the risk (MSE) r(, )isnot given. To address this issue we employ an estimate of r(, )inthegradientdescent algorithm. The following lemma known as Stein s unbiased risk estimate (SURE) [48] provides an unbiased estimate of the risk function: Lemma 3... [49] Let g( x) denote the denoiser. If g is weakly di erentiable, then Ekg( x) x o k /N = Ekg( x) xk /N + E( T (rg( x) ))/N, (3.)

37 7 Algorithm Gradient descent algorithm when the risk function is exactly known. The goal of this thesis is to approximate the iterations of this algorithm. Require: r( ),, Ensure: arg min r( ) while r 0 ( ) > do = end while where rg( x) denotes the the gradient of g and is an all one vector. This lemma provides a simple unbiased estimate of the risk (MSE) of the soft thresholding denoiser: ˆr(, )=k ( x; ) xk /N + ( T ( 0 ( x; ) ))/N, We will study the properties of ˆr(, )insection3..3andwewillshowthatthis estimate is very accurate for high dimensional problems. Furthermore, we will show how this estimate can be employed to provide an estimate of the derivative of r(, ) with respect to. Oncethesetwoestimatesarecalculated,wecanrunthegradient descent algorithm for finding opt.wewillshowthatthegradientdescentalgorithm that is based on empirical estimates converges to ˆ opt,whichis close to opt and converges to opt in probability as N!.WeformalizethesestatementsinSection Optimal Parameter Tuning for Denoising Problems This section considers the problem tuning the threshold parameter in the soft-thresholding denoising scheme. Section 3.3 connects the results of this section to the problem of

38 8 tuning the threshold parameters in AMP. 3.. Optimizing the Ideal Risk Let x R N denote a noisy observation of the vector x o,i.e., x = x o + w, where w N(0, I). Further assume that the noise variance is known. Since x o is either asparseorapproximatelysparsevector,wecanemploysoftthresholdingfunctionto obtain an estimate of x o : ˆx = ( x; ). This denoising scheme has been proposed in [50], and its optimality properties have been studied in the minimax framework. As is clear from the above formulation, the quality of this estimate is determined by the parameter. Furthermore,theoptimal value of depends both on the signal and on the noise level. Suppose that we consider the MSE to measure the goodness of the estimate ˆx : r( ), N Ekˆx x o k. According to this criterion, the optimal value of is the one that minimizes r( ). For the moment assume that r( ) isgivenandforgetthefactthatr( ) isafunctionof x o and hence is not known in practice. Can we find the optimal value of defined as opt =argmin r( ) (3.) e ciently? The following lemma simplifies the answer to this question. Lemma 3... [3] r( ) is a quasi-convex function of. Furthermore, the derivative of the function is equal to zero in at most one finite value of and that is opt. In other words, we will in general observe three di erent forms for r( ). These three forms are shown in Figure 3.. Suppose that we aim to obtain opt. Lemma

39 σ = 0.5 σ =3 σ = MSE τ 0 0 τ 0 0 τ Figure 3. : Three di erent forms for MSE vs.. Three plots correspond to two di erent standard deviation of the noise in the observation. 3.. implies that the gradient of r( ) atany points toward opt. Therefore, we expect the gradient descent algorithm to converge to opt.let t denote the estimate of the gradient descent algorithm at iteration t. Then,theupdatesofthealgorithm are given by t+ = t dr( t), (3.3) d where is the step size parameter. For instance, if L is an upped bound on the second derivative of r( ), then we can set =/L. Our first result shows that, even though the function is not convex, the gradient descent algorithm converges to the optimal value of. Lemma 3... Let = L and suppose that the optimizing is finite. Then, lim t! dr( t) d =0. See Section A.6 for the proof of this lemma. Note that the properties of the risk function summarized in Lemma 3.. enable us to employ standard techniques to In practice, we employ back-tracking to set the step-size.

40 30 prove the convergence of (3.3). The discussions above are useful if the risk function and its derivative are given. But these two quantities are usually not known in practice. Hence we need to estimate them. The next section explains how we estimate these two quantities. 3.. Approximate Gradient Descent Algorithm In Section 3.. we described a method to estimate the risk of the soft thresholding function. Here we formally define this empirical unbiased estimate of the risk in the following way: Definition The empirical unbiased estimate of the risk is defined as ˆr( ), N k ( x; ) xk + ( T ( 0 ( x; ) )) (3.4) Here, for notational simplicity, we assume that the variance of the noise is given. In Section 4 we show that estimating is straightforward for AMP. Instead of estimating the optimal parameter opt through (3.), one may employ the following optimization: ˆ opt, arg min ˆr( ). (3.5) This approach was proposed by Donoho and Johnstone [5], and the properties of this estimator are derived in [48]. However, [48] does not provide an algorithm for finding ˆ opt. Exhaustive search approaches are computationally very demanding and hence not very useful for practical purposes. As discussed in Section 3.., one approach to reduce the computational complexity is to use the gradient descent algorithm. Needless to say that the gradient of r( ) is not given, and hence it has to be estimated. Note that opt must be estimated at every iteration of AMP. Hence we seek very e cient algorithms for this purpose.

41 3 One simple idea to estimate the gradient of r( ) is the following: Fix N and estimate the derivative according to dˆr( ) d = ˆr( + N) ˆr( ). (3.6) N We will prove in Section 3..3 that, if N is chosen properly, then as N!, ˆ dr( ) d! dr( ) d in probability. Therefore, intuitively speaking, if we plug in the estimate of the gradient in (3.3), the resulting algorithm will perform well for large values of N. We will prove in the next section that this intuition is in fact true. Note that since we have introduced N in the algorithm, it is not completely free of parameters. However, we will show both theoretically and empirically, the performance of the algorithm is not sensitive to the actual value of N. Hence, the problem of setting N is simple and inspired by our theoretical results we will provide suggestions for the value of this parameter in Section 4. Therefore, our approximate gradient descent algorithm uses the following iteration: t+ = t dˆr( t ), (3.7) d where as before t is the estimate of opt at iteration t and denotes the step size. Before, we proceed to the analysis section, let us clarify some of the issues that may cause problem for our approximate gradient descent algorithm. First note that since ˆr( ) isanestimateofr( ), it is not a quasi-convex any more. Figure 3.3 compares r( ) andˆr( ). As is clear from this figure ˆr( ) mayhavemorethanonelocalminima. One important challenge is to ensure that our algorithm is trapped in a local minima that is close to the global minima of r( ). We will address this issue in Section 3..3.

42 Estimated Risk Risk r(τ ) τ Figure 3.3 : The dashed black curve denotes the risk function and the solid blue curve indicates its estimation. For the model we have used in order to produce this plot refer to Section 4. Measurements are noiseless Accuracy of the Gradient Descent Algorithm Our Approach The goal of this section is to provide performance guarantees for the empirical gradient descent algorithm that is described in Section 3... We achieve this goal in three steps: (i) characterizing the accuracy of the empirical unbiased risk estimate ˆr( ) in Section 3..3, (ii) characterizing the accuracy of the empirical estimate of the derivative of the risk dˆr d in Section 3..3, and finally (iii) providing a performance guarantee for the approximate gradient descent algorithm in Section Accuracy of Empirical Risk Our first result is concerned with the accuracy of the risk estimate ˆr( ). Consider the following assumption: we know a value max,where opt < max. Note that this is not a major loss of generality, since max can be as large as we require.

43 33 Theorem Let r( ) be defined according to (3.) and ˆr( ) be as defined in Definition Then, P sup r( ) ˆr( ) ( + 4 max )N /+ 0< < max apple Ne N + maxn 3 e c N max, where </ is an arbitrary but fixed number. See Section A.7 for the proof of Theorem First note that the probability on the right hand side goes to zero as N!. Therefore, we can conclude that according to Theorem 3..4 the di erence between r( ) andˆr( ) isnegligiblewhenn is large(with very high probability). Let opt =argmin r( )andˆ opt =argmin ˆr( ). The following simple corollary of Theorem (3..4) shows that even if we minimize ˆr( ) instead of r( ), still r(ˆ opt )isclosetor( opt ). Corollary Let opt and ˆ opt denote the optimal parameters derived from the actual and empirical risks respectively. Then, P r( opt ) apple Ne N + maxn 3 r(ˆ opt ) > (4 + 8 max )N /+ e c N max. See Section A.8 for the proof of Corollary Corollary 3..5 shows that if we could find the global minimizer of the empirical risk, it provides a good estimate for opt for high dimensional problems. The only limitation of this result is that finding the global minimizer of ˆr( ) iscomputationallydemandingasitrequiresexhaustive search. Therefore, in the next sections we analyze the fixed points of the approximate gradient descent algorithm.

44 34 Accuracy of the Derivative of Empirical Risk Our next step is to prove that our estimate of the gradient is also accurate when N is large. The estimate of the gradient of r( ) isgivenby dˆr d = ˆr( + N) ˆr( ). (3.8) The following theorem describes the accuracy of this estimate: Theorem Let N =!(N /+ ) and N = o() simultaneously. Then, there N exists 0 (, + N ) such that dˆr dr P d d ( 0 ) (8 + 6 max )N /+ N apple Ne N + maxn 3 e c N max. In particular, as N! dˆr d dr converges to d in probability. The proof of Theorem 3..6 is available in Section A.9. The following remarks highlight some of the main implications of Theorem Remark: The di erence between the actual derivative of the risk and the estimated one is small for large values of N. Therefore, if the actual derivative is positive (and not too small) then the estimated derivative remains positive, and if the actual derivative is negative (and not too small), then the estimated derivative will also be negative. This feature enables the gradient descent with an estimate of the derivative to converge to a point that is close to opt. Remark: Note that the small error that we have in the estimate of the derivative may cause di culties at the places where the derivative is small. There are two regions for which the derivative is small. As shown in Figure 3.4, the first region is

45 35 Risk Function Estimate of Risk Function Figure 3.4 : Risk function and its estimate. The estimation of the risk function has local minima for the points where dr ( ) = O N + d N. The two regions for which this phenomenon can happen are specified by ellipsoids. around the optimal value of opt,andthesecondregionisforverylargevaluesof. Note that the small error of the estimates may lead to local minima in these two regions. We show how the algorithm will avoid the local minima that occur for large values of. Furthermore, we will show that all the local minima that occur around opt have risk which is close to optimal risk. Accuracy of Empirical Gradient Descent In order to prove the convergence of the gradient descent algorithm we require two assumptions: (i) We know a value max,where opt < max. (ii) The magnitude of second derivative of r( ) isboundedfromabovebyl and L is known.

46 36 Before we proceed further let us describe why these two assumptions are required. Note from Figure 3.4 that for very large values of, wherethederivativeoftheideal risk is close to zero, the empirical risk may have many local minima. Therefore, the gradient descent algorithm is not necessarily successful if it goes to this region. Our first condition is to ensure that we are avoiding this region. So, we modify the gradient descent algorithm in a way that if at a certain iteration it returns t > max, we realize that this is not a correct estimate. The second condition is used to provide asimplewaytosetthestepsizeinthegradientdescent. It is standard in convex optimization literature to avoid the second condition by setting the step-size by using the backtracking method. However, for notational simplicity we avoid back-tracking in our theoretical analysis. However, we will employ it in our final implementation of the algorithm. Similarly, the first constraint can be avoided as well. We will propose an approach in the simulation section to avoid the first condition as well. Let t denote the estimates of the empirical gradient descent algorithm with step size =. Also, let t L denote the estimates of the gradient descent on the ideal risk function as introduced in (3.3). We can then prove the following. Theorem For every iteration t we have, in proability. lim t t =0, N! See Section A.0 for the proof.

47 Optimal Tuning of AMP Inspired by the formulation of AMP, define the following Bayesian risk function for the soft thresholding algorithm: R B (, ; p )=E( ( o + W; ) o ), (3.9) where the expected value is with respect to two independent random variables o p and W N(0, ). One of the main features of this risk function is the following Lemma inf R B (, ; p ) is an increasing function of. See the proof of this Lemma in Section A... While this result is quite intuitive and simple to prove, it has an important implication for the AMP algorithm. Let,,... denote the thresholds of the AMP algorithm at iterations t =,,... Clearly, the variance of the noise at iteration T depends on all the thresholds,,..., T (See Theorem.3. for the definition of ). Therefore, consider the notation t (,,..., t )forthevalueof at iteration t. Definition A sequence of threshold parameters,,,,...,,t is called optimal for iteration T, if and only if t (,,...,,T ) apple t (,,..., T ), 8,,..., T [0, ) T. Note that in the above definition we have assumed that the optimal value of t is achieved by (,,...,,T ). This assumption is violated for the case o = 0. While we can generalize the definition to include this case, for notational simplicity we skip this special case. The optimal sequence of thresholds has the following two properties:. It achieves a certain MSE in the least number of iterations.

48 38. If we plan to stop the algorithm after T iterations, then it gives the best achievable MSE. These two claims will be clarified as we proceed. According to Definition 3.3., it seems that, in order to tune AMP optimally, we need to know the number of iterations we plan to run it. However, this is not the case for AMP. In fact, at each step of the AMP, we can optimize the threshold as if we plan to stop the algorithm in the next iteration. The resulting sequence of thresholds will be optimal for any iteration T. The following theorem formally states this result. Theorem Let,,,,...,,T be optimal for iteration T. Then,,,,,...,,t is optimal for any iteration t<t. See Section A. for the proof of this result. This theorem, while it is simple to prove, provides a connection between optimizing the parameters of AMP and the optimal parameter tuning we discussed for the soft thresholding function. For instance, a special case of the above theorem implies that, must be optimal for the first iteration. Intuitively speaking, the signal plus Gaussian noise model is correct for this iteration. Hence we can apply the approximate gradient descent algorithm to obtain,.once, is calculated we calculate x and again from the above theorem we know that, should be optimal for the denoising problem we obtain in this step. Therefore, we apply approximate gradient descent to obtain an estimate of,.we continue this process until the algorithm converges to the right solution. If we have access to the risk function, then the above procedure can be applied. At every iteration, we find the optimal parameter with the strategy described in Section 3.. and the resulting algorithm is optimal for any iteration t. However, as we discussed before the risk function is not available. Hence we have to estimate

49 39 it. Once we estimate the risk function, we can employ the approximate gradient descent strategy described in Section 3... Consider the following risk estimate that is inspired by SURE: ˆr t ( t ) N = N k (xt + A z t ; t ) (x t + A z t )k + t + N t T ( 0 (x t + A z t ; t ) ). (3.0) As is clear from our discussion about the soft thresholding function in Section 3.., we would like to apply the approximate gradient descent algorithm to ˆr( t ). Nevertheless, N the question we have to address is that if it is really going to converge to R B (, ; p )? The next theorem establishes this result. Theorem Let ˆrt ( t ) N defined in (3.0). Then, denote the estimate of the risk at iteration t of AMP as ˆr t ( t ) lim N! N = E o,w ( (o + t W ; t ) o ), (3.) almost surely, where o and W are two random variables with distributions p and N(0, ), respectively and t satisfies (.5). See Section A. for the proof of this theorem. This result justifies the application of the approximate gradient descent for the iterations of AMP. However, as we discussed in Section 3. a rigorous proof of the accuracy of approximate gradient descent requires a stronger notion of convergence. Hence, the result of Theorem is not su cient. One su cient condition is stated in the next theorem. Let ˆ t,s denote the estimate of the s th iteration of approximate gradient descent algorithm at the t th iteration of AMP. In addition, let t,s denote the estimate of the gradient descent algorithm on the ideal risk at the t th iteration of AMP.

50 40 Theorem Suppose that there exists >0 such that for the t th iteration of AMP we have ˆr t ( t ) P sup t N E o,w ( (o + t W ; t ) o ) >cn! 0, (3.) as N!.If N = N /, then lim ˆ t,s t,s =0 N! in probability. The proof of this result is a combination of the proofs of Theorems 3..6 and 3..7 and is omitted. Note that (3.) has not been proved for the iterations of AMP and remains an open problem.

51 4 Chapter 4 Simulation Results In this section, we first study the performance of AMP with fixed threshold policy form the phase transition point of view. We then give the details of the simulations which we have put in the previous sections. We then evaluate the performance of automatically tuned AMP proposed in Algorithm via simulations. We specifically discuss the e ect of the measurement noise, the choice of parameter N used in (3.6), and the impact of the sample size N on the performance of our method. 4. Phase Transition of AMP Sparse recovery via ` minimization is successful only if the number of non-zero values in the signal, i.e., k o k 0 is smaller than a certain definite fraction of n. Let = k n and = n p be normalized measures of sparsity level and problem indeterminacy, respectively. As a result, we have a two-dimensional phase-space (, ) [0, ] in which each point determines a certain sparse recovery problem. Success or failure of this sparse recovery problem could be determined according to this phase-space. In most of the cases, there is a curve (, ( )) where the probability of successful recovery tends to 0 or as the sparsity level goes above or below this curve, respectively [3]. In this section, we determine the phase-space corresponding to the AMP with fixed threshold policy by observing the fraction of successful recovery for di erent values of and. Inotherwords, wemeasuretheempiricalphasetransitionof

52 4 AMP with fixed threshold policy and compare it with the available theoretical bound from [5, 3, 3]. The theoretical bound for phase transition under linear programming (LP) reconstruction is given by (z) =, (z)+z (z) z( (z)) =. (4.) (z) On the other hand, in order to measure the empirical phase transition of AMP with fixed thresholding policy, i.e., in order to observe the fraction of successful sparse recovery for di erent values of and, wefirsthavetodeterminethethreshold parameter of AMP to run the algorithm and measure its performance. Therefore, according to the definition of the fixed thresholding policy, we have to set the free parameter. To find out this free parameter, we use the relationship between and which is =ˆ ( ). The solution of the LASSO is given by ˆ =argmin ky k + k k. (4.) As we showed in Theorem.6 of the main text, the number of elements in the active set of ˆ,i.e.,k ˆ k 0,rangesbetween0andthenumberofmeasurementsn. Therefore, the least sparsity corresponds to the case where k ˆ k 0 = n. In addition, it is well known that if we let! 0, then the solution of the LASSO will tend to be the least sparse solution. As a result, in order to measure the phase transition, we have to let! 0whichcorrespondstotheleastsparsesignalrecovery. Since =ˆ ( ), this is equivalent to letting!. As a result, we let =whencalculatingthe empirical phase transition of the AMP algorithm with the fixed thresholding policy. In order to calculate the empirical phase transition, we produce a heatmap in which each cell corresponds to the success probability of recovery. We consider a 0

53 43 0 matrix P in which the columns correspond to the set containing 0 equi-spaced values of between 0. and 0.9; and the rows correspond to the set containing 0 equi-spaced values of between ( =0.) and ( =0.9). For each ˆ, we consider 50 equi-spaced values of in [0.8 (ˆ),. (ˆ)], where (ˆ) isobtainedfrom (4.). We call the set of all these values ˆ. In order to calculate the probability of correct recovery, for each ˆ and ˆ we use M =0MonteCarlotrials. Forthe j th Monte Carlo trial, we define the success variable Sˆ,,j = I k ˆo ok k ok < tol where tol is the threshold of relative error for evaluating the performance of AMP, and ˆx o is the AMP estimate using fixed thresholding policy. We then define the empirical P success probability as ˆPˆ, = M j Sˆ,,j. Having calculated ˆPˆ, for every ˆ, we use linear interpolation to fill in the entries of P. In running the AMP algorithm using the fixed thresholding policy, we use the following set up: - The size of o is set to p = The size of the measurement vector and the sparsity level are obtained according to: n = b Nc, k = b nc. - The measurements are noise-free and are obtained according to y = o where has iid elements from the Gaussian distribution N 0, n. - The threshold of the relative error is set to tol = 0. - The number of iterations in AMP is set to 500. Figure 4. shows the empirical heatmap of the probability of success we obtained. The black line is the theoretical curve from (4.). The red and blue color refer to successful and unsuccessful recovery, respectively. We can see the coincidence of the theoretical curve and empirical phase transition in this figure.

54 44 Figure 4. : Comparison between the empirical phase transition (heatmap of the probability of success) and theoretical phase transition curve (black curve) obtained from (4.). Red color corresponds to probability (successful recovery) and the blue color corresponds to probability 0 (unsuccessful recovery) in the heatmap. 4. Details of Simulations Here we include the details of the simulations whose results we reported in the previous sections. 4.. Figure. The dataset we used in this simulation is taken from [53]. The response variables y R 44 correspond to the diabetes progression in one year in 44 patients. We have 0 variables, namely age, sex, body mass index, average blood pressure, and six blood serum measurements. Therefore R 0. We have solved LASSO for di erent values of and presented the number of nonzero elements in,i.e. k k 0,asafunctionof in Figure.. As mentioned previously, for this specific problem, this function is not monotone decreasing.

Does l p -minimization outperform l 1 -minimization?

Does l p -minimization outperform l 1 -minimization? Does l p -minimization outperform l -minimization? Le Zheng, Arian Maleki, Haolei Weng, Xiaodong Wang 3, Teng Long Abstract arxiv:50.03704v [cs.it] 0 Jun 06 In many application areas ranging from bioinformatics

More information

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation David L. Donoho Department of Statistics Arian Maleki Department of Electrical Engineering Andrea Montanari Department of

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems

Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems 1 Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems Alyson K. Fletcher, Mojtaba Sahraee-Ardakan, Philip Schniter, and Sundeep Rangan Abstract arxiv:1706.06054v1 cs.it

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Message passing and approximate message passing

Message passing and approximate message passing Message passing and approximate message passing Arian Maleki Columbia University 1 / 47 What is the problem? Given pdf µ(x 1, x 2,..., x n ) we are interested in arg maxx1,x 2,...,x n µ(x 1, x 2,..., x

More information

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION A Thesis by MELTEM APAYDIN Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the

More information

Approximate Message Passing

Approximate Message Passing Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters

More information

Bayesian Methods for Sparse Signal Recovery

Bayesian Methods for Sparse Signal Recovery Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Motivation Motivation Sparse Signal Recovery

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011 Compressed Sensing Huichao Xue CS3750 Fall 2011 Table of Contents Introduction From News Reports Abstract Definition How it works A review of L 1 norm The Algorithm Backgrounds for underdetermined linear

More information

Design of Image Adaptive Wavelets for Denoising Applications

Design of Image Adaptive Wavelets for Denoising Applications Design of Image Adaptive Wavelets for Denoising Applications Sanjeev Pragada and Jayanthi Sivaswamy Center for Visual Information Technology International Institute of Information Technology - Hyderabad,

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Wavelet Footprints: Theory, Algorithms, and Applications

Wavelet Footprints: Theory, Algorithms, and Applications 1306 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 5, MAY 2003 Wavelet Footprints: Theory, Algorithms, and Applications Pier Luigi Dragotti, Member, IEEE, and Martin Vetterli, Fellow, IEEE Abstract

More information

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Risk and Noise Estimation in High Dimensional Statistics via State Evolution Risk and Noise Estimation in High Dimensional Statistics via State Evolution Mohsen Bayati Stanford University Joint work with Jose Bento, Murat Erdogdu, Marc Lelarge, and Andrea Montanari Statistical

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

An Introduction to Sparse Approximation

An Introduction to Sparse Approximation An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,

More information

Bayesian Paradigm. Maximum A Posteriori Estimation

Bayesian Paradigm. Maximum A Posteriori Estimation Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Phase Transition Phenomenon in Sparse Approximation

Phase Transition Phenomenon in Sparse Approximation Phase Transition Phenomenon in Sparse Approximation University of Utah/Edinburgh L1 Approximation: May 17 st 2008 Convex polytopes Counting faces Sparse Representations via l 1 Regularization Underdetermined

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Performance Analysis for Sparse Support Recovery

Performance Analysis for Sparse Support Recovery Performance Analysis for Sparse Support Recovery Gongguo Tang and Arye Nehorai ESE, Washington University April 21st 2009 Gongguo Tang and Arye Nehorai (Institute) Performance Analysis for Sparse Support

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Stochastic geometry and random matrix theory in CS

Stochastic geometry and random matrix theory in CS Stochastic geometry and random matrix theory in CS IPAM: numerical methods for continuous optimization University of Edinburgh Joint with Bah, Blanchard, Cartis, and Donoho Encoder Decoder pair - Encoder/Decoder

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

The properties of L p -GMM estimators

The properties of L p -GMM estimators The properties of L p -GMM estimators Robert de Jong and Chirok Han Michigan State University February 2000 Abstract This paper considers Generalized Method of Moment-type estimators for which a criterion

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

COMPRESSED Sensing (CS) is a method to recover a

COMPRESSED Sensing (CS) is a method to recover a 1 Sample Complexity of Total Variation Minimization Sajad Daei, Farzan Haddadi, Arash Amini Abstract This work considers the use of Total Variation (TV) minimization in the recovery of a given gradient

More information

The Minimax Noise Sensitivity in Compressed Sensing

The Minimax Noise Sensitivity in Compressed Sensing The Minimax Noise Sensitivity in Compressed Sensing Galen Reeves and avid onoho epartment of Statistics Stanford University Abstract Consider the compressed sensing problem of estimating an unknown k-sparse

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

Going off the grid. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

Going off the grid. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Going off the grid Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Joint work with Badri Bhaskar Parikshit Shah Gonnguo Tang We live in a continuous world... But we work

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Estimating LASSO Risk and Noise Level

Estimating LASSO Risk and Noise Level Estimating LASSO Risk and Noise Level Mohsen Bayati Stanford University bayati@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Andrea Montanari Stanford University montanar@stanford.edu

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel

More information

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Thresholds for the Recovery of Sparse Solutions via L1 Minimization Thresholds for the Recovery of Sparse Solutions via L Minimization David L. Donoho Department of Statistics Stanford University 39 Serra Mall, Sequoia Hall Stanford, CA 9435-465 Email: donoho@stanford.edu

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

Generalized Orthogonal Matching Pursuit- A Review and Some

Generalized Orthogonal Matching Pursuit- A Review and Some Generalized Orthogonal Matching Pursuit- A Review and Some New Results Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur, INDIA Table of Contents

More information

Sparse Solutions of an Undetermined Linear System

Sparse Solutions of an Undetermined Linear System 1 Sparse Solutions of an Undetermined Linear System Maddullah Almerdasy New York University Tandon School of Engineering arxiv:1702.07096v1 [math.oc] 23 Feb 2017 Abstract This work proposes a research

More information

Recent Developments in Compressed Sensing

Recent Developments in Compressed Sensing Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

The Pros and Cons of Compressive Sensing

The Pros and Cons of Compressive Sensing The Pros and Cons of Compressive Sensing Mark A. Davenport Stanford University Department of Statistics Compressive Sensing Replace samples with general linear measurements measurements sampled signal

More information

Lecture 5 : Sparse Models

Lecture 5 : Sparse Models Lecture 5 : Sparse Models Homework 3 discussion (Nima) Sparse Models Lecture - Reading : Murphy, Chapter 13.1, 13.3, 13.6.1 - Reading : Peter Knee, Chapter 2 Paolo Gabriel (TA) : Neural Brain Control After

More information

Elaine T. Hale, Wotao Yin, Yin Zhang

Elaine T. Hale, Wotao Yin, Yin Zhang , Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Regularized PCA to denoise and visualise data

Regularized PCA to denoise and visualise data Regularized PCA to denoise and visualise data Marie Verbanck Julie Josse François Husson Laboratoire de statistique, Agrocampus Ouest, Rennes, France CNAM, Paris, 16 janvier 2013 1 / 30 Outline 1 PCA 2

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation UIUC CSL Mar. 24 Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation Yuejie Chi Department of ECE and BMI Ohio State University Joint work with Yuxin Chen (Stanford).

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Data Sparse Matrix Computation - Lecture 20

Data Sparse Matrix Computation - Lecture 20 Data Sparse Matrix Computation - Lecture 20 Yao Cheng, Dongping Qi, Tianyi Shi November 9, 207 Contents Introduction 2 Theorems on Sparsity 2. Example: A = [Φ Ψ]......................... 2.2 General Matrix

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

5 Overview of algorithms for unconstrained optimization

5 Overview of algorithms for unconstrained optimization IOE 59: NLP, Winter 22 c Marina A. Epelman 9 5 Overview of algorithms for unconstrained optimization 5. General optimization algorithm Recall: we are attempting to solve the problem (P) min f(x) s.t. x

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations

Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations Tong Zhang, Member, IEEE, 1 Abstract Given a large number of basis functions that can be potentially more than the number

More information

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION By Mazin Abdulrasool Hameed A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for

More information

Sparsity and Compressed Sensing

Sparsity and Compressed Sensing Sparsity and Compressed Sensing Jalal Fadili Normandie Université-ENSICAEN, GREYC Mathematical coffees 2017 Recap: linear inverse problems Dictionary Sensing Sensing Sensing = y m 1 H m n y y 2 R m H A

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Recovery Based on Kolmogorov Complexity in Underdetermined Systems of Linear Equations

Recovery Based on Kolmogorov Complexity in Underdetermined Systems of Linear Equations Recovery Based on Kolmogorov Complexity in Underdetermined Systems of Linear Equations David Donoho Department of Statistics Stanford University Email: donoho@stanfordedu Hossein Kakavand, James Mammen

More information

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise IEICE Transactions on Information and Systems, vol.e91-d, no.5, pp.1577-1580, 2008. 1 Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise Masashi Sugiyama (sugi@cs.titech.ac.jp)

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Binary Compressive Sensing via Analog. Fountain Coding

Binary Compressive Sensing via Analog. Fountain Coding Binary Compressive Sensing via Analog 1 Fountain Coding Mahyar Shirvanimoghaddam, Member, IEEE, Yonghui Li, Senior Member, IEEE, Branka Vucetic, Fellow, IEEE, and Jinhong Yuan, Senior Member, IEEE, arxiv:1508.03401v1

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Which wavelet bases are the best for image denoising?

Which wavelet bases are the best for image denoising? Which wavelet bases are the best for image denoising? Florian Luisier a, Thierry Blu a, Brigitte Forster b and Michael Unser a a Biomedical Imaging Group (BIG), Ecole Polytechnique Fédérale de Lausanne

More information

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Yin Zhang Department of Computational and Applied Mathematics Rice University, Houston, Texas, U.S.A. Arizona State University March

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May University of Luxembourg Master in Mathematics Student project Compressed sensing Author: Lucien May Supervisor: Prof. I. Nourdin Winter semester 2014 1 Introduction Let us consider an s-sparse vector

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

The Pros and Cons of Compressive Sensing

The Pros and Cons of Compressive Sensing The Pros and Cons of Compressive Sensing Mark A. Davenport Stanford University Department of Statistics Compressive Sensing Replace samples with general linear measurements measurements sampled signal

More information

Unconstrained minimization: assumptions

Unconstrained minimization: assumptions Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information