ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi

Size: px

Start display at page:

Download "ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi"

Aleesha Simon
5 years ago
Views:

2 ABSTRACT Topics on LASSO and Approximate Message Passing by Ali Mousavi This thesis studies the performance of the LASSO (also known as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal x o R N from n random and noisy linear observations y = Ax o + w, wherea is the measurement matrix and w is the noise. The LASSO estimate is given by the solution to the optimization problem x o with ˆx =argmin x ky Axk + kxk. Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter. In this thesis we study two questions in the asymptotic setting (i.e., where N!, n!while the ratio n/n converges to a fixed number in (0, )): (i) How does the size of the active set kˆx k 0 /N behave as a function of, and (ii) How does the mean square error kˆx x o k /N behave as a function of? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP). Furthermore, we propose a parameter-free approximate message passing (AMP) algorithm that sets the threshold parameter at each iteration in a fully automatic way without either having an information about the signal to be reconstructed or needing any tuning from the user. We show that the proposed method attains the minimum reconstruction error in the least number of iterations. Our method is based on applying the Stein unbiased risk estimate (SURE)

3 iii along with a modified gradient descent to find the optimal threshold in each iteration. Motivated by the connections between AMP and LASSO, it could be employed to find the solution of the LASSO for the optimal regularization parameter. To the best of our knowledge, this is the first work concerning parameter tuning that obtains the smallest MSE in the least number of iterations with theoretical guarantees.

4 Contents Abstract List of Illustrations List of Tables ii vii x Introduction. Motivation for Analysis of LASSO s Solution Path Analysis of LASSO s Solution Path Implications for Approximate Message Passing Algorithms Motivation for Designing Parameterless Approximate Message Passing 6.5 Implications of Parameter Tuning for LASSO Related Work in Parameter Tuning Notation Organization of the Thesis Analysis of LASSO s Solution Path. Asymptotic CS Framework LASSO s Solution Path Implications for AMP AMP in Asymptotic Settings Connection Between AMP and LASSO Fixed Detection Thresholding Parameter Free Approximate Message Passing 4 3. Tuning the AMP

5 v 3.. Intuitive Explanation of the AMP Features Tuning Scheme Optimal Parameter Tuning for Denoising Problems Optimizing the Ideal Risk Approximate Gradient Descent Algorithm Accuracy of the Gradient Descent Algorithm Optimal Tuning of AMP Simulation Results 4 4. Phase Transition of AMP Details of Simulations Figure Figure Figure Practical Approximate Gradient Descent Algorithm Estimating the Noise Variance Setting the Step Size Avoiding Local Minima Accuracy of the Approximate Gradient Descent Impact of Sample Size N Setting N Comparison with Other Tuning Procedures A Proofs of the Main Results 56 A. Background A.. Quasiconvex Functions A.. Risk of the Soft Thresholding Function A. Proof of Theorem A.3 Proof of Theorem

6 vi A.4 Proof of Lemma A.5 Proof of Theorem A.6 Proof of Lemma A.7 Proof of Theorem A.8 Proof of Corollary A.9 Proof of Theorem A.0 Proof of Theorem A. Proof of Theorem A.. Proof of Lemma A. Proof of Theorem Bibliography 9

7 Illustrations. The number of active elements in the solution of LASSO as a function of.itisclearthatthisfunctiondoesnotmatchthe intuition. The size of the active set at one location grows as we increase.forthedetailsofthisexperiment,seesection Histogram of v t for three di erent iterations. The red curve displays the best Gaussian fit The number of active elements in the solution of LASSO as a function of.thesizeoftheactivesetdecreasesmonotonicallyasweincrease. 7. Behavior of the MSE as a function of of LASSO for two di erent noise variances Risk function r(, )asafunctionofthethresholdparameter. x o R N is a k-sparse vector where N =000andk =45. In addition, =0.54 where is the standard deviation of the noise in the model x t = x t o + v t Three di erent forms for MSE vs.. Threeplotscorrespondtotwo di erent standard deviation of the noise in the observation The dashed black curve denotes the risk function and the solid blue curve indicates its estimation. For the model we have used in order to produce this plot refer to Section 4. Measurements are noiseless.... 3

8 viii 3.4 Risk function and its estimate. The estimation of the risk function has local minima for the points where dr ( ) = O N + d N.The two regions for which this phenomenon can happen are specified by ellipsoids Comparison between the empirical phase transition (heatmap of the probability of success) and theoretical phase transition curve (black curve) obtained from (4.). Red color corresponds to probability (successful recovery) and the blue color corresponds to probability 0 (unsuccessful recovery) in the heatmap Performance of Algorithm in estimating ˆ opt in di erent iterations of AMP. In this experiment N =000, =0.85, =0.5, and we consider noiseless measurements ( = 0) Performance of Algorithm in estimating ˆ opt in di erent iterations of AMP. In this experiment N =000, =0.85, =0.5, and the standard deviation of the noise of the measurements = Performance of Algorithm in estimating ˆ opt for di erent values of N. Inthisexperiment =0.85, =0.5, and we consider noiseless measurements ( =0) Performance of Algorithm in estimating ˆ opt for di erent values of N. InthisexperimentN =000, =0.85, =0.5, and the standard deviation of the noise of the measurements is 0.. Blue dashed curve and black dashed curve show estimated risk and risk functions, respectively. Green circle shows the optimal threshold. Finally, the red cross shows the estimated optimal threshold

9 ix 4.6 MSE of AMP at each iteration for four di erent threshold setting approaches. The blue curve corresponds to the approach introduced in Algorithm. In the black curve, the threshold is set to a constant value. This constant threshold, which causes the smallest reconstruction error, is found by the grid search method. In the red curve, threshold is set to the value which gives the best phase transition [4]. Finally, in the green curve, threshold is set by using fixed detection thresholding policy [3] Semi-logarithmic version of Figure Logarithmic version of Figure A. The derivative of r(µ, ) asafunctionof for µ 0 < <. Note that the derivative of the risk has only one sign change. Below that point the derivative is negative and above of that point is positive (even though it converges to zero as!). Hence we expect the risk to be quasi-convex j k A. Dividing [0, max ]into max equally spaced points. We use this procedure to show that ( ) is closeto E[ ( )] for all [0, max ] A.3 Illustration of and, which are two monotonically decreasing functions with respect to. Thesupremumofthedistancebetween these two functions is achieved at the jump points of, whichis peicewise constant

10 Tables. Some observables and their abbreviations. The function for each observable is also specified

11 Chapter Introduction. Motivation for Analysis of LASSO s Solution Path Consider the problem of recovering a vector x o R N from a set of undersampled random linear measurements y = Ax o + w, wherea R n N is the measurement matrix, and w R n denotes the noise. One of the most successful recovery algorithms, called basis pursuit denoising or LASSO ( [, ]), that employs the following optimization problem to obtain an estimate of x o : ˆx =argmin x ky Axk + kxk. (.) A rich literature has provided a detailed analysis of this algorithm [3, 4, 5, 6, 7, 8, 9, 0,,, 3, 4, 5, 6, 7, 8, 9, 0,, ]. Most of the work published in this area falls into two categories: (i) non-asymptotic and (ii) asymptotic results. The nonasymptotic results consider N and n to be large but finite numbers and characterize the reconstruction error as a function of N and n. These analyses provide qualitative guidelines on how to design compressed sensing (CS) system. However, they su er from loose constants and are incapable of providing quantitative guidelines. Therefore, inspired by the seminal work of Donoho and Tanner [3], researchers have started the asymptotic analysis of LASSO. Such analyses provide sharp quantitative guidelines for designing CS systems. Despite the major progress in our understanding of LASSO, one major aspect of

12 the method that is of major algorithmic importance has remained unexplored. In most of the theoretical work, it is assumed that an oracle has given the optimal value of to the statistician/engineer and the analysis is performed for the optimal value of. However, in practice the optimal value of is not known a priori. One important analysis that may help in both searching for the optimal value of and/or designing e of cient algorithms for solving LASSO, is the behavior of the solution ˆx as a function. In this thesis, we conduct such an analysis and demonstrate how such results can be employed for designing e cient approximate message passing algorithms.. Analysis of LASSO s Solution Path In this thesis we aim to analyze the properties of the solution of the LASSO as changes. The two main problems that we address are: Q: How does N kˆx k 0 change as varies? Q: How does N kˆx xk change as varies? The first question is about the number of active elements in the solution of the LASSO, and the second one is about the mean squared error. Intuitively speaking, one would expect the size of the active set to shrink as increases and the mean square error to be a bowl-shaped function of. Unfortunately the peculiar behavior of LASSO breaks this intuition. See Figure. for a counter-example; we will clarify the details of this experiment in Section 4.. This figure exhibits the number of active elements in the solution as we increase the value of.itisclearthatthesizeofthe active set is not monotonically decreasing. Such pathological examples have discouraged further investigation of these problems in the literature. In this thesis we show that such pathological examples are quite

13 3 ˆβ λ Figure. : The number of active elements in the solution of LASSO as a function of. It is clear that this function does not match the intuition. The size of the active set at one location grows as we increase. For the details of this experiment, see Section 4.. λ rare, and if we consider the asymptotic setting (that will be described in Section.), then we can provide quite intuitive answers to the two questions raised above. Let us summarize our results here in a non-rigorous way. We will formalize these statements and clarify the conditions under which they hold in Section.. A: In the asymptotic setting, N kˆx k 0 is a decreasing function of. A: In the asymptotic setting, N kˆx xk is a quasi-convex function of..3 Implications for Approximate Message Passing Algorithms Traditional techniques of solving LASSO, such as the interior point method, have fail in addressing high-dimensional CS-type problems. Therefore, researchers have started exploring iterative algorithms with inexpensive per-iteration computations. One such algorithm is called approximate message passing (AMP) [3]; it is given by

14 4 the following iteration: x t+ = (x t + A z t ; t ), z t = y Ax t + It n zt. (.) AMP is an iterative algorithm, and t is the index of iteration. x t is the estimate of x o at iteration t. is the soft thresholding function applied component-wise to the elements of the vector. For a R, (a; ), ( a ) + sign(a). I t, {i : x t i 6= 0}. Finally t is called the threshold parameter. One of the most interesting features of AMP is that, in the asymptotic setting (which will be clarified later), the distribution of v t, x t + A z t x o is Gaussian at every iteration, and it can be considered to be independent of x o. Figure. shows the empirical distribution of v t at a three di erent iterations. 40 i=5 40 i=5 40 i= Density Data Data Data Figure. : Histogram of v t for three di erent iterations. The red curve displays the best Gaussian fit. As is clear from (.), the only parameter that exists in the AMP algorithm is the threshold parameter t at di erent iterations. It turns out that di erent choices of this parameter can lead to very di erent performance. One choice that has interesting

15 5 theoretical properties was first introduced in [3,4] and is based on the Gaussianity of v t. Suppose that an oracle gives us the standard deviation of v t at time t, called t. Then one way for determining the threshold is to set t = t,where is a fixed number. This is called the fixed false alarm thresholding policy. Itturnsout that if we set properly in terms of (the regularization parameter of LASSO), then x t will eventually converge to ˆx. The nice theoretical properties of the fixed false alarm thresholding policy come at a price, however, and that is the requirement for estimating t at every iteration, which is not straightforward as we observe x o + v t and not v t. However, the fact that the size of the active set of LASSO is a monotonic function of provides a practical and easy way for setting t.wecallthisapproach fixed detection thresholding. Definition.3. (Fixed detection thresholding policy). Let 0 < <. Set the threshold value, t to the absolute value of the b nc th largest element (in absolute value) of x t + A T z t. Note that a similar thresholding policy has been employed for iterative hard thresholding [5,6], iterative soft thresholding [7], and AMP [8] in a slightly di erent way. In these works, it is assumed that the signal is sparse and its sparsity level is known, and is set according to the sparsity level. However, here is assumed to be a free parameter. In the asymptotic setting, AMP with this thresholding policy is also equivalent to the LASSO in the following sense: for every >0 there exists aunique (0, ) for which AMP converges to the solution of LASSO as t!. This result is a conclusion of the monotonicity of the size of the active set of LASSO in terms of. We will formally state our results regarding the AMP algorithm with fixed detection thresholding policy in Section.3.

16 .4 Motivation for Designing Parameterless Approximate Message Passing One of the main issues in using iterative thresholding algorithms in practice is the tuning of their free parameters. For instance, in AMP one should tune,,... properly to obtain the best performance. The t have a major impact on the following aspects of the algorithm: 6 (i) The final reconstruction error, lim t! kx t x o k /N.Improperchoiceof t could lead the algorithm not to converge to the smallest final reconstruction error. (ii) The convergence rate of the algorithm to its final solution. A bad choice of t leads to extremely slow convergence of the algorithm. Ideally speaking, one would like to select the parameters in a way that the final reconstruction error is the smallest while simultaneously the algorithm converges to this solution in the least number of iterations. Addressing these challenges seem to require certain knowledge about x o. In particular, it seems that for a fixed value of, lim t! kx t x o k depends on x o.therefore,theoptimalvalueof depends on x o as well. This issue has motivated researchers to consider the least favorable signals that achieve the maximum value of the mean square error (MSE) for a given and then tune t to obtain the minimum MSE for the least favorable signal [4, 9, 30]. These schemes are usually too pessimistic for practical purposes. One of the main objectives of this thesis is to show that the properties of the AMP algorithm plus the high dimensionality of the problem enable us to set the threshold parameters t such that (i) the algorithm converges to its final solution in the least

17 7 number of iterations, and (ii) the final solution of the algorithm has the minimum MSE that is achievable for AMP with the optimal set of parameters. The result is a parameter-free AMP algorithm that requires no tuning by the user and at the same time achieves the minimum reconstruction error in the least number of iterations. The statements claimed above are true asymptotically as N!. However, our simulation results show that the algorithm is successful even for medium problem sizes such as N =000. WewillformalizethesestatementsinSections3. and Implications of Parameter Tuning for LASSO As mentioned in Section., LASSO minimizes the following cost function: ˆx =argmin x ky Axk + kxk. (0, ) is called the regularization parameter. The optimal choice of this parameter has a major impact on the performance of LASSO. It has been shown that the final solutions of AMP with di erent threshold parameters corresponds to the solutions of the LASSO for di erent values of [3, 3, 4, 9, 0]. This equivalence implies that if the parameters of the AMP algorithm are tuned optimally, then the final solution of AMP corresponds to the solution of LASSO for the optimal value of, i.e., the value of that minimizes the MSE, kˆx x o k /N. Therefore, finding the optimal parameters for AMP automatically provides the optimal parameters for LASSO as well.

18 8.6 Related Work in Parameter Tuning Several other papers that consider various threshold-setting strategies to improve the convergence rate are [3,33]. However, these schemes are based on heuristic arguments and lack theoretical justification. Optimal tuning of parameters to obtain the smallest final reconstruction error has been the focus of major research in CS, machine learning, and statistics. The methods considered in the literature fall into the following three categories: (i) The first approach is based on obtaining an upper bound for the reconstruction error and setting the parameters to obtain the smallest upper bound. For many of the algorithms proposed in the literature, there exists a theoretical analysis based on certain properties of the matrix, such as RIP [34,35], Coherence [36], and RSC [37]. These analyses can potentially provide a simple approach for tuning parameters. However, they su er from two issues: (i) Inaccuracy of the upper bounds derived for the risk of the final estimates usually lead to pessimistic parameter choices that are not useful for practical purposes. (ii) The requirement of an upper bound for the sparsity level [38,7], which is often not available in practice. (ii) The second approach is based on the asymptotic analysis of recovery algorithms. The first step in this approach is to employ asymptotic settings to obtain an accurate estimate of the reconstruction error of the recovery algorithms. This is done through either pencil-and-paper analysis or computer simulation. The next step is to employ this asymptotic analysis to obtain the optimal value of the parameters. This approach is employed in [4]. The main drawback of this approach is that the user must know the signal model (or at least an

19 9 upper bound on the sparsity level of the signal) to obtain the optimal value of the parameters. Usually, an accurate signal model is not available in practice, and hence the tuning should consider the least favorable signal that leads to pessimistic tuning of the parameters. (iii) The third approach involves model selection ideas that are popular in statistics. For a review of these schemes refer to Chapter 7 of [39]. Since the number of parameters that must be tuned in AMP is too large (one parameter per iteration), such schemes are of limited applicability. However, as described in Section 3.., the features of AMP enable us to employ these techniques in certain optimization algorithms and tune the parameters e ciently. Rather than these general methods, other approaches to skip the parameter tuning of AMP is proposed in [40,4,4,43]. These approaches are inspired by the Bayesian framework; a Gaussian mixture model is considered for x o,andthentheparameters of that mixture are estimated at every iteration of AMP by using an expectationminimization technique [4]. While these schemes perform well in practice, there is no theoretical result to confirm these observations. A first step toward a mathematical understanding of these methods is taken in [43]..7 Notation Capital letters denote both matrices and random variables. As we may consider a sequence of vectors with di erent sizes, sometimes we denote x with x(n) to emphasize its dependency on the ambient dimension. For a matrix A, A T, min(a), and max(a) denotethetransposeofa, theminimum,andmaximumsingularvaluesof A respectively. Calligraphic letters such as A denote sets. For set A, A, and A c

20 0 are the size of the set and its complement respectively. For a vector x R n, x i, kxk p, ( P x i p ) /p,andkxk 0 = {i : x i 6=0} represent the i th component, `p, and `0 norms respectively. We use P and E to denote the probability and expected value with respect to the measure that will be clear from the context. The notation E denotes the expected value with respect to the randomness in random variable. The two functions and denote the probability density function and cumulative distribution function of standard normal distribution. I( ) and sign( ) denote the indicator and sign functions, respectively. Finally, O( ) ando( ) aredenoting bigo and small O notations, respectively..8 Organization of the Thesis The organization of the thesis is as follows: Chapter sets up the framework and formally states the main contributions regarding the analysis of LASSO s solution path. Chapter 3 considers the tuning of the threshold parameter for the problem of denoising by soft thresholding and connecting the results of optimal denoising with the problem of optimal tuning of the parameters of AMP. Chapter 4 presents and summarizes our simulation results. Finally, Appendix contains the proofs of our main results.

21 Chapter Analysis of LASSO s Solution Path. Asymptotic CS Framework In this thesis we consider the problem of recovering an approximately sparse vector x o R N from n noisy linear observations y = Ax o + w. Our goal is to analyze the properties of the solution of LASSO, defined in (.), on CS problems with the following two main feature. (i) the measurement matrix has iid gaussian elements, and (ii) the ambient dimension and the number of measurements are large. We adopt the asymptotic framework to incorporate these two features. Here is the formal definition of this framework [4,9]. Let n, N!while = n N is fixed. We write the vectors and matrices as x o (N),A(N),y(N), and w(n) toemphasizeontheambient dimension of the problem. Clearly, the number of row of the matrix A is equal to N, butweassumethat is fixed and therefore we do not include n in our notation for A. The same argument is applied to y(n) andw(n). Definition... A sequence of instances {x o (N),A(N),w(N)} is called a converging sequence if the following conditions hold: - The empirical distribution of x o (N) R N converges weakly to a probability measure p with bounded second moment. With the recent results in CS [44] our results can be easily extended to subgaussian matrices. However, for notational simplicity we consider the Gaussian setting here.

22 - The empirical distribution of w(n) R n (n = N) converges weakly to a probability measure p W with bounded second moment. - If {e i } N i= denotes the standard basis for R N, then max i ka(n)e i k, min i ka(n)e i k! as N!. Note that we have not imposed any constraint on the limiting distributions p or p W.Infactforthepurposeofthissection,p is not necessarily a sparsity promoting prior. Furthermore, unlike most of the other works that assumes p W is Gaussian, we do not even impose this constraint on the noise. Also, the last condition is equivalent to saying that all the columns have asymptotically unit ` norm. For each problem instance x o (N),A(N), and w(n) we solve LASSO and obtain ˆx (N) astheestimate. We would now like to evaluate certain measures of performance for this estimate such as the mean squared error. The next generalization formalizes the types of measure we are interested in. Definition... Let ˆx (N) be the sequence of solutions of the LASSO problem for the converging sequence of instances {x o (N),A(N),w(N)}. Consider a function : R! R. An observable J is defined as N J x o, ˆx, lim N! N i= A popular choice of the function is M (u, v) =(u observable has the form: J M x o, ˆx, lim N! N N i= x o,i (N) x o,i (N), ˆx i (N). v). For this function the ˆx i (N) = lim N! N kx o ˆx k. Another example of function that we consider in this thesis is D (u, v) =I(v 6= 0), which leads us to J D x o, ˆx, lim N! N N i= kˆx k 0 I(ˆx i 6=0)= lim N! N. (.)

23 3 Table. : Some observables and their abbreviations. The function for each observable is also specified. Name Abbreviation = (u, w) Mean Square Error MSE =(u w) False Alarm Rate FA = I(w 6= 0,u = 0) Detection Rate DR = I(w 6= 0) Missed Detection MD = I(w =0,u6= 0) Some of the popular observables are summarized in Table. with their corresponding functions. Note that so far we do not have any major assumption on the sequences of matrices. Following the other works in CS, we would now consider random measurement matrices. While all our discussion can be extended to more general classes of random matrices [44], for the notational simplicity we consider A ij N(0, /n). Clearly, these matrices satisfy the unit norm column condition of converging sequences with high probability. Since ˆx (N) israndom,therearetwo questions that need to be addressed about lim N! N P N i= x o,i(n), ˆx i (N). (i) Does it exist and in what sense (e.g., in probability or almost surely)? (ii) Does it converge to a random variable or to a deterministic quantity? The following theorem, conjectured in [4] and proved in [0], shows that under some restrictions on the function, not only the almost sure limit exists in this scenario, but also it converges to a non-random number. Theorem..3. Consider a converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that ˆx (N) is the solution of the

24 4 LASSO problem. Then for any pseudo-lipschitz function : R! R, almost surely lim N! N i ˆx i (N),x o,i = E o,w [ ( ( o +ˆW; ˆ), o )], (.) where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively, is the soft thresholding operator, and ˆ and satisfy the following equations: ˆ =! + E,W [( ( +ˆW; ˆ) ) ], (.3) = ˆ P( +ˆW > ˆ). (.4) This theorem will provide the first step in our analysis of the LASSO s solution path. Before we proceed to the implications of this theorem, let us explain some of its interesting features. Suppose that ˆx equal to ( o +ˆW; has iid elements, and each element is in law ˆ), where o p and W N(0, ). Also, x o,i iid p. If these two assumptions were true, then we could use strong law of large numbers (SLLN) and argue that (.) were true under some mild conditions (required for SLLN). While this heuristic is not quite correct, and the elements of ˆx i are not necessarily independent, at the level of calculating observables defined in Definition.. (and pseudo Lipschitz) this theorem confirms the heuristic. Note that the key element that has led to this heuristic is the randomness in the measurement matrix and the large size of the problem. As we see in (.), there are two constants, (, ˆ), that are calculated according to (.3) and (.4). [4, 3] have shown that for a fixed,thesetwoequationshavea A function : R! R is pseudo-lipschitz if there exists a constant L>0 such that for all x, y R we have (x) (y) applel( + kxk + kyk )kx yk.

25 5 unique solution for (, ˆ). Note that here ˆ w,i.e.,thevarianceofthenoisethatwe observe after the reconstruction, ˆ, is larger than the input noise (according to (.3)). The extra noise that we observe after the reconstruction is due to subsampling. In fact, if we keep fixed and decrease, then we see that ˆ increases. This phenomena is sometimes called noise-folding in the CS literature [45, 46]. One of the main applications of Theorem..3 is in characterizing the normalized mean squared error of the LASSO problem as is summarized by the next corollary. Corollary..4. If {x o (N),A(N),w(N)} is a converging sequence and ˆx (N) is the solution of the LASSO problem, then almost surely lim N! kˆx (N) N x o(n)k = E o,w (o +ˆW; ˆ) o, where, ˆ, and satisfy (.3) and (.4). As we mentioned before, we are also interested in another observable and that kˆx k is lim 0 N!. As described in (.), this observable can be constructed by using N (u, v) =I(v 6= 0). However, it is not di cult to see that for this observable, the function is not pseudo-lipschitz, and hence Theorem..3 does not apply. However, as conjectured in [4] and proved in [0] we can still characterize the almost sure limit of this observable. Theorem..5. [0] If {x o (N),A(N),w(N)} is a converging sequence and ˆx (N) is the solution of the LASSO problem, then almost surely lim N! N where,, and ˆ satisfy (.3) and (.4). I ˆx i (N) 6= 0 = P( ( o +ˆW; ˆ) > 0), i

26 6. LASSO s Solution Path In Section. we characterized two simple expressions for the asymptotic behavior of normalized mean square error and normalized number of detections. These two expressions enable us to formalize the two questions that we raised in the Introduction. As mentioned in the Introduction, if we consider a generic CS problem, there are some pathological examples for which the behavior of LASSO is quite unpredictable and inconsistent with our intuition. See Figure. for an example and Section 4.. for a detailed description about it. Here, we consider the asymptotic regime introduced in the last section. It turns out that in this setting the solution of LASSO behaves as expected. Theorem... Let {x o (N),A(N),w(N)} denotes a converging sequence of problem instances as defined in... Suppose that A ij iid N(0, /n). Ifˆx (N) is the solution of LASSO with regularization parameter Furthermore, lim N! N d d lim N! N, then I ˆx i (N) 6= 0 P i I x i (N) 6= 0 apple. i! < 0. We summarize the proof of this theorem in Section A.. Intuitively speaking, Theorem.. claims that, as we increase the regularization parameter, the number of elements in the active set is decreasing. Also, according P to the condition lim N! N i I x i (N) 6= 0 apple the largest it can get is = n/n. Since the number of active elements is a decreasing function of, appears only in the limit! 0. Figure. shows the number of active elements as a function of for a setting described in Section 4... In the next section, we will exploit this property to design and tune AMP for solving the LASSO.

27 ˆβ λ Figure. : The number of active elements in the solution of LASSO as a function of.thesizeoftheactivesetdecreasesmonotonicallyasweincrease. λ Our next result is regarding the behavior of the normalized MSE in terms of the regularization parameter.inasymptoticsetting,weprovethatthenormalizedmse is a quasi-convex function of. See Section 3.4 of [47] for a short introduction on quasi-convex functions. Figure. exhibits the behavior of MSE as a function of. The detailed description of this problem instance can be found in Section 4.. Before we proceed further, we define bowl-shaped functions. Definition... A quasi-convex function f : R! R is called bowl-shaped if and only if there exists x o R at which f achieves its minimum, i.e., f(x 0 ) apple f(x), 8x R. Here is the formal statement of this result. Theorem..3. Let {x o (N),A(N),w(N)} denotes a converging sequence of problem instances as defined in Definition... Suppose A ij iid N(0, /n). If ˆx (N) is the solution of LASSO with regularization parameter, then lim N! N kˆx (N) x ok

28 8 0 σ =0.4 0 σ = MSE λ λ Figure. : Behavior of the MSE as a function of variances. of LASSO for two di erent noise is a quasi-convex function of. Furthermore, if p ( =0)6=, then the function is bowl-shaped. See the proof in Section A.3..3 Implications for AMP.3. AMP in Asymptotic Settings In this section we show how the result of Theorem.. can lead to an e cient method for setting the threshold in the AMP algorithm. We first review some background on the asymptotic analysis of AMP. This section is mainly based on the results in [3,4,9], and the interested reader is referred to these papers for further details. As we mentioned in Section.3, AMP is an iterative thresholding algorithm. Therefore, we would like to know the discrepancy of its estimate at every iteration from the original vector x o.thefollowingdefinitionformalizesdi erentdiscrepancymeasures

29 9 for the AMP estimates. Definition.3.. Let {x o (N),A(N),w(N)} denote a converging sequences of instances. Let x t (N) be a sequence of the estimates of AMP at iteration t. Consider a function : R! R. An observable J at time t is defined as J x o,x t = lim N! N As before, we can consider (u, v) =(u N i= x o,i (N),x t i(n). v) that leads to the normalized MSE of AMP at iteration t. Thefollowingresultthatwasconjecturedin[3,4]andwas finally proved in [9] provides a simple description of the almost sure limits of the observables. Theorem.3.. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that x t (N) is the estimate of AMP at iteration t. Then for any pseudo-lipschitz function : R! R lim N! N i x t i(n),x o,i = E o,w ( (o + t W ; t ), o ) almost surely, where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively. t satisfies ( t+ ) =! + E,W ( ( + t W ; t ) ), 0 = E [ o ]. (.5) Similarly, our discussion of the solution of the LASSO, this theorem claims that, as long as the calculation of the pseudo-lipschitz observables is concerned, we can assume that estimate of the AMP are modeled as iid elements with each element modeled in law as ( o + t W ; t ), where o p and W N(0, ). As before, we are also interested in the normalized number of detections. The following theorem

30 0 establishes this result. Theorem.3.3. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Suppose that x t (N) is the estimate of AMP at iteration t. Then kx t (N)k 0 lim N! N = P( o + t W t ) almost surely, where on the right hand side o and W are two random variables with distributions p and N(0, ), respectively. t satisfies (.5). In other words, the result of Theorem.3. can be extended to (u, v) =I(v 6= 0), even though this function is not pseudo-lipschitz..3. Connection Between AMP and LASSO The AMP algorithm in its general form can be considered as a sparse signal recovery algorithm. The choice of the threshold parameter t has major impact on the performance of AMP. It turns out that if we set t appropriately, then the fixed point of AMP corresponds to the solution of LASSO in the asymptotic regime. One such choice of parameters is the fixed false alarm threshold given by t = t,where t satisfies (.5). The following result conjectured in [4, 3] and later proved in [0] formalizes this statement. Theorem.3.4. Consider the converging sequence {x o (N),A(N),w(N)} and let the elements of A be drawn iid from N(0, /n). Let x t (N) be the estimate of the AMP algorithm with parameter t = t, where t satisfies (.5). Assume that lim t! t = To see more general form of AMP refer to Chapter 5 of [8].

31 ˆ. Finally, let ˆx denotes the solution of the LASSO with parameter that satisfies = ˆ( P( +ˆW ˆ)). Then, almost surely. lim lim t! N! N kˆx (N) xt (N)k =0 This promising result indicates that AMP can be potentially used as a fast iterative algorithm for solving the LASSO problem. However, it is not readily useful for practical scenarios in which t is not known (since neither nor its distribution are known). Therefore, in the first implementations of AMP, t has been estimated at every iteration from the observations x t + A z t. From Section.3 we know that v t = x t + A z t x o can be modeled as Gaussian N(0, t ). Therefore, if we had access to w t we could easily estimate t. However, we only observe x t + A z t = x 0 + v t,and we have to estimate t from this observation. The estimates that have been proposed so far are exploiting the fact that x o is sparse and provide a biased estimate of t. While such biased estimates still work well in practice, our discussion of LASSO provide an easier way to set the threshold. In the next section, based on our analysis of LASSO we discuss the performance of fixed detection thresholding policy, introduced in Section.3, and show that not only this thresholding policy can be implemented in its exact form, but also it has the nice properties of the fixed false alarm threshold..3.3 Fixed Detection Thresholding AMP looks for the sparsest solution of y = Ax o + w through the following iterations: x t+ = (x t + A z t ; t ), z t = y Ax t + It n zt. (.6)

32 As was discussed in Section, a good choice for the threshold parameter t is vital to the good performance of AMP. We proved in Section. that the number of active elements in the solution of LASSO is a monotonic function of the parameter.this motivates us to set the threshold of AMP in a way that at every iteration, a certain number of coe cients remains in the active set. To understand this claim better, compare (.3) for the fixed point of LASSO and (.5) for the iterations of AMP. Let us replace ˆ, ˆ in (.3). In addition, assume that is such that P( +ˆW ˆ)/ is equal to for some (0, ). Under these two assumptions, (.3) and (.4) can be converted to ˆ =! + E,W ( ( +ˆW;ˆ ) ), = ˆ ( ). (.7) Let us now consider the fixed point of AMP. By letting t!in (.5) we obtain =! + E,W ( ( + W ; ) ), (.8) where, lim t! t. Comparing (.7) and (.8) we conclude that if we set t in way that t! ˆ as t!thenamp has a fixed point that corresponds to the solution of LASSO. One such approach is the fixed detection thresholding policy that was introduced in Section.3. According to this thresholding policy, we keep the size of the active set of AMP fixed at every iteration. Then clearly, if the algorithm converges, then final solution will have the desired number of active elements. In other words, the final solution of the AMP will also satisfy the two equations: =! + E,W ( ( + W ; ) ), = ( ). (.9)

33 3 The first question that we shall address here is wether the above two equations have a unique fixed point. Otherwise, depending on the initialization, AMP may converge to di erent fixed points. Lemma.3.5. The fixed point of (.9) is unique, i.e., for every 0 < < there is a unique (, ) that satisfies (.9). See Section A.4 for the proof of this lemma. The heuristic discussion we have had so far shows that the fixed point of the AMP algorithm with fixed detection threshold converges to the solution of LASSO. The following theorem formalizes this result. Theorem.3.6. Let x t (N) be an estimate of AMP with fixed detection threshold for parameter. Let (ˆ, ˆ ) satisfies the fixed point equation of (.9). In addition, let ˆx (N) be the solution of LASSO for =ˆ ( ). Then, we have lim lim t! N! N kxt ˆx k! 0. As we will show in Section A.5, the proof of this theorem is essentially the same as the proof of Theorem 3. in [0]. There is a slight change in the proof due to the di erent thresholding policy that we consider here.

34 4 Chapter 3 Parameter Free Approximate Message Passing 3. Tuning the AMP 3.. Intuitive Explanation of the AMP Features In this section, we summarize some of the main features of AMP intuitively. Consider the iterations of AMP defined in (.). Define x t, x t + A z t and v t, x t x o.we call v t the noise term at the t th iteration. Clearly, at every iteration AMP calculates x t. In our new notation this can be written as x o + v t. If the noise term v t has iid zero-mean Gaussian distribution and is independent of x o, then we can conclude that at every iteration of AMP the soft thresholding is playing the role of a denoiser. The Gaussianity of v t,ifholds,willleadtodeeperimplicationsthatwillbediscussedas we proceed. To test the validity of this noise model we have presented a simulation result in Figure.. This figure exhibits the histogram of v t overlaid with its Gaussian fit for a CS problem. It has been proved that the Gaussian behavior we observe for the noise term is accurate in the asymptotic settings [3, 4, 9]. In most calculations, if N is large enough that we can assume that v t is iid Gaussian noise. This astonishing feature of AMP leads to the following theoretically and practically important implications: (i) The MSE of AMP, i.e., kx t x ok N can be theoretically predicted (with certain knowledge of x o )throughwhatisknownasstateevolution(se).

35 5 (ii) The MSE of AMP can be estimated through the Stein unbiased risk estimate (SURE). This will enable us to optimize the threshold parameters. This scheme will be described in the next section. 3.. Tuning Scheme In this section we assume that each noisy estimate of AMP, x t,canbemodeledas x t = x o + v t,wherev t is an iid Gaussian noise as claimed in the last section, i.e., v t N(0, t I), where t denotes the standard deviation of the noise. The goal is to obtain a better estimate of x o.sincex o is sparse, AMP applies the soft thresholding to obtain a sparse estimate x t = ( x t ; t ). The main question is how shall we set the threshold parameter t? To address this question first define the risk (MSE) of the soft thresholding estimator as r( ; )= N Ek (x o + u; ) x o k, where u N(0,I). Figure 3. depicts r(, )asafunctionof for a given signal x o and given noise level. In order to maximally reduce the MSE we have to set to opt defined as opt =argminr( ). There are two major issues in finding the optimizing parameter opt : (i) r(, ) is a function of x o and hence is not known. (ii) Even if the risk is known then it seems that we still require an exhaustive search over all the values of (at a certain resolution) to obtain opt. This is due to the fact that r(, )isnotnecessarilya well-behaved function, and hence more e cient algorithms such as gradient descent or Newton method do not necessarily converge to opt.

36 r(τ ) τ opt τ Figure 3. : Risk function r(, )asafunctionofthethresholdparameter. x o R N is a k-sparse vector where N =000andk =45. Inaddition, =0.54 where is the standard deviation of the noise in the model x t = x t o + v t. Let us first discuss the problem of finding opt when the risk function r(, )and the noise standard deviation are given. In Lemma 3.. we have proved that r(, ) is a quasi-convex function of. Furthermore,thederivativeofr(, )withrespectto is only zero at opt.inotherwords,themsedoesnothaveanylocalminimaexcept for the global minima. Combining these two facts we will prove in Section 3.. that if the gradient descent algorithm is applied to r(, ), then it will converge to opt. The ideal gradient descent is presented in Algorithm. We call this algorithm the ideal gradient descent since it employs r(, )thatisnotavailableinpractice. The other issue we raised above is that in practice the risk (MSE) r(, )isnot given. To address this issue we employ an estimate of r(, )inthegradientdescent algorithm. The following lemma known as Stein s unbiased risk estimate (SURE) [48] provides an unbiased estimate of the risk function: Lemma 3... [49] Let g( x) denote the denoiser. If g is weakly di erentiable, then Ekg( x) x o k /N = Ekg( x) xk /N + E( T (rg( x) ))/N, (3.)

37 7 Algorithm Gradient descent algorithm when the risk function is exactly known. The goal of this thesis is to approximate the iterations of this algorithm. Require: r( ),, Ensure: arg min r( ) while r 0 ( ) > do = end while where rg( x) denotes the the gradient of g and is an all one vector. This lemma provides a simple unbiased estimate of the risk (MSE) of the soft thresholding denoiser: ˆr(, )=k ( x; ) xk /N + ( T ( 0 ( x; ) ))/N, We will study the properties of ˆr(, )insection3..3andwewillshowthatthis estimate is very accurate for high dimensional problems. Furthermore, we will show how this estimate can be employed to provide an estimate of the derivative of r(, ) with respect to. Oncethesetwoestimatesarecalculated,wecanrunthegradient descent algorithm for finding opt.wewillshowthatthegradientdescentalgorithm that is based on empirical estimates converges to ˆ opt,whichis close to opt and converges to opt in probability as N!.WeformalizethesestatementsinSection Optimal Parameter Tuning for Denoising Problems This section considers the problem tuning the threshold parameter in the soft-thresholding denoising scheme. Section 3.3 connects the results of this section to the problem of

38 8 tuning the threshold parameters in AMP. 3.. Optimizing the Ideal Risk Let x R N denote a noisy observation of the vector x o,i.e., x = x o + w, where w N(0, I). Further assume that the noise variance is known. Since x o is either asparseorapproximatelysparsevector,wecanemploysoftthresholdingfunctionto obtain an estimate of x o : ˆx = ( x; ). This denoising scheme has been proposed in [50], and its optimality properties have been studied in the minimax framework. As is clear from the above formulation, the quality of this estimate is determined by the parameter. Furthermore,theoptimal value of depends both on the signal and on the noise level. Suppose that we consider the MSE to measure the goodness of the estimate ˆx : r( ), N Ekˆx x o k. According to this criterion, the optimal value of is the one that minimizes r( ). For the moment assume that r( ) isgivenandforgetthefactthatr( ) isafunctionof x o and hence is not known in practice. Can we find the optimal value of defined as opt =argmin r( ) (3.) e ciently? The following lemma simplifies the answer to this question. Lemma 3... [3] r( ) is a quasi-convex function of. Furthermore, the derivative of the function is equal to zero in at most one finite value of and that is opt. In other words, we will in general observe three di erent forms for r( ). These three forms are shown in Figure 3.. Suppose that we aim to obtain opt. Lemma

39 σ = 0.5 σ =3 σ = MSE τ 0 0 τ 0 0 τ Figure 3. : Three di erent forms for MSE vs.. Three plots correspond to two di erent standard deviation of the noise in the observation. 3.. implies that the gradient of r( ) atany points toward opt. Therefore, we expect the gradient descent algorithm to converge to opt.let t denote the estimate of the gradient descent algorithm at iteration t. Then,theupdatesofthealgorithm are given by t+ = t dr( t), (3.3) d where is the step size parameter. For instance, if L is an upped bound on the second derivative of r( ), then we can set =/L. Our first result shows that, even though the function is not convex, the gradient descent algorithm converges to the optimal value of. Lemma 3... Let = L and suppose that the optimizing is finite. Then, lim t! dr( t) d =0. See Section A.6 for the proof of this lemma. Note that the properties of the risk function summarized in Lemma 3.. enable us to employ standard techniques to In practice, we employ back-tracking to set the step-size.

40 30 prove the convergence of (3.3). The discussions above are useful if the risk function and its derivative are given. But these two quantities are usually not known in practice. Hence we need to estimate them. The next section explains how we estimate these two quantities. 3.. Approximate Gradient Descent Algorithm In Section 3.. we described a method to estimate the risk of the soft thresholding function. Here we formally define this empirical unbiased estimate of the risk in the following way: Definition The empirical unbiased estimate of the risk is defined as ˆr( ), N k ( x; ) xk + ( T ( 0 ( x; ) )) (3.4) Here, for notational simplicity, we assume that the variance of the noise is given. In Section 4 we show that estimating is straightforward for AMP. Instead of estimating the optimal parameter opt through (3.), one may employ the following optimization: ˆ opt, arg min ˆr( ). (3.5) This approach was proposed by Donoho and Johnstone [5], and the properties of this estimator are derived in [48]. However, [48] does not provide an algorithm for finding ˆ opt. Exhaustive search approaches are computationally very demanding and hence not very useful for practical purposes. As discussed in Section 3.., one approach to reduce the computational complexity is to use the gradient descent algorithm. Needless to say that the gradient of r( ) is not given, and hence it has to be estimated. Note that opt must be estimated at every iteration of AMP. Hence we seek very e cient algorithms for this purpose.

41 3 One simple idea to estimate the gradient of r( ) is the following: Fix N and estimate the derivative according to dˆr( ) d = ˆr( + N) ˆr( ). (3.6) N We will prove in Section 3..3 that, if N is chosen properly, then as N!, ˆ dr( ) d! dr( ) d in probability. Therefore, intuitively speaking, if we plug in the estimate of the gradient in (3.3), the resulting algorithm will perform well for large values of N. We will prove in the next section that this intuition is in fact true. Note that since we have introduced N in the algorithm, it is not completely free of parameters. However, we will show both theoretically and empirically, the performance of the algorithm is not sensitive to the actual value of N. Hence, the problem of setting N is simple and inspired by our theoretical results we will provide suggestions for the value of this parameter in Section 4. Therefore, our approximate gradient descent algorithm uses the following iteration: t+ = t dˆr( t ), (3.7) d where as before t is the estimate of opt at iteration t and denotes the step size. Before, we proceed to the analysis section, let us clarify some of the issues that may cause problem for our approximate gradient descent algorithm. First note that since ˆr( ) isanestimateofr( ), it is not a quasi-convex any more. Figure 3.3 compares r( ) andˆr( ). As is clear from this figure ˆr( ) mayhavemorethanonelocalminima. One important challenge is to ensure that our algorithm is trapped in a local minima that is close to the global minima of r( ). We will address this issue in Section 3..3.

42 Estimated Risk Risk r(τ ) τ Figure 3.3 : The dashed black curve denotes the risk function and the solid blue curve indicates its estimation. For the model we have used in order to produce this plot refer to Section 4. Measurements are noiseless Accuracy of the Gradient Descent Algorithm Our Approach The goal of this section is to provide performance guarantees for the empirical gradient descent algorithm that is described in Section 3... We achieve this goal in three steps: (i) characterizing the accuracy of the empirical unbiased risk estimate ˆr( ) in Section 3..3, (ii) characterizing the accuracy of the empirical estimate of the derivative of the risk dˆr d in Section 3..3, and finally (iii) providing a performance guarantee for the approximate gradient descent algorithm in Section Accuracy of Empirical Risk Our first result is concerned with the accuracy of the risk estimate ˆr( ). Consider the following assumption: we know a value max,where opt < max. Note that this is not a major loss of generality, since max can be as large as we require.

43 33 Theorem Let r( ) be defined according to (3.) and ˆr( ) be as defined in Definition Then, P sup r( ) ˆr( ) ( + 4 max )N /+ 0< < max apple Ne N + maxn 3 e c N max, where </ is an arbitrary but fixed number. See Section A.7 for the proof of Theorem First note that the probability on the right hand side goes to zero as N!. Therefore, we can conclude that according to Theorem 3..4 the di erence between r( ) andˆr( ) isnegligiblewhenn is large(with very high probability). Let opt =argmin r( )andˆ opt =argmin ˆr( ). The following simple corollary of Theorem (3..4) shows that even if we minimize ˆr( ) instead of r( ), still r(ˆ opt )isclosetor( opt ). Corollary Let opt and ˆ opt denote the optimal parameters derived from the actual and empirical risks respectively. Then, P r( opt ) apple Ne N + maxn 3 r(ˆ opt ) > (4 + 8 max )N /+ e c N max. See Section A.8 for the proof of Corollary Corollary 3..5 shows that if we could find the global minimizer of the empirical risk, it provides a good estimate for opt for high dimensional problems. The only limitation of this result is that finding the global minimizer of ˆr( ) iscomputationallydemandingasitrequiresexhaustive search. Therefore, in the next sections we analyze the fixed points of the approximate gradient descent algorithm.

44 34 Accuracy of the Derivative of Empirical Risk Our next step is to prove that our estimate of the gradient is also accurate when N is large. The estimate of the gradient of r( ) isgivenby dˆr d = ˆr( + N) ˆr( ). (3.8) The following theorem describes the accuracy of this estimate: Theorem Let N =!(N /+ ) and N = o() simultaneously. Then, there N exists 0 (, + N ) such that dˆr dr P d d ( 0 ) (8 + 6 max )N /+ N apple Ne N + maxn 3 e c N max. In particular, as N! dˆr d dr converges to d in probability. The proof of Theorem 3..6 is available in Section A.9. The following remarks highlight some of the main implications of Theorem Remark: The di erence between the actual derivative of the risk and the estimated one is small for large values of N. Therefore, if the actual derivative is positive (and not too small) then the estimated derivative remains positive, and if the actual derivative is negative (and not too small), then the estimated derivative will also be negative. This feature enables the gradient descent with an estimate of the derivative to converge to a point that is close to opt. Remark: Note that the small error that we have in the estimate of the derivative may cause di culties at the places where the derivative is small. There are two regions for which the derivative is small. As shown in Figure 3.4, the first region is

45 35 Risk Function Estimate of Risk Function Figure 3.4 : Risk function and its estimate. The estimation of the risk function has local minima for the points where dr ( ) = O N + d N. The two regions for which this phenomenon can happen are specified by ellipsoids. around the optimal value of opt,andthesecondregionisforverylargevaluesof. Note that the small error of the estimates may lead to local minima in these two regions. We show how the algorithm will avoid the local minima that occur for large values of. Furthermore, we will show that all the local minima that occur around opt have risk which is close to optimal risk. Accuracy of Empirical Gradient Descent In order to prove the convergence of the gradient descent algorithm we require two assumptions: (i) We know a value max,where opt < max. (ii) The magnitude of second derivative of r( ) isboundedfromabovebyl and L is known.

46 36 Before we proceed further let us describe why these two assumptions are required. Note from Figure 3.4 that for very large values of, wherethederivativeoftheideal risk is close to zero, the empirical risk may have many local minima. Therefore, the gradient descent algorithm is not necessarily successful if it goes to this region. Our first condition is to ensure that we are avoiding this region. So, we modify the gradient descent algorithm in a way that if at a certain iteration it returns t > max, we realize that this is not a correct estimate. The second condition is used to provide asimplewaytosetthestepsizeinthegradientdescent. It is standard in convex optimization literature to avoid the second condition by setting the step-size by using the backtracking method. However, for notational simplicity we avoid back-tracking in our theoretical analysis. However, we will employ it in our final implementation of the algorithm. Similarly, the first constraint can be avoided as well. We will propose an approach in the simulation section to avoid the first condition as well. Let t denote the estimates of the empirical gradient descent algorithm with step size =. Also, let t L denote the estimates of the gradient descent on the ideal risk function as introduced in (3.3). We can then prove the following. Theorem For every iteration t we have, in proability. lim t t =0, N! See Section A.0 for the proof.

47 Optimal Tuning of AMP Inspired by the formulation of AMP, define the following Bayesian risk function for the soft thresholding algorithm: R B (, ; p )=E( ( o + W; ) o ), (3.9) where the expected value is with respect to two independent random variables o p and W N(0, ). One of the main features of this risk function is the following Lemma inf R B (, ; p ) is an increasing function of. See the proof of this Lemma in Section A... While this result is quite intuitive and simple to prove, it has an important implication for the AMP algorithm. Let,,... denote the thresholds of the AMP algorithm at iterations t =,,... Clearly, the variance of the noise at iteration T depends on all the thresholds,,..., T (See Theorem.3. for the definition of ). Therefore, consider the notation t (,,..., t )forthevalueof at iteration t. Definition A sequence of threshold parameters,,,,...,,t is called optimal for iteration T, if and only if t (,,...,,T ) apple t (,,..., T ), 8,,..., T [0, ) T. Note that in the above definition we have assumed that the optimal value of t is achieved by (,,...,,T ). This assumption is violated for the case o = 0. While we can generalize the definition to include this case, for notational simplicity we skip this special case. The optimal sequence of thresholds has the following two properties:. It achieves a certain MSE in the least number of iterations.

48 38. If we plan to stop the algorithm after T iterations, then it gives the best achievable MSE. These two claims will be clarified as we proceed. According to Definition 3.3., it seems that, in order to tune AMP optimally, we need to know the number of iterations we plan to run it. However, this is not the case for AMP. In fact, at each step of the AMP, we can optimize the threshold as if we plan to stop the algorithm in the next iteration. The resulting sequence of thresholds will be optimal for any iteration T. The following theorem formally states this result. Theorem Let,,,,...,,T be optimal for iteration T. Then,,,,,...,,t is optimal for any iteration t<t. See Section A. for the proof of this result. This theorem, while it is simple to prove, provides a connection between optimizing the parameters of AMP and the optimal parameter tuning we discussed for the soft thresholding function. For instance, a special case of the above theorem implies that, must be optimal for the first iteration. Intuitively speaking, the signal plus Gaussian noise model is correct for this iteration. Hence we can apply the approximate gradient descent algorithm to obtain,.once, is calculated we calculate x and again from the above theorem we know that, should be optimal for the denoising problem we obtain in this step. Therefore, we apply approximate gradient descent to obtain an estimate of,.we continue this process until the algorithm converges to the right solution. If we have access to the risk function, then the above procedure can be applied. At every iteration, we find the optimal parameter with the strategy described in Section 3.. and the resulting algorithm is optimal for any iteration t. However, as we discussed before the risk function is not available. Hence we have to estimate

49 39 it. Once we estimate the risk function, we can employ the approximate gradient descent strategy described in Section 3... Consider the following risk estimate that is inspired by SURE: ˆr t ( t ) N = N k (xt + A z t ; t ) (x t + A z t )k + t + N t T ( 0 (x t + A z t ; t ) ). (3.0) As is clear from our discussion about the soft thresholding function in Section 3.., we would like to apply the approximate gradient descent algorithm to ˆr( t ). Nevertheless, N the question we have to address is that if it is really going to converge to R B (, ; p )? The next theorem establishes this result. Theorem Let ˆrt ( t ) N defined in (3.0). Then, denote the estimate of the risk at iteration t of AMP as ˆr t ( t ) lim N! N = E o,w ( (o + t W ; t ) o ), (3.) almost surely, where o and W are two random variables with distributions p and N(0, ), respectively and t satisfies (.5). See Section A. for the proof of this theorem. This result justifies the application of the approximate gradient descent for the iterations of AMP. However, as we discussed in Section 3. a rigorous proof of the accuracy of approximate gradient descent requires a stronger notion of convergence. Hence, the result of Theorem is not su cient. One su cient condition is stated in the next theorem. Let ˆ t,s denote the estimate of the s th iteration of approximate gradient descent algorithm at the t th iteration of AMP. In addition, let t,s denote the estimate of the gradient descent algorithm on the ideal risk at the t th iteration of AMP.

50 40 Theorem Suppose that there exists >0 such that for the t th iteration of AMP we have ˆr t ( t ) P sup t N E o,w ( (o + t W ; t ) o ) >cn! 0, (3.) as N!.If N = N /, then lim ˆ t,s t,s =0 N! in probability. The proof of this result is a combination of the proofs of Theorems 3..6 and 3..7 and is omitted. Note that (3.) has not been proved for the iterations of AMP and remains an open problem.

51 4 Chapter 4 Simulation Results In this section, we first study the performance of AMP with fixed threshold policy form the phase transition point of view. We then give the details of the simulations which we have put in the previous sections. We then evaluate the performance of automatically tuned AMP proposed in Algorithm via simulations. We specifically discuss the e ect of the measurement noise, the choice of parameter N used in (3.6), and the impact of the sample size N on the performance of our method. 4. Phase Transition of AMP Sparse recovery via ` minimization is successful only if the number of non-zero values in the signal, i.e., k o k 0 is smaller than a certain definite fraction of n. Let = k n and = n p be normalized measures of sparsity level and problem indeterminacy, respectively. As a result, we have a two-dimensional phase-space (, ) [0, ] in which each point determines a certain sparse recovery problem. Success or failure of this sparse recovery problem could be determined according to this phase-space. In most of the cases, there is a curve (, ( )) where the probability of successful recovery tends to 0 or as the sparsity level goes above or below this curve, respectively [3]. In this section, we determine the phase-space corresponding to the AMP with fixed threshold policy by observing the fraction of successful recovery for di erent values of and. Inotherwords, wemeasuretheempiricalphasetransitionof

52 4 AMP with fixed threshold policy and compare it with the available theoretical bound from [5, 3, 3]. The theoretical bound for phase transition under linear programming (LP) reconstruction is given by (z) =, (z)+z (z) z( (z)) =. (4.) (z) On the other hand, in order to measure the empirical phase transition of AMP with fixed thresholding policy, i.e., in order to observe the fraction of successful sparse recovery for di erent values of and, wefirsthavetodeterminethethreshold parameter of AMP to run the algorithm and measure its performance. Therefore, according to the definition of the fixed thresholding policy, we have to set the free parameter. To find out this free parameter, we use the relationship between and which is =ˆ ( ). The solution of the LASSO is given by ˆ =argmin ky k + k k. (4.) As we showed in Theorem.6 of the main text, the number of elements in the active set of ˆ,i.e.,k ˆ k 0,rangesbetween0andthenumberofmeasurementsn. Therefore, the least sparsity corresponds to the case where k ˆ k 0 = n. In addition, it is well known that if we let! 0, then the solution of the LASSO will tend to be the least sparse solution. As a result, in order to measure the phase transition, we have to let! 0whichcorrespondstotheleastsparsesignalrecovery. Since =ˆ ( ), this is equivalent to letting!. As a result, we let =whencalculatingthe empirical phase transition of the AMP algorithm with the fixed thresholding policy. In order to calculate the empirical phase transition, we produce a heatmap in which each cell corresponds to the success probability of recovery. We consider a 0

53 43 0 matrix P in which the columns correspond to the set containing 0 equi-spaced values of between 0. and 0.9; and the rows correspond to the set containing 0 equi-spaced values of between ( =0.) and ( =0.9). For each ˆ, we consider 50 equi-spaced values of in [0.8 (ˆ),. (ˆ)], where (ˆ) isobtainedfrom (4.). We call the set of all these values ˆ. In order to calculate the probability of correct recovery, for each ˆ and ˆ we use M =0MonteCarlotrials. Forthe j th Monte Carlo trial, we define the success variable Sˆ,,j = I k ˆo ok k ok < tol where tol is the threshold of relative error for evaluating the performance of AMP, and ˆx o is the AMP estimate using fixed thresholding policy. We then define the empirical P success probability as ˆPˆ, = M j Sˆ,,j. Having calculated ˆPˆ, for every ˆ, we use linear interpolation to fill in the entries of P. In running the AMP algorithm using the fixed thresholding policy, we use the following set up: - The size of o is set to p = The size of the measurement vector and the sparsity level are obtained according to: n = b Nc, k = b nc. - The measurements are noise-free and are obtained according to y = o where has iid elements from the Gaussian distribution N 0, n. - The threshold of the relative error is set to tol = 0. - The number of iterations in AMP is set to 500. Figure 4. shows the empirical heatmap of the probability of success we obtained. The black line is the theoretical curve from (4.). The red and blue color refer to successful and unsuccessful recovery, respectively. We can see the coincidence of the theoretical curve and empirical phase transition in this figure.

54 44 Figure 4. : Comparison between the empirical phase transition (heatmap of the probability of success) and theoretical phase transition curve (black curve) obtained from (4.). Red color corresponds to probability (successful recovery) and the blue color corresponds to probability 0 (unsuccessful recovery) in the heatmap. 4. Details of Simulations Here we include the details of the simulations whose results we reported in the previous sections. 4.. Figure. The dataset we used in this simulation is taken from [53]. The response variables y R 44 correspond to the diabetes progression in one year in 44 patients. We have 0 variables, namely age, sex, body mass index, average blood pressure, and six blood serum measurements. Therefore R 0. We have solved LASSO for di erent values of and presented the number of nonzero elements in,i.e. k k 0,asafunctionof in Figure.. As mentioned previously, for this specific problem, this function is not monotone decreasing.

Does l p -minimization outperform l 1 -minimization?

Does l p -minimization outperform l -minimization? Le Zheng, Arian Maleki, Haolei Weng, Xiaodong Wang 3, Teng Long Abstract arxiv:50.03704v [cs.it] 0 Jun 06 In many application areas ranging from bioinformatics