arxiv: v1 [stat.me] 11 Jun 2017

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 11 Jun 2017"

Transcription

1 Some Analysis of the Knockoff Filter and its Variants Jiaie Chen, Anthony Hou, Thomas Y. Hou June 3, 7 arxiv:76.34v [stat.me] Jun 7 Abstract In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables and determine which variables are truly associated with the response. In 5, Barber and Candès introduced a new variable selection procedure called the knockoff filter to control the false discovery rate(fdr) and proved that this method achieves exact FDR control. In this paper, we provide some analysis of the knockoff filter and its variants. Based on our analysis, we propose a PCA prototype group selection filter that has exact group FDR control and several advantages over existing group selection methods for strongly correlated features. Another contribution is that we propose a new noise estimator that can be incorporated into the knockoff statistic from a penalized method without violating the exchangeability property. Our analysis also reveals that some knockoff statistics, including the Lasso path and the marginal correlation statistics, suffer from the alternating sign effect. To overcome this deficiency, we introduce the notion of a good statistic and propose several alternative statistics that take advantage of the good statistic property. Finally, we present a number of numerical experiments to demonstrate the effectiveness of our methods and confirm our analysis. Introduction In many scientific endeavors, we need to determine from a response variable together with a large number of potential explanatory variables which variables are truly associated with the response. In order for this study to be meaningful, we need to make sure that the discoveries are indeed true and replicable. Thus it is highly desirable to obtain exact control of the false discovery rate (FDR) within a certain prescribed level. In [3], Barber and Candès introduce a new variable selection procedure called the knockoff filter to control the FDR for a linear model. This method achieves exact FDR control in finite sample settings. One important property of this method is that its performance is independent of the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients. Moreover, it does not require any knowledge of the noise level. A key observation is that by constructing knockoff variables that mimic the correlation structure found within the existing variables one can obtain accurate FDR control. The method is very general and flexible. It can be applied to a number of statistics and has far more power (the proportion of true signals being discovered) than existing selection rules when the proportion of null variables is high.. A brief review of the knockoff filter Before we introduce the main results of our paper, we first provide a brief overview of the knockoff filter. Consider the following linear regression model y = Xβ +ǫ, where the feature matrix X is an n p (n p) matrix with full rank, its columns normalized to be unit vectors in the l norm, School of Mathematical Sciences, Peking University; ciaie@pku.edu.cn Department of Statistics, Harvard University; ahou@college.harvard.edu Applied and Computational Mathematics, Caltech; hou@cms.caltech.edu

2 and ǫ is a Gaussian noise N(,σ ). The knockoff filter begins with the construction of a knockoff matrix X that obeys X T X = X T X, XT X = X T X diag(s), () where s i [,]. The positive definiteness of the Gram matrix [X X] T [X X] requires diag(s) X T X. () The first condition in () ensures that X has the same covariance structure as the original feature matrix X. The second condition in () guarantees that the correlations between distinct original and knockoff variables are the same as those between the originals. To ensure that the method has good statistical power to detect signals, we should choose s as large as possible to maximize the difference between X and its knockoff X. These two conditions are critical in guaranteeing that the distribution of a knockoff statistic is invariant when a particular pair of X,X is swapped. This is called the exchangeability property in [3]. The next step is to calculate a statistic, W, for each pair X, X using the Gram matrix [X X] T [X X] and the marginal correlation [X X] T y. The final step is to run the knockoff (knockoff+) selection procedure at level q { T min t > : /+#{ : W } t} q, Ŝ { : W T}. (3) #{ : W t} There are several ways to construct a statistic W. Among them, the Lasso Path statistic is discussed in detail in [3]. It first fits a Lasso regression of y on [X X] for a list of regularizing parameters λ in descending order and then calculates the first λ at which a variable enters the model, i.e. Z sup{λ : ˆβ (λ) } for feature X and Z = sup{λ : β (λ) } for its knockoff X. The Lasso path signed max statistic is defined as W = max(z, Z ) sign(z Z ). The main result in [3] is that the knockoff procedure and knockoff+ procedure has exact control of mfdr and FDR respectively, [ ] [ ] #{ mfdr E Ŝ : β = } #{ q, FDR E Ŝ : β = } #{ Ŝ}+q #{ Ŝ} q. A knockoff filter for high-dimensional selective inference and model-free knockoffs have been recently established in [4,5]. This line of research has inspired a number of follow-up works [6,8,3 5].. Alternating sign effect and the notion of a good statistic In this paper, we perform some analysis of the knockoff filter and some knockoff statistics, including the Lasso path and the marginal correlation statistics. Our analysis shows that the marginal correlation statistic and the Lasso path statistic suffer from the so-called alternating sign effect for certain design matrices whose features are only weakly correlated. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r λ), where r λ = y X EˆβE is the residue, / E and E is the active set, i.e. { : ˆβ (λ) }, λ being the regularizing parameter in front of the l norm in the Lasso path method. In Section, we describe a general mechanism for generating the alternating sign effect for a family of design matrices. We show that the alternating sign effect can lead to large negative W for strong features that are only weakly correlated. This limitation reduces the power of the knockoff filter for these statistics. To alleviate this difficulty, we introduce the notion of a good statistic. Specifically, a knockoff statistic W is called a good statistic if it satisfies the positivity of non-null features: for a fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. Based on our analysis, we propose an alternative method, which we call the half penalized method. This method penalizes only ˆβ β instead of penalizing both parameters min ˆβ, β y Xˆβ X β +P(ˆβ β),

3 hence the name half penalized method. This method takes full advantage of the property of the knockoff filter. In the case when P(x) = λ x (or P(x) = λ x ), we obtain the half Lasso method (or the negative half Lasso), which reduces the p-dimensional l optimization problem into p one-dimensional optimization problems, which can be solved explicitly using the soft threshold operator. We further prove that this new statistic and the least squares statistic satisfy the good statistic property and do not suffer from the alternating sign effect. To gain some understanding of the performance of different statistics, we investigate a variety of knockoff statistics numerically, including the least squares, half Lasso, forward selection, orthogonal matching pursuit (OMP), and Lasso path statistics. From our simulations, the forward selection, the OMP and the Lasso path statistics have similar power and computational cost. However, the alternating sign test in Section.4. shows that the Lasso path and the forward selection statistics suffer from the alternating sign effect and are less robust than the OMP statistic. Our simulation also shows that the power of the OMP is more than that of the least squares and the (negative) half Lasso in the sparse case (the proportion of the null features is large). The improvement of the OMP statistic over the least squares and the negative half Lasso statistics is not as significant in the non-sparse case. The OMP statistic seems to be the most robust among the six statistics that we consider. On the other hand, the OMP and other path statistics are computationally much more expensive. The computational cost of least squares and of the half Lasso is O(np ), while that of the Lasso path and the OMP statistic is O(np 3 ). If p, the advantage of the OMP statistic over the negative half Lasso diminishes due to the increase of computational cost..3 Extension of the sufficiency property and noise level estimate In [3], the authors introduce the sufficiency property of a statistic W, which states that W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. We observe that in the definition of the sufficiency property, only part of the information of the response variable y, i.e. [X X] T y, is utilized. By using the remaining information of y in the knockoff filter, we can incorporate the noise estimate into the statistic without violating the exchangeability property. More specifically, we generalize the sufficiency property by requiring that W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y for any orthonormal matrix that satisfies U T [X X] =. Moreover, we prove that if a statistic obeys the generalized sufficiency property and the antisymmetry property, then it satisfies the exchangeability property. Inspired by the generalized sufficiency property, we propose to use the noise level σ as a reference for the regularizing parameter and estimate the noise level as follows ˆσ U T y / n p, where U is an orthonormal matrix satisfying U T [X X] =. Since ˆσ depends only on U T y, we can define a knockoff statistic W that incorporates ˆσ and satisfies the generalized sufficiency property. Consequently, we can use the estimated noise level in the knockoff filter without violating the exchangeability property and maintain FDR control..4 A PCA prototype knockoff filter We also introduce a PCA prototype knockoff filter for group selection that has exact group FDR control (defined in Theorem 3.) for strongly correlated features. More specifically, assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in a way such that within-group correlation is relatively strong but between-group correlation is relatively weak. We first use singular value decomposition (SVD) to decompose the feature vectors within each group X Ci = U Ci D i Vi T and then reformulate the linear model as follows: k k y = X Ci β Ci +ǫ = U Ci α Ci +ǫ. i= 3 i=

4 We aim to pick out non-null groups β Ci with exact group FDR control. To capture most of the information and reduce redundant features in each group, we choose the first principal component U Ci, as a prototype of this group andthen construct knockoff pairs on theprototype set U P = (U C,,U C,,...,U Ck,), P = k. Specifically, we denote by Q = {,,...,p}\p the remaining part, U = [U P,U Q ], and then construct the knockoff matrix Ũ = [ŨP,U Q ] as follows (we choose Ũ Q = U Q since we do not select features in U Q ) Ũ T Ũ = U T U, U T U ŨT U = diag(s P,), where we apply the localized knockoff construction from [5] to increase the amplitude of s P. Inspired by [3], we implement the standard knockoff procedure on y and [U P,ŨP] and calculate the knockoff statistic W P = {W C,,W C,,..,W Ck,}. Finally, we run the knockoff filter on W C,, k to select groups. Moreover we can prove that the PCA prototype knockoff filter has the same group FDR control for the original feature matrix as in Dai-Barber s group knockoff filter [6]. Compared to Dai-Barber s group knockoff filter, our PCA method achieves greater computational efficiency since the augmented design matrix in our method is n k, which is much smaller than n p in Dai-Barber s method if p k. Since the most significant computational cost in implementing the knockoff filter with a path statistic comes from regressing y on the augmented design matrix in an iterative manner, a smaller augmented design matrix leads to greater computational efficiency. Note that the group statistic for group C is W C, and is different from that in Dai-Barber s group knockoff filter [6]..5 Comparison with other existing works There are several recent works that have an obective similar to ours. Our work is inspired by Barber and Candès knockoff filter as well as by Reid-Tibshirani s prototype knockoff filter and Dai-Barber s group knockoff filter [3,4,6,3]. We show in Section 3.4 that our PCA prototype filter has more power than Reid-Tishirani s prototype knockoff filter. When the between-group correlation is zero and within-group correlation is strong, we analyze why the PCA prototype filter performs much better than Reid-Tibshirani s prototype filter. We also show that the performance of the PCA prototype filter is comparable to that of Dai-Barber s group knockoff filter, but with greater computational efficiency if p k. More details on these two methods and their comparison with ours can be found in Section 3. We note that a localized knockoff filter has been proposed by Xu et al. in [5] in which they construct a modified knockoff matrix that has FDR control for a subset of the feature vectors. Although this localized knockoff filter guarantees FDR control, it still suffers a loss in power for strongly correlated features. There are several feature selection methods that offer some level of FDR control, see e.g. [,, 7, 9 ]. Refer to [3] for a thorough comparison between the knockoff filter and these approaches. This paper focuses on the knockoff filter and does not consider these other approaches. The rest of the paper is organized as follows. In Section, we analyze the alternating sign effect for the Lasso path, the marginal correlation, and the forward selection statistics. We also introduce thenotion of agood statistic andshow that the least squares method andthe half penalized method produce good statistics. Moreover, we generalize the sufficiency property of a knockoff statistic and propose a new method to estimate noise level. In Section 3, we introduce our PCA prototype filter for highly correlated features. We compare it to other group knockoff filters and provide numerical experiments to demonstrate the performance of various methods. Alternating sign effect, good statistics, a half penalized method In this section, we perform some analysis of the knockoff filter. Our analysis reveals some limitations of several statistics associated with the knockoff filter. Based on our understanding of these limitations, we propose some modifications of the knockoff filter to alleviate these difficulties. 4

5 . Construction of the knockoff matrix First, we review the construction of the knockoff matrix. In [3], the authors give a simple construction of the knockoff matrix X. It seems that we may have other alternative constructions of X. In the following proposition, we show that, given s i, different constructions are essentially the same. Proposition.. if and only if [ [X X] T [X X] = Σ Σ diag(s) ] Σ diag(s) (4) Σ X = X(I Σ diag(s))+uc (5) where U R n (n p) is an orthonormal matrix whose column space is orthogonal to that of X, i.e. U T X =, and C R (n p) p satisfies C T C = diag(s) diag(s)σ diag(s). We will defer the proof of the above proposition to the Appendix. The knockoff matrix X presented in [3] has the same form as (5) except that U R n p and C R p p in their formula. Using Proposition., we can reproduce the result in [3] by choosing an orthonormal matrix U = (U U ) R n (n p),u R n p,u R n (n p) whose column space is orthogonal to that of X and ( ) C C = R (n p) p, C R p p and C T C = C T C = diag(s) diag(s)σ diag(s). ( C The identity UC = (U U ) ) = U C and Proposition. reproduce X in [3].. Alternating sign effect for the marginal correlation statistic In this section, we discuss the alternating sign effect for certain statistics and propose alternative statistics that do not suffer from this effect. According to (3), the knockoff filter threshold T is determined by the ratio of large negative and positive W s. Using this threshold, the knockoff filter selects large positive statistics W > T and reects all negative W s. In order for the knockoff filter to achieve its power, W s should be large and positive for β so that the knockoff filter can pick out such features. Large, negative W s result in a large T and fewer selected features, which lead to a decrease in power. Our analysis shows that in some feature designs, certain knockoff statistics may yield large negative W s for non-null, which would decrease the power of the knockoff filter. We use the marginal correlation statistic to illustrate the alternating sign effect. The following example shows that the marginal correlation statistic could lose its power even for strong signals. Design matrix and signal amplitude Let A,B be a partition of {,,..,p}, i.e. A B = {,,..,p}, A B =. We choose a feature matrix X that satisfies X i,x = ρ for i if i and belong to the same set A or B and X i,x = ρ for i if i and belong to two different sets. A concrete example that satisfies the above design criterion is given as follows: X v ai p R n p, where v R p, v i = { λa i A λa i B, λ = ρ ρ. (6) We take ρ = for simplicity. Once the knockoff matrix is constructed, we have the relation X i = s i, < s i. The value of s i is not small because columns of X are only weakly X T i 5

6 correlated. Since the knockoff matrix is constructed without any knowledge of y and the coefficient β, we can choose any β after X is constructed. Next, we take β i = {.9M s i i A,s i, M s i i B,s i, (7) and β i = if s i =, where M is a parameter that is used to control the signal amplitude. In the following discussion, we set M = and assume that the number of s i = is either or small. Derivation of the marginal correlation statistic Let S A,S B be the sum of β i in group A, B, respectively, i.e S A = i A β i, S B = i B β i. Assume that the noise level σ is small compared to M, say σ =.3 (otherwise, we can multiply all β by a large constant) and y = Xβ + ǫ, ǫ N(,σ I p ). Under this setting, we first calculate the marginal correlation in A (the case for B can be carried out similarly) X T k y = i A X T k X iβ i + i B X T k X iβ i +X T k ǫ = S A +β k S B +XT k ǫ, X T k y = i A X T k X iβ i + i B X T k X iβ i + X T k ǫ = S A +( s k)β k S B + X T k ǫ. Further, we assume that S A S B is large compared to all β k and S A S B > (this can be done if we choose different sizes for A,B, such as A = B ). From the assumption that the noise level σ is small compared to M, sign(x T k y),k A depends on sign(s A S B ) and we have an explicit expression for W, A with large probability (noise is too small to affect the sign) k A, W k = Xk T y X k T y = S A S B + β k +XT k ǫ S A S B +( s k)β k + X k T ǫ [ = p sign(s A S B ) ( S A S B + β k +XT k ǫ) (S A S B +( ] s k)β k + X k T ǫ) ( = p sign(s A S B ) s k β k +(X k X ) k ) T ǫ, (8) where we have used the notation = p to denote an identity that holds with large probability. Based on the symmetry, sign(xk Ty),k B depends on sign(s B S A ) and ( W k = p sign(s B S A ) s k β k +(X k X ) k ) T ǫ, k B. (9) By using the signal amplitude defined in (7) and the assumption S A > S B, we have the expression {.9+(X k W k = X k ) T ǫ, k A,s k, p (X k X k ) T () ǫ, k B,s k. Since the noise level is small compared to the signal amplitude, the estimate above shows that W, A are approximately.9 and W, B are approximately with large probability. Selection If T >.95, the features selected by the knockoff filter are Ŝ = { : W T} { : W.95} =. In this case, no features will be selected. Now we consider the case of T.95. The definition of the threshold (3) implies that q /+#{W T} #{W T} = /+ B #{W T} /+ B = q A /+ B. () A 6

7 If we further take q A < B, this would contradict () and thus T must be greater than.95. As a result, no features will be selected. Note that taking q A < B does not contradict with the previous assumption on A, B that A > γ B,γ >, which guarantees S A > S B. If we assume that S A S B is large compared to all β k, we conclude from (8) and (9) that all features in either A or B are not selected according to the knockoff procedure (only positive statistics will be selected). This example illustrates that the marginal correlation statistic cannot exploit the knockoff power dueto thelarge negative W for asignificant numberof thetruefeatures. The mechanism for generating the alternating sign effect Next, we describe a more general mechanism that could lead to the alternating sign effect. First of all, such a feature matrix can be clustered into two groups A and B. Secondly, the features from the same group are positively correlatedandthosefromdifferentgroupsarenegatively correlated, i.e. X i,x > if(i,) A A orb B and X i,x < if(i,) A B. Let X betheknockoffmatrix. Without lossofgenerality, we may assume that X X, which implies that s = X T X. To see why such a feature matrix may suffer from the alternating sign effect, we generate the signal β by setting β i = M/s i. By definition, (X i X i ) T y = s i β i +(X i X i ) T ǫ N(M, s i σ ). Assume that the noise level σ is small enough. If Xi Ty <, we obtain W i = Xi Ty X i Ty XT i y XT i y M < and thus the non-null feature i is reected by the knockoff filter. A similar result holds for i B. Next, we find out under what condition we have Xi Ty <. Denote S A(i) Xi T( A X β ) ands B (i) Xi T( B X β ). Usingthecorrelation structureofx andthedefinitionofs A,S B,β, we have Xi Ty = S A(i) S B (i)+xi Tǫ if i A and XT i y = S B(i) S A (i)+xi T ǫ if i B. One can interpret S A (i) as a weighted sum of β, A with weight Xi TX. Similarly, S B (i) is a weighted sum of β, B. If the noise level σ is small enough, Xi TX does not vary much and the size of one group is larger than the size of another group, e.g. B < A, it is likely that S B (i) < S A (i) for some i B. As a result, the features in group B may not be picked out, which reduces the power. In the previous example, we construct a special example of X that satisfies X i,x =.5 and B < A. We define the signal β in a similar way. Equation () ustifies that the features in group B are not selected by the knockoff filter. In Section.4., we construct another example to show that the Lasso path and the forward selection statistics suffer from the alternating sign effect. Another mechanism for generating the alternating sign effect is when the columns of a design matrix X are all positively correlated. In this case, we can apply the same argument as above by choosing the signal via β i = M/s i,i A and β i = M/s i,i B, where (A,B) is a partition of,,..,p. For these two types of design matrices, one needs to choose a statistic that will not suffer from the alternating sign effect. Testing the alternating sign effect for the marginal correlation statistic To confirm our previous analysis, we choose the group size of A, B to be A = 6, B = 4 with and 8 signals in each group, which corresponds to % sparsity. We draw the rows of X from a multivariate normal distribution with mean and covariance matrix Σ, which satisfies Σ ii =, Σ i = ρ for i in the same group, and Σ i = ρ for i in a different group. We then normalize the columns. The correlation factor is ρ =.5, the noise level is σ =, and the signal amplitude is β i = {.9M s i M s i i S tr A, i S tr B, where S tr is the set of true signals. We assume that s i constructed by SDP is nonzero. Otherwise, we generate another design matrix X N(,Σ) and then construct another group of s i by SDP. To study the alternating sign effect, we compare the performance of the least squares statistic W ls = ˆβ ls β ls and the marginal correlation statistic Wmc = X Ty X T y using the knockoff and the knockoff+ filters at the nominal FDR q = %. We then vary the signal parameter M =,,3,..., and repeat each experiment times. The results are summarized in Table. 7

8 LS: FDR(%) MC: FDR(%) LS:knockoff+power(%) MC: knockoff+(%) M (mfdr(%)) (mfdr(%)) (knockoff power(%)) (knockoff(%)) 9.3 (8.9).6 (.93) 46.4 (47.). (.73) 9.48 (9.46). (.) (93.95).3 (.47) (9.78). (.) (99.66). (.33) (9.6). (.) (99.98). (.3) 5. (.8). (.). (.). (.) (9.5). (.). (.). (.8) (9.4). (.). (.). (.6) 8.34 (.8). (.). (.). (.3) (9.79). (.). (.). (.) 9.76 (9.8). (.). (.). (.) Table : Alternating sign effect of the marginal correlation statistic, nominal FDR q = %. LS: the least squares, MC: the marginal correlation statistic. We focus on the power of the two statistics. The results from Table show that the marginal correlation statistic loses most of its power and can hardly discover any true signal while the least squares statistic maintains about % power in this test. Thus, the marginal correlation statistic suffers from the alternating sign effect, which is consistent with the analysis above..3 Potential challenge of the path method statistics In this subsection, we point out a potential challenge for the path method statistics. To demonstrate this, we first observe that the knockoff matrix properties imply (X i X i ) T y = (X i X i ) T (Xβ +ǫ) = s i β i +(X i X i ) T ǫ, i p. The right hand side also appears in many path method statistics, including the Lasso path, the forward selection, and the orthogonal matching pursuit statistics. We now illustrate the potential difficulty that we may encounter for a path method statistic. After performing l steps in one of the path methods (or at λ for the Lasso path), we use E to denote the set of features that have entered the model. We assume that E does not include X, X at the lth step, but at the next step either X or X will enter the model. After l steps, the residue is r l = y X EˆβE. Since X, X / E, we have X T X i = X X i, X i E. The same equality holds for X i. For X, X, their marginal correlation with r l determines which one of these two features will enter into the model first at the (l +)st step: X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+X T ǫ, X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+ X T ǫ, (X X ) T r l = (X X ) T y = s β +(X X ) T ǫ. Assume that the noise level is relatively small. If sign(β ) sign(x T (Xβ X Eˆβ E )) and X T (Xβ X Eˆβ E ) > s β, then X will enter into the model at the (l+)th step since X T r l X T r l X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) ( ) ( )] = sign X T (Xβ X Eˆβ E ) )[(X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) = s β >. This may reduce the power of the knockoff filter. We call such effect the alternating sign effect. 8

9 Definition. (Alternating sign effect). Let r l denote the residue at the lth step in a path method statistic or y in the marginal correlation statistic. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r l). In the counterexample above for the marginal correlation statistic, the design matrix X and signal coefficient β are constructed to generate the alternating sign effect. From our discussion, the alternating sign effect can lead to large negative W and reduce the power of the knockoff filter..4 Alternating sign effect on the Lasso path and other knockoff statistics We will construct an example in which the Z-score is large enough to reect the null hypothesis. For this example, some knockoff statistics can only pick out a small subset of the false nulls..4. The Z-score and signal amplitude ˆβ ls /σ (Σ ), TheZ-scoreofaclassicallinearmodely = Xβ+ǫ, ǫ N(,σ I p ), isdefinedbyz = where ˆβ ls is the least squares coefficient of regressing y on X. Obviously, Z N(,), β =. In our example and numerical experiments to be presented later, we choose σ = and β i = M/s i for s i and β i = for s i =. This setting guarantees that the Z-score of a false null is large. In fact, we have the following estimate for Z. Lemma.3. Let σ = and Z = ˆβ ls / (Σ ). For any : β defined above we have Z ξ + M s ξ + M, ξ N(,). This result shows that for large amplitude, M, the Z-score of the false null is large enough to reect the null hypothesis. We defer the proof of this lemma to the Appendix..4. An example to illustrate the alternating sign effect for several knockoff statistics In this subsection, we construct an example to demonstrate that the Lasso path and the forward selection statistics could lose their power due to the alternating sign effect. In our example, the feature matrix, X, consists of four groups X = (X A,X A,X B,X B ) with correlations given as follows ρ (i,) (A A)\(A A ) or (i,) (B B)\(B B ), i X i,x = ρ (i,) A B ρ (i,) A A or (i,) B B where A = A A, B = B B, and ρ > ρ. For example, in the case A = A = B = B =, the Gram matrix of X has the following structure ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ X T X = Σ = ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ. ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ Given the above covariance matrix Σ, the rows of X N(,Σ) with columns normalized. 9

10 Testing the alternating sign effect for different statistics We perform numerical experiments for the example above. Let A = B =, A = k, B = k, ρ =.3, ρ =.9, M = 6 (recall β = M/s ), the noise level σ = and the nominal FDR q = %. We compare the performance of five statistics: the knockoff least squares, the weighted half Lasso defined in (5) of Section.5. (λ =.5, Z = diag{ s /, s /,..., s p /}), the Lasso path, the forward selection, and the orthogonal matching pursuit (OMP) statistics. Here, λ is the regularizing parameter in the penalized model. We use the difference statistic W = ˆβ β for the least squares, the signed max statistic for the half Lasso, the Lasso path, the forward selection and the OMP statistics, i.e. W = max( ˆβ, β ) sign( ˆβ β ), where ˆβ, β is the enter time (the step at which the original and knockoff feature enters) in the path statistic or the solution of the half Lasso. We vary thesize parameter k =,4,.., but keep thesparsity level at %, e.g. thenumber of true features is 7. = 4 if k =. All the signals are randomly selected from {,,..,p}. We run each experiment times and present the results in the left panel of Figure. Smaller size and more trials We rerun this same experiment with a smaller size but a larger number of trials. Let A = B = 4, A = k, B = k, k = 4,8,,...,4 and the sparsity level at %. Each experiment is repeated times. The results are plotted in the right panel of Figure. Power(%) Least Square Half Lasso(λ=.5) Lasso Path Forward Selection OMP FDR Size of (A,B )=(k,k): k (Large Test, trials) Size of (A,B )=(k,k): k (Small Test, trials) Figure : Alternating sign tests for five methods at the nominal FDR q = % with varying sizes of special groups. Left and right subplots show first and second test s results, respectively. First test s feature size (left subplot) is five times that of the second test s (right subplot). We focus on the power of the Lasso path and the forward selection statistics and find that these methods lose most of their power in this alternating sign test, which confirms that they suffer from the alternating sign effect. For other methods, they maintain nearly % power with the desired FDR control. The FDR of the Lasso path and the forward selection statistics in the left subplot is not stable, which can be attributed to the relatively small number of trials. This example indicates that the alternating sign effect could be a problem for the knockoff filter for certain statistics..5 Notion of good statistic and a half penalized method.5. Good statistic In the previous section, we show that certain knockoff statistics suffer from the alternating sign effect (i.e. loss of power) even for strong signals that are only weakly correlated due to the many largenegative knockoff statistics W that aregenerated. Based ontheknockoff property, wepropose an alternative statistic that does not suffer from the alternating sign effect. We first introduce the notion of a good statistic.

11 Definition.4 (Good statistic). A knockoff statistic W is a good statistic if it satisfies the positivity of non-null features: for fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. The assumption of a good statistic ensures that W is non-negative for a strong signal and thus it can be potentially selected by the knockoff filter. Least squares statistic An example of a good statistic is the least squares statistic. Denote the Gram matrix G = [X X] T [X X]. We observe that the original construction of diag(s) used in [3] could lead to a singular Gram matrix. To alleviate this difficulty, we modify the criterion to construct diag(s): λ min(σ)i p diag(s) 3 Σ. We will discuss different construction criteria in Section.7. ( ) ˆβ The least squares coefficients ˆβ, β obtained by regressing y on [X, X] β satisfy = β ( X G T ) ( ) ǫ η () X T ǫ η (). We show that W ˆβ β satisfies the definition of a good statistic. In fact, if β, we have W = β +η () η (). For a fixed noise ǫ, η (),η () are fixed and thus W is positive if β is large enough. In general, if β is large compared to noise level σ, W with large probability due to var(η () ) = σ (G ) and var(η () ) = σ (G ) +p,+p. The following formula of the least squares coefficients, ˆβ, β, will be useful in later sections. ( ˆβ + β β ˆβ β β ) [ = X + X, X X ] T [ [ (Σ S = ) ( S ).5. A half penalized method X + X, X X ) ] ( X+ X ( ( ) T ǫ X X ) T ǫ ] ( X+ X ( ( ) T ǫ X X ) T ǫ ), S = diag{s,s,...,s p }. In this subsection, we introduce a half penalized model based on the knockoff property. This method naturally suggests a good statistic. Consider the following penalized model min ˆβ, β () y Xˆβ X β +P(ˆβ β)+q(ˆβ + β), (3) where P(x) and Q(x) are even functions. The statistic defined by W = ˆβ β or W = max( ˆβ, β ) sign( ˆβ β ) satisfies the sufficiency and the antisymmetry properties since swapping X, X leads to swapping ˆβ, β. Let ˆβ ls, β ls be the least squares coefficients obtained by regressing y on [X, X]. We denote by r y Xˆβ ls X β ls the residue. The geometric property of the least squares method implies r X, X, which leads to y Xˆβ X β = r +X(ˆβ ls ˆβ)+ X( β ls β) = r + X(ˆβ ls ˆβ)+ X( β ls β) = r + X + X (ˆβ ls + β ls ˆβ X X β)+ (ˆβ ls β ls (ˆβ β)). The residue r is independent of ˆβ and β. Thus we can exclude the residue r from the penalized model (3). Note that the constraint () on the knockoff matrix implies an important property (X + X) T (X X) =, X + X X X. (4)

12 The orthogonality property (4) enables us to separate the left hand side into the sum of three mutually independent terms: y Xˆβ X β = r + X + X (ˆβ ls + β ls ˆβ β) We can then rewrite (3) in the following equivalent form: min ˆβ, β + X X (ˆβ ls + β ls ˆβ β) + which can be further reformulated as two equivalent subproblems ˆα X + X (ˆβ ls β ls (ˆβ β)). X X (ˆβ ls β ls (ˆβ β)) +P( ˆβ β )+Q(ˆβ + β), (5) + X min X (ˆβ ls + ˆα) +Q(ˆα), (6) X min X (ˆβ ls α α) +P( α), (7) where we have replaced ˆβ+ β, ˆβ β in (5) by ˆα, α, respectively. A key observation of the knockoff filter is that the column vectors of X X are mutually orthogonal since (X X) T (X X) = Σ (Σ diag(s)) = diag(s). (8) Consequently, the second subproblem is reduced to min α p i= 8 X i X i ( α i (ˆβ i ls ls β i )) +P( α) = min α p i= s i 4 ( α i (ˆβ i ls ls β i )) +P( α). (9) If P(x) can be expressed as P(x) = p i= P(x i), we can solve (9) easily by solving p onedimensional optimization problems separately. Example : A half penalized method and a good statistic We construct a good statistic to make sure that W > for a true feature. We choose Q and (3) becomes min ˆβ, β y Xˆβ X β +P(ˆβ β). This is different from other penalized models since it only penalizes ˆβ β. We call this model the half penalized method. This problem can also be divided into two subproblems (6) and (7). The solution of (6) is trivial and by () we have the following explicit formula: ˆα = ˆβ ls + β ls = β +(Σ S ) ( X + X ) T ǫ. () We introduce the following notation that will be used very often later on: ǫ () (Σ S ) ( X + X ) T ǫ, ǫ () ( S ) ( X X ) T ǫ, () Var(ǫ () ) = σ (Σ S/), Var(ǫ () ) = σ (S/), () where σ = var(ǫ i ). Substituting the expression of ˆβ ls β ls given in () into (9) yields min α p i= 8 X i X i ( α i β i ǫ () i ) +P( α ) = min α The minimum α in (3) satisfies the following lemma. p i= s i 4 ( α i β i ǫ () i ) +P( α). (3)

13 Lemma.5. Assume P(x) is even. The minimum of (3) satisfies sign( α ) = sign(β +ǫ () ) or if β +ǫ (),s. Proof. Since P is even, we have P(x) = P( x ). Recall the minimization problem α = argmin α p i= s i 4 (α i β i ǫ () i ) +P( α ) f(α). If sign( α ) sign(β +ǫ () ) or, we can modify α as follows α new = α, α new i = α i, i to obtain a smaller value. In fact, this modification only changes one term in f( α) and the following inequality leads to a contradiction: f( α new ) f( α) = s 4 ( αnew β ǫ () ) s 4 ( α β ǫ () ) = s α (β +ǫ () ) <. Assume that the knockoff statistic takes the difference formula, i.e. W = ˆβ β (the signed max formula can be considered similarly). Equation () yields var(ǫ () ) = σ ((Σ S/) ), var(ǫ() ) = σ /s. When β is large compared to the noise level σ, we have β > ǫ (), ǫ () with large probability. Consequently, we obtain sign(ˆα ) = sign( α ) = sign(β ). Combining the solution of the first problem (), Lemma.5 and the transform between α and β (ˆα = ˆβ+ β, α = ˆβ β), we conclude that W = ˆβ β = ( ˆα + α ˆα α ), which implies that W is a good statistic. Example : A half Lasso statistic. We choose Q(x) and P(x) = λ x. As a result, the Lasso problem (7) or (9) can be solved directly α = argmin α p i= α i = Sh(ˆβ ls i ( si ) 4 (α i (ˆβ i ls ls β i )) +λ α i β ls i,λ/s i ) sign(ˆβ ls i ls β i ) ( ˆβ i ls ls β i λ/s i ) + where Sh (Shrinkage) is the soft threshold operator and a + max(,a). We can rewrite the formula above in vector form α = Sh(ˆβ ls β ls,λs inv ), where S inv = [/s,/s,...,/s p ] T. We should interpret this vector identity as several pointwise identities. Since Q, the solution of (6) is given by ˆα = ˆβ ls + β ls. Combining the formula of α, ˆα, we obtain the solution of (3) ˆβ = (ˆβ ls + β ls +Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() +Sh(β +ǫ (),λs inv )), β = (ˆβ ls + β ls Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() Sh(β +ǫ (),λs inv )), (4) where ǫ (),ǫ () are defined in () with variance (). It is interesting to note that if β is small, the soft-threshold yields ˆβ = β, which implies W =. A weighted half Lasso statistic. We can add a weight to β i to balance the noise level and the soft-threshold. Consider the following penalized model min ˆβ, β y XZ ˆβ XZ β +λ ˆβ β, (5) 3

14 where Z = diag{z,z,...,z p } is a positive diagonal matrix chosen in advance. Note that ˆβ, β only depend on [X X] T y,[x X] T [X X]. Similarly, we derive the solution as follows ˆβ = ( ) Z(β +ǫ () )+Sh(Z(β +ǫ () ),λz S inv ), β = ( ) (6) Z(β +ǫ () ) Sh(Z(β +ǫ () ),λz S inv ), where S = diag{s,s,...,s p } and ǫ (),ǫ () are defined in (). The weighted half Lasso statistic that satisfies the sufficiency property is defined as follows W = ( ˆβ β ) = β +ǫ () +Sh(β +ǫ (),λz /s ) β +ǫ () Sh(β +ǫ (),λz /s ). (7) z We can also define the associated signed max statistic W = z max( ˆβ, β ) sign( ˆβ β ), which also satisfies the sufficiency property. The difference between (6) and (4) is the addition of a different weight to the threshold. Note that the covariance matrix of ǫ () is S σ. The weighted half Lasso can balance the variance of noise ǫ () and the soft-threshold. We suggest to use Z = diag( s /, s /,..., s p /). With this choice of Z, we have Var(ǫ () ) = σ S = 4σ diag(z /s,z /s,...,z p /s p ) λ diag(4z /s,4z /s,...,4z p /s p ). Example 3: A negative half Lasso Choosing Q(x),P(x) = λ p i= µ i x i,µ i, we can deduce the solution of (6) and (7) (or (9)) ˆβ + β = ˆα = ˆβ ls + β ls = β +ǫ (), ˆβ β = α = argmin α = ˆβ i = β i + ( ) ǫ () i +ǫ () i + λµ i sign(β i +ǫ () i ), βi = s i p i= ( si ) 4 (α i (ˆβ i ls ls β i )) λµ i α i ( ǫ () i ǫ () i ) λµ i s i sign(β i +ǫ () i ), where we have used ˆβ ls i β ls i = β i +ǫ () i. We see that a negative P(x) can increase the difference between ˆβ and β, which can be useful to distinguish the true feature from its knockoff. When µ i = s i, our numerical results show that the negative penalty enlarges the gap between ˆβ and β and increases the power by 5 % compared to least squares, while the half Lasso shrinks the gap between ˆβ and β and reduces the power by 5 %..6 Extension of the knockoff sufficiency property In [3], the sufficiency property of a knockoff statistic states that the statistic W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. In this subsection, we will generalize the sufficiency property so that we can apply the knockoff filter to more general scenarios. In addition, we propose a method to estimate the noise level and determine the prior regularizing parameter for a half penalized method. Let U R n (n p) be an orthonormal matrix such that [X X] T U = and [X X U] admits a basis of R n. Recall that the knockoff condition () implies (X+ X) T (X X) = X T X X T X =. Hence, we can decompose R n as follows R n = span(x + X) span(x X) span(u). Our key observation is that swapping each pair of the original X and its knockoff X does not modify these spaces: span(x + X), span(x X) and span(u). Therefore, the probability distributions of the proections of the response y onto these spaces respectively are independent and invariant after swapping arbitrary pair X, X. Inspired by this observation, we can generalize the sufficiency property. 4

15 Definition.6 (Generalized Sufficiency Property). The statistic W is said to obey the generalized sufficiency property if W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y; that is, we can write W = f([x X] T [X X],[X X U] T y) for some f : S + p Rn R p and an orthonormal matrix U R n (n p) that satisfies U T [X X] =. Remark. Compared with the original sufficiency property, the generalized sufficiency property includes the addition of U T y, which is the coefficient vector of the orthogonal proection of y onto span([x X]). As an application, we will use this extra component to estimate the noise level and incorporate the estimated noise level into the knockoff statistic from a penalized method without violating the exchangeability property and FDR control. The definition of the antisymmetry property remains the same: swapping X and X has the same effect as changing the sign of W, i.e. { W ([X X] swap( Ŝ),U,y) = W ([X X],U,y) + Ŝ, / Ŝ, where Ŝ is a subset of nulls. For any knockoff matrix X and the associated statistic W that satisfies the above definition, we call W the generalized knockoff statistic. We will prove that this generalized statistic satisfies the exchangeability property. Then we can apply the same super-martingale as in [3] to establish rigorous FDR control. According to the analysis of establishing exchangeability in [3], we need to prove the corresponding Lemma (Pairwise exchangeability for the features) and Lemma 3 (Pairwise exchangeability for the response) in [3]. Lemma is a direct result of the knockoff constraint. We need to prove the following lemma. Lemma.7. For any generalized knockoff statistic W and a subset Ŝ of nulls, we have W swap( Ŝ) = f([x X] T swap(ŝ)[x X] swap( Ŝ),[ [X X] swap( Ŝ) U]T y) d = f([x X] T [X X],[X X U] T y) = W. Proof. Since X is a knockoff matrix, we get [X X] T swap(ŝ)[x X] swap( Ŝ) = [X X] T [X X], (8) and thus the first variable of f on both sides of (8) are the same. Next, we verify [[X X] swap( Ŝ) U]T y d = [X X U] T y. Since y is a Gaussian random variable, it is equivalent to verifying that the means and the variances of both sides are the same. We first check the means of the both sides. E([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T Xβ = [ [X X] T Xβ U T Xβ] = E([X X U] T y). (9) The second equality is guaranteed by X T Xβ = X T Xβ Ŝ since XT X i = X T X i i and Ŝ is a subset of nulls. Using y N(Xβ,σ I p ),[X X] T U = and (8), we obtain Var([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T [ [X X] U] swap(ŝ) =diag([x X] T swap(ŝ)[x X] swap( Ŝ), UT U) = diag([x X] T [X X], U T U) =Var([ [X X] U] T y) Combining (9) and (3), we conclude the proof. (3) The exchangeability property of a generalized knockoff statistic is a result of this lemma and the antisymmetry property of the knockoff statistic. Lemma.8. (i.i.d signs for the nulls). Let η {±} p be a sign sequence independent of W, with i.i.d η = + for all nonnull and η {±} for null. Then (W,...,W p ) = d (W η,...,w p η p ). 5

16 Estimate of the noise level and an application As an application of the generalized knockoff statistic, we propose a new method to estimate the noise level in the knockoff filter without violating the exchangeability property and FDR control. Let U R n (n p) be an orthonormal matrix such that U T [X X] =. From the identity U T y = U T (Xβ +ǫ) = U T ǫ, we provide an estimate of the noise level depending on U T y: ˆσ U T y / n p. (3) For any problem with an unknown noise level, we consider the knockoff half Lasso whose regularizing parameter is decided by ˆσ, i.e. min ˆβ, β y Xˆβ X β +λˆσ ˆβ β, (3) where λ = or can be decided empirically. Since the solution of (3), i.e.(ˆβ, β), depends on the Gram matrix [X X] T [X X], the marginal correlation [X X] T y and the regularizing parameter λˆσ (ˆσ is decided by U T y), we derive that ˆβ, β are functions of the Gram matrix and [X X U] T y. Consequently, thestatistic W ˆβ β = W([X X] T [X X],[X X U] T y) (or thesigned max version) satisfies the generalized sufficiency property. The antisymmetry property can be verified easily. Hence, we can choose W as a knockoff statistic with exact FDR control..7 A modified SDP construction In [3], the authors propose to construct diag(s) (s = (s,s,..,s p )) via convex optimization maximize: p s i, subect to: diag(s) Σ; s i,i =,,..,p. (33) i= Such construction sometimes produces zero s i. In this case, feature i cannot be selected by the knockoff filter. To illustrate this point, we construct a simple but by no means extreme example in which such a construction criterion would give zero s i for some i. Let Σ a,b be a 3 3 matrix defined as Σ a,b = b a b a a a. Using the CVX solver in MATLAB, we can solve (33) for several Σ a,b. When (a,b) = (.8,.4), we have s = s =.4 and s 3 = ; when (a,b) = (.9,.7), we have s = s =.6 and s 3 = ; when (a,b) = (.7,.4), we have (s,s ) =.84 and s 3 =. We observe that s 3 = in these examples. Modified SDP construction To overcome the zero output problem, we propose to slightly modify the original SDP construction by solving the following optimization problem minimize: p ( s i ), subect to: diag(s) βσ, αλ min (Σ) s i, α [,) β (,]. i= The half penalized method requires that Σ diag(s)/ and diag(s) be invertible (see the least squares coefficient formula()), and we suggest(α, β) = (.5,.75). For path statistics, to alleviate zero output in the SDP construction, we suggest (α,β) = (.5,). 3 A PCA prototype filter In this section, we propose a PCA prototype group selection method with group FDR control to overcome the difficulty associated with strong within-group correlation. It is well known that the 6

17 grouping strategy provides an effective way to handle strongly correlated features. Our work is inspired by Reid-Tibshirani s prototype filter [3] and Dai-Barber s group knockoff filter [6]. We provide a brief summary of the two methods below before introducing our PCA prototype filter. 3. Reid-Tibshirani s prototype filter In [3], Reid and Tibshirani introduce a prototype filter. They choose a prototype for each group of features, then they use the knockoff filter to select these prototypes to perform group selection. Specifically, the method consists of the following steps. First, cluster columns of ( X into ) K groups, ( {C,...,C ) K }. Then split the data by rows into two y () X () (roughly) equal parts y = and X =. Choose a prototype for each cluster via y () the maximal marginal correlation, using only the first part of the data y (),X (). This generates the prototype set ˆP. Next, form a knockoff matrix X () from X () and perform the knockoff filter using y (),[X () X () ˆP ˆP ]. Finally, group C i is selected if and only if X () is chosen in the filter process. ˆP i This method satisfies the exchangeability property and the authors establish group FDR control based on a similar super-martingale argument as in [3]. We point out that this method does not benefit a lot from the group structure. Assume that X i,x are in the same group and X i,x δ. For this pair of X i and X, we define a unit vector v by v (e i e )/, where {e i } i p is the standard orthonormal basis of R p. From (), we have v T diag(s)v v T X T Xv, which further implies X () min(s i,s ) v T diag(s)v v T X T Xv = Xv = X i X δ. (34) If within-group correlation is strong, δ is small and the inequality above implies that either s i or s is small. Hence, the power of this method may be limited for strongly correlated features. Our numerical results in Sections 3.4 and 3.5 confirm this limitation. 3. Dai-Barber s group knockoff filter In [6], Dai and Barber investigate a group-wise knockoff filter, which is a generalization of the knockoff filter. Assume that the columns of X can be divided into k groups {X G,X G,...,X Gk }. The authors construct the group knockoff matrix according to X T X = X T X, XT X = Σ S, Σ = X T X, where S is group-block-diagonal, i.e. S Gi,G = for any two distinct groups i. Then let S = diag(s,s,...,s k ), S i = γσ Gi,G i = γxg T i X Gi,i =,,...,k. The constraint S Σ implies γ diag(σ G,G,Σ G,G,...,Σ Gk,G k ) = S Σ. In order to maximize the difference between X and X, γ is chosen as large as possible: γ = min{, λ min (DΣD)}, where D = diag(σ / G,G,Σ / G,G,...,Σ / G k,g k ). The group-wise statistic introduced in [6] can be obtained after the construction of the group knockoff matrix. The construction[ above guarantees ] group-wise exchangeability. Finally, group {#{i:β FDR control, i.e. FDR group E Gi =,i Ŝ} q, is a result of group-wise exchangeability. ( Ŝ ) Here Ŝ = { : W T} is the set of selected groups for a chosen group statistic W. 3.3 PCA Reformulation Assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in such a way that withingroup correlation is relatively strong while between-group correlation is relatively weak. First, we apply singular value decomposition (SVD) to decompose the feature vectors within each group into X Ci = U Ci D i Vi T,U Ci O n c i,d i R c i c i,v i O c i c i,c i = C i. Then we reformulate the linear model as follows: 7

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS. BY RINA FOYGEL BARBER 1 AND EMMANUEL J. CANDÈS 2 University of Chicago and Stanford University

CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS. BY RINA FOYGEL BARBER 1 AND EMMANUEL J. CANDÈS 2 University of Chicago and Stanford University The Annals of Statistics 2015, Vol. 43, No. 5, 2055 2085 DOI: 10.1214/15-AOS1337 Institute of Mathematical Statistics, 2015 CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS BY RINA FOYGEL BARBER 1 AND

More information

arxiv: v3 [stat.me] 14 Oct 2015

arxiv: v3 [stat.me] 14 Oct 2015 The Annals of Statistics 2015, Vol. 43, No. 5, 2055 2085 DOI: 10.1214/15-AOS1337 c Institute of Mathematical Statistics, 2015 CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS arxiv:1404.5609v3 [stat.me]

More information

arxiv: v1 [stat.me] 11 Feb 2016

arxiv: v1 [stat.me] 11 Feb 2016 The knockoff filter for FDR control in group-sparse and multitask regression arxiv:62.3589v [stat.me] Feb 26 Ran Dai e-mail: randai@uchicago.edu and Rina Foygel Barber e-mail: rina@uchicago.edu Abstract:

More information

The knockoff filter for FDR control in group-sparse and multitask regression

The knockoff filter for FDR control in group-sparse and multitask regression The knockoff filter for FDR control in group-sparse and multitask regression Ran Dai Department of Statistics, University of Chicago, Chicago IL 6637 USA Rina Foygel Barber Department of Statistics, University

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

False Discovery Rate

False Discovery Rate False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR

More information

A knockoff filter for high-dimensional selective inference

A knockoff filter for high-dimensional selective inference 1 A knockoff filter for high-dimensional selective inference Rina Foygel Barber and Emmanuel J. Candès February 2016; Revised September, 2017 Abstract This paper develops a framework for testing for associations

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Knockoffs as Post-Selection Inference

Knockoffs as Post-Selection Inference Knockoffs as Post-Selection Inference Lucas Janson Harvard University Department of Statistics blank line blank line WHOA-PSI, August 12, 2017 Controlled Variable Selection Conditional modeling setup:

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

Linear Models Review

Linear Models Review Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign

More information

Lecture 7: September 17

Lecture 7: September 17 10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Math Linear Algebra II. 1. Inner Products and Norms

Math Linear Algebra II. 1. Inner Products and Norms Math 342 - Linear Algebra II Notes 1. Inner Products and Norms One knows from a basic introduction to vectors in R n Math 254 at OSU) that the length of a vector x = x 1 x 2... x n ) T R n, denoted x,

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

SVD, PCA & Preprocessing

SVD, PCA & Preprocessing Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Principal Component Analysis

Principal Component Analysis I.T. Jolliffe Principal Component Analysis Second Edition With 28 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition Acknowledgments List of Figures List of Tables

More information

The dual simplex method with bounds

The dual simplex method with bounds The dual simplex method with bounds Linear programming basis. Let a linear programming problem be given by min s.t. c T x Ax = b x R n, (P) where we assume A R m n to be full row rank (we will see in the

More information

Sparse principal component analysis via regularized low rank matrix approximation

Sparse principal component analysis via regularized low rank matrix approximation Journal of Multivariate Analysis 99 (2008) 1015 1034 www.elsevier.com/locate/jmva Sparse principal component analysis via regularized low rank matrix approximation Haipeng Shen a,, Jianhua Z. Huang b a

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Sparse orthogonal factor analysis

Sparse orthogonal factor analysis Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

arxiv: v3 [math.oc] 19 Oct 2017

arxiv: v3 [math.oc] 19 Oct 2017 Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional

More information

Financial Econometrics

Financial Econometrics Material : solution Class : Teacher(s) : zacharias psaradakis, marian vavra Example 1.1: Consider the linear regression model y Xβ + u, (1) where y is a (n 1) vector of observations on the dependent variable,

More information

Stochastic Design Criteria in Linear Models

Stochastic Design Criteria in Linear Models AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 211 223 Stochastic Design Criteria in Linear Models Alexander Zaigraev N. Copernicus University, Toruń, Poland Abstract: Within the framework

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Applied Numerical Linear Algebra. Lecture 8

Applied Numerical Linear Algebra. Lecture 8 Applied Numerical Linear Algebra. Lecture 8 1/ 45 Perturbation Theory for the Least Squares Problem When A is not square, we define its condition number with respect to the 2-norm to be k 2 (A) σ max (A)/σ

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

An Introduction to Sparse Approximation

An Introduction to Sparse Approximation An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015 Part IB Statistics Theorems with proof Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May University of Luxembourg Master in Mathematics Student project Compressed sensing Author: Lucien May Supervisor: Prof. I. Nourdin Winter semester 2014 1 Introduction Let us consider an s-sparse vector

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1 PANEL DATA RANDOM AND FIXED EFFECTS MODEL Professor Menelaos Karanasos December 2011 PANEL DATA Notation y it is the value of the dependent variable for cross-section unit i at time t where i = 1,...,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Lecture 12 April 25, 2018

Lecture 12 April 25, 2018 Stats 300C: Theory of Statistics Spring 2018 Lecture 12 April 25, 2018 Prof. Emmanuel Candes Scribe: Emmanuel Candes, Chenyang Zhong 1 Outline Agenda: The Knockoffs Framework 1. The Knockoffs Framework

More information

Linear Programming Redux

Linear Programming Redux Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Chapter 6: Orthogonality

Chapter 6: Orthogonality Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

are Banach algebras. f(x)g(x) max Example 7.4. Similarly, A = L and A = l with the pointwise multiplication

are Banach algebras. f(x)g(x) max Example 7.4. Similarly, A = L and A = l with the pointwise multiplication 7. Banach algebras Definition 7.1. A is called a Banach algebra (with unit) if: (1) A is a Banach space; (2) There is a multiplication A A A that has the following properties: (xy)z = x(yz), (x + y)z =

More information