arxiv: v1 [stat.me] 11 Jun 2017

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 11 Jun 2017"

Philip Phelps
5 years ago
Views:

1 Some Analysis of the Knockoff Filter and its Variants Jiaie Chen, Anthony Hou, Thomas Y. Hou June 3, 7 arxiv:76.34v [stat.me] Jun 7 Abstract In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables and determine which variables are truly associated with the response. In 5, Barber and Candès introduced a new variable selection procedure called the knockoff filter to control the false discovery rate(fdr) and proved that this method achieves exact FDR control. In this paper, we provide some analysis of the knockoff filter and its variants. Based on our analysis, we propose a PCA prototype group selection filter that has exact group FDR control and several advantages over existing group selection methods for strongly correlated features. Another contribution is that we propose a new noise estimator that can be incorporated into the knockoff statistic from a penalized method without violating the exchangeability property. Our analysis also reveals that some knockoff statistics, including the Lasso path and the marginal correlation statistics, suffer from the alternating sign effect. To overcome this deficiency, we introduce the notion of a good statistic and propose several alternative statistics that take advantage of the good statistic property. Finally, we present a number of numerical experiments to demonstrate the effectiveness of our methods and confirm our analysis. Introduction In many scientific endeavors, we need to determine from a response variable together with a large number of potential explanatory variables which variables are truly associated with the response. In order for this study to be meaningful, we need to make sure that the discoveries are indeed true and replicable. Thus it is highly desirable to obtain exact control of the false discovery rate (FDR) within a certain prescribed level. In [3], Barber and Candès introduce a new variable selection procedure called the knockoff filter to control the FDR for a linear model. This method achieves exact FDR control in finite sample settings. One important property of this method is that its performance is independent of the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients. Moreover, it does not require any knowledge of the noise level. A key observation is that by constructing knockoff variables that mimic the correlation structure found within the existing variables one can obtain accurate FDR control. The method is very general and flexible. It can be applied to a number of statistics and has far more power (the proportion of true signals being discovered) than existing selection rules when the proportion of null variables is high.. A brief review of the knockoff filter Before we introduce the main results of our paper, we first provide a brief overview of the knockoff filter. Consider the following linear regression model y = Xβ +ǫ, where the feature matrix X is an n p (n p) matrix with full rank, its columns normalized to be unit vectors in the l norm, School of Mathematical Sciences, Peking University; ciaie@pku.edu.cn Department of Statistics, Harvard University; ahou@college.harvard.edu Applied and Computational Mathematics, Caltech; hou@cms.caltech.edu

2 and ǫ is a Gaussian noise N(,σ ). The knockoff filter begins with the construction of a knockoff matrix X that obeys X T X = X T X, XT X = X T X diag(s), () where s i [,]. The positive definiteness of the Gram matrix [X X] T [X X] requires diag(s) X T X. () The first condition in () ensures that X has the same covariance structure as the original feature matrix X. The second condition in () guarantees that the correlations between distinct original and knockoff variables are the same as those between the originals. To ensure that the method has good statistical power to detect signals, we should choose s as large as possible to maximize the difference between X and its knockoff X. These two conditions are critical in guaranteeing that the distribution of a knockoff statistic is invariant when a particular pair of X,X is swapped. This is called the exchangeability property in [3]. The next step is to calculate a statistic, W, for each pair X, X using the Gram matrix [X X] T [X X] and the marginal correlation [X X] T y. The final step is to run the knockoff (knockoff+) selection procedure at level q { T min t > : /+#{ : W } t} q, Ŝ { : W T}. (3) #{ : W t} There are several ways to construct a statistic W. Among them, the Lasso Path statistic is discussed in detail in [3]. It first fits a Lasso regression of y on [X X] for a list of regularizing parameters λ in descending order and then calculates the first λ at which a variable enters the model, i.e. Z sup{λ : ˆβ (λ) } for feature X and Z = sup{λ : β (λ) } for its knockoff X. The Lasso path signed max statistic is defined as W = max(z, Z ) sign(z Z ). The main result in [3] is that the knockoff procedure and knockoff+ procedure has exact control of mfdr and FDR respectively, [ ] [ ] #{ mfdr E Ŝ : β = } #{ q, FDR E Ŝ : β = } #{ Ŝ}+q #{ Ŝ} q. A knockoff filter for high-dimensional selective inference and model-free knockoffs have been recently established in [4,5]. This line of research has inspired a number of follow-up works [6,8,3 5].. Alternating sign effect and the notion of a good statistic In this paper, we perform some analysis of the knockoff filter and some knockoff statistics, including the Lasso path and the marginal correlation statistics. Our analysis shows that the marginal correlation statistic and the Lasso path statistic suffer from the so-called alternating sign effect for certain design matrices whose features are only weakly correlated. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r λ), where r λ = y X EˆβE is the residue, / E and E is the active set, i.e. { : ˆβ (λ) }, λ being the regularizing parameter in front of the l norm in the Lasso path method. In Section, we describe a general mechanism for generating the alternating sign effect for a family of design matrices. We show that the alternating sign effect can lead to large negative W for strong features that are only weakly correlated. This limitation reduces the power of the knockoff filter for these statistics. To alleviate this difficulty, we introduce the notion of a good statistic. Specifically, a knockoff statistic W is called a good statistic if it satisfies the positivity of non-null features: for a fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. Based on our analysis, we propose an alternative method, which we call the half penalized method. This method penalizes only ˆβ β instead of penalizing both parameters min ˆβ, β y Xˆβ X β +P(ˆβ β),

3 hence the name half penalized method. This method takes full advantage of the property of the knockoff filter. In the case when P(x) = λ x (or P(x) = λ x ), we obtain the half Lasso method (or the negative half Lasso), which reduces the p-dimensional l optimization problem into p one-dimensional optimization problems, which can be solved explicitly using the soft threshold operator. We further prove that this new statistic and the least squares statistic satisfy the good statistic property and do not suffer from the alternating sign effect. To gain some understanding of the performance of different statistics, we investigate a variety of knockoff statistics numerically, including the least squares, half Lasso, forward selection, orthogonal matching pursuit (OMP), and Lasso path statistics. From our simulations, the forward selection, the OMP and the Lasso path statistics have similar power and computational cost. However, the alternating sign test in Section.4. shows that the Lasso path and the forward selection statistics suffer from the alternating sign effect and are less robust than the OMP statistic. Our simulation also shows that the power of the OMP is more than that of the least squares and the (negative) half Lasso in the sparse case (the proportion of the null features is large). The improvement of the OMP statistic over the least squares and the negative half Lasso statistics is not as significant in the non-sparse case. The OMP statistic seems to be the most robust among the six statistics that we consider. On the other hand, the OMP and other path statistics are computationally much more expensive. The computational cost of least squares and of the half Lasso is O(np ), while that of the Lasso path and the OMP statistic is O(np 3 ). If p, the advantage of the OMP statistic over the negative half Lasso diminishes due to the increase of computational cost..3 Extension of the sufficiency property and noise level estimate In [3], the authors introduce the sufficiency property of a statistic W, which states that W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. We observe that in the definition of the sufficiency property, only part of the information of the response variable y, i.e. [X X] T y, is utilized. By using the remaining information of y in the knockoff filter, we can incorporate the noise estimate into the statistic without violating the exchangeability property. More specifically, we generalize the sufficiency property by requiring that W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y for any orthonormal matrix that satisfies U T [X X] =. Moreover, we prove that if a statistic obeys the generalized sufficiency property and the antisymmetry property, then it satisfies the exchangeability property. Inspired by the generalized sufficiency property, we propose to use the noise level σ as a reference for the regularizing parameter and estimate the noise level as follows ˆσ U T y / n p, where U is an orthonormal matrix satisfying U T [X X] =. Since ˆσ depends only on U T y, we can define a knockoff statistic W that incorporates ˆσ and satisfies the generalized sufficiency property. Consequently, we can use the estimated noise level in the knockoff filter without violating the exchangeability property and maintain FDR control..4 A PCA prototype knockoff filter We also introduce a PCA prototype knockoff filter for group selection that has exact group FDR control (defined in Theorem 3.) for strongly correlated features. More specifically, assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in a way such that within-group correlation is relatively strong but between-group correlation is relatively weak. We first use singular value decomposition (SVD) to decompose the feature vectors within each group X Ci = U Ci D i Vi T and then reformulate the linear model as follows: k k y = X Ci β Ci +ǫ = U Ci α Ci +ǫ. i= 3 i=

4 We aim to pick out non-null groups β Ci with exact group FDR control. To capture most of the information and reduce redundant features in each group, we choose the first principal component U Ci, as a prototype of this group andthen construct knockoff pairs on theprototype set U P = (U C,,U C,,...,U Ck,), P = k. Specifically, we denote by Q = {,,...,p}\p the remaining part, U = [U P,U Q ], and then construct the knockoff matrix Ũ = [ŨP,U Q ] as follows (we choose Ũ Q = U Q since we do not select features in U Q ) Ũ T Ũ = U T U, U T U ŨT U = diag(s P,), where we apply the localized knockoff construction from [5] to increase the amplitude of s P. Inspired by [3], we implement the standard knockoff procedure on y and [U P,ŨP] and calculate the knockoff statistic W P = {W C,,W C,,..,W Ck,}. Finally, we run the knockoff filter on W C,, k to select groups. Moreover we can prove that the PCA prototype knockoff filter has the same group FDR control for the original feature matrix as in Dai-Barber s group knockoff filter [6]. Compared to Dai-Barber s group knockoff filter, our PCA method achieves greater computational efficiency since the augmented design matrix in our method is n k, which is much smaller than n p in Dai-Barber s method if p k. Since the most significant computational cost in implementing the knockoff filter with a path statistic comes from regressing y on the augmented design matrix in an iterative manner, a smaller augmented design matrix leads to greater computational efficiency. Note that the group statistic for group C is W C, and is different from that in Dai-Barber s group knockoff filter [6]..5 Comparison with other existing works There are several recent works that have an obective similar to ours. Our work is inspired by Barber and Candès knockoff filter as well as by Reid-Tibshirani s prototype knockoff filter and Dai-Barber s group knockoff filter [3,4,6,3]. We show in Section 3.4 that our PCA prototype filter has more power than Reid-Tishirani s prototype knockoff filter. When the between-group correlation is zero and within-group correlation is strong, we analyze why the PCA prototype filter performs much better than Reid-Tibshirani s prototype filter. We also show that the performance of the PCA prototype filter is comparable to that of Dai-Barber s group knockoff filter, but with greater computational efficiency if p k. More details on these two methods and their comparison with ours can be found in Section 3. We note that a localized knockoff filter has been proposed by Xu et al. in [5] in which they construct a modified knockoff matrix that has FDR control for a subset of the feature vectors. Although this localized knockoff filter guarantees FDR control, it still suffers a loss in power for strongly correlated features. There are several feature selection methods that offer some level of FDR control, see e.g. [,, 7, 9 ]. Refer to [3] for a thorough comparison between the knockoff filter and these approaches. This paper focuses on the knockoff filter and does not consider these other approaches. The rest of the paper is organized as follows. In Section, we analyze the alternating sign effect for the Lasso path, the marginal correlation, and the forward selection statistics. We also introduce thenotion of agood statistic andshow that the least squares method andthe half penalized method produce good statistics. Moreover, we generalize the sufficiency property of a knockoff statistic and propose a new method to estimate noise level. In Section 3, we introduce our PCA prototype filter for highly correlated features. We compare it to other group knockoff filters and provide numerical experiments to demonstrate the performance of various methods. Alternating sign effect, good statistics, a half penalized method In this section, we perform some analysis of the knockoff filter. Our analysis reveals some limitations of several statistics associated with the knockoff filter. Based on our understanding of these limitations, we propose some modifications of the knockoff filter to alleviate these difficulties. 4

5 . Construction of the knockoff matrix First, we review the construction of the knockoff matrix. In [3], the authors give a simple construction of the knockoff matrix X. It seems that we may have other alternative constructions of X. In the following proposition, we show that, given s i, different constructions are essentially the same. Proposition.. if and only if [ [X X] T [X X] = Σ Σ diag(s) ] Σ diag(s) (4) Σ X = X(I Σ diag(s))+uc (5) where U R n (n p) is an orthonormal matrix whose column space is orthogonal to that of X, i.e. U T X =, and C R (n p) p satisfies C T C = diag(s) diag(s)σ diag(s). We will defer the proof of the above proposition to the Appendix. The knockoff matrix X presented in [3] has the same form as (5) except that U R n p and C R p p in their formula. Using Proposition., we can reproduce the result in [3] by choosing an orthonormal matrix U = (U U ) R n (n p),u R n p,u R n (n p) whose column space is orthogonal to that of X and ( ) C C = R (n p) p, C R p p and C T C = C T C = diag(s) diag(s)σ diag(s). ( C The identity UC = (U U ) ) = U C and Proposition. reproduce X in [3].. Alternating sign effect for the marginal correlation statistic In this section, we discuss the alternating sign effect for certain statistics and propose alternative statistics that do not suffer from this effect. According to (3), the knockoff filter threshold T is determined by the ratio of large negative and positive W s. Using this threshold, the knockoff filter selects large positive statistics W > T and reects all negative W s. In order for the knockoff filter to achieve its power, W s should be large and positive for β so that the knockoff filter can pick out such features. Large, negative W s result in a large T and fewer selected features, which lead to a decrease in power. Our analysis shows that in some feature designs, certain knockoff statistics may yield large negative W s for non-null, which would decrease the power of the knockoff filter. We use the marginal correlation statistic to illustrate the alternating sign effect. The following example shows that the marginal correlation statistic could lose its power even for strong signals. Design matrix and signal amplitude Let A,B be a partition of {,,..,p}, i.e. A B = {,,..,p}, A B =. We choose a feature matrix X that satisfies X i,x = ρ for i if i and belong to the same set A or B and X i,x = ρ for i if i and belong to two different sets. A concrete example that satisfies the above design criterion is given as follows: X v ai p R n p, where v R p, v i = { λa i A λa i B, λ = ρ ρ. (6) We take ρ = for simplicity. Once the knockoff matrix is constructed, we have the relation X i = s i, < s i. The value of s i is not small because columns of X are only weakly X T i 5

6 correlated. Since the knockoff matrix is constructed without any knowledge of y and the coefficient β, we can choose any β after X is constructed. Next, we take β i = {.9M s i i A,s i, M s i i B,s i, (7) and β i = if s i =, where M is a parameter that is used to control the signal amplitude. In the following discussion, we set M = and assume that the number of s i = is either or small. Derivation of the marginal correlation statistic Let S A,S B be the sum of β i in group A, B, respectively, i.e S A = i A β i, S B = i B β i. Assume that the noise level σ is small compared to M, say σ =.3 (otherwise, we can multiply all β by a large constant) and y = Xβ + ǫ, ǫ N(,σ I p ). Under this setting, we first calculate the marginal correlation in A (the case for B can be carried out similarly) X T k y = i A X T k X iβ i + i B X T k X iβ i +X T k ǫ = S A +β k S B +XT k ǫ, X T k y = i A X T k X iβ i + i B X T k X iβ i + X T k ǫ = S A +( s k)β k S B + X T k ǫ. Further, we assume that S A S B is large compared to all β k and S A S B > (this can be done if we choose different sizes for A,B, such as A = B ). From the assumption that the noise level σ is small compared to M, sign(x T k y),k A depends on sign(s A S B ) and we have an explicit expression for W, A with large probability (noise is too small to affect the sign) k A, W k = Xk T y X k T y = S A S B + β k +XT k ǫ S A S B +( s k)β k + X k T ǫ [ = p sign(s A S B ) ( S A S B + β k +XT k ǫ) (S A S B +( ] s k)β k + X k T ǫ) ( = p sign(s A S B ) s k β k +(X k X ) k ) T ǫ, (8) where we have used the notation = p to denote an identity that holds with large probability. Based on the symmetry, sign(xk Ty),k B depends on sign(s B S A ) and ( W k = p sign(s B S A ) s k β k +(X k X ) k ) T ǫ, k B. (9) By using the signal amplitude defined in (7) and the assumption S A > S B, we have the expression {.9+(X k W k = X k ) T ǫ, k A,s k, p (X k X k ) T () ǫ, k B,s k. Since the noise level is small compared to the signal amplitude, the estimate above shows that W, A are approximately.9 and W, B are approximately with large probability. Selection If T >.95, the features selected by the knockoff filter are Ŝ = { : W T} { : W.95} =. In this case, no features will be selected. Now we consider the case of T.95. The definition of the threshold (3) implies that q /+#{W T} #{W T} = /+ B #{W T} /+ B = q A /+ B. () A 6

7 If we further take q A < B, this would contradict () and thus T must be greater than.95. As a result, no features will be selected. Note that taking q A < B does not contradict with the previous assumption on A, B that A > γ B,γ >, which guarantees S A > S B. If we assume that S A S B is large compared to all β k, we conclude from (8) and (9) that all features in either A or B are not selected according to the knockoff procedure (only positive statistics will be selected). This example illustrates that the marginal correlation statistic cannot exploit the knockoff power dueto thelarge negative W for asignificant numberof thetruefeatures. The mechanism for generating the alternating sign effect Next, we describe a more general mechanism that could lead to the alternating sign effect. First of all, such a feature matrix can be clustered into two groups A and B. Secondly, the features from the same group are positively correlatedandthosefromdifferentgroupsarenegatively correlated, i.e. X i,x > if(i,) A A orb B and X i,x < if(i,) A B. Let X betheknockoffmatrix. Without lossofgenerality, we may assume that X X, which implies that s = X T X. To see why such a feature matrix may suffer from the alternating sign effect, we generate the signal β by setting β i = M/s i. By definition, (X i X i ) T y = s i β i +(X i X i ) T ǫ N(M, s i σ ). Assume that the noise level σ is small enough. If Xi Ty <, we obtain W i = Xi Ty X i Ty XT i y XT i y M < and thus the non-null feature i is reected by the knockoff filter. A similar result holds for i B. Next, we find out under what condition we have Xi Ty <. Denote S A(i) Xi T( A X β ) ands B (i) Xi T( B X β ). Usingthecorrelation structureofx andthedefinitionofs A,S B,β, we have Xi Ty = S A(i) S B (i)+xi Tǫ if i A and XT i y = S B(i) S A (i)+xi T ǫ if i B. One can interpret S A (i) as a weighted sum of β, A with weight Xi TX. Similarly, S B (i) is a weighted sum of β, B. If the noise level σ is small enough, Xi TX does not vary much and the size of one group is larger than the size of another group, e.g. B < A, it is likely that S B (i) < S A (i) for some i B. As a result, the features in group B may not be picked out, which reduces the power. In the previous example, we construct a special example of X that satisfies X i,x =.5 and B < A. We define the signal β in a similar way. Equation () ustifies that the features in group B are not selected by the knockoff filter. In Section.4., we construct another example to show that the Lasso path and the forward selection statistics suffer from the alternating sign effect. Another mechanism for generating the alternating sign effect is when the columns of a design matrix X are all positively correlated. In this case, we can apply the same argument as above by choosing the signal via β i = M/s i,i A and β i = M/s i,i B, where (A,B) is a partition of,,..,p. For these two types of design matrices, one needs to choose a statistic that will not suffer from the alternating sign effect. Testing the alternating sign effect for the marginal correlation statistic To confirm our previous analysis, we choose the group size of A, B to be A = 6, B = 4 with and 8 signals in each group, which corresponds to % sparsity. We draw the rows of X from a multivariate normal distribution with mean and covariance matrix Σ, which satisfies Σ ii =, Σ i = ρ for i in the same group, and Σ i = ρ for i in a different group. We then normalize the columns. The correlation factor is ρ =.5, the noise level is σ =, and the signal amplitude is β i = {.9M s i M s i i S tr A, i S tr B, where S tr is the set of true signals. We assume that s i constructed by SDP is nonzero. Otherwise, we generate another design matrix X N(,Σ) and then construct another group of s i by SDP. To study the alternating sign effect, we compare the performance of the least squares statistic W ls = ˆβ ls β ls and the marginal correlation statistic Wmc = X Ty X T y using the knockoff and the knockoff+ filters at the nominal FDR q = %. We then vary the signal parameter M =,,3,..., and repeat each experiment times. The results are summarized in Table. 7

8 LS: FDR(%) MC: FDR(%) LS:knockoff+power(%) MC: knockoff+(%) M (mfdr(%)) (mfdr(%)) (knockoff power(%)) (knockoff(%)) 9.3 (8.9).6 (.93) 46.4 (47.). (.73) 9.48 (9.46). (.) (93.95).3 (.47) (9.78). (.) (99.66). (.33) (9.6). (.) (99.98). (.3) 5. (.8). (.). (.). (.) (9.5). (.). (.). (.8) (9.4). (.). (.). (.6) 8.34 (.8). (.). (.). (.3) (9.79). (.). (.). (.) 9.76 (9.8). (.). (.). (.) Table : Alternating sign effect of the marginal correlation statistic, nominal FDR q = %. LS: the least squares, MC: the marginal correlation statistic. We focus on the power of the two statistics. The results from Table show that the marginal correlation statistic loses most of its power and can hardly discover any true signal while the least squares statistic maintains about % power in this test. Thus, the marginal correlation statistic suffers from the alternating sign effect, which is consistent with the analysis above..3 Potential challenge of the path method statistics In this subsection, we point out a potential challenge for the path method statistics. To demonstrate this, we first observe that the knockoff matrix properties imply (X i X i ) T y = (X i X i ) T (Xβ +ǫ) = s i β i +(X i X i ) T ǫ, i p. The right hand side also appears in many path method statistics, including the Lasso path, the forward selection, and the orthogonal matching pursuit statistics. We now illustrate the potential difficulty that we may encounter for a path method statistic. After performing l steps in one of the path methods (or at λ for the Lasso path), we use E to denote the set of features that have entered the model. We assume that E does not include X, X at the lth step, but at the next step either X or X will enter the model. After l steps, the residue is r l = y X EˆβE. Since X, X / E, we have X T X i = X X i, X i E. The same equality holds for X i. For X, X, their marginal correlation with r l determines which one of these two features will enter into the model first at the (l +)st step: X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+X T ǫ, X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+ X T ǫ, (X X ) T r l = (X X ) T y = s β +(X X ) T ǫ. Assume that the noise level is relatively small. If sign(β ) sign(x T (Xβ X Eˆβ E )) and X T (Xβ X Eˆβ E ) > s β, then X will enter into the model at the (l+)th step since X T r l X T r l X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) ( ) ( )] = sign X T (Xβ X Eˆβ E ) )[(X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) = s β >. This may reduce the power of the knockoff filter. We call such effect the alternating sign effect. 8

9 Definition. (Alternating sign effect). Let r l denote the residue at the lth step in a path method statistic or y in the marginal correlation statistic. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r l). In the counterexample above for the marginal correlation statistic, the design matrix X and signal coefficient β are constructed to generate the alternating sign effect. From our discussion, the alternating sign effect can lead to large negative W and reduce the power of the knockoff filter..4 Alternating sign effect on the Lasso path and other knockoff statistics We will construct an example in which the Z-score is large enough to reect the null hypothesis. For this example, some knockoff statistics can only pick out a small subset of the false nulls..4. The Z-score and signal amplitude ˆβ ls /σ (Σ ), TheZ-scoreofaclassicallinearmodely = Xβ+ǫ, ǫ N(,σ I p ), isdefinedbyz = where ˆβ ls is the least squares coefficient of regressing y on X. Obviously, Z N(,), β =. In our example and numerical experiments to be presented later, we choose σ = and β i = M/s i for s i and β i = for s i =. This setting guarantees that the Z-score of a false null is large. In fact, we have the following estimate for Z. Lemma.3. Let σ = and Z = ˆβ ls / (Σ ). For any : β defined above we have Z ξ + M s ξ + M, ξ N(,). This result shows that for large amplitude, M, the Z-score of the false null is large enough to reect the null hypothesis. We defer the proof of this lemma to the Appendix..4. An example to illustrate the alternating sign effect for several knockoff statistics In this subsection, we construct an example to demonstrate that the Lasso path and the forward selection statistics could lose their power due to the alternating sign effect. In our example, the feature matrix, X, consists of four groups X = (X A,X A,X B,X B ) with correlations given as follows ρ (i,) (A A)\(A A ) or (i,) (B B)\(B B ), i X i,x = ρ (i,) A B ρ (i,) A A or (i,) B B where A = A A, B = B B, and ρ > ρ. For example, in the case A = A = B = B =, the Gram matrix of X has the following structure ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ X T X = Σ = ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ. ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ Given the above covariance matrix Σ, the rows of X N(,Σ) with columns normalized. 9

10 Testing the alternating sign effect for different statistics We perform numerical experiments for the example above. Let A = B =, A = k, B = k, ρ =.3, ρ =.9, M = 6 (recall β = M/s ), the noise level σ = and the nominal FDR q = %. We compare the performance of five statistics: the knockoff least squares, the weighted half Lasso defined in (5) of Section.5. (λ =.5, Z = diag{ s /, s /,..., s p /}), the Lasso path, the forward selection, and the orthogonal matching pursuit (OMP) statistics. Here, λ is the regularizing parameter in the penalized model. We use the difference statistic W = ˆβ β for the least squares, the signed max statistic for the half Lasso, the Lasso path, the forward selection and the OMP statistics, i.e. W = max( ˆβ, β ) sign( ˆβ β ), where ˆβ, β is the enter time (the step at which the original and knockoff feature enters) in the path statistic or the solution of the half Lasso. We vary thesize parameter k =,4,.., but keep thesparsity level at %, e.g. thenumber of true features is 7. = 4 if k =. All the signals are randomly selected from {,,..,p}. We run each experiment times and present the results in the left panel of Figure. Smaller size and more trials We rerun this same experiment with a smaller size but a larger number of trials. Let A = B = 4, A = k, B = k, k = 4,8,,...,4 and the sparsity level at %. Each experiment is repeated times. The results are plotted in the right panel of Figure. Power(%) Least Square Half Lasso(λ=.5) Lasso Path Forward Selection OMP FDR Size of (A,B )=(k,k): k (Large Test, trials) Size of (A,B )=(k,k): k (Small Test, trials) Figure : Alternating sign tests for five methods at the nominal FDR q = % with varying sizes of special groups. Left and right subplots show first and second test s results, respectively. First test s feature size (left subplot) is five times that of the second test s (right subplot). We focus on the power of the Lasso path and the forward selection statistics and find that these methods lose most of their power in this alternating sign test, which confirms that they suffer from the alternating sign effect. For other methods, they maintain nearly % power with the desired FDR control. The FDR of the Lasso path and the forward selection statistics in the left subplot is not stable, which can be attributed to the relatively small number of trials. This example indicates that the alternating sign effect could be a problem for the knockoff filter for certain statistics..5 Notion of good statistic and a half penalized method.5. Good statistic In the previous section, we show that certain knockoff statistics suffer from the alternating sign effect (i.e. loss of power) even for strong signals that are only weakly correlated due to the many largenegative knockoff statistics W that aregenerated. Based ontheknockoff property, wepropose an alternative statistic that does not suffer from the alternating sign effect. We first introduce the notion of a good statistic.

11 Definition.4 (Good statistic). A knockoff statistic W is a good statistic if it satisfies the positivity of non-null features: for fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. The assumption of a good statistic ensures that W is non-negative for a strong signal and thus it can be potentially selected by the knockoff filter. Least squares statistic An example of a good statistic is the least squares statistic. Denote the Gram matrix G = [X X] T [X X]. We observe that the original construction of diag(s) used in [3] could lead to a singular Gram matrix. To alleviate this difficulty, we modify the criterion to construct diag(s): λ min(σ)i p diag(s) 3 Σ. We will discuss different construction criteria in Section.7. ( ) ˆβ The least squares coefficients ˆβ, β obtained by regressing y on [X, X] β satisfy = β ( X G T ) ( ) ǫ η () X T ǫ η (). We show that W ˆβ β satisfies the definition of a good statistic. In fact, if β, we have W = β +η () η (). For a fixed noise ǫ, η (),η () are fixed and thus W is positive if β is large enough. In general, if β is large compared to noise level σ, W with large probability due to var(η () ) = σ (G ) and var(η () ) = σ (G ) +p,+p. The following formula of the least squares coefficients, ˆβ, β, will be useful in later sections. ( ˆβ + β β ˆβ β β ) [ = X + X, X X ] T [ [ (Σ S = ) ( S ).5. A half penalized method X + X, X X ) ] ( X+ X ( ( ) T ǫ X X ) T ǫ ] ( X+ X ( ( ) T ǫ X X ) T ǫ ), S = diag{s,s,...,s p }. In this subsection, we introduce a half penalized model based on the knockoff property. This method naturally suggests a good statistic. Consider the following penalized model min ˆβ, β () y Xˆβ X β +P(ˆβ β)+q(ˆβ + β), (3) where P(x) and Q(x) are even functions. The statistic defined by W = ˆβ β or W = max( ˆβ, β ) sign( ˆβ β ) satisfies the sufficiency and the antisymmetry properties since swapping X, X leads to swapping ˆβ, β. Let ˆβ ls, β ls be the least squares coefficients obtained by regressing y on [X, X]. We denote by r y Xˆβ ls X β ls the residue. The geometric property of the least squares method implies r X, X, which leads to y Xˆβ X β = r +X(ˆβ ls ˆβ)+ X( β ls β) = r + X(ˆβ ls ˆβ)+ X( β ls β) = r + X + X (ˆβ ls + β ls ˆβ X X β)+ (ˆβ ls β ls (ˆβ β)). The residue r is independent of ˆβ and β. Thus we can exclude the residue r from the penalized model (3). Note that the constraint () on the knockoff matrix implies an important property (X + X) T (X X) =, X + X X X. (4)

12 The orthogonality property (4) enables us to separate the left hand side into the sum of three mutually independent terms: y Xˆβ X β = r + X + X (ˆβ ls + β ls ˆβ β) We can then rewrite (3) in the following equivalent form: min ˆβ, β + X X (ˆβ ls + β ls ˆβ β) + which can be further reformulated as two equivalent subproblems ˆα X + X (ˆβ ls β ls (ˆβ β)). X X (ˆβ ls β ls (ˆβ β)) +P( ˆβ β )+Q(ˆβ + β), (5) + X min X (ˆβ ls + ˆα) +Q(ˆα), (6) X min X (ˆβ ls α α) +P( α), (7) where we have replaced ˆβ+ β, ˆβ β in (5) by ˆα, α, respectively. A key observation of the knockoff filter is that the column vectors of X X are mutually orthogonal since (X X) T (X X) = Σ (Σ diag(s)) = diag(s). (8) Consequently, the second subproblem is reduced to min α p i= 8 X i X i ( α i (ˆβ i ls ls β i )) +P( α) = min α p i= s i 4 ( α i (ˆβ i ls ls β i )) +P( α). (9) If P(x) can be expressed as P(x) = p i= P(x i), we can solve (9) easily by solving p onedimensional optimization problems separately. Example : A half penalized method and a good statistic We construct a good statistic to make sure that W > for a true feature. We choose Q and (3) becomes min ˆβ, β y Xˆβ X β +P(ˆβ β). This is different from other penalized models since it only penalizes ˆβ β. We call this model the half penalized method. This problem can also be divided into two subproblems (6) and (7). The solution of (6) is trivial and by () we have the following explicit formula: ˆα = ˆβ ls + β ls = β +(Σ S ) ( X + X ) T ǫ. () We introduce the following notation that will be used very often later on: ǫ () (Σ S ) ( X + X ) T ǫ, ǫ () ( S ) ( X X ) T ǫ, () Var(ǫ () ) = σ (Σ S/), Var(ǫ () ) = σ (S/), () where σ = var(ǫ i ). Substituting the expression of ˆβ ls β ls given in () into (9) yields min α p i= 8 X i X i ( α i β i ǫ () i ) +P( α ) = min α The minimum α in (3) satisfies the following lemma. p i= s i 4 ( α i β i ǫ () i ) +P( α). (3)

13 Lemma.5. Assume P(x) is even. The minimum of (3) satisfies sign( α ) = sign(β +ǫ () ) or if β +ǫ (),s. Proof. Since P is even, we have P(x) = P( x ). Recall the minimization problem α = argmin α p i= s i 4 (α i β i ǫ () i ) +P( α ) f(α). If sign( α ) sign(β +ǫ () ) or, we can modify α as follows α new = α, α new i = α i, i to obtain a smaller value. In fact, this modification only changes one term in f( α) and the following inequality leads to a contradiction: f( α new ) f( α) = s 4 ( αnew β ǫ () ) s 4 ( α β ǫ () ) = s α (β +ǫ () ) <. Assume that the knockoff statistic takes the difference formula, i.e. W = ˆβ β (the signed max formula can be considered similarly). Equation () yields var(ǫ () ) = σ ((Σ S/) ), var(ǫ() ) = σ /s. When β is large compared to the noise level σ, we have β > ǫ (), ǫ () with large probability. Consequently, we obtain sign(ˆα ) = sign( α ) = sign(β ). Combining the solution of the first problem (), Lemma.5 and the transform between α and β (ˆα = ˆβ+ β, α = ˆβ β), we conclude that W = ˆβ β = ( ˆα + α ˆα α ), which implies that W is a good statistic. Example : A half Lasso statistic. We choose Q(x) and P(x) = λ x. As a result, the Lasso problem (7) or (9) can be solved directly α = argmin α p i= α i = Sh(ˆβ ls i ( si ) 4 (α i (ˆβ i ls ls β i )) +λ α i β ls i,λ/s i ) sign(ˆβ ls i ls β i ) ( ˆβ i ls ls β i λ/s i ) + where Sh (Shrinkage) is the soft threshold operator and a + max(,a). We can rewrite the formula above in vector form α = Sh(ˆβ ls β ls,λs inv ), where S inv = [/s,/s,...,/s p ] T. We should interpret this vector identity as several pointwise identities. Since Q, the solution of (6) is given by ˆα = ˆβ ls + β ls. Combining the formula of α, ˆα, we obtain the solution of (3) ˆβ = (ˆβ ls + β ls +Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() +Sh(β +ǫ (),λs inv )), β = (ˆβ ls + β ls Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() Sh(β +ǫ (),λs inv )), (4) where ǫ (),ǫ () are defined in () with variance (). It is interesting to note that if β is small, the soft-threshold yields ˆβ = β, which implies W =. A weighted half Lasso statistic. We can add a weight to β i to balance the noise level and the soft-threshold. Consider the following penalized model min ˆβ, β y XZ ˆβ XZ β +λ ˆβ β, (5) 3

14 where Z = diag{z,z,...,z p } is a positive diagonal matrix chosen in advance. Note that ˆβ, β only depend on [X X] T y,[x X] T [X X]. Similarly, we derive the solution as follows ˆβ = ( ) Z(β +ǫ () )+Sh(Z(β +ǫ () ),λz S inv ), β = ( ) (6) Z(β +ǫ () ) Sh(Z(β +ǫ () ),λz S inv ), where S = diag{s,s,...,s p } and ǫ (),ǫ () are defined in (). The weighted half Lasso statistic that satisfies the sufficiency property is defined as follows W = ( ˆβ β ) = β +ǫ () +Sh(β +ǫ (),λz /s ) β +ǫ () Sh(β +ǫ (),λz /s ). (7) z We can also define the associated signed max statistic W = z max( ˆβ, β ) sign( ˆβ β ), which also satisfies the sufficiency property. The difference between (6) and (4) is the addition of a different weight to the threshold. Note that the covariance matrix of ǫ () is S σ. The weighted half Lasso can balance the variance of noise ǫ () and the soft-threshold. We suggest to use Z = diag( s /, s /,..., s p /). With this choice of Z, we have Var(ǫ () ) = σ S = 4σ diag(z /s,z /s,...,z p /s p ) λ diag(4z /s,4z /s,...,4z p /s p ). Example 3: A negative half Lasso Choosing Q(x),P(x) = λ p i= µ i x i,µ i, we can deduce the solution of (6) and (7) (or (9)) ˆβ + β = ˆα = ˆβ ls + β ls = β +ǫ (), ˆβ β = α = argmin α = ˆβ i = β i + ( ) ǫ () i +ǫ () i + λµ i sign(β i +ǫ () i ), βi = s i p i= ( si ) 4 (α i (ˆβ i ls ls β i )) λµ i α i ( ǫ () i ǫ () i ) λµ i s i sign(β i +ǫ () i ), where we have used ˆβ ls i β ls i = β i +ǫ () i. We see that a negative P(x) can increase the difference between ˆβ and β, which can be useful to distinguish the true feature from its knockoff. When µ i = s i, our numerical results show that the negative penalty enlarges the gap between ˆβ and β and increases the power by 5 % compared to least squares, while the half Lasso shrinks the gap between ˆβ and β and reduces the power by 5 %..6 Extension of the knockoff sufficiency property In [3], the sufficiency property of a knockoff statistic states that the statistic W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. In this subsection, we will generalize the sufficiency property so that we can apply the knockoff filter to more general scenarios. In addition, we propose a method to estimate the noise level and determine the prior regularizing parameter for a half penalized method. Let U R n (n p) be an orthonormal matrix such that [X X] T U = and [X X U] admits a basis of R n. Recall that the knockoff condition () implies (X+ X) T (X X) = X T X X T X =. Hence, we can decompose R n as follows R n = span(x + X) span(x X) span(u). Our key observation is that swapping each pair of the original X and its knockoff X does not modify these spaces: span(x + X), span(x X) and span(u). Therefore, the probability distributions of the proections of the response y onto these spaces respectively are independent and invariant after swapping arbitrary pair X, X. Inspired by this observation, we can generalize the sufficiency property. 4

15 Definition.6 (Generalized Sufficiency Property). The statistic W is said to obey the generalized sufficiency property if W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y; that is, we can write W = f([x X] T [X X],[X X U] T y) for some f : S + p Rn R p and an orthonormal matrix U R n (n p) that satisfies U T [X X] =. Remark. Compared with the original sufficiency property, the generalized sufficiency property includes the addition of U T y, which is the coefficient vector of the orthogonal proection of y onto span([x X]). As an application, we will use this extra component to estimate the noise level and incorporate the estimated noise level into the knockoff statistic from a penalized method without violating the exchangeability property and FDR control. The definition of the antisymmetry property remains the same: swapping X and X has the same effect as changing the sign of W, i.e. { W ([X X] swap( Ŝ),U,y) = W ([X X],U,y) + Ŝ, / Ŝ, where Ŝ is a subset of nulls. For any knockoff matrix X and the associated statistic W that satisfies the above definition, we call W the generalized knockoff statistic. We will prove that this generalized statistic satisfies the exchangeability property. Then we can apply the same super-martingale as in [3] to establish rigorous FDR control. According to the analysis of establishing exchangeability in [3], we need to prove the corresponding Lemma (Pairwise exchangeability for the features) and Lemma 3 (Pairwise exchangeability for the response) in [3]. Lemma is a direct result of the knockoff constraint. We need to prove the following lemma. Lemma.7. For any generalized knockoff statistic W and a subset Ŝ of nulls, we have W swap( Ŝ) = f([x X] T swap(ŝ)[x X] swap( Ŝ),[ [X X] swap( Ŝ) U]T y) d = f([x X] T [X X],[X X U] T y) = W. Proof. Since X is a knockoff matrix, we get [X X] T swap(ŝ)[x X] swap( Ŝ) = [X X] T [X X], (8) and thus the first variable of f on both sides of (8) are the same. Next, we verify [[X X] swap( Ŝ) U]T y d = [X X U] T y. Since y is a Gaussian random variable, it is equivalent to verifying that the means and the variances of both sides are the same. We first check the means of the both sides. E([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T Xβ = [ [X X] T Xβ U T Xβ] = E([X X U] T y). (9) The second equality is guaranteed by X T Xβ = X T Xβ Ŝ since XT X i = X T X i i and Ŝ is a subset of nulls. Using y N(Xβ,σ I p ),[X X] T U = and (8), we obtain Var([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T [ [X X] U] swap(ŝ) =diag([x X] T swap(ŝ)[x X] swap( Ŝ), UT U) = diag([x X] T [X X], U T U) =Var([ [X X] U] T y) Combining (9) and (3), we conclude the proof. (3) The exchangeability property of a generalized knockoff statistic is a result of this lemma and the antisymmetry property of the knockoff statistic. Lemma.8. (i.i.d signs for the nulls). Let η {±} p be a sign sequence independent of W, with i.i.d η = + for all nonnull and η {±} for null. Then (W,...,W p ) = d (W η,...,w p η p ). 5

16 Estimate of the noise level and an application As an application of the generalized knockoff statistic, we propose a new method to estimate the noise level in the knockoff filter without violating the exchangeability property and FDR control. Let U R n (n p) be an orthonormal matrix such that U T [X X] =. From the identity U T y = U T (Xβ +ǫ) = U T ǫ, we provide an estimate of the noise level depending on U T y: ˆσ U T y / n p. (3) For any problem with an unknown noise level, we consider the knockoff half Lasso whose regularizing parameter is decided by ˆσ, i.e. min ˆβ, β y Xˆβ X β +λˆσ ˆβ β, (3) where λ = or can be decided empirically. Since the solution of (3), i.e.(ˆβ, β), depends on the Gram matrix [X X] T [X X], the marginal correlation [X X] T y and the regularizing parameter λˆσ (ˆσ is decided by U T y), we derive that ˆβ, β are functions of the Gram matrix and [X X U] T y. Consequently, thestatistic W ˆβ β = W([X X] T [X X],[X X U] T y) (or thesigned max version) satisfies the generalized sufficiency property. The antisymmetry property can be verified easily. Hence, we can choose W as a knockoff statistic with exact FDR control..7 A modified SDP construction In [3], the authors propose to construct diag(s) (s = (s,s,..,s p )) via convex optimization maximize: p s i, subect to: diag(s) Σ; s i,i =,,..,p. (33) i= Such construction sometimes produces zero s i. In this case, feature i cannot be selected by the knockoff filter. To illustrate this point, we construct a simple but by no means extreme example in which such a construction criterion would give zero s i for some i. Let Σ a,b be a 3 3 matrix defined as Σ a,b = b a b a a a. Using the CVX solver in MATLAB, we can solve (33) for several Σ a,b. When (a,b) = (.8,.4), we have s = s =.4 and s 3 = ; when (a,b) = (.9,.7), we have s = s =.6 and s 3 = ; when (a,b) = (.7,.4), we have (s,s ) =.84 and s 3 =. We observe that s 3 = in these examples. Modified SDP construction To overcome the zero output problem, we propose to slightly modify the original SDP construction by solving the following optimization problem minimize: p ( s i ), subect to: diag(s) βσ, αλ min (Σ) s i, α [,) β (,]. i= The half penalized method requires that Σ diag(s)/ and diag(s) be invertible (see the least squares coefficient formula()), and we suggest(α, β) = (.5,.75). For path statistics, to alleviate zero output in the SDP construction, we suggest (α,β) = (.5,). 3 A PCA prototype filter In this section, we propose a PCA prototype group selection method with group FDR control to overcome the difficulty associated with strong within-group correlation. It is well known that the 6

17 grouping strategy provides an effective way to handle strongly correlated features. Our work is inspired by Reid-Tibshirani s prototype filter [3] and Dai-Barber s group knockoff filter [6]. We provide a brief summary of the two methods below before introducing our PCA prototype filter. 3. Reid-Tibshirani s prototype filter In [3], Reid and Tibshirani introduce a prototype filter. They choose a prototype for each group of features, then they use the knockoff filter to select these prototypes to perform group selection. Specifically, the method consists of the following steps. First, cluster columns of ( X into ) K groups, ( {C,...,C ) K }. Then split the data by rows into two y () X () (roughly) equal parts y = and X =. Choose a prototype for each cluster via y () the maximal marginal correlation, using only the first part of the data y (),X (). This generates the prototype set ˆP. Next, form a knockoff matrix X () from X () and perform the knockoff filter using y (),[X () X () ˆP ˆP ]. Finally, group C i is selected if and only if X () is chosen in the filter process. ˆP i This method satisfies the exchangeability property and the authors establish group FDR control based on a similar super-martingale argument as in [3]. We point out that this method does not benefit a lot from the group structure. Assume that X i,x are in the same group and X i,x δ. For this pair of X i and X, we define a unit vector v by v (e i e )/, where {e i } i p is the standard orthonormal basis of R p. From (), we have v T diag(s)v v T X T Xv, which further implies X () min(s i,s ) v T diag(s)v v T X T Xv = Xv = X i X δ. (34) If within-group correlation is strong, δ is small and the inequality above implies that either s i or s is small. Hence, the power of this method may be limited for strongly correlated features. Our numerical results in Sections 3.4 and 3.5 confirm this limitation. 3. Dai-Barber s group knockoff filter In [6], Dai and Barber investigate a group-wise knockoff filter, which is a generalization of the knockoff filter. Assume that the columns of X can be divided into k groups {X G,X G,...,X Gk }. The authors construct the group knockoff matrix according to X T X = X T X, XT X = Σ S, Σ = X T X, where S is group-block-diagonal, i.e. S Gi,G = for any two distinct groups i. Then let S = diag(s,s,...,s k ), S i = γσ Gi,G i = γxg T i X Gi,i =,,...,k. The constraint S Σ implies γ diag(σ G,G,Σ G,G,...,Σ Gk,G k ) = S Σ. In order to maximize the difference between X and X, γ is chosen as large as possible: γ = min{, λ min (DΣD)}, where D = diag(σ / G,G,Σ / G,G,...,Σ / G k,g k ). The group-wise statistic introduced in [6] can be obtained after the construction of the group knockoff matrix. The construction[ above guarantees ] group-wise exchangeability. Finally, group {#{i:β FDR control, i.e. FDR group E Gi =,i Ŝ} q, is a result of group-wise exchangeability. ( Ŝ ) Here Ŝ = { : W T} is the set of selected groups for a chosen group statistic W. 3.3 PCA Reformulation Assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in such a way that withingroup correlation is relatively strong while between-group correlation is relatively weak. First, we apply singular value decomposition (SVD) to decompose the feature vectors within each group into X Ci = U Ci D i Vi T,U Ci O n c i,d i R c i c i,v i O c i c i,c i = C i. Then we reformulate the linear model as follows: 7

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary