arxiv: v1 [stat.me] 11 Jun 2017
|
|
- Philip Phelps
- 5 years ago
- Views:
Transcription
1 Some Analysis of the Knockoff Filter and its Variants Jiaie Chen, Anthony Hou, Thomas Y. Hou June 3, 7 arxiv:76.34v [stat.me] Jun 7 Abstract In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables and determine which variables are truly associated with the response. In 5, Barber and Candès introduced a new variable selection procedure called the knockoff filter to control the false discovery rate(fdr) and proved that this method achieves exact FDR control. In this paper, we provide some analysis of the knockoff filter and its variants. Based on our analysis, we propose a PCA prototype group selection filter that has exact group FDR control and several advantages over existing group selection methods for strongly correlated features. Another contribution is that we propose a new noise estimator that can be incorporated into the knockoff statistic from a penalized method without violating the exchangeability property. Our analysis also reveals that some knockoff statistics, including the Lasso path and the marginal correlation statistics, suffer from the alternating sign effect. To overcome this deficiency, we introduce the notion of a good statistic and propose several alternative statistics that take advantage of the good statistic property. Finally, we present a number of numerical experiments to demonstrate the effectiveness of our methods and confirm our analysis. Introduction In many scientific endeavors, we need to determine from a response variable together with a large number of potential explanatory variables which variables are truly associated with the response. In order for this study to be meaningful, we need to make sure that the discoveries are indeed true and replicable. Thus it is highly desirable to obtain exact control of the false discovery rate (FDR) within a certain prescribed level. In [3], Barber and Candès introduce a new variable selection procedure called the knockoff filter to control the FDR for a linear model. This method achieves exact FDR control in finite sample settings. One important property of this method is that its performance is independent of the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients. Moreover, it does not require any knowledge of the noise level. A key observation is that by constructing knockoff variables that mimic the correlation structure found within the existing variables one can obtain accurate FDR control. The method is very general and flexible. It can be applied to a number of statistics and has far more power (the proportion of true signals being discovered) than existing selection rules when the proportion of null variables is high.. A brief review of the knockoff filter Before we introduce the main results of our paper, we first provide a brief overview of the knockoff filter. Consider the following linear regression model y = Xβ +ǫ, where the feature matrix X is an n p (n p) matrix with full rank, its columns normalized to be unit vectors in the l norm, School of Mathematical Sciences, Peking University; ciaie@pku.edu.cn Department of Statistics, Harvard University; ahou@college.harvard.edu Applied and Computational Mathematics, Caltech; hou@cms.caltech.edu
2 and ǫ is a Gaussian noise N(,σ ). The knockoff filter begins with the construction of a knockoff matrix X that obeys X T X = X T X, XT X = X T X diag(s), () where s i [,]. The positive definiteness of the Gram matrix [X X] T [X X] requires diag(s) X T X. () The first condition in () ensures that X has the same covariance structure as the original feature matrix X. The second condition in () guarantees that the correlations between distinct original and knockoff variables are the same as those between the originals. To ensure that the method has good statistical power to detect signals, we should choose s as large as possible to maximize the difference between X and its knockoff X. These two conditions are critical in guaranteeing that the distribution of a knockoff statistic is invariant when a particular pair of X,X is swapped. This is called the exchangeability property in [3]. The next step is to calculate a statistic, W, for each pair X, X using the Gram matrix [X X] T [X X] and the marginal correlation [X X] T y. The final step is to run the knockoff (knockoff+) selection procedure at level q { T min t > : /+#{ : W } t} q, Ŝ { : W T}. (3) #{ : W t} There are several ways to construct a statistic W. Among them, the Lasso Path statistic is discussed in detail in [3]. It first fits a Lasso regression of y on [X X] for a list of regularizing parameters λ in descending order and then calculates the first λ at which a variable enters the model, i.e. Z sup{λ : ˆβ (λ) } for feature X and Z = sup{λ : β (λ) } for its knockoff X. The Lasso path signed max statistic is defined as W = max(z, Z ) sign(z Z ). The main result in [3] is that the knockoff procedure and knockoff+ procedure has exact control of mfdr and FDR respectively, [ ] [ ] #{ mfdr E Ŝ : β = } #{ q, FDR E Ŝ : β = } #{ Ŝ}+q #{ Ŝ} q. A knockoff filter for high-dimensional selective inference and model-free knockoffs have been recently established in [4,5]. This line of research has inspired a number of follow-up works [6,8,3 5].. Alternating sign effect and the notion of a good statistic In this paper, we perform some analysis of the knockoff filter and some knockoff statistics, including the Lasso path and the marginal correlation statistics. Our analysis shows that the marginal correlation statistic and the Lasso path statistic suffer from the so-called alternating sign effect for certain design matrices whose features are only weakly correlated. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r λ), where r λ = y X EˆβE is the residue, / E and E is the active set, i.e. { : ˆβ (λ) }, λ being the regularizing parameter in front of the l norm in the Lasso path method. In Section, we describe a general mechanism for generating the alternating sign effect for a family of design matrices. We show that the alternating sign effect can lead to large negative W for strong features that are only weakly correlated. This limitation reduces the power of the knockoff filter for these statistics. To alleviate this difficulty, we introduce the notion of a good statistic. Specifically, a knockoff statistic W is called a good statistic if it satisfies the positivity of non-null features: for a fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. Based on our analysis, we propose an alternative method, which we call the half penalized method. This method penalizes only ˆβ β instead of penalizing both parameters min ˆβ, β y Xˆβ X β +P(ˆβ β),
3 hence the name half penalized method. This method takes full advantage of the property of the knockoff filter. In the case when P(x) = λ x (or P(x) = λ x ), we obtain the half Lasso method (or the negative half Lasso), which reduces the p-dimensional l optimization problem into p one-dimensional optimization problems, which can be solved explicitly using the soft threshold operator. We further prove that this new statistic and the least squares statistic satisfy the good statistic property and do not suffer from the alternating sign effect. To gain some understanding of the performance of different statistics, we investigate a variety of knockoff statistics numerically, including the least squares, half Lasso, forward selection, orthogonal matching pursuit (OMP), and Lasso path statistics. From our simulations, the forward selection, the OMP and the Lasso path statistics have similar power and computational cost. However, the alternating sign test in Section.4. shows that the Lasso path and the forward selection statistics suffer from the alternating sign effect and are less robust than the OMP statistic. Our simulation also shows that the power of the OMP is more than that of the least squares and the (negative) half Lasso in the sparse case (the proportion of the null features is large). The improvement of the OMP statistic over the least squares and the negative half Lasso statistics is not as significant in the non-sparse case. The OMP statistic seems to be the most robust among the six statistics that we consider. On the other hand, the OMP and other path statistics are computationally much more expensive. The computational cost of least squares and of the half Lasso is O(np ), while that of the Lasso path and the OMP statistic is O(np 3 ). If p, the advantage of the OMP statistic over the negative half Lasso diminishes due to the increase of computational cost..3 Extension of the sufficiency property and noise level estimate In [3], the authors introduce the sufficiency property of a statistic W, which states that W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. We observe that in the definition of the sufficiency property, only part of the information of the response variable y, i.e. [X X] T y, is utilized. By using the remaining information of y in the knockoff filter, we can incorporate the noise estimate into the statistic without violating the exchangeability property. More specifically, we generalize the sufficiency property by requiring that W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y for any orthonormal matrix that satisfies U T [X X] =. Moreover, we prove that if a statistic obeys the generalized sufficiency property and the antisymmetry property, then it satisfies the exchangeability property. Inspired by the generalized sufficiency property, we propose to use the noise level σ as a reference for the regularizing parameter and estimate the noise level as follows ˆσ U T y / n p, where U is an orthonormal matrix satisfying U T [X X] =. Since ˆσ depends only on U T y, we can define a knockoff statistic W that incorporates ˆσ and satisfies the generalized sufficiency property. Consequently, we can use the estimated noise level in the knockoff filter without violating the exchangeability property and maintain FDR control..4 A PCA prototype knockoff filter We also introduce a PCA prototype knockoff filter for group selection that has exact group FDR control (defined in Theorem 3.) for strongly correlated features. More specifically, assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in a way such that within-group correlation is relatively strong but between-group correlation is relatively weak. We first use singular value decomposition (SVD) to decompose the feature vectors within each group X Ci = U Ci D i Vi T and then reformulate the linear model as follows: k k y = X Ci β Ci +ǫ = U Ci α Ci +ǫ. i= 3 i=
4 We aim to pick out non-null groups β Ci with exact group FDR control. To capture most of the information and reduce redundant features in each group, we choose the first principal component U Ci, as a prototype of this group andthen construct knockoff pairs on theprototype set U P = (U C,,U C,,...,U Ck,), P = k. Specifically, we denote by Q = {,,...,p}\p the remaining part, U = [U P,U Q ], and then construct the knockoff matrix Ũ = [ŨP,U Q ] as follows (we choose Ũ Q = U Q since we do not select features in U Q ) Ũ T Ũ = U T U, U T U ŨT U = diag(s P,), where we apply the localized knockoff construction from [5] to increase the amplitude of s P. Inspired by [3], we implement the standard knockoff procedure on y and [U P,ŨP] and calculate the knockoff statistic W P = {W C,,W C,,..,W Ck,}. Finally, we run the knockoff filter on W C,, k to select groups. Moreover we can prove that the PCA prototype knockoff filter has the same group FDR control for the original feature matrix as in Dai-Barber s group knockoff filter [6]. Compared to Dai-Barber s group knockoff filter, our PCA method achieves greater computational efficiency since the augmented design matrix in our method is n k, which is much smaller than n p in Dai-Barber s method if p k. Since the most significant computational cost in implementing the knockoff filter with a path statistic comes from regressing y on the augmented design matrix in an iterative manner, a smaller augmented design matrix leads to greater computational efficiency. Note that the group statistic for group C is W C, and is different from that in Dai-Barber s group knockoff filter [6]..5 Comparison with other existing works There are several recent works that have an obective similar to ours. Our work is inspired by Barber and Candès knockoff filter as well as by Reid-Tibshirani s prototype knockoff filter and Dai-Barber s group knockoff filter [3,4,6,3]. We show in Section 3.4 that our PCA prototype filter has more power than Reid-Tishirani s prototype knockoff filter. When the between-group correlation is zero and within-group correlation is strong, we analyze why the PCA prototype filter performs much better than Reid-Tibshirani s prototype filter. We also show that the performance of the PCA prototype filter is comparable to that of Dai-Barber s group knockoff filter, but with greater computational efficiency if p k. More details on these two methods and their comparison with ours can be found in Section 3. We note that a localized knockoff filter has been proposed by Xu et al. in [5] in which they construct a modified knockoff matrix that has FDR control for a subset of the feature vectors. Although this localized knockoff filter guarantees FDR control, it still suffers a loss in power for strongly correlated features. There are several feature selection methods that offer some level of FDR control, see e.g. [,, 7, 9 ]. Refer to [3] for a thorough comparison between the knockoff filter and these approaches. This paper focuses on the knockoff filter and does not consider these other approaches. The rest of the paper is organized as follows. In Section, we analyze the alternating sign effect for the Lasso path, the marginal correlation, and the forward selection statistics. We also introduce thenotion of agood statistic andshow that the least squares method andthe half penalized method produce good statistics. Moreover, we generalize the sufficiency property of a knockoff statistic and propose a new method to estimate noise level. In Section 3, we introduce our PCA prototype filter for highly correlated features. We compare it to other group knockoff filters and provide numerical experiments to demonstrate the performance of various methods. Alternating sign effect, good statistics, a half penalized method In this section, we perform some analysis of the knockoff filter. Our analysis reveals some limitations of several statistics associated with the knockoff filter. Based on our understanding of these limitations, we propose some modifications of the knockoff filter to alleviate these difficulties. 4
5 . Construction of the knockoff matrix First, we review the construction of the knockoff matrix. In [3], the authors give a simple construction of the knockoff matrix X. It seems that we may have other alternative constructions of X. In the following proposition, we show that, given s i, different constructions are essentially the same. Proposition.. if and only if [ [X X] T [X X] = Σ Σ diag(s) ] Σ diag(s) (4) Σ X = X(I Σ diag(s))+uc (5) where U R n (n p) is an orthonormal matrix whose column space is orthogonal to that of X, i.e. U T X =, and C R (n p) p satisfies C T C = diag(s) diag(s)σ diag(s). We will defer the proof of the above proposition to the Appendix. The knockoff matrix X presented in [3] has the same form as (5) except that U R n p and C R p p in their formula. Using Proposition., we can reproduce the result in [3] by choosing an orthonormal matrix U = (U U ) R n (n p),u R n p,u R n (n p) whose column space is orthogonal to that of X and ( ) C C = R (n p) p, C R p p and C T C = C T C = diag(s) diag(s)σ diag(s). ( C The identity UC = (U U ) ) = U C and Proposition. reproduce X in [3].. Alternating sign effect for the marginal correlation statistic In this section, we discuss the alternating sign effect for certain statistics and propose alternative statistics that do not suffer from this effect. According to (3), the knockoff filter threshold T is determined by the ratio of large negative and positive W s. Using this threshold, the knockoff filter selects large positive statistics W > T and reects all negative W s. In order for the knockoff filter to achieve its power, W s should be large and positive for β so that the knockoff filter can pick out such features. Large, negative W s result in a large T and fewer selected features, which lead to a decrease in power. Our analysis shows that in some feature designs, certain knockoff statistics may yield large negative W s for non-null, which would decrease the power of the knockoff filter. We use the marginal correlation statistic to illustrate the alternating sign effect. The following example shows that the marginal correlation statistic could lose its power even for strong signals. Design matrix and signal amplitude Let A,B be a partition of {,,..,p}, i.e. A B = {,,..,p}, A B =. We choose a feature matrix X that satisfies X i,x = ρ for i if i and belong to the same set A or B and X i,x = ρ for i if i and belong to two different sets. A concrete example that satisfies the above design criterion is given as follows: X v ai p R n p, where v R p, v i = { λa i A λa i B, λ = ρ ρ. (6) We take ρ = for simplicity. Once the knockoff matrix is constructed, we have the relation X i = s i, < s i. The value of s i is not small because columns of X are only weakly X T i 5
6 correlated. Since the knockoff matrix is constructed without any knowledge of y and the coefficient β, we can choose any β after X is constructed. Next, we take β i = {.9M s i i A,s i, M s i i B,s i, (7) and β i = if s i =, where M is a parameter that is used to control the signal amplitude. In the following discussion, we set M = and assume that the number of s i = is either or small. Derivation of the marginal correlation statistic Let S A,S B be the sum of β i in group A, B, respectively, i.e S A = i A β i, S B = i B β i. Assume that the noise level σ is small compared to M, say σ =.3 (otherwise, we can multiply all β by a large constant) and y = Xβ + ǫ, ǫ N(,σ I p ). Under this setting, we first calculate the marginal correlation in A (the case for B can be carried out similarly) X T k y = i A X T k X iβ i + i B X T k X iβ i +X T k ǫ = S A +β k S B +XT k ǫ, X T k y = i A X T k X iβ i + i B X T k X iβ i + X T k ǫ = S A +( s k)β k S B + X T k ǫ. Further, we assume that S A S B is large compared to all β k and S A S B > (this can be done if we choose different sizes for A,B, such as A = B ). From the assumption that the noise level σ is small compared to M, sign(x T k y),k A depends on sign(s A S B ) and we have an explicit expression for W, A with large probability (noise is too small to affect the sign) k A, W k = Xk T y X k T y = S A S B + β k +XT k ǫ S A S B +( s k)β k + X k T ǫ [ = p sign(s A S B ) ( S A S B + β k +XT k ǫ) (S A S B +( ] s k)β k + X k T ǫ) ( = p sign(s A S B ) s k β k +(X k X ) k ) T ǫ, (8) where we have used the notation = p to denote an identity that holds with large probability. Based on the symmetry, sign(xk Ty),k B depends on sign(s B S A ) and ( W k = p sign(s B S A ) s k β k +(X k X ) k ) T ǫ, k B. (9) By using the signal amplitude defined in (7) and the assumption S A > S B, we have the expression {.9+(X k W k = X k ) T ǫ, k A,s k, p (X k X k ) T () ǫ, k B,s k. Since the noise level is small compared to the signal amplitude, the estimate above shows that W, A are approximately.9 and W, B are approximately with large probability. Selection If T >.95, the features selected by the knockoff filter are Ŝ = { : W T} { : W.95} =. In this case, no features will be selected. Now we consider the case of T.95. The definition of the threshold (3) implies that q /+#{W T} #{W T} = /+ B #{W T} /+ B = q A /+ B. () A 6
7 If we further take q A < B, this would contradict () and thus T must be greater than.95. As a result, no features will be selected. Note that taking q A < B does not contradict with the previous assumption on A, B that A > γ B,γ >, which guarantees S A > S B. If we assume that S A S B is large compared to all β k, we conclude from (8) and (9) that all features in either A or B are not selected according to the knockoff procedure (only positive statistics will be selected). This example illustrates that the marginal correlation statistic cannot exploit the knockoff power dueto thelarge negative W for asignificant numberof thetruefeatures. The mechanism for generating the alternating sign effect Next, we describe a more general mechanism that could lead to the alternating sign effect. First of all, such a feature matrix can be clustered into two groups A and B. Secondly, the features from the same group are positively correlatedandthosefromdifferentgroupsarenegatively correlated, i.e. X i,x > if(i,) A A orb B and X i,x < if(i,) A B. Let X betheknockoffmatrix. Without lossofgenerality, we may assume that X X, which implies that s = X T X. To see why such a feature matrix may suffer from the alternating sign effect, we generate the signal β by setting β i = M/s i. By definition, (X i X i ) T y = s i β i +(X i X i ) T ǫ N(M, s i σ ). Assume that the noise level σ is small enough. If Xi Ty <, we obtain W i = Xi Ty X i Ty XT i y XT i y M < and thus the non-null feature i is reected by the knockoff filter. A similar result holds for i B. Next, we find out under what condition we have Xi Ty <. Denote S A(i) Xi T( A X β ) ands B (i) Xi T( B X β ). Usingthecorrelation structureofx andthedefinitionofs A,S B,β, we have Xi Ty = S A(i) S B (i)+xi Tǫ if i A and XT i y = S B(i) S A (i)+xi T ǫ if i B. One can interpret S A (i) as a weighted sum of β, A with weight Xi TX. Similarly, S B (i) is a weighted sum of β, B. If the noise level σ is small enough, Xi TX does not vary much and the size of one group is larger than the size of another group, e.g. B < A, it is likely that S B (i) < S A (i) for some i B. As a result, the features in group B may not be picked out, which reduces the power. In the previous example, we construct a special example of X that satisfies X i,x =.5 and B < A. We define the signal β in a similar way. Equation () ustifies that the features in group B are not selected by the knockoff filter. In Section.4., we construct another example to show that the Lasso path and the forward selection statistics suffer from the alternating sign effect. Another mechanism for generating the alternating sign effect is when the columns of a design matrix X are all positively correlated. In this case, we can apply the same argument as above by choosing the signal via β i = M/s i,i A and β i = M/s i,i B, where (A,B) is a partition of,,..,p. For these two types of design matrices, one needs to choose a statistic that will not suffer from the alternating sign effect. Testing the alternating sign effect for the marginal correlation statistic To confirm our previous analysis, we choose the group size of A, B to be A = 6, B = 4 with and 8 signals in each group, which corresponds to % sparsity. We draw the rows of X from a multivariate normal distribution with mean and covariance matrix Σ, which satisfies Σ ii =, Σ i = ρ for i in the same group, and Σ i = ρ for i in a different group. We then normalize the columns. The correlation factor is ρ =.5, the noise level is σ =, and the signal amplitude is β i = {.9M s i M s i i S tr A, i S tr B, where S tr is the set of true signals. We assume that s i constructed by SDP is nonzero. Otherwise, we generate another design matrix X N(,Σ) and then construct another group of s i by SDP. To study the alternating sign effect, we compare the performance of the least squares statistic W ls = ˆβ ls β ls and the marginal correlation statistic Wmc = X Ty X T y using the knockoff and the knockoff+ filters at the nominal FDR q = %. We then vary the signal parameter M =,,3,..., and repeat each experiment times. The results are summarized in Table. 7
8 LS: FDR(%) MC: FDR(%) LS:knockoff+power(%) MC: knockoff+(%) M (mfdr(%)) (mfdr(%)) (knockoff power(%)) (knockoff(%)) 9.3 (8.9).6 (.93) 46.4 (47.). (.73) 9.48 (9.46). (.) (93.95).3 (.47) (9.78). (.) (99.66). (.33) (9.6). (.) (99.98). (.3) 5. (.8). (.). (.). (.) (9.5). (.). (.). (.8) (9.4). (.). (.). (.6) 8.34 (.8). (.). (.). (.3) (9.79). (.). (.). (.) 9.76 (9.8). (.). (.). (.) Table : Alternating sign effect of the marginal correlation statistic, nominal FDR q = %. LS: the least squares, MC: the marginal correlation statistic. We focus on the power of the two statistics. The results from Table show that the marginal correlation statistic loses most of its power and can hardly discover any true signal while the least squares statistic maintains about % power in this test. Thus, the marginal correlation statistic suffers from the alternating sign effect, which is consistent with the analysis above..3 Potential challenge of the path method statistics In this subsection, we point out a potential challenge for the path method statistics. To demonstrate this, we first observe that the knockoff matrix properties imply (X i X i ) T y = (X i X i ) T (Xβ +ǫ) = s i β i +(X i X i ) T ǫ, i p. The right hand side also appears in many path method statistics, including the Lasso path, the forward selection, and the orthogonal matching pursuit statistics. We now illustrate the potential difficulty that we may encounter for a path method statistic. After performing l steps in one of the path methods (or at λ for the Lasso path), we use E to denote the set of features that have entered the model. We assume that E does not include X, X at the lth step, but at the next step either X or X will enter the model. After l steps, the residue is r l = y X EˆβE. Since X, X / E, we have X T X i = X X i, X i E. The same equality holds for X i. For X, X, their marginal correlation with r l determines which one of these two features will enter into the model first at the (l +)st step: X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+X T ǫ, X T r l = X T (y X Eˆβ E ) = X T (Xβ X Eˆβ E )+ X T ǫ, (X X ) T r l = (X X ) T y = s β +(X X ) T ǫ. Assume that the noise level is relatively small. If sign(β ) sign(x T (Xβ X Eˆβ E )) and X T (Xβ X Eˆβ E ) > s β, then X will enter into the model at the (l+)th step since X T r l X T r l X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) ( ) ( )] = sign X T (Xβ X Eˆβ E ) )[(X T (Xβ X Eˆβ E ) s β X T (Xβ X Eˆβ E ) = s β >. This may reduce the power of the knockoff filter. We call such effect the alternating sign effect. 8
9 Definition. (Alternating sign effect). Let r l denote the residue at the lth step in a path method statistic or y in the marginal correlation statistic. The alternating sign effect refers to the existence of feature that satisfies sign(β ) sign(x T r l). In the counterexample above for the marginal correlation statistic, the design matrix X and signal coefficient β are constructed to generate the alternating sign effect. From our discussion, the alternating sign effect can lead to large negative W and reduce the power of the knockoff filter..4 Alternating sign effect on the Lasso path and other knockoff statistics We will construct an example in which the Z-score is large enough to reect the null hypothesis. For this example, some knockoff statistics can only pick out a small subset of the false nulls..4. The Z-score and signal amplitude ˆβ ls /σ (Σ ), TheZ-scoreofaclassicallinearmodely = Xβ+ǫ, ǫ N(,σ I p ), isdefinedbyz = where ˆβ ls is the least squares coefficient of regressing y on X. Obviously, Z N(,), β =. In our example and numerical experiments to be presented later, we choose σ = and β i = M/s i for s i and β i = for s i =. This setting guarantees that the Z-score of a false null is large. In fact, we have the following estimate for Z. Lemma.3. Let σ = and Z = ˆβ ls / (Σ ). For any : β defined above we have Z ξ + M s ξ + M, ξ N(,). This result shows that for large amplitude, M, the Z-score of the false null is large enough to reect the null hypothesis. We defer the proof of this lemma to the Appendix..4. An example to illustrate the alternating sign effect for several knockoff statistics In this subsection, we construct an example to demonstrate that the Lasso path and the forward selection statistics could lose their power due to the alternating sign effect. In our example, the feature matrix, X, consists of four groups X = (X A,X A,X B,X B ) with correlations given as follows ρ (i,) (A A)\(A A ) or (i,) (B B)\(B B ), i X i,x = ρ (i,) A B ρ (i,) A A or (i,) B B where A = A A, B = B B, and ρ > ρ. For example, in the case A = A = B = B =, the Gram matrix of X has the following structure ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ X T X = Σ = ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ. ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ Given the above covariance matrix Σ, the rows of X N(,Σ) with columns normalized. 9
10 Testing the alternating sign effect for different statistics We perform numerical experiments for the example above. Let A = B =, A = k, B = k, ρ =.3, ρ =.9, M = 6 (recall β = M/s ), the noise level σ = and the nominal FDR q = %. We compare the performance of five statistics: the knockoff least squares, the weighted half Lasso defined in (5) of Section.5. (λ =.5, Z = diag{ s /, s /,..., s p /}), the Lasso path, the forward selection, and the orthogonal matching pursuit (OMP) statistics. Here, λ is the regularizing parameter in the penalized model. We use the difference statistic W = ˆβ β for the least squares, the signed max statistic for the half Lasso, the Lasso path, the forward selection and the OMP statistics, i.e. W = max( ˆβ, β ) sign( ˆβ β ), where ˆβ, β is the enter time (the step at which the original and knockoff feature enters) in the path statistic or the solution of the half Lasso. We vary thesize parameter k =,4,.., but keep thesparsity level at %, e.g. thenumber of true features is 7. = 4 if k =. All the signals are randomly selected from {,,..,p}. We run each experiment times and present the results in the left panel of Figure. Smaller size and more trials We rerun this same experiment with a smaller size but a larger number of trials. Let A = B = 4, A = k, B = k, k = 4,8,,...,4 and the sparsity level at %. Each experiment is repeated times. The results are plotted in the right panel of Figure. Power(%) Least Square Half Lasso(λ=.5) Lasso Path Forward Selection OMP FDR Size of (A,B )=(k,k): k (Large Test, trials) Size of (A,B )=(k,k): k (Small Test, trials) Figure : Alternating sign tests for five methods at the nominal FDR q = % with varying sizes of special groups. Left and right subplots show first and second test s results, respectively. First test s feature size (left subplot) is five times that of the second test s (right subplot). We focus on the power of the Lasso path and the forward selection statistics and find that these methods lose most of their power in this alternating sign test, which confirms that they suffer from the alternating sign effect. For other methods, they maintain nearly % power with the desired FDR control. The FDR of the Lasso path and the forward selection statistics in the left subplot is not stable, which can be attributed to the relatively small number of trials. This example indicates that the alternating sign effect could be a problem for the knockoff filter for certain statistics..5 Notion of good statistic and a half penalized method.5. Good statistic In the previous section, we show that certain knockoff statistics suffer from the alternating sign effect (i.e. loss of power) even for strong signals that are only weakly correlated due to the many largenegative knockoff statistics W that aregenerated. Based ontheknockoff property, wepropose an alternative statistic that does not suffer from the alternating sign effect. We first introduce the notion of a good statistic.
11 Definition.4 (Good statistic). A knockoff statistic W is a good statistic if it satisfies the positivity of non-null features: for fixed noise ǫ, W if the signal amplitude β is large enough relative to noise. The assumption of a good statistic ensures that W is non-negative for a strong signal and thus it can be potentially selected by the knockoff filter. Least squares statistic An example of a good statistic is the least squares statistic. Denote the Gram matrix G = [X X] T [X X]. We observe that the original construction of diag(s) used in [3] could lead to a singular Gram matrix. To alleviate this difficulty, we modify the criterion to construct diag(s): λ min(σ)i p diag(s) 3 Σ. We will discuss different construction criteria in Section.7. ( ) ˆβ The least squares coefficients ˆβ, β obtained by regressing y on [X, X] β satisfy = β ( X G T ) ( ) ǫ η () X T ǫ η (). We show that W ˆβ β satisfies the definition of a good statistic. In fact, if β, we have W = β +η () η (). For a fixed noise ǫ, η (),η () are fixed and thus W is positive if β is large enough. In general, if β is large compared to noise level σ, W with large probability due to var(η () ) = σ (G ) and var(η () ) = σ (G ) +p,+p. The following formula of the least squares coefficients, ˆβ, β, will be useful in later sections. ( ˆβ + β β ˆβ β β ) [ = X + X, X X ] T [ [ (Σ S = ) ( S ).5. A half penalized method X + X, X X ) ] ( X+ X ( ( ) T ǫ X X ) T ǫ ] ( X+ X ( ( ) T ǫ X X ) T ǫ ), S = diag{s,s,...,s p }. In this subsection, we introduce a half penalized model based on the knockoff property. This method naturally suggests a good statistic. Consider the following penalized model min ˆβ, β () y Xˆβ X β +P(ˆβ β)+q(ˆβ + β), (3) where P(x) and Q(x) are even functions. The statistic defined by W = ˆβ β or W = max( ˆβ, β ) sign( ˆβ β ) satisfies the sufficiency and the antisymmetry properties since swapping X, X leads to swapping ˆβ, β. Let ˆβ ls, β ls be the least squares coefficients obtained by regressing y on [X, X]. We denote by r y Xˆβ ls X β ls the residue. The geometric property of the least squares method implies r X, X, which leads to y Xˆβ X β = r +X(ˆβ ls ˆβ)+ X( β ls β) = r + X(ˆβ ls ˆβ)+ X( β ls β) = r + X + X (ˆβ ls + β ls ˆβ X X β)+ (ˆβ ls β ls (ˆβ β)). The residue r is independent of ˆβ and β. Thus we can exclude the residue r from the penalized model (3). Note that the constraint () on the knockoff matrix implies an important property (X + X) T (X X) =, X + X X X. (4)
12 The orthogonality property (4) enables us to separate the left hand side into the sum of three mutually independent terms: y Xˆβ X β = r + X + X (ˆβ ls + β ls ˆβ β) We can then rewrite (3) in the following equivalent form: min ˆβ, β + X X (ˆβ ls + β ls ˆβ β) + which can be further reformulated as two equivalent subproblems ˆα X + X (ˆβ ls β ls (ˆβ β)). X X (ˆβ ls β ls (ˆβ β)) +P( ˆβ β )+Q(ˆβ + β), (5) + X min X (ˆβ ls + ˆα) +Q(ˆα), (6) X min X (ˆβ ls α α) +P( α), (7) where we have replaced ˆβ+ β, ˆβ β in (5) by ˆα, α, respectively. A key observation of the knockoff filter is that the column vectors of X X are mutually orthogonal since (X X) T (X X) = Σ (Σ diag(s)) = diag(s). (8) Consequently, the second subproblem is reduced to min α p i= 8 X i X i ( α i (ˆβ i ls ls β i )) +P( α) = min α p i= s i 4 ( α i (ˆβ i ls ls β i )) +P( α). (9) If P(x) can be expressed as P(x) = p i= P(x i), we can solve (9) easily by solving p onedimensional optimization problems separately. Example : A half penalized method and a good statistic We construct a good statistic to make sure that W > for a true feature. We choose Q and (3) becomes min ˆβ, β y Xˆβ X β +P(ˆβ β). This is different from other penalized models since it only penalizes ˆβ β. We call this model the half penalized method. This problem can also be divided into two subproblems (6) and (7). The solution of (6) is trivial and by () we have the following explicit formula: ˆα = ˆβ ls + β ls = β +(Σ S ) ( X + X ) T ǫ. () We introduce the following notation that will be used very often later on: ǫ () (Σ S ) ( X + X ) T ǫ, ǫ () ( S ) ( X X ) T ǫ, () Var(ǫ () ) = σ (Σ S/), Var(ǫ () ) = σ (S/), () where σ = var(ǫ i ). Substituting the expression of ˆβ ls β ls given in () into (9) yields min α p i= 8 X i X i ( α i β i ǫ () i ) +P( α ) = min α The minimum α in (3) satisfies the following lemma. p i= s i 4 ( α i β i ǫ () i ) +P( α). (3)
13 Lemma.5. Assume P(x) is even. The minimum of (3) satisfies sign( α ) = sign(β +ǫ () ) or if β +ǫ (),s. Proof. Since P is even, we have P(x) = P( x ). Recall the minimization problem α = argmin α p i= s i 4 (α i β i ǫ () i ) +P( α ) f(α). If sign( α ) sign(β +ǫ () ) or, we can modify α as follows α new = α, α new i = α i, i to obtain a smaller value. In fact, this modification only changes one term in f( α) and the following inequality leads to a contradiction: f( α new ) f( α) = s 4 ( αnew β ǫ () ) s 4 ( α β ǫ () ) = s α (β +ǫ () ) <. Assume that the knockoff statistic takes the difference formula, i.e. W = ˆβ β (the signed max formula can be considered similarly). Equation () yields var(ǫ () ) = σ ((Σ S/) ), var(ǫ() ) = σ /s. When β is large compared to the noise level σ, we have β > ǫ (), ǫ () with large probability. Consequently, we obtain sign(ˆα ) = sign( α ) = sign(β ). Combining the solution of the first problem (), Lemma.5 and the transform between α and β (ˆα = ˆβ+ β, α = ˆβ β), we conclude that W = ˆβ β = ( ˆα + α ˆα α ), which implies that W is a good statistic. Example : A half Lasso statistic. We choose Q(x) and P(x) = λ x. As a result, the Lasso problem (7) or (9) can be solved directly α = argmin α p i= α i = Sh(ˆβ ls i ( si ) 4 (α i (ˆβ i ls ls β i )) +λ α i β ls i,λ/s i ) sign(ˆβ ls i ls β i ) ( ˆβ i ls ls β i λ/s i ) + where Sh (Shrinkage) is the soft threshold operator and a + max(,a). We can rewrite the formula above in vector form α = Sh(ˆβ ls β ls,λs inv ), where S inv = [/s,/s,...,/s p ] T. We should interpret this vector identity as several pointwise identities. Since Q, the solution of (6) is given by ˆα = ˆβ ls + β ls. Combining the formula of α, ˆα, we obtain the solution of (3) ˆβ = (ˆβ ls + β ls +Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() +Sh(β +ǫ (),λs inv )), β = (ˆβ ls + β ls Sh(ˆβ ls β ls,λs inv )) = (β +ǫ() Sh(β +ǫ (),λs inv )), (4) where ǫ (),ǫ () are defined in () with variance (). It is interesting to note that if β is small, the soft-threshold yields ˆβ = β, which implies W =. A weighted half Lasso statistic. We can add a weight to β i to balance the noise level and the soft-threshold. Consider the following penalized model min ˆβ, β y XZ ˆβ XZ β +λ ˆβ β, (5) 3
14 where Z = diag{z,z,...,z p } is a positive diagonal matrix chosen in advance. Note that ˆβ, β only depend on [X X] T y,[x X] T [X X]. Similarly, we derive the solution as follows ˆβ = ( ) Z(β +ǫ () )+Sh(Z(β +ǫ () ),λz S inv ), β = ( ) (6) Z(β +ǫ () ) Sh(Z(β +ǫ () ),λz S inv ), where S = diag{s,s,...,s p } and ǫ (),ǫ () are defined in (). The weighted half Lasso statistic that satisfies the sufficiency property is defined as follows W = ( ˆβ β ) = β +ǫ () +Sh(β +ǫ (),λz /s ) β +ǫ () Sh(β +ǫ (),λz /s ). (7) z We can also define the associated signed max statistic W = z max( ˆβ, β ) sign( ˆβ β ), which also satisfies the sufficiency property. The difference between (6) and (4) is the addition of a different weight to the threshold. Note that the covariance matrix of ǫ () is S σ. The weighted half Lasso can balance the variance of noise ǫ () and the soft-threshold. We suggest to use Z = diag( s /, s /,..., s p /). With this choice of Z, we have Var(ǫ () ) = σ S = 4σ diag(z /s,z /s,...,z p /s p ) λ diag(4z /s,4z /s,...,4z p /s p ). Example 3: A negative half Lasso Choosing Q(x),P(x) = λ p i= µ i x i,µ i, we can deduce the solution of (6) and (7) (or (9)) ˆβ + β = ˆα = ˆβ ls + β ls = β +ǫ (), ˆβ β = α = argmin α = ˆβ i = β i + ( ) ǫ () i +ǫ () i + λµ i sign(β i +ǫ () i ), βi = s i p i= ( si ) 4 (α i (ˆβ i ls ls β i )) λµ i α i ( ǫ () i ǫ () i ) λµ i s i sign(β i +ǫ () i ), where we have used ˆβ ls i β ls i = β i +ǫ () i. We see that a negative P(x) can increase the difference between ˆβ and β, which can be useful to distinguish the true feature from its knockoff. When µ i = s i, our numerical results show that the negative penalty enlarges the gap between ˆβ and β and increases the power by 5 % compared to least squares, while the half Lasso shrinks the gap between ˆβ and β and reduces the power by 5 %..6 Extension of the knockoff sufficiency property In [3], the sufficiency property of a knockoff statistic states that the statistic W depends only on the Gram matrix [X X] T [X X] and the feature-response product [X X] T y. In this subsection, we will generalize the sufficiency property so that we can apply the knockoff filter to more general scenarios. In addition, we propose a method to estimate the noise level and determine the prior regularizing parameter for a half penalized method. Let U R n (n p) be an orthonormal matrix such that [X X] T U = and [X X U] admits a basis of R n. Recall that the knockoff condition () implies (X+ X) T (X X) = X T X X T X =. Hence, we can decompose R n as follows R n = span(x + X) span(x X) span(u). Our key observation is that swapping each pair of the original X and its knockoff X does not modify these spaces: span(x + X), span(x X) and span(u). Therefore, the probability distributions of the proections of the response y onto these spaces respectively are independent and invariant after swapping arbitrary pair X, X. Inspired by this observation, we can generalize the sufficiency property. 4
15 Definition.6 (Generalized Sufficiency Property). The statistic W is said to obey the generalized sufficiency property if W depends only on the Gram matrix [X X] T [X X] and the feature-response [X X U] T y; that is, we can write W = f([x X] T [X X],[X X U] T y) for some f : S + p Rn R p and an orthonormal matrix U R n (n p) that satisfies U T [X X] =. Remark. Compared with the original sufficiency property, the generalized sufficiency property includes the addition of U T y, which is the coefficient vector of the orthogonal proection of y onto span([x X]). As an application, we will use this extra component to estimate the noise level and incorporate the estimated noise level into the knockoff statistic from a penalized method without violating the exchangeability property and FDR control. The definition of the antisymmetry property remains the same: swapping X and X has the same effect as changing the sign of W, i.e. { W ([X X] swap( Ŝ),U,y) = W ([X X],U,y) + Ŝ, / Ŝ, where Ŝ is a subset of nulls. For any knockoff matrix X and the associated statistic W that satisfies the above definition, we call W the generalized knockoff statistic. We will prove that this generalized statistic satisfies the exchangeability property. Then we can apply the same super-martingale as in [3] to establish rigorous FDR control. According to the analysis of establishing exchangeability in [3], we need to prove the corresponding Lemma (Pairwise exchangeability for the features) and Lemma 3 (Pairwise exchangeability for the response) in [3]. Lemma is a direct result of the knockoff constraint. We need to prove the following lemma. Lemma.7. For any generalized knockoff statistic W and a subset Ŝ of nulls, we have W swap( Ŝ) = f([x X] T swap(ŝ)[x X] swap( Ŝ),[ [X X] swap( Ŝ) U]T y) d = f([x X] T [X X],[X X U] T y) = W. Proof. Since X is a knockoff matrix, we get [X X] T swap(ŝ)[x X] swap( Ŝ) = [X X] T [X X], (8) and thus the first variable of f on both sides of (8) are the same. Next, we verify [[X X] swap( Ŝ) U]T y d = [X X U] T y. Since y is a Gaussian random variable, it is equivalent to verifying that the means and the variances of both sides are the same. We first check the means of the both sides. E([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T Xβ = [ [X X] T Xβ U T Xβ] = E([X X U] T y). (9) The second equality is guaranteed by X T Xβ = X T Xβ Ŝ since XT X i = X T X i i and Ŝ is a subset of nulls. Using y N(Xβ,σ I p ),[X X] T U = and (8), we obtain Var([ [X X] swap(ŝ) U]T y) = [ [X X] swap(ŝ) U]T [ [X X] U] swap(ŝ) =diag([x X] T swap(ŝ)[x X] swap( Ŝ), UT U) = diag([x X] T [X X], U T U) =Var([ [X X] U] T y) Combining (9) and (3), we conclude the proof. (3) The exchangeability property of a generalized knockoff statistic is a result of this lemma and the antisymmetry property of the knockoff statistic. Lemma.8. (i.i.d signs for the nulls). Let η {±} p be a sign sequence independent of W, with i.i.d η = + for all nonnull and η {±} for null. Then (W,...,W p ) = d (W η,...,w p η p ). 5
16 Estimate of the noise level and an application As an application of the generalized knockoff statistic, we propose a new method to estimate the noise level in the knockoff filter without violating the exchangeability property and FDR control. Let U R n (n p) be an orthonormal matrix such that U T [X X] =. From the identity U T y = U T (Xβ +ǫ) = U T ǫ, we provide an estimate of the noise level depending on U T y: ˆσ U T y / n p. (3) For any problem with an unknown noise level, we consider the knockoff half Lasso whose regularizing parameter is decided by ˆσ, i.e. min ˆβ, β y Xˆβ X β +λˆσ ˆβ β, (3) where λ = or can be decided empirically. Since the solution of (3), i.e.(ˆβ, β), depends on the Gram matrix [X X] T [X X], the marginal correlation [X X] T y and the regularizing parameter λˆσ (ˆσ is decided by U T y), we derive that ˆβ, β are functions of the Gram matrix and [X X U] T y. Consequently, thestatistic W ˆβ β = W([X X] T [X X],[X X U] T y) (or thesigned max version) satisfies the generalized sufficiency property. The antisymmetry property can be verified easily. Hence, we can choose W as a knockoff statistic with exact FDR control..7 A modified SDP construction In [3], the authors propose to construct diag(s) (s = (s,s,..,s p )) via convex optimization maximize: p s i, subect to: diag(s) Σ; s i,i =,,..,p. (33) i= Such construction sometimes produces zero s i. In this case, feature i cannot be selected by the knockoff filter. To illustrate this point, we construct a simple but by no means extreme example in which such a construction criterion would give zero s i for some i. Let Σ a,b be a 3 3 matrix defined as Σ a,b = b a b a a a. Using the CVX solver in MATLAB, we can solve (33) for several Σ a,b. When (a,b) = (.8,.4), we have s = s =.4 and s 3 = ; when (a,b) = (.9,.7), we have s = s =.6 and s 3 = ; when (a,b) = (.7,.4), we have (s,s ) =.84 and s 3 =. We observe that s 3 = in these examples. Modified SDP construction To overcome the zero output problem, we propose to slightly modify the original SDP construction by solving the following optimization problem minimize: p ( s i ), subect to: diag(s) βσ, αλ min (Σ) s i, α [,) β (,]. i= The half penalized method requires that Σ diag(s)/ and diag(s) be invertible (see the least squares coefficient formula()), and we suggest(α, β) = (.5,.75). For path statistics, to alleviate zero output in the SDP construction, we suggest (α,β) = (.5,). 3 A PCA prototype filter In this section, we propose a PCA prototype group selection method with group FDR control to overcome the difficulty associated with strong within-group correlation. It is well known that the 6
17 grouping strategy provides an effective way to handle strongly correlated features. Our work is inspired by Reid-Tibshirani s prototype filter [3] and Dai-Barber s group knockoff filter [6]. We provide a brief summary of the two methods below before introducing our PCA prototype filter. 3. Reid-Tibshirani s prototype filter In [3], Reid and Tibshirani introduce a prototype filter. They choose a prototype for each group of features, then they use the knockoff filter to select these prototypes to perform group selection. Specifically, the method consists of the following steps. First, cluster columns of ( X into ) K groups, ( {C,...,C ) K }. Then split the data by rows into two y () X () (roughly) equal parts y = and X =. Choose a prototype for each cluster via y () the maximal marginal correlation, using only the first part of the data y (),X (). This generates the prototype set ˆP. Next, form a knockoff matrix X () from X () and perform the knockoff filter using y (),[X () X () ˆP ˆP ]. Finally, group C i is selected if and only if X () is chosen in the filter process. ˆP i This method satisfies the exchangeability property and the authors establish group FDR control based on a similar super-martingale argument as in [3]. We point out that this method does not benefit a lot from the group structure. Assume that X i,x are in the same group and X i,x δ. For this pair of X i and X, we define a unit vector v by v (e i e )/, where {e i } i p is the standard orthonormal basis of R p. From (), we have v T diag(s)v v T X T Xv, which further implies X () min(s i,s ) v T diag(s)v v T X T Xv = Xv = X i X δ. (34) If within-group correlation is strong, δ is small and the inequality above implies that either s i or s is small. Hence, the power of this method may be limited for strongly correlated features. Our numerical results in Sections 3.4 and 3.5 confirm this limitation. 3. Dai-Barber s group knockoff filter In [6], Dai and Barber investigate a group-wise knockoff filter, which is a generalization of the knockoff filter. Assume that the columns of X can be divided into k groups {X G,X G,...,X Gk }. The authors construct the group knockoff matrix according to X T X = X T X, XT X = Σ S, Σ = X T X, where S is group-block-diagonal, i.e. S Gi,G = for any two distinct groups i. Then let S = diag(s,s,...,s k ), S i = γσ Gi,G i = γxg T i X Gi,i =,,...,k. The constraint S Σ implies γ diag(σ G,G,Σ G,G,...,Σ Gk,G k ) = S Σ. In order to maximize the difference between X and X, γ is chosen as large as possible: γ = min{, λ min (DΣD)}, where D = diag(σ / G,G,Σ / G,G,...,Σ / G k,g k ). The group-wise statistic introduced in [6] can be obtained after the construction of the group knockoff matrix. The construction[ above guarantees ] group-wise exchangeability. Finally, group {#{i:β FDR control, i.e. FDR group E Gi =,i Ŝ} q, is a result of group-wise exchangeability. ( Ŝ ) Here Ŝ = { : W T} is the set of selected groups for a chosen group statistic W. 3.3 PCA Reformulation Assume that X can be clustered into k groups X = (X C,X C,...,X Ck ) in such a way that withingroup correlation is relatively strong while between-group correlation is relatively weak. First, we apply singular value decomposition (SVD) to decompose the feature vectors within each group into X Ci = U Ci D i Vi T,U Ci O n c i,d i R c i c i,v i O c i c i,c i = C i. Then we reformulate the linear model as follows: 7
Summary and discussion of: Controlling the False Discovery Rate via Knockoffs
Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary
More informationCONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS. BY RINA FOYGEL BARBER 1 AND EMMANUEL J. CANDÈS 2 University of Chicago and Stanford University
The Annals of Statistics 2015, Vol. 43, No. 5, 2055 2085 DOI: 10.1214/15-AOS1337 Institute of Mathematical Statistics, 2015 CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS BY RINA FOYGEL BARBER 1 AND
More informationarxiv: v3 [stat.me] 14 Oct 2015
The Annals of Statistics 2015, Vol. 43, No. 5, 2055 2085 DOI: 10.1214/15-AOS1337 c Institute of Mathematical Statistics, 2015 CONTROLLING THE FALSE DISCOVERY RATE VIA KNOCKOFFS arxiv:1404.5609v3 [stat.me]
More informationarxiv: v1 [stat.me] 11 Feb 2016
The knockoff filter for FDR control in group-sparse and multitask regression arxiv:62.3589v [stat.me] Feb 26 Ran Dai e-mail: randai@uchicago.edu and Rina Foygel Barber e-mail: rina@uchicago.edu Abstract:
More informationThe knockoff filter for FDR control in group-sparse and multitask regression
The knockoff filter for FDR control in group-sparse and multitask regression Ran Dai Department of Statistics, University of Chicago, Chicago IL 6637 USA Rina Foygel Barber Department of Statistics, University
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates
More informationFalse Discovery Rate
False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR
More informationA knockoff filter for high-dimensional selective inference
1 A knockoff filter for high-dimensional selective inference Rina Foygel Barber and Emmanuel J. Candès February 2016; Revised September, 2017 Abstract This paper develops a framework for testing for associations
More informationModel-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate
Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationConditions for Robust Principal Component Analysis
Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationKnockoffs as Post-Selection Inference
Knockoffs as Post-Selection Inference Lucas Janson Harvard University Department of Statistics blank line blank line WHOA-PSI, August 12, 2017 Controlled Variable Selection Conditional modeling setup:
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationsublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More information[y i α βx i ] 2 (2) Q = i=1
Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation
More informationSingular Value Decomposition
Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationThe lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding
Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationLinear Models Review
Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationIntroduction to Compressed Sensing
Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More information3 Comparison with Other Dummy Variable Methods
Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationMath Linear Algebra II. 1. Inner Products and Norms
Math 342 - Linear Algebra II Notes 1. Inner Products and Norms One knows from a basic introduction to vectors in R n Math 254 at OSU) that the length of a vector x = x 1 x 2... x n ) T R n, denoted x,
More informationarxiv: v1 [stat.me] 30 Dec 2017
arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More information(Part 1) High-dimensional statistics May / 41
Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationCS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS
CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems
More informationMachine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationLinear Algebra Massoud Malek
CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product
More informationSVD, PCA & Preprocessing
Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationPrincipal Component Analysis
I.T. Jolliffe Principal Component Analysis Second Edition With 28 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition Acknowledgments List of Figures List of Tables
More informationThe dual simplex method with bounds
The dual simplex method with bounds Linear programming basis. Let a linear programming problem be given by min s.t. c T x Ax = b x R n, (P) where we assume A R m n to be full row rank (we will see in the
More informationSparse principal component analysis via regularized low rank matrix approximation
Journal of Multivariate Analysis 99 (2008) 1015 1034 www.elsevier.com/locate/jmva Sparse principal component analysis via regularized low rank matrix approximation Haipeng Shen a,, Jianhua Z. Huang b a
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationSparse orthogonal factor analysis
Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationLecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016
Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationarxiv: v3 [math.oc] 19 Oct 2017
Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional
More informationFinancial Econometrics
Material : solution Class : Teacher(s) : zacharias psaradakis, marian vavra Example 1.1: Consider the linear regression model y Xβ + u, (1) where y is a (n 1) vector of observations on the dependent variable,
More informationStochastic Design Criteria in Linear Models
AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 211 223 Stochastic Design Criteria in Linear Models Alexander Zaigraev N. Copernicus University, Toruń, Poland Abstract: Within the framework
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationApplied Numerical Linear Algebra. Lecture 8
Applied Numerical Linear Algebra. Lecture 8 1/ 45 Perturbation Theory for the Least Squares Problem When A is not square, we define its condition number with respect to the 2-norm to be k 2 (A) σ max (A)/σ
More informationIterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming
Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationIV. Matrix Approximation using Least-Squares
IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that
More informationPart IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015
Part IB Statistics Theorems with proof Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly)
More informationUniversity of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May
University of Luxembourg Master in Mathematics Student project Compressed sensing Author: Lucien May Supervisor: Prof. I. Nourdin Winter semester 2014 1 Introduction Let us consider an s-sparse vector
More informationRegression. Oscar García
Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is
More informationPANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1
PANEL DATA RANDOM AND FIXED EFFECTS MODEL Professor Menelaos Karanasos December 2011 PANEL DATA Notation y it is the value of the dependent variable for cross-section unit i at time t where i = 1,...,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationLecture 12 April 25, 2018
Stats 300C: Theory of Statistics Spring 2018 Lecture 12 April 25, 2018 Prof. Emmanuel Candes Scribe: Emmanuel Candes, Chenyang Zhong 1 Outline Agenda: The Knockoffs Framework 1. The Knockoffs Framework
More informationLinear Programming Redux
Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationChapter 6: Orthogonality
Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More informationare Banach algebras. f(x)g(x) max Example 7.4. Similarly, A = L and A = l with the pointwise multiplication
7. Banach algebras Definition 7.1. A is called a Banach algebra (with unit) if: (1) A is a Banach space; (2) There is a multiplication A A A that has the following properties: (xy)z = x(yz), (x + y)z =
More information