Lecture 12 April 25, 2018

Size: px

Start display at page:

Download "Lecture 12 April 25, 2018"

Herbert Evans
5 years ago
Views:

1 Stats 300C: Theory of Statistics Spring 2018 Lecture 12 April 25, 2018 Prof. Emmanuel Candes Scribe: Emmanuel Candes, Chenyang Zhong 1 Outline Agenda: The Knockoffs Framework 1. The Knockoffs Framework 2. Proof Sketch of FDR Control 3. Knockoffs for Random Features 4. Knockoffs for Hidden Markov Models In this lecture, we first review the knockoffs framework and rigorously establish FDR control using a martingale argument. Then we discuss how to construct knockoffs for random features. Finally, we introduce a general construction of knockoffs through sequential conditional independent pairs, and apply it to construct knockoffs for hidden Markov models. 2 The knockoffs framework We are interested in controlled variable selection. Specifically, we have a response variable Y (for example, Y can be a disease status), and several features X 1, X 2,, X p (for example, the X i s can be SNPs). The goal of controlled variable selection is to select a set of features X j that are likely to be relevant without too many false positives, so that we do not run into the problem of irreproducibility. There are many statistics that can help us assess the importance of each feature and decide which variables we should report. For example, we can obtain feature importance statistics Z j from random forests. In our framework, knockoffs serve as negative controls. By comparing the feature importance of each variable with knockoffs, we obtain a selection of the features. We denote by (Z 1,, Z p, Z 1,, Z p ) = z([x, X], y) (1) the original and knockoff scores. (Recall that swapping a variable and its knockoffs just swaps the scores.) In this lecture, we will show that we can construct knockoff features such that when j is null, j H 0, we have (Z j, Z j ) d = ( Z j, Z j ) (2) 1

2 and more generally, for any T H 0, we have (Z, Z) swap(t ) d = (Z, Z) (3) Based on Z and Z, we can construct adjusted scores W j with the flip-sign property. Specifically, set W j = w j (Z j, Z j ) (4) such that For example, we can take or We let w j ( Z j, Z j ) = w j (Z j, Z j ). W j = Z j Z j W j = Z j Z j, (5) { 1, 1, Z j > Z j Z j Z. j (6) S + (t) = {j : W j t}, S (t) = {j : W j t}. (7) If we select those features for which W j t (i.e. the selected set is (S + (t)), we have a knockoff estimate of FDP: FDP(t) = 1 + S (t) 1 S + (t). (8) We define the stopping times τ 0 and τ 1 as τ 0 = min The knockoff procedure selects the model The knockoff+ procedure selects the model { t : S } (t) 1 S + (t) q, (9) τ 1 = min{t : FDP(t) q}. (10) Ŝ 0 = {j : W j τ 0 }. (11) Ŝ 1 = {j : W j τ 1 }. (12) For the selection sets Ŝ0 and Ŝ1, we have the following rigorous result for FDR control: Theorem 1 (Barber & Candès (2015) [1]): For the knockoff procedure with selection set Ŝ0, [ ] #false positives E #selections + q 1 q. (13) For the knockoff+ procedure with selection set Ŝ1, [ ] #false positives E q. (14) #selections 2

3 3 Proof sketch of FDR control In this part, we show a proof sketch of FDR control presented in the theorem. For simplicity, we only discuss the knockoff+ procedure, and let τ = τ 1. We recall that and For the selection set we have where S + (t) = {j : W j t}, S (t) = {j : W j t}, (15) τ = min{t : 1 + S (t) S + (t) 1 FDP(τ) = #{j null : j S+ (τ)} #{j : j S + (τ)} 1 q}. (16) Ŝ = {j : W j τ}, (17) (18) = #{j null : j S+ (τ)} #{j : j S + (τ)} #{j null : j S (τ)} 1 + #{j null : j S (τ)} (19) #{j null : j S + (τ)} q 1 + #{j null : j S (τ)} (20) = V + (τ) q 1 + V (τ), (21) V + (τ) = #{j null : j S + (τ)} (22) V (τ) = #{j null : j S (τ)}. (23) Next we use a martingale argument to show that [ V + ] (τ) E 1 + V 1, (24) (τ) and thus FDR = E[FDP] q. (25) For the filtration F t = {σ(v ± (u))} u t, we now show that V + (t) 1+V (t) Conditional on V + (s) + V (s), V + (s) is hypergeometric, and thus [ V + ] (s) E 1 + V (s) V ± (t), V + (s) + V (s) Hence, by the optional stopping theorem, [ V + ] (τ) FDR qe 1 + V (τ) where the last inequality holds because V + (0) Bin(#nulls, 1 2 ). is a supermartingale (Figure 1). V + (t) 1 + V (t). (26) [ V + ] (0) qe 1 + V q, (27) (0) 3

4 Figure 1: martingale argument 4 Knockoffs for random features Here, we discuss variable selection in fairly arbitrary models. Specifically, we have a random pair (X, Y ) (where X may contain thousands or even millions of covariates) and ask the following basic question: the conditional distribution p(y X) depends on X through which variables? In the literature on graphical models, this is known as the Markov blanket of Y. However, because the Markov blanket does not always exist, we adopt a working definition of a null which is unambiguous: we say that the variable j is null if and only if Y X j X j. It turns out that if the local Markov property holds, then the set of non nulls is the Markov blanket. Hence we are interested in testing for conditional independence which is another way of saying whether a variable influences a response once we know the value of all the others. In the logistic model where 1 P(Y = 0 X) = 1 + e, (28) XT β testing for conditional independence is the same as testing β j = 0. This holds with the proviso that the variables X 1,, X p are not perfectly dependent. Now we discuss how to construct knockoff features for random features. In this setting, we have i.i.d. samples from p(x, Y ). We assume that the distribution of X is known, while the distribution of Y X (the likelihood) is completely unknown. If we denote the original features by X = (X 1,, X p ), and the knockoff features by X = ( X 1,, X p ), we will require the following two properties: Pairwise exchangeability: where S is any subset of nulls. Conditional independence: (X, X) swap(s) d = (X, X), (29) X Y X; this is achieved if we ignore Y when constructing knockoffs. For knockoff features satisfying the above conditions, we have the following result on exchangeability of feature importance statistics: 4

5 Theorem 2 (Candès, Fan, Janson, Lv (2016) [2]): For any subset T of nulls, the scores obey (Z, Z) swap(t ) d = (Z, Z). (30) This holds conditionally on Y, no matter the relationship between Y and X. In the following, we show how to construct knockoffs for Gaussian features. Suppose X N (µ, Σ). Note that we must have X = d X. A possible solution is to construct knockoffs such that the joint distribution of X and X is Gaussian: [ ] [ ] (X, X) µ Σ Σ diag{s} N (, ), where =, =, (31) µ Σ diag{s} Σ and s is chosen such that 0. features X by sampling from X X. Thus, given random features X, we can construct knockoff 5 Knockoffs for hidden Markov models A general construction algorithm for knockoffs is as follows (Sequential Conditional Independent Pairs, abbreviated as SCIP)[2]: For j = 1,, p, sample X j from the law of X j X j, X 1:j 1. Here we use X j to denote everything other than the j th variable. When p = 3, we first sample X 1 from L(X 1 X 1 ), then we sample X 2 from L(X 2 X 2, X 1 ), finally we sample X 3 from L(X 3 X 3, X 1, X 2 ). The joint distribution of (X, X) is known and it can be shown to be pairwise exchangeable by an induction argument. While this general construction is usually not practical, it can be easy to implement in some cases, such as Markov chains that we will discuss next. A Markov chain X = (X 1,, X p ) can be characterized by distribution q 1 and transition probabilities Q (X MC(q 1, Q)). The probability density function of X is given by p(x 1,, X p ) = q 1 (X 1 ) p Q j (X j X j 1 ). In this case, the general algorithm above can be implemented efficiently. For instance, when p = 3, and X 1, X 2, X 3 take values in a finite space χ, we sequentially sample X 1 from X 1 X 2 from the distribution j=2 p( X 1 = x 1 X 1 = x 1 ) q 1 ( x 1 )Q 2 (x 2 x 1 ), the normalizing constant for this distribution is N 1 (k) = l χ q 1(l)Q 2 (k l). Then sample X 2 from X 2 X 1, X 3, X 1 from p( X 2 = x 2 X 2 = x 2, X 1 = x 1 ) Q 2 ( x 2 x 1 )Q 3 (x 3 x 2 ) Q 2( x 2 x 1 ) N 1 ( x 2 ), 5

Figure 4: Knockoff construction for Markov chain

6 Figure 2: Knockoff construction for Markov chain Figure 3: Knockoff construction for Markov chain Figure 4: Knockoff construction for Markov chain Figure 5: Knockoff construction for Markov chain 6

7 Figure 6: Hidden Markov model the normalizing constant is N 2 = l χ Q 2(l x 1 )Q 3 (k l) Q 2(l x 1 ) N 1 (l). Then we sample X 3 from X 3 X 2, X 2, p( X 3 = x 3 X 3 = x 3, X 1 = x 1, X 2 = x 2 ) Q 3 ( x 3 x 2 ) Q 3( x 3 x 2 ) N 2 ( x 3 ), the normalizing constant is N 3 = l χ Q 3 (l x 2 )Q 3 (l x 2 ) N 2 (l). The procedure is presented in Figure 2-5. Next we discuss construction of knockoffs for hidden Markov models (HMM). Generally, X = (X 1,, X p ) is distributed as a hidden Markov model if { H MC(q1, Q)(latent Markov chain) X j H X j H j ind. f j (X j ; H j )(emission distribution) (32) In HMM, the H variables are latent and only the X variables are observed (see Figure 6). One can show the following algorithm constructs a knockoff copy X of X that can be described as an HMM. Theorem 3 (Sesia, Sabatti, Candes (2017)[3]): A knockoff copy X of X can be constructed as follows: Sample H from p(h X) using forward-backward algorithm Generate a knockoff H of H using the SCIP algorithm for a Markov chain Sample X from the emission distribution of X given H = H. We present the above procedures in Figure 7-9. Hidden Markov models are widely used as a phenomenological model for haplotype estimation and phasing, and imputation of missing SNPs. Each genotype can be modelled as the sum of two independent HMM haplotype sequences. Though we do not know exact model parameters, we can estimate them from data, for example use software fastphase. In GWAS study, the variables are genotype for a set of SNPs and the response can be status of a disease or a quantitative trait of interest. With the above procedure, we can generate approximate knockoff copies of genotype sequences and use them to make inference. 7

8 References [1] R. Foygel Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. Annals of Statistics 43(5), pp arxiv: [2] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection, To appear in Journal of the Royal Statistical Society Series B. arxiv: [3] Matteo Sesia, Chiara Sabatti and Emmanuel Candès, Gene Hunting with Knockoffs for Hidden Markov Models. arxiv: 8

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction