Covariate-Assisted Variable Ranking

Size: px

Start display at page:

Download "Covariate-Assisted Variable Ranking"

Jasper Logan
5 years ago
Views:

1 Covariate-Assisted Variable Ranking Tracy Ke Department of Statistics Harvard University Louis, Sep. 8, /18

2 Sparse linear regression Y = X β + z, X R n,p, z N(0, σ 2 I n ) Signals (nonzero s of β) are Rare/Weak A column of X may be significantly correlated with a few others Goal: Rank variables so that the top-ranked ones contain as many signals as possible 2/18

3 Ranking by marginal scores In this talk, we assume design is normalized, i.e., x j 2 = 1 T j = (x j, Y ) (x j, x j ) 2 = (x j, Y ) 2, x j : j-th column of X Pros: Computationally efficient Cons: Signal Cancellation (x j, Y ) = β j + k:k j,β k 0 (x j, x k )β k +(x j, z) } {{ } may cancel each other 3/18

4 Multivariate scores P I : projection from R n to span{x j, j I} T j I = P I Y 2 P I\{j} Y 2 Reduce to marginal scores when I = {j} T j I is the log-likelihood-ratio between Supp(β) = I v.s. Supp(β) = I \ {j} 4/18

5 Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 5/18

6 Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 β has 3 signals: β 1 = τ, β 2 = β 3 = a τ (h, a) = ( 1/3, 1/3), σ 2 = 0 Marginal Bivariate Rank by Rank by Variable Score Score MaS max(mas, BiS) β 1 = τ (8/9)τ (2 2/3)τ 1 1 β 2 = (1/3)τ 0 (2 2/9)τ 4 3 β 3 = (1/3)τ (1/3)τ (2 2/9)τ 2 2 β 4 = 0 (1/9)τ β 5 = /18

7 Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ /18

8 Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ In the noiseless case, least-squares always gives correct ranking 6/18

9 Rare/Weak signal model and three regions β j = { 0, with prob. 1 ɛp ±τ p, with prob. ɛ p /2 ɛ p = p ϑ, τ p = 2r log(p), 0 < ϑ, r < 1 R α = Rank of j α, j α : variable with rank α s among all s-signals α Exactly Rankable: R α = 1 for all α (0, 1), w.h.p Rankable: 1 < R α 1 + o p (1) for any α (0, 1) Not Rankable. R α 1 for some α (0, 1) 7/18

10 Blockwise design (phase diagram) Our proposal Exactly Rankable Least-squares Exactly Rankable 3.5 r 3 r Rankable Rankable Not Rankable ϑ MR 0.5 Not Rankable ϑ MR (zoom-out) r Rankable Exactly Rankable 10 8 r 6 4 Rankable Exactly Rankable Not Rankable ϑ 2 Not Rankable ϑ /18

11 Graph Of Strong Dependence (GOSD) Define GOSD G = (V, E): V = {1, 2,..., p}: each variable is a node Nodes i and j have an edge iff (xi, x j ) δ, (δ = 1 log(p), say) Under our assumptions, G is sparse 9/18

12 Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j 10/18

13 Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j Let d be the maximum degree of G. p j=1 A j(m) Cp(2.718d) m m k=1 ( ) p k 10/18

14 A real example Data: gene expression of human immortalized B cells ((p, n) = (4238, 148); Nayak et al. (2009)) Remove the first singular vector: Data = n σ k u k v k = σ 1 u 1 v 1 + k=1 n σ k u k v k k=2 }{{} design matrix X Synthetic data for regression: Y = N(X β, I n ), β j { N(0, η 2 ), 1 j s = 0, otherwise 11/18

15 Comparison of the ROC curve For CAR, (m, δ) = (2, 0.5) β j { N(0, η 2 ), 1 j s = 0, otherwise Left: (η, s) = (0.1, 50). Right: (η, s) = (5, 50) CAR MR CAR MR /18

16 Extensions Gram matrix is non-sparse but is sparsifiable Y = X β + z = HY = HX β + Hz Change-point or time-series design: linear filtering Low-rank plus sparse design: PCA projection Generalized linear models P I y 2 = log-likelihood ˆl I (y) Ke, Jin and Fan (14 ), Ke and Yang (17 ) 13/18

17 CASE for variable selection Screen. Rank variables by CAR and let Ŝ t = {1 j p, T j t} GŜ }{{} = GŜ,1 GŜ,2... GŜ, ˆM }{{} post-screening subgraph small-size components Clean. If j / Ŝ t, set ˆβ j = 0. Otherwise, we must have j GŜ,k for some k. Estimate {β j : j GŜ,k } by minimizing P GŜ,k (Y j GŜ,k β j x j ) 2 + u β 0, s.t. β j = 0 or β j v 14/18

18 Signal archipelago GS {z} signal subgraph = GS,1 GS,2... GS,M, {z } S = S(β) components 15/18

19 Rare/Weak signal model and three regions β = b j µ j, b j iid Bernoulli(ɛ), τ µ j a τ ɛ = ɛ p = p ϑ, τ = τ p = 2r log(p) Hamming distance: Hamm p ( ˆβ, { p ϑ, r) = sup µ j=1 No recovery: Hamm p ( ˆβ, ϑ, r) pɛ p P ( sgn( ˆβ j ) sgn(β j ) )} Almost Full Recovery: 1 Hamm p ( ˆβ, ϑ, r) pɛ p Exact recovery: Hamm p ( ˆβ, ϑ, r) = 0 16/18

20 Phase Diagram (blockwise design) Y N(X β, I n ), rows of X iid N(0, 1 n Ω), Ω = Left: CASE/optimal. Right: Lasso a a a a a a 1 r 5 Exact Recovery Almost Full Recovery 2 Optimal 1 1 No Recovery ϑ rr Exact Recovery Exact Non Recovery optimal Non optimal No No Recovery ϑ 6 Ji and Jin (12 ), Jin, Zhang and Zhang (14 ) 17/18

21 Take-home messages CAR: a variable ranking method that mitigates signal cancellation in marginal ranking Appealing ROC curves CASE: a screen-and-clean method for variable selection Phase transition of Hamming distance 18/18

Graphlet Screening (GS)

Graphlet Screening (GS) Jiashun Jin Carnegie Mellon University April 11, 2014 Jiashun Jin Graphlet Screening (GS) 1 / 36 Collaborators Alphabetically: Zheng (Tracy) Ke Cun-Hui Zhang Qi Zhang Princeton