Covariate-Assisted Variable Ranking Tracy Ke Department of Statistics Harvard University WHOA-PSI@St. Louis, Sep. 8, 2018 1/18
Sparse linear regression Y = X β + z, X R n,p, z N(0, σ 2 I n ) Signals (nonzero s of β) are Rare/Weak A column of X may be significantly correlated with a few others Goal: Rank variables so that the top-ranked ones contain as many signals as possible 2/18
Ranking by marginal scores In this talk, we assume design is normalized, i.e., x j 2 = 1 T j = (x j, Y ) (x j, x j ) 2 = (x j, Y ) 2, x j : j-th column of X Pros: Computationally efficient Cons: Signal Cancellation (x j, Y ) = β j + k:k j,β k 0 (x j, x k )β k +(x j, z) } {{ } may cancel each other 3/18
Multivariate scores P I : projection from R n to span{x j, j I} T j I = P I Y 2 P I\{j} Y 2 Reduce to marginal scores when I = {j} T j I is the log-likelihood-ratio between Supp(β) = I v.s. Supp(β) = I \ {j} 4/18
Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 5/18
Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 β has 3 signals: β 1 = τ, β 2 = β 3 = a τ (h, a) = ( 1/3, 1/3), σ 2 = 0 Marginal Bivariate Rank by Rank by Variable Score Score MaS max(mas, BiS) β 1 = τ (8/9)τ (2 2/3)τ 1 1 β 2 = (1/3)τ 0 (2 2/9)τ 4 3 β 3 = (1/3)τ (1/3)τ (2 2/9)τ 2 2 β 4 = 0 (1/9)τ 0 3 4 β 5 = 0 0 0 4 5 5/18
Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ 2 0.7 6/18
Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ 2 0.7 In the noiseless case, least-squares always gives correct ranking 6/18
Rare/Weak signal model and three regions β j = { 0, with prob. 1 ɛp ±τ p, with prob. ɛ p /2 ɛ p = p ϑ, τ p = 2r log(p), 0 < ϑ, r < 1 R α = Rank of j α, j α : variable with rank α s among all s-signals α Exactly Rankable: R α = 1 for all α (0, 1), w.h.p Rankable: 1 < R α 1 + o p (1) for any α (0, 1) Not Rankable. R α 1 for some α (0, 1) 7/18
Blockwise design (phase diagram) 5 4.5 4 Our proposal Exactly Rankable 5 4.5 4 Least-squares Exactly Rankable 3.5 r 3 r 3.5 3 2.5 2 Rankable 2.5 2 Rankable 1.5 1.5 1 1 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 ϑ MR 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϑ MR (zoom-out) r 4.5 4 3.5 3 2.5 2 1.5 Rankable Exactly Rankable 10 8 r 6 4 Rankable Exactly Rankable 1 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϑ 2 Not Rankable ϑ 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8/18
Graph Of Strong Dependence (GOSD) Define GOSD G = (V, E): V = {1, 2,..., p}: each variable is a node Nodes i and j have an edge iff (xi, x j ) δ, (δ = 1 log(p), say) Under our assumptions, G is sparse 9/18
Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j 10/18
Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j Let d be the maximum degree of G. p j=1 A j(m) Cp(2.718d) m m k=1 ( ) p k 10/18
A real example Data: gene expression of human immortalized B cells ((p, n) = (4238, 148); Nayak et al. (2009)) Remove the first singular vector: Data = n σ k u k v k = σ 1 u 1 v 1 + k=1 n σ k u k v k k=2 }{{} design matrix X Synthetic data for regression: Y = N(X β, I n ), β j { N(0, η 2 ), 1 j s = 0, otherwise 11/18
Comparison of the ROC curve For CAR, (m, δ) = (2, 0.5) β j { N(0, η 2 ), 1 j s = 0, otherwise Left: (η, s) = (0.1, 50). Right: (η, s) = (5, 50) 1 0.9 CAR MR 1 0.9 CAR MR 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/18
Extensions Gram matrix is non-sparse but is sparsifiable Y = X β + z = HY = HX β + Hz Change-point or time-series design: linear filtering Low-rank plus sparse design: PCA projection Generalized linear models P I y 2 = log-likelihood ˆl I (y) Ke, Jin and Fan (14 ), Ke and Yang (17 ) 13/18
CASE for variable selection Screen. Rank variables by CAR and let Ŝ t = {1 j p, T j t} GŜ }{{} = GŜ,1 GŜ,2... GŜ, ˆM }{{} post-screening subgraph small-size components Clean. If j / Ŝ t, set ˆβ j = 0. Otherwise, we must have j GŜ,k for some k. Estimate {β j : j GŜ,k } by minimizing P GŜ,k (Y j GŜ,k β j x j ) 2 + u β 0, s.t. β j = 0 or β j v 14/18
Signal archipelago GS {z} signal subgraph = GS,1 GS,2... GS,M, {z } S = S(β) components 15/18
Rare/Weak signal model and three regions β = b j µ j, b j iid Bernoulli(ɛ), τ µ j a τ ɛ = ɛ p = p ϑ, τ = τ p = 2r log(p) Hamming distance: Hamm p ( ˆβ, { p ϑ, r) = sup µ j=1 No recovery: Hamm p ( ˆβ, ϑ, r) pɛ p P ( sgn( ˆβ j ) sgn(β j ) )} Almost Full Recovery: 1 Hamm p ( ˆβ, ϑ, r) pɛ p Exact recovery: Hamm p ( ˆβ, ϑ, r) = 0 16/18
Phase Diagram (blockwise design) Y N(X β, I n ), rows of X iid N(0, 1 n Ω), Ω = Left: CASE/optimal. Right: Lasso. 6 6 1 a 0 0... 0 0 a 1 0 0... 0 0 0 0 1 a... 0 0 0 0 a 1... 0 0............. 0 0 0 0... 1 a 0 0 0 0... a 1 r 5 Exact Recovery 4 3 2 Almost Full Recovery 2 Optimal 1 1 No Recovery 0 0 0.5 1 ϑ rr 5 4 3 Exact Recovery Exact Non Recovery optimal Non optimal No No Recovery 0 0 0.5 0.5 11 ϑ 6 Ji and Jin (12 ), Jin, Zhang and Zhang (14 ) 17/18
Take-home messages CAR: a variable ranking method that mitigates signal cancellation in marginal ranking Appealing ROC curves CASE: a screen-and-clean method for variable selection Phase transition of Hamming distance 18/18