Covariate-Assisted Variable Ranking

Similar documents
Graphlet Screening (GS)

COVARIATE ASSISTED VARIABLE RANKING. By Zheng Tracy Ke and Fan Yang University of Chicago

arxiv: v2 [math.st] 25 Aug 2012

OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION

COVARIATE ASSISTED SCREENING AND ESTIMATION

Clustering by Important Features PCA (IF-PCA)

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Nearest Neighbor Gaussian Processes for Large Spatial Data

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

arxiv: v1 [math.st] 20 Nov 2009

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Gaussian Graphical Models and Graphical Lasso

Composite Loss Functions and Multivariate Regression; Sparse PCA

Clustering by Important Features PCA (IF-PCA)

Marginal Regression For Multitask Learning

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Semi-Penalized Inference with Direct FDR Control

Sample Size Requirement For Some Low-Dimensional Estimation Problems

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Overlapping Variable Clustering with Statistical Guarantees and LOVE

Post-selection Inference for Forward Stepwise and Least Angle Regression

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

High Dimensional Covariance and Precision Matrix Estimation

A Comparison of the Lasso and Marginal Regression

10708 Graphical Models: Homework 2

Bi-level feature selection with applications to genetic association

Proximity-Based Anomaly Detection using Sparse Structure Learning

CSC 576: Variants of Sparse Learning

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

A Least Squares Formulation for Canonical Correlation Analysis

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

High-dimensional covariance estimation based on Gaussian graphical models

Gibbs Sampling in Linear Models #2

(Part 1) High-dimensional statistics May / 41

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Homogeneity Pursuit. Jianqing Fan

[y i α βx i ] 2 (2) Q = i=1

Inference for High Dimensional Robust Regression

High-dimensional statistics: Some progress and challenges ahead

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Testing Algebraic Hypotheses

sparse and low-rank tensor recovery Cubic-Sketching

high-dimensional inference robust to the lack of model sparsity

Bayesian Support Vector Machines for Feature Ranking and Selection

Robust Subspace Clustering

Probabilistic Graphical Models

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Covariance Matrix Estimation for the Cryo-EM Heterogeneity Problem

Robust Principal Component Analysis

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Message passing and approximate message passing

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

A non-parametric regression model for estimation of ionospheric plasma velocity distribution from SuperDARN data

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

CS281A/Stat241A Lecture 17

Part 6: Multivariate Normal and Linear Models

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

Chapter 5 Matrix Approach to Simple Linear Regression

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Compressed Sensing and Linear Codes over Real Numbers

The Nonparanormal skeptic

Introduction to graphical models: Lecture III

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

11 : Gaussian Graphic Models and Ising Models

Causal Inference: Discussion

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Factor Analysis (10/2/13)

Default Priors and Effcient Posterior Computation in Bayesian

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

Linear regression methods

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Bayesian Methods for Sparse Signal Recovery

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Variable Selection in Structured High-dimensional Covariate Spaces

Variable Selection for Highly Correlated Predictors

Chapter 17: Undirected Graphical Models

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

19.1 Problem setup: Sparse linear regression

A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

STAT 461/561- Assignments, Year 2015

Markov Chains and Hidden Markov Models

Low Rank Matrix Completion Formulation and Algorithm

Feature selection with high-dimensional data: criteria and Proc. Procedures

Graph Detection and Estimation Theory

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

CS281 Section 4: Factor Analysis and PCA

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Stochastic Proximal Gradient Algorithm

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Transcription:

Covariate-Assisted Variable Ranking Tracy Ke Department of Statistics Harvard University WHOA-PSI@St. Louis, Sep. 8, 2018 1/18

Sparse linear regression Y = X β + z, X R n,p, z N(0, σ 2 I n ) Signals (nonzero s of β) are Rare/Weak A column of X may be significantly correlated with a few others Goal: Rank variables so that the top-ranked ones contain as many signals as possible 2/18

Ranking by marginal scores In this talk, we assume design is normalized, i.e., x j 2 = 1 T j = (x j, Y ) (x j, x j ) 2 = (x j, Y ) 2, x j : j-th column of X Pros: Computationally efficient Cons: Signal Cancellation (x j, Y ) = β j + k:k j,β k 0 (x j, x k )β k +(x j, z) } {{ } may cancel each other 3/18

Multivariate scores P I : projection from R n to span{x j, j I} T j I = P I Y 2 P I\{j} Y 2 Reduce to marginal scores when I = {j} T j I is the log-likelihood-ratio between Supp(β) = I v.s. Supp(β) = I \ {j} 4/18

Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 5/18

Example: Blockwise diagonal design Gram matrix X X is blockwise diagonal with 2 2 blocks ( ) 1 h, where h ( 1, 1) h 1 β has 3 signals: β 1 = τ, β 2 = β 3 = a τ (h, a) = ( 1/3, 1/3), σ 2 = 0 Marginal Bivariate Rank by Rank by Variable Score Score MaS max(mas, BiS) β 1 = τ (8/9)τ (2 2/3)τ 1 1 β 2 = (1/3)τ 0 (2 2/9)τ 4 3 β 3 = (1/3)τ (1/3)τ (2 2/9)τ 2 2 β 4 = 0 (1/9)τ 0 3 4 β 5 = 0 0 0 4 5 5/18

Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ 2 0.7 6/18

Blockwise design (noiseless case) Marginal ranking mis-ranks some signals below noise when ah > a h Our proposal: ranking by the maximum of marginal score and bivariate score It correctly ranks all signals above noise if h < 1/ 2 0.7 In the noiseless case, least-squares always gives correct ranking 6/18

Rare/Weak signal model and three regions β j = { 0, with prob. 1 ɛp ±τ p, with prob. ɛ p /2 ɛ p = p ϑ, τ p = 2r log(p), 0 < ϑ, r < 1 R α = Rank of j α, j α : variable with rank α s among all s-signals α Exactly Rankable: R α = 1 for all α (0, 1), w.h.p Rankable: 1 < R α 1 + o p (1) for any α (0, 1) Not Rankable. R α 1 for some α (0, 1) 7/18

Blockwise design (phase diagram) 5 4.5 4 Our proposal Exactly Rankable 5 4.5 4 Least-squares Exactly Rankable 3.5 r 3 r 3.5 3 2.5 2 Rankable 2.5 2 Rankable 1.5 1.5 1 1 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 ϑ MR 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϑ MR (zoom-out) r 4.5 4 3.5 3 2.5 2 1.5 Rankable Exactly Rankable 10 8 r 6 4 Rankable Exactly Rankable 1 0.5 Not Rankable 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϑ 2 Not Rankable ϑ 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8/18

Graph Of Strong Dependence (GOSD) Define GOSD G = (V, E): V = {1, 2,..., p}: each variable is a node Nodes i and j have an edge iff (xi, x j ) δ, (δ = 1 log(p), say) Under our assumptions, G is sparse 9/18

Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j 10/18

Covariate-Assisted Ranking (CAR) Rank variables by T j = max I A j (m) T j I, T j I = P I Y 2 P I\{j} Y 2 A j (m): size m connected subgraphs containing j Let d be the maximum degree of G. p j=1 A j(m) Cp(2.718d) m m k=1 ( ) p k 10/18

A real example Data: gene expression of human immortalized B cells ((p, n) = (4238, 148); Nayak et al. (2009)) Remove the first singular vector: Data = n σ k u k v k = σ 1 u 1 v 1 + k=1 n σ k u k v k k=2 }{{} design matrix X Synthetic data for regression: Y = N(X β, I n ), β j { N(0, η 2 ), 1 j s = 0, otherwise 11/18

Comparison of the ROC curve For CAR, (m, δ) = (2, 0.5) β j { N(0, η 2 ), 1 j s = 0, otherwise Left: (η, s) = (0.1, 50). Right: (η, s) = (5, 50) 1 0.9 CAR MR 1 0.9 CAR MR 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/18

Extensions Gram matrix is non-sparse but is sparsifiable Y = X β + z = HY = HX β + Hz Change-point or time-series design: linear filtering Low-rank plus sparse design: PCA projection Generalized linear models P I y 2 = log-likelihood ˆl I (y) Ke, Jin and Fan (14 ), Ke and Yang (17 ) 13/18

CASE for variable selection Screen. Rank variables by CAR and let Ŝ t = {1 j p, T j t} GŜ }{{} = GŜ,1 GŜ,2... GŜ, ˆM }{{} post-screening subgraph small-size components Clean. If j / Ŝ t, set ˆβ j = 0. Otherwise, we must have j GŜ,k for some k. Estimate {β j : j GŜ,k } by minimizing P GŜ,k (Y j GŜ,k β j x j ) 2 + u β 0, s.t. β j = 0 or β j v 14/18

Signal archipelago GS {z} signal subgraph = GS,1 GS,2... GS,M, {z } S = S(β) components 15/18

Rare/Weak signal model and three regions β = b j µ j, b j iid Bernoulli(ɛ), τ µ j a τ ɛ = ɛ p = p ϑ, τ = τ p = 2r log(p) Hamming distance: Hamm p ( ˆβ, { p ϑ, r) = sup µ j=1 No recovery: Hamm p ( ˆβ, ϑ, r) pɛ p P ( sgn( ˆβ j ) sgn(β j ) )} Almost Full Recovery: 1 Hamm p ( ˆβ, ϑ, r) pɛ p Exact recovery: Hamm p ( ˆβ, ϑ, r) = 0 16/18

Phase Diagram (blockwise design) Y N(X β, I n ), rows of X iid N(0, 1 n Ω), Ω = Left: CASE/optimal. Right: Lasso. 6 6 1 a 0 0... 0 0 a 1 0 0... 0 0 0 0 1 a... 0 0 0 0 a 1... 0 0............. 0 0 0 0... 1 a 0 0 0 0... a 1 r 5 Exact Recovery 4 3 2 Almost Full Recovery 2 Optimal 1 1 No Recovery 0 0 0.5 1 ϑ rr 5 4 3 Exact Recovery Exact Non Recovery optimal Non optimal No No Recovery 0 0 0.5 0.5 11 ϑ 6 Ji and Jin (12 ), Jin, Zhang and Zhang (14 ) 17/18

Take-home messages CAR: a variable ranking method that mitigates signal cancellation in marginal ranking Appealing ROC curves CASE: a screen-and-clean method for variable selection Phase transition of Hamming distance 18/18