Marginal Screening and Post-Selection Inference

Similar documents
3 Comparison with Other Dummy Variable Methods

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Knockoffs as Post-Selection Inference

Association studies and regression

False Discovery Rate

AFT Models and Empirical Likelihood

Statistics in medicine

Semi-Nonparametric Inferences for Massive Data

STAT 461/561- Assignments, Year 2015

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lawrence D. Brown* and Daniel McCarthy*

A General Framework for High-Dimensional Inference and Multiple Testing

Lecture 12 April 25, 2018

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

A knockoff filter for high-dimensional selective inference

Multiple Sample Categorical Data

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007

Statistical Inference

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Significance Test for the Lasso

Outline of GLMs. Definitions

University of California San Diego and Stanford University and

Selection-adjusted estimation of effect sizes

Analysis of Categorical Data Three-Way Contingency Table

Stat 5101 Lecture Notes

Math 494: Mathematical Statistics

Logistic regression: Miscellaneous topics

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Reports of the Institute of Biostatistics

1 Comparing two binomials

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Statistical Applications in Genetics and Molecular Biology

Introduction to Statistical Analysis

Rank conditional coverage and confidence intervals in high dimensional problems

Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Cross-Validation with Confidence

Quantile Regression for Residual Life and Empirical Likelihood

Non-specific filtering and control of false positives

Power and Sample Size Calculations with the Additive Hazards Model

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

A Reliable Constrained Method for Identity Link Poisson Regression

High-dimensional Ordinary Least-squares Projection for Screening Variables

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Post-selection Inference for Changepoint Detection

Empirical Likelihood Inference for Two-Sample Problems

Building a Prognostic Biomarker

Testing for Marginal Linear Effects in Quantile Regression

Double Bootstrap Confidence Intervals in the Two Stage DEA approach. Essex Business School University of Essex

Big Data Analysis with Apache Spark UC#BERKELEY

Sparse Linear Models (10/7/13)

arxiv: v1 [stat.me] 29 Dec 2018

A Significance Test for the Lasso

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Regression, Ridge Regression, Lasso

A multiple testing procedure for input variable selection in neural networks

Linear regression methods

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Tests of independence for censored bivariate failure time data

Subject CS1 Actuarial Statistics 1 Core Principles

ST495: Survival Analysis: Hypothesis testing and confidence intervals

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

STA6938-Logistic Regression Model

Resampling-Based Control of the FDR

Equivalence of random-effects and conditional likelihoods for matched case-control studies

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Discussion of Papers on the Extensions of Propensity Score

Step-down FDR Procedures for Large Numbers of Hypotheses

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Resampling and the Bootstrap

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap

Data Uncertainty, MCML and Sampling Density

Lecture 6 Multiple Linear Regression, cont.

GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

High-Throughput Sequencing Course

Confounder Adjustment in Multiple Hypothesis Testing

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Selective Inference for Effect Modification

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

On Two-Stage Hypothesis Testing Procedures Via Asymptotically Independent Statistics

Chapter 1 Statistical Inference

PANNING FOR GOLD: MODEL-FREE KNOCKOFFS FOR HIGH-DIMENSIONAL CONTROLLED VARIABLE SELECTION. Emmanuel J. Candès Yingying Fan Lucas Janson Jinchi Lv

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

Latent Variable Methods for the Analysis of Genomic Data

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Transcription:

Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29

Outline 1 Background on Marginal Screening 2 2 2 Tables in Case-Control Studies (joint work with Min Qian) 3 Binary Screening Test (BST) 4 Forward Stepwise BST 5 Censored Survival Data (Tzu-Jung Huang, M. and Min Qian) Ian McKeague (Columbia University) Marginal Screening August 13, 2017 2 / 29

Wald Lecture at JSM 2017 by Candès Ian McKeague (Columbia University) Marginal Screening August 13, 2017 3 / 29

Panning for gold Are there any active predictors (in the sense defined by Candès)? The most relevant question with low signal to noise ratio (e.g., epidemiology) and sparse signals (if any). Knockoff approach furnishes rigorous FDR control (not FWER control): Barber and Candès (2016): knockoff filter for testing associations in high-dimensional linear model. Candès, Fan, Janson and Lv (2017): Panning for gold: model-free knockoffs for high-dimensional controlled variable selection Fighting words: To constrain oneself to marginal testing is to completely ignore the vast modern literature on sparse regression that, while lacking finite-sample Type I error control, has had tremendous success establishing other useful inferential guarantees such as model selection consistency under high-dimensional asymptotics... Ian McKeague (Columbia University) Marginal Screening August 13, 2017 4 / 29

The unreasonable effectiveness of marginal testing Assumption-lean full linear model: p Y = α 0 + β k X k + ɛ where ɛ has mean 0, finite variance, and is uncorrelated with each X k. The power of marginal testing derives from the fact that If (X 1,..., X p ) has a non-singular covariance matrix, k=1 H 0 : β k = 0, k = 1,..., p holds if and only if Y is marginally uncorrelated with each X k. That is, marginal testing does address the question of whether there are any active predictors in the full model (not the wrong question after all). Ian McKeague (Columbia University) Marginal Screening August 13, 2017 5 / 29

Marginal screening Least squares fitting of to each (standardized) predictor X k. E(Y X k ) = α k + β k X k This abuses the notation for β k, but it doesn t matter! Y is marginally uncorrelated with X k if and only if β k = 0. Parameter of interest: θ 0 = β k0, where k 0 arg max k=1,...,p Corr(X k, Y ). The Problem: Test whether θ 0 0 and provide a CI for θ 0 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 6 / 29

Marginal screening (cont d) Test statistics for H 0 : θ 0 = 0 versus H a : θ 0 0. Maximally-selected slope ˆθ n = Ĉov(Xˆkn, Y ) Var(Xˆk n ), ˆk n arg max k=1,...,p Ĉorr(X k, Y ). Reject H 0 for large ˆθ n. Equivalently, Maximally-selected correlation ˆρ n = Ĉorr(Xˆk n, Y ), Reject H 0 for large ˆρ n. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 7 / 29

Marginal screening (cont d) Has the hallmarks of post-selection inference: non-regular asymptotics at the null hypothesis, unstable behavior in small samples. How to provide an accurate p-value and CI for θ 0 while preserving the assumption-lean approach? i.e. without adding assumptions such as independent normal errors, as needed for conditional testing approaches, say. McKeague and Qian (2015, JASA): adaptive resampling test (ART). A modified nonparametric bootstrap provides valid (post-selection) p-values. Forward stepwise ART used to identify the presence of additional active predictors. Luedtke and van der Laan (2017, JASA): regularization approach. Asymptotically normal test statistic, CIs easily constructed, p can grow exponentially with n. Restricted to independent errors. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 8 / 29

2 2 Tables in Case-Control Studies (GWAS) Motivating Example: Risk Assessment of Cerebrovascular Events (RACE Study, 2017) On-going case-control study of stroke: over 5,000 imaging confirmed cases of stroke and 5,000 controls recruited from seven medical centers in Pakistan. Subset of 1,220 cases with early-onset stroke (stroke before age 60), and 1,273 controls Genome-wide SNP data available for this subset of subjects Goal: identify novel genetic factors associated with early-onset stroke Few if any known genetic associations (in contrast to Crohn s disease, say, where 30 genes are known) Ian McKeague (Columbia University) Marginal Screening August 13, 2017 9 / 29

Case-Control Set-Up Standard unmatched case-control study: N = M 1 (cases) + M 2 (controls) Binary disease status: D {case, control} Binary risk factors: W k {exposed, unexposed}, k = 1,..., p. Fixed margins. log-odds ratio (instead of correlation) to quantify the association. Is D significantly associated with any of the risk factors W 1,..., W p? Ideal approach would be model free, avoiding the requirement of a high-dimensional logistic regression model. Claim: the unreasonable effectiveness of marginal screening still holds. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 10 / 29

Multiple 2 2 Tables For the k-th risk factor, k = 1,..., p, Cases Controls Exposed X k N 1k X k N 1k Unexposed M 1 X k X k + N 2k M 1 N 2k M 1 M 2 N X k noncentral hypergeometric parameterized by the odds ratio. Odds Ratio: θ k = P(D = 1 W k = 1)/P(D = 0 W k = 1) P(D = 1 W k = 0)/P(D = 0 W k = 0). Hypotheses: H 0 : θ 1 =... = θ p = 1 versus H a : at least one θ k 1. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 11 / 29

Overview of Inference for 2 2 Tables Classical chi-squared and Fisher exact tests Mantel Haenszel test of H 0 : θ 1 =... = θ p = 1 versus H a : at least one θ k 1. requires all the odds ratios θ k to be identical. Kou and Ying (1996): asymptotic theory of the empirical log-odds ratio for a single table. Kou and Ying (2006) studied the problem of estimating a common odds ratio from sequences of dependent 2 2 tables. There is an extensive literature on tests for homogeneity of multiple odds ratios, e.g. Reis, Hirji and Afifi (1999) Ian McKeague (Columbia University) Marginal Screening August 13, 2017 12 / 29

Existing screening methods H 0 : θ 1 =... = θ p = 1 versus H a : at least one θ k 1. Marginal p-values: control of FWER using Bonferroni highly conservative for large p Permutation test (D randomly permuted among subjects) heavy computational burden FDR control for lasso (logistic regression) based on knockoffs Is there a more powerful, computationally efficient, and model free approach? Ian McKeague (Columbia University) Marginal Screening August 13, 2017 13 / 29

Hypotheses Question: Is D significantly related to any of the risk factors W 1,..., W p? Define k 0 arg max log θ k /σ k. k=1,...,p where σ k > 0 is a prescribed sequence of normalizing constants. Hypotheses: H 0 : log θ 0 = 0 versus H a : log θ 0 0, where θ 0 = θ k0. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 14 / 29

Binary Screening Test (BST) Empirical odds ratio ˆθ k = X k(x k + N 2k M 1 ), k = 1,..., p. (N 1k X k )(M 1 X k ) Estimate of k 0 : ˆk N arg max log ˆθ k /ˆτ k, k where ˆτ k is the standard error of log ˆθ k. Test Statistic T N = log ˆθ N = log ˆθˆk N. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 15 / 29

Asymptotic behavior of ˆθ N under local alternatives Local parameterization: log θ (N) = log θ (0) + b/ N, θ = (θ 1,..., θ p ) T. Hypotheses: H 0 : log θ N = 0 versus H a : log θ N 0, where θ N = θ kn, k N arg max k log θ k /σ k. Theorem Under regularity conditions, N(log ˆθ N log θ N ) d { σk0 Z k0 if θ (0) 1, σ K Z K + b K b k if θ (0) = 1, where (Z 1,..., Z p ) T N(0, C X ) with C X be the limit of Corr(X 1,..., X p ); k 0 = arg max k log θ (0) k /σ k assumed to be unique when θ (0) 1; k = arg max k b k /σ k assumed to be unique when θ (0) = 1 and b 0; and K = arg max k=1,...,p (Z k + b k /σ k ) 2. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 16 / 29

Key regularity condition Kou and Ying (1996) established the (marginal) representation M 1 d X k = I (η sk (1 + θ 1 k λ sk) 1, s N 1k ), s=1 where η sk iid Unif(0, 1) and λ sk 0 are the roots of the Jacobi polynomial φ k (z) = min(m 1,N 1k ) u=max(0,m 1 N 2k ) ( N1k u ) ( N2k M 1 u ) z u. We need to assume this representation holds jointly over all k = 1,..., p. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 17 / 29

Calibration of BST Calibrate under the null θ N = 1. That is θ (0) = 1 and b = 0. Only need to estimate the distribution of σ K Z K, where K = arg max k=1,...,p Z 2 k. σ 2 k can be consistently estimated by N ˆτ 2 k, where ˆτ k is the standard error of log ˆθ k. (Z 1,..., Z p ) T N(0, C X ), with C X consistently estimated by the sample correlation matrix of the vector of risk factors (W 1,..., W p ) restricted to the data on D = 1. Draw from the estimated null distribution of log ˆθ N using Monte Carlo simulations to obtain critical values. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 18 / 29

Confidence intervals for log θ 0 Three possibilities: CI 0 : use the same Monte Carlo calibration as BST CI max : use the most conservative critical values given by the limiting distributions in the Theorem as a function of b, with the ˆk N -th component of b allowed to vary freely and all its other components set to zero. CI boot : Select the value of on the grid that provides the best bˆkn agreement with the nominal 95% level in terms of the coverage of a bootstrapped version of log ˆθ N. Only CI boot works well. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 19 / 29

Simulation studies Three scenarios: A) Null: W k Ber(0.5) for k = 1,..., p; B) Alternative (weak dense signal): W k Ber(0.6), k = 1,..., p/2, W k Ber(0.5), k = p/2 + 1,..., p for cases, and W k Ber(0.5), k = 1,..., p/2, W k Ber(0.6), k = p/2 + 1,..., p for controls; C) Alternative (strong sparse signal): W 1 Ber(0.65), W 2 Ber(0.6), and W 3 Ber(0.55) for cases, W k Ber(0.4), k = 1, 2, 3 for controls, and W k Ber(0.5), k = 4,..., p. N = 200 with M 1 = M 2 = 100. Varying p from 10 to 400. Three correlation structures: independent, exchangeable, AR(1). Ian McKeague (Columbia University) Marginal Screening August 13, 2017 20 / 29

Independent risk factors Model p BST Bonferroni Permutation A 10 4.8 (0.5) 3.1 (0.15) 4.7 (10.7) 50 5.8 (0.9) 1.9 (0.06) 4.5 (51) 100 5.2 (1.6) 2.7 (0.1) 4.7 (102) 200 6.5 (2.1) 3.6 (0.2) 5.6 (193) 400 6.0 (4.3) 2.3 (0.4) 4.6 (389) B 10 60.5 (0.5) 51.0 (0.02) 60.3 (11) 50 82.0 (1.0) 64.0 (0.06) 81.0 (55) 100 86.4 (1.6) 73.0 (0.1) 85.0 (102) 200 94.3 (2.4) 85.7 (0.2) 91.9 (200) 400 98.0 (4.3) 86.1 (0.4) 96.3 (387) C 10 93.9 (0.6) 90.9 (0.02) 93.6 (11) 50 80.0 (1.0) 68.5 (0.06) 77.5 (51) 100 73.8 (1.4) 64.0 (0.1) 71.3 (100) 200 63.2 (2.2) 54.8 (0.2) 60.4 (200) 400 58.7 (4.0) 42.7 (0.4) 54.3 (388) Table: Empirical rejection rates (%) over 1, 000 Monte Carlo iterations and average runtime (seconds) per iteration when W k, k = 1,..., p, are independent. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 21 / 29

AR(1) risk factors Model p BST Bonferroni Permutation A 10 3.9 1.8 4.0 50 5.5 2.4 4.8 100 5.3 2.3 4.3 200 5.2 2.4 4.5 400 6.0 2.5 5.3 B 10 53.9 43.8 54.0 50 76.2 59.2 73.6 100 82.7 66.4 79.2 200 91.0 79.4 89.1 400 96.2 81.2 93.3 C 10 88.5 84.9 88.7 50 72.0 59.8 70.1 100 67.9 56.3 65.1 200 59.5 51.6 57.1 400 51.5 37.9 48.0 Table: Empirical rejection rates (%) with AR(1) correlation structure Corr(W j, W k ) = 0.5 j k. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 22 / 29

Exchangeable risk factors Model p BST Bonferroni Permutation A 10 4.6 2.3 7.2 50 4.5 1.2 6.7 100 5.8 2.1 5.4 200 6.0 1.2 5.1 400 5.8 1.1 4.9 B 10 59.8 42.1 58.1 50 67.0 38.9 65.1 100 72.4 37.5 67.8 200 75.3 42.7 69.6 400 79.2 38.1 73.1 C 10 90.6 85.4 90.2 50 76.6 58.3 74.5 100 73.4 54.4 70.3 200 69.1 48.9 65.6 400 61.1 34.6 55.2 Table: Empirical rejection rates (%) based on 1,000 samples generated from models A, B and C with exchangeable correlation structure Corr(W j, W k ) = 0.5 for j k. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 23 / 29

Performance Comparisons Simulation studies show: BST has good control of type I error rate, while consistently maintaining the highest power compared with the Bonferroni and Permutation test approaches. Advantage of BST is most evident when the 2 2 tables are highly correlated. BST is 10 times slower than Bonferroni (due to the computationally intensive simulation step). BST 100 times faster than the Permutation test (using 1000 permutations and 1000 Monte Carlo draws) BST is 1000 times faster than ART (our marginal screening test based on linear regression, which needs the double bootstrap). Ian McKeague (Columbia University) Marginal Screening August 13, 2017 24 / 29

Forward Stepwise BST Run BST. If a significant risk factor is found (say ˆk N ), then 1 Split the data on the remaining risk factors into two collections of p 1 tables: exposed or unexposed to ˆk N ; 2 For each of the remaining p 1 risk factors, calculate the Mantel-Haenszel OR estimate ˆθ k and standard error of log-or ˆτ k from each pair of the 2 2 tables. This yields a new test statistic T N. 3 Estimate the null distribution with the new correlation matrix C X estimated by the (p 1) (p 1) submatrix of the original estimate C X excluding the entries involving ˆk N. Repeat steps 1 3 until no more significant risk factors are found. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 25 / 29

Example: RACE Study N = 2493 with 1220 cases and 1273 controls. p = 2000: first 2,000 genetic variants on chromosome 5. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 26 / 29

Example: RACE Study (cont d) 95% confidence intervals based on CI boot : Ian McKeague (Columbia University) Marginal Screening August 13, 2017 27 / 29

Censored survival data (Huang, M, Q, 2017, Statistica Sinica, submitted) Only observe Ỹ = min(y, C), δ = 1 Y C, X = (X 1,..., X p ), where C is independent of X and ɛ. ART extends using a synthetic response Y S in place of Y. Koul, Susala and Van Ryzin (1981): linear regression based on Y S = δỹ where S(t) = P(C > t) survival function of C, S(Ỹ ) with plug-in of the K-M estimator of S. Correlations preserved: Corr(Y S, X k ) = Corr(Y, X k ) for all k, so the unreasonable effectiveness of marginal screening still applies. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 28 / 29

Selected References McKeague, I. W. and Qian, M. (2017). Marginal screening of 2 2 tables in large-scale case-control studies. In preparation for Biometrics. McKeague, I. W. and Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors (with discussion). JASA 110, 1422 1433. Wang, H. Judy, McKeague, I. W. and Qian, M. (2017). Testing for marginal linear effects in quantile regression. JRSS-B, to appear. Huang, T.-J., McKeague, I. W. and Qian, M. (2017). Marginal screening for high-dimensional predictors of survival outcomes. Submitted to Statistica Sinica. Ian McKeague (Columbia University) Marginal Screening August 13, 2017 29 / 29