Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Similar documents
Statistical Inference

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

The t-test Pivots Summary. Pivots and t-tests. Patrick Breheny. October 15. Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/18

Censoring mechanisms

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities

A Significance Test for the Lasso

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Semiparametric Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection inference with an application to internal inference

Stability and the elastic net

Data Mining Stat 588

Some new ideas for post selection inference and model assessment

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Theoretical results for lasso, MCP, and SCAD

LECTURE 5 HYPOTHESIS TESTING

A significance test for the lasso

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Association studies and regression

Lecture 15. Hypothesis testing in the linear model

Multiple Regression Analysis: The Problem of Inference

Post-Selection Inference

A significance test for the lasso

Recent Developments in Post-Selection Inference

Bayesian linear regression

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Two-sample inference: Continuous data

A Significance Test for the Lasso

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

High-dimensional regression

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Nonconvex penalties: Signal-to-noise ratio and algorithms

Sparse Linear Models (10/7/13)

Exact Post-selection Inference for Forward Stepwise and Least Angle Regression

Topic 3: Hypothesis Testing

Linear Model Selection and Regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Empirical Likelihood

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Sampling distribution of GLM regression coefficients

Proportional hazards regression

Some General Types of Tests

One-sample categorical data: approximate inference

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

arxiv: v5 [stat.me] 11 Oct 2015

MSA220/MVE440 Statistical Learning for Big Data

Mathematical Notation Math Introduction to Applied Statistics

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Regression #2. Econ 671. Purdue University. Justin L. Tobias (Purdue) Regression #2 1 / 24

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

Generalized Linear Models

B. Differential Equations A differential equation is an equation of the form

Day 4: Shrinkage Estimators

Stat 5102 Final Exam May 14, 2015

Bayesian Regression (1/31/13)

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Composite Hypotheses and Generalized Likelihood Ratio Tests

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

Power and sample size calculations

Supremum of simple stochastic processes

Comprehensive Examination Quantitative Methods Spring, 2018

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

The Logit Model: Estimation, Testing and Interpretation

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Lecture 5: A step back

3 Comparison with Other Dummy Variable Methods

Correlation. Patrick Breheny. November 15. Descriptive statistics Inference Summary

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Jordan normal form notes (version date: 11/21/07)

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

Business Statistics. Lecture 9: Simple Regression

A Short Introduction to the Lasso Methodology

01 Probability Theory and Statistics Review

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Regression and Statistical Inference

L7: Multicollinearity

Modified Simes Critical Values Under Positive Dependence

Eigenvalues and diagonalization

Lecture 8: Signal Detection and Noise Assumption

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

MS&E 226: Small Data

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Mathematical Notation Math Introduction to Applied Statistics

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Math 423/533: The Main Theoretical Topics

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Topic 19 Extensions on the Likelihood Ratio

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

Multiple Regression Analysis

1 Linear Regression and Correlation

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Mean Vector Inferences

Generalized Linear Models (1/29/13)

Exam C Solutions Spring 2005

Central Limit Theorem ( 5.3)

Transcription:

Patrick Breheny April 18 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Introduction In our final lecture on inferential approaches for penalized regression, we will discuss two rather recent approaches: The covariance test for testing the significance of additional terms along a covariate path, or post-selection inference, in which we carry out tests/construct confidence intervals conditional on the selected model Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 2/20

Motivation The classical test for the significance of adding a variable to a model is to calculate the test statistic T R = RSS 0 RSS 1 σ 2 and compare it to a χ 2 1 distribution (for simplicity, let s assume σ 2 is known) This is valid, of course, when the variable is prespecified In the model selection context, however, it is far too liberal, as we have seen Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 3/20

Failure of χ 2 result n = p = 100, β = 0, significance of first predictor added: 20 T R 15 10 5 0 0 2 4 6 8 χ 2 P(T > 3.84) > 0.99 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 4/20

: Motivation The goal of the covariance test is to develop a test statistic for an added variable whose distribution can be characterized despite the fact that we searched over a large number of candidate predictors to find it Let λ 1 > λ 2 denote the values of the regularization parameter at which a new variable enters the model, and let A k 1 denote the active set, not including the variable added at step k Now, let us consider two solutions at the same value of λ: β(λ k+1 ) β Ak 1 (λ k+1 ) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 5/20

Consider, then, the following quantity: T C = yt X β(λ k+1 ) y T X β Ak 1 (λ k+1 ) σ 2 ; i.e., how much of the covariance between the fitted model and outcome can be attributed to the new predictor, as opposed to the increase in covariance we would get from simply lowering λ without adding any new variables Under the null hypothesis that all variables with β j 0 are included in A k 1, it can be shown that as n and p T C d Exp(1) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 6/20

Accuracy of the approximating distribution As before, n = p = 100, β = 0, first predictor added: 5 T C 4 3 2 1 0 0 1 2 3 4 5 Exp(1) P(T > 3.00) = 0.08 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 7/20

Results: Example data Applying the covariance test to the example data from our previous lecture, we have T c p A1 13.0701 A2 60.0133 A3 0.4991 0.6108 A6 5.0203 0.0113 C29 0.0351 0.9655 A4 0.3442 0.7109 B3 0.0438 0.9572 A5 5.5515 0.0075 C4 0.1197 0.8875 B1 0.0477 0.9535 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 8/20

Remarks When σ 2 is not known, but can be estimated with rdf degrees of freedom, the estimate can be substituted for σ 2 and the covariance test statistic will follow a F 2,rdf distribution The distribution of the test statistic is thus somewhat inflated by searching over a number of candidate predictors, but far less so for the lasso than for forward selection due to the shrinkage imposed by the lasso It is possible to translate the sequential tests into an FDR rule, giving a false discovery rate for the included variables at a given point; for our example, however, this only allows the first two variables to enter Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 9/20

Our final, and most recent, approach to inference is selective inference, or post-selection inference The idea here is to explicitly condition on the model we have selected in carrying out inference Similar to the previous approach, selective inference adjusts for the fact that we have cherry-picked the top k variables and eliminated p k other variables to arrive at this model, however, is more comprehensive, offering post-selection confidence intervals and p-values for all of the terms in the selected model Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 10/20

The lasso selection event To proceed, we will need to explicitly characterize the condition that the lasso selects a given model A and eliminated the variables in set B = A C Theorem: For a fixed value of λ, the event that the lasso sets β j = 0 for all j B can be written 1 n XT B(I P A )y + λx T BX A (X T AX A ) 1 s λ, where P A = X A (X T A X A) 1 X T A, s = sign( β A ), and the above inequality applies elementwise Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 11/20

Remarks Thus, the event that the lasso selects a certain model (and assigns its nonzero coefficients certain signs) can be written as a set of linear constraints {Ay b} In other words, the set {y Ay b} is the set of random response vectors y that would yield the same active set and coefficient signs as the model we ve selected Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 12/20

: Big picture Now, suppose that y N(µ, σ 2 I) The main idea of selective inference is to make inference on µ, or more generally a linear combination θ = η T µ, conditional on the event {Ay b} For example, we would likely be interested in (X T A X A) 1 X T A µ; to address this question we would set η equal to various columns of X A (X T A X A) 1 Carrying out this conditional inference turns out to be quite a bit easier than one would expect due to a remarkable result known as the polyhedral lemma, which we will state on the next slide without proof Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 13/20

Polyhedral lemma Theorem: The conditional distribution of η T y {Ay b} is equivalent to distribution of η T y given where V (y) η T y V + (y) V 0 (y) 0, α = Aη/ η 2 V b j (Ay) j + α j η T y (y) = max j:α j <0 α j V + b j (Ay) j + α j η T y (y) = max j:α j >0 α j V 0 (y) = min j:α j =0 {b j (Ay) j } Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 14/20

Using the lemma for inference The polyhedral lemma appears complex, but its upshot is actually quite simple: η T y would ordinarily follow a normal distribution, but conditional on the model we have selected, it follows a truncated normal distribution with support determined by A and b Thus, letting F denote the CDF of this truncated normal distribution, we can carry out hypothesis tests of θ = η T µ = 0 by calculating tail areas with respect to this distribution Likewise, we can construct confidence intervals by inverting the above tests, searching for values of θ satisfying F (θ) = α/2 and F (θ) = 1 α/2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 15/20

Application to example data A1 A2 B3 0.48 A3 < 0.001 A4 0.02 A5 0.03 A6 < 0.01 C29 0.64 1.5 1.0 0.5 0.0 0.5 1.0 1.5 β Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 16/20

Lasso-OLS hybrid intervals A1 A2 B3 0.07 A3 A4 A5 A6 C29 0.23 1.5 1.0 0.5 0.0 0.5 1.0 1.5 β Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 17/20

covtest: TCGA data T c p C17orf53 126.0942 VPS25 12.4525 NBR2 8.8638 0.0001 DTL 22.4862 PSME3 2.1363 0.1181 TOP2A 8.0346 0.0003 TIMELESS 4.5714 0.0103 CDC25C 1.2952 0.2738 CCDC56 2.1602 0.1153 CENPK 0.3062 0.7363 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 18/20

selectiveinference: TCGA data C17orf53 < 0.001 DTL NBR2 PSME3 < 0.01 TIMELESS 0.07 TOP2A < 0.001 VPS25 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 β Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 19/20

Final remarks The covariance test lies somewhere in between the classical and false inclusion definitions with respect to the error rate it controls The covariance test approach scales up well to high dimensions, but how powerful it is has not been extensively studied is a very promising approach to inference that appears to work very well in the n > p case How well it works in the p > n case is something of an unanswered question (the method is very new at this point); for the TCGA data, once the 8th predictor enters the model, all confidence intervals become infinitely wide Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 20/20