Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Similar documents
A Modern Maximum-Likelihood Theory for High-dimensional Logistic Regression

A Modern Maximum-Likelihood Theory for High-dimensional Logistic Regression

Inference For High Dimensional M-estimates: Fixed Design Results

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Exercises Chapter 4 Statistical Hypothesis Testing

Inference For High Dimensional M-estimates. Fixed Design Results

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression Models P8111

Generalized linear models

UNIVERSITY OF TORONTO Faculty of Arts and Science

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Logistic Regression 21/05

Outline of GLMs. Definitions

The Phase Transition for the Existence of the Maximum Likelihood Estimate in High-dimensional Logistic Regression

Inference for High Dimensional Robust Regression

Poisson regression 1/15

Generalized Linear Models. Kurt Hornik

Stat 5101 Lecture Notes

high-dimensional inference robust to the lack of model sparsity

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Linear Methods for Prediction

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Introduction to General and Generalized Linear Models

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Generalized Linear Models Introduction

Modeling Overdispersion

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

2.1 Linear regression with matrices

Applied Asymptotics Case studies in higher order inference

8. Hypothesis Testing

Logistic Regressions. Stat 430

AFT Models and Empirical Likelihood

Generalized Linear Models for Non-Normal Data

Lecture 5: LDA and Logistic Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

1/15. Over or under dispersion Problem

STA 450/4000 S: January

2 Nonlinear least squares algorithms

Various types of likelihood

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 3: Multiple Regression

Lecture 32: Asymptotic confidence sets and likelihoods

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Composite Hypotheses and Generalized Likelihood Ratio Tests

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STAC51: Categorical data Analysis

High-dimensional regression:

Modeling Real Estate Data using Quantile Regression

Stat 5102 Final Exam May 14, 2015

Exercise 5.4 Solution

MSH3 Generalized linear model

M-Estimation under High-Dimensional Asymptotics

Chapter 22: Log-linear regression for Poisson counts

Part 6: Multivariate Normal and Linear Models

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

1 Exercises for lecture 1

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

2. We care about proportion for categorical variable, but average for numerical one.

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Lecture 12: Effect modification, and confounding in logistic regression

LOGISTIC REGRESSION Joseph M. Hilbe

Math 423/533: The Main Theoretical Topics

9 Generalized Linear Models

Matched Pair Data. Stat 557 Heike Hofmann

General Regression Model

MSH3 Generalized linear model Ch. 6 Count data models

Solving Corrupted Quadratic Equations, Provably

Figure 36: Respiratory infection versus time for the first 49 children.

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Introduction to Error Analysis

Linear Methods for Prediction

Logistic Regression - problem 6.14

Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 3. Inference about multivariate normal distribution

Age 55 (x = 1) Age < 55 (x = 0)

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Naive Bayes and Gaussian Bayes Classifier

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Generalized Linear Models

Chapter 11. Regression with a Binary Dependent Variable

MIT Spring 2016

Naive Bayes and Gaussian Bayes Classifier

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Naive Bayes and Gaussian Bayes Classifier

Master s Written Examination

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Support Vector Machines

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

12 Modelling Binomial Response Data

Transcription:

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University

Coauthors Pragya Sur Stanford Statistics Emmanuel Candès Stanford Statistics & Math 2/ 26

In memory of Tom Cover (1938-2012) Tom @ Stanford EE We all know the feeling that follows when one investigates a problem, goes through a large amount of algebra, and finally investigates the answer to find that the entire problem is illuminated not by the analysis but by the inspection of the answer 3/ 26

Inference in regression problems Example: logistic regression y i logistic-model ( x i β ), 1 i n 4/ 26

Inference in regression problems Example: logistic regression y i logistic-model ( x i β ), 1 i n One wishes to determine which covariate is of importance, i.e. β j = 0 vs. β j 0 (1 j p) 4/ 26

Classical tests β j = 0 vs. β j 0 (1 j p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics 5/ 26

Classical tests β j = 0 vs. β j 0 (1 j p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics Wald test: Wald statistic χ 2 Likelihood ratio test: log-likelihood ratio statistic χ 2 Score test:... score N (0, Fisher Info) 5/ 26

Example: logistic regression in R (n = 100, p = 30) > fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7727-0.8718 0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642. X4-0.293916 0.331553-0.886 0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 **... --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Can these inference calculations (e.g. p-values) be trusted? 6/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j 0 (1 j p) Log-likelihood ratio (LLR) statistic LLR j := l ( β ) l ( β ( j) ) l( ): log-likelihood β = arg max β l(β): unconstrained MLE 7/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j 0 (1 j p) Log-likelihood ratio (LLR) statistic LLR j := l ( β ) l ( β ( j) ) l( ): log-likelihood β = arg max β l(β): unconstrained MLE β ( j) = arg max β:βj =0 l(β): constrained MLE 7/ 26

Wilks phenomenon 1938 Samuel Wilks, Princeton βj = 0 vs. βj 6= 0 (1 j p) LRT asymptotically follows chi-square distribution (under null) d 2 LLRj χ21 (p fixed, n ) 8/ 26

30000 Wilks phenomenon 1938 10000 Counts Counts 20000 5000 10000 0 0 0.00 0.25 0.50 P Values 0.75 1.00 classical Wilks j = 0 Samuel Wilks 0.25 0.50 P Values 0.75 1.00 Bartlett-corrected vs. j = 0 (1 Æ j Æ p p-value LRT asymptotically follows chi-square distribution Bartlett correction (finite sample effect): 2 LLRj æ 2LLRj 1+ n /n 21 (p fixed, n æ Œ) 1 p-values are still non-uniform æ this is NOT finite sa Samuel Wilks, Princeton βj = 0 assess significance of coefficients vs. βj 6= 0 (1 j p) LRT asymptotically follows chi-square distribution (under null) d 2 LLRj χ21 (p fixed, n ) 8/ 26

Classical LRT in high dimensions p/n (1, ) Linear regression y = Xβ + η }{{} i.i.d. Gaussian 0.00 0.25 0.50 0.75 1.00 classical p-values arenuniform For linear regression (with Gaussian noise) in high dimensions, 2LLR j χ 2 1 (classical test always works) 9/ 26

Classical LRT in high dimensions p = 1200, n = 4000 Logistic regression y logistic-model(xβ) 0.00 0.25 0.50 0.75 1.00 classical p-values are highly nonuniform 10/ 26

Classical LRT in high dimensions p = 1200, n = 4000 Logistic regression y logistic-model(xβ) 0.00 0.25 0.50 0.75 1.00 classical p-values are highly nonuniform Wilks theorem seems inadequate in accommodating logistic regression in high dimensions 10/ 26

Bartlett correction? (n = 4000, p = 1200) 30000 Counts 20000 10000 0 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett correction (finite sample effect): 2LLR j 1+α n/n χ2 1 11/ 26

Bartlett correction? (n = 4000, p = 1200) 30000 10000 Counts 20000 Counts 10000 5000 0 0.00 0.25 0.50 0.75 1.00 P Values 0 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected 2LLR Bartlett correction (finite sample effect): j 1+α n/n χ2 1 p-values are still non-uniform this is NOT finite sample effect 11/ 26

Bartlett correction? (n = 4000, p = 1200) 30000 10000 Counts 20000 Counts 10000 5000 0 0.00 0.25 0.50 0.75 1.00 P Values 0 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected 2LLR Bartlett correction (finite sample effect): j 1+α n/n χ2 1 p-values are still non-uniform this is NOT finite sample effect What happens in high dimensions? 11/ 26

Our findings 30000 6000 Counts 20000 10000 Counts 10000 5000 Counts 4000 2000 0 0.00 0.25 0.50 0.75 1.00 P Values 0 0.25 0.50 0.75 1.00 P Values 0 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected rescaled χ 2 2LLR Bartlett correction (finite sample effect): j 1+α n/n χ2 1 p-values are still non-uniform this is NOT finite sample effect A glimpse of our theory: LRT follows a rescaled χ 2 distribution 11/ 26

ood: fy X (y x} Y unknown signal: observation: Y likelihood: (y x} X observation: Y fy X funknown unknown signal: X Yob Xlikelihood: Y X (y x}signal: Problem formulation (formal) X fy X (y x} y Y X likelihood: fy X (y x} unknown signal: X yunknown X y X signal: observation: Y X X y n observation y n p p ind. Gaussian design: Xi N (0, Σ) Logistic model: yi = 1, 1, with prob. with prob. 1 1+exp( Xi> β) 1 1+exp(Xi> β) 1 i n Proportional growth: p/n constant Global null: β = 0 12/ 26

For p/n > 1/2 yimle does 2 {0, 1} not or y exist i 2 { 1, 1} When does MLE exist? n X Class labels (MLE) maximizeβ `(β) = Class labels n o log 1 + exp( yi Xi> β) i=1 {z yi 2 {0, 1} or y 0 i yi 2 {0, 1} or y i 2 { 1, 1} 2 { 1, 1}} p-values are uniform p-values are highly non-uniform p-values are uniform p-values are highly non-uniform yi = 1 yi = 1 yi = 1 yi = 1 MLE is unbounded if perfect separating hyperplane 13/ 26

When does MLE exist? (MLE) maximizeβ `(β) = n X n o log 1 + exp( yi Xi> β) i=1 {z 0 } If a hyperplane that perfectly separates {yi }, i.e. βb then MLE is unbounded s.t. yi Xi> βb > 0 for all i lim `( a aβb {z} )=0 unbounded 13/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis 1965) y i =1 y i = 1 n =2 n =4 n = 12 number of samples n increases = more difficult to find separating hyperplane 14/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis 1965) y i =1 y i = 1 n =2 n =4 n = 12 Theorem 1 (Cover 1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high prob. iff n/p < 2 (asymptotically) 14/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 α(p/n) can be determined by solving a system of 2 nonlinear equations and 2 unknowns τ 2 = n p E [ (Ψ(τZ; b)) 2] p n = E [ Ψ (τz; b) ] where Z N (0, 1), Ψ is some operator, and α(p/n) = τ 2 /b 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 α(p/n) can be determined by solving a system of 2 nonlinear equations and 2 unknowns α( ) depends only on aspect ratio p/n It is not a finite sample effect α(0) = 1: matches classical theory 15/ 26

Our adjusted LRT theory in practice rescaling constant (p/n) Rescaling Constant 2.00 1.75 1.50 1.25 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n Counts 6000 4000 2000 0 0.00 0.25 0.50 0.75 1.00 P Values rescaling constant for logistic model empirical p-values Unif(0, 1) Empirically, LRT rescaled χ 2 1 (as predicted) 16/ 26

Validity of tail approximation 0.0100 0.0075 Empirical cdf 0.0050 0.0025 0.0000 0.000 0.002 0.004 0.006 0.008 0.010 t Empirical CDF of adjusted pvalues (n = 4000, p = 1200) Empirical CDF is in near-perfect aggreement with diagonal, suggesting that our theory is remarkably accurate even when we zoom in 17/ 26

Efficacy under moderate sample sizes Empirical cdf 1.00 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 t Empirical CDF of adjusted pvalues (n = 200, p = 60) Our theory seems adequete for moderately large samples 18/ 26

Universality: non-gaussian covariates 15000 12500 30000 10000 10000 20000 7500 Counts 10000 Counts 5000 Counts 5000 2500 0 0 0 0.00 0.25 0.50 0.75 1.00 P Values 0.25 0.50 0.75 1.00 P Values 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected rescaled χ 2 i.i.d. Bernoulli design, n = 4000, p = 1200 19/ 26

Connection to convex geometry 20/ 26

For p/n > 1/2 MLE does not exist sign(xi b) = 1 Class labels sign( yi Xi b) = 1 8i i Xb > 0 Connection yi 2 {0, 1} or y i 2 to convex geometry Perfect separation: b Class labels yi {0, 1} or y i { 1, 1} d Prob. rand. subspace pos. orth. > sign( yi Xi> b)intersects = sign(x i b) sign( yi Xino b) separating =1 i { 1, 1} p P {Xb b R } Rn+ = {0}? hyperplane d sign( yi Xi b) = sign(xi b) sign(xi> b) = 1 8i () sign(xi b) = 1 i Xb > 0 Xb > 0 Prob. rand. subspace intersects pos. orth. = {0}? norn+separating P {Xb b 2 Rp } \ Rn+ 6= {0}? separating hyperplane Prob. rand. subspace intersects pos. orth. P {Xb b Rp } hyperplane exists WLOG, if y1 = = yn = 1, then separability ) = range(x) fl Rn+ = {0} separating hyperplane exists WLOG, if y1 = = yn = 1, then separability = n * (1) 20/ 24 o range(x) Rn+ 6= {0} {z } can be analyzed via convex geometry (e.g. Amelunxen et al.) 20/ 26

Connection to robust M-estimation Since y i = ±1 and X i ind. N (0, Σ), maximize β l(β) = d = n i=1 n { } log 1 + exp( y i Xi β) log { } 1 + exp( Xi β) i=1 } {{} n := i=1 ρ( X iβ) 21/ 26

Connection to robust M-estimation Since y i = ±1 and X i ind. N (0, Σ), maximize β l(β) = d = n i=1 n { } log 1 + exp( y i Xi β) log { } 1 + exp( Xi β) i=1 } {{} n := i=1 ρ( X iβ) = MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation with y = 0 21/ 26

Connection to robust M-estimation MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation Variance inflation as p/n (El Karoui et al. 13, Donoho, Montanari 13) 22/ 26

Connection to robust M-estimation MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation Variance inflation as p/n (El Karoui et al. 13, Donoho, Montanari 13) rescaling constant (p/n) Rescaling Constant 2.00 1.75 1.50 1.25 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n variance inflation increasing rescaling factor 22/ 26

This is not just about logistic regression Our theory is applicable to logistic model probit model linear model (under Gaussian noise) rescaling const α(p/n) 1 (consistent with classical theory) linear model (under non-gaussian noise) 23/ 26

Key ingredients in establishing our theory Key step is to realize that 2 LLR j d p b(p/n) β 2 j }{{} rescaled χ 2 where b( ) depends only on p n, β ( α( j N 0, n)b( p n) p ) p 24/ 26

Key ingredients in establishing our theory Key step is to realize that 2 LLR j d p b(p/n) β 2 j }{{} rescaled χ 2 where b( ) depends only on p n, β ( α( j N 0, n)b( p n) p ) p 1. Convex geometry: show β = O(1) 2. Approximate message passing: determine asymptotic distribution of β 3. Leave-one-out analysis: characterize marginal distribution of β j (rescaled Gaussian) and ensure higher-order terms vanish 24/ 26

A small sample of prior work Candès, Fan, Janson, Lv 16: observed empirically nonuniformity of p-values in logistic regression Fan, Demirkaya, Lv 17: classical asymptotic normality of MLE (basis of Wald test) fails to hold in logistic regression when p n 2/3 El Karoui, Beana, Bickel, Limb, Yu 13, El Karoui 13, Donoho, Montanari 13, El Karoui 15: robust M-estimation for linear models (assuming strong convexity) 25/ 26

Summary Caution needs to be exercised when applying classical statistical procedures in a high-dimensional regime a regime of growing interest and relevance What shall we do under non-null (β 0)? Paper: The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square, Pragya Sur, Yuxin Chen, Emmanuel Candès, 2017. 26/ 26