Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University

Coauthors Pragya Sur Stanford Statistics Emmanuel Candès Stanford Statistics & Math 2/ 26

In memory of Tom Cover (1938-2012) Tom @ Stanford EE We all know the feeling that follows when one investigates a problem, goes through a large amount of algebra, and finally investigates the answer to find that the entire problem is illuminated not by the analysis but by the inspection of the answer 3/ 26

Inference in regression problems Example: logistic regression y i logistic-model ( x i β ), 1 i n 4/ 26

Inference in regression problems Example: logistic regression y i logistic-model ( x i β ), 1 i n One wishes to determine which covariate is of importance, i.e. β j = 0 vs. β j 0 (1 j p) 4/ 26

Classical tests β j = 0 vs. β j 0 (1 j p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics 5/ 26

Classical tests β j = 0 vs. β j 0 (1 j p) Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics Wald test: Wald statistic χ 2 Likelihood ratio test: log-likelihood ratio statistic χ 2 Score test:... score N (0, Fisher Info) 5/ 26

Example: logistic regression in R (n = 100, p = 30) > fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7727-0.8718 0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642. X4-0.293916 0.331553-0.886 0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 **... --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Can these inference calculations (e.g. p-values) be trusted? 6/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j 0 (1 j p) Log-likelihood ratio (LLR) statistic LLR j := l ( β ) l ( β ( j) ) l( ): log-likelihood β = arg max β l(β): unconstrained MLE 7/ 26

This talk: likelihood ratio test (LRT) β j = 0 vs. β j 0 (1 j p) Log-likelihood ratio (LLR) statistic LLR j := l ( β ) l ( β ( j) ) l( ): log-likelihood β = arg max β l(β): unconstrained MLE β ( j) = arg max β:βj =0 l(β): constrained MLE 7/ 26

Wilks phenomenon 1938 Samuel Wilks, Princeton βj = 0 vs. βj 6= 0 (1 j p) LRT asymptotically follows chi-square distribution (under null) d 2 LLRj χ21 (p fixed, n ) 8/ 26

30000 Wilks phenomenon 1938 10000 Counts Counts 20000 5000 10000 0 0 0.00 0.25 0.50 P Values 0.75 1.00 classical Wilks j = 0 Samuel Wilks 0.25 0.50 P Values 0.75 1.00 Bartlett-corrected vs. j = 0 (1 Æ j Æ p p-value LRT asymptotically follows chi-square distribution Bartlett correction (finite sample effect): 2 LLRj æ 2LLRj 1+ n /n 21 (p fixed, n æ Œ) 1 p-values are still non-uniform æ this is NOT finite sa Samuel Wilks, Princeton βj = 0 assess significance of coefficients vs. βj 6= 0 (1 j p) LRT asymptotically follows chi-square distribution (under null) d 2 LLRj χ21 (p fixed, n ) 8/ 26

Classical LRT in high dimensions p/n (1, ) Linear regression y = Xβ + η }{{} i.i.d. Gaussian 0.00 0.25 0.50 0.75 1.00 classical p-values arenuniform For linear regression (with Gaussian noise) in high dimensions, 2LLR j χ 2 1 (classical test always works) 9/ 26

Classical LRT in high dimensions p = 1200, n = 4000 Logistic regression y logistic-model(xβ) 0.00 0.25 0.50 0.75 1.00 classical p-values are highly nonuniform 10/ 26

Classical LRT in high dimensions p = 1200, n = 4000 Logistic regression y logistic-model(xβ) 0.00 0.25 0.50 0.75 1.00 classical p-values are highly nonuniform Wilks theorem seems inadequate in accommodating logistic regression in high dimensions 10/ 26

Bartlett correction? (n = 4000, p = 1200) 30000 Counts 20000 10000 0 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett correction (finite sample effect): 2LLR j 1+α n/n χ2 1 11/ 26

Bartlett correction? (n = 4000, p = 1200) 30000 10000 Counts 20000 Counts 10000 5000 0 0.00 0.25 0.50 0.75 1.00 P Values 0 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected 2LLR Bartlett correction (finite sample effect): j 1+α n/n χ2 1 p-values are still non-uniform this is NOT finite sample effect 11/ 26

Our findings 30000 6000 Counts 20000 10000 Counts 10000 5000 Counts 4000 2000 0 0.00 0.25 0.50 0.75 1.00 P Values 0 0.25 0.50 0.75 1.00 P Values 0 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected rescaled χ 2 2LLR Bartlett correction (finite sample effect): j 1+α n/n χ2 1 p-values are still non-uniform this is NOT finite sample effect A glimpse of our theory: LRT follows a rescaled χ 2 distribution 11/ 26

ood: fy X (y x} Y unknown signal: observation: Y likelihood: (y x} X observation: Y fy X funknown unknown signal: X Yob Xlikelihood: Y X (y x}signal: Problem formulation (formal) X fy X (y x} y Y X likelihood: fy X (y x} unknown signal: X yunknown X y X signal: observation: Y X X y n observation y n p p ind. Gaussian design: Xi N (0, Σ) Logistic model: yi = 1, 1, with prob. with prob. 1 1+exp( Xi> β) 1 1+exp(Xi> β) 1 i n Proportional growth: p/n constant Global null: β = 0 12/ 26

For p/n > 1/2 yimle does 2 {0, 1} not or y exist i 2 { 1, 1} When does MLE exist? n X Class labels (MLE) maximizeβ `(β) = Class labels n o log 1 + exp( yi Xi> β) i=1 {z yi 2 {0, 1} or y 0 i yi 2 {0, 1} or y i 2 { 1, 1} 2 { 1, 1}} p-values are uniform p-values are highly non-uniform p-values are uniform p-values are highly non-uniform yi = 1 yi = 1 yi = 1 yi = 1 MLE is unbounded if perfect separating hyperplane 13/ 26

When does MLE exist? (MLE) maximizeβ `(β) = n X n o log 1 + exp( yi Xi> β) i=1 {z 0 } If a hyperplane that perfectly separates {yi }, i.e. βb then MLE is unbounded s.t. yi Xi> βb > 0 for all i lim `( a aβb {z} )=0 unbounded 13/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis 1965) y i =1 y i = 1 n =2 n =4 n = 12 number of samples n increases = more difficult to find separating hyperplane 14/ 26

When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis 1965) y i =1 y i = 1 n =2 n =4 n = 12 Theorem 1 (Cover 1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high prob. iff n/p < 2 (asymptotically) 14/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 α(p/n) can be determined by solving a system of 2 nonlinear equations and 2 unknowns τ 2 = n p E [ (Ψ(τZ; b)) 2] p n = E [ Ψ (τz; b) ] where Z N (0, 1), Ψ is some operator, and α(p/n) = τ 2 /b 15/ 26

Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Candès 2017) Suppose n/p > 2. Under i.i.d. Gaussian design and global null, ( d p 2 LLR j α χ n) 2 1 }{{} rescaled χ 2 α(p/n) can be determined by solving a system of 2 nonlinear equations and 2 unknowns α( ) depends only on aspect ratio p/n It is not a finite sample effect α(0) = 1: matches classical theory 15/ 26

Our adjusted LRT theory in practice rescaling constant (p/n) Rescaling Constant 2.00 1.75 1.50 1.25 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n Counts 6000 4000 2000 0 0.00 0.25 0.50 0.75 1.00 P Values rescaling constant for logistic model empirical p-values Unif(0, 1) Empirically, LRT rescaled χ 2 1 (as predicted) 16/ 26

Validity of tail approximation 0.0100 0.0075 Empirical cdf 0.0050 0.0025 0.0000 0.000 0.002 0.004 0.006 0.008 0.010 t Empirical CDF of adjusted pvalues (n = 4000, p = 1200) Empirical CDF is in near-perfect aggreement with diagonal, suggesting that our theory is remarkably accurate even when we zoom in 17/ 26

Efficacy under moderate sample sizes Empirical cdf 1.00 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 t Empirical CDF of adjusted pvalues (n = 200, p = 60) Our theory seems adequete for moderately large samples 18/ 26

Universality: non-gaussian covariates 15000 12500 30000 10000 10000 20000 7500 Counts 10000 Counts 5000 Counts 5000 2500 0 0 0 0.00 0.25 0.50 0.75 1.00 P Values 0.25 0.50 0.75 1.00 P Values 0.00 0.25 0.50 0.75 1.00 P Values classical Wilks Bartlett-corrected rescaled χ 2 i.i.d. Bernoulli design, n = 4000, p = 1200 19/ 26

Connection to convex geometry 20/ 26

For p/n > 1/2 MLE does not exist sign(xi b) = 1 Class labels sign( yi Xi b) = 1 8i i Xb > 0 Connection yi 2 {0, 1} or y i 2 to convex geometry Perfect separation: b Class labels yi {0, 1} or y i { 1, 1} d Prob. rand. subspace pos. orth. > sign( yi Xi> b)intersects = sign(x i b) sign( yi Xino b) separating =1 i { 1, 1} p P {Xb b R } Rn+ = {0}? hyperplane d sign( yi Xi b) = sign(xi b) sign(xi> b) = 1 8i () sign(xi b) = 1 i Xb > 0 Xb > 0 Prob. rand. subspace intersects pos. orth. = {0}? norn+separating P {Xb b 2 Rp } \ Rn+ 6= {0}? separating hyperplane Prob. rand. subspace intersects pos. orth. P {Xb b Rp } hyperplane exists WLOG, if y1 = = yn = 1, then separability ) = range(x) ﬂ Rn+ = {0} separating hyperplane exists WLOG, if y1 = = yn = 1, then separability = n * (1) 20/ 24 o range(x) Rn+ 6= {0} {z } can be analyzed via convex geometry (e.g. Amelunxen et al.) 20/ 26

Connection to robust M-estimation Since y i = ±1 and X i ind. N (0, Σ), maximize β l(β) = d = n i=1 n { } log 1 + exp( y i Xi β) log { } 1 + exp( Xi β) i=1 } {{} n := i=1 ρ( X iβ) 21/ 26

Connection to robust M-estimation Since y i = ±1 and X i ind. N (0, Σ), maximize β l(β) = d = n i=1 n { } log 1 + exp( y i Xi β) log { } 1 + exp( Xi β) i=1 } {{} n := i=1 ρ( X iβ) = MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation with y = 0 21/ 26

Connection to robust M-estimation MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation Variance inflation as p/n (El Karoui et al. 13, Donoho, Montanari 13) 22/ 26

Connection to robust M-estimation MLE β n = arg min ρ(y i X i β) β i=1 }{{} robust M-estimation Variance inflation as p/n (El Karoui et al. 13, Donoho, Montanari 13) rescaling constant (p/n) Rescaling Constant 2.00 1.75 1.50 1.25 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n variance inflation increasing rescaling factor 22/ 26

This is not just about logistic regression Our theory is applicable to logistic model probit model linear model (under Gaussian noise) rescaling const α(p/n) 1 (consistent with classical theory) linear model (under non-gaussian noise) 23/ 26

Key ingredients in establishing our theory Key step is to realize that 2 LLR j d p b(p/n) β 2 j }{{} rescaled χ 2 where b( ) depends only on p n, β ( α( j N 0, n)b( p n) p ) p 24/ 26

Key ingredients in establishing our theory Key step is to realize that 2 LLR j d p b(p/n) β 2 j }{{} rescaled χ 2 where b( ) depends only on p n, β ( α( j N 0, n)b( p n) p ) p 1. Convex geometry: show β = O(1) 2. Approximate message passing: determine asymptotic distribution of β 3. Leave-one-out analysis: characterize marginal distribution of β j (rescaled Gaussian) and ensure higher-order terms vanish 24/ 26

A small sample of prior work Candès, Fan, Janson, Lv 16: observed empirically nonuniformity of p-values in logistic regression Fan, Demirkaya, Lv 17: classical asymptotic normality of MLE (basis of Wald test) fails to hold in logistic regression when p n 2/3 El Karoui, Beana, Bickel, Limb, Yu 13, El Karoui 13, Donoho, Montanari 13, El Karoui 15: robust M-estimation for linear models (assuming strong convexity) 25/ 26

Summary Caution needs to be exercised when applying classical statistical procedures in a high-dimensional regime a regime of growing interest and relevance What shall we do under non-null (β 0)? Paper: The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square, Pragya Sur, Yuxin Chen, Emmanuel Candès, 2017. 26/ 26