A Significance Test for the Lasso

A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1

Last time Problem: Many clinical covariates which are important to a certain medical outcome? Want to choose the important variables and say how important these variables are Bad solution: Forward stepwise regression very anti-conservative p-values Better solution: Lasso with p-values from newly proposed covariance test statistic 2

Framework Consider regression setup with outcome vector y R n with covariate matrix X R n p and y = βx + ɛ with ɛ N(0, σ 2 I ). The lasso estimator is obtained by finding β that minimizes where λ is the lasso penalty. 1 2 y X β 2 + λ p β i, i=1 3

Lasso solution path (λ 1 > λ 2 > λ 3 > λ 4 >...) Coefficients 0.2 0.0 0.2 0.4 0.6 λ 1 λ 2 λ 3 λ 4 constraint ˆβ lasso = arg min β 1 2 y X β 2 + λ p β i i=1 4

Obtain p-value for covariate entering the model Coefficients 0.2 0.0 0.2 0.4 0.6 λ 1 λ 2 λ 3 λ 4 constraint 5

Form of test statistic 1 Forward stepwise regression: Lasso: RSS null RSS σ 2 = y ŷ null 2 y ŷ 2 σ [ 2 y T ŷ y T ] ŷ = 2 null + ŷ null 2 ŷ 2 σ 2 T k = y T ŷ y T ŷ null σ 2 σ 2 1 Taking σ 2 as known (for now) 6

What is ŷ? testing that variable that enters at λ 3 has β = 0 ŷ = X ˆβ(λ 4 ) Coefficients 0.2 0.0 0.2 0.4 0.6 λ 1 λ 2 λ 3 λ 4 constraint 7

What about ŷ null? testing that variable that enters at λ 3 has β = 0 ŷ = X ˆβ(λ 4 ) ŷ null = X null ˆβ null (λ 4 ) Coefficients 0.2 0.0 0.2 0.4 0.6 λ 1 λ 2 λ 3 λ 4 constraint 8

Putting this together The covariance test statistic for testing the predictor that enters at the kth step is T k = y T ŷ y T ŷ null σ 2 = y T X ˆβ(λ k+1 ) y T X null ˆβ null (λ k+1 ) σ 2. 9

10 What exactly is the null? Under the global null (β = 0), then T 1 d Exp(1) T 2 d Exp(1/2) T 3 d Exp(1/3). for orthogonal predictor matrix X. Asymptotic distributions are stochastically smaller for general X.

Does it work for finite samples? Simulation of distribution of test statistics for first covariate to enter model under global null (β = 0) n = 100, p = 10 Forward Stepwise Lasso Test statistic 0 2 4 6 8 10 Test statistic 0 1 2 3 4 5 6 0 2 4 6 8 10 0 1 2 3 4 5 6 Chi squared on 1 df Exp(1) 11

12 What exactly is the null? Under the weaker null where there are k 0 truly active covariates (and they have entered the model), then T k0 +1 d Exp(1) T k0 +2 d Exp(1/2) T k0 +3 d Exp(1/3) for orthogonal predictor matrix X. Asymptotic distributions are stochastically smaller for general X..

13 See, it works... Simulation of distribution of test statistics when true β has three non-zero components n = 100, p = 10 F 1 (p) = θ log(1 p) for Exp(θ) 4th predictor 5th predictor 6th predictor Test statistic 0 1 2 3 4 5 Test statistic 0 1 2 3 4 5 Test statistic 0 1 2 3 4 5 0 1 2 3 4 5 Exp(1) 0 1 2 3 4 5 Exp(1) 0 1 2 3 4 5 Exp(1)

14 Simulation setup Distribution of T 1 under global null (β = 0) n = 100 and p (10, 50, 200) Varying correlation structure of predictors with ρ (0, 0.2, 0.4, 0.6, 0.8) Exchangeable AR(1) Block diagonal Mean, variance, and tail probability of distribution

The authors results 15

Some commentary... I don t have any applied or technical comments on the paper at hand (except for feeling strongly that Tables 2 and 3 should really really really be made into a graph... do we really care that a certain number is 315.216?) Andrew Gelman 2 2 via his blog 16

The authors results 17

18 Sampling distribution of simulation results 100 replications of the simulation for given parameters Note large variance of each distribution Larger number of replications needed for accurate estimate n=100, p=10, rho=0 n=100, p=10, rho=0 n=100, p=10, rho=0 Frequency 0 5 10 15 Frequency 0 5 10 15 20 Frequency 0 5 10 15 20 0.90 1.00 1.10 Mean 1.0 1.4 1.8 Variance 0.04 0.06 0.08 0.10 Tail probability

Lots of sampling distributions 19

20 My results mean Exchangeable correlation AR(1) correlation Block diagonal correlation mean 0.90 0.95 1.00 1.05 1.10 p=10 p=50 p=200 mean 0.90 0.95 1.00 1.05 1.10 p=10 p=50 p=200 mean 0.90 0.95 1.00 1.05 1.10 p=10 p=50 p=200 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ

21 My results variance Exchangeable correlation AR(1) correlation Block diagonal correlation variance 0.9 1.1 1.3 1.5 p=10 p=50 p=200 variance 0.9 1.1 1.3 1.5 p=10 p=50 p=200 variance 0.9 1.1 1.3 1.5 p=10 p=50 p=200 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ

22 My results tail probability Exchangeable correlation AR(1) correlation Block diagonal correlation tail probability 0.00 0.02 0.04 0.06 0.08 0.10 p=10 p=50 p=200 tail probability 0.00 0.02 0.04 0.06 0.08 0.10 p=10 p=50 p=200 tail probability 0.00 0.02 0.04 0.06 0.08 0.10 p=10 p=50 p=200 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ 0.0 0.2 0.4 0.6 0.8 ρ

23 What to do when σ 2 is unknown? (n > p) Estimate in the usual way: ˆσ 2 = RSS n p = y X ˆβ LS 2 n p Asymptotic distribution is now F 2,n p Numerator is Exp(1) = χ 2 2 /2 Denominator is χ 2 n p /(n p) Numerator and denominator independent

24 What to do when σ 2 is unknown? (n p) Estimate from least squares fit from model selected by cross-validation No rigorous theory here (fingers crossed!)

25 What s the big idea? Use covariance test statistic to obtain p-value for covariates as they enter the lasso model Compare to asymptotic distribution Exp(1) to obtain p-values Reasonable performance in finite samples Possibly extend this to obtaining inference for all coefficients from a model for a specific lasso penalty

26 What s next! To do: Obtain data for p > n case (HIV data) Finish simulations Next time: Real data examples More on assumptions and theory