Lecture 24: Variable selection in linear models

Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet is 0. Variable or model selectio refers to elimiatig covariates colums of Z correspodig to zero compoets of β. Example 1. Liear regressio models A = a subset of {1,...,p}, idices of ozero compoets of β The dimesio of A is dima = q p β A : sub-vector of β with idices i A Z A : the correspodig sub-matrix of Z The umber of models could be as large as 2 p Approximatio to a respose surface The ith row of Z A = 1,t i,ti 2,...,ti h, t i R A = {1,...,h}: a polyomial of order h h = 0,1,...,p UW-Madiso Statistics Stat 709 Lecture 24 2018 1 / 15

Example 2. 1-mea vs p-mea = pr, p = p, r = r There are p groups, each has r idetically distributed observatios Select oe model from two models 1-mea model: all groups have the same mea µ 1 p-mea model: p groups have differet meas µ 1,..., µ p A = A 1 or A p Z = 1 r 0 0 0 1 r 1 r 0 0 1 r 0 0 1 r β = Z Ap = Z β Ap = β Z A1 = 1 β A1 = µ 1 µ 1 µ 2 µ 1 µ p µ 1 I traditioal studies, p is fixed ad is large, or p/ is small I moder applicatios, both p ad are large, ad i some cases p >, p/ UW-Madiso Statistics Stat 709 Lecture 24 2018 2 / 15

Methods for variable selectio Geeralized Iformatio Criterio GIC Put a pealty o the dimesio of the parameter: We miimize X Z A β A 2 + λ σ 2 dimβ A over A, to obtai a suitable A, ad the estimate β A. σ 2 is a suitable estimator of the error variace σ 2 The term X Z A β A 2 measures goodess-of-fit of model A, whereas the term λ σ 2 dimβ A cotrols the size" of A. If λ = 2, this is the C p method, ad close to the AIC If λ = log, this is close to the BIC Regularizatio or pealized optimizatio simultaeously select variables ad estimate θ by miimizig X Z β 2 + p λ β, where p λ is a pealty fuctio idexed by the pealty parameter λ 0, which may deped o ad data. Zero compoets of β are estimated as zeros ad automatically elimiated. UW-Madiso Statistics Stat 709 Lecture 24 2018 3 / 15

Examples of pealty fuctios Ridge regressio: p λ β = λ β 2 ; LASSO least absolute shrikage ad selectio operator: p λ β = λ β 1 = λ p j=1 β j, β j is the jth compoet of β; Adaptive LASSO: p λ β = λ p j=1 τ j β j, where τ j s are o-egative leverage factors chose adaptively such that large pealties are used for uimportat β j s ad small pealties for importat oes; Elastic et: p λ β = λ 1 β 1 + λ 2 β 2 ; Miimax cocave pealty: p λ β = p j=1 aλ β j + /a for some a > 0; SCAD smoothly clipped absolute deviatio: p λ β = p j=1 λ{iβ j λ + aλ β j + a 1λ Iβ j λ} for some a > 2; There are also may modified versios of the previously listed methods. Resamplig methods Cross validatio, bootstrap Thresholdig Compare β j with a threshold may deped o ad data ad elimiate estimates that are smaller tha the threshold. UW-Madiso Statistics Stat 709 Lecture 24 2018 4 / 15

Assessmet of variable/model selectio procedures A = the set cotaiig exactly idices of ozero compoets of β A : a set of variables/model selected based o a selectio procedure The selectio procedure is selectio cosistet if lim P A = A = 1 Sometimes the followig weaker versio of cosistecy is desired. Uder model A, µ = EX Z is estimated by µ A = Z A βa We wat to miimize the squared error loss L A = 1 µ µ A 2 over A which is equivalet to miimizig the average predictio error [ ] 1 E X µ A 2 X,Z over A X : a future idepedet copy of X The selectio procedure is loss cosistet if L A /L A p 1 UW-Madiso Statistics Stat 709 Lecture 24 2018 5 / 15

Cosistecy of the GIC Let M deote a set of idices model. If A M, the M is a correct model; otherwise, M is a wrog model. The loss uder model M is equal to L M = M + ε τ H M ε/ H M = Z M Z τ M Z M 1 Z τ M, M = µ H M µ 2 / 0 if M is correct Let Γ,λ M = 1 [ X Z M β M 2 + λ σ 2 dimβ M ] to be miimized X Z M β M 2 = X H M X 2 = µ H M µ + ε H M ε 2 = M + ε 2 ε τ H M ε + 2ε τ I H M µ Whe M is a wrog model, Γ,λ M = ε 2 = ε 2 + M ετ H M ε + λ σ 2 dimm M + O P λdimm + L L M M + O P + O P = ε 2 + L M + o P L M UW-Madiso Statistics Stat 709 Lecture 24 2018 6 / 15

provided that lim if mi M > 0 ad λp M is wrog 0 The first coditio impies that wrog is always worse tha correct Amog all wrog M, miimizig Γ,λ M is asymptotically the same as miimizig L M Hece, the GIC is loss cosistet whe all models are wrog The GIC selects the best wrog model, i.e., the best approximatio to a correct model i terms of M, the leadig term i the loss L M For correct models, however, α = 0 ad L M = ε τ H M ε/ Correct models are ested, ad A has the smallest dimesio ad Γ,λ M = ε 2 ε τ H A ε = mi M is correct ετ H M ε ετ H M ε + λ σ 2 dimm = ε 2 + L M + λ σ 2 dimm 2ετ H M ε UW-Madiso Statistics Stat 709 Lecture 24 2018 7 / 15

If λ, the domiatig term i Γ,λ M is λ σ 2 dimm /. Amog correct models, the GIC selects a model by miimizig dimm, i.e., it selects A. Combiig the results, we showed that the GIC is selectio cosistet. O the other had, if λ = 2 the C p method, AIC, the term 2 σ 2 dimm 2ετ H M ε is of the same order as L M = ε τ H M ε/ uless dimm for all but oe correct model. Uder some coditios, the GIC with λ = 2 is loss cosistet if ad oly if there does ot exist two correct models with fixed dimesios. Coclusio 1 The GIC with a bouded λ C p, AIC is loss cosistet whe there is at most oe fixed-dimesio correct model; otherwise it is icosistet. 2 The GIC with λ ad λp/ 0 BIC are selectio cosistet or loss cosistet. UW-Madiso Statistics Stat 709 Lecture 24 2018 8 / 15

Example 2. 1-mea vs p-mea A 1 vs A p always correct p groups, each with r observatios A 1 = p j=1 µ j µ 2 /p, µ = p j=1 µ j/p = p r meas that either p or r 1. p = p is fixed ad r The dimesios of correct models are fixed The GIC with λ ad λ/ 0 is selectio cosistet The GIC with λ = 2 is icosistet 2. p ad r = r is fixed Oly oe correct model has a fixed dimesio The GIC with λ = 2 is loss cosistet The GIC with λ is icosistet, because λp / = λ/r 3. p ad r Oly oe correct model has a fixed dimesio The GIC is selectio cosistet, provided that λ/r 0 UW-Madiso Statistics Stat 709 Lecture 24 2018 9 / 15

More o the case where p ad r = r is fixed σ 2 = SA p /, SA = X Z A β A 2. It ca be show that L A 1 = A 1 + ē 2 1 p = lim p p L A p = 1 p p i=1 p j=1 ē 2 i p σ 2 r µ j 1 p p 2 µ i i=1 where e ij s are iid, Ee ij = 0, Ee 2 ij = σ 2, ē i = r 1 r j=1 e ij, ad ē = p 1 p i=1 ēi. The L A 1 L A p r p σ 2 The oe-mea model is better if ad oly if r < σ 2. The wrog model may be better! The GIC with λ miimizes SA 1 + λ SA p p ad SA p + λ r SA p p UW-Madiso Statistics Stat 709 Lecture 24 2018 10 / 15

Because SA 1 SA p = A 1 + 1 = 1 p r i=1 j=1 p r i=1 j=1 e ij ē 2 p + σ 2 e ij ēi 2 r 1σ 2 p r ad λ /r, P{GIC with λ selects A 1 } 1 O the other had, the C p GIC with λ = 2 is loss cosistet, because the C p miimizes SA 1 + 2 SA 1 SA p p + 2 ad SA p + 2 r SA p p p + σ 2, SA p p SA p + 2 SA p r p p σ 2 + σ 2 r Asymptotically, the C p selects A 1 iff < σ 2 /r, which is the same as the oe-mea model is better. UW-Madiso Statistics Stat 709 Lecture 24 2018 11 / 15

Variable selectio by thresholdig Ca we do variable selectio usig p-values? Or, ca we simply select variables by usig the values β j, j = 1,...,p? Here β j is the jth compoet of β, the least squares estimator of β. For simplicity, assume that X Z NZ β,σ 2 I. The β j β j = l ij ε i Z N 0,σ 2 lij 2 i=1 i=1 where ε i ad l ij are the ith compoets of ε = X Z β ad Z τ Z 1 z i z j is the jth row of Z Because 2π 1 Φt e t2 /2, t > 0 t where Φ is the stadard ormal cdf, P β j β j > t var β Z Z 2 2π e t2 /2, t > 0 t Let J j be the p-vector whose jth compoet is 1 ad other compoets are 0: lij 2 = [Jj τ Z τ Z 1 z i ] 2 Jj τ Z τ Z 1 J j zi τ Z τ Z 1 z i UW-Madiso Statistics Stat 709 Lecture 24 2018 12 / 15

i=1 l 2 ij c j i=1 z τ i Z τ Z 1 z i = pc j p/η where c j is the jth diagoal elemet of Z τ Z 1 ad η is the smallest eigevalue of Z τ Z. Thus, for ay j, P β j β j > tσ p/η Z 2 2π e t2 /2, t > 0 t ad lettig t = a /σ p/η P β j β j > a Z Ce a2 η /2σ 2 p for some costat C > 0, P max β j β j > a Z j=1,...,p pce a2 η /2σ 2 p Suppose that p/ 0 ad p/η log 0 typically, η = O. The, we ca choose a such that a 0 ad aη 2 log/p such that UW-Madiso Statistics Stat 709 Lecture 24 2018 13 / 15

P max β j β j > ca Z j=1,...,p for ay c > 0 ad some s 1; e.g., p a = M η log = O s for some costats M > 0 ad α 0, 1 2. What ca we coclude from this? Let A = {j : β j 0} ad A = {j : βj > a } That is, A cotais the idices of variables we select by thresholdig β j at a. Selectio cosistecy: P A A Z P β j > a,j A Z + P β j a,j A Z α The first term o the right had side is bouded by P max β j β j > a Z = O s j=1,...,p UW-Madiso Statistics Stat 709 Lecture 24 2018 14 / 15

O the other had, if we assume that mi j A β j c 0 a for some c 0 > 1, the P β j a,j A Z P β j β j β j a,j A Z P c 0 a β j β j a,j A Z P max β j β j c 0 1a Z j=1,...,p = O s Hece, we have cosistecy; i fact, the covergece rate is O s. We ca also obtai similar results by thresholdig β j / i=1 l2 ij. This approach may ot work if p/ 0. If p >, the Z τ Z is ot of full rak. There exist several other approaches for the case where p > ; e.g., we replace Z τ Z 1 by some matrix, or use ridge regressio istead of LSE. UW-Madiso Statistics Stat 709 Lecture 24 2018 15 / 15