Lecture 24: Variable selection in linear models

Similar documents
Lecture 19: Convergence

Lecture 33: Bootstrap

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 20: Multivariate convergence and the Central Limit Theorem

Lecture 23: Minimal sufficiency

Efficient GMM LECTURE 12 GMM II

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Asymptotic Results for the Linear Regression Model

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

TAMS24: Notations and Formulas

Empirical Process Theory and Oracle Inequalities

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Rank tests and regression rank scores tests in measurement error models

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Lecture 8: Convergence of transformations and law of large numbers

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Regularization with the Smooth-Lasso procedure

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Technical Proofs for Homogeneity Pursuit

Quantile regression with multilayer perceptrons.

A Note on Adaptive Group Lasso

1 Last time: similar and diagonalizable matrices

Lecture 2: Monte Carlo Simulation

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Chapter 3: Other Issues in Multiple regression (Part 1)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

REGRESSION WITH QUADRATIC LOSS

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 7: Properties of Random Samples

Random assignment with integer costs

Math 61CM - Solutions to homework 3

Algebra of Least Squares

Statistical Inference Based on Extremum Estimators

Problem Set 2 Solutions

Output Analysis and Run-Length Control

A survey on penalized empirical risk minimization Sara A. van de Geer

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

The Method of Least Squares. To understand least squares fitting of data.

6.867 Machine learning, lecture 7 (Jaakkola) 1

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION

STA6938-Logistic Regression Model

An Introduction to Asymptotic Theory

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Chapter 1 Simple Linear Regression (part 6: matrix version)

STATISTICS 593C: Spring, Model Selection and Regularization

Linear Support Vector Machines

Linearly Independent Sets, Bases. Review. Remarks. A set of vectors,,, in a vector space is said to be linearly independent if the vector equation

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Sequences and Series of Functions

Regression with quadratic loss

Optimally Sparse SVMs

Exercise 4.3 Use the Continuity Theorem to prove the Cramér-Wold Theorem, Theorem. (1) φ a X(1).

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Estimation of the Mean and the ACVF

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

5.1 Review of Singular Value Decomposition (SVD)

Random Variables, Sampling and Estimation

Mathematical Statistics - MS

Machine Learning Brett Bernstein

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Chapter 6 Principles of Data Reduction

ECE534, Spring 2018: Solutions for Problem Set #2

Frequentist Inference

Lecture 15: Learning Theory: Concentration Inequalities

x iu i E(x u) 0. In order to obtain a consistent estimator of β, we find the instrumental variable z which satisfies E(z u) = 0. z iu i E(z u) = 0.

CEU Department of Economics Econometrics 1, Problem Set 1 - Solutions

An Introduction to Randomized Algorithms

1 Review of Probability & Statistics

Lecture 10 October Minimaxity and least favorable prior sequences

MA Advanced Econometrics: Properties of Least Squares Estimators

Introductory statistics

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Lecture 18: Sampling distributions

Properties and Hypothesis Testing

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Lecture 11 October 27

Math 778S Spectral Graph Theory Handout #3: Eigenvalues of Adjacency Matrix

Eigenvalues and Eigenvectors

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Topic 9: Sampling Distributions of Estimators

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Expectation and Variance of a random variable

Linear Regression Models

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Local Polynomial Regression

Economics 326 Methods of Empirical Research in Economics. Lecture 18: The asymptotic variance of OLS and heteroskedasticity

2.2. Central limit theorem.

Transcription:

Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet is 0. Variable or model selectio refers to elimiatig covariates colums of Z correspodig to zero compoets of β. Example 1. Liear regressio models A = a subset of {1,...,p}, idices of ozero compoets of β The dimesio of A is dima = q p β A : sub-vector of β with idices i A Z A : the correspodig sub-matrix of Z The umber of models could be as large as 2 p Approximatio to a respose surface The ith row of Z A = 1,t i,ti 2,...,ti h, t i R A = {1,...,h}: a polyomial of order h h = 0,1,...,p UW-Madiso Statistics Stat 709 Lecture 24 2018 1 / 15

Example 2. 1-mea vs p-mea = pr, p = p, r = r There are p groups, each has r idetically distributed observatios Select oe model from two models 1-mea model: all groups have the same mea µ 1 p-mea model: p groups have differet meas µ 1,..., µ p A = A 1 or A p Z = 1 r 0 0 0 1 r 1 r 0 0 1 r 0 0 1 r β = Z Ap = Z β Ap = β Z A1 = 1 β A1 = µ 1 µ 1 µ 2 µ 1 µ p µ 1 I traditioal studies, p is fixed ad is large, or p/ is small I moder applicatios, both p ad are large, ad i some cases p >, p/ UW-Madiso Statistics Stat 709 Lecture 24 2018 2 / 15

Methods for variable selectio Geeralized Iformatio Criterio GIC Put a pealty o the dimesio of the parameter: We miimize X Z A β A 2 + λ σ 2 dimβ A over A, to obtai a suitable A, ad the estimate β A. σ 2 is a suitable estimator of the error variace σ 2 The term X Z A β A 2 measures goodess-of-fit of model A, whereas the term λ σ 2 dimβ A cotrols the size" of A. If λ = 2, this is the C p method, ad close to the AIC If λ = log, this is close to the BIC Regularizatio or pealized optimizatio simultaeously select variables ad estimate θ by miimizig X Z β 2 + p λ β, where p λ is a pealty fuctio idexed by the pealty parameter λ 0, which may deped o ad data. Zero compoets of β are estimated as zeros ad automatically elimiated. UW-Madiso Statistics Stat 709 Lecture 24 2018 3 / 15

Examples of pealty fuctios Ridge regressio: p λ β = λ β 2 ; LASSO least absolute shrikage ad selectio operator: p λ β = λ β 1 = λ p j=1 β j, β j is the jth compoet of β; Adaptive LASSO: p λ β = λ p j=1 τ j β j, where τ j s are o-egative leverage factors chose adaptively such that large pealties are used for uimportat β j s ad small pealties for importat oes; Elastic et: p λ β = λ 1 β 1 + λ 2 β 2 ; Miimax cocave pealty: p λ β = p j=1 aλ β j + /a for some a > 0; SCAD smoothly clipped absolute deviatio: p λ β = p j=1 λ{iβ j λ + aλ β j + a 1λ Iβ j λ} for some a > 2; There are also may modified versios of the previously listed methods. Resamplig methods Cross validatio, bootstrap Thresholdig Compare β j with a threshold may deped o ad data ad elimiate estimates that are smaller tha the threshold. UW-Madiso Statistics Stat 709 Lecture 24 2018 4 / 15

Assessmet of variable/model selectio procedures A = the set cotaiig exactly idices of ozero compoets of β A : a set of variables/model selected based o a selectio procedure The selectio procedure is selectio cosistet if lim P A = A = 1 Sometimes the followig weaker versio of cosistecy is desired. Uder model A, µ = EX Z is estimated by µ A = Z A βa We wat to miimize the squared error loss L A = 1 µ µ A 2 over A which is equivalet to miimizig the average predictio error [ ] 1 E X µ A 2 X,Z over A X : a future idepedet copy of X The selectio procedure is loss cosistet if L A /L A p 1 UW-Madiso Statistics Stat 709 Lecture 24 2018 5 / 15

Cosistecy of the GIC Let M deote a set of idices model. If A M, the M is a correct model; otherwise, M is a wrog model. The loss uder model M is equal to L M = M + ε τ H M ε/ H M = Z M Z τ M Z M 1 Z τ M, M = µ H M µ 2 / 0 if M is correct Let Γ,λ M = 1 [ X Z M β M 2 + λ σ 2 dimβ M ] to be miimized X Z M β M 2 = X H M X 2 = µ H M µ + ε H M ε 2 = M + ε 2 ε τ H M ε + 2ε τ I H M µ Whe M is a wrog model, Γ,λ M = ε 2 = ε 2 + M ετ H M ε + λ σ 2 dimm M + O P λdimm + L L M M + O P + O P = ε 2 + L M + o P L M UW-Madiso Statistics Stat 709 Lecture 24 2018 6 / 15

provided that lim if mi M > 0 ad λp M is wrog 0 The first coditio impies that wrog is always worse tha correct Amog all wrog M, miimizig Γ,λ M is asymptotically the same as miimizig L M Hece, the GIC is loss cosistet whe all models are wrog The GIC selects the best wrog model, i.e., the best approximatio to a correct model i terms of M, the leadig term i the loss L M For correct models, however, α = 0 ad L M = ε τ H M ε/ Correct models are ested, ad A has the smallest dimesio ad Γ,λ M = ε 2 ε τ H A ε = mi M is correct ετ H M ε ετ H M ε + λ σ 2 dimm = ε 2 + L M + λ σ 2 dimm 2ετ H M ε UW-Madiso Statistics Stat 709 Lecture 24 2018 7 / 15

If λ, the domiatig term i Γ,λ M is λ σ 2 dimm /. Amog correct models, the GIC selects a model by miimizig dimm, i.e., it selects A. Combiig the results, we showed that the GIC is selectio cosistet. O the other had, if λ = 2 the C p method, AIC, the term 2 σ 2 dimm 2ετ H M ε is of the same order as L M = ε τ H M ε/ uless dimm for all but oe correct model. Uder some coditios, the GIC with λ = 2 is loss cosistet if ad oly if there does ot exist two correct models with fixed dimesios. Coclusio 1 The GIC with a bouded λ C p, AIC is loss cosistet whe there is at most oe fixed-dimesio correct model; otherwise it is icosistet. 2 The GIC with λ ad λp/ 0 BIC are selectio cosistet or loss cosistet. UW-Madiso Statistics Stat 709 Lecture 24 2018 8 / 15

Example 2. 1-mea vs p-mea A 1 vs A p always correct p groups, each with r observatios A 1 = p j=1 µ j µ 2 /p, µ = p j=1 µ j/p = p r meas that either p or r 1. p = p is fixed ad r The dimesios of correct models are fixed The GIC with λ ad λ/ 0 is selectio cosistet The GIC with λ = 2 is icosistet 2. p ad r = r is fixed Oly oe correct model has a fixed dimesio The GIC with λ = 2 is loss cosistet The GIC with λ is icosistet, because λp / = λ/r 3. p ad r Oly oe correct model has a fixed dimesio The GIC is selectio cosistet, provided that λ/r 0 UW-Madiso Statistics Stat 709 Lecture 24 2018 9 / 15

More o the case where p ad r = r is fixed σ 2 = SA p /, SA = X Z A β A 2. It ca be show that L A 1 = A 1 + ē 2 1 p = lim p p L A p = 1 p p i=1 p j=1 ē 2 i p σ 2 r µ j 1 p p 2 µ i i=1 where e ij s are iid, Ee ij = 0, Ee 2 ij = σ 2, ē i = r 1 r j=1 e ij, ad ē = p 1 p i=1 ēi. The L A 1 L A p r p σ 2 The oe-mea model is better if ad oly if r < σ 2. The wrog model may be better! The GIC with λ miimizes SA 1 + λ SA p p ad SA p + λ r SA p p UW-Madiso Statistics Stat 709 Lecture 24 2018 10 / 15

Because SA 1 SA p = A 1 + 1 = 1 p r i=1 j=1 p r i=1 j=1 e ij ē 2 p + σ 2 e ij ēi 2 r 1σ 2 p r ad λ /r, P{GIC with λ selects A 1 } 1 O the other had, the C p GIC with λ = 2 is loss cosistet, because the C p miimizes SA 1 + 2 SA 1 SA p p + 2 ad SA p + 2 r SA p p p + σ 2, SA p p SA p + 2 SA p r p p σ 2 + σ 2 r Asymptotically, the C p selects A 1 iff < σ 2 /r, which is the same as the oe-mea model is better. UW-Madiso Statistics Stat 709 Lecture 24 2018 11 / 15

Variable selectio by thresholdig Ca we do variable selectio usig p-values? Or, ca we simply select variables by usig the values β j, j = 1,...,p? Here β j is the jth compoet of β, the least squares estimator of β. For simplicity, assume that X Z NZ β,σ 2 I. The β j β j = l ij ε i Z N 0,σ 2 lij 2 i=1 i=1 where ε i ad l ij are the ith compoets of ε = X Z β ad Z τ Z 1 z i z j is the jth row of Z Because 2π 1 Φt e t2 /2, t > 0 t where Φ is the stadard ormal cdf, P β j β j > t var β Z Z 2 2π e t2 /2, t > 0 t Let J j be the p-vector whose jth compoet is 1 ad other compoets are 0: lij 2 = [Jj τ Z τ Z 1 z i ] 2 Jj τ Z τ Z 1 J j zi τ Z τ Z 1 z i UW-Madiso Statistics Stat 709 Lecture 24 2018 12 / 15

i=1 l 2 ij c j i=1 z τ i Z τ Z 1 z i = pc j p/η where c j is the jth diagoal elemet of Z τ Z 1 ad η is the smallest eigevalue of Z τ Z. Thus, for ay j, P β j β j > tσ p/η Z 2 2π e t2 /2, t > 0 t ad lettig t = a /σ p/η P β j β j > a Z Ce a2 η /2σ 2 p for some costat C > 0, P max β j β j > a Z j=1,...,p pce a2 η /2σ 2 p Suppose that p/ 0 ad p/η log 0 typically, η = O. The, we ca choose a such that a 0 ad aη 2 log/p such that UW-Madiso Statistics Stat 709 Lecture 24 2018 13 / 15

P max β j β j > ca Z j=1,...,p for ay c > 0 ad some s 1; e.g., p a = M η log = O s for some costats M > 0 ad α 0, 1 2. What ca we coclude from this? Let A = {j : β j 0} ad A = {j : βj > a } That is, A cotais the idices of variables we select by thresholdig β j at a. Selectio cosistecy: P A A Z P β j > a,j A Z + P β j a,j A Z α The first term o the right had side is bouded by P max β j β j > a Z = O s j=1,...,p UW-Madiso Statistics Stat 709 Lecture 24 2018 14 / 15

O the other had, if we assume that mi j A β j c 0 a for some c 0 > 1, the P β j a,j A Z P β j β j β j a,j A Z P c 0 a β j β j a,j A Z P max β j β j c 0 1a Z j=1,...,p = O s Hece, we have cosistecy; i fact, the covergece rate is O s. We ca also obtai similar results by thresholdig β j / i=1 l2 ij. This approach may ot work if p/ 0. If p >, the Z τ Z is ot of full rak. There exist several other approaches for the case where p > ; e.g., we replace Z τ Z 1 by some matrix, or use ridge regressio istead of LSE. UW-Madiso Statistics Stat 709 Lecture 24 2018 15 / 15