Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Similar documents
Nonparametric Estimation of Regression Functions with Both Categorical and Continuous Data

41903: Introduction to Nonparametrics

Efficient Estimation of Average Treatment Effects With Mixed Categorical and Continuous Data

Local Polynomial Regression

Nonparametric Econometrics

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Error distribution function for parametrically truncated and censored data

Nonparametric Econometrics in R

Nonparametric Methods

Time Series and Forecasting Lecture 4 NonLinear Time Series

Density estimators for the convolution of discrete and continuous random variables

Gibbs Sampling in Latent Variable Models #1

Non-parametric Inference and Resampling

Modelling Non-linear and Non-stationary Time Series

Nonparametric Regression

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction

ECO Class 6 Nonparametric Econometrics

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Statistical Properties of Numerical Derivatives

Robustness to Parametric Assumptions in Missing Data Models

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Practice Examination # 3

Introduction to Regression

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

2 Functions of random variables

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

Generated Covariates in Nonparametric Estimation: A Short Review.

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Inverse Statistical Learning

Motivational Example

13 Endogeneity and Nonparametric IV

Econ 2148, fall 2017 Instrumental variables II, continuous treatment

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Statistics 100A Homework 5 Solutions

Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

5 Operations on Multiple Random Variables

Adaptive Kernel Estimation of The Hazard Rate Function

Kernel density estimation

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

4 Riesz Kernels. Since the functions k i (ξ) = ξ i. are bounded functions it is clear that R

Optimal bandwidth selection for the fuzzy regression discontinuity estimator

ECE Lecture #9 Part 2 Overview

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Semiparametric Estimation of a Sample Selection Model in the Presence of Endogeneity

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

On the Robust Modal Local Polynomial Regression

Exercises and Answers to Chapter 1

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data

Contents 1. Contents

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Math 5588 Final Exam Solutions

Nonparametric Function Estimation with Infinite-Order Kernels

Quantile Regression for Residual Life and Empirical Likelihood

Introduction to Regression

Independent and conditionally independent counterfactual distributions

Adaptive Nonparametric Density Estimators

Expectation and Variance

Flexible Estimation of Treatment Effect Parameters

Simple and Efficient Improvements of Multivariate Local Linear Regression

A Discontinuity Test for Identification in Nonparametric Models with Endogeneity

Additive Isotonic Regression

Truncation and Censoring

Lectures on Elementary Probability. William G. Faris

STAT 450: Statistical Theory. Distribution Theory. Reading in Casella and Berger: Ch 2 Sec 1, Ch 4 Sec 1, Ch 4 Sec 6.

Let X and Y be two real valued stochastic variables defined on (Ω, F, P). Theorem: If X and Y are independent then. . p.1/21

Introduction to Regression

Introduction to Regression

Final Exam # 3. Sta 230: Probability. December 16, 2012

Ultra High Dimensional Variable Selection with Endogenous Variables

Homework 10 (due December 2, 2009)

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp

Integral approximation by kernel smoothing

What s New in Econometrics? Lecture 14 Quantile Methods

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

4.1 The Expectation of a Random Variable

Smooth simultaneous confidence bands for cumulative distribution functions

Estimating a frontier function using a high-order moments method

ASM Study Manual for Exam P, Second Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA Errata

On variable bandwidth kernel density estimation

Understanding Regressions with Observations Collected at High Frequency over Long Span

Final Exam. Economics 835: Econometrics. Fall 2010

Chapter 8. Quantile Regression and Quantile Treatment Effects

Class 8 Review Problems solutions, 18.05, Spring 2014

Econ 582 Nonparametric Regression

Robustifying Trial-Derived Treatment Rules to a Target Population

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

1 Exercises for lecture 1

Stat 704 Data Analysis I Probability Review

Nonparametric confidence intervals. for receiver operating characteristic curves

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

Optimal bandwidth selection for differences of nonparametric estimators with an application to the sharp regression discontinuity design

Week 9 The Central Limit Theorem and Estimation Concepts

Transcription:

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification. curse of dimensionality and the number of discrete cells. The conventional frequency estimation method. Treatment effect, Hahn (1998), Hirano, Imbens and Ridder (2003). Smooth the discrete variables, Aitchison and Aitken (1976). Bandwidth selection: Data-driven methods. Irrelevant covariates (regressors) should be removed. Testing procedure not powerful and not efficient. The data-driven cross-validation method. Automatic removal of irrelevant variables. 1

2 Regression Models with Irrelevant Regressors X d i, q 1 vector of discrete variables. Xi c Rp continuous regressors. Xis d, the sth component of Xd i, Xis d takes c s 2 different values. X i = (Xi d, Xc i ), We want to estimate E(Y i X i ) The first p 1 (1 p 1 p) components of X c and the first q 1 (0 q 1 q) components of X d are relevant regressors. Let X be relevant components of X and let X = X \ { X} denote the irrelevant components of X. We assume that ( X, Y ) is independent of X (2.1) (2.1) implies that E(Y X) = E(Y X), (2.2) So that Y = g(x) + error, Y i = ḡ( X i ) + u i, (2.3) E(u i X i ) = 0. 2

For 1 s q, define l(x d is, x d s, λ s ) = { 1 if X d is = x d s, λ s if Xis d xd s. (2.4) The product kernel for x d = (x d 1,..., x d q) is given by K d (x d, X d i ) = q l(xis, d x d s, λ s ). s=1 0 λ s 1. When λ s = 1, K d (x d, X d i ) is unrelated to (xd s, X d is ) x d s is smoothed out. For the continuous variables x c = (x c 1,..., x c p), the product kernel is K c (x c, X c i ) = p s=1 1 h s K ( x c s X c is h s where K is a symmetric, univariate density function, 0 < h s <. ), Note that when h s =, x c s is smoothed out. K(x, X i ) = K c (x c, Xi c)kd (x d, Xi d ). We estimate E(Y X = x) by ĝ(x) = n 1 n i=1 Y ik(x, X i ) n 1 n i=1 K(x, X i). (2.5) We choose (h, λ) = (h 1,..., h p, λ 1,..., λ q ) by minimizing the crossvalidation function given by CV (h, λ) = 1 n (Y i ĝ i (X i )) 2 w(x i ), (2.6) n i=1 3

where ĝ i (X i ) = n j i Y jk(x i, X j )/ n j i K(X i, X j ) is the leave-one-out kernel estimator of E(Y i X i ), and w(.) is a weight function. Define σ 2 ( x) = E(u 2 i X i = x) and let S c denote the support of w. We also assume that g, f and σ 2 have two continuous derivatives; w is continuous, nonnegative and has compact support; f is bounded away from zero for x = (x c, x d ) S. (2.7) We impose the following conditions on the bandwidth and kernel functions. Define H = ( p 1 s=1 h s) p s=p 1 +1 min(h s, 1). (2.8) Letting 0 < ɛ < 1/(p + 4) and for some constant c > 0, n ɛ 1 H n ɛ ; n c < h s < n c for all s = 1,..., p ; the kernel K is a symmetric, compactly supported, Hölder-continuous probability density; and K(0) > K(δ) for all δ > 0. (2.9) The above conditions basically ask that each h s does not converge to zero, or to infinity, too fast, and that nh 1... h p1. We expect that, as n, the smoothing parameters associated with the relevant regressors will converge to zero, while those associated with the irrelevant regressors will not. It would be convenient to further assume that h s 0 for s = 1,..., p 1, and that λ s 0 for s = 1,..., q 1, however, the relevant components are known a priori. 4

Therefore, we shall assume the following condition holds. Defining µ g ( x) = E[ĝ(x) ˆf(x)]/E[ ˆf(x)] 1, we assume that supp w [ µ g(x) ḡ( x)] 2 w( x) f( x)d x, as a function of h 1,..., h p1, and λ 1,..., λ q1 vanishes if and only if all of the smoothing parameters vanish, (2.10) where w is a weight function defined in (2.16) below. Define an indicator function q I s (v d, x d ) = I(vs d x d s) t s,t=1 I(v d s = x d t ). (2.11) Note that I s (v d, x d ) = 1 if and only if v d and x d differ in their sth component only. The leading term of CV is κ p 1 σ 2 ( x) w(x) nh 1... h R(x) f( x)dx (2.12) p1 ( q 1 [ + λ s Is ( v d, x d ) { ḡ( x c, v d ) ḡ( x) } f( x c, v d ) ] + 1 2 κ 2 p 1 s=1 s=1 h 2 s v d {ḡss ( x) f( x) + 2 f s ( x)ḡ s ( x) }) 2 f( x) 1 w( x)d x, (2.13) 1 Note that µ g ( x) does not depend on x, nor does it depend on (h p1+1,..., p, λ q1+1,..., λ q ), because the components in the numerator related to the irrelevant variables cancel with those in the denominator. 5

with κ = K(v) 2 dv, κ 2 = K(v)v 2 dv, and where R(x) = R(x, h p1 +1,..., h p, λ q1 +1,..., λ q ) is given by R(x) = ν 2(x) ν 1 (x), (2.14) where for j = 1, 2, [ p ν j (x) = E s=p 1 +1 h 1 s K ( X c is x c ) q s h s s=q 1 +1 λ I(Xd is xd s) s ] j, (2.15) and w( x) = f( x)w(x)d x, (2.16) where, again, we define d x = x d d x c. Irrelevant variable x appears in R. By Hölder s inequality, R(x) 1 for all choices of x, h p1 +1,..., h p, and λ q1 +1,..., λ q. Also, R 1 as hs (p 1 + 1 s p) and λ s 1 (q 1 + 1 s q). One needs to select h s (s = p 1 + 1,..., p) and λ s (s = q 1 + 1,..., q) to minimize R. The only smoothing parameter values for which R(x, h p1 +1,..., h p, λ q1 +1,..., λ q ) = 1 are h s = for p 1 + 1 s p, and λ s = 1 for q 1 + 1 s q. Define Z n (x) = p s=p 1 +1 K ( x c s X d is ) q s X d h s s=q 1 +1 λi(xd is s ). If at least one h s is finite (for p 1 + 1 s p), or one λ s < 1 (for q 1 + 1 s q), then by K(0) > K(δ) for all δ > 0, we know that Var(Z n ) = E[Z 2 n] [E(Z n )] 2 > 0 so that R = E[Z 2 n]/[e(z n )] 2 > 1. 6

Only when all h s = and all λ s = 1, we have Z n K(0) p p 1 and Var(Z n ) = 0. R = 1 only in this case. Smoothing parameters related to irrelevant regressors must all converge to their upper extremities, so that R(x) 1 as n for all x S. Replacing R(x) by 1, the first term of (2.12) becomes κ p 1 σ 2 ( x) nh 1... h p1 w(x)d x, (2.17) a s = h s n 1/(q 1+4) and b s = λ s n 2/(q 1+4), then (2.12) (with (2.17) as its first term since R(x) 1) becomes n 4/(q 1+4) χ(a 1,..., a p1, b 1,..., b q1 ), where χ(a, b) = + 1 2 κ 2 p 1 s=1 a 2 s κ p σ 2 ( ( x) q 1 w( x)d x + b s I s ( v d, x d ) (ḡ( x c, v d ) ḡ( x) ) f( x c, v d ) a 1... a p1 s=1 v d (ḡss ( x) f( x) + 2 f s ( x)ḡ s ( x) )) 2 f( x) 1 w( x)d x. (2.18) Let a 0 1,..., a 0 p 1, b 0 1,..., b 0 q 1 denote values of a 1,..., a p1, b 1,..., b q1 that minimize χ subject to each of them being nonnegative. We require that Each a 0 s is positive and each b 0 s nonnegative, all are finite and uniquely defined. (2.19) Theorem 2.1 Assume conditions (2.7), (2.9), (2.10), and (2.19) hold, and let ĥ1,..., ĥp, ˆλ 1,..., ˆλ q denote the smoothing parameters that minimize CV. 7

Then n 1/(p1+4) ĥ s a 0 s in probability for 1 s p 1, P (ĥs > C) 1 for p 1 + 1 s p and for all C > 0, n 2/(p 1+4)ˆλs b 0 s in probability for 1 s q 1, ˆλ s 1 in probability for q 1 + 1 s q, (2.20) and n 4/(p 1+4) CV (ĥ1,..., ĥp, ˆλ 1,..., ˆλ q ) inf χ in probability. The asymptotic normality result for ĝ(x). Theorem 2.2 Under the same conditions as in Theorem 2.1, for x = (x c, x d ) S = S c S d, then ( nĥ1... ĥp 1 ) 1/2 [ĝ(x) ḡ( x) q 1 s=1 ] p 1 B 1s ( x)ˆλ 2 s B 2s ( x)ĥ2 d s N(0, Ω( x)), s=1 (2.21) where B 1s ( x) = v I d s ( v d, x d ) ( ḡ( x c, v d ) ḡ( x) ) f( x c, v d ) f( x) 1, B 2s ( x) = ( ) 1 2 κ 2 ḡ ss ( x) + 2 f s ( x)ḡ s ( x), and Ω( x) = κ f( x) p 1 σ2 ( x)/ f( x) are the asymptotic bias and variance terms respectively. 8

The Presence of Ordered Discrete Regressors If x d s takes c s different values, {0, 1,..., c s 1}, and is ordered, We suggest the use K d (x d s, v d s ) = λ xd s v d s s. (2.22) λ s [0, 1]. Aitchison and Aitken (1976,p.29) suggested using K d (x d s, v d s, λ s ) = ( ) cs λ t t s(1 λ s ) c s t where x d s v d s = t (0 t c s ), and ( ) cs = c t s!/[t!(c s t)!]. It can not smooth out irrelevant regressors, 9

Cross-Validation and Estimation of Conditional Densities Peter Hall, Jeff Racine, Qi Li Let f(x, y) denote the joint pdf of (X, Y ), and m(x) the marginal pdf X, and g(y x) = f(x, y)/m(x). We estimate g(y x) = f(x, y)/m(x) by ĝ(y x) = ˆf(x, y) ˆm(x). (2.23) Using the notation dz = z d dz c, a weighted integrated square difference between ĝ( ) and g( ) is I n = [ĝ(y x) g(y x)] 2 m(x) dz = [ĝ(y x)] 2 m(x) dz 2 ĝ(y x)g(y x)m(x) dz + [g(y x)] 2 m(x) dz I 1n 2I 2n + I 3n. (2.24) I 3n is independent of (h, λ). Define Ĝ(x) = [ ˆf(x, y)] 2 dy = n 2 K Xi,xK Xj,x K Yi,yK Yj,ydy i j = n 2 K Xi,xK Xj,xK (2) Y i,y j, (2.25) i j where K (2) Y i,y j = K Yi,yK Yj,ydy y d KYi,yK Yj,ydy c is the second order convolution kernel, K Yi,y = W Yi,yL Yi,y. 10

Using Eq. (2.25), we have I 1n = [ĝ(y x)] 2 m(x) dz = = [ ˆf(x, y)] 2 dy [ ˆm(x)] 2 m(x) dx Ĝ(x) [ ˆm(x)] 2m(x) dx = E Ĝ(X) X[ [ ˆm(X)] 2], (2.26) where E X ( ) denotes the expectation with respect to X only. Also, I 2n = = ĝ(y x)g(y x)m(x) dz = ] [ ˆf(x, y) ˆm(x) f(x, y) dx = E Z [ ˆf(Z) ˆm(X) ĝ(y x)f(x, y) dxdy ] where E Z denotes the expectation with respect to Z only., (2.27) Choosing m(x) as the weighting function give I 1n and I 2n some simple forms. { } Ĝ(X) I 1n 2I 2n = E X [ ˆm(X)] 2 [ ] ˆf(X, Y ) 2E Z, (2.28) ˆm(X) Use the leave-one-out estimators for ˆf(X l, Y l ) and ˆm(X l ), also use a leaveone-out estimator for G(X l ): Ĝ l (X l ) = n 2 K Xi,X l K Xj,X l K (2) Y i,y j. (2.29) i l j l CV (h, λ) def = 1 n n l=1 Ĝ l (X l ) [ ˆm l (X l )] 2 2 n n l=1 We will choose (λ, h) to minimize CV (h, λ). 11 ˆf l (X l, Y l ) ˆm l (X l ). (2.30)

Theorem 2.3 Assume some regularity conditions, and let ĥ 1,..., ĥ p, ˆλ 1,..., ˆλ q denote the smoothing parameters that minimize CV. Then n 1/(p1+5) ĥ s a 0 s in probability for 0 s p 1, P (ĥs > C) 1 for p 1 + 1 s p and for all C > 0, n 2/(p 1+5)ˆλs b 0 s in probability for 1 s q 1, ˆλ s 1 in probability for q 1 + 1 s q, (2.31) and n 4/(p 1+5) CV (ĥ0, ĥ1,..., ĥp, ˆλ 1,..., ˆλ q ) inf χ p (.) in probability. The asymptotic normality result. Theorem 2.4 Under some regularity conditions, we have nĥp (ĝ(y x) g(y x) ĥ2 B 1 (z) ˆλB 2 (z)) N(0, Ω(z)) in distribution, where B 1 (z) = (1/2)(1/m(z))tr[ 2 f(z)][ w(v)v 2 dv], B 2 (z) = (1/m(z)) z d,d z, z =1 [f(zc, z d ) f(z c, z d )], and Ω(z) = [f(z)/m 2 (x)][ W 2 (v)dv] ( 2 is with respect to z c ). 12

In the conditional probability paper, we define the the kernel function for discrete variables via for 1 s q, { 1 l(xis, d x d λs if Xis d s, λ s ) = = xd s, λ s c s 1 if Xis d xd s. λ s [0, (c s 1)/c s ]. (2.32) If x d s {0, 1}, a dummy variable with c s = 2. Then l(x d is, x d s, λ s ) = { 1 λ2 if X d is = xd s, λ s if X d is xd s. (2.33) λ s [0, 1/2]. The reason for this is that the kernel weights add up to one. x d S L(x d, X d d i, λ) = 1. It is needed in unconditional density estimation. This is not needed in regression estimation, or the conditional covariates. The relationship between the two λ s are Relative weight is λ/1 = λ earlier. Now the relative weight is λ/[(c 1)(1 λ)]. For example, if c = 2 and λ = 0.2 in the second expression, then the equivalent λ in the first expression should be λ = 0.2/[(1)(1 0.2)] = 0.2/0.8 = 0.25. 13

3 Applications Two empirical applications. Out-of-sample predictive performance. 3.1 Count Survival Data - Veteran s Administration Lung Cancer Trial The data set is from Kalbfleisch and Prentice (1980, pg 223-224), which models survival in days of cancer patients with six discrete explanatory variables being treatment type, cell type, Karnofsky score, months from diagnosis, age in years, and prior therapy. The dataset contains 137 observations, and the number of cells greatly exceed the number of observations. Model the expected survival in days. The estimation sample n 1 = 132, and the prediction sample is n 2 = 5. The out-of-sample MSE = n 1 2 n2 i=1 (y i ŷ i ) 2. Randomly split the sample into two sub-samples of size n 1 and n 2 100 times. Table 1: Veteran s Lung Cancer Data Mean MSEnp Model Median MSE Mean MSE Mean MSE NONP 9,770.5 17,614.9 100% OLS 13,831.7 28,422.6 62% POISSON 15,319.8 29,052.9 60% 14

Median values of the CV smoothing parameters over the 100 sample splits. Table 2: Summary of Median Smoothing Parameters [number of categories parentheses] λ 1 [2] λ 2 [4] λ 3 [12] λ 4 [28] λ 5 [40] λ 6 [2] 0.934 0.056 0.004 0.157 1.000 1.000 The relevant explanatory variables for predicting survival are cancer cell-type (λ 2 ), Karnofsky score 2 (λ 3 ), and months from diagnosis (λ 4 ). The remaining regressors are smoothed out by cross-validation. 2 The Karnofsky score measures patient performance of activities of daily living. The score has proven useful not only to follow the course of the illness (usually progressive deficit and ultimately death), but also a prognosticator: patients with the highest (best) Karnofsky scores at the time of tumor diagnosis have the best survival and quality of life over the course of their illness. 15

3.2 Count Data - Fair s 1977 Extramarital Affairs Data We consider Fair s (1978) dataset which models how often an individual engaged in extramarital affairs during the past year. This often cited paper with a complete description of the data is located at http://fairmodel.econ.yale.edu/rayfair/pdf/1978a200.pdf. The dataset contains 601 observations and has eight discrete explanatory variables given by sex, age, number of years married, have children or not, how religious, level of education, occupation, and how they rate their marriage. Estimation sample n 1 = 500, prediction sample n 2 = 101. The goal is to model the expected number of affairs. We split the sample into two sub-sample of size n 1 and n 2 to compute the out-of-sample squared prediction errors, and repeat this 100 times. Median and mean out-of-sample squared prediction errors by different estimation methods. Table 3: Fair s 1977 Extramarital Affairs Data Mean MSEnp Model Median MSE Mean MSE Mean MSE NONP 9.97 10.18 100% OLS 11.90 12.39 82% PROBIT 14.66 14.74 69% TOBIT 12.81 13.04 78% POISSON 12.88 12.99 78% NP MSPE is 69% to 82% of those by various Para. methods. 16

Median values of CV smoothing parameters over the 100 sample splits. Table 4: Summary of Median Smoothing Parameters [number of categories in parentheses] λ 1 [2] λ 2 [9] λ 3 [8] λ 4 [2] λ 5 [5] λ 6 [7] λ 7 [7] λ 8 [5] 0.416 0.064 1.000 0.998 0.507 0.999 0.282 0.004 The most relevant explanatory variables are age (λ 2 ) and how a person rates their marriage (λ 8 ), while irrrelevant variables are number of years married (λ 3 ), having children or not (λ 4 ), and level of education (λ 6 ) appear to be irrelevant for prediction of number of affairs, and the remaining variables appear to have some relevance again from a cross-validation perspective. In comparison, using a Tobit model, Fair obtained a t-statistic of 3.63 for the number of years married, while our nonparametric method removes this variable from the resulting estimate. The out-of-sample prediction results suggest that the Tobit model is likely to be misspecified. 17