Nonparametric and Semiparametric Estimation Whitney K. Newey Fall 2007

Size: px

Start display at page:

Download "Nonparametric and Semiparametric Estimation Whitney K. Newey Fall 2007"

Barnard George
5 years ago
Views:

1 Nonparametric and Semiparametric Estimation Witney K. Newey Fall 2007 Introduction Function form misspecification error is important in elementary econometrics. Flexible functional forms; e.g. translog y = β 1 + β 2 ln(x)+ β 3 [ln(x)] 2 Fine for simple nonlinearity, e.g. diminising returns. Economic teory does not restrict form. Nonparametric metods allow for complete flexibility. Good for graps. Good for complete flexibility wit a few dimensions. An Empirical Example An example illustrates. Deaton (1989); effect of rice prices on te distributions of incomes in Tailand. p price of rice; q amount purcased; y amount sold. Cange in benefits from dp is db = (q y)dp = p(q y)d ln(p). Elasticity form: db /d ln(p) =(w py/x), x w budget sare of rice purcases; x total expenditure. Benefit/expenditure measure is te negative of rigt-and side. Empirical Distribution Function Simple nonparametric estimation problem. Te CDF of is F (z) =Pr( z). Let 1,..., n be i.i.d. data, 1(A) indicator of A, so F (z) = E[1( i z)]. X Fˆ (z) = #{i i z} 1 n = 1( i z). n n Empirical CDF. Probability weigt 1/n on eac observation. Consistent and asymptotically normal. Nonparametrically efficient. No good for density estimation. Kernel Density Estimator Add a little continuous noise to smoot out empirical CDF.

2 n ave empirical CDF. U a continuous random variable wit pdf K(u), indep of n a positive scalar. Define = n + U Empirical CDF plus noise. Kernel density estimator is density of. R Derivation: Let F U (u) = u K(t)dt be CDF of U. By iterated expectations so by 1( z) =1(U (z n)/), Differentiating gives pdf E[1( z)] = E[E[1( z) n]], F (z) = Pr( z) = E[1( z)] = E[[1(U z n ) n]] n z n X z i = E[F U ( )] = F U ( )/n. fˆ (z) = df (z)/dz = K (z i )/n; K (u) = 1 K(u/). Tis is a kernel density estimator. Te te bandwidt. function K(u) iste kernel and te scalar is fˆ (z) = df (z)/dz = K (z i )/n; K (u) = 1 K(u/). Bandwidt controls te amount of smooting. As increases, density smooter, but more noise from U, i.e. more bias. As 0 get roug density, spikes at data points, but bias srinks. Coosing important in practice; see below. fˆ (z) will be consistent if 0and n. Examples: Gaussian kernel: K(u) = (2π) 1/2 e u2 /2. Epanecnikov: K(u) = 1( u 1)(1 u 2 )(3/4).

3 Coice of K does not matter as muc as coice of. Epanecnikov kernel as sligtly smaller mean square error, and so optimal. Bias and Variance of Kernel Estimators ( 1,..., n ) are i.i.d.. Bias: f 0 (z) ispdf of i. Expectation of kernel estimator; wit E[fˆ (z)] = K (z t)f 0 (t)dt = 1 K( z t )f 0 (t)dt = K(u)f 0 (z u)du, for cange of variables u = (z t)/. Taylor expand f 0 (z u) around = 0, f 0 (z u) = f 0 (z) f 0 0 (z)u + Γ(, u, z) 2, Γ(, u, z) = f 0 00 (z + (z, u)u)u 2 /2, R were (z, u). For K(u)u 2 du <, K(u)udu = 0, assuming f 00 0 (z) continuous and bounded, K(u)Γ(, u, z)du [ K(u)u 2 du]f 00 0 (z)/2. Ten for o( 2 )= a() witlim 0 a()/ 2 =0, 2 Ten multiplying te expansion by K(u) and integrating gives K(u)Γ(, u, z)du = 2 [ K(u)u 2 du]f 0 00 (z)/2 R +o( 2 ) f 0 (z u) = f 0 (z) f 0 0 (z)u + Γ(, u, z) 2 E[fˆ (z)] = f 0 (z)+ 2 f 00 0 (z) K(u)u 2 du/2+ o( 2 ). We can summarize tese calculations in te following result: Proposition 1: If f 0 (z) is twice continuously differentiable wit bounded second R R R derivative, K(u)du = 1, K(u)udu = 0, u 2 K(u)du <, ten E[fˆ (z)] f 0 (z) = 2 f 00 (z) K(u)u 2 du/2+ o( 2 ). 0

4 Variance: From Proposition 1, E[K (z i )] = E[fˆ (z)] is bounded as 0. Let O(1/n) denote(a n ) n=1 suc tat na n is bounded. Ten by fˆ (z) a sample of average of K (z i ), for 0, Var(fˆ (z)) = {E[K (z i ) 2 ] {E[K (z i )]} 2 }/n = 1 K( z t ) 2 f 0 (t)dt/n + O(1/n) 2 1 = K(u) 2 f 0 (z u)du/(n)+ O(1/n). R For f 0 (z) continuous and bounded and K(u) 2 du <, K(u) 2 f 0 (z u)du f 0 (z) K(u) 2 du. By 0, it follows tat no(1/n) 0, so tat O(1/n) = o(1/n). Plugging in abovevarianceformula we find, Var(fˆ (z)) = f 0 (z) K(u) 2 du/(n)+ o(1/(n)). We can summarize tese calculations in te following result: Proposition 2: If f 0 (z) is continuous and bounded, K(u) 2 du <, 0, and n ten Var[fˆ (z)] = f 0 (z) K(u) 2 du/(n)+ o(1/(n)). Consistency and Convergence Rate of Kernel Estimators For consistency implied by 0; bias goes to zero. n ; variance goes to zero. Bandwidt srinks to zero slower tan 1/n. Intuition for te 0: Smooting noise must go away asymptotically to remove all bias. Intuition for n : For Epanecnikov kernel; K((z i )/) > 0ifand only if z i <. If srinksasfastasorfastertan1/n, te number of observations wit z i < will not grow, so averaging over a finite number of observations, ence variance does not go to zero. R

5 Explicit form for (MSE) under 0,n. MSE(fˆ (z)) = Var(fˆ (z)) + Bias 2 (fˆ (z)) = f 0 (z) K(u) 2 du/(n) + 4 {f0 00 (z) K(u)u 2 du/2} 2 +o( 4 +1/(n)). By 0, MSE vanises slower tan 1/n. Tus, kernel estimator converges slower tan n 1/2. Avoidance of bias by 0 means fraction of te observations used goes to zero. Bandwidt Coice for Density Estimation: Grapical: Coose one tat looks good, report several. Minimize asymptotic integrated MSE. Integrating over z, MSE(fˆ (z))dz = Min over as first-order conditions K(u) 2 du/(n) + f 00 0 (z) 2 dz[ K(u)u 2 du/2] 2 4 +o( 4 +1/(n)). 0 = n 1 2 C f 00 0 (z) 2 dzc 2, C 1 = K(u) 2 du, C 2 =[ K(u)u 2 du] 2. Solving gives = [C 1 /{nc 2 f 00 (z) 2 dz}] 1/5. 0 = [C 1 /{nc 2 f 0 00 (z) 2 dz}] 1/5. Asymptotically optimal bandwidt. Could be estimated by plugging-in estimator for f 0 00 (z). Tis procedure depends on initial bandwidt, but final estimator not as sensitive to coice of bandwidt for f 0 00 (z) as coice of bandwidt for f 0 (z). Silverman s rule of tumb: Optimal bandwidt wen f 0 (z) is Gaussian and K(u) is a standard normal pdf. = 1.06σn 1/5,σ = Var(z i ) 1/2. Use estimator of te standard deviation σ.

6 Estimate directly te integrated MSE: MSE(fˆ (z))dz = E[{fˆ (z) f 0 (z)} 2 ]dz = E[ {fˆ (z) f 0 (z)} 2 ] Unbiased estimator of E[ R fˆ (z) 2 ]is = E[ fˆ (z) 2 ] 2E[ fˆ (z)f 0 (z)] X +f 0 (z) 2. fˆ (z) 2 dz = K ( i t)k (t j )dt/n 2 To find unbiased estimator of second, note tat E[ fˆ (z)f 0 (z)dz] = K (z t)f 0 (z)f 0 (t)dzdt. i,j By observations independent, we can average over pairs to estimate tis term. Last term in MSE does not depend on, so we can drop. Combining estimates of first two terms gives criterion. 2 X CVˆ () = fˆ (z) 2 dz K ( i j ). n(n 1) i=j 6 2 X CVˆ () = fˆ (z) 2 dz K ( i j ). n(n 1) i=j 6 Coosing by minimizing CVˆ () iscalled cross-validation. Motivation for terminology is second term is 2 X fˆ i,( i )/n, 6 fˆ i,(z) = K (z j )/(n 1). j=i Here fˆ i,(z) is estimator tat uses all observations but te it, so tat fˆ i,( i )is cross-validated. Multivariate Density Estimation: Multivariate density estimation can be important as in example. Let z be r 1, K(u) denote a pdf for a r 1 random vector, e.g. K(u) = Π r j=1k(u j ) for univariate pdf k(u). Let Σˆ te sample covariance matrix of i. For a bandwidt let K (u) = r det(σˆ) 1/2 K(Σˆ 1/2 u/),

7 A 1/2 is inverse square root of positive definite A. Often K(u) = ρ(u 0 u)for some ρ, so K (u) = r det(σˆ) 1/2 ρ(u 0 Σˆ 1 u/). Multivariate kernel estimator, wit scale normalization fˆ (z) = K (z i )/n. Gaussian kernel: K(u) = (2π) r/2 exp( u 0 u/2). Epanecnikov kernel: K(u) = C r (1 u 0 u)1(u 0 u 1). Te Curse of Dimensionality for Kernel Estimation Difficult to nonparametrically estimate pdf s of ig dimensional i. Need many observations witin a small distance of a point. As dimension rises wit distance fixed, te proportion of observations tat are close srinks very rapidly. Matematically, wit one dimension [0, 1] can be covered wit 1/ balls of radius, wile it requires 1 r balls to cover [0, 1] r. Tus, if data equally likely to fall in one ball, tend to be many fewer data points in any one ball for ig dimensional data. Silverman (1986, Density Estimation) famous table. Multivariate normal density at zero, te sample size required for MSE to be 10 percent of te density r n ,700 43, , ,000 Curse of dimensionality sows up in convergence rate. Expand again, f 0 (z u) = f 0 (z) [ f 0 (z)/ z] 0 u + 2 u 0 [ 2 f 0 (z + (z, u)u)/ z z 0 ]u/2, so bias asymptotically C 3 2 no matter ow big r. For te variance, setting u = (z t)/ K (z t) 2 f 0 (t)d = 2r K((z t)/) 2 f(t)dt = r K(u) 2 f 0 (z u)du. Integrating, wit Σˆ = I, E[{fˆ (z) f 0 (z)} 2 ]dz C 1 /(n r )+ C First-order conditions for te optimal : 0 = C 1 rn 1 r 1 +4C 2 3. Solving gives = C 0 n 1/(r+4). Plugging back in gives E[{fˆ (z) f 0 (z)} 2 ]dz C 0 n 4/(r+4).

8 Te convergence rate declines as r increases. Nonparametric Regression Often in econometrics te object of estimation is a regression function. A classical formulation is Y = X 0 β + ε wit E[ε X] = 0. A more direct way to write tis model is E[Y X] = X 0 β. A nonparametric version of tis model, tat allows for unknown functional form in te regression, is E[Y X] = g 0 (X), were g 0 (x) is an unknown function. Kernel Regression To estimate g 0 (x) nonparametrically, one can start wit kernel density estimate and plug it into te formula g 0 (x) = yf(y x)dy = yf(y, x)dy/ f(y, x)dy Assume tat X is a scalar. Let k(u 1,u 2 )beabivariatekernel, wit t k(t, u 2 )dt = 0. Data R are (Y 1,X 1 ),..., (Y n,x n ). Let K(u 2 )= k(t, u 2 )dt. By cange of variables t = (y Y i )/, n yfˆ (y, x)dy = n 1 2 X n yk( y Y i, x X i )dy = n 1 1 X (Y i + t)k(t, x X i )dt n = n 1 1 X Y i k(t, x X i )dt n = n 1 1 X Y i K( x X i ), n fˆ (y, x)dy = n 1 2 X Plugging in fˆ (y, x) informula for g 0 (x), R n = n 1 1 X K( x X i ). k( y Y i, x X i )dy P n ³ x X yfˆ (y, x)dy Y i K i gb (x) = R = P ³. X fˆ (y, x)dy i i n =1 K x R

9 Also, kernel regression estimator ĝ(x) justdefined by second equality. Tis regression estimator is a weigted average, wit P ³ x X i X n K ĝ (x) = w i (x)y i,w i (x) = P ³ n j=1 K n By construction, w i (x) = 1, wile w i (x) is nonnegative by K(u) being nonnegative. For symmetric K(u) witauniquemodeat u = 0, more weigt will be given to observations wit X i tat is closer to x. Bandwidt controls ow fast te weigts decline. As declines, more weigt given closer observations. Reduces bias but increases variance. For multivariate X, formula is te same, wit K(u) replaced by multivariate kernel, suc as K(u) =det(σˆ) 1/2 1/2 k(σˆ u) for some kernel k(u). Consistency and convergence rate results are similar. Details are not reported ere because te calculations are complicated by te ratio form of te estimator. Bandwidt coice below. Series Regression Anoter approac to nonparametric regression flexible functional forms wit complete flexibility by letting te number of terms grow wit sample size. Tink of approximating P K g0 (x) by linear combination j=1 p jk (x)β j of approximating functions p jk (x), e.g. polynomials or splines. Estimator of g 0 (x) is predicted value from regressing Y i on p K (X i ) for p K (x) =(p 1K (x),..., p KK (x)) 0. Consistent as K grows wit n. Let Y =(Y 1,..., Y n ) 0 and P =[P K (X 1 ),..., P K (X n )] 0.Ten ĝ(x) = p K (x) 0 β, ˆ βˆ =(P 0 P ) P 0 Y, A denotes generalized inverse, AA A = A. Different tan kernal; global fit ratertan local average. Examples: Power series: x scalar p jk (x) = x j 1, (j = 1, 2,...) p K (x) 0 β is a polynomial. Suc approximations good for global approximation of smoot functions. Not wen most variation in narrow range or wen jump or kink. Using ortogonal polynomials wit respect to density can mitigate multicollinearity. Regression splines: x is scalar, p jk (x) = x j 1, (j = 1, 2, 3, 4), x p jk (x) = 1(x> j 4,K )(x j 4,K ) 3, (j = 5,..., K). X j

10 Te 1,..., K 4 are knots, p K (x) 0 β is cubic in between jk, twice continuously differentiable everywere. Picks up local variation but still global fit. Need to place knots. B-splines can mitigate multi-collinearity. Extend bot to multivariate case by including products of individual components. Convergence Rate for Series Regression Depends on approximation rate, i.e. bias. Assumption A: Tere exists γ > 0, β K, and C suc tat {E[{g 0 (X i ) p K (X i ) 0 β K } 2 ]} 1/2 CK γ. Comes from approximation teory. For power series, X i in a compact set, g 0 (x) is continuously differentiable of order s, ten γ = s/r. For splines, γ =min{4,s}/r. For multivariate, approximation depends on te order of te included terms (e.g. on power) wic grows more slowly wit K wen x is iger dimensional. Proposition 3: If Assumption A is satisfied and Var(Y i X i ) ten Proof: Let E[{ĝ K (X i ) g 0 (X i )} 2 ] K/n + C 2 K 2γ. Q = P (P 0 P ) P 0, ĝ =(ĝ(x 1 ),..., ĝ(x n )) 0 = QY, g 0 = (g 0 (X 1 ),..., g 0 (X n )) 0,ε = Y g 0, ḡ = P β K. Q idempotent, so I Q idempotent, ence as eigenvalues tat are zero or one. Terefore, by Assumption A, E [(g 0 ḡ)(i Q)(g 0 ḡ)] E ( g 0 ḡ) 0 (g 0 ḡ) ne {g 0 (X i ) p K (X i ) 0 β K } 2 CnK 2γ. Also, for X = (X 1,..., X n ), by independence and iterated expectations, for i 6= j, E[ε i ε j X] = E[ε i ε j X i,x j ] = E[ε i E[ε j X i,x j,ε i ] X i,x j ] = E[ε i E[ε j X j ] X i,x j ]=0. Ten for Λ ii = Var(Y i X i )and Λ = diag(λ 11,..., Λ nn )weave E[εε 0 X] = Λ. It follows tat for tr(a) te trace of a square matrix A, by rank(q) K, i i

11 6 E [ε 0 Qε X] = tr(qe [εε 0 X]) = tr(qλ) = tr(qλq) tr(q) K. Ten by iterated expectations, E[ε 0 Qε] CK. Also, Ten by i.i.d. observations, {ĝ(xi ) g 0 (X i )} 2 = (ĝ g 0 ) 0 (ĝ g 0 ) = (Qε (I Q)g 0 ) 0 (Qε (I Q)g 0 ) = ε 0 Qε + g 0 (I Q)g 0 0 = ε 0 Qε +(g 0 ḡ) 0 (I Q)(g 0 ḡ). E[{ĝ(X i ) g 0 (X i )} 2 ] = E[(ĝ i g 0i ) 2 ] = E[(ĝ g 0 ) 0 (ĝ g 0 )]/n K n + C 2 K 2γ. Coosing Bandwidt or Number of Terms Data based coice operationalizes complete flexibility. Number of terms or bandwidt adjusts. Cross-validation is common. Let ĝ i, (X i )and ĝ i,k (X i ) be predicted values for i t observation using all te oters, for kernel and series respectively. ĝ i, (X i ) = P P ³ X i X j j=i Y j K ³, X j= 6 i K i X j Yi ĝk ( Xi) ĝ i,k (X i ) = Y i 1 P K ( X ) 0 ( P 0P ) P K (X i ), Second equality by recursive residuals. Criteria are CVˆ () = v(x i )[Y i ĝ i, (X i )] 2, CVˆ (K) = [Y i ĝ i,k (X i )] 2, v(x) is a weigt function, equal to zero near support boundary i minmize. i. Coose or K to

12 (weigted) sample MSE is Cross-validation criteria optimal, Series too. ĥ converges slowly. MSE() = v(x i ){ĝ (X i ) g 0 (X i )} 2 /n Anoter Empirical Example: min MSE() p 1. MSE(ĥ) Hausman and Newey (1995, Nonparametric Estimation of Exact Consumers Surplus, Econometrica), kernel and series estimates, Y is log of gasoline purcased, function of price, income, and time and location dummies. Six cross-sections of individuals from Energy Department, total of 18,109 observations. Cross-validation criteria are Kernel Spline Power CV Knots CV Order CV Locally Linear Regression: Tere is anoter local metod, locally linear regression, tat is tougt to be superior to kernel regression. It is based on a locally fitting a line rater tan a constant. Unlike kernel regression, locally linear estimation would ave no bias if te true model were linear. In general, locally linear estimation removes a bias term from te kernel estimator, tat makes it ave better beavior near te boundary of te x s and smaller MSE everywere. To describe tis estimator, let K (u) = r K(u/) as before. Consider te estimator ĝ(x) given by te solution to min (Y i g (x X i ) 0 β) 2 K (x X i ). g,β

13 Tat is ĝ(x) is te constant term in a weigted least squares regression of Y i on (1,x X i ), wit weigts K (x X i ). For Y1 1 (x X 1 ) 0 Y =., X =... Y n 1 (x X n ) 0 W = diag (K (x X 1 ),...,K (x X n )) and e 1 a(r +1) 1 vector wit 1 in first position and zeros elsewere, we ave ĝ(x) = e 0 1(X 0 W X ) 1 X 0 WY. Tis estimator depends on x bot troug te weigts K (x X i ) and troug te regressors x X i. Tis estimator is a locally linear fit of te data. It runs a regression wit weigts tat are smaller for observations tat are farter from x. In contrast, te kernel regression estimator solves tis same minimization problem but wit β constrained to be zero, i.e., kernel regression minimizes (Yi g) 2 K (x X i ) Removing te constraint β = 0 leads to lower bias witout increasing variance wen g 0 (x) is twice differentiable. It is also of interest to note tat βˆ from te above minimization problem estimates te gradient g 0 (x)/ x. Like kernel regression, tis estimator can be interpreted as a weigted average of te Y i observations, toug te weigts are a bit more complicated. Let n n n X X X S 0 = K (x X i ),S 1 = K (x X i )(x X i ),S 2 = K (x X i )(x X i )(x X i ) 0 Xn X n mˆ 0 = K (x X i )Y i, mˆ 1 = K (x X i )(x X i )Y i. Ten, by te usual partitioned inverse formula " # 1 Ã! 0 0 S ĝ(x) = e 0 S 1 mˆ 0 1 =(S 0 S 0 1 S 1 2 S 1 ) 1 (ˆm 0 S S 2 mˆ S 1) 1 S 2 mˆ 1 P n = P a i Y i,a i = K (x X i )[1 S 0 S 1 n 1 2 (x X i )] a i It is straigtforward toug a little involved to find asymptotic approximations to te MSE. For simplicity we do tis for scalar x case. Note tat for g 0 =(g 0 (X 1 ),..., g 0 (X n )) 0 ĝ(x) g 0 (x) = e 0 (X 0 W X ) 1 X 0 W (Y g 0 )+ e 0 (X 0 W X ) 1 X 0 Wg 0 g 0 (x). 1

14 Ten for Σ = diag(σ 2 (X 1 ),..., σ 2 (X n )), i E {ĝ(x) g 0 (x)} 2 X 1,..., X n = e 0 ( X 0 W X ) 1 X 0 W ΣW X (X 0 W X ) n + e 0 (X 0 W X ) 1 X 0 Wg 0 g 0 (x) An asymptotic approximation to MSE is obtained by taking te limit as n grows. Note tat we ave 1 n X n 1 j S j = K (x X i )[(x X i )/] j n Ten, by te cange of variables u = (x X i )/, i E n 1 j S j = E[K (x X i ) {(x X i )/} j ]= K(u)u j f 0 (x u)du = μ j f 0 (x)+o(1). R for μ j = K(u)u j du and 0. Also, ³ i 2j var n 1 j S j n 1 E K (x X i ) 2 [(x X i )/ ] n 1 1 K(u) 2 u 2j f 0 (x u)du Cn for n. Terefore, for 0and n n 1 j S j = μ j f 0 (x)+ o p (1). Now let H = diag(1,). Ten by μ 0 =1 and μ 1 =0 we ave " # " # n 1 H 1 W = n 1 S 0 1 S X 0 XH 1 1 S 1 2 = f S 0 (x) + o 2 0 p (1). R Next let ν j = K(u) 2 u j du. tenbyasimilarargumentweave X 1 n K (x X i ) 2 [(x X i )/] j σ 2 (X i )= ν j f 0 (x)σ 2 (x)+ o p (1). n μ 2 o2 1 e It follows by ν 1 =0 tat " # ν0 0 n 1 H 1 X 0 W ΣW XH 1 = f 0 (x)σ 2 (x) + o 0 ν p (1). Ten we ave, for te variance term, by H 1 e 1 = e 1, e 0 1(X 0 W X ) 1 X 0 W ΣW X (X 0 W X ) 1 e 1 Ã! 1 Ã! 1 H 1 X 0 W XH 1 H 1 X 0 W ΣW XH 1 H 1 X 0W XH 1 = n 1 1 e 0 1H 1 H 1 e 1 n n n " # #" # = n 1 1 e " ν 0 ν e 1 σ2 (x) + op (1). 1 0 μ 2 ν 1 ν 2 0 μ 2 f(x) 2

15 It ten follows tat Ã! e 0 1(X 0 W X ) 1 X 0 W ΣW X (X 0 W X ) 1 e 1 = n 1 1 ν 0 σ 2 (x) + op (1). f(x) For te bias consider an expansion g(x i )= g 0 (x)+ g 0 (x)(x i x)+ 1 g 00 (x)(x i x) g 000 (X i)(x i x) Let r i = g 0 (X i ) g 0 (x) [dg 0 (x)/dx](x i x). Ten by te form of X we ave g = (g 0 (X 1 ),..., g 0 (X n )) 0 = g 0 (x)we 1 g 0 (x)we 2 + r 0 It follows by e 0 1e 2 = 0 tat te bias term is e 0 ( X 0 W X) 1 X 0 Wg g 0 (x) = e 0 ( X 0 W X) 1 X 0 W Xe g 0 (x) g 0 (x) 0 +e 1 X 0 X 0 Xe 2 g (x)+ e 1 X 0 X 0 0 ( W X ) 1 W ( W X ) 1 Wr = e 1 ( X 0 W X ) 1 X 0 Wr. Recall tat n 1 j S j = μ j f 0 (x)+ o p (1). Terefore, by μ 3 =0, = n 1 2 H 1 X 0 W ((x X 1 ) 2,..., (x X n ) 2 ) 0 1 g 00 0 (x) Ã! Ã! 2 n 1 2 S 2 1 μ 2 1 n 1 3 S 3 2 g 0 00 (x) = f 0 (x) 0 2 g 0 00 (x)+ o p (1). Assuming tat g (X i) is bounded, bounded ³ ³ ) 3 0 n 1 2 H 1 X 0 W (x X 1 ) 3 g (X 1),..., (x X n g0 000 X n ( X 3 C max Terefore, we ave n 1 2 i K (x X i ) x X i,n 1 2 S 4 ) 0. Ã H e 0 1( X 0 W X) 1 X 0 Wr = 2 e 0 1H 1 1 X 0 W XH 1! 1 2 H 1 X 0 Wr n n 2 Ã! 1 Ã! 1 0 = g 0 00 (x)e 0 μ2 1 + o 2 0 μ 2 μ p (1) 3 2 = 2 [g00 0 (x)μ 2 + o p (1)].

16 Exercise: Apply analogous calculation to sow kernel regression bias is Ã! 0 μ g0 (x)+ g (x) f ( x) f0 ( x) Notice bias is zero if function is linear. Combining te bias and variance expression, we ave te following form for asymptotic MSE: 1 σ 2 (x) 4 n ν 0 f 0 (x) + 4 g 0 00 (x) 2 μ 2 2. In contrast, te kernel MSE is 1 σ 2 (x) 4 " f 0 0 (x) # 2 n ν 0 f 0 (x) + g (x)+2g (x) μ f0 (x) 2. Bias will be muc bigger near boundary of te support were f 0 0 (x)/f 0 (x) is large. For example, if f 0 (x) isapproximately x α for x> 0 near zero, ten f 0 0 (x)/f 0 (x) grows like 1/x as x gets close to zero. Tus, locally linear as smaller boundary bias. Also, locally linear as no bias if g 0 (x) islinearbut kernel obviouslydoes. Simple bandwidt coice metod is to take expected value of MSE. One could use a plug in metod to minimize integrated asymptotic MSE, integrated over ω(x)f 0 (x) for some weigt. Reducing te Curse of Dimensionality Idea: Restrict form of regression so tat it only depends on low dimensional components. Additive model as regression additive in lower dimension components. Index model as regression depending only on a linear combination. Additive model, X = (x 1,..., x r ) 0, E[Y X] = rx j=1 0 g j0 (X j ). One dimensional rate. Series estimator is simplest. Restricts approximating functions to depend on only on component. Scalar u and p L (u), =1,...,L approximating functions, p L (u) =(p 1L (u),..., p LL (u)) 0,K = Lr +1 let p K (x) =(1,p L (x 1 ) 0,...,p L (x r ) 0 ) 0. Regress Y i on p K (X i ). For βˆ0 equal to constant and βˆj, (j =1,..., r) te coefficient vector for p L (x j )/ rx ĝ(x) = βˆ0 + p L (x j ) 0 βˆj. j=1

17 Proposition 4: If Assumption A is satisfied wit g 0 (X) equal to eac component g j0 (X j ) and p K (X) replaced by (1,p L (X j ) 0 ) 0, were te constants C, γ do not depend on j, and Var(Y X) ten E[{ĝ K (X i ) g 0 (X i )} 2 ] /n + r[ L/n + C 2 (L +1) 2γ ]. Proof: Let p L (u) = (1,p L (u) 0 ) 0. By Assumption A, is C and γ so for eac j and L is a(l +1) 1vector β jl =(β jl, β 2 jl0 1 ) 0, Let β =( P r j =1 β 1 E[{g j0 (X j ) p L(X j ) 0 β jl } 2 ] C 2 (L +1) 2γ. jl, β 2 1L0,..., β 2 rl0 ) 0, so tat E[{g(X) p K (X) 0 β } 2 ] rx = E[{g j0 (X j ) p L(X j ) 0 β jl } 2 ] j=1 rc 2 (L +1) 2γ. Ten as in te proof of Propostion 3, we ave E[{ĝ K (X) g 0 (X)} 2 ] K/n + rc 2 (L +1) 2γ = /n + r[ L/n + C 2 (L +1) 2γ ]. Q.E.D. Convergence rate does not depend on r, altoug does affect. Here r could even grow wit sample size at some power of n. Additivity condition satisfied in Hausman and Newey (1995). Semiparametric Models Data: 1, 2,... i.i.d. Model: F aset of pdfs. Correct specification: pdf f 0 of i in F. Semiparametric model: F as parametric θ and nonparametric components. Ex: Linear model E[Y X] = X 0 β 0 ; parametric component is β, everyting else nonparametric. Ex: Probit, Y {0, 1}, Pr(Y =1 X) = Φ(X 0 β 0 ) is parametric component, nonparametric component is distribution of X.

18 Binary Coice wit Unknown Disturbance Distribution: = (Y,X), v(x, β) a known function, Tis model implies Y =1(Y > 0),Y = v(x, β 0 ) ε, ε independent of X, Pr(Y =1 X) = G(v(X, β 0 )), Parameter β, everyting else, including G(u), is nonparametric. Te v(x, β) notation allows location and scale normalization, e.g. v(x, β) = x 1 + x 0 2β, x = (x 1,x 2 0 ) 0, x 1 scalar. Censored Regression wit Unknown Disturbance Distribution: = (Y,X), Y =max{0,y },Y = X 0 β 0 + ε, ε independent of X; Parameter β, everyting else, including distribution of ε, is nonparametric. Binary coice and censored regression are limited dependent variable models. Semiparametric models are important ere because misspecifying te distribution of te disturbances leads to inconsistency of MLE. Partially Linear Regression: = (Y,X,W ), E[Y X, W] = X 0 β 0 + g 0 (W ). Parameter β, everyting else nonparametric, including additive component of regression. Can elp wit curse of dimensionality, wit covariates X entering parametrically. In Hausman and Newey (1995) W is log income and log price, and X includes about 20 time and location dummies. X maybevariableofinterestand g 0 () some covariates, e.g. sample selection. Index Regression: = (Y,X), v(x, β) a known function, E[Y X] = τ(v(x, β 0 )), were te function τ( ) is unknown. Binary coice model as E[Y X] = Pr(Y =1 X) = τ(v(x, β 0 )), wit τ( ). If allow conditional distribution of ε given X to depend (only) on v(x, β 0 ), ten binary coice model becomes index model. Semiparametric Estimators Estimators of β 0. Two kinds; do and do not require nonparametric estimation. Really model specific, but beyond scope to say wy. One general kind of estimator: βˆ =arg min β B q( i,β)/n, β 0 =arg mine[q( i,β)], β B

19 B set of parmaeter values. Extremum estimator. Clever coices of q(, β) insome semiparametric models. Conditional median estimators use two facts: Fact 1: Te median of a monotonic transformation is transformation of te median. Fact 2: Te median minimizes te expected absloute deviation, i.e. med(y X) minimizes E[ Y m(x) ] over functions m( ) of X. Binary Coice: v(x, β) include constant, ε as zero median, ten med(y X) = v(x, β 0 ). 1(y > 0) is montonic tranformation, Fact 1 implies med(y X) =1(med(Y X) > 0) = 1(v(X, β 0 ) > 0). Fact 2, β 0 minimizes E[ y 1(v(x, β) > 0) ], so βˆ =arg min Y 1(v(X i,β) > 0) ] β Maximum score estimator of Manski (1977). Only requires med(y X) = v(x, β 0 ); allows for eteroskedasticity. Censored Regression: x 0 β includes a constant, ε as median zero, ten med(y X) = X 0 β 0. max{0,y} is a monotonic transformation, so Fact 1 says med(y X) = max{0,x 0 β 0 }. By Fact 2, β 0 minimizes E[ y max{0,x 0 β} ], so 0 βˆ =arg min Y i max{0,x i β}. β Censored least absolute deviations estimator of Powell (1984). Only requires med(y X) = X 0 β, allows for eteroskedasticity. Generalize: med(y X) = v(x, β 0 )and Y = T (Y ) for monotonic transformation T (y). By Fact 1 med(y X) = T (v(x, β 0 )). Use Fact 2 to form βˆ =arg min Y i T (v(x i,β)). β Global minimization, rater tan solving first-order conditions, is important. For maximum score no first-order conditions. For censored LAD first-order conditions are zero wenever x 0 iβ< 0for all(i = 1,..., n). Approac provides estimates parameters and conditional median predictions, not conditional means. Generalizes to conditional quantiles. Consistency and Asymptotic Normality of Minimization Estimators A consistency result:

20 Proposition 5: If i) E[q(, β)] as a unique minimum at β 0, ii) β 0 B and B is compact; iii) q(, β) is continuous at β wit probability one; iv) E[sup β B q( i,β) ] < ; ten βˆ =arg min q( i,β) β 0. β B Well known. Allows for q(, β) to be discontinuous. Binary coice model above, assumption iii) satisfied if v(x, β) = x 0 β and X i includes continuously distributed regressor wit corresponding component of β bounded away from zero on B. All te conditions are straigtforward to ceck. An asymptotic normality result, Van der Vaart (1995). p Proposition 6: If βˆ β 0, β 0 is in te interior of B, and i) E[q( i,β)] is twice differentiable at β 0 wit nonsingular Hessian H; ii) tere is d(z) suc tat E[d() 2 ] exists and for all β,β B, q(, β ) q(, β) d()kβ βk; iii) wit probability one q(, β) is differentiable at β 0 wit derivative m(), ten n 1/2 (βˆ β 0 ) d N(0,H 1 E[m()m() 0 ]H 1 ). Straigtforward to ceck for censored LAD. Do not old for maximum score. Instead 1/3 ( ˆ d n β β 0 ). Estimators wit Nonparametric Components Some models require use of nonparametric estimators. Include te partially linear and index regressions. We discuss least squares estimation wen tere is a nonparametric component in te regression. Basic idea is to concentrate out nonparametric component, to find a profile squared residual function, by substituting for nonparametric component an estimator. Partially linear model as in Robinson (1988). Know E[Y X, W] minimizes E[(Y G(X, W )) 2 ]over G, sotat (β 0,g 0 ( )) = arg min E[{Y i X i β g(w )} β,g( ) p 0 2 ]. Do minimization in two steps. First solve for minimum over g for fixed β, substituting tat minimum into te objective function, ten minimize β. Te minimizer over g for fixed β is E[Y i X i 0 β ] = E[Y i i ] E[X i i ] 0 β. Substituting β 0 =arg mine[{y i E[Y i i ] (X i E[X i i ]) 0 β} 2 ]. β

21 Estimate using Ê[Y i i ]and Ê[X i i ] and replacing te outside expectation by a sample average, βˆ =arg min β {Y i Ê[Y i i ] (X i Ê[X i i ]) 0 β} 2 /n, Least squares of Y i Ê[Y i i ]on X i Ê[X i i ]. Kernel or series fine. Index regression, as in Icimura (1993). By E[Y X] = τ 0 (v(x, β 0 )), (β 0,τ 0 ( )) = arg min E[{Y i τ(v(x i,β))} 2 ]. β,τ( ) Concentrating out te τ, τ(x, β) = E[Y v(x, β)]. Let ˆτ(X i,β) a nonparametric estimator of E[Y v(x, β)], estimator is βˆ =arg min {Y i τˆ(x i,β)} 2. β Generalize to log-likeliood and oter objective functions. Let q(z,β,η) depend on parametric component β and nonparametric component η. True values minimize E[q(, β, η)], Estimator ˆη(β) of η(β) =arg min η E[q(, β, η)], βˆ =arg min β q( i,β, ηˆ(β)). Asymptotic teory difficult because of presence of nonparametric estimator. Know tat often n 1/2 rate, asymptotically normal, and even estimation of η(β) doesnot affect asymptotic distribution. Oter estimators tat depend on nonparametric estimators may ave affect on limiting distribution, e.g. average derivative estimator of Stoker (1987) and Powell, Stock, and Stoker (1989). Joint maximization possible but can be difficult because of need to smoot. Cannot allow ˆη(β) tobe n-dimensional. An empirical example is Hausman and Newey (1995). Graps are actually tose for ĝ(w) from a partially linear model. Example of teory, series estimator of partially linear model. Let p K (w) be a K 1 vector of approximating functions, suc as power series or splines. Also let Y = (Y 1,..., Y n ) 0,X =[X 1,..., X n ] 0, P = [p K (W 1 ),...,p K (W n )] 0, Q = P (P 0 P ) P 0, Let Ê[Y i W i ]= p K (W i ) 0 (P 0 P ) P 0 Y and Ê[X i 0 W i ]= p K (W i ) 0 (P 0 P ) P 0 X be series estimators. Residuals are (I Q)Y and (I Q)X, respectively, so by I Q idempotent, βˆ =(X 0 (I Q)X) 1 X 0 (I Q)Y.

22 Here βˆ is also least squares regression of Y on X and P.Result on n 1/2 -consistency and asymptotic normality. Proposition 7: If i) Y i and X i ave finite second moments; ii) H = E[Var(X i W i )] is nonsingular; iii) Var(Y i X i,w i ) and Var(X i W i ) are bounded; iv) tere are C, γ g, and γ x suc tat for every K tere are α K and β K wit E[{g 0 (W i ) α K p K (W i )} 2 ] K γg 0 K and E[kE[X (W i )k 2 ] CK 2γ x v) K/n 0 and n 1/2 K (γ g+γ x ) W i ] β K p 0, ten for Σ = E[Var(Y i X i,w i ){X i E[X i W i ]}{X i E[X i W i ]} 0 ], 1/2 ( ˆ d n β β 0 ) N(0,H 1 ΣH 1 ). Condition ii) is an identification condition tat is essentially no perfect multicollinearity between X i and any function of W i. Intuitively, if one or more of te X i variables were functions of W i ten we could not separately identify g 0 (W i )and tecoefficients on tose variables. Furtermore, we know tat necessary and sufficient conditions for identification of β from least squares objective function were g() as been partialled out are tat X i E[X i W i ] ave a nonsingular second moment matrix. By iterated expectations tat second moment matrix is E[(X i E[X i W i ])(X i E[X i W i ]) 0 ] = E[E[{(X i E[X i W i ])(X i E[X i W i ]) 0 } W i ]] = H. Tus, condition ii) is te same as te usual identification condition for least squares after partialling out te nonparametric component. Te requirement K/n 0 is a small variance condition and n 1/2 K (γg +γx) 0a small bias condition. Te bias ere is of order K (γ g+γ x ), wic is of smaller order tan just te bias in approximating g 0 (z) (wicisonly K γ g ). Indeed, te order of te bias of βˆ is te product of te biases from approximating g 0 (z) and from approximating E[x z]. So, one sufficient condition is tat te bias in eac of te nonparametric estimates vanis faster tan n 1/4. Tis faster tan n 1/4 condition is common to many semiparametric estimators. Some amount of smootness is required for root-n consistency. Existence of K satisfying te rate condition iii) requires tat γ g + γ x >r/2. An analogous smootness requirement (or even a stronger one) is generally needed for root-n consistency of any semiparametric estimator tat requires estimation of a nonparametric component.

23 Proof of Propostion 7: For simplicity we give te proof wen X i is scalar. Let M = I Q, =(W 1,..., W n ) 0, X = [E[X 1 W 1 ],..., E[X n W n ]] 0, V = X X,g 0 =(g 0 (z 1 ),...,g 0 (z n )) 0, ε = Y Xβ 0 g 0. Substituting Y = Xβ 0 + g 0 + ε in te formula for β, ˆ subtracting β0, and multiplying by n 1/2 gives n 1/2 (βˆ β 0 ) = (X 0 MX/n) 1 (X 0 Mg 0 /n 1/2 + X 0 Mε/n 1/2 ) p By te law of large numbers V 0 V/n H. Also, similarly to te proof of Proposition 3, p Ten V 0 MV/n H and so tat V 0 QV/n = O p (K/n), X 0 M X/n = O (K 2γx p ), g 0 0 Mg 0 /n = O p (K 2γg ). V 0 M X/n (V 0 MV/n) 1/2 0 (X M X/n) 1/2 0, X 0 MX/n = (X + V ) 0 M(X + V )/n 0 = X M X/n p +2X 0 MV/n + V 0 MV/n H. Next, similarly to te proof of Proposition 3 we ave E[VV 0 W ] CI n and E[εε 0 W, X] CI n.ten E[{V 0 Mg 0 /n 1/2 } 2 ] = E[g 0 0 ME[VV 0 W ]Mg 0 ]/n CE[g 0 0 Mg 0 ]/n = O(K 2γg ) 0. X 0 Mg/n 1/2 2 n(x 0 MX/n )(g 0 0 Mg 0 /n) p = O p (nk 2γx 2γg ) 0, p

24 p so tat X 0 Mg 0 /n 1/2 = X 0 Mg 0 /n 1/2 + V 0 Mg 0 /n 1/2 0. Also, E[{X 0 Mε/n 1/2 } 2 ] = E[X 0 ME[εε 0 X, W ]MX ]/n CE[X 0 MX ]/n = O(K 2γ x ) 0, E[{V 0 Qε/n 1/2 } 2 ] = E[V 0 QE[εε 0 X, W ]QV ]/n CE[V 0 QV ]/n = O(K/n) 0, d and by te central limit teorem, V 0 ε/n 1/2 N(0, Σ). Terefore, X 0 Mε/n 1/2 = X 0 Mε/n 1/2 +V 0 ε/n 1/2 V 0 Qε/n 1/2 d N(0, Σ). Ten by te continuous mapping and Slutzky teorems it follows tat n 1/2 (βˆ β 0 ) = (H + o p (1)) 1 [V 0 ε/n 1/2 + o p (1)]] d H 1 N(0, Σ) = N(0,H 1 ΣH 1 ).

Semiparametric Models and Estimators

Semiparametric Models and Estimators Whitney Newey October 2007 Semiparametric Models Data: Z 1,Z 2,... i.i.d. Model: F aset of pdfs. Correct specification: pdf f 0 of Z i in F. Parametric model: F = {f(z