An introduction to nonparametric and semi-parametric econometric methods

Size: px

Start display at page:

Download "An introduction to nonparametric and semi-parametric econometric methods"

Derick Beasley
6 years ago
Views:

1 An introduction to nonparametric and semi-parametric econometric methods Robert Breunig Australian National University March 1,

2 Outline 1. Introduction 2. Density Estimation (a) Kernel techniques (b) Bandwidth Selection (c) Estimating derivatives of densities (d) Non-kernel techniques 3. Conditional Mean Estimation 4. Semi-parametric estimation (a) Robinson s method (b) Differencing (c) Binary Choice models (d) Mixed categorical and continuous variables 1

3 Objectives 1. Introduce nonparametric and semiparametric techniques 2. Introduce some of the key issues in the literature 3. Introduce several key tools and techniques 4. Provide examples of the use of techniques 5. Provide reference literature so that interested students can pursue these techniques in their applied work 2

4 Objects of Interest All statistical objects studied by applied econometricians may be expressed as functions of unknown distributions. Measurement of inequality F a (x) F b (x) = x f a (t)dt x f b (t)dt Regression modelling m(x) =E[Y x] = y f(y, x) f 1 (x) dy 3

5 Measurement of response β(x) = de[y x] dx = [ ] d y f(y,x) f 1 (x) dy dx Market risk σ 2 (x) = (y E[Y x]) 2 f(y, x) f 1 (x) dy Discrete choice Prob[Y =1 x] = f(y, x) f 1 (x) >.5 4

6 Parametric Models Parametric econometric methods require the prior specification of the functional form of the object being estimated. For example, one might assume that the conditional mean function is linear f(y, x) m(x) =E[Y x] = y f 1 (x) dy = β 0 + β 1 x This specification implies a constant response [ ] de[y x] d y f(y,x) f 1 (x) dy β(x) = = = β 0 + β 1 x dx dx dx = β 1 5

7 Parametric Methods: Drawbacks Parametric models impose a priori structure on the underlying DGP. Having assumed that this structure is known, we then estimate a handful of unknown parameters. Choice of models is frequently not based upon any attempt to select the correct parametric specification from the space of admissible models. Rather, model selection is usually made on the bases of tractability and ease of interpretation. The risk is that inference, prediction, and policy are all based upon an incorrectly specified parametric model. The consequences of such mis-specification are well known. 6

8 Nonparametric Methods Nonparametric estimators estimate objects of interest to economists by replacing unknown densities and distribution functions with their nonparametric density estimators. They are consistent under less restrictive assumptions than those underlying their parametric counterparts. When there is sufficient data, these estimators frequently reveal features of the data that are invisible under parametric techniques. Different features and structures revealed by nonparametric estimators often lead to different conclusions and policy prescriptions than those based upon parametric methods. 7

9 Four uses of nonparametric 1. Visualizing the data methods 2. Testing and comparing models 3. Conditional Mean Estimation (regression) 4. Combining parametric and nonparametric methods (semi-parametric estimation) 8

10 Basic building block: Nonparametric Kernel Density Estimator f(x) = 1 nh n K i=1 ( ) xi x h 9

11 We would like to estimate the density, f(x), from a sample x 1,x 2,..., x n Histogram Naive Non-parametric/Local Histogram f I (x) = 1 nh n I i=1 ( 1 2 < x i x h < 1 2 ) = n nh n is the number of points which lie between x h 2 and x + h 2. h will determine smoothness of estimate 10

12 Replace Indicator function with smooth weighting function called a Kernel where K (ψ) dψ =1 f(y) = 1 nh n K i=1 ( ) yi y h K (ψ) large for small ψ K (ψ) small for large ψ K () should be symmetric. 11

13 A large class of functions satisfy these assumptions, for example (i) Standard Normal: K(ψ) =(2π) 1/2 exp [ 12 ] (ψ)2 (ii) Uniform: K(ψ) =(2c) 1, c <ψ<cand 0 otherwise. (iii) Epanechnikov (1969) [optimal kernel]: K 0 (ψ) = 3 4 ( 1 ψ 2 ), ψ 1 = 0 otherwise 12

14 In order to implement this estimator, we have to make two choices. 1. Kernel (weight) function: K 2. The smoothing parameter (bandwidth): h It turns out that the choice of kernel does not have much effect on the optimality of the estimator, but that the choice of bandwidth (or window width) has important repercussions for our results. 13

15 Bandwidth selection methods 1. Plug-in methods 2. Likelihood cross-validation 3. Least-squares cross-validation 14

16 All of these methods begin from the same starting point, which is that the bandwidth, h, should be chosen so that the estimated density, f(x) is as close as possible to the true density, f(x). Most of the time we employ some kind of global criteria. The most common is the integrated squared error (ISE) ( f(x) f(x) ) 2 (ISE) or its expected value, the mean integrated squared error (MISE) ( ) 2 E f(x) f(x) (MISE) These two quantities correspond to loss and risk. 15

17 For independent and identically distributed (i.i.d.) data, it is straightforward to show that Bias( ˆf) =E ˆf f = K(ψ)[f(hψ + x) f(x)] dψ and V ( ˆf) =(nh) 1 K 2 (ψ)f(hψ + x)dψ n 1 [ ] 2 K(ψ)f(hψ + x)dψ 16

18 The expressions for exact bias and variance are not useful without knowledge of the quantity that we are attempting to estimate the true underlying density. We can, however, derive approximations to these quantities by expanding f(hψ + x) by Taylor series expansion, for small h. f(hψ + x) =f(x)+hψf (1) (x)+ h2 2 ψ2 f (2) (x)

19 Given i.i.d. assumption above, the assumptions made regarding the kernel function, and the following additional assumptions (A3) The second order derivatives of f are continuous and bounded in some neighborhood of x. (A4) h = h n 0 as n. (A5) nh n as n. we can show that upto O(h 2 ) the bias is given by Bias ( ˆf) = h2 2 μ 2f (2) (x) and upto O(nh) 1, the variance is given by V ( ˆf) =(nh) 1 f(x) K 2 (ψ)dψ, 18

20 [ ( f)) ] 2 MISE = Bias( + Var( f) dx The approximate MISE, using the above expressions, is ( ) AMISE = h4 2 4 μ2 2 f (x) dx +(nh) 1 f(x)dx K 2 (ψ)dψ = 1 4 λ 1h 4 + λ 2 (nh) 1 where λ 1 = μ 2 2 ( f (2) (x)) 2 dx, λ2 = K 2 (ψ)dψ. (1) 19

21 The optimal window width, in the sense that the approximate integrated mean squared error is minimized, will be h = cn

22 Assuming a normal kernel and a normal density, f(x), both λ 1 and λ 2 can be evaluated numerically. This provides h =1.06σ x n 1 5 Software packages which implement nonparametric density estimation (SAS, Shazam, STATA) use this as the default window width. For non-normal distributions it works well as a first approximation. It also can provide a good starting point for data driven methods of bandwidth selection (see more below.) It is by far the most commonly used window width in the literature. 21

23 Silverman (1986) provides several other alternatives which work well for heavily skewed data or multi-modal data. A simple improvement is to replace σ by a robust estimator of spread and he specifies two alternatives that seem to work well; h =.79 Rn 1/5 h =.9 An 1/5, where R is the inter-quartile range and A =min(σ, (R/1.34)). 22

24 Least Squares Cross-validation This is essentially a data-driven technique of choosing the optimal bandwidth. The idea is to minimize a particular criteria function. In least squares cross-validation the function minimized is ISE(h) = = ( ) 2 ˆf(x) f(x) dx ˆf 2 dx + f 2 dx 2 ˆffdx, Since f 2 dx does not depend upon f, the function that people minimize in practice is actually f 2 dx 2 ffdx. 23

25 Further manipulation yields ISE (h) =n 2 h 1 n i=1 n j=1 K K( x i x j h ) 2n 1 n i=1 ˆf i (x i ). as the function that is actually minimized. n 1 n i=1 ˆf i (x i )istheleave one out estimator whichisformedasa standard kernel density estimator, omitting the ith observation. This provides an unbiased estimate of ˆffdx. Most programs actually implement the leave-one out estimator as the actual density estimate since it minimizes the influence of solitary observations. 24

26 Likelihood Cross-validation The basic idea behind this method is to choose an h which maximizes the likelihood log L = n log f(x i ). An estimated log likelihood or i=1 pseudo log likelihood can be written as log L = n log ˆf(x i )=log L(h), i=1 where ˆf(x i ) is a density estimator of f and it depends on h. Maximizing log L with respect to h produces a trivial maximum at h = 0. To overcome this problem, the cross validation principle might be adopted, in which ˆf(x i ) is replaced by ˆf i (x). 25

27 This leave one out version of the estimator can be written as ˆf i (x i )=((n 1)h) 1 n j i j=1 K ( ) xj x i. h Thus the likelihood CV principle is to choose h such that log L(h) = n log ˆf i (x i ) is a maximum. The procedure is also i=1 known as Kullback-Leibler cross validation in the sense that it gives an h for which the Kullback-Leibler distance measure between two densities f and ˆf, I(f, ˆf), is a minimum, where I(f, ˆf) = f(x)log { } f(x) dx; ˆf(x) see Hall (1987). 26

28 A disadvantage of the h obtained by likelihood CV is that it can be severely affected by the tail behavior of f. Furthermore, Hall (1987) has indicated that selecting h by minimizing the Kulback-Leibler measure may be useful for the statistical discrimination problem but not for curve estimation. Thus the likelihood CV procedure has not proven to be of much current interest in the literature. 27

29 Other density estimation techniques Nearest Neighbor Density Estimation Let d(x 1,x) represent the distance of point x 1 from the point x, and for each x denote d k (x) as the distance of x from its k th nearest neighbor (k NN)amongx 1,..., x n. Then, taking h =2d k (x), the estimator can be written as ˆf k NN (x) = #(x 1...x n )in[(x d k (x)), (x+d k (x))] 2nd k (x) = k 2nd k (x) = 1 n I 2nd k (x) i=1 ( ) xi x 2d k (x). The degree of smoothing is controlled by an integer k, typically k n 1/2. 28

30 Series estimation Suppose X is a random variable with density f on the unit interval [0,1]. Under these circumstances it can be expressed as the Fourier series f(x) = a j ζ j (x), j=0 where, for each j 0, the coefficients a j = 1 0 f(x)ζ j (x)dx = Eζ j (x), and the sequence ζ j (x) isgivenbyζ 0 (x) = 1, and ζ j (x) = 2cos π(j +1)x when j is odd and 2sin πjx when j is even. 29

31 Using â j = n 1 n i=1 ζ j(x i ) as an estimator of a j, the orthogonal series estimator is defined as ˆf(x) = m â j ζ j (x), j=0 where m is the cutoff point in the infinite sum and determines the amount of smoothing. The regression analog of this is to express the conditional mean of y as an infinite polynomial in x. 30

32 Variable window width estimators Another option is to let the window width vary with each point in the data according to some rule. The estimator will then have the form ˆf vww (x) = 1 n n i=1 1 h ni K ( ) xi x, In general, the rule should allow larger h in regions where there are few observations and smaller h in those where observations are densely located. h ni 31

33 Penalized likelihood estimators Local Log-Likelihood estimators Both of these techniques treat f(x) as an unknown parameter and try to employ likelihood methods to estimate the unknown quantity. The global likelihood has no finite maximum over the class of all densities, so options are to instead maximize a penalized likelihood function (which imposes some pre-determined amount of smoothness on the function) or the local, kernel-weighted, log-likelihood. 32

34 Example 1 Eruption Length of Old Faithful Geyser in Yellowstone National Park 3 files oldfaithful.pdf Example 2 Hamilton Lin (1996) model of excess stock returns from Standard and Poor 500 Example 3 Ait-Sahalia (1996) nonparametric test of interest rate diffusion models 33

35 Multivariate Density Estimation Consider a bivariate distribution where the ith sample observation is given by (y i,x i )andz =(y, x) isafixedpoint. This can be estimated nonparametrically by ˆf(y, x) = ˆf(z) = 1 nh 2 n i=1 K 1 ( zi z h ), 34

36 The kernel estimator of the marginal density f 1 (x) ofx is ˆf 1 (x) = ˆf(y, x)dy = 1 nh 2 n i=1 = 1 nh i=1 K1 ( yi y h n K ( x i ) x h, x i x h ) dy where K(x) = K 1 (y, x)dy is such that K(x)dx = 1. The estimator of the conditional density of Y given X canthenbewrittenas ˆf(y x) = ˆf(y, x) ˆf 1 (x). 35

37 In general, for a multivariate density estimation problem of dimension d, theoptimalh which minimizes the approximate MISE can be found by substituting nh d for nh in the MISE expression given earlier and minimizing with respect to h. It is easy to show that h = cn 1/(4+d), and, for this h, AMISE =0(n 4/(4+d) ). When the kernel is multivariate standard normal, c = {4/(2d +1)} 1/(d+4). 36

38 Curse of dimensionality It is clear from this result that the higher the dimension q +1,the slower will be the speed of convergence of ˆf to f. Thus one may need a large data size to estimate the multivariate density in high dimensions. 37

39 Multivariate Kernels standard multivariate normal density, where d =dim(ψ) K(ψ) =(2π) d/2 exp( 1 2 ψ ψ) multivariate Epanechnikov kernel K c (ψ) =.5 c 1 d (d + 2)(1 ψ ψ) if ψ ψ<1 and equaling 0 otherwise, where c d is the volume of the unit d-dimensional sphere (c 1 =2, c 2 = π, c 3 =4π/3). 38

40 One disadvantage with direct application of the kernels above is that the variables may exhibit disparate variation. To overcome this problem it is good practice to work with standardized data, i.e., normalized by the standard deviation or some measure of scale. Then each of the elements in ψ will have unit variance and application of a kernel such as the multivariate standard normal is appropriate. 39

41 Conditional mean estimation Consider q +1=p economic variables (Y, X )wherey is the dependent variable and X is a (q 1) vector of regressors; these p variables are taken to be completely characterized by their unknown joint density f(y, x 1,...,x q )=f(y, x), at the points y, x. As noted in the introduction interest frequently centres upon the conditional mean m(x) =E(Y X = x), where x is some fixed value of X. Now suppose that we have n data points (y i,x i ). By definition, Y i = E(Y i X i = x i )+u i = m(x i )+u i where the error term u i has the properties E(u i x i )=0and E(u 2 i x i )=σ 2 (x i ). 40

42 Parametric Estimation Parametric methods specify a form of m(x i ). In the case of a linear specification y i = α + x i β + u i. The least squares estimators of α and β are α =ȳ xβ and β = ( n i=1 (x i x) 2) 1 ( n i=1 (x i x)y i ). The best unbiased parametric estimator of m(x) =α + xβ is m (x) =α + xβ = n a ni (x)y i (2) i=1 where a ni (x) =n 1 +(x x)(x i x) ( n i=1 (x i x) 2) 1.Them in (2) is the weighted sum of y i, where the weights a ni are linear in x, and depend on the distance of x from x. 41

43 The assumption that m(x i )=α + x i β implies certain assumption about the data generating process (joint density). For example, if (y i,x i ) is a bivariate normal density then it can be shown that the mean of the conditional density of y i given x i is, E(y i x i )=α + x i β, where α = Ey i (Ex i )β and β =(var(x i )) 1 cov {(x i,y i )}. This implies that the assumption of linear specification for m(x) holdsif the data comes from the normal distribution. However, if the true distribution is not normal then the linear specification for the conditional expectation may be invalid, and so the least squares estimator of m(x) will become biased and inconsistent. 42

44 For example suppose the true relationship is y i = α + x i β + x 2 i γ + u i then the parameter of interest is β +2γx i = y i / x i. However, if a linear approximation is taken, y i / x i is being estimated under the false restriction that γ = 0. Typically, the exact functional form connecting m(x) withx is unknown. Because of the possibility that forcing the function to be linear or quadratic may affect the accuracy of estimation of m(x), it is worthwhile considering nonparametric estimation of the unknown function, and this task is taken up in the following sections. 43

45 Kernel-Based Estimation Suppose that the x i are i.i.d. random variables. Because m(x i )isthe mean of the conditional density f(y i x i )=f(y X = x i ), there is a potential to employ the methods of density estimation seen earlier. By definition the conditional mean is m = where f 1 (x) is the marginal density of X at x. (yf(y, x)/f 1 (x)) dy, (3) Nadaraya (1964) and Watson (1964) therefore proposed that m be estimated by replacing f(y, x) by ˆf(y, x) andf 1 (x) by ˆf 1 (x), where these density estimators were the kernel estimators discussed above. 44

46 The expressions for ˆf(y, x), ˆf1 (x) from the first part of this talk may be substituted into (3) to give ˆm = y [ (nh p ) 1 (nh q ) 1 n i=1 K ( yi y 1 h n i=1 K ( x i ) x h, x i x h )] dy, (4) where p = q +1andh is the window width. Some simplification yields ˆm = [ (nh q ) 1 n i=1 y i K ( ) ] [ xi x / (nh q ) 1 h n i=1 K ( ) ] xi x h = n K i=1 ( ) xi x y i / h n K i=1 ( ) xi x, h 45

47 A feature of the Nadaraya-Watson estimator is that it is a weighted sum of those y i s that correspond to x i in a neighborhood of x. The weights are low for x i s far away from x and high for x i s closer to x. With this motivation, a general class of nonparametric estimators of m(x) can be written as m = m(x) = n w ni (x)y i i=1 where w ni (x) =w n (x i,x) represents the weight assigned to the i th observation y i, and it depends on the distance of x i from the point x. Note that the parametric estimator m(x) in (2) is a special case with linear weights w ni (x) =a ni (x) such that w ni (x) = 1, but w ni (x) 0 is not necessarily true. 46

48 An implicit assumption in nonparametric estimation is that m(x) is smooth over x, implying that y i contains information about m(x) whenever x i is near to x. The estimator m(x) isasmoothed estimator in the sense that it is constructed, at every point, by local averaging of the observations y i s corresponding to those x i s close to x in some sense. In parametric regression, a functional form is specified for the conditional mean m(x). This functional form, say m(x, β), depends on a finite number of unknown parameters β. The least squares estimate of m = m(x) ism(x, ˆβ), where ˆβ is chosen to minimize n i=1 ( y i m(x i, ˆβ)) 2. (5) 47

49 Compare (5) with the following weighted least squares criterion for the nonparametric estimation of m(x) : n wni(x)[y i m(x)] 2. (6) i=1 In (6), m(x) replaces the m(x, β) that appears in (5). If m(x) is regarded as a single unknown parameter m, it may be estimated by minimizing n wni(x)[y i m] 2. (7) i=1 The resulting estimate, m, ofm(x) is precisely the Nadaraya-Watson estimator. Thus the kernel estimator ˆm is also a least squares estimator, with w ni (x) =K ((x i x)/h). 48

50 One might also think of m(x) as a method of moments estimator. Since E(u i x i )=0, or Ew ni(x)(y i m(x i )) = 0 (8) = E [w ni(x)(y i m)+w ni(x)(m m(x i ))] = 0. (9) If the second term in (9) is ignored and a sample estimate of the first, n 1 n i=1 w ni (x)(y i m), is used, the value of m for which this is zero is again the Nadaraya-Watson estimator. 49

51 Whether the second term can be ignored depends upon the weights wni (x). If the weights were the indicator functions of the local histogram presented earlier, the second term will be identically zero, whereas with kernel weights it is only asymptotically zero. Because the orthogonality relation only holds as n, the situation is out of the normal framework described by Hansen (1982), but it is close to work reported in Powell (1986), in that the expected value of the function the parameter solves changes with the sample size (through h) and so its large sample limit has to be used instead. 50

52 Local Linear Nonparametric Regression The Nadaraya-Watson estimator of m(x) minimizes Σ n i=1 {y i α} 2 K ( x i ) x h with respect to α, giving ˆm(x) =ˆα = [ ΣK ( x i )] x 1 ( h ΣK xi ) x h yi. Stone (1977) and Cleveland (1979) suggested that instead one minimize n {y i α (x i x) β} 2 K i=1 ( ) xi x, h with respect to α and β and set ˆm(x) equal to the resulting estimate of α. 51

53 This estimate can be found by performing a weighted least squares regression of y i against z i =(1 (x i x)) with weights [ ( K xi )] x 1/2. h Thus, while the Nadaraya-Watson estimator is fitting a constant to the data close to x, the local linear approximation is fitting a straight line. This local linear smoothing estimator has been extensively investigated by Fan (1992a), (1993), Fan and Gijbels (1992) and Ruppert and Wand (1994). 52

54 The resulting estimator has the form m LL (x) = n i=1 w LL ni (x)y i, with weights wni LL = e 1 ( z i K i z i ) 1 z i K i, where e 1 is a column vector of dimension the same as z i with unity as first element and zero elsewhere. One advantage of this estimator is that it can be analysed with standard regression techniques, and it has the same first order statistical properties irrespective of whether the x i are stochastic or non-stochastic. The optimal window width is proportional to n 1/5. 53

55 Applications of the idea in econometrics are McManus (1994) to estimation of cost functions, Gourrieourx and Scaillat (1994) to the term structure, Lin and Shu (1994) to estimation of a disequilibrium transition model, Bossaerts and Hillion (1997) to options prices and their determinants, and Ullah and Roy(1996) for a nutrition/income relation. Implementation and computations are discussed in Cleveland et al (1988). Hastie and Loader (1993) provide an excellent account of the history and potential of the method. 54

56 The logic of linear local regression smoothing can be seen by expanding m (x i ) around x to get m (x i )=m(x)+ m x (x )(x i x), (10) where x lies between x i and x. This may be expressed as m (x i )=α + β (x )(x i x). (11) 55

57 Now, since E (y i x i )=m (x i ), the objective function Σ(y i m i (x i )) 2 K i =Σ(y i α β (x )(x i x)) 2 K i is essentially the residual sum of squares from a regression using only observations close to x i = x. Notice that this means that β (x ) will be very close to constant as x must lie between x i and x. This also points to the fact that improvements might be available from expanding m (x i )asaj th order polynomial in (x i x), but doing so requires the derivatives m (j) to exist. 56

58 Example 4 Eruption Length of Old Faithful Geyser Conditional on Waiting Time 57

59 Other Notes The optimal h can be found by minimizing the MISE similar to the density case, and it can be shown that h opt α n 1/(q+4) Cross validation may be performed by minimizing the estimated prediction error (EPE), n 1 Σ(y i ˆm(x i )) 2,where ˆm(x i )is computed as the leave-one-out estimator deleting the i th observation in the sums. To appreciate why minimizing EPE is sensible notice that, when the leave one out estimator is employed and observations are independent, ˆm i is independent of y i, meaning that E (ˆm i (y i m i )) = 0, and so E(EPE)=σ 2 + E ( n 1 Σ( ˆm i m i ) 2) = σ 2 + MASE. 58

60 Minimizing E(EPE) with respect to h is therefore equivalent to minimizing MASE with respect to h. Unfortunately, minimizing the sample EPE tends to produce an estimator of h that converges only extremely slowly to the value of h minimizing E(EPE), of order n 1/10, The curse of dimensionality means that pure nonparametric regression is difficult to use in higher dimension problems. 59

61 Semi-parametric estimation A number of models exist in the literature which have the distinguishing feature that part of the model is linear and part constitutes an unknown non-linear format. which could be written in matrix form as y i = x 1iβ + g 1 (x 2i )+u i, (12) In (23) x 2i cannot have unity as an element. y = X 1 β + g 1 + u. (13) 60

62 This intercept restriction is an identification condition arising from the fact that g 1 (x 2i ) is unconstrained and therefore can have a constant term as part of its definition. Hence, it would always be possible to add any constant number to (23) and then absorb it into g(x 2i ), showing that, without some further restriction upon the nature of g 1 (x 2i ), it is impossible to consistently estimate an intercept. This issue of identification of parameters, particularly in regards to the intercept, but sometimes a scale parameter as well, arises a good deal in the semi-parametric literature and needs to be dealt with by imposing some restrictions. The parameter of interest is β so that the issue is how to estimate it in the presence of the unknown function g 1. 61

63 A Semi-Parametric Estimator of β Taking the conditional expectation of (13) leads to E (y i x 2i )=E (x 1i x 2i ) β + g 1 (x 2i ). Consequently and y i E (y i x 2i )=(x 1i E (x 1i x 2i )) β + u i (14) g 1 (x 2i )=E (y i x 2i ) E (x 1i x 2i ) β. (15) 62

64 Since (14) has the properties of a linear regression model with dependent variable y i E (y i x 2i ) and independent variables (x 1i E (x 1i x 2i )), an obvious estimator of β is ˆβ = [ n ] 1 [ n ] (x 1i ˆm 12i )(x 1i ˆm 12i ) (x 1i ˆm 12i )(y i ˆm 2i ), i=1 i=1 (16) where ˆm 12i and ˆm 2i are the kernel based estimators of m 12i = E(x 1i x 2i )andm 2i = E(y i x 2i ). 63

65 Once ˆβ is found, g 1 (x 2i ) can be estimated from (15) as ĝ 1 (x 2i )= ˆm 2i ˆm 12i ˆβ, (17) for example Stock (1989) works with this model but is particularly interested in estimating g 1 (x 2i ) rather than β. The kernel estimator for β in the context of (13) was analyzed by Robinson (1988) 64

66 Differencing Consider again the partial linear model y i = x 1i β + g 1 (x 2i )+ε i, (18) where x 1 is a scalar. Order the x 2 from smallest to largest so that x 21 x x 2n Suppose that x 1 is a smooth function of x 2 where E[x 1 x 2 ]=g(x 2 ) and therefore x 1 = g(x 2 )+u 65

67 y i y i 1 =(x 1i x 1,i 1 ) β +(f(x 2i ) f(x 2,i 1 )) + ε i ε i 1 =(g(x 2i ) g(x 2,i 1 )) β +(u i u i 1 ) β+ (f(x 2i ) f(x 2,i 1 )) + ε i ε i 1 Provided that the functions f and g are sufficiently smooth and that the data is sufficiently dense, the differences f(x 2i ) f(x 2,i 1 )and g(x 2i ) g(x 2,i 1 ) should be very small providing the approximations z i z i 1 =u i u i 1 y i y i 1 =(u i u i 1 ) β + ε i ε i 1 66

68 The non-parametric difference estimator of β is simply β diff = (zi z i 1 ) (y i y i 1 ) (zi z i 1 ) 2 which converges at the usual rate of n, with normal distribution so that ( D β diff N β, 1.5 σ 2 ) ε n σu 2 67

69 Example 5: Yatchew and No (2001) Gasoline Demand in Canada 68

70 Binary Choice Models We often start with the idea of an underlying linear (latent variable) model yi = x iβ + u i (19) y i =1whenyi > 0 and takes value 0 otherwise. The standard approach to estimating β in (22) is via maximum likelihood. The likelihood function is formed for a sample of size n as L = G i = n [y i ln(g i )+(1 y i )ln(1 G i )] (20) i=1 x i β [g(u)] = Prob(u i <x iβ) 69

71 G is assumed to be normal (probit) or logistic (logit) in most applications. Klein and Spady (1993) propose to estimate a smooth version of the likelihood that locally approximates the parametric likelihood. Note that x i β could be written in more general terms, but Klein and Spady do retain the linear index function in their method. The key transformation is to note that G in (20) is the probability that u is less than the x iβ conditional on the index function and the parameter β. This can be written as a G[x iβ; β] =Prob(y =1) g υ y=1 g υ (21) where g υ y=1 is the distribution of the index function conditional on y =1andg υ is the unconditional distribution of the index function. a Prob(A B) = Prob(A B) Prob(B) = Prob(B A)Prob(A) Prob(B) 70

72 These can both be estimated nonparametrically using standard kernel techniques while the Prob(y = 1) can be estimated as the sample fraction of observations with y i =1. 71

73 Ichimura and Thompson (1998) propose a wider class of estimates which is based upon a random coefficients approach. y i = x iβ i + u i (22) y i =1wheny i > 0 and takes value 0 otherwise. The distribution of β i is estimated by nonparametric methods with few restrictions. Ai and Chen (Econometrica, 2003) have proposed a better method for estimating binary choice models which is currently considered the state of the art. 72

74 Additional notes on bandwidth selection Plug-in methods Usually reserved for simple density estimation Fan and Gijbels (1996) provide plug-in estimators for regression estimation Least-squares cross-validation popular in many applications Ichimura and Todd (2004, Handbook of Econometrics V) find that this method works well in a simulation study The biggest problem with least-squares cross-validation happens when the data are sparse. In this case the method tends to choose a bandwidth which is too large in order to avoid having zero densities in any area (the criterion takes on an unbounded value if the density is zero at any point). 73

75 Variable bandwidth selection methods result in estimates that are no longer densities. Thus global bandwidth selection methods tend to be preferred There are also bootstrap bandwidth selection methods which tend to be very computationally intensive 74

76 Reducing the curse of dimensionality Restricting the class of models ex: Separable models of Robinson and Yatchew ex: Klein and Spady Binary Choice Model Changing the Parameter of Interest ex: Average derivative methods 75

77 Specifying different stochastic assumptions see Powell (1984, J. of Econometrics) I won t discuss this last one. But these methods essentially involve making some restriction on the conditional distribution of observable variables but not enough to estimate the model parametrically. Powell applies these to various limited dependent variable models including the Tobit model. 76

78 Average Derivative Method Consider the model y i = g(x i )+u i, (23) Suppose that instead of estimating the derivative g (x) atevery point, we are interested in E(g (x)) (24) The advantage is that by taking the average over all points, the curse of dimensionality is eliminated. Even though the function g can not be estimated at the rate of parametric convergence, the average of its derivatives can. 77

79 These estimators have achieved great popularity and are discussed in Stoker (1986, Econometrica) Härdle and Stoker (1989, JASA) Powell, Stock and Stoker (1989, Econometrica) The simplest form is the direct average derivative estimator which is simply n Ê(y i x i ) x t i i=1 β = n (25) t i where t is a trimming function that removes points which have zero or negative densities. i=1 78

80 What affects the results? Bandwidth Choice Trimming 79

81 Trimming Trimming essentially refers to the practice of dropping some observations which meet a particular criterion. In other cases, it may mean rounding values at or near zero up to some acceptable level. (ex: Klein and Spady.) Practical reasons In all of the regression estimators that we have looked at, some type of density estimate appears in the denominator of the expression. If this is zero or near zero, the estimate of the conditional mean function is undefined. So it is sometimes necessary to drop data points in order to avoid the boundary problem. 80

82 Technical reasons Semiparametric estimators which use non-parametric estimators in their construction. The non-parametric estimators need to have uniform rates of convergence in order to establish the asymptotic properties of the semiparametric estimators. This generally involves the use of bounded kernels and densities (for x, typically) that are bounded. So most technical proofs involve the introduction of some trimming function. (See Robinson (1988) or Klein and Spady (1993) for examples.) 81

83 Additively Separable Models This represents another way to restrict the class of models y i = β 0 + g 1 (x i1 )+g 2 (x i2 )+...+ g k (x ik )+u (26) Less restrictive than it appears because some variables could involve interactions with other variables. Estimates achieve the univariate rate of convergence: n 2/5 Complicated to estimate. Use Backfitting or an integration approach of Newey (1994, Econometric Theory) and Härdle and Linton (1996, Biometrika) Less commonly applied than the partially linear model 82

84 Partially Linear Models: Recent developments Refinements have been proposed by Ahn and Powell (1993, Journal of Econometrics) Heckman, Ichimura, and Todd (1998, U. of Chicago, still unpublished) These deal with the case where instrumental variables are needed and where sample selection correction of unknown functional form is estimated. 83

85 Other Notes The book by Pagan and Ullah (1998) remains an excellent reference. The new book by Li and Racine (2006) is written to serve more as a teaching text, complete with problem sets and examples. More recent developments are discussed by Ichimura and Todd (Handbook of Econometrics, Volume 5, 2004). I particularly like their section on bandwidth selection (chapter 6) for semi-parametric, parametric, and average derivative regression estimation techniques. 84

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity