Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Size: px

Start display at page:

Download "Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas"

Tobias Black
5 years ago
Views:

2 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome A. Linear E[Y 0i X i] X Outcome B. Nonlinear E[Y 0i X i] i X utcome C. Nonlinearity mistaken for discontinuity Nonlinearitymistakenfordiscontinuity O X

3 Objectives of the slides Overview of nonparametric density estimation, which plays a central role in nonparametric analysis. Methods for estimating conditional means: Nadaraya-Watson, kernel regression. Semiparametric models: partially linear (Robinson, 1988, Yatchew, 1998), single index models (Ichimura, 1993, Klein and Spady, 1993). Familiarize with implementation.

4 Histograms Kernel density estimation

5 Distribution function Histograms Kernel density estimation Definition: Cumulative distribution function. The cumulative distribution function (c.d.f.) of a random variable X, denoted F X ( ), is a function with domain on the real line and counterdomain on the interval [0, 1] which satisfies F X (x) = P[X x] = P[ω : X (ω) x] for every real number x. F X ( ) = 0 and F X (+ ) = 1 F X ( ) is a monotone non-decreasing function, i.e. F X (a) F X (b) if a < b F X ( ) is continuous from the right, i.e. lim 0<h 0 F X (x + h) = F X (x)

6 Discrete random variables Histograms Kernel density estimation Definition: Discrete random variable A random variable will be defined discrete if the range of its domain X is countable. If a random variable is discrete than the cumulative distribution function will be defined to be discrete. Definition: Discrete density function If X is a discrete random variable with values x 1, x 2,...x n,... the function f X (x) = P[X = x j ] if x = x j, j = 1,...n,... and zero otherwise is defined the discrete density function of X.

7 Continuous random variables Histograms Kernel density estimation Definition: Continuous random variable A random variable will be called continuous if there exists a function f X ( ) such that F X (x) = x f X (u)du for every real number x. Definition: Probability density function If X is a continuous random variable the function f X ( ) in F X (x) = x f X (u)du is called the probability density function (or continuous density function). Note: It is important to recognize that f is not by itself a probability. Instead, the probability that X lies in the interval (x, x + dx) is f (x)dx, or for a finite interval (a, b) it is b a f (x)dx. Any function f ( ) with domain the real line and counterdomain [0, ) is defined to be a probability density function if: f (x) 0 for all x f (x)dx = 1

8 Histograms Histograms Kernel density estimation Constructing a histogram is straightforward. If X is a discrete random variable with domain {x 1, x 2,...} then select M = #X and point-bins. Then, f ˆ(x) = i 1x i =x n If X is a continuous random variable with domain X then consider a series of bins, intervals that cover the domain of X, assumed to be bounded. Let M be the number of bins, indexed by m, each of the same width h. Then we have bins of the form Then, [x m h, x m + h), m = 1, 2,..., M x 1 < x 2 <... < x M 1 < x M, x m + h = x m+1 h f ˆ(x) = i 1 [x i in same bin as x] nh How to select the bins optimally? In particular, this depends on the bandwidth h.

9 Kernel Histograms Kernel density estimation The kernel K is a symmetric function satisfying: K (ψ)dψ = 1; ψk (ψ)dψ = 0 ψ 2 K (ψ)dψ = µ 2 <. Unless otherwise specified (i.e. if the domain of X is bounded with known bounds), the limits of integration are (, ).

10 Kernel density estimation Histograms Kernel density estimation Let h be the bandwith. Let x be a particular value where we want to estimate f (x). A kernel density estimation is fˆ h (x) = 1 ( ) nh xi x K i h In practice, we need to report a density for the entire domain of x, X, and not only for one particular x. Then we need to specify a grid of values X = [x(1), x(2),..., x(m)], where #X = M for which we are going to estimate the density. Then, a density estimation corresponds to the graph { fˆ h (x(m)), x(m)} M m=1. We could also consider different bandwidths h(x(m)).

11 Kernel density estimation Mean squared error and asymptotic requirements Histograms Kernel density estimation How to evaluate kernel density estimators? Mean squared error (MSE): MSE (x) = E [( fˆ h (x) f (x)) 2 ] = var( fˆ h (x)) + ( bias( fˆ h (x)) ) 2 Following Pagan&Ullah pp.23-24: bias( ˆ f h (x)) h2 2 f (x)µ 2 var( ˆ f h (x)) f (x) nh K (ψ) 2 dψ Then note that for a consistent estimator of f (x) we have a trade-off between bias and variance: 1 h 0 as n. As the sample size increases, we can make the bins smaller to get a more precise estimate of f (x), i.e. to reduce the bias. 2 However, as the bins become smaller that increases the variance too! Then we also require nh, that is, n grows faster than h decreases.

12 Kernel density estimation Rate of convergence Histograms Kernel density estimation Since we want to evaluate the density for the entire domain X, then we consider the integrated mean squared error (IMSE): IMSE = MSE (x)dx = f (x) X nh We can now minimize with respect to h, to get ( ) 1/5 ( h opt = K (ψ) 2 dψ µ 2/5 2 K (ψ) 2 dψ + µ 2 h 4 2 f (x)dx 4 X X ) 1/5 f (x)dx n 1/5 That is, h should decrease at rate 1/5 with respect to n. That means that the rate of convergence depends on fˆ h (x) f (x) = O p (n 2/5 ) which means a slower rate of convergence of the (optimal) maximum likelihood estimator of f (x), assuming correct density of O p (n 1/2 ), i.e. n-consistency.

13 Kernel density estimation Bandwidth selection Histograms Kernel density estimation The most important part in nonparametric density estimation is to select a bandwidth. If h 0, i.e. too small, then there is no smoothing. The density estimator has too many spikes, i.e. one for each observation. If h, i.e. too small, then there is too much smoothing. The density estimator fits the density of the selected kernel. Note that for choosing h opt we need K (ψ) 2 dψ. This depends on the choosen kernal and that it is readily available. µ 2 = ψ 2 K (ψ)dψ. This depends on the choosen kernal and that it is readily available. X f (x)dx. This depends in the unknown f. Bandwidth selection depends on this.

14 Kernel density estimation Bandwidth selection Histograms Kernel density estimation Rule-of-thumb: Use a standard family of distributions to construct X f (x)dx, i.e. Gaussian. Then X f (x)dx = 3 8 πσ 5. Then, ĥ opt = 1.059ˆσn 1/5, with ˆσ = 1 n i (x i x) 2 Plug-in: use rule-of-thumb to get a prior estimate of f, then compute f and plug-in again. Cross-validation: minimize h by estimating IMSE ( fˆ h ). This is computationally intensive as requires to estimate f leaving each observation out of the sample and using the rest. The rate of convergence is extremely slow and too much volatility. (See Pagan&Ullah pp ) Use intuition! is an art...

15 Histograms Kernel density estimation Histograms can be created with the hist command. This can be accessed in Kernel density estimation can be implemented using the kdensity command. This can be accessed in

16 Local constant conditional mean Marginal effect

17 Model misspecification Local constant conditional mean Marginal effect Assume that E (y x) = m(x), where m(.) is a continuous and differentiable (non necessarily linear) function of x. Then we can always define y = m(x) + e with E (e x) = 0. What does OLS estimates in this case? Suppose we estimate the model y i = βx i + u i. Then the OLS estimator, ˆβ satisfies β = Cov(y,x) = Cov(m(x)+e,x) = Cov(m(x),x). Var(x) Var(x) Var(x)

18 Model misspecification Local constant conditional mean Marginal effect Consider the Taylor of m(x), evaluated at x : Then, m(x) = m(x ) + m (x )(x x ) + m (x ) (x x ) β = m (x ) + m (x ) ( Cov(x 2, x) 2x Var(x) ) Var(x) If m(x) = a + bx then β = b. If m(x) = a + bx + cx 2 then β = b + 2cx + c Var(x) as x 2 is an omitted variable) ( Cov(x 2, x) 2x Var(x) ) = b + c Cov(x2,x) Var(x) (same

19 Local constant conditional mean Local constant conditional mean Marginal effect This estimator is locally averaging those values of y close in terms of x. Consider now the estimation of m(x) E [y x] estimated on x. E [y x] = yf (y x)dy = y f (y, x) g(x) dy = f (x) f (x) where f (y x) is the conditinal pdf of y conditional on x and by definition f (y x) f (y,x) f (x) ; f (y, x) is the joint density of y and x, g(x) yf (y, x)dy.

20 Local constant conditional mean Nadaraya-Watson kernel regression estimator Local constant conditional mean Marginal effect The Nadaraya-Watson estimator is a weighted average of those y i s that correspond to x i in a neighborhood of x. Consider now the kernel based estimator of the conditional mean. Define ψ i (x) = x i x h where h is the (fixed) bandwidth that weights distances of each x i to the corresponding value of x. Define a kernel K (.). Then we have, ˆm h (x) = i y i K (ψ i ) i K (ψ i )

21 Local constant conditional mean Nadaraya-Watson kernel regression estimator Local constant conditional mean Marginal effect Note that as h, ˆm h (x) ȳ = n 1 i y i, the unconditional mean of y. What does it mean? For a large bandwidth x lim h ψ i (x) = lim i x h h = 0 and then lim h K (ψ i (x)) = K (0) = max = constant. In this case there is no smoothing based on x. Result: Too much smoothing. Note that as h 0, then ˆm h (x) becomes the nearest neighbor (NN) estimator. What does it mean? For a small bandwidth x lim h 0 ψ i (x) = lim i x h 0 h = except when x i = x. Note that K ( ) = 0 and then, only the x i s equal to x are considered. This is identical to set ˆm 0 (x) = i 1[x i =x]y i i 1[x i =x], that is, for each value of x takes the corresponding vaue of y if there is only one pair (y i, x i = x); it takes the average for all values of y that have x i = x if there are more observations; or takes the closest average of observations with the closest x j = x. Result: No smoothing.

22 Local constant conditional mean Marginal effect Local constant response - Marginal effect Consider now the marginal effect of x on y. Define, β(x) d m(x) dx = m (x) = f (x)g (x) g(x)f (x) f 2 (x) = g (x) f (x) g(x) f (x) f (x) Then a local kernel estimator of the marginal effect is with ĝ(x) = 1 nh i f ˆ(x) = 1 nh i ˆβ(x) = ĝ (x) f ˆ(x) ĝ(x) f ˆ (x) f ˆ(x) y i K (ψ i ), ĝ (x) = 1 nh 2 y i K (ψ i ) K (ψ i ), f ˆ (x) = 1 nh 2 i i K (ψ i )

23 Local polynomial linear regression Local constant conditional mean Marginal effect The Nadaraya-Watson estimator can be obtained as ˆm h (x) = argmin a (y i a) 2 K (ψ i ). Now consider an extension (â h (x), ˆb h (x)) = argmin (a,b) (y i a b(x i x)) 2 K (ψ i ) Note that this is a weighted regression estimator, where the weights are given by the kernel. ( âh (x) b h (x) ) [ ( = K (ψ i ) i 1 (x i x) (x i x) (x i x) 2 ) ] 1 ( K (ψ i ) i y i y i (x i x) This can be extended to a higher order polynomial, i.e. y i a b(x i x) c(x i x) 2. In fact, for the estimation of β(x) = d m(x) dx it is better to include a quadratic polynomial to reduce bias. Note that as h, the local linear regression estimator becomes the OLS estimator lim h â h (x) + ˆb h (x) = ˆβ 0 + ˆβ 1 x. )

24 Some asymptotic properties Assumptions Local constant conditional mean Marginal effect Consider the following assumptions 1 m and f are twice differentiable in a neighborhood of x. f is bounded in a neighborhood of x. x int(x ) [What does it mean? It rules out points with jumps.] 2 The kernel K is a symmetric function satisfying (i) K (ψ)dψ = 1; (ii) ψk (ψ)dψ = 0; (ii) ψ 2 K (ψ)dψ = µ 2 <. [What does it mean? Same properties as for density estimation.] 3 h = h n 0, nh as n. [What does it mean? Same properties as for density estimation.] 4 xs are iid and independent of the error term in the model y = m(x) + u. [What does it mean? Exogeneity assumption.]

25 Some asymptotic properties Nadaraya-Watson estimator Local constant conditional mean Marginal effect Theorem (Pagan&Ullah p. 101) Under the assumptions above BIAS( ˆm h (x)) = h2 2f µ 2(m f + 2f m ) + O(n 1 h 1 ) + o(h 2 ) V ( ˆm h (x)) = σ2 nhf K 2 (ψ)dψ + o(n 1 h 1 ) Then the optimal bandwidth should satisfy h n n 1/5.

26 Some asymptotic properties Local linear regression Local constant conditional mean Marginal effect Theorem (Pagan&Ullah p. 105) Under the assumptions above BIAS( ˆm h (x)) = m h 2 µ 2 + O(n 1 h 1 ) + o(h 2 ) 2 V ( ˆm h (x)) = σ2 nhf K 2 (ψ)dψ + o(n 1 h 1 )

27 The curse of dimentionality Local constant conditional mean Marginal effect The results above can be adjusted for multiple covariates, q. As the number of covariates increases, the rate of convergence deteriorates, and in particular it becomes O p (n 2/(q+4) ). Compare that with OLS models where the rate of convergence is O p (n 1/2) for any q. In particular, h opt n 1/q+4

28 Local constant conditional mean Marginal effect Nadaraya-Watson and local polynomial kernel regression can be implemented using the lpoly command. This can be accessed in

29 Partially linear models Index models

30 Partially linear models Index models Partially linear models: y = x β + g(z) + u A semiparametric partially linear model is given by x i is a p 1 vector of covariates; z i is a q 1 vector of covariates; g(.) is an unspecified function; y i = x i β + g(z i ) + u i, i = 1, 2,..., n, u i y i E (y i x i, z i ) then E (u i x i, z i ) = 0 and E (u 2 i x i, z i ) = σ 2 (x i, z i ) (i.e. potentially heteroskedastic). Example: Suppose we are able to assume linearity in some covariates (i.e. x) but we cannot assume it for others (i.e. z).

31 Partially linear models Index models Partially linear models: y = x β + g(z) + u Robinson s (1988) and Yatchew s (1998) estimators This model avoids the curse of dimensionality if few or a single variable is in z. Separating into linear covariantes and non-linear covariates increase precision of estimates. We have β or ˆβ is n-consistent (usual OLS asymptotics for the linear part).

32 Partially linear models Index models Partially linear models: y = x β + g(z) + u Robinson s (1988) estimator This model can be estimated by Robinson (1988) estimator. Consider E (y i z i ) = E (x i z i ) β + g(z i ) + E (u i z i ) = E (x i z i ) β + g(z i ) (because E (u z) = 0) Then, substracting from the first equation y i E (y i z i ) = (x i E (x i z i ))β + u i. Denoting ỹ i = y i E (y i z i ) and x i = x i E (x i z i ), we can get. β = [ n i ] 1 n x i x i i However we don t have the conditional expectations E (y z) and E (x z) which will be estimated nonparametrically to get E (y z) and Ê (x z). g(z i ) can then be estimated from ĝ(z i ) = ỹ i x i β. x i ỹ i

33 Partially linear models Index models Partially linear models: y = x β + g(z) + u Yatchew s (1998) estimator Sort the data according to z, that is, z (1), z (2),..., z (n). Consider the regression in first differences y = x β + g(z) + u If g(.) is smooth, single-valued with bounded first derivative in a compact support, then g(z) 0 as n. Then β can be estimated from a regression of y on x. Obtain ˆβ. [Note that we don t need to know the form of g(.).] g(z) can then be estimated from a nonparametric regression of y x ˆβ on z.

34 Index models: y = g(x β) + u Partially linear models Index models A semiparametric single index model is given by x i is a q 1 vector of covariates; g(.) is an unspecified function; y i = g(x i β) + u i, i = 1, 2,..., n, Note: this is different from a model where y i = g(x i ) + u i ; u i y i E (y i x i ) then E (u i x i ) = 0 and E (u 2 i x i ) = σ 2 (x i ) (i.e. potentially heteroskedastic).

35 Index models: y = g(x β) + u Partially linear models Index models Ichimura (1993) method consists on assuming there is one parameter value, β 0, where y i = g(x i β 0) + u i, i = 1, 2,..., n. However we can define E (y x β) for any β. Note that E (y x β) = g(x β) unless β = β 0. Consider a grid of β B, B = [β (1), β (2),..., β (M) ]; For each j = 1, 2,..., M and for each observation i = 1, 2,..., n, estimate nonparametrically ĝ i (x i β (j) ) as a leave-one-out nonparametric kernel estimator of g i (x i β); Choose β as ˆβ = argmin β B n i (y i ĝ i (x i β)) 2

36 Partially linear models Index models Binary choice semiparametric single index models Suppose now y is a binary variable, i.e. y {0, 1}. The single index model can be applied as an alternative to logit or probit models. This is the Klein and Spady (1993) estimator. Let g(x i β) = Pr[y = 1 x i ]. Then apply Ichimura s method but maximize a quasi-log-likelihood ˆβ = argmax β B n i (1 y i )ln(1 ĝ i (x i β)) + y i ln(ĝ i (x i β)) Note: Compare with probit and logit models where ˆβ = argmax β B n i (1 y i )ln(1 F (x i β)) + y i ln(f (x i β)) where F is either a normal or logistic cdf.

37 Partially linear models Index models Robinson s (1988) semiparametric partially linear model can be estimated by the command semipar ssc install semipar Yatchew s (1988) semiparametric partially linear model can be estimated by the command plreg sml fits univariate binary-choice models by the semiparametric maximum likelihood estimator of Klein and Spady (1993). articlenum=st0144

38 References Partially linear models Index models This slides are based on Gutierrez, R.G., Linhart, J.M. and Pintblado, J.S. (2003), From the help desk: Local polynomial regression and Stata plugins, Stata Journal, 3(4), Ichimura, H. (1993), Semiparametric least squares (SLS) and weighted SLS estimation of single-index models, Journal of Econometrics, 58, Klein, R.W. and Spady, R.H. (1993), An efficient semiparametric estimator for binary response models, Econometrica, 61, Pagan, A. and Ullah, A. (1999), Nonparametric Econometrics. Cambridge: Cambridge University Press. Racine, J. (2008), Nonparametric Econometrics: A Primer, Foundations and Trends in Econometrics, 3(1), Robinson, P. (1988), Root-n-consistent semi-parametric regression, Econometrica, 56, Yatchew, A. (1998), Nonparametric regression techniques in economics, Journal of Economic Literature, 36,

Nonparametric Econometrics

Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-