ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Size: px

Start display at page:

Download "ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd"

Beryl May
6 years ago
Views:

1 ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014

2 2

3 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation Histogram Estimator What is the variance and bias of this estimator? Selecting the bin size Symmetric Binning Histogram Estimator Standard kernel density estimator Bias Variance Choice of Bandwidth Alternative Estimator- k nearest neighbor Asymptotic Distribution of Kernel Density Nonparametric Regression Overview The Local approach Two types of potential problems Interpretation of Local Polynomial Estimators as a local regression Properties of Nonparametric Regression Estimators Consistency Asymptotic Normality How might we choose the bandwidth? Bandwidth Selection for Density Estimation First Generation Methods Plug-in method (local version i

4 ii CONTENTS Global plug-in method Rule-of-thumb Method (Silverman Global Method #2: Least Squares Cross-validation Biased Cross-validation Second Generation Methods Bandwidth selection for Nonparametric Regression Plug-in Estimator (local method global Plug-in Estimator Bandwidth Choice in Nonparametric Regression Global MISE criterion blocking method Bandwidth Selection for Nonparametric Regression- Cross- Validation

5 Chapter 1 Review of Stochastic Order Symbols Let X n be a sequence of random variables. Then and X n = o P (1 if X n P 0 as n X n = O P (a n if X n /a n is stochastically bounded. Recall that X n /a n is stochastically bounded if ɛ > 0, M < : P ( X n /a n > M < ɛ, n If a sequence is o p (1 then it is O p (1. Convergence in probability implies stochastic boundedness but not vice versa. 1

6 2 CHAPTER 1. REVIEW OF STOCHASTIC ORDER SYMBOLS

7 Chapter 2 Nonparametric Density Estimation Nonparametric density estimators allow estimation of densities without having to specify the functional form of the density. Instead, weaker assumptions such as continuity and differentiability can be assumed. 2.1 Histogram Estimator The most commonly used nonparametric density estimator is the histogram estimator. Suppose we estimate density by a histogram with bin width equal to h Proportion of data in bin: 1 n n 1(x i [x 0, x 0 + h] ˆf(x 0 h This suggests a density estimator: ˆf(x 0 = 1 nh n 1(x i [x 0, x 0 + h] 3

8 4 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION What is the variance and bias of this estimator? Bias E ˆf(x 0 = 1 h E(1(x i [x 0, x 0 + h] = 1 h x0 +h 1(x i [x 0, x 0 + h]f(x i dx i = 1 {f(x 0 + f ( x(x i x 0 }dx i x [x i, x 0 ] h x 0 = f(x 0 1 x0 +h [ x0 +h ] 1 dx i + f ( x(x i x 0 dx i h x } 0 x {{} 0 h }{{} 1 R(x 0,h Assume the first derivative is bounded: f (x c < x In that case, the remainder term can be bounded from above by R(x 0, h x0 +h x 0 f ( x x i x 0 dx i h C h 2 1 h = O p(h Thus, E ˆf(x0 f(x 0 Ch Consistency of the histogram requires that h 0 as n. Note that these derivations assumed f is differentiable and derivative is bounded.

9 2.1. HISTOGRAM ESTIMATOR 5 Variance Mean-square consistency requires that var 0 as well. If observations are iid, we can write var ˆf(x n 0 = 1 var{1(x i [x 0, x 0 + h]} n 2 h 2 We will show that the variance is bounded above by something that goes to zero (below, it is bounded by 0. var{1(x i [x 0, x 0 + h]} = E ( 1(x i [x 0, x 0 + h] 2 E(1( 2 E ( 1(x i [x 0, x 0 + h] 2 E (1(x i [x 0, x 0 + h] We already analyzed this term above in studying the bias. We showed that Assume that f(x 0 C, so, E(1(x i [x 0, x 0 + h] = hf(x 0 + O p (h 2 var ˆf(x 0 C n h n 2 h 2 = C nh = O p( 1 nh Consistency therefore requires nh Remarks (i The estimator is not root-n consistent, because the variance gets large as h 0 (ii For consistency, we require both h 0 and n h Selecting the bin size How might you choose h? One way is to minimize the mean-squared error (MSE, focussing on the lowest order terms (the terms that converge the slowest: MSE = C 2 h 2 + C nh

10 6 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION h = 2 C2 h + C nh 2 = 0 2 C 2 h 3 = + C n h = [ C 2C ] n 3 }{{} optimal choice of bin width 2.2 Symmetric Binning Histogram Estimator Instead of constructing the bins as before, consider constructing the bins symmetrically around point of evaluation: ˆf(x 0 = 1 n n 1 2h 1( x 0 x i < h Can rewrite as ˆf(x 0 = 1 n 1 nh 2 1( x 0 x i h < 1 }{{} uniform kernel Note that the area under the kernel function integrates to 1 and the kernel function is symmetric.

11 2.3. STANDARD KERNEL DENSITY ESTIMATOR 7 Figure 2.1: Symmetric uniform kernel function 2.3 Standard kernel density estimator Instead of uniform kernel, use kernel with weights that give more weight to closer observations and also go smoothly to zero. We also require that it integrates to 1 and is symmetric. Let Examples: s = x 0 x i h k(s = (s2 1 2 if s 1 biweight or quartic kernel = 0 else k(s = 1 2π e s2 2 normal kernel With the normal kernel, the weights are always positive. The so-called Nadaraya-Watson Kernel Estimator is given by ˆf(x 0 = 1 n k( x 0 x i n

12 8 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION Figure 2.2: Kernel function with weights that go smoothly to zero Bias E ˆf(x 0 = 1 k( x 0 x i f(x i dx i let s = x 0 x i ds = 1 x i = x 0 s when x i =, s i = = = k(sf(x 0 s ds k(sf(x 0 s ds Assume f is r times differentiable with bounded r th derivative f(x 0 s = f(x 0 + f (x 0 ( s f (x 0 h 2 ns r! f r (x 0 ( s r + 1 r! [f r ( x f r (x 0 ]( s r }{{} Remainder term R(x 0,s If k(ssr = 0 R = 1..r

13 2.3. STANDARD KERNEL DENSITY ESTIMATOR 9 then bias is k(sr(x 0, sds h r nc 2 s r k(sds (since we assumed bounded rth derivative Thus, this estimator improves on histogram if some moments equal 0. Typically, k satisfies If we require k(sds = 1 k(ssds = 0 k(ss 2 ds = 0 Then k(s must be negative in places, as in the figure below. Remark If f r satisfies a lipschitz condition i.e. f r ( x f r (x 0 m x x 0 then Variance V ar ˆf(x 0 = 1 = 1 nh 2 n Bias C 0 h r+1 n [Ek 2 ( x 0 x i C 1 nh = O p ( 1 n n 2 h 2 n n V ark( x 0 x i s r k(sds = O p (h r+1 n }{{} higher order Ek( x 0 x i 2 ] can show using same techniques does not improve on histogram

14 10 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION Figure 2.3: Kernel function that is sometimes negative Choice of Bandwidth Choose h to minimize the asymptotic mean-squared error (AMSE: minamse(h = C 2 h 2h 2r n + C 1 n h = C2 2 2r h 2r 1 n C 1 nh 2 n h 2r+1 n = C 1 C 2 2 2rn = 0 = ( C 1 2rC 2 1 2r+1 (n 1 2r+1 shrinks to zero as n grows large Now, let s look at order of AMSE (without making the Lipschitz assumption. Plug in the optimal value for that was obtained above. Let C 3 = C 1 2rC 2 1 2r+1.

15 2.4. ALTERNATIVE ESTIMATOR- K NEAREST NEIGHBOR 11 MSE(h k = C 2 C 2r 3 n 2r 2r+1 + C 1 n C 3 n 1 2r+1 = C 2 C3 2r n 2r 2r+1 = O p (n 2r 2r+1 + C 1 C3 1 n 2r 2r+1 so MSE = O p (n r 2r + 1 < 1 2 2r+1 r (parametric rate is n 1 2 Recall ( ˆβ ols β 0 (0, σ 2 ( X X N 1 N 1 }{{} V ar RMSE = MSE is O p (N 1 2 How do you generalize kernel density estimator to more dimensions? ˆf(x 0, z 0 = 1 n,x,z n k 1 ( x 0 x i k 2 ( z 0 z i,z,x 2.4 Alternative Estimator- k nearest neighbor Alternatively, the bandwidth can be chosen so as to include a certain number of neighbors in each point estimation. In this case, the bandwidth will be variable:

16 12 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION For m-dimensional data, f(x 0 2r k = #data in bin=k 1 n ˆf(x 0 = k 1 2r k n ˆf(x 0 = k 1 (2r k m n 2.5 Asymptotic Distribution of Kernel Density We already did consistency (we showed conditions required for conv. in mean square ˆf(x 0 = 1 n k( x 0 x i p n h E ˆf(x 0 f(x 0 as h 0 n by LLN V AR ˆf(x 0 p 0 as n Now, asymptotic distribution nhn ( 1 n k( x 0 x i f(x 0 = n ( 1 n k( x 0 x i E k( x 0 x i n n (1 + n ( 1 n E k( x 0 x i f(x 0 n (2

17 2.5. ASYMPTOTIC DISTRIBUTION OF KERNEL DENSITY 13 Analyze term (1: can find a CLT for this part V AR 1 k( x 0 x i hn Assume 1 n n n [ 1 hn k( x 0 x i N (0, V AR 1 hn k( x 0 x i = 1 E (k( x 0 x i 2 h }{{ n } k(sds = 1, f(x 0 k2 (sds k(ssds = 0 [ 1 O(h 2 n] 0 as 0 Term (1 N(0, f(x 0 k 2 (sds 1 E( k( x ] 0 x i hn 1 E (k( x 0 x i 2 Now, show Term (2 p 0 (will require more stringent conditions on the bandwidth nhn ( 1 n E k( x 0 x i f(x 0 = n {E 1 k( x 0 x i f(x 0 } n = 1 = k( x 0 x i f(x i dx i s = x 0 x i k(sf(x 0 s ds = k(s[f(x 0 + f (x 0 ( s + f ( x s 2 h 2 2 n]ds = f(x 0 k(sds + f (x 0 k(ssds + h 2 n x between x 0 and x 0 + sh. ds = dx i f ( x k(ss 2 ds 2 Assume f ( x M and k(ssds = 0 (satisfied if k is symmetric.

18 14 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION then E 1 ( x0 x i k f(x 0 is O p (h 2 h n n which implies that nhn E( 1 k( x 0 x i f(x 0 is O p ( n h 2 n = O p ( nh 5 n nhn E( 1 k( x 0 x i f(x 0 = O p ( nh 5 n Hence, we require nh 5 n 0 in addition to 0, n. The required condition for asymptotic normality is stronger than for consistency.

19 Chapter 3 Nonparametric Regression 3.1 Overview Consider the model: where we assume y i = g(x i + ɛ i E(ɛ i x i = 0 E(ɛ 2 i x i = c < and we would like to estimated g(x without having to impose functional form assumptions on g, except assumptions such as continuity and differentiability. There are two types of nonparametric estimation methods: local and global. These two approaches reflect two different ways to reduce the problem of estimating a function into estimation of real numbers. Local approaches consider a real valued function g (x at a single point x = x 0. The problem of estimating a function becomes estimating a real number h (x 0. If we are interested in evaluating the function in the neighborhood of the point x 0, we can approximate the function by g(x 0 or, if g (x is continuously differentiable at x 0, then a better approximation might be g (x 0 + g (x 0 (x x 0. Thus, the problem of estimating a function at a 15

20 16 CHAPTER 3. NONPARAMETRIC REGRESSION point may be thought of as estimating two real numbers g (x 0 and g (x 0, making use of observations in the neighborhood. Either way, if we want to estimate the function over a wider range of x values, the same, pointwise problem can be solved at the different points of evaluation. Global approaches introduce a coordinate system in a space of functions, which reduces the problem of estimating a function into that of estimating a set of real numbers. Any element v in a d-dimensional vector space can be uniquely expressed using a system of independent vectors {b j } d j=1 as v = d j=1 θ j b j, where one can think of {b j } d j=1 as a system of coordinates and (θ 1,..., θ d as the representation of v using the coordinate system. Likewise, using an appropriate set of linearly independent functions {φ j (x} j=1 as coordinates any square integrable real valued function can be uniquely expressed by a set of coefficients. That is, given an appropriate set of linearly independent functions {φ j (x} j=1, any square integrable function g (x has unique coefficients {θ j } j=1 such that g (x = θ j φ j (x. j=1 One can think of {φ j (x} j=1 as a system of coordinates and (θ 1, θ 2,... as the representation of g (x using the coordinate system. This observation allows us to translate the problem of estimating a function into a problem of estimating a sequence of real numbers {θ j } j=1. Well known bases are polynomial series and Fourier series. These bases are infinitely differentiable everywhere. Other well known bases are polynomial spline bases and wavelet bases. The literature on nonparametric estimation methods is vast. These notes will focus on local methods, in particular, kernel-based methods, which are among the most commonly used (at least by economists!. 3.2 The Local approach Examples: kernel regression, local polynomial regression Early smoothing methods date back to the 1860 s in actuarial literature (see Cleveland (1979, JASA, where they were used to study life expectancies at different ages.

21 3.2. THE LOCAL APPROACH 17 One possibility is to estimate the conditional mean by: E(y x = Ê(y x = yf(y xdy = y ˆf(x, y ˆf(x dy. f(x, y y f(x dy Or, with repeated data at each point, could estimate by g(x 0 by ĝ(x 0 = y i 1(x i = x 0 i 1(x i = x 0 i A drawback is that we can only estimate this way at points where we have data. Need a way of interpolating or extrapolating between and outside data observations. We could create cells as in the histogram estimator: ĝ(x 0 = = n y i 1(x i [x 0, x 0 + ] n 1(x i [x 0, x 0 + ] n y i 1( x 0 x i < 1 n 1( x 0 x i < 1 If we want ĝ(x to be smooth, we need to choose a kernel function that goes to 0 smoothly

22 18 CHAPTER 3. NONPARAMETRIC REGRESSION Figure 3.1: Repeated data estimator ĝ(x 0 = = n y i k( x 0 x i n k( x 0 x i = n y i w i (x 0 n y i k( x 0 x i ( g(x 0 f(x 0 n ( f(x 0 1 n 1 n k( x 0 x i

23 3.2. THE LOCAL APPROACH 19 where w i (x 0 = k( x 0 x i n k( x 0 x i note : n w i (x 0 = 1 We don t need k(sds = 1, but we do require that kernel used in numerator and denominator integrate to the same value Two types of potential problems (i When density at x 0 is low, the denominator tends to be close to 0 and we get a division by zero problem. (iiat boundary points, the estimator has higher order bias, even with appropriately chosen k(. This problem is known as boundary bias Interpretation of Local Polynomial Estimators as a local regression Solve this problem at each point of evaluation, x 0 (which may or may not correspond to points in the data: n â : argmin (y i a b 1 (x i x 0 b 2 (x i x b k (x i x 0 k 2 k i {a,b 1,..,b k } ( x0 x i k i = k ˆb1 provides an estimator for g (x 0 ˆb2 provides an estimator for g (x 0 2 if k = 0, â = n y i k( x 0 x i hn n k( x 0 x i = n y i w i (x 0 hn standard Nadaraya-Watson kernel regression estimator where n w i (x 0 = 1

24 20 CHAPTER 3. NONPARAMETRIC REGRESSION if k = 1 where â = n n y i k i (x j x 0 2 n y k k k (x k x 0 n k l (x l x 0 j=1 k=1 l=1 n n k i k j (x j x 0 2 { n k k (x k x 0 } 2 j=1 k=1 k i = k( x 0 x i We can again write the expression as n y i w i (x 0 where the weights satisfy and n w i (x 0 = 1 n w i (x 0 (x i x 0 = 0 k = 0 weights for the standard Nadaraya-Watson kernel estimator do not satisfy this second property. The LLR estimator (k=1 improves on kernel estimator in two ways (a the asymptotic bias does not depend on design density of the data (i.e. f(x does not appear in the bias expression. (b the bias has a higher order of convergence at boundary points Intuitively, the standard kernel estimator fits a local constant, so when there is no data on the other side, the estimate will tend to be too high or too low. It does not capture the slope. LLR gets the slope. y i = g(x i + ε i

25 3.2. THE LOCAL APPROACH 21 Figure 3.2: Boundary bias - How LLR improves on kernel regression ĝ LLR (x 0 g(x 0 = n w i (x 0 y i g(x 0 = n w i (x 0 ε i + n w i (x 0 (g(x i g(x 0 = n w i (x 0 ε i + g (x 0 (x i x 0 + [g ( x g (x 0 ] (x i x 0 = n w i (x 0 ε i + g (x 0 n w i (x 0 (x i x 0 + }{{}}{{} variance part =0 with LLR + n w i (x 0 (g ( x g (x 0 }{{} (x i x 0

26 22 CHAPTER 3. NONPARAMETRIC REGRESSION If g satisfies a Lipschitz condition, the last term can be bounded. 3.3 Properties of Nonparametric Regression Estimators 1. Consistency (will show convergence in mean square 2. Asymptotic Normality In showing the asymptotic properties, we will work with the standard kernel regression estimator, because it is simpler to analyze than LLR and the basic techniques are the same. Assume ˆm(x 0 = 1 n y i k( x 0 x i i 1 n k( x 0 x i i (i 0, n (ii sk(sds = 0 k(sds 0 k(ss2 ds < C (iii f(x and g(x are c 2 with bounded second derivatives Consistency We already showed 1 k( x 0 x i MS f(x 0 k(sds (denominator n i will now show 1 y i k( x 0 x i MS f(x 0 g(x 0 k(sds (numerator n i Then apply Mann.Wald theorem (plim of a continuous function is continuous function of plim to conclude ratio P g(x 0 Denominator:

27 3.3. PROPERTIES OF NONPARAMETRIC REGRESSION ESTIMATORS23 V AR( 1 n k( x 0 x i 1 n n 2 h n 2 E(k2 ( x 0 x i = 1 nh O p(h 2 n = O p ( 1 nh Thus, 1 k( x 0 x i MS E( 1 k( x 0 x i n i n i h n = f(x 0 k(sds + O p ( Numerator: E( 1 n y i k( x 0 x i = 1 E(E(y i x i k( x 0 x i n Also, can show Hence, = 1 E(g (x i k( x 0 x i = 1 g (x i k( x 0 x i f(x i dx i Let φ(x i = f (x i g(x i and Taylor expand (as before = φ (x 0 where φ(x 0 = g(x 0 f(x 0 k(sds + O p ( V AR( 1 ( n x0 x i y i k n c n (3.1 c n 0 as n (3.2

28 24 CHAPTER 3. NONPARAMETRIC REGRESSION ratio g (x 0 f (x 0 k(sds f (x 0 k(sds = g (x 0 provided that 0 n Asymptotic Normality Show nhn (ĝ(x 0 g(x 0 N {[ ] 1 2 g (x 0 + g (x 0 f (x 0 } h 2 n nhn u 2 k(udu, f(x 0 1 σ 2 (x 0 k 2 (u du f (x 0 where σ 2 (x 0 = E (ε 2 i x i = x 0 conditional variance [ ] yi k i nhn g (x 0 = [ ] (yi g(x 0 k i n ki ki (3.3 y i = g(x i + ε i E(ε i x i = 0 = εi k i n ki }{{} variance part = n ( 1 + (g(xi g(x 0 k i n ki }{{} bias part n εi k i 1 n ki + n ( 1 n (g(xi g(x 0 k i 1 ki n Assuming k(sds = 1, we showed before that

29 3.3. PROPERTIES OF NONPARAMETRIC REGRESSION ESTIMATORS25 1 n ki MS f(x 0 + O( We will now analyze the distribution of the numerator 1 nhn εi k i = 1 ε i k( x 0 x i = 1 1 ε i k( x 0 x i n nhn n hn i i and apply CLT to get N ( ( 1 0, V AR ε i k( x 0 x i hn (3.4 ( 1 V AR ε i k( x 0 x i = 1 ( E ε 2 i k 2 ( x 0 x i + higher order terms hn = 1 E ( E(ε 2 i x i k 2 ( x 0 x i =σ 2 (x i = 1 σ 2 (x i k 2 ( x 0 x i f(x i dx i h n = σ 2 (x 0 f(x 0 k 2 (sds + O( by change of var Now, analyze the bias term

30 26 CHAPTER 3. NONPARAMETRIC REGRESSION 1 nhn (g(xi g(x 0 k i = 1 (g(x i g(x 0 k( x 0 x i n nhn 1 nhn n i (g(x i g(x 0 k( x 0 x i f(x i dx i = n (g(x 0 s g(x 0 k(s f(x 0 s ds = n {g (x 0 ( s g (x 0 s 2 h 2 n [g ( x g (x 0 ]s 2 h 2 n} +higher order (will ignore {f(x 0 s f (x 0 + s2 h 2 n 2 f ( x}k(sds if g satisfies lipschitz condition g ( x g (x 0 C x x 0 then [g ( x g (x 0 ] 2 term will be higher order -other terms =0 { }}{ = g f sk(sds + g f h 2 n g (x 0 f(x 0 h 2 n s 2 k(sds g f ( x h 3 n 2 }{{} higher order k(ss 3 ds k(ss 2 ds + higher order terms get nhn (g(xi g(x 0 k i n [g (x 0 f (x g (x 0 f(x 0 ] h 2 n k(ss 2 ds

31 3.3. PROPERTIES OF NONPARAMETRIC REGRESSION ESTIMATORS27 require nhn h 2 n 0 i.e. nh 5 n 0 Putting these results together, we get nhn (ĝ(x 0 g(x 0 N ( [ g (x 0 f (x f(x 2 g (x 0 ] n h 2 n k(ss 2 ds, f(x 0 1 σ 2 (x 0 k(s 2 ds (which is what we set out to show Remarks: Getting asymptotic distribution requires plug-in estimator of g, f, f, g, σ 2 (x 0 (all the ingredients of the bias and variance g -obtain by local linear or local quadratic regression f -derivative of estimator for f gives estimator of f ˆf (x 0 = 1 n k ( x 0 x i n f obtain by standard density estimator k2 (sds, k(ssds constants that depend on kernel function k(s σ 2 (x 0 conditional variance. Estimate by Ê(ε 2 i x 0 = wi (x 0 ˆε 2 i wi (x 0 fitted residuals ŷ ˆm(x i = ˆε i We showed distribution of kernel estimator. LLR? Fan (JASA, 1992 showed What about distribution of nhn (ĝ LLR (x 0 g(x 0 = N 1 2 g (x 0 k(ss 2 ds nh n h 2 n, f(x 0 1 σ 2 (x 0 k 2 (sds }{{}}{{} one less term in bias expression same variance Remarks:

32 28 CHAPTER 3. NONPARAMETRIC REGRESSION Bias of LLR does not depend on f(x 0 (the design density of the data. Because of this feature, Fan refers to the estimator as being Design-Adaptive Fan showed that in general, better to use odd-order local polynomial. There is no cost in variance, bias of odd-order does not depend on f(x. LLR estimator does not suffer from boundary bias problem How might we choose the bandwidth? Could choose the bandwidth to minimize the pointwise asymptotic MSE(AMSE Remarks: AMSE LLR = [h 2 n 1 2 g (x 0 k(ss 2 ds] 2 + f(x 0 1 σ 2 (x 0 n = 4h 3 n 1 4 g (x 0 2 [ k(ss 2 ds] 2 + f(x 0 1 σ 2 (x 0 nh 2 n 1 5 = f(x 0 1 σ 2 (x 0 g (x 0 2 k 2 (sds [ k(ss 2 ds] }{{ 2 } constant (1 higher variance σ 2 (x 0 wider bandwidth n 1 5 (2 more variability in function (g (x 0 wider bandwidth k 2 (sds k 2 (sds = 0 (3 more data narrower bandwidth (with kernel regression (i.e. local polynomial of degree 0, bias term also depends on f(x and therefore optimal bandwidth choice will depend on f(x (4 difficulty obtaining optimal bandwidth requires estimates of σ 2 (x 0, g (x 0 (and in the case of the usual regression estimator f(x 0. This is the problem of how to choose pilot bandwidth. One could assume normality in choosing a pilot for f(x 0 (this is Silverman s rule-of-thumb method. Could also use fixed bw or nearest neighbor for σ 2 (x 0, g (x 0

33 3.3. PROPERTIES OF NONPARAMETRIC REGRESSION ESTIMATORS29 Figure 3.3: Functions of different smoothness (5Could minimize AMISE = E (ĝ(x 0 g(x 0 dx 0 bandwidth and pick a global For further information on choosing the bandwidth, see below and see book by Fan and Gijbels (1996.

34 30 CHAPTER 3. NONPARAMETRIC REGRESSION

35 Chapter 4 Bandwidth Selection for Density Estimation 4.1 First Generation Methods Plug-in method (local version One could derive the optimal bandwidth at each point of evaluation, x 0, which would lead to a pointwise localized bandwidth estimator ĥ(x 0 = arg min AMSE ˆf(x ( 2 0 = E ˆf (x0 f (x 0 [ ] 2 = h4 4 f (x 0 2 k(ss 2 ds + 1 nh f(x 0 k 2 (sds ĥ(x 0 = [ k 2 (sds [ k(ss2 ds ] 2 f(x 0 (f (x 0 2 ] 1 5 n 1 5 We need to estimate f (x 0, f (x 0 to get optimal plug-in bandwidth, and the optimal bandwidth would be different for different points of estimation Global plug-in method Because the above method is computationally intensive, one could instead derive a optimal bandwidth for all the points of evaluation by minimizing a different criterion - the mean integrate squared error: 31

36 32CHAPTER 4. BANDWIDTH SELECTION FOR DENSITY ESTIMATION ( MISE ˆf(x 2 0 = E ˆf(x0 f(x 0 dx0 removes x 0 by integration = 1 f (x 0 dx 0 k 2 (sds + h4 nh 4 ( k(ss 2 ds 2 ( f (x 0 2 dx 0 }{{}}{{}}{{} =1 A B h 4 MISE = ( A B n 1 5 note: f (x 0 2 dx 0 is a measure of the variation in f(x 0. If f is highly variable then h will be small. Here, we still need estimate of f (x 0 but no longer require an estimate of f(x 0 (it was eliminated by integrating Rule-of-thumb Method (Silverman If x Normal then Silverman shows (in his book Density Estimation h 4 MISE n 1 5 SD(x SD = standard deviation This method if often applied to nonnormal data, but it has a tendency to oversmooth, particularly when data density is multimodel. Sometimes, the standard deviation is replaced by the 75%-25% interquartile range. This is an older method, now available in some software packages, but the performance is not so good. Other methods are better Global Method #2: Least Squares Cross-validation The idea is to minimize an estimate of the integrated squared error (ISE minise = h = ˆf(x 0 2 dx 0 } {{ } term #1 { ˆf(x 0 f(x 0 } 2 dx 0 2 ˆf(x 0 f(x 0 dx 0 + } {{ } term #2 f(x 0 2 dx 0 }{{} term doesnt depend on h

37 4.1. FIRST GENERATION METHODS 33 T erm #1 = 1 n 2 h 2 i j k( x i x 0 k( x j x 0 let s = x i x 0 x 0 = x j + sh = 1 k( x i n 2 h 2 i j h x j }{{ h} x i x 0 hn = 1 k( x i x j n 2 h h i j x 0 h = x j h + s sk(sds sk(sds Note: k(a sk(sds is the density of the sum of two random variables with density k (also known as the convolution Now examine Term #2: 2 ˆf(x0 f(x 0 dx 0 y = a + s Pr(y y 0 = Pr(a + s y 0 = y = An unbiased estimator for term (2 is 2 k( x i x j N(N 1 h i j i = Pr(a y 0 s F (y 0 sk(sds k(y 0 sk(sds 2 N 2 h k( x i x j h Note that we need to use leave-one-out estimator to get independence. ĥ CV = arg min 1 k( x i x j sk(sds n 2 h i j h }{{} need to calculate convolution for this part i j i 2 n 2 h k( x i x j i j i h }{{} ith point does not get used, = ˆf i (x i ˆf i (x i = }{{} ˆf(x i K(o 1 nh leave one out estimator

38 34CHAPTER 4. BANDWIDTH SELECTION FOR DENSITY ESTIMATION (1 turns out that ĥcv can only be estimated at rate n 1 10 (result due to Stone which is very slow. However, in Monte Carlo studies LSCV often does better that Rule-of-thumb method. (2 Marson (1990 found a tendency for local minima near small h areas, which lead to a tendency to undersmooth. (3 LSCV is usually implemented by searching over a grid of bandwidth values Biased Cross-validation AMISE = 1 nh k 2 (sds + h4 n 4 Γ 2 k(sds f (x 2 dx + o( 1 nh + h4 what if we plug in ˆf (x? note: ˆf(x = 1 nh ˆf (x = 1 i k( x i x h k ( x i x nh 2 i h ˆf (x = 1 k ( x i x nh 3 i h ˆf (x 2 = 1 n n k( x i x n 2 h 6 h j=1 k(x j x h What is bias in estimation b (x 2? E{ ˆf (x 2 } = = 1 nh 6 can show = f (x + 1 nh 5 1 E( k ( x i x n 2 h 6 i h k ( x j x dx j E ( x i x h 2 dx k (s 2 d + higher order

39 4.1. FIRST GENERATION METHODS 35 h BCV = arg min 1 nh k 2 (sds+ h4 4 s 2 k(sds[ ˆf (x 2 dx 1 k (sds ] nh } 5 {{} subtract this bias here Second Generation Methods Sheather and Jones(1991 Solve -the -equation Plug-in Method h SIP I = [ ] 1 k 2 5 (sds f g(h (x2 dx [ s 2 k(sds ] 2 n 1 5 find h that solves this We need to plug in an estimate for f (x. The bandwidth that is optimal for f(x is not same as optimal bandwidth for f We can find an analogue to AMSE for R(f = f (x 2 dx get [ ] g = C 1 (k{r(f } 1 7 C2 (k n 1 7 R(f = f (xdx enters into the formula for the optimal bandwidth for g C 1 (k and C 2 (k are functions of the kernel. R(f is f (x 2 dx Now solve for g as a function of h g(h = [ C 3 (k{ R(f ] R(f } 1 7 C4 (k h 5 7 Now we need an estimate of f. At this point, SJ suggest using a rule of thumb method, based on normal density assumption What is main advantage of this method over other methods? For most methods ĥ h MISE h MISE n p

40 36CHAPTER 4. BANDWIDTH SELECTION FOR DENSITY ESTIMATION Recall, for CV, p = 1. for SJPI, 10 ρ = 5 14 (close to 1 2 so rate of convergence of the bandwidth is faster. This method is available in some software packages.

41 Chapter 5 Bandwidth selection for Nonparametric Regression Recall that the distribution for the LLR estimator was ( 1 nhn (ĝ LLR (x 0 g(x 0 N 2 g (x 0 k 2 (ssds nh n h 2 n, f(x 0 1 σ 2 (x 0 k 2 (s ds 5.1 Plug-in Estimator (local method choose bandwidth to minimize the MSE data-dependent [ ] 2 [ ] 2 1 min 2 g (x 0 k(ss 2 ds h 4 n + f(x 0 1 σ 2 (x 0 k 2 (Sds nh }{{}} n {{} Variance=C 1 n 1 h 1 n h n = Bias 2 =C 0 h 4 n [ f(x0 1 σ 2 (x 0 k 2 (sds g (x 0 2 k(ss 2 ds ] 1 5 n 1 5 note: min C 0 h 4 n + C 1 n 1 h 1 n 4C 0 h 3 n C 1 n 1 h 2 n = 0 h n : h n(x 0 variable bandwidth h 5 n = C 1 n 1 4C 0 = 37 ( C1 4C n 1 5

42 38CHAPTER 5. BANDWIDTH SELECTION FOR NONPARAMETRIC REGRESSION Remarks: g high smaller bandwidth want smaller bandwidth in regions where function is highly variable σ 2 (x 0 high use a larger bandwidth 5.2 global Plug-in Estimator minimize global criterion s.a.mise h r n = { } σ 2 (x 0 dx 0 k (sds g n 1 (x Mf: Fan &Gijbels (1996 Local Polynomial Reg dx 0 k(ss2 ds 5.3 Bandwidth Choice in Nonparametric Regression y i = m(x i + ε i min n {y i α... β p (x i x p } k( x i x how to choose, given p 5.4 Global MISE criterion V AR(y x {}}{ (p + 1p! 2 R(k p σ 2 (xdx h MISE = µ p+1 (k p 2 m p+1 (x 2 f(xdx 1 2p+3 n 1 2p+3 for LLR, p = 1 and get n 1 5, for p odd Fan & Gijbels (1996 Local Polynomial Regression k p = function of kernel R(k p = k p (s 2 ds µ l (k p = µ l k p (µdµ [ = R(k σ 2 (xdx µ 2 (k 2 m (x 2 f(xdx ] 1 5 n 1 5

43 5.5. BLOCKING METHOD 39 unknowns are σ 2 (x, m (x, f(x Apply same plug-in method to estimate ˆm (x then take average Still need σ 2 (xdx 5.5 blocking method (1 divide x into N blocks (2 Estimate y = m(x + ε by Q-degree poly within each block σ 2 (N = (n sn 1 i j { } y i ˆm Q j (x i 1(x i X j 5.6 Bandwidth Selection for Nonparametric Regression- Cross-Validation Recall LLR estimator defined as Let y = x = y 1... y N â = arg min n (y i a b(x i x 0 2 k( x i x 0 w = 1 (x 1 x (x N x 0 ( x k 1 x k N Z ( x N x 0 N N ĝ LLR (x 0 = [1 0](x wx 1 (x wy coefficient from a weighted least squares problem. Let weights depend on x 0, h

44 40CHAPTER 5. BANDWIDTH SELECTION FOR NONPARAMETRIC REGRESSION H(h = [1 0] (x wx 1 x w NN N ĝ LLR (x 0 = H(h y 1 N choose h to min E((ĝ LLR (x 0 g(x 2 x minimize conditional variance at each point of evaluation = E((H (h y g (H(hy g x = E((H (h g + H (h ε g ((H (h g + H (h ε g x = E(((H (h Ig + H (h ε ((H (h Ig + H (h ε x = g (H (h I (H (h Ig + E( ε 1 N H(h H(h ɛ x recall trab = trba N 1 1 N N 1 = g (H (h I (H (h Ig + E(tr(ε H(h H(hɛ x tr(a + B = tr(a + tr(b = g (H (h I (H (h Ig + E(tr(εε tr(h(h H(h x trace is a linear mapping = g (H (h I (H (h Ig + nσ 2 εtr(h(h H(h need estimator for this Consider what is estimated by (y H(hy (y H(hy = y (I H (h (I H (hy sum of fitted residuals = (g + ɛ (I H (h (I H (h(g + ε { } +g = (I H (h (I H (hg + g (I H (h (I H (hε +ε (I H (h (I H (hg + ε (I H (h (I H (hε E(g (I H (h (I H (hg x = g (I H (h (I H (hg E(g (I H (h (I H (hε x = 0 E(ε (I H (h (I H (hε x = E(tr(ε (I H (h (I H (hε x = E(tr(εε tr ((I H (h (I H (h x = nσ 2 tr ((I H (h (I H (h = nσ 2 [tri 2tr(H (htri + trh(h H(h] together, (y H(hy (y H(hy = g (I H (h (I H (hg + nσ 2 tri 2nσ 2 trh (h thi + nσ }{{} 2 trh (h H (h don t want this term if we use ĝ i (x i instead of ĝ (x i (leave-one-out estimator

45 5.6. BANDWIDTH SELECTION FOR NONPARAMETRIC REGRESSION- CROSS-VALIDATION41 then trh (h = 0 n ĥ CV = arg min (y i ĝ i (x i 2 h H

Nonparametric Density and Regression Estimation

Nonparametric Density and Regression Estimation Nonparametrc Densty and Regresson Estmaton February 2, 2007 Densty Estmaton. Hstogram Estmator The smplest nonparametrc densty estmator s the hstogram estmator. Suppose we estmate densty by a hstogram