3 Nonparametric Density Estimation

Size: px

Start display at page:

Download "3 Nonparametric Density Estimation"

Oswald Banks
5 years ago
Views:

1 3 Nonparametric Density Estimation Example: Income distribution Source: U.K. Family Expenditure Survey (FES) Approximately 7000 British Households per year For each household many different variables are observed: Income, expenditures, age of household members, professions, etc. Nominal net incomes (136 of 7041 housholds): EconometricsII-Kneip 3 1

2 Characterizing distributions i.i.d random sample X 1,..., X n with underlying density f Traditional density estimator : histogram Histogram for FES income data (year=1976): income There are some drawbacks when using an histogram for estimating a density: How to choose the binwidth (as well as the starting point)? An histogram is a step function (non-smooth); it is an inefficient estimator of the underlying density f. EconometricsII-Kneip 3 2

3 histogram for FES income data in the year 1983 (large binwidth): histogram for FES income data in the year 1983 (small binwidth): EconometricsII-Kneip 3 3

4 3.1 Kernel density estimator: basic properties Data: i.i.d. sample X 1,..., X n of a continuous random variables X Problem: Estimate the density function f(x) Qualitative assumption: f ist smooth (at least twice differentiable) income Histogram with binwidth 2h: intervals [x j 1, x j ) with x j x j 1 = 2h estimation at the center points x = (x j 1 + x j )/2 with ˆf hist (x) = {X i [x j 1, x j )} 2hn K(z) = = 1 nh n ( ) x Xi K h i=1 1/2 if z [ 1, 1) 0 else EconometricsII-Kneip 3 4

5 Moving histogram (= elementary kernel estimator with rectangular kernel): Estimation at each point x by ˆf h (x) = 1 nh n ( ) x Xi K h i=1 h - bandwidth ( 2 binwidth of original histogram) Motivation: {X i [x h, x + h]} 2hn P (X i [x h, x + h]) 2h = P (X i [x h, x + h]) 2h = f(x) + O(h 2 ) 1 + O P ( ) nh General definition of a kernel density estimator: Replace the rectangular kernel by a smooth, continuously differentiable kernel function K and estimate f(x) by ˆf h (x) = 1 nh n ( ) x Xi K h i=1 for all x IR. The bandwidth h as well as the kernel function K have to be chosen by the econometrician. Simple properties of kernel density estimators: Positivity : K 0 ˆf h 0 Smoothness : K continuous, differentiable ˆf h continuous, differentiable K density function ˆf h density function EconometricsII-Kneip 3 5

6 Second-order kernel K: K is a density function which is symmetric at 0. K(x)dx = 1 and xk(x)dx = 0 Important (second-order) kernel functions: Family of symmetric Beta-densities: For p = 0, 1, 2,... K(u; p) = Const p ( 1 u 2 ) p für u [ 1, 1] and 0 else Resulting kernels for different values of p (u [ 1, 1]): p = 0 Uniform kernel: K(u) = 1 2 p = 1 Epanechnikov kernel: K(u) = 3 4 (1 u2 ) p = 2 Quartic/Biweight kernel: K(u) = (1 u2 ) 2 p = 3 Triweight kernel: K(u) = (1 u2 ) 3 Gaussian kernel: K(u) = ϕ(u) = 1 2π exp( u 2 /2), u IR Possible generalization: For m = 3, 4, 5,... an m-th order kernel function has to satisfy K(x)dx = 1, x q K(x)dx = 0, q = 1,..., m 1, x m K(x)dx 0 Note: Higher order kernels are almost never used in practice. The problem is that for m > 2 the above conditions imply that K(x) < 0 for some x IR. This implies that in general the resulting estimate ˆf h (x) is not a density. EconometricsII-Kneip 3 6

7 Kernel density estimators for different bandwidths (Gaussian kernel) 5 x 10 3 Family Expenditure Survey (1990) h = income before housing costs 4.5 x 10 3 Family Expenditure Survey (1990) h = income before housing costs EconometricsII-Kneip 3 7

8 4 x 10 3 Family Expenditure Survey (1990) h = income before housing costs 3.5 x 10 3 Family Expenditure Survey (1990) h = income before housing costs EconometricsII-Kneip 3 8

9 Kernel estimator with normal-reference bandwidth 4.5 x 10 3 Family Expenditure Survey (1990) Normal Reference income before housing costs Kernel density estimators with estimated optimal bandwidth (plugin) 4.5 x 10 3 Family Expenditure Survey (1990) Sheather / Jones income before housing costs EconometricsII-Kneip 3 9

10 3.2 The accuracy of kernel density estimators In the following we will consider the asymptotic behavior of a kernel density estimator as n and h h n 0 such that 1 nh 0. We will need some additional assumptions: The underlying density f is twice continuously differentiable K is a continuous second order kernel function with compact support [ 1, 1]. Note: A compact support of K is assumed in order to simplify arguments. The asymptotic expansions for bias and variance of the estimator remain valid under weaker assumptions: K is a bounded function which is continuous on its support S K IR, and lim y y 2 K(y) = 0. We now first derive the pointwise bias of ˆf h (x) at an arbitrary point x. E( ˆf h (x)) = 1 n = E n i=1 ( 1 h E K( x X i h ) ( 1 h K(x X i ) h = ) ) 1 h K(x u h )f(u)du EconometricsII-Kneip 3 10

11 and = 1 h K(x u h )f(u)du = K(y)f(x + yh)dy K(y) {f(x) + f (x)yh + 12! } f (x)y 2 h 2 dy + o(h 2 ) = f(x) + h f (x) K(y)y 2 dy +o(h 2 ) }{{} ν 2 (K) Bias( ˆf h (x))) = E( ˆf h (x)) f(x) = h 2 ν 2(K) f (x) + o(h 2 ) 2 For the variance we obtain: Var( ˆf h (x)) = 1 n ( 1 n 2 E h K(x X i ) E( 1 h h K(x X i )) h = f(x) nh i=1 K(y) 2 dy } {{ } R(K) +o( 1 nh ) This implies that the mean-squared error is given by MSE( ˆf h (x)) = Bias( ˆf h (x)) 2 + Var( ˆf h (x)) = h 4 ν 2(K) 2 4 ) 2 f (x) 2 + f(x) nh R(K) + o(h4 + 1 nh ) We can immediately infer that as n and h h n 0, 1 nh 0, the kernel estimator ˆf h (x) is a weakly consistent estimator of f(x), EconometricsII-Kneip 3 11

12 ˆf h (x) P f(x) It can even be shown that there is a uniform convergence in probability sup ˆf h (x) f(x) P 0 x IR At the same time the central limit theorem implies that nh( ˆfh (x) E( ˆf h (x))) D N(0, f(x)r(k)) which can also be written in the form ˆf h (x) AN(E( ˆf h (x)), f(x) nh R(K)) A general measure of the accuracy of a kernel density estimator over all point x IR is the mean integrated squared error (MISE): MISE( ˆf h ) = = = h 4 ν 2(K) 2 4 E( ˆf h (x) f(x)) 2 dx = Bias( ˆf h (x)) 2 dx + MSE( ˆf h (x))dx Var( ˆf h (x))dx f (x) nh R(K) + o(h4 + 1 nh ) The formula shows that the choice of an appropriate bandwidth is crucial for the accuracy of a kernel density estimator: The bias decreases as h decreases. Variance increases as h decreases. EconometricsII-Kneip 3 12

13 When ignoring the smaller order terms o(h nh ) an asymptotically optimal bandwidth balancing squared bias and variance minimizes the asymptotic MISE: h opt = { R(K) nν 2 (K) 2 f (x) 2 dx } 1/5, and for large n the corresponding minimal value of the asymptotic mean integrated squared error is given by = 5 4 MISE( ˆf hopt ) = min MISE( ˆf h ) h>0 } 1/5 {ν 2 (K) 2 R(K) 4 f (x) 2 dx n 4/5 This implies that when using an optimal bandwidth a kernel density estimator has the rate of convergence n 2/5 : ˆf hopt (x) f(x) = O P (n 2/5 ), while the resulting MISE depends on the curvature f (x) 2 dx of the (unknown) true density f, the choice of the kernel function, i.e. the constant ν 2 (K) 2 R(K) 4. Note: If a kernel density estimator is applied with a bandwidth h h opt, then it is common to speak of undersmoothing if h < h opt and of oversmoothing if h > h opt. Theory of optimal kernels: An obvious idea is to choose the kernel function which minimal value of C(K) = ν 2 (K) 2 R(K) 4. EconometricsII-Kneip 3 13

14 This leads to the variational problem of minimizing C(K) with respect to all possible second order kernels K. Hodges and Lehmann (1956) showed that the Epanechnikov kernel is optimal in the sense that it provides the minimal value C opt. The efficiency of some kernel K relative to the Epanechnikov kernel is usually defined as C opt C(K). The following table shows that, although the Epanechnikov kernel is optimal, the efficiency loss when using other popular kernels is very limited. This explains that in practice often biweight or triweight kernels are prefered to the Epanechnikov kernel, since the latter does not possess continuous first derivatives. Kernel C opt C(K) Epanechnikov Uniform Biweight Triweight Normal Optimal rates of convergence Consider an estimator θ n θ n (X 1,..., X n ) of some parameter (vector) θ IR d, d 1, of interest. For some r > 0, θ n possesses the the rate of convergence n r if θ n θ = O P (n r ) and n r = O P ( θ n θ) The rate of convergence tell us how fast the accuracy of the estimators improves as the sample size increases. An important justification for the use of a specific estimator consists in verifying that the estimator achieves the optimal rate of convergence. EconometricsII-Kneip 3 14

15 Optimal rates of convergence depend on the nature of the estimation problem to be studied. Let Ω denote the space of all possible values of the unknown parameter θ. Note that the probability distribution P θ of an estimator θ n will then depend on the true value of θ Ω. For some r > 0, n r is a lower rate of convergence if there exists a constant c > 0 such that for any possible estimator θ n θ n (X 1,..., X n ) lim inf sup P θ ( θ n θ > cn r) > 0 n θ Ω Moreover, n r is an achievable rate of convergence if there exists an estimator θ n such that lim lim sup sup P θ ( θ n θ > cn r) = 0 c n θ Ω The rate n r is called an optimal rate of convergence if it is both a lower and an achievable rate of convergence. Optimal rates of convergence are usually only of interest in nonparametric settings. In standard parametric problems (there are exceptions!) the optimal rates of convergence are n 1/2. This follows from the fact that for such estimators bias is asymptotically negligible, and variance is of order n 1/2. In many situations there also exist bounds for the variance of the most efficient estimator (Cramer-Rao lower bound). The situation is different in nonparametric function estimation. Any estimator then has to balance bias and variance which means that the rate n 1/2 cannot be reached. In nonparametric regression or density estimation we generally have to estimate a d-dimensional function f, d 1, from given data. If the aim is to estimate the r-th order derivative of f, r = EconometricsII-Kneip 3 15

16 0, 1, 2,..., then for classes of p-times continuously differentiable functions with bounded p-th derivatives, the optimal rates of convergence are f (r) (x) f(x) = O P (n p r 2p+d ) For the problem of estimating a density function f (i.e. r = 0) this means: If f is twice continuously differentiable, i.e. p = 2, then the optimal rate of convergence is n 2/5. Hence, using a secondorder kernel, a kernel density estimator achieves the optimal rate of convergence If f is four times continuously differentiable, i.e. p = 4, then the optimal rate of convergence is n 4/9. When using a second-order kernel, the corresponding kernel density estimator still only possesses the rate of convergence n 2/5. In this case, optimal rates of convergence can in principle be reached by using a fourth order kernel (often not a good idea in practice). 3.4 Estimating derivatives of f Based on a smooth kernel is is easy to estimate derivatives f (r) (x), r = 1, 2,.... If K is r-times continuously differentiable, then also ˆf h (x) is r-times continuously differentiable, and an estimate of f (r) (x) is given by ˆf (r) h (x) = 1 nh r+1 n i=1 K (r) ( x Xi h Assume a second-order kernel K which is r-times continuously differentiable and has the compact support [ 1, 1] (e.g. biweight EconometricsII-Kneip 3 16 )

17 if r = 1, or triweight if r = 2 ). If f is at least r + 2-times continuously differentiable, then it is easily verified that and E( MSE( ˆf (r) h Var( (x)) = f (r) (x) + h 2 ν 2(K) f (r+2) (x) + o(h 2 ) 2 (r) f(x) ˆf h (x)) = nh 2r+1 R(K(r) ) + o( (r) ˆf h (x)) ν 2(K) 2 =h4 f (r+2) (x) o(h nh 2r+1 ) 1 nh 2r+1 ) f(x) nh 2r+1 R(K(r) ) An optimal bandwidth h (r) opt for estimating f (r) (x) is thus very different from an optimal bandwidth h opt for estimating f(x). If r = 1, then an optimal bandwidth for estimating f (x) is of order h (1) opt n 1/7. If r = 2, then an optimal bandwidth for estimating f (x) is of order h (2) opt n 1/ Bandwidth selection Normal reference bandwidth: A straightforward approach is to use a standard family of distributions to assign a value to the term f (x) 2 dx in the asymptotic expression of h opt. In many applications one may expect that the structure of the true density f is not extremely different from the structure of a normal density. A reasonable approximation of an optimal bandwidth for estimating f may then be obtained by referring to the optimal bandwidth of a normal density. EconometricsII-Kneip 3 17

18 Normal density: ϕ µ,σ (x) = 1 σ ϕ(x µ σ ), where ϕ is the standard normal density, and where µ, σ 2 are mean, variance of X i. Some calculations lead to ϕ µ,σ(x) 2 dx = 3 π8σ 5 Normal reference bandwidth: h NR = { 8 πr(k) 3ν 2 (K) 2 n } 1/5 ˆσ, where ˆσ denotes a suitable estimate of the standard deviation of X. Cross-validation: We obviously obtain = ˆf h (x) 2 dx 2 ( ˆf h (x) f(x)) 2 dx ˆf h (x)f(x)dx + f(x) 2 dx Since f(x)2 dx does not depend on h, minimizing MISE( ˆf h ) over h is equivalent to minimizing E( ˆf h (x) 2 dx) E(2 ˆf h (x)f(x)dx) These terms can be estimated by cross-validation: CV (h) = 1 n n i=1 ˆf h (x) 2 dx 2 1 n n i=1 EconometricsII-Kneip 3 18 ˆf h, i (X i ),

19 where for each i = 1,..., n we use ˆf h,i to denote the kernel estimator obtained when dropping the i-th observation. In other words, for each i = 1,..., n the estimator ˆf h, i is determined from the reduced sample X 1,..., X i 1, X i+1,..., X n. Under fairly general conditions, minimizing CV (h) over h leads to a consistent estimator ĥopt,cv of h opt. The rate of convergence is ĥopt,cv h opt = O P (n 1/10 ). Plug-in methods: In the expression { R(K) h opt = nν 2 (K) 2 f (x) 2 dx } 1/5 the quantities R(K) and ν 2 (K) can be directly computed from the selected kernel function. The only problem in calculating h opt is the term f (x) 2 dx which depends on the unknown true density. But of course this integral may be estimated by ˆf h (x)2 dx. The theory of kernel derivative estimation implies that consistent estimation of f (x) requires a bandwidth h > h opt. It obviously does not make much sense to complicate the problem by looking for an optimal h. A general idea adopted by plug-in methods is to look over a reasonable range of bandwidth h and to use corresponing inflated bandwidths h := g n (h) > h, g n(h) h as n, for estimating the functional. An estimate ĥopt,pi is then determined by solving EconometricsII-Kneip 3 19

20 the fixed point problem R(K) ĥ opt,pi = n ν 2 (K) 2 ˆf g n (ĥopt,pi) (x)2 dx A solution is obtained iteratively, starting e.g. from a normal reference bandwidth. Gasser, Kneip and Köhler (JASA, 1992) propose g n (h) = n 1/10 h. Sheather and Jones (JRSSB, 1991) use a more complicated relation based on a reference normal model. 1/5 3.6 Pointwise Confidence intervals Recall that asymptotically nh( ˆfh (x) E( ˆf h (x))) D N(0, f(x)r(k)) This asymptotic normality result allows to establish pointwise confidence intervals. For given α > 0 an approximate 1 α- confidence interval for the variability of ˆf h (x) is thus given by ˆf h (x) ± z 1 α/2 ˆfh (x)r(k) nh. Here, z 1 α/2 denotes the corresponding quantile of the standard normal distribution. If α = 0.05, then z 1 α/2 = z = Then P (E( ˆf h (x)) [ ˆf h (x) ± z 1 α/2 ˆfh (x)r(k) nh ]) 1 α as n But obviously this interval only focuses on random fluctuations of ˆf h (x), the bias is not taken into account. Recall that E( ˆf h (x)) = f(x) + o(h 2 ). When using an approximately optimal bandwidth EconometricsII-Kneip 3 20

21 h n 1/5, then the probability that the interval contains the true value f(x) may be much smaller than 1 α. A possible trick to circumvent the bias problem is to use an undersmoothing bandwidth. Instead of using h n 1/5, confidence intervals may be constructed with respect to a bandwidth h = h n β, β > 0. Then h 2 = o( 1 nh ) and therefore n h( ˆf h(x) E( ˆf h(x)) = n h( ˆf h(x) f(x)) + o(1), which yields n h( ˆf h(x) f(x)) D N(0, f(x)r(k)). We can conclude that then ˆf h(x) ± z 1 α/2 ˆf h(x)r(k) n h provides an asymptotically valid 1 α-confidence interval. Of course, a suitable choice of β is a very difficult problem in any practical application. Note: There are more sophisticated methods for constructing confidence intervals based on the bootstrap. 3.7 Testing normality Consider an i.i.d. sample X 1,..., X n. Many standard procedures in parametric statistics rely on the assumption that X i is normally distributed. In practice it is often useful to test the hypothesis of normally distributed observations. Formally this leads to the testing problem: H 0 : X i N(µ, σ 2 ) EconometricsII-Kneip 3 21

22 against the alternative H 1 : X i not normally distributed Kernel density estimation allows to define a sensible test of this problem. Consider a kernel estimator ˆf h (x) based on using the Gaussian kernel K(u) = ϕ(u). If the null hypothesis is correct, i.e. f = ϕ µ,σ, it can then be shown that for any possible bandwidth h This implies that E( ˆf h (x)) = ϕ µ, σ 2 +h 2 E( ˆfh (x) ϕ µ, σ 2 +h 2 ) 2 dx = Var( ˆf h (x))dx Consequently, if H 0 is correct, then any difference between ˆf h (x) und ϕ µ, σ2 +h is only due to random fluctuations (no systematic 2 error, no additional bias). We thus arrive at the following test procedure: Estimate mean and variance of X i by ˆµ = 1 n n i=1 X i = X und ˆσ 2 = S 2. Determine the normal reference bandwidth h NR and the corresponding kernel density estimator ˆf hnr. Calculate D = Reject H 0 if D is too large. ( 2 ˆfhNR(x) ϕˆµ, ˆσ 2 +h NR 2(x)) dx. The distribution of D under H 0 can be approximated by Monte- Carlo-simulations. In a first order approximation this distribution EconometricsII-Kneip 3 22

23 only depends on the sample size n and is independent of the values of µ, σ 2 ab. The following table presents critical values for a test of level α = 5%. n crit. value Example: In (older) economic literature it is frequently assume that the income distribution is lognormal. This means that log X i follows a normal distribution. FES data (1990): An application of the kernel based test on logincome data yields D = 0, Rejection of H 0, the income distribution is not lognormal. EconometricsII-Kneip 3 23

24 0.7 Family Expenditure Survey (1990) Normal Reference = N(5.23, ) L2 distance= ln(income) before housing costs 3.8 Boundary problems The above calculations implicitly assume that the density is supported on the entire real line. Boundary problems can arise if the support of f is only a subset of IR. In order to exemplify this point we will in the following assume that f has the support [0, ]. Then f(x) = 0 if x < 0 which means that X i 0 with probability 1. In economic applications this is an important situation, since many variables of interest (e.g. income, wages, working hours, etc.) are positive. Consider the behavior of kernel density estimator based on a second order kernel with compact support [ 1, 1]. Then the interval [0, h] is called the boundary region. Depending on the structure of f(x) for x [0, h] the kernel estimator may produ- EconometricsII-Kneip 3 24

25 ce poor estimates in the boundary region. More precisely, ˆfh (0) usually underestimates f(0). This is because ˆf h does not feel the boundary, and penalizes for the lack of data on the negative axis. From a theoretical point of view the bias increases, while the variance is still of order 1/(nh). If f(0) > 0, then as n ˆf h (0) P 1 2 f(0). If f(0) = 0, then ˆf h (0) = O P (h). If f(0) = 0 and f (0) = 0, then ˆf h (0) = O P (h 2 ). There are several possibilities to deal with boundary problems. 1) Reflection Method: If f(0) > 0, then the reflection of data method produces consistent estimates of f(x) in the boundary region. The simple idea is to create a new sample of 2n observations by simply adding X 1, X 2,..., X n to the data. The estimator then becomes f h (x) = 1 nh n i=1 ( K( x X i ) + K( x + X ) i ) h h for x 0 (and f h (x) = 0 for x < 0). Then If f (0) 0, then f h (0) f(0) = O P (h). If f (0) = 0, then f h (0) f(0) = O P (h 2 ). There are more sophisticated reflection methods which even in the general case are able to guarantee that the bias is of order h 2 (see. e.g. Cowling and Hall, JRSSB, 1996). 2) Boundary kernel method: At each point in the boundary region, use a different kernel for estimating function. These EconometricsII-Kneip 3 25

26 new kernels give up the symmetry property and put more weight on the positive axis (see e.g. Scott (1992): Multivariate density estimation, Wiley). 3) Transformation of data: Data transformations have the potential to improve the quality of density estimates (not only at possible boundaries). If f(0) = 0, i.e. if the support of f if is actually (0, ), then such transformations can avoid any boundary problem. The idea then is to define a strictly monotonically increasing, differentiable transformation T : (0, ) IR (in economic applications one will often use T (x) = log x). Let g denote the density of the transformed random variables Y i = T (X i ), i = 1,..., n. Then f and g are linked by the relation f(x) = g(t (x)) T (x), x (0, ). The density g can be estimated by ordinary kernel estimation from Y 1 = T (X 1 ),..., Y n = T (X n ), ĝ h (y) = 1 nh n ( ) y Yi K, h i=1 and an estimate of f is then given by f h (x) = ĝ h (T (x))t (x) = 1 n ( ) T (x) T (Xi ) K T (x) nh h Using such a transformation may lead to a general improvement in the quality of the density estimation if g is structurally much simpler than f (if e.g. g (y) 2 dy f (x) 2 dx). i=1 EconometricsII-Kneip 3 26

27 3.9 Multivariate density estimation kernel estimators can also be used to estimate multivariate density functions. Consider an i.i.d. sample of random vectors X i = (X i1, X i2,..., X id ) τ with underlying density f(x), x = (x 1,..., x d ) τ. Daten: Zufallsstichprobe X 1 = X 11. X 1d,..., X 1 = X n1. X nd d-dimensional kernel estimator (with kernel function K and bandwidth h 1,..., h d ) ˆf h (x) = 1 n n i=1 ( 1 x1 X i1 K,..., x ) d X id, x = h 1... h d h 1 h d The kernel function K : IR d IR is now a d-dimensional density function which is symmetric around 0. Hence, in particular IR d K(x 1, x 2,..., x d )dx 1 dx 2... dx d = 1 Frequently used kernel functions: Product kernels: Let K denote a one-dimensional secondorder kernel function. A d-dimensional kernel function can then be defined by using the corresponding products of K(X ij ), j = 1,..., d: K(x 1, x 2,..., x d ) = K(x 1 ) K(x 2 )... K(x d ) Examples: d-dimensional Gaussian kernel density of the N d (0, I)-distribution EconometricsII-Kneip 3 27 x 1. x d IRd

28 Multivariate Epanechnikov-kernel: 1 2c K(x 1,..., x d ) = d (d + 2)(1 d i=1 x2 i ) if d i=1 x2 i 1 0 else Here, c d denotes the volume of the d-dimensional unit circle: c 1 = 2, c 2 = π, c 3 = 4π/3, etc. Smooth kernels (in the case d = 2): 3 π K(x 1, x 2 ) = (1 2 i=1 x2 i )2 if 2 i=1 x2 i 1 0 else K(x 1, x 2 ) = 4 π (1 2 i=1 x2 i )3 if 2 i=1 x2 i 1 0 else In practice, usually one basic bandwidth h is selected. In order to eliminate effects of different scalings of variable the d bandwidth h 1,..., h d are then determined as h j = ˆσ j h, where ˆσ j := 1 n (X ij X j ) 2 For a a given point x the MSE of ˆf h (x) is then of order MSE( ˆf h (x)) = O(h nh d ), which means that an optimal bandwidth is of order h n 1 4+d. The corresponding rate of convergence is then ˆf h (x)) f(x) = O P (n 2 4+d ). EconometricsII-Kneip 3 28

29 Example: FES data (1984) X i1 - relative income of household i; X i2 - age of household head Kernel estimator of the joint density of (X 1, X 2 ): EconometricsII-Kneip 3 29

30 Der The curse of dimensionality Kernel estimators are a useful tool for the nonparametric estimation of one, two-, or three-dimensional densities. The accuracy of estimation, however, decreases rapidly as d increases. In really high-dimensional problems kernel estimators are practically useless. Indeed, this is a substantial problem which emerges in all types of nonparametric function estimation (density estimation as well as regression, autoregression, etc). One generally speaks of the curse of dimensionality. The reason is the emptiness of a high-dimensional space IR d. If d 1, then even for large sample sizes n there will only exist very few observations which a close-by. As an example consider the estimation of a d-variate standard normal distribution at the point x = 0. This is obviously the center of the distribution, and the density has its maximal value at x = 0. First assume that we use a kernel density estimator based on the Epanechnikov kernel with bandwidths h 1 = h 2 = h d = 1. Of course, these are fairly large bandwidths which lead to a substantial bias. If d = 1, then P ( X i 1) 0.68, i.e. one will expect, that approximately 68% of all observations satisfy K(X i /h) > 0 and thus possess a positive weight when calculating ˆf h (0) (h = 1). If d = 2, then P ( X i1 1 and X i2 1) This means that approximately 46% of all observations contribute a positive weight when calculating ˆf h (0) (h = 1). EconometricsII-Kneip 3 30

31 If d = 10, then P ( X ij 1 for all j = 1,..., 10) This means that only approximately 2% of all observations contribute a positive weight when calculating ˆf h (0) (h = 1). Now assume that an optimal bandwidths h = h opt are used. Then the following sample sizes n are necessary in order to obtain E( ˆf hopt (0) f(0)) 2 f(0) Dimension d Sample size n EconometricsII-Kneip 3 31

32 4 Nonparametric Regression We start by considering univariate regression with one single explanatory variable X IR. Data: (Y i, X i ), i = 1,..., n, where Y i response variable X i [a, b] IR explanatory variable n sufficiently large (e.g. n 40) nonparametric regression model: Y i = m(x i ) + ϵ i m(x i ) = E(Y i X = X i ) regression function ϵ 1, ϵ 2,... i.i.d., E(ϵ i ) = 0, var(ϵ i ) = σ 2 Linear regression: m(x) is a straight line m(x) = β 0 + β 1 X Possible generalizations: m(x) quadratic or cubic polynomial m(x) = β 0 + β 1 X + β 2 X 2 oder m(x) = β 0 + β 1 X + β 2 X 2 + β 3 X 3 EconometricsII-Kneip 4 1

33 Many important applications lead to regression functions possessing a complicated structure. Standard models then are too simple and do not provide useful approximations of m(x) All models are false, but some are useful (G. Box) Example: Total expenditure in dependence of age The data stem from a random sample of British households in the year income age EconometricsII-Kneip 4 2

34 Nonparametric regression: There are no specific assumptions about the structure of the regression function. It is only assumed that m is smooth. An important point in theoretical analysis is the way how the observations X 1,..., X n have been generated. One distinguishes between fixed and random design. Fixed design: The observation points X 1,..., X n are fixed (non stochastic) values. Example: Crop yield (Y ) in dependence of the amount of fertilizer (X) used. Most important special case: equidistant design - X i+1 X i = b a n. Random design: The observation points X 1,..., X n are (realizations of ) i.i.d. random variables with density f. f is called design density. Throughout this chapter it will be assumed that f(x) > 0 for all x [a, b]. Example: Sample (Y 1, X 1 ),..., (Y n, X n ) of income (Y ) and age (X) of n 7000 randomly selected British households. In the case of random design m(x) is the conditional expectation of Y given X = x, and var(ϵ i X i ) = σ 2. m(x) = E(Y X = x) EconometricsII-Kneip 4 3

35 4.1 Basis function expansions Some frequently used approaches to nonparametric regression rely on expansions of the form m(x) p β j b j (x), j=1 where b 1 (x), b 2 (x),... are suitable basis function. b 1, b 2,... have to be chosen in such a way that for any possible smooth function m the approximation error inf β m(x) p j=1 β jb j (x) tends to zero as p ( approximation theory). Examples are approximations by polynomials, spline functions, wavelets or Fourier expansions (for periodic functions). Simplest approach : For a fixed value p an estimator ˆm p is determined by p ˆm(x) = ˆβ j b j (x), j=1 where the coefficients ˆβ j are obtained by ordinary least squares 2 2 n p n Y i ˆβ j b j (X i ) p = min Y i β j b j (X i ) β 1,...,β p }{{} i=1 j=1 i=1 j=1 X ij The quality of the approximation obviously depends on the choice of p which serves as a smoothing parameter p small: variability of the estimator is small, but there may exists a high systematic error (bias) p large: bias is small, but variability of the estimator is high (Bias) EconometricsII-Kneip 4 4

36 4.1.1 Polynomial Regression Approximation theory: Every smooth function can be well approximated by a polynomial of sufficiently high degree Approach: Choose p and fit a polynomial of degree p: n ( p ) 2 Y i ˆβ j X j 1 = min i=1 j=1 ˆm p (X) = ˆβ 1 + p 1 j=2 ˆβ j X j 1 i This corresponds to an approximation with basis functions b 1 (x) = 1, b 2 (x) = x, b 3 (x) = x 2,..., b p (x) = x p 1. Note: It is only assumed that m is well approximated by a polynomial, there will usually still exist an approximation error ( bias 0). Remark: Polynomial regression is not very popular in practice. Reasons are numerical problems in fitting high dimensional polynomials. Furthermore, high order polynomials often possses an erratic, difficult to interpret behavior at boundaries. EconometricsII-Kneip 4 5

37 p = 2: 1.7 income age p = 3: 1.7 income age EconometricsII-Kneip 4 6

38 p = 5: 1.7 income age p = 7: 1.7 income age EconometricsII-Kneip 4 7

39 4.1.2 Spline Approximation A frequently used method consists in a local polynomial approximation using spline functions. A spline is a piecewise polynomial function. They are defined with respect to a pre-specified sequence of q knots a = τ 1 < τ 2 < < τ q = b. Different specifications of the knot sequence lead to different splines. More precisely, for a given knot sequence a spline functions s(x) of degree k is defined by the following properties: s(x) is a polynomial of degree k in every interval [τ j, τ j+1 ], i.e. s(x) = s 0 + s 1 x + s 2 x s k x k, s 0,..., s k IR, for all x [τ j, τ j+1 ] s(x) is k 1 times continuously differentiable at each knot point x = τ j, j = 1,..., q. s(x) is called a linear spline if k = 1, s(x) is a quadratic spline if k = 1, and s(x) is a cubic spline if k = 3. In practice, the most frequently used splines are cubic spline functions based on an equidistant sequence of q knots (τ j+1 τ j = τ j τ j 1 for all j). The space of all spline function of degree k defined with respect to a given knot sequence a = τ 1,..., τ q b is a p := q + k 1 dimensional linear function space S k,τ1,...,τ q. EconometricsII-Kneip 4 8

40 Possible basis functions are b1 (x) = 1, b 2 (x) = x,..., b k = x k 1, b k+1 = (x τ 1 ) k +,..., b k+q 1 = (x τ q 1 ) k +, where (x τ j ) k + = (x τ j ) k if x τ j 0 else Each spline function s S k,τ1,...,τ q can then be written as s(x) = k β j x j 1 + j=1 q 1 j=1 and suitable parameters β 1,..., β k+q 2. β j+k (x τ j ) k + for x [a, b] The so-called B-spline functions generate an alternative basis. Such B-spline representations are almost always used in practice, since they possess a number of advantages from a numerical point of view. The B-Spline basis functions b j,k, j = 1,..., k+q 2, for splines of order k based on a knot sequence a τ 1,..., τ q b are calculated by a recursive procedure: 1 if x [τj b j,0 (x) =, τ j+1 ], j = 1,..., q + 2k 1 0 else and b j,l (x) = x τ j τ l+j τ j b j,l 1 (x) + τ l+j+1 x τl+j+1 τ j+1 b j+1,l 1 (x), for l = 1,..., k, j = 1,..., q + k 1, and x [a, b]. Here, τ 1 = = τ k+1 = τ 1, τ k+2 = τ 2,..., τ k+q = τ q and τ k+q+1 = = τ 2k+q = τ q. EconometricsII-Kneip 4 9

41 The so-called regression spline (or B-spline ) approach to estimating a regression function m(x) is based on fitting a spline function to the data. Frequently, cubic splines (k = 3) with equidistant knots are applied. Then τ 1 = a, τ q = b and τ j+1 τ j = b a q 1. In this case only the number of knots (or more precisely p = q + 2) is a smoothing parameter which has to be selected by the statistician. An estimator ˆm p (x) is then given by ˆm p (x) = p j=1 ˆβ j b j,k (x), and the coefficients ˆβ j are determined by ordinary least squares. Here, again p = q + k 1. With Y = (Y 1,..., Y n ) T and X denoting the n p matrix with elements b j (X i ) the estimated vector ˆβ = ( ˆβ 1,..., ˆβ p ) T of coefficient can be written as ˆβ = (X T X) 1 X T Y, ˆm p (X 1 ). ˆm p (X n ) = X ˆβ = X(X T X) 1 X T Y }{{} Remark: Quite generally, the most important nonparametric regression procedures are linear smoothing methods. This means that in dependence of some smoothing parameter λ, estimates of the vector (m(x 1 ),..., m(x n )) T are obtained by multiplying a smoother matrix S λ with Y. S p EconometricsII-Kneip 4 10

42 Approximation properties of spline functions As already mentioned above, nonparametric regression does not assume that m(x) exactly corresponds to a spline function. ˆm p thus possesses a systematic error. But if the number of knots is large, then splines can approximate any smooth function with high accuracy. The accuracy of spline approximations: Results of approximation theory imply that for any spline order k and any ν times continuously differentiable function m, 1 ν k + 1, we have min s S k,τ1,...,τq max m(x) s(x) x [a,b] C ν,k max τ j+1 τ j ν j=1,...,q max x [a,b] m(ν) (x) for some universal constant C ν,k which only depends on ν and k. Let k = 3 (cubic spline function). A cubic spline function satisfying the boundary constraints s (τ 1 ) = s (τ q ) = 0 is usually called a (cubic) natural spline. Note that for any cubic natural spline the effective number of parameters to be estimated reduces to q (instead of q + 2). Now assume that for some twice continuously differentiable function m only the functional values m(τ 1 ),..., m(τ q ) at τ 1,..., τ q are known. We then have to interpolate these functional values in order to obtain some suitable reconstruction of m(x) on [τ 1, τ q ]. Spline interpolation: For all possible values m(τ 1 ),..., m(τ q ) there exists a unique cubic natural spline s m,q interpolating these values, i.e., s m,q (τ j ) = m(τ j ) for all j = 1,..., q. Spline theory now states that s m,q is the smoothest function interpolating EconometricsII-Kneip 4 11

43 these values, τq s m,q(x) 2 dx τq m (x) 2 dx τ 1 τ 1 for any other twice cont. differentiable function m with m(τ j ) = m(τ j ) for all j = 1,..., q. Literature: C. de Boor, A practical guide to splines, Springer (1978); R. Eubank, Spline smoothing and nonparametric regression, Marcel Dekker (1988) Mean average squared error The behavior of nonparametric function estimates is usually evaluated with respect to quadratic risk. To simplify notation, I will in the following write E ϵ as well as var ϵ to denote expectation and variance with respect to the r.v. ϵ i, only. In the case of random design, E ϵ and var ϵ thus denote the conditional expectation E( X 1,..., X n ) and variance given the observed X- values. For random design, these conditional expectations depend on the observed sample, and thus are random. For fixed design, such expectations are of course fixed values. It will always be assumed that the matrix X T X is invertible (under our conditions on the design density this holds with probability 1 for random design). A common measure of accuracy of a spline estimator ˆm p is the EconometricsII-Kneip 4 12

44 mean average squared error (MASE): MASE( ˆm p ) := 1 n n E ϵ (m(x i ) ˆm p (X i )) 2 i=1 = 1 n (m(x i ) E ϵ ( ˆm p (X i ))) n E ϵ (( ˆm p (X i ) E ϵ ( ˆm p (X i ))) 2 n n i=1 i=1 }{{}}{{} Bias 2 ˆm p V ar( ˆm p ) Another frequently used measure is the mean integrated squared error (MISE) MISE( ˆm p ) := Equidistant design: b a E ϵ (m(x) ˆm p (x)) 2 dx MISE( ˆm p ) = MASE( ˆm p ) + O(n 1 ) MISE and MASE are generally not asymptotically equivalent in the case of random design MASE( ˆm p ) = b a E ϵ (m(x) ˆm p (x)) 2 f(x)dx + O P (n 1 ). In the following we have to analyze bias and variance of the estimator. As already mentioned above, nonparametric regression does not assume that m(x) exactly corresponds to a spline function. ˆm p thus possesses a systematic error (bias), m p (x) := E ϵ ( ˆm p (x)) m(x). Let m = (m(x 1 ),..., m(x n )) T. Then β = E ϵ ( ˆβ) = (X T X) 1 X T m EconometricsII-Kneip 4 13

45 Consequently, β = (β 1,..., β p ) T is a solution of (m(x i ) i p β j b j,k (X i )) 2 = j=1 inf (m(x i ) ϑ 1,...,ϑ p = inf s S k,τ1,...,τq i p ϑ j b j,k (X i )) 2 j=1 (m(x i ) s(x i )) 2 i m p (x) := p β j b j (x) = E ϵ ( ˆm p (x)) j=1 is the best approximation of m(x) by spline functions in S k,τ1,...,τ q, and ˆβ j estimate the corresponding coefficients β j. By the general approximation properties of cubic splines (k = 3) with q = p 2 equidistant knots, we will thus expect that if m is twice continuously differentiable, then Bias( ˆm p ) 2 = 1 n n (m(x i ) m p (X i ))) 2 = O P (p 4 ), i=1 if m is four times continuously differentiable, then Bias( ˆm p ) 2 = 1 n n (m(x i ) m p (X i ))) 2 = O P (p 8 ), i=1 The average variance of the estimator can be obtained by the usual type of arguments applied in parametric regression. Let EconometricsII-Kneip 4 14

46 m p = ( m(x 1 ),..., m(x n )) T and ϵ = (ϵ 1,..., ϵ n ) T. Then V ar( ˆm p ) =: 1 ( ) n E ϵ X(X T X) 1 X T Y X(X T X) 1 X T m p 2 2 = 1 ( ) n E ϵ X(X T X) 1 X T ϵ 2 2 = 1 ( ) n E ϵ ϵ T X(X T X) 1 X T ϵ = 1 ( ) n trace X T X) 1 X T E(ϵϵ T )X = 1 ( ) n σ2 trace X T X) 1 X T X = σ 2 p n Remark: For any j l matrix A and any l j matrix B we have the identity trace(ab) = trace(ba) These arguments imply that there is a tradeoff between average squared bias and variance. For cubic splines with equidistant knots and a twice differentiable function m we will expect that Bias( ˆm p ) 2 = O P (p 4 ) Since V ar( ˆm p ) = σ 2 p n an optimal p, balancing bias and variance, will be of order p opt n 1/5. Then MASE( ˆm popt ) = O P (n 4/5 ) Note: For an estimator ˆm based on a valid (!) parametric model we have MASE( ˆm popt ) = O P (n 1 ). Similar results can be obtained for the mean integrated squared error (MISE): If m is twice continuously differentiable, and p opt n 1/5, then ( ) b MISE( ˆm popt ) = E ϵ (m(x) ˆm popt (x)) 2 dx = O P (n 4/5 ) a EconometricsII-Kneip 4 15

47 AMSE True model bias Estimated model variability Estimated model Bias 2 ( ˆm p ) decreases as p increases V ar( ˆm p ) increases as p increases bias variance Number of parameters Number of parameters EconometricsII-Kneip 4 16

48 Problem: m is unknown MASE and p opt cannot be directly computed Approach: Determine an estimate ˆp opt of the optimal number p of basis functions by minimizing a suitable error criterium with the following properties: - For every possible p the corresponding criterion function can be calculated from the data. - For any p the error criterion provides information about the respective M ASE Recall: With ˆm p = ( ˆm p (X 1 ),..., ˆm p (X n )) T we have ˆm p = X ˆβ = X(X T X) 1 X T Y =: S p Y and p n = tr(s p) n. For given p, the number of parameters to estimate by the spline method (one also speaks of the degrees of freedom of the smoothing procedure) is equal to p. This corresponds to the trace of the smoother matrix S p = X(X T X) 1 X T. Most frequently used error criteria: Cross-validation (CV): For a given values p cross-validation tries to approximate the corresponding prediction error. CV (p) = 1 n ( 2 Y i ˆm p, i (X i )), n i=1 Here, for any i = 1,..., n, ˆm p, i is the leave-one-out estimator of m to be obtained when a spline function is fitted to the n 1 observations (Y 1, X 1 ),..., (Y i 1, X i 1 ), (Y i+1, X i+1 ),..., (Y n, X n ). EconometricsII-Kneip 4 17

49 Motivation: We have ( E ϵ (CV (p)) = 1 n E n ( ) ) 2 ϵ m(x i ) + ϵ i ˆm p, i (X i ) i=1 ( = 1 n E n ( ) ) 2 ϵ m(x i ) ˆm p, i (X i ) i=1 }{{} MASE( ˆm p ) ( n ) n E ϵ (m(x i ) ˆm p, i (X i ))ϵ i +σ 2 i=1 } {{ } =0 Generalized cross-validation (GCV): 1 n ( GCV (p) = n(1 p Y i ˆm p (X i ) n )2 i=1 Motivation: It is easily verified that with ARSS(p) := 1 n ( Y i ˆm p (X i ) n we have i=1 ) 2 E ϵ (ARSS(p)) = MASE( ˆm p ) 2σ 2 p n + σ2 ) 2 If p such that p/n 0, a Taylor expansion yields GCV (p) =ARSS(p) + 2 p n ARSS(p) }{{} =σ 2 +o p (1) +O P (( p n )2 ) As motivated above, for large n CV (p) as well as GC(p) can be seen as estimates of MASE( ˆm p )+σ 2. More precisely, as n, p n 0, E ϵ (CV (p)) = E ϵ (GCV (p) = MASE( ˆm p ) (1 + o P (1)) + σ 2 EconometricsII-Kneip 4 18

50 There are theoretical results which show that if ˆp opt is determined by minimizing CV (p) or GC(p), then for large n MASE( ˆmˆpopt ) will be close to MASE( ˆm popt ). There are more advanced procedures which estimate p as well as a best placement of the knots τ 1,..., τ q simultaneously from the data (MARS algorithm) GCV p EconometricsII-Kneip 4 19

51 4.2 Approaches Based on Roughness Penalties A different approach to spline fitting, which is widely used in practice, is based on the use of a roughness penalty. The basic idea can be described as follows: In order to guarantee a small systematic error, spline functions are defined with respect to a large number of knots ( p n close to 1). Variability of the estimator is controlled by fitting the coefficients subject to a penalty which penalizes roughness (nonsmoothness) of the resulting function. A convenient measure of smoothness is b a m (x) 2 dx. Remark: The so-called (cubic) smoothing spline approach relies on cubic splines with knots at each observation point. More precisely, q = n and τ 1 = X 1, τ 2 = X 2,..., τ n = X n. The side conditions s (a) = 0 and s (b) = 0 are additionally imposed in order to ensure that the number of coefficients to be estimated is equal to n. In the following we will consider cubic splines (k = 3). For a smoothing parameter λ > 0 (to be selected by the statistician), an estimate ˆm λ (x) = j ˆβ j b j (x) is determined by 1 n (Y i ˆm λ (X i )) 2 + λ i = inf s S 3,τ1,...,τq { 1 n b a ˆm λ(x) 2 dx (Y i s(x i )) 2 + λ i b a s (x) 2 dx }, EconometricsII-Kneip 4 20

52 or equivalently, 1 n (Y i i j = inf ϑ 1,...,ϑ p ˆβ j b j (X i )) 2 + λ 1 (Y i n i j b a ( j ϑ j b j (X i )) 2 + λ ˆβ j b j (x)) 2 dx b a ( j ϑ j b j (x)) 2 dx. Let X denote the n p matrix with elements b j (X i ), and let B denote the p p matrix with elements n b a b j (x)b l (x)dx, j, l = 1,..., p. Then the solutions are given by ˆβ = (X T X + λb) 1 X T Y, ˆm λ = X(X T X + λb) 1 X T Y, }{{} S λ where ˆm λ = ( ˆm λ (X 1 ),..., ˆm λ (X n )) T (m (x)) 2 dx small (m (x)) 2 dx large EconometricsII-Kneip 4 21

53 In the following we will concentrate on the situation that p is large compared to n (e.g. p n) such that the bias of spline approximation is negligible. Then only the choice of λ is crucial for the quality of the estimator. We will additionally assume that the true regression function m is twice continuously differentiable. Bias 2 ( ˆm λ ) increases as λ increases extreme case: λ = straight line fit V ar( ˆm λ ) decreases as λ increases; the estimated functions ˆm λ is the smoother the larger λ. An optimal smoothing parameter λ opt will again balance bias and variance. - It can be verified that 1 n var ϵ ( ˆm λ (X i )) = 1 n E ϵ As n, nλ, i tr(sλ) 2 1 = O P ( λ ) 1/4 ( ϵ T S 2 λϵ ) = σ2 n tr(s2 λ) The degrees of freedoms of the estimation procedure are defined as df λ = tr(s λ ) (sometimes also df λ = tr(s2 λ ) is considered). These degrees of freedom can be seen as a nonparametric equivalent of the number of parameters to estimate in parametric regression. EconometricsII-Kneip 4 22

54 Bias and MASE Let m λ = ( m λ (X 1 ),..., m λ (X n )) T = E ϵ (S λ Y ) = X(X T X+λB) 1 X T m It is then easily seen that m λ is a solution of 1 n (m(x i ) m λ (X i )) 2 + λ i = inf s S k,τ1,...,τq { 1 n b a m λ(x) 2 dx (m(x i ) s(x i )) 2 + λ i b a s (x) 2 dx If the number of knots is sufficiently large bias of a best possible spline approximation s opt S 3,τ1,...,τ q to m is negligible, m s opt. The above relation then implies that for large number of knots 1 n (m(x i ) m λ (X i )) 2 + λ i b a m λ(x) 2 dx λ b a } m (x) 2 dx For a twice continuously differential function m it can indeed be shown that Bias( ˆm λ ) 2 is proportional to λ. Hence, as n, λ 0, nλ, MASE( ˆm λ ) = O P (λ + 1 nλ 1/4 ) - The above result implies that an optimal smoothing parameter balancing bias and variance will be of order λ opt n 4/5. - Then MASE( ˆm λopt ) = O P (n 4/5 ) EconometricsII-Kneip 4 23

55 Similar results can be obtained for the mean integrated squared error (MISE): ( ) b MISE( ˆm) = E ϵ (m(x) ˆm λopt (x)) 2 dx = O P (n 4/5 ) a Again, estimates of λ opt may be determined by minimizing CV (λ) or GCV (λ): Cross-validation (CV): CV (λ) = 1 n ( 2 Y i ˆm λ, i (X i )), n i=1 Here, for any i = 1,..., n, ˆm p, i is the leave-one-out estimator of m to be obtained when only the die n 1 observations (Y 1, X 1 ),..., (Y i 1, X i 1 ), (Y i+1, X i+1 ),..., (Y n, X n ) are used. Generalized cross-validation (GCV): 1 n ( 2 GCV (λ) = n(1 df Y λ i ˆm λ (X i )), n ) 2 i=1 where df λ = tr(s λ ) (= degrees of freedom) Remark: Under some regularity conditions it can be shown that MASE( ˆm λopt ) MASE( ˆmˆλopt ) = O P (n 1/2 MASE( ˆm λopt ) 1/2), where ˆλ opt denotes the smoothing parameters estimated by CV or GCV. EconometricsII-Kneip 4 24

56 Smoothing Splines (df h = 3) 1.7 income age Smoothing Splines (df h = 10) 1.7 income age EconometricsII-Kneip 4 25

57 4.3 Estimating the error variance The magnitude of the variance σ 2 of the error terms ϵ i influences the accuracy of the estimators. For simplicity it will in the following be assumed that the observations X i are ordered, X 1 X 2 X n, and that m is a smooth, twice continuously differentiable function. a) Based on an nonparametric estimate ˆm of m a simple estimate of σ 2 is obtained by averaging squared residuals: ˆσ 2 := 1 (Y i ˆm(X i )) 2 n b) The method of Rice: i ˆσ 2 := 1 2(n 1) n (Y i Y i 1 ) 2 i=2 It can be shown that E ϵ (ˆσ 2 ) = σ 2 + O P ( 1 n 2 ) and V ar ϵ (ˆσ 2 ) = O P ( 1 n ). c) The method of Gasser et.al.: In a first step pseudo-residuals ˆϵ i = X i+1 X i X i+1 X i 1 Y i 1 + X i X i 1 X i+1 X i 1 Y i+1 Y i are calculated. Then ˆσ 2 := 1 n 2 n 1 Often methods b) or c) are preferred to a). The important point is that the bias of the estimators in b) or c) is much smaller than the bias of the estimator in a). However, all procedures a), b), c) yield consistent estimators of σ 2. i=2 EconometricsII-Kneip 4 26 ˆϵ 2 i

58 Confidence Intervals for spline methods Consider spline fitting based on a roughness penalty with smoothing parameter λ. Under some suitable regularity conditions it can easily be shown that as n, λ 0, nλ, ˆm λ (x) m λ (x) varϵ ( ˆm λ (x)) L N(0, 1) holds for all x (central limit theorem). Here again m λ (x) = E ϵ ( ˆm λ (x)). Note that with cov ϵ ( ˆβ) = σ 2 (X T X + λb) 1 X T X(X T X + λb) 1 }{{} Q λ This implies that with b(x) = (b 1 (x),..., b p (x)) T var ϵ ( ˆm λ (x)) = var ϵ (b(x) T ˆβ) = σ 2 b(x) T Q λ b(x). Based on an estimate ˆσ 2 of σ 2 this leads to asymptotically valid (1 α) confidence intervals for m h (x): ˆm h (x) ± z 1 α/2 ˆσ 2 b(x) T Q λ b(x), where z 1 α/2 is the 1 α/2-quantile of the standard normal distribution (e.g. z = 1.96). These intervals can be calculated for any point x confidence bands for the function m h. In the literature one speaks of confidence intervals for the variability of ˆm λ (x) (i.e. error bounds for the random fluctuation due to the error terms ϵ i ). Quite obviously, bias is not taken into account when calculating these intervals. EconometricsII-Kneip 4 27

59 4.4 The Nadaraya-Watson Kernel Estimator Idea: Approximation of m(x) by a local average of the observations Y i : n ˆm h (x) = w(x, X i, h)y i i=1 The weight function w is constructed in such a way that the weight of an observation Y i is the smaller the larger the distance x X i. A smoothing parameter ( bandwidth ) h determines the rate of decrease of the weights w(x, X i, h) as x X i increases. Kernel estimators calculate weights on the basis of a prespecified kernel function K. Usually K is chosen as a symmetric density functions. Nadaraya-Watson kernel estimator: n K( x X i h ˆm h (x) = ) 1 n i=1 j=1 K( x X j h ) Y i = nh 1 nh n i=1 K( x X i h )Y i n j=1 K( x X j h ) For every possible bandwidth h > 0 the sum of all weights w(x, X i, h) = K( x X n i )/ K( x X j ) h h j=1 is always equal to 1, i w(x, X i, h) = 1. Kernel estimators are linear smoothing methods. ˆm h = ( ˆm h (X 1 ),..., ˆm h (X n )) T = S h Y where the elements of the n n matrix S h are given by (S h ) ij = K( X i X j h ) n l=1 K( X i X l h ), tr(s h) = O( 1 h ) EconometricsII-Kneip 4 28

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations