Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Size: px

Start display at page:

Download "Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation"

Annabel Pope
5 years ago
Views:

1 Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric techniques that are widely applied in empirical researches with focuses on the kernel and sieve estimation. The kernel and sieve methods form the backbone of nonparametrics despite their distinctive nature in terms of local or global approximation. We occasionally touch upon alternative estimation methodology but refer the readers directly to the related books or articles. The lecture note is comprised of three parts. In the first part, we provide a rigorous introduction to nonparametric methods. After we study nonparametric density estimation and testing, we move quickly to nonparametric regression analysis. Kernel estimation with mixed data will also be addressed. Then we study the sieve estimation of the conditional mean function. In the second part, we examine various semiparametric models, which bridges the gap between parametric and nonparametric models. Here the primary interest is typically to estimate a finite-dimensional parameter in the presence of one or several infinite-dimensional nuisance parameters (i.e., nonparametric components). We provide a unified framework to analyze the asymptotic properties of the semiparametric estimator of the finite dimensional parameter and then study various semiparametric regression models in details, including partially linear models, index models, and additive models. In the third part, we focus on various topics in nonparametric and semiparametric econometrics. We will mainly discuss nonparametric kernel and sieve estimation with endogenous regressors and study the estimation of various nonparametric and semiparametric panel data models. In each chapter, we shall first introduce the theory for nonparametric or semiparametric estimation, followed by one or two applications in related areas. To help the reader grasp the material, we also include a small amount of theoretical exercises together with some real data exercises. Nonparametric Density Estimation and Testing. Introduction In this chapter we will describe the most important method of estimating density functions, namely kernel density estimation. As Pagan and Ullah (999) remarked, there are three areas in which one needs to estimate densities. First, density estimates may be required to capture the stylized facts that need explanation and to judge how well a potential model is likely to fit the data. Secondly, in the case where we need a complete picture of the distribution of an estimator, we need density estimates to summarize the information. Thirdly, some parametric estimators (e.g., quantile estimators) have asymptotic distributions that depend on a density evaluated at a specific point. Let be a generic random variable or vector with cumulative distribution function (CDF) ( ) and probability density function (PDF) ( ) Let the observations bedrawnfromanunknown distribution function Weareinterestedinestimating at a point.2 Univariate Density Estimation Several estimators have been proposed to estimate the density function nonparametrically. These include the kernel density estimator by Rosenblatt (956) and Parzen (962), the nearest neighborhood estimator

2 by Fix and Hodges (95), the series estimator by Cencov (962), the penalized likelihood estimator by Good and Gaskins (97, 980), and more recently the local likelihood estimator by Loader (993). Pagan and Ullah (999) discussed all of these estimators, among which the kernel density estimator is the best known estimator. Also, it is better developed and more widely used than others. Therefore, we will only focus on nonparametric kernel density estimation..2. Motivation for Kernel Density Estimator For simplicity, we first look at the issue of estimating the density ( ) of a scalar continuously-valued random variable at a particular point. To motivate the method, noticing that () = 0 () a.e., one can obtain a simple estimator of : () = ( + ) ( ) (.) 2 where = () is a sequence of positive constants and () is the empirical distribution function of () = X ( ) where () is the indicator function: () =if holds and 0 otherwise. Note that X 2 () = [ ( + ) ( )] = ( + ) = as a summation of independent Bernoulli random variables, has the binomial distribution Binomial( ( + ) ( )) It follows that and [ ()] = Var [ ()] = = ( + ) ( ) 2 = () if 0 as [ ( + ) ( )] { [ ( + ) ( )]} 4 2 (+) ( ) 2 () 2 { [ ( + ) ( )]} 2 0 if 0 and Thus, to guarantee a good behavior of () we should choose such that 0 and as One can also calculate the MSE of () and establish asymptotic normality for it. Rewrite () as () = = 2 2 X ( + ) One can propose a useful class of kernel density estimator of the form b() = X µ = = X ( ) = X = µ (.2) = 2 (.3)

3 where we refer to ( ) as a kernel function on R and to as a smoothing parameter (or alternatively, a bandwidth ), and () = (). (2) is called a uniform kernel estimator because the kernel function ( ) corresponds to a uniform pdf on [-, ]. It is sometimes referred to as a näive kernel estimator and was first introduced in Fix and Hodges (95). In practice, there are a variety of kernel functions that might be chosen, among which, three are most popular choices: () the Gaussian kernel () = exp µ 2 (.4) 2 2 (2) the Epanechnikov kernel () = 3 2 ( ) (.5) 4 (3) the Biweight or Quartic kernel () = ( ) (.6) 6 Another three less frequent choices of kernels are: (4) the Uniform kernel () = ( ) (.7) 2 (5) the Triangular kernel () =( ) ( ) (.8) (6)theTriweightkernel () = ( ) (.9) 32 It turns out that the choice between these kernels rarely make significant difference in the estimates. The kernel functions are used to smooth the data whereas the amount of smoothing is controlled by the bandwidth 0 Intuitively, () b is the average of a set of weights. If a large number of the observations are near then the weights are relatively large and () b is large. Conversely, if only a few are close to then the weights are small and () b is small. The bandwidth controls the degree of closeness..2.2 Asymptotic Properties of the Kernel Density Estimator The properties of b () are well studied in the literature. To summarize some important properties of b() we make the following assumptions. A The observations are IID with density A2 The second order derivatives of are continuous and bounded in the neighborhood of A3 The kernel is a symmetric PDF around zero satisfying (i) R () = (ii) R 2 () = 2 (0 ) (iii) R 2 () A4 As = () 0 and is a second-order kernel if R () = R () =0 and R 2 () So Assumption A3 implies that is a second-order kernel. 3

4 . () b is a valid density. That is, () b 0 for all and it integrates to : X µ X µ b() = = = = X µ µ = = = X () = = where the fourth equality applies the Fubini Theorem and change-of-variables: =( ) 2. The moments of the density b () can be calculated easily. The mean is b () = = = = X = = X µ ( + ) () X () + = X = () where the last equality follows from the facts R () =and R () =0The second moment is 2 () b X µ = 2 = = X ( + ) 2 () = = X = where 2 = R 2 () is the variance of the kernel. Consequently, R 2 () b P = 2 if 0 as 3. MSE of () b Suppose that () is second order continuously differentiable, then h i b() = = = µ () () ( + ) () ()+ 0 () + 2 (2) () = ()+ 2 (2) () This leads to the bias expression h i Bias b() = 2 (2) () (.0) 4

5 For the variance, we have h i Var b() = Var [ ( )] = [ ( )] 2 { [ ( )]} 2 µ 2 = 2 () µ ()2 + = () 2 ( + ) µ ()2 + = () Since the variance is of order () assumption A4 ensures that Var Adding (.) to the square of (.0) we obtain where 02 (MISE) µ () 2 + (.) h i b() h i b() = 4 (2) () () 02 + µ + 4 converges to zero. = R ()2 Integrating this expression, we obtain the mean integrated squared error µ () = ()+ + 4 (.2) where () = (2) (.3) and (2) R (2) () 2 is a measure of the total curvature of We call AMISE the asymptotic MISE since it provides a useful large sample approximation to the MISE. Both MISE and AMISE are global measures of precision for estimation of () Notice that the integrated squared bias is asymptotically proportional to 4 so for this quantity to decrease we need to take to be small. However, since the leading term of the integrated variance is proportional to () we need to choose large to reduce the variance. Therefore, as increases should vary in such a way that each of the components in MISE or AMISE becomes smaller. This is known as the variance-bias trade-off. This is a mathematical quantification for the critical role of the bandwidth. Example (Density estimate of annual salary income). In Figure we use the Panel Study of Income Dynamics (PSID) dataset to estimate the density of annual salary income in the USA. We have = 445 observations on the annual salary income for the year We choose the standard normal kernel. The bandwidth is chosen according to = 5 for =025.06, and 4, where is the sample standard deviation. For comparison, we also include plot of the density estimates using the least squares cross-validated (LSCV, see below) bandwidth. Figure illustrates the effect of choosing the bandwidth. For small value of ( =025) the estimated density function is very spiky and hence very variable in the sense that, over repeated sampling from the true density the spikes would appear in different places. There is, however, very little bias. When is increased, the variability is reduced at the expense of introducing bias. When the over-smoothing bandwidth is applied ( =4) the estimated density function is too flat, indicating significant bias. For the given dataset, we observe that the density estimate using the ROT bandwidth ( =06 to be introduced below) almost coincide with that using the LSCV bandwidth. Nevertheless, the AMISE is a much simpler expression to comprehend than the expression for the MISE given by (.2). The main advantage of AMISE is that we can define the optimal bandwidth with 5

6 2 x c=0.25 c=.06 c=4 LSCV.4.2 Density Annual salary income x 0 5 Figure : Density estimate based on the PSID annual salary income data in 2003 (Kernel: standard normal, bandwidth: = 5 for = and 4 Least squares cross validated (LSCV) bandwidth is also used.) 6

7 respect to this criterion, and it has a closed form expression. To be specific, we define the asymptotically optimal bandwidth 0 as the value to minimize () : 0 =argmin () The solution can be found by solving the first order condition yielding R 0 = ()2 (2) R 2 () (.4) Aside from its dependence on the known and this expression shows us that 0 is inversely proportional to (2) 5 Thus for a density with little curvature, (2) will be small and a large bandwidth is called for. On the other hand, when (2) is large, little smoothing will be desired. Unfortunately, direct use of (.4) to choose a good bandwidth in practice is not possible since (2) is unknown. However, there now exist several rules for choosing which may or may not be based on estimating (2) Let denote the sample standard deviation. For the Gaussian, Epanechnikov, and Biweight kernels discussed above, one can choose 0 = and respectively. Such choices of bandwidth are called bandwidth selected by Silverman s rule of thumb (ROT) in the literature. Substituting (.4) into (3) leads to inf 0 = 5 4 ( 2 2 µ 4 () 2 (2) )5 45 (.5) which is the smallest possible AMISE for estimation of using the kernel We can rewrite the information conveyed by (.4) and (.5) in terms of the MISE itself using the asymptotic notation. Let be the minimizer of MISE() defined in (.2). Then R ()2 (2) R 2 () and inf ( 2 2 µ 4 () 2 (2) )5 45 These expressions give the rate of convergence of the MISE optimal bandwidth and the minimum MISE, respectively, to zero as Under the stated assumptions, the best obtainable rate of convergence of MISE of the kernel density estimator is of order 45 which is slower than the typical parametric rate of order for MSE convergence..2.3 Univariate Bandwidth Selection There are several ways to choose the bandwidth in practice. We introduce the most widely used methods only. Rule of thumb and plug in method (4) shows that the optimal smoothing parameter depends on the integrated second derivative of the unknown density. In practice, we may choose an initial pilot value of (say the ROT one) to estimate (2) R (2) () 2 nonparametrically, and then use 7

8 this value to obtain estimate of 0 optimally. Such a method is referred to as plug-in method in the literature. It is worth mentioning the estimation of (2) requires estimation of the second order derivatives of which is introduced in the next subsection. A popular way of choosing the initial value of is to assume that () belongs to certain parametric family of distributions, and then estimate the optimal bandwidth using (4) For example, if we assume () is a normal density function with variance 2 we get the pilot estimate =(4) 0 h (38) 2i which is plugged into R b (2) (). The latter is then used to obtain the estimate for the optimal bandwidth. In practice we replace by the sample standard deviation of { } whereas Silverman (986) advocates using a robust measure of spread which replaces by an adaptive measure of spread given by min(standard deviation, interquartile range/.34). Least squares cross validation Another popular method in choosing the bandwidth parameter is least squares cross validation, which is a fully automatic and data-driven method. This method is based on the principle that a bandwidth is chosen to minimize the mean squared error of the resulting estimate. The integrated squared difference between b and is h i 2 = b () () h i 2 = b () 2 b () () + () 2 (.6) () 2 h i Note that the last term does not depend on the bandwidth and 2 = b () where ( ) denotes expectation with respect to not with respect to the random observations { } used in defining b ( ) Therefore, 2 can be estimated by b 2 = X X X b ( )= ( ) ( ) = = 6= where b ( )= X ( ) =6= is the leave-one-out kernel estimator of ( ) We ask the reader to verify as an exercise that we cannot use the usual kernel density estimate b ( ) in obtaining b 2 For the first term we have h i 2 = b X X () = 2 ( ) ( ) = = X X µ µ = 2 = = X X µ = 2 () = = X X µ = 2 = = 8

9 where () = R () ( ) is the convolution kernel derived from ( ) Given the exact form of ( ) we may obtain an analytic expression for ( ) For example, if () =exp then () =exp a normal density with zero mean and variance 2 This follows from the fact that two independent (0 ) random variables sum to an (0 2) random variable. So we choose to minimize () = 2 b 2 = X X µ 2 X X 2 ( ) (.7) ( ) = = = 6= Let b denote the solution to the above cross-validation problem. Härdle et al. (988) show that That is, b 0 converges to as we would expect. b 0 0 = 0 Likelihood cross validation Another data-driven method to choose the bandwidth is done by the likelihood cross validation. This approach yields a density estimates which has an entropy interpretation. That is, the bandwidth is chosen to minimize the Kullback-Leibler distance between the estimated density and the true one. In general, the Kullback-Leibler distance between two density functions and is given by µ µ () () ( ) ln = ()ln () () = ()ln(()) ()ln(()) Here taking = and = b we have ()= b ()ln(()) ()ln b () The first term on the right hand side does not depend on the bandwidth and the second term is hln b () i which can be estimated by P = log b ( ) Therefore minimizing () b is equivalent to maximizing the log-likelihood. So we choose to maximize X = log () = log b ( ) The main problem with the likelihood cross validation is that it is severely affected by the tail behavior of () and works poorly when the true distribution has fat tails..2.4 Univariate Density Derivative Estimation Estimation of density derivatives may be needed in several cases. First and second derivatives may be of intrinsic interest as measures of slope and curvature. Other important functions, such as the score function 0 depend on density derivatives. Automatic plug-in bandwidth selection methods require estimation of quantities involving density derivatives too. = 9

10 When the density function () is th order differentiable at a natural estimator of its th derivative () () is b () () = X µ + () (.8) = which is the th derivative of b () The mean squared error properties of b() () can be derived straightforwardly to obtain b () () = () h i ½ ¾ 2 2+ () (+2) () (.9) 2+ where () = R () () 2 and 2 = R 2 () It follows that the MSE-optimal bandwidth for estimating () () is of order (2+5) Therefore, estimation of 0 () requires a bandwidth of order 7 compared to the optimal bandwidth rate 5 for the estimation of () itself. Moreover, the optimal MSE or its integrated version is of order 4(2+5) This rate becomes slower for higher values of, whichreflects the increasing difficulty inherent in the problems of estimating higher order derivatives..2.5 Univariate Cumulative Distribution Function Estimation We can estimate the cumulative distribution function (CDF) () by the empirical distribution function (edf) () Nevertheless, () is not smooth as it jumps by at each sample realization point. We can obtain a smoothed kernel estimate of () by integrating b () : b () = = = = b () = X ( ) = X µ µ = X µ = X = where () = R () To calculate the MSE of b h i h i () we first calculate b () and b () 2 () h i b () µ µ = = () = () ( ) = () ( ) = () ( ) = + ( ) () = 0+ () () () + 2 (2) () 2 2 3! (3) () ! (4) () () = ()+ 2 2 (2) () ! (4) ()

11 where = R () ( =24) () () is the th order derivative of Similarly, we can show that µ µ Var b () = Var µ 2 = ( ½ µ ¾ ) 2 = () 0 ()+ 2 ª n ()+ 2 2 (2) () o 2 = ()[ ()] 0 ()+() where 0 =2 R () () Consequently b () h i 2 h i n h io 2 = b () () = Var b () + bias b () = ()[ ()] 0 ()+ 4 2 ()+ 4 + where 2 () = 2 (2) () 2 To choose the bandwidth, we can minimize the mean integrated squared error of b It turns out the optimal bandwidth in this case is 0 =[ 0 (4 2 )] 3 3 (.20) where 2 = R 2 () Hence the optimal bandwidth for estimating a univariate CDF has faster rate of convergence than the optimal bandwidth for estimating a univariate PDF. Bowman et al. (998) suggest choosing to minimize the following cross-validation function () = X = h ( ) b ()i 2 (.2) where b ( )=( ) P =6= (( ) ) istheleaveoneoutestimatorof ( ) Let b denote the value of minimizing () It can be shown that b 0 Using = b we can easily derive the asymptotic distribution of b () : b () () (0 ()( ())) (.22) So the asymptotic property of b () is the same as that of the empirical distribution function () In particular, it converges to its probability limit () at the parametric rate. Example 2 (Cumulative distribution estimates of annual salary income). Using the same data as in Example, we now estimate the cumulative distribution function ( ) of annual salary income in the USA. We choose the standard normal kernel. We could choose the bandwidth by minimizing the cross-validation criterion function in (.2). For simplicity, we choose the bandwidth by using the LSCV criterion function for estimating the density functions (see (.7) and then adjust it to have the optimal rate for estimating the cdfs. That is, we set = 5 3 = 2 5 where is the least squares cross validated bandwidth for estimating the pdf Figure 2 plots the smoothed kernel estimate b () of () together with the edf estimate () The two curves almost coincide except that the edf curve is wiggly even for the dataset with large sample size ( = 445)

12 Probability smoothed kernel estimate for cdf empirical distribution function Annual salary income x 0 5 Figure 2: Smoothed kernel estimate and empirical distribution estimate for the cdf of annual salary income (LSCV bandwidth is used in the kernel estimate. See the text for details.).2.6 Higher Order Kernels and Bias Reduction Recall that we have used the so-called second order kernel in the previous sections. For the univariate kernel density estimate, the best obtainable rate of convergence of the MISE is of order 45 In this subsection, we demonstrate that it is possible to obtain better rates of convergence by relaxing the restriction that the kernel be a probability density function. To this goal, higher order kernels are needed because they can help reduce the bias in the kernel density estimates. Recall the result for bias of () b in (.0): Bias b() = 2 2 (2) () Here the kernel is constrained to be a probability density function so that it is necessary to have 2 = R 2 () 0 Nevertheless, without this restriction, it is possible to construct such that 2 =0 which will have the effect of reducing the bias to be of order 4 provided that has a continuous square integrable fourth derivative. It is easy to check that the MSE and MISE will have optimal rate of convergence of order 89 if such a kernel is used. In general, the order of a kernel ( ), ( 0) is defined as the order of the first non-zero moment of the kernel. A general th order kernel must satisfy the following conditions, () () = () () = 0 for = () () = 2

13 For example, the standard normal (Gaussian) kernel () =(2) 2 exp 2 2 is a second order kernel. If we use a th order kernel in estimating the multivariate density, we can show that Bias b () = ( ) (.23) Var b () = () (.24) Consequently, and b () = 2 +() b () () = +() 2 It is simple to construct symmetric higher order kernel functions. In order to construct a th ( 2 even) order kernel, one could begin with a second order kernel such as the standard normal (Gaussian) kernel, set up a polynomial in its arguments, and solve for the roots of the polynomial subject to the desired moment restrictions. For example, let () =(2) 2 exp 2 2 we could begin with () = 2 X =0 2 () (.25) where ( =0 2 ) are constants that must satisfy the requirement of a symmetric th order kernel. One can verify that the 4, 6, 8th order Gaussian kernels are given respectively by µ 3 4 () = () µ 5 6 () = () µ 35 8 () = () The general formula for the kernels that are higher-order extensions of the second order normal kernel are given by () = 2 X =0 ( ) 2! (2) () =0 2 4 See Wand and Schucany (990) or Wand and Jones (995, p.34). If we start with the second order Epanechnikov kernel (standardized to have unit variance): 2 () = 3 µ the 4,6, 8th order Epanechnikov kernels are given respectively by µ 5 4 () = () µ 75 6 () = () µ () = () 3

14 nd order 4th order 6th order 8th order Value of kernel: k(u) u Figure 3: Gaussian kernels of various orders.6.4 Value of kernel: k(u) nd order 4th order 6th order 8th order u Figure 4: Epanechnikov kernels of various orders 4

15 Obviously, when 2 the kernels can take negative values on their support. This is a drawback for higher order kernel when used in density estimation because they may assign negative weight in the density estimate. In some cases, we may have negative density estimate. Example 3 (Higher order kernels). Figures 3 and 4 plot the Gaussian and Epanechnikov kernels of various orders. For the two figures, we clearly see that for kernels of order higher than 2 they can indeed take negative values. There are some other rules for constructing higher-order kernels (e.g., Jones and Foster, 993). Let denote a th order kernel. Then one can use the following recursive formula to generate higher order kernels: +2 () = 3 2 ()+ 2 0 () (.26) where () is assumed to be differentiable. For example, application of (.26) to the second order Gaussian kernel 2 () = () leads directly to the fourth-order kernel given above..3 Multivariate Density Estimation We now investigate the extension of the univariate kernel density estimator to the multivariate setting. As Wand and Jones (995) remark, the need for nonparametric density estimates for recovering structure in multivariate data is greater since parametric modelling is more difficult than in the univariate case. The extension of the univariate kernel methodology is not without its problems. Firstly, the most general kernel smoothing parametrization of the kernel density estimator in higher dimensions requires the specification of many more bandwidth parameters than in the univariate setting. This leads us to consider simpler smoothing parametrization as well. Secondly, the sparseness of data in higher-dimensional space makes kernel smoothing more difficult unless the sample size is very large. This phenomenon is usually referred to as the curse of dimensionality problem in the nonparametric literature. It means that, with practical sample sizes, reasonable nonparametric density estimation is very difficult in more than about five dimensions (see Exercise in this chapter). Nevertheless, there have been many studies where the kernel density estimator has been demonstrated to be an effective tool for displaying structure in bivariate samples (e.g., Silverman, 986, Scott, 992). The multivariate kernel density estimate has also played an important role in developments of the visualization of structure in three- and four-dimensional data sets (Scott, 992). Also, in many nonparametric hypothesis testing (e.g., testing for conditional independence, Su and White, 2007, 2008, 203; Song, 2009, Huang, 200), it is required to estimate multivariate densities..3. The Multivariate Kernel Density Estimator Let denote a -variate random sample having density We will use the notation = ( ) 0 to denote the components of and a generic vector R will have the representation =( ) 0 In its most general form, the -dimensional kernel density estimator of takes the form b () = X ( ) where is a symmetric positive matrix called the bandwidth matrix, () = 2 2 is a -variate kernel function satisfying R () = and is the determinant of. = 5

16 The kernel function isoftenchosentobe-variate probability density function. There are two common techniques for generating multivariate kernels from a symmetric univariate kernel : Y ( 2 0 ) () = ( ) and () = R ( 2 0 ) = The first of these is often called a product kernel, and the second has the property of being spherically or radially symmetric. A popular choice for is the standard -variate normal density µ () =(2) 2 exp 2 0 in which case ( ) is the ( ) density in the vector It is well known that the -variate normal kernel can be constructed from the univariate standard normal density using either the product or spherically symmetric extensions. In general, the bandwidth matrix has ( +)2 independent entries, which, even for moderate can be a substantial number of smoothing parameters to have to choose. Another extreme case is to specify = 2 where is the identity matrix. In this case, the kernel estimator of the density is a straightforward generalization of (.3): b () = X = µ (.27) where, is a multivariate kernel function such that R R () = Nevertheless, the use of a single bandwidth may not be appropriate if the variation in one component of is much greater than the others. In this case, it may be more appropriate to use a vector or matrix of bandwidth parameters. An attractive practical approach is to linearly transform the data to have a unit covariance matrix, apply (.27) to the transformed data, and finally transform back to the original metric. Almost all the results in the previous subsection go through for the multivariate density estimator b () with obvious modifications. In particular, the ROT bandwidth is proportional to (4+) A realistic choice is to set = diag 2 2 Just as in the univariate case, some important choices have to be made when constructing a multivariate kernel density estimator. Nevertheless, the extension to higher dimension means there are more degrees of freedom. Below we restrict ourself to use the product kernel and diagonal. Forbrevity,we will write () instead of () for the product kernel. We estimate () by b () = X ( ) = = where =( ) and ( ) = b () below. Y = X = = Y µ (.28) We will study the asymptotic properties of.3.2 Asymptotic Properties of the Multivariate Kernel Density Estimator As in the univariate setting we can obtain a simple asymptotic approximation to the MISE of a multivariate kernel density estimator under certain smoothness assumptions on the density and some standard assumptions on the kernel and bandwidth. Before studying this, we first study the asymptotic normality 6

17 of the nonparametric kernel density estimator b () For this purpose, we make the following assumptions on and A The observations are IID with density A2 is third order continuously differentiable at the interior point of its support. A3 is a product of univariate kernel which is a symmetric function around zero satisfying (i) R () = (ii) R 2 () = 2 (0 ) (iii) R 2 () = 02 A4 As,and 0 for = Clearly, Assumptions A-A4 parallel those in Section.2 and they are not the minimal requirements. The choice of product kernel and diagonal bandwidth matrix is just for the simplicity of notation. One can generalize the results below straightforwardly to the general choice of kernel and bandwidth (e.g., Wand and Jones, 995). Just as in the univariate setting, under the above assumptions we can easily show that Ã Bias b () = X X! 2 () (.29) 2 = and µ Var b () = 02 ()+ 2 (.30) where () is the second order derivative of () with respect to 2 = R 2 () 02 = R () 2 and 2 = P = 2 Consequently, b () = h Bias b () i 2 + Var b () = = 2 2 +( ) (.3) which converges to zero under Assumption A4. Since convergence in MSE implies convergence in probability by the Chebyshev inequality, we have b () () To derive the asymptotic normality of b () we can write it as the summation of a double-array random variable and apply the Liapounov central limit theorem (CLT) in the appendix. Thus we state the following theorem. Theorem. Suppose Assumptions A-A4 hold. Suppose that R () 2+ for some 0 If P = 6 0, then Ã! p b X 2 () () () 2 (0 02 ()) (.32) 2 where = R () for =0 2 Proof. We only sketch the proof. Write n h i b () () = b () () = o n h io + b () b () + The first term contributes to the bias of the estimator and the second term contributes to the variance. By (.29), it suffices to show that p (0 02 ()) Define µ µ ¾ =( ) ½ 2 7

18 Then p = P = Note that ( )=0and P = Var( ) = 02 ()+() To verify the other conditions in Theorem.8, it suffices to check that P = 2+ = () By the and Jensen inequalities X 2+ = X µ µ 2+ ( ) + 2 = ( 2 + µ 2+ µ 2+ ) + 2 ( ) µ 2+ 2 ( ) + 2 = 2 + ( ) 2 () 2+ ( + ) = ( ) 2 = () where denotes Hadamard product. So we can apply the Liapounov CLT to conclude p (0 02 ()) Remark. It can be easily shown that cov ˆ ( ) ˆ ( 2 ) = () for any two distinct points and 2 on the support of () By the Cramér-Wold device, we can show that for any finite collection of points p P ˆ ( ) ( ) 2 2 = ( ) 2. (0 02 p ˆ ( ) ( ) P 2 2 = ( ) diag ( ( ) ( ))) 2 That is, ˆ ( ) ˆ ( ) are asymptotically independent of each other. This observation is very useful when we try to construct pointwise confidence intervals for () when is evaluated at distinct points since we do not need to take into account any dependence between the estimators at different points. For the asymptotic MISE approximations, we need the second order partial derivatives of to be square integrable. From (.29)-(.3), we can obtain the MISE of b () µ () = ()+ + 2 (.33) where = P = 2 () = " 2 X () # 2 + = 02 (.34) Unlike the univariate setting explicit expressions for the AMISE-optimal bandwidth are not available in general and this quantity can only be obtained numerically (see Wand, 992a). In the special case = = = the optimal bandwidth has an explicit formula and is given by = " # (+4) 02 R () ª 2 (+4) The inequality says that + 2 ( + ) where 8

19 where 2 () = P = 2 () One can easily obtain the minimum AMISE as +4 inf () = 0 4 ½( 2 ) 2 ( 02 )4 2 () ª 2 4 ¾ (+4) The last expression implies that the rate of convergence of inf 0 () is of order 4(+4) arate which becomes slower as the dimension increases. This slower rate indicates the curse of dimensionality as discussed previously..3.3 Multivariate Bandwidth Selection As is the case for univariate density estimation, the optimal bandwidth =( ) should balance the squared bias and variance terms. This requires = ( 4 ) = (4+) Let 0 =( 0 0 ) denote the optimal bandwidth that minimizes the AMISE of b () Let = Y where () is the convolution kernel of ( ) In practice, we often choose the bandwidth = =( ) by minimizing the following least squares cross validation function () = 2 b X X µ 2 X X 2 = 2 ( ) (.35) ( ) = = = 6= which is a generalization of (7) from the univariate case to the multivariate density estimation setting. Let b = b b denote the solution to the above cross-validation problem. Then one can show b 0 for = It is worth mentioning that one can also apply the plug-in principle to choose bandwidth in the multivariate setting. Using the asymptotic MISE approximations developed in the last subsection, it is possible to develop a multivariate version of the plug-in type bandwidth. See Wand and Jones (995) for details..3.4 Construction of Confidence Intervals P If undersmoothing bandwidths are chosen so that = 4 0 then Theorem. implies that the pointwise 00( )% (or simply ) asymptoticconfidence interval (CI) for () can be constructed as follows s ˆ ˆ s () 02 () 2 ˆ ˆ 02 () ()+ 2 (.36) where 2 denotes the ( 2)-percentile of the standard normal distribution. Nevertheless, the above CI can be badly behaved in finite samples in that its coverage probability may be significantly smaller than the nominal level ( ). An alternative method is to use bootstrap method to construct the CI for kernel density estimator. Since CIs for high-dimensional density estimators ( 3) are seldom used, we now turn to the construction of the confidence band for a uni-variate density estimator. In this case, the CI in (??) can be written as s ˆ () 2 s 02 ˆ () ˆ () ˆ () (.37) 9

20 where ˆ () = E and 02 = R () 2 The pointwise confidence band for () is s s B () = ( ) : X ˆ () 2 02 ˆ () ˆ () ˆ () where X is the support of The coverage probability of B () at a point is given by ( ) = {( ()) B ()} (.38) The limit of ( ) is not given by unless one uses undersmoothing bandwidth to remove the asymptotic bias of ˆ () h i To consider the bootstrap CI for () or ˆ () we define another estimator of the variance of ˆ () Note that the variance of ˆ () (in the case of =)is given by Ã 2 X µ! () = Var = X µ µ 2 2 Var = = ( = µ µ 2 µ µ 2) 2 E E Hall (992) proposes to estimate 2 () by " ˆ 2 () = X µ µ # 2 2 ˆ () 2 h i We now describe how to construct the bootstrap CI for ˆ () in several steps: =. Draw a bootstrap sample { } = randomly (with replacement) from the original sample D { } 2. Compute with undersmoothing choice of bandwidth : ˆ () = X = " 2 ˆ 2 () = µ X = µ () = ˆ () ˆ () ˆ () µ # 2 ˆ () 2 n o 3. Repeat Steps -2 for a large number of times, say times, we obtain estimators () = o Let 2 and 2 n denote the 2 and 2 percentiles of () Then the symmetric h i = two-sided CI of ˆ () is given by n ˆ () ˆ () 2 ˆ () ˆ () 2o (.39) 20

21 h i Note that ˆ () D = ˆ () That is, ˆ () is an unbiased estimator of ˆ () conditional on the data, though h ˆ i() is a biased estimator of () Hence, () is a bootstrap -statistic for forming a CI for ˆ () The CI in (.39) can be regarded as the CI for () when we use the undersmoothing bandwidth so that the bias of ˆ () is asymptotically negligible. Horowitz (200, Handbook of Econometrics) demonstrates that the bootstrap provides asymptotic refinements fortestsofhypothesisand confidence intervals in density estimation. More recently, Hall and Horowitz (203, AoS) propose a simple bootstrap method for constructing the CI for kernel density estimator without the need of undersmoothing. Instead of drawing the bootstrap observations randomly from D Hall and Horowitz (203) suggest that we can draw a random sample D = { } from the distribution with density ˆ () and define ˆ to be the corresponding kernel estimator of ˆ : ˆ () = X µ = One can show that conditional on D the asymptotic bias and variance of ˆ () are the same as that of ˆ (), ignoring asymptotically negligible terms. So that we can construct the CI or confidence band for () based on ˆ () The bootstrap version of B () in (.38) is given by s B () = ( ) : X ˆ 02 ˆ () () 2 ˆ ()+ 2 s 02 ˆ () The bootstrap estimator ˆ ( ) of the coverage probability ( ) that B () covers ( ()) is defined by n ˆ ( ) = ˆ o () B () D and can be computed, by Monte Carlo simulations, in the form X n ˆ o () B () (.40) = where B () is calculated as B () based on the th bootstrap resample. [The bootstrap resamples are independent of each other conditional on D ]Forlargeenough we can treat (.40) as ˆ ( ), ignoring the small simulation error. Then we can define ˆ ( 0 ) to be the solution, in of ˆ ( ) = 0 Take X 0 =[ ] as a subset of X.Let be evenly distributed on [ ] such that = 0 + = so that the distance between two adjacent points is given by =( ) ( +) Let ˆ ( 0 ) denote the -level empirical quantile of points in the set nˆ ( 0 ) ˆ o ( 0 ) For a value (0 2 ] construct the 0 confidence band B (ˆ ( 0 )) by taking sufficiently small. Hall and Horowitz (203) recommends =00 and remarks that =005 maybewarrantedinthecaseof large samples (p.899)..3.5 Conditional Density Estimation Even though conditional pdfs form the backbone of most popular statistical methods in use today, they are not modeled directly in the parametric setting. They have received even less attention in the nonparametric literature. A few exceptions include Chen et al. (2003), and Li and Racine (2007). Let =() where is a scalar and =( ) is a vector. We are interested in estimating the conditional density of given = which is denoted as ( ) Writing ( ) = 2

22 ( ) () we estimate it by where and b ( ) = b ( ) = b ( ) b () X ( ) ( ) = b () = X ( ) = The asymptotic properties of b ( ) caneasilybederivedfromthoseof b ( ) and b () To consider the choice of the bandwidths, we can consider the following criterion function based upon the weighted integrated squared error (ISE): n o 2 = b ( ) ( ) () () = ( ) 2 () () (.4) where the last term does not depend on and = b ( ) 2 () () and 2 = b ( ) ( ) () () Here we use the weight function () to mitigate the random denominator problem. Let b () = R b ( ) 2 Then = R b () b () 2 () () One can verify that b () = X X 2 ( ) ( ) ( ) ( ) = 2 = = X = = X µ ( ) ( ) where () = R () ( ) is the convolution kernel of ( ) We estimate and 2 by b = X = b ( )= 2 b ( ) ( ) b ( ) 2 and b 2 = =6= =6= X = b ( ) ( ) b ( ) 2 respectively, where the subscript denotes the leave-one-out estimators. For example, Ã! X X ( ) ( ) Thus, we can choose ( ) to minimize the following cross-validation function ( ) ( )= b 2 b 2 Let b b be the solution to the above minimization problem. Let ( 0 0 ) be the minimizer of (.4). Hall et al. (2003a) shows that b 0 0 (+5) b and 0 0 (+5) for = 22

23 .3.6 Uniform Rates of Convergence Up to now we have only demonstrated the case of pointwise and mean integrated square error consistency for the density estimator b () In fact, the consistency of b () can be generalized to a stronger uniform consistency result. It can be proved that the nonparametric kernel estimators are uniformly strongly (almost sure) consistent. This result is important for theoretical purpose. So we only report the result below. For the proof of the result, we refer the reader to Masry (996a,b). Theorem.2 Under some regularity conditions given in Masry (996a, b), we have (i) b () () = ( ) 2 (ln ) 2 + P = 2 S h i 2 (ii) b () () = ( ) + P = 4 S where S is a compact set in the interior of the support of In the above theorem, we restrict ourself to a compact set in the interior of the support of Suppose is compactly supported, it is well known that when lies at the boundary of the support, we cannot estimate at the usual rate. In fact the MSE of b () is not () in this case. Some modifications are needed to consistently estimate () for at the boundary of the support. For details, see Gausser and Müller (979), Rice (984), Hall and Wehrly (99), Scott (992, pp ), Wand and Jones (995, pp ). Example 4 (Boundary bias problem). Suppose has compact support [0 ] and (0) 0 Suppose we estimate (0) by the usual nonparametric kernel estimator b (0) that is, b (0) = () P = 0 Then h i b (0) = µ = () 0 = () () () (0) = (0) 2 0 where we have used the dominated convergence theorem and the fact that ( ) is a symmetric kernel (implying R 0 () =2) So b (0) is a biased estimator for (0) even asymptotically and the asymptotic bias of of magnitude (0) 2 When is infinitely supported, however, we can extend the above result to the full support of For more details, see the work by Hansen (2008). 0.4 Testing Hypotheses about Densities In this section we consider testing hypotheses about densities. Suppose and are two possible densities for the random variable or vector We may like to test several types of hypotheses regarding these densities, each of which will be formulated as testing for 0 : () = () versus : () 6= () (.42) Pagan and Ullah (999) consider several examples which we reformulate below. Examples. (a) It is sometimes desirable to test whether a nonparametrically estimated density has a particular form, say normal density. In this example, we estimate () nonparametrically and estimate () parametrically according to the parametric assumption: (; ) where is a vector of unknown parameters. 23

24 It is interesting to note that the finite dimensional parameter can usually be estimated at the parametric -rate, which is faster than any nonparametric rate of estimation, so that whether is estimated will not have any asymptotic impact on the test. (b) Symmetry of a density around some point, say zero, is frequently assumed in the literature. If this assumption is not met, subsequent statistical inference may not be valid. So it is desirable to have a test for symmetry. For example, if we are testing whether the density () of is symmetric around zero, we may let () = ( ) in (.42). (c) Conditional symmetry of a conditional density may be of great interest also. It turns about several tests are available to test for conditional symmetry. See, e.g., Su (2006). (d) Independence is a fundamental concept in statistics. A large number of parametric or nonparametric tests are available to test for independence. In testing whether two random variables and are independent, we may put =() Let and denote the marginal density of marginal density of and the joint density of and respectively. So that the null hypothesis in (.42) can be written as 0 : ( ) = () () Similarly, we can formulate hypotheses for testing for various variants of independence such as serial independence, spatial independence, or conditional independence. (e)itisusefultocomparedensities() and () that come from two different groups (male or female, white or non-white), regions (rural and urban, coastal or inland), or time periods. As Pagan and Ullah (999) remark, the above testing problems can be tackled by considering a widely accepted measure of global distance (closeness) between two densities () and () In practice, people frequently use the weighted integrated squared error: () = [ () ()] 2 () (.43) where () is a nonnegative weight function. For example, if one takes () = () or (), then (.43) can be estimated by its sample analogue b = X h i 2 b ( ) b ( ) (.44) = Another measure of distance (affinity) between two densities is the well known Kullback-Leibler (KL) distance (information) measure introduced earlier on. Under the null hypothesis, the KL distance between and is zero and it is nonzero otherwise..4. Comparison with a Parametric Density Function Now consider the problem of testing 0 : () = (; ) where (; ) is a fully specified (known) density up to the finite dimensional parameters. Given data { } let b () be the nonparametric kernel density estimator of and b be the maximum likelihood estimator for based upon the parametric density (; ) Based on the observation, () = [ () (; )] 2 = 2 () + 2 (; ) 2 () (; ) = [ ()] + 2 (; ) 2 [ (; )] 24

25 Following Fan (994), we can propose a feasible test statistic by replacing ( ) and by b ( ) and b : b = X b ( )+ 2 ; b 2 X ; b (.45) = = Under certain conditions, we can follow the proof of Theorem 4. of Fan (994) to prove the following Theorem. Theorem.3 Under some regularity conditions and 0, we have where b 2 = 2 P = = ( ) 2 b b P = 2 where (0 ) = Π = Fan (994) uses R b 2 () to replace R 2 () and proves an analog of the above theorem. Noticing that our test is a one-sided test, we will reject the null when the upper -percentile of standard normal distribution. This is true for other tests in this section too. Note that the integration in (.45) may not be needed in practice. For example, if ; b is the pdf b 2 b for a normal distribution with mean b and variance b 2 i.e., b = then ; 2 b to the density of b b 2 2 and it is easy to verify that R ; 2 b =.4.2 Testing for Symmetry is proportional 4b 2 2 in this case. To test whether a density function () is symmetric around zero, we write the null and alternative hypotheses as 0 : () = ( ) versus 0 : () 6= ( ) (.46) Noting that = [ () ( )] 2 = [ () ( )] [ () ( )] 2 2 = [ () ( )] () [ () ( )] ( ) 2 2 = [ () ( )] () = [ () ( )] () Ahmad and Li (997) propose a test based upon the last functional. Clearly, we can estimate by b = = X h b ( ) b ( )i = 2 X X = = µ µ + (.47) Under the null hypothesis and the standard assumption that 0 and Ahmad and Li (997) prove the following theorem. Theorem.4 Under some regularity conditions and 0 we have ( ) 2 b () = b 25 (0 )

26 where b 2 =4 P = b ( ) R 2 () and () = (0) ( ) is used to correct for finite sample bias. One can prove the above theorem with a simple application of the CLT for degenerate second order U-statistics. See Theorem.9 in the appendix and Exercise 4 in this chapter..4.3 Comparison with Unknown Densities Comparison of two densities is important in some empirical work. For example, we may be interested in comparing income distributions across two groups, regions, or time periods. Let { } = and { } 2 = be two samples from -dimensional random vectors. Assume that and have density and and distribution functions and respectively. The null hypothesis of interest is 0 : () = () Noticing that = = [ () ()] 2 () ()+ () () 2 () () we can propose a feasible test statistic by replacing and by b b b and b respectively, where b () = P = ( ) b () = P 2 2 = ( ) and b and b are the empirical distributions of { } = and { } 2 = respectively. This leads to b = b () b ()+ b () b () 2 b () b () = X = 2 = X X = b ( )+ X 2 2 = = 2 where = Π = (( ) ) The following theorem states the main result. b ( ) 2 X 2 2 X X 2 = = (( ) ) = Π = = b ( ) 2 2 Theorem.5 Under some regularity conditions and 0 we have ( 2 ) 2 b () h where () = (0) = + 2 i and X X b 2 =2 2 = = b 2 4 X 2 X 2 + = = X 2 X = = (( ) ) and = Π = (0 ) 2 4 X X ( 2 ) 2 = = For a proof of the above result, see Li and Racine (2006). See a variant of the above test, see Li (996). 26

27 .4.4 Testing for Independence Let ( ) 0 be a ( + ) random vector with joint cdf ( ) and pdf ( ) Further let () and 2 () denote the marginal cdf of and with marginal pdf () and 2 () respectively. The null hypothesis of interest is 0 : ( ) = () 2 (). Observing that = = [ ( ) () 2 ()] 2 ( ) ( )+ () () 2 () 2 () 2 () 2 () ( ) = [ ( )] + [ ()] [ 2 ( )] 2 [ () 2 ( )] we can propose a feasible test statistic by replacing ( ), ( ) and 2 ( ) by their leave-one-out kernel estimators b ( ), b ( ) and b 2 ( ). This will lead to the following expression b = X b ( )+ 2 = X = = X b ( ) b 2 ( ) 2 X b ( ) b 2 ( ) = where b ( )= P 6= ( ) ( ) b ( )= P 6= ( ) and b 2 ( ) = P 6= ( ) with ( ) = Π = (( ) ) and ( )= Π = (( ) ) Under certain conditions, Ahmad and Li (997) prove the following theorem. Theorem.6 Under some regularity conditions and 0, we have = ( ) 2 b b (0 ) where b 2 = 2 2 P P = 6= 2 2 with, e.g., = Π = Like the other nonparametric tests defined in this section, large values of is in favor of the alternative and we reject the null hypothesis if the upper -percentile of standard normal distribution. It is worth mentioning that the previously developed theories on kernel density estimation and testing go through under weak data dependence conditions. In the next application, we consider testing for structural change in the time series framework..4.5 Test for Structural Change in Densities Since Page (956), the problem of testing for a structural change has generated much interest in both statistics and econometrics. Early study mainly focused on the case of parameter change in the parametric framework. Recently, much attention has been paid to the whole distribution or density level when testing for structural change. Let { } be a stationary strong mixing process satisfying () =sup ( ) () () : F F + ª 0 27

28 where F = ( ) is the -field generated by and We wishes to test for the change of the marginal density of { } = So the null hypothesis is and the alternative hypothesis is 0 : have a common marginal density ; : for some (0 ) dc have a common density and dc+ have a common density 2 where dc denotes the largest integer less than or equal to and 2 are all assumed unknown. To test 0 define Define dc () = dc X dc = µ ( ) = µ () R 2 () and dc () = ( dc) X =dc+ 2 dc dc () dc () dc µ (.48) provided () 6= 0 If () =0 (.48) is defined to be zero. Under the null 0 we can define a partial sum process: ( ) = = µ () µ () 2 2 dc () 2 () 2 dc X = if () 6= 0 and ( ) if () =0 Then we can write dc () dc () µ µ ( ) = ( ) dc () Lee and Na (2004) shows for fixed { ( ) 0 } converges weakly to a standard Brownian motion process, which implies that { ( ) 0 } converges to a Brownian bridge. Let be distant real numbers. Define = max Lee and Na (2004) prove the following theorem. sup ( ) 0 Theorem.7 Suppose the regularity conditions given in Lee and Na (2004) hold. (i) Under 0 as max 0 () where 0 0 are independent Brownian bridges. (ii) Under as sup 0 if ( ) 6= 2 ( ) for some { } Thus we reject the null if is large enough. In practice, one can tabulate the critical values based on simulations on Brownian bridges. 28

Nonparametric Methods

Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis