7 Semiparametric Estimation of Additive Models

Size: px

Start display at page:

Download "7 Semiparametric Estimation of Additive Models"

Justin Gordon
5 years ago
Views:

1 7 Semiparametric Estimation of Additive Models Additive models are very useful for approximating the high-dimensional regression mean functions. They and their extensions have become one of the most widely used nonparametric techniques since the excellent monograph by Hastie and Tibshirani (1990) and the companion software as described in Chambers and Hastie (1991). For a recent survey on additive models, see Horowitz (2014). Much applied research in economics and statistics is concerned with the estimation of a conditional mean or quantile function. Specifically, let () be a random pair where is a scalar random variable and is a 1 random vector that is continuously distributed. We are interested in the estimation of either () ( = ) or () = argmin ( ) [ ( ()) = ] the th conditional quantile function of given = : ( () = ) = In a classical nonparametric additive model, or is assumed to have the form or () = 0 + () = 0 + ( ) (7.1) ( ) (7.2) where 0 is a constant, is the th element of and 1 are one-dimensional smooth functions that are unknown and estimated nonparametrically. Model (7.1) or (7.2) can be extended to () = 0 + ( ) (7.3) or () = 0 + ( ) (7.4) where is a strictly increasing function that may be known or unknown. Below we focus on the study of the estimation of model (7.1) and its extension. Then we briefly touch upon the models in (7.2)-(7.4). 7.1 The Additive Model and the Backfitting Algorithm The Basic Additive Model In the regression framework, a simple additive model is defined by = 0 + ( )+ (7.5) where ( 1 )=0 2 1 = 2 ( 1 ) the 0 are arbitrary univariate functions that are assumed to be smooth and unknown. Note that we can add a constant to a component or 0 and subtract the constant from another component in (7.5). Thus the 0 1 are not identified without further restrictions. To prevent ambiguity, various identification conditions can be assumed. For example, one can assume that either [ ( )] = 0 (7.6) 142

2 or or Z (0) = 0 (7.7) () =0 (7.8) whichever is convenient for the estimation method on hand. We also assume that the are smooth functions so that they can be estimated as well as the one-dimensional nonparametric regression problem (Stone, 1985, 1986). Hence, the curse of dimensionality is avoided. Frequently, we will denote below () = ( = ) = 0 + ( ) (7.9) where =( 1 ) 0 and =( 1 ) 0 Model (7.5) allows use to examine the extent of nonlinear contribution of each explanatory variable to the dependent variable Under the identification conditions that [ ( )] = 0 hold for each =1 and ( 1 )=0 we have 0 = ( ) so that the single finite dimensional parameter 0 can be estimated by the sample mean = 1 P Since converges to 0 at the parametric -rate, which is faster than any nonparametric convergence rate, here we will simply work on the model without 0 in (7.5) by assuming ( )=0 Additive models of the form (7.5) have been shown to be useful in practice. They naturally generalize the linear regression models and allow interpretation of marginal changes, i.e., the effect of one variable, say on the conditional mean function () holding everything else constant. They are also interesting from a theoretical perspective since they combine flexible nonparametric modeling of many variables with statistical precision that is typical for just one explanatory variable. Example 7.1 (Additive AR(p) models) In the time series literature, a useful class of nonlinear autoregressive models are the additive models = ( )+ (7.10) In this case, the model is also called additive autoregressive models of order and simply denoted as AAR(). In particular, it includes the AR() model as a special case and allows us to test whether an AR() model holds reasonably for a given time series. Restricting ourselves to the class of additive models (7.5), the prediction error can be written as 2 0 ( ) = [ ( 1 )] 2 + ( 1 ) 0 2 ( ) (7.11) where ( 1 )= ( 1 ) Thus finding the best additive model to minimize the least squares prediction error is equivalent to finding the one that best approximates the conditional mean function in the senses that 0 and ( 1 ) minimize the second term in (7.11). In the case where the additive model is not correctly specified (i.e., Pr( ( 1 )= 0 + P ( )) 1), wecan interpret it as the approximation of the conditional mean function 143

3 7.1.2 The Backfitting Algorithm The estimation of ( 1 ) can easily be done by using the backfitting algorithm in the nonparametric literature. To do this we first introduce some background knowledge on global spline approximation. The local linear modelling cannot be directly applied to fit the additive model (7.5) with =0 To approximate the unknown functions 1 locally at the point 1 we need to localize simultaneously in the variables 1 This yields a -dimensional hypercube, which contains hardly any data points for small to moderate sample sizes, unless the neighborhood is very large. Nevertheless, when the neighborhood is too large to contain enough data points, the approximation error will be large. This is the key problem underlying the curse of dimensionality. To attenuate the problem, we can approximate the nonlinear functions 1 by polynomial splines, or Hermite polynomials, among others. For example, we can approximate () by X ( b )= () (7.12) where b = 0 1 and { ()} can be chosen from some bases of functions, which includes the trigonometric series {sin () cos ()} the polynomial series ª and the Gallant s (1982) flexible Fourier form 2 sin () cos () sin (2) cos (2) ª Below we introduce two popular choices of approximating functions, namely, polynomial splines and Hermite polynomials. Spline methods are very useful for nonparametric modelling. They are based on global approximation and are useful extensions of polynomial regression techniques. Let 1 be a sequence of given knots such that 1 These knots can be chosen either by researchers or data themselves. A spline function of order is a ( 1)th continuously differentiable function such that its restriction to each of the intervals ( 1 ] [ 1 2 ) [ ) is a polynomial function of order 1 The following formal definition is adapted from Eubank (1999, p. 281). Definition 7.1 A polynomial spline function () of order with knots 1 is a function of the form + X () = () (7.13) for some set of coefficients + where () = 1 =1 (7.14) + () =( ) 1 + =1 and ( ) 1 + =max{( ) 1 0} The above definition is equivalent to saying that (i) is a piecewise polynomial of order 1 on any subinterval [ +1 ); (ii) has 2 continuous derivatives, and (iii) has a discontinuous ( 1)st derivative with jumps at Thus, a spline is a piecewise polynomial whose different polynomial segments are joined together at the knots in a fashion that insures the continuity properties. Let ( 1 ) denote the set of all functions of the form (7.14). Then ( 1 ) is a vector space in the sense that the sums of functions 144

4 in ( 1 ) remain in the set, etc. Since the functions 1, 1 ( 1 ) 1 + ( ) 1 + are linearly independent, it follows that ( 1 ) has dimensions + Example 7.2 (Polynomial splines) To restrict to our case, we can approximate ( =1 ) by a polynomial spline of order with knots 1 ª by =0 X 1 X () ' ( ) 1 + ( b ) (7.15) In real applications, for any given number of knots value the knot ª can be simply chosen as the empirical quantiles of i.e., = ( +1)-th quantile of for =1 When the knots are fine enough on the support of which is usually assumed to be compact, the resulting spline function ( b ) can approximate the smooth function quite well. Two popular choices of are =2and 4. When =2 ( b ) is simply the piecewise linear function ( b )= ( 1 ) (7.16) One can easily verify that ( b ) is piecewise linear and continuous, and has kinks at the knots 1 Example 7.3 (Hermite polynomials of order ) We can also approximate the unknown function ( =1 ) by Hermite polynomials of order : ( ) X () ' ( 1 ) exp ( 1) ( b ) (7.17) =0 where 1 and 2 2 can be chosen as the sample mean and sample variance of the data { } Hermite polynomials are often chosen when the underlying variables have infinite support. After the approximation, we can estimate the unknown parameters in the approximation by the least squares method. That is, we choose b 1 b to minimize the following criterion function 1 X { 1 ( 1 b 1 ) ( b )} 2 (7.18) Let the solution be b Then the estimated functions are simply b () = b =1 (7.19) The above least squares problem can be solved directly, resulting in a large parametric problem with an inversion of matrix of high order. Alternatively, the optimization problem can be solved using the backfitting algorithm. Conditional expectations provide a simple intuitive motivation for the backfitting algorithm. If the additive model (7.5) is correct with =0(otherwise replace by minus its sample mean), then for any =1 X ( ) = ( ) (7.20) 6= This immediately suggests an iterative algorithm for computing : 145

5 Step 1. Given the initial values of b 2 b (say from the direct least squares solution), minimize (7.18) with respect to b 1 This is a much smaller parametric problem can can be solved relatively easily. Step 2. With estimated values of b 1 and values of b 3 b we now minimize (7.18) with respect to b 2 This results an updated estimate of b 2 Repeat this exercise until b is updated. Step 3. Repeat Steps 1-2 until certain convergence criterion is met. This is the basic idea of the backfitting algorithm (Ezekiel, 1924, Buja et al., 1989). Let b = P 6= b be the partial residuals without using the regressor Then the backfitting algorithm finds b by minimizing 1 X {b ( b )} 2 (7.21) This is a nonparametric regression problem of b on the variable. The resulting estimate is linear in the partial residuals {b } and can be written as 1 b 2 b. b = S b 1 b 2. b (7.22) where S is the smoothing matrix. For the ease of presentation, denote the left hand side of (7.22) as bg and write Y =( 1 ) 0 Then (7.22) can be written as bg = S Y X bg (7.23) 6= The above example can utilize polynomial splines or Hermite polynomials as a nonparametric smoother. The idea can be applied to any nonparametric smoother. Let S be a smoothing matrix that is obtained by regressing nonparametrically the partial residuals {b } on { } The general backfitting algorithm can be outlined as below. Step 1. Initialize the functions bg 1 bg Step 2. For =1 compute bg = S Y P 6= bg and center the estimator to obtain bg ( ) =bg ( ) 1 where bg ( ) denotes the th element of bg Step 3. Repeat Step 2 until convergence. X bg ( ) (7.24) See Hastie and Tibshirani (1990, p.91) for a discussion on the above algorithm. The re-centering in Step 2 is to comply with the constraint in (7.6). The convergence issue of the algorithm is delicate and has been addressed via the concept of concurvity by Buja et al. (1989). Concurvity is the analogue of 146

6 collinearity in the linear regression models. Assuming that concurvity is not present, it is shown there that the backfitting algorithm converges and solves the following equation 1 bg 1 S 1 S 1 S 1 1 bg 2. = S 2 S 2 S (7.25) bg S S S Direct calculation of the right-hand-side of (7.25) involves inverting a square matrix and can hardly be implemented on an average computer for moderate to large sample sizes. In contrast, the backfitting does not share this drawback and is frequently used in practical implementations. To add: Mammen, Linton, and Nielsen (1999, AoS,The existence and asymptotic properties of a backfitting projection algorithm under weak conditions) Generalized Additive Models: Logistic Regression As Hastie and Tibshirani (1990) remark, the linear model is used for regression in a wide variety of contexts other than the ordinary regression, including log-linear models, logistic regression, the proportionalhazards model for survival data, models for ordinal categorical response, and transformation models. It is a convenient but crude first-order approximation to the regression function, and in many cases it is adequate. The additive model can be used to generalize all these models in an obvious way. For clarity, we focus on the logistic regression model. In this setting the response variable is dichotomous, such as yes/no or survived/died, increase/decrease, and the data analysis is aimed at relating this outcome to the predictors. One quantity of interest is the proportion of outcome as a function of the predictors (explanatory variables). In linear modelling of binary data, the most popular approach is logistic regression which models the logit of the response probability with a linear form ½ logit { ()} log () 1 () ¾ = 0 (7.26) where () = ( =1 ) Alternatively, we can write () = exp (0 ) 1+exp( 0 (7.27) ) There are several reasons for its popularity, but the most compelling is that the logit model ensures that the proportions () lie in (0 1) (see (7.27)) without any constraints on the linear predictor 0 We can generalize the model in (7.26) by replacing the linear predictor with an additive one ½ ¾ () log = + ( ) (7.28) 1 () or that in (7.27) to where () = exp() 1+exp() () = + ( ) (7.29) denotes the CDF of the standard logistic distribution. Thus (7.29) is a special case of the generalized additive model in (7.4) where is a strictly increasing function that has a known functional form. 147

7 Insight from the Linear Logistic Regression Model To estimate the model (7.28), we can gain some insight from the linear logistic regression methodology. Maximum likelihood is the most popular method for estimating the linear logistic model. For the present problem the log-likelihood has the form () = { log ( ( )) + (1 )log(1 ( ))} (7.30) where ( )=exp( 0) (1 + exp (0 )) The score equations () = X [ ( )] = 0 (7.31) are nonlinear in the parameters and consequently one has to find the solution iteratively. The Newton-Raphson iterative method can be expressed in an appealing form. Given the current estimate b we can estimate the probabilities ( ) by =exp 0 b 1+exp 0 b We form the linearized response = 0 b +( ) { (1 )} (7.32) where the quantity represents the first-order Taylor s series approximation to logit( ) about the current estimate 3 Denote =( ) { (1 )} If b and hence are fixed, the variance of is 1/{ (1 )} and hence we choose the weights (1 ) Alternatively, we can verify that ( )=0and 2 1 = (7.33) ( )(1 ( )) in the extreme case where = ( ) So when approximates ( ) we expect that ( ) 0 and 2 1 ( )(1 ( )) 1 (1 ) (7.34) Consequently, a new b can be obtained by weighted linear regression of on with weights (1 ) This is repeated until b converges. Algorithm The above iterative algorithm lends itself ideally to the generalized additive model in (7.28). Define = e + e ( )+( ) { (1 )} (7.35) where (e e ) are the current estimates for the additive model components and exp e + P e ( ) = 1+exp e + P (7.36) e ( ) Define the weights = (1 ) (7.37) The new estimates of 0 and ( =1 ) are computed by fitting a weighted additive model to Of course, this additive model fitting procedure is iterative as well. Fortunately, the functions from the previous step are good starting values for the next step. This procedure is called the local-scoring algorithm in the literature. The new estimates from each local scoring step are monitored and the iterations are stopped when their relative change is negligible. 3 Pretending that isboundedawayfrom0and1andiscloseto,wehave,bythefirst order Taylor expansion, that logit( )=log log (1 ) ( ) 148

8 7.2 The Marginal Integration Method The Marginal Integration Estimator Let ( 1 ) be a random sample from the following additive model = 0 + ( )+ (7.38) where ( 1 ) = = 2 ( 1 ) and the { ( )} is a set of unknown functions satisfying [ ( )] = 0. We follow Chen et al. (1995) to define the marginal integration estimator. Let ( ) be the marginal density of Let ( 1 )= 0 + P ( ) be the conditional mean function. For any {1 } we define =(1 1+1 ) 0 The joint density of =( ) 0 is denoted as Then for a fixed =( 1 ) 0 the functional Z () Y6= (7.39) is 0 + ( ) Let ( ) and ( ) be kernel functions with compact support. Let ( ) = 1 ( ) and ( ) = ( 1) ( ) (7.40) Using the Nadaraya-Watson (NW) kernel method to estimate the mean function ( ) we average over the observations to obtain the following estimator. For 1 and any in the domain of ( ) define b ( ) = 1 e ( ) "P = 1 # ( ) P ( ) " # 1 ( ) = P ( ) (7.41) If the regressors were independent, we might use P ( ) P ( ) to estimate ( ) This is a one-dimensional NW estimator. Nevertheless, this estimator has larger variance in comparison to the above estimator even in this restricted situation, see Härdle and Tsybakov (1995). Let ( ) denote the joint density of =( 1 ) 0 The following assumptions are modified from Chen et al. (1995). Assumptions A1. { } is IID. 4 A2. The densities and are bounded, Lipschitz continuous and bounded away from zero. The function has Lipschitz continuous derivatives. A3. The conditional variance function 2 () = 2 = is Lipschitz continuous. 149

9 A4 The kernel function ( ) is a bounded nonnegative second order kernel that is compactly supported and Lipschitz continuous. 02 = R 2 () and 21 = R 2 () A5. The kernel function is a bounded, compactly supported, and Lipschitz continuous. is a th order kernel with kk 2 2 = R 2 () A6. As 0 = 0 15 and 1 log Theorem 7.2 Suppose that Assumptions A1-A6 hold and the order of satisfies ( 1) 2 Then 25 {b ( ) ( ) 0 } ( ( ) ( )) (7.42) where ( ) = ( ( )+ 0 ( ) ( ) = Z 2 () 2 () Z () ) () Proof. See Chen et al. (1995). Theorem 7.2 says that the rate of convergence to the asymptotic normal limit distribution does not suffer from the curse of dimensionality. Nevertheless, to achieve this rate of convergence, we must impose some restrictions on the bandwidth sequences and choices of kernels. Note that the above bandwidth condition does not exclude the optimal one dimensional smoothing bandwidth of the 15 rate for both and when 4 More importantly, we can take = 15 When 5 we can no longer use at the rate 15 and we need to use higher order kernel to reduce the bias that is associated with the use of Estimation of the regression surface Define b () = b ( ) ( 1) b (7.43) where b = 1 P The following theorem gives the asymptotic distribution of b (). Theorem 7.3 Under the conditions of Theorem 7.2, where () = P ( ) and () = P ( ). 25 { b () ()} ( () ()) (7.44) Proof. See Chen et al. (1995). Theorem 7.3 says that the covariance between b ( ) and b ( ) for 6= is asymptotically negligible, i.e., it is of smaller order than the variances of each component function estimator Marginal Integration Estimation of Additive Models with Known Links For many situations, especially binary and survival time data, the model (7.38) may not be appropriate. In the parametric case, a more appropriate framework of modelling is provided by generalized linear 150

10 models of McCullagh & Nelder (1989). Hastie and Tibshirani (1991) extend these ideas to nonparametric modelling. In the nonparametric case, the model can be fully or partially specified. In the full model specification case, the conditional distribution of given is assumed to belong to an exponential family with known link function and mean function such that { ()} = 0 + ( ) (7.45) where [ ( )] = 0. This model is usually called a generalized additive model in the literature. It implies, for example, that the variance is functionally related to the mean. In the partial model specification case, we keep the form (7.45) but don t restrict ourselves to the exponential family. In this latter case, the variance function is unrestricted. Example 7.4. Clearly, when is the identity function we have the additive regression model examined above. Other examples include the logit and probit link functions for binary data, and the logarithm transform for Poisson count data (McCullagh & Nelder 1989, p.30) and the Box-cox transformation (e.g., 1 ). It also incudes cases when the regression function is multiplicative. The backfitting procedure in conjunction with Fisher scoring is widely used to estimate (7.45) (Hastie and Tibshirani 1991, p.141). It exploits the likelihood structure. Nevertheless, it is even less tractable from a statistical perspective when is not the identity because the estimate is not linear in Estimation of the Additive Components Linton and Härdle (1996) propose a marginal integrationbased method of estimating the components in (7.45). The main advantage of their method is that one can obtain its asymptotic properties. They also suggest how to take into account of the additional information provided by the exponential family structure. Noticing that under the additive structure (7.45), Z ( ) ª = 0 + ( ) (7.46) The general strategy is to replace both and in (7.46) by their estimates. We estimate by e X = (7.47) where ( ) can be estimated by its sample analogue: = ( ) P ( ) (7.48) b ( )= 1 X e ª (7.49) When is the identity function, b ( ) is linear in, i.e., b ( )= P ( ) (c.f. equation (7.41)), where ( )= 1 In general, b ( ) is a nonlinear function of X 151

11 The above procedure is carried out for each =1 We can obtain estimates of each ( ) evaluated at each sample point. Let We reestimate () by b =() 1 b ( ) b ( )=b ( ) b b () = 1 (b + ) b ( ) where 1 is the inverse function of Let b = b ( ) be the additive regression residuals which estimate the errors = ( ) These residuals can be used to test the additive structure, i.e., to look for possible interactions ignored in the simple additive structure. When (7.45) is true, b should be approximately uncorrelated with any function of We modify Assumption A6 to be A6* As 0 = and 25 1 Theorem 7.4 Suppose that Assumptions A1-A6 and A6* hold and the order of satisfies ( 1) Assume that is twice continuously differentiable. Then 25 {b ( ) ( ) 0 } ( ( ) ( )) (7.50) where ½ Z Z 1 ( ) = ( ) 0 ( ()) + 0 ( ) Z ( ) = [ 0 ( ())] 2 2 () 2 () 0 ( ()) ¾ ln () Proof. See Linton and Härdle (1996). As Linton and Härdle (1996) remark, we can use the local linear smoother as a pilot in place of the NW estimator. In this case, the asymptotic variance of b ( ) will be the same but the bias will take the simpler form ½ Z 1 ( )= ( ) 0 ( ()) ¾ To construct the asymptotic confidence interval of b ( ) we need to estimate ( ) and ( ) It is easy to show that ( ) can be consistently estimated by where b = 1 b ( )= b b 2 n 0 b o The formula for the estimate of ( ) is quite complicated and we thus omit it. Nevertheless, in case of undersmoothing, i.e., = 15 it suffices to estimate ( ) Alternatively, we can use bootstrap method (e.g., wild bootstrap) for approximating the desired asymptotic confidence interval. 152

12 Estimation of the Regression Surface After we obtain estimates for the additive component, we can obtain estimates for the regression function. Note that we don t limit ourselves to the exponential family structure discussed earlier. Theorem 7.5 Suppose the conditions of Theorem 7.4 hold and 1 is twice continuously differentiable. Then 25 { b () ()} ( () ()) (7.51) where () = 0 ( 0 + P ( )) P ( ) and () ={ 0 ( 0 + P ( ))} 2 P ( ). Proof. The result follows from Theorem 7.4, the delta method, and the fact that b ( ) and b ( ) are asymptotically independent for 6= Theorem 7.5 says that the rate of convergence of b () is free from the curse of dimensionality as desired. Remark. Yang, Sperlich, and Härdle (2003, JSPI) studies the derivative estimation of generalized additive models via the kernel method. In addition, they study the hypothesis testing on derivatives Efficient Estimation of Generalized Additive Models Recently, Linton (2000, ET) defines new procedures for estimating the generalized additive nonparametric regression models that are more efficient than the Linton and Härdle (1996). He considers criterion functions based on the linear exponential family. When the linear exponential family specification is correct, the new estimator achieves certain oracle bounds. For brevity, we refer the reader to Linton (2000). 7.3 Additive Partially Linear Models In this section we introduce two methods to estimate additive partially linear models. One is the series method of Li (2000, Intl Economic Review)) and the other is the kernel method of Fan and Li (2003, Statistica Sinica) Series Estimation A typical additive partially linear model is of the form = ( 1 )+ + ( )+ (7.52) where ( 1 ) = 0 is a 1 vector of random variables that does not contain a constant term, 0 is a 1 vector of unknown parameter; is of dimension ( 1 ); ( ) are unknown smooth functions. Denote by the non-overlapping variables obtained from ( 1 ). is of dimension with P In practice, the most widely used case is where =1 so that is a scalar random variable and =. Clearly, the individual functions ( ) ( =1 ) arenotidentified without some identification conditions. In the literature of kernel estimation, it is convenient to impose [ ( )] = 0 for all =2 Nevertheless, such conditions are not easily imposed for series estimation. Instead, here it is more convenient to impose (0) = 0 =2 (7.53) 153

13 To construct a series estimator for the unknown parameters in the model, we use Li s (2000) definition of the class of additive functions. Definition. Afunction () is said to belong to an additive class of functions G ( G) if (i) () = P ( ) ( ) is continuous in its support Z which is a compact sunset of R ( =1 ); (ii) P h ( ) 2i ; (iii) (0) = 0 for =2 When () is a vector-valued function, we say that G if each of its component belongs to G In vector notation, we can write (7.52) as = = (7.54) where = ( 1 ) 0 = ( 1 ) 0 = ( ( 1 ) ( )) 0 = ( 1 ) 0 and = For =1 we shall use a linear combination of functions, approximate ( ) Let () =[ 1 1 ( 1 ) 0 ( )=[ 1 ( ) ( )] 0 to ( ) 0 ] 0 where = P The linear combination of () forms an approximation function for () ( 1 )= P ( ) The approximation function has the following properties: (1) it belongs to G and (2) as grows for all =1 there is a linear combination of () that can approximate any G arbitrarily well in the mean squared error sense. Define 0 h = ( 1 ) ( )i ( =1 ), (7.55) = ( 1 ) Note that is of dimension and is of dimension Define = ( 0 ) 0 where is the symmetric generalized inverse of For a matrix with rows, let e = If we premultiply both sides of (7.54) by then we have Subtracting (7.56) from (7.54) gives e = So we can estimate 0 by regressing e on e to obtain b = e = e 0 + e + e (7.56) e 0 +( e)+ e (7.57) 0 e e e 0 e (7.58) After obtaining b we can estimate () = P ( ) by b () = () 0 b where b =( 0 ) 0 e b (7.59) ( ) can also be estimated easily based upon b For example, b 1 ( 1 )= 1 1 ( 1 ) 0 b 1 where b 1 is a collection of the first 1 elements in b Define () ( = ) and 2 ( ) Var( = = ) We use () to denote the projection of () onto G That is, () = [ ()] where by the definition of ( ) we know () is 154

14 an additive function, i.e., () = P ( ) G and ( ) is the solution of the following minimization problem: [ ( ) ( )] [ ( ) ( )] 0ª " #" # 0 = inf = ( G ) ( ) ( ) ( ) Noting that ( ) is of dimension 1, wecanwrite () = (1) () () () 0 To state the main result, we need to make some assumptions. Assumptions. A1. (i) {( ) } are IID. The support of ( ) is a compact subset of R + ; (ii) Both () ( = ) and 2 ( ) Var( = = ) are bounded functions on the support of ( ) A2. (i) For every there is a nonsingular matrix such that for () = () the smallest eigenvalue of [ ( ) ( ) 0 ] is bounded away from zero uniformly in ; (ii) there is a sequence of constants 0 () satisfying sup Z () 0 () and = such that ( 0 ()) 2 0 as where Z is the support of A3. (i) For = P there exist some 0 and = () =( ) 0 such that sup Z () () 0 P = as min{ 1 } where =( ) 0 ; (ii) P 0 as A4. Φ = [ ( )] [ ( )] 0ª is positive definite. Assumption A1 is quite standard in the literature of estimating additive models. Assumption A2 ensures that ( 0 ) is asymptotically nonsingular. While Assumptions A2-A3 are not primitive conditions, it is known that many series functions satisfy these conditions. Newey (1997) gives primitive conditions for power series and splines such that Assumptions A2-A3 hold. For power series, 0 () = (); for B-splines, 0 () =( ) For =1 if the density ( ) of is continuously differentiable of order then = The following theorem states the asymptotic property of b Theorem 7.6 Under Assumptions A1-A4, we have b 0 0 Φ 1 ΨΦ 1 where Ψ = 2 ( ) 0 and = ( ) For statistical inference, we need to obtain consistent estimates of Φ and Ψ. Li (2000) shows that we can estimate them consistently by X bφ = 1 e e 0 and bψ = 1 X b 2 e e 0 where e is the th row of e = and b = 0 b b ( ) Li (2000) also gives the convergence rate of b () = () 0 b to () = P ( ) 155

15 Theorem 7.7 Under Assumptions A1-A4, we have p P (i) sup Z b () () = ( 0 ()) + (ii) 1 P [b ( ) ( )] 2 = + P 2 ; (iii) R [b () ()] 2 () = + P 2 The property of b ( ) is similar and thus omitted for brevity. ; where ( ) is the cdf of Kernel Method We now study the kernel method of estimating the additive partially linear models. Like Fan and Li (2003), we consider the additive partially linear model = ( 1 )+ + ( )+ (7.60) where ( 1 )=0 is a 1 vector of random variables that does not contain a constant term, 0 is a scalar parameter, = 1 0 is a 1 vector of unknown parameters; the 0 are univariate continuous random variables, and ( ) are unknown smooth functions. Let =( ) where is removed from =( 1 ). Define Then we can rewrite (7.60) as = 1 ( 1 )+ + 1 ( 1 )+ +1 ( +1 )+ + ( ) = ( )+ + (7.61) Fan, Härdle, and Mammen (1998) consider the case where is a 1 vector of discrete variables and suggest two ways of estimating model (7.61). In neither method did they make full use of the information that enters the regression function linearly. Motivated by this observation, Fan and Li (2003) consider a two-stage estimation procedure which applies to the case where contains both discrete and continuous elements and makes full use of the information that enters the regression function linearly. For =1 define ( ) = = = ( )= ( ) ( ) = = = ( )= ( ) Denote = ( ) and = ( ) Then taking conditional expectations on both sides of (7.61) gives ( )= 0 + ( ) 0 + ( )+ (7.62) Integrating both sides of (7.62) over leads to ( )= 0 + ( ) 0 + ( ) (7.63) wherewehaveusedtheidentification condition that =0 Replacing in (7.63) by and then summing both sides of (7.63) gives = ( ) (7.64)

16 Subtracting (7.64) from (7.60), we can eliminate P ( ) and get Ã! 0 =(1 ) (7.65) as Let Y = P and X = 1 ( P ) 0 Then in vector notation we can write (7.65) Y = X + (7.66) where Y and X are 1 and ( +1) matrices with th rows given by Y and X respectively, =( 1 ) 0 and = with 0 =(1 ) 0 We can apply OLS regression to (7.66) to obtain Ã! 0 = =(X 0 X ) 1 X 0 Y = +(X 0 X ) 1 X 0 (7.67) Under standard conditions, we can show that converges to at the parametric -rate. Nevertheless, is an infeasible estimator because it depends on the unknown quantities P and P. To obtain a feasible estimator of we need to replace these unknown quantities by their consistent estimates. A consistent estimator of = ( ) is given by b = 1 = 1 ½P ¾ ( ) ( ) P ( ) ( ) X ( ) ( ) P ( ) ( ) (7.68) where the definition for is clear, () = 1 ( ) () = ( 1) and are kernel functions, and are smoothing parameters. Fan and Li (2003) use the leave-one-out method to obtain which can only simplify the proofs but does not change the asymptotic results. Note that P =1 b is a weighted average of Similarly, a consistent estimator of = ( ) is given by b = (7.69) Let Y b = P b and X b = 1 P b We can obtain a feasible estimator of by replacing Y and X in (7.67) by Y b and X b respectively. Nevertheless, noting that near the boundary the density () = ( 1 ) of cannot be estimated well. So Fan and Li (2003) consider trimming observations near the boundary. Assume that [ ] where are both finite constants, =1 Define the trimming set S = Π [ + ] where = for some =max{ } Let 1 = 1 ( S ) Fan and Li (2003) estimate by b = Ã b 0 b! Ã 1 X = bx 0 X b bx 0 Y b = bx b X 0 1! 1 X bx b Y 1 (7.70) where b Y and b X are 1 and ( +1)matrices with th rows given by b Y 1 and b X 1 respectively. To state the main result,we make the following assumptions. 157

17 Assumptions. A1. { } are IID; ( ) has a finite support with the support of beingaproductset Π [ ]; the density function of is bounded from below by a positive constant on its support; 4 A2. ( ) ( ) ( ) and ( 1 ) all belong to G 4 where 2 is an integer. Let = where and denote the continuous and discrete components of respectively, then for all values of 2 ( 1 )=( 2 = 1 = 1 = ) G1 4 G is defined in Section h A3. Φ = ( P )( P ) 0i is positive definite. A4. The kernel functions and are bounded, symmetric, and both are of order A5. As and ( ) 0 Note that when = = Assumption A5 implies that 32 and 2 0 which allows the use of a second order kernel if 5 It also implies that the data needs to be undersmoothed. The following theorem shows the asymptotic property of b Theorem 7.8 Under Assumptions A1-A5, we have b (0 Φ 1 ΨΦ 1 ) where Ψ = 2 0 with = +( ( )) (1 P ) = ( ) = ( ) P ( ) and = ( ) Here ( ) ( ) and ( ) are the density functions of and respectively. Given the -consistent estimator b we can rewrite (7.60) as 0 b = 0 + = 0 + ( )+ + 0 ( )+error term. The intercept term 0 can be -consistently estimated by b (7.71) b 0 = 0 b where = 1 P and = 1 P Note that (7.71) is essentially an additive regression model with 0 b as the new dependent variable and + 0 b as the new error term. Since b = 12 a rate faster than any nonparametric convergence rate, the asymptotic distribution of any nonparametric estimator of ( ) based on (7.71) will remain the same as if one replaces b by To make statistical inference on, weneedtoestimateφ and Ψ consistently. Let b ( ) b ( ) and b ( ) denote the kernel estimators of ( ) ( ) and ( ) respectively. That is, X b ( ) = 1 ( ) b = 1 X b ( ) = 1 ( ) 158

18 Let b denote a kernel estimator of ( ) Define b = b ( ) b b = b Ã b = b b = b +(b ) 1 b = b 0 0 b Then we can estimate Φ and Ψ consistently by Ã bφ = 1 and b ( ) bψ = 1 b!ã b 2 b b0 7.4 Specification Test for Additive Models b = 1! b b! Test for Additive Partially Linear Models via Series Method Li, Hsiao, and Zinn s (2003, JoE) consider consistent specification tests for semiparametric/nonparametric models where the null models all contain some nonparametric components based on series estimation methods. A leading case is to test for an additive partially linear model. The null hypothesis is 0 : ( )= 0 + ( ) a.s. for some B X b ( ) G, (7.72) where is a 1 vector of regressors, is of dimension 1 is an 1 unknown parameter, is a 1 non-overlapping variables of =1 B is a compact set of R and G is the class of additive functions defined in the last section. In particular, the identification condition is (0) = 0 for =2 The alternative hypothesis is 1 : ( ) 6= 0 + ( ) (7.73) on a set with positive measure for any B and any P ( ) G. Let = 0 P ( ) and =( ) The null hypothesis 0 is equivalent to 0 : ( )=0a.s. (7.74) Noting that ( )=0a.s. if and only if ( ( )) = 0 for all ( ) A, the class of bounded ( )- measurable functions, Li, Hsiao, and Zinn (2003) follow Bierens and Ploberger (1997), Stute (1997), and Stinchcombe and White (1998), and consider the following unconditional moment test: [ ( )] = 0 for almost all = R + (7.75) where ( ) is a proper choice of weight function so that (7.75) is equivalent to (7.74). Stinchcombe and White (1998) show that there exists a wide class of weight functions ( ) that makes (7.75) equivalent 159

19 to (7.74). Choices of weight functions include the exponential function ( ) = exp( 0 ) the logistic function ( )=1(1 + exp( 0)) with 6= 0 the trigonometric function ( )= cos( 0)+sin( 0) and the usual indicator function ( )=1 ( ) See Stinchcombe and White (1998) and Bierens and Ploberger (1997) for more discussions. The goodness of switching from a conditional moment test of (7.74) to an unconditional moment test of (7.75) is to avoid the estimation of the alternative model nonparametrically, as in Chen and Fan (1999), and Delgado and González Manteiga (2001). If is observable, one can construct a test based upon the sample analogue of [ ( )] = 0 : 0 () = 1 X ( ) (7.76) 0 ( ) can be viewed as a random element taking values in the separable space L 2 (=) of all real, Borel measurable functions on = such that R = ()2 () 4 which is endowed with the 2 norm ½Z 12 kk = () 2 ()¾ = Chen and White (1997) show that for a sequence IID process { ( )}. L 2 (=)-valued elements, 12 P ( ) converges weakly to Z ( ) in the topology of (L 2 (=) ) if and only if R h = () 2i (), wherez is a Gaussian element with the same covariance function Ω ( 0 )=[ () ( 0 )] Let 2 ( )= 2 It is easy to check that for ( ) = ( ) we have h i k ( )k 2 ½Z = ¾ ½ Z 2 ( ) 2 () = 2 ( ) 2 ( ) 2 Z = () ¾ ( ) 2 () provided the weight function ( ) is bounded above by on = =.Thisimpliesthat 0 ( ) converges weakly to 0 ( ) in L 2 (= ) where 0 ( ) is a Gaussian process centered at zero and with covariance function Ω ( 0 )= [ () ( 0 )] = 2 ( ) ( ) ( 0 ) (7.77) Since is unobservable, we can replace it by a consistent estimate of it, say b and construct a feasible test statistic as b () = 1 X b ( ) (7.78) There are several ways to obtain b One is to apply the series method of Li (2000) to obtain consistent estimates of the parameters in the additive partially linear models first. Another is to apply the kernel method of Fan and Li (2003). In either case, let b be the consistent estimator of and b ( ) be the consistent estimator of ( ) Then we estimate consistently by b = 0 b b ( ) 4 A separable space is topological space that has a countable dense subset. An example is the Euclidean space R 160

20 Now, one can construct a Cramer-von Mises statistic for testing 0 : Z = b () 2 () = 1 b ( ) 2 where ( ) is the empirical distribution of 1 Alternatively, one can construct the Kolmogorov- Smirnov statistic: =sup b () or = max b ( ) = 1 The following theorem states the asymptotic distributions of the test statistics. Theorem 7.9 Under some conditions and under 0 (i) b ( ) converges weakly to ( ) in L 2 (= ) where ( ) is a Gaussian process with zero mean and certain covariance function Ω 1 ( 0 ) where Ω 1 ( 0 )= 2 ( ) () ( 0 ) () = ( ) () () with () = [ ( )] () = [ ( ) ][ ( 0 )] 1 = [ ] and [ ] denotes the projection onto the space G of additive functions. R (ii) [ ()] 2 () and sup = () where ( ) is the cdf of Li, Hsiao and Zinn (2003) consider a special case of the additive partially linear model where = ( ) is a 1 vector of known functions. In their case, = 2 = 2 = 2 ( ) and Ω 1 ( 0 )=Ω 1 ( 0 )= 2 ( ) () ( 0 ) where () = ( ) () () with () = [ ( )] () = [ ( ) ][ ( 0 )] 1 and = ( ) [ ( )] The asymptotic null distribution is not pivotal. Li, Hsiao and Zinn (2003) suggest using a residualbasedwildbootstrapmethodtoapproximatethecritical values for the null limiting distributions of and The procedure is standard and we only discuss it briefly here based on the method of sieves. Let denote the wild bootstrap error that is generated via a two-point distribution: =[(1 5)2]b with probability (1 + 5)(2 5) and =[(1+ 5)2]b with probability ( 5 1)(2 5) We generate accordingtothenullmodel = 0 b + ( ) 0 b + Basedonthewildbootstrapsample{( )} we can re-estimate the model under the null to obtain estimates b and b Let b = 0 b ( ) 0 b the bootstrap residual. Then the bootstrap test statstic is given by b () = 1 X b ( ) Using b () we can compute a bootstrap version of the statistic, i.e., = 1 P [ b ( )] 2 The bootstrap version of the statistic is analogously defined. 161

21 7.5 Generalized Additive Partially Linear Models In this section, we introduce Gozalo and Linton s (2001) estimation and test of generalized additive partially linear models. Let ( ) be a random vector with of dimension, of dimension and as scalar. Let ( ) = ( = = ) ( ) has a generalized additive partially linear structure: 0 ( ( )) = ( ) (7.79) where ( ), Θ R is a parametric family of transformations (a link function which is known up to the finite dimensional parameter ), is a 1 vector of unknown parameters, =( 1 ) are the -dimensional random variables, and ( ), =1 are one-dimensional unknown smooth functions. For the identification purpose, it is convenient to assume that [ ( )] = 0 For future use, let 0 =( ) There are many cases where the specification in (7.79) can arise. It allows us to nest the renown standard logit and probit models in more flexible structures. Also, it nest the Box-Cox-type of transformation as a special case where () = 1 Example 7.5 (Misclassification of a binary dependent variable). The specification in (7.79) can arise from a misclassification of a binary dependent variable as in Copas (1988) and Hausman, Abrevaya and Scott-Morton (1998). Suppose that! ( =1 = = ) = ( ) = Ã ( ) (7.80) for some known link function but that when =1we erroneously observe =0with probability 1 and =0we erroneously observe =1with probability 2 Then which is of the form (7.79) with 1 ( =1 = = ) = ( =1 = = )(1 1 )+ ( =0 = = ) 2! = 2 +(1 1 2 ) Ã ( ) = 2 +(1 1 2 ) Estimation To simplify presentation, we shall follow Gozalo and Linton (2001) and assume that is continuous while is discretely valued. Specifically, { 1 } Let =( ) and =( ) The values and take will be written as =( ) and =() Like before, for any =1 we shall partition =( ) and =( ) Let ( ) ( ) and ( ) denote the conditional density functions of and given = respectively. Define ( ) = ( ) (7.81) ( ) which is a measure of the dependence between and in the conditional distribution. Obviously, ( ) lies between zero and infinity, and when and are independent given = it is one. 162

22 Let X be the support of and X 0 be a rectangular strict subset of X, i.e., X 0 = Π X 0 X 0 =1 are intervals. Let X 0 = Π 6= X 0 For any =( ) let Z 0 ( ; ) = 0 ( ( )) 0 (7.82) 0 (; ) = Ã + 0 +! 0 ( ; ) (7.83) where = 1 When the additive structure in (7.79) is true, 0 ( ; 0 )= ( ) and 0 (; 0 )= ( ) for all (7.82) and (7.83) form the basis of the so-called marginal integration method of estimating additive nonparametric models. As we shall see, the estimation of the model (7.79) is similar to the idea profile least squares. That is, one can first assume that =( ) is known and estimate the additive index using the integration method. Then one can estimate by generalized method of moments. Estimation of for given We can estimate () in two ways. One is consistent when (7.79) holds, and the other is consistent more generally. First, we estimate () by the NW method P b () = ( )1( = ) P ( )1( = ) (7.84) where ( )=Π ( ) ( ) = 1 ( ) ( ) is a one-dimensional kernel of order and = () is a bandwidth sequence. The property of b () is standard. At any interior point, µ b () ()! 1 () (0 02 ()) (7.85) where = R () and () = 2 () ( ) where 2 () =Var( = ) is the conditional variance function. The bias function () is the probability limit of () where () = 1 ( ) ( )1( = ) { ( ) ( )} When ( ) satisfies the generalized additive model structure (7.79), we can estimate () with a better rate of convergence by imposing the additive restrictions. The empirical versions of 0 ( ; ) and 0 ( ; ) are given by e ( ; ) = 1 0 b 1 X 0 0 (7.86) e ( ) = Ã + 0 +! e ( ; ) (7.87) where a bandwidth 0 = 0 () is used throughout this estimation. The asymptotic properties of these estimators can be studied using the similar methods to those of Linton and Härdle (1996) who derive the pointwise asymptotic properties of e ( ; ) and e ( ) in the absence of discrete variables. Specifically, p µ 0 e ( ) ( ) 0! 1 0 (; ) ( (; )) (7.88) 163

23 where = R () 0 (; ) = 0 ( ( ())) P 0 (; ) 0 (; ) = 0 ( ( ())) 2 P 0 (; ) with 0 (; ) = R 0 ( ()) (; ) and 0 (; ) = R 0 ( ())2 (; ) 2 Estimation of Gozalo and Linton (2001) suggest estimating the parameters in by the generalized method of moments. That is, they choose Ψ to minimize the criterion function 1 () = ( em ( ; ) ) 2 (7.89) where Ψ is a compact parameter space in R ++1 ( ) is a given moment function, is a symmetric and positive definite weighting matrix, em ( ; ) =(e ( ; ) e ( ; ) ) and kk 2 = 0 for any positive definite matrix and conformable vector To study the asymptotic properties of the estimator b obtained from minimizing the above criterion function, we suppose that the 1 vector ( + +1) of moment functions satisfy m 0 ( ; ) =0 (7.90) where m 0 ( ; ) = 0 ( ; ) 0 ( ; ) if and only if = 0 For example, from a Gaussian likelihood for homoskedastic regression we obtain m 0 ( ; ) =( ( ; )) ( ; ) (7.91) while from a binary choice likelihood we obtain m 0 ( ; ) ( ; ) ( ; ) = (7.92) ( ; )(1 ( ; )) Given the estimate b ( m) we can finally estimate () by e ; b Let ( m)= ( m (; ) ) and ( ) = m 0 (; ) Let ( ) = ( ) 0 = [ ( )] Define ( m) R ( m (; ) )= and R ( ) =R m 0 (; ) m m=m(;) Under weak conditions, we have 1 X X R 0 em ( ; ) m 0 ( ; ) (0) for some finite positive definite matrix The following theorem states the asymptotic property of the GMM estimator b Theorem 7.10 Under certain conditions, µ b The above theorem implies that the optimal choice of the weighting matrix is given by = b 1 where b Now the asymptotic variance becomes

24 7.5.2 Specification Test We now test for the validity of the additive specification (7.79) of the regression function () over a subset of interest J 0 R + of the support of The null hypothesis is H 0 : () = 0 (; 0 ) for some 0 Ψ and all J 0 (7.93) The alternative hypothesis H 1 is the negation of H 0 Gozalo and Linton (2001) consider replacing 0 by their estimator b which is -consistent under the null hypothesis. Let be a family of monotonic transformations. They consider the following test statistics: b 0 = 1 b 1 = 1 b 2 = b 3 = h ( b ( )) e ; i b 2 ( ) (7.94) h e b ( ) e ; b i ( ) (7.95) 6= e e ( ) ( ) (7.96) (e b ) 2 ( ) (7.97) where = (( ) )1( = ) b = b ( ) and e = e ; b are the unrestricted and restricted (additive) residuals, respectively, and ( ) is a pre-specified nonnegative weighting function, which is continuous and strictly positive on J 0 In (7.94), likely candidates for the function are = and = One reject the null for large values of b =0123 Under certain conditions, the test statistics b =0123 are, after location and scale adjustment, asymptotically standard normal under the null hypothesis. That is, for each test statistic there exists a random sequence ª such that under H0 = 2 b p (0 1) =0 3 (7.98) The expressions of and are given in Gozalo and Linton (2001). As usual, the asymptotic normal approximations here can work poorly. To implement the test, one needs to rely on the bootstrap to compute the bootstrapped p-values or critical values. The procedure is standard. For more details, see Gozalo and Linton (2001). 7.6 Exercises 1. Generate = 100 binary observations according to the model ½ ¾ () log = () where () = ( =1 = ) and is uniformly distributed on [ 1 1] Fit the model logit( ()) = 0 + () as described in the chapter. Plot both the estimated logits and probabilities with their true values. 2. Derive approximate formulae for the standard errors of the estimates in the previous exercise and add these to the plots above. 165

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric