CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction

Size: px

Start display at page:

Download "CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction"

Darcy Matthews
5 years ago
Views:

1 CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION NADINE MCCLOUD AND CHRISTOPHER F. PARMETER Abstract. The matrix that transforms the response variable in a regression to its predicted value is commonly referred to as the hat matrix. The trace of the hat matrix is a standard metric for calculating degrees of freedom. Nonparametric-based hat matrices do not enjoy all properties of their parametric counterpart in part owing to the fact that the former do not always stem directly from a traditional ANOVA decomposition. In the multivariate, local polynomial setup with a mix of continuous and discrete covariates, which include some irrelevant covariates, we formulate asymptotic expressions for the trace of the resultant non-anova and ANOVA-based hat matrix from the estimator of the unknown conditional mean. The asymptotic expression of the trace of the non-anova hat matrix associated with the conditional mean estimator is equal up to a linear combination of kernel-dependent constants to that of the ANOVA-based hat matrix. Additionally, we document that the trace of the ANOVA-based hat matrix converges to 0 in any setting where the bandwidths diverge. This attrition outcome can occur in the presence of irrelevant continuous covariates or it can arise when the underlying data generating process is in fact of polynomial order. Simulated examples demonstrate that our theoretical contributions are valid in finite-sample settings. 1. Introduction The hat matrix plays a fundamental role in regression analysis; the elements of this matrix have well-known properties and are used to construct variances and covariances of the residuals. In particular, the trace of the hat matrix is commonly used to calculate degrees of freedom, and appears in regression diagnostics, constructing measures of fit and conceptualizing what a residual is. And while the hat matrix is commonly called upon in parametric estimation and inference, its use in nonparametric settings is much less prevalent. One potential reason that the hat matrix has been less commonly deployed in nonparametric regression analysis, is that several notions of the hat matrix exist, which leads to alternative versions of both overall model fit and degrees of freedom (which can subsequently be used to penalize in-sample fit). 1 Within a univariate framework, Huang & Chen (2008) provide an ANOVA decomposition of the total sum of squares into its respective Date: November 21, Key words and phrases. Trace, Degrees of Freedom, Effective Parameters, Nonparametric Regression, Irrelevant Regressors, Bandwidth, Goodness-of-fit. 1 In the nonparametric literature, this hat matrix is also referred to as the smoother matrix. 1

2 2 HAT MATRIX explained and residual components. However, this ANOVA decomposition is local in nature and needs to be integrated to achieve a global hat matrix. Moreover, in multivariate settings which are of practical appeal calculation of this global ANOVA is likely to be difficult. This is irksome for the calculation of the degrees of freedom of the model as well given that the proper hat matrix stemming from this ANOVA decomposition needs to be integrated. An alternative would be to use the trace of the hat matrix stemming directly from the local polynomial method of estimating the unknown conditional mean (Ruppert & Wand 1994, Fan & Gijbels 1996). In fact, Zhang (2003) discusses exactly this non-anova case in the univariate setting. Here our goal is to augment the work of Huang & Chen (2008) and Zhang (2003) by considering calculation of degrees of freedom in a multivariate, local polynomial setting with a mix of continuous and discrete covariates. From this platform we can compare the effective number of parameters, k eff, stemming from the trace of the global ANOVA hat matrix and its non-anova counterpart. Our generalizations and juxtapositions allow us to make the following nontrivial contributions to the existing literature. One, in the presence of mixed discrete and continuous covariates, the difference in asymptotic expressions of the trace of the ANOVA and non-anova hat matrices is driven by a linear combination of moments of the underlying kernel. This suggests that the non-anova hat matrix taken directly from the multivariate, local polynomial estimator can be used to approximate degrees of freedom for the ANOVA hat matrix as the latter is more computationally intensive in applied settings with multiple continuous covariates. For example, using a bivariate regression, and for local constant, local linear, and local cubic estimators, we show that the absolute differences between asymptotic expressions for the trace of the ANOVA and non-anova hat matrices are in the unit interval. Two, to improve the usefulness of our work to applied settings, we also give consideration to nonparametric regression models in which some covariates are irrelevant. The non-anova nonparametric framework has been the workhorse for the analysis of irrelevant covariates. We show that this framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the trace of the non-anova hat matrix is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of bandwidth vector for the irrelevant continuous covariates on the kernel ratio is dominated by the influence of its relevant counterpart. We show that the bandwidth vector for the irrelevant continuous covariates has an attrition effect on the trace of the ANOVA hat matrix resulting in the latter converging to zero in

3 HAT MATRIX 3 probability. Although the trace of the ANOVA hat matrix is also a ratio of two kernel terms, these kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This paves the way for a sizable influence of the bandwidth vector for the irrelevant continuous covariates relative to its relevant counterpart. In fact, our simulation results confirm this attrition effect of irrelevant regressors on the trace of the ANOVA hat matrix, for the local constant, local linear, and local cubic estimators. One implication of this attrition effect is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates; degrees of freedom from the non-anova framework may be suitable substitutes for their ANOVA counterparts when irrelevant continuous variables are likely to be present in the underlying nonparametric model. Three, we formalize the trace concept of the non-anova and ANOVA hat matrices that are predicated on only discrete covariates to provide a measure of the degrees of freedom for the underlying nonparametric model. In the presence of only relevant discrete covariates, the traces of ANOVA and non-anova hat matrices all converge in probability to the cardinality of the discrete support. Although this result holds when irrelevant discrete covariates are also present in the nonparametric model, the asymptotic trace values in this case can exceed their purely relevant counterparts. This latter result draws on the theoretical contributions of Ouyang, Li & Racine (2009) who establish, in a purely discrete-covariate setting with least-squares cross-validation (LSCV), that the irrelevant regressors cannot be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that the bandwidths selected via LSCV do not converge to their upper extreme values even a n. In essence, with positive probability, the presence of irrelevant discrete covariates can lead to asymptotic trace values of the ANOVA and non-anova hat matrices that are larger than the cardinality of the support for the relevant discrete covariates. This also means that these asymptotic trace values will exhibit large variances in the presence of irrelevant discrete covariates. Four, unlike the parametric hat matrix, the geometric properties of the non-anova hat matrix stemming directly from multivariate local polynomial estimation with mixed data of the unknown conditional mean have yet to be formalized. We show that while the non- ANOVA hat matrix is not a projection matrix, it shares many of the same geometric properties as its parametric counterpart. These properties of the hat matrix are of importance in, for example, assessing the amount of leverage or influence that y j has on ŷ i, which is related to the (i, j)-th entry of the hat matrix. In the special case of a local constant estimator, we

4 4 HAT MATRIX deduce that each ŷ i is a convex combination of the response vector Y ; this convexity property indicates how large is the leverage of y i on its corresponding fitted value ŷ i. Thus, our work can also be used to identify high-leverage points and improve model fit in multivariate local polynomial estimation. In essence, our theoretical contributions are of independent interest to the wider nonparametric literature. Additionally, we can use the trace of the hat matrix as a measure of the effective number of parameters used/constructed by the local-polynomial model and this can provide insight into which covariates in the model are relevant. Whereas the theories of Hall, Li & Racine (2007), Hall & Racine (2015) and Ouyang et al. (2009) can shed light on relevancy through the size of the bandwidth and the order of the local polynomial (when selected in a datadriven manner), they do not provide an exact number of parameters. Thus, use of the hat matrix here, with these data-driven methods, can generate additional insight into how the local polynomial estimator adapts to the data. The remainder of the paper is organized as follows. Section 2 provides a short review of the geometric properties of the hat matrix from a linear parametric model. Section 3 derives nonasymptotic and asymptotic results for the trace of the non-anova based hat matrix from the multivariate, local polynomial model with a mix of continuous and discrete, and relevant and irrelevant covariates. Under similar model specifications, Section 4 derives asymptotic results for the trace of the ANOVA based hat matrix. Section 5 explores the implications of our theoretical results using simulated data. Section 6 contains the conclusion. We place all proofs in the technical appendix. 2. A Brief Review of the Hat Matrix for the Canonical Parametric Model Consider the situation where one is interested in estimation of the regression (1) y i = m(x i ) + ε i, where i = 1, 2,..., n, y i is our regressand, x i is a q-vector of regressors, and ε i is the idiosyncratic noise. If we parameterize our function in (1) to be linear in parameters, m(x i ) = x T i β, where β R q, then we can estimate the model via least squares to obtain, ˆβ = (X T X) 1 X T y where X is the full n q design matrix with rank q and y is the n 1 vector of responses. The vector of fitted values, ŷ is given by (2) ŷ = X(X T X) 1 X T y = Hy.

5 HAT MATRIX 5 The matrix H in (2) is a projection matrix and thus by definition, H is idempotent, H 2 = H. By construction, H is symmetric, H T = H (in this case H is an orthogonal projection matrix). Since premultiplying y by this matrix H puts a hat on y it is often called the hat matrix. Using basic properties of projection matrices Hoaglin & Welsch (1978) show that the elements of H, h ij, satisfy the first four geometric properties: (i) 0 h ii 1. (ii) h ij 1 for i j. (iii) h ii = 1 iff h ij = 0 j i. (iv) If X contains a column of ones then n h ij = 1. (v) HX = X. Equivalently, HX c = X c, where X c is any column of the design matrix X. Properties (i) and (ii) are boundedness conditions on the entries of H. Note that by symmetry of H, property (iv) is equivalent to n h ij = 1. Thus, each ŷ i is an affine combination of y i. We add the invariance property, see (v), as it will help us to establish some important results in the subsequent section. Note that property (v) nests (iv). The trace of the parametric hat matrix is commonly used to calculate degrees of freedom since, by virtue of cyclic permutation of the trace operator, (3) tr(h) = tr ( X(X T X) 1 X ) T = tr ( (X T X) 1 X T X ) = tr(i k ) = q, the number of covariates that we included in the parametric model. For nonparametric regressions, three orthodox definitions of k eff, which are identical for linear models, are tr(hτ,γh T τ,γ ), tr(h τ,γ ), tr(2h τ,γ Hτ,γH T τ,γ ) (see, for e.g., Hastie & Tibshirani 1990). In subsequent sections, we show that properties (i) to (v) hold for the multivariate local polynomial regression estimator of the unknown conditional mean. The equality in (3) between the trace of the hat matrix and the rank of the design matrix is one of the distinguishing characteristics of the parametric framework that is not always possessed by its nonparametric counterpart. Heuristically, the hat matrix in the local polynomial regression is predicated on an effective design matrix with column rank that increases with the order of the local polynomial. For the multivariate, local polynomial estimator of the unknown conditional mean, we show that the rank of the effective design matrix is the infimum for the trace of the resultant hat matrix.

6 6 HAT MATRIX 3. The non-anova hat matrix for the multivariate local polynomial estimator To generalize the results in Zhang (2003), we embed a mix of continuous and discrete regressors into the multivariate smooth varying-coefficient model (Hastie & Tibshirani 1993) with conditionally linear structure d z (4) y = m k (X)Z k + ε, k=1 and where y is a scalar regressand, X = (X c 1,..., X c q, X d 1,..., X d r ) T and Z = (Z 1,..., Z dz ) T are the given covariates with Z 1 1, E(ε X, Z) = 0 and var(ε X, Z) = σ 2 (X, Z). When d z = 1 the model in (4) is just a multivariate nonparametric regression model. Thus, allowing d z 1 allows for a broader array of models. Denote Xi d in X i = (Xi c, Xi d ) as the r 1 vector of regressors that takes discrete values, and denote Xi c R q as the vector of continuous regressors. Let Ω d and Ω be the support of Xi d and Xi c, respectively. Let the s th component of x d be x d s which takes c s different values in Ωs d = {0, 1,..., c s 1} for s = 1,..., r and c s 2 is a finite positive constant. Then the cardinality of the set Ωs d is c s, which we denote as Ωs d. Assume X has a sampling density f X with a known bounded support Ω Ω d. Furthermore, assume the square matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d to guarantee identifiability of the model in (4). We let γ be the bandwidth vector for X. As in Li & Racine (2007), we use the partition γ = (h T, λ T ) T to reflect the presence of continuous and discrete regressors in X, with bandwidth subvectors h and λ, respectively. For the case of unordered discrete regressors X d i, we follow Li & Racine (2007) who use a variant of the kernel function of Aitchison & Aitken (1976) that is defined by 1 if X (5) l(xis, d x d is d = x d s s, λ s ) = λ s if Xis d x d s where 0 λ s 1 is the smoothing parameter of x d s. Then the product kernel for x d = (x d 1,..., x d r) T is L(x d, X d i, λ) = r l(xis, d x d s, λ s ) = r λ 1(Xd is xd s) s, where 1(X d is x d s) is an indicator function that equals 1 when X d is x d s, and 0 otherwise.

7 HAT MATRIX 7 We let K h ( ) be the generalized product kernel (Li & Racine 2007, Henderson & Parmeter 2015), (6) K h (x c, X c i ) = q h 1 s K where K( ) be a symmetric, density function on R. function for the mixed regressor (x c, x d ). Then ( x c s X c is h s ), K γ (x, X i ) = K h (x c, X c i )L(x d, X d i, λ). Define K γ (x, X i ) to be the kernel To obtain an estimate of the unknown smooth functions {m k (x)} dz k=1 and their population mean regression function m(x 1,..., x q, z 1,..., z dz ) = d z k=1 m k(x)z k from the observations {Y i, X i, Z i } n, we employ the p th -order local polynomial estimation method. In what follows, we adopt the notation of Masry (1996a, 1996b). Thus, for a p th -order local polynomial estimation the corresponding objection function is ( ) d z 2 ( (7) min n 1 Y i β j X c i x c) j Zik K γ (x, X i ), β k=1 0 j p where j = (j 1,..., j q ), j = q j i, x j = q xj i i, j! = q j i! = j 1! j q! and p l l =, 0 j p l=0 j 1 =0 j q=0 j 1 + +j q=l j!β j (x) corresponds to ( D j m ) (x), the partial derivative of m(x) = m(x c, x d ) with respect to x c, which is defined as: (8) ( D j m ) (x) j m(x) (x c 1) j 1... (x c q ) jq, and β vertically concatenates β j (0 j p) in lexicographical order (with highest priority to last position so that (0,..., 0, i) is the first element in the sequence and (i, 0,..., 0) is the last element), and g 1 i denotes this one-to-one map. Note that (7) handles the continuous regressor vector x c in a local polynomial manner but the discrete regressor vector x d in a local constant manner. Let β(x; γ) (0 j p) be the estimator for the weighted least square problem in (7), and define S n (x) = D(x c ) T W (x)d(x c ) and T n (x) = D(x c ) T W (x), with D(x c ) = [ T D 1 (x c ),..., D n (x )] c where Di (x c ) vertically concatenates (Xi c x c ) j Z i for 0 j p in lexicographical order, with Z i = (Z 1i,..., Z dzi) T, and denotes the Kronecker operator.

8 8 HAT MATRIX Thus, for example, D i (x c ) = Z i for p = 0, and D i (x c ) = [1, (X c i x c ) T ] T Z i for p = 1. Here W (x) is a diagonal matrix with the i th diagonal element being K γ (x, X i ). Now, let N p,l = (l + q 1)! l!(q 1)! be the number of distinct q-tuples j with j = l for 0 l p. That is, N p,l is the number of distinct l th -order partial derivative of m(x) with respect to x c. Set N p p l=0 N p,l. The minimizer of the local polynomial least squares procedure at x is (9) β(x; γ) = S 1 n (x)t n (x)y. If our estimate of interest in the unknown regression function is the first d z elements of this vector, then we have m(x) = ( e T 1,N p I dz ) β(x; γ), where eτ+1,np is the (τ + 1) th standard basis vector in the coordinate space R Np and I dz is the identity matrix in R dz. In practice, however, interest may lie in a specific derivative, say of first or second order, of the regression function. Thus, we define m τ (x) = ( e T τ+1,n p I dz ) β(x; γ), where τ = 0, 1,..., Np 1. That is, (τ + 1) is the lexicographical position of the derivative vector of m, of interest. Thus the relationship between τ and m is: m 1 (10) τ = g 1 m (m) + N p,r. (11) Then, we can cast our local polynomial regression estimator as m τ (x, z) = where z = (z 1,..., z dz ) T and d z k=1 m τ,k(x)z k r=0 = m τ (x) T z, = ( e T τ+1,n p z ) T Sn 1 (x)d(x c ) T W (x)y, = A τ,j (x)y j (12) A τ,j (x) ( e T τ+1,n p z T ) S 1 n (x)d j (x c )K γ (x, X j ). If we replace the generic x with our n observations, X i, and τ = 0, then we obtain the fitted values for our data. Further, we can use these n observations to construct our hat matrix H τ,γ. From (11) we have that the (i, j)th element of our hat matrix is (13) H τ,γ (i, j) = A τ,i (X j ), for i, j = 1,..., n.

9 HAT MATRIX Geometric Properties of the non-anova hat matrix. Clearly, H τ,γ is not symmetric in this generalization. To show that the properties in (i)-(iv) hold for H τ,γ (i, j) for i, j = 1,..., n, we will use the following result which is the local polynomial, multivariate, generalized kernel extension of Zhang (2003, Lemma A.1). Lemma 3.1. For a nonnegative kernel satisfying K γ (0) = sup u,v K γ (u, v), we have that (1) n H2 τ,γ(i, j) H τ,γ (i, i) for i = 1,..., n, (2) Let X l = (X l 1,..., X l n) T for any l such that 0 l p. Assume the relationship between τ and and its derivative vector m is as defined in (10). Then ( ) ) q H τ,γ (i, j) (X lj l ( s Z j = X l m i Z j ), and, consequently, H 0,γ X l = X l. m s Part (1) of Lemma 3.1 implies that 0 H τ,γ (i, i) 1 for i = 1,..., n and H τ,γ (i, j) 1, for i j. Thus, properties (i) and (ii) hold. To show that property (iv) also holds, observe from Proposition A.1 (see appendix) that Z j A τ,j (x) = δ 0,τ Z j, and thus (14) 1 if τ = 0 H τ,γ (i, j) = 0 if τ > 0, as the first entry of Z j is normalized to 1. Property (iii) holds by virtue of properties (i), (ii) and (iv). For τ = 0, part (2) of Lemma 3.1 is equivalent to property (v). Equation (14) shows that for each i th observation, the conditional mean estimator is an affine combination of the i th row of the associated hat matrix, whereas that of its derivative estimator counterpart is a zero-sum linear combination. Note that there is an absence of the invariance property of H τ,γ for τ > 0 (see part 2), which is intuitive. For derivative estimators, H τ,γ is associated with dimensionality reduction but preservation of the span of a polynomial ring. Thus, Lemma 3.1 reveals that for nonparametric regression, H τ,γ can be exploited to conduct diagnostic analyses, such as leverage effects, which have be used in parametric settings. In fact, the hat matrix for the local-constant least-squares (LCLS) estimator renders easy interpretation of leverage effects. To see this, assume that for model (4), d z = 1 and

10 10 HAT MATRIX p = 0. Then the LCLS estimator of our model is K γ (x j, x i )y j (15) m 0 (x i ) = = K γ (x l, x i ) where (16) A 0,j (x i ) = K γ(x j, x i ). K γ (x l, x i ) l=1 l=1 A 0,j (x i )y j, We use the notation A ij = A 0,i (x j ) = H 0,γ (i, j) from (15) for parsimony. Observe that by the definition of the product kernel, K γ (x j, x i ) and A ij for H 0,γ, properties (i) and (ii) become: 0 A ij 1 i, j. However, unlike the parametric model, the elements of the local constant hat matrix are all positive. Clearly, from (16), H 0,γ is a symmetric matrix, as in the parametric case. Thus, A ij = n A ij = 1. More important, this normalization property holds unconditionally, rather than conditionally on the regressor matrix. Combining this restriction with property (iii) yields that each ŷ i is a convex combination of the response vector Y ; this convexity restriction, although clearly a stronger condition than its parametric counterpart, renders easier detection and comparison of high-leverage points relative to other points Degrees of Freedom for Mixed Regressors. Following Zhang (2003), first we formulate a relationship among three common measures of k eff : tr(h T τ,γh τ,γ ), tr(h τ,γ ), tr(2h γ H T γ H γ ). To do so, we make use of the bound implied by Part (1) of Lemma 3.1: (17) tr(h T τ,γh τ,γ ) = {H τ,γ (i, j)} 2 H τ,γ (i, i) = tr(h τ,γ ). In addition, using the fact that (I n H τ,γ ) T (I n H τ,γ ) is positive definite, we have ( ) tr (I n H τ,γ ) T (I n H τ,γ ) = n tr(2h τ,γ Hτ,γH T τ,γ ) > 0, given equivalence of the trace of a transposed matrix. This result, coupled with (17), implies that tr(h T τ,γh τ,γ ) tr(h τ,γ ) tr(2h τ,γ H T τ,γh τ,γ ) < n. An implication of Part (2) of Lemma 3.1 is that H τ,γ projects matrices onto a space with column span of d z N p. Thus, tr(h T τ,γh τ,γ ) d z N p. This follows from Schur s inequality

11 HAT MATRIX 11 (Lütkepohl 1996, p. 43), along with the fact that the trace of a matrix is equal to the sum of its eigenvalues. We now formalize our generalization of Zhang s (2003) nonasymptotic bound as follows. Proposition 3.1. Assume the kernel condition in Lemma 3.1 holds. For the multivariate local polynomial regression defined in (4) and any h R q + and λ [0, 1] r, we have d z N p tr(hτ,γh T τ,γ ) tr(h τ,γ ) tr(2h τ,γ Hτ,γH T τ,γ ) < n, where N p is defined above. Next, we present the asymptotic results for tr(h τ,γ ), tr(hτ,γh T τ,γ ), and tr(2h τ,γ Hτ,γH T τ,γ ). A few assumptions are in order. Assumption A.1. (x i, z i, y i ) are i.i.d. as (X, Z, Y ). The covariates X = (X c, X d ) have bounded support Ω Ω d, and the density f(x) of X is Lipschitz continuous in x c and bounded away from zero on Ω Ω d. Assumption A.2. m k (x c, x d ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω x d Ω d. Assumption A.3. The fourth moment of ε exists and is strictly positive. Assumption A.4. The kernel K( ) is a symmetric, nonnegative, and bounded continuous probability density function having compact support. Specifically, u 4p K(u) L 1 and u 4p+q K(u) 0 as u. Assumption A.5. The matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d, and each entry is Lipschitz continuous in x c. These conditions are identical or similar up to generalizations to those made by Zhang (2003) and Huang & Chen (2008). For example, Assumptions A.1 and A.4 ensure convergence in the mean-square sense of the matrices of multivariate moments of the kernel K from the multivariate local polynomial estimation due to Masry (1996b). 2 To state our main theorem, we introduce some additional notation. Define Γ (x) = f X (x)e(zz T X = x). For each j with 0 j 2p, define µ j = u j K(u)du, ν j = u j K 2 (u)du, R q R q 2 Masry (1996a) establishes uniform strong consistency of these moment matrices.

12 12 HAT MATRIX and the N p N p dimensional matrices M 0,0 M 0,1... M 0,p S 0,0 S 0,1... S 0,p M M = 1,0 M 1,1... M 1,p.., S = S 1,0 S 1,1... S 1,p.., M p,0 M p,1... M p,p S p,0 S p,1... S p,p where M i,j and S i,j are N i N j dimensional matrices whose (l, m) elements are µ gi (l)+g j (m) and ν gi (l)+g j (m), respectively. Hence, the matrices M and S are the multivariate moments of the kernels K and K 2, respectively. Given that our kernel function is a probability density function, µ j is the j th raw moment of the kernel and ν j is the kernel weighted j th raw moment. Let c h,,i = [(Xc i x c ) h] and c h,,i vertically concatenates [ c h,,i] j for 0 j p in lexicographical order, where represents Hadamard division. Also, let d,i = X d i x d. Then, for a fully nonparametric regression model with mixed continuous and discrete regressors, we define the associated multivariate equivalent kernel, K τ( ), as (18) K τ( c h,,i, d,i) = e T τ+1,n p M 1 c h,,i K( c h,,i)l(x d, X d i, λ). Clearly, K τ(0, 0) = e T τ+1,n p M 1 e T τ+1,n p [K(0)] q, and we set K τ(0, 0) = K τ(0). As noted in Fan & Gijbels (1996, sect ) the equivalent kernel is the weighting scheme that arises based on the specific kernel, polynomial order chosen and the design points location relative to the point of evaluation. The multivariate equivalent kernel is the effective weighting scheme that produces the estimator for β j (x) for given bandwidth γ. When p = 0, the multivariate equivalent kernel is identical to the product kernel, but for p > 0, the multivariate equivalent kernel can automatically adapt to alternative data designs as well as account for boundary estimation. Theorem 3.2. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,γ is based on (12) with γ = (h T, λ T ) T. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Under Assumptions A.1 to A.5, with h s 0, for s = 1,..., q, and n,

13 HAT MATRIX 13 and nh 1... h q, we have (19) (20) (21) tr(h τ,γ ) = d zkτ(0) Ω Ω d h { m q 1 + h op (1) }, s tr(hτ,γh T τ,γ ) = d zkτ Kτ(0) Ω Ω d h { 2m q 1 + h op (1) }, s ( ) tr(2h τ,γ Hτ,γH T d z τ,γ ) = h m q h 2Kτ(0) K τ Kτ(0) Ω Ω d { 1 + o s h m P (1) }, where Ω Ω d is the volume of the Cartesian product Ω Ω d, is the convolution operator, and K τ K τ(0) = e T τ+1,n p M 1 SM 1 e τ+1,np. Theorem 3.2 shows that the asymptotic k eff are proportional to the total number of covariates in model (4), d z, inversely proportional to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors. This latter result on the discrete covariates cannot be inferred from Zhang s (2003) asymptotic approximations for the trace of the resultant hat matrices for the local polynomial estimator of the conditional mean (τ = 0) for model (4) with a scalar, continuous X. Theorem 3.2 also shows each asymptotic k eff is independent of the mixed design density f. In the case where the discrete regressors are ordered, then Li & Racine (2007) suggest, in lieu of (5), the use of the following kernel function: 1 if X (22) l(xis, d x d is d = x d s s, λ s ) =, λ Xd is xd s s if Xis d x d s where the range of the smoothing parameter of x d s, λ s, is [0, 1]. As in the unordered case, if λ s is equal to its minimum value the function in (22) is an indicator function; if λ s is equal to its maximum then (22) is a uniform weight function. In this case, the results of Theorem 3.2 continue to hold Degrees of Freedom for Continuous Regressors. In the purely continuous case, we assume that j! β j (x c ) estimates ( D j m ) (x c ). Then the p th -order local polynomial estimation has the following objection function: ( min n 1 Y i β d z k=1 0 j p β j ( X c i x c) j Zik ) 2 K h (x c, X c i ).

14 14 HAT MATRIX Thus, D(x c ) is as defined on page 7, and we redefine the diagonal weighting matrix W (x c ) so that its i th diagonal entry is K h (x c, X c i ). Then, for the continuous regressor case, we have (23) A τ,j (x c ) ( e T τ+1,n p z T ) S 1 n (x c )D j (x c )K h (x c, X c j ). Using (23) to defined the (i, j)th element of our hat matrix, H τ,h, as is done in (13), note that the results of Lemma 3.1 and Proposition A.1 continue to hold in the continuous regressor case. To generate asymptotic expressions for tr(h τ,γ ), tr(h T τ,γh τ,γ ), and tr(2h τ,γ H T τ,γh τ,γ ) in the continuous regressor case, we need to modify Assumptions A.1, A.2, and A.5, respectively, as follows: Assumption B.1. (x c i, z i, y i ) are i.i.d. as (X c, Z, Y ). support Ω, and the density f(x c ) of X c zero. The covariates X c have bounded is Lipschitz continuous and bounded away from Assumption B.2. m k (x c ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω. Assumption B.3. The matrix E(ZZ T X c = x c ) has strictly positive eigenvalues for each x c Ω, and each entry is Lipschitz continuous. Note that for d z = 1 in our multivariate model (4) with only continuous regressors, the corresponding multivariate equivalent kernel, K τ( ), is K τ( c h,,i) = e T τ+1,n p M 1 c h,,i K h ( c h,,i). Theorem 3.3. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,h is based on (23). Under Assumptions B.1 to B.3 and A.3 to A.4, with h s 0, for s = 1,..., q, and n, and nh 1... h q, the asymptotic results in Theorem 3.2 continue to hold so that (24) tr(2h τ,h H T τ,hh τ,h ) = tr(h τ,h ) = d zk τ(0) Ω h m q h s h m q h s { 1 + op (1) }, tr(hτ,hh T τ,h ) = d zkτ Kτ(0) Ω h { 2m q 1 + h op (1) }, s ( d z 2Kτ(0) K τ Kτ(0) h m ) Ω { 1 + o P (1) }. Clearly, for τ = 0 and q = 1, Theorem 3.3 nests Zhang s (2003) results for degrees of freedom of local polynomial hat matrices (see Theorems 1 and 3, pages 612 and 616, respectively).

15 HAT MATRIX Degrees of Freedom for Discrete Regressors. The case in which all regressors are discrete is also common in applied research. We therefore assume that for model (4) d z = 1, p = 0, and X i = Xi d. Then, the result LCLS estimator implied by (15) is L(x d j, x d i, λ)y j m(x d i ) = = A j (x d L(x d l, xd i, λ) i )y j, where (25) A j (x d i ) = L(xd j, x d i, λ). L(x d l, xd i, λ) l=1 l=1 We let H λ (i, j) := A i (x d j). This formulation of H λ gives rise to the following asymptotic expressions. Theorem 3.4. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Then (26) (27) (28) tr(h λ ) = Ω d { 1 + o P (1) }, tr(h T λ H λ ) = Ω d { 1 + o P (1) }, tr(2h λ H T λ H λ ) = Ω d { 1 + o P (1) }, uniformly in λ [0, 1] r. The results in Theorem 3.4 suggest that in the presence of only discrete regressors, tr(h λ ), tr(hλ T H λ) and tr(2h λ Hλ T H λ) are asymptotically equivalent, in the sense that any pairwise combination of their difference tends to zero in probability. Thus, the computational cost of tr(hλ T H λ), which is O(n 2 ) relative to that of tr(h λ ), which is O(n), may suggest use of tr(h λ ) Degrees of Freedom in the Presence of Relevant and Irrelevant Regressors. Our preceding analyses assume that all variables included in the regression are relevant. It is common in applied work to have a mix of irrelevant and relevant regressors included in the same regression. In this setting, we are interested in the asymptotic expressions for tr(h 0,γ ) when a set of bandwidths moves toward their theoretical upper bounds and another set moves toward zero. To proceed with this scenario, without loss of generality, we assume the first r 1 (0 r 1 r) components of Xi d are relevant, and the first q 1 (1 q 1 q) components of X c i are relevant. We denote X c i and X d i as the relevant components, and X c i

16 16 HAT MATRIX and X i d as the irrelevant components. We adopt the concept of irrelevant regressors from Hall et al. (2007), thus we assume that (29) (Y, X) and X are independent of each other. By virtue of this independence assumption, f(x) = f( x) f( x), where f( x) and f( x) are the marginal densities of Xi and X i respectively. We use Ω and Ω d to denote the support of x c and x d, respectively. 3 In the ensuing sections, we will make use of the following kernel partitions: (30) and L(Xi d, x d, λ) = L( X i d, x d, λ)l( X i d, x d, λ), r 1 = l( X is, d x d s, λ r s ) l( X is, d x d s, λ s ), s=r 1 +1 (31) K h (Xi c x c ) = K h( Xc i x c) K h( Xc i x c), q 1 ( ) q Xc = h 1 s K is x c s h s s=q 1 +1 h 1 s K ( ) Xc is x c s. h s With ν d, x d Ω d, define 1 s ( ν d, x d ) = 1 s ( ν d s x d s) r 1 t=1,t s 1 s( ν d = x d ). That is, 1 s ( ν d, x d ) is an indicator function equal to one if ν d and x d only differ in their sth element, and zero otherwise. For l = 1, 2, define m l ( x) = E{[K h( X i c x c )L( X i d, x d, λ)] l }, {[( q ( Xc m l ( x) = E K is x c ) ) ] l } s L( h X i d, x d, λ), s s=q 1 +1 m L,l ( x d ) = E{[L( X d i, x d, λ)] l }. Note the subtle difference between m l ( x) and m l ( x): m l ( x) does not contain the division by h that exists in m l ( x). This feature will be important when we study the limiting behavior of our ANOVA based hat matrix. What do these three terms capture? m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, but scaled by the irrelevant continuous covariates bandwidth vector, and m L,l ( x d ) is the l th raw moment 3 Hall et al. (2007) highlight that a more practically appealing yet theoretically challenging variant of (29) is: conditional on X, the variables X and Y are independent. We therefore follow Hall et al. (2007) and choose the more restrictive of these two independence assumption.

17 HAT MATRIX 17 of the kernel weights for the irrelevant, discrete covariates at the point x d. All three of these moment based functions are expectations taken over the design points. To shed light on the asymptotic behavior of the tr(h 0,γ ) in the presence of some irrelevant regressors, we examine the entries of the resultant (H 0,γ ). Appealing to equations (15), (30) and (31), we obtain [ q1 m 0 (x i ) = k [ q1 k ( x c js x c is ) q h s s=q 1 +1 ( x c js x c is ) q h s k s=q 1 +1 ( x c js x c is k h s ( x c js x c is h s ) r1 l(x d js, x d is, λ s ) ) r1 l(x d js, xd is, λ s) r s=r 1 +1 r s=r 1 +1 ] l(x d js, x d is, λ s ) y j ]. l(x d js, xd is, λ s) For this general setting with a mix of discrete and continuous regressors, Hall et al. (2007) establish and prove that the cross-validated bandwidths for the irrelevant regressors converge in probability to the suprema of their ranges. Ideally, if we set h s = for s = q 1 + 1,..., q and λ s = 1 for s = r 1 + 1,..., r, we have [ q1 m 0 (x i ) = = k [ q1 [ q1 k k [ q1 k ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] k(0)q q 1 1 r r 1 l(x d js, xd is, λ k(0) s) q q 1 1 r r 1 ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] = l(x d js, xd is, λ s) A ij y j, where A ij only contains the relevant regressors. What is apparent is that the algebraic form of the local constant estimator suggests that, regardless of the number of variables in the model, when variables are smoothed out, it is only the bandwidths associated with variables not smoothed away that dictate the behavior of A ij. Only when all variables are smoothed away is A ij impacted by the increasing bandwidths. Thus, calculating degrees of freedom for the local constant kernel estimator is not influenced by the presence of irrelevant regressors when relevant regressors are present. What this suggests is that the degrees of freedom is only influenced by the number of relevant regressors. This is clearly not the case in OLS where adding an additional (irrelevant) regressor always contributes 1 to the degrees of freedom. Drawing of the theoretical contributions of Hall et al. (2007), we now provide the asymptotic expression for tr(h 0,γ ), tr(h T 0,γH 0,γ ), and tr(2h 0,γ H T 0,γH 0,γ ) as follows.

18 18 HAT MATRIX Theorem 3.5. Suppose d z = 1 in model (4), and condition (29) and Assumptions A.1 to A.5, are satisfied. Assume as n, (i) h s 0, for s = 1,..., q 1, h s for s = q 1 + 1,..., q, and nh 1... h q1 and (ii) λ s 0, for s = 1,..., r 1, and λ s 1 for s = r 1 + 1,..., r. Then for the LCLS estimator we have (32) (33) (34) tr(h 0,γ ) = [K(0)]q 1 Ω Ω d q1 h s { 1 + op (1) }, tr(h T 0,γH 0,γ ) = νq 1 0 Ω Ω d q1 h s { 1 + op (1) } tr(2h 0,γ H T 0,γH 0,γ ) = {2[K(0)]q 1 ν q 1 0 } Ω Ω d q1 h s { 1 + op (1) }. Theorem 3.5 formally establishes that the influence of the irrelevant regressors on tr(h 0,γ ) is asymptotically negligible (see its counterpart in absence of irrelevant regressors that can be deduced from Theorem 3.2). Similar asymptotic expressions for the case of continuous only regressors can be deduced from Theorem 3.5. In this latter case, the influence of irrelevant regressors will also have asymptotically negligible effect on the trace of each of the three non-anova hat matrices. Our foregoing results and discussions highlight that all three non-anova trace measures are well-defined and useful in the presence of irrelevant covariates Hat matrices with only discrete regressors. As in the case with only relevant regressors, we assume that for model (4) d z = 1, p = 0, and X = X d. Then, by virtue of the kernel partition in (30), the LCLS estimator in (15) simplifies to L( X j d, x d i, λ)l( X j d, x d i, λ)y j (35) m(x d i ) = L( X l d, xd i, λ)l( X l d, xd i, λ) = where l=1 (36) A j (x d i ) = L( X j d, x d i, λ)l( X j d, x d i, λ) L( X l d, xd i, λ)l( X l d, xd i, λ). l=1 A j (x d i )y j, In this case with only discrete regressors, Ouyang et al. (2009) establish that the irrelevant regressors the associated bandwidths from the least-squares cross-validated method can be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that these bandwidths do not converge to their upper extreme 4 Zhang (2003) notes this observation.

19 HAT MATRIX 19 values even a n. Thus λ s 1 for at least one s = r 1 + 1,..., r, and hence in (36) L( X d l, xd i, λ) = 1 is not guaranteed asymptotically. In essence, Ouyang et al. s (2009) result implies that tr(h λ ) can exceed Ω d with positive probability, where H λ is predicated on (36). This implication also holds for tr(h T λ H λ) and tr(2h λ H T λ H λ). Theorem 3.6. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Assume H λ is predicated on (36). Suppose that λ = (λ 1,..., λ r1 ) [0, b n ] r 1, where b n is a positive sequence that converges to zero as n and lim n P r( λ r1 +1 = 1,..., λ r1 +1 = 1) α for some α (0, 1). Then (37) tr(h λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d (38) (39) tr(hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d tr(2h λ Hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d uniformly in λ [0, 1] r. Theorem 3.6 suggests that for nonparametric regressions with only discrete covariates the asymptotic equivalence between any pair of tr(h λ ), tr(h T λ H λ) and tr(2h λ H T λ H λ) is valid even if some of the covariates are irrelevant. Note also that f( x d ) m L,1 ( x d ) for each x d Ω d. In particular, f( x d ) < m L,1 ( x d ) for each x d Ω d. Thus, x f( x d ) < Ω d d m L,1. ( x d ) Therefore, Theorem 3.6 implies that in the presence of irrelevant discrete variables tr(h λ ) < Ω d Ω d asymptotically. In essence, x f( x d ) d m L,1 is a measure of the degree of irrelevance. ( x d ) 4. A Multivariate generalization of the hat matrix from the ANOVA framework Huang & Chen (2008) consider the local polynomial estimator for model (4) with a scalar continuous regressor X i under an ANOVA framework. To do this, from (7) Huang & Chen (2008) define a local SSE, SST, and SSR, respectively as SSE p (x, h) = n 1 n ( Yi p ˆβ j=0 j (X i x) j) 2 Kh (X i x) n 1 n K, h(x i x) SST p (x, h) = n 1 n ( Yi Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x) SSR p (x, h) = n 1 n ( p ˆβ j=0 j (X i x) j Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x)

20 20 HAT MATRIX so that SST p (x, h) = SSE p (x, h) + SSR p (x, h). ANOVA decomposition are SSE p (h) = SST (h) = SSR p (h) = Their global counterparts to this local SSE p (x, h) ˆf(x; h)dx, SST (x, h) ˆf(x; h)dx, SSR p (x, h) ˆf(x; h)dx, and SST (h) = SST n (Y i Ȳ )2 under some conditions. Their corresponding hat matrix to the global ANOVA decomposition is denoted as H, and is defined as H = W H ˆf(x; h)dx, with W a diagonal matrix having entries Kh (X i x)/ ˆf(x; h), and (40) H = X(X T W X) 1 X T W, with X being the effective design matrix generated by the local polynomial expansion. We now extend the Huang & Chen (2008) framework by allowing the regression model to have q continuous regressors in the vector X c and r discrete regressors in a vector X d. In light of the foregoing local and global ANOVA decompositions, we proceed in the following way: SSE p (x, γ) = n ( 1 n Y i ˆβ ( 0 j p j X c i x c) ) j 2 Kγ (X i x) n 1 n K, γ(x i x) SST (x, γ) = n 1 n ( Yi Ȳ ) 2 Kγ (X i x) n 1 n K, γ(x i x) n ( 1 n ˆβ ( 0 j p j X c i x c) ) j 2Kγ Ȳ (X i x) SSR p (x, γ) = n 1 n K. γ(x i x) Their global counterparts to this local ANOVA decomposition are SSE p (γ) = x d SSE p (x, γ) ˆf(x; γ)dx c, SST (γ) = x d SSR p (γ) = x d SST (x, γ) ˆf(x; γ)dx c, SSR p (x, γ) ˆf(x; γ)dx c,

21 HAT MATRIX 21 where x d refers to summation over all atoms x d = (x d 1,..., x d r) of the distribution of X d. Then, for this generalization, (41) H = x d W H ˆf(x; γ)dx c, with H as defined in (3.3), X = D(x c ), W = W (x)/ ˆf(x; γ) a diagonal matrix having entries K γ (x, X i )/ ˆf(x; γ), where ˆf(x; γ) = 1 n K γ (x, X i ) and we assume the following normalization: K γ (X i x)dx c = 1, x d which ensures that SST = x SST (x, γ) ˆf(x; γ)dx c = n 1 n d (Y i Ȳ )2. Define M 0,0 M 0,1... M 0,p M M 1 = 1,0 M 1,1... M 1,p... M p,0 M p,1... M p,p The immediate result is a generalization of the trace result in Theorem 4(c) in Huang & Chen (2008). Theorem 4.1. Assume the conditions in Theorem 3.2 hold and d z = 1 in (4). The conditional trace of H, as defined in (41), for the multivariate local polynomial estimator of the conditional mean is asymptotically (42) tr(h ) = Ω Ωd q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Theorem 4.1 shows that the asymptotic expansion for tr(h ) is inversely related to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors and the mixed design density. In the absence of discrete regressors, the following corollary is immediate. Corollary 4.2. Assume the conditions in Theorem 4.1 hold with only continuous regressors in (4). The conditional trace of H for the multivariate local polynomial estimator is

22 22 HAT MATRIX asymptotically (43) tr(h ) = Ω q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Clearly, the difference between the non-anova asymptotic expressions for the hat matrix implied by the conditional mean estimator in Theorems 3.2 and and that of their ANOVA counterpart in Theorem 4.1 is driven by a linear combination of kernel-dependent constants, which can be easily calculated. Furthermore, this result remains in the presence of only continuous regressors, as can be seen from Theorem 3.3 and Corollary 4.2. More important, under certain model restrictions this linear combination of kernel dependent constants can be quite minuscule; we illustrate this in the ensuing subsection Comparing Degrees of Freedom from the univariate ANOVA and non-anova frameworks. In light of the regression function with scalar smooth covariate that is in both Zhang (2003) and Huang & Chen (2008), we now gauge the size of the linear combination of kernel-dependent constants that drives the difference between the asymptotic expressions for the trace of their resultant hat matrices for the conditional mean estimator. We consider the three most popular local polynomial estimator: LCLS, local linear, and local cubic; thus, p {0, 1, 3}. For a scalar covariate, we have M = (µ i+j 2 ) 1 i,j p+1, M 1 := (m ij ) 1 i,j p+1, S = (ν i+j 2 ) 1 i,j p+1, (see pages 12 and 21). Thus, Theorem 3.3 and Corollary 4.2 simplify, respectively, to We define tr(h 0,h ) = h 1 e T 1,p+1M 1 e 1,p+1 K(0) Ω { 1 + o P (1) }, tr(h0,hh T 0,h ) = h 1 e T 1,p+1M 1 SM 1 e 1,p+1 Ω { 1 + o P (1) }, ( p+1 ) tr(h ) = h 1 m ij ν i+j 2 Ω { 1 + o P (1) }, with i + j is even. 1 C p κ 2 C p κ 3 C p κ p+1 i, p+1 i, p+1 i, i, m ij ν i+j 2 e T 1,p+1M 1 e 1,p+1 K(0), m ij ν i+j 2 e T 1,p+1M 1 SM 1 e 1,p+1, m ij ν i+j 2 ( 2e T 1,p+1M 1 e 1,p+1 K(0) e T 1,p+1M 1 SM 1 e 1,p+1 ),

23 HAT MATRIX 23 with i + j is even, and where 1 Cκ, p 2 Cκ, p and 3 Cκ p are associated with differences between tr(h ) and that of tr(h 0,h ), tr(h0,h T H 0,h), and tr(2h 0,h H0,h T H 0,h), respectively. For p = 0, 1, for example, the ANOVA-based result in Corollary 4.2 degenerates to (44) (45) tr(h ) = h 1 Ω ν 0 {1 + o P (1)}, tr(h ) = h 1 Ω (ν 0 + ν 2 /µ 2 ){1 + o P (1)}, respectively, whereas the non-anova counterparts implied by Theorem 3.3 degenerates to (46) (47) tr(h 0,h ) = h 1 Ω K(0) { 1 + o P (1) }, tr(h T 0,hH 0,h ) = h 1 Ω ν 0 {1 + o P (1)}. Clearly, 2 C 0 κ = 0; that is, for the LCLS estimator the asymptotic difference between tr(h ) and tr(h T 0,h H 0,h) is zero for any kernel and assuming identical bandwidth parameter h. 5 We undertake a more general comparison between tr(h ) and tr(h 0,h ) and tr(h T 0,h H 0,h) for the popular class of symmetric beta kernels defined as (48) K(t) = 1 Beta(1/2, κ + 1) (1 t2 ) κ +, κ = 0, 1, 2,..., where the subscript + denotes the positive part, which is understood to be taken prior to exponentiation (see, for e.g., Fan & Gijbels 1996, p. 15). This class nests the uniform, Epanechnikov, biweight, and triweight kernels for κ = 0, 1, 2, and 3, respectively, and the Gaussian kernel as the limiting kernel function as κ. For κ = 0, 1, 2, and 3, µ 2j = Beta(j + 1/2, κ + 1), and ν 2j = Beta(1/2, κ + 1) and for the Gaussian kernel, K(u) = ( 1/ ) 2π e u2 /2, with Beta(j + 1/2, 2κ + 1) { } 2, Beta(1/2, κ + 1) µ 2j = (2j 1)(2j 3) 3 1, and ν 2j = 2 j 1 µ 2j / π, (see Fan & Gijbels 1996, p. 78). Table 1 reports the values of 1 C p κ, 2 C p κ, and 3 C p κ for this class of kernels and for p {0, 1, 3}. Table 1 shows that for smoother kernels, that is kernels with a larger κ 1, 1 C 1 κ and 1 C 3 κ become smaller. For the local linear estimator, the asymptotic difference between tr(h ) and tr(h 0,h ) can be quite minute; specifically, for Gaussian kernel 1 C 1 κ = More important, 0 [ 1 C p κ] 1, 0 [ 2 C p κ] 1, and 0 [ 3 C p κ] 1, κ, and p {0, 1, 3}, where [c] denotes the nearest integer function of the real number c. 5 In fact, 2 C 0 κ = 0 for any q-variate smooth covariate.

24 24 HAT MATRIX [Table 1 about here.] Hat matrices with only discrete regressors. For the LCLS estimator with only discrete regressors, H = x W H ˆf(x d ; λ), with W a diagonal matrix having entries L(X d d i, x d, λ)/ ˆf(x d ; λ), and with X = ι, the vector of ones in R n. Then ( tr(h ) = tr W ι(ι T W ι) 1 ι T W ˆf(x ) d ; λ) x d = (ι T W ι) 1 ι T W 2 ι ˆf(x d ; λ) x d (49) = x d ( n L(Xd i, x d, λ) ˆf(x d ; λ) ) 1 ( n ) L2 (Xi d, x d, λ) [ ˆf(x ˆf(x d ; λ). d ; λ)] 2 In light of (49), we have the following result: Theorem 4.3. Under the conditions of Theorem 3.4, (50) tr(h ) = Ω d { 1 + o P (1) } where tr(h ) is defined in (49). Theorem 4.3 suggests that in the purely discrete case with all relevant regressors, the differences between the non-anova based approaches in Zhang (2003) and the ANOVAbased approach in Huang & Chen (2008) are asymptotically negligible. Hence, for example, tr(h ) tr(h λ ) Degrees of Freedom in the presence of Relevant and Irrelevant Regressors. In Subsection 3.5, we show that the non-anova nonparametric framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the tr(h 0,γ ) is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of h s for s = q 1 + 1,..., q on the ratio is dominated by the influence of h s 0 for s = 1,..., q 1. In light of our juxtaposition of non-anova and ANOVA frameworks, it is interesting to examine whether the trace of the resultant hat matrix from Huang & Chen s (2008) ANOVA framework has a meaningful expression when some covariates are irrelevant. We now consider the LCLS estimator for tr(h ) with a mix of continuous and discrete regressors. Observe that tr(h ) = x d (ι T W ι) 1 (ι T W 2 ι) ˆ f( x; γ) ˆ f( x; γ)dx c,

25 which depends also on the ratio of two kernel terms. HAT MATRIX 25 However, unlike the non-anova framework, the kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This suggests that there can be sizable influence of h s for s = q 1 + 1,..., q, relative to h s 0 for s = 1,..., q 1, on the tr(h ). Hence, tr(h ) 0 is possible under the condition that h s for s = q 1 + 1,..., q. Formally, we provide this attrition effect of the irrelevant bandwidths on tr(h ) in the following result: Theorem 4.4. Assume the conditions of Theorem 3.5 hold. Let for some constant c, n c < h s < n c s = 1,..., q, and h h q1 +1 h q n κ, where κ = (q 1 (η + 1) + 4η) /(q 1 + 4), and η 1. The tr(h ) associated with the LCLS estimator is such that (51) tr(h ) = νq 1 0 Ω Ω ( d ) {1 q1 h m( x) d x c + op (1) }, s x d where m( x) = m 2 ( x)/m 1 ( x) = O ({ h q1 +1 h ) q } 1. Thus, Theorem 4.4 shows that in the presence of irrelevant continuous covariates tr(h ) p 0. 6 In fact, simulation results confirm this behavior for the tr(h ), for the LCLS, local linear and local cubic estimators. One implication of Theorem 4.4 is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates. In particular, tr(h ) 0 will render a residual degrees of freedom close to n and hence a large global mean square error which is used to compute an unbiased estimate for the error variance in finitesample settings; also, tr(h ) 0 will render a negative regression degrees of freedom. The measure and interpretability of their ANOVA-based adjusted R-squared are also impaired by tr(h ) 0. This feature will be true for other data-driven bandwidth selection measures with the capability of selecting bandwidths which diverge. For example, the AIC c bandwidth selection criterion of Hurvich, Simonoff & Tsai (1998) has been shown to perform in a similar fashion to LSCV (Li & Racine 2004) though no formal theory currently exists that demonstrates that AIC c bandwidth selection will produce large bandwidths for irrelevant variables. We further conjecture that this result will hold in the local polynomial setting when all of the continuous covariates enter the model in a polynomial fashion. The reason for this is that as mentioned in Hall & Racine (2015), when the underlying data generating process is 6 To the best of our knowledge, there is no study in the extant literature that documents the rate at which these bandwidths associated with the irrelevant covariates diverge to infinity. In practice, however, in a given model specification each h s is often larger than the h s by a factor in excess of n η. Therefore, the restriction we impose on the bandwidths for the irrelevant covariates is quite conservative.

Reference: Davidson and MacKinnon Ch 2. In particular page

RNy, econ460 autumn 03 Lecture note Reference: Davidson and MacKinnon Ch. In particular page 57-8. Projection matrices The matrix M I X(X X) X () is often called the residual maker. That nickname is easy