CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction

Size: px
Start display at page:

Download "CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction"

Transcription

1 CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION NADINE MCCLOUD AND CHRISTOPHER F. PARMETER Abstract. The matrix that transforms the response variable in a regression to its predicted value is commonly referred to as the hat matrix. The trace of the hat matrix is a standard metric for calculating degrees of freedom. Nonparametric-based hat matrices do not enjoy all properties of their parametric counterpart in part owing to the fact that the former do not always stem directly from a traditional ANOVA decomposition. In the multivariate, local polynomial setup with a mix of continuous and discrete covariates, which include some irrelevant covariates, we formulate asymptotic expressions for the trace of the resultant non-anova and ANOVA-based hat matrix from the estimator of the unknown conditional mean. The asymptotic expression of the trace of the non-anova hat matrix associated with the conditional mean estimator is equal up to a linear combination of kernel-dependent constants to that of the ANOVA-based hat matrix. Additionally, we document that the trace of the ANOVA-based hat matrix converges to 0 in any setting where the bandwidths diverge. This attrition outcome can occur in the presence of irrelevant continuous covariates or it can arise when the underlying data generating process is in fact of polynomial order. Simulated examples demonstrate that our theoretical contributions are valid in finite-sample settings. 1. Introduction The hat matrix plays a fundamental role in regression analysis; the elements of this matrix have well-known properties and are used to construct variances and covariances of the residuals. In particular, the trace of the hat matrix is commonly used to calculate degrees of freedom, and appears in regression diagnostics, constructing measures of fit and conceptualizing what a residual is. And while the hat matrix is commonly called upon in parametric estimation and inference, its use in nonparametric settings is much less prevalent. One potential reason that the hat matrix has been less commonly deployed in nonparametric regression analysis, is that several notions of the hat matrix exist, which leads to alternative versions of both overall model fit and degrees of freedom (which can subsequently be used to penalize in-sample fit). 1 Within a univariate framework, Huang & Chen (2008) provide an ANOVA decomposition of the total sum of squares into its respective Date: November 21, Key words and phrases. Trace, Degrees of Freedom, Effective Parameters, Nonparametric Regression, Irrelevant Regressors, Bandwidth, Goodness-of-fit. 1 In the nonparametric literature, this hat matrix is also referred to as the smoother matrix. 1

2 2 HAT MATRIX explained and residual components. However, this ANOVA decomposition is local in nature and needs to be integrated to achieve a global hat matrix. Moreover, in multivariate settings which are of practical appeal calculation of this global ANOVA is likely to be difficult. This is irksome for the calculation of the degrees of freedom of the model as well given that the proper hat matrix stemming from this ANOVA decomposition needs to be integrated. An alternative would be to use the trace of the hat matrix stemming directly from the local polynomial method of estimating the unknown conditional mean (Ruppert & Wand 1994, Fan & Gijbels 1996). In fact, Zhang (2003) discusses exactly this non-anova case in the univariate setting. Here our goal is to augment the work of Huang & Chen (2008) and Zhang (2003) by considering calculation of degrees of freedom in a multivariate, local polynomial setting with a mix of continuous and discrete covariates. From this platform we can compare the effective number of parameters, k eff, stemming from the trace of the global ANOVA hat matrix and its non-anova counterpart. Our generalizations and juxtapositions allow us to make the following nontrivial contributions to the existing literature. One, in the presence of mixed discrete and continuous covariates, the difference in asymptotic expressions of the trace of the ANOVA and non-anova hat matrices is driven by a linear combination of moments of the underlying kernel. This suggests that the non-anova hat matrix taken directly from the multivariate, local polynomial estimator can be used to approximate degrees of freedom for the ANOVA hat matrix as the latter is more computationally intensive in applied settings with multiple continuous covariates. For example, using a bivariate regression, and for local constant, local linear, and local cubic estimators, we show that the absolute differences between asymptotic expressions for the trace of the ANOVA and non-anova hat matrices are in the unit interval. Two, to improve the usefulness of our work to applied settings, we also give consideration to nonparametric regression models in which some covariates are irrelevant. The non-anova nonparametric framework has been the workhorse for the analysis of irrelevant covariates. We show that this framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the trace of the non-anova hat matrix is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of bandwidth vector for the irrelevant continuous covariates on the kernel ratio is dominated by the influence of its relevant counterpart. We show that the bandwidth vector for the irrelevant continuous covariates has an attrition effect on the trace of the ANOVA hat matrix resulting in the latter converging to zero in

3 HAT MATRIX 3 probability. Although the trace of the ANOVA hat matrix is also a ratio of two kernel terms, these kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This paves the way for a sizable influence of the bandwidth vector for the irrelevant continuous covariates relative to its relevant counterpart. In fact, our simulation results confirm this attrition effect of irrelevant regressors on the trace of the ANOVA hat matrix, for the local constant, local linear, and local cubic estimators. One implication of this attrition effect is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates; degrees of freedom from the non-anova framework may be suitable substitutes for their ANOVA counterparts when irrelevant continuous variables are likely to be present in the underlying nonparametric model. Three, we formalize the trace concept of the non-anova and ANOVA hat matrices that are predicated on only discrete covariates to provide a measure of the degrees of freedom for the underlying nonparametric model. In the presence of only relevant discrete covariates, the traces of ANOVA and non-anova hat matrices all converge in probability to the cardinality of the discrete support. Although this result holds when irrelevant discrete covariates are also present in the nonparametric model, the asymptotic trace values in this case can exceed their purely relevant counterparts. This latter result draws on the theoretical contributions of Ouyang, Li & Racine (2009) who establish, in a purely discrete-covariate setting with least-squares cross-validation (LSCV), that the irrelevant regressors cannot be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that the bandwidths selected via LSCV do not converge to their upper extreme values even a n. In essence, with positive probability, the presence of irrelevant discrete covariates can lead to asymptotic trace values of the ANOVA and non-anova hat matrices that are larger than the cardinality of the support for the relevant discrete covariates. This also means that these asymptotic trace values will exhibit large variances in the presence of irrelevant discrete covariates. Four, unlike the parametric hat matrix, the geometric properties of the non-anova hat matrix stemming directly from multivariate local polynomial estimation with mixed data of the unknown conditional mean have yet to be formalized. We show that while the non- ANOVA hat matrix is not a projection matrix, it shares many of the same geometric properties as its parametric counterpart. These properties of the hat matrix are of importance in, for example, assessing the amount of leverage or influence that y j has on ŷ i, which is related to the (i, j)-th entry of the hat matrix. In the special case of a local constant estimator, we

4 4 HAT MATRIX deduce that each ŷ i is a convex combination of the response vector Y ; this convexity property indicates how large is the leverage of y i on its corresponding fitted value ŷ i. Thus, our work can also be used to identify high-leverage points and improve model fit in multivariate local polynomial estimation. In essence, our theoretical contributions are of independent interest to the wider nonparametric literature. Additionally, we can use the trace of the hat matrix as a measure of the effective number of parameters used/constructed by the local-polynomial model and this can provide insight into which covariates in the model are relevant. Whereas the theories of Hall, Li & Racine (2007), Hall & Racine (2015) and Ouyang et al. (2009) can shed light on relevancy through the size of the bandwidth and the order of the local polynomial (when selected in a datadriven manner), they do not provide an exact number of parameters. Thus, use of the hat matrix here, with these data-driven methods, can generate additional insight into how the local polynomial estimator adapts to the data. The remainder of the paper is organized as follows. Section 2 provides a short review of the geometric properties of the hat matrix from a linear parametric model. Section 3 derives nonasymptotic and asymptotic results for the trace of the non-anova based hat matrix from the multivariate, local polynomial model with a mix of continuous and discrete, and relevant and irrelevant covariates. Under similar model specifications, Section 4 derives asymptotic results for the trace of the ANOVA based hat matrix. Section 5 explores the implications of our theoretical results using simulated data. Section 6 contains the conclusion. We place all proofs in the technical appendix. 2. A Brief Review of the Hat Matrix for the Canonical Parametric Model Consider the situation where one is interested in estimation of the regression (1) y i = m(x i ) + ε i, where i = 1, 2,..., n, y i is our regressand, x i is a q-vector of regressors, and ε i is the idiosyncratic noise. If we parameterize our function in (1) to be linear in parameters, m(x i ) = x T i β, where β R q, then we can estimate the model via least squares to obtain, ˆβ = (X T X) 1 X T y where X is the full n q design matrix with rank q and y is the n 1 vector of responses. The vector of fitted values, ŷ is given by (2) ŷ = X(X T X) 1 X T y = Hy.

5 HAT MATRIX 5 The matrix H in (2) is a projection matrix and thus by definition, H is idempotent, H 2 = H. By construction, H is symmetric, H T = H (in this case H is an orthogonal projection matrix). Since premultiplying y by this matrix H puts a hat on y it is often called the hat matrix. Using basic properties of projection matrices Hoaglin & Welsch (1978) show that the elements of H, h ij, satisfy the first four geometric properties: (i) 0 h ii 1. (ii) h ij 1 for i j. (iii) h ii = 1 iff h ij = 0 j i. (iv) If X contains a column of ones then n h ij = 1. (v) HX = X. Equivalently, HX c = X c, where X c is any column of the design matrix X. Properties (i) and (ii) are boundedness conditions on the entries of H. Note that by symmetry of H, property (iv) is equivalent to n h ij = 1. Thus, each ŷ i is an affine combination of y i. We add the invariance property, see (v), as it will help us to establish some important results in the subsequent section. Note that property (v) nests (iv). The trace of the parametric hat matrix is commonly used to calculate degrees of freedom since, by virtue of cyclic permutation of the trace operator, (3) tr(h) = tr ( X(X T X) 1 X ) T = tr ( (X T X) 1 X T X ) = tr(i k ) = q, the number of covariates that we included in the parametric model. For nonparametric regressions, three orthodox definitions of k eff, which are identical for linear models, are tr(hτ,γh T τ,γ ), tr(h τ,γ ), tr(2h τ,γ Hτ,γH T τ,γ ) (see, for e.g., Hastie & Tibshirani 1990). In subsequent sections, we show that properties (i) to (v) hold for the multivariate local polynomial regression estimator of the unknown conditional mean. The equality in (3) between the trace of the hat matrix and the rank of the design matrix is one of the distinguishing characteristics of the parametric framework that is not always possessed by its nonparametric counterpart. Heuristically, the hat matrix in the local polynomial regression is predicated on an effective design matrix with column rank that increases with the order of the local polynomial. For the multivariate, local polynomial estimator of the unknown conditional mean, we show that the rank of the effective design matrix is the infimum for the trace of the resultant hat matrix.

6 6 HAT MATRIX 3. The non-anova hat matrix for the multivariate local polynomial estimator To generalize the results in Zhang (2003), we embed a mix of continuous and discrete regressors into the multivariate smooth varying-coefficient model (Hastie & Tibshirani 1993) with conditionally linear structure d z (4) y = m k (X)Z k + ε, k=1 and where y is a scalar regressand, X = (X c 1,..., X c q, X d 1,..., X d r ) T and Z = (Z 1,..., Z dz ) T are the given covariates with Z 1 1, E(ε X, Z) = 0 and var(ε X, Z) = σ 2 (X, Z). When d z = 1 the model in (4) is just a multivariate nonparametric regression model. Thus, allowing d z 1 allows for a broader array of models. Denote Xi d in X i = (Xi c, Xi d ) as the r 1 vector of regressors that takes discrete values, and denote Xi c R q as the vector of continuous regressors. Let Ω d and Ω be the support of Xi d and Xi c, respectively. Let the s th component of x d be x d s which takes c s different values in Ωs d = {0, 1,..., c s 1} for s = 1,..., r and c s 2 is a finite positive constant. Then the cardinality of the set Ωs d is c s, which we denote as Ωs d. Assume X has a sampling density f X with a known bounded support Ω Ω d. Furthermore, assume the square matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d to guarantee identifiability of the model in (4). We let γ be the bandwidth vector for X. As in Li & Racine (2007), we use the partition γ = (h T, λ T ) T to reflect the presence of continuous and discrete regressors in X, with bandwidth subvectors h and λ, respectively. For the case of unordered discrete regressors X d i, we follow Li & Racine (2007) who use a variant of the kernel function of Aitchison & Aitken (1976) that is defined by 1 if X (5) l(xis, d x d is d = x d s s, λ s ) = λ s if Xis d x d s where 0 λ s 1 is the smoothing parameter of x d s. Then the product kernel for x d = (x d 1,..., x d r) T is L(x d, X d i, λ) = r l(xis, d x d s, λ s ) = r λ 1(Xd is xd s) s, where 1(X d is x d s) is an indicator function that equals 1 when X d is x d s, and 0 otherwise.

7 HAT MATRIX 7 We let K h ( ) be the generalized product kernel (Li & Racine 2007, Henderson & Parmeter 2015), (6) K h (x c, X c i ) = q h 1 s K where K( ) be a symmetric, density function on R. function for the mixed regressor (x c, x d ). Then ( x c s X c is h s ), K γ (x, X i ) = K h (x c, X c i )L(x d, X d i, λ). Define K γ (x, X i ) to be the kernel To obtain an estimate of the unknown smooth functions {m k (x)} dz k=1 and their population mean regression function m(x 1,..., x q, z 1,..., z dz ) = d z k=1 m k(x)z k from the observations {Y i, X i, Z i } n, we employ the p th -order local polynomial estimation method. In what follows, we adopt the notation of Masry (1996a, 1996b). Thus, for a p th -order local polynomial estimation the corresponding objection function is ( ) d z 2 ( (7) min n 1 Y i β j X c i x c) j Zik K γ (x, X i ), β k=1 0 j p where j = (j 1,..., j q ), j = q j i, x j = q xj i i, j! = q j i! = j 1! j q! and p l l =, 0 j p l=0 j 1 =0 j q=0 j 1 + +j q=l j!β j (x) corresponds to ( D j m ) (x), the partial derivative of m(x) = m(x c, x d ) with respect to x c, which is defined as: (8) ( D j m ) (x) j m(x) (x c 1) j 1... (x c q ) jq, and β vertically concatenates β j (0 j p) in lexicographical order (with highest priority to last position so that (0,..., 0, i) is the first element in the sequence and (i, 0,..., 0) is the last element), and g 1 i denotes this one-to-one map. Note that (7) handles the continuous regressor vector x c in a local polynomial manner but the discrete regressor vector x d in a local constant manner. Let β(x; γ) (0 j p) be the estimator for the weighted least square problem in (7), and define S n (x) = D(x c ) T W (x)d(x c ) and T n (x) = D(x c ) T W (x), with D(x c ) = [ T D 1 (x c ),..., D n (x )] c where Di (x c ) vertically concatenates (Xi c x c ) j Z i for 0 j p in lexicographical order, with Z i = (Z 1i,..., Z dzi) T, and denotes the Kronecker operator.

8 8 HAT MATRIX Thus, for example, D i (x c ) = Z i for p = 0, and D i (x c ) = [1, (X c i x c ) T ] T Z i for p = 1. Here W (x) is a diagonal matrix with the i th diagonal element being K γ (x, X i ). Now, let N p,l = (l + q 1)! l!(q 1)! be the number of distinct q-tuples j with j = l for 0 l p. That is, N p,l is the number of distinct l th -order partial derivative of m(x) with respect to x c. Set N p p l=0 N p,l. The minimizer of the local polynomial least squares procedure at x is (9) β(x; γ) = S 1 n (x)t n (x)y. If our estimate of interest in the unknown regression function is the first d z elements of this vector, then we have m(x) = ( e T 1,N p I dz ) β(x; γ), where eτ+1,np is the (τ + 1) th standard basis vector in the coordinate space R Np and I dz is the identity matrix in R dz. In practice, however, interest may lie in a specific derivative, say of first or second order, of the regression function. Thus, we define m τ (x) = ( e T τ+1,n p I dz ) β(x; γ), where τ = 0, 1,..., Np 1. That is, (τ + 1) is the lexicographical position of the derivative vector of m, of interest. Thus the relationship between τ and m is: m 1 (10) τ = g 1 m (m) + N p,r. (11) Then, we can cast our local polynomial regression estimator as m τ (x, z) = where z = (z 1,..., z dz ) T and d z k=1 m τ,k(x)z k r=0 = m τ (x) T z, = ( e T τ+1,n p z ) T Sn 1 (x)d(x c ) T W (x)y, = A τ,j (x)y j (12) A τ,j (x) ( e T τ+1,n p z T ) S 1 n (x)d j (x c )K γ (x, X j ). If we replace the generic x with our n observations, X i, and τ = 0, then we obtain the fitted values for our data. Further, we can use these n observations to construct our hat matrix H τ,γ. From (11) we have that the (i, j)th element of our hat matrix is (13) H τ,γ (i, j) = A τ,i (X j ), for i, j = 1,..., n.

9 HAT MATRIX Geometric Properties of the non-anova hat matrix. Clearly, H τ,γ is not symmetric in this generalization. To show that the properties in (i)-(iv) hold for H τ,γ (i, j) for i, j = 1,..., n, we will use the following result which is the local polynomial, multivariate, generalized kernel extension of Zhang (2003, Lemma A.1). Lemma 3.1. For a nonnegative kernel satisfying K γ (0) = sup u,v K γ (u, v), we have that (1) n H2 τ,γ(i, j) H τ,γ (i, i) for i = 1,..., n, (2) Let X l = (X l 1,..., X l n) T for any l such that 0 l p. Assume the relationship between τ and and its derivative vector m is as defined in (10). Then ( ) ) q H τ,γ (i, j) (X lj l ( s Z j = X l m i Z j ), and, consequently, H 0,γ X l = X l. m s Part (1) of Lemma 3.1 implies that 0 H τ,γ (i, i) 1 for i = 1,..., n and H τ,γ (i, j) 1, for i j. Thus, properties (i) and (ii) hold. To show that property (iv) also holds, observe from Proposition A.1 (see appendix) that Z j A τ,j (x) = δ 0,τ Z j, and thus (14) 1 if τ = 0 H τ,γ (i, j) = 0 if τ > 0, as the first entry of Z j is normalized to 1. Property (iii) holds by virtue of properties (i), (ii) and (iv). For τ = 0, part (2) of Lemma 3.1 is equivalent to property (v). Equation (14) shows that for each i th observation, the conditional mean estimator is an affine combination of the i th row of the associated hat matrix, whereas that of its derivative estimator counterpart is a zero-sum linear combination. Note that there is an absence of the invariance property of H τ,γ for τ > 0 (see part 2), which is intuitive. For derivative estimators, H τ,γ is associated with dimensionality reduction but preservation of the span of a polynomial ring. Thus, Lemma 3.1 reveals that for nonparametric regression, H τ,γ can be exploited to conduct diagnostic analyses, such as leverage effects, which have be used in parametric settings. In fact, the hat matrix for the local-constant least-squares (LCLS) estimator renders easy interpretation of leverage effects. To see this, assume that for model (4), d z = 1 and

10 10 HAT MATRIX p = 0. Then the LCLS estimator of our model is K γ (x j, x i )y j (15) m 0 (x i ) = = K γ (x l, x i ) where (16) A 0,j (x i ) = K γ(x j, x i ). K γ (x l, x i ) l=1 l=1 A 0,j (x i )y j, We use the notation A ij = A 0,i (x j ) = H 0,γ (i, j) from (15) for parsimony. Observe that by the definition of the product kernel, K γ (x j, x i ) and A ij for H 0,γ, properties (i) and (ii) become: 0 A ij 1 i, j. However, unlike the parametric model, the elements of the local constant hat matrix are all positive. Clearly, from (16), H 0,γ is a symmetric matrix, as in the parametric case. Thus, A ij = n A ij = 1. More important, this normalization property holds unconditionally, rather than conditionally on the regressor matrix. Combining this restriction with property (iii) yields that each ŷ i is a convex combination of the response vector Y ; this convexity restriction, although clearly a stronger condition than its parametric counterpart, renders easier detection and comparison of high-leverage points relative to other points Degrees of Freedom for Mixed Regressors. Following Zhang (2003), first we formulate a relationship among three common measures of k eff : tr(h T τ,γh τ,γ ), tr(h τ,γ ), tr(2h γ H T γ H γ ). To do so, we make use of the bound implied by Part (1) of Lemma 3.1: (17) tr(h T τ,γh τ,γ ) = {H τ,γ (i, j)} 2 H τ,γ (i, i) = tr(h τ,γ ). In addition, using the fact that (I n H τ,γ ) T (I n H τ,γ ) is positive definite, we have ( ) tr (I n H τ,γ ) T (I n H τ,γ ) = n tr(2h τ,γ Hτ,γH T τ,γ ) > 0, given equivalence of the trace of a transposed matrix. This result, coupled with (17), implies that tr(h T τ,γh τ,γ ) tr(h τ,γ ) tr(2h τ,γ H T τ,γh τ,γ ) < n. An implication of Part (2) of Lemma 3.1 is that H τ,γ projects matrices onto a space with column span of d z N p. Thus, tr(h T τ,γh τ,γ ) d z N p. This follows from Schur s inequality

11 HAT MATRIX 11 (Lütkepohl 1996, p. 43), along with the fact that the trace of a matrix is equal to the sum of its eigenvalues. We now formalize our generalization of Zhang s (2003) nonasymptotic bound as follows. Proposition 3.1. Assume the kernel condition in Lemma 3.1 holds. For the multivariate local polynomial regression defined in (4) and any h R q + and λ [0, 1] r, we have d z N p tr(hτ,γh T τ,γ ) tr(h τ,γ ) tr(2h τ,γ Hτ,γH T τ,γ ) < n, where N p is defined above. Next, we present the asymptotic results for tr(h τ,γ ), tr(hτ,γh T τ,γ ), and tr(2h τ,γ Hτ,γH T τ,γ ). A few assumptions are in order. Assumption A.1. (x i, z i, y i ) are i.i.d. as (X, Z, Y ). The covariates X = (X c, X d ) have bounded support Ω Ω d, and the density f(x) of X is Lipschitz continuous in x c and bounded away from zero on Ω Ω d. Assumption A.2. m k (x c, x d ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω x d Ω d. Assumption A.3. The fourth moment of ε exists and is strictly positive. Assumption A.4. The kernel K( ) is a symmetric, nonnegative, and bounded continuous probability density function having compact support. Specifically, u 4p K(u) L 1 and u 4p+q K(u) 0 as u. Assumption A.5. The matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d, and each entry is Lipschitz continuous in x c. These conditions are identical or similar up to generalizations to those made by Zhang (2003) and Huang & Chen (2008). For example, Assumptions A.1 and A.4 ensure convergence in the mean-square sense of the matrices of multivariate moments of the kernel K from the multivariate local polynomial estimation due to Masry (1996b). 2 To state our main theorem, we introduce some additional notation. Define Γ (x) = f X (x)e(zz T X = x). For each j with 0 j 2p, define µ j = u j K(u)du, ν j = u j K 2 (u)du, R q R q 2 Masry (1996a) establishes uniform strong consistency of these moment matrices.

12 12 HAT MATRIX and the N p N p dimensional matrices M 0,0 M 0,1... M 0,p S 0,0 S 0,1... S 0,p M M = 1,0 M 1,1... M 1,p.., S = S 1,0 S 1,1... S 1,p.., M p,0 M p,1... M p,p S p,0 S p,1... S p,p where M i,j and S i,j are N i N j dimensional matrices whose (l, m) elements are µ gi (l)+g j (m) and ν gi (l)+g j (m), respectively. Hence, the matrices M and S are the multivariate moments of the kernels K and K 2, respectively. Given that our kernel function is a probability density function, µ j is the j th raw moment of the kernel and ν j is the kernel weighted j th raw moment. Let c h,,i = [(Xc i x c ) h] and c h,,i vertically concatenates [ c h,,i] j for 0 j p in lexicographical order, where represents Hadamard division. Also, let d,i = X d i x d. Then, for a fully nonparametric regression model with mixed continuous and discrete regressors, we define the associated multivariate equivalent kernel, K τ( ), as (18) K τ( c h,,i, d,i) = e T τ+1,n p M 1 c h,,i K( c h,,i)l(x d, X d i, λ). Clearly, K τ(0, 0) = e T τ+1,n p M 1 e T τ+1,n p [K(0)] q, and we set K τ(0, 0) = K τ(0). As noted in Fan & Gijbels (1996, sect ) the equivalent kernel is the weighting scheme that arises based on the specific kernel, polynomial order chosen and the design points location relative to the point of evaluation. The multivariate equivalent kernel is the effective weighting scheme that produces the estimator for β j (x) for given bandwidth γ. When p = 0, the multivariate equivalent kernel is identical to the product kernel, but for p > 0, the multivariate equivalent kernel can automatically adapt to alternative data designs as well as account for boundary estimation. Theorem 3.2. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,γ is based on (12) with γ = (h T, λ T ) T. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Under Assumptions A.1 to A.5, with h s 0, for s = 1,..., q, and n,

13 HAT MATRIX 13 and nh 1... h q, we have (19) (20) (21) tr(h τ,γ ) = d zkτ(0) Ω Ω d h { m q 1 + h op (1) }, s tr(hτ,γh T τ,γ ) = d zkτ Kτ(0) Ω Ω d h { 2m q 1 + h op (1) }, s ( ) tr(2h τ,γ Hτ,γH T d z τ,γ ) = h m q h 2Kτ(0) K τ Kτ(0) Ω Ω d { 1 + o s h m P (1) }, where Ω Ω d is the volume of the Cartesian product Ω Ω d, is the convolution operator, and K τ K τ(0) = e T τ+1,n p M 1 SM 1 e τ+1,np. Theorem 3.2 shows that the asymptotic k eff are proportional to the total number of covariates in model (4), d z, inversely proportional to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors. This latter result on the discrete covariates cannot be inferred from Zhang s (2003) asymptotic approximations for the trace of the resultant hat matrices for the local polynomial estimator of the conditional mean (τ = 0) for model (4) with a scalar, continuous X. Theorem 3.2 also shows each asymptotic k eff is independent of the mixed design density f. In the case where the discrete regressors are ordered, then Li & Racine (2007) suggest, in lieu of (5), the use of the following kernel function: 1 if X (22) l(xis, d x d is d = x d s s, λ s ) =, λ Xd is xd s s if Xis d x d s where the range of the smoothing parameter of x d s, λ s, is [0, 1]. As in the unordered case, if λ s is equal to its minimum value the function in (22) is an indicator function; if λ s is equal to its maximum then (22) is a uniform weight function. In this case, the results of Theorem 3.2 continue to hold Degrees of Freedom for Continuous Regressors. In the purely continuous case, we assume that j! β j (x c ) estimates ( D j m ) (x c ). Then the p th -order local polynomial estimation has the following objection function: ( min n 1 Y i β d z k=1 0 j p β j ( X c i x c) j Zik ) 2 K h (x c, X c i ).

14 14 HAT MATRIX Thus, D(x c ) is as defined on page 7, and we redefine the diagonal weighting matrix W (x c ) so that its i th diagonal entry is K h (x c, X c i ). Then, for the continuous regressor case, we have (23) A τ,j (x c ) ( e T τ+1,n p z T ) S 1 n (x c )D j (x c )K h (x c, X c j ). Using (23) to defined the (i, j)th element of our hat matrix, H τ,h, as is done in (13), note that the results of Lemma 3.1 and Proposition A.1 continue to hold in the continuous regressor case. To generate asymptotic expressions for tr(h τ,γ ), tr(h T τ,γh τ,γ ), and tr(2h τ,γ H T τ,γh τ,γ ) in the continuous regressor case, we need to modify Assumptions A.1, A.2, and A.5, respectively, as follows: Assumption B.1. (x c i, z i, y i ) are i.i.d. as (X c, Z, Y ). support Ω, and the density f(x c ) of X c zero. The covariates X c have bounded is Lipschitz continuous and bounded away from Assumption B.2. m k (x c ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω. Assumption B.3. The matrix E(ZZ T X c = x c ) has strictly positive eigenvalues for each x c Ω, and each entry is Lipschitz continuous. Note that for d z = 1 in our multivariate model (4) with only continuous regressors, the corresponding multivariate equivalent kernel, K τ( ), is K τ( c h,,i) = e T τ+1,n p M 1 c h,,i K h ( c h,,i). Theorem 3.3. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,h is based on (23). Under Assumptions B.1 to B.3 and A.3 to A.4, with h s 0, for s = 1,..., q, and n, and nh 1... h q, the asymptotic results in Theorem 3.2 continue to hold so that (24) tr(2h τ,h H T τ,hh τ,h ) = tr(h τ,h ) = d zk τ(0) Ω h m q h s h m q h s { 1 + op (1) }, tr(hτ,hh T τ,h ) = d zkτ Kτ(0) Ω h { 2m q 1 + h op (1) }, s ( d z 2Kτ(0) K τ Kτ(0) h m ) Ω { 1 + o P (1) }. Clearly, for τ = 0 and q = 1, Theorem 3.3 nests Zhang s (2003) results for degrees of freedom of local polynomial hat matrices (see Theorems 1 and 3, pages 612 and 616, respectively).

15 HAT MATRIX Degrees of Freedom for Discrete Regressors. The case in which all regressors are discrete is also common in applied research. We therefore assume that for model (4) d z = 1, p = 0, and X i = Xi d. Then, the result LCLS estimator implied by (15) is L(x d j, x d i, λ)y j m(x d i ) = = A j (x d L(x d l, xd i, λ) i )y j, where (25) A j (x d i ) = L(xd j, x d i, λ). L(x d l, xd i, λ) l=1 l=1 We let H λ (i, j) := A i (x d j). This formulation of H λ gives rise to the following asymptotic expressions. Theorem 3.4. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Then (26) (27) (28) tr(h λ ) = Ω d { 1 + o P (1) }, tr(h T λ H λ ) = Ω d { 1 + o P (1) }, tr(2h λ H T λ H λ ) = Ω d { 1 + o P (1) }, uniformly in λ [0, 1] r. The results in Theorem 3.4 suggest that in the presence of only discrete regressors, tr(h λ ), tr(hλ T H λ) and tr(2h λ Hλ T H λ) are asymptotically equivalent, in the sense that any pairwise combination of their difference tends to zero in probability. Thus, the computational cost of tr(hλ T H λ), which is O(n 2 ) relative to that of tr(h λ ), which is O(n), may suggest use of tr(h λ ) Degrees of Freedom in the Presence of Relevant and Irrelevant Regressors. Our preceding analyses assume that all variables included in the regression are relevant. It is common in applied work to have a mix of irrelevant and relevant regressors included in the same regression. In this setting, we are interested in the asymptotic expressions for tr(h 0,γ ) when a set of bandwidths moves toward their theoretical upper bounds and another set moves toward zero. To proceed with this scenario, without loss of generality, we assume the first r 1 (0 r 1 r) components of Xi d are relevant, and the first q 1 (1 q 1 q) components of X c i are relevant. We denote X c i and X d i as the relevant components, and X c i

16 16 HAT MATRIX and X i d as the irrelevant components. We adopt the concept of irrelevant regressors from Hall et al. (2007), thus we assume that (29) (Y, X) and X are independent of each other. By virtue of this independence assumption, f(x) = f( x) f( x), where f( x) and f( x) are the marginal densities of Xi and X i respectively. We use Ω and Ω d to denote the support of x c and x d, respectively. 3 In the ensuing sections, we will make use of the following kernel partitions: (30) and L(Xi d, x d, λ) = L( X i d, x d, λ)l( X i d, x d, λ), r 1 = l( X is, d x d s, λ r s ) l( X is, d x d s, λ s ), s=r 1 +1 (31) K h (Xi c x c ) = K h( Xc i x c) K h( Xc i x c), q 1 ( ) q Xc = h 1 s K is x c s h s s=q 1 +1 h 1 s K ( ) Xc is x c s. h s With ν d, x d Ω d, define 1 s ( ν d, x d ) = 1 s ( ν d s x d s) r 1 t=1,t s 1 s( ν d = x d ). That is, 1 s ( ν d, x d ) is an indicator function equal to one if ν d and x d only differ in their sth element, and zero otherwise. For l = 1, 2, define m l ( x) = E{[K h( X i c x c )L( X i d, x d, λ)] l }, {[( q ( Xc m l ( x) = E K is x c ) ) ] l } s L( h X i d, x d, λ), s s=q 1 +1 m L,l ( x d ) = E{[L( X d i, x d, λ)] l }. Note the subtle difference between m l ( x) and m l ( x): m l ( x) does not contain the division by h that exists in m l ( x). This feature will be important when we study the limiting behavior of our ANOVA based hat matrix. What do these three terms capture? m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, but scaled by the irrelevant continuous covariates bandwidth vector, and m L,l ( x d ) is the l th raw moment 3 Hall et al. (2007) highlight that a more practically appealing yet theoretically challenging variant of (29) is: conditional on X, the variables X and Y are independent. We therefore follow Hall et al. (2007) and choose the more restrictive of these two independence assumption.

17 HAT MATRIX 17 of the kernel weights for the irrelevant, discrete covariates at the point x d. All three of these moment based functions are expectations taken over the design points. To shed light on the asymptotic behavior of the tr(h 0,γ ) in the presence of some irrelevant regressors, we examine the entries of the resultant (H 0,γ ). Appealing to equations (15), (30) and (31), we obtain [ q1 m 0 (x i ) = k [ q1 k ( x c js x c is ) q h s s=q 1 +1 ( x c js x c is ) q h s k s=q 1 +1 ( x c js x c is k h s ( x c js x c is h s ) r1 l(x d js, x d is, λ s ) ) r1 l(x d js, xd is, λ s) r s=r 1 +1 r s=r 1 +1 ] l(x d js, x d is, λ s ) y j ]. l(x d js, xd is, λ s) For this general setting with a mix of discrete and continuous regressors, Hall et al. (2007) establish and prove that the cross-validated bandwidths for the irrelevant regressors converge in probability to the suprema of their ranges. Ideally, if we set h s = for s = q 1 + 1,..., q and λ s = 1 for s = r 1 + 1,..., r, we have [ q1 m 0 (x i ) = = k [ q1 [ q1 k k [ q1 k ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] k(0)q q 1 1 r r 1 l(x d js, xd is, λ k(0) s) q q 1 1 r r 1 ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] = l(x d js, xd is, λ s) A ij y j, where A ij only contains the relevant regressors. What is apparent is that the algebraic form of the local constant estimator suggests that, regardless of the number of variables in the model, when variables are smoothed out, it is only the bandwidths associated with variables not smoothed away that dictate the behavior of A ij. Only when all variables are smoothed away is A ij impacted by the increasing bandwidths. Thus, calculating degrees of freedom for the local constant kernel estimator is not influenced by the presence of irrelevant regressors when relevant regressors are present. What this suggests is that the degrees of freedom is only influenced by the number of relevant regressors. This is clearly not the case in OLS where adding an additional (irrelevant) regressor always contributes 1 to the degrees of freedom. Drawing of the theoretical contributions of Hall et al. (2007), we now provide the asymptotic expression for tr(h 0,γ ), tr(h T 0,γH 0,γ ), and tr(2h 0,γ H T 0,γH 0,γ ) as follows.

18 18 HAT MATRIX Theorem 3.5. Suppose d z = 1 in model (4), and condition (29) and Assumptions A.1 to A.5, are satisfied. Assume as n, (i) h s 0, for s = 1,..., q 1, h s for s = q 1 + 1,..., q, and nh 1... h q1 and (ii) λ s 0, for s = 1,..., r 1, and λ s 1 for s = r 1 + 1,..., r. Then for the LCLS estimator we have (32) (33) (34) tr(h 0,γ ) = [K(0)]q 1 Ω Ω d q1 h s { 1 + op (1) }, tr(h T 0,γH 0,γ ) = νq 1 0 Ω Ω d q1 h s { 1 + op (1) } tr(2h 0,γ H T 0,γH 0,γ ) = {2[K(0)]q 1 ν q 1 0 } Ω Ω d q1 h s { 1 + op (1) }. Theorem 3.5 formally establishes that the influence of the irrelevant regressors on tr(h 0,γ ) is asymptotically negligible (see its counterpart in absence of irrelevant regressors that can be deduced from Theorem 3.2). Similar asymptotic expressions for the case of continuous only regressors can be deduced from Theorem 3.5. In this latter case, the influence of irrelevant regressors will also have asymptotically negligible effect on the trace of each of the three non-anova hat matrices. Our foregoing results and discussions highlight that all three non-anova trace measures are well-defined and useful in the presence of irrelevant covariates Hat matrices with only discrete regressors. As in the case with only relevant regressors, we assume that for model (4) d z = 1, p = 0, and X = X d. Then, by virtue of the kernel partition in (30), the LCLS estimator in (15) simplifies to L( X j d, x d i, λ)l( X j d, x d i, λ)y j (35) m(x d i ) = L( X l d, xd i, λ)l( X l d, xd i, λ) = where l=1 (36) A j (x d i ) = L( X j d, x d i, λ)l( X j d, x d i, λ) L( X l d, xd i, λ)l( X l d, xd i, λ). l=1 A j (x d i )y j, In this case with only discrete regressors, Ouyang et al. (2009) establish that the irrelevant regressors the associated bandwidths from the least-squares cross-validated method can be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that these bandwidths do not converge to their upper extreme 4 Zhang (2003) notes this observation.

19 HAT MATRIX 19 values even a n. Thus λ s 1 for at least one s = r 1 + 1,..., r, and hence in (36) L( X d l, xd i, λ) = 1 is not guaranteed asymptotically. In essence, Ouyang et al. s (2009) result implies that tr(h λ ) can exceed Ω d with positive probability, where H λ is predicated on (36). This implication also holds for tr(h T λ H λ) and tr(2h λ H T λ H λ). Theorem 3.6. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Assume H λ is predicated on (36). Suppose that λ = (λ 1,..., λ r1 ) [0, b n ] r 1, where b n is a positive sequence that converges to zero as n and lim n P r( λ r1 +1 = 1,..., λ r1 +1 = 1) α for some α (0, 1). Then (37) tr(h λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d (38) (39) tr(hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d tr(2h λ Hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d uniformly in λ [0, 1] r. Theorem 3.6 suggests that for nonparametric regressions with only discrete covariates the asymptotic equivalence between any pair of tr(h λ ), tr(h T λ H λ) and tr(2h λ H T λ H λ) is valid even if some of the covariates are irrelevant. Note also that f( x d ) m L,1 ( x d ) for each x d Ω d. In particular, f( x d ) < m L,1 ( x d ) for each x d Ω d. Thus, x f( x d ) < Ω d d m L,1. ( x d ) Therefore, Theorem 3.6 implies that in the presence of irrelevant discrete variables tr(h λ ) < Ω d Ω d asymptotically. In essence, x f( x d ) d m L,1 is a measure of the degree of irrelevance. ( x d ) 4. A Multivariate generalization of the hat matrix from the ANOVA framework Huang & Chen (2008) consider the local polynomial estimator for model (4) with a scalar continuous regressor X i under an ANOVA framework. To do this, from (7) Huang & Chen (2008) define a local SSE, SST, and SSR, respectively as SSE p (x, h) = n 1 n ( Yi p ˆβ j=0 j (X i x) j) 2 Kh (X i x) n 1 n K, h(x i x) SST p (x, h) = n 1 n ( Yi Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x) SSR p (x, h) = n 1 n ( p ˆβ j=0 j (X i x) j Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x)

20 20 HAT MATRIX so that SST p (x, h) = SSE p (x, h) + SSR p (x, h). ANOVA decomposition are SSE p (h) = SST (h) = SSR p (h) = Their global counterparts to this local SSE p (x, h) ˆf(x; h)dx, SST (x, h) ˆf(x; h)dx, SSR p (x, h) ˆf(x; h)dx, and SST (h) = SST n (Y i Ȳ )2 under some conditions. Their corresponding hat matrix to the global ANOVA decomposition is denoted as H, and is defined as H = W H ˆf(x; h)dx, with W a diagonal matrix having entries Kh (X i x)/ ˆf(x; h), and (40) H = X(X T W X) 1 X T W, with X being the effective design matrix generated by the local polynomial expansion. We now extend the Huang & Chen (2008) framework by allowing the regression model to have q continuous regressors in the vector X c and r discrete regressors in a vector X d. In light of the foregoing local and global ANOVA decompositions, we proceed in the following way: SSE p (x, γ) = n ( 1 n Y i ˆβ ( 0 j p j X c i x c) ) j 2 Kγ (X i x) n 1 n K, γ(x i x) SST (x, γ) = n 1 n ( Yi Ȳ ) 2 Kγ (X i x) n 1 n K, γ(x i x) n ( 1 n ˆβ ( 0 j p j X c i x c) ) j 2Kγ Ȳ (X i x) SSR p (x, γ) = n 1 n K. γ(x i x) Their global counterparts to this local ANOVA decomposition are SSE p (γ) = x d SSE p (x, γ) ˆf(x; γ)dx c, SST (γ) = x d SSR p (γ) = x d SST (x, γ) ˆf(x; γ)dx c, SSR p (x, γ) ˆf(x; γ)dx c,

21 HAT MATRIX 21 where x d refers to summation over all atoms x d = (x d 1,..., x d r) of the distribution of X d. Then, for this generalization, (41) H = x d W H ˆf(x; γ)dx c, with H as defined in (3.3), X = D(x c ), W = W (x)/ ˆf(x; γ) a diagonal matrix having entries K γ (x, X i )/ ˆf(x; γ), where ˆf(x; γ) = 1 n K γ (x, X i ) and we assume the following normalization: K γ (X i x)dx c = 1, x d which ensures that SST = x SST (x, γ) ˆf(x; γ)dx c = n 1 n d (Y i Ȳ )2. Define M 0,0 M 0,1... M 0,p M M 1 = 1,0 M 1,1... M 1,p... M p,0 M p,1... M p,p The immediate result is a generalization of the trace result in Theorem 4(c) in Huang & Chen (2008). Theorem 4.1. Assume the conditions in Theorem 3.2 hold and d z = 1 in (4). The conditional trace of H, as defined in (41), for the multivariate local polynomial estimator of the conditional mean is asymptotically (42) tr(h ) = Ω Ωd q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Theorem 4.1 shows that the asymptotic expansion for tr(h ) is inversely related to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors and the mixed design density. In the absence of discrete regressors, the following corollary is immediate. Corollary 4.2. Assume the conditions in Theorem 4.1 hold with only continuous regressors in (4). The conditional trace of H for the multivariate local polynomial estimator is

22 22 HAT MATRIX asymptotically (43) tr(h ) = Ω q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Clearly, the difference between the non-anova asymptotic expressions for the hat matrix implied by the conditional mean estimator in Theorems 3.2 and and that of their ANOVA counterpart in Theorem 4.1 is driven by a linear combination of kernel-dependent constants, which can be easily calculated. Furthermore, this result remains in the presence of only continuous regressors, as can be seen from Theorem 3.3 and Corollary 4.2. More important, under certain model restrictions this linear combination of kernel dependent constants can be quite minuscule; we illustrate this in the ensuing subsection Comparing Degrees of Freedom from the univariate ANOVA and non-anova frameworks. In light of the regression function with scalar smooth covariate that is in both Zhang (2003) and Huang & Chen (2008), we now gauge the size of the linear combination of kernel-dependent constants that drives the difference between the asymptotic expressions for the trace of their resultant hat matrices for the conditional mean estimator. We consider the three most popular local polynomial estimator: LCLS, local linear, and local cubic; thus, p {0, 1, 3}. For a scalar covariate, we have M = (µ i+j 2 ) 1 i,j p+1, M 1 := (m ij ) 1 i,j p+1, S = (ν i+j 2 ) 1 i,j p+1, (see pages 12 and 21). Thus, Theorem 3.3 and Corollary 4.2 simplify, respectively, to We define tr(h 0,h ) = h 1 e T 1,p+1M 1 e 1,p+1 K(0) Ω { 1 + o P (1) }, tr(h0,hh T 0,h ) = h 1 e T 1,p+1M 1 SM 1 e 1,p+1 Ω { 1 + o P (1) }, ( p+1 ) tr(h ) = h 1 m ij ν i+j 2 Ω { 1 + o P (1) }, with i + j is even. 1 C p κ 2 C p κ 3 C p κ p+1 i, p+1 i, p+1 i, i, m ij ν i+j 2 e T 1,p+1M 1 e 1,p+1 K(0), m ij ν i+j 2 e T 1,p+1M 1 SM 1 e 1,p+1, m ij ν i+j 2 ( 2e T 1,p+1M 1 e 1,p+1 K(0) e T 1,p+1M 1 SM 1 e 1,p+1 ),

23 HAT MATRIX 23 with i + j is even, and where 1 Cκ, p 2 Cκ, p and 3 Cκ p are associated with differences between tr(h ) and that of tr(h 0,h ), tr(h0,h T H 0,h), and tr(2h 0,h H0,h T H 0,h), respectively. For p = 0, 1, for example, the ANOVA-based result in Corollary 4.2 degenerates to (44) (45) tr(h ) = h 1 Ω ν 0 {1 + o P (1)}, tr(h ) = h 1 Ω (ν 0 + ν 2 /µ 2 ){1 + o P (1)}, respectively, whereas the non-anova counterparts implied by Theorem 3.3 degenerates to (46) (47) tr(h 0,h ) = h 1 Ω K(0) { 1 + o P (1) }, tr(h T 0,hH 0,h ) = h 1 Ω ν 0 {1 + o P (1)}. Clearly, 2 C 0 κ = 0; that is, for the LCLS estimator the asymptotic difference between tr(h ) and tr(h T 0,h H 0,h) is zero for any kernel and assuming identical bandwidth parameter h. 5 We undertake a more general comparison between tr(h ) and tr(h 0,h ) and tr(h T 0,h H 0,h) for the popular class of symmetric beta kernels defined as (48) K(t) = 1 Beta(1/2, κ + 1) (1 t2 ) κ +, κ = 0, 1, 2,..., where the subscript + denotes the positive part, which is understood to be taken prior to exponentiation (see, for e.g., Fan & Gijbels 1996, p. 15). This class nests the uniform, Epanechnikov, biweight, and triweight kernels for κ = 0, 1, 2, and 3, respectively, and the Gaussian kernel as the limiting kernel function as κ. For κ = 0, 1, 2, and 3, µ 2j = Beta(j + 1/2, κ + 1), and ν 2j = Beta(1/2, κ + 1) and for the Gaussian kernel, K(u) = ( 1/ ) 2π e u2 /2, with Beta(j + 1/2, 2κ + 1) { } 2, Beta(1/2, κ + 1) µ 2j = (2j 1)(2j 3) 3 1, and ν 2j = 2 j 1 µ 2j / π, (see Fan & Gijbels 1996, p. 78). Table 1 reports the values of 1 C p κ, 2 C p κ, and 3 C p κ for this class of kernels and for p {0, 1, 3}. Table 1 shows that for smoother kernels, that is kernels with a larger κ 1, 1 C 1 κ and 1 C 3 κ become smaller. For the local linear estimator, the asymptotic difference between tr(h ) and tr(h 0,h ) can be quite minute; specifically, for Gaussian kernel 1 C 1 κ = More important, 0 [ 1 C p κ] 1, 0 [ 2 C p κ] 1, and 0 [ 3 C p κ] 1, κ, and p {0, 1, 3}, where [c] denotes the nearest integer function of the real number c. 5 In fact, 2 C 0 κ = 0 for any q-variate smooth covariate.

24 24 HAT MATRIX [Table 1 about here.] Hat matrices with only discrete regressors. For the LCLS estimator with only discrete regressors, H = x W H ˆf(x d ; λ), with W a diagonal matrix having entries L(X d d i, x d, λ)/ ˆf(x d ; λ), and with X = ι, the vector of ones in R n. Then ( tr(h ) = tr W ι(ι T W ι) 1 ι T W ˆf(x ) d ; λ) x d = (ι T W ι) 1 ι T W 2 ι ˆf(x d ; λ) x d (49) = x d ( n L(Xd i, x d, λ) ˆf(x d ; λ) ) 1 ( n ) L2 (Xi d, x d, λ) [ ˆf(x ˆf(x d ; λ). d ; λ)] 2 In light of (49), we have the following result: Theorem 4.3. Under the conditions of Theorem 3.4, (50) tr(h ) = Ω d { 1 + o P (1) } where tr(h ) is defined in (49). Theorem 4.3 suggests that in the purely discrete case with all relevant regressors, the differences between the non-anova based approaches in Zhang (2003) and the ANOVAbased approach in Huang & Chen (2008) are asymptotically negligible. Hence, for example, tr(h ) tr(h λ ) Degrees of Freedom in the presence of Relevant and Irrelevant Regressors. In Subsection 3.5, we show that the non-anova nonparametric framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the tr(h 0,γ ) is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of h s for s = q 1 + 1,..., q on the ratio is dominated by the influence of h s 0 for s = 1,..., q 1. In light of our juxtaposition of non-anova and ANOVA frameworks, it is interesting to examine whether the trace of the resultant hat matrix from Huang & Chen s (2008) ANOVA framework has a meaningful expression when some covariates are irrelevant. We now consider the LCLS estimator for tr(h ) with a mix of continuous and discrete regressors. Observe that tr(h ) = x d (ι T W ι) 1 (ι T W 2 ι) ˆ f( x; γ) ˆ f( x; γ)dx c,

25 which depends also on the ratio of two kernel terms. HAT MATRIX 25 However, unlike the non-anova framework, the kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This suggests that there can be sizable influence of h s for s = q 1 + 1,..., q, relative to h s 0 for s = 1,..., q 1, on the tr(h ). Hence, tr(h ) 0 is possible under the condition that h s for s = q 1 + 1,..., q. Formally, we provide this attrition effect of the irrelevant bandwidths on tr(h ) in the following result: Theorem 4.4. Assume the conditions of Theorem 3.5 hold. Let for some constant c, n c < h s < n c s = 1,..., q, and h h q1 +1 h q n κ, where κ = (q 1 (η + 1) + 4η) /(q 1 + 4), and η 1. The tr(h ) associated with the LCLS estimator is such that (51) tr(h ) = νq 1 0 Ω Ω ( d ) {1 q1 h m( x) d x c + op (1) }, s x d where m( x) = m 2 ( x)/m 1 ( x) = O ({ h q1 +1 h ) q } 1. Thus, Theorem 4.4 shows that in the presence of irrelevant continuous covariates tr(h ) p 0. 6 In fact, simulation results confirm this behavior for the tr(h ), for the LCLS, local linear and local cubic estimators. One implication of Theorem 4.4 is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates. In particular, tr(h ) 0 will render a residual degrees of freedom close to n and hence a large global mean square error which is used to compute an unbiased estimate for the error variance in finitesample settings; also, tr(h ) 0 will render a negative regression degrees of freedom. The measure and interpretability of their ANOVA-based adjusted R-squared are also impaired by tr(h ) 0. This feature will be true for other data-driven bandwidth selection measures with the capability of selecting bandwidths which diverge. For example, the AIC c bandwidth selection criterion of Hurvich, Simonoff & Tsai (1998) has been shown to perform in a similar fashion to LSCV (Li & Racine 2004) though no formal theory currently exists that demonstrates that AIC c bandwidth selection will produce large bandwidths for irrelevant variables. We further conjecture that this result will hold in the local polynomial setting when all of the continuous covariates enter the model in a polynomial fashion. The reason for this is that as mentioned in Hall & Racine (2015), when the underlying data generating process is 6 To the best of our knowledge, there is no study in the extant literature that documents the rate at which these bandwidths associated with the irrelevant covariates diverge to infinity. In practice, however, in a given model specification each h s is often larger than the h s by a factor in excess of n η. Therefore, the restriction we impose on the bandwidths for the irrelevant covariates is quite conservative.

Reference: Davidson and MacKinnon Ch 2. In particular page

Reference: Davidson and MacKinnon Ch 2. In particular page RNy, econ460 autumn 03 Lecture note Reference: Davidson and MacKinnon Ch. In particular page 57-8. Projection matrices The matrix M I X(X X) X () is often called the residual maker. That nickname is easy

More information

Chapter 5 Matrix Approach to Simple Linear Regression

Chapter 5 Matrix Approach to Simple Linear Regression STAT 525 SPRING 2018 Chapter 5 Matrix Approach to Simple Linear Regression Professor Min Zhang Matrix Collection of elements arranged in rows and columns Elements will be numbers or symbols For example:

More information

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model 1. Introduction Varying-coefficient partially linear model (Zhang, Lee, and Song, 2002; Xia, Zhang, and Tong, 2004;

More information

STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State

More information

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.

More information

A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES

A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES MICHAEL O HARA AND CHRISTOPHER F. PARMETER Abstract. This paper presents a Monte Carlo comparison of

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Linear Algebra Review

Linear Algebra Review Linear Algebra Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Linear Algebra Review 1 / 45 Definition of Matrix Rectangular array of elements arranged in rows and

More information

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2) RNy, econ460 autumn 04 Lecture note Orthogonalization and re-parameterization 5..3 and 7.. in HN Orthogonalization of variables, for example X i and X means that variables that are correlated are made

More information

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form Topic 7 - Matrix Approach to Simple Linear Regression Review of Matrices Outline Regression model in matrix form - Fall 03 Calculations using matrices Topic 7 Matrix Collection of elements arranged in

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

Sliced Inverse Regression

Sliced Inverse Regression Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

On a Nonparametric Notion of Residual and its Applications

On a Nonparametric Notion of Residual and its Applications On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014

More information

The Linear Regression Model

The Linear Regression Model The Linear Regression Model Carlo Favero Favero () The Linear Regression Model 1 / 67 OLS To illustrate how estimation can be performed to derive conditional expectations, consider the following general

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1 PANEL DATA RANDOM AND FIXED EFFECTS MODEL Professor Menelaos Karanasos December 2011 PANEL DATA Notation y it is the value of the dependent variable for cross-section unit i at time t where i = 1,...,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Multiple Regression Model: I

Multiple Regression Model: I Multiple Regression Model: I Suppose the data are generated according to y i 1 x i1 2 x i2 K x ik u i i 1...n Define y 1 x 11 x 1K 1 u 1 y y n X x n1 x nk K u u n So y n, X nxk, K, u n Rks: In many applications,

More information

Quantile Regression for Extraordinarily Large Data

Quantile Regression for Extraordinarily Large Data Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng Quantile regression Two-step

More information

Simple Linear Regression: The Model

Simple Linear Regression: The Model Simple Linear Regression: The Model task: quantifying the effect of change X in X on Y, with some constant β 1 : Y = β 1 X, linear relationship between X and Y, however, relationship subject to a random

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Local Polynomial Regression

Local Polynomial Regression VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based

More information

Large Sample Properties of Estimators in the Classical Linear Regression Model

Large Sample Properties of Estimators in the Classical Linear Regression Model Large Sample Properties of Estimators in the Classical Linear Regression Model 7 October 004 A. Statement of the classical linear regression model The classical linear regression model can be written in

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality. 88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface

More information

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where: VAR Model (k-variate VAR(p model (in the Reduced Form: where: Y t = A + B 1 Y t-1 + B 2 Y t-2 + + B p Y t-p + ε t Y t = (y 1t, y 2t,, y kt : a (k x 1 vector of time series variables A: a (k x 1 vector

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Variance Function Estimation in Multivariate Nonparametric Regression

Variance Function Estimation in Multivariate Nonparametric Regression Variance Function Estimation in Multivariate Nonparametric Regression T. Tony Cai 1, Michael Levine Lie Wang 1 Abstract Variance function estimation in multivariate nonparametric regression is considered

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Greene, Econometric Analysis (7th ed, 2012)

Greene, Econometric Analysis (7th ed, 2012) EC771: Econometrics, Spring 2012 Greene, Econometric Analysis (7th ed, 2012) Chapters 2 3: Classical Linear Regression The classical linear regression model is the single most useful tool in econometrics.

More information

Lecture 13: Simple Linear Regression in Matrix Format

Lecture 13: Simple Linear Regression in Matrix Format See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 13: Simple Linear Regression in Matrix Format 36-401, Section B, Fall 2015 13 October 2015 Contents 1 Least Squares in Matrix

More information

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna

More information

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Unit Roots in White Noise?!

Unit Roots in White Noise?! Unit Roots in White Noise?! A.Onatski and H. Uhlig September 26, 2008 Abstract We show that the empirical distribution of the roots of the vector auto-regression of order n fitted to T observations of

More information

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric

More information

Introduction to Regression

Introduction to Regression Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors: Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility

More information

Lecture 20: Linear model, the LSE, and UMVUE

Lecture 20: Linear model, the LSE, and UMVUE Lecture 20: Linear model, the LSE, and UMVUE Linear Models One of the most useful statistical models is X i = β τ Z i + ε i, i = 1,...,n, where X i is the ith observation and is often called the ith response;

More information

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity Local linear multiple regression with variable bandwidth in the presence of heteroscedasticity Azhong Ye 1 Rob J Hyndman 2 Zinai Li 3 23 January 2006 Abstract: We present local linear estimator with variable

More information

Simple and Efficient Improvements of Multivariate Local Linear Regression

Simple and Efficient Improvements of Multivariate Local Linear Regression Journal of Multivariate Analysis Simple and Efficient Improvements of Multivariate Local Linear Regression Ming-Yen Cheng 1 and Liang Peng Abstract This paper studies improvements of multivariate local

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data ew Local Estimation Procedure for onparametric Regression Function of Longitudinal Data Weixin Yao and Runze Li The Pennsylvania State University Technical Report Series #0-03 College of Health and Human

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is

More information

Estimation of the Conditional Variance in Paired Experiments

Estimation of the Conditional Variance in Paired Experiments Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often

More information

STAT5044: Regression and Anova. Inyoung Kim

STAT5044: Regression and Anova. Inyoung Kim STAT5044: Regression and Anova Inyoung Kim 2 / 51 Outline 1 Matrix Expression 2 Linear and quadratic forms 3 Properties of quadratic form 4 Properties of estimates 5 Distributional properties 3 / 51 Matrix

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley Panel Data Models James L. Powell Department of Economics University of California, Berkeley Overview Like Zellner s seemingly unrelated regression models, the dependent and explanatory variables for panel

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

Matrix Factorizations

Matrix Factorizations 1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular

More information

1. Stochastic Processes and Stationarity

1. Stochastic Processes and Stationarity Massachusetts Institute of Technology Department of Economics Time Series 14.384 Guido Kuersteiner Lecture Note 1 - Introduction This course provides the basic tools needed to analyze data that is observed

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York Introduction Let Y

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Introduction to Regression

Introduction to Regression Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric

More information

3 Multiple Linear Regression

3 Multiple Linear Regression 3 Multiple Linear Regression 3.1 The Model Essentially, all models are wrong, but some are useful. Quote by George E.P. Box. Models are supposed to be exact descriptions of the population, but that is

More information

Fitting Linear Statistical Models to Data by Least Squares: Introduction

Fitting Linear Statistical Models to Data by Least Squares: Introduction Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Most of this course will be concerned with use of a regression model: a structure in which one or more explanatory

More information

Economics 573 Problem Set 5 Fall 2002 Due: 4 October b. The sample mean converges in probability to the population mean.

Economics 573 Problem Set 5 Fall 2002 Due: 4 October b. The sample mean converges in probability to the population mean. Economics 573 Problem Set 5 Fall 00 Due: 4 October 00 1. In random sampling from any population with E(X) = and Var(X) =, show (using Chebyshev's inequality) that sample mean converges in probability to..

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

COUNTEREXAMPLES TO THE COARSE BAUM-CONNES CONJECTURE. Nigel Higson. Unpublished Note, 1999

COUNTEREXAMPLES TO THE COARSE BAUM-CONNES CONJECTURE. Nigel Higson. Unpublished Note, 1999 COUNTEREXAMPLES TO THE COARSE BAUM-CONNES CONJECTURE Nigel Higson Unpublished Note, 1999 1. Introduction Let X be a discrete, bounded geometry metric space. 1 Associated to X is a C -algebra C (X) which

More information

Nonparametric Econometrics

Nonparametric Econometrics Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-

More information

Regression and Statistical Inference

Regression and Statistical Inference Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF

More information

Understanding Regressions with Observations Collected at High Frequency over Long Span

Understanding Regressions with Observations Collected at High Frequency over Long Span Understanding Regressions with Observations Collected at High Frequency over Long Span Yoosoon Chang Department of Economics, Indiana University Joon Y. Park Department of Economics, Indiana University

More information

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12 Partitioned Covariance Matrices and Partial Correlations Proposition 1 Let the (p + q (p + q covariance matrix C > 0 be partitioned as ( C11 C C = 12 C 21 C 22 Then the symmetric matrix C > 0 has the following

More information

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Statistica Sinica 18(2008), 515-534 DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Kani Chen 1, Jianqing Fan 2 and Zhezhen Jin 3 1 Hong Kong University of Science and Technology,

More information

LOCAL POLYNOMIAL AND PENALIZED TRIGONOMETRIC SERIES REGRESSION

LOCAL POLYNOMIAL AND PENALIZED TRIGONOMETRIC SERIES REGRESSION Statistica Sinica 24 (2014), 1215-1238 doi:http://dx.doi.org/10.5705/ss.2012.040 LOCAL POLYNOMIAL AND PENALIZED TRIGONOMETRIC SERIES REGRESSION Li-Shan Huang and Kung-Sik Chan National Tsing Hua University

More information

A nonparametric method of multi-step ahead forecasting in diffusion processes

A nonparametric method of multi-step ahead forecasting in diffusion processes A nonparametric method of multi-step ahead forecasting in diffusion processes Mariko Yamamura a, Isao Shoji b a School of Pharmacy, Kitasato University, Minato-ku, Tokyo, 108-8641, Japan. b Graduate School

More information

Regularization Methods for Additive Models

Regularization Methods for Additive Models Regularization Methods for Additive Models Marta Avalos, Yves Grandvalet, and Christophe Ambroise HEUDIASYC Laboratory UMR CNRS 6599 Compiègne University of Technology BP 20529 / 60205 Compiègne, France

More information

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013 18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Need for Several Predictor Variables

Need for Several Predictor Variables Multiple regression One of the most widely used tools in statistical analysis Matrix expressions for multiple regression are the same as for simple linear regression Need for Several Predictor Variables

More information

A Bootstrap Test for Conditional Symmetry

A Bootstrap Test for Conditional Symmetry ANNALS OF ECONOMICS AND FINANCE 6, 51 61 005) A Bootstrap Test for Conditional Symmetry Liangjun Su Guanghua School of Management, Peking University E-mail: lsu@gsm.pku.edu.cn and Sainan Jin Guanghua School

More information

1 Appendix A: Matrix Algebra

1 Appendix A: Matrix Algebra Appendix A: Matrix Algebra. Definitions Matrix A =[ ]=[A] Symmetric matrix: = for all and Diagonal matrix: 6=0if = but =0if 6= Scalar matrix: the diagonal matrix of = Identity matrix: the scalar matrix

More information