CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION. 1. Introduction
|
|
- Darcy Matthews
- 5 years ago
- Views:
Transcription
1 CALCULATING DEGREES OF FREEDOM IN MULTIVARIATE LOCAL POLYNOMIAL REGRESSION NADINE MCCLOUD AND CHRISTOPHER F. PARMETER Abstract. The matrix that transforms the response variable in a regression to its predicted value is commonly referred to as the hat matrix. The trace of the hat matrix is a standard metric for calculating degrees of freedom. Nonparametric-based hat matrices do not enjoy all properties of their parametric counterpart in part owing to the fact that the former do not always stem directly from a traditional ANOVA decomposition. In the multivariate, local polynomial setup with a mix of continuous and discrete covariates, which include some irrelevant covariates, we formulate asymptotic expressions for the trace of the resultant non-anova and ANOVA-based hat matrix from the estimator of the unknown conditional mean. The asymptotic expression of the trace of the non-anova hat matrix associated with the conditional mean estimator is equal up to a linear combination of kernel-dependent constants to that of the ANOVA-based hat matrix. Additionally, we document that the trace of the ANOVA-based hat matrix converges to 0 in any setting where the bandwidths diverge. This attrition outcome can occur in the presence of irrelevant continuous covariates or it can arise when the underlying data generating process is in fact of polynomial order. Simulated examples demonstrate that our theoretical contributions are valid in finite-sample settings. 1. Introduction The hat matrix plays a fundamental role in regression analysis; the elements of this matrix have well-known properties and are used to construct variances and covariances of the residuals. In particular, the trace of the hat matrix is commonly used to calculate degrees of freedom, and appears in regression diagnostics, constructing measures of fit and conceptualizing what a residual is. And while the hat matrix is commonly called upon in parametric estimation and inference, its use in nonparametric settings is much less prevalent. One potential reason that the hat matrix has been less commonly deployed in nonparametric regression analysis, is that several notions of the hat matrix exist, which leads to alternative versions of both overall model fit and degrees of freedom (which can subsequently be used to penalize in-sample fit). 1 Within a univariate framework, Huang & Chen (2008) provide an ANOVA decomposition of the total sum of squares into its respective Date: November 21, Key words and phrases. Trace, Degrees of Freedom, Effective Parameters, Nonparametric Regression, Irrelevant Regressors, Bandwidth, Goodness-of-fit. 1 In the nonparametric literature, this hat matrix is also referred to as the smoother matrix. 1
2 2 HAT MATRIX explained and residual components. However, this ANOVA decomposition is local in nature and needs to be integrated to achieve a global hat matrix. Moreover, in multivariate settings which are of practical appeal calculation of this global ANOVA is likely to be difficult. This is irksome for the calculation of the degrees of freedom of the model as well given that the proper hat matrix stemming from this ANOVA decomposition needs to be integrated. An alternative would be to use the trace of the hat matrix stemming directly from the local polynomial method of estimating the unknown conditional mean (Ruppert & Wand 1994, Fan & Gijbels 1996). In fact, Zhang (2003) discusses exactly this non-anova case in the univariate setting. Here our goal is to augment the work of Huang & Chen (2008) and Zhang (2003) by considering calculation of degrees of freedom in a multivariate, local polynomial setting with a mix of continuous and discrete covariates. From this platform we can compare the effective number of parameters, k eff, stemming from the trace of the global ANOVA hat matrix and its non-anova counterpart. Our generalizations and juxtapositions allow us to make the following nontrivial contributions to the existing literature. One, in the presence of mixed discrete and continuous covariates, the difference in asymptotic expressions of the trace of the ANOVA and non-anova hat matrices is driven by a linear combination of moments of the underlying kernel. This suggests that the non-anova hat matrix taken directly from the multivariate, local polynomial estimator can be used to approximate degrees of freedom for the ANOVA hat matrix as the latter is more computationally intensive in applied settings with multiple continuous covariates. For example, using a bivariate regression, and for local constant, local linear, and local cubic estimators, we show that the absolute differences between asymptotic expressions for the trace of the ANOVA and non-anova hat matrices are in the unit interval. Two, to improve the usefulness of our work to applied settings, we also give consideration to nonparametric regression models in which some covariates are irrelevant. The non-anova nonparametric framework has been the workhorse for the analysis of irrelevant covariates. We show that this framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the trace of the non-anova hat matrix is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of bandwidth vector for the irrelevant continuous covariates on the kernel ratio is dominated by the influence of its relevant counterpart. We show that the bandwidth vector for the irrelevant continuous covariates has an attrition effect on the trace of the ANOVA hat matrix resulting in the latter converging to zero in
3 HAT MATRIX 3 probability. Although the trace of the ANOVA hat matrix is also a ratio of two kernel terms, these kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This paves the way for a sizable influence of the bandwidth vector for the irrelevant continuous covariates relative to its relevant counterpart. In fact, our simulation results confirm this attrition effect of irrelevant regressors on the trace of the ANOVA hat matrix, for the local constant, local linear, and local cubic estimators. One implication of this attrition effect is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates; degrees of freedom from the non-anova framework may be suitable substitutes for their ANOVA counterparts when irrelevant continuous variables are likely to be present in the underlying nonparametric model. Three, we formalize the trace concept of the non-anova and ANOVA hat matrices that are predicated on only discrete covariates to provide a measure of the degrees of freedom for the underlying nonparametric model. In the presence of only relevant discrete covariates, the traces of ANOVA and non-anova hat matrices all converge in probability to the cardinality of the discrete support. Although this result holds when irrelevant discrete covariates are also present in the nonparametric model, the asymptotic trace values in this case can exceed their purely relevant counterparts. This latter result draws on the theoretical contributions of Ouyang, Li & Racine (2009) who establish, in a purely discrete-covariate setting with least-squares cross-validation (LSCV), that the irrelevant regressors cannot be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that the bandwidths selected via LSCV do not converge to their upper extreme values even a n. In essence, with positive probability, the presence of irrelevant discrete covariates can lead to asymptotic trace values of the ANOVA and non-anova hat matrices that are larger than the cardinality of the support for the relevant discrete covariates. This also means that these asymptotic trace values will exhibit large variances in the presence of irrelevant discrete covariates. Four, unlike the parametric hat matrix, the geometric properties of the non-anova hat matrix stemming directly from multivariate local polynomial estimation with mixed data of the unknown conditional mean have yet to be formalized. We show that while the non- ANOVA hat matrix is not a projection matrix, it shares many of the same geometric properties as its parametric counterpart. These properties of the hat matrix are of importance in, for example, assessing the amount of leverage or influence that y j has on ŷ i, which is related to the (i, j)-th entry of the hat matrix. In the special case of a local constant estimator, we
4 4 HAT MATRIX deduce that each ŷ i is a convex combination of the response vector Y ; this convexity property indicates how large is the leverage of y i on its corresponding fitted value ŷ i. Thus, our work can also be used to identify high-leverage points and improve model fit in multivariate local polynomial estimation. In essence, our theoretical contributions are of independent interest to the wider nonparametric literature. Additionally, we can use the trace of the hat matrix as a measure of the effective number of parameters used/constructed by the local-polynomial model and this can provide insight into which covariates in the model are relevant. Whereas the theories of Hall, Li & Racine (2007), Hall & Racine (2015) and Ouyang et al. (2009) can shed light on relevancy through the size of the bandwidth and the order of the local polynomial (when selected in a datadriven manner), they do not provide an exact number of parameters. Thus, use of the hat matrix here, with these data-driven methods, can generate additional insight into how the local polynomial estimator adapts to the data. The remainder of the paper is organized as follows. Section 2 provides a short review of the geometric properties of the hat matrix from a linear parametric model. Section 3 derives nonasymptotic and asymptotic results for the trace of the non-anova based hat matrix from the multivariate, local polynomial model with a mix of continuous and discrete, and relevant and irrelevant covariates. Under similar model specifications, Section 4 derives asymptotic results for the trace of the ANOVA based hat matrix. Section 5 explores the implications of our theoretical results using simulated data. Section 6 contains the conclusion. We place all proofs in the technical appendix. 2. A Brief Review of the Hat Matrix for the Canonical Parametric Model Consider the situation where one is interested in estimation of the regression (1) y i = m(x i ) + ε i, where i = 1, 2,..., n, y i is our regressand, x i is a q-vector of regressors, and ε i is the idiosyncratic noise. If we parameterize our function in (1) to be linear in parameters, m(x i ) = x T i β, where β R q, then we can estimate the model via least squares to obtain, ˆβ = (X T X) 1 X T y where X is the full n q design matrix with rank q and y is the n 1 vector of responses. The vector of fitted values, ŷ is given by (2) ŷ = X(X T X) 1 X T y = Hy.
5 HAT MATRIX 5 The matrix H in (2) is a projection matrix and thus by definition, H is idempotent, H 2 = H. By construction, H is symmetric, H T = H (in this case H is an orthogonal projection matrix). Since premultiplying y by this matrix H puts a hat on y it is often called the hat matrix. Using basic properties of projection matrices Hoaglin & Welsch (1978) show that the elements of H, h ij, satisfy the first four geometric properties: (i) 0 h ii 1. (ii) h ij 1 for i j. (iii) h ii = 1 iff h ij = 0 j i. (iv) If X contains a column of ones then n h ij = 1. (v) HX = X. Equivalently, HX c = X c, where X c is any column of the design matrix X. Properties (i) and (ii) are boundedness conditions on the entries of H. Note that by symmetry of H, property (iv) is equivalent to n h ij = 1. Thus, each ŷ i is an affine combination of y i. We add the invariance property, see (v), as it will help us to establish some important results in the subsequent section. Note that property (v) nests (iv). The trace of the parametric hat matrix is commonly used to calculate degrees of freedom since, by virtue of cyclic permutation of the trace operator, (3) tr(h) = tr ( X(X T X) 1 X ) T = tr ( (X T X) 1 X T X ) = tr(i k ) = q, the number of covariates that we included in the parametric model. For nonparametric regressions, three orthodox definitions of k eff, which are identical for linear models, are tr(hτ,γh T τ,γ ), tr(h τ,γ ), tr(2h τ,γ Hτ,γH T τ,γ ) (see, for e.g., Hastie & Tibshirani 1990). In subsequent sections, we show that properties (i) to (v) hold for the multivariate local polynomial regression estimator of the unknown conditional mean. The equality in (3) between the trace of the hat matrix and the rank of the design matrix is one of the distinguishing characteristics of the parametric framework that is not always possessed by its nonparametric counterpart. Heuristically, the hat matrix in the local polynomial regression is predicated on an effective design matrix with column rank that increases with the order of the local polynomial. For the multivariate, local polynomial estimator of the unknown conditional mean, we show that the rank of the effective design matrix is the infimum for the trace of the resultant hat matrix.
6 6 HAT MATRIX 3. The non-anova hat matrix for the multivariate local polynomial estimator To generalize the results in Zhang (2003), we embed a mix of continuous and discrete regressors into the multivariate smooth varying-coefficient model (Hastie & Tibshirani 1993) with conditionally linear structure d z (4) y = m k (X)Z k + ε, k=1 and where y is a scalar regressand, X = (X c 1,..., X c q, X d 1,..., X d r ) T and Z = (Z 1,..., Z dz ) T are the given covariates with Z 1 1, E(ε X, Z) = 0 and var(ε X, Z) = σ 2 (X, Z). When d z = 1 the model in (4) is just a multivariate nonparametric regression model. Thus, allowing d z 1 allows for a broader array of models. Denote Xi d in X i = (Xi c, Xi d ) as the r 1 vector of regressors that takes discrete values, and denote Xi c R q as the vector of continuous regressors. Let Ω d and Ω be the support of Xi d and Xi c, respectively. Let the s th component of x d be x d s which takes c s different values in Ωs d = {0, 1,..., c s 1} for s = 1,..., r and c s 2 is a finite positive constant. Then the cardinality of the set Ωs d is c s, which we denote as Ωs d. Assume X has a sampling density f X with a known bounded support Ω Ω d. Furthermore, assume the square matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d to guarantee identifiability of the model in (4). We let γ be the bandwidth vector for X. As in Li & Racine (2007), we use the partition γ = (h T, λ T ) T to reflect the presence of continuous and discrete regressors in X, with bandwidth subvectors h and λ, respectively. For the case of unordered discrete regressors X d i, we follow Li & Racine (2007) who use a variant of the kernel function of Aitchison & Aitken (1976) that is defined by 1 if X (5) l(xis, d x d is d = x d s s, λ s ) = λ s if Xis d x d s where 0 λ s 1 is the smoothing parameter of x d s. Then the product kernel for x d = (x d 1,..., x d r) T is L(x d, X d i, λ) = r l(xis, d x d s, λ s ) = r λ 1(Xd is xd s) s, where 1(X d is x d s) is an indicator function that equals 1 when X d is x d s, and 0 otherwise.
7 HAT MATRIX 7 We let K h ( ) be the generalized product kernel (Li & Racine 2007, Henderson & Parmeter 2015), (6) K h (x c, X c i ) = q h 1 s K where K( ) be a symmetric, density function on R. function for the mixed regressor (x c, x d ). Then ( x c s X c is h s ), K γ (x, X i ) = K h (x c, X c i )L(x d, X d i, λ). Define K γ (x, X i ) to be the kernel To obtain an estimate of the unknown smooth functions {m k (x)} dz k=1 and their population mean regression function m(x 1,..., x q, z 1,..., z dz ) = d z k=1 m k(x)z k from the observations {Y i, X i, Z i } n, we employ the p th -order local polynomial estimation method. In what follows, we adopt the notation of Masry (1996a, 1996b). Thus, for a p th -order local polynomial estimation the corresponding objection function is ( ) d z 2 ( (7) min n 1 Y i β j X c i x c) j Zik K γ (x, X i ), β k=1 0 j p where j = (j 1,..., j q ), j = q j i, x j = q xj i i, j! = q j i! = j 1! j q! and p l l =, 0 j p l=0 j 1 =0 j q=0 j 1 + +j q=l j!β j (x) corresponds to ( D j m ) (x), the partial derivative of m(x) = m(x c, x d ) with respect to x c, which is defined as: (8) ( D j m ) (x) j m(x) (x c 1) j 1... (x c q ) jq, and β vertically concatenates β j (0 j p) in lexicographical order (with highest priority to last position so that (0,..., 0, i) is the first element in the sequence and (i, 0,..., 0) is the last element), and g 1 i denotes this one-to-one map. Note that (7) handles the continuous regressor vector x c in a local polynomial manner but the discrete regressor vector x d in a local constant manner. Let β(x; γ) (0 j p) be the estimator for the weighted least square problem in (7), and define S n (x) = D(x c ) T W (x)d(x c ) and T n (x) = D(x c ) T W (x), with D(x c ) = [ T D 1 (x c ),..., D n (x )] c where Di (x c ) vertically concatenates (Xi c x c ) j Z i for 0 j p in lexicographical order, with Z i = (Z 1i,..., Z dzi) T, and denotes the Kronecker operator.
8 8 HAT MATRIX Thus, for example, D i (x c ) = Z i for p = 0, and D i (x c ) = [1, (X c i x c ) T ] T Z i for p = 1. Here W (x) is a diagonal matrix with the i th diagonal element being K γ (x, X i ). Now, let N p,l = (l + q 1)! l!(q 1)! be the number of distinct q-tuples j with j = l for 0 l p. That is, N p,l is the number of distinct l th -order partial derivative of m(x) with respect to x c. Set N p p l=0 N p,l. The minimizer of the local polynomial least squares procedure at x is (9) β(x; γ) = S 1 n (x)t n (x)y. If our estimate of interest in the unknown regression function is the first d z elements of this vector, then we have m(x) = ( e T 1,N p I dz ) β(x; γ), where eτ+1,np is the (τ + 1) th standard basis vector in the coordinate space R Np and I dz is the identity matrix in R dz. In practice, however, interest may lie in a specific derivative, say of first or second order, of the regression function. Thus, we define m τ (x) = ( e T τ+1,n p I dz ) β(x; γ), where τ = 0, 1,..., Np 1. That is, (τ + 1) is the lexicographical position of the derivative vector of m, of interest. Thus the relationship between τ and m is: m 1 (10) τ = g 1 m (m) + N p,r. (11) Then, we can cast our local polynomial regression estimator as m τ (x, z) = where z = (z 1,..., z dz ) T and d z k=1 m τ,k(x)z k r=0 = m τ (x) T z, = ( e T τ+1,n p z ) T Sn 1 (x)d(x c ) T W (x)y, = A τ,j (x)y j (12) A τ,j (x) ( e T τ+1,n p z T ) S 1 n (x)d j (x c )K γ (x, X j ). If we replace the generic x with our n observations, X i, and τ = 0, then we obtain the fitted values for our data. Further, we can use these n observations to construct our hat matrix H τ,γ. From (11) we have that the (i, j)th element of our hat matrix is (13) H τ,γ (i, j) = A τ,i (X j ), for i, j = 1,..., n.
9 HAT MATRIX Geometric Properties of the non-anova hat matrix. Clearly, H τ,γ is not symmetric in this generalization. To show that the properties in (i)-(iv) hold for H τ,γ (i, j) for i, j = 1,..., n, we will use the following result which is the local polynomial, multivariate, generalized kernel extension of Zhang (2003, Lemma A.1). Lemma 3.1. For a nonnegative kernel satisfying K γ (0) = sup u,v K γ (u, v), we have that (1) n H2 τ,γ(i, j) H τ,γ (i, i) for i = 1,..., n, (2) Let X l = (X l 1,..., X l n) T for any l such that 0 l p. Assume the relationship between τ and and its derivative vector m is as defined in (10). Then ( ) ) q H τ,γ (i, j) (X lj l ( s Z j = X l m i Z j ), and, consequently, H 0,γ X l = X l. m s Part (1) of Lemma 3.1 implies that 0 H τ,γ (i, i) 1 for i = 1,..., n and H τ,γ (i, j) 1, for i j. Thus, properties (i) and (ii) hold. To show that property (iv) also holds, observe from Proposition A.1 (see appendix) that Z j A τ,j (x) = δ 0,τ Z j, and thus (14) 1 if τ = 0 H τ,γ (i, j) = 0 if τ > 0, as the first entry of Z j is normalized to 1. Property (iii) holds by virtue of properties (i), (ii) and (iv). For τ = 0, part (2) of Lemma 3.1 is equivalent to property (v). Equation (14) shows that for each i th observation, the conditional mean estimator is an affine combination of the i th row of the associated hat matrix, whereas that of its derivative estimator counterpart is a zero-sum linear combination. Note that there is an absence of the invariance property of H τ,γ for τ > 0 (see part 2), which is intuitive. For derivative estimators, H τ,γ is associated with dimensionality reduction but preservation of the span of a polynomial ring. Thus, Lemma 3.1 reveals that for nonparametric regression, H τ,γ can be exploited to conduct diagnostic analyses, such as leverage effects, which have be used in parametric settings. In fact, the hat matrix for the local-constant least-squares (LCLS) estimator renders easy interpretation of leverage effects. To see this, assume that for model (4), d z = 1 and
10 10 HAT MATRIX p = 0. Then the LCLS estimator of our model is K γ (x j, x i )y j (15) m 0 (x i ) = = K γ (x l, x i ) where (16) A 0,j (x i ) = K γ(x j, x i ). K γ (x l, x i ) l=1 l=1 A 0,j (x i )y j, We use the notation A ij = A 0,i (x j ) = H 0,γ (i, j) from (15) for parsimony. Observe that by the definition of the product kernel, K γ (x j, x i ) and A ij for H 0,γ, properties (i) and (ii) become: 0 A ij 1 i, j. However, unlike the parametric model, the elements of the local constant hat matrix are all positive. Clearly, from (16), H 0,γ is a symmetric matrix, as in the parametric case. Thus, A ij = n A ij = 1. More important, this normalization property holds unconditionally, rather than conditionally on the regressor matrix. Combining this restriction with property (iii) yields that each ŷ i is a convex combination of the response vector Y ; this convexity restriction, although clearly a stronger condition than its parametric counterpart, renders easier detection and comparison of high-leverage points relative to other points Degrees of Freedom for Mixed Regressors. Following Zhang (2003), first we formulate a relationship among three common measures of k eff : tr(h T τ,γh τ,γ ), tr(h τ,γ ), tr(2h γ H T γ H γ ). To do so, we make use of the bound implied by Part (1) of Lemma 3.1: (17) tr(h T τ,γh τ,γ ) = {H τ,γ (i, j)} 2 H τ,γ (i, i) = tr(h τ,γ ). In addition, using the fact that (I n H τ,γ ) T (I n H τ,γ ) is positive definite, we have ( ) tr (I n H τ,γ ) T (I n H τ,γ ) = n tr(2h τ,γ Hτ,γH T τ,γ ) > 0, given equivalence of the trace of a transposed matrix. This result, coupled with (17), implies that tr(h T τ,γh τ,γ ) tr(h τ,γ ) tr(2h τ,γ H T τ,γh τ,γ ) < n. An implication of Part (2) of Lemma 3.1 is that H τ,γ projects matrices onto a space with column span of d z N p. Thus, tr(h T τ,γh τ,γ ) d z N p. This follows from Schur s inequality
11 HAT MATRIX 11 (Lütkepohl 1996, p. 43), along with the fact that the trace of a matrix is equal to the sum of its eigenvalues. We now formalize our generalization of Zhang s (2003) nonasymptotic bound as follows. Proposition 3.1. Assume the kernel condition in Lemma 3.1 holds. For the multivariate local polynomial regression defined in (4) and any h R q + and λ [0, 1] r, we have d z N p tr(hτ,γh T τ,γ ) tr(h τ,γ ) tr(2h τ,γ Hτ,γH T τ,γ ) < n, where N p is defined above. Next, we present the asymptotic results for tr(h τ,γ ), tr(hτ,γh T τ,γ ), and tr(2h τ,γ Hτ,γH T τ,γ ). A few assumptions are in order. Assumption A.1. (x i, z i, y i ) are i.i.d. as (X, Z, Y ). The covariates X = (X c, X d ) have bounded support Ω Ω d, and the density f(x) of X is Lipschitz continuous in x c and bounded away from zero on Ω Ω d. Assumption A.2. m k (x c, x d ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω x d Ω d. Assumption A.3. The fourth moment of ε exists and is strictly positive. Assumption A.4. The kernel K( ) is a symmetric, nonnegative, and bounded continuous probability density function having compact support. Specifically, u 4p K(u) L 1 and u 4p+q K(u) 0 as u. Assumption A.5. The matrix E(ZZ T X = x) has strictly positive eigenvalues for each x Ω Ω d, and each entry is Lipschitz continuous in x c. These conditions are identical or similar up to generalizations to those made by Zhang (2003) and Huang & Chen (2008). For example, Assumptions A.1 and A.4 ensure convergence in the mean-square sense of the matrices of multivariate moments of the kernel K from the multivariate local polynomial estimation due to Masry (1996b). 2 To state our main theorem, we introduce some additional notation. Define Γ (x) = f X (x)e(zz T X = x). For each j with 0 j 2p, define µ j = u j K(u)du, ν j = u j K 2 (u)du, R q R q 2 Masry (1996a) establishes uniform strong consistency of these moment matrices.
12 12 HAT MATRIX and the N p N p dimensional matrices M 0,0 M 0,1... M 0,p S 0,0 S 0,1... S 0,p M M = 1,0 M 1,1... M 1,p.., S = S 1,0 S 1,1... S 1,p.., M p,0 M p,1... M p,p S p,0 S p,1... S p,p where M i,j and S i,j are N i N j dimensional matrices whose (l, m) elements are µ gi (l)+g j (m) and ν gi (l)+g j (m), respectively. Hence, the matrices M and S are the multivariate moments of the kernels K and K 2, respectively. Given that our kernel function is a probability density function, µ j is the j th raw moment of the kernel and ν j is the kernel weighted j th raw moment. Let c h,,i = [(Xc i x c ) h] and c h,,i vertically concatenates [ c h,,i] j for 0 j p in lexicographical order, where represents Hadamard division. Also, let d,i = X d i x d. Then, for a fully nonparametric regression model with mixed continuous and discrete regressors, we define the associated multivariate equivalent kernel, K τ( ), as (18) K τ( c h,,i, d,i) = e T τ+1,n p M 1 c h,,i K( c h,,i)l(x d, X d i, λ). Clearly, K τ(0, 0) = e T τ+1,n p M 1 e T τ+1,n p [K(0)] q, and we set K τ(0, 0) = K τ(0). As noted in Fan & Gijbels (1996, sect ) the equivalent kernel is the weighting scheme that arises based on the specific kernel, polynomial order chosen and the design points location relative to the point of evaluation. The multivariate equivalent kernel is the effective weighting scheme that produces the estimator for β j (x) for given bandwidth γ. When p = 0, the multivariate equivalent kernel is identical to the product kernel, but for p > 0, the multivariate equivalent kernel can automatically adapt to alternative data designs as well as account for boundary estimation. Theorem 3.2. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,γ is based on (12) with γ = (h T, λ T ) T. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Under Assumptions A.1 to A.5, with h s 0, for s = 1,..., q, and n,
13 HAT MATRIX 13 and nh 1... h q, we have (19) (20) (21) tr(h τ,γ ) = d zkτ(0) Ω Ω d h { m q 1 + h op (1) }, s tr(hτ,γh T τ,γ ) = d zkτ Kτ(0) Ω Ω d h { 2m q 1 + h op (1) }, s ( ) tr(2h τ,γ Hτ,γH T d z τ,γ ) = h m q h 2Kτ(0) K τ Kτ(0) Ω Ω d { 1 + o s h m P (1) }, where Ω Ω d is the volume of the Cartesian product Ω Ω d, is the convolution operator, and K τ K τ(0) = e T τ+1,n p M 1 SM 1 e τ+1,np. Theorem 3.2 shows that the asymptotic k eff are proportional to the total number of covariates in model (4), d z, inversely proportional to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors. This latter result on the discrete covariates cannot be inferred from Zhang s (2003) asymptotic approximations for the trace of the resultant hat matrices for the local polynomial estimator of the conditional mean (τ = 0) for model (4) with a scalar, continuous X. Theorem 3.2 also shows each asymptotic k eff is independent of the mixed design density f. In the case where the discrete regressors are ordered, then Li & Racine (2007) suggest, in lieu of (5), the use of the following kernel function: 1 if X (22) l(xis, d x d is d = x d s s, λ s ) =, λ Xd is xd s s if Xis d x d s where the range of the smoothing parameter of x d s, λ s, is [0, 1]. As in the unordered case, if λ s is equal to its minimum value the function in (22) is an indicator function; if λ s is equal to its maximum then (22) is a uniform weight function. In this case, the results of Theorem 3.2 continue to hold Degrees of Freedom for Continuous Regressors. In the purely continuous case, we assume that j! β j (x c ) estimates ( D j m ) (x c ). Then the p th -order local polynomial estimation has the following objection function: ( min n 1 Y i β d z k=1 0 j p β j ( X c i x c) j Zik ) 2 K h (x c, X c i ).
14 14 HAT MATRIX Thus, D(x c ) is as defined on page 7, and we redefine the diagonal weighting matrix W (x c ) so that its i th diagonal entry is K h (x c, X c i ). Then, for the continuous regressor case, we have (23) A τ,j (x c ) ( e T τ+1,n p z T ) S 1 n (x c )D j (x c )K h (x c, X c j ). Using (23) to defined the (i, j)th element of our hat matrix, H τ,h, as is done in (13), note that the results of Lemma 3.1 and Proposition A.1 continue to hold in the continuous regressor case. To generate asymptotic expressions for tr(h τ,γ ), tr(h T τ,γh τ,γ ), and tr(2h τ,γ H T τ,γh τ,γ ) in the continuous regressor case, we need to modify Assumptions A.1, A.2, and A.5, respectively, as follows: Assumption B.1. (x c i, z i, y i ) are i.i.d. as (X c, Z, Y ). support Ω, and the density f(x c ) of X c zero. The covariates X c have bounded is Lipschitz continuous and bounded away from Assumption B.2. m k (x c ), for k = 1,..., d z, has continuous (p + 1) th derivative in Ω. Assumption B.3. The matrix E(ZZ T X c = x c ) has strictly positive eigenvalues for each x c Ω, and each entry is Lipschitz continuous. Note that for d z = 1 in our multivariate model (4) with only continuous regressors, the corresponding multivariate equivalent kernel, K τ( ), is K τ( c h,,i) = e T τ+1,n p M 1 c h,,i K h ( c h,,i). Theorem 3.3. Assume the relationship between τ and and its derivative vector m is as defined in (10), and the hat matrix H τ,h is based on (23). Under Assumptions B.1 to B.3 and A.3 to A.4, with h s 0, for s = 1,..., q, and n, and nh 1... h q, the asymptotic results in Theorem 3.2 continue to hold so that (24) tr(2h τ,h H T τ,hh τ,h ) = tr(h τ,h ) = d zk τ(0) Ω h m q h s h m q h s { 1 + op (1) }, tr(hτ,hh T τ,h ) = d zkτ Kτ(0) Ω h { 2m q 1 + h op (1) }, s ( d z 2Kτ(0) K τ Kτ(0) h m ) Ω { 1 + o P (1) }. Clearly, for τ = 0 and q = 1, Theorem 3.3 nests Zhang s (2003) results for degrees of freedom of local polynomial hat matrices (see Theorems 1 and 3, pages 612 and 616, respectively).
15 HAT MATRIX Degrees of Freedom for Discrete Regressors. The case in which all regressors are discrete is also common in applied research. We therefore assume that for model (4) d z = 1, p = 0, and X i = Xi d. Then, the result LCLS estimator implied by (15) is L(x d j, x d i, λ)y j m(x d i ) = = A j (x d L(x d l, xd i, λ) i )y j, where (25) A j (x d i ) = L(xd j, x d i, λ). L(x d l, xd i, λ) l=1 l=1 We let H λ (i, j) := A i (x d j). This formulation of H λ gives rise to the following asymptotic expressions. Theorem 3.4. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Let the vector of bandwidth λ be such that λ [0, b n ] r, where b n is a positive sequence that converges to zero as n. Then (26) (27) (28) tr(h λ ) = Ω d { 1 + o P (1) }, tr(h T λ H λ ) = Ω d { 1 + o P (1) }, tr(2h λ H T λ H λ ) = Ω d { 1 + o P (1) }, uniformly in λ [0, 1] r. The results in Theorem 3.4 suggest that in the presence of only discrete regressors, tr(h λ ), tr(hλ T H λ) and tr(2h λ Hλ T H λ) are asymptotically equivalent, in the sense that any pairwise combination of their difference tends to zero in probability. Thus, the computational cost of tr(hλ T H λ), which is O(n 2 ) relative to that of tr(h λ ), which is O(n), may suggest use of tr(h λ ) Degrees of Freedom in the Presence of Relevant and Irrelevant Regressors. Our preceding analyses assume that all variables included in the regression are relevant. It is common in applied work to have a mix of irrelevant and relevant regressors included in the same regression. In this setting, we are interested in the asymptotic expressions for tr(h 0,γ ) when a set of bandwidths moves toward their theoretical upper bounds and another set moves toward zero. To proceed with this scenario, without loss of generality, we assume the first r 1 (0 r 1 r) components of Xi d are relevant, and the first q 1 (1 q 1 q) components of X c i are relevant. We denote X c i and X d i as the relevant components, and X c i
16 16 HAT MATRIX and X i d as the irrelevant components. We adopt the concept of irrelevant regressors from Hall et al. (2007), thus we assume that (29) (Y, X) and X are independent of each other. By virtue of this independence assumption, f(x) = f( x) f( x), where f( x) and f( x) are the marginal densities of Xi and X i respectively. We use Ω and Ω d to denote the support of x c and x d, respectively. 3 In the ensuing sections, we will make use of the following kernel partitions: (30) and L(Xi d, x d, λ) = L( X i d, x d, λ)l( X i d, x d, λ), r 1 = l( X is, d x d s, λ r s ) l( X is, d x d s, λ s ), s=r 1 +1 (31) K h (Xi c x c ) = K h( Xc i x c) K h( Xc i x c), q 1 ( ) q Xc = h 1 s K is x c s h s s=q 1 +1 h 1 s K ( ) Xc is x c s. h s With ν d, x d Ω d, define 1 s ( ν d, x d ) = 1 s ( ν d s x d s) r 1 t=1,t s 1 s( ν d = x d ). That is, 1 s ( ν d, x d ) is an indicator function equal to one if ν d and x d only differ in their sth element, and zero otherwise. For l = 1, 2, define m l ( x) = E{[K h( X i c x c )L( X i d, x d, λ)] l }, {[( q ( Xc m l ( x) = E K is x c ) ) ] l } s L( h X i d, x d, λ), s s=q 1 +1 m L,l ( x d ) = E{[L( X d i, x d, λ)] l }. Note the subtle difference between m l ( x) and m l ( x): m l ( x) does not contain the division by h that exists in m l ( x). This feature will be important when we study the limiting behavior of our ANOVA based hat matrix. What do these three terms capture? m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, m l ( x) is the l th raw moment of the kernel weights for the irrelevant covariates at the point x, but scaled by the irrelevant continuous covariates bandwidth vector, and m L,l ( x d ) is the l th raw moment 3 Hall et al. (2007) highlight that a more practically appealing yet theoretically challenging variant of (29) is: conditional on X, the variables X and Y are independent. We therefore follow Hall et al. (2007) and choose the more restrictive of these two independence assumption.
17 HAT MATRIX 17 of the kernel weights for the irrelevant, discrete covariates at the point x d. All three of these moment based functions are expectations taken over the design points. To shed light on the asymptotic behavior of the tr(h 0,γ ) in the presence of some irrelevant regressors, we examine the entries of the resultant (H 0,γ ). Appealing to equations (15), (30) and (31), we obtain [ q1 m 0 (x i ) = k [ q1 k ( x c js x c is ) q h s s=q 1 +1 ( x c js x c is ) q h s k s=q 1 +1 ( x c js x c is k h s ( x c js x c is h s ) r1 l(x d js, x d is, λ s ) ) r1 l(x d js, xd is, λ s) r s=r 1 +1 r s=r 1 +1 ] l(x d js, x d is, λ s ) y j ]. l(x d js, xd is, λ s) For this general setting with a mix of discrete and continuous regressors, Hall et al. (2007) establish and prove that the cross-validated bandwidths for the irrelevant regressors converge in probability to the suprema of their ranges. Ideally, if we set h s = for s = q 1 + 1,..., q and λ s = 1 for s = r 1 + 1,..., r, we have [ q1 m 0 (x i ) = = k [ q1 [ q1 k k [ q1 k ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ( x c js x c is h s ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] k(0)q q 1 1 r r 1 l(x d js, xd is, λ k(0) s) q q 1 1 r r 1 ) r1 ] l(x d js, x d is, λ s ) y j ) r1 ] = l(x d js, xd is, λ s) A ij y j, where A ij only contains the relevant regressors. What is apparent is that the algebraic form of the local constant estimator suggests that, regardless of the number of variables in the model, when variables are smoothed out, it is only the bandwidths associated with variables not smoothed away that dictate the behavior of A ij. Only when all variables are smoothed away is A ij impacted by the increasing bandwidths. Thus, calculating degrees of freedom for the local constant kernel estimator is not influenced by the presence of irrelevant regressors when relevant regressors are present. What this suggests is that the degrees of freedom is only influenced by the number of relevant regressors. This is clearly not the case in OLS where adding an additional (irrelevant) regressor always contributes 1 to the degrees of freedom. Drawing of the theoretical contributions of Hall et al. (2007), we now provide the asymptotic expression for tr(h 0,γ ), tr(h T 0,γH 0,γ ), and tr(2h 0,γ H T 0,γH 0,γ ) as follows.
18 18 HAT MATRIX Theorem 3.5. Suppose d z = 1 in model (4), and condition (29) and Assumptions A.1 to A.5, are satisfied. Assume as n, (i) h s 0, for s = 1,..., q 1, h s for s = q 1 + 1,..., q, and nh 1... h q1 and (ii) λ s 0, for s = 1,..., r 1, and λ s 1 for s = r 1 + 1,..., r. Then for the LCLS estimator we have (32) (33) (34) tr(h 0,γ ) = [K(0)]q 1 Ω Ω d q1 h s { 1 + op (1) }, tr(h T 0,γH 0,γ ) = νq 1 0 Ω Ω d q1 h s { 1 + op (1) } tr(2h 0,γ H T 0,γH 0,γ ) = {2[K(0)]q 1 ν q 1 0 } Ω Ω d q1 h s { 1 + op (1) }. Theorem 3.5 formally establishes that the influence of the irrelevant regressors on tr(h 0,γ ) is asymptotically negligible (see its counterpart in absence of irrelevant regressors that can be deduced from Theorem 3.2). Similar asymptotic expressions for the case of continuous only regressors can be deduced from Theorem 3.5. In this latter case, the influence of irrelevant regressors will also have asymptotically negligible effect on the trace of each of the three non-anova hat matrices. Our foregoing results and discussions highlight that all three non-anova trace measures are well-defined and useful in the presence of irrelevant covariates Hat matrices with only discrete regressors. As in the case with only relevant regressors, we assume that for model (4) d z = 1, p = 0, and X = X d. Then, by virtue of the kernel partition in (30), the LCLS estimator in (15) simplifies to L( X j d, x d i, λ)l( X j d, x d i, λ)y j (35) m(x d i ) = L( X l d, xd i, λ)l( X l d, xd i, λ) = where l=1 (36) A j (x d i ) = L( X j d, x d i, λ)l( X j d, x d i, λ) L( X l d, xd i, λ)l( X l d, xd i, λ). l=1 A j (x d i )y j, In this case with only discrete regressors, Ouyang et al. (2009) establish that the irrelevant regressors the associated bandwidths from the least-squares cross-validated method can be smoothed out with probability approaching one as the sample size increases; that is, there is a positive probability that these bandwidths do not converge to their upper extreme 4 Zhang (2003) notes this observation.
19 HAT MATRIX 19 values even a n. Thus λ s 1 for at least one s = r 1 + 1,..., r, and hence in (36) L( X d l, xd i, λ) = 1 is not guaranteed asymptotically. In essence, Ouyang et al. s (2009) result implies that tr(h λ ) can exceed Ω d with positive probability, where H λ is predicated on (36). This implication also holds for tr(h T λ H λ) and tr(2h λ H T λ H λ). Theorem 3.6. Assume (x d i, y i ) are i.i.d. as (X d, Y ), and the probability mass function f(x d ) is bounded away from zero on Ω d. Assume H λ is predicated on (36). Suppose that λ = (λ 1,..., λ r1 ) [0, b n ] r 1, where b n is a positive sequence that converges to zero as n and lim n P r( λ r1 +1 = 1,..., λ r1 +1 = 1) α for some α (0, 1). Then (37) tr(h λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d (38) (39) tr(hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d tr(2h λ Hλ T H λ ) = Ω d x f( x d ) { 1 + op (1) }, m L,1 ( x d ) d uniformly in λ [0, 1] r. Theorem 3.6 suggests that for nonparametric regressions with only discrete covariates the asymptotic equivalence between any pair of tr(h λ ), tr(h T λ H λ) and tr(2h λ H T λ H λ) is valid even if some of the covariates are irrelevant. Note also that f( x d ) m L,1 ( x d ) for each x d Ω d. In particular, f( x d ) < m L,1 ( x d ) for each x d Ω d. Thus, x f( x d ) < Ω d d m L,1. ( x d ) Therefore, Theorem 3.6 implies that in the presence of irrelevant discrete variables tr(h λ ) < Ω d Ω d asymptotically. In essence, x f( x d ) d m L,1 is a measure of the degree of irrelevance. ( x d ) 4. A Multivariate generalization of the hat matrix from the ANOVA framework Huang & Chen (2008) consider the local polynomial estimator for model (4) with a scalar continuous regressor X i under an ANOVA framework. To do this, from (7) Huang & Chen (2008) define a local SSE, SST, and SSR, respectively as SSE p (x, h) = n 1 n ( Yi p ˆβ j=0 j (X i x) j) 2 Kh (X i x) n 1 n K, h(x i x) SST p (x, h) = n 1 n ( Yi Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x) SSR p (x, h) = n 1 n ( p ˆβ j=0 j (X i x) j Ȳ ) 2 Kh (X i x) n 1 n K, h(x i x)
20 20 HAT MATRIX so that SST p (x, h) = SSE p (x, h) + SSR p (x, h). ANOVA decomposition are SSE p (h) = SST (h) = SSR p (h) = Their global counterparts to this local SSE p (x, h) ˆf(x; h)dx, SST (x, h) ˆf(x; h)dx, SSR p (x, h) ˆf(x; h)dx, and SST (h) = SST n (Y i Ȳ )2 under some conditions. Their corresponding hat matrix to the global ANOVA decomposition is denoted as H, and is defined as H = W H ˆf(x; h)dx, with W a diagonal matrix having entries Kh (X i x)/ ˆf(x; h), and (40) H = X(X T W X) 1 X T W, with X being the effective design matrix generated by the local polynomial expansion. We now extend the Huang & Chen (2008) framework by allowing the regression model to have q continuous regressors in the vector X c and r discrete regressors in a vector X d. In light of the foregoing local and global ANOVA decompositions, we proceed in the following way: SSE p (x, γ) = n ( 1 n Y i ˆβ ( 0 j p j X c i x c) ) j 2 Kγ (X i x) n 1 n K, γ(x i x) SST (x, γ) = n 1 n ( Yi Ȳ ) 2 Kγ (X i x) n 1 n K, γ(x i x) n ( 1 n ˆβ ( 0 j p j X c i x c) ) j 2Kγ Ȳ (X i x) SSR p (x, γ) = n 1 n K. γ(x i x) Their global counterparts to this local ANOVA decomposition are SSE p (γ) = x d SSE p (x, γ) ˆf(x; γ)dx c, SST (γ) = x d SSR p (γ) = x d SST (x, γ) ˆf(x; γ)dx c, SSR p (x, γ) ˆf(x; γ)dx c,
21 HAT MATRIX 21 where x d refers to summation over all atoms x d = (x d 1,..., x d r) of the distribution of X d. Then, for this generalization, (41) H = x d W H ˆf(x; γ)dx c, with H as defined in (3.3), X = D(x c ), W = W (x)/ ˆf(x; γ) a diagonal matrix having entries K γ (x, X i )/ ˆf(x; γ), where ˆf(x; γ) = 1 n K γ (x, X i ) and we assume the following normalization: K γ (X i x)dx c = 1, x d which ensures that SST = x SST (x, γ) ˆf(x; γ)dx c = n 1 n d (Y i Ȳ )2. Define M 0,0 M 0,1... M 0,p M M 1 = 1,0 M 1,1... M 1,p... M p,0 M p,1... M p,p The immediate result is a generalization of the trace result in Theorem 4(c) in Huang & Chen (2008). Theorem 4.1. Assume the conditions in Theorem 3.2 hold and d z = 1 in (4). The conditional trace of H, as defined in (41), for the multivariate local polynomial estimator of the conditional mean is asymptotically (42) tr(h ) = Ω Ωd q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Theorem 4.1 shows that the asymptotic expansion for tr(h ) is inversely related to the bandwidths for the continuous regressors but are unrelated to the bandwidths for the discrete regressors and the mixed design density. In the absence of discrete regressors, the following corollary is immediate. Corollary 4.2. Assume the conditions in Theorem 4.1 hold with only continuous regressors in (4). The conditional trace of H for the multivariate local polynomial estimator is
22 22 HAT MATRIX asymptotically (43) tr(h ) = Ω q h s p ( ) { M r,c S c,r 1 + op (1) }. r,c=0 Clearly, the difference between the non-anova asymptotic expressions for the hat matrix implied by the conditional mean estimator in Theorems 3.2 and and that of their ANOVA counterpart in Theorem 4.1 is driven by a linear combination of kernel-dependent constants, which can be easily calculated. Furthermore, this result remains in the presence of only continuous regressors, as can be seen from Theorem 3.3 and Corollary 4.2. More important, under certain model restrictions this linear combination of kernel dependent constants can be quite minuscule; we illustrate this in the ensuing subsection Comparing Degrees of Freedom from the univariate ANOVA and non-anova frameworks. In light of the regression function with scalar smooth covariate that is in both Zhang (2003) and Huang & Chen (2008), we now gauge the size of the linear combination of kernel-dependent constants that drives the difference between the asymptotic expressions for the trace of their resultant hat matrices for the conditional mean estimator. We consider the three most popular local polynomial estimator: LCLS, local linear, and local cubic; thus, p {0, 1, 3}. For a scalar covariate, we have M = (µ i+j 2 ) 1 i,j p+1, M 1 := (m ij ) 1 i,j p+1, S = (ν i+j 2 ) 1 i,j p+1, (see pages 12 and 21). Thus, Theorem 3.3 and Corollary 4.2 simplify, respectively, to We define tr(h 0,h ) = h 1 e T 1,p+1M 1 e 1,p+1 K(0) Ω { 1 + o P (1) }, tr(h0,hh T 0,h ) = h 1 e T 1,p+1M 1 SM 1 e 1,p+1 Ω { 1 + o P (1) }, ( p+1 ) tr(h ) = h 1 m ij ν i+j 2 Ω { 1 + o P (1) }, with i + j is even. 1 C p κ 2 C p κ 3 C p κ p+1 i, p+1 i, p+1 i, i, m ij ν i+j 2 e T 1,p+1M 1 e 1,p+1 K(0), m ij ν i+j 2 e T 1,p+1M 1 SM 1 e 1,p+1, m ij ν i+j 2 ( 2e T 1,p+1M 1 e 1,p+1 K(0) e T 1,p+1M 1 SM 1 e 1,p+1 ),
23 HAT MATRIX 23 with i + j is even, and where 1 Cκ, p 2 Cκ, p and 3 Cκ p are associated with differences between tr(h ) and that of tr(h 0,h ), tr(h0,h T H 0,h), and tr(2h 0,h H0,h T H 0,h), respectively. For p = 0, 1, for example, the ANOVA-based result in Corollary 4.2 degenerates to (44) (45) tr(h ) = h 1 Ω ν 0 {1 + o P (1)}, tr(h ) = h 1 Ω (ν 0 + ν 2 /µ 2 ){1 + o P (1)}, respectively, whereas the non-anova counterparts implied by Theorem 3.3 degenerates to (46) (47) tr(h 0,h ) = h 1 Ω K(0) { 1 + o P (1) }, tr(h T 0,hH 0,h ) = h 1 Ω ν 0 {1 + o P (1)}. Clearly, 2 C 0 κ = 0; that is, for the LCLS estimator the asymptotic difference between tr(h ) and tr(h T 0,h H 0,h) is zero for any kernel and assuming identical bandwidth parameter h. 5 We undertake a more general comparison between tr(h ) and tr(h 0,h ) and tr(h T 0,h H 0,h) for the popular class of symmetric beta kernels defined as (48) K(t) = 1 Beta(1/2, κ + 1) (1 t2 ) κ +, κ = 0, 1, 2,..., where the subscript + denotes the positive part, which is understood to be taken prior to exponentiation (see, for e.g., Fan & Gijbels 1996, p. 15). This class nests the uniform, Epanechnikov, biweight, and triweight kernels for κ = 0, 1, 2, and 3, respectively, and the Gaussian kernel as the limiting kernel function as κ. For κ = 0, 1, 2, and 3, µ 2j = Beta(j + 1/2, κ + 1), and ν 2j = Beta(1/2, κ + 1) and for the Gaussian kernel, K(u) = ( 1/ ) 2π e u2 /2, with Beta(j + 1/2, 2κ + 1) { } 2, Beta(1/2, κ + 1) µ 2j = (2j 1)(2j 3) 3 1, and ν 2j = 2 j 1 µ 2j / π, (see Fan & Gijbels 1996, p. 78). Table 1 reports the values of 1 C p κ, 2 C p κ, and 3 C p κ for this class of kernels and for p {0, 1, 3}. Table 1 shows that for smoother kernels, that is kernels with a larger κ 1, 1 C 1 κ and 1 C 3 κ become smaller. For the local linear estimator, the asymptotic difference between tr(h ) and tr(h 0,h ) can be quite minute; specifically, for Gaussian kernel 1 C 1 κ = More important, 0 [ 1 C p κ] 1, 0 [ 2 C p κ] 1, and 0 [ 3 C p κ] 1, κ, and p {0, 1, 3}, where [c] denotes the nearest integer function of the real number c. 5 In fact, 2 C 0 κ = 0 for any q-variate smooth covariate.
24 24 HAT MATRIX [Table 1 about here.] Hat matrices with only discrete regressors. For the LCLS estimator with only discrete regressors, H = x W H ˆf(x d ; λ), with W a diagonal matrix having entries L(X d d i, x d, λ)/ ˆf(x d ; λ), and with X = ι, the vector of ones in R n. Then ( tr(h ) = tr W ι(ι T W ι) 1 ι T W ˆf(x ) d ; λ) x d = (ι T W ι) 1 ι T W 2 ι ˆf(x d ; λ) x d (49) = x d ( n L(Xd i, x d, λ) ˆf(x d ; λ) ) 1 ( n ) L2 (Xi d, x d, λ) [ ˆf(x ˆf(x d ; λ). d ; λ)] 2 In light of (49), we have the following result: Theorem 4.3. Under the conditions of Theorem 3.4, (50) tr(h ) = Ω d { 1 + o P (1) } where tr(h ) is defined in (49). Theorem 4.3 suggests that in the purely discrete case with all relevant regressors, the differences between the non-anova based approaches in Zhang (2003) and the ANOVAbased approach in Huang & Chen (2008) are asymptotically negligible. Hence, for example, tr(h ) tr(h λ ) Degrees of Freedom in the presence of Relevant and Irrelevant Regressors. In Subsection 3.5, we show that the non-anova nonparametric framework also lends itself well to meaningful asymptotic expressions for the trace of the implied hat matrix in the presence of irrelevant continuous and discrete covariates. Intuitively, the tr(h 0,γ ) is the ratio of two kernel terms that are of equal order of magnitude in the bandwidth vector for the continuous covariates; thus, the influence of h s for s = q 1 + 1,..., q on the ratio is dominated by the influence of h s 0 for s = 1,..., q 1. In light of our juxtaposition of non-anova and ANOVA frameworks, it is interesting to examine whether the trace of the resultant hat matrix from Huang & Chen s (2008) ANOVA framework has a meaningful expression when some covariates are irrelevant. We now consider the LCLS estimator for tr(h ) with a mix of continuous and discrete regressors. Observe that tr(h ) = x d (ι T W ι) 1 (ι T W 2 ι) ˆ f( x; γ) ˆ f( x; γ)dx c,
25 which depends also on the ratio of two kernel terms. HAT MATRIX 25 However, unlike the non-anova framework, the kernel terms are of different orders of magnitude in the bandwidth vector for the continuous covariates. This suggests that there can be sizable influence of h s for s = q 1 + 1,..., q, relative to h s 0 for s = 1,..., q 1, on the tr(h ). Hence, tr(h ) 0 is possible under the condition that h s for s = q 1 + 1,..., q. Formally, we provide this attrition effect of the irrelevant bandwidths on tr(h ) in the following result: Theorem 4.4. Assume the conditions of Theorem 3.5 hold. Let for some constant c, n c < h s < n c s = 1,..., q, and h h q1 +1 h q n κ, where κ = (q 1 (η + 1) + 4η) /(q 1 + 4), and η 1. The tr(h ) associated with the LCLS estimator is such that (51) tr(h ) = νq 1 0 Ω Ω ( d ) {1 q1 h m( x) d x c + op (1) }, s x d where m( x) = m 2 ( x)/m 1 ( x) = O ({ h q1 +1 h ) q } 1. Thus, Theorem 4.4 shows that in the presence of irrelevant continuous covariates tr(h ) p 0. 6 In fact, simulation results confirm this behavior for the tr(h ), for the LCLS, local linear and local cubic estimators. One implication of Theorem 4.4 is that the nonparametric ANOVA-based F-tests developed by Huang & Chen (2008), Huang & Su (2009) and Huang & Davidson (2010) may not be operational in the presence of such covariates. In particular, tr(h ) 0 will render a residual degrees of freedom close to n and hence a large global mean square error which is used to compute an unbiased estimate for the error variance in finitesample settings; also, tr(h ) 0 will render a negative regression degrees of freedom. The measure and interpretability of their ANOVA-based adjusted R-squared are also impaired by tr(h ) 0. This feature will be true for other data-driven bandwidth selection measures with the capability of selecting bandwidths which diverge. For example, the AIC c bandwidth selection criterion of Hurvich, Simonoff & Tsai (1998) has been shown to perform in a similar fashion to LSCV (Li & Racine 2004) though no formal theory currently exists that demonstrates that AIC c bandwidth selection will produce large bandwidths for irrelevant variables. We further conjecture that this result will hold in the local polynomial setting when all of the continuous covariates enter the model in a polynomial fashion. The reason for this is that as mentioned in Hall & Racine (2015), when the underlying data generating process is 6 To the best of our knowledge, there is no study in the extant literature that documents the rate at which these bandwidths associated with the irrelevant covariates diverge to infinity. In practice, however, in a given model specification each h s is often larger than the h s by a factor in excess of n η. Therefore, the restriction we impose on the bandwidths for the irrelevant covariates is quite conservative.
Reference: Davidson and MacKinnon Ch 2. In particular page
RNy, econ460 autumn 03 Lecture note Reference: Davidson and MacKinnon Ch. In particular page 57-8. Projection matrices The matrix M I X(X X) X () is often called the residual maker. That nickname is easy
More informationChapter 5 Matrix Approach to Simple Linear Regression
STAT 525 SPRING 2018 Chapter 5 Matrix Approach to Simple Linear Regression Professor Min Zhang Matrix Collection of elements arranged in rows and columns Elements will be numbers or symbols For example:
More informationSome Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model
Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model 1. Introduction Varying-coefficient partially linear model (Zhang, Lee, and Song, 2002; Xia, Zhang, and Tong, 2004;
More informationSTAT 540: Data Analysis and Regression
STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State
More informationNonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors
Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.
More informationA COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES
A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES MICHAEL O HARA AND CHRISTOPHER F. PARMETER Abstract. This paper presents a Monte Carlo comparison of
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationReview of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley
Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More informationLinear Algebra Review
Linear Algebra Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Linear Algebra Review 1 / 45 Definition of Matrix Rectangular array of elements arranged in rows and
More informationIn the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)
RNy, econ460 autumn 04 Lecture note Orthogonalization and re-parameterization 5..3 and 7.. in HN Orthogonalization of variables, for example X i and X means that variables that are correlated are made
More informationTopic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form
Topic 7 - Matrix Approach to Simple Linear Regression Review of Matrices Outline Regression model in matrix form - Fall 03 Calculations using matrices Topic 7 Matrix Collection of elements arranged in
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationSpatial Process Estimates as Smoothers: A Review
Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed
More informationSliced Inverse Regression
Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed
More informationTime Series and Forecasting Lecture 4 NonLinear Time Series
Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1
MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationMultivariate Regression
Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationOn a Nonparametric Notion of Residual and its Applications
On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014
More informationThe Linear Regression Model
The Linear Regression Model Carlo Favero Favero () The Linear Regression Model 1 / 67 OLS To illustrate how estimation can be performed to derive conditional expectations, consider the following general
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationPANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1
PANEL DATA RANDOM AND FIXED EFFECTS MODEL Professor Menelaos Karanasos December 2011 PANEL DATA Notation y it is the value of the dependent variable for cross-section unit i at time t where i = 1,...,
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationIntroduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β
Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationMultiple Regression Model: I
Multiple Regression Model: I Suppose the data are generated according to y i 1 x i1 2 x i2 K x ik u i i 1...n Define y 1 x 11 x 1K 1 u 1 y y n X x n1 x nk K u u n So y n, X nxk, K, u n Rks: In many applications,
More informationQuantile Regression for Extraordinarily Large Data
Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng Quantile regression Two-step
More informationSimple Linear Regression: The Model
Simple Linear Regression: The Model task: quantifying the effect of change X in X on Y, with some constant β 1 : Y = β 1 X, linear relationship between X and Y, however, relationship subject to a random
More informationMinimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.
Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of
More informationEmpirical Processes: General Weak Convergence Theory
Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationLocal Polynomial Regression
VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based
More informationLarge Sample Properties of Estimators in the Classical Linear Regression Model
Large Sample Properties of Estimators in the Classical Linear Regression Model 7 October 004 A. Statement of the classical linear regression model The classical linear regression model can be written in
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More information5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.
88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal
More informationLectures on Simple Linear Regression Stat 431, Summer 2012
Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population
More informationAUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET
The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface
More informationVAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:
VAR Model (k-variate VAR(p model (in the Reduced Form: where: Y t = A + B 1 Y t-1 + B 2 Y t-2 + + B p Y t-p + ε t Y t = (y 1t, y 2t,, y kt : a (k x 1 vector of time series variables A: a (k x 1 vector
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationVariance Function Estimation in Multivariate Nonparametric Regression
Variance Function Estimation in Multivariate Nonparametric Regression T. Tony Cai 1, Michael Levine Lie Wang 1 Abstract Variance function estimation in multivariate nonparametric regression is considered
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationGreene, Econometric Analysis (7th ed, 2012)
EC771: Econometrics, Spring 2012 Greene, Econometric Analysis (7th ed, 2012) Chapters 2 3: Classical Linear Regression The classical linear regression model is the single most useful tool in econometrics.
More informationLecture 13: Simple Linear Regression in Matrix Format
See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 13: Simple Linear Regression in Matrix Format 36-401, Section B, Fall 2015 13 October 2015 Contents 1 Least Squares in Matrix
More informationReview (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology
Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna
More informationSUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA
SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationUnit Roots in White Noise?!
Unit Roots in White Noise?! A.Onatski and H. Uhlig September 26, 2008 Abstract We show that the empirical distribution of the roots of the vector auto-regression of order n fitted to T observations of
More informationPreface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation
Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric
More informationIntroduction to Regression
Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd
ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram
More informationRegression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin
Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n
More informationstatistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:
Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility
More informationLecture 20: Linear model, the LSE, and UMVUE
Lecture 20: Linear model, the LSE, and UMVUE Linear Models One of the most useful statistical models is X i = β τ Z i + ε i, i = 1,...,n, where X i is the ith observation and is often called the ith response;
More informationLocal linear multiple regression with variable. bandwidth in the presence of heteroscedasticity
Local linear multiple regression with variable bandwidth in the presence of heteroscedasticity Azhong Ye 1 Rob J Hyndman 2 Zinai Li 3 23 January 2006 Abstract: We present local linear estimator with variable
More informationSimple and Efficient Improvements of Multivariate Local Linear Regression
Journal of Multivariate Analysis Simple and Efficient Improvements of Multivariate Local Linear Regression Ming-Yen Cheng 1 and Liang Peng Abstract This paper studies improvements of multivariate local
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationNew Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data
ew Local Estimation Procedure for onparametric Regression Function of Longitudinal Data Weixin Yao and Runze Li The Pennsylvania State University Technical Report Series #0-03 College of Health and Human
More informationInference For High Dimensional M-estimates: Fixed Design Results
Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49
More informationLecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices
Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is
More informationEstimation of the Conditional Variance in Paired Experiments
Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often
More informationSTAT5044: Regression and Anova. Inyoung Kim
STAT5044: Regression and Anova Inyoung Kim 2 / 51 Outline 1 Matrix Expression 2 Linear and quadratic forms 3 Properties of quadratic form 4 Properties of estimates 5 Distributional properties 3 / 51 Matrix
More informationSparse Nonparametric Density Estimation in High Dimensions Using the Rodeo
Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationPanel Data Models. James L. Powell Department of Economics University of California, Berkeley
Panel Data Models James L. Powell Department of Economics University of California, Berkeley Overview Like Zellner s seemingly unrelated regression models, the dependent and explanatory variables for panel
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More informationIntroduction to Regression
Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression
More informationVectors and Matrices Statistics with Vectors and Matrices
Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc
More informationMatrix Factorizations
1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular
More information1. Stochastic Processes and Stationarity
Massachusetts Institute of Technology Department of Economics Time Series 14.384 Guido Kuersteiner Lecture Note 1 - Introduction This course provides the basic tools needed to analyze data that is observed
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationIntegrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University
Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York Introduction Let Y
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationIntroduction to Regression
Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric
More information3 Multiple Linear Regression
3 Multiple Linear Regression 3.1 The Model Essentially, all models are wrong, but some are useful. Quote by George E.P. Box. Models are supposed to be exact descriptions of the population, but that is
More informationFitting Linear Statistical Models to Data by Least Squares: Introduction
Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationWooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model
Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Most of this course will be concerned with use of a regression model: a structure in which one or more explanatory
More informationEconomics 573 Problem Set 5 Fall 2002 Due: 4 October b. The sample mean converges in probability to the population mean.
Economics 573 Problem Set 5 Fall 00 Due: 4 October 00 1. In random sampling from any population with E(X) = and Var(X) =, show (using Chebyshev's inequality) that sample mean converges in probability to..
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationCOUNTEREXAMPLES TO THE COARSE BAUM-CONNES CONJECTURE. Nigel Higson. Unpublished Note, 1999
COUNTEREXAMPLES TO THE COARSE BAUM-CONNES CONJECTURE Nigel Higson Unpublished Note, 1999 1. Introduction Let X be a discrete, bounded geometry metric space. 1 Associated to X is a C -algebra C (X) which
More informationNonparametric Econometrics
Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-
More informationRegression and Statistical Inference
Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF
More informationUnderstanding Regressions with Observations Collected at High Frequency over Long Span
Understanding Regressions with Observations Collected at High Frequency over Long Span Yoosoon Chang Department of Economics, Indiana University Joon Y. Park Department of Economics, Indiana University
More informationPartitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12
Partitioned Covariance Matrices and Partial Correlations Proposition 1 Let the (p + q (p + q covariance matrix C > 0 be partitioned as ( C11 C C = 12 C 21 C 22 Then the symmetric matrix C > 0 has the following
More informationDESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA
Statistica Sinica 18(2008), 515-534 DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Kani Chen 1, Jianqing Fan 2 and Zhezhen Jin 3 1 Hong Kong University of Science and Technology,
More informationLOCAL POLYNOMIAL AND PENALIZED TRIGONOMETRIC SERIES REGRESSION
Statistica Sinica 24 (2014), 1215-1238 doi:http://dx.doi.org/10.5705/ss.2012.040 LOCAL POLYNOMIAL AND PENALIZED TRIGONOMETRIC SERIES REGRESSION Li-Shan Huang and Kung-Sik Chan National Tsing Hua University
More informationA nonparametric method of multi-step ahead forecasting in diffusion processes
A nonparametric method of multi-step ahead forecasting in diffusion processes Mariko Yamamura a, Isao Shoji b a School of Pharmacy, Kitasato University, Minato-ku, Tokyo, 108-8641, Japan. b Graduate School
More informationRegularization Methods for Additive Models
Regularization Methods for Additive Models Marta Avalos, Yves Grandvalet, and Christophe Ambroise HEUDIASYC Laboratory UMR CNRS 6599 Compiègne University of Technology BP 20529 / 60205 Compiègne, France
More information18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013
18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are
More informationCh 2: Simple Linear Regression
Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component
More informationNeed for Several Predictor Variables
Multiple regression One of the most widely used tools in statistical analysis Matrix expressions for multiple regression are the same as for simple linear regression Need for Several Predictor Variables
More informationA Bootstrap Test for Conditional Symmetry
ANNALS OF ECONOMICS AND FINANCE 6, 51 61 005) A Bootstrap Test for Conditional Symmetry Liangjun Su Guanghua School of Management, Peking University E-mail: lsu@gsm.pku.edu.cn and Sainan Jin Guanghua School
More information1 Appendix A: Matrix Algebra
Appendix A: Matrix Algebra. Definitions Matrix A =[ ]=[A] Symmetric matrix: = for all and Diagonal matrix: 6=0if = but =0if 6= Scalar matrix: the diagonal matrix of = Identity matrix: the scalar matrix
More information