ADDITIVE COEFFICIENT MODELING VIA POLYNOMIAL SPLINE

Size: px

Start display at page:

Download "ADDITIVE COEFFICIENT MODELING VIA POLYNOMIAL SPLINE"

Brandon Caldwell
5 years ago
Views:

1 ADDITIVE COEFFICIENT MODELING VIA POLYNOMIAL SPLINE Lan Xue and Lijian Yang Michigan State University Abstract: A flexible nonparametric regression model is considered in which the response depends linearly on some covariates, with regression coefficients as additive functions of other covariates. Polynomial spline estimators are proposed for the unknown coefficient functions, with optimal univariate mean square convergence rate under geometric mixing condition. Consistent model selection method is also proposed based on a nonparametric Bayes Information Criterion BIC). Simulated and real data examples demonstrate that the polynomial spline estimators are computationally efficient and also as accurate as the existing local polynomial estimators. Short Running Title. Additive Coefficient Model Key words and phrases: AIC, Approximation space, BIC, German real GNP, knot, mean square convergence, spline approximation 1. Introduction Parametric regression models are based on the assumption that their regression functions follow a pre-determined parametric form with finitely many unknown parameters. A wrong model can lead to excessive estimation biases and erroneous inferences. In contrast, nonparametric models impose less stringent assumptions on the regression functions. General nonparametric models, however, need large sample sizes to obtain reasonable estimators when the predictors are high dimensional. Much effort has been made to alleviate the curse of dimensionality by imposing appropriate structures on the regression function. Of special importance is the varying coefficient model Hastie & Tibshirani 1993), whose regression function depends linearly on some regressors, with coefficients as smooth functions of other predictor variables, called tuning variables. A special type of varying coefficient model is called functional coefficient model by Chen & Tsay 1993b), in which all tuning variables are the same and univariate. It was studied in the time series context by Cai, Fan & Yao 000) and Huang & Shen 004). Xue & Yang 005a) extended the functional coefficient model to the case when the tuning variable is multivariate, with additive structure on regression coefficients to avoid the

2 curse of dimensionality. The regression function of the new additive coefficient model is d 1 d m X, T) = α l X) T l, α l X) = α l0 + α ls X s ), 1.1) in which the predictor vector X, T) R d R d 1, with X = X 1,..., X d ) T, T = T 1,..., T d1 ) T. This additive coefficient model includes as special cases the varying/functional coefficient models, as well as additive model Chen & Tsay 1993a, Hastie & Tibshirani 1990) and linear regression model, see Xue & Yang 005a), which has obtained asymptotic distributions of local polynomial marginal integration estimators of the unknown coefficients. The parameters {α l0 } d 1 are estimated at the parametric rate 1/ n, and the nonparametric functions {α ls x s )} d 1,d, are estimated at the univariate smoothing rate. Due to integration step and its local nature, the kernel method in Xue & Yang 005a) is computationally expensive. Based on a sample of size n, to estimate the coefficient functions {α l x)} d 1 in 1.1) at any fixed point x, a total of d + 1) n least squares estimations have to be done. So the computational burden increases dramatically as the sample size n and the dimension of the tuning variables d increase. In this paper, we propose a faster polynomial spline estimator for model 1.1). In contrast to the local polynomial, polynomial spline is a global smoothing method. One solves only one least squares estimation to estimate all the components in the coefficient functions, regardless of the sample size n and the dimension of the tuning variable d. So the computation is substantially reduced. As an attractive alternative to the local polynomial, polynomial spline has been used to estimate various models, for example, additive model Stone 1985), the functional ANOVA model Huang 1998a, b), the varying coefficient model Huang, Wu & Zhou 00), and additive model for weakly dependent data Huang & Yang 004), with asymptotic results. We have given complete proof of the polynomial spline estimators rate of convergence under geometric mixing condition. A major innovation is the use of approximation space with possibly unbounded basis. Huang, Wu & Zhou 00), for instance, has imposed the assumption that T = T 1,..., T d1 ) T in 1.1) has compactly supported distribution to make their basis bounded. imposes mild moment conditions on T. Our method, in contrast, only The paper is organized as follows. Section discusses the identification issue for model 1.1). Section 3 presents the polynomial spline estimators, their L consistency and a model selection procedure based on Bayes Information Criterion BIC). These estimation and model selection procedures adapt automatically to varying coefficient model Hastie & Tibshirani 1993), functional coefficient model Chen & Tsay 1993b), additive model Hastie & Tibshirani 1990, Chen & Tsay 1993a), and linear regression model, a feature not shared by any kernel type estimators. Section 4 applies the methods to the simulated and empirical examples. Technical assumptions and proofs are given in the Appendix.. The Model Let {Y i, X i, T i )} n be a sequence of strictly stationary observations, with univariate response Y i, d and d 1 variate predictors X i and T i. With unknown conditional mean and variance functions

3 m X i, T i ) = E Y i X i, T i ), σ X i, T i ) = var Y i X i, T i ), the observations satisfy Y i = m X i, T i ) + σ X i, T i ) ε i..1) The errors {ε i } n are i.i.d with E ε i X i, T i ) = 0, E ε i X i, T i ) = 1, and εi independent of the σ-field F i = σ {X j, T j ), j i} for i = 1,..., n. So the variables X i, T i ) can consist of either exogenous variables or lagged values of Y i. For the additive coefficient model, the regression function m takes the form in 1.1), and satisfies the identification conditions that E {α ls X is )} = 0, 1 l d 1, 1 s d..) The conditional variance function σ x, t) is assumed to be continuous and bounded. As in most works on nonparametric smoothing, estimation of the functions {α ls x s )} d 1,d, compact sets. Without lose of generality, let the compact set be χ = [0, 1] d. is conducted on Following Stone 1985), p.693, the space of s-centered square integrable functions on [0, 1] is H 0 s = { α : E {α X s )} = 0, E { α X s ) } < + }, 1 s d. Next define the model space M, a collection of functions on χ R d 1 as { } d 1 d M = m x, t) = α l x) t l ; α l x) = α l0 + α ls x s ); α ls Hs 0, in which {α l0 } d 1 are finite constants. The constraints that E {α ls X s )} = 0, 1 s d ensure unique additive representation of α l, but are not necessary for the definition of space M. In what follows, denote by E n the empirical expectation, E n ϕ = n ϕ X i, T i ) /n. We introduce two inner products on M. For functions m 1, m M, the theoretical and empirical inner products are defined respectively as m 1, m = E {m 1 X, T) m X, T)}, m 1, m n = E n {m 1 X, T) m X, T)}. The corresponding induced norms are m 1 = Em 1 X, T), m 1,n = E n m 1 X, T). The model space M is called theoretically empirically) identifiable, if for any m M, m = 0 m,n = 0) implies that m = 0 a.s. Lemma 1 Under assumptions C1) and C) in the Appendix, there exists a constant C > 0 such that { d1 )} ) m C d αl0 + d 1 d α ls, m = α l0 + α ls t l M. Hence for any m M, m = 0 implies α l0 = 0, α ls = 0 a.s., for all 1 l d 1, 1 s d. Consequently the model space M is theoretically identifiable. Proof. Let A l X) = α l0 + d m = E [ d1 α ls X s ), AX) = A 1 X),..., A d1 X)) T. Under assumption C) { } d α l0 + α ls X s ) [ ] c 3 E AX) T AX) ] [ ] T l = E AX) T TT T AX) { } d 1 d = c 3 E α l0 + α ls X s ) 3

4 [ d1 which, by.), is c 3 α l0 + { d 1 E d } ] α lsx s ). Applying Lemma 1 of Stone 1985) m c 3 [ d1 ] αl0 + {1 d 1 d δ)/}d 1 Eαls X s), where δ = 1 c 1 /c ) 1/ with 0 < c 1 c as specified in assumption C1). By taking C = c 3 {1 δ)/} d 1, the first part is proved. To show identifiability, notice that for any m = d 1 α l0 + ) d α ls t l M, with m = 0, we have [ d1 { } ] d [ d1 d 1 0 = E α l0 + α ls X s ) T l C αl0 + d E { αls X s) }], which entails that a l0 = 0 and α ls X s ) = 0 a.s. for all 1 l d 1, 1 s d, or m = 0 a.s. 3. Polynomial Spline Estimation 3.1 The estimators In this paper, denote by C p [0, 1]) the space of p-times continuously differentiable functions. For each of the tuning variable direction, i.e. s = 1,..., d, a knot sequence k s,n with N n interior knots is introduced k s,n = {0 = x s,0 < x s,1 < < x s,nn < x s,nn+1 = 1}. For an integer p 1, define ϕ s = ϕ p [0, 1], k s,n ), a subspace of C p 1 [0, 1]) consisting of functions that are polynomial of degree p or less) on intervals [x s,i, x s,i+1 ), i = 0,..., N n 1, and [x s,nn, x s,nn+1]. Functions in ϕ s are called polynomial splines, which are piecewise polynomials connected smoothly on the interior knots. So a polynomial spline with degree p = 1 is a continuous piecewise linear function, etc. The space ϕ s is determined by the polynomial degree p and the knot sequence k s,n. Let h s = h s,n = max i=0,...,nn x s,i+1 x s,i, called mesh size of k s,n, and define h = max s=0,...,d h s, the overall smoothness measure. Lemma For 1 s d, define ϕ 0 s = {g s : g s ϕ s, Eg s X s ) = 0}, the space of centered polynomial splines. There exists a constant c > 0, so that for any α s Hs 0 C p+1 [0, 1]), there exists a g s ϕ 0 s, such that α s g s c h p+1 α s p+1) s. Proof. According to de Boor 001), p.149, there exists a constant c > 0 and spline function gs ϕ s, such that α s gs c h p+1. Note next that Eg s E gs α s ) + Eα s α s p+1) gs α s. Thus for g s = gs Egs ϕ 0 s, one has α s g s α s gs + Egs c s α s p+1) h p+1. Lemma entails that if the functions {α ls x s )} d 1,d, in 1.1) are smooth, they are approximated well by centered splines { g ls x s ) ϕ 0 } d1,d s,. As the definition of ϕ0 s depends on the 4 s

5 unknown distribution of X s, the empirically defined space ϕ 0,n s = {g s : g s ϕ s, E n g ls ) = 0} is used. Intuitively, function m M is approximated by some function from the approximate space { } M n = d 1 d m n x, t) = g l x) t l ; g l x) = α l0 + g ls x s ); g ls ϕ 0,n s Given observations {Y i, X i, T i )} n from model.1), the estimator of the unknown regression function m is defined as its best approximation from M n, n ˆm = argmin mn Mn {Y i m n X i, T i )}. 3.1) To be precise, write J n = N n + p, and let {w s,0, w s,1,..., w s,jn } be a basis of the spline space ϕ s, for 1 s d. For example, the truncated power basis is used in the implementation { } 1, x s,..., x p s, x s x s,1 ) p +,..., x s x s,nn ) p +, in which x) p + = x +) p. Let w = {1, w 1,1,..., w 1,Jn,..., w d,1,..., w d,j n }, then {wt 1,..., wt d1 } is a R n = d 1 {d J n + 1} dimensional basis of M n, and 3.1) amounts to d 1 d J n ˆm x, t) = + ĉ ls,j w s,j x s ) ĉl0 t l, 3.) in which the coefficients {ĉ l0, ĉ ls,j, 1 l d 1, 1 s d, 1 j J n } minimize the sum of squares n d 1 d Y i c J n l0 + c ls,j w s,j X is ) T il 3.3) with respect to {c l0, c ls,j, 1 l d 1, 1 s d, 1 j J n }. Note that Lemma A.5 entails that, with probability approaching one, the sum of squares in 3.3) has a unique minimizer. For 1 l d 1, 1 s d, denote αls x s) = J n ĉls,jw s,j x s ). Then the estimators of {α l0 } d 1 {α ls x s )} d 1,d, in 1.1) are given as d ˆα l0 = ĉ l0 + E n αls, 1 l d 1;. and ˆα ls x s ) = α ls x s) E n α ls 1 l d 1, 1 s d, 3.4) where {ˆα ls x s )} d 1,d, are empirically centered to consistently estimate the theoretically centered function components in 1.1). These estimators are determined by the knot sequences {k s,n } d and the polynomial degree p, which relates to the smoothness of the regression function. We will refer to an estimator by its degree p. For example, a linear spline fit corresponds to p = 1. Theorem 1 Under assumptions C1)-C5) in the Appendix, if α ls C p+1 [0, 1]), for 1 l d 1, 1 s d, one has ˆm m = O p h p+1 + ) 1/nh, max ˆα l0 α l0 + max ˆα ls α ls = O p h p+1 + ) 1/nh. 1 l d 1 1 l d 1,1 s d 5

6 Theorem 1 entails that the optimal order of h is n 1/p+3), in which case ˆα ls α ls = O p n 1/p+3) ), which is the same rate of the mean square errors as the marginal integration estimators in Xue & Yang 005a). 3. Knot number selection An appropriate selection of the knot sequence is important to efficiently implement the proposed polynomial spline estimation method. Stone 1986) found that the number of knots is more crucial than its location. Thus we discuss an approach to select the number of interior knots N n using the AIC criteria. For knots location, we use either equally spaced knots the same distance between any adjacent knots), or quantile knots sample quantiles with the same number of observations between any two adjacent knots). According to Theorem 1, the optimal order of N n is n 1/p+3). Thus we propose to select the optimal N n denoted as ˆN opt n from the set of integers in [0.5N r, min 5N r, T b)] with N r = n 1/p+3) and T b = {n/ 4d 1 ) 1} /d which ensures the total number of parameters in the least square estimation is less than n/4. To be specific, we denote the estimator for the i-th response Y i by Ŷi N n ) = ˆm X i, T i ), for i = 1,, n. Here ˆm depends on the knot sequence as given in 3.). Let q n = 1 + d N n ) d 1 be the total number of parameters in the least square problem 3.3). Then the AIC value where AIC N n ) = log MSE) + q n /n with MSE = n 3.3 Model selection ˆN opt n is the one minimizing ˆN n opt = argmin AIC N n ), 3.5) N n [0.5N r,min5n r,t b)] { Y i Ŷi N n )} /n. For the full model 1.1), a natural question to ask is whether the functions {α ls x s )} d 1,d, are all significant. A simpler model by setting some of {α ls x s )} d 1,d, zero may perform as well as the full model. For 1 l d 1, let S l denote the set of indices of the tuning variables which are significant in the coefficient function of T l, and S the collection of indices from all the sets S l. The set S is called the model indices. In particular, the model indices of the full model is S f = {S f1,..., S fd1 }, where S fl {1,..., d }, 1 l d 1. For two indices S = {S 1,..., S d1 }, S = { S 1,..., S d 1 }, we say that S S if and only if S l S l, for all 1 l d 1 and S l S l, for some l. The goal is to select the smallest sub-model with indices S S f, which gives the same information as the full additive coefficient model. Following Huang & Yang 004), Akaike Information Criterion AIC) and Bayes Information Criterion BIC) are considered. For a submodel m S with indices S = {S 1,..., S d1 }, let N n,s be the number of interior knots used to estimate the model m S and J n,s = N n,s + p. As in the full model estimation, let {ĉ l0, ĉ ls,j, 1 l d 1, s S l, 1 j J n,s } be the minimizer of the sum of squares n d 1 Y i c l0 + J n,s c ls,j w s,j X is ) T il s S l )

7 Define d 1 ˆm S x, t) = + J n,s ĉ ls,j w s,j x s ) ĉl0 t l. 3.7) s S l Denote Ŷi,s = ˆm S X i, T i ),i = 1,, n, MSE S = Ŷi,s) n Y i /n, qs = d 1 {1 + # S l) J n,s }, the total number of parameters in 3.6). Then the submodel is selected with the smallest AIC or BIC) values, defined as AIC S = log MSE S ) + q S /n, BIC S = log MSE S ) + log n) q S /n. Let S 0 and Ŝ be the index set of the true model and the selected model respectively. The outcome is defined as correct fitting, if Ŝ = S 0; overfitting, if S 0 Ŝ; and underfitting, if S 0 Ŝ, that is, S 0l Ŝl, for some l. For either overfitting or underfitting, we denote Ŝ S 0. Theorem Under the same conditions as in Theorem 1, and N n,s N n,s0 n 1/p+3) ), the BIC is consistent: for any S S 0, lim n P BIC S > BIC S0 ) = 1, hence lim n P Ŝ = S0 = 1. The condition that N n,s N n,s0 is essential for the BIC to be consistent. As a referee pointed out, the number of parameters q S depends on the number of knots and the number of additive terms used in the model function. To ensure BIC consistency, roughly the same sufficient number of knots should be used to estimate the various models so that q S depends only on the number of functions terms. In the implementation, we have used the same number of interior knots Nn opt see 3.5), the optimal knot number for the full additive coefficient model) in the estimation of all the submodels. 4. Examples In this section, we first analyze two simulated data sets with an i.i.d. and time series set-up respectively. Both data sets have sample sizes n = 100, 50 and 500, and 100 replications. Later the proposed methods are successfully applied to an empirical example: West German real GNP. The performance of the function estimators is assessed by the averaged integrated squared error AISE). Denoting the estimator of α ls in the i-th replication as ˆα i,ls, and {x m } n grid m=1 the grid points where the functions are evaluated, we define ISEˆα i,ls ) = 1 n grid {ˆα i,ls x m ) α ls x m )} and AISEˆα ls ) = ISEˆα i,ls ). n grid 100 m=1 4.1 Simulated example 1 The data are generated from the model Y = {c 1 + α 11 X 1 ) + α 1 X )} T 1 + {c + α 1 X 1 ) + α X )} T + ε, with c 1 =, c = 1 and α 11 x) = sin {4x )} + exp { x 0.5) }, α 1 x) = x, α 1 x) = sin x), α x) = 0. The vector X = X 1, X ) T is uniformly distributed on [ π, π] independent of the standard bivariate normal T = T 1, T ) T. The error ε is a standard normal variable independent of X, T). 7

8 The functions are estimated by linear spline p = 1), cubic spline p = 3) and the marginal integration method of Xue & Yang 005a). For s = 1,, let x i s,min, xi s,max denote the smallest and largest [ observation] of the variable x s in the i-th replication. Knots are placed evenly on the intervals x i s,min, xi s,max, with the number of interior knots N n selected by AIC as in subsection 3.. The functions {α ls },, are estimated on a grid of equally-spaced points x m, m = 1,..., n grid with x 1 = 0.975π, x ngrid = 0.975π, n grid = 6. Table 1 reports the means and standard errors in the parentheses) of {ĉ l }, and the AISEs of {ˆα ls },, for all the three fits. Both spline fits p = 1, 3) are generally comparable, with the cubic fit p = 3) better than the linear fit p = 1) for larger sample sizes n = 50, 500), the standard errors of the constant estimators and the AISEs of the function estimators decrease as samples size increases, confirming Theorem 1. The polynomial spline methods also perform better than the marginal integration method. Figure 1 gives the plots of the 100 cubic spline fits for all sample sizes, clearly illustrating the estimation improvements as sample size increases. Plots d1-d4 of the typical estimated curves whose ISE is the median of the 100 ISEs from the replications) seem satisfactory for sample size as small as 100. As mentioned earlier, the polynomial spline method enjoys great computational efficiency. It takes polynomial spline less than 0 seconds to run 100 simulations on a Pentium 4 PC, regardless of sample sizes. In contrast, it takes marginal integration about hours to run 100 simulations with n = 100; and about 0 hours with n = 500. For each replication, model selection is also conducted according to the criteria proposed in subsection 3.3, for polynomial splines with p = 1,, 3. The model selection results are presented in Table, indexed as Example 1. The BIC is rather accurate: more than 86% correct selection when the sample size is as small as 100, and absolute correct selection when sample size increases to 50 and 500, corroborating with Theorem that BIC is consistent. AIC clearly tends to over-fit and never under-fits. 4. Simulated example Insert Table 1 about here) Insert Table about here) Insert Figure 1 about here) The data is generated from a nonlinear AR model Y t = {c 1 + α 11 Y t 1 ) + α 1 Y t )} Y t 3 + {c + α 1 Y t 1 ) + α Y t )} Y t ε t, with i.i.d. standard normal noise ε t, c 1 = 0., c = 0.3 and α 11 u) = u) exp 4u ), α 1 u) = 0.3/ { 1 + u 1) 4}, a 1 u) = 0, α u) = u) exp 4u ). In each replication, a total of n observations are generated and only the last n observations are used to ensure approximate stationarity. In this example, we have used linear spline p = 1) on the quantile knot sequences. The coefficient functions {α ls },, are estimated on a 8

9 grid of equally-spaced points on the interval [ 1, 1], with the number of grid points n grid = 41. Table 3 contains the means and standard errors of {ĉ l }, and the AISEs of {ˆα ls }, The,. results are also graphically presented in Figure. Similar to Example 1, the estimation improves as sample size increases, supporting the asymptotic result Theorem 1). The model selection results are presented in Table, indexed as Example. As Example 1, AIC tends to overfit compared with BIC, and for n = 50, 500, the model selection result is satisfactory. 4.3 West German real GNP Insert Table 3 about here) Insert Figure about here) Now, we compare the estimation and prediction performances of the polynomial spline and marginal integration methods through the West German real GNP. For other empirical examples, see Xue & Yang 005b). The data consists of the quarterly West German real GNP from January 1960 to December 1990, denote as {G t } 14 t=1, where G t is the real GNP in the t-th quarter the first quarter being from January 1, 1960 to April 1, 1960). From its time plot Figure 3), {G t } 14 t=1 appears to have both trend and seasonality. After removing the seasonal means from {log G t+4 /G t+3 )} 10 we obtain a more stationary time series denoted as {Y t } 10 t=1, whose time plot is given in Figure 4. As the nonparametric alternative to the linear autoregressive model selected by BIC Xue & Yang 005a) proposed the additive coefficient model t=1, Y t = a 1 Y t + a Y t 4 + σε t, 4.1) Y t = {c 1 + α 11 Y t 1 ) + α 1 Y t 8 )} Y t + {c + α 1 Y t 1 ) + α Y t 8 )} Y t 4 + σε t, 4.) which is fitted by linear and cubic splines p = 1, 3). Following Xue & Yang 005a), we use the first 110 observations for estimation and the last 10 observations for one-step prediction. Table 4 gives the averaged squared estimation errors ASE) and averaged squared prediction errors ASPE) of different fittings. Polynomial spline is better than the marginal integration method overall, while model 4.) significantly improves over model 4.1) in both estimation and prediction. For visualization, plots of the function estimates are given in Figure 5. Insert Table 4 about here) Insert Figure 3 about here) Insert Figure 4 about here) Insert Figure 5 about here) 5. Conclusion A polynomial spline estimation method together with a BIC model selection procedure have been proposed for the semiparametric additive coefficient model. Consistency has been established under broad geometric mixing assumptions. The procedures apply to nonlinear regression with weakly dependence as well as i.i.d. observations. Implementation of the proposed method is as easy and 9

10 fast as linear regression, with excellent performance as Theorem 1 stipulates. Empirical study illustrates that these methods are very useful for statistical inference in multivariate regression setting. Acknowledgements The paper is substantially improved thanks to the helpful comments of Editor Jane-Ling Wang, an associate editor and the referees. Research has been supported in part by National Science Foundation grants BCS , DMS , and SES References Bosq, D. 1998). Nonparametric Statistics for Stochastic Processes: estimation and prediction, nd Ed. Springer-Verlag, New York. Cai, Z., Fan, J. and Yao, Q. W. 000). Functional-coefficient regression models for nonlinear time series. J. Amer. Statist. Assoc. 95, Chen, R. and Tsay, R. S. 1993a). Nonlinear additive ARX models. J. Amer. Statist. Assoc. 88, Chen, R. and Tsay, R. S. 1993b). Functional-coefficient autoregressive models. J. Amer. Statist. Assoc. 88, de Boor, C. 001). A Practical Guide to Splines. Springer, New York. Devore R. A. and Lorentz, G. G. 1993). Constructive Approximation. Springer-Varlag, Berlin Heidelberg. Hastie, T. J. and Tibshirani, R. J. 1990). Generalized Additive Models. Chapman and Hall, London. Hastie, T. J. and Tibshirani, R. J. 1993). Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55, Huang, J. Z. 1998a). Projection estimation in multiple regression with application to functional ANOVA models. Ann. Statist. 6, 4-7. Huang, J. Z. 1998b). Functional ANOVA models for generalized regression. J. Multivariate Anal. 67, Huang, J. Z. and Shen, H. 004). Functional coefficient regression models for non-linear time series: a polynomial spline approach. Scandinavian Journal of Statistics 31, Huang, J. Z., Wu, C. O. and Zhou, L. 00). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89,

11 Huang, J. Z., and Yang, L. 004). Identification of nonlinear additive autoregressive models. Journal of the Royal Statistical Society Series B 66, Stone, C. J. 1985). Additive regression and other nonparametric models. Ann. Statist. 13, Xue, L. and Yang, L. 005a). Estimation of semiparametric additive coefficient model. Journal of Statistical Planning and Inference in press. Xue, L. and Yang, L. 005b). Additive coefficient modeling via polynomial spline, downloadable at yangli/addsplinefull.pdf. Department of Statistics and Probability, Michigan State University, East Lansing, MI 4884, USA xuelan@stt.msu.edu Department of Statistics and Probability, Michigan State University, East Lansing, MI 4884, USA yang@stt.msu.edu Appendix A.1 Assumptions and notations The following assumptions are needed for our theoretical results. C1) The tuning variables X = X 1,..., X d ) are compactly supported and without lose of generality, we assume that its support is χ = [0, 1] d. The joint density of X, denoted by fx), is absolutely continuous and bounded away from zero and infinity, that is, 0 < c 1 min x χ fx) max x χ fx) c <. C) i) There exist positive constants 0 < c 3 c 4, such that c 3 I d1 ETT T X = x) c 4 I d1 for all x χ. Here I d1 denotes { the d 1 d 1 identity matrix. There also exist positive constants Tl ) } +δ0 c 5, c 6 such that, c 5 E T l X = x c 6 a.s. for some δ 0 > 0 and l, l = 1,..., d 1. ii) For some sufficient large m > 0, E T l m < +, for l = 1,..., d 1. C3) The d sets of knots denoted as k s,n = {0 = x s,0 x s,1 x s,nn x s,nn+1 = 1}, s = 1,..., d, are quasi-uniform, that is, there exists c 7 > 0 max x s,j+1 x s,j, j = 0,..., N n ) max,...,d minx s,j+1 x s,j, j = 0,..., N n ) c 7. The number of interior knots N n n 1 p+3, where p is the degree of the spline and meaning both sides have the same order. In particular, h n 1 p+3. C4) The vector process {ς t } t= = {Y t, X t, T t )} t= is strictly stationary and geometricly strongly mixing, that is, its α -mixing coefficient αk) cρ k, for constants c > 0, 0 < ρ < 1, where αk) = sup A σςt,t 0),B σς t,t k) P A)P B) P AB). 11

12 C5) The conditional variance function σ x, t) is measurable and bounded. Assumptions C1)-C5) are common in the nonparametric regression literature. Assumption C1) is the same as Condition 1, p.693 of Stone 1985), assumption c), p.468 of Huang & Yang 004). Assumption C) i) is a direct extension of condition ii), p.531 of Huang & Shen 004). Assumption C) ii) is a direct extension of condition v), p.531 of Huang & Shen 004), and of the moment condition A. c) p.95 of Cai, Fan & Yao 000). Assumption C3) is the same as in equation 6), p.49 of Huang 1998a), and also p.59, Huang 1998b). Assumption C4) is similar to condition iv), p.531 of Huang & Shen 004). Assumption C5) is the same as p.4 of Huang 1998a), and p.465 of Huang & Yang 004). In this Appendix, whenever proofs are brief, see Xue & Yang 005b) for details. A. Technical lemmas For notational convenience, we introduce, for 1 l d 1, 1 s d, d α l0 = ĉ l0 + Eαls, α lsx s ) = αls x s) Eαls. A.1) Then one can rewrite ˆm 3.1) as ˆm = { d 1 α l0 + } d α lsx s ) t l, with { α ls x s )} 1 l d1 ϕ 0 s. The terms in A.1) are not directly observable and serve only as the intermediate step in the proof of Theorem 1. By observing that, for 1 l d 1, 1 s d, d ˆα l0 = α l0 E n α ls, ˆα ls x s ) = α ls x s ) E n α ls, A.) the terms { α l0 } d 1, { α lsx s )} d 1,d, and {ˆα l0} d 1, {ˆα lsx s )} d 1,d, differ only by a constant. In section A.3, we first prove the consistency of { α l0 } d 1, { α lsx s )} d 1,d, in Theorem 3. Then Theorem 1 follows by showing {E n α ls } d 1,d, are negligible. B-spline basis is used in the proofs, which is equivalent to the truncated power basis used in implementation, but has nice local properties de Boor 001). With J n = N n + p, we denote the B-spline basis of ϕ s by b s = {b s,0,..., b s,jn }. For 1 s d, denote B s = {B s,1,..., B s,jn } with B s,j = N n b s,j E b ) s,j) E b s,0 ) b s,0, j = 1,..., J n. A.3) Note that assumption C1) ensures all B s,j s are well defined. Now, let B = 1,B 1,1,..., B 1,Jn,..., B d,1,..., B d,j n ) T with 1 being the identity function defined on χ. Define G = Bt 1,..., Bt d1 ) T = G 1,..., G Rn {) T,with R n = d 1 d J n + 1). } Then G is a set of basis of M n. By 3.1), one has ˆm x, t) = d 1 ĉ l0 + d ĉ ls,j B s,j x s ) t l, in which } {ĉ l0, ĉ ls,j, 1 l d 1, 1 s d, 1 j J n minimize the sum of squares as in 3.3), with w s,j replaced by B s,j. Then Lemma 1 leads to α l0 = ĉ l0, α ls = Jn ĉ ls,j B s,j x s ), 1 l d 1, 1 s d. J n 1

13 Theorem 3 Under assumptions C1)-C5), if α ls C p+1 [0, 1]), for 1 l d 1, 1 s d, one has ˆm m = O p h p+1 + ) 1/nh, max α l0 α l0 + max α ls α ls = O p h p+1 + ) 1/nh. 1 l d 1 1 l d 1,1 s d To prove Theorem 3, we first present the properties of the basis G in Lemmas A.1-A.3. Lemma A.1 For any 1 s d, and spline basis B s,j as in A.3), one has i) E B s,j ) = 0, E B s,j k N k/ 1 n, for k > 1, j = 1,..., J n. ii) There exists a constant C > 0 such that for any vector a = a 1,..., a Jn ) T, as n, J n a j B s,j C Jn a j. Proof. i) follows from Theorem 5.4. of DeVore & Lorentz 1993), and assumptions C1), C3). To prove ii), we introduce the auxiliary knots x s, p =... = x s, 1 = x s,0 = 0, and x s,nn+p+1 =... = x s,nn+ = x s,nn+1 = 1 for the knots k s. Then J n a j B s,j c 1 J n a j B s,j = c 1 J n a j Nn b s,j J n a j Nn E b s,j ) E b s,0 ) where is defined as f = f x) dx, for any square integrable function f. Let d s,j = x s,j+1 x s,j p ) / p + 1). Then Theorem 5.4. of Devore & Lorentz 1993) ensures that for a C > 0, the above is c 1 C J n a jn n d s,j + J n a j Nn E b s,j ) E b s,0 ) d s,0 c 1 C J n b s,0 J n a jn n d s,j c 1 C p + 1) /c 7 a j. Lemma A. There exists a constant C > 0, such that as n, for any sets of coefficients, {c l0, c ls,j, l = 1,..., d 1 ; s = 1,..., d ; j = 1,..., J n } d 1 d c l0 + J n d 1 d c ls,j B s,j t l C c l0 + J n c ls,j., Proof. The result follows immediately from Lemmas 1, and A.1. Lemma A.3 Let G, G be the R n R n matrix defined as G, G = G i,g j ) Rn i,. Define G, G n similarly as G, G, but replace the theoretical inner product with the empirical inner product, and let D = diag G, G ). Define Q n = sup D 1/ G, G n G, G ) D 1/, where sup is ) taken oven all the elements in the random matrix. Then as n, Q n = O p n 1 h 1 log n). 13

14 Proof. For notation simplicity, we consider the diagonal terms. For any 1 l d, 1 s { } d 1, 1 j J n fixed, let ξ = E n E) Bs,j X n s) Tl = 1 n ξ i, in which ξ i = Bs,j X is) Til { } E Bs,j X is) Til. Define T il = T il I { Til n δ }, for some 0 < δ < 1, and define ξ, ξ i similarly as ξ and ξ i, but replace T l with T l. Then for any ɛ > 0, one has P ) ξ ɛ log n)/ nh) ) P ξ ɛ log n)/ nh) + P ξ ξ), A.4) in which P ξ ξ) P T il T ) il, for some i = 1,..., n n P T il n δ) E T l m /n mδ 1. Also note that sup 0 xs 1 B s,j x s ) = sup 0 xs 1 N n {b s,j E b s,j ) b s,0 /E b s,0 )} c N n, for some c > 0. Then by Minkowski s inequality, for any positive integer k 3 [ k E ξ i k 1 E Bs,j X s ) T k { + E Bs,j X s ) T } ] [ k k 1 n δk c k Nn k + cn n ) k] n δk c k Nn. k l On the other hand E ξ i E Bs,j X s ) Tl / E { Bs,j X s ) Tl } B E s,j X s ) Tl I { T l >n δ } /, in which, under assumption C) E Bs,j X s ) Tl I { T l >n δ } E Bs,j 4 X s ) E where δ 0 is as in assumption C). Furthermore l T 4+δ 0 l /n δδ 0 X) c6 E B 4 s,j X s ) /n δδ 0 cn n /n δδ 0, E { Bs,j X s ) Tl } c 4 E { Bs,j X s ) } c, E B s,j X s ) Tl c 5 E B s,j X s ) 4 c 1 c 5 B s,j x s ) 4 dx s cc 1 c 5 N n. Thus E ξ i cnn c cnn n cn δδ 0 n. So there exists a constant c > 0, such that for all 3 k E ξ i n δk c k Nn k cn 6δ N n) k k!e ξi. Then one can apply Theorem 1.4 of Bosq 1998) to n ξ i, with the Cramer s constant c r = cn 6δ N n. That is, for any ɛ > 0, q [ 1, n ], and k 3, one has P 1 n where n ξ i ɛ log n) nh ɛ log n) a 1 = n q + nh 1 + 5m + 5ɛc r qɛ log n) nh a 1 exp 5m + 5ɛc r log n) nh log n) nh, a k) = 11n 1 + ɛ k 5m k+1 p log n) nh + a k)α [ ]) k n k+1, q + 1, m = E ξ i, m p = ξ i p. 14

15 log Observe that 5ɛc n) r nh = 5ɛcn 6δ Nn log n) nh = o1), by taking δ < ) q = n/ {c 0 log n)}, one has a 1 = O n q = O {logn)}, a k) = O for n large enough 1 P ) n ξ i ɛ log n)/ nh) n p 1p+3) ) k n N k+1 n log n) nh. Then by taking ) = o n 3. Thus, { } c logn) exp ɛ logn) 50c 0 m + cn 3 exp { logρ)c0 log n)}. Thus by A.4), taking c 0, ɛ, m large enough and use assumption C4), one has that ) P sup G, G n=1 n G, G ɛ log n)/ nh) { { } {d 1d N n + )} c logn) exp ɛ logn) + cn 3 exp { logρ)c0 log n)} + E T l m } n=1 5c 0 n mδ 1 < {d 1d N n + )} n 3 < +, n=1 in which N n n 1 p+3. Then the lemma follows from Borel-Cantelli Lemma and Lemma A.1. Lemma A.4 As n, one has sup φ 1, φ n φ 1, φ φ 1 M n,φ M n φ 1 φ = O p log n). nh In particular, there exist constants 0 < c < 1 < C such that, except on an event whose probability tends to zero as n, c m m,n C m, m M n. Proof. With vector notation, one can write φ 1 = a T 1 G, φ = a T G, for R n 1 vectors a 1, a. φ 1, φ n φ 1, φ = R n Q n i, On the other hand by Lemma A., R n i, a 1i a j G i G j Q n C a 1i a j G i, G j n G i, G j R n i, a 1i a j Q n C a T 1 a 1a T a. φ 1 φ = a T 1 G, G a 1 ) a T G, G a ) C a T 1 a 1 a T a. Then φ 1, φ n φ 1, φ φ 1 φ C Q n C a T1 a a T 1 a 1 a T a a T a = O p Q n ) = O p log n). nh Lemma A.4 shows that the empirical and theoretical inner products are uniformly close over the approximation space M n. This lemma plays the crucial role analogous to that of Lemma 10 in Huang 1998a). Our result is new in that i) the spline basis of Huang 1998a) must be bounded, 15

16 whereas the term t in basis G makes it possibly bounded; ii) Huang 1998a) s setting is i.i.d. with uniform approximation rate of o p 1), while our setting is α-mixing, ) broadly applicable to time series data, with approximation rate the sharper O p log n)/nh. The next lemma follows immediately from Lemmas A. and A.4. Lemma A.5 There exists constant C > 0 such that except on an event whose probability tends to zero as n d 1 d c l0 + J n c ls,j B s,j t l A.3 Proof of mean square consistency Proof of Theorem 3. We denote,n d 1 d C c l0 + J n c ls,j Y = Y 1,..., Y n ) T, m = {mx 1, T 1 ),..., mx n, T n )} T, E = {σx 1, T 1 )ε 1,..., σx n, T n )ε n } T. Note that Y = m + E, and projecting it onto M n, one has ˆm = m + e, where ˆm is defined in 3.1), and m, e are the solution to 3.1) with Y i replaced by mx i, T i ) and σx i, T i )ε i respectively. Also one can uniquely represent m as m = d 1 α l0 + ) d α ls t l, α ls ϕ 0 s. With these notations, one has the error decomposition ˆm m = m m + e, where m m is the bias term, and e is the variance term. Since for 1 l d 1, 1 s d, α ls C p+1 [0, 1]), by Lemma, there exist { C > 0 and spline functions g ls ϕ 0 s, such that α ls g ls Ch p+1. Let m n x, t) = α l0 + } d g ls x s ) t l M n. One has d1. d 1 d d 1 d m m n {α ls g ls } t l c 4 α ls g ls c 4 Ch p+1. A.5) Also m m n,n Ch p+1 a.s. Then by the definition of projection, one has m m,n m m n,n Ch p+1, which also implies m m n,n m m,n + m m n,n Ch p+1. By Lemma A.4 m m n m m n,n 1 Q n ) 1/ = O p h p+1 ). Together with A.5), one has m m = O p h p+1 ). A.6) Next we consider the variance term e. For some set of coefficients â = â 1,..., â Rn ) T, one can write e x, t) = R n âjg j x, t). Denote N = [N G 1 ),..., N G Rn )] T, with N G j ) = 1 n n G j X i, T i ) σ X i, T i ) ε i. By the definition of projection, one has Gj, G j n ) Rn j,j =1 â = N. 16

17 Multiplying both sides with the same vector, one gets â T ) Rn G j, G j n j,j =1 â = ât N. Now, by Lemmas A., A.4, the LHS is R n âjg j C 1 Q n) R n,n â j, while the RHS is 1/ ) R n R n 1/ â 1 n j G j X i, T i ) σ X i, T i ) ε i. n Hence C 1 Q n ) R n â j Rn ) 1/ { Rn 1 n â j n G ) } 1/ j X i, T i ) σ X i, T i ) ε i entailing 1/ ) R n R n 1/ â j C 1 1 Q n ) 1 1 n G j X i, T i ) σ X i, T i ) ε i, n { and as a result e C 1 Q n) Rn 1 n n G ) } j X i, T i ) σ X i, T i ) ε i. Since ε i is independent of {X j, T j ), j i}, for i = 1,..., n, one has R n E 1 n ) n G j X i, T i ) σ X i, T i ) ε i = R n 1 n E {G j X i, T i ) σ X i, T i ) ε i } CJ n n ) 1 = O, nh by assumptions C), C5) and Lemma A.1 i). Therefore ẽ = O p n 1 h 1). This, together with A.6) prove ˆm m = O p h p+1 + ) 1/nh. The second part of Theorem 3 follows from Lemma [ 1, which entails that for C > 0, ˆm m C d1 { α l0 α l0 ) + }] d α ls α ls. Proof of Theorem 1. By A.), one only needs to show E n α ls = O p h p+1 + ) 1/nh, for 1 l d 1, 1 s d. Note that E n α ls E n { α ls α ls } + E n α ls, whose first term E n { α ls α ls } α ls α ls,n α ls α ls,n + α ls α ls,n α ls g ls,n + α ls g ls,n + α ls α ls,n, with α ls g ls,n α ls g ls Ch p+1, and applying Lemmas 1 and A.3, one has α ls g ls,n 1 + Q n ) α ls g ls 1 + Q n ) m m n = O p h p+1 ), ) α ls α ls,n 1 + Q n ) α ls α ls,n 1 + Q n ) ẽ = O p 1/nh. Thus E n { α ls α ls } = O p h p+1 + ) 1/nh. Since E n α ls = O p 1/ n), one now has E n α ls = O p h p+1 + ) 1/nh. Theorem 1 now follows from the triangular inequality. A.4 Proof of BIC consistency We denote the model space M S and the approximation space M n,s of m S separately as d1 M S = m x, t) = α l x) t l ; α l x) = α l0 + α ls x s ); α ls Hs 0, s S l M n,s = m d 1 n x, t) = g l x) t l ; 17 g l x) = α l0 + g ls x s ); g ls ϕ 0 s. s S l

18 If S S f, M S M Sf and M n,s M n,sf. Let Proj S Proj n,s ) be the orthogonal least square projector onto M S M n,s ) with respect to the empirical inner product. Then ˆm S 3.7) can be viewed as: ˆm S = Proj n,s Y). As a special case of Theorem 1, one has the following result. Lemma A.6 Under the same conditions as in Theorem 1, one has ˆm S m S = O p 1/N p+1 S + ) N S /n. Now denote c S, m) = Proj S m m. One has if m M S0, Proj S0 m = m, thus c S 0, m) = 0; and if S overfits, since m M S0 M S, c S, m) = 0; and if S underfits, c S, m) > 0. Proof of Theorem. Notice that BIC S BIC S0 = MSE S MSE S0 {1 + o p 1)} + q S q S0 log n) MSE S0 n MSE S MSE S0 = E{σ X, T)}1 + o p 1)) {1 + o p 1)} + n p+ p+3 log n), since q S q S0 n 1/p+3), and MSE s0 1 n n {Y i m X i, T i )} + 1 n n { ˆm S0 X i, T i ) m X i, T i )} = E{σ X, T)}1+o p 1)). Case 1 Overfitting): Suppose that S 0 S and S 0 S. One has MSE s MSE s0 = m s m s,0,n = m s m s,0 {1 + o p 1)} ) m s m + m s,0 m {1 + o p 1)} = O p n p+)/p+3)). Thus lim n + {P BIC s BIC s0 > 0)} = 1. To see why the assumption q S q S0 n 1/p+3) is necessary, suppose q S0 n r, with r > 1/p + 3) instead. Then it can be shown that n r 1 MSE s MSE s0 = E{σ X, T)} {1 + o p 1)} nr 1 logn) {1 + o p 1)}, which leads to lim n + {P BIC s BIC s0 < 0)} = 1, instead. Case Underfitting): Similarly as in Huang & Yang 004), we can show that if S underfits, MSE S MSE S0 c S, m) + o p 1). Then BIC S BIC S0 c S, m) + o p 1) E{σ X, T)}1 + o p 1)) + o p 1), which implies that lim n + {P BIC s BIC s0 > 0)} = 1. 18

19 Table 1: Simulated example 1: the means and standard errors in parentheses) of ĉ 1, ĉ and AISEs of ˆα 11, ˆα 1, ˆα 1, ˆα by two methods: marginal integration and polynomial spline. Integration fit c 1 = c = 1 α 11 α 1 α 1 α n = ) ) n = ) ) n = ) ) Spline fit p = 1 n = ) ) n = ) ) n = ) ) Spline fit p = 3 n = ) ) n = ) ) n = ) ) Table : Simulation results of model selection using polynomial spline fit with the BIC, AIC. For each setup, the first, second and third columns give the number of underfitting, correct fitting and overfitting over 100 simulations. n BIC AIC p = 1 p = p = 3 p = 1 p = p = 3 Example Example

20 Table 3: Results of the simulated example : the means and standard errors in parentheses) of ĉ 1 and ĉ, and AISEs of ˆα 11, ˆα 1, ˆα 1, ˆα by linear spline p = 1). Spline fit p = 1 c 1 = 0. c = 0.3 α 11 α 1 α 1 α n = ) ) n = ) ) n = ) ) Table 4: German real GNP: the ASE s and ASPE s of five fits. ASE ASPE Integration fit, p = Integration fit, p = Spline fit p = Spline fit p = Linear AR fit

21 a1 a a3 a b1 b b3 b c1 c c3 c d1 d d3 d4 Figure 1: Plots of the estimated coefficient functions in the simulated example 1. a1-a4) are plots of the 100 estimated curves for α 11 x) = sin4x )) + exp x 0.5) ), α 1 x) = x, α 1 x) = sinx), α x) = 0, for n = 100. b1-b4) and c1-c4) are the same as a1-a4), but for n = 50 and n = 500 respectively. d1-d4) are plots of the typical estimators, the solid curve represents the true curve, the dotted curve is the typical estimated curve for n = 100, the dotdashed and dashed curves are for n = 50, 500 respectively.

22 a1 a a3 a b1 b b3 b c1 c c3 c d1 d d3 d4 Figure : Plots of the estimated coefficient functions in the simulated example. a1-a4) are plots of the 100 estimated curves for α 11 u) = u) exp 4u ), α 1 u) = 0.3/{1 + u 1) 4 }, α 1 u) = 0, α u) = u) exp 4u ) when n = 100. b1-b4) and c1-c4) are the same as a1-a4), but when n = 50, 500 respectively. d1-d4) are plots of the typical estimators, the solid curve represents the true curve, the dotted curve is the typical estimated curve for n = 100, the dot-dashed and dashed curves are for n = 50, 500 respectively.

23 Original quarterly GNP data Figure 3: GNP data: time plot of the series {G t } 14 t=1.

24 Transformed quarterly GNP data Figure 4: GNP data after transformation: time plot of the series {Y t } 10 t=1.

25 a b c d Figure 5: GNP data: spline approximation of the functions in model 4.). a) ˆα 11 ; b) ˆα 1 ; c) ˆα 1 ; d) ˆα, in which solid lines denote the estimation results using linear spline p = 1), and dotted lines denote estimates using cubic spline p = 3).

Identification of non-linear additive autoregressive models

J. R. Statist. Soc. B (2004) 66, Part 2, pp. 463 477 Identification of non-linear additive autoregressive models Jianhua Z. Huang University of Pennsylvania, Philadelphia, USA and Lijian Yang Michigan