Cross-validation in model-assisted estimation

Size: px

Start display at page:

Download "Cross-validation in model-assisted estimation"

Prosper McKenzie
5 years ago
Views:

Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 009 Cross-validation in model-assisted estimation Lifeng You Iowa State University Follow this and

1 Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 009 Cross-validation in model-assisted estimation Lifeng You Iowa State University Follow this and additional works at: Part of the Statistics and Probability Commons Recommended Citation You, Lifeng, "Cross-validation in model-assisted estimation" (009. Graduate Theses and Dissertations This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact

2 Cross-validation in model-assisted estimation by Lifeng You A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Statistics Program of Study Committee: Jean D. Opsomer, Major Professor Michael Larsen Huaiqing Wu Cindy L. Yu Helen H. Jensen Iowa State University Ames, Iowa 009 Copyright c Lifeng You, 009. All rights reserved.

3 ii DEDICATION I would like to dedicate this thesis to my husband Lei without whose support I would not have been able to complete this work. I would also like to thank my friends and family for their loving guidance and financial assistance during the writing of this work.

4 iii TABLE OF CONTENTS LIST OF TABLES ACKNOWLEDGEMENTS ABSTRACT v ix x CHAPTER. Cross-Validation in Penalized Spline Model-Assisted Estimation Introduction Definition of the Estimator and Smoothing Parameter Selection Definition of the Estimator Smoothing Parameter Selection Theoretical Properties Simulation Results Conclusion CHAPTER. CV as Improved Variance Estimation for Model-Assisted Estimators Introduction Definition of the Estimator Theoretical Properties Simulation Results Conclusion CHAPTER 3. Shrinking for Linear Regression Estimation Definition of the Estimator Theoretical Properties

5 iv 3.3 Simulation Results Conclusion CHAPTER 4. Model Selection for Regression Estimators Definition of the Estimator Theoretical Properties Simulation Results Conclusion APPENDIX A. Technical Lemmas A. Lemmas I A. Lemmas II APPENDIX B. Additional Simulation Results B. More Simulation Results in Chapter B. More Simulation Results in Chapter B.3 More Simulation Results in Chapter B.4 More Simulation Results in Chapter BIBLIOGRAPHY

10 ix ACKNOWLEDGEMENTS I would like to take this opportunity to express my thanks to my major professor, Dr. Opsomer, for his guidance, patience and support throughout this research and the writing of this thesis. I would also like to thank my committee members for their efforts and contributions to this work: Dr. Larsen, Dr. Wu, Dr. Yu and Dr. Jensen. I would additionally like to thank Dr. Hofmann for her guidance throughout the initial stages of my graduate study. Finally I would like to thank Dr. Kling and Dr. Maiti for their valuable comments and suggestions to my research study.

11 x ABSTRACT Variance estimation for survey estimators that include modeling relies on approximations that ignore the effect of fitting the models. Cross-validation (CV criterion provides a way to incorporate this effect. We will show 4 ways in which we explore this in this dissertation. Penalized spline regression, as a main type of nonparametric model assisted methods, is a common technique to improve the precision of finite population estimators. In Chapter, we propose a CV based criterion to select the smoothing parameter for the penalized spline regression estimator. The design-based asymptotic properties of the method are derived, and simulation studies show how well it works in practice. Regression estimator is a common technique to improve the precision of finite population estimators by using the available auxiliary information of the population. In Chapter, we propose a CV based variance estimator and compare it to other two variance estimators. The design-based asymptotic properties of the estimator are derived, and simulation studies show how well it works in practice. Regression estimator works well for the cases where there is a strong linear relationship between regressor and regressands. On the contrary, when the relationship is weak, π estimator is a good choice. In Chapter 3, a new estimator as a linear combination of those two estimators is proposed to select between them. We introduce a CV based variance estimator for the new proposed estimator. The design-based asymptotic properties of the estimator are explored, and simulation studies show how well it works in practice. In linear regression estimation, how to choose the set of control variables x is a difficult practical problem. In Chapter 4, a CV criterion is introduced for choosing between combinations of the x variables to be included in the model. The design-based asymptotic properties of the estimator are explored, and simulation studies show how well it works in practice.

12 CHAPTER. Cross-Validation in Penalized Spline Model-Assisted Estimation. Introduction In many surveys, the available auxiliary information for the population can be used to improve the precision of design-based estimators. Thereinto ratio and regression estimators have been used for a long time in survey estimation, e.g., Cochran (977. Breidt and Opsomer (000 proposed a nonparametric model-assisted regression estimator with the relationship between the variables to be any smooth function. They used kernel-based local polynomial regression and showed that the nonparametric estimator has the same asymptotic design properties of the parametric model-assisted estimators. The practical properties of the estimator depend on the choice of a smoothness tuning parameter, i.e., the bandwidth in local polynomial regression. In Breidt and Opsomer (000, the bandwidth is treated as a fixed quantity and the issue of how to best select a bandwidth value is not addressed. In Opsomer and Miller (005, the issue of smoothing parameter selection for nonparametric model-assisted estimation was explored and a samplebased criterion that could be used for this purpose was proposed. The proposed smoothing parameter selection method was based on minimizing a type of cross-validation criterion, suitably adjusted for the effect of the finite population setting and the survey design. Penalized spline regression, often called P-splines, is a main type of nonparametric modelassisted methods introduced by Eilers and Marx (996. P-splines are flexible and can be incorporated into a wide range of modelling contexts. Ruppert et al. (003 gave an overview of applications of P-splines to different settings. P-splines are also a natural candidate for constructing nonparametric small area estimators in terms of their close connections with

13 linear mixed models discussed in Wand (003. The ability of combing nonparametric regression and mixed model regression with P- splines was used in different contexts, e.g., Parise et al. (00 and Coull et al. (00. They all provided examples of using penalized splines in the construction of mixed effect regression models to analyze the data with random effects. In the survey context, Zheng and Little (003 proposed a model-based estimator for cluster sampling, where the regression model combines a spline model with a random effect for the clusters. Opsomer et al. (008 proposed a new small area estimation approach, which combines small area random effects with a smooth nonparametrically specified trend. The small area estimation problem could be expressed as a mixed effect model regression by using penalized splines as the representation for the nonparametric trend. They showed consistency of the estimator, computed its mean squared error and provided tests for small area effects and non-linearities. Breidt et al. (005 proposed a class of estimators based on penalized spline regression. Those estimators are weighted linear combinations of sample observations, and weights are calibrated to known control totals. The estimators are design consistent and asymptotically normal under the conditions of standard design, and they admit consistent variance estimation by using design-based methods. Breidt et al. (005 considered data-driven penalty selection in the context of unequal probability sampling designs and showed that the estimators are more efficient than parametric regression estimators when the parametric model is incorrectly specified, while being approximately same efficient when the parametric model specification is correct. Modelling a regression function as a piecewise polynomial with a large number of pieces relative to the sample size is involved in regression spline smoothing. Since the number of possible models is so large, efficient strategies are required for choosing among them. Wand (000 reviewed some approaches to this problem and compared them through a simulation study. For simplicity, Wand (000 considered the univariate smoothing setting with Gaussian noise and the truncated polynomial regression spline basis. Several other approaches for knot selection exist, e.g., the TURBO algorithm in Friedman and Silverman (989 and its subsequent generalization, the MARS algorithm in Friedman (99. The knots of a penalized spline are generally at fixed quantiles of the independent vari-

14 3 able, and the only parameters that can be chosen to adjust are the number of knots and the penalty parameter. Ruppert (00 studied the effects of number of knots on the performance of penalized splines. Two algorithms for the automatic selection of the number of knots, myopic algorithm and full search algorithm, were proposed. Ruppert (00 also described a Demmler-Reinsch type diagonalization for computing univariate and additive penalized spline, which is very useful for super-fast generalized cross-validation, while being not effective for smoothing splines since large number of knots. The choices for the number and positioning of the knots are much less crucial than the smoothing parameter. Ruppert et al. (003 introduced some model selection approaches, e.g., cross-validation (CV, generalized cross-validation (GCV, Mallows s C p criterion. The optimal amount of smoothing in penalized spline regression was investigated in Wand (999. In this article, a simple closed form approximation to the optimal smoothing parameter was derived. This approach was based on the mean average squared error (MASE, which is a mathematically measure of the global discrepancy between m and m. It was shown to be a useful starting point for measuring the optimal amount of smoothing in penalized spline regression. In nonparametric regression one can select the smoothing parameter by minimizing a Mean Squared Error (MSE based criterion. For spline smoothing, the smooth estimation can be rewritten as a Linear Mixed Model. Then Maximum Likelihood (ML theory can be applied to estimate the smoothing parameter as variance component. The relationship between spline smoothing and Mixed Models was discussed in Green and Silverman (994, Brumback and Rice (998 and Verbyla et al. (999. In Kauermann (005, smoothing parameter selections for P-spline smoothing based on MSE minimization and REML estimation were compared. The results for MSE minimization method are similar to the results provided in Wand (999. It was shown that REML-based smoothing parameter selection is asymptotically biased towards undersmoothing, i.e., this approach chooses a more complex model compared to the MSE method. The result accords with classical spline smoothing, however the asymptotic arguments are different. Different smoother is rapidly becoming more popular, which is much easier to prove theoretical results. In this chapter, we will propose a new CV-based criterion for smoothing

15 4 parameter selection, which has almost exact expression as the criterion (9 in Opsomer and Miller (005 except that a penalized spline estimator is used to estimate the smooth function instead of a local polynomial estimator in Opsomer and Miller (005. Section.. will give the definition of the spline estimator. Section.. introduces the smoothing parameter selection method. In section.3 we state assumptions used in the theoretical derivations and our main theoretical results are described. In section.4, we report simulation results, which show how well the CV-based criterion works in practice.. Definition of the Estimator and Smoothing Parameter Selection.. Definition of the Estimator In survey sampling, the estimation of a finite population total, t y = y i is a problem in common. Where U = {,,...,N} is a finite population with N identifiable elements and y i is a response variable for the ith element. A sample of population elements s U is selected with probability p (s. Let = Pr (i s = s:i s p (s > 0 denote the inclusion probability for element i, then the Horvitz-Thompson estimator (Horvitz and Thompson 95 for t y is t y,ht = s y i. (. The variance of the Horvitz-Thompson estimator under the sampling design is Var ( t y,ht = (j y i y j, (. where j = Pr (i s,j s, the joint inclusion probability for elements i,j U. j U Suppose there is auxiliary information x i available for all of U. Then we hope to improve estimation of t y by using the auxiliary information. One approach to incorporate the auxiliary information is postulating a superpopulation model, say ξ, which describes the relationship between the response variable y and the auxiliary variable x.

16 5 Consider the superpopulation regression model y i = m (x i + ε i, (.3 where ε i are independent random variables with mean zero and variance v (x i, m (x i is a smooth function of x i, and v (x i is smooth and strictly positive. In order to introduce the estimator, we treat {(x i,y i : i U} as a realization from the superpopulation model (.3. If the entire realization were observed, we could define a P-spline estimator for m ( as follows: K m (x;β = β 0 + β x β q x q + β q+k (x κ k q +, (.4 where (t q + = tq if t > 0 and 0 otherwise, q is the degree of the spline, κ <... < κ K is a set of fixed knots and β = (β 0,...,β q+k T is the coefficient vector. Typically, q is kept fixed and low. If the number of knots K is sufficiently large, the class of functions m (x;β is very large and can approximate most smooth functions with a high degree of accuracy. The population estimator for β is defined as the minimizer of K (y i m (x i ;β + α βq+k (.5 k= k= for some fixed constant α 0. The smoothness of the resulting fit depends on the value of α, with larger values corresponding to smoother fits. Let X represent the matrix with rows x T i = {,x i,...,x q i, (x i κ q +,...,(x i κ K q + for i U, and let Y denote the column vector of response values y i for i U. Define a diagonal matrix A α = diag {0,...,0,α,...,α}, which has q + zeros followed by K penalty constant α. If the population U is fully observed, the penalized least squares estimator for the coefficient vector of (.4 has the ridge-regression representation: } β U = ( X T X + A α X T Y. (.6 Let m i = m (x i ;β U x T i β U, i U denote the P-spline fit obtained from this hypothetical population fit at x i. If these fitted values are known, they can be incorporated into the survey estimation by constructing the difference estimator (Särndal et al. 99, p. t y,diff = U m i + s y i m i. (.7

17 6 The difference estimator is design unbiased and its design variance is Var ( t y,diff = (j y i m i U U y j m j. (.8 Obviously, the efficiency of t y,diff depends on how well the m i approximates the variable y i. The estimator (.7 is infeasible because the m i cannot be calculated. However, given a sample s, the m i in (.7 can be replaced by sample-based estimators, denoted by m i and constructed as follows. Define the diagonal matrix of inverse inclusion probabilities W = diag j U {/ } and its sample submatrix W s = diag j s {/ }. Similarly, let X s be the submatrix of X consisting of those rows for which j s and Y s denote the column vector of response values y j for j s. For fixed α and under suitable regularity conditions, the π-weighted estimator β = ( X T s W s X s + A α X T s W s Y s = G α Y s (.9 is a design-consistent estimator of β U in (.6. ( Define m i = m x i, β x T β. i Then the model-assisted P-spline estimator is defined as t y,spl = U m i + s y i m i. (.0.. Smoothing Parameter Selection Introducing the indicator function I i = if i s and 0 otherwise, and the indicator vector e i which is a zero vector except for an entry of one at position i, we can rewrite (.0 as t y,spl = i s { π i + j U which shows that t y,spl is a linear estimator. ( I j / x T j G α e i } y i s w i(s y i, In Breidt et al. (005, it was shown that this estimator is design consistent and asymptotically design unbiased. Asymptotically, the design mean squared error of t y,spl is equivalent to the variance of the generalized difference estimator, given in (.8, MSE p ( t y,spl = Ep ( t y,spl t y i,j U (j y i m i y j m j. (.

18 7 Finally, Breidt et al. (005 provides a design consistent and asymptotically design unbiased estimator of MSE p ( t y,spl, as V ( t y,spl = i,j s j j y i m i y j m j. (. The problem is that minimizing V ( t y,spl does not lead to the minimizer of the MSE, since m i can be made to be close to y i by letting α 0. In Opsomer and Miller (005, V ( t y,spl is modified so that it provides a more suitable criterion. Specifically, each estimator m i is replaced by the leave-one-out estimator m ( i. This estimator is readily derived by defining a modified smoothing vector w si with elements w sij w sij w = sii if j i, 0 if i = j where w sij denotes the jth element of the vector w si = ( x T i G α T, and set m ( i := j s w sijy j. The modification of V ( t y,spl proposed to use is defined as V CV ( t y,spl := i,j s j j y i m ( i y j m ( j. (.3 We will refer to V CV ( t y,spl, denoted by VCV (α, as the CV criterion for smoothing parameter selection in function estimation. We will write α CV for the minimizer of V CV (α, and use it as an estimator of α opt, the minimizer of MSE p ( t y,spl..3 Theoretical Properties In order to prove our theoretical results, we make the following technical assumptions (see Breidt and Opsomer (000 and Breidt et al. (005. For simplicity, we will only consider the case with the sample size, denoted by, fixed for each N, and also assume that. As above, q, K and {κ k } are fixed. A. (Sampling rate N. As N, N π (0,. A. (Inclusion probabilities and. For all N, min λ > 0, min i,j U j λ > 0, lim N max i,j U,i j ij <, where ij = j.

19 8 A3. Let D ( ( ( s = N X T s W s X s + A α, and DU = N X T UX U + A α = Ep Ds. Assume D s, D U exist for all α H α, where H α is an interval fixed between 0 and some constant C α with 0 < C α <. A4. max y i < C y, and max,j {,...,p} x ij < C x, where p = +q +K is the dimension of x i, C y and C x are some positive constants. A5. Additional assumptions involving higher-order inclusion probabilities: lim N lim N max E p [(I i (I j I k k ] = 0 (i,j,k D 3,N max E p [(I i I j j (I k I l π kl ] = 0 (i,j,k,l D 4,N lim n [ N max E p (Ii (I j (I k π k ] < N (i,j,k D 3,N lim N n N max (i,j,k,l D 4,N E p [(I i (I j (I k π k (I l π l ] < where D t,n denotes the set of all distinct t-tuples from U. The assumption 3 ensures that β and β U exist for all α H α. The assumption depends on the knots, the penalty constant α, and the distribution of the x i. The following results establish design consistency of variance estimator of t y,spl. Theorem.3.. Let w sii = N x T i D s x i /, and assumptions A-A5 hold. Then, the auxiliary population {x j } j U, error population {ε j } j U, and sample s are such that sup VCV ( t y,spl Var ( t ( N y,diff = op, α H α where Var ( t y,diff = U ij y i m i y j m j. Proof of Theorem.3.: We write the expression as follows ij y i m i y j m j V CV ( t y,spl = π i s j s ij w sii w sjj = ij y i m i y j m j π i s j s ij w sii w sjj + ij m i m i m j m j π i s j s ij w sii w sjj + ij y i m i m j m j π i s j s ij w sii w sjj = V + V + V 3.

20 9 In this expression, V = j U ij y i m i + y i m i ij j U = V + V. y j m j w sii w sjj y j m j w sii w sjj ( Ii I j j And, V = + j U ij y i m i j U ij y i m i = Var ( t y,diff + V. y j m j y j m j ( w sii w sjj Suppose assumptions A, A hold, from Lemma A..3 and A..5, we can show that sup V sup α H α α H α α H α ( (y i m i sup ij (i,j D,N { λ π i y i m i sup max y i m i α H α + { max λ ij (i,j D,N (i,j D,N ( N = O ( + O n ( N N = O, y j m j } sup g ii ( Ds + α H α max i,j U ( g ij Ds ( g ij Ds sup max y i m i α H α } sup α H α max i,j U ( g ij Ds as the fact that max ii = max (. Rewrite V as follows, V = ( y i m i y j m j Ii I j ij j j U + ( ( y i m i y j m j Ii I j ij w sii w sjj j j U = V + V.

21 0 Where, E p [V ] = 0, and Var p [V ] = E p [ V ] = i,j,k,l U ij kl y i m i y j m j y k m k π k [( ( ] y l m l Ii I j Ik I l E p π l j π kl Then, by assumptions A, A, A5 and Lemma A..3, sup Var p [V ] { max α H α λ ij max kl sup max y 4 i m i i,j U k,l U α H i,j,k,l U α = max i,j,k,l U λ 5 λ + 4 λ 4 λ + 4 λ 5 λ + λ 6 [( ( E Ii I j Ik I l p ] j π kl ( max E p [(I i I j j (I k π k ] (i,j,k D 3,N ( max E p [(I i I j j (I i I k k ] (i,j,k D 3,N (i,j,k D 3,N O (i,j,k D 3,N O (i,j D,N O O ( max ( n N } 4 max (i,j D,N E p [(I i I j j (I i ] [ E p (Ii ] + ( λ 4 λ O max E n p [(I i I j j (I k I l π kl ] (i,j,k,l D N (i,j,k,l D 4,N 4,N ( ( ( ( N 3 N 3 N N 4 = o + O + O + O (N + o n N n N ( N 4 = o, n N which implies that sup α Hα V = o p ( N. By assumption A, max I i = max (, λ, (.4 and, max (i,j D,N I i I j j = max (, j λ, (.5

22 then V should have the same order as V. Therefore, by assumptions A, A, Lemma A..3 and A..5, it can be shown that sup α H α V { } sup max λ y 3 i m i sup α H α + λ λ (i,j D,N ( N = O ( + O n ( N N = O, { max ij (i,j D,N α H α max i,j U ( g ij Ds sup max y i m i α H α } sup α H α max i,j U as the fact that max ii = max (. Thus, it follows that Note Next, we will show that V sup V sup V + sup V α H α α H α α H ( ( α N N = o p + O ( N = o p. sup α H α V = o p = m i m i ij j U + m i m i ij j U = V + V. ( N. m j m j w sii w sjj m j m j w sii w sjj ( Ii I j j ( g ij Ds

23 And, V = m i m i ij j U + m i m i ij j U T = ( β βu j U m j m j m j m j ij x i x T j ( β βu j U ( w sii w sjj T x i x + ( β T ( j βu ij g ij Ds ( β βu = V + V. Suppose assumptions A, A and A4 hold for all α H α, by Lemma A.., it can be shown that sup V α H α λ sup T β βu π i ( x i x T i sup β βu α H α α H α and by Lemma A..5 + λ sup α H α β βu T (i,j D,N = o p (O (No p ( + o p (O ( N = o p, max ij x i x T j (i,j D,N ( N o p ( sup V α H α λ sup T β βu π i ( x i x T i sup α H α + λ sup α H α β βu T (i,j D,N ( sup max g ij Ds sup β βu α H α i,j U α H α ( N = o p (O (o p ( + o p (O n ( N N = o p, max α H α i,j U max ij x i x T j (i,j D,N o p ( sup α H α β βu ( g ij Ds sup β βu α H α

24 3 as the fact that max ii = max (. Then, it follows that Rewrite V as V = m i m i ij j U + m i m i ij j U T = ( β βu sup V sup V + sup V α H α α H α α H ( ( α N = o p + o p NnN j U T + ( β βu = V + V. = o p ( N m j m j. ( Ii I j j ( m j m j ij x i x T j j U ( Ii I j w sjj ( β βu w sii ( Ii I j j ij x i x T j g ij ( Ds ( Ii I j j ( β βu By (.4 and (.5, V should have the same order as V. It follows that ( N sup V = o p. α H α Therefore, it can be shown that sup V sup V + sup V α H α α H α α H ( ( α N N = o p + o p = o p ( N. j Finally, we will show that sup α H α V 3 = o p ( N.

25 4 Note V 3 = j U ij y i m i + y i m i ij j U = V 3 + V 3, m j m j w sii w sjj m j m j w sii w sjj ( Ii I j j where V 3 = j U ij y i m i + j U = y i m i ij j U + y i m i ij j U = V 3 + V 3. ij y i m i m j m j m j m j x T ( j β π U β j ( w sii w sjj x T ( ( j g ij Ds β π U β j Suppose assumptions A, A and A4 hold, from Lemma A.., A..3 and A..5 we can show that sup V 3 = sup α H α α H α j U ij y i m i x T ( j β π U β j sup max λ y i m i max x i T sup β βu α H α α H α + max λ ij sup max y i m i max x j T sup β βu (i,j D,N α H (i,j D α j U α H α,n ( N = O (No p ( + O o p ( ( N = o p,

26 5 and, sup V 3 = sup α H α α H α j U ij y i m i x T ( ( j g ij Ds β π U β j sup max λ y i m i max x i T sup α H α max α H α i,j U + max λ ij sup max y i m i max x j T (i,j D,N α H (i,j D α j U,N ( sup max g ij Ds sup β βu α H α i,j U α H ( α N = O (o p ( + O o p ( n ( N N = o p, as the fact that max ii = max (. Thereby, Write V 3 as follows V 3 = j U sup V 3 sup V 3 + sup V 3 α H α α H α α H ( ( α N = o p + o p NnN ij y i m i + j U = y i m i ij j U + y i m i ij j U = V 3 + V 3. ij y i m i = o p ( N m j m j x T j. ( Ii I j j ( m j m j w sii ( Ii I ( β j β π U ij x T j g ij ( Ds ( Ii I j j ( g ij Ds sup β βu α H α ( Ii I j w sjj ( β U β j

27 6 Similarly, from (.4 and (.5, V 3 should have the same order as V 3. It follows that Therefore, sup V 3 sup V 3 + sup V 3 α H α α H α α H ( ( α N N = o p + o p = o p ( N sup VCV ( t y,spl Var ( t y,diff sup V + sup V + sup V + sup V 3 α H α α H α α H α α H α α H ( ( ( ( α N N N N = O + o p + o p + o p ( N = o p. Thus, the result follows.. Theorem.3.. Let assumptions A-A5 hold. Then, the auxiliary population {x j } j U, error population {ε j } j U, and sample s are such that where Var ( t y,diff = U ij y i m i Then, lim sup MSE p ( t y,spl Var ( t y,diff = 0, N α H α N y j m j. Proof of Theorem.3.: Let a N = U (y i m i [ ] E p a y i m i N = ij j U = Var ( t y,diff. ( I i, b N = ( U (m I i m i i. y j m j

28 7 And under assumptions A, A, by Lemma A..3, [ ] E p a { } N max λ π i ( sup max y i m i α H α + { max λ ij sup max y i m i (i,j D,N α H (i,j D α,n ( N = O (N + O ( N = O, as the fact that max ii = max (. Then it follows that ( sup α H α N E [ ] p a N N = O. }

29 8 Let b k denote the kth element of β β U, from assumptions A, A, A4 and A5, by Lemma A.. we can show that [ ] E p b N = E p ( ( Ii Ij x ik x jl b k b l (i,j U k,l {,...,p} k,l {,...,p} k,l {,...,p} ( = O (i,j D,N (i,j D,N E p [b k b l ] E p (i,j,i D 3,N λ sup (i,j,i,j D 4,N α H α (i,j U ( ( Ii Ij x ik x jl max E p [b 4 k ] { } 4 max k {,...,p} x ik max E p [(I i 4],k {,...,p} { } 4 [ max x Ep ik max (I i (I j ],k {,...,p} (i,j D,N { } 4 [ max x Ep ik max (I i 3 (I j ],k {,...,p} (i,j D,N { } 4 [ max x Ep ik max (I i (I j (I i ],k {,...,p} (i,j,i D 3,N { max x ik,k {,...,p} } max E p [(I i (I j (I i (I j ] (i,j,i,j D 4,N ( ( ( O N + O O (N + O ( +O ( N = o which implies that and,, ( N O sup α H α N E p [ a N b N ] } 4 ( sup α H α N E [ ] p b N N = o, = ( ( N 3/ O (N + O O nn sup α H α N E p [a N ] sup α H α N E p [b N ] ( ( N N O o ( N = o.

30 9 Then, Rewrite MSE p ( t y,spl as follows, MSE p ( t y,spl = Ep [ ( t y,spl t y ] ( = E p m i + i s ( = E p ( = E p = E p [ (an + b N ] sup MSE p ( t y,spl Var ( t y,diff = α H α N Therefore the result follows. y i m i ( Ii (y i m i y i ( Ii (y i m i + m i m i = E p [ a N ] + Ep [ b N ] + Ep [a N b N ]. = o = o [ ] sup MSE p ( t y,spl Ep a α H α N N sup α H α N E [ ] p b N + sup α H α N E p [a N b N ] sup α H α N E [ ] p b N + sup α H α N E p [ a N b N ] ( ( N N + o n ( N N. Corollary.3.. Let assumptions A-A5 hold. Then, the auxiliary population {x j } j U, error population {ε j } j U, and sample s are such that sup α H α VCV ( t y,spl MSEp ( t y,spl = op ( N Thereby the theory derived above for the P-spline estimator shows that it is possible to use V CV ( t y,spl, denoted by VCV (α, as an asymptotically equivalent criterion to MSE p ( t y,spl, denoted by MSE p (α, for selecting an optimal smoothing parameter α opt. In section.4, we will evaluate how well this selection criterion works..

31 0.4 Simulation Results In this section, we follow Opsomer and Miller (005 in the design of a simulation study. A random population of N = 000 values of x is generated from the uniform distribution on [0, ], and 000 values for the errors ε are drawn from N (0,. This one error population is used for all simulations, up to multiplication by σ. Eight populations of y are generated as follows: where {m l } 8 l= y il = m l (x i + ε i i 000, l 8, are predefined functions given in the Table.. The finite population quantities of interest are t y = 000 i= y il for each l. Population Expression.Linear m (x = x.quadratic 3.Bump m (x = + (x 0.5 m 3 (x = x + exp ( 00 (x Jump m 4 (x = xi {x 0.65} I {x>0.65} 5.Normal CDF m 5 (x = Φ (.5 x, where Φ is the standard normal cdf 6.Exponential m 6 (x = exp ( 8x 7.Slow sine m 7 (x = + sin (πx 8.Fast sine m 8 (x = + sin (8πx Table. Eight population mean functions. The samples are drawn by one of two designs, simple random sampling without replacement (SI or stratified simple random sampling without replacement (STSI. For each simulation run, M = 000 samples are drawn from {(x i,y i }. For each sample, we compute the estimator t y,spl in equation (.0 for α opt and α CV. Referring to Opsomer and Miller (005, the optimal smoothing parameters α opt for each population are not sample-based. We compute them by minimizing a simulation-based approximation to the function MSE p (α, which is constructed by simulating repeated samples from these populations for a grid of smoothing parameters over the interval [0.000, 0], and finding the functions MSE p (α by averaging over these simulations. For each sample, the smoothing parameter α CV is sought through a search algorithm

32 implemented in R, which uses expression: V CV (α := i,j s j j y i m ( i y j m ( j, the same expression as (.3. A simulation run is determined by sample size n, error variance σ, and degree of the spline regression q. For the design of simple random sampling without replacement, simulations are done for n {00, 00, 500}, σ {0.0, 0.6}, q {}. The design of stratified simple random sampling without replacement uses 4 strata with each stratum containing 50 elements, and the stratification of the strata is based on a random variable z i and ratio r. First, we generate v i from a standard normal distributio (0,σ v with σ v satisfying r = σ. Then z σ +σv i s are derived as follows: v i + ε i if 0 < r < z i = v i if r = 0, where σv = ε i if r = After sorting by z i (i =,...,N, the population is separated into 4 strata with boundaries given by equally-spaced quantiles of z. Then, simulations are conducted with the stratum sample sizes {(5, 0, 30, 35, (30, 40, 60, 70, (75, 00, 50, 75}, r {0, 0.5, 0.5, 0.75, }, σ {0.0, 0.6}, and q =. Thus, the strata have different sampling rates with the inclusion probability correlated with the model error. As mentioned in Breidt et al. (005, for m and m, the models are polynomial; the remaining mean functions are representatives with various departures from the polynomial model. The mean function m 3 is mostly linear over its range, except that there is a bump for a small portion of the range of x k. Function m 4 is not a smooth function. The sigmoidal function m 5 is the cumulative distribution function, and m 6 is an exponential curve. The function m 7 is a sinusoid completing one full cycle on [0, ], while m 8 completes four full cycles. Since the true α opt in the case with model correctly specified increases to infinity, for simplicity, we restrict the range for searching the minimums of α opt and α CV within (0, 0]..

33 Population σ n α CV MSE bαcv α opt MSE αopt.linear Quadratic Bump Jump Normal CDF Exponential Slow sine Continued...

34 3 Population σ n α CV MSE bαcv α opt MSE αopt 8.Fast sine Table.: CV smoothing parameters α CV and optimal smoothing parameters α opt with their corresponding MSEs based on 000 replications of simple random sampling from all populations of size N = 000. Table. shows the cross-validation smoothing parameters α CV and the optimal smoothing parameters α opt for linear spline regression under SI simulation runs. Apparently, in agreement with α opt, α CV varies widely across functions. CV and optimal smoothing parameters are generally an increasing function of the closeness of the relationship between y and x (as measured by σ. The difference between CV and optimal smoothing parameters is generally a decreasing function of the sample size. Population σ n α CV MSE bαcv α opt MSE αopt.linear Quadratic Bump Continued...

35 4 Population σ n α CV MSE bαcv α opt MSE αopt 4.Jump Normal CDF Exponential Slow sine Fast sine Table.3: CV smoothing parameters α CV and optimal smoothing parameters α opt with their corresponding MSEs based on 000 replications of stratified simple random sampling from all populations of size N = 000 with r = 0.5. The same overall behavior can be seen in the results of Table.3, which displays the CV and optimal smoothing parameters for linear spline regression estimation under the STSI design with r = 0.5.

36 5 Population r α CV MSE bαcv α opt MSE αopt.linear Quadratic Bump Jump Normal CDF Exponential Slow sine Fast sine Continued...

37 6 Population r α CV MSE bαcv α opt MSE αopt Table.4: CV smoothing parameters α CV and optimal smoothing parameters α opt with their corresponding MSEs based on 000 replications of stratified simple random sampling from populations of size N = 000 with model variance σ = 0.0, and the stratum sample sizes n = (30, 40, 60, 70. Table.4 displays the CV smoothing parameters α CV and optimal smoothing parameters α opt with their corresponding MSEs under the design of stratified simple random sampling without replacement for linear spline regression estimation with model variance σ = 0.0 and the stratum sample sizes n = (30, 40, 60, 70. We can find that α CV changes within a short range, and it tracks α opt well. Because the decrease of r means that the relationship between the inclusion probability and the model error becomes weaker, it is reasonable to find that the MSE decreases as r increases. Also, the optimal α and its estimator remain stable across values of r, so the method we propose works even when the design effect is important. From the above tables, we can see that the smoothing parameter selection method provides a reasonably accurate estimate of the minimizer of the MSE. Even in the cases where it leads to a value further from the true minimizer, it appears to perform well in the sense of selecting a smoothing parameter leading to the best possible MSE. This finding holds across the sample sizes, model variances and sampling design methods we considered. Generally, when the MSE function has a minimum value based on the smoothing parameter, the CV criterion (.3 performs successfully in approximating the minimum of the MSE..5 Conclusion In this chapter, we proposed a design-based CV criterion for selecting the smoothing parameter in penalized spline regression estimation. First, we developed theoretical results by proving that the design-based properties of the P-spline regression estimator hold uniformly for a range of smoothing parameter values, making smoothing parameter selection possible.

38 7 Then those results were applied to show that the proposed method for smoothing parameter selection is asymptotically equivalent to minimizing the MSE of the estimator. By a simulation study, we showed that the estimated smoothing parameter usually tracks the optimal parameter quite well. Hence, we recommend the design-based CV criterion whenever data-driven smoothing parameter selection for survey estimation is required, except that alternative methods are developed for smoothing parameter selection in the finite population estimation context.

39 8 CHAPTER. CV as Improved Variance Estimation for Model-Assisted Estimators. Introduction A characteristic of sampling survey is to use the available auxiliary information of the population to improve the precision of design-based estimators. The regression estimators have been used for a long time in survey estimation to make auxiliary information be efficiently used. Särndal (98 proposed a variance estimator for the regression estimator. And Särndal, Swensson, and Wretman (989 studied its properties with respect to the design and the model respectively. It has been shown that the estimator is design consistent and approximately unbiased. In this chapter, we will propose a new variance estimator based on the leave-one-out or cross-validation (CV principle following the construction of CV criterion in Opsomer and Miller (005. The purpose of this chapter is to introduce the new variance estimator, explore its general properties and compare it to other variance estimators used currently. Section. will give the definition of the regression estimator and introduce the CV-based variance estimator. In section.3 we state assumptions used in the theoretical derivations and our main theoretical results are described. In section.4, we report simulation results, which show how well the CV-based variance estimator works in practice.

40 9. Definition of the Estimator The estimation of a finite population mean is considered, ȳ N = N y i. Similar to section. U = {,,...,N} is a finite population with N identifiable elements and y i is a response variable for the ith element. A sample of population elements s U is selected with probability p (s. Let = Pr (i s = s:i s p (s > 0 denote the inclusion probability for element i. Then the π estimator of ȳ N is ȳ π = N s y i, (. which is based on the Horvitz-Thompson estimator (Horvitz and Thompson 95. The variance of ȳ π under sampling design is Var (ȳ π = N j U (j y i y j, (. where j = Pr (i s,j s, the joint inclusion probability for elements i,j U. Suppose there is auxiliary information x i available for all of U. And the value of the auxiliary variable vector for the ith element is denoted by x i = (x i,...,x ij T, where J is the number of auxiliary variable. Then the general regression estimator, denoted by ȳ reg, is defined from t yr in Särndal et al. (99, p. 5 as follows ȳ reg = ȳ π + ( XN X π T β, (.3 where is the π estimator of the known X N, X π = N s x i = N t xπ X N = x i = N N t x. U

41 30 And β is ( β = ( β,..., β T J = x i x T i /σi x i y i /σi (.4 s s with the σi assumed known up to a proportionality constant. As shown in (.3, the regression estimator is explicitly the π estimator plus an adjustment term. The adjustment term is often negatively correlated with the error of the π estimator when the regression estimator works well. The regression model ξ motivating (.3, with y as the regressor and x,...,x J as regressands, will have the following features: i. y,...,y N are assumed to be realized values of independent random variables Y,...,Y N, ii. E ξ (Y i = x T i β, iii. V ξ (Y i = σi (i =,...,N, where E ξ and V ξ denote expected value and variance with respect to the model ξ, and β and σi are model parameters. Under the model ξ the population-level weighted least-squares estimator of β = (β,...,β J T will be ( β N = (β N,...,β NJ T = x i x T i /σi x i y i /σi, (.5 U U which could be written as more familiar expression from regression analysis, β N = ( X T V X X T V Y, where X represents the matrix of dimensio J with rows x T i = (x i,...,x ij, Y = (y,...,y N T, and V is N N diagonal matrix σ... 0 V = σn It has been shown that β N is the best linear unbiased estimator of β under the model. Note that β N is a finite population characteristic unknown to us, but we can estimate it using

42 3 sample data by applying π estimation (inverse-probability-weighting. Write the unknown β N as where T = U β N = T t, (.6 x i x T i σ i ; t = U x i y i. (.7 σi T is a symmetric J J matrix and t is a J-vector. The typical elements of T and t are population product totals denoted respectively by: t jj = U x ij x ij σ i = t j j; t j0 = U x ij y i. (.8 σi The π estimators for T and t respectively are T = s Their typical elements are given by: x i x T i σi π ; t = i s x i y i σi π. (.9 i t jj = s x ij x ij σ i = t j j; t j0 = s x ij y i σi π. (.0 i They are unbiased for t jj and t j0 respectively. Then the population parameter β N is estimated by β = = T t ( s x i x T i σ i s x i y i σi π, (. i which is the same expression proposed in (.4. Referring to Särndal et al. (99, note that, ȳ reg = ȳ π + ( XN X T π β = y i g is N π i s i = ŷ i + e is N N π i s i = yi 0 + E i g is, (. N N i s

43 3 ( T where g is = + t x t xπ T xi /σi, e is = y i ŷ i = y i x T β, i and E i = y i yi 0 = y i x T i β N. The regression estimator ȳ reg is approximated through Taylor linearization by ȳ reg,0 = ȳ π + ( XN X T π βn = yi 0 + Ě i, (.3 N N i s where Ěi = E i /. Then the approximate variance is given by AV (ȳ reg = N i,j U ij Ě i Ě j, (.4 with ij = j. Replacing the unobservable Ěi by the observable ě is, we obtain the naive variance estimator where ě is = e is /. V n (ȳ reg = N i,j s ˇ ij ě is ě js, (.5 From (., it follows that V (ȳ reg = N V ( i s g isěi. Disregarding that the weights g is are sample dependent, and inserting ě is for Ěi, we obtain the g-corrected variance estimator: V g (ȳ reg = N i,j s ˇ ij (g is ě is (g js ě js. (.6 Both estimators in (.5 and (.6 are based on large sample approximations. And for a given level of α, either of the two variance estimators gives approximately 00 ( α % coverage rate in repeated large samples. These intervals are also based on asymptotic normality. However, the available evidence in a number of cases suggests that (.6 is preferable. This estimator was proposed in Särndal (98. Särndal, Swensson, and Wretman (989 studied its properties with respect to the design and the model respectively. It has been shown that the estimator (.6 is design consistent and approximately unbiased. In this chapter, we propose a new variance estimator based on the leave-one-out or cross-validation (CV principle. The CV variance estimator for ȳ reg is given by V CV (ȳ reg = N i,j s ˇ ij ě ( is ě( js,

44 33 where ě ( is = e ( is /, e ( is estimator ŷ ( i. It can be shown that ŷ ( i = y i ŷ ( i. In this expression, we replace ŷ i by the leave-one-out ( = x T i T x ix T ( i σi π t x iy i i σi x = x T T i x T i T ( i T + σi t x iy i wsii σi = ŷ i w siiy i + w siiŷ i (w sii y i w sii where w sii = x T i T x i /σ i. Which implies that = ŷi wsiiy i, (.7 wsii e ( is = y i ŷ ( i = y i ŷ i wsii = e is. (.8 wsii Then, V CV (ȳ reg can be expressed by V CV (ȳ reg = N i,j s ˇ ij e is e js w sii.3 Theoretical Properties. (.9 wsjj In order to prove our theoretical results, we make the following technical assumptions. For simplicity, we will only consider the case with the sample size, denoted by, fixed for each N, and also assume that. As above, J is fixed. A. (Sampling rate N. As N, N π (0,. A. (Inclusion probabilities and. For all N, min λ > 0, min i,j U j λ > 0, lim N max i,j U,i j ij <. A3. Assume that ( N T and ( nn N T exist for all samples.

45 34 A4. max y i < C y, and max,j {,...,J} x ij < C x, where C y and C x are some positive constants. A5. 0 < σ L min σ i max σ i σ U <, where σ L and σ U are some positive constants. A6. Additional assumptions involving higher-order inclusion probabilities: lim N lim N max E p [(I i (I j I k k ] = 0 (i,j,k D 3,N max E p [(I i I j j (I k I l π kl ] = 0 (i,j,k,l D 4,N lim n [ N max E p (Ii (I j (I k π k ] < N (i,j,k D 3,N lim N n N max E p [(I i (I j (I k π k (I l π l ] < (i,j,k,l D 4,N where D t,n denotes the set of all distinct t-tuples from U. Assumption A3 ensures that β and β N exist. The following results establish design consistency of variance estimator of ȳ reg. Theorem.3.. Let assumptions A-A6 hold. Then, we have the following result: MSE p (ȳ reg = E p (ȳ reg ȳ N = N i,j U E i E j ij + o ( n N. Then, Proof of Theorem.3.: Let a N = U (y i y 0 i [ ] E p a y i y N = i 0 ij = j U j U ( I i, b N = ( U (y0 I i ŷ i i. ij E i E j. y j y 0 j

46 35 And by assumptions A, A and Lemma A..3, [ ] E p a N λ { max ( max + { max λ ij (i,j D,N (i,j D,N ( N = O (N + O ( N = O, y i yi 0 max } y i yi 0 as the fact that max ii = max (. Then it follows that ( [ ] N E p a N = O. }

47 36 Let b k denote the kth element of β β N, from assumptions A, A, A4, A6, by Lemma A.. we can show that [ ] E p b N = E p (i,j U k,l {,...,J} k,l {,...,J} k,l {,...,J} ( = O (i,j D,N (i,j D,N x ik x jl b k b l ( Ii E p [b k b l ] E p (i,j,i D 3,N (i,j U ( Ij ( ( Ii Ij x ik x jl λ max E p [b 4 k ] { } 4 max k {,...,J} x ik max E p [(I i 4],k {,...,J} (i,j,i,j D 4,N { } 4 [ max x Ep ik max (I i (I j ],k {,...,J} (i,j D,N { } 4 [ max x Ep ik max (I i 3 (I j ],k {,...,J} (i,j D,N { } 4 [ max x Ep ik max (I i (I j (I i ],k {,...,J} (i,j,i D 3,N { max x ik,k {,...,J} } max E p [(I i (I j (I i (I j ] (i,j,i,j D 4,N ( ( ( O N + O O (N + O ( +O ( N = o, ( N O } 4 ( ( N 3/ O (N + O O nn which implies that and, ( [ ] N E p b N = o. N E p [ a N b N ] = N E p [a N ] N E p [b N ] ( ( O o ( = o.

48 37 Rewrite MSE p (ȳ reg as follows, [ MSE p (ȳ reg = E p (ȳreg ȳ N ] ( = N E p ŷ i + i s ( = N E p ( = N E p = N E p [ (an + b N ] y i ŷ i ( Ii (y i ŷ i y i ( ( yi yi 0 + yi 0 I i ŷ i = N { Ep [ a N ] + Ep [ b N ] + Ep [a N b N ] }. Then, MSE p (ȳ reg = N = N j U j U ( E i E j ij + o ij E i E j + o ( ( + o. Therefore the result follows. Theorem.3.. Under assumptions A-A6, we have that the estimator V CV (ȳ reg = N i,j s = N i,j U ˇ ij e is e js w sii w sjj E i E j ( ij + o p n N.

Nonparametric Regression Estimation of Finite Population Totals under Two-Stage Sampling

Nonparametric Regression Estimation of Finite Population Totals under Two-Stage Sampling Ji-Yeon Kim Iowa State University F. Jay Breidt Colorado State University Jean D. Opsomer Colorado State University