Single-index model-assisted estimation in survey sampling

Size: px

Start display at page:

Download "Single-index model-assisted estimation in survey sampling"

Kory Owen
5 years ago
Views:

1 Journal of onparametric Statistics Vol. 21, o. 4, May 2009, Single-index model-assisted estimation in survey sampling Li Wang* Department of Statistics, University of Georgia, Athens, GA, 30602, USA (Received 4 June 2008; final version received 19 January 2009 ) A model-assisted semiparametric method of estimating finite-population totals is investigated to improve the precision of survey estimators by incorporating multivariate auxiliary information. The proposed superpopulation model is a single-index model (SIM) which has proven to be a simple and efficient semiparametric tool in multivariate regression. A class of estimators based on polynomial spline regression is proposed. These estimators are robust against deviation from SIMs. Under standard design conditions, the proposed estimators are asymptotically design-unbiased, consistent and asymptotically normal. An iterative optimisation routine is provided that is sufficiently fast for users to analyze large and complex survey data within seconds. The proposed method has been applied to simulated datasets and MU281 dataset, which have provided strong evidence that corroborates with the asymptotic theory. Keywords: Horvitz Thompson estimator; model-assisted estimation; semiparametric; spline smoothing; superpopulation AMS Subject Classification: 62D05; 62G08 1. Introduction In this article, the classic finite-population estimation problem is investigated. In what follows, let U ={1,...,i,...,} denote the units of finite population. For each i U, let y i be a generic characteristic and the objective is to estimate t y = i U y i. A probability sample s is drawn from U according to a fixed sampling design p ( ), where p (s) is the probability of drawing the sample s. Let i i = Pr{i s} = s i p (s) denote the inclusion probability for element i U and ij ij = Pr{i, j s} = s i,j p (s) denote the inclusion probability for element i, j U. If no information other than the inclusion probabilities is used to estimate t y, a well-known design-unbiased estimator is the Horvitz Thompson (HT) estimator ˆt y ˆt y = i s y i i. (1) * lilywang@uga.edu ISS print/iss online 2009 Taylor & Francis DOI: /

2 488 L. Wang The variance of the HT estimator under the sampling design is Var p (ˆt y ) = ( ij i j ) y i y j. i,j U i j The efficiency of the HT estimator can be significantly improved by incorporating some cheap auxiliary information at the population level in addition to sample data. Such auxiliary information is often available for all elements of the population of interest in many surveys. For instance, in many countries, administrative registers provide extensive sources of auxiliary information. Complete registers can give access to variables such as sex, age, income and country of birth. Studies of labor force characteristics or household expenditure patterns, for example, might benefit from these auxiliary data. Another example is the satellite images or GPS data used in spatial sampling. These data are often collected at the population level, which are often available at little or no extra cost, especially compared with the cost of collecting the survey data. For more examples of auxiliary information, see [1 4]. Use of auxiliary information to improve the accuracy of survey estimators actually dates back to post-stratification, calibration, ratio and regression estimation; see [5 8] for a general review of these methods. Auxiliary information can also be used to increase the accuracy of the finite-population distribution function, for example, [9]. In this article, let x i ={x i1,...,x id } be a d-dimensional auxiliary variable vector, i U, and assume that {(x i,y i )} i U is a realisation of (X,Y)from an infinite superpopulation, ξ, satisfying Y = m(x) + σ(x)ε, (2) in which the d-variate function m is the unknown mean function of Y conditional on the auxiliary information vector X, often is assumed to be smooth; σ is the unknown standard deviation (SD) function. The standard error satisfies that E ξ (ε X) = 0 and E ξ (ε 2 X) = 1, where E ξ is the expectation with respect to the population ξ. The interesting problem is how to take advantage of the regression relationship (2) to better estimate t y. The traditional parametric approach to analyse a regression relationship assumes that the superpopulation model is fully described by a finite set of parameters, for example, the linear regression (LREG) estimator discussed in [7]. However, it sometimes requires prohibitively complex models with a very large number of parameters to address various hypotheses. It is very difficult to obtain any prior model information about the regression function m in Equation (2), and substantial estimation bias can result if a preselected parametric model is too restricted to fit unexpected features. As an alternative, one can try to estimate the unknown regression relationships nonparametrically without reference to a specific form. The flexibility of nonparametric smoothing/regression is extremely helpful in exploratory data analysis as well as in obtaining robust predictions, see [10,11] for details. onparametric methods for survey data are rather sparse and have begun to emerge as important and practical tools, see [12 17]. Breidt and Opsomer [12] first proposed a nonparametric model-assisted estimator based on local polynomial regression, which generalised the parametric framework in survey sampling and improved the precision of the survey estimators. Their investigation is restricted to the scalar case, i.e., d = 1. owadays most surveys involve more than one auxiliary variable [18]. For example, the auxiliary information obtained from remote sensing data, satellite images and GPS data provide a wide and growing range of variables to be employed in spatial sampling. The ortheastern lakes survey discussed in [19,20] is a good example of this. In that study, a lot of information, such as longitude, latitude, and elevation, of every lake in the population is known for the Environmental Monitoring and Assessment Program of the US Environmental Protection Agency. In addition, the growing possibilities of information and communication technology have made it possible to develop very large and complex surveys. In this

3 Journal of onparametric Statistics 489 article, a d-dimensional auxiliary vector is considered to improve the efficiency of estimating t y for both small and large surveys. Research in nonparametric survey theory and methodology when the dimension of the auxiliary information vector is high, however, is quite challenging. A key difficulty is due to the issue of curse of dimensionality : the optimal rate of convergence decreases with dimensionality [21]. One solution is regression in the form of additive model popularised by Hastie and Tibshirani [22]; see [13,15,19] for possible application of additive model to survey sampling. A weakness of the purely additive model is that interactions between the explanatory variables are completely ignored [23]. An attractive alternative to additive model is the single-index model (SIM) given in Equation (3). Similar to the first step of projection pursuit regression, SIM reduces dimensionality but does not incorporate interactions; see [24 28] for instance. The basic appeal of SIM is that it is in nature a hybrid method of parametric and nonparametric regression. It preserves the simplicity of parametric regression where simplicity is sufficient: the d-variate function m(x) = m(x 1,...,x d ) is expressed as a univariate function of x T θ 0 = d q=1 x qθ 0,q ; it also employs the flexibility of nonparametric regression where flexibility is necessary. In this article, I investigate the SIM-assisted estimator for the finite-population total, that is, the superpopulation model in Equation (2) is assumed to be an SIM. Under standard design conditions, a design-consistent estimator of θ 0 has been obtained using polynomial splines, and the proposed estimator of t y is asymptotically design-unbiased (ADU), consistent and asymptotically normal. By taking advantage of the spline smoothing and iterative optimisation routines, the proposed method is particularly computationally efficient compared with the kernel additive model approaches in the literature of nonparametric survey estimation, in which iterative approaches such as a backfitting algorithm [19,22] or marginal integration [29] are necessary. The rest of the article is organised as follows. Section 2 gives details of model specification and proposes the estimation method. Section 3 describes some nice properties of the proposed estimators. Section 4 provides the actual procedure to implement the method. Section 5 reports the empirical results. Technical proofs are contained in the Appendix. 2. Superpopulation model and proposed estimator 2.1. Single-index superpopulation model In this article, the proposed superpopulation model ξ in (2) is an SIM, where Y = m(x T θ 0 ) + σ(x)ε, (3) where the unknown parameter θ 0 is called the single-index coefficient, used for simple interpretation once estimated; function m is an unknown smooth function used for further data summary. If the SIM is misspecified, however, a goodness-of-fit test is necessary and the estimation of θ 0 must be rethought; see [30]. So in this article, instead of presuming that the underlying true function m is a single-index function like the one defined in Equation (3), the single-index is identified by the best approximation to the multivariate function m. Specifically, a univariate function g is estimated, which optimally approximates the multivariate function m in the sense of g(ν) = E ξ [m(x) X T θ 0 = ν]. (4) The superiority of this method is that it works very well even under model misspecification so that it is much more useful in applications than the traditional SIMs given in Equation (3).

4 490 L. Wang For the superpopulation model defined by Equations (2) and (4), let m θ (X T θ) = E ξ [Y X T θ]=e ξ [m(x) X T θ] for any fixed θ, where as noted in the introduction, E ξ denotes the expected value with respect to the population ξ in (2) and (4). Define the risk function of θ as R(θ) = E ξ [{Y m θ (X T θ)} 2 ]=E ξ {m(x) m θ (X T θ)} 2 + E ξ σ 2 (X), (5) which is uniquely minimised at θ 0 S+ d 1 ={(θ 1,...,θ d ) d q=1 θ q 2 = 1,θ d > 0}. Remark 2.1 It is obvious that without constraints, the coefficient vector θ 0 is identified only up to a constant factor. Typically, one requires that θ 0 =1, which entails that at least one of the coordinates θ 0,1,...,θ 0,d is nonzero. One could assume without loss of generality that θ 0,d > 0, and the candidate θ 0 would then belong to the upper unit hemisphere S+ d Spline smoothing Estimation of both θ 0 and g( ) in model (4) requires a degree of statistical smoothing. In this article, all estimation is carried out via polynomial splines. The use of polynomial spline smoothing in the generalised nonparametric models can be back to [21]. As pointed out in [13,31], one of the important advantages of spline smoothing is the relative ease with which spline estimators can be simply computed, even for large datasets or datasets with regions of sparse data. In addition, spline smoothing is a global smoothing method. After the spline basis is chosen, the coefficients can be estimated by an efficient optimisation procedure. In contrast, kernel-based methods, such as the kernel-based backfitting [19,22] and marginal integration approaches [29], in which the maximising has to be conducted repeatedly at every local data points, are very time-consuming. To introduce the function space of splines of order p, one pre-selects an integer 1/6 J = J 1/5 (log ) 2/5, see Assumption (A4), and divides [0, 1] into (J + 1) subintervals, [k j,k j+1 ), j = 0,...,J 1, [k J, 1], where {k j } J j=1 is a sequence of equally spaced points, called interior knots, given as k 1 p = =k 1 = k 0 = 0 <k 1 < <k J < 1 = k J +1 = =k J +p, in which k j = j/(j + 1), j = 0, 1,...,J + 1. The jth B-spline of order p denoted by B j,p is recursively defined by [32]. In the following, let (2) = (2) [0, 1] be the space of all the second-order smoothness functions that are polynomials of degree 3 on each subinterval. Direct calculation shows that under Assumption (A1) in Section 3.2, for any θ S+ d 1, the variable X T θ has a Lebesgue probability density function (pdf) that is uniformly bounded below and above by the pdf of a rescaled centered Beta{(d + 1)/2,(d + 1)/2}, ( Ɣ(d + 1) 1 v 2 ) (d 1)/2 f d (ν) = I [ a,a](v), Ɣ{(d + 1)/2} 2 2 d a a 2 which vanishes at boundary points a and a. This makes nonparametric smoothing of Y on X T θ difficult. I therefore first transform the variable X T θ by using the cumulative distribution function F d of f d ν/a Ɣ(d + 1) F d (ν) = 1 Ɣ{(d + 1)/2} 2 2 (1 t 2 ) (d 1)/2 dt, ν [ a,a]. (6) d For the rest of the article, denote the transformed variable of the single-index variable X T θ by Z θ and let ϕ θ be the conditional expectation of m given the transformed variable Z θ, i.e., ϕ θ (Z θ ) = E ξ {m(x) Z θ }=E ξ {m(x) X T θ}=m θ (X T θ), (7)

5 Journal of onparametric Statistics 491 Remark 2.2 The transformed variable, Z θ, has a quasi-uniform [0, 1] distribution, i.e., the pdf of the transformed variable is supported on [0, 1] with positive lower bound. In practice, the radius a can take the value of the 100(1 α) percentile of { x i } i U, for example, α = Oracle population-based estimator If the entire realisation were known by oracle, one can create an oracle estimator to estimate θ 0 and g in Equation (4) through a profile least-squares method. Specifically, one estimates the single-index coefficient θ 0 by a consistent estimator θ via minimising the empirical version of the risk function R(θ) defined in Equation (5), i.e., θ = arg min θ S d 1 + R(θ), (8) where R(θ) = 1 {y i ϕ θ (z θi )} 2, (9) i U with ϕ θ ( ) = arg min ϕ( ) (2) [0,1] i U {y i ϕ(z θi )} 2. Then the link function g can be estimated by g, a spline smoother of {y i } i U on {z θi } i U, i.e., g(ν) = ϕ θ (F d(ν)), where F d ( ) is defined in Equation (6). Thus, the best single-index approximation to m(x) is m(x) = g(x T θ) = ϕ θ (z θ ). Let y = (y 1,y 2,...,y ) T, B θ ={B j,4 (z θi )} i U,j= 3,...,J be the B-spline matrix for any fixed θ and e i be a -vector witha1intheith position and 0 elsewhere. Write m i = g(x T i θ) = ϕ θ (z θi ) = et i B θ (BT θ B θ ) 1 B y. (10) T θ Clearly, m i is the spline single-index prediction at x i based on the entire finite population. If these pseudo-predictions m i were known, then a design-unbiased estimator of t y would be the generalised difference estimator t y,diff = i s y i m i i + i U m i, (11) as given in [7, p. 221]. The design variance of t y,diff in Equation (11) is Var p ( t y,diff ) = ( ij i j ) y i m i y j m j. i,j U i j 2.4. Sample-based estimator However, the predictions m i for m(x i ) cannot be computed directly from data, because the only y i s observed are those with i s. Therefore, each m i needs to be replaced by a sample-based consistent estimator. For any fixed θ, the sample-based cubic spline estimator ˆϕ θ of ϕ θ in Equation (7) is defined as ˆϕ θ ( ) = arg min 1 i {y i ϕ(z θi )} 2. ϕ( ) (2) [0,1] i s Define the sample-based empirical risk function of θ ˆR(θ) = 1 i s 1 i {y i ˆϕ θ (z θi )} 2, (12)

6 492 L. Wang then the sample design-based spline estimator of θ 0 is defined as ˆθ = arg min θ S d 1 + ˆR(θ), (13) and the spline estimator of g is ĝ, i.e., ĝ(v) =ˆϕ ˆθ (F d(v)). For any i s, let ˆm i =ĝ(x T i ˆθ) =ˆϕ ˆθ (z ˆθi ) = et i B ˆθ,s (BṰ θ,s W sb ˆθ,s ) 1 B Ṱ θ,s W sy s, (14) where y s ={y i } i s is the n -vector of y i obtained in the sample and { } 1 B ˆθ,s ={B j,4(z ˆθi )} i s,j= 3,...,J, W s = diag. i i s Then the sample design-based spline estimator of t y is 3. Properties of the estimator ˆt y,diff = i s y i ˆm i i 3.1. A simple alternative expression for the estimator + i U ˆm i. (15) Like the ratio and LREG estimators [7] and the penalised spline estimators [13], the B-spline estimator defined in Equation (15) can also be represented in a simple form. Let ˆt z and ˆt z be two vectors: ˆt z = B j,4 (z ˆθi ) i U T j= 3,...,J, ˆt z = { i s } T 1 i B j,4 (z ˆθi ). j= 3,...,J Then the estimator in Equation (15) can be written as ˆt y,diff = ˆt y + (ˆt z ˆt z ) ˆγ, where ˆγ = (B Ṱ θ,s W sb ˆθ,s ) 1 B Ṱ θ,s W sy s. oting that (1,...,1) J +4 B Ṱ = (1,...,1) θ,s n and one has { i s } T 1 i B j,4 (z ˆθi ) = (1,...,1) J +4 B Ṱ W θ,s sb ˆθ,s, j= 3,...,J { } T ˆt z ˆγ = 1 i B j,4 (z ˆθi ) (B Ṱ W θ,s sb ˆθ,s ) 1 B Ṱ W θ,s sy s i s j= 3,...,J = (1,...,1) J +4 B Ṱ θ,s W sy s = (1,...,1) n W s y s = ˆt y. So the proposed estimator takes the simple and attractive form: ˆt y,diff = ˆt z ˆγ = i U ˆm i.

7 Journal of onparametric Statistics Assumptions For the asymptotic properties of the estimators, I will use the traditional asymptotic framework given in [12,33], in which both the population and sample sizes increase as. There are two sources of variation to be considered here. The first is introduced by the random sample design and the corresponding measure is denoted by p. The with p-probability 1, O p, o p and E p ( ) notation below is with respect to this measure. The second is associated with the superpopulation from which the finite population is viewed as a sample. The corresponding measure and notation are ξ, with ξ-probability 1, O ξ, o ξ and E ξ ( ). Before stating the asymptotic results, I formulate some assumptions. Let Ba d ={x Rd x a} be the d-dimensional ball with radius a, center 0 and volume Vol d (Ba d ). Let C (k) (B d a ) ={m the kth order partial derivatives of m are continuous on Bd a } be the space of kth order smooth functions. (A1) The density function of X, f(x) C (4) (Ba d ) for some a>0, and there are positive constants c f C f such that c f /Vol d (Ba d) f(x) C f /Vol d (Ba d),ifx Bd a and f(x) = 0, x / Bd a. (A2) The regression function in Equation (2) m C (4) (Ba d). (A3) The error ε in Equation (2) satisfies E ξ (ε X) = 0, E ξ (ε 2 X) = 1, and there exists a positive constant M such that sup x B d a E ξ ( ε 3 X = x) <M. The SD function σ(x) is continuous on Ba d,0<c σ inf x B d a σ(x) sup x B d a σ(x) C σ <. (A4) As, n 1 (0, 1) and the number of interior knots J satisfies: n 1/6 J n 1/5 {log(n )} 2/5. (A5) For all, min i U i λ>0, min i,j U ij λ > 0 and lim sup n max ij i j <. i,j U,i =j (A6) Let D k, be the set of all distinct k-tuples (i 1,i 2,...,i k ) from U, lim sup n 2 max E p [(I i1 i1 )(I i2 i2 )(I i3 i3 )(I i4 i4 )] <, (i 1,i 2,i 3,i 4 ) D 4, lim sup n 2 max E p [(I i1 I i2 i1 i 2 )(I i3 I i4 i3 i 4 )] <, (i 1,i 2,i 3,i 4 ) D 4, and lim sup n 2 max E p [(I i1 i1 ) 2 (I i2 i2 )(I i3 i3 )] <, (i 1,i 2,i 3 ) D 3, where I i = 1ifi s and I i = 0 otherwise. (A7) The risk function R in Equation (9) is locally convex at θ : ε>0, δ >0 such that θ θ 2 <εif R(θ) R( θ)<δ. (A8) The second-order derivative of R(θ) is bounded at θ = θ. Remark 3.1 Assumptions (A1) (A3) are typical in the nonparametric smoothing literature, see, for instance, [10,11,28]. Assumption (A4) is about how to choose the number of knots in order to achieve the optimal nonparametric rate of convergence. In practice, the number of interior knots J is chosen according to Equation (17). Assumptions (A5) and (A6) involve the inclusion probabilities of the design, which are also assumed in [12]. Assumption (A7) is used to derive the design consistency of ˆθ to θ and Assumption (A8) is used to obtain the rate of the consistency.

8 494 L. Wang 3.3. Asymptotic properties of the estimator The estimator ˆθ in Equation (13) of the single-index coefficient θ 0 is asymptotically designconsistent as the following theorem demonstrates. Theorem 3.1 Under Assumptions (A1) (A5) and (A7), ˆθ is asymptotically design-consistent in the sense that with p-probability 1 lim ˆθ θ = 0, and further if (A8) holds, then ˆθ θ = O p ( J n 1/2 ), where θ, ˆθ are the population- and sample-based estimators of θ 0 in Equations (8) and (13). Like the local polynomial estimators in [12], the following theorem shows that the estimator ˆt y,diff in Equation (15) is ADU and design-consistent. Theorem 3.2 Under Assumptions (A1) (A5) and (A7) and (A8), the model-assisted spline estimator ˆt y,diff in Equation (15) is ADU in the sense that lim E p [ ] ˆty,diff t y = 0 with ξ-probability 1, and is design-consistent in the sense that for all η>0 lim E p[i { ˆty,diff t y >η}] =0with ξ-probability 1. Like the local polynomial estimators in [12], the following theorem shows that the estimator in Equation (15) also inherits the limiting distribution of the generalised difference estimator. Theorem 3.3 Under Assumptions (A1) (A8), for t y,diff, ˆt y,diff in Equations (11) and (15), as implies where ˆV( 1 ˆt y,diff ) = ( t y,diff t y ) d Varp 1/2( (0, 1) 1 t y,diff ) 1 (ˆt y,diff t y ) ˆV 1/2 ( 1 ˆt y,diff ) d (0, 1), ( ij i j (y i ˆm i )(y j ˆm j ) i,j U i j Details of the proofs of Theorems are given in the Appendix. ) Ii I j ij. (16) Remark 3.2 In [13], the number of knots is fixed, thus the bias caused by spline approximation in developing the asymptotic theory is ignored. It has been shown in many contexts of function

9 Journal of onparametric Statistics 495 estimation that, by letting the number of knots increase with the sample size at an appropriate rate, the spline estimate of an unknown function can achieve the optimal nonparametric rate of convergence; see [31,34]. For this purpose, in this article, n 1/6 J n 1/5 {log(n )} 2/5,as shown in Assumption (A4). Remark 3.3 As one referee pointed out, the asymptotics with the number of knots allowed to grow is much more challenging, and only very recent work tackles this problem, e.g., [35,36]. However, the results obtained in this article are not directly comparable to those obtained in [35,36] due to different settings of the model. The problem in [35,36] is a purely nonparametric curve estimation problem, and the objective is to study the asymptotics of the curve estimators fitted with penalised splines, whereas the problem here is a semi-parametric one and the main interest is in estimating the parametric component θ. At the population level, it has been shown that θ 0 should be estimable at the usual root-n rate of convergence using similar techniques as deriving the asymptotics of maximum likelihood estimators. In this article, examination of the approximation results of the derivatives (up to the second order) of the risk function in Equation (5) by their empirical versions implies that a range of smoothing parameter is allowed for the desired asymptotics; see Appendix A of [37]. This differs from nonparametric curve estimation in [35,36] in which the optimal choice of the smoothing parameter is required to achieve the optimal rate of convergence. 4. Algorithm In this section, the actual procedure is described to implement the estimation of θ 0 and t y.i first introduce some new notation. For any fixed θ, write P θ,s = B θ,s (B T θ,s W sb θ,s ) 1 B T θ,s W s as the sample projection matrix onto the cubic spline space. For any q = 1,...,d, write Ḃ q = ( / θ q )B θ, Ṗ q = ( / θ q )P θ as the first-order derivatives of B θ and P θ,s with respect to θ. Write θ d = (θ 1,...,θ d 1 ) T. Let Ŝ (θ d ) be the score vector of the risk function ˆR (θ d ) = ˆR(θ 1,θ 2,...,θ d 1, 1 θ d 2 2 ), that is, Ŝ (θ d ) = ( / θ d ) ˆR (θ d ). The exact form of Ŝ (θ d ) is given in Lemma 4.1 of [37]. In practice, the estimation is implemented via the following procedure. Step 1. Standardise the auxiliary variables {x i } i U and find the radius a used in the CDF transformation (6) by calculating the 100(1 α) percentile of { x i } i U (α = 0.01, 0.05 for example). Step 2. Find the estimator ˆθ of θ 0 by minimising ˆR in Equation (12) through the port optimisation routine in the technical report of [38], with (0, 0,...,1) T as the initial value and the gradient vector Ŝ in Equation (17) of [37]. If d<n, one can take the simple OLS estimator (after standardisation) for {y i, x i } i s with its last coordinate positive. Step 3. Obtain the estimator ˆm i of m(x i ), i U, by applying formula (14). Step 4. Calculate the sample design-based spline estimator of t y in Equation (15). Remark 4.1 In Step 2, the number of interior knots is J = min{c 1 [n 1/5.5 ],c 2 }, (17) where c 1 and c 2 are positive integers and [ν] denotes the integer part of ν. The choice of the tuning parameter c 1 makes little difference for a large sample, and according to our asymptotic theory, there is no optimal way to set these c 1 and c 2. I recommend using c 1 = 1 to save computing for massive data sets and c 2 = 5,...,10 for smooth monotonic or smooth unimodal regression as suggested by Yu and Ruppert [39].

10 496 L. Wang 5. Empirical results In this section, empirical results are provided to demonstrate the applicability of the methodology. Besides the spline single-index (SIM) estimators proposed in the article, I have obtained for comparison the performance of three other estimators: HT estimator in Equation (1), LREG estimator without interaction terms in Chapter 6 of [7] and spline additive estimator (AM) in [13] with degrees 1, 2 and 3 and adaptive knots. The number of knots J for the spline SIM estimator is selected according to Equation (17) Simulated population To illustrate the finite-sample behavior of the estimator ˆt y,diff, some simulation results are presented. For the superpopulation model (2), the following six mean functions are considered: 2-dimension (linear): m 1 (x) = x 1 + x 2 ; 2-dimension (quadratic): m 2 (x) = 1 + (x 1 + x 2 ) 2 ; 2-dimension (bump 1): m 3 (x) = x 1 + x 2 + 4exp{ (x 1 + x 2 ) 2 }; 2-dimension (bump 2): m 4 (x) = x 1 + x 2 + 4exp{ (x 1 + x 2 ) 2 }+ x x2 2 ; 4-dimension (sinusoid): m 5 (x) = sin(x T θ0), θ 0 = (1, 1, 0, 1) T / 3; 10-dimension (sinusoid): m 6 (x) = sin(x T θ 0 ), θ 0 = (1, 1, 0,...,0, 1) T / 3. These represent various correct and incorrect SIM specifications. Function m 1 is a simple linear additive function with two auxiliary variables, and it is also a linear single-index function; Functions m 2, m 3, m 5 and m 6 are some very common SIMs, but unlike m 1, they are not additive so that the purely linear or additive model would be misspecified. Function m 4 is neither a genuine single-index nor a genuine additive function so that any of the above models would be misspecified. However, because the SIM in this article is identified by the best approximation (see Equation (2)) to the multivariate mean function, the estimator ˆt y,diff is expected to be robust in this case. The auxiliary vector {x i } i U is generated from i.i.d d-dimensional uniform (0, 1) random vectors. The population values y i s are generated from the mean functions by adding i.i.d (0,σ 2 ) errors with σ = 0.1 and 0.4. The population is of size = Samples are generated by simple random sampling using sample size n = 50, 100 and 200. For each combination of mean function, SD and sample size, 1000 replicates are selected from the same population, the estimators are calculated and the design bias, design variance and the design mean squared errors (MSEs) are estimated. Table 1 lists the average mean squared errors (AMSEs) of the spline estimators ˆθ in Equation (13) based on d dimensions AMSE( ˆθ) = 1 d d MSE( ˆθ q ), (18) q=1 from which one sees that, even for small sample size, the estimators ˆθ are very accurate for all the population models, and the precision is improved when sample size n increases. In terms of the design biases, the percent relative design biases {E p [ˆt y,diff ] t y } t y 100%

11 Journal of onparametric Statistics 497 Table 1. AMSE of the spline estimators ˆθ defined in Equation (18). σ n ote: Based on 1000 replications of n simple random samples from population of size = defined in [12] have been measured for all the above models. It is found that the relative design biases of the SIM estimators are quite small (<1% for all cases in the simulation) even for sample size n = 50. Table 2 shows the ratios of design MSEs for HT, LREG and AM estimators to the MSE for the proposed spline SIM estimator. From this table, one sees that the model-assisted estimators, LREG, AM and SIM estimators, perform much better than the simple HT estimators regardless of the type of mean function and standard error. For m 1, LREG is expected to be the preferred estimator, since the assumed model is correctly specified. The AM and SIM estimators have similar behavior in this case, and the MSE ratios of AM to SIM are close to 1. However, not much efficiency is lost by using SIM and AM instead of LREG. The MSE ratios of LREG to SIM are at least 0.78 for all cases. For the rest of the population, the SIM estimators perform consistently better than LREG and AM estimators because the interactions between the auxiliary variables have been completely ignored for LREG and AM estimators. For m 4, it is not a genuine single-index function, but SIM estimators are still much more accurate than HT, LREG and AM estimators, confirmative to the theory that the proposed estimators are robust against the deviation from the SIM. To see how fast the computation is, Table 2 provides the average time (based on 1000 replications) of obtaining the SIM estimators on an ordinary PC with Intel Pentium IV 1.86 GHz processor and 1.0 GB RAM. It shows that the proposed SIM estimation is extremely fast. For instance, for Model 6, the SIM estimation of a 10-dimensional sample of size 200 takes on average 0.23 s. I have also carried out the simulation with sample size n = 5000 generated from the population of size 50,000. Remarkably, it takes on average <8 s to get the SIM estimators for all the above models MU281 data The MU284 data set from Appendix B of [7] contains data about Swedish municipalities. The study variable y is RMT , where RMT85 is municipal tax receipts in Two auxiliary variables x 1 (CS82) and x 2 (SS82) are used, where x 1 is the number of Conservative Party seats in the municipal council, and x 2 is the number of Social Democrat Party seats. The largest three cities according to the variable population in 1975 (pop75) are discarded because they are huge outliers and would be treated separately in practice. The population total of = 281 Swedish Municipalities, t y, is found to be The oracle estimator θ (Equation (8)) at the population level is found to be (0.8412, ) T. A Monte Carlo simulation is carried out in which 1000 repeated SRS samples (each with n = 50 and 100) are drawn from the MU281 population of Swedish municipalities. To demonstrate the closeness of the spline estimator ˆθ to the oracle index parameter θ, Table 3 lists the sample mean (MEA), design bias (BIAS), design SD, the design MSE and the AMSE in Equation (18) of ˆθ. From this table, one sees that the sample-based estimators ˆθ are very accurate even for sample

12 498 L. Wang Table 2. Ratio of MSE of the HT, LREG and additive model-assisted estimators (AM) to the SIM-assisted estimators and the average computing time of the SIM. MSE ratio Model σ n HT LREG Degree = 1 Degree = 2 Degree = 3 Time of SIM (s) AM ote: Based on 1000 replications of simple random sampling from population of size = Table 3. Spline estimators ˆθ on MU281 data. n θ MEA BIAS SD MSE AMSE 50 θ θ θ θ ote: Based on 1000 replications of simple random sampling from population of 281 Swedish Municipalities.

13 Journal of onparametric Statistics 499 Table 4. Estimators of t y on MU281 data. n Estimator MEA BIAS SD MSE 50 HT LREG AM (degree = 1) AM (degree = 2) AM (degree = 3) SIM HT LREG AM (degree = 1) AM (degree = 2) AM (degree = 3) SIM ote: Based on 1000 replications of simple random sampling from population of 281 Swedish Municipalities. size 50. As what is expected, when the sample size increases, the coefficient is more accurately estimated. Table 4 shows the performance of the HT, LREG, AM and SIM estimators of t y. One sees from this table that the model-assisted estimators are much more accurate than the simple HT estimators. Among all the model-assisted estimators, the spline SIM estimators are better than other estimators in terms of the MSE. ote The corresponding computing package in R, svyty_1.0.zip, can be freely downloaded from edu/research/svyty_1.0.zip. References [1] R.L. Chambers, Robust case-weighting for multipurpose establishment surveys. J. Off. Statist. 12 (1996), pp [2] R.L. Chambers, A.H. Dorfman, and T.E. Wehrly, Bias robust estimation in finite populations using nonparametric calibration, J. Amer. Statist. Assoc. 88 (1993), pp [3] A.H. Dorfman, onparametric regression for estimating totals in finite populations. Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA., 1992, pp [4] A.H. Dorfman and P. Hall, Estimators of the finite population distribution function using nonparametric regression, Ann. Statist. 21 (1993), pp [5] R.L. Chambers, A.H. Dorfman, and P. Hall, Properties of estimators of the finite distribution function, Biometrika 79 (1992), pp [6] R.L. Chambers and C.J. Skinner, Analysis of Survey Data, Wiley, Chichester, [7] C.E. Särndal, B. Swensson, and J. Wretman, Model Assisted Survey Sampling, Springer-Verlag, ew York, [8] M.E. Thompson, Theory of Sample Surveys, Chapman and Hall, London, [9] S. Wang and A.H. Dorfman, A new estimator for the finite population distribution function, Biometrika 83 (1997), pp [10] J. Fan and I. Gijbels, Local Polynomial Modelling and Its Applications, Chapman and Hall, London, [11] W. Härdle, Applied onparametric Regression, Cambridge University Press, Cambridge, [12] F.J. Breidt and J.D. Opsomer, Local polynomial regression estimators in survey sampling, Ann. Statist. 28 (2000), pp [13] F.J. Breidt, G. Claeskens, and J.D. Opsomer, Model-assisted estimation for complex surveys using penalised splines, Biometrika 92 (2005), pp [14] G.E. Montanari and M.G. Ranalli, onparametric model calibration estimation in survey sampling, J. Amer. Statist. Assoc. 100 (2005), pp [15] J.D. Opsomer, F.J. Breidt, G.G. Moisen, and G. Kauermann, Model-assisted estimation of forest resources with generalized additive models (with discussion), J. Amer. Statist. Assoc. 102 (2007), pp

14 500 L. Wang [16] H. Zheng and R.J.A. Little, Penalized spline model-based estimation of finite population total from probabilityproportional-to-size samples, J. Off. Statist. 19 (2003), pp [17] H. Zheng and R.J.A. Little, Penalized spline nonparametric mixed models for inference about a finite population mean from two-stage samples, Survey Methodol. 30 (2004), pp [18] C.E. Särndal and S. Lundström, Estimation in Surveys with onresponse, Wiley, ew York, [19] F.J. Breidt, J.D. Opsomer, A.A. Johnson, and M.G., Ranalli, Semiparametric model-assisted estimation for natural resource surveys, Survey Methodol. 33 (2007), pp [20] S. Everson-Stewart, onparametric survey regression estimation in two-stage spatial sampling, unpublished masters project, Colorado State University. Available at [21] C.J. Stone, The dimensionality reduction principle for generalized additive models, Ann. Statist. 14 (1986), pp [22] T.J. Hastie and R.J. Tibshirani, Generalized Additive Models, Chapman and Hall, London, [23] S. Sperlich, D. Tjøstheim, and L. Yang, onparametric estimation and testing of interaction in additive models, Econ. Theory 18 (2002), pp [24] R. Carroll, J. Fan, I. Gijbels, and M.P. Wand, Generalized partially linear single-index models. J. Amer. Statist. Assoc. 92 (1997), pp [25] P. Hall, On projection pursuit regression, Ann. Statist. 17 (1989), pp [26] W. Härdle, P. Hall, and H. Ichimura, Optimal smoothing in single-index models, Ann. Statist. 21 (1993), pp [27] J.L. Horowitz and W. Härdle, Direct semiparametric estimation of single-index models with discrete covariates, J. Amer. Statist. Assoc. 91 (1996), pp [28] Y. Xia, H. Tong, W.K. Li, and L. Zhu, An adaptive estimation of dimension reduction space, J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002), pp [29] O.B. Linton and J.P. ielsen, A kernel method of estimating structured nonparametric regression based on marginal integration, Biometrika 82 (1995), pp [30] Y. Xia, W.K. Li, H. Tong, and D. Zhang, A goodness-of-fit test for single-index models, Statist. Sinica 14 (2004), pp [31] L. Wang and L. Yang, Spline-backfitted kernel smoothing of nonlinear additive autoregression model, Ann. Statist. 35 (2007), pp [32] C. de Boor, A Practical Guide to Splines, Springer-Verlag, ew York, [33] C. Isaki and W.A. Fuller, Survey design under the regression superpopulation model, J. Amer. Statist. Assoc. 77 (1982), pp [34] J.Z. Huang, Local asymptotics for polynomial spline regression, Ann. Statist. 31 (2003), pp [35] G. Claeskens, T. Krivobokova, and J. Opsomer, Asymptotic properties of penalized spline regression estimators, to appear in Biometrika. [36] Y. Li and D. Ruppert, On the asymptotics of penalized splines, Biometrika 95 (2008), pp [37] L. Wang, Single-index model-assisted estimation in survey sampling, Technical Report. Available at [38] D.M. Gay, Usage summary for selected optimization routines, Computing Science Technical Report o Available at [39] Y. Yu and D. Ruppert, Penalized spline estimation for partially linear single index models, J. Amer. Statist. Assoc. 97 (2002), pp Appendix A.1. Proof of Theorem 3.1 Let (, A, P) be the design probability space with respect to the sampling design measure. By Lemma A.3 of [37], for any δ>0 and ω, there exists an integer 0 (ω), such that when > 0 (ω), ˆR( θ,ω) R( θ) < δ/2. ote that ˆθ = ˆθ(ω)is the minimiser of ˆR(θ, ω),so ˆR(ˆθ(ω),ω) R( θ)<δ/2. Using Lemma A.3 of [37] again, there exists 1 (ω), such that when > 1 (ω), R(ˆθ(ω)) ˆR(ˆθ(ω),ω)<δ/2. Thus, when >max( 0 (ω), 1 (ω)), R(ˆθ(ω)) R( θ) < δ 2 + ˆR(ˆθ(ω),ω) R( θ) < δ 2 + δ 2 = δ. By Assumption (A7), for any ε>0, if R(ˆθ(ω),ω) R( θ)<δ, then one would have ˆθ(ω) θ 2 <ε for large enough, which is true for any ω, and the strong consistency holds. ext, note that ˆR(θ) ˆR(θ) = 2 ˆR(θ) θ θ θ θ θ=ˆθ θ= θ T ( ˆθ θ), θ= θ with θ = t ˆθ + (I t) θ. So ( ) ˆθ θ 2 ˆR(θ) 1 ˆR(θ) = θ θ T, θ θ= θ θ= θ

15 Journal of onparametric Statistics 501 where according to Equation (A.6) of [37] and the above consistency result of ˆθ, lim 2 ˆR(θ) θ θ T θ= θ in probability p, and by Equation (A.6) of [37] again, one has ˆR(θ) θ sup ˆR(θ) θ= θ θ Sc d 1 θ Thus ˆθ θ = O p (J /n 1/2 ) by Assumption (A8). 2 R(θ) θ θ T R(θ) θ θ= θ ( = O p J n 1/2 ). A.2. Proof of Theorem 3.2 Lemma A.1 Under Assumptions (A1) (A5) and (A7) one has where m i and ˆm i are defined in Equations (10) and (14). Proof Let ˆ m i = ei TB ˆθ (BṰB θ ˆθ ) 1 B Ṱ y, then one can write θ 1 lim E p ( m i ˆm i ) 2 = 0, i U ( m i ˆm i ) 2 = ( m i ˆ m i ) 2 + ( ˆ m i ˆm i ) 2 + 2( m i ˆ m i )( ˆ m i ˆm i ). By Lemma A.2 of [37], (1/)E p [ i U ( ˆ m i ˆm i ) 2 ] 0, it suffices to show 1 E p ( m i ˆ m i ) 2 0. i U Let f(t)= e T i P t ˆθ+(1 t) θ y, then df(t)/dt = et i dq=1 ( / θ q )P t ˆθ+(1 t) θ ( ˆθ q θ q )y. Thus, (A.1) where t (0, 1). Therefore, m i ˆ m i = e T i B ˆθ (BṰ θ B ˆθ ) 1 B Ṱ θ y et i B θ (BT θ B θ ) 1 B T θ y d = f(1) f(0) = ei T P θ t θ+(1 t ˆ ) θ ( ˆθ q θ q )y, q=1 q 1 E p ( m i ˆ m i ) 2 = 1 i U ote that according to Theorem 3.1, with p-probability 1, E p et i i U q=1 P θ t ˆθ+(1 t ) θ P q θ θ, q 2 d P θ t ˆθ+(1 t ) θ ( ˆθ q θ q )y. q and ˆθ θ = O p (J /n 1/2 ). By Lemma A.4 of [37], there exists a positive constant C 0 such that sup 1 k d sup θ S d 1 ( / θ c k )P θ C 0 J with ξ-probability 1. Thus Equation (A.1) follows directly from the above arguments and Assumption (A4). Hence the result.

16 502 L. Wang Then ote that E p ˆt y,diff t y ˆt y,diff t y = E p (y i m i ) i U i U (y i m i ) ( Ii i 1) ( ) Ii 1 + ( ( ˆm i m i ) 1 I ) i. i i U i According to the definition of Equation (9), under Assumptions (A1) (A4), one has lim sup + E p ( ˆm i m i )2 E p 1/2 (1 I i / i )2. (A.2) i U i U 1 i U (y i m i ) 2 <. Following the same argument of Theorem 1 in [12], the first term on the right hand side of Equation (A.2) converges to zero as. For the second term, Assumption (A5) implies that E p (1 I i / i )2 = i (1 i ) 2 1 λ. i U i U i According to Lemma A.1, 1 lim E p [( ˆm i m i ) 2 ] 0with ξ-probability 1, i U and the result follows from Markov s inequality. A.3. Proof of Theorem 3.3 The next lemma is to derive the asymptotic MSE of the proposed spline estimator in Equation (15). Lemma A.2 Proof Let then ote that Under Assumptions (A1) (A5) and (A7) ( ) 2 ˆty,diff t y n E p = n ( ij i j 2 (y i m i )(y j m i ) i,j U i j ˆt y,diff t y a = n 1/2 = y i m i i U y i m i i U ( ) Ii 1 + i i U ( ) Ii 1, b = n 1/2 i ˆm i m i i U ˆm i m i ( ) Ii 1. i ) + o(1). (A.3) ( 1 I ) i, i E p [a 2 ]= n ( ) ij i j 2 (y i m i )(y j m i ) i,j U i j ( 1 λ + n ) max i,j U,i =j ij i j 1 λ 2 (y i m i ) 2 <, i U E p [b 2 ]= n 2 E p ( ˆm i m i )( ˆm j m j ) i,j U ( 1 λ + n max i,j U,i =j ij i j λ 2 ( 1 I i i )( 1 I j j ) ) 1 E p[{ ˆm i m i } 2 ]. By Lemma A.1, one has E p [b 2 ]=o(1) and the Cauchy Schwartz inequality implies E p[a n b n ]=0. Therefore, ( ) 2 ˆty,diff t y n E p = E p [an 2 ]+2E p[a n b n ]+E p [bn 2 ]=E p[an 2 ]+o(1). Thus the desired result holds.

17 Journal of onparametric Statistics 503 Denote AMSE( 1 ˆt y,diff ) = 1 2 ( ) ij i j (y i m i )(y j m j ) i,j U i j as the asymptotic MSE in (A.3). The next result shows that it can be estimated consistently by ˆV( 1 ˆt y,diff ) in Equation (16). Lemma A.3 Under Assumptions (A1) (A7), one has lim n E p ˆV( 1 ˆt y,diff ) AMSE( 1 ˆt y,diff ) =0. Proof Denote then For the first term S 1, one has and S 1 = 1 2 S 2 = 2 2 S 3 = 1 2 ( ij i j (y i m i )(y j m j ) i,j U i j ( ij i j (y i m i )( m j ˆm j ) i,j U i j ( ij i j ( m i ˆm i )( m j ˆm j ) i,j U i j ) Ii I j ij, ) Ii I j ij, ) Ii I j ij, ˆV( 1 ˆt y,diff ) AMSE( 1 ˆt y,diff ) = S 1 AMSE( 1 ˆt y,diff ) + S 2 + S 3. n E p S 1 AMSE( 1 ˆt y,diff ) n 2 E p ( ) 2 1/2 ij i j Ii I j ij (y i m i )(y j m j ), i,j U i j ij n 2 4 E p ( ) 2 ij i j Ii I j ij (y i m i )(y j m j ) i,j U i j ij = n2 ( )( )( ) 1 4 (y i m i ) 2 (y k m k ) 2 i 1 k ik i k i,k U i k ik + 2n2 ( )( ) 1 4 (y i m i ) 2 i kl k l (y k m k )(y l m l ) i U k =l i kl ( ) Ii i I k I l kl E p + n2 i kl 4 (y i m i )(y j m j )(y k m k )(y l m l ) ( )( ij i j kl k l i j k l s 1 + s 2 + s 3. i =j,k =l ) ( Ii I j ij E p ij ) I k I l kl kl (A.4) ow s 1 n 2 (y i m i )4 λ n 2 i U ( n 2 λ n2 (y i m i )2 (y k m k )2 ik i k λ 4 4 i,k U ) max i,k U,i =k ik i k λ 4 2 (y i m i )4, i U

18 504 L. Wang and lim sup 1/ i U (y i m i ) 4 <. Thus s 1 goes to zero as.ext s 3 (n max i,k U,i =k ik i k ) 2 1 λ 4 λ 2 4 (y i m i )(y j m j )(y k m k )(y l m l ) E p i =j,k =l I i I j ij I k I l kl ij kl O( 1 ) + (n max i,k U,i =k ik i k ) 2 [( )( )] λ 4 λ 2 max (i,j,k,l) D 4, E Ii I j ij Ik I l kl p ij kl (y i m i )4, i U which converges to zero as by Assumption (A6). As a result of the Cauchy Schwartz inequality, one can show s 2 goes to zero as. Therefore, n E p S 1 AMSE( 1 ˆt y,diff ) 0, as. (A.5) ext for S 2, by Lemma A.1 n E p 2 ( ) ij i j Ii I j 2 (y i m i )( m j ˆm j ) i,j U i j ij ( 2n max i,k U,i =k ik i k 2 λ 4 λ 2 + 2n ) (y i m i )2 E p ( m i ˆm i )2 λ 2, (A.6) i U i U which converges to zero. For S 3, applying Lemma A.1 again, one has E p n S 3 = n 2 E ( ) ij i j Ii I j p ( m i ˆm i )( m j ˆm j ) i,j U i j ij ( n max i,j U,i =j ij i j λ 2 λ + n ) 1 λ 2 E p ( m i ˆm i ) 2 0. i U The desired result follows from (A.4) (A.7). Proof of Theorem 3.3 By the proof of Lemma A.2, 1 (ˆt y,diff t y ) = 1 i s y i ˆm i i = 1 ( t y,diff t y ) + o p (n 1/2 ), so the desired result follows from Lemma A.3. + y i I i ˆm i i U i U i (A.7)

F. Jay Breidt Colorado State University

Model-assisted survey regression estimation with the lasso 1 F. Jay Breidt Colorado State University Opening Workshop on Computational Methods in Social Sciences SAMSI August 2013 This research was supported