Model Based Direct Estimation of Small Area Distributions

Unversty of Wollongong Research Onlne Centre for Statstcal & Survey Methodology Workng Paper Seres Faculty of Engneerng and Informaton Scences 2010 Model Based Drect Estmaton of Small Area Dstrbutons Ncola Salvat Unversty of Psa, Italy Hukum Chandra Unversty of Wollongong, hchandra@uow.edu.au Ray Chambers Unversty of Wollongong, ray@uow.edu.au Recommended Ctaton Salvat, Ncola; Chandra, Hukum; and Chambers, Ray, Model Based Drect Estmaton of Small Area Dstrbutons, Centre for Statstcal and Survey Methodology, Unversty of Wollongong, Workng Paper 20-10, 2010, 28p. http://ro.uow.edu.au/cssmwp/70 Research Onlne s the open access nsttutonal repostory for the Unversty of Wollongong. For further nformaton contact the UOW Lbrary: research-pubs@uow.edu.au

Centre for Statstcal and Survey Methodology The Unversty of Wollongong Workng Paper 20-10 Model Based Drect Estmaton of Small Area Dstrbutons Ncola Salvat, Hukum Chandra and Ray Chambers Copyrght 2008 by the Centre for Statstcal & Survey Methodology, UOW. Work n progress, no part of ths paper may be reproduced wthout permsson from the Centre. Centre for Statstcal & Survey Methodology, Unversty of Wollongong, Wollongong NSW 2522. Phone +61 2 4221 5435, Fax +61 2 4221 4845. Emal: anca@uow.edu.au

Model Based Drect Estmaton of Small Area Dstrbutons Ncola Salvat 1, Hukum Chandra 2 and Ray Chambers 3 1 Dpartmento d Statstca e Matematca Applcata all'economa, Unversty of Psa, Italy, E-mal: salvat@ec.unp.t 2 Centre for Statstcal and Survey Methodology, Unversty of Wollongong, Wollongong, Australa. E-mal: hchandra@uow.edu.au 3 Centre for Statstcal and Survey Methodology, Unversty of Wollongong, Wollongong, Australa. Emal: ray@uow.edu.au Summary Much of the small area estmaton lterature focuses on populaton totals and means. However, users of survey data are often nterested n the fnte populaton dstrbuton of a survey varable, and the measures (e.g. medans, quartles, percentles) that characterse the shape of ths dstrbuton at small area level. In ths paper we propose a model-based drect estmator (MBDE, see Chandra and Chambers, 2009) of the small area dstrbuton functon. The MBDE s defned as weghted sum of sample data from the area of nterest, wth weghts derved from the calbrated splne-based estmate of the fnte populaton dstrbuton functon ntroduced by Harms and Duchesne (2006), under an approprately specfed regresson model wth random area effects. We also dscuss the mean squared error estmaton of the MBDE. Monte Carlo smulatons based on both smulated and real datasets show that the proposed MBDE and ts assocated mean squared error estmator perform well when compared wth alternatve estmators of the area-specfc fnte populaton dstrbuton functon. Key words: Indcator functon; Model-based drect estmator; Mean squared error estmator; Smulaton experments. 1

{ } 1. Introducton Let U = 1,2,..., N be the fnte populaton of sze N and let y denote a varable of nterest that takes values over ths populaton. A common target of nference s then the proporton of values y j that are bounded by a gven constant (e.g. the proporton of households whose monthly per capta expendture s below the poverty lne). More generally, the target of nference s the value of the fnte populaton dstrbuton functon for a varable y at a specfed value t. Ths s N F N (t) = N 1 I(y j t),.e. the proporton of the populaton j =1 whose values for y are less than or equal to t, where I(y j t) s the ndcator functon that takes the value 1 f y j t and 0 otherwse and t s a specfed constant. Clearly, once we obtan an estmator of the fnte populaton dstrbuton functon, we can evaluate ts nverse to obtan the assocated estmator of the fnte populaton quantle functon. See Chambers and Dunstan (1986), Rao et al. (1990), Harms and Duchesne (2006) and Rueda et al. (2007, 2010). Small area estmaton (SAE) s an mportant objectve of many surveys. Small areas or small domans are subsets of the populaton wth small sample szes, so standard survey estmaton methods for these areas, whch only use nformaton from the small area samples, are unrelable. In ths context SAE methods that borrow strength va statstcal models (Rao, 2003) can be used to produce relable small area estmates. However, vrtually all of these methods focus on estmaton of lnear parameters, e.g. small area means or totals. In ths paper we focus on estmaton of the small area dstrbuton of a study varable and measures (e.g. medans, quartles, percentles) that characterse the shape of ths dstrbuton. Ths s especally useful f there are extreme values n the small area sample data, or f the small area dstrbuton of the varable of nterest s hghly skewed (Tzavds et al., 2010). 2

We propose a model based drect estmator (MBDE) for the small area dstrbuton functon, extendng the MBDE approach (Chandra and Chambers, 2009) to the estmaton of the small area dstrbuton functon. Ths MBDE estmator s a weghted sum of the sample data from the small area of nterest, wth weghts that are derved from a splne-based calbrated estmator of the populaton dstrbuton functon (Harms and Duchesne, 2006) under a regresson model wth random area effects. The rest of the artcle s organzed as follows. The followng Secton descrbes SAE based on the lnear mxed model and the nonparametrc regresson model based on penalzed splnes and then uses these models to motvate estmators of the small area dstrbuton functon. Secton 3 ntroduces the concept of calbrated sample weghts for a fnte populaton dstrbuton functon and uses these to defne the MBDE estmator for ths functon. A basrobust estmator of the mean squared error of the MBDE s also developed, based on the approach of Chambers et al. (2009). The emprcal performances of the proposed MBDE as well as alternatve estmators of the small area dstrbuton functon are evaluated n Secton 4, usng both model-based and desgn-based smulatons, wth the desgn-based smulatons based on two real data sets. Concludng remarks are set out n Secton 5. 2. Estmaton of the Small Area Dstrbuton Functon We assume that a fnte populaton U contanng N unts can be parttoned nto A nonoverlappng domans, referred to from now on as small areas, or smply areas, ndexed by N A N =1 = 1,..., A, wth area contanng unts, so N =. Let denote the value of the y j varable of nterest y for unt j ( j = 1,, N ) n area ( = 1,, A ). The area-specfc dstrbuton functon of y for area s N 1 F (t) = N I(y j t). (1) j =1 3

Let s denotes a sample of n unts drawn from U by some specfed samplng desgn, and assume that values of the varable of nterest y are avalable for each of these n sample unts. The non-sample component of U, contanng N - n unts, s denoted by r. In what follows, we use a subscrpt of to denote quanttes specfc to area ( = 1,..., A). For example, s and r denote the n sample and N n non-sample unts respectvely for area. Wth ths notaton, the conventonal estmators of the area dstrbuton functon, F (t), are the Horvtz- Thompson (HT) estmator ˆF HT 1 (t) = N π 1 j I(y j t), (2) j s and the Hajek estmator ˆF Hajek (t) = π 1 j I(y j t) j s 1 π j s j. (3) Here π j denotes the sample ncluson probablty of unt j. Both (2) and (3) are area-specfc desgn-based drect estmators and do not depend on an assumed model for ther valdty (Cochran, 1977). Unfortunately, emprcal evdence presented n Rueda et al. (2007) shows that these estmators can be substantally based, whle the fact that they only use nformaton from the area sample makes them too unstable for SAE. Model-based small area estmators based on the lnear mxed model are wdely used n SAE. However, f the functonal form of the regresson relatonshp between the varable of nterest and the avalable auxlary varables s unknown or has a complcated functonal form, then SAE based on the use of a nonparametrc regresson model can offer sgnfcant advantages compared wth one based on a lnear model. In partcular, a nonparametrc regresson model based on p-splnes s attractve because t represents a relatvely straghtforward extenson of a lnear regresson model (Elers and Marx, 1996). Opsomer et al. (2008) descrbe the use of a splne-based nonparametrc regresson model for SAE. See also Salvat et al. (2010). In the rest of ths Secton we therefore summarze the model-based 4

approach to estmaton of the small area dstrbuton functon under the lnear mxed model and under a nonparametrc regresson model. 2.1 Estmaton under the lnear mxed model SAE theory for ths case s now well establshed, see Rao (2003). We brefly descrbe t below snce ths allows us to ntroduce notaton that wll be used elsewhere n the paper. To start, we note that throughout ths paper we wll assume that we have access to the populaton values of p auxlary scalar varables that are, to a greater or lesser extent, correlated wth y. Let x j denote the vector of values of these auxlary varables that are assocated wth y j and let z j denote a vector of auxlary contextual varables whose values are known for all unts n the populaton. Let y U, X U and Z U denote the populaton level vector and matrces defned by y j, x j and z j, respectvely. Then the lnear mxed model s y U = X U β + Z U u + e U, (4) where β s a p vector of regresson coeffcents, u s a random vector of area effects and e U s a populaton N-vector of random ndvdual effects. In general, area effects are vectorvalued, so ( 1 2 A ) u T = u T, u T, u T and Z = dag { Z ; = 1,, A}, where ndexes the A areas U that make up the populaton and N Z s of dmenson q. The area effects { u ; 1,, A } = are assumed to be ndependent and dentcally dstrbuted realsatons of a random vector of dmenson q wth zero mean and covarance matrx Σ u. Smlarly, the scalar ndvdual effects makng up e U are assumed to be ndependent and dentcally dstrbuted realsatons of a random varable wth zero mean and varance σ e 2, wth area and ndvdual effects mutually ndependent. The covarance matrx of the vector s then y U Var(y U ) = I k V U = Z U Σ u Z T U + σ 2 e I N, where denotes the dentty matrx of dmenson k. The parameters θ = (Σ u,σ 2 e ) are typcally referred to as the varance components of (4). 5

We also assume throughout ths paper that the method of samplng s non-nformatve gven the auxlary varables, so the model (4) holds for both sampled and non-sampled populaton unts. Consequently, we can partton y U, X U, Z U and e U nto components defned by the n sampled and N n non-sampled populaton unts, denoted by subscrpts of s and r respectvely, and re-express (4) as follows: y U = y s y r = X s X r β + Z s Z r u + e s e r, wth the varance of y smlarly parttoned, V U = V ss V rs V sr V rr. Thus X s represents the matrx defned by the n sample values of the auxlary varable vector, whle 2 { ; 1,, } { T σ ; 1,, } V = dag V = A = dag Z Σ Z + I = A ss ss s u s e s and { ; 1,, } { T ; 1,, } V = dag V = A = dag Z Σ Z = A. sr sr s u r Here Z s and Z r respectvely denote the restrcton of Z to sampled and non-sampled unts n area. The dstrbuton functon for small area gven by (1) can be expressed as 1{ j s j r } F ( t ) = N I ( y t ) + I ( y t) j j y j, where the frst term on the left s known and the second s unknown. The problem of estmatng F (t) therefore reduces to predctng the values for the non-sample unts n area. Gven estmated values ˆθ = ( ˆΣ u, ˆσ 2 e ) of the varance components we can defne the estmated covarance matrx ˆV U, and the predcted ŷ EBLUP T values of are j = x ˆβEBLUE j + z jt û EBLU P, where ˆβ EBLUE T X s ) 1 T =(X ˆV 1 X s ˆVss 1 s ss y s s the y j 6

emprcal best lnear unbased estmator (EBLUE) of β and û EBLUP = ˆΣ T u Z s ˆVss 1 (y s X s ˆβ) s the emprcal best lnear unbased estmator (EBLUP) of u. Substtutng estmated values for the parameters of (4) then allows us to defne an estmator for F (t) of the form EBLUP 1{ ( ˆ ) j s I j r j t } ˆ EBP F () t = N I( y t) + y. (5) j We refer to (5) as the emprcal best predctor or EBP. An alternatve way of predctng F ( t) s va the Chambers and Dunstan (hereafter CD) estmator. See Chambers and Dunstan (1986) for detals. Snce the wthn area resduals are homoskedastc under (4), the CD estmator of F (t) can be wrtten EBLUP { ( ) j s ˆ ˆEBLUP j r k s y k k { + y y t}} ˆ CD 1 1 F () t = N I( yj t) + n I j. (6) Note that the CD estmator s asymptotcally unbased f (4) s correctly specfed. 2.2 Estmaton under a nonparametrc mxed model The CD estmator (6) wll be based f the functonal form of the relatonshp between the response varable and the auxlary varables (.e. the regresson functon) s not lnear or the varance term n the regresson model s msspecfed (Tzavds et al., 2010). Ths susceptblty of parametrc model-based methods to msspecfcaton bas provdes motvaton for the use of alternatve non-parametrc model-based methods. We now summarze applcaton of the p-splne nonparametrc regresson model to SAE (Opsomer et al., 2008), and, for smplcty, consder the unvarate case. The underlyng regresson model s then y = m( x ) + e, where are ndependent random varables wth zero means. The j j j e j functon m( x) s unknown and assumed to be approxmated suffcently well by b mx (, βγ, ) = β + β x+ + β x + γ ( x κ ) b, (7) 0 1 p K k = 1 k k + 7

b b where b s the degree of the splne, ( c) = c I( c > + b), κ k s a set of fxed constants called knots for k = 1,..., K, β = (β 0,...,β p )T s the coeffcent vector of the parametrc part of the model and γ = (γ 1,...,γ K ) T s the vector of splne coeffcents. The approxmatng functon m(x,β,γ ) n (7) uses truncated polynomal bass functons for smplcty and, f the number of knots K s suffcently large, can approxmate most smooth functons. Ruppert et al. (2003, Chapter 5) suggest the use of a knot for every four observatons, up to a maxmum of about 40 knots for a unvarate applcaton. Usng a large number of knots n (7) can lead to an unstable ft. In order to overcome ths problem, an upper lmt s usually mposed on the sze of the splne coeffcent vector γ. Estmatng β and γ by mnmzng the squared devatons of model (7) from the actual data values subject to ths constrant s equvalent to mnmzng the penalzed loss functon ( (,, )) 2 j j T y m x β γ + λγ γ. (8) j Here λ s a Lagrange multpler that controls the level of smoothness of the resultng ft. Wand (2003) and Ruppert et al. (2003, Chapter 4) note the equvalence between mnmzng (8) and maxmzng the lkelhood of the response varable under the lnear model (7) where the splne coeffcents are treated as random effects. In partcular, let y U = ( y 1, y 2,..., y N ) T, X U b 1 x1 x 1 = b 1 xn x N and p b ( x1 κ1) + ( x1 κk ) + Δ U =. p b ( xn κ1) + ( xn κk) + The splne approxmaton (7) can then be wrtten as the lnear mxed model y U = X U β + Δ U γ + e U, (9) where γ and e are now assumed to be ndependent Gaussan random vectors of dmenson K and N respectvely. In partcular, t s assumed that 8

γ ~ N(0,σ 2 γ I K ) and e U ~ N(0,σ 2 e I N ). Opsomer et al. (2008) adapt p-splnes to the SAE context by addng area random effects to (9), whch then becomes y U = X U β + Δ U γ + Z U u + e U, (10) where, as n Secton 2.1, Z = ( Z,, Z ) T U 1 N s a matrx of known covarates of dmenson N A charactersng dfferences among the areas and u s the A-vector of random area effects. In the smplest case, Z U s gven by a matrx whose -th column, for = 1,, A, s an ndcator varable that takes the value 1 f a unt s n area and s zero otherwse. It s assumed that the area effects are dstrbuted ndependently of the splne effects γ and the ndvdual effects e, wth u ~ N(0, Σ u ), so that the covarance matrx of the vector s y U Var(y Z U Σ u Z T U + σ 2 U ) = V = σ 2 γ Δ U Δ T U + e I N. The varance components of (10) are then gven by 2 2 ( γ, u, e ) θ = σ Σ σ. Note that, as n prevous Secton, the use of non-nformatve samplng gven the auxlary varables means that (10) also holds at the sample level. When the varance components are known, well-establshed theory (McCulloch and Searle, 2001, Chapter 9) leads to the generalsed least squares estmator of β,.e. ˆβ =(X T s V 1 ss X s ) 1 X T s V 1 ss y s, and the best lnear unbased predctors (BLUPs) for γ and u,.e. ˆγ =σ γ 2 Δ s T V ss 1 (y s X s ˆβ) and û = Σ u Z T s V 1 ss (y s X s ˆβ). In practce, the varance components are unknown and must be estmated from sample data usng methods such as maxmum lkelhood or restrcted maxmum lkelhood; see Harvlle (1977). In what follows we use 2 2 ( ˆ σ, ˆ, ˆ γ Σu σ e ) to denote such estmates, allowng us to defne the plug-n estmator ˆV ss = ˆσ γ 2 Δ s Δ s T + Z s ˆΣu Z s T + ˆσ 2 e I n, where I n s the dentty matrx of order n. Ths leads to the nonparametrc model-based EBLUE for β,, and to the ˆβ NPEBLUE =(X s T ˆVss 1 X s ) 1 X s T ˆVss 1 y s 9

correspondng nonparametrc EBLUPs (NPEBLUPs) for the splne and area effects n (10), ˆγ NPEBLUP = ˆσ γ 2 Δ s T ˆVss 1 (y s X s ˆβ NPEBLUE ) and û NPEBLUP = ˆΣ T u Z s ˆVss 1 (y s X s ˆβ NPEBLUE ). Under (10), the nonparametrc emprcal best predctor of the dstrbuton functon for area (denoted by NPEBP) s 1 { ˆ NPEBP F () t = N NPEBLUP I( y t) + I( yˆ t) j s j ŷ NPEBLUP T j = x j ˆβ NPEBLUE + δ jt ˆγ NP EBLUP + z jt û NPEBLUP T where, and x j, and denote j r j δ j T }, (11) z j T respectvely the rows of, and that correspond to unt j n area. Smlarly, under X U Δ U Z U (10), the nonparametrc verson of the CD estmator of the dstrbuton functon for area s { } ˆ NPCD 1 1 NPEBLUP NPEBLUP F () t = N I( y ) n ˆ ( ˆ I yj yk y j t + + k ) t j s j r k s. (12) 3. The Model-Based Drect Estmator for the Small Area Dstrbuton Functon A drect estmate for a small area s smple to nterpret, snce the estmated value of the varable of nterest for the area s just a weghted average of the sample data from the same area. Ths s not true of an ndrect estmator lke the EBLUP, whch s a weghted sum over the entre sample. Unfortunately, when weghts are the nverses of sample ncluson probabltes, conventonal drect estmators lke (2) and (3) can be qute neffcent. The Model-Based Drect Estmator (MBDE) of a small area mean mproves upon the effcency of these conventonal drect estmators by usng the weghts that defne the EBLUP for the populaton total under a model wth random area effects. See Chandra and Chambers (2009) and Salvat et al. (2010). MBDEs for the populaton mean of y usng weghts based on the lnear model (4) as well as those based on the non-parametrc model (10) are therefore possble. However, the fnte populaton dstrbuton functon s the populaton mean of an ndcator varable, whch does not satsfy ether (4) or (10). Consequently, 'standard' EBLUP 10

weghts are not approprate for defnng the MBDE of ths functon. Instead, we use sample weghts that are calbrated to the known fnte populaton dstrbuton of the auxlary varables n x and are based on a model wth random area effects. For smplcty, we restrct our dscusson below to a sngle scalar covarate x, notng that the extenson to multple scalar covarates s straghtforward. The calbrated estmator of a fnte populaton dstrbuton functon F N (t) was defned n Harms and Duchesne (2006) as a weghted emprcal dstrbuton functon ˆF HD N (t) = N 1 w j I(y j t) (13) j s where the sample weghts w n (13) are calbrated to the known fnte populaton j dstrbuton of x. In partcular, let 0< α < α < < αk < 1 1 2 denote an ordered set of constants. Then the weghts used n (13) sum to N and, for k = 1,, K, also satsfy { ( α )} = α, (14) wi x Q N j s j j x k k where Q x (α k ) s the known α k -quantle of the fnte populaton dstrbuton of x. That s, the weghts used n (13) are calbrated to both the populaton sze N and to the populaton totals of the auxlary varables defned by the ndcators I{ xj Qx( αk) }. Standard results from calbraton theory (Devlle and Särndal, 1992; Chambers, 1996) can be used to show that f these calbrated weghts w j are then chosen to mnmse ther chsquare dstance from the weghts used n Horvtz-Thompson estmator (2), as s commonly done, then (13) s a regresson estmator of F N (t) under the lnear model { } K I( y t) = β + β I x Q ( α ) + jt ε, (15) j 0t kt j x k k = 1 where the ε jt are uncorrelated errors wth zero expectaton and varance 2 σ εt (Chambers, 2005). However, (15) s also easly seen to be a p-splne model wth knots at the α k -th 11

quantles of the fnte populaton dstrbuton of x. That s, ˆF HD N (t) s actually a p-splne estmator of F N (t). Defne g jk I{ xj Qx ( αk )} = and let = ( g ; j = 1,..., N) g Uk jk be the correspondng populaton N-vector, so G = [ 1, g,, g ] U N U1 UK denotes the populaton level matrx of values of these varables, where 1 denotes a N-vector of ones. Also, defne d jt = I(y j t) and put d Ut equal to the N-vector of populaton values of the d jt. The populaton level verson of model (15) s then N d Ut = G U β t + ε Ut. (16) Gven the approprate sample and non-sample components of d, and the covarance Ut G U matrx V Ut = σ 2 I εt of ε DF, the vector of sample weghts w that defne the EBLUP of the U Ut jt populaton total of the d jt under (16) s then DF ( w j ) DF T T T 1 ; ˆ ( ˆ T T ) ( ) ˆ w ˆ st = jt s = 1n + Hst GU 1N Gs 1n + In Hstgs VsstVsrt1 N n, (17) where ˆ T ( ˆ st = Gs VsstGs ) H 1 1 T ˆ 1 Gs Vsst. Under (16), ˆVsst = ˆσ 2 εt I n and ˆVsrt = 0, so these weghts smplfy to T T T ( wj ; j ) = n +Gs( GsGs) ( GU1N Gs1n) = 1n + G ( G ) DF DF 1 T 1 T s = s 1 s s s N w G G n 1. N n The model (16) s easly adapted to small area estmaton by ncludng random area effects. That s, we replace (16) by d Ut = G U β t + Z U u t + ε Ut (18) where Z U was defned followng (4) and u t ~ N(0, Ω t ) s an A-vector of random area effects. As usual, we assume that u t and ε Ut are ndependently dstrbuted, so that Var(d T Ut ) = V Ut = Z U Ω t Ζ + σ U 2 εti N DF. The sample weghts w that defne the EBLUP of the jt populaton total of the d under (18) are then stll gven by (17), but now wth jt 12

ˆV sst = Z s ˆΩ t Z T s + ˆσ 2 εt I n varance components of (18). 0< 1 < 2 < < K T 2 and ˆV srt = Z s ˆΩ t Z r, where ˆΩ t and ˆσ εt are the estmated values of the In practce, one frst needs to decde on the calbraton constrants (14) before (18) can be ftted and (17) calculated. Ths n turn requres that one has chosen the values α α α < 1. We adapt the ordered half-sample cross valdaton procedure descrbed n Chambers (2005) for ths purpose. In partcular, we fx K = 1 and then search for α t opt the value that maxmses the concordance between the sample values of and the d jt sample values of j = { x ( ) j Qx } g I α. The steps n ths procedure are as follows: 1. Order the sample x-values: x (1), x (2), x(3),..., x (n 1), x (n) ; 2. Create two sets E = { x(1), x (3),...} and = {,,... (2) (4) } V x x ; 3. For gven α and t, ft the model (18) and then compute the weghts (17), treatng E as the 'sample' and V as the 'nonsample'. Denote the correspondng value of (13) based on these weghts by ˆF HD(n) N (t,α); 4. The optmal value α t opt then satsfes ( ( n) 1 { t } mn { FN n I( yj t) j s } HD n) opt Fˆ 1 N (, t t ) n I( yj ) = ˆ HD (, t ) j s 2 2 α α. 0< α < 1 We note that although ths procedure only dentfes a sngle 'most concordant' calbraton constrant to use n (14), there s nothng to stop t beng extended to dentfcaton of multple calbraton constrants. However, some care must then be taken to ensure that the resultng values of Q x ( α) are separated suffcently n the nterval spanned by the sample values of the auxlary x. Falure to do ths could result n the sample desgn matrx defned by (18) not beng of full rank. Fnally, gven the weghts (17), we wrte down the MBDE for the area dstrbuton functon F (t) as 13

ˆF MBDE (t) = w DF jt I(y j t) j s. (19) DF w j s jt We refer (19) as a drect estmator because t s a weghted average of the sample data from the area of nterest. However, ths does not mean that t can be calculated from these data alone. The weghts (17) are a functon of the data from the entre sample. That s, they borrow strength from other areas va the model (18). It should also be ponted out that snce the weghts (17) depend on t, there s no guarantee that (19) defnes a monotone functon of t,.e. one where t 1 < t 2 mples ˆF MBDE (t 1 ) ˆF MBDE (t 2 ). Ths ssue wll usually not be relevant when one wshes to estmate the dstrbuton of nterest at ponts that are well separated, but can be a problem when the am s to nvert (19) as a functon of t n order to estmate quantles. In such a stuaton we recommend that (19) be frst transformed to be monotone n t, e.g. usng the approach descrbed n He (1997). 3.1 Mean squared error estmaton for the MBDE A bas-robust estmator of the mean squared error (MSE) of the MBDE s descrbed n Chandra and Chambers (2009), see also Chambers et al. (2009), and we use ths approach here to defne a correspondng MSE estmator for (19). Ths s the estmator { MBDE 2 ()} t t Mˆ Fˆ t = Vˆ + Bˆ (20) where ˆV t s a heteroskedastcty-robust estmator of the condtonal predcton varance of MBDE ˆF (t) (Royall and Cumberland, 1978), ˆBt s an estmator of the correspondng condtonal predcton bas, and the condtonng s wth respect to the value of the area effect. In partcular, we use DF 2 {( ) } V ˆ N N w 1 ( N n ) n ( d ˆ μ 2 = + ), (21) 2 ( ) 1 t j s jt jt jt 14

where w jt DF() = w jt DF and ˆμ jt s an unbased lnear estmator of the condtonal DF w k s kt expected value μ jt = E(d jt g j,u t ). Chambers et al. (2009) recommend that ˆμ jt be computed as the unshrunken verson of the EBLUP for μ jt,.e. 1 ( ) ( ) T ˆ ˆ T T T ˆ T T jt = 0t + g j 1t + j s s s s st s n ˆ μ β β z Z Z Z I H g l. For the condtonal bas of the MBDE, we use a smple plug-n estmator of the form DF() 1 ˆB t = w jt ˆμ jt N ˆμ jt. (22) j s Note that the MSE estmator (20) gnores the extra varablty assocated wth estmaton of the varance components, and s therefore a heteroskedastcty-robust frst order approxmaton to the actual condtonal MSE of the MBDE. Also, (20) treats the weghts (17) as fxed,.e. t gnores the contrbuton to the MSE from the estmated varance components. Chambers et al. (2009) refer to ths as a pseudo-lnearzaton assumpton snce for large overall sample szes the contrbuton to the overall MSE of (19) arsng from the varablty of varance components wll be of smaller order of magntude then the fxed weghts predcton varance estmated by (21). However, the extent of ths underestmaton wll depend on the small area sample szes and the characterstcs of the populaton of nterest, partcularly the strength of the small area effects. Fnally, we note that (22) s a conservatve estmator of the j U squared bas, snce ˆ2 ( ) ( ˆ 2 t t ) ( ˆ t ) E B = Var B + E B. However, the extent of ths overestmaton s typcally very small. 4. Emprcal Evaluatons In ths Secton we report the results from model-based and desgn-based smulaton studes that llustrate the performance of the dfferent estmators of the small area dstrbuton functon defned n the precedng two Sectons. These estmators are set out n Table 1. Ther 15

performance n the smulaton studes s evaluated by computng for each small area the absolute relatve bas (ARB), the relatve root mean squared error (RRMSE) and coverage rate (CR) of nomnal 95 per cent confdence ntervals defned as follows: { } ( ) 1 R 1 1 R r= 1 r= 1( ), ARB = R F R F ˆ F 100 r r r 1 R 1 R ( r= 1 ) r= 1( ) RRMSE R F 1 R F 2 ˆ F = r r r 100, and ( ) R 1 CR = I Fˆ F 2 Mˆ 100. r r r R r = 1 Here R denotes the number of smulatons, F r denotes the true value of the area dstrbuton ˆF r functon at smulaton r, denotes an estmate of ths value, and denotes an estmate of 1 R the MSE of ˆF r. The value of the true MSE for ˆFr s calculated as R ( Fˆ r F ˆM r r= 1 r ) 2. Note that n the desgn-based smulatons F r = F. 4.1 Model-based smulatons In the model-based smulatons we set A = 30 and use two types of models to generate the populaton values of y. The frst s a lnear model, y j = 500 + 1.5x j + u + e j, where x j ~ χ 2 (20), j = 1,..., N and = 1,..., A, wth random area effects are generated as ( ) ( 94.09) ndependent realzatons from a N 0, 23.52 dstrbuton and e j dstrbuted as N 0,, u correspondng to an ntra-area correlaton of σ u 2 σ 2 2 ( u + σ ε )= 0.2. Smulatons based on ths model are referred to as set 1 smulatons. The second model s a multplcatve model, y j = 5x β j u e j, where the values of x j are ndependently drawn from the lognormal dstrbuton log( x ) N j ( 2 6, σ x ), and the ndvdual effects and area effects are ndependently 2 2 drawn as log( ej ) N( 0,σ e ) and log( u) N( 0, σ u ) respectvely. We use two sets of parameters for ths model, defned by β (1 or 2), σ u (0.4 or 0.6), σ e (0.7 or 1.0) and σ x (2.25 16

or 1.20). These are referred to from now on as set 2a and set 2b. Data values for y generated under set 2a are almost lnear n x whle those generated under set 2b are qute non-lnear n x. The small area populaton szes are randomly drawn from a unform dstrbuton on N [450,550] and kept fxed over the smulatons. The small area sample szes n are determned by frst selectng a smple random sample of sze n =600 from the populaton and notng the resultng sample szes n each small area. These area specfc sample szes n are then fxed n the smulatons by treatng the small areas as strata and carryng out stratfed random samplng. A total of R = 1000 smulatons are then carred out for each combnaton of model and ndvdual error dstrbuton, wth each smulaton correspondng to frst generatng the populaton values and then drawng a sample. The average ARB values and the average RRMSE values of the dfferent small area dstrbuton functon estmators are shown n Table 2 and 3 respectvely. These values are n percentage terms, and the averages are over the 30 small areas. All estmators are evaluated at the 0.1, 0.25, 0.5, 0.75 and 0.9 quantles of y. 4.2 Desgn-based smulatons The desgn-based smulatons are based on two real survey data sets. The frst survey data set s based on data collected n the 1995-96 Australan Agrcultural Grazng Industry Survey (AAGIS) conducted by the Australan Bureau of Agrcultural and Resource Economcs. In the orgnal sample there were 759 farms from 12 regons (the small areas of nterest), whch make up the wheat-sheep zone for Australan broadacre agrculture. We used these sample data to generate a synthetc populaton of sze N = 39,562 farms by re-samplng the orgnal AAGIS sample of n = 759 farms wth probablty proportonal to a farm s sample weght. Ths fxed populaton was then repeatedly sampled usng stratfed random samplng wth regons correspondng to strata and wth stratum sample szes the same as n the orgnal sample. The varable of nterest s total cash costs (TCC) and the auxlary varable s land area. Based on the orgnal AAGIS sample data, the ft of the lnear mxed model (AIC = 17

20012.32) and the ft of the nonparametrc p-splne regresson model (AIC = 19998.02) were essentally the same, ndcatng that addton of the nonparametrc splne component does not mprove the ft of the mxed model. We therefore do not expect to see much dfference between the dstrbuton functon estmates generated by these two models. The am s to estmate the values of the regonal dstrbuton functons at the 0.1, 0.25, 0.5, 0.75 and 0.9 quantles of the fnte populaton dstrbuton of TCC. The data for the second desgn-based smulaton come from the Envronmental Montorng and Assessment Program (EMAP) survey carred out by the Space Tme Aquatc Resources Modellng and Analyss Program (STARMAP) at Colorado State Unversty, and we replcate the desgn-based smulaton experment carred out by Salvat et al. (2010). The background to ths data set s that EMAP conducted a survey of lakes n the North-Eastern states of the Unted States of Amerca between 1991 and 1996. The data collected n ths survey ncluded 551 measurements of Acd Neutralzng Capacty (ANC) - an ndcator of the acdfcaton rsk of water bodes n water resource surveys - from a sample of 349 of the 21,028 lakes located n ths area. Here we defne lakes grouped by 6-dgt Hydrologc Unt Code (HUC) as our small areas of nterest. Snce three HUCs have sample szes of one, these are combned wth adjacent HUCS, leadng to a total of 23 small areas. Sample szes n these 23 areas vary from 2 to 45. A (fxed) pseudo-populaton of N = 21,028 lakes s defned by samplng N tmes wth replacement and wth probablty proportonal to a lake's sample weght from the orgnal sample of 349 lakes. A total of R = 1000 ndependent stratfed random samples of the same sze as the orgnal sample are selected from ths pseudopopulaton, wth HUCs correspondng to strata and stratum sample szes fxed to be the same as n the orgnal sample. The survey varable of nterest s the ANC value of a lake, wth ts elevaton defnng the auxlary varable. Usng the orgnal EMAP data, the ft of the lnear mxed model (AIC = 6714.31) s worse than that of the nonparametrc regresson model (AIC 18

= 6580.2). In ths case, therefore, there are gans from ncludng the splne component n the mxed model, and so we expect that estmates of the dstrbuton functon based on the nonparametrc regresson model wll perform better than those based on the lnear mxed model. Agan, the am s to estmate the values of the ndvdual HUC dstrbuton functons at the 0.1, 0.25, 0.5, 0.75 and 0.9 quantles of the fnte populaton dstrbuton of ANC. Tables 4 and 5 show the average over small areas of the ARB and RRMSE values of the dfferent dstrbuton functon estmators based on the R = 1000 ndependent stratfed samples taken from the AAGIS and EMAP populatons respectvely. Smlarly, Table 6 shows the correspondng averages over the areas of the true RMSEs and estmated RMSEs, and the actual coverage rates of nomnal 95 percent confdence ntervals for the true areaspecfc dstrbuton functon values based on the MBDE estmator (19) and ts assocated MSE estmator (20). Fgures 1 and 2 show the area-specfc values of the true RMSE and estmated RMSE of the MBDE (19) for the desgn-based smulatons of the AAGIS and EMAP data. 4.3 Dscusson Two thngs stand out n Tables 2 and 3. The frst s that the MBDE offers substantal bas gans over the other DF estmators, at all quantles, when the relatonshp between the study varable and the covarate s complcated and/or the usual mxed model dstrbutonal assumptons are nvald (sets 2a and 2b). If the underlyng populaton structure s lnear and the usual mxed model assumptons hold (set 1) the CD and NPCD estmators have slghtly smaller absolute bases than the MBDE. The larger bases of the 'plug-n' EBP and NPEBP estmators are not unexpected n set 1 because these estmators gnore unt level varablty n y. Second, the NPCD estmator generally records the lowest RRMSE among the alternatves to the MBDE, but when the relatonshp between y and x s complcated, as under sets 2a and 2b, the RRMSE values recorded by the MBDE are comparable, and sometmes lower, than 19

those recorded by the NPCD estmator. On the other hand, under the lnear specfcaton (set 1), the MBDE s clearly less effcent than ts alternatves. Desgn-based smulatons serve to complement model-based smulatons for SAE, provdng evdence of comparatve performance and robustness n realstc data scenaros. Table 4 shows the results for the desgn-based smulatons usng the AAGIS data. Here we see that the MBDE has lower bas and RMSE than the other predctors at all quantles. As expected, gven the lnear relatonshp between y and x, the CD-based estmators of the DF based on the lnear mxed model are generally more effcent than those based on the nonparametrc splne regresson model. However, the reverse s true for the EBP-based estmators, perhaps reflectng the lower (but stll substantal) bases of the NPEBP. Table 5 reports the desgn-based smulaton results for EMAP data. These agan ndcate that the MBDE domnates the other estmators n terms of bas. The results for RRMSE are not as clear-cut as n the AAGIS smulatons, but stll show that the performance of the MBDE s comparable wth the performance of the NPCD estmator, whch was consstently the best of the alternatve estmators n terms of RRMSE. We now turn to an examnaton of the performance of the MSE estmator (20) for the MBDE. Fgures 1 and 2 show that ths estmator accurately tracks the smulaton (.e. repeated samplng) area-specfc MSEs of the MBDE at all fve target quantles for y. Ths good performance s confrmed by the results n Table 6, whch shows that the area averages of the true RMSEs and the estmated RMSEs obtaned usng (20) are very close. Fnally, we note that one can combne the MBDE estmator (19) wth the MSE estmator (20) to generate normal theory confdence ntervals for the area-specfc value of the dstrbuton functon,.e. as the small area estmate plus or mnus twce ts correspondng estmated RMSE. Table 6 shows that the actual coverage rates acheved by these ntervals, though generally less than 95 per cent, are stll close enough to ther target value to be practcally useful. 20

Fnally, we note that an alternatve to the CD estmator that s both model-consstent and desgn-consstent, has been proposed by Rao et al. (1990). Although the relevant results are not reported here, we also explored the performance of both parametrc and nonparametrc versons of ths estmator n our smulatons. In all cases, ths performance was almost dentcal to that of the parametrc and nonparametrc versons of the CD predctor. 5. Conclusons Ths paper develops an MBDE estmator for the value of the area-specfc fnte populaton dstrbuton of a response varable y. Ths estmator s based on sample weghts that are calbrated to the fnte populaton dstrbuton of an auxlary varable x, and also allow for random area effects. We then compare the performance of ths MBDE estmator wth two competng estmators based on ether a lnear mxed model or a nonparametrc mxed model for y. Our results ndcate that the proposed MBDE can sometmes be much better than these alternatves, partcularly n realstc applcatons where ftted models are approxmatons at best. On the other hand, f the model assumptons are vald (e.g. set 1 n the model-based smulatons), then area-specfc dstrbuton functon estmators based on the CD representaton are preferable. We also provde a method for estmatng the MSE of the MBDE and demonstrate emprcally that t performs well. References Chambers, R. (1996). Robust case-weghtng for multpurpose establshment surveys. Journal of Offcal Statstcs, 12, 3-32. Chambers, R. (2005). Imputaton vs. Estmaton of Fnte Populaton Dstrbutons. Southampton Statstcal Scences Research Paper. S3RI Methodology Workng Papers, M05/06. 21

Chambers, R., Chandra, H. and Tzavds, N. (2009). On Bas-Robust Mean Squared Error Estmaton for Lnear Predctors for Domans. Workng Papers, Centre for Statstcal and Survey Methodology, The Unversty of Wollongong, Australa. (Avalable from: http://cssm.uow.edu.au/publcatons). Chambers, R. and Dunstan, R. (1986). Estmatng dstrbuton functons from survey data. Bometrka, 73, 597-604. Chandra, H. and Chambers, R. (2009). Multpurpose weghtng for small area estmaton. Journal of Offcal Statstcs, 25, 3, 379-395. Cochran, W.G. (1977). Samplng Technques, 3rd edton. Wley & Sons, NY. Devlle, J.C. and Särndal, C.E. (1992). Calbraton estmators n survey samplng. Journal of the Amercan Statstcal Assocaton, 87, 376-382. Elers, P. and Marx, B. (1996). Flexble Smoothng usng B-splnes and Penalzed Lkelhood (wth comments and rejonder). Statstcal Scence, 11, 1200-1224. Harms, T. and Duchesne, P. (2006). On calbraton estmaton for quantles. Survey Methodology, 32, 37-52. Harvlle, D.A. (1977). Maxmum lkelhood approaches to varance component estmaton and to related problems. Journal of the Amercan Statstcal Assocaton, 72, 320-338. He, X. (1997). Quantle curves wthout crossng. Amercan Statstcan, 51, 186 192. McCulloch, C.E., and Searle, S.R. (2001). Generalzed Lnear and Mxed Models. Wley, New York. Opsomer, J.D., Claeskens, G., Ranall, M.G., Kauermann, G. and Bredt, F.J. (2008). Nonparametrc small area estmaton usng penalzed splne regresson. Journal of the Royal Statstcal Socety, Seres B, 70, 265-286. Rao, J.N.K., Kovar, J.G. and Mantel, H.J. (1990). On estmatng dstrbuton fucntons and quantles from survey data usng auxlary nformaton. Bometrka, 77, 365-375. 22

Rao, J.N.K. (2003). Small Area Estmaton. New York: Wley. Rueda, M., Martínez, S., Martínez, H. and Arcos, A. (2007). Estmaton of the dstrbuton functon wth calbraton methods. Journal of Statstcal Plannng and Inference, 137, 435-448. Rueda, M., Sánchez-Borrego, I., Arcos, A. and Martínez, S. (2010). Model-calbraton estmaton of the dstrbuton functon usng nonparametrc regresson. Metrka, 71, 33-44. Ruppert, D., Wand, M.P. and Carroll, R. (2003). Semparametrc Regresson. Cambrdge Unversty Press, Cambrdge. Royall, R.M. (1976). The lnear least-squares predcton approach to two-stage samplng. Journal of the Amercan Statstcal Assocaton, 71, 657-664. Royall, R.M. and Cumberland, W.G. (1978). Varance estmaton n fnte populaton samplng. Journal of the Amercan Statstcal Assocaton, 71, 351-358. Salvat, N., Chandra, H., Ranall, M.G. and Chambers, R. (2010). Small area estmaton usng a nonparametrc model-based drect estmator. Computatonal Statstcs and Data Analyss, 54, 2159-2171. Tzavds, N., Marchett, S., and Chambers, R. (2010). Robust predcton of small area means and quantles. Australan and New Zealand Journal of Statstcs, 52, 167-186. Wand, M.P. (2003). Smoothng and mxed models. Computatonal Statstcs, 18, 223-249. 23

Table 1. Descrpton of the estmators consdered n the smulaton studes. Estmator Descrpton MBDE MBDE (19) wth sample weghts (17) based on model (18) EBP EBLUP-based EBP estmator (5) under lnear mxed model (4) CD EBLUP-based CD estmator (6) under lnear mxed model (4) NPEBP NPEBLUP-based EBP estmator (11) under splne-based mxed model (10) NPCD NPEBLUP-based CD estmator (12) under splne-based mxed model (10) Table 2. Area averages of absolute relatve bas (ARB, %) generated by model-based smulatons. Set Populaton quantle MBDE EBP CD NPEBP NPCD 1 0.10 2.41 71.94 1.24 71.83 1.28 0.25 1.29 30.92 0.61 30.83 0.62 0.50 0.84 2.61 0.40 2.65 0.39 0.75 0.52 9.17 0.26 9.14 0.25 0.90 0.27 5.46 0.15 5.43 0.15 2a 0.10 2.40 127.28 141.01 114.80 160.20 0.25 1.30 3.13 17.97 4.57 24.39 0.50 0.80 39.42 10.49 16.33 8.94 0.75 0.51 19.18 9.05 7.42 8.97 0.90 0.28 1.12 4.00 1.35 3.92 2b 0.10 2.18 444.41 344.70 175.30 202.23 0.25 1.38 120.62 80.84 21.72 33.14 0.50 0.79 13.75 5.82 17.00 10.75 0.75 0.53 17.62 29.28 12.20 11.48 0.90 0.29 17.36 23.47 3.30 5.67 24

Table 3. Area averages of relatve root mean squared error (RRMSE, %) generated by modelbased smulatons. Set Populaton quantle MBDE EBP CD NPEBP NPCD 1 0.10 63.22 82.54 38.12 82.52 38.21 0.25 36.55 39.05 22.35 39.21 22.40 0.50 21.17 15.25 12.76 15.45 12.78 0.75 12.38 11.23 6.93 11.22 6.92 0.90 7.16 6.46 3.61 6.43 3.60 2a 0.10 65.17 314.20 179.08 242.08 180.82 0.25 37.57 115.75 41.11 80.16 36.71 0.50 21.66 71.40 16.70 39.96 14.19 0.75 12.54 37.44 11.32 18.98 10.97 0.90 7.23 6.43 6.04 5.81 5.67 2b 0.10 64.88 455.68 351.48 297.19 218.47 0.25 37.30 128.20 86.91 92.85 43.29 0.50 21.53 26.30 18.50 44.63 16.34 0.75 12.43 21.17 30.80 26.43 13.67 0.90 7.19 18.70 24.98 10.84 7.47 Table 4. Average values over 12 regons of absolute relatve bas (ARB, %) and relatve root mean squared error (RRMSE, %) for the AAGIS data. Populaton quantle MBDE EBP CD NPEBP NPCD ARB (%) 0.10 1.51 97.14 87.92 95.03 143.74 0.25 0.92 94.18 50.74 64.10 53.45 0.50 0.35 67.99 13.96 38.27 16.34 0.75 0.30 24.39 3.97 15.34 10.70 0.90 0.15 10.26 1.83 6.76 2.69 RRMSE (%) 0.10 47.75 131.26 108.26 117.60 155.65 0.25 23.40 114.07 59.25 81.29 58.53 0.50 14.48 81.50 19.17 45.62 19.06 0.75 7.59 29.43 8.53 20.23 12.26 0.90 3.81 10.67 4.20 8.51 4.36 25

Table 5. Average values over 23 HUCs of absolute relatve bas (ARB,%) and relatve root mean squared error (RRMSE,%) for the EMAP data. Populaton quantle MBDE EBP CD NPEBP NPCD ARB (%) 0.10 2.10 71.13 32.37 50.85 21.14 0.25 0.74 51.53 17.20 42.38 18.74 0.50 0.67 43.44 13.83 33.09 11.86 0.75 0.43 21.92 6.22 18.12 9.17 0.90 0.25 11.55 2.23 11.92 3.61 RRMSE (%) 0.10 46.76 72.02 47.71 58.38 43.91 0.25 28.41 58.93 32.64 47.92 29.02 0.50 30.51 52.17 25.18 36.83 21.60 0.75 14.76 27.94 16.04 21.70 15.21 0.90 5.30 14.13 6.29 13.57 6.06 Table 6. Average values of true RMSE and estmated RMSE and actual coverage rate (CR, %) of nomnal 95 per cent confdence ntervals generated by the MBDE (19) and assocated MSE estmator (20) for the AAGIS and EMAP data. Averages are over regons. AAGIS EMAP Populaton quantle True Estmated True RMSE Estmated RMSE CR RMSE RMSE CR 0.10 0.034 0.034 89 0.018 0.021 95 0.25 0.051 0.052 91 0.041 0.041 92 0.50 0.061 0.061 95 0.054 0.055 93 0.75 0.051 0.051 94 0.052 0.058 93 0.90 0.031 0.032 90 0.028 0.034 93 26

Fgure 1. Regon-specfc values of actual repeated samplng RMSE (sold lne) and average estmated RMSE (dashed lne) of MBDE (19) for the AAGIS data. Fgure 2. HUC-specfc values of actual repeated samplng RMSE (sold lne) and average estmated RMSE (dashed lne) of MBDE (19) for the EMAP data. 27