Constrained Small Area Estimators Based on M-quantile Methods

Size: px

Start display at page:

Download "Constrained Small Area Estimators Based on M-quantile Methods"

Posy Rich
6 years ago
Views:

1 Journal of Offcal Statstcs, Vol. 28, No. 1, 2012, pp Constraned Small Area Estmators Based on M-quantle Methods Enrco Fabrz 1, Ncola Salvat 2, and Monca Prates 3 Small area estmators assocated wth M-quantle regresson methods have been recently proposed by Chambers and Tzavds (2006). These estmators do not rely on normalty or other dstrbutonal assumptons, do not requre explct modellng of the random components of the model and are robust wth respect to outlers and nfluental observatons. In ths artcle we consder two remanng problems whch are relevant to practcal applcatons. The frst s benchmarkng, that s the consstency of a collecton of small area estmates wth a relable estmate obtaned accordng to ordnary desgn-based methods for the unon of the areas. The second s the correcton of the under/over-shrnkage of small area estmators. In fact, t s often the case that, f we consder a collecton of small area estmates, they msrepresent the varablty of the underlyng ensemble of populaton parameters. We propose benchmarked M-quantle estmators to solve the frst problem, whle for the second we propose an algorthm that s qute smlar to the one used to obtan Constraned Emprcal Bayes estmators, but that, consstently wth the prncples of M-estmaton, does not make use of dstrbutonal assumptons and tres to acheve robustness wth respect to the presence of outlers. The artcle s essentally about pont estmaton; we also ntroduce estmators of the mean squared error, but we do not deal wth nterval estmaton. Key words Over-shrnkage; benchmarkng; robust estmaton. 1. Introducton In statstcal nference about a fnte populaton, estmates of populaton descrptve quanttes for a target varable y are usually needed for the populaton as a whole and for dfferent collectons of subpopulatons (domans or areas). A small area estmaton problem arses when the avalable samples are not large enough to allow for relable estmaton usng standard desgn-based methods for all or most of the domans (areas) beng consdered. A large and growng lterature s devoted to ths subject; see Rao (2003) and Jang and Lahr (2006) for general ntroducton and recent revews of the lterature. 1 DISES, Unverstà Cattolca del S. Cuore, Pacenza, Italy. Emal enrco.fabrz@uncatt.t 2 Dpartmento d Statstca e Matematca Applcata all Economa, Unverstà d Psa, Italy. Emal salvat@ec.unp.t 3 Dpartmento d Statstca e Matematca Applcata all Economa, Unverstà d Psa, Italy. Emal m.prates@ec.unp.t Acknowledgments Ths artcle s based on a conference presentaton at the ITACOSM, Sena, Italy, The work of Salvat and Prates s supported by the project PRIN 2007 Effcent use of auxlary nformaton at the desgn and at the estmaton stage of complex surveys methodologcal aspects and applcatons for producng offcal statstcs awarded by the Italan Government to the Unverstes of Peruga, Cassno, Florence, Psa and Treste. The work of Salvat and Prates s also supported by the project SAMPLE Small Area Methods for Poverty and Lvng Condton Estmates ( fnanced by the European Commsson under the 7th FP. q Statstcs Sweden

2 90 Journal of Offcal Statstcs All small area estmaton methods are based on the avalablty of populaton-level auxlary nformaton to mprove the precson of the estmaton. They have ther own specfcty n the way of lnkng auxlary nformaton and target varable and n the propertes of the obtaned small area predctors. When the samples avalable for each area are very small, model-based predctors are popular. Among these methods, Best and Emprcal Best (EB) predctors have become the standard of the ndustry estmators. Let s suppose for smplcty s sake, that a lnear model lnkng the target varable y to a set of auxlary varables x s plausble. EB predctors are based on the assumpton of a lnear mxed model n whch random effects are ntroduced to account for the correlaton of resduals wthn the same area. Common crtcsms of the EB based on lnear mxed models are that they requre explct assumptons on the random effects and that the estmaton of the parameters relyng on normalty or on least squares s senstve to the presence of outlers or nfluental observatons n the data, a stuaton that s lkely to occur n the analyss of survey data. Ths lmtaton can be overcome usng Robust EB (Snha and Rao 2009). Nonetheless, here we prefer to focus on the alternatve approach based on M-quantle regresson. Small area estmaton based on lnear quantle regresson and M-estmaton has been recently proposed by Chambers and Tzavds (2006). M-quantle estmaton s based on the assumpton of a lnear relatonshp between y and x at each quantle of the yjx dstrbuton, but s free of any dstrbutonal assumpton, s robust wth respect to the presence of outlers and nfluental observatons and does not requre explct specfcaton of the random part of the model. How M-quantle regresson may be used to obtan small area predctors s revewed n Secton 2. Model-based methods may not satsfy coherence propertes, that may be relevant to fnal users of small area estmates. In ths artcle we focus on two of these propertes. The frst s benchmarkng, whle the second may be labelled as neutral shrnkage. Let the small areas be a partton of a larger area. A set of estmates s sad to be benchmarkng f the estmated totals of y for the small areas sum to the total estmated for the larger area (typcally usng desgn unbased or desgn consstent methods). The EB predctors do not fulfll the benchmarkng property (see Rao 2003, Secton for a dscusson of ths problem and also of adjusted predctors). As may be expected, ths s true also for M-quantle regresson based estmators; n fact they are model-based and do not ncorporate the samplng weghts. We propose a modfcaton of the M-quantle (MQ) predctors estmaton algorthm to obtan benchmarked estmates. Formally they wll be constraned optmal MQ estmators and wll be referred to as benchmarked MQ (BMQ). The second coherence property we consder s neutral shrnkage, whch s a specal ensemble property, that means a property related to the estmaton of a functonal of an ensemble of parameters (Frey and Cresse 2003). Specfcally, we focus on the estmaton of the varance of the underlyng populaton means or totals of y pertanng to an ensemble of small areas, a problem often consdered n the lterature (see, for nstance, Ghosh 1992; Judkns and Lu 2000; Ugarte et al. 2009). An ensemble of estmators has neutral shrnkage f the varance of the ensemble of the parameters can be unbasedly estmated by the varance of the ensemble of the estmators. Desgn-based estmates are typcally over-dspersed (they are more spread than the actual populaton parameters), whle EB predctors are under-dspersed, that s they over-shrnk.

3 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 91 The behavour of MQ predctors n ths respect has not been studed n the lterature. By means of smulatons, t may be easly shown that they can ether over- or under-shrnk dependng on the actual dstrbuton of the underlyng parameters, as wll be made clear n later sectons. In the artcle we propose an adjustment of the benchmarked MQ predctors n order to obtan estmators wth approxmately neutral shrnkage. Ths adjustment parallels the one used to adjust EB predctors (Rao 2003, Secton 9.6). The structure of the artcle s as follows. In Secton 2 MQ estmators are revewed; n Secton 3 we propose benchmarked MQ estmators and benchmarked MQ estmators wth approxmately neutral shrnkage. A smulaton exercse s ntroduced and ts results dscussed n Secton 4. Secton 5 s devoted to the descrpton of the applcaton of the method to a well-known data set, whch wll test the method also n the presence of outlers. Concludng remarks are contaned n Secton Small Area Estmators Based on M-quantle Methods Let us suppose that a populaton U of sze N s dvded nto m nonoverlappng subsets (domans or areas of nterest) of sze N ; ¼ 1; ;m. We are nterested n a target varable y and more specfcally n estmatng the area level means X Y N ¼ N 21 y j j¼1 Suppose that a random sample s drawn from the populaton, so that area-specfc samples of sze n. 0 are avalable. It may also be the case that n ¼ 0 for some areas. The problem of estmaton (and benchmarkng) for these areas wll be addressed at the end of Secton 3.1. Values of y are known only for sampled values, but we assume that a vector of p auxlary varables x j s known for each unt n the populaton. We use subscrpt of to denote restrcton to small area, so that s (r ) denotes the set of sample (nonsample) populaton unts from area, and U ¼ s < r denotes the set of populaton unts makng up the small area. A recently proposed approach to small area estmaton s based on the use of M-quantle models (see Chambers and Tzavds 2006, Secton 4). Snce much of the development n ths artcle s based on the applcaton of lnear quantle/m-quantle regresson, we now gve a bref defnton of these concepts. Ordnary lnear regresson s based on the dea of modellng the expected value of the dependent varable as a functon of the regressors; that s, n our notaton, on the assumpton that Eð y j jx j Þ¼x T jb. In quantle regresson t s the qth quantle that s assumed to be a lnear functon of the auxlary nformaton,.e., Q q ð y j jx j Þ¼x T j bðqþ q [ ð0; 1Þ Ths means that a dstnct (hyper)plane s ftted to the data for each q [ ð0; 1Þ accordng to quantle-specfc regresson coeffcents bðqþ. See Koenker and Bassett (1978) for a general ntroducton to quantle regresson. The vector of bðqþ may be estmated accordng to some mnmzaton crteron such as least absolute devatons consdered n Koenker and D Orey (1987). Brecklng and Chambers (1988) ntroduced the applcaton of robust M-estmaton to quantle regresson. M-quantle regresson provdes a quantle-lke generalzaton of

4 92 Journal of Offcal Statstcs regresson based on nfluence functons. For specfed q and nfluence functon c, an estmate of the vector of the regresson parameters b c ðqþ may then be obtaned by solvng the followng normal equatons X n y j 2 x T j b cðqþ x j ¼ 0 c q j¼1 n b c ðqþ, where c q ðrþ ¼2cðs 21 rþ{qiðr. 0Þþð12qÞIðr # 0Þ} and r ¼ðr j Þ¼y j 2 x T j b cðqþ. The nfluence functon cðþ may for nstance be chosen to be the Huber proposal 2,.e., cðuþ ¼uIð2c # u # cþþc; sgn ðuþiðjuj. cþ, u [ R, as we have done n the applcaton n Sectons 4 and 5; n ths case c s a tunng constant assumed to be bounded away from 0. Consstently wth most applcatons, we set c ¼ Ths value gves reasonably hgh effcency n the normal case; more specfcally, t produces 95% effcency when the errors are normal and stll offers protecton aganst outlers (Huber 1981). The quantty s n the defnton of c q ðrþ s a robust estmate of the scale of the data such as the mean absolute devaton s ¼ medjrj= For specfed q an estmate ^b c ðqþ of b c ðqþ s then obtaned va teratve reweghted least squares. Followng Chambers and Tzavds (2006), an alternatve to random effects for characterzng the varablty across the populaton not accounted for by the regressors s to use the M-quantle coeffcents of the populaton unts. For unt j n area, ths coeffcent s the value u j such that Q uj ð y j jx j ; cþ ¼y j. If a herarchcal structure does explan part of the varablty n the populaton data, unts wthn clusters (areas) defned by ths herarchy are expected to have smlar M-quantle coeffcents. When the condtonal M-quantles are assumed to follow a lnear model, wth b c ðqþ a suffcently smooth functon of q, ths suggests a predctor of Y of the form 2 Y^ MQ X X ¼ N 21 4 yj þ x T j ^b c ð ^u Þþ N 2 n j[s j[r n Xn yj 2 x T ^b j c ð ^u Þ j[s 3 o 5 ð1þ (Tzavds et al. 2010), where ^u ¼ n 21 P ^u j s an estmate of the average value of the M-quantle coeffcents u j for unts n area. These ^u j are obtaned by solvng ^Q uj ð y j jx j ; cþ ¼y j for u j wth ^Q q denotng the estmated value of Q q ð y j jx j ; cþ at q. For possble alternatve choces of ^u see Chambers and Tzavds (2006). Tzavds et al. (2010) refer to Expresson (1) as the bas adjusted M-quantle predctor of Y, derved as the mean functonal of the Chambers and Dunstan (1986) estmator of the dstrbuton functon Falure of Benchmarkng and Neutral Shrnkage Propertes The MQ predctors do not satsfy the benchmarkng property. To see ths, note frst that Y^ MQ has an nterestng GREG-lke representaton 8 9 Y^ MQ X < X X = ¼ N 21 x T j ^b c ð ^u Þþn 21 yj 2 x T j ^b c ð ^u Þ j[u j[s j[s ; 8 9 ð2þ X < X X = ¼ n 21 yj þ N 21 x T j 2 n 21 x T j j[s j[u j[s ; ^b c ð ^u Þ

5 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 93 Usng a more compact notaton (2) may be rewrtten as Y^ MQ ¼ y^ þ X T 2 x^ T ^b c ð ^u Þ wth X T ¼ N 21 Pj[U x T j, x^ T ¼ n 21 P j[s x T j and y^ ¼ n 21 P j[s y j Let us assume, for smplcty, that we select a smple random sample from each area and let w ¼ n =n be the samplng fracton n area wth n ¼ P n. The drect estmator of the overall populaton mean Y, y^ ¼ n 21P m P n j¼1 y j may be wrtten as y^ ¼ n 21P m n y^, whch s the weghted average of small area mean estmators. Ths property s desrable for all small area estmators, and t s, n ths case, an alternatve statement of the benchmarkng property. When the small area mean s estmated by Y^ MQ we have that n 21Xm n Y^ MQ ¼ y^ þ Xm w X T 2 x^ T ^b c ð ^u Þ ð3þ from whch t s clear that the estmator s not benchmarked to the overall drect estmator of the mean because, n general, P m d ¼ P m w X T 2 ^x T ^b c ð ^u Þ 0. Futhermore, ths property wll not be satsfed for general samplng desgns and weghted estmators of the overall mean Y. About (3) we notethat d wll be small for large n ; more precsely they wll be O p ðn 21=2 Þ whenever X T 2 x^ T ¼ O p ðn 21=2 Þ, w ¼ O p ð1þ and ^b c ¼ O p ð1þ, where O p ðþ denotes the ordnary of convergence n probablty. Of course, as we are focusng on small area estmaton, large n s are not of specal nterest here. As far as neutral shrnkage s concerned, we note that, assumng a normal lnear mxed model the set Y^ MQ, ¼ 1; ;ms overdspersed wth respect to the underlyng populaton parameters 2 Y^ MQ 2. 2 Y 2 Y ð4þ w Y^ MQ w where Y^ MQ ¼ P m w Y^ MQ, as wll be confrmed by the smulaton results of Secton 4. The behavour of M-quantle based predctors s then more smlar to that of drect estmators and n contrast wth that of the over-shrnkng EB predctors (see EURAREA Consortum 2004 Secton B.3). When outlers are present n the data and normalty fals, the over-shrnkage of EB predctors becomes severe and also the ensemble of Y^ MQ exhbts a varance smaller than the actual set of populaton parameters. Ths effect, detected by the authors usng smulatons, wll also be apparent when analysng the outler affected data of Secton Modfed M-quantle Small Area Estmators In ths secton we ntroduce adjusted MQ estmators. We consder two alternatve approaches. The frst s based on constranng M-quantle regresson. It can be appled to obtan benchmarkng MQ small area estmates, but cannot be easly extended to the correcton of over/under-shrnkage as ths would nvolve quadratc constranng that s very dffcult to manage. The second approach s based on an ex-post frst two moments matchng procedure parallel to that commonly used to adjust the over-shrnkage of EB estmators (Rao 2003, Secton 9.6). The two approaches may be ntegrated as the output of the benchmarkng procedure may be used for the over/under shrnkage correcton.

6 94 Journal of Offcal Statstcs In ths case the output of the second procedure would be an MQ estmator benchmarked and satsfyng neutral shrnkage. In Secton 3.1 we llustrate the method to obtan benchmarkng MQ estmators (denoted BMQ); n Secton 3.2 we descrbe the procedure to acheve neutral shrnkage Constraned M-estmaton of Regresson Parameters n Quantle Regresson The constraned robust regresson model (Eddy and Kadane 1982) can be generalzed to a model for the M-quantle of order q of the condtonal dstrbuton of y gven x. Let H be a h p matrx and suppose we want the vector b c (q) to match Hb c ðqþ ¼d ð5þ where d s an h 1 vector of values (d may be a vector of zeroes). For specfed q, a constraned estmate of the vector of the regresson parameters b c (q) may then be obtaned by mnmzng X n r q j¼1 y j 2 x T j b cðqþ þ L T ðd 2 Hb c ðqþþ where L T ¼ðl 1 ; l 2 ;;l h Þ s a 1 h vector of Lagrange multplers and r q ðþ s a loss functon assocated wth the nfluence functon c q ðþ ntroduced n Secton 2. The constraned estmate of b c (q) may then be obtaned by dfferentatng (6) wth respect to b c (q) and L, settng the dervatves equal to zero and solvng the normal equatons X n c q j¼1 y j 2 x T j b cðqþ x j 2 H T L ¼ 0 Because (7) s a system of nonlnear equatons, an teratve method s used for ts soluton. Let w c ðrþ ¼c q ðrþ=r and w cj ¼ w c ðr j Þ, wth r j ¼ y j 2 x T j b cðqþ. Then (7) can be wrtten as X n w cj j¼1 y j 2 x T j b cðqþ x j 2 H T L ¼ 0 The steps of the teratvely reweghted least squares algorthm are as follows 1. For specfed q defne an ntal estmates b ð0þ c ðqþ. 2. At each teraton t, calculate the resduals r ðt21þ j weghts w ðt21þ cj from the prevous teraton. ð6þ ð7þ ð8þ ¼ y j 2 x T j bðt21þ c ðqþ and assocated 3. Compute the new weghted least squares estmates subject to the constraned Hb c ðqþ ¼d ^b ðtþ c ðqþ ¼½A21 2 A 21 H T ðha 21 H T Þ 21 HA 21 ŠX T W ðt21þ y þ A 21 H T ðha 21 H T Þ 21 d ð9þ Here X s the matrx of order n p of sample x values and A ¼ X T W ðt21þ X, y s the vector of n sample values for y. The matrx W ðt21þ ¼ dagðw cj Þ s a dagonal matrx of order n wth entry correspondng to a partcular sample observaton set equal to the weght w cj.

7 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods Repeat Steps 1 3 untl convergence. Convergence s acheved when the dfference between the estmated model parameters obtaned from two successve teratons s neglgble. For detals about the convergence of teratve reweghted least squares wth constrants see Dempster et al. (1980). The algorthm has been used n the smulaton experments (Secton 4) and n the applcaton (Secton 5) and a satsfactory convergence has been obtaned n 10 to 20 teratons. The R code that mplements ths algorthm s avalable from the authors. The algorthm s appled to acheve benchmarkng as follows. Assume smple random samplng wthn the areas and also that we are nterested n the consstency of the estmated small area means wth the estmate of the populaton mean usng the overall sample. We have the followng benchmarkng equaton w Y^ MQ ¼ Xm n w y^ þ X T 2 x^ T ^b c ð ^u o Þ ¼ y^ so, n vew of (3) Equaton (5) may be rewrtten as w X T 2 x^ T ^b c ð ^u Þ¼0 ð10þ The benchmarkng equaton s expressed n terms of means, but t can be equvalently n wrtten n terms of totals. To derve the benchmarked MQ predctors, for short Y^ BMQ o, we have to consder that the constrant (10) acts smultaneously on all the area-specfc regresson coeffcents, so smultaneous constraned estmaton of { ^b c ð ^u Þ;; ^b c ð ^u m Þ} for the m small areas s requred. The sze of the vector of the M-quantle regresson parameters becomes ðm pþ 1 and consequently the soluton of the system of normal equatons requres the nverson of a matrx of sze ðn mþ ðm pþ. As a consequence, obtanng these constraned estmates may be computatonally demandng n applcatons wth a large number of areas. Relaxng the assumpton of smple random samplng and assumng more general samplng desgns we wll have that Y^ ¼ N 21P m P n j¼1 g jy j wth g j ¼ p 21 j or defned n some more complex way but such that P m P n j¼1 g j ¼ N. A popular drect estmator s n ths case gven by y^w ¼ g 21 P n j¼1 g jy j wth g ¼ P n j¼1 g j. As a consequence y^ ¼ 1=N P m g y^w ¼ P m w* y^w wth P m w* ¼ 1 as P m g ¼ N. The benchmarkng constrant s then expressed by ths equaton w ** Y^ MQ ¼ Xm w * y^w We may defne w ** ¼ w * even f formally ths s not necessary. If we want to avod the use of weghts, we may keep w ** ¼ n =n or, f ths nformaton s avalable, w ** ¼ N =N. Anyway, we wll have that P m y^ P m w* y^w so we need to modfy (10) as follows w ** X T 2 x^ T w** ^b c ð ^u Þ¼d wth d ¼ P m w* y^ 2 P m w** y^.

8 96 Journal of Offcal Statstcs For the set E ¼ {jn ¼ 0} of the out of sample areas,.e., areas where n ¼ 0, consstently wth Chambers and Tzavds (2006), we may defne Y^ MQ ¼ X T ^b c ð05þ. In the benckmarkng equaton w * cannot be used snce g ¼ 0, [ E. In ths case w ** ¼ N =N s a more sensble opton Adjusted M-quantle Small Area Estmators Wth Neutral Shrnkage In order to obtan a set of modfed MQ predctors that satsfes benchmarkng and neutral shrnkage (at least approxmately), we propose a strategy that s smlar to that of Rao (2003, Secton 9.6). More specfcally, gven a set of predctors of the small 1##m area means, we look for a new set of estmators {t } 1##m that mnmzes w 2 2 t 1 and satsfes benchmarkng and neutral shrnkage,.e., s subject to the constrants 1. P m w t ¼ c 1 ; 2. P m w ðt 2 t Þ 2 ¼ c 2, where w ¼ n =n or any other weght such that P m w ¼ 1, t ¼ P m w t and c 1 and c 2 are known constants. The constant c 1 wll be a relable estmator of the overall populaton mean, typcally ^y or some other (possbly survey weghted) model-free estmator. The constant c 2, that can be rewrtten as c 2 ¼ P m w t 2 þ c 2 1, should be a sutable measure of the varance between the areas. Note that constrant 1 s redundant when the neutral shrnkage correcton s appled to benchmarkng estmators such as the BMQ of Secton 3.1. For nonbenchmarkng estmators our procedure may be seen as a crude way to attan the benchmarkng property, that may be appled when the method of secton 3.1 s mpractcal because of ts computatonal complexty. It may be shown (see the Appendx) that v ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff t opt ¼ c 1 þ 2 u c 2 t w 2 2 ð11þ If we set c 1 ¼ wth ¼ P w Y^ * we wll also have that 0 1 vffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff vffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff t opt c 2 ¼ u t c 2 w 2 2 þ B1 2 u t w 2 2 AY^ * In practce, n the calculaton of (11) the problem s to fnd a reasonable value for c 2. Under a lnear mxed model for the data, c 2 s a measure of the varaton between unobservable area-specfc model parameters that may be calculated usng the estmates of the varance components. Wthout recourse to an explct model, we defne the deal c 2 as the unweghted varance between the area-specfc populaton means,.e., c * 2 ¼ 1 2 Y 2 Y m 2 1

9 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 97 To estmate ths quantty we cannot use ordnary desgn-based estmators of Y and Y snce they are known to be over-dspersed when the sample szes n the areas are small (Fabrz 2009). Let s wrte y j ¼ x T j b cð05þþe j. It follows that X Y N ¼ N 21 y j ¼ X T b cð05þþh The term h ¼ N 21 P N j¼1 e j ¼ Y 2 X T b cð05þ may be thought of as a pseudo random effect, snce t captures the average devaton of unts n the same area from the medan regresson plane. We have that Y ¼ N 21P m N Y ¼ X T b c ð05þþh wth h ¼ N 21P m N h and, as a consequence c * 2 ø b cð05þ T S XX b c ð05þþ 1 ðh 2 hþ 2 m 2 1 where S XX ¼ 1 m 2 1 ð X 2 XÞð X 2 XÞ T The equalty s only approxmate snce, n developng the square, we omt the double product term that can be shown to be neglgble wth respect to the two man addends. We may estmate h wth ~h ¼ n 21 P n j¼1 y j 2 x T ^b j c ð05þ ; the assocated estmator of the second term n c * 2,.e., ðm 2 1Þ21P m ð ~h 2 hþ ~ 2, wth h ~ ¼ N 21P m N h ~ s lkely to be unstable and lable to the nfluence of outlyng resduals. In lne wth the robust estmaton approach adopted n ths artcle we then propose the followng estmator of c * 2 ^c * 2 ø ^b c ð05þ T S XX ^b c ð05þþ 1 ½cð~h 2 hþš ~ 2 ð12þ m 2 1 where cðþ s an nfluence functon such as the Huber proposal 2 already mentoned n Secton 2. Ths functon depends on a tunng constant c that should be chosen n accordance wth the data at hand and the szes of the area-specfc samples on whch the calculaton of the ~h s based. The more the occurrence of the outlers s lkely and the smaller the area-specfc sample szes, the more pronounced the smoothng operated through cðþ s expected to be (.e.,the smaller c should be). We denote the estmators obtaned followng ths procedure Y^ CBMQ. In Sectons 4 and 5 n they wll be compared to the estmators obtaned constranng the Emprcal Best Y^ EB o accordng to the procedure llustrated n Rao (2003, Secton 9.6) and that wll be denoted n o as. Y^ CEB 3.3. MSE Estmaton of MQ Estmators Mean Squared Error (MSE) estmaton of M-quantle based small area mean estmators reles on the approach descrbed n Chambers et al. (2008). Snce the estmates ^b c ðqþ of the M-quantle regresson coeffcents can be expressed as lnear combnatons of the sample

10 98 Journal of Offcal Statstcs y values, t follows that, for fxed ^u, the estmator of the area, Y^ MQ can be wrtten as lnear combnatons of these sample values; a frst order approxmaton to ts MSE can be developed usng the arguments n Royall and Cumberland (1978). Let {b j ; j [ s} denote the set of weghts that defne each of the M-quantle predctors. Ths approach then leads to a MSE estmator of the form mse Y^ MQ ¼ 1 N 2 2 X 4 j[s f 2 j þ N 2 n n y j 2 x T ^b j c ð ^u 2þ X Þ f 2 j y j 2 x T ^b j c ð ^u 2 Þ 5 ð13þ wth f j ¼ b j 2 1fj [ s and f j ¼ b j otherwse. Snce Y^ MQ s an approxmately unbased estmator of the small area mean, the squared bas wll not sgnfcantly mpact the MSE. The man lmtaton of the MSE estmator s that t does not account for the varablty ntroduced n estmatng the area specfc u s. The MSE estmator (14) can be used to formulate an estmator of the MSE of the constraned estmators. Followng (Rao 2003, p. 279) a measure of uncertanty assocated wth Y^ BMQ and Y^ CBMQ can be obtaned by mse Y^ BMQ=CBMQ j[s\s ¼ mse Y^ MQ þ Y^ MQ 2 Y^ BMQ2CBMQ 2 ð14þ Ths s a somewhat crude method and an emprcal alternatve to the analytcal estmator of MSE may be represented by a bootstrap procedure. The defnton and evaluaton of such procedure s an object of our current research. 4. A Smulaton Study In ths secton we present a Monte Carlo study for checkng whether the adjustment procedure llustrated n Secton 3.2, ncludng the proposed estmator for c 2, effectvely works; we also am at assessng the mpact of the adjustment on the MSE and the bas of the predctors. To do ths we consder a populaton generated accordng to a normal lnear mxed model, for whch we know that Emprcal Best predctors are very effcent and CEB predctors correct for over-shrnkage. We can then use them as sound terms of comparson. We also nvestgate how the proposed MSE estmator (15) tracks the true MSE of the CBMQ estmator. We consder a model-based smulaton n whch propertes of the tradtonal and proposed estmators are evaluated wth respect to the process that generates the fnte populaton from whch the samples are drawn. Let { Y } 1##m be the set of area-specfc populaton means and 1##m the correspondng predctors, wth * ¼ DIR; EB; CEB; MQ; CBMQ; ^c * 2 defned n (13) s used as a guess for c 2. At each Monte Carlo teraton, the values of the study varable y defned on a fnte populaton of sze N ¼ k4; 200, where k s a postve nteger number, are generated accordng to the followng Battese-Harter-Fuller model y j ¼ x T j b þ v þ e j wth ¼ 1; ;m ¼ 36, j ¼ 1; ;N and N rangng from k50 to k200. To study the behavour of the estmators for growng sample szes n a framework consstent wth the

11 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 99 ordnary asymptotc of fnte populatons (see Isak and Fuller 1982) we consder sequences of sample and populaton szes settng k ¼ 2; 3; 5; 10; 20. The random components nd are drawn from the followng normal dstrbutons v,nð0; s v ¼ 16Þ and nd e j,n 0; s 2 e ¼ 100. The auxlary varables x T j ¼ð1; x j Þ T are generated only once and held fxed for all Monte Carlo teratons. In partcular x j ¼ x þ u j wth nd x,n m x ¼ 194; s 2 x ¼ 2 and uj nd,n 0; s 2 u ¼ 25. Ths populaton structure reflects a stuaton n favour of the applcaton of lnear mxed models as only a small part of the dfference among area means s explaned by the auxlary nformaton, that attrbutable to the unobservable random effect. Ths populaton s essentally the same as the one consdered n Torab et al. (2009). From the above populaton a stratfed smple random sample s drawn, wth strata gven by the areas. The allocaton of the sample s assumed to be proportonal to the populaton sze. Dfferent total sample szes wth n ¼ k84 are consdered; as a result of proportonal allocaton area-specfc sample szes range from k1tok4. The predctors are compared n terms of 1. ther ablty to estmate the actual descrptve varance of the ensemble of the area means,.e., calculatng the rato between the varance of the ensemble of the estmates, and the same quantty defned on the set of the underlyng populaton parameters ðm 2 1Þ 21X m 2 AVR½{ }Š¼R 21XR ;r 2 Y^r r¼1 ðm 2 1Þ 21 Y 2 ð15þ ;r 2 Y r 2. ther average bas AB½{ }Š¼R 21XR 3. ther average MSE AMSE½{ }Š¼R 21XR m 21Xm r¼1 m 21Xm r¼1 ;r 2 Y ;r ð16þ 2 ;r 2 Y r ð17þ The ndex r s the counter of Monte Carlo replcatons, whose total number, R, s set equal to 5,000. Note that 2. and 3. are propertes evaluated on average wth respect to the set of the small area beng studed. Average evaluaton of small area estmators s n lne wth many smulaton exercses n the current lterature (see Rao 2003, Secton 7.2.6). The results of the smulaton study are llustrated n Table 1 and Table 2. In Table 1 the quanttes reported wthn parentheses for the CBMQ estmators are those calculated usng the same guess for c 2 as n the case of CEB. Focusng on the ablty of the predctors to acheve neutral shrnkage, we have that both EB and MQ estmators converge to the value 1 of AVR as the average area-specfc sample szes grow large, but from dfferent sdes. As expected, EB predctors over-shrnk the dstrbuton; the estmates based on the quantle method are more dspersed than the actual parameters, although not as much as the area-specfc sample means. Ths overdsperson may be attrbuted to the poor estmaton of u when area-specfc samples are small; t decreases as k grows; ths should be expected n vew of (1).

12 100 Journal of Offcal Statstcs Table 1. Model-based smulaton results. Wthn parentheses the values of AVR, AB and AMSE computed usng the same values for c 2 used for the CEB estmators k n f Y * g AVR AB AMSE DIR EB CEB MQ CBMQ 1.29 (1.06) (20.21) (19.81) DIR EB CEB MQ CBMQ 1.14 (1.06) (20.23) (14.60) DIR EB CEB MQ CBMQ 1.09 (1.04) (20.06) (9.56) DIR EB CEB MQ CBMQ 1.04 (1.04) (20.03) 5.61 (5.70) 20 1,680 DIR ,680 EB ,680 CEB ,680 MQ ,680 CBMQ 1.01 (1.03) 0.00 (20.01) 3.05 (3.23) Comparng the CEB and CBMQ estmates, the performances n terms of AVR are really close whenever the same value of c 2 s plugged nto the constranng procedure. The use of ^c * 2 s not as effectve when k s small even f the mpact goes n the rght drecton n all the cases; anyway t should be noted that the smulaton s conducted under the assumpton of the normal lnear mxed model whch favours the correcton ncorporated nto the CEB estmator. The bas s moderate n all cases, wth the one excepton of the CBMQ predctor when k ¼ 2 and k ¼ 3. Ths mples that the dstrbutons of the ensembles of the estmates based on both the EB and the MQ methods are centered about ther true means. As regards the effcency measured by AMSE, EB predctors are far more effcent than MQ estmators when k s small. For larger sample szes ths dfference dwndles. Ths confrms the expectaton that when the assumptons of EB predctors hold, as n ths smulaton, they yeld bg gans n effcency, especally when area-specfc sample szes are very small. Methods based on robust modelng and weaker assumptons, such as the M-quantle, become vald alternatves when more than a few unts are observed n each area.

13 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 101 Table 2. Across areas dstrbuton of true (.e., Monte Carlo) root mean squared errors (True RMSE) and area averages of estmated root mean squared errors (Est. RMSE) Percentle of across areas dstrbuton k Indcator Medan Mean True RMSE Est. RMSE True RMSE Est. RMSE True RMSE Est. RMSE True RMSE Est. RMSE True RMSE Est. RMSE Moreover the CBMQ predctors, less dspersed than ther unconstraned counterparts, show a lower AMSE. The gan n effcency s not very bg and depends on the amount of under-shrnkage effectvely corrected; for ths reason t reduces when k grows. In any case, ths s n sharp contrast wth the behavour of the constraned EB predctors that are, by constructon, suboptmal n terms of MSE wth respect to unconstraned EB predctors. Intutvely, f we consder (11), we may note that, because of the underdsperson of the MQ predctors, the quantty under the square root s less than 1, thus leadng to a varance reducton that over-compensates the ncrease n the bas. As a consequence of ths fact the gap between constraned MQ and EB predctors s smaller than that between ordnary (unconstraned) MQ and EB predctors. To sum up, we found that M-quantle estmators are comparable to EB predctors even when data are generated under a normal lnear mxed model, provded that the area-specfc sample szes are not too small. The over-shrnkage correcton llustrated n the prevous secton s effectve under the same crcumstances. Table 2 shows key percentles of the across area dstrbutons of the area level true and estmated root mean squared errors (the latter based on (14) and averaged over the smulatons) of the CBMQ predctor. In general the proposed MSE estmator (14) provdes a good approxmaton to the true MSE. p nd We also run smulatons n whch e j,10= ffffff 3 t3 (a re-scaled Student s t wth 3 degrees of nd freedom) nstead of e j,nð0; 100Þ to check the robustness of the ntroduced constranng methods to the presence of outlers. The performances of CBMQ and ts MSE estmator are stll good. Anyway, about the latter there emerges a tendency to overestmate when the average area-specfc samples are very small (.e., very small k). Detaled results are avalable from the authors upon request. We note that the am of the smulaton exercse of ths secton was not to compare the effcency of the varous methods when outlers are present (concernng ths, see Salvat et al. 2011) as we may expect that M-quantle regresson based methods do better than Emprcal Best predctors n ths case, but to compare the effectveness of the adjustments needed to acheve neutral shrnkage and ther mpact n terms of effcency.

14 102 Journal of Offcal Statstcs 5. An Applcaton to Survey and Satellte Data Battese et al. (1988) analyse survey and satellte data for corn and soybean n a part of Iowa. Ther objectve s to predct the means of the areas under corn and soybean for twelve countes (small areas) n North Central Iowa. Data are from the 1978 June Enumeratve Survey and nclude nformaton on corn and soybean areas at ndvdual pxel and segment level. The data set contans the number of segments n each county, the number of hectares of corn and soybean for each sample segment, and the number of pxels per segment n each county classfed as corn and soybean. The lnear mxed model used for the small area mean hectares of corn and soybean per segment s descrbed n the paper by Battese et al. (1988). We use ths well-known data set to compare the performances of EB, MQ predctors and ther adjusted versons amed at achevng neutral shrnkage (CEB, CBMQ). The analyss of the soybean varable represents a stuaton n whch normalty approxmately holds, whle for corn there s an nfluental outler (n Hardn county). The presence of an outler n the data s a typcal departure from normalty. We analyse the corn varable wth and wthout the Hardn county outler, to evaluate ts mpact on the behavour of the ensemble predctors. Table 3 presents the EB, CEB, MQ, CBMQ estmates for the mean hectares of corn and soybean per segment for each county. The procedure descrbed n Secton 3.2 nvolves the unknown quanttes c 1 and c 2. The frst constant c 1 s equal to the drect estmate of the mean hectares of crop (soybean) obtaned for the unon of the twelve countes, the second constant c 2 s estmated by (13) for the adjustng of MQ estmators and accordng to the method descrbed n Rao (2003, Secton 9.6) for the EB predctors. The averages and the between-areas varances of the small area estmates are reported n the last two rows of Table 3. For CEB and CBMQ the results concde wth the c 1 and c 2 constrants. The small area predctons from all these methods are somewhat smlar for most countes wth a few exceptons. As may be expected, dfferences are larger for areas wth one or very few observatons, smaller for the countes wth somewhat larger sample szes. Moreover, we note that constranng the EB and MQ area predctors to known values for planned domans reduces the nfluence of the outlyng observaton n both cases. In fact the CEB and CBMQ predctors are smlar for all the countes, Hardn ncluded. Under normalty soybean data our results are n lne wth the results of the smulaton experment n Secton 4. The MQ estmator has a between-areas varance equal to aganst a ^c * 2 of The EB predctors tend to over-shrnk the dstrbuton (307.81) and the CEB corrects ths behavour (^c 2 ¼ 33922). As regards corn data, when the outler s ncluded EB predctors over-shrnk the dstrbuton dramatcally, but they are far less shrunken when the outler s removed. The estmates of c 2 wth and wthout the outler are wdely dfferent. The reason for ths les n the bg mpact that an outler has on the estmaton of varance components. MQ predctors, although nfluenced to some extent by the presence of the outler, are more robust n ths respect; n partcular the estmates of c 2 based on (12) wth and wthout the outler are reasonably close. In a stuaton where normalty does not hold, MQ estmators, dfferently from the normal lnear mxed model settng of prevous sectons, tend to over-shrnk the actual varance of area specfc populaton parameters. So, dfferently from the EB predctors that always over-shrnk, the MQ predctors may over- or under-shrnk dependng

15 Table 3. Predcted mean hectares of soybean and corn per segment Soybean Corn Corn wth outlers County n EB CEB MQ CBMQ EB CEB MQ CBMQ EB CEB MQ CBMQ Cerro Gordo Hamlton Worth Humboldt Frankln Pocahontas Wnnebago Wrght Webster Hancock Kossuth Hardn 5 (6) m 21 P ðm 2 1Þ 21 P Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 103

16 104 Journal of Offcal Statstcs on the actual dstrbuton of the data. In other cases we have found that they tend to overshrnk whenever the study varable has a skewed dstrbuton. 6. Conclusons The MQ based estmators are a recent and promsng proposal n the small area lterature and the analyss of ther propertes s an area of actve research. In ths artcle we explored the behavour of MQ predctors wth respect to two coherence propertes, benchmarkng and neutral shrnkage that are of nterest to fnal users of small area estmates. Snce the estmators ntroduced n Chambers and Tzavds (2006) do not satsfy these propertes, we proposed modfed estmators. As regards benchmarkng, our soluton s consstent wth the M-quantle regresson framework, thus t s theoretcally more nterestng than a smple rato adjustment. It should be noted that obtanng these benchmarked MQ estmators may be computatonally demandng when the (overall) sample sze and the number of areas s large. Wth respect to neutral shrnkage we found that the MQ estmators may under-shrnk (under normalty) or over-shrnk (when the dstrbuton of actual small area parameters s skewed); ths behavour s dfferent from that of EB predctors, whch always over-shrnk. The soluton proposed to obtan MQ predctors adjusted n ths sense suffers from some lmtatons. Smlarly to what s usually done for EB predctors, the correcton s based on the frst two moments; so the focus s manly on normal or close to normal stuatons, and may not be sensble when the dstrbuton of actual small area parameters s very skewed. Anyway, we keep the correcton based on the frst two moments as t represents the standard n the lterature and n practcal applcatons. A possble area of future research s represented by the consderaton of other coherence propertes that are desrable to users, especally n offcal statstcal agences, and desgn consstency of the M-quantle predctors n partcular the specfyng also ther asymptotc behavour and the role of samplng weghts. Appendx Proof of (11) Frst note that w ðt 2 t Þ 2 ¼ Xm w t 2 þ t 2 w 2 2t w t ¼ Xm w t 2 2 t 2 We want to mnmze!! f ¼ Xm w 22a1 2 t w t 2 c 1 2 a 2 w t 2 2 c 2 2 c 2 1 The frst partal dervatve n t s f ¼ 22w 2 t 2 a 1 w 2 2a 2 w t t

17 Fabrz, Salvat, and Prates Constraned Estmators Based on M-Quantle Methods 105 Equatng ths dervatve to 0 and solvng for t we obtan t opt ¼ 1 þ a a 2 2 Imposng the frst constrant P m w t ¼ c 1 and solvng n a 1 =2 we obtan a 1 2 ¼ c 1ð1 2 a 2 Þ 2 leadng to t opt ¼ 1 2 þ c a 2 Imposng the second constrant P m w ðt 2 t Þ 2 ¼ c 2 and solvng n 1=ð1 2 a 2 Þ we obtan vffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff 1 ¼ u c a tx 2 m w 2 2 t opt vffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff c 2 ¼ u t w þ c 1 7. References Battese, G., Harter, R., and Fuller, W. (1988). An Error-Components Model for Predcton of County Crop Areas Usng Survey and Satellte Data. Journal of the Amercan Statstcal Assocaton, 83, Brecklng, J. and Chambers, R. (1988). M-quantles. Bometrka, 75, Chambers, R. and Dunstan, R. (1986). Estmatng Dstrbuton Functon from Survey Data. Bometrka, 73, Chambers, R. and Tzavds, N. (2006). M-quantle Models for Small Area Estmaton. Bometrka, 93, Chambers, R., Chandra, H., and Tzavds, N. (2008). On Bas-Robust Mean Squared Error Estmaton for Lnear Predctors for Domans. Workng Papers, Centre for Statstcal and Survey Methodology, The Unversty of Wollongong, Australa, (Avalable from http//cssm.uow.edu.au/publcatons). Dempster, A., Lard, N., and Rubn, D. (1980). Iteratvely Reweghted Least Squares for Lnear Regresson When Errors are Normal/Independent Dstrbuted. Multvarate Analyss V, (ed.) P.R. Krshnaah. Amsterdam North Holland, Eddy, W. and Kadane, J. (1982). The Cost of Drllng for Ol and Gas An Applcaton of Constraned Robust Regresson. Journal of the Amercan Statstcal Assocaton, 77, EURAREA Consortum. (2004). Enhancng Small Area Estmaton Technques to meet European Needs, Project Reference Volume. Avalable at http// eurarea/download.asp.

18 106 Journal of Offcal Statstcs Fabrz, E. (2009). A Comparson of Adjusted Bayes Estmators of an Ensemble of Small Area Parameters. Statstca, LXIX, Frey, J. and Cresse, N. (2003). Some Results on Constraned Bayes Estmators. Statstcs and Probablty Letters, 65, Ghosh, M. (1992). Constraned Bayes Estmaton wth Applcatons. Journal of the Amercan Statstcal Assocaton, 87, Huber, P.J. (1981). Robust Statstcs John Wley & Sons. Isak, C. and Fuller, W.A. (1982). Survey Desgn under the Regresson Superpopulaton Model. Journal of the Amercan Statstcal Assocaton, 77, Jang, J. and Lahr, P. (2006). Mxed Model Predcton and Small Area Estmaton. Test, 15, Judkns, D. and Lu, J. (2000). Correctng the Bas n the Range of a Statstc Across Small Areas. Journal of Offcal Statstcs, 16, Koenker, R. and Bassett, G. (1978). Regresson Quantles. Econometrca, 46, Koenker, R. and D Orey, V. (1987). Computng Regresson Quantles. Bometrka, 93, Rao, J.N.K. (2003). Small Area Estmaton John Wley & Sons. Royall, R. and Cumberland, W. (1978). Varance Estmaton n Fnte Populaton Samplng. Journal of the Amercan Statstcal Assocaton, 73, Salvat, N., Ranall, M., and Prates, M. (2011). Small Area Estmaton of the Mean Usng Non-parametrc M-quantle Regresson A Comparson when a Lnear Mxed Model Does Not Hold. Journal of Statstcal Computaton and Smulaton, 81, Snha, S. and Rao, J. (2009). Robust Small Area Estmaton. Canadan Journal of Statstcs, 37, Torab, M., Datta, G., and Rao, J. (2009). Emprcal Bayes Estmaton under a Nested Error Lnear Regresson Model wth Measurement Errors n the Covarates. Scandnavan Journal of Statstcs, 36, Tzavds, N., Marchett, S., and Chambers, R. (2010). Robust Predcton of Small Area Means and Dstrbutons. Australan and New Zealand Journal of Statstcs, 52, Ugarte, M.D., Mltno, A., and Gocoa, T. (2009). Benchmarked Estmates n Small Areas Usng Lnear Mxed Models wth Restrctons. Test, 18, Receved November 2009 Revsed September 2011

On Outlier Robust Small Area Mean Estimate Based on Prediction of Empirical Distribution Function

On Outlier Robust Small Area Mean Estimate Based on Prediction of Empirical Distribution Function On Outler Robust Small Area Mean Estmate Based on Predcton of Emprcal Dstrbuton Functon Payam Mokhtaran Natonal Insttute of Appled Statstcs Research Australa Unversty of Wollongong Small Area Estmaton