Small Area Estimation Under Transformation To Linearity

Univerity of Wollongong Reearch Online Centre for Statitical & Survey Methodology Working Paper Serie Faculty of Engineering and Information Science 2008 Small Area Etimation Under Tranformation To Linearity Hukum Chandra Univerity of Wollongong, hchandra@uow.edu.au R. Chamber Univerity of Wollongong, ray@uow.edu.au Recommended Citation Chandra, Hukum and Chamber, R., Small Area Etimation Under Tranformation To Linearity, Centre for Statitical and Survey Methodology, Univerity of Wollongong, Working Paper 10-08, 2008, 29p. http://ro.uow.edu.au/cmwp/9 Reearch Online i the open acce intitutional repoitory for the Univerity of Wollongong. For further information contact the UOW Library: reearch-pub@uow.edu.au

Centre for Statitical and Survey Methodology The Univerity of Wollongong Working Paper 10-08 Small Area Etimation Under Tranformation To Linearity Hukum Chandra and Ray Chamber Copyright 2008 by the Centre for Statitical & Survey Methodology, UOW. Work in progre, no part of thi paper may be reproduced without permiion from the Centre. Centre for Statitical & Survey Methodology, Univerity of Wollongong, Wollongong NSW 2522. Phone +61 2 4221 5435, Fax +61 2 4221 4845. Email: anica@uow.edu.au

Small Area Etimation Under Tranformation To Linearity Hukum Chandra and Ray Chamber 1 Abtract Small area etimation baed on linear mixed model can be inefficient when the underlying relationhip are non-linear. In thi paper we introduce SAE technique for variable that can be modelled linearly following a non-linear tranformation. In particular, we extend the modelbaed direct etimator of Chandra and Chamber (2005) to data that are conitent with a linear mixed model in the logarithmic cale, uing model calibration to define appropriate weight for ue in thi etimator. Our reult how that the reulting tranformation-baed etimator i both efficient and robut with repect to the ditribution of the random effect in the model. An application to buine urvey data demontrate the atifactory performance of the method. Key Word: Sample urvey; Survey etimation; Buine urvey; Model calibration; Skewed data; Model-baed direct etimation; Empirical bet linear unbiaed prediction. 1. Introduction Commonly ued method for mall area etimation (SAE) aume that a linear mixed model can be ued to characterize the regreion relationhip between the urvey variable Y and an auxiliary variable X in the mall area of interet. In particular, empirical bet linear unbiaed prediction (EBLUP), ee Rao (2003, chapter 6-8) i typically baed on a linear mixed model aumption. However, when the data are kewed, a i often the cae in buine urvey, the relationhip between Y and X may not be linear in the original (raw) cale, but can be linear in a tranformed cale, e.g. the 1. Hukum Chandra, Diviion of Social Statitic, Univerity of Southampton, Southampton, SO171BJ, UK. Email: hchandra@oton.ac.uk; Ray Chamber, Centre for Statitical and Survey Methodology, Univerity of Wollongong, Wollongong, NSW, 2522, Autralia. Email: ray@uow.edu.au. 1

logarithmic (log) cale. In uch cae we would expect etimation baed on a linear mixed model for Y to be inefficient compared with one baed on a imilar model for a tranformed verion of Y. See Hidiroglou and Smith (2005). The ue of tranformation in inference ha a long hitory, ee for example Carroll and Ruppert (1988, chapter 4). Recently, Chen and Chen (1996) and Karlberg (2000a) have invetigated the ue of a tranform to linearity approach for regreion etimation of urvey variable that behave non-linearly. However, to the bet of our knowledge there ha been no application of thi idea in SAE, even though economic theory (and caual obervation) ugget that regreion relationhip in buine urvey data are typically multiplicative, and hence linear in the log cale. In thi paper we extend the model-baed direct (MBD) etimation idea decribed in Chandra and Chamber (2005) to the ituation where the linear mixed model underpinning SAE hold on the log cale, uing weight derived via model calibration (Wu and Sitter, 2001). In doing o, we note that our approach eaily generalie to other monotone (i.e. invertible) tranformation. In contrat, extenion of the EBLUP approach to where the data follow a linear mixed model under tranformation i complicated. We alo relax the uual normality aumption for the area effect in order to examine robutne with repect to thi aumption. In the following ection we ummarie the MBD approach to SAE under a linear mixed model. In ection 3 we ue a model-baed perpective to motivate model calibrated etimation of population quantitie where the underlying variable i linear after uitable tranformation. In ection 4 we bring thee two idea together, introducing the concept of a fitted value model derived from a linear mixed model in the tranformed cale. We then ue thi fitted value model to pecify urvey weight for ue in an MBD etimator in SAE. In ection 5 we preent empirical reult from a number 2

of imulation tudie that contrat the propoed tranformation-baed MBD etimator with both the EBLUP and the uual MBD etimator defined by fitting a linear mixed model to the data. Section 6 conclude the paper with a dicuion of outtanding iue. Note that the approach taken in thi article i model-baed. Conequently all moment are evaluated with repect to a model for the population data. Alo, all ample data are aumed to have been obtained via a non-informative ampling method, e.g. probability ampling with incluion probabilitie defined by known model covariate. 2. Model-Baed Direct Etimation for Small Area To tart, we fix our notation. Let U denote a population of ize N and let y U denote the N-vector of population value of a characteritic Y of interet. Suppoe that our primary aim i etimation of the total t Uy =! y j of thee population value (or their mean U m Uy = N!1 " y j ). Let X denote a p-vector of auxiliary variable that are related, in U ome ene, to Y and let x U denote the correponding N! p matrix of population value thee variable. We aume that the individual ample value of X are known. The non-ample value of X may not be individually known, but are aumed known at ome aggregate level. At a minimum, we know the vector of population total t Ux of the column of X. Suppoe that it i reaonable to aume that the regreion of Y on X in the population i linear, i.e. E( y U x U ) = x U! and Var ( y U x U ) = v U (1) where v U i known up to a multiplicative contant. Given a ample of ize n from thi! population, we can partition x U = x $ # & "# x r %& and v =! v U # "# v r v r v rr $ & into their ample and non- %& 3

ample component. Here r = U! denote the population unit that are not in ample. The vector of weight that define the Bet Linear Unbiaed Predictor (BLUP) of t Uy i then (Royall, 1976; Valliant, Dorfman and Royall, 2000, ection 2.4) ( ) = 1 + H" w BLUP = w j BLUP ; j! ( t Ux # t x ) + (I # H" x" )v #1 v r 1 r (2) where H = x! v ( "1 x ) "1! x v "1, I i the identity matrix of order n, t x i the vector of ample total of X and 1 (1 r ) denote a vector of one of ize n ( N! n ). A noted in ection 1, linear mixed model are often ued in SAE. Such model can be written in the form y U = x U! + g U u + e U (3) where u i a random vector of o-called area effect, e U i a population N-vector of random individual effect and g U i a known matrix. In general, area effect are vectorvalued, o u! = u 1! u!!! 2 ( u D ) and g U = diag g i ;i = 1,, D { }, where i indexe the D mall area that make up the population and g i i of dimenion N i! q, where N i i the population ize of area i. The area pecific effect { u i ;i = 1,, D} are aumed to be independent and identically ditributed realiation of a random vector of dimenion q with zero mean and covariance matrix! u. Similarly, the calar individual effect making up e U are aumed to be independent and identically ditributed realiation of a random variable with zero mean and variance! e 2, with area and individual effect 2 mutually independent. The parameter! = (" u,# e ) are typically referred to a the variance component of (3). Throughout, we aume that there i at leat one ample unit in each mall area of interet. Given the value of the variance component, it i traightforward to ee that (3) i 4

jut a pecial cae of the general linear model (1) that underpin the BLUP weight (2). In particular, under (3) and { } = diag{ g i! u g" i + # 2 e I i ;i = 1,, D} (4) v = diag v i ;i = 1,, D { } = diag { g i! u g" ir ;i = 1,, D}. (5) v r = diag v ir ;i = 1,, D Here g i and g ir denote the retriction of g i to ampled and non-ampled unit in area i repectively. Given etimated value ˆ! 2 = ( ˆ" u, ˆ# e ) of the variance component we can ubtitute thee in (4) and (5) to obtain etimate ˆv and ˆv r of v and v r repectively, and therefore compute empirical BLUP weight, or EBLUP weight ( ) = 1 + Ĥ" w EBLUP = w ij EBLUP ; j! i ;i = 1,, D ( t Ux # t x ) + (I # Ĥ" x" ) ˆv #1ˆv r 1 r (6) where Ĥ = x! ˆv ( "1 x ) "1! x ˆv "1. Note that we now ue a double index of ij to differentiate between population unit in different area. We alo ue i to denote the n i ample unit in area i. The MBD etimator for the mean m iy of Y in area i (Chandra and Chamber, 2005) baed on the EBLUP weight (6) i imply the correponding weighted average of the ample value of Y in area i, ˆm iy MBD = #1 EBLUP {" j!i w ij } " w EBLUP ij y ij. (7) Note that (7) i not the EBLUP for m iy under (3). Thi i (ee Rao, 2003, ection 6.2.3) j! i { } &# y j" j + 1$ ir x ir ˆ% + ˆvir ˆv -1 i (y i! x i ˆ%) ) '( i { }*+ & n i y i + (N i! n i ) x$ ir ˆ% + gir $ ˆ, u g$ i ( g i ˆ, g u $ + ˆ- 2 I )!1 (y i e i i! x i ˆ%) ˆm iy EBLUP = Ê m iy y i,x i,x ir = N i!1!1 = N i '( { } ) *+. (8) 5

Here Ê denote the expectation operator under (3) with unknown parameter replaced by etimate, x i and x ir are the matrice of ample and non-ample value of X in area i, y i i the vector of ample value of Y in the ame area, ˆ! i the empirical BLUE of!, ˆv ir i the tranpoe of the etimated value of v ir with ˆv i the correponding etimate of v i, ee (4) and (5), and 1 ir i a vector of one of length N i! n i. Note that the lat expreion on the right hand ide of (8) follow directly by ubtitution of (4) and (5), with x ir and g ir denoting the vector defined by averaging the column of x ir and g ir repectively. By contruction, (7) i a direct etimator of m iy, becaue it i a weighted mean of the area i ample value of Y. In contrat, (8) i an indirect etimator becaue it cannot be expreed in thi form, being a weighted mean of all the ample value of Y. Clearly, under (3), (8) i more efficient than (7). However, (7) ha the advantage of being a imple weighted mean of the area i ample data, and therefore hould be more robut to mipecification of (3) than the more model-dependent etimator (8). Some empirical evidence for thi i et out in Chandra and Chamber (2005) and in Chandra, Salvati and Chamber (2007), with more extenive evidence available from the unpublihed PhD thei, Chandra (2007). Direct etimator like (7), i.e. etimator that are defined a weighted average of the ample data from the mall area of interet, have a number of practical advantage, including implicity of contruction and aggregation conitency. Alo, a we hall ee later, (7) i eaily generalied to model that are more complex than (3). Correponding generaliation of (8) uually lead to rather complex non-linear etimator. MSE etimation for (8) i uually carried out uing the theory decribed in Praad and Rao (1990). Although thi MSE etimator i omewhat complicated, it work well 6

under (3). However, when (3) fail it can be mileading. It i alo inadequate a an etimator of the repeated ampling MSE of (8), a ha been pointed out by Longford (2007). In contrat, MSE etimation for (7) i quite traightforward. Thi i becaue if one treat the weight defining thi etimator a fixed, then it i a linear etimator of a domain mean, and o it prediction variance V i under (1) can be etimated uing wellknown method (ee Royall and Cumberland, 1978). Since in general the EBLUP weight (6) are not locally calibrated (i.e. they do not reproduce the area i mean x i of X), (7) ha a bia B i under (1). A imple plug-in etimate of thi bia i the difference between (7) and x! i ˆ". The final MSE etimator ued with (7) i therefore defined by umming the etimate of V i and the quare of thi etimate of B i. Thi method of MSE etimation ha been empirically demontrated to have good model-baed a well a repeated ampling propertie. See Chandra and Chamber (2005), Chamber and Tzavidi (2006), Chandra, Salvati and Chamber (2007) and Tzavidi, Salvati, Pratei and Chamber (2007). 3. Model Calibrated Weighting Model calibration wa introduced by Wu and Sitter (2001) a a model-aited method of calibrated weighting when the underlying regreion relationhip i non-linear. Here we provide a model-baed perpective on the method, a a precuror to uing it for contructing weight for ue in an MBD etimator in a imilar ituation. Suppoe that the underlying population model i non-linear, with the relationhip between Y and X in the population of form E( y j x j ) = h(x j ;!) and Var( y j x j ) =! 2 j. (9) Here j = 1,, N,! (typically vector-valued) and! j 2 are unknown model parameter 7

and the mean function h(x j ;!) i a known function of x j and!. We alo aume that population unit are mutually uncorrelated given their repective value of X. Note that (9) i quite general, and include linear, non-linear, and generalized linear model a pecial cae. In thi ituation, Wu and Sitter (2001) define the model-calibrated etimator of the population total t Uy a ˆt y mc = mc " w j! j y j, where the vector of weight w mc mc = ( w j ) i choen to minimie an appropriately choen meaure of the ditance from w mc to the vector of Horvitz-Thompon weight w! "1 = (! j ), ubject to the model calibration contraint " = N and $ w mc j h(x j ; ˆ! " ) = $ h(x j ; ˆ! " ) (10) w j mc j! j# j#u with ˆ! " a deign conitent etimator of!. Note that unlike tandard calibration, the contraint (10) require that we know the individual population value of X. The key idea behind thi approach i that provided (9) fit reaonably, then y j i (at leat approximately) a linear function of it fitted value h(x j ; ˆ! " ) under thi model and o we can carry out linear etimation uing thee fitted value a auxiliary information. A model-baed perpective on model calibration can be developed a follow. Let ˆ! denote a model-efficient etimator of! in (9), e.g. it maximum likelihood (ML) etimator, with aociated fitted value h(x j ; ˆ!). In general, thee fitted value will not be unbiaed. They will alo be correlated. However, there will till be a ytematic relationhip between the actual value of Y and their correponding fitted value that we can approximate. Although there i nothing to top u looking at more complex approximation, a linear model for the relationhip between the population value y j and the fitted value ŷ j = h(x j ; ˆ!) eem a reaonable tarting point. We therefore 8

replace the non-linear model (9) by the linear model ( ŷ j ) =! 0 +! 1 ŷ j and Cov y j, y k ŷ j, ŷ k E y j ( ) =! jk. (11) We refer to (11) a the fitted value model correponding to (9). Let J U denote the population deign matrix under (11), i.e. J U =!" 1 U ŷ U # $, where 1 U denote the unit vector of ize N and ŷ U = ( ŷ j ; j = 1,, N ), and put! U = # $ " jk ; j = 1,, N;k = 1,, N % &. We can then partition J U and! U according to ample () and non-ample (r) unit a! J U = J $ # " J & and! = "!! r % U $ ', and hence write down the weight that define the r % # $! r! rr &' BLUP of t Uy under (11). Thee are the model-baed model-calibrated weight w mbmc = (w mbmc j ; j!) = 1 + H" mc ( J U " 1 U # J" 1 ) + (I # H" mc J" )$ #1 $ r 1 r (12) where H mc = ( J! " #1 J ) #1 J! " #1. Clearly, thee weight are model-calibrated ince mbmc " = N and " w j ŷ j = " ŷ j. However, unlike the linear model w j mbmc j! j! j!u EBLUP weight (2), they are not calibrated on X. In practice, the component of! U will not be known and will need to be etimated. When thee etimate are ubtituted in (12), we obtain the empirical verion w embmc of thee model-calibrated weight. 4. Small Area Etimation under Tranformation In thi ection we extend the MBD approach to SAE when the underlying regreion relationhip are non-linear, exploring it ue with model-baed model-calibrated weight. In doing o, we hall focu on the important cae where the population value of Y follow a non-linear model in their original (raw) cale, but their logarithm can be modelled linearly. The extenion to other tranform to linear model i traightforward. 9

Without lo of generality, uppoe that both Y and X are calar and trictly poitive, with kewed population marginal ditribution and clear evidence of non-linearity in their relationhip, e.g. a in many buine urvey application. Furthermore, a linear mixed model i appropriate for characteriing how the regreion of log(y ) on log( X ) varie between the mall area. That i, for i = 1,..., D; j = 1,..., Ni we have l ij = log( y ij ) =! 0 +! 1 log(x ij ) + g" ij u i + e ij (13) where y ij and x ij are the value of Y and X repectively for population unit j in mall area i, g ij denote a contextual covariate of dimenion q, u i denote a random effect for area i alo of dimenion q and e ij i a calar individual random effect. A uual with thi type of model, we aume that all random effect are normally ditributed and mutually uncorrelated, with zero expected value, Var(u i ) =! u and Var(e ij ) =! e 2. Note that Var l ij ( x ij ) = v ijj =! g ij " u g ij + # e 2 and Cov( l ij,l ik x ij,x ik,g ij,g ik ) = v ijk = g! ij " u g ik under (13). Given ample value of y ij, x ij and g ij, tandard method of etimation (e.g. ML or REML, ee Harville, 1977) can be ued to etimate the parameter of (13). Let ˆ! u and ˆ! e 2 denote the reulting etimate of the variance component of thi linear mixed model. The etimate of! = (! 0! 1 ) " i then (# -1 d i ) -1 # " i i ˆ! = d" i ˆv i ( d i ˆv -1 l i ) (14) i where ˆv i, d i and l i are the ample component of ˆv i = [ ˆv ijk ] = g i ˆ! u g" i + ˆ# 2 e I i, d i = [d ijk ] = [1 i log(x i )] and l i = (l ij ; j = 1,, N i ) repectively. Here g i i the N i! q matrix defined by the covariate g ij in area i, I i i the identity matrix of order N i, 1 i 10

denote a vector of one of dimenion N i and log(x i ) denote the vector of N i value of log(x) in area i. Note that when the variance component! u and! e 2 are known, (14) i the BLUE for!. Conequently, E( ˆ!) "! and Var( ˆ!) " d# i ˆv i (% $1 d i ) -1. Put ˆ! i i = ( ˆ! ij ) = d i ˆ". (% $1 d g ) -1 g Then E( ˆ! i ) " d i # and Var( ˆ! i ) = A i = [a ijk ] " d i d# g ˆv g d# i, where a ijk = d! ij Var( ˆ")d ik # 0 a n! ". Our aim i to ue the log cale linear mixed model (13) for etimation of the mall area mean m iy. In particular, we ue model calibration baed on thi model to develop ample weight for ue in the MBD etimator (7) of thi quantity. From the development in the previou ection it can be een that thi require u to firt pecify a fitted value model (11) for Y baed on (13), i.e. we need to calculate appropriate fitted value ŷ ij a well a etimate ˆ! ijk of! ijk = Cov( y ij, y ik x ij,x ik,g ij,g ik ) under (13). The ample weight to ue in the MBD etimator (7) are then given by (12). A imple method of defining fitted value ŷ ij under (13) i one where parameter etimate derived under thi model are ued to obtain predicted value on the log cale which are then back-tranformed. Unfortunately, a i well known, thi approach i biaed. We therefore develop the firt and econd order moment of an appropriate biacorrected fitted value model baed on (13). Let x and g denote the ample value of x ij and g ij repectively. Under (13), E( y ij x ij,g ij ) = E e l ij x ij,g ij { } = e! ij +v ijj 2 " E e ˆ! ij + ˆv ijj 2 ( x,g ) = E ŷ x ij ij,g ij ( ) o the uual bia correction that make ue of the fact that the conditional ditribution of 11

y ij i lognormal i inadequate. Let ˆ! ij = ( ˆ", ˆv ijj )# be an etimate of! ij = (",v ijj )# uch that E( ˆ! ij "! ij ) # 0 for large n. Put z(! ij ) = e " ij +v ijj 2. Uing a econd order Taylor erie approximation we can write z( ˆ! ij ) " z(! ij ) + ( ˆ! ij #! ij ) $ z (1) (! ij ) + 1 2 ( ˆ! ij #! ij ) $ z (2) (! ij )( ˆ! ij #! ij ) and o Here and { } " z(! ij ) + 1 2 tr % E { z(2) (! ij )( ˆ! ij #! ij )( ˆ! ij #! ij ) $ } E z( ˆ! ij ) $ z (1) (! ij ) = d" ij e # ij +v ijj 2 1 % & & 2 e# ij +vijj 2 ' " ( ) $ d ij d" ij e # ij +v ijj 2 1 z (2) (! ij ) = 2 d ij e# ij +vijj 2' & ) & ) & 1 2 d" ij e# ij +v ijj 2 1 4 e# ij +vijj 2 ) % ( are the vector and matrix repectively containing the firt and econd order derivative of z(! ij ) with repect to! ij. Since the aymptotic covariance between ML (or REML) ' (. etimator of the fixed and variance component of a linear mixed model i zero (McCulloch and Searle, 2001, chapter 2, pp 40 45), the covariance between ˆ! and ˆv ijj will be negligible. It follow that { } { } tr $ E z (2) (! ij )( ˆ! ij "! ij )( ˆ! ij "! ij )# & % ' = tr $ z(2) (! ij )E ( ˆ! ij "! ij )( ˆ! ij "! ij )# & % ' ( ) -1 dij + 1 4 Var( ˆv ijj ) ( e ) ij + v ijj $ 2 + d# ij * d# g ˆv "1 d g g % g $ = E( y ij x ij,g ij ) â ij + 1 4 Var( ˆv ) & + ijj, % ' &, ' 12

where â ijj = d! ˆ ij V ( ˆ")d ij and V ˆ( ˆ!) = d" i ˆv i ( $ #1 d i ) -1 i the uual etimator of Var( ˆ!). i Our fitted value are therefore defined by the econd order bia corrected etimator of E( y ij x ij,g ij ), ŷ ij = h(d ij ; ˆ! ij ) = ˆk ij "1 e ˆ# ij + ˆv ijj 2 (15) where ˆkij = 1+ 1! 2 âijj + 1 V ˆ $ " ( ˆv 4 ijj )% # & and ˆ V ( ˆv ijj ) i the etimated aymptotic variance of ˆv ijj. Under ML and REML etimation of the variance component of (13), thi etimated aymptotic variance i obtained from the invere of the relevant information matrix. Note that the bia adjutment of Karlberg (2000a) i a pecial cae of (15). In order to ue (12) to define model-baed model-calibrated ample weight, we alo need etimate of the econd order moment of the population value of Y given thee fitted value. The conditional moment! ijk are a firt order approximation to thee moment. In particular, given normal random effect! ijk = e (" ij +" ik )+(v ijj +v ikk )/ 2 e v ijk ( # 1) (16) Our etimate ˆ! ijk of! ijk i obtained by ubtituting ˆ! ij and ˆv ijk for! ij and v ijk in (16). The empirical model-baed model-calibrated weight (12) correponding to the fitted value model defined by (15) and (16) are w embmc = (w embmc ij ; j! i ;i = 1,, D) = 1 + Ĥ" mc ( J U " 1 U # J" 1 ) + (I # Ĥ" mc J" ) ˆ$. (17) #1 ˆ$r 1 r % Here J U =!" 1 U ŷ U # $, o J U! 1 U " J! 1 = ' & N " n $ i $ j#ri ŷ ij ( *, and Ĥ mc = (! ) Alo ˆ! = diag { ˆ! i ;i = 1,, D} and ˆ!r = diag ˆ! ir ;i = 1,, D J ˆ" #1 J ) #1! J ˆ" #1. { }, where ˆ!i and 13

ˆ! ir are defined by the ample/non-ample decompoition of ˆ! i. For example, when (13) correpond to a random intercept pecification, ˆv ijk = ˆ! u 2 + ˆ! e 2 I( j = k) and o the component of ˆ! i are ˆ! ijk = e ˆ" ij + ˆ" ik + ˆ# 2 2 u + ˆ# e % &' e ˆ# 2 u 1+ I( j = k) e ˆ# 2 e { ( $ 1) } $ 1 ( )*. The development o far ha aumed normality of log-cale random effect. However, there i no good reaon (beyond convenience) to aume that with kewed data thee random area effect hould be normal. One alternative, given a calar area effect in (13), i to aume that the random effect in thi model are drawn from the gamma family of ditribution. From the propertie of thi ditribution and uing binomial and exponential expanion (ignoring higher order term) we can how that E( y ij x ij,g ij )! e " ij +v ijj 2 = z(# ij ) a in the normal cae. Thi indicate that an MBD etimator baed on the model-baed model-calibrated weight (17) hould be robut with repect to the ditribution of the random effect in (13). Finally, we conider definition of the MBD etimator itelf. A noted in ection 2, thi etimator i jut the weighted average of the ample Y-value in an area. However, ue of uch a weighted average pre-uppoe that the weight are reaonably cloe to being locally calibrated on N, i.e. when ummed over the ample unit in mall area i we obtain a value that i not too different from the actual mall area population ize N i. Thi property uually hold if the weight are the EBLUP weight (6) defined by a linear mixed model for Y. It doe not necearily hold for the model-baed modelcalibrated weight (17). Conequently, we conider two pecification for the MBD etimator given thee weight. The firt, which we refer to a a Hajek pecification, i jut the weighted average (7), with weight defined by (17). The econd, which we refer to a a Horvitz-Thompon pecification, replace the denominator in (7) by the actual 14

value of N i. That i, the two type of MBD etimator under model-baed modelcalibrated weighting that we conider are Hajek ˆm!TrMBD iy = and!1 embmc {# j"i w ij } # w embmc ij y ij (18) j" i HT ˆm!TrMBD!1 iy = N i # w embmc ij y ij. (19) Etimation of the mean quared error of (18) and (19) i carried out in the uual way for MBD etimator, i.e. via the MSE etimation approach decribed in ection 2. j" i 5. An Empirical Evaluation In thi ection we provide empirical reult on the comparative performance of four different method of SAE. Thee are the two tranformation-baed MBD etimator (18) and (19), both baed on the model-baed model-calibrated weight (17) and denoted Hajek-TrMBD and HT-TrMBD repectively; the tandard MBD etimator (7) baed on the linear mixed model (3) and the empirical EBLUP weight (6), which we denote by Hajek-LinMBD to emphaie that it i a Hajek-type weighted mean baed on weight derived under a linear mixed model; and the EBLUP (8) derived under the ame linear mixed model, which we denote LinEBLUP. Note that the mean quared error for all three MBD etimator were etimated uing the method decribed in ection 2, while the mean quared error of LinEBLUP wa etimated uing the method decribed in Praad and Rao (1990). Our empirical reult are baed on two type of imulation tudie. The firt type ued model-baed imulation to generate artificial population and ample data. Thee data were then ued to compare the performance of the different etimator. We carried out two et of model-baed imulation. In the firt et of imulation (Set A), we 15

invetigated the performance of thee etimator given population data generated uing the log-cale linear mixed model (13). In econd et of imulation (Set B), we examined the robutne of thee etimator to mipecification of thi model. The econd type of imulation tudy wa deign-baed. Here we evaluated thee etimator in the context of repeated ampling from a real population uing realitic ampling method. Four meaure of etimator performance were computed uing the variou etimate generated in thee imulation tudie. They were the relative bia (RB) and the relative root mean quared error (RRMSE) of thee etimate, together with the coverage rate and average width of the nominal 95 per cent confidence interval baed on them. In Table 2 to 4 thee meaure are preented a average over the mall area of interet. 5.1 The Model-Baed Simulation Study In our model-baed imulation we fixed the population ize at N = 15,000 and randomly generated the mall area population ize N i, i = 1,..., D = 30 o that! N i i = N. We ued an overall ample ize of n = 600 with mall area ample ize et o that they were proportional to the correponding mall area population ize. Thee area-pecific ample ize were kept fixed in all our imulation. In Set A of our model-baed imulation the population value y ij were generated uing the multiplicative model y ij = 5.0x ij! u i e ij, with random ample then taken from each mall area. Here the value of x ij were independently drawn from the log-normal 2 ditribution log(x ij )! N ( 6,! x ), with the individual effect and area effect 2 2 independently drawn a log(e ij )! N ( 0,! e ) and log(u i )! N ( 0,! u ) repectively. The 16

value of! e and! u were choen o that the intra-area correlation in the population varied between 0.20 and 0.25. Table 1 how the ix different et of parameter value that were ued in Set A. Thee enured that the imulated population contained a wide range of variation. Uing the ample data in each cae, parameter value were etimated uing the lme function in R (Bate and Pinheiro, 1998), and etimate for the mall area mean then calculated, along with appropriate nominal 95% confidence interval. The proce of generating population and ample data, etimation of parameter and calculation of mall area etimate wa independently replicated 1000 time. The reult from thi part of the imulation tudy are hown in Table 2. In Set B of the model-baed imulation, population data were generated uing the model y ij = 5.0x ij [exp(log 2 (x ij ))]! u i e ij. Here the individual effect e ij and the area effect u i were independently drawn a log(e ij )! N ( 0,1) and log(u i )! N ( 0,0.25) repectively, while the covariate value x ij were drawn a log(x ij )! N ( 3,0.04). Five different value for the parameter! (-1.0, -0.5, 0.0, 0.5, 1.0) were invetigated, thu generating population data with different degree of curvature. All other apect of thee imulation, including the etimator conidered, were the ame a in Set A. Table 3 preent reult from thi component of the imulation tudy. 5.2 The Deign-Baed Simulation Study Thi tudy ued the ame population and ample a the imulation tudie decribed in Chandra and Chamber (2005) and Chamber and Tzavidi (2006), which wa baed on data obtained from a ample of 1652 farm that participated in the Autralian Agricultural and Grazing Indutrie Survey (AAGIS). A realitic population of 81982 farm wa defined by ampling with replacement from the original ample of 1652 farm with 17

probabilitie proportional to their ample weight, all of which were trictly greater than one. A total of 1000 independent ample, each of ize n = 1652, were drawn from thi fixed population by imple random ampling without replacement within trata defined by the 29 Autralian agricultural region repreented in the AAGIS ample. Thee region are the mall area of interet. Regional ample ize were fixed to be the ame a in thi original ample, varying from a low of 6 to a high of 117, which allow an evaluation of the performance of the different etimation method acro a range of realitic mall area ample ize. Note that ampling fraction in thee trata alo varied diproportionately, ranging between 0.70 and 15.87 percent. The aim i to etimate average annual farm cot (TCC, meaured in A$) in each region uing farm ize (hectare) a the auxiliary variable. The ame mixed model pecification a in Chandra and Chamber (2005) i ued. Thi include an interaction term (zone by ize) in the fixed effect and a random lope pecification for the area effect. In it linear form the model doe not fit the AAGIS ample data terribly well. Thi fit i improved (albeit marginally) when a log-cale linear pecification i ued. Our reult are ummarized in Table 4. 5.3 Dicuion of Simulation Reult The mot triking feature of Table 2 i the extremely large value of the average relative bia of Hajek-TrMBD under model-baed model-calibrated weighting. On the other hand, HT-TrMBD, which i baed on the ame weight a Hajek-TrMBD, i clearly the bet of the four etimator whoe reult are hown in thi Table. An invetigation of the reaon for thi anomaly revealed that umming the model-baed model-calibrated weight (17) within mall area produced extremely variable etimate of the mall area population ize, implying that thee weight cannot be conidered a multipurpoe they function well when ued with variable that are reaonably 18

correlated with the variable that define the fitted value model, but can fail with other, le well correlated, variable (e.g. the indicator variable for mall area incluion). We further note that thi problem doe not arie with the tandard empirical EBLUP weight (6), a Hajek-LinMBD perform conitently for all ix of the cenario explored in Set A of the imulation tudy. From now on we therefore focu our dicuion on the three etimator, HT-TrMBD, Hajek-LinMBD and LinEBLUP. Table 2 how that the average relative biae and the average relative RMSE for HT-TrMBD are conitently lower than thoe generated by Hajek-LinMBD and LinEBLUP. Furthermore, average coverage rate and interval width for HT-TrMBD are better than thoe generated by Hajek-LinMBD and LinEBLUP. In comparion, for ame order of RB, the RRMSE of LinEBLUP i maller than that of Hajek-LinMBD, and, although both etimator generate very imilar coverage rate, confidence interval generated via LinEBLUP tend to have maller average width than thoe generated via Hajek-LinMBD. The plot in Figure 1 diplay the region-pecific performance meaure generated by thee three etimator for the Set A imulation. Thee how that the RB and the RRMSE value generated by HT-TrMBD are maller than correponding value for Hajek-LinMBD and LinEBLUP in all region. Further, the RB and the RRMSE of Hajek-LinMBD and LinEBLUP increae a the non-linearity in the data increae (i.e. a we move from parameter et 1 to parameter et 6). We alo ee that HT-TrMBD generate better coverage rate acro all region compared with the coverage rate generated by LinEBLUP and Hajek-LinMBD. Overall, thee reult how that when the model for the underlying population i nonlinear there can be ignificant gain from the ue of HT-type MBD etimator for mall area mean baed on the model-calibrated weight (17) compared with tandard linear mixed model-baed etimator like Hajek-LinMBD and LinEBLUP. They alo how 19

that the indirect etimator LinEBLUP perform relatively better than the direct etimator Hajek-LinMBD in thee ituation. In Set B of the model-baed imulation we invetigated the robutne of modelbaed model-calibrated direct etimation to mipecification of the non-linear model. The reult in Table 3 how that in thi cae the biae generated by HT-TrMBD increae a the actual non-linear model deviate more from the aumed non-linear model (! = 0.0 in the table). However, thee biae are offet by mall variability, o in term of average RRMSE, HT-TrMBD till perform a well or better than LinEBLUP and continue to dominate Hajek-LinMBD. The biae generated by Hajek-LinMBD and LinEBLUP are of the ame order, while the average RRMSE of LinEBLUP dominate that of Hajek-LinMBD. Average coverage rate for LinEBLUP are marginally better than thoe of Hajek-LinMBD and HT-TrMBD, but the average width of the confidence interval underpinning thee rate tended to be mallet for HT- TrMBD, followed by LinEBLUP and then Hajek-LinMBD. Overall, our model-baed imulation reult for Set B indicate that although MBD-baed SAE with model-baed model-calibrated weight i uceptible to model mipecification bia, the overall performance of thi approach appear relatively unaffected by light deviation from the aumed non-linear model. In Table 4 and Figure 2 we preent the average and region-pecific performance meaure generated by different SAE method for AAGIS data repectively. Thee reult how that the average relative bia of HT-TrMBD i maller than that of LinEBLUP but larger than that of Hajek-MBD, while the average RRMSE of HT-TrMBD i marginally larger than the correponding value for Hajek-LinMBD and LinEBLUP. Inpection of Figure 2 how that thi reult i eentially due to one region (21) in the original AAGIS ample that contained a maive outlier (TCC > A$30,000,000). Thi outlier wa included 20

in the imulation population (twice) and then elected (in one cae, twice) in 37 of the 1000 imulation ample, leading to completely unrealitic etimate for region 21 being generated by HT-TrMB2 and Hajek-LinMBD. The right-hand column in Table 4 therefore how the average performance of the different method when thi region i excluded. Here we ee that now HT-TrMBD and Hajek-LinMBD are eentially on a par, with both dominating LinEBLUP. The fact that HT-TrMBD doe not provide ignificant gain over Hajek-LinMBD in thi cae reflect the fact that the raw-cale and log-cale linear mixed model ued in thee etimator both provide relatively poor fit to the AAGIS data. 6. Concluion and Further Reearch The imulation reult dicued in the previou ection how that combining modelbaed model-calibrated weight with direct etimation can bring ignificant gain in SAE efficiency if the population data are clearly non-linear. A one would expect, thee gain are le when the aumed non-linear model i mipecified. Although we do not provide the detail, our concluion were eentially unaffected when we carried out imilar imulation uing gamma ditributed random effect. Our main caveat concerning the ue of the model-baed model-calibrated weight (17) for SAE i their pecificity. Thee weight do not appear to have the ame multipurpoe characteritic a tandard EBLUP weight baed on linear mixed model. Further reearch i therefore required on how to build model-calibrated weight for SAE that are more general purpoe. It i to be expected that uch weight would not be a efficient a the variable pecific weight (17), but hopefully thi will be more than offet by their increaed utility. A further iue that i extremely important in practice i that poitively kewed urvey variable can alo take zero (or even negative) value. For 21

example, economic variable like debt and capital expenditure often take zero value, while variable defined a the difference of two non-negative quantitie (e.g. profit, which i the difference between income and expenditure) can be negative. Karlberg (2000b) ue a mixture model to characterie data that are a mix of zero and trictly poitive value. Thi type of model can be ued in model-baed model-calibrated weighting. Finally, we note that uing a tranformation-baed MBD approach where the uual linear model aumption are only approximately valid (the ituation conidered in thi paper) i not the only approach that ha been uggeted for thi problem. Two alternative approache in the literature are the peudo-eblup (Rao, 2003, ection 7.2.7) and the model-aited EB-type etimator of Jiang and Lahiri (2006). Recollect from (8) that the EBLUP i defined by replacing the unknown area i mean m iy by an etimate of it expected value given the oberved ample value of Y in area i and the area i value of X. Let! ij denote the ample incluion probability of population unit j in mall area i. The peudo-eblup i then defined by replacing m iy by an etimate of it expected value given the value of it deign-conitent etimate ˆm! "1 "1 iy = ( $ j#i! ij ) $! "1 ij y j# ij =!w ij y ij i j# i $ (20) and the area i value of X. That i, under (3) the peudo-eblup of m iy i ˆm puedoeblup iy = Ê { m ˆm! iy iy,x i,x ir } 2 2 = x" i ˆ#!w + ( g i " ˆ$u!w g i!w ) g i " ˆ$u (!w!w g i!w + ˆ% e!w ' j&i!w ij ) (1 ( ˆm! iy ( x" i!w ˆ#!w ) (21) 2 where ˆ!!w, ˆ! u!w and ˆ! e!w are peudo-maximum likelihood etimate baed on the weight!w ij and g i!w and x! i!w are deign-conitent etimate of g i and x i that are 22

defined in exactly the ame way a ˆm iy! above. Under the ame model the Jiang and Lahiri (2006) model-aited EB-type approach lead to an etimator that i alo defined by conditioning on the value of ˆm iy!, # j"i { ( ) ˆm! iy,x i } {!wi ( g ˆ&u g i $ + ˆ' 2 I )!w i e i i } (1 $ ˆm iy JL =!w ij Ê Ê y ij x ij,u i = x$ i!w ˆ% + $ ( ) {!w i g ˆ&u i g $!w i } ˆm! iy ( x$ i!w ˆ% (22) where!w i i the vector of tandardied ample weight!w ij in area i. Note that in (22) we ue optimal (i.e. ML or REML) etimate for model parameter. Both (21) and (22) are eentially motivated by the idea of etimating the area i mean by it conditional expectation under (3) given the value of the uual deign-conitent etimator (20) for thi quantity. A uch, they are indirect etimator like the EBLUP. Under (3), neither will be a efficient a the EBLUP, while if (13) rather than (3) hold, then both etimator rely on the deign conitency of ˆm iy! for robutne. Since relying on a large ample property of a mall ample tatitic eem rather optimitic, we prefer to tackle the model pecification problem directly, replacing (3) by (13) and uing the tranformation-baed MBD approach decribed in ection 4. Value of ARB and ARRMSE for the peudo-eblup (21) and the Jiang and Lahiri etimator (22) are hown in Table 4. It i intereting to note that neither etimator appear to perform any better than the tandard EBLUP in thee deign-baed imulation, and all three are ubtantially out performed in term of average RRMSE by the two MBD-type etimator that were invetigated in thi tudy. Clearly the reult of a ingle (but reaonably realitic) imulation tudy hould not be conidered a anything more than indicative. However, they do provide ome evidence that aymptotic deign-baed propertie are no guarantee of mall area etimation performance. 23

Acknowledgement The firt author gratefully acknowledge the financial upport provided by a PhD cholarhip from the U.K. Commonwealth Scholarhip Commiion. Reference Bate, D.M. and Pinheiro, J.C. (1998). Computational Method for Multilevel Model. http://franz.tat.wic.edu/pub/nlme/. Carroll, R. and Ruppert, D. (1988). Tranformation and Weighting in Regreion. New York: Chapman and Hall. Chamber, R. and Tzavidi, N. (2006). M-quantile model for mall area etimation. Biometrika 93, 255-268. Chandra, H. (2007). Improved Direct Etimator for Small Area. Unpublihed PhD Thei, School of Social Science, Univerity of Southampton. Chandra, H. and Chamber, R.L. (2005). Comparing EBLUP and C-EBLUP for Small Area Etimation. Statitic in Tranition, 7, 637-648. Chandra, H., Salvati, N. and Chamber, R. (2007) Small Area Etimation for Spatially Correlated Population. A Comparion of Direct and Indirect Model-Baed Method. Statitic in Tranition, 8, 887-906. Chen, G. and Chen, J. (1996). A Tranformation Method for Finite Population Sampling Calibrated with Empirical Likelihood. Survey Methodology, 22, 139-146. Harville, D.A. (1977). Maximum Likelihood Approache to Variance Component Etimation and to Related Problem. Journal of the American Statitical Aociation, 72, 320 338. Hidiroglou, M.A. and Smith, P.A. (2005). Developing Small Area Etimate for Buine Survey at the ONS. Statitic in Tranition, 7, 527-539. 24

Jiang, J. and Lahiri, P. (2006). Etimation of Finite Population Domain Mean: A Model-Aited Empirical Bet Prediction Approach. Journal of the American Statitical Aociation, 101, 301 311. Karlberg, F. (2000a). Population Total Prediction Under a Lognormal Superpopulation Model. Metron, LVIII, 53-80. Karlberg, F. (2000b). Survey Etimation for Highly Skewed Population in the Preence of Zeroe. Journal of Official Statitic, 16, 229-241. Longford, N.T. (2007). On Standard Error of Model-Baed Small-Area Etimator. Survey Methodology, 33, 69-79. McCulloch, C.E. and Searle, S.R. (2001). Generalized, Linear and Mixed Model. Wiley: New York. Praad, N.G.N. and Rao, J.N.K. (1990). The Etimation of the Mean Squared Error of Small Area Etimator. Journal of the American Statitical Aociation, 85, 163-171. Rao, J.N.K. (2003). Small Area Etimation. New York: Wiley. Royall, R.M. (1976). The Linear Leat-Square Prediction Approach to Two-Stage Sampling. Journal of the American Statitical Aociation, 71, 657-664. Royall, R.M. and Cumberland, W.G. (1978). Variance Etimation in Finite Population Sampling. Journal of the American Statitical Aociation, 73, 351-358. Tzavidi, N., Salvati, N., Pratei, M. and Chamber, R. (2007). M-quantile Model With Application to Poverty Mapping. Statitical Method And Application. In pre. Valliant, R., Dorfman, A.H and Royall, R.M. (2000). Finite Population Sampling and Inference. Wiley: New York. Wu, C. and Sitter, R.R. (2001). A Model Calibration Approach to Uing Complete Auxiliary Information from Survey Data. Journal of the American Statitical Aociation, 96, 185-193. 25

Table 1 Population pecification for model-baed imulation Set A Parameter Set!! u! e! x 1 0.5 0.30 0.50 3.00 2 0.8 0.35 0.60 2.50 3 1.0 0.40 0.70 2.25 4 1.3 0.45 0.80 1.75 5 1.5 0.50 0.90 1.50 6 2.0 0.60 1.00 1.20 Table 2 Average relative bia (ARB, %), average relative RMSE (ARRMSE, %), average coverage rate (ACR) and average interval width (AW) for model-baed imulation Set A. Criterion Etimator Parameter Set 1 2 3 4 5 6 ARB Hajek-TrMBD -75.20-95.97-97.97-98.55-98.12-98.66 HT-TrMBD 0.02-0.07 0.28 0.11-0.39 0.75 Hajek-LinMBD 10.98 4.11-0.29-6.28-7.81-9.59 LinEBLUP 12.65 5.44 0.49-5.85-7.68-9.32 AARMSE Hajek-TrMBD 7.98 1.25 1.22 1.30 1.44 1.59 HT-TrMBD 0.15 0.29 0.39 0.52 0.70 0.88 Hajek-LinMBD 1.03 1.47 1.79 1.89 1.98 2.78 LinEBLUP 0.76 0.69 0.61 0.75 0.98 1.29 ACR Hajek-TrMBD 0.99 0.98 0.96 0.95 0.94 0.92 HT-TrMBD 0.94 0.91 0.89 0.89 0.89 0.89 Hajek-LinMBD 0.87 0.85 0.85 0.87 0.88 0.87 LinEBLUP 0.85 0.85 0.85 0.87 0.87 0.87 AW Hajek-TrMBD 1753 22487 141001 27 10 4 35 10 5 43 10 6 HT-TrMBD 220 4426 33722 8 10 4 11 10 5 16 10 6 Hajek-LinMBD 1007 19318 139346 28 10 4 38 10 5 56 10 6 LinEBLUP 380 7253 55498 13 10 4 20 10 5 31 10 6 26

Table 3 Average relative bia (ARB, %), average relative RMSE (ARRMSE, %), average coverage rate (ACR) and average interval width (AW) for model-baed imulation Set B. Criterion Etimator! = -1.0! = -0.5! = 0.0! = 0.5! = 1.0 ARB HT-TrMBD 4.92 0.66 0.14-1.50-8.75 Hajek-LinMBD -0.21 0.04 0.12 0.16-0.85 LinEBLUP -0.19 0.04 0.13 0.17-0.77 ARRMSE HT-TrMBD 0.38 0.35 0.33 0.37 0.41 Hajek-LinMBD 0.56 0.36 0.34 0.53 1.20 LinEBLUP 0.38 0.30 0.29 0.36 0.56 ACR HT-TrMBD 0.94 0.92 0.92 0.91 0.87 Hajek-LinMBD 0.91 0.92 0.92 0.92 0.90 LinEBLUP 0.93 0.94 0.94 0.93 0.92 AW HT-TrMBD 0.04 2.50 211 29070 5 10 6 Hajek-LinMBD 0.06 2.70 214 38660 13 10 6 LinEBLUP 0.05 2.60 214 33442 10 10 6 Table 4 Average relative bia (ARB, %), average relative RMSE (ARRMSE, %) and average coverage rate (ACR) for deign-baed imulation uing AAGIS data. Simulation tandard error of ARB and ARRMSE are hown in parenthee. Criterion Etimator Average of 29 region Average of 28 region ARB HT-TrMBD 1.96 (0.20) 1.92 (0.11) Hajek-LinMBD -2.13 (0.15) -2.21 (0.12) LinEBLUP 2.98 (0.18) 3.36 (0.16) PeudoEBLUP 4.01 (0.22) 4.41 (0.20) JL 1.89 (0.19) 2.23 (0.17) ARRMSE HT-TrMBD 21.93 (4.47) 17.41 (1.18) Hajek-LinMBD 20.15 (3.80) 16.91 (2.20) LinEBLUP 19.87 (1.78) 19.30 (1.63) PeudoEBLUP 22.42 (2.52) 21.95 (2.46) JL 20.97 (1.48) 20.48 (1.31) ACR HT-TrMBD 0.89 0.92 Hajek-LinMBD 0.93 0.95 LinEBLUP 0.85 0.85 27

Figure 1. Area pecific reult for HT-TrMBD (thick line, 0), LinEBLUP (thin line Δ) and Hajek-LinMBD (dahed line, Δ) under parameter et 1 (ParA1), 3 (ParA3), 5 (ParA5) and 6 (Par A6). Left column i RB (%) and right column i RRMSE (%). 28

Figure 2. Region-pecific imulation reult for HT-TrMBD (thick line, 0), LinEBLUP (thin line Δ) and Hajek-LinMBD (dahed line, Δ) in deign-baed imulation baed on the AAGIS data. Plot how (in order from the top), RB (%), RRMSE (%) and CR. Region are ordered in term of increaing population ize. 29