Chapter 2 Section 2.4 - Section 2.9 J. Kim (ISU) Chapter 2 1 / 26 2.4 Regression and stratification Design-optimal estimator under stratified random sampling where (Ŝxxh, Ŝxyh) ˆβ opt = ( x st, ȳ st ) = ( H ȳ reg = ȳ st + ( x N x) ˆβ opt Wh 2 (1 f h) n 1 h Ŝxxh ) 1 H Wh 2 (1 f h) n 1 h Ŝxyh n h = (n h 1) 1 (x hj x h ) (x hj x h, y hj ȳ h ) j=1 H W h ( x h, ȳ h ). J. Kim (ISU) Chapter 2 2 / 26
2.4 Regression and stratification Note that H n h ˆβ opt = K h (x hj x h ) (x hj x h ) j=1 1 H n h j=1 where K h = Wh 2(1 f h)n 1 h (n h 1) 1 = W 2 h (1 f h )n 2 h On the other hand, H n h ˆβ GREG = j=1 hj (x hj x h ) (x hj x h ) π 1 1 H n h j=1 K h (x hj x h ) y hj, = (1 f h )π 2 hi. π 1 hj (x hj x h ) y hj. Roughly speaking, ˆβ opt is the first part of the slope for the regression of π 1 hi y i on π 1 hi x i and z i, where z i is a vector of stratum indicator functions. J. Kim (ISU) Chapter 2 3 / 26 2.4 Regression and stratification Given ˆβ, consider a regression estimator under stratified sampling ȳ st,reg = ȳ st + ( x N x st ) ˆβ. Write y hi = x hi β h + e hi, e hi ( 0, σe,h) 2. The large-sample variance of the regression estimator is V (ȳ st,reg ) = H Wh 2 (1 f h) n 1 h σ2 a,h where σ 2 a,h = σ2 e,h + (β h β N ) Σ xx,h (β h β N ), Σ xx,h = V {x hi }, and β N is the probability limit of ˆβ. J. Kim (ISU) Chapter 2 4 / 26
2.4 Regression and stratification Example 2.4.1 Two estimators of β: ˆβ wls = ( X D w X ) 1 X D w y ˆβ opt = ( X D 2 w X ) 1 X D 2 w y, where D w is a diagonal matrix with diagonal elements equal to W h n 1 h for units in stratum h. Probability limits: β ols,n = p lim ˆβ wls = ( X NX N ) 1 X Ny N β opt,n = p lim ˆβ opt = ( X ND w,n X N ) 1 X ND w,n y N J. Kim (ISU) Chapter 2 5 / 26 2.4 Regression and stratification Example 2.4.1 (Cont d) For example, assume H = 2 with W 1 = 0.15 and W 2 = 0.85. Stratum parameters: σ 2 x,h = { 4.3 if h = 1 0.6 if h = 2, β 1,h = Population regression coefficients (under n 1 = n 2 ) { 3.0 if h = 1 1.0 if h = 2 β obs,n = H W hσ 2 xh β 1h H W hσ 2 xh = 2.1169 β opt,n = H W 2 h σ2 xh β 1h H W 2 h σ2 xh = 1.3649. J. Kim (ISU) Chapter 2 6 / 26
2.4 Regression and stratification Example 2.4.1 (Cont d) To compare the variances, assume that σ 2 e,h = { 24 if h = 1 0.8 if h = 2. Stratum variances of the residuals from ˆβ ols. σ 2 a,h = { (3 2.1169) 2 (4.3) + 24 = 27.3537 if h = 1 (1 2.1169) 2 (0.6) + 0.8 = 1.5485 if h = 2 Stratum variances of the residuals from ˆβ opt. σ 2 a,h = { (3 1.3649) 2 (4.3) + 24 = 35.4960 if h = 1 (1 1.3649) 2 (0.6) + 0.8 = 0.8106 if h = 2 J. Kim (ISU) Chapter 2 7 / 26 2.4 Regression and stratification Example 2.4.1 (Cont d) (Under n h =constant,) the large-sample variances of the regression estimator satisfy and n h V {ȳ st,reg,wls } = (0.15) 2 (27.3537) + (0.85) 2 (1.5485) = 1.7345 n h V {ȳ st,reg,opt } = (0.15) 2 (35.4960) + (0.85) 2 (0.8106) = 1.3845 Roughly speaking, β ols,n minimizes h W hσa,h 2 while β opt,n minimizes h W h 2n 1 h σ2 a,h, where σ2 ah = E { (y hi x hi β) 2}. J. Kim (ISU) Chapter 2 8 / 26
2.4 Regression and stratification If x h,n = N 1 Nh h i=1 x hi are available then we can construct a separate regression estimator ȳ s,reg = N W h x h,n ˆβ h where ˆβ h = { nh i=1 ( x hi x hn ) ( x hi x hn ) } 1 n h ( x hi x hn ) y hi. i=1 Because the weights are the same within each stratum, the GREG type estimator is the same as the design-optimal estimator when the separate regression estimation is used. Bias can be sizable if n h are small in some strata. J. Kim (ISU) Chapter 2 9 / 26 2.6 Regression for two-stage samples Basic Setup Two-stage cluster sampling 1 Stage One: select n clusters 2 Stage Two: Within the selected cluster i, select m i second-stage units (from the M i units). π (ij) : the inclusion probability of selecting element j in primary sampling unit i. (π (ij) = π 1i π 2j i ) The analysis unit is the element, not the cluster. Thus, we want to construct weights for the sample elements. J. Kim (ISU) Chapter 2 10 / 26
2.6 Regression for two-stage samples Basic Setup Two-types of auxiliary information x ij : element level auxiliary information z i : cluster level auxiliary information Want to incorporate the auxiliary information. I j A i w ij x ij = I j A i w ij z i = i U I M i j=1 i U I z i x ij J. Kim (ISU) Chapter 2 11 / 26 2.6 Regression for two-stage samples Approach 1 Construct z ij from z i and apply the regression weighting method using (x ij, z ij ) in the sample. Use z ij = z im 1 i π 2j i. Note that j A i π 1 2j i z ij = z i and so E π 1 1i π 1 2j i z ij = E j A i I I π 1 1i z i = i UI z i. J. Kim (ISU) Chapter 2 12 / 26
2.6 Regression for two-stage samples Approach 2 : design-consistent model-based approach Model for the two-stage sample y ij = x ij β + u ij u ij = b i + e ij where b i iid(0, σ 2 b ), e ij iid(o, σ 2 e), and e ij is independent of b k for all i, j, k. Writing u i = (u i1,, u im ), we have u i (0, Σ uu ) where Σ uu = I m σ 2 e + J m J mσ 2 a. For illustration, see Example 2.6.1. J. Kim (ISU) Chapter 2 13 / 26 2.7 Calibration Minimize ω Vω s.t. ω X = x N (ω Vω)(aX V 1 Xa ) (ω Xa ) 2 with equality iff ω V 1/2 ax V 1/2 ω ax V 1 ω = kax V 1, k : constant ω X = kax V 1 X & x N (X V 1 X) 1 = ka ω = x N (X V 1 X) 1 X V 1 ω Vω x N (X V 1 X) 1 x N Note Minimize V ξ (ω y) s.t. E ξ (ω y) = E(ȳ N ). J. Kim (ISU) Chapter 2 14 / 26
Alternative Minimization Lemma α : given n-dimensional vector Let ω a = arg min ω ω Vω s.t ω X = x N Let ω b = arg min ω (ω α) V(ω α) s.t ω X = x N If V α C(X), then ω a = ω b. Proof : (ω α) V(ω α) = ω Vω α Vω ω Vα + α Vα = ω Vω λ X ω ω Xλ + α Vα where V α = Xλ = ω Vω 2λ x N + α Vα ω X = x N If α = D 1 π J n, then V α C(X) is the condition for design consistency in Corollary 2.2.3.1. J. Kim (ISU) Chapter 2 15 / 26 General Objective Function min G(ω i, α i ) s.t. ω i x i = x N Lagrange multiplier method g(ω i, α i ) λ x i = 0 where g(ω i, α i ) = G ω i ω i = g 1 (λ x i, α i ) where λ is from g 1 (λ x i, α i )x i = x N J. Kim (ISU) Chapter 2 16 / 26
GREG Estimator min Q(ω, d) = d 1 i (ω i d i )q i + λ x i = 0 ω i = d i + λ d i x i/q i ω i x i = ( ) 2 ωi d i 1 q i s.t. d i d i x i + λ d i x ix i /q i ω i x i = x N. λ = ( x N x HT )( d i x ix i /q i ) 1 w i = d i + ( x N x HT )( d i x ix i /q i ) 1 d i x i/q i J. Kim (ISU) Chapter 2 17 / 26 Other Objective Functions Pseudo empirical likelihood Q(ω, d) = d i log Kullback-Leibler distance: Q(ω, d) = ω i log ( ωi d i ( ωi d i ), ω i = d i /(1 + x i λ) ), ω i = d i exp(x i λ) J. Kim (ISU) Chapter 2 18 / 26
Theorem 2.7.1 Deville and Särndal (1992) Theorem Let G(ω, α) be a continuous convex function with a first derivative that is zero for ω = α. Under some regularity conditions, the solution ω i that minimizes G(ω i, α i ) s.t. ω i x i = x N satisfies ω i y i = α i y i + ( x N x α ) ˆβ + O p (n 1 ) where ˆβ = ( x i x i/φ ii ) 1 x i y i/φ ii and φ ii = 2 G(α i, α i )/ ω 2 i. J. Kim (ISU) Chapter 2 19 / 26 Proof of Theorem 2.7.1 Using the Lagrange multiplier method and Taylor linearization, ω i = ω i (λ) = g 1 (λ x i, α i ) where g(ω i, α i ) = G/ ω i. By assumption, g 1 (0, α i ) = α i. Define Û(λ) = ω ix i x N and let ˆλ satisfy Û(ˆλ) = 0. By Taylor 0 = Û(ˆλ) = Û(0) + Û(0) λ (ˆλ 0) + O p (n 1 ). Because Û(0) = α ix i and where g (α i, α i ) = Û(0) λ 2 ω 2 i = 1 g (α i, α i ) x ix i = x ix i /φ ii, G(ω i, α i ) = φ ii. ωi =α i J. Kim (ISU) Chapter 2 20 / 26
Proof of Theorem 2.7.1, continued ȳ cal (ˆλ) = ω i (ˆλ)y i [ ] ȳcal (0) = ȳ cal (0) + (ˆλ 0) + O p (n 1 ) λ = [ ] [ ] 1 x i α i y i + y i x i x i ( x N x α ) + O p (n 1 ). φ ii φ ii J. Kim (ISU) Chapter 2 21 / 26 2.8 Weight Bounds ω i = d i + d i λ x i /c i can take negative values (or take very large values) Add L 1 ω i L 2 to ω i x i = x N. Approaches 1 Huang and Fuller: 2 Husain (1969) Q(w i, d i ) = d i Ψ ( wi d i ), Ψ : Huber function min ω ω + γ(ω X x N ) Σ 1 x x (ω X x N ) for some γ 3 Other methods, quadratic programming. J. Kim (ISU) Chapter 2 22 / 26
2.9 Maximum likelihood and raking ratio Basic Setup Two-way (r c) categorical data a km = n km n, [ nkm p km = E n p k, p m : known k = 1, 2,, r, m = 1, 2,, c ] We are interested in estimating p km. Constraints: ˆp km = p k ˆp km = p m m k J. Kim (ISU) Chapter 2 23 / 26 Maximum likelihood approach Multinomial Likelihood r c a km log(p km ) k=1 m=1 Lagrangian Multiplier Method r c r a km log(p km ) + p km p k ) + k=1 m=1 c p km = m=1 λ r+m ( r k=1 a km λ k + λ r+m ( c λ k k=1 m=1 ) p km p m J. Kim (ISU) Chapter 2 24 / 26
Raking ratio method Deming & Stephan (1940) idea: Approximate r k=1 m=1 c a km log(p km ). = r k=1 m=1 c {a km log(a km ) + (p km a km ) a 1 km (p km a km ) 2} Thus, maximizing r c k=1 m=1 a kmlog(p km ) is asymptotically equivalent to minimizing r c k=1 m=1 a 1 km (p km a km ) 2. If there is only one set of constraints, c p km = p k k = 1,, r, m=1 then the solution to minimizing c m=1 a 1 km (p km a km ) 2 s.t. the constraint is p k p km = a km c m=1 a. km J. Kim (ISU) Chapter 2 25 / 26 Raking ratio method (Cont d) For the two sets of constraints, c p km = p k k = 1,, r m=1 r p km = p m m = 1,, c. k=1 p (t+1) km = p (t) km p k, p (t+2) c m=1 p(t) km km = p (t+1) km p m r k=1 p(t+1) km J. Kim (ISU) Chapter 2 26 / 26