Two Applications of Nonparametric Regression in Survey Estimation

Size: px

Start display at page:

Download "Two Applications of Nonparametric Regression in Survey Estimation"

Dorcas Daniels
5 years ago
Views:

1 Two Applications of Nonparametric Regression in Survey Estimation 1/56 Jean Opsomer Iowa State University Joint work with Jay Breidt, Colorado State University Gerda Claeskens, Université Catholique de Louvain Göran Kauermann, Universität Bielefeld Giovanna Ranalli, Università di Perugia April 9, 2004

2 Outline 2/56 1. Introduction (a) Two examples (b) Generic and specific inference 2. Nonparametric Regression Using Penalized Splines 3. Generic Inference for Survey Data (a) Nonparametric model-assisted estimator (b) Utah Forest Survey 4. Specific Inference for Survey Data (a) Nonparametric small area estimator (b) Northeastern Lakes Survey 5. Conclusion

3 1. Introduction: Utah Forest Survey 3/56 Forest inventory survey conducted by U.S. Forest Service

4 Utah Forest Survey (2) Phase Variables Geographic X,Y coordinates E-W, N-S Information elev elevation (m) System asp aspect (deg) (GIS) slope slope (deg) hillshd hillshade (solar radiation) N = 67, 216 nlcd vegetation cover type (class) Forest fortyp forest type (class) Inventory trees number of trees and agemax max tree age (years) Analysis ageavg avg tree age (years) (FIA) bamax max tree basal area (sq in) crcov tree crown cover (%) n I = 3, Forest lichen lichen species present (count) Health... Monitoring (FHM) n = 71 Goal: estimate mean number of lichen in region/domains 4/56

5 Introduction: Northeastern Lakes Survey 5/56 Ecological condition survey of Northeastern lakes conducted by U.S. EPA Data collected for 338 lakes

6 Northeastern Lakes Survey (2) Region includes digit Hydrologic Unit Codes (HUC) 6/56 Goal: estimate mean lake Acid Neutralizing Capacity (ANC) for all HUCs

7 Types of Statistical Inference 7/56 Specific Inference: expensive, high quality, targeted using custom-built method (or model) to achieve best possible estimator for particular variable(s) willing to defend model

Types of Statistical Inference 7/56 Specific Inference: expensive, high quality, targeted using custom-built method (or model) to achieve best possible estimator for particular variable(s) willing

8 Types of Statistical Inference 7/56 Specific Inference: expensive, high quality, targeted using custom-built method (or model) to achieve best possible estimator for particular variable(s) willing to defend model Generic Inference: cheap, reasonable quality, good for many purposes using method appropriate for large number of variables that need to be estimated jointly Corn Flakes NET WT. 12 OZ.

9 Statistical Inference in Surveys 8/56 Number of Inference Modelling observations Large { generic none generic model-assisted Moderate specific model-based Small specific small domain estimation Number of observations depends on domain (subpopulation) size For moderate sample size, use of generic inference depends on model goodness-of-fit Nonparametric methods can improve both generic and specific inference

10 2. Nonparametric Regression Using Penalized Splines 9/56 Many nonparametric regression methods are available Kernel and local polynomial methods Smoothing splines, regression splines,... Orthogonal decomposition (wavelet, Fourier series) Penalized splines or P-splines regression (Eilers and Marx, 1996) is simple, flexible and computationally attractive smoothing method

11 Definition of Penalized Splines Model 10/56 Regression model y i = m(x i ) + ε i Function m( ) is unknown but assumed well approximated by polynomial spline m(x; β) = β 0 + β 1 x β p x p + K β p+k (x κ k ) p + k=1 p : degree of spline (fixed) κ 1 <... < κ K : set of K knots (fixed) β = (β 0,..., β p+k ) : vector of parameters (unknown) (Ruppert, Wand and Carroll, 2003)

12 Polynomial Spline Basis Functions 11/56 (x κ) p + { (x κ) p if x κ > 0 0 if x κ p=1 0.8 p= basis functions x x Other spline basis functions are possible (B-splines, radial splines)

13 Choosing K 12/56 If K is sufficiently large, m( ; β) can approximate large class of functions y K= K= y K= x Rule of thumb: K = min(#x/4, 35) K= x

14 Expressing Spline Model as Parametric Model 13/56 m(x; β) = β 0 + β 1 x β p x p + x β K β p+k (x κ k ) p + k=1 with x = (1, x,..., x p, (x κ 1 ) p +,..., (x κ K ) p +) β = (β 0,..., β p+k ) T

15 Fitting by Penalized Splines Regression Minimize penalized sum of squares 14/56 min β n (y i m(x i ; β)) 2 + λ i=1 K k=1 β 2 p+k ˆβ λ = ( X T X + λd ) 1 X T Y λ = smoothing penalty (fixed) D = diag{0,..., 0, 1,..., 1} X = design matrix (including spline terms) ˆβ λ is ridge regression estimator, with ridge penalty on nonlinear (spline) terms of model

16 Fitting by Penalized Splines Regression (2) 15/56 λ protects against overfitting and determines smoothness of fit y lambda= lambda= y lambda= x lambda= x

17 Choosing the Penalty λ Cross-Validation: minimize CV sum of squares with respect to λ Mixed model approach: treat spline parameters β p+1,..., β p+k as a random effect with common variance σβ 2 and fit regression using Maximum Likelihood approach (MLE, REML) 16/56

18 P-spline: Hybrid Regression Method 17/56 P-spline is a nonparametric regression method: can fit very large classes of functions adaptive to local features in the data smoothness of function is determined by penalty parameter λ P-spline is a parametric regression method: model can be written as x β fitted by (global) least squares method number of parameters p + K puts upper bound on flexibility of model

19 Extending the model 18/56 Semi-parametric regression Model y i = m(x 1i ; β 1 ) + x 2i β 2 + ε i min β 1,β 2 n (y i m(x 1i ; β 1 ) x 2i β 2 ) 2 + λ i=1 K k=1 β 2 1,p+k

20 Extending the model 18/56 Semi-parametric regression Model y i = m(x 1i ; β 1 ) + x 2i β 2 + ε i min β 1,β 2 n (y i m(x 1i ; β 1 ) x 2i β 2 ) 2 + λ i=1 K k=1 β 2 1,p+k Additive model min β 1,β 2 Model y i = m 1 (x 1i ; β 1 ) + m 2 (x 2i ; β 2 ) + ε i n K (y i m 1 (x 1i ; β 1 ) m 2 (x 2i ; β 2 )) 2 +λ 1 i=1 Other... k=1 β 2 1,p+k+λ 2 K β2,p+k 2 k=1

21 3.1 Generic Inference for Survey Data Population U = {1,..., k,..., N} with unknown population parameters ȳ N = 1 y i and z N, x N,... N U Sample s selected from U according to known sampling design p(s) stratification clustering multiple phases 19/56

22 3.1 Generic Inference for Survey Data Population U = {1,..., k,..., N} with unknown population parameters ȳ N = 1 y i and z N, x N,... N U Sample s selected from U according to known sampling design p(s) stratification clustering multiple phases Generic estimator for the population mean ŷ s = [ w i y i ẑ s = ] w i z i, ˆx s,... s s 19/56

23 Simple Generic Estimation: Design-based Horvitz-Thompson estimator (1952) ŷ π = 1 N s 1 π i y i with inclusion probabilities π i = Pr(i s) Properties of ŷ π (for any design p) with E p (ŷ π ) = 1 N Var p (ŷ π ) = 1 N 2 y i = ȳ N U U π ij = Pr(i, j s) y i π i y j π j (π ij π i π j ) 20/56

24 Better Generic Estimation: Model-assisted Superpopulation model ξ: y i are iid with E ξ (y i ) = β 0 + β 1 x i = x T i β Var ξ (y i ) = σ 2 Least squares population fit for β 21/56 B U = (X T UX U ) 1 X U Y U

25 Better Generic Estimation: Model-assisted Superpopulation model ξ: y i are iid with E ξ (y i ) = β 0 + β 1 x i = x T i β Var ξ (y i ) = σ 2 Least squares population fit for β 21/56 B U = (X T UX U ) 1 X U Y U B U is estimated by sample-based estimator ˆB = ( X T s Π 1 s with Π s = diag{π i, i s} X s ) 1 X T s Π 1 s Y s Model-assisted ( regression ) estimator (Cassel et al., 1977) ŷ r = 1 x T i ˆB + 1 y i x T i ˆB N N U s π i

26 Properties of Regression Estimator Generic estimator ŷ r = w i (s)y i s 22/56 Consistent, asymptotically design unbiased E p (ŷ r ) ȳ N Approximate design variance Var p (ŷ r ) 1 N 2 U y i x T i B U π i y j x T j B U π j (π ij π i π j ) If model is true, smaller than HT variance [ ] HT : Var p (ŷ π ) = 1 y i y j (π N 2 ij π i π j ) π i π j U

27 Nonparametric Regression Estimation 23/56 Superpopulation model ξ: E ξ (y i ) = x T i β Var ξ (y i ) = σ 2

28 Nonparametric Regression Estimation 23/56 Superpopulation model ξ: E ξ (y i ) = x T i β Var ξ (y i ) = σ 2 Replace by: Superpopulation model ξ: E ξ (y i ) = m(x i ) Var ξ (y i ) = v(x i ) Breidt and Opsomer (2000) developed model-assisted nonparametric estimator based on local polynomial regression

29 Nonparametric Regression Estimation 23/56 Superpopulation model ξ: E ξ (y i ) = x T i β Var ξ (y i ) = σ 2 Replace by: Superpopulation model ξ: E ξ (y i ) = m(x i ) Var ξ (y i ) = v(x i ) Breidt and Opsomer (2000) developed model-assisted nonparametric estimator based on local polynomial regression Easier to do with P-splines!

30 Sample-weighted Penalized Splines Regression 24/56 Superpopulation model ξ: E ξ (y i ) = m(x i ; β) = x i β Var ξ (y i ) = v(x i ) Least squares P-spline population fit (for fixed λ) B U,λ = ( X T U X U + λd ) 1 X T U Y U B U,λ is estimated by design-based estimator ˆB λ = ( X T s Π 1 s X s + λd ) 1 X T s Π 1 s Y s

31 Penalized Splines Regression Estimation Model-assisted P-splines estimator ŷ p = 1 N U x T i ˆB λ + 1 N s y i x T i π i ˆB λ 25/56 Design properties: identical to regression estimator (!)

32 Improved Efficiency from Nonparametric Fit Approximate design variance Var p (ŷ p ) 1 N 2 [ Var p (ŷ r ) 1 N 2 [ U y i x T i B U,λ π i U y i x T i B π i HT : Var p (ŷ π ) = 1 N 2 U y j x T j B U,λ (π ij π i π j ) π j y j x T j B ] (π ij π i π j ) π j y i π i y j π j (π ij π i π j ) ] 26/56

33 P-Splines or Local Polynomials for Model- Assisted Estimation? 27/56 Advantages of P-Splines ly related to parametric modelling (ratio, linear models) Model is easy to extend to multivariate, additive, semiparametric cases Handles data sparseness easily, very fast to compute

34 P-Splines or Local Polynomials for Model- Assisted Estimation? 27/56 Advantages of P-Splines ly related to parametric modelling (ratio, linear models) Model is easy to extend to multivariate, additive, semiparametric cases Handles data sparseness easily, very fast to compute Disadvantages of P-Splines Flexibility of model limited by number of parameters p + K

35 Multi-phase Sampling 28/56 Population U (N elements) Phase 1 sample S (n I elements) Phase 2 sample R (n elements)

36 Multi-phase Sampling 28/56 Population U (N elements) Geographic Information System (GIS) N = 67, 216 Phase 1 sample S (n I elements) Phase 2 sample R (n elements) Forest Inventory and Analysis (FIA) n I = 3, 107 Forest Health Monitoring (FHM) n = 71

37 Model-assisted Estimation for Multi-phase samples Different information available at different phases population x ak, z ak, k U GIS variables phase 1 x bk, z bk, k S FIA measurements phase 2 y k, k R FHM measurements (lichen count) (x bk, z bk contains x ak, z ak ) 29/56 Goal: estimate ȳ N with generic but efficient estimator

38 Model-assisted Estimation for Multi-phase samples Different information available at different phases population x ak, z ak, k U GIS variables phase 1 x bk, z bk, k S FIA measurements phase 2 y k, k R FHM measurements (lichen count) (x bk, z bk contains x ak, z ak ) 29/56 Goal: estimate ȳ N with generic but efficient estimator Approach: 1. use P-splines to fit semiparametric additive model for each level of auxiliary info 2. construct model-assisted estimator

39 Model-assisted Estimation for Multi-phase samples (2) Models Model a: using predictors available for U 30/56 E ξ (y i ) = g a (x ai, z ai ) = m a (x ai ; β a ) + z ai γ a Model b: using predictors available for S E ξ (y i ) = g b (x bi, z bi ) = m b (x bi ; β b ) + z bi γ b

40 Model-assisted Estimation for Multi-phase samples (2) Models Model a: using predictors available for U 30/56 E ξ (y i ) = g a (x ai, z ai ) = m a (x ai ; β a ) + z ai γ a Model b: using predictors available for S E ξ (y i ) = g b (x bi, z bi ) = m b (x bi ; β b ) + z bi γ b P-splines regression estimator ŷ p = 1 ĝ ai + 1 N N U S ĝ bi ĝ ai π i(s) + 1 N R y i ĝ bi π i(s) π k(r S) Design properties: follow nonparametric model-assisted framework

41 3.2 Utah Forest Survey Phase Variables Geographic X,Y coordinates E-W, N-S Information elev elevation (m) System asp aspect (deg) (GIS) slope slope (deg) hillshd hillshade (solar radiation) N = 67, 216 nlcd vegetation cover type (class) Forest fortyp forest type (class) Inventory trees number of trees and agemax max tree age (years) Analysis ageavg avg tree age (years) (FIA) bamax max tree basal area (sq in) crcov tree crown cover (%) n I = 3, Forest lichen lichen species present (count) Health... Monitoring (FHM) n = 71 31/56

42 Semiparametric Model a 32/56 E ξ (LICHEN) = m(hillshd; β) + z NLCD γ ps(hillshd) hillshd partial for nlcd Fitted using Splus functions gam(), ps() with p = 2, K = 15, λ = 1 nlcd2

43 Semiparametric Model b E ξ (LICHEN) = m 1 (CRCOV; β 1 ) + m 2 (AGEMAX; β 2 ) +m 3 (BAMAX; β 3 ) + z NLCD γ 33/56 ps(crcov) ps(agemax) crcov agemax ps(bamax) bamax partial for nlcd nlcd2

44 Forest Health Monitoring Estimates 34/56 HT = Generic estimator, ignores all auxiliary information Linear = Generic estimator, all models are fitted by linear regression Semiparametric = Generic estimator, semiparametric (P-spline) models HT Linear Semiparametric Estimate Est. St. Dev (69%) (44%) Nonparametric regression can improve the efficiency of generic estimators in surveys

45 35/56 Estimators for Domains Domain : subpopulation for which separate estimator is needed Models can improve precision of domain estimators, if they have good local properties Nonparametric model better able to adapt to local features of data crcov s(crcov)

46 Model-Assisted Estimators for Domains 36/56 (Särndal, 1984) Obtain sample-weighted model fit for complete sample s Estimator for domain U d U with realized sample s d s is ŷ p = 1 ĝ i + 1 N d N d U d s d y i ĝ i π i Variance follows from model-assisted estimation theory Approach maintains additivity of domain estimates

47 Example: Estimation for Domain with NLCD > 50 37/56 Phase Longitude Index Latitude Index Only n = 27 observations in domain

48 Example (2) All Data (n = 71) HT Linear Semiparametric Estimate Est. St. Dev /56 Domain (n = 27) HT Linear Semiparametric Estimate Est. St. Dev (1.58) (1.56) (1.06) Nonparametric regression makes it possible to maintain generic approach at smaller scales

49 4.1 Specific Inference for Survey Data 39/56 Data contain 557 observations over 113 HUCs Goal: produce estimates of ANC for all HUCs

50 HUC Sample Means 40/56 Problems: unreliable estimates, missing HUCs

51 HUC as Small Areas 41/56 Few sample observations available in most HUCs Average sample size/huc: HUCs contain less than 5 observations 27 out of 113 HUCs contain no sample observations Modelling required to construct reliable HUC-level estimates called small area estimation in surveys statistics mixed model/prediction problem = type of specific inference

52 Nonparametric Small Area Estimation 42/56 P-splines are ideally suited for small area estimation when the mean function is difficult to specify a priori close relationship between classical small area estimation models and P-splines availability of existing software ability to evaluate need for nonlinearity in model and significance of small area effects

53 P-splines as Random Effects 43/56 y i = m(x i ; β) + ε i = x i β + ε i

54 P-splines as Random Effects 43/56 y i = m(x i ; β) + ε i = x i β + ε i = x F i β F + z i γ + ε i x F i β F β 0 + x i β x p i β p (parametric, fixed component) z i γ = z 1i γ z Ki γ K (x i κ 1 ) p +β p (x i κ K ) p +β p+k

55 P-splines as Random Effects 43/56 y i = m(x i ; β) + ε i = x i β + ε i = x F i β F + z i γ + ε i x F i β F β 0 + x i β x p i β p (parametric, fixed component) z i γ = z 1i γ z Ki γ K (x i κ 1 ) p +β p (x i κ K ) p +β p+k (deviations from parametric, treated as random effect) γ = (γ 1,..., γ K ) iid F γ (0, σ 2 γ) ε i F ε (0, σ 2 ε)

56 P-splines Estimator as BLUP Assuming σγ, 2 σε 2 known, Best Linear Unbiased Estimator/Predictor (BLUE/BLUP) is solution to n ( min yi x F β F i β F + z i γ ) 2 σ 2 K + ε γ 2,γ σγ 2 k i=1 (Henderson et al., 1959) [ ] ˆβF = ˆγ with [ ˆβF ˆγ k=1 ( 1 X T X + σ2 ε D) X T Y σγ 2 D = diag{0,..., 0, 1,..., 1} X = [ x F z ] ] = P-splines (ridge) regression estimator ˆβ λ with λ = σε/σ 2 γ 2 44/56

57 P-splines Estimator as EBLUP If σ 2 γ, σ 2 ε are unknown, estimates can be obtained by Maximum Likelihood (ML) or Restricted Maximum Likelihood (REML) (e.g. Searle, Casella and McCulloch, 1992), and ˆβˆλ = [ ˆβF ˆγ ] = ( 1 X T X + ˆσ2 ε D) X T Y ˆσ γ 2 45/56 ˆβˆλ is Empirical BLUP (EBLUP) for β

58 P-splines Estimator as EBLUP If σ 2 γ, σ 2 ε are unknown, estimates can be obtained by Maximum Likelihood (ML) or Restricted Maximum Likelihood (REML) (e.g. Searle, Casella and McCulloch, 1992), and ˆβˆλ = [ ˆβF ˆγ ] = ( 1 X T X + ˆσ2 ε D) X T Y ˆσ γ 2 45/56 ˆβˆλ is Empirical BLUP (EBLUP) for β Smoothing penalty λ = ˆσ 2 ε/ˆσ 2 γ is determined by data Automatically adjusts λ to patterns in data small deviations from parametric shape ˆσ 2 γ small more smoothing data exhibit significant deviations from parametric shape ˆσ 2 γ large less smoothing

59 Spatial Smoothing using P-splines NE Lakes data require bivariate (spatial) smoothing Low-rank radial basis ( thin-plate spline) [ ] z = C(x i κ k ) 1 i n 1 k K [ C(κ k κ k ) with C(r) = r 2 log r (Ruppert et al. 2003) Mixed model ] 1/2 1 k,k K 46/56 y i = x F i βf + z i γ + ε i γ = (γ 1,..., γ K ) iid F γ (0, σ 2 γ) ε i F ε (0, σ 2 ε) Knot selection: regular spatial grid, space-filling algorithm

60 Small Area Estimation as Mixed Effect Regression Classical small area estimation (Battese, Harter and Fuller, 1988): T small areas, with u t the random effect for small area t = 1,..., T Model 47/56 y i = x i β + u t + ε i = x i β + d i u + ε i d i = (d i1,..., d it ) d it = { 1 if i small area t 0 otherwise u = (u 1,..., u T ) iid F u (0, σ 2 u) ε i F ε (0, σ 2 ε) Target Ȳt = X t β + u t estimated by (E)BLUP ȳ t = X t ˆβ + ût

61 Nonparametric Small Area Model Combine both random effects models 48/56 y i = m(x i ; β) + d i u + ε i = x F i β F + z i γ + d i u + ε i Variance components γ iid F γ (0, σ 2 γ) u iid F u (0, σ 2 u) ε i F ε (0, σ 2 ε)

62 Nonparametric Small Area Model Combine both random effects models 48/56 y i = m(x i ; β) + d i u + ε i = x F i β F + z i γ + d i u + ε i Variance components γ iid F γ (0, σ 2 γ) u iid F u (0, σ 2 u) ε i F ε (0, σ 2 ε) EBLUP can be computed by REML, and ȳ t = X F t ˆβ F + Z t ˆγ + û t

63 4.2 Small Area Estimation for NE Lakes 557 measurements on 338 lakes Dependent variable Independent variables Fixed effects Random effects ANC Acid Neutralizing Capacity INT intercept ELEV elevation TPS spatial thin-plate spline for K = 81 HUC 113 small areas 49/56

64 Knot Locations for Spatial Spline 50/56 utmy utmx

65 Full Model: Model Fit 51/56 Estimates Fixed effects Random effects ˆβ F INT 586 ELEV ˆσ TPS 79 HUC 420 Error 173 Correlation between model predictions and ANC: 0.96

66 Full Model: HUC Predictions 52/ Correlation between HUC means and model predictions: 0.97

67 Are both random effects needed? AIC model selection criterion TPS HUC yes no yes no /56 Correlation between ANC and model prediction TPS HUC yes no yes no

68 Spline in Small Area Model? 54/ observed HUCs non observed HUCs HUC predictions HUC model HUC predictions HUC + TPS model Spline model provides better predictions for empty HUCs

69 Small Area Random Effect in Spatial Model? 55/ observed HUCs non observed HUCs HUC predictions TPS model HUC predictions HUC + TPS model HUC effect results in predictions that are closer to observed data

70 5. Conclusions 56/56 Generic and specific inference can be improved with nonparametric regression methods P-splines easily extend existing survey estimation approaches Utah: more efficient generic estimators for region and for domains NE Lakes: improved prediction for HUCs without observations To do: smoothing parameter selection mixed model inference, test for significance of random effects Contact information: http// jopsomer/home.html

Nonparametric Small Area Estimation Using Penalized Spline Regression

Nonparametric Small Area Estimation Using Penalized Spline Regression 0verview Spline-based nonparametric regression Nonparametric small area estimation Prediction mean squared error Bootstrapping small