Large sample theory for merged data from multiple sources

Size: px

Start display at page:

Download "Large sample theory for merged data from multiple sources"

Cameron Wheeler
5 years ago
Views:

1 Large sample theory for merged data from multiple sources Takumi Saegusa University of Maryland Division of Statistics August

2 Section 1 Introduction

3 Problem: Data Integration Massive data are collected from various sources: online surveys social networks business transactions sensor networks scientific research (in a public health setting) disease registry, clinical trials, epidemiological studies, health surveys, hospital records, healthcare databases, etc. etc. The representativeness of a sample critically depends on technology for data collection A remedy: to merge multiple data sets with different coverage

4 Motivating Examples: Telephone Surveys Cell Phone Users (Data Source 1) Duplicated Selection Sample 1 Population i.i.d. Sample Landline Users (Data Source 2) Sample 2 Sampling without Replacement

5 Motivating Examples: Study on Rare Populations Maori People (Data Source 1) Population i.i.d. Sample Sampling without Replacement Auckland Population (Data Source 2)

6 Motivating Examples: Combining Medical Studies Disease Registry Cohort Study Clinical Trial Population i.i.d. Sample Health Survey Sampling without Replacement

7 Two-Stage Formulation i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V The issue: Biased and Dependent Sample with Duplicated Selection Biasedness Data Sources of Different Sizes Overlapping Data Sources Dependence (Across Samples) Duplicated Selection (Within Samples) Sampling without Replacement Lack of Identification of Duplicated Items Independent Data Collection across Data Sources

8 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata.

9 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory

10 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples

11 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples Record Linkage: Identification of duplications (Issue) Produces bias of wrong links and non-links (Theory) Requires a correctly specified model of linking errors

12 Comparison with Approaches in Sampling Theory Finite Population Super Population Ours Randomness sampling from data sources distribution on variables Model on Variables Parameter finite population parameter parameter in the model Dependence within samples across samples Applications sample mean generalized linear model? semiparametric model? Asymptotics LLN (?) CLT (?) U-LLN for a class of functions U-CLT for a class of functions Conditions super population design

13 Section 2 Empirical Process

14 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc.

15 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc. Let X 1,..., X n i.i.d. P taking values in X. The empirical measure is defined as P n = 1 n δ Xi n where δ x is a Dirac measure putting a unit mass at x. The empirical measure is a probability measure. The probability of the event A X under P n is P n(a) = 1 n n 1 A (X i ) = #{X i : X i A}, n and the expectation of f (X ) under P n is P nf = 1 n f (X i ). n

16 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P).

17 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P.

18 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P. Examples of index sets are F = {t 1 (,t] (s) : t R} yields P n 1 (,t] = F n (t) F = {x log p θ (x) : θ Θ}

19 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions.

20 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F

21 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F.

22 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F. At f, g F, this implies ( Gnf G ng ) d ( Gf Gg ) (( 0 N 0 ) ( ρp (f, f ) ρ, P (f, g) ρ P (f, g) ρ P (g, g) )). There exits a version of a Gaussian process with sample continuity. Here we further have asymptotic equicontinuity: sup G n(f g) = o P (1), as δ 0. ρ P (f,g)<δ

23 Why Empirical Process Theory? We have enough tools already? Regularity conditions Calculus Law of Large Numbers (LLN) Central Limit Theorem (CLT) Martingale theory if you like

24 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ

25 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n

26 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent?

27 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent? The sample X 1,..., X n is independent but ˆθ n depends on X 1,..., X n. Hence log pˆθn (X 1 ),..., log pˆθn (X n ) are dependent.

28 The Glivenko-Cantelli theorem and the dominated convergence theorem yield 1 n n log p ˆθ n (X i ) = 1 n log p n ˆθ n (X i ) E log p ˆθ n (X ) + E log p ˆθ n (X ) 1 n sup log p θ (X i ) E log p θ (X ) n + E log p (X ) ˆθn θ Θ 0 + E log p θ0 (X )

29 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n

30 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n Rubin-Bleuer and Kratina (Annals of Statistics, 2005) adopted a two-phase framework for estimating a Euclidean parameter from estimating equations: N(ˆθN θ 0 ) = N(ˆθN θ N ) }{{} Asymptotic Normality from Design Conditions + N(θN θ 0 ) }{{} Asymptotic Normality from Superpopulation Condition where ˆθ N is a solution to weighted estimating equations and θ N is a solution of unweighted estimating equations.

31 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β).

32 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β). The following is the likelihood for the Cox model with current status data. Can you use the Taylor expansion around θ 0 = (β 0, Λ 0 ) as usual? l β (θ) = 1 n X i e βt X i 1 e Λ(Y i ) eβ T Xi Λ(Yi ) i (1 i ) n e eβt X i Λ(Yi )

33 Suppose the asymptotic equicontinuity condition holds: 1 n n l n ˆθn (X i ) E log l ˆθn (X ) 1 n n l θ0 (X i ) E log l θ0 (X ) n = o P (1 + n ˆθ n θ 0 ). Since (1/n) n l ˆθn (X i ) = 0 and E l θ0 (X ) = 0, it follows n(e l ˆθ n (X ) E l θ0 (X )) = { } 1 n n l n ˆθn (X i ) E log l ˆθn (X ) = { } 1 n n l θ0 (X i ) E log n l θ0 (X ) + o P (1 + n ˆθ n θ 0 ) If θ E l θ (X ) is differentiable at θ 0 and n ˆθ n θ 0 = O P (1), we obtain ne l θ0 (X )(ˆθ n θ 0 ) = n { 1 n } n l θ0 (X i ) E log l θ0 (X ) + o P (1). For semiparametric models, the derivative E l θ0 is replaced by the functional derivative. the number of the likelihood equations becomes infinitely many.

34 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies.

35 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies. In the analysis of complex samling data where sampling depends on the event, we analyze inverse probability weighted partial likelihood score n τ 0 W }{{} i H(s) }{{} Not Predictable Predictable Process so that the Martingale CLT does not apply. d M(s) }{{} Martingale the Martingale CLT and Empirical process approaches must address dependence issues from complex sampling but the former approach intrinsically fails to address sampling that depends on events even if dependence can be addressed.

36 Some Literature on the Cox Model in Sampling Theory D.Y. Lin, On fitting Cox s proportional hazards models to survey data, Biometrika 87 (2000) The paper simply cited Andersen and Gill (Annals of Statistics 10(4) ) for consistency but there are too many difficulties left to the reader (martingale, LLN, etc.) The paper simply assumes the existence of the U-CLT a priori. S. Rubin-Bleuer, The proportional hazards model for survey data from independent and clustered super-populations, Journal of Multivariate Analysis 102 (2011), Most parts assumes sampling does not depend on the event so that the martingale CLT can be used Consistency results counts on K.H. Yuan an R. Jennrich ( J. Multivariate Anal. 65,1998, ) where the uniform LLN are assumed that this condition is not verified in the paper. The last part where sampling depend on the event counts on Lin (2000).

37 Some Literature on Empirical Process Theory on Complex Surveys Bertail, P., Chautru, E. and Clémençon, S. (2017). Empirical processes in survey sampling with (conditional) Poisson designs. Scand. J. Stat Rejective Sampling U-LLN is not obtained Boistard, H., Lopuha a, H. P. and Ruiz-Gazen, A. (2017). Functional central limit theorems for single-stage sampling designs. Ann. Statist Single-stage sampling in a general way finite dimensional CLT is assumed U-LLN is not obtained A class of functions is restricted to a indicator function of variables less than some number Daniel Bonnéry, F. Jay Breidt, and François Coquet, Uniform convergence of the empirical cumulative distribution function under informative selection from a finite population A class of functions is restricted to a indicator function of variables less than some number

38 Section 3 Our Approach

39 Setting (2 Data Sources) V : auxiliary variables available for all items V 1,..., V N i.i.d. V (j), j = 1, 2: sampling frames of size N (j) (V (1) V (2) ) V V (j) means the item belong to source j n (j) items are selected without replacement from source j R (j), j = 1, 2: sampling indicator from source j (R (j) i s with V i V (j) are dependent) π (j) (v): sampling probability from source j (e.g. π (1) (v) = n (1) /N (1) I V (1) (v)) n (j) /N (j) p (j) > 0. X : only available on selected items i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V

40 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2)

41 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. Hartley s estimator of X = (1/N) N X i is 1 N N R (1) i π (1) (V i ) }{{} Inverse probability ρ (1) (V i ) }{{} Reweighing for Source 1 weighting for source 1 (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2) + R (2) i π (2) (V i ) }{{} Inverse probability ρ (2) (V i ) }{{} Reweighing for Source 2 weighting for source 2 X i

42 Unbiasedness: Biased Sampling: E[R (j) V, X ] = π (j) (V ). ρ (1) (v) + ρ (2) (v) = 1 for every v. No identification of duplicated items: 1 N R (1) i N π (1) (V i ) ρ(1) (V i )X i + 1 N R (2) i N π (2) (V i ) ρ(2) (V i )X i }{{}}{{} computed from sample from source 1 computed from sample from source 2

43 Hartley-Type Empirical Process The empirical measure is replaced by Hartley s estimator of the distribution function. Define Hartley-type inverse probability weighted (H-IPW) empirical measure by P H N = 1 N ( N R (1) i π (1) (V i ) ρ(1) (V i ) + ) R(2) i π (2) (V i ) ρ(2) (V i ) δ Xi Note that this is NOT a probability measure measure: ( ) P H N1 = 1 N R (1) i N π (1) (V i ) ρ(1) (V i ) + R(2) i π (2) (V i ) ρ(2) (V i ) 1 in general in contrast to P n 1 = 1. Define H-IPW empirical process by G H N = N(P H N P).

44 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f }{{} Sampling from Population + N(P H N P N )f }{{} Sampling from Data Sources

45 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes.

46 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes. It follows by Donsker theorem, G N G.

47 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework.

48 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework. Sampling from each source: N(P H N P N )f = 2 N (j) N (j) (ˆP (j) P (j) )ρ (j) f N n (j) N (j) j=1 where with reindexing X (j),i, i = 1,..., N (j), j = 1, 2. It can be shown that G N and N (j) /N N (j) (ˆP (j) P (j) ) are all n (j) N uncorrelated. (j)

49 Theorem (Uniform Law of Large Numbers) Suppose the class F of measurable functions is P-Glivenko-Cantelli. Then P H N P F = sup P H Nf Pf P 0. f F Theorem (Uniform Central Limit Theorem) Suppose the class F of measurable functions is the P-Donsker. Then 2 G H N G + P(V V (j) 1 p ) (j) G (j) ρ (j), j=1 where P-Brownian bridge G, and P (j) -Brownian bridge G (j) are all independent. Here P (j) is a conditional probability measure given membership in source j. p (j)

50 Implications Asymptotic distribution of G H N f : with Σ H = Var(f (X )) + }{{} Population Variance G H Nf d Z H N(0, Σ H ) 2 P(V V (j) ) 1 p(j) Var {ρ (j) (V )f (X ) V (j)}. j=1 p (j) } {{ } Design Variance Optimal ρ (j) and Optimal Calibration (Deville and Sarndal, JASA, 1992) can be derived from this variance formula. (Our Approach) Optimal based on the limiting distribution of G H N f = N(P H N P)f. (Finite Population Approach, Lohr and Rao, Estimation in multiple-frame surveys, JASA, 2006, 101, ) Optimality based on the variable P H N f.

51 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator

52 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator For Multiple-frame sampling, Ranalli et al (2018) uses several different relationship to improve the estimation of P N X. One of them is to use the following idea: P N V }{{} P H }{{ NV } Sample Average Hartley s estimator

53 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j

54 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej

55 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej Method Ours Ranalli (2) Which variables? ρ (1) (V )V V in source 1 ρ (2) (V )V V in source 2 Where variable come from? Sampling from source 1 Both sampling Sampling from source 2 Both sampling What is computed Horvitz-Thompson Hartley Horvitz-Thompson Hartley

56 General semiparametric model Semiparametric model: X P θ,η P (parametric) θ Θ R p (nonparametric) η H B, B, a Banach space Scores l θ,η for θ and B θ,η h for η with h H, Hilbert space Efficient influence function l 0 Semiparametric efficiency bound I = E l 0 Assumptions (for complete data) Smoothness of the model Asymptotic equicontinuity

57 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).

58 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).

59 Theorem (H-WLE for Data Integration) Assume the WLE s are consistent, and ˆη N η 0 = O P (N β ). Then N(ˆθ N θ 0 ) Z N(0, Σ), and Σ = J I P(V V (j) ) 1 p j Var(ρ (j) (V ) l 0 V (j) ). }{{} p j=1 j Superpopulation Variance }{{} Design Variances

60 Variance Estimation: Plug-in Estimator For sampling fraction and data source probability ˆp j = n(j) N (j), P(V V (j) ) = N(j) N For the inverse of the efficient information I = E l 0, For variance from data sources, I 1 0 = P H 2 N l ˆθ N,ˆη N Var(ρ (j) l 0 V (j) ) = 1 N (j) N (j) R (j) { } (j),i 2 ρ (j) (V π (j) (j),i ) lˆθ (V (j),i ) N,ˆη N (X (j),i ) N (j) R (j) (j),i π (j) (V (j),i ) ρ(j) (V (j),i ) lˆθ N,ˆη N (X (j),i ) 2

61 Section 4 Numerical Study

62 Simulation for Cox Model with Right Censoring T Weibull(α, β): time to event C Uniform(0, c): censoring variable Y = min{t, C}, = I {T C}. covariates Z 1 Bernoulli(1/2), Z 2 N(0, 1) Z 1 is collected only in the final combined sample Data sources V (j) are created from V = (Y,, Z 2 )

63 V (1) V (2) N N (1) N (2) n (1) n (2) Duplication Scenario 1 Z 2 1 Z Scenario 2 V Z Scenario 3 V = Duplication N N (1) N (2) N (3) n (1) n (2) n (3) twice thrice Scenario Table: Sample sizes and the numbers of duplications based on 2000 simulated datasets. In Scenarion 4, V (1) = { = 1}, and membership in V (2) {V (3) } C, V (2) V (3), and {V (2) } C V (3) are determined via multinomial logistic regression on Z 2

64 θ 1 = θ 2 log 2 0 N Scenario 1 Scenario 3 θ 1 Bias SD SEE θ 2 Bias SD SEE Scenario 2 Scenario 4 θ 1 Bias SD SEE θ 2 Bias SD SEE Table: Bias, an absolute Monte Carlo sample bias; SD, a Monte Carlo sample standard deviation; SEE, average of a plug-in estimator of a standard error.

65 (α, β) = (.2,.5) N = 500 N = θ 1 = log(2) w/o SC C DC w/o SC C DC MLE S SF B θ 2 = log(2) w/o SC C DC w/o SC C DC MLE S SF B Note: S, the proposed weights; SF, ρ for a single-frame estimator; B, a balanced weights; w/o, non-calibration; SC, the proposed calibration; C, the standard calibration; DC, the data-source-specific calibration. All calibrations use U and Y.

66 National Wilms Tumor Study Complete information is available for comparison of designs N = 1957 Histology is determined in the final sample Event = Relapse of Wilms Tumor Data integration (n = 506 with 68 duplications) Data Source 1: Death (all sampled) Data Source 2: Unfavorable Histology (50% sampled) Data Source 3: Entire Cohort (10% sampled) Stratified Sample (n = 502) Stratum 1: Death (all sampled) Stratum 2: Alive with Unfavorable Histology (50% sampled) Stratum 3: the rest (14% sampled)

67 Full cohort Data integration Stratified sampling ρ Proposed Balanced # subjects (506 with duplication) 502 Duplication 0 64 (twice) 2 (thrice) 0 Partial likelihood Variable ˆθ SE ˆθ SE ˆθ SE ˆθ SE Histology Age Stage (III/IV) Tumor Stage Tumor Note: Histology is measured at a central laboratory. Table: Point estimates and estimated standard errors in the analysis of the NWTS study with different sampling schemes. Proposed means results for the estimator with proposed ρ (j) and Balanced means results for the estimator with the value for ρ (j) across sources.

68 Section 5 Discussion

69 Empirical process theory is shown to be extended to data integration problems. Empirical process theory is powerful tool to study semiparametric models under complex surveys. Discussion above for more than 2 sources and stratified sampling from each source as an alternative to sampling without replacement are straightforward. Other sampling designs to combine different data sources are to be investigated. Many methods proposed in the i.i.d. setting should be modified to accomodate sampling procedures.

70 Thank you!

Estimation for two-phase designs: semiparametric models and Z theorems

Estimation for two-phase designs:semiparametric models and Z theorems p. 1/27 Estimation for two-phase designs: semiparametric models and Z theorems Jon A. Wellner University of Washington Estimation for