Large sample theory for merged data from multiple sources

Size: px
Start display at page:

Download "Large sample theory for merged data from multiple sources"

Transcription

1 Large sample theory for merged data from multiple sources Takumi Saegusa University of Maryland Division of Statistics August

2 Section 1 Introduction

3 Problem: Data Integration Massive data are collected from various sources: online surveys social networks business transactions sensor networks scientific research (in a public health setting) disease registry, clinical trials, epidemiological studies, health surveys, hospital records, healthcare databases, etc. etc. The representativeness of a sample critically depends on technology for data collection A remedy: to merge multiple data sets with different coverage

4 Motivating Examples: Telephone Surveys Cell Phone Users (Data Source 1) Duplicated Selection Sample 1 Population i.i.d. Sample Landline Users (Data Source 2) Sample 2 Sampling without Replacement

5 Motivating Examples: Study on Rare Populations Maori People (Data Source 1) Population i.i.d. Sample Sampling without Replacement Auckland Population (Data Source 2)

6 Motivating Examples: Combining Medical Studies Disease Registry Cohort Study Clinical Trial Population i.i.d. Sample Health Survey Sampling without Replacement

7 Two-Stage Formulation i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V The issue: Biased and Dependent Sample with Duplicated Selection Biasedness Data Sources of Different Sizes Overlapping Data Sources Dependence (Across Samples) Duplicated Selection (Within Samples) Sampling without Replacement Lack of Identification of Duplicated Items Independent Data Collection across Data Sources

8 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata.

9 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory

10 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples

11 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples Record Linkage: Identification of duplications (Issue) Produces bias of wrong links and non-links (Theory) Requires a correctly specified model of linking errors

12 Comparison with Approaches in Sampling Theory Finite Population Super Population Ours Randomness sampling from data sources distribution on variables Model on Variables Parameter finite population parameter parameter in the model Dependence within samples across samples Applications sample mean generalized linear model? semiparametric model? Asymptotics LLN (?) CLT (?) U-LLN for a class of functions U-CLT for a class of functions Conditions super population design

13 Section 2 Empirical Process

14 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc.

15 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc. Let X 1,..., X n i.i.d. P taking values in X. The empirical measure is defined as P n = 1 n δ Xi n where δ x is a Dirac measure putting a unit mass at x. The empirical measure is a probability measure. The probability of the event A X under P n is P n(a) = 1 n n 1 A (X i ) = #{X i : X i A}, n and the expectation of f (X ) under P n is P nf = 1 n f (X i ). n

16 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P).

17 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P.

18 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P. Examples of index sets are F = {t 1 (,t] (s) : t R} yields P n 1 (,t] = F n (t) F = {x log p θ (x) : θ Θ}

19 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions.

20 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F

21 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F.

22 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F. At f, g F, this implies ( Gnf G ng ) d ( Gf Gg ) (( 0 N 0 ) ( ρp (f, f ) ρ, P (f, g) ρ P (f, g) ρ P (g, g) )). There exits a version of a Gaussian process with sample continuity. Here we further have asymptotic equicontinuity: sup G n(f g) = o P (1), as δ 0. ρ P (f,g)<δ

23 Why Empirical Process Theory? We have enough tools already? Regularity conditions Calculus Law of Large Numbers (LLN) Central Limit Theorem (CLT) Martingale theory if you like

24 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ

25 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n

26 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent?

27 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent? The sample X 1,..., X n is independent but ˆθ n depends on X 1,..., X n. Hence log pˆθn (X 1 ),..., log pˆθn (X n ) are dependent.

28 The Glivenko-Cantelli theorem and the dominated convergence theorem yield 1 n n log p ˆθ n (X i ) = 1 n log p n ˆθ n (X i ) E log p ˆθ n (X ) + E log p ˆθ n (X ) 1 n sup log p θ (X i ) E log p θ (X ) n + E log p (X ) ˆθn θ Θ 0 + E log p θ0 (X )

29 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n

30 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n Rubin-Bleuer and Kratina (Annals of Statistics, 2005) adopted a two-phase framework for estimating a Euclidean parameter from estimating equations: N(ˆθN θ 0 ) = N(ˆθN θ N ) }{{} Asymptotic Normality from Design Conditions + N(θN θ 0 ) }{{} Asymptotic Normality from Superpopulation Condition where ˆθ N is a solution to weighted estimating equations and θ N is a solution of unweighted estimating equations.

31 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β).

32 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β). The following is the likelihood for the Cox model with current status data. Can you use the Taylor expansion around θ 0 = (β 0, Λ 0 ) as usual? l β (θ) = 1 n X i e βt X i 1 e Λ(Y i ) eβ T Xi Λ(Yi ) i (1 i ) n e eβt X i Λ(Yi )

33 Suppose the asymptotic equicontinuity condition holds: 1 n n l n ˆθn (X i ) E log l ˆθn (X ) 1 n n l θ0 (X i ) E log l θ0 (X ) n = o P (1 + n ˆθ n θ 0 ). Since (1/n) n l ˆθn (X i ) = 0 and E l θ0 (X ) = 0, it follows n(e l ˆθ n (X ) E l θ0 (X )) = { } 1 n n l n ˆθn (X i ) E log l ˆθn (X ) = { } 1 n n l θ0 (X i ) E log n l θ0 (X ) + o P (1 + n ˆθ n θ 0 ) If θ E l θ (X ) is differentiable at θ 0 and n ˆθ n θ 0 = O P (1), we obtain ne l θ0 (X )(ˆθ n θ 0 ) = n { 1 n } n l θ0 (X i ) E log l θ0 (X ) + o P (1). For semiparametric models, the derivative E l θ0 is replaced by the functional derivative. the number of the likelihood equations becomes infinitely many.

34 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies.

35 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies. In the analysis of complex samling data where sampling depends on the event, we analyze inverse probability weighted partial likelihood score n τ 0 W }{{} i H(s) }{{} Not Predictable Predictable Process so that the Martingale CLT does not apply. d M(s) }{{} Martingale the Martingale CLT and Empirical process approaches must address dependence issues from complex sampling but the former approach intrinsically fails to address sampling that depends on events even if dependence can be addressed.

36 Some Literature on the Cox Model in Sampling Theory D.Y. Lin, On fitting Cox s proportional hazards models to survey data, Biometrika 87 (2000) The paper simply cited Andersen and Gill (Annals of Statistics 10(4) ) for consistency but there are too many difficulties left to the reader (martingale, LLN, etc.) The paper simply assumes the existence of the U-CLT a priori. S. Rubin-Bleuer, The proportional hazards model for survey data from independent and clustered super-populations, Journal of Multivariate Analysis 102 (2011), Most parts assumes sampling does not depend on the event so that the martingale CLT can be used Consistency results counts on K.H. Yuan an R. Jennrich ( J. Multivariate Anal. 65,1998, ) where the uniform LLN are assumed that this condition is not verified in the paper. The last part where sampling depend on the event counts on Lin (2000).

37 Some Literature on Empirical Process Theory on Complex Surveys Bertail, P., Chautru, E. and Clémençon, S. (2017). Empirical processes in survey sampling with (conditional) Poisson designs. Scand. J. Stat Rejective Sampling U-LLN is not obtained Boistard, H., Lopuha a, H. P. and Ruiz-Gazen, A. (2017). Functional central limit theorems for single-stage sampling designs. Ann. Statist Single-stage sampling in a general way finite dimensional CLT is assumed U-LLN is not obtained A class of functions is restricted to a indicator function of variables less than some number Daniel Bonnéry, F. Jay Breidt, and François Coquet, Uniform convergence of the empirical cumulative distribution function under informative selection from a finite population A class of functions is restricted to a indicator function of variables less than some number

38 Section 3 Our Approach

39 Setting (2 Data Sources) V : auxiliary variables available for all items V 1,..., V N i.i.d. V (j), j = 1, 2: sampling frames of size N (j) (V (1) V (2) ) V V (j) means the item belong to source j n (j) items are selected without replacement from source j R (j), j = 1, 2: sampling indicator from source j (R (j) i s with V i V (j) are dependent) π (j) (v): sampling probability from source j (e.g. π (1) (v) = n (1) /N (1) I V (1) (v)) n (j) /N (j) p (j) > 0. X : only available on selected items i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V

40 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2)

41 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. Hartley s estimator of X = (1/N) N X i is 1 N N R (1) i π (1) (V i ) }{{} Inverse probability ρ (1) (V i ) }{{} Reweighing for Source 1 weighting for source 1 (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2) + R (2) i π (2) (V i ) }{{} Inverse probability ρ (2) (V i ) }{{} Reweighing for Source 2 weighting for source 2 X i

42 Unbiasedness: Biased Sampling: E[R (j) V, X ] = π (j) (V ). ρ (1) (v) + ρ (2) (v) = 1 for every v. No identification of duplicated items: 1 N R (1) i N π (1) (V i ) ρ(1) (V i )X i + 1 N R (2) i N π (2) (V i ) ρ(2) (V i )X i }{{}}{{} computed from sample from source 1 computed from sample from source 2

43 Hartley-Type Empirical Process The empirical measure is replaced by Hartley s estimator of the distribution function. Define Hartley-type inverse probability weighted (H-IPW) empirical measure by P H N = 1 N ( N R (1) i π (1) (V i ) ρ(1) (V i ) + ) R(2) i π (2) (V i ) ρ(2) (V i ) δ Xi Note that this is NOT a probability measure measure: ( ) P H N1 = 1 N R (1) i N π (1) (V i ) ρ(1) (V i ) + R(2) i π (2) (V i ) ρ(2) (V i ) 1 in general in contrast to P n 1 = 1. Define H-IPW empirical process by G H N = N(P H N P).

44 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f }{{} Sampling from Population + N(P H N P N )f }{{} Sampling from Data Sources

45 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes.

46 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes. It follows by Donsker theorem, G N G.

47 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework.

48 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework. Sampling from each source: N(P H N P N )f = 2 N (j) N (j) (ˆP (j) P (j) )ρ (j) f N n (j) N (j) j=1 where with reindexing X (j),i, i = 1,..., N (j), j = 1, 2. It can be shown that G N and N (j) /N N (j) (ˆP (j) P (j) ) are all n (j) N uncorrelated. (j)

49 Theorem (Uniform Law of Large Numbers) Suppose the class F of measurable functions is P-Glivenko-Cantelli. Then P H N P F = sup P H Nf Pf P 0. f F Theorem (Uniform Central Limit Theorem) Suppose the class F of measurable functions is the P-Donsker. Then 2 G H N G + P(V V (j) 1 p ) (j) G (j) ρ (j), j=1 where P-Brownian bridge G, and P (j) -Brownian bridge G (j) are all independent. Here P (j) is a conditional probability measure given membership in source j. p (j)

50 Implications Asymptotic distribution of G H N f : with Σ H = Var(f (X )) + }{{} Population Variance G H Nf d Z H N(0, Σ H ) 2 P(V V (j) ) 1 p(j) Var {ρ (j) (V )f (X ) V (j)}. j=1 p (j) } {{ } Design Variance Optimal ρ (j) and Optimal Calibration (Deville and Sarndal, JASA, 1992) can be derived from this variance formula. (Our Approach) Optimal based on the limiting distribution of G H N f = N(P H N P)f. (Finite Population Approach, Lohr and Rao, Estimation in multiple-frame surveys, JASA, 2006, 101, ) Optimality based on the variable P H N f.

51 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator

52 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator For Multiple-frame sampling, Ranalli et al (2018) uses several different relationship to improve the estimation of P N X. One of them is to use the following idea: P N V }{{} P H }{{ NV } Sample Average Hartley s estimator

53 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j

54 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej

55 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej Method Ours Ranalli (2) Which variables? ρ (1) (V )V V in source 1 ρ (2) (V )V V in source 2 Where variable come from? Sampling from source 1 Both sampling Sampling from source 2 Both sampling What is computed Horvitz-Thompson Hartley Horvitz-Thompson Hartley

56 General semiparametric model Semiparametric model: X P θ,η P (parametric) θ Θ R p (nonparametric) η H B, B, a Banach space Scores l θ,η for θ and B θ,η h for η with h H, Hilbert space Efficient influence function l 0 Semiparametric efficiency bound I = E l 0 Assumptions (for complete data) Smoothness of the model Asymptotic equicontinuity

57 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).

58 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).

59 Theorem (H-WLE for Data Integration) Assume the WLE s are consistent, and ˆη N η 0 = O P (N β ). Then N(ˆθ N θ 0 ) Z N(0, Σ), and Σ = J I P(V V (j) ) 1 p j Var(ρ (j) (V ) l 0 V (j) ). }{{} p j=1 j Superpopulation Variance }{{} Design Variances

60 Variance Estimation: Plug-in Estimator For sampling fraction and data source probability ˆp j = n(j) N (j), P(V V (j) ) = N(j) N For the inverse of the efficient information I = E l 0, For variance from data sources, I 1 0 = P H 2 N l ˆθ N,ˆη N Var(ρ (j) l 0 V (j) ) = 1 N (j) N (j) R (j) { } (j),i 2 ρ (j) (V π (j) (j),i ) lˆθ (V (j),i ) N,ˆη N (X (j),i ) N (j) R (j) (j),i π (j) (V (j),i ) ρ(j) (V (j),i ) lˆθ N,ˆη N (X (j),i ) 2

61 Section 4 Numerical Study

62 Simulation for Cox Model with Right Censoring T Weibull(α, β): time to event C Uniform(0, c): censoring variable Y = min{t, C}, = I {T C}. covariates Z 1 Bernoulli(1/2), Z 2 N(0, 1) Z 1 is collected only in the final combined sample Data sources V (j) are created from V = (Y,, Z 2 )

63 V (1) V (2) N N (1) N (2) n (1) n (2) Duplication Scenario 1 Z 2 1 Z Scenario 2 V Z Scenario 3 V = Duplication N N (1) N (2) N (3) n (1) n (2) n (3) twice thrice Scenario Table: Sample sizes and the numbers of duplications based on 2000 simulated datasets. In Scenarion 4, V (1) = { = 1}, and membership in V (2) {V (3) } C, V (2) V (3), and {V (2) } C V (3) are determined via multinomial logistic regression on Z 2

64 θ 1 = θ 2 log 2 0 N Scenario 1 Scenario 3 θ 1 Bias SD SEE θ 2 Bias SD SEE Scenario 2 Scenario 4 θ 1 Bias SD SEE θ 2 Bias SD SEE Table: Bias, an absolute Monte Carlo sample bias; SD, a Monte Carlo sample standard deviation; SEE, average of a plug-in estimator of a standard error.

65 (α, β) = (.2,.5) N = 500 N = θ 1 = log(2) w/o SC C DC w/o SC C DC MLE S SF B θ 2 = log(2) w/o SC C DC w/o SC C DC MLE S SF B Note: S, the proposed weights; SF, ρ for a single-frame estimator; B, a balanced weights; w/o, non-calibration; SC, the proposed calibration; C, the standard calibration; DC, the data-source-specific calibration. All calibrations use U and Y.

66 National Wilms Tumor Study Complete information is available for comparison of designs N = 1957 Histology is determined in the final sample Event = Relapse of Wilms Tumor Data integration (n = 506 with 68 duplications) Data Source 1: Death (all sampled) Data Source 2: Unfavorable Histology (50% sampled) Data Source 3: Entire Cohort (10% sampled) Stratified Sample (n = 502) Stratum 1: Death (all sampled) Stratum 2: Alive with Unfavorable Histology (50% sampled) Stratum 3: the rest (14% sampled)

67 Full cohort Data integration Stratified sampling ρ Proposed Balanced # subjects (506 with duplication) 502 Duplication 0 64 (twice) 2 (thrice) 0 Partial likelihood Variable ˆθ SE ˆθ SE ˆθ SE ˆθ SE Histology Age Stage (III/IV) Tumor Stage Tumor Note: Histology is measured at a central laboratory. Table: Point estimates and estimated standard errors in the analysis of the NWTS study with different sampling schemes. Proposed means results for the estimator with proposed ρ (j) and Balanced means results for the estimator with the value for ρ (j) across sources.

68 Section 5 Discussion

69 Empirical process theory is shown to be extended to data integration problems. Empirical process theory is powerful tool to study semiparametric models under complex surveys. Discussion above for more than 2 sources and stratified sampling from each source as an alternative to sampling without replacement are straightforward. Other sampling designs to combine different data sources are to be investigated. Many methods proposed in the i.i.d. setting should be modified to accomodate sampling procedures.

70 Thank you!

Estimation for two-phase designs: semiparametric models and Z theorems

Estimation for two-phase designs: semiparametric models and Z theorems Estimation for two-phase designs:semiparametric models and Z theorems p. 1/27 Estimation for two-phase designs: semiparametric models and Z theorems Jon A. Wellner University of Washington Estimation for

More information

Efficiency of Profile/Partial Likelihood in the Cox Model

Efficiency of Profile/Partial Likelihood in the Cox Model Efficiency of Profile/Partial Likelihood in the Cox Model Yuichi Hirose School of Mathematics, Statistics and Operations Research, Victoria University of Wellington, New Zealand Summary. This paper shows

More information

Weighted likelihood estimation under two-phase sampling

Weighted likelihood estimation under two-phase sampling Weighted likelihood estimation under two-phase sampling Takumi Saegusa Department of Biostatistics University of Washington Seattle, WA 98195-7232 e-mail: tsaegusa@uw.edu and Jon A. Wellner Department

More information

An empirical-process central limit theorem for complex sampling under bounds on the design

An empirical-process central limit theorem for complex sampling under bounds on the design An empirical-process central limit theorem for complex sampling under bounds on the design effect Thomas Lumley 1 1 Department of Statistics, Private Bag 92019, Auckland 1142, New Zealand. e-mail: t.lumley@auckland.ac.nz

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

Estimation and Inference of Quantile Regression. for Survival Data under Biased Sampling

Estimation and Inference of Quantile Regression. for Survival Data under Biased Sampling Estimation and Inference of Quantile Regression for Survival Data under Biased Sampling Supplementary Materials: Proofs of the Main Results S1 Verification of the weight function v i (t) for the lengthbiased

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

On Testing for Informative Selection in Survey Sampling 1. (plus Some Estimation) Jay Breidt Colorado State University

On Testing for Informative Selection in Survey Sampling 1. (plus Some Estimation) Jay Breidt Colorado State University On Testing for Informative Selection in Survey Sampling 1 (plus Some Estimation) Jay Breidt Colorado State University Survey Methods and their Use in Related Fields Neuchâtel, Switzerland August 23, 2018

More information

11 Survival Analysis and Empirical Likelihood

11 Survival Analysis and Empirical Likelihood 11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010 M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010 Z-theorems: Notation and Context Suppose that Θ R k, and that Ψ n : Θ R k, random maps Ψ : Θ R k, deterministic

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Lecture 5 Models and methods for recurrent event data

Lecture 5 Models and methods for recurrent event data Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.

More information

1 Glivenko-Cantelli type theorems

1 Glivenko-Cantelli type theorems STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Graybill Conference Poster Session Introductions

Graybill Conference Poster Session Introductions Graybill Conference Poster Session Introductions 2013 Graybill Conference in Modern Survey Statistics Colorado State University Fort Collins, CO June 10, 2013 Small Area Estimation with Incomplete Auxiliary

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Nonresponse weighting adjustment using estimated response probability

Nonresponse weighting adjustment using estimated response probability Nonresponse weighting adjustment using estimated response probability Jae-kwang Kim Yonsei University, Seoul, Korea December 26, 2006 Introduction Nonresponse Unit nonresponse Item nonresponse Basic strategy

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Missing covariate data in matched case-control studies: Do the usual paradigms apply? Missing covariate data in matched case-control studies: Do the usual paradigms apply? Bryan Langholz USC Department of Preventive Medicine Joint work with Mulugeta Gebregziabher Larry Goldstein Mark Huberman

More information

EMPIRICAL ENVELOPE MLE AND LR TESTS. Mai Zhou University of Kentucky

EMPIRICAL ENVELOPE MLE AND LR TESTS. Mai Zhou University of Kentucky EMPIRICAL ENVELOPE MLE AND LR TESTS Mai Zhou University of Kentucky Summary We study in this paper some nonparametric inference problems where the nonparametric maximum likelihood estimator (NPMLE) are

More information

Goodness-of-fit tests for the cure rate in a mixture cure model

Goodness-of-fit tests for the cure rate in a mixture cure model Biometrika (217), 13, 1, pp. 1 7 Printed in Great Britain Advance Access publication on 31 July 216 Goodness-of-fit tests for the cure rate in a mixture cure model BY U.U. MÜLLER Department of Statistics,

More information

Philosophy and Features of the mstate package

Philosophy and Features of the mstate package Introduction Mathematical theory Practice Discussion Philosophy and Features of the mstate package Liesbeth de Wreede, Hein Putter Department of Medical Statistics and Bioinformatics Leiden University

More information

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Shigeyuki Hamori 1 Kaiji Motegi 1 Zheng Zhang 2 1 Kobe University 2 Renmin University of China Institute of Statistics

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression

Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression doi:./j.467-9469.26.523.x Board of the Foundation of the Scandinavian Journal of Statistics 26. Published by Blackwell Publishing Ltd, 96 Garsington Road, Oxford OX4 2DQ, UK and 35 Main Street, Malden,

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. BY TAKUMI SAEGUSA 1 AND JON A. WELLNER 2 University of Washington

WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. BY TAKUMI SAEGUSA 1 AND JON A. WELLNER 2 University of Washington The Annals of Statistics 2013, Vol. 41, o. 1, 269 295 DOI: 10.1214/12-AOS1073 Institute of Mathematical Statistics, 2013 WEIGHTED LIKELIHOOD ESTIMATIO UDER TWO-PHASE SAMPLIG BY TAKUMI SAEGUSA 1 AD JO A.

More information

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations Takeshi Emura and Hisayuki Tsukuma Abstract For testing the regression parameter in multivariate

More information

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

More information

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models NIH Talk, September 03 Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models Eric Slud, Math Dept, Univ of Maryland Ongoing joint project with Ilia

More information

A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models

A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models Jian Huang Department of Statistics University of Iowa Ying Zhang and Lei Hua Department of Biostatistics University

More information

Graduate Econometrics I: Maximum Likelihood I

Graduate Econometrics I: Maximum Likelihood I Graduate Econometrics I: Maximum Likelihood I Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood

More information

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation

More information

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);

More information

Comparing Distribution Functions via Empirical Likelihood

Comparing Distribution Functions via Empirical Likelihood Georgia State University ScholarWorks @ Georgia State University Mathematics and Statistics Faculty Publications Department of Mathematics and Statistics 25 Comparing Distribution Functions via Empirical

More information

Theoretical Statistics. Lecture 17.

Theoretical Statistics. Lecture 17. Theoretical Statistics. Lecture 17. Peter Bartlett 1. Asymptotic normality of Z-estimators: classical conditions. 2. Asymptotic equicontinuity. 1 Recall: Delta method Theorem: Supposeφ : R k R m is differentiable

More information

Statistical Methods for Alzheimer s Disease Studies

Statistical Methods for Alzheimer s Disease Studies Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations

More information

Exercises. (a) Prove that m(t) =

Exercises. (a) Prove that m(t) = Exercises 1. Lack of memory. Verify that the exponential distribution has the lack of memory property, that is, if T is exponentially distributed with parameter λ > then so is T t given that T > t for

More information

Survival Regression Models

Survival Regression Models Survival Regression Models David M. Rocke May 18, 2017 David M. Rocke Survival Regression Models May 18, 2017 1 / 32 Background on the Proportional Hazards Model The exponential distribution has constant

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Lecture 22 Survival Analysis: An Introduction

Lecture 22 Survival Analysis: An Introduction University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 22 Survival Analysis: An Introduction There is considerable interest among economists in models of durations, which

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Economics 583: Econometric Theory I A Primer on Asymptotics

Economics 583: Econometric Theory I A Primer on Asymptotics Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:

More information

Functional central limit theorems for single-stage sampling designs

Functional central limit theorems for single-stage sampling designs Functional central limit theorems for single-stage sampling designs Hélène Boistard Toulouse School of Economics, 2 allée de Brienne, 3000 Toulouse, France Hendrik P. Lopuhaä Delft Institute of Applied

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information

On the Breslow estimator

On the Breslow estimator Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,

More information

Model Assisted Survey Sampling

Model Assisted Survey Sampling Carl-Erik Sarndal Jan Wretman Bengt Swensson Model Assisted Survey Sampling Springer Preface v PARTI Principles of Estimation for Finite Populations and Important Sampling Designs CHAPTER 1 Survey Sampling

More information

Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests

Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests Biometrika (2014),,, pp. 1 13 C 2014 Biometrika Trust Printed in Great Britain Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests BY M. ZHOU Department of Statistics, University

More information

ASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL

ASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL Statistica Sinica 19 (2009), 997-1011 ASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL Anthony Gamst, Michael Donohue and Ronghui Xu University

More information

arxiv: v2 [math.st] 20 Jun 2014

arxiv: v2 [math.st] 20 Jun 2014 A solution in small area estimation problems Andrius Čiginas and Tomas Rudys Vilnius University Institute of Mathematics and Informatics, LT-08663 Vilnius, Lithuania arxiv:1306.2814v2 [math.st] 20 Jun

More information

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis PROPERTIES OF ESTIMATORS FOR RELATIVE RISKS FROM NESTED CASE-CONTROL STUDIES WITH MULTIPLE OUTCOMES (COMPETING RISKS) by NATHALIE C. STØER THESIS for the degree of MASTER OF SCIENCE Modelling and Data

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Shigeyuki Hamori 1 Kaiji Motegi 1 Zheng Zhang 2 1 Kobe University 2 Renmin University of China Econometrics Workshop UNC

More information

Modelling geoadditive survival data

Modelling geoadditive survival data Modelling geoadditive survival data Thomas Kneib & Ludwig Fahrmeir Department of Statistics, Ludwig-Maximilians-University Munich 1. Leukemia survival data 2. Structured hazard regression 3. Mixed model

More information

Ensemble estimation and variable selection with semiparametric regression models

Ensemble estimation and variable selection with semiparametric regression models Ensemble estimation and variable selection with semiparametric regression models Sunyoung Shin Department of Mathematical Sciences University of Texas at Dallas Joint work with Jason Fine, Yufeng Liu,

More information

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in

More information

SEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE. By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam

SEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE. By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam The Annals of Statistics 1997, Vol. 25, No. 4, 1471 159 SEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam Likelihood

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department

More information

Optimal Auxiliary Variable Assisted Two-Phase Sampling Designs

Optimal Auxiliary Variable Assisted Two-Phase Sampling Designs MASTER S THESIS Optimal Auxiliary Variable Assisted Two-Phase Sampling Designs HENRIK IMBERG Department of Mathematical Sciences Division of Mathematical Statistics CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

5601 Notes: The Sandwich Estimator

5601 Notes: The Sandwich Estimator 560 Notes: The Sandwich Estimator Charles J. Geyer December 6, 2003 Contents Maximum Likelihood Estimation 2. Likelihood for One Observation................... 2.2 Likelihood for Many IID Observations...............

More information

Manifold Learning for Subsequent Inference

Manifold Learning for Subsequent Inference Manifold Learning for Subsequent Inference Carey E. Priebe Johns Hopkins University June 20, 2018 DARPA Fundamental Limits of Learning (FunLoL) Los Angeles, California http://arxiv.org/abs/1806.01401 Key

More information

On the bias of the multiple-imputation variance estimator in survey sampling

On the bias of the multiple-imputation variance estimator in survey sampling J. R. Statist. Soc. B (2006) 68, Part 3, pp. 509 521 On the bias of the multiple-imputation variance estimator in survey sampling Jae Kwang Kim, Yonsei University, Seoul, Korea J. Michael Brick, Westat,

More information

A semiparametric generalized proportional hazards model for right-censored data

A semiparametric generalized proportional hazards model for right-censored data Ann Inst Stat Math (216) 68:353 384 DOI 1.17/s1463-14-496-3 A semiparametric generalized proportional hazards model for right-censored data M. L. Avendaño M. C. Pardo Received: 28 March 213 / Revised:

More information

Published online: 10 Apr 2012.

Published online: 10 Apr 2012. This article was downloaded by: Columbia University] On: 23 March 215, At: 12:7 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 172954 Registered office: Mortimer

More information

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data Jae-Kwang Kim 1 Iowa State University June 28, 2012 1 Joint work with Dr. Ming Zhou (when he was a PhD student at ISU)

More information

An augmented inverse probability weighted survival function estimator

An augmented inverse probability weighted survival function estimator An augmented inverse probability weighted survival function estimator Sundarraman Subramanian & Dipankar Bandyopadhyay Abstract We analyze an augmented inverse probability of non-missingness weighted estimator

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

4. Comparison of Two (K) Samples

4. Comparison of Two (K) Samples 4. Comparison of Two (K) Samples K=2 Problem: compare the survival distributions between two groups. E: comparing treatments on patients with a particular disease. Z: Treatment indicator, i.e. Z = 1 for

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Attributable Risk Function in the Proportional Hazards Model

Attributable Risk Function in the Proportional Hazards Model UW Biostatistics Working Paper Series 5-31-2005 Attributable Risk Function in the Proportional Hazards Model Ying Qing Chen Fred Hutchinson Cancer Research Center, yqchen@u.washington.edu Chengcheng Hu

More information

Empirical Likelihood Methods for Sample Survey Data: An Overview

Empirical Likelihood Methods for Sample Survey Data: An Overview AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 191 196 Empirical Likelihood Methods for Sample Survey Data: An Overview J. N. K. Rao Carleton University, Ottawa, Canada Abstract: The use

More information

ML Testing (Likelihood Ratio Testing) for non-gaussian models

ML Testing (Likelihood Ratio Testing) for non-gaussian models ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l

More information

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Overview of today s class Kaplan-Meier Curve

More information

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen Recap of Part 1 Per Kragh Andersen Section of Biostatistics, University of Copenhagen DSBS Course Survival Analysis in Clinical Trials January 2018 1 / 65 Overview Definitions and examples Simple estimation

More information

Robust Backtesting Tests for Value-at-Risk Models

Robust Backtesting Tests for Value-at-Risk Models Robust Backtesting Tests for Value-at-Risk Models Jose Olmo City University London (joint work with Juan Carlos Escanciano, Indiana University) Far East and South Asia Meeting of the Econometric Society

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Maximum likelihood estimation for Cox s regression model under nested case-control sampling

Maximum likelihood estimation for Cox s regression model under nested case-control sampling Biostatistics (2004), 5, 2,pp. 193 206 Printed in Great Britain Maximum likelihood estimation for Cox s regression model under nested case-control sampling THOMAS H. SCHEIKE Department of Biostatistics,

More information

Asymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri

Asymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri Asymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri University of Bergamo (Italy) ilia.negri@unibg.it SAPS VIII, Le Mans 21-24 March,

More information

δ -method and M-estimation

δ -method and M-estimation Econ 2110, fall 2016, Part IVb Asymptotic Theory: δ -method and M-estimation Maximilian Kasy Department of Economics, Harvard University 1 / 40 Example Suppose we estimate the average effect of class size

More information

Asymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics.

Asymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics. Asymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics. Dragi Anevski Mathematical Sciences und University November 25, 21 1 Asymptotic distributions for statistical

More information

Full likelihood inferences in the Cox model: an empirical likelihood approach

Full likelihood inferences in the Cox model: an empirical likelihood approach Ann Inst Stat Math 2011) 63:1005 1018 DOI 10.1007/s10463-010-0272-y Full likelihood inferences in the Cox model: an empirical likelihood approach Jian-Jian Ren Mai Zhou Received: 22 September 2008 / Revised:

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part III: Parametric Bootstrap Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD)

More information

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Statistics Preprints Statistics -00 A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Jianying Zuo Iowa State University, jiyizu@iastate.edu William Q. Meeker

More information

Negative Multinomial Model and Cancer. Incidence

Negative Multinomial Model and Cancer. Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar,

More information

6. Fractional Imputation in Survey Sampling

6. Fractional Imputation in Survey Sampling 6. Fractional Imputation in Survey Sampling 1 Introduction Consider a finite population of N units identified by a set of indices U = {1, 2,, N} with N known. Associated with each unit i in the population

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Tests of independence for censored bivariate failure time data

Tests of independence for censored bivariate failure time data Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ

More information