Large sample theory for merged data from multiple sources
|
|
- Cameron Wheeler
- 5 years ago
- Views:
Transcription
1 Large sample theory for merged data from multiple sources Takumi Saegusa University of Maryland Division of Statistics August
2 Section 1 Introduction
3 Problem: Data Integration Massive data are collected from various sources: online surveys social networks business transactions sensor networks scientific research (in a public health setting) disease registry, clinical trials, epidemiological studies, health surveys, hospital records, healthcare databases, etc. etc. The representativeness of a sample critically depends on technology for data collection A remedy: to merge multiple data sets with different coverage
4 Motivating Examples: Telephone Surveys Cell Phone Users (Data Source 1) Duplicated Selection Sample 1 Population i.i.d. Sample Landline Users (Data Source 2) Sample 2 Sampling without Replacement
5 Motivating Examples: Study on Rare Populations Maori People (Data Source 1) Population i.i.d. Sample Sampling without Replacement Auckland Population (Data Source 2)
6 Motivating Examples: Combining Medical Studies Disease Registry Cohort Study Clinical Trial Population i.i.d. Sample Health Survey Sampling without Replacement
7 Two-Stage Formulation i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V The issue: Biased and Dependent Sample with Duplicated Selection Biasedness Data Sources of Different Sizes Overlapping Data Sources Dependence (Across Samples) Duplicated Selection (Within Samples) Sampling without Replacement Lack of Identification of Duplicated Items Independent Data Collection across Data Sources
8 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata.
9 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory
10 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples
11 Resemblance to Other Frameworks Stratified Sampling with Overlapping Strata (Practice) Single entity designs the entire sampling so that duplicated items can be identified (Method) Naive method produces bias from overlaps (Theory) The quantity of interest is decomposed into stratum means, and they are asymptotically independent due to the disjoint nature of strata. Multiple-frame surveys: sampling from overlapping sampling frames (Practice) Applications are limited to survey sampling (Method) Hartley s estimator works well (Theory) Finite population framework (Theory) No empirical process theory Meta-analysis (Practice) Does not cover overlaps in samples Record Linkage: Identification of duplications (Issue) Produces bias of wrong links and non-links (Theory) Requires a correctly specified model of linking errors
12 Comparison with Approaches in Sampling Theory Finite Population Super Population Ours Randomness sampling from data sources distribution on variables Model on Variables Parameter finite population parameter parameter in the model Dependence within samples across samples Applications sample mean generalized linear model? semiparametric model? Asymptotics LLN (?) CLT (?) U-LLN for a class of functions U-CLT for a class of functions Conditions super population design
13 Section 2 Empirical Process
14 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc.
15 Empirical Process Approach Empirical process is a stochastic process very useful in semiparametric and nonparametric models. Major tools for statistical theory Uniform LLN and Uniform CLT Rate of convergence Concentration inequalities, etc. Let X 1,..., X n i.i.d. P taking values in X. The empirical measure is defined as P n = 1 n δ Xi n where δ x is a Dirac measure putting a unit mass at x. The empirical measure is a probability measure. The probability of the event A X under P n is P n(a) = 1 n n 1 A (X i ) = #{X i : X i A}, n and the expectation of f (X ) under P n is P nf = 1 n f (X i ). n
16 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P).
17 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P.
18 The empirical process indexed by the class F of functions on X is defined as G n = n(p n P). This is a stochastic process indexed by F, i.e., given f F, the following random variable is obtained: G n f = n(p n f Pf ) = ( ) 1 n n f (X i ) Pf. n Here Pf = E P f (X ) is the expectation of f (X ) under P. Examples of index sets are F = {t 1 (,t] (s) : t R} yields P n 1 (,t] = F n (t) F = {x log p θ (x) : θ Θ}
19 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions.
20 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F
21 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F.
22 An important goal of modern empirical process theory is to provide a uniform control of the sample average over the class of functions. The class F of functions on X is called P-Glivenko-Cantelli if P n P F sup P nf Pf P or a.s. 0. f F The class F of functions on X is called P-Donsker if G n G in l (F), where G is the P-Brownian bridge, a Gaussian process with covariance function ρ P (f, g) = Cov P (f (X ), g(x )) = Pfg (Pf )(Pg) for f, g F. At f, g F, this implies ( Gnf G ng ) d ( Gf Gg ) (( 0 N 0 ) ( ρp (f, f ) ρ, P (f, g) ρ P (f, g) ρ P (g, g) )). There exits a version of a Gaussian process with sample continuity. Here we further have asymptotic equicontinuity: sup G n(f g) = o P (1), as δ 0. ρ P (f,g)<δ
23 Why Empirical Process Theory? We have enough tools already? Regularity conditions Calculus Law of Large Numbers (LLN) Central Limit Theorem (CLT) Martingale theory if you like
24 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ
25 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n
26 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent?
27 Motivating Example: Uniform LLN M-estimator ˆθ n = argmax θ Θ P n m(θ) Condition for Consistency (van der Vaart 1998, Theorem 5.7) sup P n m(θ) Pm(θ) P 0. θ Θ In some literature, it is claimed that under regularity conditions, the law of large numbers yields 1 n log (X pˆθn i ) a.s. E log p θ0 (X ) (?) n The law of large numbers requires the independent summand: 1 n log pˆθ n n (X i ) }{{} Independent? The sample X 1,..., X n is independent but ˆθ n depends on X 1,..., X n. Hence log pˆθn (X 1 ),..., log pˆθn (X n ) are dependent.
28 The Glivenko-Cantelli theorem and the dominated convergence theorem yield 1 n n log p ˆθ n (X i ) = 1 n log p n ˆθ n (X i ) E log p ˆθ n (X ) + E log p ˆθ n (X ) 1 n sup log p θ (X i ) E log p θ (X ) n + E log p (X ) ˆθn θ Θ 0 + E log p θ0 (X )
29 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n
30 Motivating Example: Asymptotic Equicontinuity The MLE solves the likelihood equation (1/n) n l ˆθ n (X i ) = 0. For asymptotic normality, Taylor s theorem from Calculus yields 0 = 1 n l n ˆθn (X i ) = 1 n l θ0 (X i ) + 1 n l θ n n n (X i )(ˆθ n θ 0 ). Hence we can apply LLN and CLT to obtain n(ˆθ n θ 0 ) = ( 1 n ) n 1 n 1 n l θ n (X i ) l θ0 (X i ) d X N(0, I 1 ). n Rubin-Bleuer and Kratina (Annals of Statistics, 2005) adopted a two-phase framework for estimating a Euclidean parameter from estimating equations: N(ˆθN θ 0 ) = N(ˆθN θ N ) }{{} Asymptotic Normality from Design Conditions + N(θN θ 0 ) }{{} Asymptotic Normality from Superpopulation Condition where ˆθ N is a solution to weighted estimating equations and θ N is a solution of unweighted estimating equations.
31 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β).
32 The previous argument works for many parametric models. In survival analysis, however, semiparametric models play a pivotal role in determining effects of treatments and risk factors. A semiparametric model is a collection of probability measures indexed by a finite-dimensional parameter, and a infinite-dimensional parameter. An example is the Cox proportional hazards model with regression parameter β R d and the cumulative hazard function Λ in the class of positive increasing functions. The conditional hazard function given covariates X = x is λ(t x) = λ 0 (t) exp(x T β). The following is the likelihood for the Cox model with current status data. Can you use the Taylor expansion around θ 0 = (β 0, Λ 0 ) as usual? l β (θ) = 1 n X i e βt X i 1 e Λ(Y i ) eβ T Xi Λ(Yi ) i (1 i ) n e eβt X i Λ(Yi )
33 Suppose the asymptotic equicontinuity condition holds: 1 n n l n ˆθn (X i ) E log l ˆθn (X ) 1 n n l θ0 (X i ) E log l θ0 (X ) n = o P (1 + n ˆθ n θ 0 ). Since (1/n) n l ˆθn (X i ) = 0 and E l θ0 (X ) = 0, it follows n(e l ˆθ n (X ) E l θ0 (X )) = { } 1 n n l n ˆθn (X i ) E log l ˆθn (X ) = { } 1 n n l θ0 (X i ) E log n l θ0 (X ) + o P (1 + n ˆθ n θ 0 ) If θ E l θ (X ) is differentiable at θ 0 and n ˆθ n θ 0 = O P (1), we obtain ne l θ0 (X )(ˆθ n θ 0 ) = n { 1 n } n l θ0 (X i ) E log l θ0 (X ) + o P (1). For semiparametric models, the derivative E l θ0 is replaced by the functional derivative. the number of the likelihood equations becomes infinitely many.
34 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies.
35 Motivating Example: Martingale The Cox s partial likelihood score can be written as n τ H i (s) d M i (s) 0 }{{}}{{} Predictable Process Martingale to which Martingale Central Limit Theorem applies. In the analysis of complex samling data where sampling depends on the event, we analyze inverse probability weighted partial likelihood score n τ 0 W }{{} i H(s) }{{} Not Predictable Predictable Process so that the Martingale CLT does not apply. d M(s) }{{} Martingale the Martingale CLT and Empirical process approaches must address dependence issues from complex sampling but the former approach intrinsically fails to address sampling that depends on events even if dependence can be addressed.
36 Some Literature on the Cox Model in Sampling Theory D.Y. Lin, On fitting Cox s proportional hazards models to survey data, Biometrika 87 (2000) The paper simply cited Andersen and Gill (Annals of Statistics 10(4) ) for consistency but there are too many difficulties left to the reader (martingale, LLN, etc.) The paper simply assumes the existence of the U-CLT a priori. S. Rubin-Bleuer, The proportional hazards model for survey data from independent and clustered super-populations, Journal of Multivariate Analysis 102 (2011), Most parts assumes sampling does not depend on the event so that the martingale CLT can be used Consistency results counts on K.H. Yuan an R. Jennrich ( J. Multivariate Anal. 65,1998, ) where the uniform LLN are assumed that this condition is not verified in the paper. The last part where sampling depend on the event counts on Lin (2000).
37 Some Literature on Empirical Process Theory on Complex Surveys Bertail, P., Chautru, E. and Clémençon, S. (2017). Empirical processes in survey sampling with (conditional) Poisson designs. Scand. J. Stat Rejective Sampling U-LLN is not obtained Boistard, H., Lopuha a, H. P. and Ruiz-Gazen, A. (2017). Functional central limit theorems for single-stage sampling designs. Ann. Statist Single-stage sampling in a general way finite dimensional CLT is assumed U-LLN is not obtained A class of functions is restricted to a indicator function of variables less than some number Daniel Bonnéry, F. Jay Breidt, and François Coquet, Uniform convergence of the empirical cumulative distribution function under informative selection from a finite population A class of functions is restricted to a indicator function of variables less than some number
38 Section 3 Our Approach
39 Setting (2 Data Sources) V : auxiliary variables available for all items V 1,..., V N i.i.d. V (j), j = 1, 2: sampling frames of size N (j) (V (1) V (2) ) V V (j) means the item belong to source j n (j) items are selected without replacement from source j R (j), j = 1, 2: sampling indicator from source j (R (j) i s with V i V (j) are dependent) π (j) (v): sampling probability from source j (e.g. π (1) (v) = n (1) /N (1) I V (1) (v)) n (j) /N (j) p (j) > 0. X : only available on selected items i.i.d. Sample Sampling Data Source 1 V (1) without Replacement Sample 1 ( X (1),V (1) ) Population P 0 π(v )=n (1) / N (1) Data Source 2 V (2) π(v )=n (2) / N (2) Sample 2 ( X (2),V (2) ) V X,V
40 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2)
41 Canonical Estimator: Hartley s Estimator Solution to Duplicated Selection: Hartleys estimator (1962, 1974) reweighing function ρ : V R 2 ρ(v) = ( ) ρ (1) (v), ρ (2) (v) = where c 1 + c 2 = 1. Hartley s estimator of X = (1/N) N X i is 1 N N R (1) i π (1) (V i ) }{{} Inverse probability ρ (1) (V i ) }{{} Reweighing for Source 1 weighting for source 1 (1, 0) if v V (1) ( V (2)) c (0, 1) if v ( V (1)) c V (2) (c 1, c 2 ) if v V (1) V (2) + R (2) i π (2) (V i ) }{{} Inverse probability ρ (2) (V i ) }{{} Reweighing for Source 2 weighting for source 2 X i
42 Unbiasedness: Biased Sampling: E[R (j) V, X ] = π (j) (V ). ρ (1) (v) + ρ (2) (v) = 1 for every v. No identification of duplicated items: 1 N R (1) i N π (1) (V i ) ρ(1) (V i )X i + 1 N R (2) i N π (2) (V i ) ρ(2) (V i )X i }{{}}{{} computed from sample from source 1 computed from sample from source 2
43 Hartley-Type Empirical Process The empirical measure is replaced by Hartley s estimator of the distribution function. Define Hartley-type inverse probability weighted (H-IPW) empirical measure by P H N = 1 N ( N R (1) i π (1) (V i ) ρ(1) (V i ) + ) R(2) i π (2) (V i ) ρ(2) (V i ) δ Xi Note that this is NOT a probability measure measure: ( ) P H N1 = 1 N R (1) i N π (1) (V i ) ρ(1) (V i ) + R(2) i π (2) (V i ) ρ(2) (V i ) 1 in general in contrast to P n 1 = 1. Define H-IPW empirical process by G H N = N(P H N P).
44 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f }{{} Sampling from Population + N(P H N P N )f }{{} Sampling from Data Sources
45 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes.
46 Decomposition of H-Empirical Process Key Idea 1: Decompose H-Empirical Process into different sampling: Stage 1 + Stage 2: G H Nf = N(P H N P)f + N(P N P N )f = N(P N P)f + N(P H N P N )f = G N f + N(P H N P }{{} N )f }{{} Sampling from Population Sampling from Data Sources It can be shown that G N f and N(P H N P N )f are uncorrelated. If the latter processes converge to Gaussian process, the limiting process of G H N is a sum of independent Gaussian processes. It follows by Donsker theorem, G N G.
47 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework.
48 Key Idea 2: View sampling from sources as a single realization of m out of n without-replacement bootstrap with m = n (j) and n = N (j). The average within data source j before sampling P (j) N (j) f = 1 N (j) i:v i V (j) f (X i ) plays a role of sample average in a bootstrap framework. The average within data source j after sampling ˆP (j) n (j) f = 1 n (j) i:v i V (j) R (j) i f (X (j),i ) plays a role of bootstrap sample average in a bootstrap framework. Sampling from each source: N(P H N P N )f = 2 N (j) N (j) (ˆP (j) P (j) )ρ (j) f N n (j) N (j) j=1 where with reindexing X (j),i, i = 1,..., N (j), j = 1, 2. It can be shown that G N and N (j) /N N (j) (ˆP (j) P (j) ) are all n (j) N uncorrelated. (j)
49 Theorem (Uniform Law of Large Numbers) Suppose the class F of measurable functions is P-Glivenko-Cantelli. Then P H N P F = sup P H Nf Pf P 0. f F Theorem (Uniform Central Limit Theorem) Suppose the class F of measurable functions is the P-Donsker. Then 2 G H N G + P(V V (j) 1 p ) (j) G (j) ρ (j), j=1 where P-Brownian bridge G, and P (j) -Brownian bridge G (j) are all independent. Here P (j) is a conditional probability measure given membership in source j. p (j)
50 Implications Asymptotic distribution of G H N f : with Σ H = Var(f (X )) + }{{} Population Variance G H Nf d Z H N(0, Σ H ) 2 P(V V (j) ) 1 p(j) Var {ρ (j) (V )f (X ) V (j)}. j=1 p (j) } {{ } Design Variance Optimal ρ (j) and Optimal Calibration (Deville and Sarndal, JASA, 1992) can be derived from this variance formula. (Our Approach) Optimal based on the limiting distribution of G H N f = N(P H N P)f. (Finite Population Approach, Lohr and Rao, Estimation in multiple-frame surveys, JASA, 2006, 101, ) Optimality based on the variable P H N f.
51 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator
52 Calibration General Idea (Deville and Sarndal, JASA 1992): Improve estimation of Horvitz-Thompson estimator of P N X = (1/N) N X i by the following relationship 1 N 1 N R i P N V = V i N N π(v i ) V i }{{}}{{} Sample Average Horvitz-Thompson Estimator For Multiple-frame sampling, Ranalli et al (2018) uses several different relationship to improve the estimation of P N X. One of them is to use the following idea: P N V }{{} P H }{{ NV } Sample Average Hartley s estimator
53 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j
54 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej
55 Another method considered by Ranalli et al. (2018) uses the relationship given by P N VI {V V (j) } }{{} P H NVI {V V (j) } }{{} Sample Average in Source j Hartley s estimator in Source j Our approach: 1 N (j) ρ (j) (V i )V i I {V i V (j) } i:v i V (j) }{{} Sample Average in Source j 1 R (j) i N (j) π (j) (V i ) ρ(j) (V i )V i i:v i V (j) }{{} Horvitz-Thompson Estimator in Sourcej Method Ours Ranalli (2) Which variables? ρ (1) (V )V V in source 1 ρ (2) (V )V V in source 2 Where variable come from? Sampling from source 1 Both sampling Sampling from source 2 Both sampling What is computed Horvitz-Thompson Hartley Horvitz-Thompson Hartley
56 General semiparametric model Semiparametric model: X P θ,η P (parametric) θ Θ R p (nonparametric) η H B, B, a Banach space Scores l θ,η for θ and B θ,η h for η with h H, Hilbert space Efficient influence function l 0 Semiparametric efficiency bound I = E l 0 Assumptions (for complete data) Smoothness of the model Asymptotic equicontinuity
57 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).
58 Hartley-type Weighted Likelihood Estimator (H-WLE) The MLE (ˆθ, ˆη) for complete data is obtained from the likelihood equations; 1 N 1 N N lˆθ,ˆη (X i) = o P (N 1/2 ), N Bˆθ,ˆη h(x i) = o P (N 1/2 ). The WLE (ˆθ N, ˆη N ) is obtained from the weighted likelihood equations; 1 N 1 N N N { R (1) i π (1) (V i ) ρ(1) (V i ) + { (1) R i π (1) (V i ) ρ(1) (V i ) + R } (2) i π (2) (V i ) ρ(2) (V i ) l ˆθ N, ˆη (X N i ) = o P (N 1/2 ), R (2) i π (2) (V i ) ρ(2) (V i ) } B ˆθ N, ˆη N h(x i ) = o P (N 1/2 ).
59 Theorem (H-WLE for Data Integration) Assume the WLE s are consistent, and ˆη N η 0 = O P (N β ). Then N(ˆθ N θ 0 ) Z N(0, Σ), and Σ = J I P(V V (j) ) 1 p j Var(ρ (j) (V ) l 0 V (j) ). }{{} p j=1 j Superpopulation Variance }{{} Design Variances
60 Variance Estimation: Plug-in Estimator For sampling fraction and data source probability ˆp j = n(j) N (j), P(V V (j) ) = N(j) N For the inverse of the efficient information I = E l 0, For variance from data sources, I 1 0 = P H 2 N l ˆθ N,ˆη N Var(ρ (j) l 0 V (j) ) = 1 N (j) N (j) R (j) { } (j),i 2 ρ (j) (V π (j) (j),i ) lˆθ (V (j),i ) N,ˆη N (X (j),i ) N (j) R (j) (j),i π (j) (V (j),i ) ρ(j) (V (j),i ) lˆθ N,ˆη N (X (j),i ) 2
61 Section 4 Numerical Study
62 Simulation for Cox Model with Right Censoring T Weibull(α, β): time to event C Uniform(0, c): censoring variable Y = min{t, C}, = I {T C}. covariates Z 1 Bernoulli(1/2), Z 2 N(0, 1) Z 1 is collected only in the final combined sample Data sources V (j) are created from V = (Y,, Z 2 )
63 V (1) V (2) N N (1) N (2) n (1) n (2) Duplication Scenario 1 Z 2 1 Z Scenario 2 V Z Scenario 3 V = Duplication N N (1) N (2) N (3) n (1) n (2) n (3) twice thrice Scenario Table: Sample sizes and the numbers of duplications based on 2000 simulated datasets. In Scenarion 4, V (1) = { = 1}, and membership in V (2) {V (3) } C, V (2) V (3), and {V (2) } C V (3) are determined via multinomial logistic regression on Z 2
64 θ 1 = θ 2 log 2 0 N Scenario 1 Scenario 3 θ 1 Bias SD SEE θ 2 Bias SD SEE Scenario 2 Scenario 4 θ 1 Bias SD SEE θ 2 Bias SD SEE Table: Bias, an absolute Monte Carlo sample bias; SD, a Monte Carlo sample standard deviation; SEE, average of a plug-in estimator of a standard error.
65 (α, β) = (.2,.5) N = 500 N = θ 1 = log(2) w/o SC C DC w/o SC C DC MLE S SF B θ 2 = log(2) w/o SC C DC w/o SC C DC MLE S SF B Note: S, the proposed weights; SF, ρ for a single-frame estimator; B, a balanced weights; w/o, non-calibration; SC, the proposed calibration; C, the standard calibration; DC, the data-source-specific calibration. All calibrations use U and Y.
66 National Wilms Tumor Study Complete information is available for comparison of designs N = 1957 Histology is determined in the final sample Event = Relapse of Wilms Tumor Data integration (n = 506 with 68 duplications) Data Source 1: Death (all sampled) Data Source 2: Unfavorable Histology (50% sampled) Data Source 3: Entire Cohort (10% sampled) Stratified Sample (n = 502) Stratum 1: Death (all sampled) Stratum 2: Alive with Unfavorable Histology (50% sampled) Stratum 3: the rest (14% sampled)
67 Full cohort Data integration Stratified sampling ρ Proposed Balanced # subjects (506 with duplication) 502 Duplication 0 64 (twice) 2 (thrice) 0 Partial likelihood Variable ˆθ SE ˆθ SE ˆθ SE ˆθ SE Histology Age Stage (III/IV) Tumor Stage Tumor Note: Histology is measured at a central laboratory. Table: Point estimates and estimated standard errors in the analysis of the NWTS study with different sampling schemes. Proposed means results for the estimator with proposed ρ (j) and Balanced means results for the estimator with the value for ρ (j) across sources.
68 Section 5 Discussion
69 Empirical process theory is shown to be extended to data integration problems. Empirical process theory is powerful tool to study semiparametric models under complex surveys. Discussion above for more than 2 sources and stratified sampling from each source as an alternative to sampling without replacement are straightforward. Other sampling designs to combine different data sources are to be investigated. Many methods proposed in the i.i.d. setting should be modified to accomodate sampling procedures.
70 Thank you!
Estimation for two-phase designs: semiparametric models and Z theorems
Estimation for two-phase designs:semiparametric models and Z theorems p. 1/27 Estimation for two-phase designs: semiparametric models and Z theorems Jon A. Wellner University of Washington Estimation for
More informationEfficiency of Profile/Partial Likelihood in the Cox Model
Efficiency of Profile/Partial Likelihood in the Cox Model Yuichi Hirose School of Mathematics, Statistics and Operations Research, Victoria University of Wellington, New Zealand Summary. This paper shows
More informationWeighted likelihood estimation under two-phase sampling
Weighted likelihood estimation under two-phase sampling Takumi Saegusa Department of Biostatistics University of Washington Seattle, WA 98195-7232 e-mail: tsaegusa@uw.edu and Jon A. Wellner Department
More informationAn empirical-process central limit theorem for complex sampling under bounds on the design
An empirical-process central limit theorem for complex sampling under bounds on the design effect Thomas Lumley 1 1 Department of Statistics, Private Bag 92019, Auckland 1142, New Zealand. e-mail: t.lumley@auckland.ac.nz
More informationSurvival Analysis for Case-Cohort Studies
Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz
More informationPower and Sample Size Calculations with the Additive Hazards Model
Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine
More informationEstimation and Inference of Quantile Regression. for Survival Data under Biased Sampling
Estimation and Inference of Quantile Regression for Survival Data under Biased Sampling Supplementary Materials: Proofs of the Main Results S1 Verification of the weight function v i (t) for the lengthbiased
More informationIntroduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models
Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations
More informationOn Testing for Informative Selection in Survey Sampling 1. (plus Some Estimation) Jay Breidt Colorado State University
On Testing for Informative Selection in Survey Sampling 1 (plus Some Estimation) Jay Breidt Colorado State University Survey Methods and their Use in Related Fields Neuchâtel, Switzerland August 23, 2018
More information11 Survival Analysis and Empirical Likelihood
11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationM- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010
M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010 Z-theorems: Notation and Context Suppose that Θ R k, and that Ψ n : Θ R k, random maps Ψ : Θ R k, deterministic
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationLecture 5 Models and methods for recurrent event data
Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.
More information1 Glivenko-Cantelli type theorems
STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then
More informationCombining multiple observational data sources to estimate causal eects
Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,
More informationSTAT 512 sp 2018 Summary Sheet
STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}
More informationIntroduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued
Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research
More informationGraybill Conference Poster Session Introductions
Graybill Conference Poster Session Introductions 2013 Graybill Conference in Modern Survey Statistics Colorado State University Fort Collins, CO June 10, 2013 Small Area Estimation with Incomplete Auxiliary
More informationStatistical Properties of Numerical Derivatives
Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective
More informationModification and Improvement of Empirical Likelihood for Missing Response Problem
UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu
More informationFractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling
Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction
More informationNonresponse weighting adjustment using estimated response probability
Nonresponse weighting adjustment using estimated response probability Jae-kwang Kim Yonsei University, Seoul, Korea December 26, 2006 Introduction Nonresponse Unit nonresponse Item nonresponse Basic strategy
More informationPENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA
PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University
More informationMissing covariate data in matched case-control studies: Do the usual paradigms apply?
Missing covariate data in matched case-control studies: Do the usual paradigms apply? Bryan Langholz USC Department of Preventive Medicine Joint work with Mulugeta Gebregziabher Larry Goldstein Mark Huberman
More informationEMPIRICAL ENVELOPE MLE AND LR TESTS. Mai Zhou University of Kentucky
EMPIRICAL ENVELOPE MLE AND LR TESTS Mai Zhou University of Kentucky Summary We study in this paper some nonparametric inference problems where the nonparametric maximum likelihood estimator (NPMLE) are
More informationGoodness-of-fit tests for the cure rate in a mixture cure model
Biometrika (217), 13, 1, pp. 1 7 Printed in Great Britain Advance Access publication on 31 July 216 Goodness-of-fit tests for the cure rate in a mixture cure model BY U.U. MÜLLER Department of Statistics,
More informationPhilosophy and Features of the mstate package
Introduction Mathematical theory Practice Discussion Philosophy and Features of the mstate package Liesbeth de Wreede, Hein Putter Department of Medical Statistics and Bioinformatics Leiden University
More informationCalibration Estimation of Semiparametric Copula Models with Data Missing at Random
Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Shigeyuki Hamori 1 Kaiji Motegi 1 Zheng Zhang 2 1 Kobe University 2 Renmin University of China Institute of Statistics
More informationPrerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3
University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.
More informationWeighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression
doi:./j.467-9469.26.523.x Board of the Foundation of the Scandinavian Journal of Statistics 26. Published by Blackwell Publishing Ltd, 96 Garsington Road, Oxford OX4 2DQ, UK and 35 Main Street, Malden,
More informationA note on multiple imputation for general purpose estimation
A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume
More informationPairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion
Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial
More informationWEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. BY TAKUMI SAEGUSA 1 AND JON A. WELLNER 2 University of Washington
The Annals of Statistics 2013, Vol. 41, o. 1, 269 295 DOI: 10.1214/12-AOS1073 Institute of Mathematical Statistics, 2013 WEIGHTED LIKELIHOOD ESTIMATIO UDER TWO-PHASE SAMPLIG BY TAKUMI SAEGUSA 1 AD JO A.
More informationHypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations
Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations Takeshi Emura and Hisayuki Tsukuma Abstract For testing the regression parameter in multivariate
More informationA COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky
A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),
More informationEfficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models
NIH Talk, September 03 Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models Eric Slud, Math Dept, Univ of Maryland Ongoing joint project with Ilia
More informationA Least-Squares Approach to Consistent Information Estimation in Semiparametric Models
A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models Jian Huang Department of Statistics University of Iowa Ying Zhang and Lei Hua Department of Biostatistics University
More informationGraduate Econometrics I: Maximum Likelihood I
Graduate Econometrics I: Maximum Likelihood I Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood
More informationData Integration for Big Data Analysis for finite population inference
for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation
More informationOther Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model
Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);
More informationComparing Distribution Functions via Empirical Likelihood
Georgia State University ScholarWorks @ Georgia State University Mathematics and Statistics Faculty Publications Department of Mathematics and Statistics 25 Comparing Distribution Functions via Empirical
More informationTheoretical Statistics. Lecture 17.
Theoretical Statistics. Lecture 17. Peter Bartlett 1. Asymptotic normality of Z-estimators: classical conditions. 2. Asymptotic equicontinuity. 1 Recall: Delta method Theorem: Supposeφ : R k R m is differentiable
More informationStatistical Methods for Alzheimer s Disease Studies
Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations
More informationExercises. (a) Prove that m(t) =
Exercises 1. Lack of memory. Verify that the exponential distribution has the lack of memory property, that is, if T is exponentially distributed with parameter λ > then so is T t given that T > t for
More informationSurvival Regression Models
Survival Regression Models David M. Rocke May 18, 2017 David M. Rocke Survival Regression Models May 18, 2017 1 / 32 Background on the Proportional Hazards Model The exponential distribution has constant
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationIntroduction to Statistical Analysis
Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive
More informationLecture 22 Survival Analysis: An Introduction
University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 22 Survival Analysis: An Introduction There is considerable interest among economists in models of durations, which
More informationFractional Imputation in Survey Sampling: A Comparative Review
Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationEconomics 583: Econometric Theory I A Primer on Asymptotics
Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:
More informationFunctional central limit theorems for single-stage sampling designs
Functional central limit theorems for single-stage sampling designs Hélène Boistard Toulouse School of Economics, 2 allée de Brienne, 3000 Toulouse, France Hendrik P. Lopuhaä Delft Institute of Applied
More informationSurvival Analysis Math 434 Fall 2011
Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup
More informationOn the Breslow estimator
Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,
More informationModel Assisted Survey Sampling
Carl-Erik Sarndal Jan Wretman Bengt Swensson Model Assisted Survey Sampling Springer Preface v PARTI Principles of Estimation for Finite Populations and Important Sampling Designs CHAPTER 1 Survey Sampling
More informationSize and Shape of Confidence Regions from Extended Empirical Likelihood Tests
Biometrika (2014),,, pp. 1 13 C 2014 Biometrika Trust Printed in Great Britain Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests BY M. ZHOU Department of Statistics, University
More informationASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL
Statistica Sinica 19 (2009), 997-1011 ASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL Anthony Gamst, Michael Donohue and Ronghui Xu University
More informationarxiv: v2 [math.st] 20 Jun 2014
A solution in small area estimation problems Andrius Čiginas and Tomas Rudys Vilnius University Institute of Mathematics and Informatics, LT-08663 Vilnius, Lithuania arxiv:1306.2814v2 [math.st] 20 Jun
More informationTHESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis
PROPERTIES OF ESTIMATORS FOR RELATIVE RISKS FROM NESTED CASE-CONTROL STUDIES WITH MULTIPLE OUTCOMES (COMPETING RISKS) by NATHALIE C. STØER THESIS for the degree of MASTER OF SCIENCE Modelling and Data
More informationChapter 3. Point Estimation. 3.1 Introduction
Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.
More informationCalibration Estimation of Semiparametric Copula Models with Data Missing at Random
Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Shigeyuki Hamori 1 Kaiji Motegi 1 Zheng Zhang 2 1 Kobe University 2 Renmin University of China Econometrics Workshop UNC
More informationModelling geoadditive survival data
Modelling geoadditive survival data Thomas Kneib & Ludwig Fahrmeir Department of Statistics, Ludwig-Maximilians-University Munich 1. Leukemia survival data 2. Structured hazard regression 3. Mixed model
More informationEnsemble estimation and variable selection with semiparametric regression models
Ensemble estimation and variable selection with semiparametric regression models Sunyoung Shin Department of Mathematical Sciences University of Texas at Dallas Joint work with Jason Fine, Yufeng Liu,
More informationLecture 3. Truncation, length-bias and prevalence sampling
Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in
More informationSEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE. By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam
The Annals of Statistics 1997, Vol. 25, No. 4, 1471 159 SEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam Likelihood
More informationUNIVERSITY OF CALIFORNIA, SAN DIEGO
UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department
More informationOptimal Auxiliary Variable Assisted Two-Phase Sampling Designs
MASTER S THESIS Optimal Auxiliary Variable Assisted Two-Phase Sampling Designs HENRIK IMBERG Department of Mathematical Sciences Division of Mathematical Statistics CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY
More informationFULL LIKELIHOOD INFERENCES IN THE COX MODEL
October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach
More information5601 Notes: The Sandwich Estimator
560 Notes: The Sandwich Estimator Charles J. Geyer December 6, 2003 Contents Maximum Likelihood Estimation 2. Likelihood for One Observation................... 2.2 Likelihood for Many IID Observations...............
More informationManifold Learning for Subsequent Inference
Manifold Learning for Subsequent Inference Carey E. Priebe Johns Hopkins University June 20, 2018 DARPA Fundamental Limits of Learning (FunLoL) Los Angeles, California http://arxiv.org/abs/1806.01401 Key
More informationOn the bias of the multiple-imputation variance estimator in survey sampling
J. R. Statist. Soc. B (2006) 68, Part 3, pp. 509 521 On the bias of the multiple-imputation variance estimator in survey sampling Jae Kwang Kim, Yonsei University, Seoul, Korea J. Michael Brick, Westat,
More informationA semiparametric generalized proportional hazards model for right-censored data
Ann Inst Stat Math (216) 68:353 384 DOI 1.17/s1463-14-496-3 A semiparametric generalized proportional hazards model for right-censored data M. L. Avendaño M. C. Pardo Received: 28 March 213 / Revised:
More informationPublished online: 10 Apr 2012.
This article was downloaded by: Columbia University] On: 23 March 215, At: 12:7 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 172954 Registered office: Mortimer
More informationAn Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data
An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data Jae-Kwang Kim 1 Iowa State University June 28, 2012 1 Joint work with Dr. Ming Zhou (when he was a PhD student at ISU)
More informationAn augmented inverse probability weighted survival function estimator
An augmented inverse probability weighted survival function estimator Sundarraman Subramanian & Dipankar Bandyopadhyay Abstract We analyze an augmented inverse probability of non-missingness weighted estimator
More informationConstrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources
Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan
More information4. Comparison of Two (K) Samples
4. Comparison of Two (K) Samples K=2 Problem: compare the survival distributions between two groups. E: comparing treatments on patients with a particular disease. Z: Treatment indicator, i.e. Z = 1 for
More informationMultilevel Statistical Models: 3 rd edition, 2003 Contents
Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction
More informationReview and continuation from last week Properties of MLEs
Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that
More informationPrevious lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.
Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative
More informationAttributable Risk Function in the Proportional Hazards Model
UW Biostatistics Working Paper Series 5-31-2005 Attributable Risk Function in the Proportional Hazards Model Ying Qing Chen Fred Hutchinson Cancer Research Center, yqchen@u.washington.edu Chengcheng Hu
More informationEmpirical Likelihood Methods for Sample Survey Data: An Overview
AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 191 196 Empirical Likelihood Methods for Sample Survey Data: An Overview J. N. K. Rao Carleton University, Ottawa, Canada Abstract: The use
More informationML Testing (Likelihood Ratio Testing) for non-gaussian models
ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l
More informationStatistics 262: Intermediate Biostatistics Non-parametric Survival Analysis
Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Overview of today s class Kaplan-Meier Curve
More informationDefinitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen
Recap of Part 1 Per Kragh Andersen Section of Biostatistics, University of Copenhagen DSBS Course Survival Analysis in Clinical Trials January 2018 1 / 65 Overview Definitions and examples Simple estimation
More informationRobust Backtesting Tests for Value-at-Risk Models
Robust Backtesting Tests for Value-at-Risk Models Jose Olmo City University London (joint work with Juan Carlos Escanciano, Indiana University) Far East and South Asia Meeting of the Econometric Society
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationMaximum likelihood estimation for Cox s regression model under nested case-control sampling
Biostatistics (2004), 5, 2,pp. 193 206 Printed in Great Britain Maximum likelihood estimation for Cox s regression model under nested case-control sampling THOMAS H. SCHEIKE Department of Biostatistics,
More informationAsymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri
Asymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri University of Bergamo (Italy) ilia.negri@unibg.it SAPS VIII, Le Mans 21-24 March,
More informationδ -method and M-estimation
Econ 2110, fall 2016, Part IVb Asymptotic Theory: δ -method and M-estimation Maximilian Kasy Department of Economics, Harvard University 1 / 40 Example Suppose we estimate the average effect of class size
More informationAsymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics.
Asymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics. Dragi Anevski Mathematical Sciences und University November 25, 21 1 Asymptotic distributions for statistical
More informationFull likelihood inferences in the Cox model: an empirical likelihood approach
Ann Inst Stat Math 2011) 63:1005 1018 DOI 10.1007/s10463-010-0272-y Full likelihood inferences in the Cox model: an empirical likelihood approach Jian-Jian Ren Mai Zhou Received: 22 September 2008 / Revised:
More informationBootstrap, Jackknife and other resampling methods
Bootstrap, Jackknife and other resampling methods Part III: Parametric Bootstrap Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD)
More informationA Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators
Statistics Preprints Statistics -00 A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Jianying Zuo Iowa State University, jiyizu@iastate.edu William Q. Meeker
More informationNegative Multinomial Model and Cancer. Incidence
Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar,
More information6. Fractional Imputation in Survey Sampling
6. Fractional Imputation in Survey Sampling 1 Introduction Consider a finite population of N units identified by a set of indices U = {1, 2,, N} with N known. Associated with each unit i in the population
More informationAsymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University
More informationTests of independence for censored bivariate failure time data
Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ
More information