Double-Sampling Designs for Dropouts

Size: px

Start display at page:

Download "Double-Sampling Designs for Dropouts"

Spencer Fowler
5 years ago
Views:

1 Double-Sampling Designs for Dropouts Dave Glidden Division of Biostatistics, UCSF 26 February

2 Outline Mbarra s Dropouts Future Directions Inverse Prob. Weights/Horvitz Thompson Double Sampling for Dropouts An IPW Estimator and Variance

3 PEPFAR Massive scale-up of ARV treatment Outcome results essential Required large-scale follow-up 1.3 Million initiated ARV from

4 Mbarara ISS Cohort of 3,340 HIV+ infected individuals Started on ARV from 1/04 to 7/07 Followed through 7/1/07 56 died 2,530 alive as of 7/1/ lost to FU prior to 7/1/07

5 Multiple Events Three possible events Death Dropout Administrative Censoring Competing risks

6 Calendar Time Scale Patient 1 E D 2 E 3 E L A 4 E A Date

7 Follow Up Time Scale Patient D L Administrative Limit Time Since Initiation

8 Notation C i: time to administrative censoring always known L i: time to dropout censored by Ci and Ti T i: time to death censored by Ci and Li Xi: min(ti,ci), i = I(Ti Ci) data in absence of dropouts Robs =0 dropout, R obs =1 non-dropout

9 Administrative Censoring Patients can only dropout if Li < min(ti, Ci) Patients can only die if Di < min(li, Ci) Some people may have dropout later...

10 Administrative Censoring (T,C) are independent very standard assumption violated if demographics change over time Can be relaxed to (T,C) independent given a series of covariates Conditional on Robs, (Ti,Ci) NOT indep example of collider stratification creates dependent censoring

11 Dependent Dropout (T,L) are likely correlated Hastens death Not easily handled Dropout suggest ARV discontinuation What about observing after dropout how about sampling?

12 Sampling Plan 3,340 HIV+ initiated ART Robs =1: n1=2,625 2,569 Alive and in FU as of 7/1/09 56 Died in Follow-Up Robs=0: n 0=715, ñ0 =79 95 sought, 79 vital status ascertained

13 Advantages of Sampling Very flexible Sampling prob can vary by individual using ancillary data Valid framework for dependent censoring in a way that is model-independent no need to specify cor(t,l) will get information on this

14 Horvitz Thompson Have finite population of size n Want to estimate µ = n 1 n i=1 x i ξi=1 indicated if ith person sampled Sample with probability E(ξi=1)=πi HT estimator ˆµ = n 1 n i=1 ξ i π i x i

15 Variance var{ n( µ ˆµ)} n n = n 1 πij π i π j π ij π i π j i=1 j=1 ξ i ξ j x i x j πij is the probability that i and j are selected termed the second-order inclusion probability

16 Simple Sample Total population is n Sample with equal probability πi = ñ/n Sample with replacement chose ñ from n (putting balls back in jar) Sample with quota chose exactly ñ different from n Sample without quota chose ith person with probability ñ/n Same estimate, different variances!!

17 Second-Order Weights under equal probability schemes πii πij w/ replace (ñ/n) 2 (ñ/n) 2 quota ñ/n ñ(ñ-1)/n(n-1) no quota ñ/n (ñ/n) 2

18 Variance for Finite Population under quota sampling var{ˆµ} = (π 1 1)σ 2 n n: size of total popn (1 π)σ2 var{ˆµ} = ñ For the standard sample mean, π is effectively 0 var{ x} = σ2 ñ ñ: size of sampled popn

19 Double Sampling group of i=1,...,n subjects sampled from a population (first level of sampling) some additional data avail. (z1,...,zn) second level of data richer data collection on subset sampling determined by (z1,...,zn) pr(ξi=1 zi)=πi Neyman, 1938

20 Example: Two-Stage Case Control Study Bladder cancer case/control study Exposure: metal working fluids (MWF) Phase 1: Data collected including Q: Have you ever worked in metal industry? Phase II: Detailed work history/exposure collected Metal workers oversampled in Phase II

21 Other Examples Two-phase genotyping studies Nested case-control studies Case-cohort studies With significant cost savings All valid, comparatively efft designs

22 Back to Mbarara

23 Double Sample for Dropouts Robs =1: observed death/admin censoring data already completely observed sampling fraction 1.00 Robs =0: observed dropouts n 0 dropouts sample ñ 0 for vital ascertainment sampling fraction ñ0/n0 = π

24 Observed Data (Xi, i) if ξi=1 ignored otherwise ξ=i if R obs =1 or R obs =0 & sampled pr(sampled R obs =0) = π Data is MAR conditional on Robs

25 Frangakis and Rubin Double sampling estimate of survivor function Ignore data if ξ=0 if dropout & not dble sampled, drop data Constructed survivor estimate Showed that (T,C) not independent conditional on the value of R obs Estimate hazard and transform to survival

26 Frangakis Rubin Estimator consistent with a Gaussian limiting distribution

27 Their representation Cumulative hazard is weighted sum Of crude hazards in 2 groups Weight varies with time Not an intuitive representation why does it work?

28 Double-Sampling Λ(t) Λ(t) True cumulative hazard function Nelson-Aalen estimator if complete data avail on cohort ˆΛ(t) The FR estimator based on dble-sampled data

29 More Notation N(t)=I(X t, Δ=1) Y(t) = I(X t) H 1 : set of people with R obs =1 has size n1, ñ1 sampled (n1 = ñ1) i in H1 πi= 1 H 0 : set of people with R obs =0 has size n0, ñ0 sampled i in H0: πi= ñ0/n0

30 Complete Data on Cohort N g (t) =n 1 g Ȳ g (t) =n 1 g N(t) = g N i (t) i H g Y i (t) i H g n g n N g (t) Ȳ (t) = g n g n Ȳg(t)

31 Other Representation Λ(t) = t 0 N (du) Y(u) as n N(t) N (t) Ȳ (t) Y(t) Λ(t) = t 0 N(du) Ȳ (u)

32 Horvitz Thompson ˆN g (t) =ñ 1 g Ŷ g (t) =ñ 1 g i H g ˆN(t) = g i H g ξ i π i N i (t) ξ i π i Y i (t) n g n ˆN g (t) Ŷ (t) = g n g n Ŷg(t)

33 Approximations ˆN 1 (t) = N 1 (t) Ŷ 1 (t) =Ȳ1(t) ˆN 0 (t) N 0 (t) Ŷ 0 (t) Ȳ0(t)

34 A Natural Estimator ˆΛ(t) = t 0 ˆN(du) Ŷ (u) identical to FR estimator and to ˆΛ(t) = t 0 i i ξ i and has a IPW representation N π i i (du) ξ i Y π i i (u)

35 But what about variance?

36 FR Variance

37 Two-Part Variance n{ˆλ(t) Λ(t)} = n{ˆλ(t) Λ(t)} + n{ Λ(t) Λ(t)} last two terms are independent σ 2 (t) = var{ˆλ(t) Λ(t)} σ1(t) 2 = var{ˆλ(t) Λ(t)} σ2(t) 2 = var{ Λ(t) Λ(t)} σ 2 (t) =σ1(t)+σ 2 2(t) 2

38 Variance Decomposition σ 2 (t) : Total Variance σ 2 2(t) : Variance if full cohort observed Usual variance for Nelson-Aalen estimate Easily estimated σ 2 1(t) : Variance due to double sampling HT type variance Estimation more complicated

39 Nelson-Aalen Variance σ 2 2(t) = ˆσ 2 2(t) = t 0 t 0 = n 0 n = t 0 Λ(du) Y(u) ˆΛ(du) Ŷ (u) t 0 ˆN 0 (du) Ŷ 2 + n 1 (u) n ˆN(du) Ŷ 2 (u) t 0 ˆN 1 (du) Ŷ 2 (u)

40 Double-Sample Variance looong HT-based variance arguments lead to ˆσ 2 1(t) = (π 1 1) n 0 = n 1 n t i:r obs 0 =0 t 0 ˆN 0 (du) Ŷ 2 (u) ξ i (π 1 i π i 1)N i (du) Ỹ 2 (u). = n 1 n i=1 t 0 ξ i (π 1 i π i 1)N i (du) Ŷ 2 (u)

41 The total variance ˆσ 2 (t) =ˆσ 2 1(t)+ˆσ 2 2(t) = n 1 n n 1 i=1 n i=1 = n 1 n i=1 t 0 t 0 t 0 ξ i π i N i (du) Ŷ 2 (u) + ξ i (π 1 i π i ξ i π 1 i π i 1)N i (du) Ŷ 2 (u) N i (du) Ŷ 2 (u)

42 Source of Variance R obs =0 R obs =1 ˆσ 2 1(t) (π 1 1) n 0 n t 0 ˆN 0 (du) Ŷ 2 (u) 0 ˆσ 2 2(t) n 0 n t 0 ˆN 0 (du) Ŷ 2 (u) n 1 n t 0 ˆN 1 (du) Ŷ 2 (u) ˆσ 2 (t) π 1 n 0 n t 0 ˆN 0 (du) Ŷ 2 (u) n 1 n t 0 ˆN 1 (du) Ŷ 2 (u)

43 Some observations Robs =1: no contribution to double-sample variance Robs =0: ratio of NA variance to FR variance = 1/π = n0/ñ0 just the ratio of sample size in Robs=0 between DS data and full data Intuitive look at the variance

44 Variance Estimate Easily computed Demystifies the form Facilitates sample size calculations look at effect of various sample fractions Performs great in simulations

45 Data Example Cohort of 3,340 HIV+ infected individuals Robs =1: n1=2,625 (56 died) Robs=0: n 0=715, ñ0 =79 (26 died) π = 1/9.18 = Rate of death is about 16 times higher in dropouts compared to non-dropouts

46 FR and Naive Survival Proportion surviving Naïve estimate Adjusted estimate Days since initiation of ART

47 Relative Efficiency Trade-off between sampling fractions What is efficiency of sampling ρ dropouts compared to all dropouts Can be consistently estimated Based on Mbarara data

48 Relative Efficiency Relative Efficiency Compared to Complete Cohort % 30% 50% 70% 90% Percentage of Dropouts Sampled

49 Future Directions Apply insights from survey statistics formulae and approximations post-stratification, calibration auxiliary variables => more efficiency Look at using non-sample dropout data

50 Acknowledgement Jeff Martin Elvin Geng Constantin Yiannoutsos Mbarara ISS clinic East Africa Region of IeDEA Chuck McCulloch

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Overview of today s class Kaplan-Meier Curve