Quasi-likelihood Scan Statistics for Detection of

Size: px

Start display at page:

Download "Quasi-likelihood Scan Statistics for Detection of"

Sophia Clarke
5 years ago
Views:

1 for Quasi-likelihood for Division of Biostatistics and Bioinformatics, National Health Research Institutes & Department of Mathematics, National Chung Cheng University 17 December / 25

2 Outline for / 25

3 for Motivation: A study of enterovirus pattern in Taiwan. Collected by CDC of Taiwan from 1999 to In 2003, cases reported in 89 administrative regions of the North Taiwan. Would temperature/humidity affect the disease rate? How to identify geographic clusters incorporated with spatial correlation and environmental covariates? 3 / 25

4 Enterovirus Data for (a) Disease Rate Missing Figur: (a) Grey-scale map of the observed enterovirus rates. A region with a slash mark means missing value. (b) Locations of estimated clusters by the QL scan statistic. 4 / 25

5 Literature Review for Two main methodologies to find geographic clusters: Likelihood ratio tests for localized clusters (e.g. Chan, 2009, Annals of ). Disease mapping on a hierarchical Bayesian model (e.g. Gangnon and Clayton, 2003, in Medicine). However, current methods could neither straight incorporate ecological covariates nor account for spatial correlation (Zhang and Lin, 2009, Biometrics). We explore a scan method based on quasi-likelihood functions for localized clusters. 5 / 25

6 for Basic concept: moving windows across space or time. Each distinct window induces a statistical model: difference in risk inside v.s. outside window Many potential choices for windows in space: circles, ellipses, arbitrary shapes We use circular windows for power consideration. 6 / 25

7 Potential Clusters for We use centroid s k to denote region k. For each s k, we sort the distances from s k to other centroids, say r k,(1) < r k,(2) < < r k,(mk ). Then B{s k, r k,(j) } is a potential cluster, j = 1,..., m k. Each potential cluster contains at most 35% of the total population. 7 / 25

8 Latent Process for Y i : responses correlated through a latent Gaussian process ɛ. Y i ɛ i Poisson(θi ), where { θ i = E i exp x iβ + } K ξ k δ Bk (s i ) + ɛ i. (1) k=1 B k and ξ k denote the kth cluster and associated coefficient, and x i and β correspond to covariates. {( ) ( 0 σ 2 σ (ɛ i, ɛ j ) N, 2 )} ρ i,j 0 σ 2 ρ i,j σ 2. 8 / 25

9 Linear Mixed Model for By moment generating functions, (1) becomes θ i = E(Y i ) = E i exp { 0.5σ 2 + x iβ + with a covariance structure given by } K ξ k δ Bk (s i ) k=1 var(y i ) = θ i + θ 2 i {exp(σ 2 ) 1}, (2) cov(y i, Y j ) = θ i θ j {exp(σ 2 ρ i,j ) 1}. (3) In (2), ξ i could be positive or negative. So we can detect hot-spot and cool-spot clusters. 9 / 25

10 Two-stage Estimation for To estimate parameters, the first step is to find a working correlation model: 1. Obtain an initial estimate { ˆβ (0), ˆξ (0) } by the traditional GLM method. 2. Let ˆɛ (0) i = log(y i ) log(ˆθ (0) i ), and obtain ˆσ (1) and ˆρ (1) i,j by variogram. 3. Plug ˆσ and ˆρ (1) i,j to (3) to get a working covariance matrix. 10 / 25

11 Quasi-likelihood Estimating Equation for The second estimation step is to use a QL estimating equation Q( ˆβ, ˆξ; ˆσ, ˆρ i,j ) = D ( ˆβ, ˆξ; ˆσ)V 1 ( ˆβ, ˆξ; ˆσ, ˆρ i,j ){Y θ( ˆβ, ˆξ; ˆσ)} = 0. We derive a new estimate by a Newton-Raphson iteration { ˆβ(k+1), ˆξ (k+1)} { = ˆβ(k), ˆξ (k)} + [ (D V 1 D) 1 D V 1 {Y θ(β, ξ; ˆσ)} ] {, ˆβ(k), ˆξ (k),ˆσ,ˆρ i,j } 11 / 25

12 Asymptotic Properties for Assumption: (a) Number of misspecified regions to a cluster is o(n). (b) Working correlation models ρ i,j = o(h d ). Theorem: Under the assumptions, the QL estimates ( ˆβ, ˆξ) are consistent. Furthermore, if n 1/2 θ θ 1 0 uniformly, then n 1/2 {( ˆβ, ˆξ) (β, ξ)} N{0, I 1 1 (β )I 0 (θ )I 1 1 (θ )} 12 / 25

13 Multiple Testing Problem for Let H denote the null hypothesis of no geographic cluster. When using the QL scan statistic, we would face a multiple testing problem. We adapt a parametric bootstrapping approach to find a correction p-value p / 25

14 Parametric Bootstrapping for (a) We compute E i = n i ˆp, where ˆp = Y i / n i, and fit a null marginal model θ i = E i exp(β 0 + β 1 x i ). Let ( β 0, β 1 ) and ( σ, ρ i,j ) be the two-stage estimates. (b) We simulate Y (j) from models (2) and (3) based on ( β 0, β 1 ) and ( σ, ρ i,j ), j = 1,..., n 0. (c) For the jth simulated data, we use the QL scan statistic get p-values of all potential clusters. 14 / 25

15 Estimated Clusters for (d) Let ˆp (j) be the smallest p-value for the jth simulated data, j = 1,..., n 0. (e) We sort P = {ˆp (1),..., ˆp (n 0) } and set p 0 to be the 95% quantile of P. (f) For the real data, we compare ˆp i,(j) and p 0. A potential cluster B{s i, r i,(j) } enters the final screening if ˆp i,(j) is less than p / 25

16 Quasi-deviance Criteria for Let B = { ˆB 1,..., ˆB m} be a set of estimated clusters from bootstrapping. Some of them would have overlapping regions. For the overlapping clusters, we screen them with a quasi-deviance (Lin, 2010) D(ˆξ i, ˆξ j ) = 0.5{θ(ˆξ i ; ) θ(ˆξ j ; )} T [ ] V 1 (ˆξ i ; ){Y θ(ˆξ i )} + V 1 (ˆξ j ; ){Y θ(ˆξ j )}. 16 / 25

17 Cluster Reformation for Using the quasi-deviance criteria to overlapping clusters, we discard ˆB j if D(ˆξ i, ˆξ j ) > χ2 1,0.95, or, combine ˆB i and ˆB j if D(ˆξ i, ˆξ j ) χ2 1,0.95, and sequentially screen clusters from the one with a smallest p-value to the one with a largest p-value. For the final K disjoint clusters ˆB 1,..., ˆB K, the two-stage estimation procedure gives { } ˆθ i = E i exp 0.5ˆσ 2 + x ˆβ K i + ˆξ k δ ˆBk (s i ). k=1 17 / 25

18 for Scenario: ɛ: GRF with mean 0, σ 2 = 0.25 and ρ = 0.7. Y i ɛ i : Poisson with intensity θ i = exp{β 0 + ξ 1 δ B1 (s i ) + ɛ i }. parameters: β 0 = 0.6, ξ = 0, 0.5, 0.75, 1.0, or 1.5. cluster: B 1 is a cluster centered at Zhong-li city with 15 adjacent regions. 18 / 25

19 Kulldorff s LR Test for When no covariates, Kulldorff (1997) suggested to take a maximum value of likelihood ratio over all possible clusters. That is, where LR max = max k,j LR k,rk,(j), LR k,rk,(j) = sup L(Θ 1 y)/ sup L(Θ 0 y). We compare the QL scan statistic with LR. 19 / 25

20 Comparison of Power Functions for Figur: A comparison of power based on 1000 simulations. (a) The pre-specified cluster. (b) The estimated power functions. (a) (b) Latitude % Pop 15 Regions power QL LR Longitude coefficient 20 / 25

21 for Tabel: Conditional sample means and standard errors for QL and GLM scan statistics. QL GLM ˆβ 1 ˆβ2 ˆξ1 ˆξ2 ˆβ1 ˆβ2 ˆξ1 ˆξ2 (β1, β 2, ξ 1, ξ 2 ) = ( 0.475, 0.5, 0.5, 0.5) Estimate Asym SE (0.189) (0.266) (0.365) (0.360) (0.132) (0.284) (0.441) (0.432) Sample SE (0.158) (0.254) (0.330) (0.328) (0.115) (0.253) (0.416) (0.377) (β1, β 2, ξ 1, ξ 2 ) = ( 0.475, 0.5, 1.0, 1.0) Estimate Asym SE (0.195) (0.248) (0.348) (0.336) (0.156) (0.291) (0.379) (0.348) Sample SE (0.156) (0.231) (0.332) (0.321) (0.110) (0.262) (0.295) (0.318) (β1, β 2, ξ 1, ξ 2 ) = ( 0.475, 0.5, 1.5, 1.5) Estimate Asym SE (0.206) (0.232) (0.357) (0.332) (0.306) (0.337) (0.323) (0.385) Sample SE (0.196) (0.245) (0.322) (0.310) (0.293) (0.298) (0.272) (0.390) 21 / 25

22 Data for 2003 North Taiwan Data: 21 regions without observations. Environmental covariates temperature and humidity were standardized to be in (-1,1). We use the QL scan statistic for the 2003 data to get Parameter Estimate Std. Err p-value Intercept Temp Humidity Cluster Cluster / 25

23 Data for The estimated model for the disease rate of the ith region is ˆθ i =0.0032n i exp{14.24(temp) i (humidity i ) 0.75δ B1 (s i ) δ B2 (s i )}, with a correlation estimate ˆρ i,j = 0.94 exp( s i s j /3.32) and ˆσ 2 = / 25

24 Result for (a) Disease Rate Missing Figur: (a) Grey-scale map of the observed enterovirus rates. A region with a slash mark means missing value. (b) Locations of estimated clusters by the QL scan statistic. 24 / 25

25 Future Work for Variogram may not work for some potential cluster models. Non-stationary correlation models for latent processes. Add temporal random effects for a Bayesian dynamic linear model. 25 / 25

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a