False Discovery Control in Spatial Multiple Testing

Size: px

Start display at page:

Download "False Discovery Control in Spatial Multiple Testing"

Brice Parrish
5 years ago
Views:

1 False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, University of Southern California 2 North Carolina State University 3 University of Pennsylvania 4 MD Anderson

2 Spatial modeling of time trends in tropospheric ozone The EPA uses a monitoring network to regulate ozone. Our objective it to identify areas with changing ozone. Other examples of spatial multiple testing: climate change, disease monitoring, neuroimaging, etc. Latitude Latitude Longitude (a) First stage estimates, ˆβ(s) Longitude (b) First stage z-scores, z(s) = ˆβ(s)/w(s)

3 New issues in spatial multiple testing One only observes data points at a discrete subset of the locations but needs to make inference everywhere in the spatial domain. A finite approximation strategy is needed for inference in a continuous spatial domain otherwise an uncountable number of tests needs to be conducted, which is impossible in practice. It is desirable to aggregate information from nearby locations to make cluster-wise inference, and to incorporate important spatial variables in the decision-making process.

4 Gaussian random field model Let X = {X (s) :s S} be a random field on a spatial domain S: X (s) =μ(s)+ɛ(s), (1) where μ(s) is the unobserved random process and ɛ(s) is the noise process. An important special case is the Gaussian random field model where the signals and errors are Gaussian processes with means μ and 0, and covariance functions ρ 1 and ρ 2, respectively. Let Θ denote the collection of all parameters in model (1).

5 Notation Hypotheses H 0 (s) :μ(s) A versus H 1 (s) :μ(s) A c A is the indifference region, e.g.a = {μ : μ μ 0 } True states θ(s) =0ifH 0 (s) andθ(s) =1ifH 1 (s) Null area: S 0 = {s S : θ(s) =0} Non-null area S 1 = {s S : θ(s) =1} Decisions δ(s) = 1 if reject and δ(s) = 0 otherwise Rejection region, R = {s S : δ(s) =1} Error regions False positive area: S FP = {s S : θ(s) =0,δ(s) =1} False negative area: S FN = {s S : θ(s) =1,δ(s) =0}

6 False discovery measures The key quantity to control is the false discovery proportion, FDP = ν(s FP) ν(r) I {ν(r) > s 0}, where s 0 is a small positive value. FDR: Typically, false discovery rate is controlled so that FDR = E(FDP) <α. FDX: We might also want to be reasonably confident the FDPislessthansomevalue,sayτ (0, 1). Therefore, we also consider the false discovery exceedance, FDX τ = P(FDP >τ) <α. MDR: The power of a multiple testing procedure is summarized by the missed discovery rate, MDR = E{ν(S FN )}.

7 Compound decision theory for spatial multiple testing Under mild conditions, the multiple testing problem is equivalent to a weighted classification problem L(θ, δ) =λν(s FP )+ν(s FN ). There is a one-to-one relationship between λ and α. The optimal rule is to reject if the posterior probability of the null is smaller than a threshold t, i.e., δ(s) =I [T OR (s) < t], where T OR (s) =P Θ {θ(s) =0 X }. The threshold t is taken to be as large as possible (to increase power) while still maintaining FDR(t) α. FDR(t) is unknown and must be approximated.

8 Discrete approximation This problem boils down to estimating FDR(t), which is difficult because it is an integral over potentially infinitely-many tests (spatial locations). Let m i=1 S i be a fine partition of S. Take a point s i from each S i. Compute the probability of the null, T OR (s i ). Let {T (i) OR : i =1,, m} be the ordered oracle statistics and S (i) the region corresponding to T (i) OR.

9 FDR control Procedure Define R j = j i=1 S (i) and { r =max j : ν(r j ) 1 } j T (i) OR ν(s (i)) α. i=1 The rejection area is given by R = r i=1 S (i). Therefore, we assume the decision is constant within pixels and approximate the FDR as the sum of posterior probabilities of the null over the pixels in the rejection region.

10 FDX control Similar to FDR control, we control FDX by: Procedure Define ( FDX m τ,j = P I {ν(rj )>0} Θ ν(r j ) s i Rj m and r =max{j : FDX m τ,j α}. ) {1 θ(s i )}ν(s i ) >τ ɛ 0 X Then the rejection region is given by R = r i=1 S (i).

11 The decision process on a continuous spatial domain can be described, within a small margin of error, by a finite number of decisions on a grid of pixels. Theorem Under conditions on random field and partition (a) The FDR level of the FDR procedure satisfies FDR α + o(1) when m. (b) The FDX level of the FDX procedure at tolerance level τ satisfies FDX α + o(1) when m.

12 Computational Algorithms The numerical methods for model fitting and parameter estimation in spatial models have been extensively studied in a Bayesian computational framework. Suppose the MCMC samples are {μ b : b =1,, B}, where μ b =[μ b (s 1 ),,μ b (s m )] is a sample b. Let θ b (s i )=I [μ b (s i ) A c ]. T OR (s i ) can be estimated by ˆT OR (s i )= 1 B B [1 θ b (s i )]. b=1

13 Simulation setting Generate data from the model x(s) =μ(s)+ɛ(s). Both the signals and errors are generated as Gaussian processes. The signal process μ has mean μ and exponential covariance Cov[μ(s),μ(s )] = σ 2 μ exp[ s s /ρ μ ]. The error process ɛ has mean zero and covariance Cov[ɛ(s),ɛ(s )] = (1 r)i (s = s )+r exp[ s s /ρ ɛ ]. Choose n = 1000, r =0.9, μ = 1, and σ μ =2. The expected proportion of positive observations is 0.31.

14 Models We compare seven methods: Non-spatial approach of Benjamini and Hochberg FDR (BH) Non-spatial approach of Genovese and Wasserman (GW) Spatial approach of Pacifico et al (PGVW) FDR and FDX Oracle (our approach with hyperparameters fixed at true values) FDR and FDX MC (our approach with hyperparameters estimated using MCMC) FDR and FDX Uninformative priors: μ N(0, ) σ 2 μ Gamma(0.1, 0.1) r,ρ μ,ρ ε Uniform(0, 1).

15 False discovery rate (target is α =0.1) by ρ μ Mean FDP BH GW Oracle FDR Oracle FDX MC FDR MC FDX PGVW FDR PGVW FDX Spatial range

16 Distribution of FDP by ρ μ Distribution of FDP Oracle FDR Oracle FDX MC FDR MC FDX Spatial range

17 Missed discovery rate by ρ μ MDR Spatial range

18 Summary of simulation The oracle FDR procedure controls the FDR nearly perfectly. The MC FDR procedure with uninformative priors has good FDR control. FDX methods are more conservative than the FDR methods. The BH, GW, and PGVW are very conservative. MDR levels of the oracle and MC methods are much lower.

19 Ozone data analysis (a) Posterior mean of μ(s) (b) Posterior prob μ(s) < 0.1

20 Rejection region (black) using the FDX rule

21 Summary Convention (e.g., Benjamini and Yekutieli (2001) and Sarkar (2002)) is that it is safe to apply standard methods as if the tests were independent. While standard methods control FDR, incorporating the underlying dependency structure can dramatically improve the power. A continuous decision process can be described, within a small margin of error, by a finite number of decisions on a grid of pixels. FDR and FDX controlling problems can be solved in a unified theoretical and computational framework. We have also extended this to deal with spatial clusters.

False discovery control in large-scale spatial multiple testing

J. R. tatist. oc. B (2015) 77, Part 1, pp. 59 83 False discovery control in large-scale spatial multiple testing Wenguang un, University of outhern California, Los Angeles, UA Brian J. Reich, North Carolina