Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

Size: px

Start display at page:

Download "Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University"

Amy Parrish
5 years ago
Views:

1 Correlation, z-values, and the Accuracy of Large-Scale Estimators Bradley Efron Stanford University

2 Correlation and Accuracy Modern Scientific Studies N cases (genes, SNPs, pixels,... ) each with its own summary statistic z i, i = 1, 2,..., N N 10, 000 Estimate of interest ˆθ = s(z) [ e.g., ˆθ = #{z i > 3}/N ] Question How accurate is ˆθ? Easy answer if z i s independent (but usually not!) Troubles for the bootstrap Correlation, z-values, Accuracy 1

3 Leukemia Microarray Study (Golub et al., 1999) 72 leukemia patients: n 1 = 47 ALL, n 2 = 25 AML N = 7128 genes Data matrix X X has independent columns but correlated rows rms correlation ˆα =.11 t i = two-sample t-statistic, AML vs. ALL for gene i z i = Φ 1 (F 70 (t i )) [Φ, F 70 cdfs N(0, 1), t 70 ] H 0 : z i N(0, 1) theoretical null Correlation, z-values, Accuracy 2

4 Leukemia data: N=7128 z values comparing 47 ALL vs 25 AML patients; RMS correlation=.11; Central standard dev sighat0=1.68 yy fhat(z) [Poisson glm spline, df=5] z values Correlation, z-values, Accuracy 3

5 Leukemia z value histogram and average 100 bootstrap z hists. [Two sample Nonparametric Boots: resample Columns of X ] Frequency Poisson spline fit boot average z values Correlation, z-values, Accuracy 4

6 Bootstrap Dilation x i = ith row of X (n equals 72 = ) x i z i x z z i i i + N(0, σ 2) i Bootstrap histogram has extra component of variance: E N 1 / z 2 i N = N 1 z 2 i / N + N 1 σ 2 i / N Next: Boot stdev estimates for ˆF(x) = #{z i x}/n Correlation, z-values, Accuracy 5

7 Bootstrap Stdev for empirical cdf of Leukemia z values, compared with Formula X Sd estimates Formula X Bootstrap x value Correlation, z-values, Accuracy 6

8 Sd estimates jackknife Now permutation and jackknife ests of sd{empirical cdf} compared with Formula X perm Formula X x value Correlation, z-values, Accuracy 7

9 Formula X Var { ˆF(x) } { ˆF(x)(1 ˆF(x)) N } + { } ˆσ 2 0 ˆα f ˆ 2 (1) (x) 2 independence correlation penalty ˆσ 0 = 1.68 from empirical null ˆα =.11 ˆ f (1) (x) estimated RMS correlation first derivative of estimate ˆ f (x) Depends on normality: z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 8

10 Formula X for Leukemia Data x: ˆF(x) ŝd ŝd Correlation, z-values, Accuracy 9

11 Simulation: sd{fhat(x)} from Formula X; N=6000, n=20+20, alpha=.10; Solid Curve and bars are mean and stdev of sdhat values, 100 sims standard deviation estimates Dashed curve is actual sd Correlation, z-values, Accuracy 10

12 Digression: The Non-Null Distribution of z-values z-value is a test statistic N(0, 1) under H 0 Theorem Under reasonable conditions the non-null distribution of z is where z N(µ, σ 2 ) + O p (1/n) σ 2 = 1 + O ( 1 / ) n 1 2 Normality degrades more slowly than unit standard deviation Helps justify model z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 11

13 Student-t z-values t t ν (δ) [noncentral-t, noncentrality δ, d f = ν] H 0 : δ = 0 z = Φ 1 F ν (t) [F ν central t cdf, d f = ν] so under H 0, z N(0, 1) What if δ 0? Correlation, z-values, Accuracy 12

14 Densities for z=phiinv(fnu(t)), t~t(del,nu=20), for del=0,1,2,3,4,5; Dotted dashed lines are matching N(M,SD) density del= z value Correlation, z-values, Accuracy 13

15 The Count Vector y Partition range Z of z into K bins: Z = Each bin of width K k=1 Z k Bin centers x k, k = 1, 2,..., K (Leukemia histogram: Z = [ 7.9, 7.9], =.2, K = 79) Counts y k = # {z i Z k } y = (y 1, y 2,..., y K ) Count vector y is discretized order statistic of z (most statistics of interest of form ˆθ = m(y)) Correlation, z-values, Accuracy 14

16 Multi-Class Normal Model Suppose z i s are in classes C 1, C 2,..., C C, with z i N(µ c, σ 2 c) N c = # {C c }, p c = N c /N for z i C c [ so c N c = N, c p c = 1 ] Correlation distribution: g(ρ) = empirical density all ( N 2) true correlations Correlation, z-values, Accuracy 15

17 Mehler s Identity (Lancaster, 1958) ϕ ρ (u, v) = standard normal bivariate density Mehler λ ρ (u, v) = ϕ ρ(u, v) ϕ(u)ϕ(v) 1 = where h j is jth Hermite polynomial Crucial quantity: Λ(u, v) = = j α j j! h j(u)h j (v) where α j = j 1 ρ j j! h j(u)h j (v) λ ρ (u, v)g(ρ) dρ 1 1 ρ j g(ρ) dρ Correlation, z-values, Accuracy 16

18 Exact Covariance of y z i N(µ c, σ 2 c) for z i C c N c = #C c, p c = N c /N Theorem cov(y) = cov 0 + cov 1, { cov 0 = N p c diag(πc ) π c π c } c [independence] where π ck = Pr c {z i bin k }, π c = ( π ck... ), cov 1 = N 2 p c p d B cd N p c B cc [corr penalty] c ( xk µ c and B cd (k, l) = π ck π dl Λ d σ c c, x l µ ) d. σ d Correlation, z-values, Accuracy 17

19 Four Simplifications of cov 1 Drop N term Microarray standardization methods make α 1 0 Mehler expansion: α 2 = 1 Higher terms ignorable if α 2 small Simplified Formula (almost Formula X): Letting 1 2 α = α and φ (2) 2 k 1 ρ2 g(ρ) is the lead term = c p c ϕ (2) ( x kc µ c σ c ) / σ c cov 1 (N α) 2 φ (2) φ (2) / 2 [rms approximation] Correlation, z-values, Accuracy 18

20 Numerical Comparison N = 6000, α =.1 Two classes: (p c, µ c, σ c ) = (.95, 0, 1) (.05, 2.5, 1) Next figure compares standard deviations (square roots diagonal elements) of exact cov(y) & rms approximation Correlation, z-values, Accuracy 19

21 Compare sd{y[k]} from exact formula (solid) with rms approx (dashed); N=6000, alpha=.1, (p0,mu0,sig0)=(.95,0,1) and(.05,2.5,1) standard deviation rms approx imation sd{y[k]},exact without corr penalty z value dashes show bin centers x[k] Correlation, z-values, Accuracy 20

22 Same numerical example, now sd{fhat[k]} [ Fhat[k]=sum(y[l] for l>=k)/n ] sd{fhat} rms approx without corr penalty exact z value Correlation, z-values, Accuracy 21

23 Estimation of RMS Correlation α ˆρ ii = empirical correlation, rows i, i of X, N n expression matrix { ˆρ ii } has mean and variance (m, v) [leukemia = (.00,.19 2 )] ˆα 2 = n n 1 ( v 1 ) n 1 ALL AML Both ˆα: Correlation, z-values, Accuracy 22

24 More General Accuracy Estimates Q q-dimensional statistic of interest: Q = Q(y) Influence Function ˆD: dq = ˆD dy [ ˆD jk = Q j / y k ] ĉov(q) = ˆDcov(y) ˆD Correlation, z-values, Accuracy 23

25 Example: Accuracy of log ( f ˆ ) z y ˆ f by Poisson GLM of counts y k on polynomial (x k ) Q = log( ˆ f) = (... log f (x k )... ) ˆD = M [ M diag ( ˆ f ) M ] M / N with M the GLM structure matrix Correlation, z-values, Accuracy 24

26 Local False Discovery Rate p 0 = prior Pr null p 1 = prior Pr non-null z f 0 (z) f 1 (z) Mixture f (z) = p 0 f 0 (z) + p 1 f 1 (z) Estimated local false discovery rate fdr(z) = Pr{null z} = p 0 f 0 (z) / ˆ f (z) cov { log fdr } cov { log f ˆ } Correlation, z-values, Accuracy 25

27 sd{log fdrhat(z)} ; N=6000, alpha=0,.1, and.2, (p0,mu,sig) = (.95,0,1) and (.05,2.5,1) sd alpha=.2 alpha=.1 alpha= z value > stars are sd's for N=1500, alpha=.1; number are fdrhat[z] Correlation, z-values, Accuracy 26

28 Now compare sd's for log{fdrhat} and log{fdrhat}, alpha=.1 sd sdlogfdrnon sdlogfdr sdlogfdr z value > numbers are Fdr[z] Correlation, z-values, Accuracy 27

29 References Efron, B. (2007a). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102: Efron, B. (2007b). Size, power and false discovery rates. Ann. Statist. 35: Efron, B. (2010). Correlated z-values and the accuracy of largescale statistical estimates. J. Amer. Statist. Assoc. To appear ( brad/papers). Golub, T., Slonim, D. and Tamayo, P. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: Correlation, z-values, Accuracy 28

30 Lancaster, H. (1958). The structure of bivariate distributions. Ann. Math. Statist. 29: Owen, A. B. (2005). Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B 67: Correlation, z-values, Accuracy 29

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Tweedie s Formula and Selection Bias Bradley Efron Stanford University Selection Bias Observe z i N(µ i, 1) for i = 1, 2,..., N Select the m biggest ones: z (1) > z (2) > z (3) > > z (m) Question: µ values?