Correlation, z-values, and the Accuracy of Large-Scale Estimators Bradley Efron Stanford University
Correlation and Accuracy Modern Scientific Studies N cases (genes, SNPs, pixels,... ) each with its own summary statistic z i, i = 1, 2,..., N N 10, 000 Estimate of interest ˆθ = s(z) [ e.g., ˆθ = #{z i > 3}/N ] Question How accurate is ˆθ? Easy answer if z i s independent (but usually not!) Troubles for the bootstrap Correlation, z-values, Accuracy 1
Leukemia Microarray Study (Golub et al., 1999) 72 leukemia patients: n 1 = 47 ALL, n 2 = 25 AML N = 7128 genes Data matrix X 7128 72 X has independent columns but correlated rows rms correlation ˆα =.11 t i = two-sample t-statistic, AML vs. ALL for gene i z i = Φ 1 (F 70 (t i )) [Φ, F 70 cdfs N(0, 1), t 70 ] H 0 : z i N(0, 1) theoretical null Correlation, z-values, Accuracy 2
Leukemia data: N=7128 z values comparing 47 ALL vs 25 AML patients; RMS correlation=.11; Central standard dev sighat0=1.68 yy 0 50 100 150 200 250 300 350 fhat(z) [Poisson glm spline, df=5] 5 0 5 z values Correlation, z-values, Accuracy 3
Leukemia z value histogram and average 100 bootstrap z hists. [Two sample Nonparametric Boots: resample Columns of X ] Frequency 0 50 100 150 200 250 300 350 Poisson spline fit boot average 5 0 5 z values Correlation, z-values, Accuracy 4
Bootstrap Dilation x i = ith row of X (n equals 72 = 47 + 25) x i z i x z z i i i + N(0, σ 2) i Bootstrap histogram has extra component of variance: E N 1 / z 2 i N = N 1 z 2 i / N + N 1 σ 2 i / N Next: Boot stdev estimates for ˆF(x) = #{z i x}/n Correlation, z-values, Accuracy 5
Bootstrap Stdev for empirical cdf of Leukemia z values, compared with Formula X Sd estimates 0.000 0.005 0.010 0.015 0.020 0.025 Formula X Bootstrap 10 5 0 5 10 x value Correlation, z-values, Accuracy 6
Sd estimates 0.000 0.005 0.010 0.015 0.020 0.025 jackknife Now permutation and jackknife ests of sd{empirical cdf} compared with Formula X perm Formula X 10 5 0 5 10 x value Correlation, z-values, Accuracy 7
Formula X Var { ˆF(x) } { ˆF(x)(1 ˆF(x)) N } + { } ˆσ 2 0 ˆα f ˆ 2 (1) (x) 2 independence correlation penalty ˆσ 0 = 1.68 from empirical null ˆα =.11 ˆ f (1) (x) estimated RMS correlation first derivative of estimate ˆ f (x) Depends on normality: z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 8
Formula X for Leukemia Data x: 1 2 3 4 5 ˆF(x).29.13.057.025.010 ŝd.017.022.010.004.002 ŝd 0.005.004.003.002.001 Correlation, z-values, Accuracy 9
Simulation: sd{fhat(x)} from Formula X; N=6000, n=20+20, alpha=.10; Solid Curve and bars are mean and stdev of sdhat values, 100 sims standard deviation estimates 0.000 0.005 0.010 0.015 0.020 4 2 0 2 4 Dashed curve is actual sd Correlation, z-values, Accuracy 10
Digression: The Non-Null Distribution of z-values z-value is a test statistic N(0, 1) under H 0 Theorem Under reasonable conditions the non-null distribution of z is where z N(µ, σ 2 ) + O p (1/n) σ 2 = 1 + O ( 1 / ) n 1 2 Normality degrades more slowly than unit standard deviation Helps justify model z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 11
Student-t z-values t t ν (δ) [noncentral-t, noncentrality δ, d f = ν] H 0 : δ = 0 z = Φ 1 F ν (t) [F ν central t cdf, d f = ν] so under H 0, z N(0, 1) What if δ 0? Correlation, z-values, Accuracy 12
Densities for z=phiinv(fnu(t)), t~t(del,nu=20), for del=0,1,2,3,4,5; Dotted dashed lines are matching N(M,SD) density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 del= 0 1 2 3 4 5 4 2 0 2 4 6 z value Correlation, z-values, Accuracy 13
The Count Vector y Partition range Z of z into K bins: Z = Each bin of width K k=1 Z k Bin centers x k, k = 1, 2,..., K (Leukemia histogram: Z = [ 7.9, 7.9], =.2, K = 79) Counts y k = # {z i Z k } y = (y 1, y 2,..., y K ) Count vector y is discretized order statistic of z (most statistics of interest of form ˆθ = m(y)) Correlation, z-values, Accuracy 14
Multi-Class Normal Model Suppose z i s are in classes C 1, C 2,..., C C, with z i N(µ c, σ 2 c) N c = # {C c }, p c = N c /N for z i C c [ so c N c = N, c p c = 1 ] Correlation distribution: g(ρ) = empirical density all ( N 2) true correlations Correlation, z-values, Accuracy 15
Mehler s Identity (Lancaster, 1958) ϕ ρ (u, v) = standard normal bivariate density Mehler λ ρ (u, v) = ϕ ρ(u, v) ϕ(u)ϕ(v) 1 = where h j is jth Hermite polynomial Crucial quantity: Λ(u, v) = = j 1 1 1 α j j! h j(u)h j (v) where α j = j 1 ρ j j! h j(u)h j (v) λ ρ (u, v)g(ρ) dρ 1 1 ρ j g(ρ) dρ Correlation, z-values, Accuracy 16
Exact Covariance of y z i N(µ c, σ 2 c) for z i C c N c = #C c, p c = N c /N Theorem cov(y) = cov 0 + cov 1, { cov 0 = N p c diag(πc ) π c π c } c [independence] where π ck = Pr c {z i bin k }, π c = ( π ck... ), cov 1 = N 2 p c p d B cd N p c B cc [corr penalty] c ( xk µ c and B cd (k, l) = π ck π dl Λ d σ c c, x l µ ) d. σ d Correlation, z-values, Accuracy 17
Four Simplifications of cov 1 Drop N term Microarray standardization methods make α 1 0 Mehler expansion: α 2 = 1 Higher terms ignorable if α 2 small Simplified Formula (almost Formula X): Letting 1 2 α = α and φ (2) 2 k 1 ρ2 g(ρ) is the lead term = c p c ϕ (2) ( x kc µ c σ c ) / σ c cov 1 (N α) 2 φ (2) φ (2) / 2 [rms approximation] Correlation, z-values, Accuracy 18
Numerical Comparison N = 6000, α =.1 Two classes: (p c, µ c, σ c ) = (.95, 0, 1) (.05, 2.5, 1) Next figure compares standard deviations (square roots diagonal elements) of exact cov(y) & rms approximation Correlation, z-values, Accuracy 19
Compare sd{y[k]} from exact formula (solid) with rms approx (dashed); N=6000, alpha=.1, (p0,mu0,sig0)=(.95,0,1) and(.05,2.5,1) standard deviation 0 10 20 30 40 rms approx imation sd{y[k]},exact without corr penalty 4 2 0 2 4 z value dashes show bin centers x[k] Correlation, z-values, Accuracy 20
Same numerical example, now sd{fhat[k]} [ Fhat[k]=sum(y[l] for l>=k)/n ] sd{fhat} 0 20 40 60 80 100 rms approx without corr penalty exact 4 2 0 2 4 z value Correlation, z-values, Accuracy 21
Estimation of RMS Correlation α ˆρ ii = empirical correlation, rows i, i of X, N n expression matrix { ˆρ ii } has mean and variance (m, v) [leukemia = (.00,.19 2 )] ˆα 2 = n n 1 ( v 1 ) n 1 ALL AML Both ˆα:.121.109.114 Correlation, z-values, Accuracy 22
More General Accuracy Estimates Q q-dimensional statistic of interest: Q = Q(y) Influence Function ˆD: dq = ˆD dy [ ˆD jk = Q j / y k ] ĉov(q) = ˆDcov(y) ˆD Correlation, z-values, Accuracy 23
Example: Accuracy of log ( f ˆ ) z y ˆ f by Poisson GLM of counts y k on polynomial (x k ) Q = log( ˆ f) = (... log f (x k )... ) ˆD = M [ M diag ( ˆ f ) M ] M / N with M the GLM structure matrix Correlation, z-values, Accuracy 24
Local False Discovery Rate p 0 = prior Pr null p 1 = prior Pr non-null z f 0 (z) f 1 (z) Mixture f (z) = p 0 f 0 (z) + p 1 f 1 (z) Estimated local false discovery rate fdr(z) = Pr{null z} = p 0 f 0 (z) / ˆ f (z) cov { log fdr } cov { log f ˆ } Correlation, z-values, Accuracy 25
sd{log fdrhat(z)} ; N=6000, alpha=0,.1, and.2, (p0,mu,sig) = (.95,0,1) and (.05,2.5,1) sd 0.00 0.05 0.10 0.15 0.20 0.25 alpha=.2 alpha=.1 alpha=0 0.69 0.58 0.44 0.25 0.09 0.03 2.0 2.5 3.0 3.5 z value > stars are sd's for N=1500, alpha=.1; number are fdrhat[z] Correlation, z-values, Accuracy 26
Now compare sd's for log{fdrhat} and log{fdrhat}, alpha=.1 sd 0.00 0.05 0.10 0.15 0.20 0.25 sdlogfdrnon sdlogfdr sdlogfdr 0.34 0.26 0.18 0.1 0.04 0.01 2.0 2.5 3.0 3.5 z value > numbers are Fdr[z] Correlation, z-values, Accuracy 27
References Efron, B. (2007a). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102: 93 103. Efron, B. (2007b). Size, power and false discovery rates. Ann. Statist. 35: 1351 1377. Efron, B. (2010). Correlated z-values and the accuracy of largescale statistical estimates. J. Amer. Statist. Assoc. To appear (http://stat.stanford.edu/ brad/papers). Golub, T., Slonim, D. and Tamayo, P. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531 537. Correlation, z-values, Accuracy 28
Lancaster, H. (1958). The structure of bivariate distributions. Ann. Math. Statist. 29: 719 736. Owen, A. B. (2005). Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B 67: 411 426. Correlation, z-values, Accuracy 29