Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

Similar documents
Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Frequentist Accuracy of Bayesian Estimates

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

Bayesian Inference and the Parametric Bootstrap. Bradley Efron Stanford University

Package locfdr. July 15, Index 5

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Are a set of microarrays independent of each other?

Empirical Bayes Deconvolution Problem

ROW AND COLUMN CORRELATIONS (ARE A SET OF MICROARRAYS INDEPENDENT OF EACH OTHER?) Bradley Efron Department of Statistics Stanford University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

The bootstrap and Markov chain Monte Carlo

Statistical testing. Samantha Kleinberg. October 20, 2009

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Advanced Statistical Methods: Beyond Linear Regression

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Technical Report 1004 Dept. of Biostatistics. Some Exact and Approximations for the Distribution of the Realized False Discovery Rate

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Estimation of a Two-component Mixture Model

The miss rate for the analysis of gene expression data

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Resampling and the Bootstrap

Frequentist Accuracy of Bayesian Estimates

Statistical Applications in Genetics and Molecular Biology

Resampling and the Bootstrap

Large-Scale Multiple Testing of Correlations

Statistics Applied to Bioinformatics. Tests of homogeneity

Bootstrap, Jackknife and other resampling methods

FDR and ROC: Similarities, Assumptions, and Decisions

A G-Modeling Program for Deconvolution and Empirical Bayes Estimation

Large-Scale Hypothesis Testing

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Math Review Sheet, Fall 2008

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

Asymptotic Statistics-VI. Changliang Zou

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

A STUDY OF PRE-VALIDATION

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Chapter 7: Model Assessment and Selection

Statistical Applications in Genetics and Molecular Biology

STAT440/840: Statistical Computing

Sanat Sarkar Department of Statistics, Temple University Philadelphia, PA 19122, U.S.A. September 11, Abstract

Regularized Discriminant Analysis and Its Application in Microarrays

Correction for Tuning Bias in Resampling Based Error Rate Estimation

A Bias Correction for the Minimum Error Rate in Cross-validation

On testing the significance of sets of genes

4 Resampling Methods: The Bootstrap

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

Week 9 The Central Limit Theorem and Estimation Concepts

Research Article Sample Size Calculation for Controlling False Discovery Proportion

University of California San Diego and Stanford University and

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

STAT 461/561- Assignments, Year 2015

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Mutual fund performance: false discoveries, bias, and power

Package FDRreg. August 29, 2016

Probabilistic Inference for Multiple Testing

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Bayesian inference and the parametric bootstrap

Introduction to Computational Finance and Financial Econometrics Probability Theory Review: Part 2

Statistics for exp. medical researchers Regression and Correlation

On prediction and density estimation Peter McCullagh University of Chicago December 2004

STAT 536: Genetic Statistics

Resampling-based Multiple Testing with Applications to Microarray Data Analysis

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Supervised Dimension Reduction:

False discovery control for multiple tests of association under general dependence

Spring 2012 Math 541B Exam 1

The automatic construction of bootstrap confidence intervals

Bootstrap (Part 3) Christof Seiler. Stanford University, Spring 2016, Stats 205

Statistical Applications in Genetics and Molecular Biology

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

L2: Review of probability and statistics

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Math 180B, Winter Notes on covariance and the bivariate normal distribution

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

First Year Examination Department of Statistics, University of Florida

A class of generalized ridge estimator for high-dimensional linear regression

BIOS 2083 Linear Models c Abdus S. Wahed

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Statistics Assignment 2 HET551 Design and Development Project 1

Review of Statistics 101

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

MAS3301 Bayesian Statistics Problems 5 and Solutions

Step-down FDR Procedures for Large Numbers of Hypotheses

RESAMPLING METHODS FOR HOMOGENEITY TESTS OF COVARIANCE MATRICES

application in microarrays

Generalized Estimating Equations (gee) for glm type data

Comparing Two Variances. CI For Variance Ratio

One-Sample Numerical Data

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Linear Regression Model. Badr Missaoui

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Outline. Confidence intervals More parametric tests More bootstrap and randomization tests. Cohen Empirical Methods CS650

Transcription:

Correlation, z-values, and the Accuracy of Large-Scale Estimators Bradley Efron Stanford University

Correlation and Accuracy Modern Scientific Studies N cases (genes, SNPs, pixels,... ) each with its own summary statistic z i, i = 1, 2,..., N N 10, 000 Estimate of interest ˆθ = s(z) [ e.g., ˆθ = #{z i > 3}/N ] Question How accurate is ˆθ? Easy answer if z i s independent (but usually not!) Troubles for the bootstrap Correlation, z-values, Accuracy 1

Leukemia Microarray Study (Golub et al., 1999) 72 leukemia patients: n 1 = 47 ALL, n 2 = 25 AML N = 7128 genes Data matrix X 7128 72 X has independent columns but correlated rows rms correlation ˆα =.11 t i = two-sample t-statistic, AML vs. ALL for gene i z i = Φ 1 (F 70 (t i )) [Φ, F 70 cdfs N(0, 1), t 70 ] H 0 : z i N(0, 1) theoretical null Correlation, z-values, Accuracy 2

Leukemia data: N=7128 z values comparing 47 ALL vs 25 AML patients; RMS correlation=.11; Central standard dev sighat0=1.68 yy 0 50 100 150 200 250 300 350 fhat(z) [Poisson glm spline, df=5] 5 0 5 z values Correlation, z-values, Accuracy 3

Leukemia z value histogram and average 100 bootstrap z hists. [Two sample Nonparametric Boots: resample Columns of X ] Frequency 0 50 100 150 200 250 300 350 Poisson spline fit boot average 5 0 5 z values Correlation, z-values, Accuracy 4

Bootstrap Dilation x i = ith row of X (n equals 72 = 47 + 25) x i z i x z z i i i + N(0, σ 2) i Bootstrap histogram has extra component of variance: E N 1 / z 2 i N = N 1 z 2 i / N + N 1 σ 2 i / N Next: Boot stdev estimates for ˆF(x) = #{z i x}/n Correlation, z-values, Accuracy 5

Bootstrap Stdev for empirical cdf of Leukemia z values, compared with Formula X Sd estimates 0.000 0.005 0.010 0.015 0.020 0.025 Formula X Bootstrap 10 5 0 5 10 x value Correlation, z-values, Accuracy 6

Sd estimates 0.000 0.005 0.010 0.015 0.020 0.025 jackknife Now permutation and jackknife ests of sd{empirical cdf} compared with Formula X perm Formula X 10 5 0 5 10 x value Correlation, z-values, Accuracy 7

Formula X Var { ˆF(x) } { ˆF(x)(1 ˆF(x)) N } + { } ˆσ 2 0 ˆα f ˆ 2 (1) (x) 2 independence correlation penalty ˆσ 0 = 1.68 from empirical null ˆα =.11 ˆ f (1) (x) estimated RMS correlation first derivative of estimate ˆ f (x) Depends on normality: z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 8

Formula X for Leukemia Data x: 1 2 3 4 5 ˆF(x).29.13.057.025.010 ŝd.017.022.010.004.002 ŝd 0.005.004.003.002.001 Correlation, z-values, Accuracy 9

Simulation: sd{fhat(x)} from Formula X; N=6000, n=20+20, alpha=.10; Solid Curve and bars are mean and stdev of sdhat values, 100 sims standard deviation estimates 0.000 0.005 0.010 0.015 0.020 4 2 0 2 4 Dashed curve is actual sd Correlation, z-values, Accuracy 10

Digression: The Non-Null Distribution of z-values z-value is a test statistic N(0, 1) under H 0 Theorem Under reasonable conditions the non-null distribution of z is where z N(µ, σ 2 ) + O p (1/n) σ 2 = 1 + O ( 1 / ) n 1 2 Normality degrades more slowly than unit standard deviation Helps justify model z i N(µ i, σ 2 i ) Correlation, z-values, Accuracy 11

Student-t z-values t t ν (δ) [noncentral-t, noncentrality δ, d f = ν] H 0 : δ = 0 z = Φ 1 F ν (t) [F ν central t cdf, d f = ν] so under H 0, z N(0, 1) What if δ 0? Correlation, z-values, Accuracy 12

Densities for z=phiinv(fnu(t)), t~t(del,nu=20), for del=0,1,2,3,4,5; Dotted dashed lines are matching N(M,SD) density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 del= 0 1 2 3 4 5 4 2 0 2 4 6 z value Correlation, z-values, Accuracy 13

The Count Vector y Partition range Z of z into K bins: Z = Each bin of width K k=1 Z k Bin centers x k, k = 1, 2,..., K (Leukemia histogram: Z = [ 7.9, 7.9], =.2, K = 79) Counts y k = # {z i Z k } y = (y 1, y 2,..., y K ) Count vector y is discretized order statistic of z (most statistics of interest of form ˆθ = m(y)) Correlation, z-values, Accuracy 14

Multi-Class Normal Model Suppose z i s are in classes C 1, C 2,..., C C, with z i N(µ c, σ 2 c) N c = # {C c }, p c = N c /N for z i C c [ so c N c = N, c p c = 1 ] Correlation distribution: g(ρ) = empirical density all ( N 2) true correlations Correlation, z-values, Accuracy 15

Mehler s Identity (Lancaster, 1958) ϕ ρ (u, v) = standard normal bivariate density Mehler λ ρ (u, v) = ϕ ρ(u, v) ϕ(u)ϕ(v) 1 = where h j is jth Hermite polynomial Crucial quantity: Λ(u, v) = = j 1 1 1 α j j! h j(u)h j (v) where α j = j 1 ρ j j! h j(u)h j (v) λ ρ (u, v)g(ρ) dρ 1 1 ρ j g(ρ) dρ Correlation, z-values, Accuracy 16

Exact Covariance of y z i N(µ c, σ 2 c) for z i C c N c = #C c, p c = N c /N Theorem cov(y) = cov 0 + cov 1, { cov 0 = N p c diag(πc ) π c π c } c [independence] where π ck = Pr c {z i bin k }, π c = ( π ck... ), cov 1 = N 2 p c p d B cd N p c B cc [corr penalty] c ( xk µ c and B cd (k, l) = π ck π dl Λ d σ c c, x l µ ) d. σ d Correlation, z-values, Accuracy 17

Four Simplifications of cov 1 Drop N term Microarray standardization methods make α 1 0 Mehler expansion: α 2 = 1 Higher terms ignorable if α 2 small Simplified Formula (almost Formula X): Letting 1 2 α = α and φ (2) 2 k 1 ρ2 g(ρ) is the lead term = c p c ϕ (2) ( x kc µ c σ c ) / σ c cov 1 (N α) 2 φ (2) φ (2) / 2 [rms approximation] Correlation, z-values, Accuracy 18

Numerical Comparison N = 6000, α =.1 Two classes: (p c, µ c, σ c ) = (.95, 0, 1) (.05, 2.5, 1) Next figure compares standard deviations (square roots diagonal elements) of exact cov(y) & rms approximation Correlation, z-values, Accuracy 19

Compare sd{y[k]} from exact formula (solid) with rms approx (dashed); N=6000, alpha=.1, (p0,mu0,sig0)=(.95,0,1) and(.05,2.5,1) standard deviation 0 10 20 30 40 rms approx imation sd{y[k]},exact without corr penalty 4 2 0 2 4 z value dashes show bin centers x[k] Correlation, z-values, Accuracy 20

Same numerical example, now sd{fhat[k]} [ Fhat[k]=sum(y[l] for l>=k)/n ] sd{fhat} 0 20 40 60 80 100 rms approx without corr penalty exact 4 2 0 2 4 z value Correlation, z-values, Accuracy 21

Estimation of RMS Correlation α ˆρ ii = empirical correlation, rows i, i of X, N n expression matrix { ˆρ ii } has mean and variance (m, v) [leukemia = (.00,.19 2 )] ˆα 2 = n n 1 ( v 1 ) n 1 ALL AML Both ˆα:.121.109.114 Correlation, z-values, Accuracy 22

More General Accuracy Estimates Q q-dimensional statistic of interest: Q = Q(y) Influence Function ˆD: dq = ˆD dy [ ˆD jk = Q j / y k ] ĉov(q) = ˆDcov(y) ˆD Correlation, z-values, Accuracy 23

Example: Accuracy of log ( f ˆ ) z y ˆ f by Poisson GLM of counts y k on polynomial (x k ) Q = log( ˆ f) = (... log f (x k )... ) ˆD = M [ M diag ( ˆ f ) M ] M / N with M the GLM structure matrix Correlation, z-values, Accuracy 24

Local False Discovery Rate p 0 = prior Pr null p 1 = prior Pr non-null z f 0 (z) f 1 (z) Mixture f (z) = p 0 f 0 (z) + p 1 f 1 (z) Estimated local false discovery rate fdr(z) = Pr{null z} = p 0 f 0 (z) / ˆ f (z) cov { log fdr } cov { log f ˆ } Correlation, z-values, Accuracy 25

sd{log fdrhat(z)} ; N=6000, alpha=0,.1, and.2, (p0,mu,sig) = (.95,0,1) and (.05,2.5,1) sd 0.00 0.05 0.10 0.15 0.20 0.25 alpha=.2 alpha=.1 alpha=0 0.69 0.58 0.44 0.25 0.09 0.03 2.0 2.5 3.0 3.5 z value > stars are sd's for N=1500, alpha=.1; number are fdrhat[z] Correlation, z-values, Accuracy 26

Now compare sd's for log{fdrhat} and log{fdrhat}, alpha=.1 sd 0.00 0.05 0.10 0.15 0.20 0.25 sdlogfdrnon sdlogfdr sdlogfdr 0.34 0.26 0.18 0.1 0.04 0.01 2.0 2.5 3.0 3.5 z value > numbers are Fdr[z] Correlation, z-values, Accuracy 27

References Efron, B. (2007a). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102: 93 103. Efron, B. (2007b). Size, power and false discovery rates. Ann. Statist. 35: 1351 1377. Efron, B. (2010). Correlated z-values and the accuracy of largescale statistical estimates. J. Amer. Statist. Assoc. To appear (http://stat.stanford.edu/ brad/papers). Golub, T., Slonim, D. and Tamayo, P. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531 537. Correlation, z-values, Accuracy 28

Lancaster, H. (1958). The structure of bivariate distributions. Ann. Math. Statist. 29: 719 736. Owen, A. B. (2005). Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B 67: 411 426. Correlation, z-values, Accuracy 29