Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Size: px

Start display at page:

Download "Tweedie s Formula and Selection Bias. Bradley Efron Stanford University"

Louisa Ward
6 years ago
Views:

1 Tweedie s Formula and Selection Bias Bradley Efron Stanford University

2 Selection Bias Observe z i N(µ i, 1) for i = 1, 2,..., N Select the m biggest ones: z (1) > z (2) > z (3) > > z (m) Question: µ values? What can we say about their corresponding Selection Bias selected z s. The µ s will usually be smaller than the Tweedie s Formula 1

3 Gamma example: z[i]~ N(mu[i],1), i=1,2,...n=5000; dashes show the 100 largest z[i] values Frequency z(1) z values Tweedie s Formula 2

4 Gamma example: z[i]~n(mu[i],1) i=1,2, ; green curve shows true mu values; blue dashes at right are 100 largest z's; red stars indicate the corresponding mu values index z and mu values 9.02 Tweedie s Formula 3

5 Tweedie s Formula (Robbins, 1956) Bayes Model µ g( ) and z µ N(µ, 1) Marginal Density Tweedie s Formula E{µ z} = z + l (z) f (z) = [ ] ϕ(z µ)g(µ) dµ ϕ = e 1 2 z2 2π (Normal version) [l (z) = ddz ] log f (z) MLE Bayes correction Advantage Only need f, not g. Tweedie s Formula 4

6 Higher Moments Cumulant Generating Function of µ given z: cgf(z) = 1 2 z2 + l(z) [ l(z) = log f (z) ] so var{µ z} = 1 + l (z), skew{µ z} = l (z), etc. [1 + l 3/2 (z)] µ z (z + l, 1 + l ) Bayes Risk (C.-H. Zhang, 1997) µ = E{µ z} : E { (µ µ ) 2 } = 1 E { l (z) 2} Tweedie s Formula 5

7 Empirical Bayes Estimates Use z = (z 1, z 2,..., z N ) to get estimate f ˆ (z), ˆl(z) = log f ˆ (z); take µ i z i ( z i + ˆl ), 1 + ˆl i i }{{}}{{} ˆµ i var i Idea: Bayes estimates are immune to selection bias maybe empirical Bayes estimates ˆµ i are, too! Tweedie s Formula 6

8 Maximum Likelihood Estimation of f (z) Parametric Model Suppose l(z) = log f (z) = J j=0 β jz j, with MLE ˆl(z) = J j=0 ˆβ j z j. Lindsey s Method Histogram bins Z = K k=1, y k = #{z i Z k }. ˆl from glm(y poly(x,j), Poisson) where x is vector of bin centers. [Slide 2: K = 63 bins of width 0.2] Tweedie s Formula 7

9 The James Stein Estimator If µ i N(0, A) and z i µ i N(µ i, 1), then z i N(0, V = A + 1). log marginal density l(z i ) = z 2 i / 2V Quadratic (J = 2), E{µ i z i } = z i z i V. James Stein ˆµ i = z i z i ˆV where ˆV 1 = (N 2) z 2 Tweedie estimates are a generalization of James Stein. Tweedie s Formula 8

10 Fitted curve lhat(z) for Gamma example, using Lindsey's method, natural spline model with J=5 degrees of freedom; points are log histogram counts log(fhat(z)) z value NOTE: l(z) concave implies posterior variance < 1 Tweedie s Formula 9

11 Estimated posterior expectation Ehat{mu z}=z+lhat'(z); red stars are uhat[i]'s for largest 100 z[i]'s Ehat{mu z} (9.02,7.64) 100 largest z[i]'s z value small green dots show true Bayes rule Tweedie s Formula 10

12 Does Tweedie Cure Selection Bias? Simulations µ i Gamma, and z i N(µ i, 1) for i = 1, 2,..., N = Simulations: Select top 20 and bottom 20 z i s each time. Next histograms show ˆµ i µ i. Tweedie s formula works well even for N = 200. Tweedie s Formula 11

13 100 simulations of Gamma model, N=1000 z[i]'s per simulation; Solid histogram muhat[i] mu[i] for top 20 per simulation; line histogram z[i] mu[i] Frequency muhat[i] mu[i] Tweedie s Formula 12

14 Now for Bottom 20 per simulation Frequency muhat[i] mu[i] Tweedie s Formula 13

15 Bayes Regret (Muralidharan, 2009; Zhang, 1997) Regret Reg(z 0 ) E [ µ ˆµ z (z 0 ) ]2 E [ µ µ (z 0 ) ] 2 with z = (z i,..., z N ) and µ z 0 random, z 0 fixed. Reg(z 0 ) depends on accuracy of ˆl z(z 0 ) as estimate of l (z 0 ) = d dz log f (z) z0 Reg(z 0 ) = E [ˆl z (z 0 ) l (z 0 ) ] 2 Asymptotically Reg(z 0 ) var {ˆl z (z 0 ) } c(z 0 )/N Tweedie s Formula 14

16 RMS Regret for Gamma Example %ile z-value N = N = N = c(z 0 )/ Tweedie s Formula 15

17 Empirical Bayes Information As N increases ˆµ i goes from MLE z i (N = 1) to Bayes estimate µ (N = ) i Posterior Variability E [ µ i ˆµ i ] 2 = var i + Reg(z i) ˆµ i undependable for most extreme few z i s ( var i 1 ) Empirical Bayes Information (per other observation): 1 I(z 0 ) = lim ( N N Reg(z0 ) ) = 1 [ Reg(z 0 ) ] 1 NI(z 0 ) c(z 0 ) Tweedie s Formula 16

18 Empirical Bayes information per observation Ireg(z0) for Gamma model. (Dashed curve for James Stein) Ireg(z0) value z0 Tweedie s Formula 17

19 CNV Data (Efron and Zhang, 2010) Copy Number Variation 150 healthy controls measured at 5000 positions ˆk i = estimated number of cnv subjects at position i, i = 1, 2,..., N = 5000 Bootstrapping Subjects = ˆk i N ( k i, σ 2 i ) with σ 2 i increasing in k i Selected position i = 1755 with ˆk i = 35.3 Empirical Bayes inference for k i? (Don t need independence!) ( σ 2 i 6.5 ) Tweedie s Formula 18

20 k values vs position, cnv data k value position Tweedie s Formula 19

21 The 5000 k values for cnv example (first 3 freqs 1908, 560,357 Frequency k[1755] =35.3 2nd Mode k values Tweedie s Formula 20

22 log{fhat(k)} = lhat(k), CNV data; ns df=7; stars are log bin proportions log fit nd Mode k value Tweedie's formula for Ehat{k khat} Ehat{mu k} nd Mode k value Tweedie s Formula 21

23 Which Other Cases are Relevant? EB estimate of k 1755 depends on which other ˆk i s thought relevant: Relevant Cases 1:5000 1: :2000 ˆµ Large ˆk s in 3000:5000 pull up estimate ˆµ 1755 Bayes Problem on position Relevant prior g(k) may depend Tweedie s Formula 22

24 Gamma5000 example; solid red points are max values in successive groups of 50, with squares showing corresponding estimates Ehat{mu z} from combined Empirical Bayes analysis. Upwardly biased! z value and EB estimate actual mu values position Tweedie s Formula 23

25 Relevance Let ρ 0 (i) be relevance of case i to target case i 0 Examples (1) ρ 0 (i) = 1 if i i 0 ± 500; 0 otherwise (2) ρ 0 (i) = exp { i i 0 /500} Extended Tweedie Formula ˆµ i0 = z i0 + ˆl (z i0 ) + d dz log ˆR 0 (z) where R 0 (z) = E { ρ 0 (i) z } [ Regress ρ0 (i) on z i to get ˆR 0 (z). ] Using (1) or (2) cures bias on previous panel. zi0 Tweedie s Formula 24

26 Non-Constant Variance If z N(µ, σ 2) then E{µ z} = z + 0 σ2 0 l (z) (took σ 0 = 6.5 for cnv) Suppose z N(µ, σ 2 µ) Theorem g(µ z 0 ) g 0 (µ z 0 ) = c 0λ µ e 1 2 (λ2 µ 1) 2 µ where σ 0 = σ µ=z0 and g 0 (µ z 0 ) is distribution assuming constant σ = σ 0. λ µ = σ 0 /σ µ µ = (µ z 0 ) / σ 0 Tweedie s Formula 25

27 posterior density Estimated posterior distribution for k[1755], CNV example; Assuming sig0=6.5 and Normality (dashed); or Adjusted for varying sigma (solid) k[1755] Tweedie s Formula 26

28 False Discovery Rates µ g( ) = π 0 δ 0 ( ) + π 1 g 1 ( ) and z µ N(µ, 1) π 0 = prior Pr{null} Then fdr(z) = Pr{null z} = π 0 ϕ(z) / f (z) ( local false discovery rate ) d dz log fdr(z) = z + l (z) = E{µ z} Tail Area Fdr Fdr(z 0 ) = Pr{null z z 0 } = π 0 Φ(z 0 ) / F(z) where F(z) is cdf of mixture density f (z) Tweedie s Formula 27

29 Prostate data z values. N=6033 genes measured for each of 102 subjects; z vals from two sample t stats comparing 52 prostate cancer patients with 50 healthy controls. Frequency z value Empirical Bayes estimate Ehat{mu z} Ehat{mu z} z value Tweedie s Formula 28

30 p-value Selection Bias Observe z 1, z 2,..., z N and select smallest values of p i = Φ(z i ) Significance? Benjamini Hochberg / Reject small values of Fdr i = π 0 p i ˆF(z i ), or large values of log ( Fdri ) = log(pi ) + log ( ˆF(z i ) / ) ˆπ 0 EB estimate frequentist EB correction estimate (usually ˆπ 0 = 1) Tweedie s Formula 29

31 log{pvalue} (dashed) and log{fdr} (solid) for prostate data, right sided log(fdr(z)) z= p value Fdr z value Tweedie s Formula 30

32 References Efron, B. and Zhang, N. (2010). False discovery rates and copy number variation, brad/. Muralidharan, O. (2009). High dimensional exponential family estimation via empirical Bayes, under review. Robbins, H. (1956). An empirical Bayes approach to statistics. In Proc. 3rd Berkeley Symposium. Berkeley: UC Press, Zhang, C.-H. (1997). Empirical Bayes and compound estimation of normal means. Statist. Sinica 7: Tweedie s Formula 31

Tweedie s Formula and Selection Bias

Tweedie s Formula and Selection Bias Bradley Efron Stanford University Abstract We suppose that the statistician observes some large number of estimates z i, each with its own unobserved expectation parameter