Empirical Bayes Moderation of Asymptotically Linear Parameters

Size: px

Start display at page:

Download "Empirical Bayes Moderation of Asymptotically Linear Parameters"

Drusilla Griffith
6 years ago
Views:

1 Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org github/nhejazi slides: goo.gl/6ou8yr

2 Preview 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 1

3 Preview 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 1

4 Preview 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 1

5 Preview 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 4. The approach of moderated statistics easily extends to the case of asymptotically linear parameters. 1

6 Motivation: Let s meet the data Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. 2

7 Motivation: Let s meet the data Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Biomarkers of interest are in the form of mirna, assessed using the Illumina Human Ref-8 BeadChips platform. 2

8 Motivation: Let s meet the data Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Biomarkers of interest are in the form of mirna, assessed using the Illumina Human Ref-8 BeadChips platform. Occupational exposure to benzene reported as discrete values of interest (to epidemiologists): none, < 1ppm, > 5ppm. 2

9 Motivation: Let s meet the data Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Biomarkers of interest are in the form of mirna, assessed using the Illumina Human Ref-8 BeadChips platform. Occupational exposure to benzene reported as discrete values of interest (to epidemiologists): none, < 1ppm, > 5ppm. Background (phenotype-level) information available on each subject, including age, sex, smoking status. 2

10 Data analysis? Linear models! For each biomarker (b = 1,...,B), fit a linear model: E[y b ] = Xβ b 3

11 Data analysis? Linear models! For each biomarker (b = 1,...,B), fit a linear model: E[y b ] = Xβ b Generally, we have a particular model coefficent in which we are interested (e.g., effect of benzene on biomarker expression). 3

12 Data analysis? Linear models! For each biomarker (b = 1,...,B), fit a linear model: E[y b ] = Xβ b Generally, we have a particular model coefficent in which we are interested (e.g., effect of benzene on biomarker expression). Controlling for baseline covariates, batch effects, and potential confounders happens by adding terms to the linear model. 3

13 Data analysis? Linear models! For each biomarker (b = 1,...,B), fit a linear model: E[y b ] = Xβ b Generally, we have a particular model coefficent in which we are interested (e.g., effect of benzene on biomarker expression). Controlling for baseline covariates, batch effects, and potential confounders happens by adding terms to the linear model. Test the coefficent of interest using a standard t-test: t b = ˆβ b β b,h0 s b 3

14 LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. 4

15 LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. Uncertainty in the variance is an acute problem when the sample size is small. 4

16 LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. Uncertainty in the variance is an acute problem when the sample size is small. This results in false positives. Smyth proposes we get around this by an empirical Bayes shrinkage of the s 2 b. 4

17 LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. Uncertainty in the variance is an acute problem when the sample size is small. This results in false positives. Smyth proposes we get around this by an empirical Bayes shrinkage of the s 2 b. Test the coefficent of interest with a moderated t-test: t b = ˆβ b β b,h0 s b, s 2 b = s2 b d b + s 2 0 d 0 d b + d 0 4

18 LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. Uncertainty in the variance is an acute problem when the sample size is small. This results in false positives. Smyth proposes we get around this by an empirical Bayes shrinkage of the s 2 b. Test the coefficent of interest with a moderated t-test: t b = ˆβ b β b,h0 s b, s 2 b = s2 b d b + s 2 0 d 0 d b + d 0 Eliminates large t-statistics merely from very small s b. 4

19 Beyond linear models It s not always desirable to specify a functional form: perhaps we can do better than linear models? 5

20 Beyond linear models It s not always desirable to specify a functional form: perhaps we can do better than linear models? Such models are a matter of convenience and not honest scientific practice: does ˆβ b really answer our questions? 5

21 Beyond linear models It s not always desirable to specify a functional form: perhaps we can do better than linear models? Such models are a matter of convenience and not honest scientific practice: does ˆβ b really answer our questions? We can do better by using parameters motivated by causal models (n.b., these will reduce to variable importance measures in our case). 5

22 Beyond linear models It s not always desirable to specify a functional form: perhaps we can do better than linear models? Such models are a matter of convenience and not honest scientific practice: does ˆβ b really answer our questions? We can do better by using parameters motivated by causal models (n.b., these will reduce to variable importance measures in our case). As long as the parameters we seek to estimate have asymptotically linear estimators, we can readily apply the approach of moderated statistics. 5

23 Target parameters for complex questions Rather than being satisfied with ˆβ b as an answer to our questions, let s consider a simple target parameter: the average treatment effect (ATE): Ψ b (P 0 ) = E W,0 [E 0 [Y b A = a high,w] E 0 [Y b A = a low,w]] 6

24 Target parameters for complex questions Rather than being satisfied with ˆβ b as an answer to our questions, let s consider a simple target parameter: the average treatment effect (ATE): Ψ b (P 0 ) = E W,0 [E 0 [Y b A = a high,w] E 0 [Y b A = a low,w]] No need to specify a functional form or assume that we know the true data-generating distribution P 0. 6

25 Target parameters for complex questions Rather than being satisfied with ˆβ b as an answer to our questions, let s consider a simple target parameter: the average treatment effect (ATE): Ψ b (P 0 ) = E W,0 [E 0 [Y b A = a high,w] E 0 [Y b A = a low,w]] No need to specify a functional form or assume that we know the true data-generating distribution P 0. Parameters like this can be estimated using targeted minimum loss-based estimation (TMLE). 6

26 Target parameters for complex questions Rather than being satisfied with ˆβ b as an answer to our questions, let s consider a simple target parameter: the average treatment effect (ATE): Ψ b (P 0 ) = E W,0 [E 0 [Y b A = a high,w] E 0 [Y b A = a low,w]] No need to specify a functional form or assume that we know the true data-generating distribution P 0. Parameters like this can be estimated using targeted minimum loss-based estimation (TMLE). Asymptotic linearity: Ψ b (P n) Ψ b (P 0 ) = 1 n n i=1 IC(O i ) + o P ( 1 n ) 6

27 Targeted Minimum Loss-Based Estimation TMLE produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution. 7

28 Targeted Minimum Loss-Based Estimation TMLE produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution. Iterative procedure (though there is a one-step now) that updates an initial estimate of the relevant part (Q 0 ) of the data generating distribution (P 0 ). 7

29 Targeted Minimum Loss-Based Estimation TMLE produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution. Iterative procedure (though there is a one-step now) that updates an initial estimate of the relevant part (Q 0 ) of the data generating distribution (P 0 ). Like corresponding A-IPTW estimators, removes asymptotic residual bias of initial estimator for the target parameter. If it uses a consistent estimator of g 0 (nuisance parameter), it is doubly robust. 7

30 Targeted Minimum Loss-Based Estimation TMLE produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution. Iterative procedure (though there is a one-step now) that updates an initial estimate of the relevant part (Q 0 ) of the data generating distribution (P 0 ). Like corresponding A-IPTW estimators, removes asymptotic residual bias of initial estimator for the target parameter. If it uses a consistent estimator of g 0 (nuisance parameter), it is doubly robust. We can estimate the target parameter: Ψ b (P n) = 1 n n [Q (b,1) n i=1 (A i = a h,w i ) Q (b,1) n (A i = a l,w i )] 7

31 Inference with influence curves The influence ( curve for the estimator is: 1(Ai = a IC b,n (O i ) = h ) g n (a h W i ) 1(A ) i = a l ) g n (a l W i ) (b,1) (Y b,i Q n (A i,w i )) + Q (b,1) n (a l,w i ) Ψ b (P n) Q (b,1) n (a h,w i ) (1) 8

32 Inference with influence curves The influence ( curve for the estimator is: 1(Ai = a IC b,n (O i ) = h ) g n (a h W i ) 1(A ) i = a l ) g n (a l W i ) (b,1) (Y b,i Q n (A i,w i )) + Q (b,1) n (a l,w i ) Ψ b (P n) Sample variance of the influence curve: s 2 (IC n ) = 1 n n i=1 (IC n(o i )) 2 Q (b,1) n (a h,w i ) (1) 8

33 Inference with influence curves The influence ( curve for the estimator is: 1(Ai = a IC b,n (O i ) = h ) g n (a h W i ) 1(A ) i = a l ) g n (a l W i ) (b,1) (Y b,i Q n (A i,w i )) + Q (b,1) n (a l,w i ) Ψ b (P n) Sample variance of the influence curve: s 2 (IC n ) = 1 n n i=1 (IC n(o i )) 2 Q (b,1) n (a h,w i ) Use sample variance to estimate the standard error: s se n = 2 (IC n ) n (1) 8

34 Inference with influence curves The influence ( curve for the estimator is: 1(Ai = a IC b,n (O i ) = h ) g n (a h W i ) 1(A ) i = a l ) g n (a l W i ) (b,1) (Y b,i Q n (A i,w i )) + Q (b,1) n (a l,w i ) Ψ b (P n) Sample variance of the influence curve: s 2 (IC n ) = 1 n n i=1 (IC n(o i )) 2 Q (b,1) n (a h,w i ) Use sample variance to estimate the standard error: (1) s se n = 2 (IC n ) n Use this for inference that is, to derive uncertainty measures (i.e., p-values, confidence intervals). 8

35 Moderated statistics for target parameters One can define a standard t-test statistic for an estimator of an asymptotically linear parameter (over b = 1,...,B) as: n(ψb (P t b = n) Ψ 0 ) s b (IC b,n ) 9

36 Moderated statistics for target parameters One can define a standard t-test statistic for an estimator of an asymptotically linear parameter (over b = 1,...,B) as: n(ψb (P t b = n) Ψ 0 ) s b (IC b,n ) This naturally extends to the moderated t-statistic of Smyth: n(ψb (P t b = n) Ψ 0 ) s b where the posterior estimate of the variance of the influence curve is s 2 b = s2 b (IC b,n)d b + s 2 0 d 0 d b + d 0 9

37 An influence curve transform Need the estimate for each biomarker (b) and the IC for every observation for that biomarker, repeating for all b = 1,...,B. 10

38 An influence curve transform Need the estimate for each biomarker (b) and the IC for every observation for that biomarker, repeating for all b = 1,...,B. Essentially, transform original data matrix such that new entries are: Y b,i = IC b,n(o i ;P n ) + Ψ b (P n) 10

39 An influence curve transform Need the estimate for each biomarker (b) and the IC for every observation for that biomarker, repeating for all b = 1,...,B. Essentially, transform original data matrix such that new entries are: Y b,i = IC b,n(o i ;P n ) + Ψ b (P n) Since E[IC b,n ] = 0 across the columns (units) for each b, the average will be the original estimate Ψ b (P n). 10

40 An influence curve transform Need the estimate for each biomarker (b) and the IC for every observation for that biomarker, repeating for all b = 1,...,B. Essentially, transform original data matrix such that new entries are: Y b,i = IC b,n(o i ;P n ) + Ψ b (P n) Since E[IC b,n ] = 0 across the columns (units) for each b, the average will be the original estimate Ψ b (P n). For simplicity, let s assume the null value is Ψ 0 = 0 for all b. Then, applying the moderated t-test to Y b,i will generate corrected, conservative test statistics t b. 10

41 Why moderated statistics in this context? Often times, such data analyses are based on relatively small samples. 11

42 Why moderated statistics in this context? Often times, such data analyses are based on relatively small samples. To get a data-adaptive estimate, with standard implementation of these estimates, standard errors can be non-robust. 11

43 Why moderated statistics in this context? Often times, such data analyses are based on relatively small samples. To get a data-adaptive estimate, with standard implementation of these estimates, standard errors can be non-robust. Practically, significant estimates of variable importance measures may be driven by poorly and underestimated s 2 b (IC b,n). 11

44 Why moderated statistics in this context? Often times, such data analyses are based on relatively small samples. To get a data-adaptive estimate, with standard implementation of these estimates, standard errors can be non-robust. Practically, significant estimates of variable importance measures may be driven by poorly and underestimated s 2 b (IC b,n). Moderated statistics shrink these s 2 b (IC b,n) (making them bigger), thus taking biomarkers with small parameter estimates but very small s 2 b (IC b,n) out of statistical significance. 11

45 Software implementation: R/biotmle An R package that facilitates biomarker discovery by generalizing the moderated t-statistic of Smyth for use with asymptotically linear parameters. 12

46 Software implementation: R/biotmle An R package that facilitates biomarker discovery by generalizing the moderated t-statistic of Smyth for use with asymptotically linear parameters. Check it out on GitHub: nhejazi/biotmle 12

47 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. 13

48 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. 13

49 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. Treatment A is degree of Benzene exposure: none, < 1ppm, and > 5ppm. 13

50 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. Treatment A is degree of Benzene exposure: none, < 1ppm, and > 5ppm. Outcome Y is mirna expression, median normalized. 13

51 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. Treatment A is degree of Benzene exposure: none, < 1ppm, and > 5ppm. Outcome Y is mirna expression, median normalized. Estimate the parameter: Ψ b (P n) = E[E[Y b A = max(a),w] E[Y b A = min(a),w]] 13

52 Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. Treatment A is degree of Benzene exposure: none, < 1ppm, and > 5ppm. Outcome Y is mirna expression, median normalized. Estimate the parameter: Ψ b (P n) = E[E[Y b A = max(a),w] E[Y b A = min(a),w]] Apply moderated t-test as previously discussed. 13

53 Analysis results I: Uncorrected tests Histogram of raw p values (applying Limma shrinkage to TMLE results) count magnitude of raw p values 14

54 Analysis results II: Corrected tests Histogram of FDR corrected p values (BH) (applying Limma shrinkage to TMLE results) count magnitude of BH corrected p values 15

55 Analysis results III: Volcano plot 16

56 Analysis results IV: Heatmap of IC estimates 17

57 Review 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 18

58 Review 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 18

59 Review 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 18

60 Review 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 4. The approach of moderated statistics easily extends to the case of asymptotically linear parameters. 18

61 References I Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages

62 References II Smyth, G. K. (2005). Limma: linear models for microarray data. In Bioinformatics and computational biology solutions using R and Bioconductor, pages Springer. Tuglus, C. and van der Laan, M. J. (2011). Targeted methods for biomarker discovery. In Targeted Learning, pages Springer. van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.

63 Acknowledgments Alan Hubbard Mark van der Laan University of California, Berkeley University of California, Berkeley 21

64 Slides: goo.gl/6ou8yr stat.berkeley.edu/~nhejazi nimahejazi.org github/nhejazi 22

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi