Empirical Bayes Moderation of Asymptotically Linear Parameters

Similar documents
Empirical Bayes Moderation of Asymptotically Linear Parameters

Targeted Learning for High-Dimensional Variable Importance

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths

Targeted Maximum Likelihood Estimation in Safety Analysis

Statistical Inference for Data Adaptive Target Parameters

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Statistical testing. Samantha Kleinberg. October 20, 2009

University of California, Berkeley

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Association studies and regression

Lesson 11. Functional Genomics I: Microarray Analysis

University of California, Berkeley

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Causal Inference Basics

University of California, Berkeley

e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls

The International Journal of Biostatistics

Robust Semiparametric Regression Estimation Using Targeted Maximum Likelihood with Application to Biomarker Discovery and Epidemiology

SIMPLE EXAMPLES OF ESTIMATING CAUSAL EFFECTS USING TARGETED MAXIMUM LIKELIHOOD ESTIMATION

High-dimensional regression

University of California, Berkeley

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Non-specific filtering and control of false positives

Mediation for the 21st Century

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Probability and statistics; Rehearsal for pattern recognition

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Using Estimating Equations for Spatially Correlated A

Targeted Maximum Likelihood Estimation for Adaptive Designs: Adaptive Randomization in Community Randomized Trial

Advanced Statistical Methods: Beyond Linear Regression

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Biostatistics Advanced Methods in Biostatistics IV

University of California, Berkeley

University of California, Berkeley

Selective Inference for Effect Modification

Integrated approaches for analysis of cluster randomised trials

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE

Doing Right By Massive Data: How To Bring Probability Modeling To The Analysis Of Huge Datasets Without Taking Over The Datacenter

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Step-down FDR Procedures for Large Numbers of Hypotheses

MSA220/MVE440 Statistical Learning for Big Data

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Parametric Empirical Bayes Methods for Microarrays

Linear Models and Empirical Bayes Methods for Microarrays

University of California, Berkeley

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Data Adaptive Estimation of the Treatment Specific Mean in Causal Inference R-package cvdsa

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Exam: high-dimensional data analysis January 20, 2014

A Sampling of IMPACT Research:

Combining multiple observational data sources to estimate causal eects

Proteomics and Variable Selection

Confounder Adjustment in Multiple Hypothesis Testing

Data Mining Stat 588

Matching Methods for Observational Microarray Studies

Statistical Applications in Genetics and Molecular Biology

Statistical methods research done as science rather than math: Estimates on the boundary in random regressions

High-throughput Testing

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

Cross-Sectional Regression after Factor Analysis: Two Applications

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Unbiased estimation of exposure odds ratios in complete records logistic regression

Uncertain Inference and Artificial Intelligence

University of California, Berkeley

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Bayesian Linear Regression

Tools and topics for microarray analysis

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

CS 188: Artificial Intelligence Spring Today

University of California, Berkeley

University of California, Berkeley

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Background to Statistics

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

36-463/663: Multilevel & Hierarchical Models

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

WU Weiterbildung. Linear Mixed Models

Probability and Statistics

University of California, Berkeley

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Estimation of the False Discovery Rate

Machine Learning

Machine Learning Linear Regression. Prof. Matteo Matteucci

Identifying Bio-markers for EcoArray

Descriptive statistics

High-Throughput Sequencing Course

Linear Regression (9/11/13)

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Targeted Maximum Likelihood Estimation for Dynamic Treatment Regimes in Sequential Randomized Controlled Trials

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Modern Statistical Learning Methods for Observational Data and Applications to Comparative Effectiveness Research

Construction and statistical analysis of adaptive group sequential designs for randomized clinical trials

Transcription:

Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi slides: goo.gl/6ou8yr These are slides for a talk given most recently at the Division of Biostatistics seminar series, at the University of California, Berkeley on 20 March 2017. Source: https://github.com/nhejazi/talk_biotmle Slides: https://goo.gl/r3zsu6 With notes: https://goo.gl/6ou8yr

Preview 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 4. The approach of moderated statistics easily extends to the case of asymptotically linear parameters. 1 We ll go over this summary again at the end of the talk. Hopefully, it will all make more sense then.

Motivation: Let s meet the data Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Biomarkers of interest are in the form of mirna, assessed using the Illumina Human Ref-8 BeadChips platform. Occupational exposure to benzene reported as discrete values of interest (to epidemiologists): none, < 1ppm, > 5ppm. Background (phenotype-level) information available on each subject, including age, sex, smoking status. 2 This is not an atypical data set by modern standards in epidemiology, certainly not the standard for molecular biology. That is, sample sizes are usually much smaller in experiments examining biological processes.

Data analysis? Linear models! For each biomarker (b = 1,...,B), fit a linear model: E[y b ] = Xβ b Generally, we have a particular model coefficent in which we are interested (e.g., effect of benzene on biomarker expression). Controlling for baseline covariates, batch effects, and potential confounders happens by adding terms to the linear model. Test the coefficent of interest using a standard t-test: t b = ˆβ b β b,h0 s b 3 There s nothing particularly wrong with this approach. It s exactly what we would come up with after a first-year statistics course. In practice, there are many issues: (1) we are forced to specify a functional form, the linear model; (2) we end up with unstable variance estimates that sharply increase the number of false positives detected, even after multiple testing corrections.

LIMMA: Linear Models for Microarray Data When the sample size is small, s 2 b may be so small that small differences ( ˆβ b β b,h0 ) lead to large t b. Uncertainty in the variance is an acute problem when the sample size is small. This results in false positives. Smyth proposes we get around this by an empirical Bayes shrinkage of the s 2 b. Test the coefficent of interest with a moderated t-test: t b = ˆβ b β b,h0 s b, s 2 b = s2 b d b + s 2 0 d 0 d b + d 0 Eliminates large t-statistics merely from very small s b. 4 The substantive contribution here is the use of an empirical Bayes method to shrink the standard deviation across all of the biomarkers such that we obtain a larger (but accurate) estimate that reduces the number of test statistics that are marked as significant by low s 2 b estimates alone. Note that this is not the exact formulation of the moderated t-statistic as given by Smyth (his derivation assumes a hierarchical model; see original paper if interested). This formulation does a good enough job to help us see the bigger picture.

Beyond linear models It s not always desirable to specify a functional form: perhaps we can do better than linear models? Such models are a matter of convenience and not honest scientific practice: does ˆβ b really answer our questions? We can do better by using parameters motivated by causal models (n.b., these will reduce to variable importance measures in our case). As long as the parameters we seek to estimate have asymptotically linear estimators, we can readily apply the approach of moderated statistics. 5 Linear models are convenient for communicating results that is, all scientists are trained to understand them. This means they provide a basic way of easily communicating between statisticians and collaborators. That said, doesn t it seem a bit odd to use such elementary models to analyze complex biological sequencing data? We re using old statistical technology to analyze classes of data that have only recently become available.

Target parameters for complex questions Rather than being satisfied with ˆβ b as an answer to our questions, let s consider a simple target parameter: the average treatment effect (ATE): Ψ b (P 0 ) = E W,0 [E 0 [Y b A = a high,w] E 0 [Y b A = a low,w]] No need to specify a functional form or assume that we know the true data-generating distribution P 0. Parameters like this can be estimated using targeted minimum loss-based estimation (TMLE). Asymptotic linearity: Ψ b (P n) Ψ b (P 0 ) = 1 n n IC(O i ) + o P ( 1 ) i=1 n 6 By allowing scientific questions to inform the parameters that we choose to estimate, we can do a better job of actually answering the questions of interest to our collaborators. Further, we abandon the need to specify the functional relationship between our outcome and covariates; moreover, we can now make use of advances in machine learning.

Targeted Minimum Loss-Based Estimation TMLE produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution. Iterative procedure (though there is a one-step now) that updates an initial estimate of the relevant part (Q 0 ) of the data generating distribution (P 0 ). Like corresponding A-IPTW estimators, removes asymptotic residual bias of initial estimator for the target parameter. If it uses a consistent estimator of g 0 (nuisance parameter), it is doubly robust. We can estimate the target parameter: Ψ b (P n) = 1 n n [Q (b,1) n i=1 (A i = a h,w i ) Q (b,1) n (A i = a l,w i )] 7 Natural use of machine learning methods for the estimation of both Q 0 and g 0. Focuses effort to achieve minimal bias and asymptotic semiparametric efficiency bound for the variance, but still get inference (with some assumptions).

Inference with influence curves The influence ( curve for the estimator is: 1(Ai = a IC b,n (O i ) = h ) g n (a h W i ) 1(A ) i = a l ) g n (a l W i ) (b,1) (Y b,i Q n (A i,w i )) + Q (b,1) n (a l,w i ) Ψ b (P n) Sample variance of the influence curve: s 2 (IC n ) = 1 n n i=1 (IC n(o i )) 2 Q (b,1) n (a h,w i ) Use sample variance to estimate the standard error: (1) s se n = 2 (IC n ) n Use this for inference that is, to derive uncertainty measures (i.e., p-values, confidence intervals). 8 Using the influence curve representation, we can obtain all of the standard objects of statistical interest, but for more interesting parameters.

Moderated statistics for target parameters One can define a standard t-test statistic for an estimator of an asymptotically linear parameter (over b = 1,...,B) as: n(ψb (P t b = n) Ψ 0 ) s b (IC b,n ) This naturally extends to the moderated t-statistic of Smyth: n(ψb (P t b = n) Ψ 0 ) s b where the posterior estimate of the variance of the influence curve is s 2 b = s2 b (IC b,n)d b + s 2 0 d 0 d b + d 0 9 Consider this is repeated for b = 1,...,B different biomarkers, so that one has, for each b: Ψ b (Q b,n ),S2 b (IC b,n), estimate of variable importance and standard error for all B. Propose an existing joint-inferential procedure that can add some finite-sample robustness to an estimator that can be highly variable.

An influence curve transform Need the estimate for each biomarker (b) and the IC for every observation for that biomarker, repeating for all b = 1,...,B. Essentially, transform original data matrix such that new entries are: Y b,i = IC b,n(o i ;P n ) + Ψ b (P n) Since E[IC b,n ] = 0 across the columns (units) for each b, the average will be the original estimate Ψ b (P n). For simplicity, let s assume the null value is Ψ 0 = 0 for all b. Then, applying the moderated t-test to Y b,i will generate corrected, conservative test statistics t b. 10 Just like the one-sample problem for estimation of parameter with associated standard error from the influence curve.

Why moderated statistics in this context? Often times, such data analyses are based on relatively small samples. To get a data-adaptive estimate, with standard implementation of these estimates, standard errors can be non-robust. Practically, significant estimates of variable importance measures may be driven by poorly and underestimated s 2 b (IC b,n). Moderated statistics shrink these s 2 b (IC b,n) (making them bigger), thus taking biomarkers with small parameter estimates but very small s 2 b (IC b,n) out of statistical significance. 11 Essentially, we have the same concerns about using variable importance measures that we did about using the standard t-test that is, non-robut estimates of the standard error of the estimator of the target parameter can cause erroneous identification of biomarkers (false positives). To reduce this, we can apply the same machinery that we did in the case of the standard t-test for our naive linear modeling approach.

Software implementation: R/biotmle An R package that facilitates biomarker discovery by generalizing the moderated t-statistic of Smyth for use with asymptotically linear parameters. Check it out on GitHub: nhejazi/biotmle 12 Use it. File an issue. Help make it better!

Data analysis with R/biotmle Observational study of the impact of occupational exposure (to benzene), with data collected on 125 subjects and roughly 22, 000 biomarkers. Baseline covariates W: age, sex, smoking status; all were discretized. Treatment A is degree of Benzene exposure: none, < 1ppm, and > 5ppm. Outcome Y is mirna expression, median normalized. Estimate the parameter: Ψ b (P n) = E[E[Y b A = max(a),w] E[Y b A = min(a),w]] Apply moderated t-test as previously discussed. 13 We are really just walking through the mechanistic procedure we outline, applying to the data set that served as our motivating example.

Analysis results I: Uncorrected tests Histogram of raw p values (applying Limma shrinkage to TMLE results) 8000 6000 count 4000 2000 4000 6000 8000 2000 0 0.00 0.25 0.50 0.75 1.00 magnitude of raw p values 14 This is promising we re not seeing too many biomarkers identified as significant. But, we do have to correct for those 22,000 tests that we just performed.

Analysis results II: Corrected tests Histogram of FDR corrected p values (BH) (applying Limma shrinkage to TMLE results) 15000 count 10000 4000 8000 12000 16000 5000 0 0.00 0.25 0.50 0.75 1.00 magnitude of BH corrected p values 15 After application of the Benjamini-Hochberg procedure for controlling the False Discovery Rate (FDR).

Analysis results III: Volcano plot 16 Taking a look at a standard volcano plot adapted to the ATE quickly reveals that we really are not identifying any biomarkers with low fold change in the ATE as significant erroneously.

Analysis results IV: Heatmap of IC estimates 17 We can use our influence curve transform to identify biomarkers that are top contributers to the target parameter of interest the ATE in this case.

Review 1. Linear models are the standard approach for analyzing microarray and next-generation sequencing data (e.g., R package limma ). 2. Moderated statistics help reduce false positives by using an empirical Bayes method to perform standard deviation shrinkage for test statistics. 3. Beyond linear models: we can assess evidence using parameters that are more scientifically interesting (e.g., ATE) by way of TMLE. 4. The approach of moderated statistics easily extends to the case of asymptotically linear parameters. 18 It s always good to include a summary.

References I Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289 300. Gruber, S. and van der Laan, M. J. (2010). An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. The International Journal of Biostatistics, 6(1):1 31. Hejazi, N. S., Cai, W., and Hubbard, A. E. (2017). biotmle: Targeted learning for biomarker discovery. Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):1 25. 19 References II Smyth, G. K. (2005). Limma: linear models for microarray data. In Bioinformatics and computational biology solutions using R and Bioconductor, pages 397 420. Springer. Tuglus, C. and van der Laan, M. J. (2011). Targeted methods for biomarker discovery. In Targeted Learning, pages 367 382. Springer. van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media. 20

Acknowledgments Alan Hubbard Mark van der Laan University of California, Berkeley University of California, Berkeley 21 This was all made possible by generous advising and collaboration.

Slides: goo.gl/6ou8yr stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi 22 Here s where you can find me, as well as the slides for this talk.