Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Size: px

Start display at page:

Download "Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights"

Anthony Harrington
5 years ago
Views:

1 Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights Koen Van den Berge, Ghent University Statistical Genomics,

2 The team Koen Van den Berge Fanny Perraudeau Jean-Philippe Vert Charlotte Soneson Mark Robinson Sandrine Dudoit Davide Risso Michael Love Lieven Clement 2

3 single-cell RNA-sequencing (scrna-seq) is noisier than bulk RNA-seq 3

Single-cell RNA-seq protocols I Full-length protocols (e.g., SMART-Seq2) I Cells must be isolated (manually, FACS,...). I Library prep is typically plate-based; one well contains one cell.

4 Single-cell RNA-seq protocols I Full-length protocols (e.g., SMART-Seq2) I Cells must be isolated (manually, FACS,...). I Library prep is typically plate-based; one well contains one cell. I Droplet-based protocols (e.g., 10X, drop-seq) I Cells do not need to be isolated! I Cell-containing medium is mixed with bead-containing oil droplets. 4

5 single-cell RNA-sequencing (scrna-seq) is noisier than bulk RNA-seq data from [Pickrell et al. 2010, Fletcher et al. 2017, Zheng et al. 2017] 5

6 Bulk RNA-seq di erential expression (DE) analysis Popular methods (edger, DESeq2) adopt negative binomial (NB) models y gi NB(µ gi, g ) log(µ gi ) = gi gi = X i g + log(o i ) with y gi the expression count of gene g in sample i. 6

7 Bulk RNA-seq DE not worse than bespoke scrna-seq tools Jaakkoola et al. (2016), Bioinformatics: Our evaluations did not reveal systematic benefits of the currently available single-cell-specific methods. Soneson & Robinson (2018), Nat. Meth.: We found that bulk RNA-seq analysis methods do not generally perform worse than those developed specifically for scrna-seq. 7

8 Bulk RNA-seq methods still break down due to ZI Simulated (ZI-)bulk RNA-seq data using [Zhou et al. 2014] framework 8

9 Bulk RNA-seq methods still break down due to ZI Simulated (ZI-)bulk RNA-seq data using [Zhou et al. 2014] framework 8

10 Observation weights unlock bulk RNA-seq tools towards zero inflation Excess zeros observed! zero inflation We propose to model counts with a zero inflated negative binomial (ZINB) distribution f ZINB (y gi ; µ gi, g, gi )= gi +(1 gi )f NB (y gi ; µ gi, g ). (1) 9

11 Observation weights unlock bulk RNA-seq tools towards zero inflation Excess zeros observed! zero inflation We propose to model counts with a zero inflated negative binomial (ZINB) distribution f ZINB (y gi ; µ gi, g, gi )= gi +(1 gi )f NB (y gi ; µ gi, g ). (1) A ZINB model corresponds to a weighted NB where observation weights are posterior probabilities w gi = (1 gi)f NB (y gi ; µ gi, g ) f ZINB (y gi ; µ gi, g, gi ) (2) 9

12 Observation weights unlock bulk RNA-seq tools towards zero inflation Excess zeros observed! zero inflation We propose to model counts with a zero inflated negative binomial (ZINB) distribution f ZINB (y gi ; µ gi, g, gi )= gi +(1 gi )f NB (y gi ; µ gi, g ). (1) A ZINB model corresponds to a weighted NB where observation weights are posterior probabilities w gi = (1 gi)f NB (y gi ; µ gi, g ) f ZINB (y gi ; µ gi, g, gi ) (2) Weights are used to unlock RNA-seq NB models (edger, DESeq2) for zero inflation [Van den Berge,Perraudeau et al., 2018]. 9

13 10 zinbwave can be used to fit ZINB models in scrna-seq Estimation of the ZINB parameters using penalized likelihood implemented in the ZINB-WaVE model [Risso et al. 2018] Bioconductor: J genes log μ n samples } Known sample-level covariates Known gene-level covariates M J L J M β L VT + γ Τ = X + Unknown sample-level covariates K J α K W J genes n n n n samples logit π } Observed random variable } Unknown parameter } Unknown parameter } Observed random variable } Unobserved random variable } Unknown parameter X intercept acts as a gene-specific scaling factor V intercept acts as a sample-specific scaling factor

14 10 zinbwave can be used to fit ZINB models in scrna-seq Estimation of the ZINB parameters using penalized likelihood implemented in the ZINB-WaVE model [Risso et al. 2018] Bioconductor: J genes log μ n samples } Known sample-level covariates Known gene-level covariates M J L J M β L VT + γ Τ = X + Unknown sample-level covariates K J α K W J genes n n n n samples logit π } Observed random variable } Unknown parameter } Unknown parameter } Observed random variable } Unobserved random variable } Unknown parameter X intercept acts as a gene-specific scaling factor V intercept acts as a sample-specific scaling factor Alternatively: EM-algorithm (see last couple of slides)

15 Downweighting excess zeros recovers mean-variance trend, resulting in high power Simulated (ZI-)bulk RNA-seq data using [Zhou et al. 2014] framework 11

16 Downweighting excess zeros recovers mean-variance trend, resulting in high power Simulated (ZI-)bulk RNA-seq data using [Zhou et al. 2014] framework 11

17 12 High power, good FDR control in scrna-seq simulations Full-length protocols DESeq2 MAST ZINB-WaVE_DESeq2_common ZINB-WaVE_edgeR_common edger

18 High power, good FDR control in scrna-seq simulations Droplet-based protocols, e.g. 10X Genomics, Drop-seq 13

19 14 Mock comparisons on real data show good FPR control Non-UMI dataset on 622 neuronal cells from [Usoskin et al. 2015]. 45 vs. 45 mock comparisons.

20 15 Downweighting leads to biologically meaningful results 10X Genomics PBMC dataset, preprocessed using tutorial from Seurat.

16 Method is implemented in zinbwave Bioc

21 16 Method is implemented in zinbwave Bioc package I computeobservationalweights for weights calculation I edger:glmweightedf for ZI-adjusted inference I DESeq2:nbinomWaldTest and nbinomlrt for ZI-adjusted inference I Tutorial available in zinbwave vignette

22 What follows are some slides on the EM algorithm used in the zinger method. 17

23 18 Azero-inflatednegativebinomialmodel Distribution of counts y for gene g over samples i y gi i +(1 i )f NB (µ gi, g ) i.e. mixture distribution between point-mass at zero and negative binomial Log-likelihood l(y gi )= X i log { i +(1 i )f NB (µ gi, g )} does not factorize! very di cult to maximize!

24 19 Fitting a mixture distribution with EM Estimate mixture using EM-algorithm: introduce latent variable Z gi B( gi ) to assign zeros to the zero-inflation or count component. The joint density becomes f (y gi, z gi )=f (y gi z gi )f (z gi ) =[ i ] z gi [(1 i )f NB (µ gi, g )] (1 z gi )

25 19 Fitting a mixture distribution with EM Estimate mixture using EM-algorithm: introduce latent variable Z gi B( gi ) to assign zeros to the zero-inflation or count component. The joint density becomes f (y gi, z gi )=f(y gi z gi )f (z gi ) =[ i ] z gi [(1 i )f NB (µ gi, g )] (1 z gi ) Maximization of expected log-likelihood given the data: Q = E(l(y gi, z gi ) y gi ) = E(z gi y gi )log i + E(z gi y gi )log +[1 E(z gi y gi )]log(1 i )+ [1 E(z gi y gi )]log[f NB (µ gi, g )] 1. E-step: Calculate expected likelihood 2. M-step: Maximize expected likelihood

26 20 E-step EM-algorithm I Calculate posterior probability that a zero belongs to zero-inflation component ˆ i I (y gi = 0) E(z gi y gi ) = ˆ i I (y gi = 0) + (1 ˆ i )f NB (y gi ;ˆµ gi, ˆg )

27 E-step EM-algorithm I Calculate posterior probability that a zero belongs to zero-inflation component M-step ˆ i I (y gi = 0) E(z gi y gi ) = ˆ i I (y gi = 0) + (1 ˆ i )f NB (y gi ;ˆµ gi, ˆg ) I Estimate NB component parameters µ gi and g using edger I Incorporate observation-level weights wgi =1 E y (z gi )for counts y gi I Because maximizing ZINB likelihood for NB model parameters is equivalent to maximizing a weighted NB likelihood. I Estimate i using logistic regression model i log = N i 1 i with N i log library size of sample i 20

28 Why we use logistic regression with library size in the EM 21

29 22 Case study: Islam et al. (2014) I Single-cell RNA-seq (scrna-seq) allowed the study of sparse cell populations. I One of the first datasets we worked with was from Islam et al. (2014). It demonstrates scrna-seq for 85 cells consisting of two cell populations in mouse: embryonic stem cells and fibroblasts. I This paper was one of the first scrna-seq studies and motivated our method development. I Link to paper:

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 Olga (NBIS) scrna-seq de October 2017 1 / 34 Outline Introduction: what