Normalization and differential analysis of RNA-seq data

Similar documents
Analyses biostatistiques de données RNA-seq

Unit-free and robust detection of differential expression from RNA-Seq data

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Dispersion modeling for RNAseq differential analysis

Statistical challenges in RNA-Seq data analysis

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

Lecture: Mixture Models for Microbiome data

High-Throughput Sequencing Course

RNASeq Differential Expression

Comparative analysis of RNA- Seq data with DESeq2

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Statistical tests for differential expression in count data (1)

Classifying next-generation sequencing data using a zero-inflated Poisson model

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

DEXSeq paper discussion

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Non-specific filtering and control of false positives

Looking at the Other Side of Bonferroni

*Equal contribution Contact: (TT) 1 Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv

Lesson 11. Functional Genomics I: Microarray Analysis

Multiple hypothesis testing and RNA-seq differential expression analysis accounting for dependence and relevant covariates

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Differential expression analysis for sequencing count data. Simon Anders

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Normalization of metagenomic data A comprehensive evaluation of existing methods

Exam: high-dimensional data analysis January 20, 2014

Empirical Bayes Moderation of Asymptotically Linear Parameters

Multiple testing: Intro & FWER 1

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Empirical Bayes Moderation of Asymptotically Linear Parameters

Statistical methods for estimation, testing, and clustering with gene expression data

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

EBSeq: An R package for differential expression analysis using RNA-seq data

arxiv: v1 [stat.me] 1 Dec 2015

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Some General Types of Tests

Statistical testing. Samantha Kleinberg. October 20, 2009

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Exam: high-dimensional data analysis February 28, 2014

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

using Bayesian hierarchical model

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

High-throughput Testing

Peak Detection for Images

Mixtures and Hidden Markov Models for analyzing genomic data

Statistical Applications in Genetics and Molecular Biology

Generalized Linear Models (1/29/13)

Robust statistics. Michael Love 7/10/2016

Research Article Sample Size Calculation for Controlling False Discovery Proportion

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Bias in RNA sequencing and what to do about it

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

BTRY 7210: Topics in Quantitative Genomics and Genetics

STAT 461/561- Assignments, Year 2015

Step-down FDR Procedures for Large Numbers of Hypotheses

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Differential Expression with RNA-seq: Technical Details

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

Single gene analysis of differential expression

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

Estimation of the False Discovery Rate

Hunting for significance with multiple testing

Supplemental Information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Advanced Statistical Methods: Beyond Linear Regression

Statistical Applications in Genetics and Molecular Biology

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

BIOINFORMATICS. On Differential Gene Expression Using RNA-Seq Data. Juhee Lee 1, Peter Müller 2 Shoudan Liang 3 Guoshuai Cai 3, and Yuan Ji 1

Sample Size Estimation for Studies of High-Dimensional Data

Session 06 (A): Microarray Basic Data Analysis

Biochip informatics-(i)

Reports of the Institute of Biostatistics

Sociology 6Z03 Review II

University of California, Berkeley

RNA-seq. Differential analysis

Association studies and regression

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Our typical RNA quantification pipeline

Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression

Statistics Applied to Bioinformatics. Tests of homogeneity

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Stat 206: Estimation and testing for a mean vector,

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE

Department of Statistics University of Central Florida. Technical Report TR APR2007 Revised 25NOV2007

Chapter 1 - Lecture 3 Measures of Location

Transcription:

Normalization and differential analysis of RNA-seq data Nathalie Villa-Vialaneix INRA, Toulouse, MIAT (Mathématiques et Informatique Appliquées de Toulouse) nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org SPS Summer School 2016 From gene expression to genomic network Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 1 / 37

Outline 1 Normalization 2 Differential expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 2 / 37

A typical transcriptomic experiment Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 3 / 37

A typical transcriptomic experiment Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 3 / 37

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 4 / 37

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 4 / 37

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data overdispersion (variance mean) black line is variance = mean Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 4 / 37

Steps in RNAseq data analysis Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 5 / 37

Part I: Normalization Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 6 / 37

Purpose of normalization identify and correct technical biases (due to sequencing process) to make counts comparable types of normalization: within sample normalization and between sample normalization Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 7 / 37

Source of variation in RNA-seq experiments 1 at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors) 2 at the middle layer: technical variations (library effect) 3 at the bottom layer: technical variations (lane and cell flow effects) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 8 / 37

Source of variation in RNA-seq experiments 1 at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors) 2 at the middle layer: technical variations (library effect) 3 at the bottom layer: technical variations (lane and cell flow effects) lane effect < cell flow effect < library effect biological effect Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 8 / 37

Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 9 / 37

Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: gene B is expressed with a number of transcripts twice larger than gene A gene A gene B Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 9 / 37

Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: both genes are expressed with the same number of transcripts but gene B is twice longer than gene A gene A gene B Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 9 / 37

Within sample normalization Purpose of within sample comparison: enabling comparisons of genes from a same sample Sources of variability: gene length, sequence composition (GC content) These differences need not to be corrected for a differential analysis and are not really relevant for data interpretation. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 9 / 37

Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 10 / 37

Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: gene A is more expressed in sample 3 than in sample 2 gene A in sample 2 gene A in sample 3 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 10 / 37

Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes) gene A in sample 2 gene A in sample 3 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 10 / 37

Between sample normalization Purpose of between sample comparison: enabling comparisons of a gene in different samples Sources of variability: library size,... These differences must be corrected for a differential analysis and for data interpretation. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 10 / 37

Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: Raw counts correspond to different sequencing depths control treated Gene 1 5 1 0 0 4 0 0 Gene 2 0 2 1 2 1 0 0 Gene 3 92 161 76 70 140 88 70 : : : : : : : : : Gene G 15 25 9 5 20 14 17 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: A correction multiplicative factor is calculated for every sample control treated Gene 1 5 1 0 0 4 0 0 Gene 2 0 2 1 2 1 0 0 Gene 3 92 161 76 70 140 88 70 : : : : : : : : : Gene G 15 25 9 5 20 14 17 C j 1.1 1.6 0.6 0.7 1.4 0.7 0.8 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: Every counts is multiplied by the correction factor corresponding to its sample Gene 3 92 161 76 70 140 88 70 C j 1.1 1.6 0.6 0.7 1.4 0.7 0.8 Gene 3 101.2 257.6 45.6 49 196 61.6 56 x Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts Consequences: Library sizes for normalized counts are roughly equal. control treated Gene 1 5.5 1.6 0 0 5.6 0 0 Gene 2 0 3.2 0.6 1.4 1.4 0 0 Gene 3 101.2 257.6 45.6 49 196 61.6 56 : : : : : : : : : Gene G 16.5 40 5.4 5.5 28 9.8 13.6 + Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10 5 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj = K gj s j in which s j = C 1 j is the scaling factor for sample j. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj = K gj s j in which s j = C 1 j is the scaling factor for sample j. Three types of methods: distribution adjustment method taking length into account the effective library size concept Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 11 / 37

Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] s j = 1 N N l=1 D l in which N is the number of samples and D j = g K gj. D j raw counts normalized counts log 2(count + 1) 15 10 5 Samples Sample 1 Sample 2 edger: cpm(..., normalized. lib. sizes=true) 0 0 250 500 750 1000 0 250 500 750 1000 rank(mean) gene expression Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 12 / 37

Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] (Upper) Quartile normalization [Bullard et al., 2010] s j = Q (p) j 1 N l=1 Q(p) l in which Q (p) is a given quantile (generally 3rd quartile) of the count j distribution in sample j. raw counts normalized counts raw counts normalized counts 20 20 15 15 log 2(count + 1) 10 Samples Sample 1 Sample 2 log 2(count + 1) 10 Samples Sample 1 Sample 2 5 5 0 0 0 250 500 750 1000 0 250 500 750 1000 rank(mean) gene expression 0 250 500 750 1000 0 250 500 750 1000 rank(mean) gene expression edger: calcnormfactors(..., method = " upperquartile", p = 0.75) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 13 / 37

Method using gene lengths (intra & inter sample normalization) RPKM: Reads Per Kilobase per Million mapped Reads Assumptions: read counts are proportional to expression level, transcript length and sequencing depth s j = D jl g 10 3 10 6 in which L g is gene length (bp). edger: rpkm(..., gene.length =...) Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009]. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 14 / 37

Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample: geometric mean across samples N R g = K gj (geometric mean is less sensitive to extreme values than standard mean) j=1 1/N 15 log 2(count + 1) 10 5 Samples Sample 1 Sample 2 0 0 250 500 750 1000 rank(mean) gene expression Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 15 / 37

Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference K gj = K gj R g with R g = N K gj j=1 1/N 4 count / geo. mean 3 2 Samples Sample 1 Sample 2 1 0 250 500 750 1000 rank(mean) gene expression Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 15 / 37

Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference 3 calculate normalization factor: median of centered counts over the genes s j = median } g { Kgj factors multiply to 1: s j = ( s j 1 exp N N l=1 log( s l) ) count / geo. mean 4 3 2 1 Samples Sample 1 Sample 2 0 250 500 750 1000 rank(mean) gene expression with and K gj = K gj R g N R g = K gj j=1 1/N Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 15 / 37

Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference 3 calculate normalization factor: median of centered counts over the genes log 2(count + 1) 15 10 5 0 Samples Sample 1 Sample 2 ## with edger calcnormfactors(..., method =" RLE") ## with DESeq estimatesizefactors (...) 0 250 500 750 1000 rank(mean) gene expression Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 16 / 37

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 17 / 37

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r 2 2 + log D 2 j D r select as a reference sample, the sample r with the upper quartile closest to the average upper quartile M(j,r) 2 1 0 1 2 Trimmed values None M- vs A-values 3 20 15 10 5 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 17 / 37 A(j,r)

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r 2 2 + log D 2 j D r 2 1 Trim 30% on M-values M(j,r) 0 1 Trimmed values None 30% of Mg 2 3 20 15 10 5 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 17 / 37 A(j,r)

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r 2 2 + log D 2 j D r 2 1 Trim 5% on A-values M(j,r) 0 1 Trimmed values None 30% of Mg 5% of Ag 2 3 20 15 10 5 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 17 / 37 A(j,r)

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed M(j,r) 2 1 0 1 2 3 20 15 10 5 A(j,r) Trimmed values None 30% of Mg 5% of Ag On remaining data, calculate the weighted mean of M-values: w g (j, r)m g (j, r) TMM(j, r) = with w g (j, r) = g:not trimmed g:not trimmed ( Dj K gj D j K gj w g (j, r) + D r K gr D r K gr ). Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 17 / 37

Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed Correction factors: s j = 2 TMM(j,r) factors multiply to 1: s j = ( s j 1 exp N N l=1 log( s l) ) calcnormfactors(..., method =" TMM") Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 18 / 37

Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Purpose of the comparison: finding the best method for all cases is not a realistic purpose find an approach which is robust enough to provide relevant results in all cases Method: comparison based on several criteria to select a method which is valid for multiple objectives Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 19 / 37

Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on count distribution: RPKM and TC are very similar to raw data. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 19 / 37

Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on differential analysis (DESeq v. 1.6): Inflated FPR for all methods except for TMM and DESeq (RLE). Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 19 / 37

Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Conclusion: Differences appear based on data characteristics TMM and DESeq (RLE) are performant in a differential analysis context. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 19 / 37

Practical session import and understand data; run different types of normalization; compare the results... Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 20 / 37

Part II: Differential expression analysis Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 21 / 37

Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples which is tested against an alternative H 1 : the average count for gene g in the control samples is different from the average count in the treated samples Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 22 / 37

Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 22 / 37

Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 22 / 37

Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 4 deduce the probability that the observations occur under H 0 : this is called the p-value Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 22 / 37

Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 4 deduce the probability that the observations occur under H 0 : this is called the p-value 5 conclude: if the p-value is low (usually below α = 5% as a convention), H 0 is unlikely: we say that H 0 is rejected. We have that: α = P H0 (H 0 is rejected). Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 22 / 37

Summary of the possible decisions Not reject H 0 Reject H 0 Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 23 / 37

Types of errors in tests Reality H 0 is true H 0 is false Decision Do not reject H 0 Correct decision Type II error (True Negative) (False Negative) Reject H 0 Type I error Correct decision (False Positive) (True Positive) P(Type I error) = α (risk) P(Type II error) = 1 β ( β: power) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 24 / 37

Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α. P(at least one FP if H 0 is always true) = 1 (1 α) G Ex: for α = 5% and G = 20, P(at least one FP if H 0 is always true) 64%!!! Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 25 / 37

Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α. P(at least one FP if H 0 is always true) = 1 (1 α) G Ex: for α = 5% and G = 20, P(at least one FP if H 0 is always true) 64%!!! Probability to have at least one false positive versus the number of tests performed when H 0 is true for all G tests 1.00 Probability 0.75 0.50 0.25 For more than 75 tests and if H 0 is always true, the probability to have at least one false positive is very close to 100%! 0 25 50 75 100 Nathalie Villa-Vialaneix (INRA, Number MIAT) of tests Biostatistics - RNAseq 25 / 37

Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 U G 1 V G R Total G 0 G 1 G Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 26 / 37

Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 U G 1 V G R Total G 0 G 1 G Instead of the risk α, control: familywise error rate (FWER): FWER = P(U > 0) (i.e., probability to have at least one false positive decision) false discovery rate (FDR): FDR = E(Q) with U/R if R > 0 Q = 0 otherwise Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 26 / 37

Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 27 / 37

Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 27 / 37

Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) 2 calculate p (g) = a g p (g) with Bonferroni method: a g = G (FWER) with Benjamini & Hochberg method: a g = G/g (FDR) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 27 / 37

Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) 2 calculate p (g) = a g p (g) with Bonferroni method: a g = G (FWER) with Benjamini & Hochberg method: a g = G/g (FDR) 3 if adjusted p-values p (g) are larger than 1, correct p (g) min{ p (g), 1} Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 27 / 37

Fisher s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n ga n gb n g other genes N A n ga N B n gb N n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 28 / 37

Fisher s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n ga n gb n g other genes N A n ga N B n gb N n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? Method Direct calculation of the probability to obtain such a contingency table (or a more extreme contingency table) with: independency between the two columns of the contingency tables; the same marginals ( Total ). Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 28 / 37

Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change 0 3 Main remark: more conservative for genes with a low expression 6 0 5 10 15 mean log 2 expression Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 29 / 37

Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change 0 3 Main remark: more conservative for genes with a low expression 6 0 5 10 15 mean log 2 expression Limitation of Fisher test Highly expressed genes have a very large variance! As Fisher test does not estimate variance, it tends to detect false positives among highly expressed genes do not use it! Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 29 / 37

Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) estimate its parameters for both conditions conclude by calculating p-value Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 30 / 37

Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions conclude by calculating p-value Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 30 / 37

Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g1 λ g2 φ g conclude by calculating p-value Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 30 / 37

Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g1 λ g2 φ g conclude by calculating p-value Test H0 : {λ g1 = λ g2 } Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 30 / 37

First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g1 +... + K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 31 / 37

First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g1 +... + K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 31 / 37

First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g1 +... + K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) 2 φ g is estimated independently of λ g1 and λ g2, using different approaches to account for small sample size Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 31 / 37

First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g1 +... + K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) 2 φ g is estimated independently of λ g1 and λ g2, using different approaches to account for small sample size 3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 31 / 37

Estimating the dispersion parameter φ g Some methods: DESeq, DESeq2: φ g is a smooth function of λ g = λ g1 = λ g2 dge <- estimatedispersion( dge) dispersion 0.01 0.05 0.50 5.00 gene est fitted final 5e 01 5e+00 5e+01 5e+02 mean of normalized counts edger: estimate a common dispersion parameter for all genes and use it as a prior in a Bayesian approach to estimate a gene specific dispersion parameter dge <- estimatecommondisp( dge) dge <- estimatetagwisedisp( dge) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 32 / 37

Perform the test Some methods: DESeq, DESeq2: exact (DESeq) or approximate (Wald and LR in DESeq2) tests res <- nbinomwaldtest( dge) results( res) res <- nbinomlr( dge) results( res) edger: exact tests res <- exacttest( dge) toptags( res) (comparison between methods in [Zhang et al., 2014]) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 33 / 37

More complex experiments: GLM Framework: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) in which: s j is the library size for sample j; Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 34 / 37

More complex experiments: GLM Framework: in which: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) s j is the library size for sample j; log(λ gj ) is estimated (for instance) by a Generalized Linear Model (GLM): log(λ gj ) = λ 0 + x j β g in which x i is a vector of covariates. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 34 / 37

More complex experiments: GLM Framework: in which: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) s j is the library size for sample j; log(λ gj ) is estimated (for instance) by a Generalized Linear Model (GLM): log(λ gj ) = λ 0 + x j β g in which x i is a vector of covariates. GLM allows to decompose the effects on the mean of different factors their interactions Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 34 / 37

More complex experiments: GLM in practice edger dge <- estimatedisp(dge, design) fit <- glmfit(dge, design) res <- glmrt(fit,...) toptags( res) DESeq, DESeq2 dge <- newcountdataset( counts, design) dge <- estimatesizefactors( dge) dge <- estimatedispersions( dge) fit <- fitnbinomglms(dge, count ~...) fit0 <- fitnbinomglms(dge, count ~ 1) res <- nbinomglmtest(fit, fit0) p. adjust(res, method = " BH") Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 35 / 37

Alternative approach: linear model for count data [Law et al., 2014], limma Basic idea: 1 data are transformed so that they are approximately normally distributed tcount <- voom( counts, design) 2 a linear (Gaussian) model is fitted (with a Bayesian approach to improve FDR [McCarthy and Smyth, 2009]): K gj N(µ gj, σ 2 g) with E( Kgj ) = β 0 + x j β g fit <- lmfit( tcount, design) fit <- ebayes( fit) toptables(fit,...) Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 36 / 37

Practical session use the same data as before; run the analysis with different approaches (using exact test or GLM or voom + LM); compare the results... Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 37 / 37

References Note: Some images in this slide have been used as a courtesy of Ignacio Gonzàles. Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106. Bullard, J., Purdom, E., Hansen, K., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics, 11(1):94. Dillies, M., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloë, D., Le Gall, C., Schaëffer, B., Le Crom, S., Guedj, M., and Jaffrézic, F. (2013). A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6):671 683. on behalf of The French StatOmique Consortium. Law, C., Chen, Y., Shi, W., and Smyth, G. (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(R29). McCarthy, D. and Smyth, G. (2009). Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25:765 771. Mortazavi, A., Williams, B., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5:621 628. Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct, 4(14). Robinson, M. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11:R25. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 37 / 37

Robinson, M. and Smyth, G. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Bioinformatics, 9(2):321 332. Zhang, Z., Jhaveri, D., Marshall, V., Bauer, D., Edson, J., Narayanan, R., Robinson, G., Lundberg, A., Bartlett, P., Wray, N., and Zhao, Q. (2014). A comparative study of techniques for differential expression analysis on RNA-seq data. PLoS ONE, 9(8):e103207. Nathalie Villa-Vialaneix (INRA, MIAT) Biostatistics - RNAseq 37 / 37