Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Similar documents
RNASeq Differential Expression

Normalization and differential analysis of RNA-seq data

Differential expression analysis for sequencing count data. Simon Anders

Analyses biostatistiques de données RNA-seq

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences

Comparative analysis of RNA- Seq data with DESeq2

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

Normalization of metagenomic data A comprehensive evaluation of existing methods

High-Throughput Sequencing Course

Statistical tests for differential expression in count data (1)

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data

Statistical challenges in RNA-Seq data analysis

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq

*Equal contribution Contact: (TT) 1 Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

Unit-free and robust detection of differential expression from RNA-Seq data

Optimal normalization of DNA-microarray data

Dispersion modeling for RNAseq differential analysis

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Robust statistics. Michael Love 7/10/2016

DEXSeq paper discussion

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Lesson 11. Functional Genomics I: Microarray Analysis

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Statistical Inferences for Isoform Expression in RNA-Seq

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays

Count ratio model reveals bias affecting NGS fold changes

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Bias in RNA sequencing and what to do about it

BIOINFORMATICS ORIGINAL PAPER

Expression arrays, normalization, and error models

Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression

RNA-seq. Differential analysis

Mixtures and Hidden Markov Models for analyzing genomic data

Biochip informatics-(i)

GS Analysis of Microarray Data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Supplemental Information

Data Preprocessing. Data Preprocessing

Microarray Preprocessing

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

arxiv: v1 [stat.me] 1 Dec 2015

Introduction to de novo RNA-seq assembly

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

Design of Microarray Experiments. Xiangqin Cui

Error analysis in biology

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Non-specific filtering and control of false positives

Exam: high-dimensional data analysis January 20, 2014

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Regularized non-negative matrix factorization for latent component discovery in heterogeneous methylomes

SUPPLEMENTAL DATA: ROBUST ESTIMATORS FOR EXPRESSION ANALYSIS EARL HUBBELL, WEI-MIN LIU, AND RUI MEI

Seminar Microarray-Datenanalyse

Statistical Inference

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Single gene analysis of differential expression. Giorgio Valentini

University of California, Berkeley

Bioconductor Project Working Papers

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

cdna Microarray Analysis

Optimal State Estimation for Boolean Dynamical Systems using a Boolean Kalman Smoother

Genome Assembly. Sequencing Output. High Throughput Sequencing

EBSeq: An R package for differential expression analysis using RNA-seq data

Statistical methods for estimation, testing, and clustering with gene expression data

Our typical RNA quantification pipeline

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

A Practitioner s Guide to Cluster-Robust Inference

Practical Statistics for the Analytical Scientist Table of Contents

Collaborative Statistics: Symbols and their Meanings

GS Analysis of Microarray Data

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

Bayesian Clustering of Multi-Omics

Generalized Linear Models (1/29/13)

Sociology 6Z03 Review II

Differential Expression with RNA-seq: Technical Details

GIST 4302/5302: Spatial Analysis and Modeling

General Regression Model

On estimation of the Poisson parameter in zero-modied Poisson models

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Slide a window along the input arc sequence S. Least-squares estimate. σ 2. σ Estimate 1. Statistically test the difference between θ 1 and θ 2

Combinations. April 12, 2006

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis

Detection of Oestrus by Monitoring Boar Visits Tage Ostersen Advanced Herd Management 22/ Slide 1/30 1 Background 2 Data 3 Duration of Visits 4

Probabilistic Inference for Multiple Testing

Transcription:

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35

The problem We need to assess relative levels of transcript abundances in multiple samples This requires: Sample collection (gene arrays, tiling arrays or RNASeq) Signal normalization (bringing the measured signals to comparable values) Assessment of dierential expression signicance 2/35

Microarrays designed to look at gene expression use a few probes for each known or predicted gene prehistory 3/35

Tiling arrays subtype of microarray chips. dier in the nature of the probes short fragments, designed to cover the entire genome or contiguous regions of the genome depending on probe lengths and spacing, dierent degrees of resolution can be achieved 4/35

Tiling arrays For Aymetrix tiling arrays: contain 25-nt probes tiled every 35 bp of DNA sequence. whole genome arrays (2.0R) are comprised of 7 chips covering entire human or mouse genome that is masked of repeat sequences. 5/35

6/35

Read count matrix A n m count matrix N, where N gs is the number of reads assigned to gene g in sample s Produced from alignment data (eg using HTSeq, or Picard) Not a direct measure of gene expression! Rather, N gs l g µ gs, where l g is the gene g length, and µ gs is the expected expression. 7/35

Normalization Denition (Normalization) Normalization is a process designed to identify and correct technical biases removing the least possible biological signal. This step is technology and platform-dependant. 8/35

Nomenclature sample material with a specic source, e.g. culture or tissue. replicates several independent samples with the same material type and origin condition environment in which samples are prepared (e.g. added chemicals). There can be several samples per condition ow cell a glass slide where the sequencing takes place line one of eight independent sequencing areas that a ow cell is made up from library contains cdnas representative of the RNA molecules that are extracted from a given sample, pre-processed and deposited on lanes in order to be sequenced library size the number of mapped short reads obtained from sequencing of the library. Here, one sample one line one library one condition. Dillies et al. A comprehensive evaluation of normalization.... Brief. in Bioinformatics 2012. 9/35

Two issues calling for normalization 1. Bias in sample size 2. Bias in over-represented genes - genes whose counts dominate the sample size 10/35

Normalization by scaling factor Total count (TC) N gs mean( i N ij) across samples j j N js (1) Upper Quartile (UQ) N gs mean upper quartile of N ij 0 across samples j mean upper quartile of N js (2) Median (Med) N gs med(n ij 0) across samples j med(n js 0) (3) 11/35

Normalization by scaling factor cont. Hypothesis: most genes are not DE, should have similar read counts across samples. DESeq N gs med across genes i. ( geometric mean(nij ) across samples j Trimmed Mean of M-values (TMM) Similar to DESeq but uses means and removes outliers. N is ), (4) 12/35

RPKM/FPKM normalization Reads/Fragments Per Kilobase of transcript sequence Million base pairs sequenced 13/35

RPKM/FPKM normalization Reads/Fragments Per Kilobase of transcript sequence Million base pairs sequenced + correction for gene/transcript length + correction for sequencing depth no correction for dierence in expression distribution between samples relation between read number and variation is lost 13/35

Quantile normalization (Q) A technique for making two distributions identical in statistical properties. 14/35

Comparison of normalization methods on real data - normalized data distribution When large dierences in library size, TC and RPKM do not improve over the raw counts. 15/35

Comparison of normalization methods on real data - within-condition variability Example: Mus musculus, condition D dataset M.-A. Gillies et al. A comprehensive evaluation of normalization.... Brief. in Bioinformatics 2012. 16/35

Comparison of normalization methods by DE gene number M.-A. Gillies et al. A comprehensive evaluation of normalization.... Brief. in Bioinformatics 2012. 17/35

Consensus dendogram 18/35

Comparison of normalization methods on simulated data - error rate and power 19/35

So the winner is...? in most cases, the methods give similar results the dierences appear in data characteristics 20/35

Interpretation RawCount Often fewer dierential expressed genes (e.g. A. fumigatus: no DE gene) TC, RPKM Q Sensitive to the presence of predominant genes Less eective stabilization of distributions Ineective (similar to RawCount) Can increase between group variance Is based on a (too) strong assumption (similar distributions) Med High variability of housekeeping genes TC, RPKM, Q, Med, UQ Adjustment of distributions, implies a similarity between RNA repertoires expressed 21/35

Concusions on normalization RNA-seq data are aected by technical biaises (total number of mapped reads per lane, gene length, composition bias) Normalization is needed and has a great impact on the DE genes Detection of dierential expression in RNA-seq data is inherently biased (more power to detect DE of longer genes) Do not normalise by gene length in a context of dierential analysis. TMM and DESeq : performant and robust methods in a DE analysis context on the gene scale. 22/35

Dierential analysis Aim : To detect dierentially expressed genes between two conditions Discrete quantitative data Few replicates Overdispersion problem 23/35

Dierential analysis 24/35

Dierential analysis 25/35

Poisson distribution For X Poisson-distributed P(X = k) = λk e λ expresses the probability of a given number of events occurring in a xed interval of time or space assumes these events occur with a known average rate and independently of the time since the last event. variance equal to the mean (λ) k! 26/35

Overdispersion Poisson distribution was proposed to model read count data No need to estimate the variance. This is convenient E.g., Wang et al (2010), Bloom et al (2009), Kasowski et al (2010),... however, when models are t, the observed variance is higher than the variance of theoretical models overdispersion type-i errors (false DE discoveries). 27/35

Negative binomial distribution Suppose a sequence of independent Bernoulli trials. The probability of success is p and of failure is (1 p). We observe this sequence until r failures occurr. Then for the random number of successes we have seen, ( ) k + r 1 P(X = k) = (1 p) r p k, k with mean µ = pr 1 p and variance σ2 = pr (1 p) 2. 28/35

Negative binomial distribution re-parametrized Let α = 1 r, and mean as before µ = pr 1 p. Then the mean for NB is µ and variance µ + αµ 2. 29/35

edger Model count data with NB distributions The number of replicates in read count data is often too small to reliably estimate mean µ and variance σ 2 parameters reliably Assume mean and variance are related by σ 2 = µ + αµ 2, with a single proportionality constant α, estimated for each gene. Robinson and Smyth, 2010 30/35

DEseq Three assumptions: N gs = NB(µ gs, σ 2 gs) 1. µ gs = q g,ρ(s) f s, where ρ(s) denotes the experimental condition of sample s q g,ρ(s) is proportional to the expectation value of the true (but unknown) concentration of reads from gene g under condition ρ(s) f s is the DEseq normalization factor. 2. σ 2 gs = µ gs + v g,ρ(s) f 2 s. Here µ gs is the technical, Poisson-distributed variance (shot noise), and v g,ρ(s) f 2 s to raw variance. refers 3. v g,ρ is a smooth function of q s,ρ : v g,ρ(s) = v ρ (q g,ρ(s) ). This allows to pool the data from genes with similar expression strength for the purpose of variance estimation. Anders and Huber, 2010 31/35

DEseq model tting Assume n genes and m samples, and k experimental conditions. We have the following parameters estimated: 1. m size factors f s (expected values of all counts from sample s proportional to f s ) 2. kn expression strength parameters q g,ρ, for each condition ρ and gene g, (expected values of counts for gene g in condition ρ are proportional to q g,ρ ): ˆq g,ρ = 1 m ρ s:ρ(s)=ρ N g,s f s, i.e., the averaged normalized counts from samples for condition ρ, with m ρ = the number of samples for condition ρ. 3. k smooth functions v ρ : R + R +. For each condition ρ, v ρ models the dependence of the raw variance v g,ρ on the expected mean q g,ρ. Anders and Huber, 2010 32/35

DEseq model tting For estimation of raw variance v g,ρ : 1. Calculate normalized condition variance estimates 2. dene 1 w g,ρ = m ρ 1 s:ρ(s)=ρ z g,ρ = ˆq g,ρ m ρ ( Ng,s f s f s s:ρ(s)=ρ ˆq g,ρ 3. Theorem: w g,ρ z g,ρ is an unbiased estimator of raw variance. 4. For small m ρ not useful. Instead regress on (ˆq g,ρ, w g,ρ ) to obtain a smooth function w ρ (q) and estimate raw variance with ˆv ρ (ˆq g,ρ ) = w ρ (ˆq g,ρ ) z g,ρ. 1 ) 2 Anders and Huber, 2010 33/35

DEseq model tting Orange line regression w ρ of y-axis: condition variance estimator w g,ρ, on x-axis: means estimator ˆq g,ρ. Dashed orange line edger variance estimator Violet line Poisson variance (=mean) Anders and Huber, 2010 34/35

DEseq DE calling For gene g and two conditions, 1 and 2, we want to evaluate whether gene g is dierentially expressed between 1 and 2 Hypothesis testing: H 0 : the means q g,1 = q g,2 equal. 35/35