Analyses biostatistiques de données RNA-seq

Size: px
Start display at page:

Download "Analyses biostatistiques de données RNA-seq"

Transcription

1 Analyses biostatistiques de données RNA-seq Ignacio Gonzàlez, Annick Moisan & Nathalie Villa-Vialaneix Toulouse, 18/19 mai 2017 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

2 Outline 1 Exploratory analysis Introduction Experimental design Data exploration and quality assessment 2 Normalization Raw data filtering Interpreting read counts 3 Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

3 Outline 1 Exploratory analysis Introduction Experimental design Data exploration and quality assessment 2 Normalization Raw data filtering Interpreting read counts 3 Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

4 A typical transcriptomic experiment IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

5 A typical transcriptomic experiment IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

6 A typical RNA-seq experiment 1 collect samples 2 generate libraries of cdna fragments 3 affect material to one lane in one flow cell IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

7 Steps in RNAseq data analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

8 Part I: Experimental design IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

9 Confounded effects: a simple example Basic experiment: find differences between control/treated plants control group plant treated group plant IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

10 Confounded effects: a simple example Basic experiment: find differences between control/treated plants control group plant treated group plant A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field Field 1 Field 2 because you can not differentiate between differences due to the field and differences due to the treatment confounded effects IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

11 Confounded effects: a simple example Basic experiment: find differences between control/treated plants control group plant treated group plant A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field) Field 1 Field 2 differences due to the field and differences due to the treatment can be estimated separately IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

12 Confounded effects: a simple example Basic experiment: find differences between control/treated plants control group plant treated group plant In summary, what is a good experimental design? Experimental design are usually not as simple as this example: they can include multiple experimental factors (day of experiment, flow cell,... ) and multiple covariates (sex, parents,... ). The experimental design must be carefully thought before starting the experiment and confounded effects must be searched for in a systematic manner. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

13 Effect & Variation 2 conditions, 2 genes whose expression distribution is: first gene: different median levels between the two groups but large variance: differences may be non significant second gene: different median levels between the two groups but very small variance: differences may be significant IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

14 Source of variation in RNA-seq experiments 1 at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors) 2 at the middle layer: technical variations (library preparation effect) 3 at the bottom layer: technical variations (lane and cell flow effects) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

15 Source of variation in RNA-seq experiments 1 at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors) 2 at the middle layer: technical variations (library preparation effect) 3 at the bottom layer: technical variations (lane and cell flow effects) lane effect < cell flow effect < library preparation effect biological effect IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

16 Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: 3 per condition) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

17 Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: 3 per condition) technical replicates: same biological material but independent replications of the technical steps (from library preparation; advised: 2 per biological replicate) but 6 biological replicates are better than 3 biological replicates 2 technical replicates! See [Liu et al., 2014] for further discussions. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

18 In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) RNA extraction Library preparation Lane 1 Lane 2 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

19 In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) RNA extraction Library preparation Lane 1 Lane 2 Example with 2 biological replicates per condition and 4 technical replicates per biological replicate: Flow cell 1 Flow cell 2 Flow cell 3 Flow cell C 11 C 21 T 11 T 21 C 12 C 22 T 12 T 22 C 13 C 23 T 13 T 23 C 14 C 24 T 14 T 24 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

20 In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters RNA extraction Library preparation and barcoding Pooling Lane 1 Lane 2 Barcoded adapters IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

21 In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters RNA extraction Library preparation and barcoding Pooling Lane 1 Lane 2 Barcoded adapters Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells: Flow cell 1 Flow cell 2 Flow cell 3 Flow cell C 111 C 121 C 131 C 141 C 112 C 122 C 132 C 142 C 113 C 123 C 133 C 143 C 114 C 124 C 134 C 144 C 211 C 221 C 231 C 241 C 212 C 222 C 232 C 242 C 213 C 223 C 233 C 243 C 214 C 224 C 234 C 244 T 111 T 121 T 131 T 141 T 112 T 122 T 132 T 142 T 113 T 123 T 133 T 143 T 114 T 124 T 134 T 144 T 211 T 221 T 231 T 241 T 212 T 222 T 232 T 242 T 213 T 223 T 233 T 243 T 214 T 224 T 234 T 244 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

22 Part II: Exploratory analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

23 Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

24 Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

25 Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data overdispersion (variance mean) black line is variance = mean IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

26 Dataset used in the examples Dataset pasilla The dataset pasilla comes from the R package pasilla available on Bioconductor It can be installed using source(" bioconductor. org/ bioclite.r") bioclite(" pasilla") directly use the function data in R (the library DESeq is required to extract the count data): library( pasilla); library( DESeq) data(" pasillagenes") head( counts( pasillagenes)) or import the text file included in the package which contains the counts datafile=system.file("extdata/pasilla_gene_counts.tsv", package =" pasilla") rawcounttable =read. table( dfile, header =TRUE, row. names =1) head( rawcount) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

27 Dataset used in the examples Description of the data: experiment on Drosophila melanogaster [Brooks et al., 2010] count data obtained from FASTQ files produced by second generation sequencing techniques (Illumina) details on the way the count table was obtained are given in the package s vignette IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

28 Dataset used in the examples Description of the data: experiment on Drosophila melanogaster [Brooks et al., 2010] count data obtained from FASTQ files produced by second generation sequencing techniques (Illumina) details on the way the count table was obtained are given in the package s vignette Before any further processing, the data must be explored by visualization methods and statistical methods that provide simplified summary of the data. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

29 Count distribution The count distribution (i.e., the number of times a given count is obtained in the data) can be visualized with histograms count control1 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

30 Count distribution The count distribution (i.e., the number of times a given count is obtained in the data) can be visualized with histograms count 5000 log 2(count + 1) control control1 This distribution is highly skewed and it is better visualized using a log 2 transformation before it is displayed. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

31 Count distribution The count distribution (i.e., the number of times a given count is obtained in the data) can be visualized with histograms count 5000 log 2(count + 1) control control1 This distribution is highly skewed and it is better visualized using a log 2 transformation before it is displayed. library size The library size is the sum of all counts in a given sample. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

32 Count distribution within samples The count distributions within each sample can be compared using parallel boxplots (on the log 2 transformed data): 15 log 2 (count + 1) 10 5 Condition control treated 0 control1 control2 control3 control4 treated1 treated2 treated3 It is expected that, within a given condition (i.e., control or treated ), the count distributions between the different samples are similar. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

33 Check reproducibility between samples MA plots can be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values): A-values: average log counts between two samples: M-values: log of ratio between counts between two samples: Mg = log2 (Kgj ) log2 (Kgj 0 ) Ag = log2 (Kgj ) + log2 (Kgj 0 ) 2 where Kgj stands for the counts for gene g in sample j. IG, AM, NV2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

34 Check reproducibility between samples MA plots can be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values): control1 vs control2 control1 vs control3 control1 vs control4 control2 vs control3 control2 vs control4 control3 vs control M A IG, AM, NV2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

35 Check similarity between samples Similarities between samples can be visualized with a AHC and a heatmap: perform ascending hierarchical classification (AHC) using Euclidean distance between samples: δ(j, j ) = ( g log2 (K gj ) log 2 (K gj ) ) 2 visualize the strength of the similarity with heatmap. Color key control3 : paired end control4 : paired end treated2 : paired end treated3 : paired end control1 : single end control2 : single end treated1 : single end : single end : single end : single end : paired end : paired end IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64 : paired end : paired end

36 Search for the main structures in the data: PCA PCA (on log 2 counts) can be used to project data into a small dimensional space and search for unexpected experimental effects in the data. In DESeq, the function plotpca performs PCA on the top genes with the highest variance. control : paired end control : single end treated : paired end treated : single end 10 PC PC1 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

37 Search for the main structures in the data: MDS MDS (on log 2 counts) can also be used to project data into a small dimensional space. Unlike PCA (which aims at providing the projection with the largest variance), its aim is to preserve the distances during the projection. In edger, the function plotmds performs MDS on the top genes with the highest variance. reated : paired end treated : paired end treated : single e Dimension trol : paired end control : paired end control : single end control : single end when using the Euclidean distance between counts for MDS, it is equivalent to PCA Dimension 1 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

38 Outline 1 Exploratory analysis Introduction Experimental design Data exploration and quality assessment 2 Normalization Raw data filtering Interpreting read counts 3 Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

39 Raw data filtering Filtering consists in removing genes with low expression. Different strategies can be used: [Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold; [Bottomly et al., 2011]: filter out genes with zero count in an experimental condition; [Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition. With CPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes). IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

40 Raw data filtering Filtering consists in removing genes with low expression. Different strategies can be used: [Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold; [Bottomly et al., 2011]: filter out genes with zero count in an experimental condition; [Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition. With CPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes). More sophisticated filtering To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

41 Part III: Normalization IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

42 Purpose of normalization identify and correct technical biases (due to sequencing process) to make counts comparable types of normalization: within sample normalization and between sample normalization IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

43 Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts for gene B are twice larger than counts for gene A because: IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

44 Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts for gene B are twice larger than counts for gene A because: gene B is expressed with a number of transcripts twice larger than gene A gene A gene B IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

45 Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts for gene B are twice larger than counts for gene A because: both genes are expressed with the same number of transcripts but gene B is twice longer than gene A gene A gene B IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

46 Within sample normalization Purpose of within sample comparison: enabling comparisons of genes from a same sample Sources of variability: gene length, sequence composition (GC content) These differences need not to be corrected for a differential analysis and are not really relevant for data interpretation. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

47 Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts in sample 3 are much larger than counts in sample 2 because: IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

48 Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts in sample 3 are much larger than counts in sample 2 because: gene A is more expressed in sample 3 than in sample 2 gene A in sample 2 gene A in sample 3 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

49 Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A gene B counts in sample 3 are much larger than counts in sample 2 because: gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes) gene A in sample 2 gene A in sample 3 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

50 Between sample normalization Purpose of between sample comparison: enabling comparisons of a gene in different samples Sources of variability: library size,... These differences must be corrected for a differential analysis and for data interpretation. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

51 Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

52 Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: Raw counts correspond to different sequencing depths control treated Gene Gene Gene : : : : : : : : : Gene G IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

53 Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: A correction multiplicative factor is calculated for every sample control treated Gene Gene Gene : : : : : : : : : Gene G C j IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

54 Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts In practice: Every counts is multiplied by the correction factor corresponding to its sample Gene C j Gene x IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

55 Principles for sequencing depth normalization Basics 1 choose an appropriate baseline for each sample 2 for a given gene, compare counts relative to the baseline rather than raw counts Consequences: Library sizes for normalized counts are roughly equal. control treated Gene Gene Gene : : : : : : : : : Gene G Lib. size x 10 5 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

56 Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj = K gj s j in which s j = C 1 j is the scaling factor for sample j. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

57 Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj = K gj s j in which s j = C 1 j is the scaling factor for sample j. Three types of methods: distribution adjustment method taking length into account the effective library size concept IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

58 Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] s j = 1 N N l=1 D l in which N is the number of samples and D j = g K gj. D j raw counts normalized counts log 2(count + 1) Samples Sample 1 Sample 2 edger: cpm(..., normalized. lib. sizes=true) rank(mean) gene expression IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

59 Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] (Upper) Quartile normalization [Bullard et al., 2010] s j = Q (p) j 1 N l=1 Q(p) l in which Q (p) is a given quantile (generally 3rd quartile) of the count j distribution in sample j. raw counts normalized counts raw counts normalized counts log 2(count + 1) 10 Samples Sample 1 Sample 2 log 2(count + 1) 10 Samples Sample 1 Sample rank(mean) gene expression rank(mean) gene expression edger: calcnormfactors(..., method = " upperquartile", p = 0.75) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

60 Method using gene lengths (intra & inter sample normalization) RPKM: Reads Per Kilobase per Million mapped Reads Assumptions: read counts are proportional to expression level, transcript length and sequencing depth s j = D jl g in which L g is gene length (bp). edger: rpkm(..., gene.length =...) Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009]. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

61 Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample: geometric mean across samples N R g = K gj (geometric mean is less sensitive to extreme values than standard mean) j=1 1/N 15 log 2(count + 1) 10 5 Samples Sample 1 Sample rank(mean) gene expression IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

62 Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference K gj = K gj R g with R g = N K gj j=1 1/N 4 count / geo. mean 3 2 Samples Sample 1 Sample rank(mean) gene expression IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

63 Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference 3 calculate normalization factor: median of centered counts over the genes s j = median } g { Kgj factors multiply to 1: s j = ( s j 1 exp N N l=1 log( s l) ) count / geo. mean Samples Sample 1 Sample rank(mean) gene expression with and K gj = K gj R g N R g = K gj j=1 1/N IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

64 Relative Log Expression (RLE) [Anders and Huber, 2010], edger - DESeq - DESeq2 Method: 1 compute a pseudo-reference sample 2 center samples compared to the reference 3 calculate normalization factor: median of centered counts over the genes log 2(count + 1) Samples Sample 1 Sample 2 ## with edger calcnormfactors(..., method =" RLE") ## with DESeq estimatesizefactors (...) rank(mean) gene expression IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

65 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

66 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r log D 2 j D r select as a reference sample, the sample r with the upper quartile closest to the average upper quartile M(j,r) Trimmed values None M- vs A-values IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64 A(j,r)

67 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r log D 2 j D r 2 1 Trim 30% on M-values M(j,r) 0 1 Trimmed values None 30% of Mg IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64 A(j,r)

68 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed remove extreme data for fold-changed (M) and average intensity (A) ( ) ( ) Kgj Kgr M g (j, r) = log 2 log D 2 A g (j, r) = 1 [ ( ) ( )] Kgj Kgr log j D r log D 2 j D r 2 1 Trim 5% on A-values M(j,r) 0 1 Trimmed values None 30% of Mg 5% of Ag IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64 A(j,r)

69 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed M(j,r) A(j,r) Trimmed values None 30% of Mg 5% of Ag On remaining data, calculate the weighted mean of M-values: w g (j, r)m g (j, r) TMM(j, r) = with w g (j, r) = g:not trimmed g:not trimmed ( Dj K gj D j K gj w g (j, r) + D r K gr D r K gr ). IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

70 Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edger Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed Correction factors: s j = 2 TMM(j,r) factors multiply to 1: s j = ( s j 1 exp N N l=1 log( s l) ) calcnormfactors(..., method =" TMM") IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

71 Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Purpose of the comparison: finding the best method for all cases is not a realistic purpose find an approach which is robust enough to provide relevant results in all cases Method: comparison based on several criteria to select a method which is valid for multiple objectives IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

72 Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on count distribution: RPKM and TC are very similar to raw data. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

73 Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on differential analysis (DESeq v. 1.6): Inflated FPR for all methods except for TMM and DESeq (RLE). IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

74 Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Conclusion: Differences appear based on data characteristics TMM and DESeq (RLE) are performant in a differential analysis context. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

75 Outline 1 Exploratory analysis Introduction Experimental design Data exploration and quality assessment 2 Normalization Raw data filtering Interpreting read counts 3 Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

76 Part IV: Differential expression analysis IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

77 Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples which is tested against an alternative H 1 : the average count for gene g in the control samples is different from the average count in the treated samples IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

78 Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

79 Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

80 Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 4 deduce the probability that the observations occur under H 0 : this is called the p-value IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

81 Different steps in hypothesis testing 1 formulate an hypothesis H 0 : H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples 2 from observations, calculate a test statistics (e.g., the mean in the two samples) 3 find the theoretical distribution of the test statistics under H 0 4 deduce the probability that the observations occur under H 0 : this is called the p-value 5 conclude: if the p-value is low (usually below α = 5% as a convention), H 0 is unlikely: we say that H 0 is rejected. We have that: α = P H0 (H 0 is rejected). IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

82 Summary of the possible decisions Not reject H 0 Reject H 0 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

83 Types of errors in tests Reality H 0 is true H 0 is false Decision Do not reject H 0 Correct decision Type II error (True Negative) (False Negative) Reject H 0 Type I error Correct decision (False Positive) (True Positive) P(Type I error) = α (risk) P(Type II error) = 1 β ( β: power) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

84 Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α. P(at least one FP if H 0 is always true) = 1 (1 α) G Ex: for α = 5% and G = 20, P(at least one FP if H 0 is always true) 64%!!! IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

85 Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α. P(at least one FP if H 0 is always true) = 1 (1 α) G Ex: for α = 5% and G = 20, P(at least one FP if H 0 is always true) 64%!!! Probability to have at least one false positive versus the number of tests performed when H 0 is true for all G tests 1.00 Probability For more than 75 tests and if H 0 is always true, the probability to have at least one false positive is very close to 100%! IG, AM, NV 2 (INRA) Number of tests Biostatistique RNA-seq Toulouse, 18/19 mai / 64

86 Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 U G 1 V G R Total G 0 G 1 G IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

87 Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 U G 1 V G R Total G 0 G 1 G Instead of the risk α, control: familywise error rate (FWER): FWER = P(U > 0) (i.e., probability to have at least one false positive decision) false discovery rate (FDR): FDR = E(Q) with U/R if R > 0 Q = 0 otherwise IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

88 Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

89 Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

90 Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) 2 calculate p (g) = a g p (g) with Bonferroni method: a g = G (FWER) with Benjamini & Hochberg method: a g = G/g (FDR) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

91 Adjusted p-values Settings: p-values p 1,..., p G (e.g., corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are p 1,..., p G such that Rejecting tests such that p g < α P(U > 0) α or E(Q) α Calculating p-values 1 order the p-values p (1) p (2)... p (G) 2 calculate p (g) = a g p (g) with Bonferroni method: a g = G (FWER) with Benjamini & Hochberg method: a g = G/g (FDR) 3 if adjusted p-values p (g) are larger than 1, correct p (g) min{ p (g), 1} IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

92 Fisher s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n ga n gb n g other genes N A n ga N B n gb N n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

93 Fisher s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n ga n gb n g other genes N A n ga N B n gb N n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? Method Direct calculation of the probability to obtain such a contingency table (or a more extreme contingency table) with: independency between the two columns of the contingency tables; the same marginals ( Total ). IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

94 Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change 0 3 Main remark: more conservative for genes with a low expression mean log 2 expression IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

95 Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change 0 3 Main remark: more conservative for genes with a low expression mean log 2 expression Limitation of Fisher test Highly expressed genes have a very large variance! As Fisher test does not estimate variance, it tends to detect false positives among highly expressed genes do not use it! IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

96 Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) estimate its parameters for both conditions conclude by calculating p-value IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

97 Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions conclude by calculating p-value IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

98 Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g1 λ g2 φ g conclude by calculating p-value IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

99 Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g, K 1 g1,..., K 1 gn 1 (condition 1) and K 2 g1,..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj NB(s k j λ gk, φ g ) in which: s k is library size of sample j in condition k j λ gk is the proportion of counts for gene g in condition k φ g is the dispersion of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g1 λ g2 φ g conclude by calculating p-value Test H0 : {λ g1 = λ g2 } IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

100 First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

101 First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

102 First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) 2 φ g is estimated independently of λ g1 and λ g2, using different approaches to account for small sample size IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

103 First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies s K 1 g K 1 gn 1 NB(sλ g1, φ g /n 1 ) (and similarly for the second condition) 1 λ g1 and λ g2 are estimated (mean of the distributions) 2 φ g is estimated independently of λ g1 and λ g2, using different approaches to account for small sample size 3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

104 Estimating the dispersion parameter φ g Some methods: DESeq, DESeq2: φ g is a smooth function of λ g = λ g1 = λ g2 dge <- estimatedispersion( dge) dispersion gene est fitted final 5e 01 5e+00 5e+01 5e+02 mean of normalized counts edger: estimate a common dispersion parameter for all genes and use it as a prior in a Bayesian approach to estimate a gene specific dispersion parameter dge <- estimatecommondisp( dge) dge <- estimatetagwisedisp( dge) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

105 Perform the test Some methods: DESeq, DESeq2: exact (DESeq) or approximate (Wald and LR in DESeq2) tests res <- nbinomwaldtest( dge) results( res) res <- nbinomlr( dge) results( res) edger: exact tests res <- exacttest( dge) toptags( res) (comparison between methods in [Zhang et al., 2014]) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

106 More complex experiments: GLM Framework: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) in which: s j is the library size for sample j; IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

107 More complex experiments: GLM Framework: in which: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) s j is the library size for sample j; log(λ gj ) is estimated (for instance) by a Generalized Linear Model (GLM): log(λ gj ) = λ 0 + x j β g in which x j is a vector of covariates. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

108 More complex experiments: GLM Framework: in which: K gj NB(µ gj, φ g ) with log(µ gj ) = log(s j ) + log(λ gj ) s j is the library size for sample j; log(λ gj ) is estimated (for instance) by a Generalized Linear Model (GLM): log(λ gj ) = λ 0 + x j β g in which x j is a vector of covariates. GLM allows to decompose the effects on the mean of different factors their interactions IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

109 More complex experiments: GLM in practice edger dge <- estimatedisp(dge, design) fit <- glmfit(dge, design) res <- glmrt(fit,...) toptags( res) DESeq, DESeq2 dge <- newcountdataset( counts, design) dge <- estimatesizefactors( dge) dge <- estimatedispersions( dge) fit <- fitnbinomglms(dge, count ~...) fit0 <- fitnbinomglms(dge, count ~ 1) res <- nbinomglmtest(fit, fit0) p. adjust(res, method = " BH") IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

110 Alternative approach: linear model for count data [Law et al., 2014], limma Basic idea: 1 data are transformed so that they are approximately normally distributed tcount <- voom( counts, design) 2 a linear (Gaussian) model is fitted (with a Bayesian approach to improve FDR [McCarthy and Smyth, 2009]): K gj N(µ gj, σ 2 g) with E( Kgj ) = β 0 + x j β g fit <- lmfit( tcount, design) fit <- ebayes( fit) toptables(fit,...) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

111 Part V: Interpreting differential analysis results IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

112 Overview of the results: MA-plot DESeq, DESeq2 plotma(..., main="deseq", ylim=c(-4,4)) plotma(..., main="deseq2", ylim=c(-4,4)) DESeq DESeq2 log 2 fold change log fold change e 01 5e+00 5e+01 5e+02 5e+03 mean of normalized counts 5e 01 5e+00 5e+01 5e+02 5e+03 mean expression (the last one includes a prior on log 2 fold change which results in more moderated estimates for low count genes) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

113 Overview of the results: MA-plot edger plotsmear(..., de.tags =...) logfc Average logcpm IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

114 Fold change and p-value: the Volcano plot p-value versus fold change (both log scaled) scatterplot. Significant genes are in red: 2 fold + 2 fold log 10 pvalue pval = log 2 fold change IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

115 Gene clustering Prior clustering: transform data to obtain counts with similar variance DESeq, DESeq2 variancestabilizingtransformation (...) DESeq2 rlog(...) edger cpm(..., prior. count =2, log=true) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

116 Gene clustering On transformed data, use e.g., heatmap: Color key B_3 B_2 B_1 A_3 Samples A_2 A_1 gene495 gene816 gene723 gene888 gene202 gene843 gene828 gene837 gene841 gene295 gene970 gene883 gene423 gene222 gene844 gene174 gene704 gene253 gene275 gene751 gene433 gene66 gene78 gene793 gene533 gene596 gene935 gene721 gene978 gene201 gene76 gene20 gene437 gene672 gene39 gene215 gene957 gene242 gene532 gene852 gene737 gene711 gene535 gene305 gene718 gene774 gene322 gene538 gene520 gene283 gene886 gene61 gene166 gene147 gene303 gene160 gene584 gene442 gene947 gene203 gene397 gene355 gene229 gene586 gene728 gene990 gene59 gene878 gene833 gene785 gene337 gene759 gene367 gene782 gene566 gene470 gene68 gene929 gene967 gene545 Genes which is useful to visualize which genes are over/under-expressed in one condition. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

117 Standard property of usual DE analyses DESeq log 2 fold change e 01 5e+00 5e+01 5e+02 5e+03 mean of normalized counts Remark: low read counts have a too large variance to be found differentially expressed. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

118 Standard property of usual DE analyses DESeq log 2 fold change e 01 5e+00 5e+01 5e+02 5e+03 mean of normalized counts Remark: low read counts have a too large variance to be found differentially expressed. Consequence: filtering out these genes before the DE analysis because it improves the power of the test because of multiple test correction. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

119 Example Filtering out the 40% genes that have the lowest overall counts does not affect much low p-values: frequency pass do not pass 0 1 but leads to find new DE genes that were previously discarded by multiple test correction. IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

120 Filtering in practice HTSFilter cdsfilt <- HTSFilter(..., plot=false)$ filtereddata res <- nbinomtest( cdsfilt) log 2 fold change e 01 5e+00 5e+01 5e+02 5e+03 mean of normalized counts IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

121 In summary... (with edger) preparation of the design of the experiment sequencing count data exploratory analysis (hist, boxplot...) creating an R object with count data (DGEList) a DGEList object normalization (calcnormfactors) a DGEList object with normalization factors fitting the model (estimatedisp) a DGEList object with dispersion estimates filtering low count genes (HTSFilter) a DGEList object without filtering test (exacttest or glmfit/glmlrt) a DGEExact or DGELRT object exploratory analysis (toptags, plotsmear...) IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse, 18/19 mai / 64

Normalization and differential analysis of RNA-seq data

Normalization and differential analysis of RNA-seq data Normalization and differential analysis of RNA-seq data Nathalie Villa-Vialaneix INRA, Toulouse, MIAT (Mathématiques et Informatique Appliquées de Toulouse) nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

RNASeq Differential Expression

RNASeq Differential Expression 12/06/2014 RNASeq Differential Expression Le Corguillé v1.01 1 Introduction RNASeq No previous genomic sequence information is needed In RNA-seq the expression signal of a transcript is limited by the

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

RNA-seq. Differential analysis

RNA-seq. Differential analysis RNA-seq Differential analysis DESeq2 DESeq2 http://bioconductor.org/packages/release/bioc/vignettes/deseq 2/inst/doc/DESeq2.html Input data Why un-normalized counts? As input, the DESeq2 package expects

More information

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences Wentao Yang October 30, 2018 1 Introduction This vignette is intended to give a brief introduction of the ABSSeq

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Introduc)on to RNA- Seq Data Analysis Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Material: hep://)ny.cc/rnaseq Slides: hep://)ny.cc/slidesrnaseq

More information

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n( ,a) RNA-seq RNA-seq Cuffdiff, edger, DESeq Sese Jun,a) Abstract: Frequently used biological experiment technique for observing comprehensive gene expression has been changed from microarray using cdna

More information

Dispersion modeling for RNAseq differential analysis

Dispersion modeling for RNAseq differential analysis Dispersion modeling for RNAseq differential analysis E. Bonafede 1, F. Picard 2, S. Robin 3, C. Viroli 1 ( 1 ) univ. Bologna, ( 3 ) CNRS/univ. Lyon I, ( 3 ) INRA/AgroParisTech, Paris IBC, Victoria, July

More information

DEXSeq paper discussion

DEXSeq paper discussion DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Differential expression analysis for sequencing count data. Simon Anders

Differential expression analysis for sequencing count data. Simon Anders Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in HTS RNA-Seq Tag-Seq Gene 13CDNA73 A2BP1 A2M A4GALT AAAS AACS AADACL1 [...] ChIP-Seq Bar-Seq... GliNS1 4 19

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Multiple hypothesis testing and RNA-seq differential expression analysis accounting for dependence and relevant covariates

Multiple hypothesis testing and RNA-seq differential expression analysis accounting for dependence and relevant covariates Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 218 Multiple hypothesis testing and RNA-seq differential expression analysis accounting for dependence and relevant

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

Looking at the Other Side of Bonferroni

Looking at the Other Side of Bonferroni Department of Biostatistics University of Washington 24 May 2012 Multiple Testing: Control the Type I Error Rate When analyzing genetic data, one will commonly perform over 1 million (and growing) hypothesis

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and

More information

Statistical methods for estimation, testing, and clustering with gene expression data

Statistical methods for estimation, testing, and clustering with gene expression data Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2017 Statistical methods for estimation, testing, and clustering with gene expression data Andrew Lithio Iowa

More information

Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression

Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 6 1 References Anders & Huber (2010), Differential

More information

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1 Paper G. I. Allen and Z. Liu.

More information

Statistical challenges in RNA-Seq data analysis

Statistical challenges in RNA-Seq data analysis Statistical challenges in RNA-Seq data analysis Julie Aubert UMR 518 AgroParisTech-INRA Mathématiques et Informatique Appliquées Ecole de bioinformatique, Station biologique de Roscoff, 2013 Nov. 18 J.

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

*Equal contribution Contact: (TT) 1 Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv

*Equal contribution Contact: (TT) 1 Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv Supplementary of Complementary Post Transcriptional Regulatory Information is Detected by PUNCH-P and Ribosome Profiling Hadas Zur*,1, Ranen Aviner*,2, Tamir Tuller 1,3 1 Department of Biomedical Engineering,

More information

Normalization of metagenomic data A comprehensive evaluation of existing methods

Normalization of metagenomic data A comprehensive evaluation of existing methods MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Normalization, testing, and false discovery rate estimation for RNA-sequencing data Biostatistics Advance Access published October 14, 2011 Biostatistics (2011), 0, 0, pp. 1 16 doi:10.1093/biostatistics/kxr031 Normalization, testing, and false discovery rate estimation for RNA-sequencing

More information

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006 Gene Expression an Overview of Problems & Solutions: 3&4 Utah State University Bioinformatics: Problems and Solutions Summer 006 Review Considering several problems & solutions with gene expression data

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

EBSeq: An R package for differential expression analysis using RNA-seq data

EBSeq: An R package for differential expression analysis using RNA-seq data EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model

More information

Unit-free and robust detection of differential expression from RNA-Seq data

Unit-free and robust detection of differential expression from RNA-Seq data Unit-free and robust detection of differential expression from RNA-Seq data arxiv:405.4538v [stat.me] 8 May 204 Hui Jiang,2,* Department of Biostatistics, University of Michigan 2 Center for Computational

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights Koen Van den Berge, Ghent University Statistical Genomics, 2018-2019 1 The team Koen Van den Berge Fanny

More information

arxiv: v1 [stat.me] 1 Dec 2015

arxiv: v1 [stat.me] 1 Dec 2015 Bayesian Estimation of Negative Binomial Parameters with Applications to RNA-Seq Data arxiv:1512.00475v1 [stat.me] 1 Dec 2015 Luis León-Novelo Claudio Fuentes Sarah Emerson UT Health Science Center Oregon

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

using Bayesian hierarchical model

using Bayesian hierarchical model Biomarker detection and categorization in RNA-seq meta-analysis using Bayesian hierarchical model Tianzhou Ma Department of Biostatistics University of Pittsburgh, Pittsburgh, PA 15261 email: tim28@pitt.edu

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier Data visualization, quality control, normalization & peak calling Peak annotation Presentation () Practical session

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Multiple testing: Intro & FWER 1

Multiple testing: Intro & FWER 1 Multiple testing: Intro & FWER 1 Mark van de Wiel mark.vdwiel@vumc.nl Dep of Epidemiology & Biostatistics,VUmc, Amsterdam Dep of Mathematics, VU 1 Some slides courtesy of Jelle Goeman 1 Practical notes

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Some General Types of Tests

Some General Types of Tests Some General Types of Tests We may not be able to find a UMP or UMPU test in a given situation. In that case, we may use test of some general class of tests that often have good asymptotic properties.

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data Dong et al. BMC Bioinformatics (2016) 17:369 DOI 10.1186/s12859-016-1208-1 RESEARCH ARTICLE Open Access NBLDA: negative binomial linear discriminant analysis for RNA-Seq data Kai Dong 1,HongyuZhao 2,TiejunTong

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

Generalized Linear Models in R

Generalized Linear Models in R Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 Olga (NBIS) scrna-seq de October 2017 1 / 34 Outline Introduction: what

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier Visualization, quality, normalization & peak-calling Presentation (Carl Herrmann) Practical session Peak annotation

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

The Difference in Proportions Test

The Difference in Proportions Test Overview The Difference in Proportions Test Dr Tom Ilvento Department of Food and Resource Economics A Difference of Proportions test is based on large sample only Same strategy as for the mean We calculate

More information

Bias in RNA sequencing and what to do about it

Bias in RNA sequencing and what to do about it Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Package aspi. R topics documented: September 20, 2016

Package aspi. R topics documented: September 20, 2016 Type Package Title Analysis of Symmetry of Parasitic Infections Version 0.2.0 Date 2016-09-18 Author Matt Wayland Maintainer Matt Wayland Package aspi September 20, 2016 Tools for the

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability

More information

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish Rigorous Evaluation Analysis and Reporting Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish S. Ludi/R. Kuehl p. 1 Summarize and Analyze Test Data Qualitative data - comments,

More information

Differential Expression with RNA-seq: Technical Details

Differential Expression with RNA-seq: Technical Details Differential Expression with RNA-seq: Technical Details Lieven Clement Ghent University, Belgium Statistical Genomics: Master of Science in Bioinformatics TWIST, Krijgslaan 281 (S9), Gent, Belgium e-mail:

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the

More information

Chapter 1 - Lecture 3 Measures of Location

Chapter 1 - Lecture 3 Measures of Location Chapter 1 - Lecture 3 of Location August 31st, 2009 Chapter 1 - Lecture 3 of Location General Types of measures Median Skewness Chapter 1 - Lecture 3 of Location Outline General Types of measures What

More information

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials. One-Way ANOVA Summary The One-Way ANOVA procedure is designed to construct a statistical model describing the impact of a single categorical factor X on a dependent variable Y. Tests are run to determine

More information

Introductory Econometrics. Review of statistics (Part II: Inference)

Introductory Econometrics. Review of statistics (Part II: Inference) Introductory Econometrics Review of statistics (Part II: Inference) Jun Ma School of Economics Renmin University of China October 1, 2018 1/16 Null and alternative hypotheses Usually, we have two competing

More information

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males Supplementary Figure 1 The number of differentially expressed genes for males (green), females (yellow), males (red), and females (blue) in caring vs. control comparisons in the caring gene set and the

More information

Exam: high-dimensional data analysis February 28, 2014

Exam: high-dimensional data analysis February 28, 2014 Exam: high-dimensional data analysis February 28, 2014 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question (not the subquestions) on a separate piece of paper.

More information

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS Ying Liu Department of Biostatistics, Columbia University Summer Intern at Research and CMC Biostats, Sanofi, Boston August 26, 2015 OUTLINE 1 Introduction

More information

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq Institute for Computational Biomedicine ChIP seq peak calling Statistical integration between ChIP seq and RNA seq Olivier Elemento, PhD ChIP-seq to map where transcription factors bind DNA Transcription

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko Concepts and Applications of Kriging Eric Krause Konstantin Krivoruchko Outline Introduction to interpolation Exploratory spatial data analysis (ESDA) Using the Geostatistical Wizard Validating interpolation

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Statistics Applied to Bioinformatics. Tests of homogeneity

Statistics Applied to Bioinformatics. Tests of homogeneity Statistics Applied to Bioinformatics Tests of homogeneity Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation

More information

Hunting for significance with multiple testing

Hunting for significance with multiple testing Hunting for significance with multiple testing Etienne Roquain 1 1 Laboratory LPMA, Université Pierre et Marie Curie (Paris 6), France Séminaire MODAL X, 19 mai 216 Etienne Roquain Hunting for significance

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Alpha-Investing. Sequential Control of Expected False Discoveries

Alpha-Investing. Sequential Control of Expected False Discoveries Alpha-Investing Sequential Control of Expected False Discoveries Dean Foster Bob Stine Department of Statistics Wharton School of the University of Pennsylvania www-stat.wharton.upenn.edu/ stine Joint

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Concepts and Applications of Kriging. Eric Krause

Concepts and Applications of Kriging. Eric Krause Concepts and Applications of Kriging Eric Krause Sessions of note Tuesday ArcGIS Geostatistical Analyst - An Introduction 8:30-9:45 Room 14 A Concepts and Applications of Kriging 10:15-11:30 Room 15 A

More information

Exploratory statistical analysis of multi-species time course gene expression

Exploratory statistical analysis of multi-species time course gene expression Exploratory statistical analysis of multi-species time course gene expression data Eng, Kevin H. University of Wisconsin, Department of Statistics 1300 University Avenue, Madison, WI 53706, USA. E-mail:

More information