ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq

Similar documents
DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences

RNA-seq. Differential analysis

A Brief Introduction To. GRTensor. On MAPLE Platform. A write-up for the presentation delivered on the same topic as a part of the course PHYS 601

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Phylogenetic analyses. Kirsi Kostamo

Using the EartH2Observe data portal to analyse drought indicators. Lesson 4: Using Python Notebook to access and process data

Reaxys Pipeline Pilot Components Installation and User Guide

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

ncounter PlexSet Data Analysis Guidelines

M E R C E R W I N WA L K T H R O U G H

PathNet: A tool for finding pathway enrichment and pathway cross-talk using topological information and gene expression data

Introduction to analyzing NanoString ncounter data using the NanoStringNormCNV package

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

MACAU 2.0 User Manual

Information-Based Similarity Index

Manual: R package HTSmix

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Esri UC2013. Technical Workshop.

Introduction to Statistics and R

Bloomsburg University Weather Viewer Quick Start Guide. Software Version 1.2 Date 4/7/2014

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

6.867 Machine Learning

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

High-Throughput Sequencing Course

Parametric Empirical Bayes Methods for Microarrays

Package clonotyper. October 11, 2018

Genome Assembly. Sequencing Output. High Throughput Sequencing

SuperCELL Data Programmer and ACTiSys IR Programmer User s Guide

Differential expression analysis for sequencing count data. Simon Anders

Date: Summer Stem Section:

ISSP User Guide CY3207ISSP. Revision C

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

Lecture: Mixture Models for Microbiome data

Introduction to Occupancy Models. Jan 8, 2016 AEC 501 Nathan J. Hostetter

1 Introduction to Minitab

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data

Spatial Analysis using Vector GIS THE GOAL: PREPARATION:

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning

R: A Quick Reference

Climate Change Indices By: I Putu Santikayasa

FireFamilyPlus Version 5.0

Introduction to R, Part I

Why Bayesian approaches? The average height of a rare plant

Spectral energy distribution fitting with MagPhys package

A GUI FOR EVOLVE ZAMS

let s examine pupation rates. With the conclusion of that data collection, we will go on to explore the rate at which new adults appear, a process

AMS 132: Discussion Section 2

Supplementary File 3: Tutorial for ASReml-R. Tutorial 1 (ASReml-R) - Estimating the heritability of birth weight

Package NanoStringDiff

Perinatal Mental Health Profile User Guide. 1. Using Fingertips Software

WeatherHawk Weather Station Protocol

DUG User Guide. Version 2.1. Aneta J Florczyk Luca Maffenini Martino Pesaresi Thomas Kemper

GS Analysis of Microarray Data

DEXSeq paper discussion

Size Determination of Gold Nanoparticles using Mie Theory and Extinction Spectra

GS Analysis of Microarray Data

Generalized Linear Models (1/29/13)

Chart Group indicator

GS Analysis of Microarray Data

Photometry of Supernovae with Makali i

Differentially Private Linear Regression

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Creative Data Mining

Dispersion modeling for RNAseq differential analysis

Fundamentals of Measurement and Error Analysis

Naive Bayes classification

Instructions for Mapping 2011 Census Data

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health

Variations of the Turing Machine

Trail Flow: Analysis of Drainage Patterns Affecting a Mountain Bike Trail

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

GET STARTED ON THE WEB WITH HYPATIA AND OPloT

HOW TO USE MIKANA. 1. Decompress the zip file MATLAB.zip. This will create the directory MIKANA.

Import Digital Spatial Data (Shapefiles) into OneStop

CSC2515 Assignment #2

6.867 Machine Learning

Section 7: Discriminant Analysis.

1 Overview of Simulink. 2 State-space equations

Athena Visual Software, Inc. 1

Simple linear regression: estimation, diagnostics, prediction

Prac%cal Bioinforma%cs for Life Scien%sts. Week 14, Lecture 28. István Albert Bioinforma%cs Consul%ng Center Penn State

(S1) (S2) = =8.16

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

YYT-C3002 Application Programming in Engineering GIS I. Anas Altartouri Otaniemi

Predici 11 Quick Overview

OECD QSAR Toolbox v.3.3. Step-by-step example of how to categorize an inventory by mechanistic behaviour of the chemicals which it consists

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

Didacticiel - Études de cas

WinLTA USER S GUIDE for Data Augmentation

OECD QSAR Toolbox v.3.0

Transcription:

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq Andrew Fernandes, Gregory Gloor, Jean Macklaim July 18, 212 1 Introduction This guide provides an overview of the R package ALDEx for differential ession analysis of proportional data. The package was developed specifically for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-rna-seq), but testing showed that it performed very well with traditional RNA-Seq datasets. In principle, the analysis method should be applicable to any type of count data that is generated by high-throughput sequencing such as ChIP-Seq or sequencing of metagenomes, although ALDEx has not been tested on these types of problems as of yet. The ALDEx package is suitable for comparison of samples between two conditions where there are at least two biological replicates per condition. One unique feature of ALDEx is the explicit estimation of the within-sample biological variation using the Dirichlet distribution. This distribution maintains the proportional nature of the data, which is not the case for Poisson or negativebinomial based methods. The second unique feature of ALDEx is that it introduces a mathematical transform to normalize the data by removing the uninformative subspace introduced by logarithmic transformation. 2 How to get help Questions about how ALDEx works are explained in the accompanying paper, under consideration at PLoS ONE. The purpose of these instructions is an explanation of how to use ALDEx. There are only three functions in ALDEx, aldex, summary.aldex, and plot.aldex. How to use these functions is given by typing?aldex or help(aldex) at the R command prompt. All three functions, along with some basic R pitfalls in setting up the required input tables are included in the quick start and vignette sections of this document. The authors expect that there is a current working installation of the R statistical programming language that is properly installed, and that the user has a basic understanding of it. The authors appreciate bug reports or suggestions for improvement to the package, although suggestions should be focussed on improving the performance of the package. We do not expect to introduce a large number of new features in the near term. 2.1 A brief discussion of memory It is important to note that ALDEx is somewhat memory intensive. It can require a large amount of memory if there are a large number of genes, a large number of replicates, or if there are a 1

large number of Dirichlet samples generated. It is best not to try and run ALDEx on a laptop with limited memory. The ical limiting factor is the number of values in the distribution used to calculate Q A statistic given in Fernandes et al. This number of values in the statistic scales with the number of replicate samples in each group. As one example, the meta-rna-seq dataset used in the Liver vs. Kidney vignette requires a minimum of 1GB of RAM, and takes approximately 1 hour to run on a server class machine. Some rules of thumb will be helpful: for a 2x2 design, mc.samples=248 would be more than adequate, for a 3x3 design mc.samples=124 and for a 4x4 design mc.samples=12 would be adequate. Larger designs can be scaled accordingly, with the minimum number of samples being approximately 12. 3 Quick start A simple example of how to use ALDEx is given below. We assume that there are two conditions, N and T and that there are two replicates per condition. In the code that follows: # comments that help understand the code start with a hash symbol while the actual commands typed into the R console are identified as a teletype-like font as below. Note that a single command input line may wrap onto more than one line if it is very long being in a sans-seriff font # Install ALDEx from the downloaded package install.packages( ALDEx_1..2.tar.gz ) # load ALDEx library for use by R library(aldex) We assume that the counts are stored in a plain text file, with the gene identifier in the first column. An example of the input data format expected for this example is given below. The example assumes a tab or white-space delimited file format. refseq N1 T2 N T6 id1 1 1 26 id2 6 4 32 1... # get the data in the right format as a dataframe with rownames aldex.in <- read.table( reads.txt, header=t, stringsasfactors=t, row.names=1) # examine the input with head(aldex.in). Note that we will ignore the P9 column in the analysis N1 T2 N T6 P9 id1 12826 882 11322 8332 1774 id2 17262 11981 1622 1182 12883... # run the aldex function and place the output into a variable, x # aldex takes inputs: 2

# 1) the dataframe containing the counts. aldex.in # 2) a list that outlines how to order the conditions based on the columns in the dataframe c( N, T, N, T ) # 3) the number of Dirichlet samples. We use a minimum of 248 for production purposes, and 26 for testing mc.samples # 4) the likelihood that the effect size is zero or worse t.significant # ) the median relative difference between conditions t.meaningful. A value of 1. ensures that the between condition difference is always greater than the within-condition difference. A value of 2. is conservative and ensures that the between condition difference is always more than twice the within condition difference. # run aldex. NOTE: this can take a while depending on your machine and the memory available x <- aldex(aldex.in, c( N, T, N, T ), mc.samples=248, t.significant=.1, t.meaningful=1.) # examine the output file. # The built-in plot function can be used to rapidly display traditional MA style plots of the output as well as the MW plots given in [?]. Additional options can be found in the function plot.aldex(x, type= MA ) plot.aldex(x, type= MW ) # The built-in summary function generates a table of all quantile summary values derived from the analysis. The table is suitable for export externally. y <- summary.aldex(x) # This returns only the medians of the summary values y <- summary.aldex(x, median.only=t, sort=t 4 Modelling the data as proportions rather than counts Individual sequence reads are assigned to genes or genomic features in an RNA-seq experiment. Reads assigned to these features have several sources of variation: technical variation (sometimes called shot noise), biological variation within a condition, biological variation between conditions and unexplained variation. While these sources of variation are being recognized by other widelyused RNA-seq analysis tools, the consensus view in the literature is that the underlying variation is best explained by the negative binomial distribution. We take a different approach. First, we assume no underlying distribution. Second, we model the reads as proportions of the data available rather than as counts. The proportional nature of the data stems from a two major causes: 1) the cell has finite resources, and resources devoted to one function cannot be devoted to another, and 2) there are a large but finite number of reads available 3

from a next-generation sequencing run. See the submitted manuscript for a discussion of the advantages of modelling the data in this way. In brief, we we use Baysian techniques to infer the underlying technical variation in the read count proportions in a way that preserves the proportional nature of the data. This is done by sampling from a Dirichlet distribution, and results in a transformation of the observed read counts into a distribution of read counts. These distributions are isometric log-transformated following the advice of Egozcue for dealing with proporitonal data. This has two effects. First, the data for each sample is centred on its mean ession level and second, the values become largely independent and can be dealt with as statistically independent components. Case study: The Liver vs. Kidney dataset from Marioni et al[1].1 Introduction This section contains a detailed analysis of the Liver vs. Kidney technical replicate dataset of Marioni et al[1] that is often used to benchmark RNA-Seq methodologies. This dataset is a very good test of ALDEx because most genes have very few reads mapping to them..2 Reading the data and making the input dataframe d <- read.table( tech_rep_read_counts.txt, header=t, stringsasfactors=t) This dataset contains many different sample replicates. There were two libraries made # make a dataframe with the replicates we want id <- as.character(d[,1]) K11 <- d$r1l1kidney L12 <- d$r1l2liver L21 <- d$r2l1liver K22 <- d$r2l2kidney aldex.in <- data.frame(k11,l12,l21,k22, row.names=id) library(aldex).3 Generating and examining the dataset # run it and put the values in x x <- aldex(aldex.in, c( K, L, L, K ), mc.samples=12,t.significant=.1, t.meaningful=1.) # This extracts out a summary of the data useful for plotting or saving: y <- summary.aldex(x, median.only=t) 4

write.table(y, file= aldex_12.txt, sep= \t, quote=f) This is the dataset that you want to explore. ALDEx scales all ession values by log2(ession) - log2(mean-ession) (see the paper for details), so values less than are essed at levels lower than the mean and vice-versa. The fields give all the summary information output by the program at each step, the field names are : (blank) - this holds the gene identifier information ession.among - median value of the mixture distribution of all samples ession.within.k - median value of the mixture distribution samples in K ession.within.l - median value of the mixture distribution samples in L ession.sample.k11 - median ession value of individual sample K11 ession.sample.l12 - etc. ession.sample.l21 - etc. ession.sample.k22 - etc. difference.between - median ession difference between K and L difference.within - maximum median difference within K or L effect - effect size, how distinct the between difference is as a function of the within difference eria.significance - the likelihood that effect is less than eria.significant - T if eria.significance < t.significant eria.meaning - absolute value of effect eria.meaningful - T if eria.significance > t.meaningful Both eria must be TRUE for a gene to be called as differentially essed. (id).amg.wthn.k.wthn.l.k11.l12.l21.k22 diff.between diff.wthn effect.signance 9349-3.19-3.6-2.77-3.64-2.73-2.82-3.46.71 4.16.17.42 F.173 F 93 2.18.99 3.74.72 4.29 2.88 1.19 2.67 1.49 1.84.1 T 1.84 T 931-3.17-3.3-2.98-3.32-3.8-2.89-3.37.4 4..12.44 F.12 F 932-1. -2. -.79-3.34.27-2.82-1.2 1.18 4.4.28.3 F.286 F 2679.41.41.9.43 6.61 4.9.4.48 1.66.3.48 F.38 F 2678 12.27 12.27 12.23 12.28 12.32 12.1 12.27 -.4.17 -.29.481 F.294 F.sig.mning.mningful

# This makes a plot of the data, either as MA or MW plots: y <- plot.aldex(x, type= MA ) 1 Median Log2 Combined Expression Level 1 1 Median Log2 Between Condition Difference 1 1 Median Log2 Between Condition Difference The MA and MW plots are illustrated. Blue are genes not called significantly different, yellow are significant but not meaningful, white are meaningful but not significant, and red passes both filters. 1 2 3 4 6 Median Log2 Within Condition Difference References [1] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (28) Rna-seq: an assessment of technical reproducibility and comparison with gene ession arrays. Genome Res 18: 1917. 6