Expression arrays, normalization, and error models

Similar documents
6.867 Machine learning

AP Statistics Cumulative AP Exam Study Guide

Design of microarray experiments

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

SPOTTED cdna MICROARRAYS

Estimators as Random Variables

Statistical Modeling and Analysis of Scientific Inquiry: The Basics of Hypothesis Testing

Notes on Discriminant Functions and Optimal Classification

Where now? Machine Learning and Bayesian Inference

Class 26: review for final exam 18.05, Spring 2014

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

6. Vector Random Variables

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

A3. Statistical Inference

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

2 Statistical Estimation: Basic Concepts

Biochip informatics-(i)

STA 414/2104, Spring 2014, Practice Problem Set #1

State Space and Hidden Markov Models

GS Analysis of Microarray Data

2. A Basic Statistical Toolbox

Statistical inference

3.3.1 Linear functions yet again and dot product In 2D, a homogenous linear scalar function takes the general form:

Exam: high-dimensional data analysis January 20, 2014

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

GS Analysis of Microarray Data

ebay/google short course: Problem set 2

Probabilistic Graphical Models

Optimal normalization of DNA-microarray data

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

GS Analysis of Microarray Data

Robustness and Distribution Assumptions

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Practice Problems Section Problems

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011

14.30 Introduction to Statistical Methods in Economics Spring 2009

Why is the field of statistics still an active one?

The binary entropy function

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Hypothesis Testing, Bayes Theorem, & Parameter Estimation

Lecture 3: Pattern Classification. Pattern classification

cdna Microarray Analysis

Machine Learning Basics III

It is convenient to introduce some notation for this type of problems. I will write this as. max u (x 1 ; x 2 ) subj. to. p 1 x 1 + p 2 x 2 m ;

Factor Analysis. Qian-Li Xue

Bioconductor Project Working Papers

Biostatistics in Research Practice - Regression I

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Bayesian model selection: methodology, computation and applications

Correlation and simple linear regression S5

Chapter 27 Summary Inferences for Regression

Issues and Techniques in Pattern Classification

Gibbs Sampling Methods for Multiple Sequence Alignment

CSC 2541: Bayesian Methods for Machine Learning

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Ch. 5 Joint Probability Distributions and Random Samples

A Practitioner s Guide to Generalized Linear Models

Bayesian Learning (II)

Bayesian Inference of Noise Levels in Regression

Machine Learning Lecture 2

Lecture 5: Sampling Methods

Regression with Input-Dependent Noise: A Bayesian Treatment

Mobile Robot Localization

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture Note 2: Estimation and Information Theory

Parametric Techniques Lecture 3

Problem Set 3 Due Oct, 5

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayes Decision Theory - I

Linear Models and Empirical Bayes Methods for Microarrays

Lecture 30. DATA 8 Summer Regression Inference

Bayesian ANalysis of Variance for Microarray Analysis

Signal Detection Basics - CFAR

Gaussian Process Vine Copulas for Multivariate Dependence

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Stat 5101 Lecture Notes

Expectation Maximization Deconvolution Algorithm

Generalized Linear Models

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Mathematics for Economics MA course

Multiple Regression Analysis

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays

Design of Microarray Experiments. Xiangqin Cui

Lecture on Null Hypothesis Testing & Temporal Correlation

Physical network models and multi-source data integration

High-Contrast Algorithm Behavior Observation, Hypothesis, and Experimental Design

Chapter 5: Microarray Techniques

Two-Color Microarray Experimental Design Notation. Simple Examples of Analysis for a Single Gene. Microarray Experimental Design Notation

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Chapter 12 - Part I: Correlation Analysis

Introduction to Probability Theory for Graduate Economics Fall 2008

7 Gaussian Discriminant Analysis (including QDA and LDA)

Lecture 4: Parameter Es/ma/on and Confidence Intervals. GENOME 560, Spring 2015 Doug Fowler, GS

Transcription:

1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in situ synthesized oligoarrays, and other variants. Our goal here is to illustrate basic computational methods and ideas involved in teasing out the relevant signal from array measurements. For simplicity we will focus eclusively on spotted cdna arrays. Spotted cdna arrays are geared towards measuring relative changes in the mrna levels across two populations of cells, e.g., cells under normal conditions and those undergoing a specific treatment (e.g., nutrient starvation, chemical eposure, temperature, gene deletion, and so on). The mrna etracted from the cells in each population is reverse transcrib ed into cdna and labeled with a fluorescent dye (Cye3 or Cye5) specific to the population. The resulting populations of differently labeled cdnas are subsequently jointly hybridized to the matri of immobilized probes, complements of the cdna targets we epect to measure. Each array location or spot contains a number of probes specific to the corresponding target to ensure efficient hybridization. We won t consider here the question of how the probes are/should be chosen, for eample, to minimize potential cross hybridization (target hybridizing to a probe other than the intended one). By eciting the fluorescent dyes of the hybridized targets on the array, we can read off the amount of each cdna target (hybridized to a specific location on the array) corresponding to each population of interest. By jointly hybridizing the two populations we can more directly gauge any changes in the mrna levels across the two populations without necessarily being able to capture the actual transcript levels in each. This type of internal control helps determine whether a gene is up or down regulated relative to the control. Array measurements are limited by the fact that we have to use a large number of cells (10,000 or more) to get a reasonable signal. When the cell population of interest is relatively uniform this typically doesn t matter. However, when there are two or more distinct cell types in the population, we might draw false inferences from the aggregate measurements. Suppose, for eample, that gene A is active and gene B is inactive in cell type 1 and that the converse holds for cell type 2. We would see both genes active in the array measurement but this conclusion matches neither of the two underlying cell types. We will return to this issue later on in the course.

2 Normalization To use the arrays we have to first normalize the signal so as to make two different array measurements or the two channels (specified by the fluorescent dyes) within a single array mutually comparable. By normalizing the signal we aim to remove any systematic eperimental variation. For simplicity we will consider here only normalization to remove biases arising from the differences between the two dyes. The effect can be easily seen by swapping the dyes between the control and treatment populations or by using two identical populations with different dyes. The biases we observe may arise during sample preparation due to differences in how effectively the dyes are incorporated, or later due to different heat and light characteristics of the dyes. The simplest way to normalize the signal from the two channels would be to equalize the total measured intensity across the spots. The assumption here is, of course, that the overall amount of mrna transcript is the same in two the cell populations (control and treated). A slightly better approach would be to normalize based on the total intensity across genes unlikely to change due to the treatment (e.g., housekeeping genes); defining this set may be difficult, however. We will consider here a slightly different approach. Suppose first that for genes that remain unchanged due to the treatment, the measured intensities, in the absence of noise, are proportional to each other: R = kg, where R and G are the signals from the red and green channels, respectively, and k is an unknown constant we have to estimate. We do not assume that we know which genes remain unchanged and which ones (or how many) have widely different epression levels due to the treatment. For this reason we cannot equalize the total signal from the two channels. A robust alternative is to k so set that logr/kg (log ratio of the corrected signals) has median zero. While this value can be computed directly, we formulate the problem in terms of robust estimation for later utility: we d like to find c = log(k) that minimizes log R j c log G j where the summation is over n probes/spots on the array and denotes the absolute value. To see that we get the same answer in this case, we can take the derivative of this objective with respect to c (recalling that the derivative of the absolute value ±1 is

3 depending on the sign of the argument 1 ) and set it to zero d log R j c log G j = (+1) ( 1) + ( 1) ( 1) = 0 dz j:log R j /G j >c j:log R j /G j <c Array normalization: single array This condition ensures that, at the optimum, we balance the number of probes for which log R j /G j is greater than c and those for which it is less than c. The optimal c is therefore the median value of log R j /G j. Geometrically, setting c corresponds to centering a 45 degree line in the logr versus log G scatter plot from origin (no bias) to the center of the cloud of points (see figure below). When there s a multiplicative difference in the dye bias (something we could also hope to remove to a degree by overall intensity normalization), the figure would look like log R log G R = c + 1 log G R = c log G R = c 1 (log R k, log G k ) log G We can think of removing the effect by moving the coordinate the intensities for identical transcript levels would relate as system so that we are back in the earlier situation. before, we could estimate the parameters p and c = log(k) by minimizing How can we do this more formally? The dye biases may depend on the intensity as well. For eample, we might suspect that R = kr p for some k and p. As log R j c p log G j This is, however, not quite correct. We should measure the deviations orthogonally to the line log R = c + p log G, not vertically, since both measurements R logand logg involve errors (see figure below). Tommi Jaakkola, MIT CSAIL 9 1 We omit here the special cases when the argument is eactly zero.

We need to find the line corresponding to most genes whose epression in the two channels differs only by noise; differentially epressed genes would appear as outliers Let e k is the signed orthogonal distance from the regression line to (log R k, log G k ) 6.874/6.807/7.90 Computational functional genomics, lecture 6 (Jaakkola) 4 log R e k log G Estimation criterion: find the regression line that minimizes some measure (loss) of the orthogonal distances For any given slope p the orthogonal distance is proportional to the vertical distance and so we can easily correct the above objective: {}}{ n 1 Tommi Jaakkola, MIT CSAIL log R j c p log G j 10 p 2 + 1 e j By defining c = c/ p 2 + 1 and p = p/ p 2 + 1 the optimization problem we have to solve still has the same form: log R j c p log G j with the constraint that p [ 1, 1]. After solving for c and p we can reconstruct c and p to perform the normalization. Differential epression Suppose now that we have removed all the systematic variations, e.g., due to the dye biases. The first type of inference we d like to make on the basis of the array measurement is to determine which genes are differentially epressed (up or down regulated) due to the treatment. Here we consider making these decisions on the basis of a single array; the decisions can obviously be strengthened by carrying out the eperiment in duplicate or triplicate as is typically done in practice. The methodology we discuss here can be directly etended to include multiple replicates. A simple approach would be to assume that for genes that are not differentially epressed log R/G N(0, σ 2 ). In other words, the null hypothesis for each gene is that the log ratios

5 follow a normal distribution with mean zero and variance σ 2, where the variance does not depend on the spot. This normal distribution summarizes the eperimental variation we epect to see on spots that should return identical intensity values from the two channels. We could estimate the variance from the measured log ratios corresponding to housekeeping genes that are assumed not to change their epression due to the treatment. For other genes we could use the normal distribution to evaluate a p value for differential epression: the probability mass of the tail of the normal distribution at the observed value of the log ratio. The problem with this approach is that it only pays attention to the log ratio R/G. log Thus decisions are made as confidently when the intensity measurements R and G are very low (at noise level) as when they are high (clear signal). We follow here a bit more sophisticated hierarchical Bayesian approach. The basic idea is very simple. We wish to gauge, for each gene, whether we can eplain the observed intensities from the two channels by assuming a common biological signal, or whether we have to accept that the intensity differences are too large to be eplained by eperimental noise. But we have to define eperimental noise and what type of biological signal to epect so they can be compared. Let s start by specifying the eperimental variation for a single channel (say red). We epect the intensity measurements for a given biological signal to be distributed according to a Gamma distribution: R j Gamma(a, θ j ), where a is called the shape parameter and θ j is the (inverse) scale parameter. The shape parameter is common to all genes, while the scale parameter characterizes the underlying biological signal and varies from gene to gene. The mode (peak) of this distribution occurs at intensity a 1)/θ ( j ; the mean is a/θ j 2 and the variance is given by a/θ j. The distribution has heavy tails meaning that larger errors are also permitted. We don t epect to know either of the parameter a values; will be estimated on the basis of the array measurements. We epect the biological signals (scale parameters) θ j to be independent for each gene (spot) and also follow a Gamma distribution θ j Gamma(a 0, v), with a common shape parameter a 0 and scale v. So, according to this model, we could generate biological signals for each spot on the array by drawing independent samples from this Gamma distribution. The actual intensity values that we would epect to see on the spots would be subsequently sampled from R j Gamma(a, θ j ), separately for each spot, as discussed above. The fact that we have chosen to use Gamma distributions is largely for mathematical convenience albeit they have some appropriate qualitative features (e.g., normal looking peak, heavy tails). As a first approimation we have also chosen to consider each spot independently of others; in other words, we decide whether a gene is differentially epressed largely by looking only on the intensity values from the two channels for that gene (save the parameters of

6 the Gamma distributions that are estimated on the basis of all the spots). We are now ready to define what epect in terms of intensity measurements from the two channels when they are/are not assumed to be differentially epressed: H 0 : same biological signal at j H 1 : differential epression at j θ j Gamma(a 0, v) R j Gamma(a, θ j ), G j Gamma(a, θ j ) R θ j Gamma(a 0, v), R j Gamma(a, θ R j ) G θ j Gamma(a 0, v), G j Gamma(a, θ G j ) In other words, we sample different biological signals for the two channels if they are differentially epressed; otherwise we sample a common signal. These sampling schemes give rise to two different (continuous) miture distributions over the observed intensities: [ ] P (R j, G j a 0, v, a, H 0 ) = Gamma(θ j ; a 0, v) Gamma(R j ; a, θ j )Gamma(R j ; a, θ j ) dθ j θ j where, for eample, Gamma(R j ; a, θ j ) is the pdf for a Gamma distributions with parameters a and θ j. The integration over θ j represents the fact that we need to account for different levels of underlying biological signal, in proportion to our epectations. The measurements R j and G j are not independent since they share a common biological signal. Similarly, where, for eample, P (R j, G j a 0, v, a, H 1 ) = P (R j a 0, v, a, H 1 )P (G j a 0, v, a, H 1 ) P (R j a 0, v, a, H 1 ) = θ R j Gamma(θ j R ; a 0, v)gamma(r j ; a, θ j R )dθ R j Note that in this case R j and G j are independent since they have no common co variate. We have now two competing eplanations for measurements from the two channels, one that assumes a common biological signal, and the other that doesn t (differential epression). For any particular gene we don t know a priori whether it is differentially epressed. We

7 will therefore have to entertain both possibilities and infer which eplanation is more likely. Assuming a prior probability p for the fact that any specific gene is differentially epressed, our model over the intensity measurements R j,( G j ) is given by P (R j, G j a 0, v, a, p) = p P (R j, G j a 0, v, a, H 1 ) + (1 p) P (R j, G j a 0, v, a, H 0 ) This is again a miture model, a miture of two distributions that are themselves mitures. What s left to do for us is to estimate the four parameters in this amodel 0,v, a and p. We find the setting of these parameters that maimize the probability of reproducing the observations across the spots on the array (maimum likelihood fitting): n P (R j, G j a 0, v, a, p) This task may look a little daunting but can be done analogously to the EM algorithm we have used previously for estimating miture models (motifs). Here, in the E step we evaluate for each gene the posterior probability that it is differentially epressed and what the biological signal might be. Once we have these posteriors, we can, in the M step, essentially just estimate the Gamma distributions from weighted observations. We omit the details of the algorithm. We are now ready to make decisions concerning differential epression. Given the estimated parameters aˆ 0,ˆ v, a ˆ and p ˆ(where the hat denotes the fact that they were estimated), we find genes that are differentially epressed on the basis of the posterior probabilities: ˆ a 0, ˆ a, H 1 ) P (j is differentially epressed R j, G j, aˆ 0, v, ˆ a, ˆ p) ˆ = p P (R j, G j ˆ v, ˆ P (R j, G j ˆ a 0, v, ˆ a, ˆ p) ˆ These posterior probabilities approapriately discount high log ratios corresponding to low intensity measurements as such measurements are easily eplained by eperimental noise. References Yee Hwa Yang et al., Normalization for cdna microarray data: a robust composite method for addressing single and multiple slide systematic variation, Nucleic Acid Research, Vol 30, No 4, 2002. Newton et al., On Differential Variability of Epression Ratios: Improving Statistical Inference about Gene Epression Changes from Microarray Data, Journal of Computational Biology, Vol 8, No 1, 2001.