Design of microarray experiments

Similar documents
Design of microarray experiments

Design of Microarray Experiments. Xiangqin Cui

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Two-Color Microarray Experimental Design Notation. Simple Examples of Analysis for a Single Gene. Microarray Experimental Design Notation

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Hypothesis testing (cont d)

Tests about a population mean

Biochip informatics-(i)

STAT 515 fa 2016 Lec Statistical inference - hypothesis testing

Introductory Econometrics. Review of statistics (Part II: Inference)

Hypothesis testing. Data to decisions

Mathematical statistics

Expression arrays, normalization, and error models

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Section 10.1 (Part 2 of 2) Significance Tests: Power of a Test

Introduction to Statistical Hypothesis Testing

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Bio 183 Statistics in Research. B. Cleaning up your data: getting rid of problems

The regression model with one fixed regressor cont d

Parametric Empirical Bayes Methods for Microarrays

Resampling and the Bootstrap

6.4 Type I and Type II Errors

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Statistical Tests. Matthieu de Lapparent

Sample Size and Power Calculation in Microarray Studies Using the sizepower package.

Bayesian methods in economics and finance

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Interpreting Regression Results

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

UCLA STAT 251. Statistical Methods for the Life and Health Sciences. Hypothesis Testing. Instructor: Ivo Dinov,

Computer exercises on Experimental Design, ANOVA, and the BOOTSTRAP

Optimal design of microarray experiments

Alpha-Investing. Sequential Control of Expected False Discoveries

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Mathematical Statistics

Introductory Econometrics

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Parameter Estimation and Fitting to Data

Statistical Modeling and Analysis of Scientific Inquiry: The Basics of Hypothesis Testing

Lesson 11. Functional Genomics I: Microarray Analysis

Example 1: Two-Treatment CRD

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

f (1 0.5)/n Z =

An inferential procedure to use sample data to understand a population Procedures

How is the Statistical Power of Hypothesis Tests affected by Dose Uncertainty?

Statistical Inference

Analysis of Variance

Sample Size Estimation for Studies of High-Dimensional Data

Class 19. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Ch. 5 Hypothesis Testing

Statistical Data Analysis Stat 3: p-values, parameter estimation

STAT 450: Final Examination Version 1. Richard Lockhart 16 December 2002

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

8.1-4 Test of Hypotheses Based on a Single Sample

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Statistical inference

Mathematical statistics

Section 5.4: Hypothesis testing for μ

Bayesian Regression (1/31/13)

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Mixtures and Hidden Markov Models for analyzing genomic data

Topics on statistical design and analysis. of cdna microarray experiment

CH.9 Tests of Hypotheses for a Single Sample

PASS Sample Size Software. Poisson Regression

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Lecture on Null Hypothesis Testing & Temporal Correlation

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Power and Sample Size + Principles of Simulation. Benjamin Neale March 4 th, 2010 International Twin Workshop, Boulder, CO

Probabilistic Inference for Multiple Testing

Statistical analysis of microarray data: a Bayesian approach

Exam: high-dimensional data analysis February 28, 2014

Sizing Mixture (RSM) Designs for Adequate Precision via Fraction of Design Space (FDS)

Design and Analysis of Gene Expression Experiments

SMAM 314 Exam 3 Name. F A. A null hypothesis that is rejected at α =.05 will always be rejected at α =.01.

Optimal normalization of DNA-microarray data

Single gene analysis of differential expression. Giorgio Valentini

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Data analysis and Geostatistics - lecture VI

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Non-specific filtering and control of false positives

BIOS 312: Precision of Statistical Inference

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Power and sample size calculations

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Seminar Microarray-Datenanalyse

Theory. Pattern and Process

CHAPTER 8. Test Procedures is a rule, based on sample data, for deciding whether to reject H 0 and contains:

Setting Precision Requirements for Estimating Proportions

Bios 6649: Clinical Trials - Statistical Design and Monitoring

DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS

Simple Linear Regression for the Climate Data

Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

Practice Problems Section Problems

A variance-stabilizing transformation for gene-expression microarray data

Transcription:

Design of microarray experiments Ulrich Mansmann mansmann@imbi.uni-heidelberg.de Practical microarray analysis March 23 Heidelberg Heidelberg, March 23

Experiments Scientists deal mostly with experiments of the following form: A number of alternative conditions / treatments one of which is applied to each experimental unit an observation (or several observations) then being made on each unit. The objective is: Separate out differences between the conditions / treatments from the uncontrolled variation that is assumed to be present. Take steps towards understanding the phenomena under investigation. Heidelberg, March 23 2

Statistical thinking Uncertain knowledge Knowledge of the extent of uncertainty in it + = Useable knowledge Measurement model m = µ + e m measurement with error, µ - true but unknown value What is the mean of e? What is the variance of e? Is there dependence between e and µ? What is the distribution of e (and µ)? Typically but not always: e ~ N(,σ²) Gaussian / Normal measurement model Decisions on the experimental design influence the measurement model. Heidelberg, March 23 3

Main requirements for experiments Once the conditions / treatments, experimental units, and the nature of the observations have been fixed, the main requirements are: Experimental units receiving different treatments should differ in no systematic way from one another Assumptions that certain sources of variation are absent or negligible should, as far as practical, be avoided; Random errors of estimation should be suitably small, and this should be achieved with as few experimental units as possible; The conclusions of the experiment should have a wide range of validity; The experiment should be simple in design and analysis; A proper statistical analysis of the results should be possible without making artificial assumptions. Taken from Cox DR (958) Planning of experiments, Wiley & Sons, New York (page 3) Heidelberg, March 23 4

The most simple measurement model in microarray experiments Situation: m arrays (Affimetrix) from control population n arrays (Affimetrix) from population with special condition /treatment Observation of interest: Mean difference of log-transformed gene expression ( logfc) logfc obs = logfc true + e e ~ N(, σ² [/n+/m]) In an experiment with 5 arrays per population and the same variance for the expression of a gene of interest, the above formula implies that the variance of the logfc is only 4% (/5+/5 = 2/5 =.4) of the variability of a single measurement taming of uncertainty. Heidelberg, March 23 5

Separate out differences between the conditions / treatments from the uncontrolled variation that is assumed to be present. Is logfc true? How to decide? Special Decision rules: Statistical Tests When the probability model for the mechanism generating the observed data is known, hypotheses about the model can be tested. This involves the question: Could the presented data reasonable have come from the model if the hypothesis is correct? Usually a decision must be made on the basis of the available data, and some degree of uncertainty is tolerated about the correctness of that decision. These four components: data, model, hypothesis, and decision are basic to the statistical problem of hypothesis testing. Heidelberg, March 23 6

Quality of decision True state of gene Decision Gene is diff. expr. Gene is not diff. expr. Gene is diff. expr. OK false positive decision happens with probability α Gene is not diff. expr. OK false negative decision happens with probability β Two sources of error: Power of a test: False positive rate α False negative rate β Ability to detect a difference if there is a true difference Power true positive rate or Power = - β Heidelberg, March 23 7

Question of interest (Alterative): The Statistical test Is the gene G differentially expressed between two cell populations? Answer the question via a proof by contradiction: Show that there is no evidence to support the logical contrary of the alternative. The logical contrary of the alternative is called null hypothesis. Null hypothesis: The gene G is not differentially expressed between two cell populations of interest. A test statistic T is introduced which measures the fit of the observed data to the null hypothesis. The test statistics T implies a prob. distribution P to quantify its variability when the null hypothesis is true. It will be checked if the test statistic evaluated at the observed data t obs behaves typically (not extreme) with respect to the test distribution. The p-value is the probability under the null hypothesis of an observation which is more extreme as the observation given by the data: P( T t obs ) = p. A criteria is needed to asses extreme behaviour of the test statistic via the p value which is called the level of the test: a. The observed data does not fit to the null hypothesis if p < a or t obs > t* where t* is the -α or -α/2 quantile of the prob. distribution P. t* is also called the critical value. The conditions p < a and t obs > t* are equivalent. If p < a or t obs > t* the null hypothesis will be rejected. If p a or t obs t* the null hypothesis can not be rejected this does not mean that it is true Absence of evidence for a difference is no evidence for an absence of difference. Heidelberg, March 23 8

Controlling the power sample size calculations The test should produce a significant result (level α) with a power of -β if logfc true = δ null hypothesis: logfc true = alternative: logfc true = δ δ z -α/2 σ n,m z -β σ n,m σ 2 n,m = σ 2 (/n+/m) The above requirement is fulfilled if: δ = (z -α/2 + z -β ) σ n,m or 2 n m (z α / 2 + z β) = n + m δ 2 σ 2 Heidelberg, March 23 9

Controlling the power sample size calculations 2 n m (z α / 2 + z β) = n + m δ 2 σ 2 n = N γ and m = N (-γ) with M total size of experiment and γ ],[ 2 (z α / 2 + z β ) N = γ ( γ) 2 δ 2 σ The size of the experiment is minimal if g = ½. 5 5 2..2.4.6.8. Gamma Heidelberg, March 23

Sample size calculation for a microarray experiment I Truth Test result diff. expr. (H ) not diff. expr. (H ) diff. expr. D D D not diff. expr. U U U Number of genes on array G G G α = E[D ]/G β = E[U ]/G FDR=E[D /D] E: expectation / mean number family type I error probability: α F = P[D >] family type II error probability: β F = P[U >] Heidelberg, March 23

Sample size calculation for a microarray experiment II Independent genes Dependent Genes P[D =] = (-α ) Go = -α F D ~ inomial(g, α ) E[D ] = G α Poissonapprox.: E[D ] ~ -ln(-α F ) P[U =] = (-β ) G = -β F E[U ] = G (-β ) onferroni: α = α F / G No direct link between the probability for D and α F. -β F max{,- G β } No direct link between the probability for U and β F. Heidelberg, March 23 2

Sample size calculation for a microarray experiment III for an array with 33 independent genes What are useful α and β? α F =.8 E[D ] = - ln(-.8) =.6 = λ P(exactly k false pos.) = exp(-λ) λ k / (k!) false pos. 2 3 4 5 Prob..2.322.259.39.56.8 P(at least six false positives) =.62 325 unexpressed genes: α =.6/325 =.495 5 expressed genes, set E[D ] = 45 -β = 45/5 =.9 β =. -β F = (-β ) G < -23 E[FDR] =.35 95% quantile of FDR:.89 (calculated by simulation) Heidelberg, March 23 3

Sample size calculation for a microarray experiment IV In order to complete the sample size calculation for a microarray experiment, information on σ 2 is needed. The size of the experiment, N, needed to detect a logfc true of δ on a significance level α and with power -β is: 2 2 (z α / 2 + z β ) σ N = 4 2 δ In a similar set of experiments σ 2 for a set of 2 VSN transformed arrays was between.55 and.85. One may choose the value σ 2 = 2. δ log(.5) log(2) log(3) log(5) log() N (σ 2 = 2) 388 476 9 88 44 N (σ 2 = ) 694 238 96 44 22 Sample size with α =.495, β =. Heidelberg, March 23 4

Sample size formula for a one group test The test should produce a significant result (level α) with a power of -β if T = δ null hypothesis: T = alternative: T = δ δ z -α/2 σ n z -β σ n σ 2 n = σ 2 /n The above requirement is fulfilled if: δ = (z -α/2 + z -β ) σ n or 2 2 (z α / 2 + z β ) σ n = 2 δ Heidelberg, March 23 5

Measurement model for cdna arrays Gene expression under condition A intensity of red colour, Gene expression under condition intensity of green colour Measurement: m A/ = I red,a 2 I green, Log = γ A/ + δ + e γ A/ log-transformed true fold change of gene of condition A with respect to condition δ - dye effect, e measurement error with E[e] = and Var(e) = σ 2 Measurement m A/ is used to estimate unknown γ A/ Vertices Edges mrna samples hybridization Direction Dye assignment Green Red A Heidelberg, March 23 6

Estimation of log fold change g A/ Reference Design Dye swap design A A R R A/ Estimate of γ A/ DS g = m A/R m /R g A/ = (m A/ m /A )/2 Variability of estimate g ) = 2 σ 2 Var( g DS A/ ) =.5 σ 2 Sample Size increases proportional to the variance of the measurement! Var( R A/ Heidelberg, March 23 7

2x2 factorial experiments I treatment / condition Wild type Mutation before treatment β β+µ after treatment β+τ β+τ+µ+ψ β - baseline effect; τ - effect of treatment; µ - effect of mutation ψ - differential effect on treatment between WT and MUT treatment effect on gene expr. in WT cells: WT = (β + τ) - β = τ treatment effect on gene expr. in MUT cells: MUT = (β + τ + µ + ψ) (β + µ)= τ + ψ differential treatment effect: MUT WT or ψ How many cdna arrays are needed to show ψ with significance α and power -β if ψ > ln(5)? Heidelberg, March 23 8

2x2 factorial experiments II Study the joint effect of two conditions / treatment, A and, on the gene expression of a cell population of interest. There are four possible condition / treatment combinations: A: treatment applied to MUT cells A: treatment applied to WT cells : no treatment applied to MUT cells : no treatment applied to WT cells A A Design with 2 slides Heidelberg, March 23 9

2x2 factorial experiments III Array m A/ m /A m / m / m A/ m /A m A/A m A/A m A/ m /A m A/ m /A Measurement γ A/ + δ + e = τ + δ + e -γ A/ + δ + e = -τ + δ + e γ / + δ + e = µ + δ + e -γ / + δ + e = -µ + δ + e γ A/ + δ + e = µ + τ + ψ + δ + e -γ A/ + δ + e = - (µ + τ + ψ) + δ + e γ A/A + δ + e = µ + ψ + δ + e -γ A/A + δ + e = - (µ + ψ) + δ + e γ A/ + δ + e = µ + ψ + δ + e -γ A/ + δ + e = - (µ + ψ) + δ + e γ A/ + δ + e = τ - µ + δ + e -γ A/ + δ + e = - (τ - µ) + δ + e Each measurement has variance σ 2 Parameter β is confounded with the dye effect Heidelberg, March 23 2

Heidelberg, March 23 2 Regression analysis ψ µ τ δ = M M M M M M M M M M M M E A / A / A / A / A A / A A / A / A / / / A / A / A A For parameter θ = (δ, τ, µ, ψ) define the design matrix X such that E(M) = Xθ. For each gene, compute least square estimate θ* = (X X) - X M (LUE) Obtain measures of precision of estimated effects. Use all possibilities of the theory of linear models. Design problem: Each measurement M is made with variability σ 2. How precise can we estimate the components or contrasts of θ? Answer: Look at (X X) -

A 2 x 2 factorial designs IV A total.2.by.2.design.mat delta alpha beta psi > precision.2.by.2.rfc(x.mat) A/ $inv.mat /A - / tau mu psi / - tau.25.25 -.25 A/ mu.25.25 -.25 /A - - - psi -.25 -.25.5 A/A A/A - - $effects A/ tau mu psi tau-mu /A - -.25.25.5.25 /A - A/ - Var(A-) = Var(A) + Var() 2 Cov(A,) Heidelberg, March 23 22

Sample size for differential treatment effect (DTE) in a 2 x 2 factorial designs I Array has 2. genes: 95 without DTE, 5 with DTE α F =.9, using onferroni adjustment: α =.9/2. =.462 Mean number of correct positives is set to 45: -β =.9 σ 2 =.7, taken from similar experiments A total dye swap design (2 arrays) estimates ψ with precision σ 2 /2 =.35 N = [4.74 +.282] 2.35 / ln(5) 2 = 3.876 The experiment would need in total 4 x 2 = 48 arrays Is there a chance to get the same result cheaper? Heidelberg, March 23 23

2 x 2 factorial designs V Design I Common ref. Design II Common ref. Design III Connected Design IV Connected A A A A A A A A A Design V All-pairs A Scaled variances of estimated effects D.I D.II D.III D.IV D.V D.tot tau 2.75..5.25 mu 2.75.75.5.25 psi 3 3. 2...5 # chips 3 3 4 4 6 2 Heidelberg, March 23 24

Sample size for differential treatment effect (DTE) in a 2 x 2 factorial designs II Is there a chance to get the same result cheaper? Using total dye swap design, the experiment would need in total 4 x 2 = 48 arrays Using Design III, the effect of interest is estimated with doubled variance (4 8) but by using a design which need only 4 arrays (2 4). This reduces the number of arrays needed from 48 to 32. Heidelberg, March 23 25

Designs for time course experiments Experimental Design - Conclusions In addition to experimental constraints, design decisions should be guided by knowledge of which effects are of greater interest to the investigator. The unrealistic planning based on independent genes may be put into a more realistic framework by using simulation studies speak to your bio statistician/informatician How to collect and present experience from performed microarray experiments on which to base assumptions for planing (σ 2 )? Further reading: Kerr MK, Churchill GA (2) Experimental design for gene expression microarrays, iostatistics, 2:83-2 Lee MLT, Whitmore GA (22), Power and sample size for DNA microarray studies, Stat. in Med., 2:3543-357 Heidelberg, March 23 26