Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Similar documents
Single gene analysis of differential expression. Giorgio Valentini

Exam details. Final Review Session. Things to Review

Design of microarray experiments

Statistical Applications in Genetics and Molecular Biology

Rank-Based Methods. Lukas Meier

Single gene analysis of differential expression

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

GS Analysis of Microarray Data

The miss rate for the analysis of gene expression data

Lesson 11. Functional Genomics I: Microarray Analysis

Design of Microarray Experiments. Xiangqin Cui

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Statistics Handbook. All statistical tables were computed by the author.

Transition Passage to Descriptive Statistics 28

Identifying Bio-markers for EcoArray

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Comparison of Two Population Means

STATISTICS 4, S4 (4769) A2

3 Joint Distributions 71

Business Statistics. Lecture 10: Course Review

Analysis of variance (ANOVA) Comparing the means of more than two groups

Selection should be based on the desired biological interpretation!

Distribution-Free Procedures (Devore Chapter Fifteen)

Statistical Applications in Genetics and Molecular Biology

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Unsupervised machine learning

Unit 14: Nonparametric Statistical Methods

Glossary for the Triola Statistics Series

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Optimal normalization of DNA-microarray data

Two-Color Microarray Experimental Design Notation. Simple Examples of Analysis for a Single Gene. Microarray Experimental Design Notation

Design of microarray experiments

Contents. Acknowledgments. xix

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Chapter 7 Comparison of two independent samples

Sample Size Estimation for Studies of High-Dimensional Data

Statistical. Psychology

Correlation and Regression (Excel 2007)

Exam: high-dimensional data analysis February 28, 2014

Introduction to hypothesis testing

Non-parametric (Distribution-free) approaches p188 CN

Experimental Design and Data Analysis for Biologists

Data Analysis and Statistical Methods Statistics 651

Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis

Biochip informatics-(i)

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Relative efficiency. Patrick Breheny. October 9. Theoretical framework Application to the two-group problem

REPLICATED MICROARRAY DATA

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Non-specific filtering and control of false positives

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Tables Table A Table B Table C Table D Table E 675

University of California, Berkeley

Advanced Statistical Methods: Beyond Linear Regression

Non-parametric tests, part A:

2.830 Homework #6. April 2, 2009

Lecture 30. DATA 8 Summer Regression Inference

BIOINFORMATICS ORIGINAL PAPER

1. How will an increase in the sample size affect the width of the confidence interval?

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

4.1 Hypothesis Testing

Kruskal-Wallis and Friedman type tests for. nested effects in hierarchical designs 1

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods

probability George Nicholson and Chris Holmes 31st October 2008

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

Lecture 7: Hypothesis Testing and ANOVA

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

SPOTTED cdna MICROARRAYS

Nonparametric tests, Bootstrapping

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Advanced Statistics II: Non Parametric Tests

Bootstrap. ADA1 November 27, / 38

Background to Statistics

Statistical testing. Samantha Kleinberg. October 20, 2009

Inferences About the Difference Between Two Means

Stat 427/527: Advanced Data Analysis I

Statistical analysis of microarray data: a Bayesian approach

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

The Nonparametric Bootstrap

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Chapter 3. Measuring data

ST4241 Design and Analysis of Clinical Trials Lecture 9: N. Lecture 9: Non-parametric procedures for CRBD

eqr014 Lenth s Method for the Analysis of Unreplicated Experiments

One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays

Nonparametric tests. Mark Muldoon School of Mathematics, University of Manchester. Mark Muldoon, November 8, 2005 Nonparametric tests - p.

Resampling Methods. Lukas Meier

Non-Parametric Two-Sample Analysis: The Mann-Whitney U Test

Transcription:

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). 3.1 Motivation: 1. Identify genes which show evidence of differential expression (DE).

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). 3.1 Motivation: 1. Identify genes which show evidence of differential expression (DE). In general, a study may involve one or a small number of genes, or many thousands of genes as in microarray experiments.

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). 3.1 Motivation: 1. Identify genes which show evidence of differential expression (DE). In general, a study may involve one or a small number of genes, or many thousands of genes as in microarray experiments. In microarray experiments, a primary goal is to rank genes according to evidence of DE.

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). 3.1 Motivation: 1. Identify genes which show evidence of differential expression (DE). In general, a study may involve one or a small number of genes, or many thousands of genes as in microarray experiments. In microarray experiments, a primary goal is to rank genes according to evidence of DE. 2. Matching DNA sequences.

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and (ii) choose a critical value for the ranking statistic, above which any value is considered to be statistically significant, and therefore DE.

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and (ii) choose a critical value for the ranking statistic, above which any value is considered to be statistically significant, and therefore DE. There are practical constraints: in a typical study, only a limited number of genes can be followed up for further study.

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and (ii) choose a critical value for the ranking statistic, above which any value is considered to be statistically significant, and therefore DE. There are practical constraints: in a typical study, only a limited number of genes can be followed up for further study. Note that in Chapter 3, we assume the data have been appropriately normalized using the methods we studied in Chapter 2.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide. Replication is very important.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide. Replication is very important. (i) A common approach to analysis is to calculate the average log ratio M j for each gene, and sort the genes according to the absolute value of M j.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide. Replication is very important. (i) A common approach to analysis is to calculate the average log ratio M j for each gene, and sort the genes according to the absolute value of M j. But this is a poor choice because it ignores the variability in expression levels for each gene.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide. Replication is very important. (i) A common approach to analysis is to calculate the average log ratio M j for each gene, and sort the genes according to the absolute value of M j. But this is a poor choice because it ignores the variability in expression levels for each gene. The variability of the M j values for a gene over replicates varies from gene to gene, and genes with larger variance have a good chance of giving a large M j statistic even if they are not DE.

(ii) It is better to use the single sample t statistic. For each gene j, calculate: t j = M j s j / n where s j is the standard deviation of M j -values for the replicates for a gene; j = 1,..., g genes.

(ii) It is better to use the single sample t statistic. For each gene j, calculate: t j = M j s j / n where s j is the standard deviation of M j -values for the replicates for a gene; j = 1,..., g genes. This is in fact a paired t statistic when applied to microarray data. Why?

(ii) It is better to use the single sample t statistic. For each gene j, calculate: t j = M j s j / n where s j is the standard deviation of M j -values for the replicates for a gene; j = 1,..., g genes. This is in fact a paired t statistic when applied to microarray data. Why? The null hypothesis is H 0 : µ j = 0 vs H A : µ j 0 for each gene j = 1,..., g.

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic.

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic. Aim is to avoid spuriously large t j resulting from unrealistically small s j.

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic. Aim is to avoid spuriously large t j resulting from unrealistically small s j. The penalty is applied to the estimated standard deviation s j : t j = M j (a + s j )/ n

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic. Aim is to avoid spuriously large t j resulting from unrealistically small s j. The penalty is applied to the estimated standard deviation s j : t j = M j (a + s j )/ n or to the estimated variance s 2 j : M j t j = (a + s 2 j )/n

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations. Intensity-dependent penalities are also applied in practice.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations. Intensity-dependent penalities are also applied in practice. Always analyse the unadjusted t statistics.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations. Intensity-dependent penalities are also applied in practice. Always analyse the unadjusted t statistics. The wisdom of penalising is open to debate.

(i) Standard error of M versus average gene intensity (ii) Normal qq-plot of penalised t statistic Standard deviation of log ratios 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Sample Quantiles 10 5 0 5 10 9 10 11 12 13 14 15 Average gene intensity 4 2 0 2 4 Theoretical Quantiles

Assessing differential expression 3.3. Ranking genes: Suppose we calculate the t statistic for every gene in the array (this could be any number up to 20,000) and rank the absolute values of the t statistics.

Assessing differential expression 3.3. Ranking genes: Suppose we calculate the t statistic for every gene in the array (this could be any number up to 20,000) and rank the absolute values of the t statistics. This will give us a ranked list of genes in which the largest values of t provide the strongest evidence of differential expression.

Assessing differential expression 3.3. Ranking genes: Suppose we calculate the t statistic for every gene in the array (this could be any number up to 20,000) and rank the absolute values of the t statistics. This will give us a ranked list of genes in which the largest values of t provide the strongest evidence of differential expression. However we have, in effect, just performed 20,000 t-tests!

Assessing differential expression 3.3. Ranking genes: Suppose we calculate the t statistic for every gene in the array (this could be any number up to 20,000) and rank the absolute values of the t statistics. This will give us a ranked list of genes in which the largest values of t provide the strongest evidence of differential expression. However we have, in effect, just performed 20,000 t-tests! The next step is to choose a cut-off value above which genes will be flagged as statistically significant. How should we do this?

The aim in attempting to determine which genes are truly DE is to control for the large amount of multiple testing inherent in the need to conduct a test for each gene. See Chapter 4 on Multiple Comparisons.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot. The idea is to look for points which deviate markedly from the line.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot. The idea is to look for points which deviate markedly from the line. The example on the next slide shows a qq-plot for t statistics with 4 degrees of freedom; the experiment compared two mutant cells lines in leukaemic mice on each slide.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot. The idea is to look for points which deviate markedly from the line. The example on the next slide shows a qq-plot for t statistics with 4 degrees of freedom; the experiment compared two mutant cells lines in leukaemic mice on each slide. The method implicitly assumes M is roughly normal and that the genes are behaving independently (which may not be true).

t qq-plot t.statistics1a[, 1] 200 150 100 50 0 50 15 10 5 0 5 10 15 qt(ppoints(t.statistics1a[, 1]), df = 4)

3.4 More complex experiments:

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design.

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design. The simplest such design compares two mrna samples A and B through a reference sample, Ref. That is, A is compared with Ref, and B is compared with Ref.

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design. The simplest such design compares two mrna samples A and B through a reference sample, Ref. That is, A is compared with Ref, and B is compared with Ref. In terms of log ratios M, for each gene j we now have M Aj = log(a j /Ref), M Bj = log(b j /Ref) where A, B are labelled red and Ref is labelled green.

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design. The simplest such design compares two mrna samples A and B through a reference sample, Ref. That is, A is compared with Ref, and B is compared with Ref. In terms of log ratios M, for each gene j we now have M Aj = log(a j /Ref), M Bj = log(b j /Ref) where A, B are labelled red and Ref is labelled green. The difference of interest is M Aj M Bj.

For ease of notation, we will drop the subscript j.

For ease of notation, we will drop the subscript j. In microarray experiments, there will be n 1 replicate arrays comparing (i.e. hybridising) A with Ref, and n 2 replicate arrays comparing B with Ref. Then the test statistic will be based on M A M B.

For ease of notation, we will drop the subscript j. In microarray experiments, there will be n 1 replicate arrays comparing (i.e. hybridising) A with Ref, and n 2 replicate arrays comparing B with Ref. Then the test statistic will be based on M A M B. We know that the optimal normal theory statistic is the two-sample t statistic: t = M A M B s p 1 n 1 + 1 n 2 where s p is the pooled sample standard deviation.

The null hypothesis for each gene is that the expression levels in the two cell types A and B are the same, i.e., H 0 : µ A = µ B versus H a : µ A µ B.

The null hypothesis for each gene is that the expression levels in the two cell types A and B are the same, i.e., H 0 : µ A = µ B versus H a : µ A µ B. s p is sometimes replaced by the penalised pooled sample standard deviation, s p = a + s 2 p.

The example on the next page is from Dudoit et al. (2002) and shows a histogram of the observed two-sample t statistics, and the normal qq-plot for two-sample t statistics from a study comparing lipid levels in treated (A) and control mice (B).

The example on the next page is from Dudoit et al. (2002) and shows a histogram of the observed two-sample t statistics, and the normal qq-plot for two-sample t statistics from a study comparing lipid levels in treated (A) and control mice (B). There were 16 slides in the experiment, 8 for treated and 8 for control mice, each hybridised to a common reference pool of mice DNA (Ref).

The example on the next page is from Dudoit et al. (2002) and shows a histogram of the observed two-sample t statistics, and the normal qq-plot for two-sample t statistics from a study comparing lipid levels in treated (A) and control mice (B). There were 16 slides in the experiment, 8 for treated and 8 for control mice, each hybridised to a common reference pool of mice DNA (Ref). The points lying off the line are candidates for differential expression.

Histogram & qq plot ApoA1

Remarks on t statistics

Remarks on t statistics The t statistic has the advantage of extending to more complex situations, such as factorial designs and multiple regression.

Remarks on t statistics The t statistic has the advantage of extending to more complex situations, such as factorial designs and multiple regression. The above approach to analysis can be generalised to more than two samples using F statistics, and so on.

Remarks on t statistics The t statistic has the advantage of extending to more complex situations, such as factorial designs and multiple regression. The above approach to analysis can be generalised to more than two samples using F statistics, and so on. However, the two-sample t statistic assumes the random variables M A and M B are normally distributed and have equal variances, which may not be justified.

We can relax the equal variance assumption by using the approximate unequal variance form of the two-sample t statistic: t = M A M B s 2 A n 1 + s2 B n 2

We can relax the equal variance assumption by using the approximate unequal variance form of the two-sample t statistic: t = M A M B s 2 A n 1 + s2 B n 2 But there are better, alternative approaches.

The rest of this Chapter...

The rest of this Chapter... Nonparametric or distribution-free alternatives to the two-sample t statistic are popular and we consider two of these in 3.5: Mann-Whitney test Permutation test.

The rest of this Chapter... Nonparametric or distribution-free alternatives to the two-sample t statistic are popular and we consider two of these in 3.5: Mann-Whitney test Permutation test. Computer-intensive testing and estimation procedures are also popular, and in 3.6 we will study Bootstrap techniques.

The rest of this Chapter... Nonparametric or distribution-free alternatives to the two-sample t statistic are popular and we consider two of these in 3.5: Mann-Whitney test Permutation test. Computer-intensive testing and estimation procedures are also popular, and in 3.6 we will study Bootstrap techniques. In 3.7, we will study Bayesian estimation and testing procedures.