Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). 3.1 Motivation: 1. Identify genes which show evidence of differential expression (DE). In general, a study may involve one or a small number of genes, or many thousands of genes as in microarray experiments.

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and

For microarrays, we need to: (i) select one or more statistics to rank genes in order of evidence of DE, from strongest to weakest; and (ii) choose a critical value for the ranking statistic, above which any value is considered to be statistically significant, and therefore DE. There are practical constraints: in a typical study, only a limited number of genes can be followed up for further study.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide.

3.2 The simplest comparison: Two mrna samples A, B are hybridised to a single array (=slide), and the array is replicated n times. Assume each gene is spotted once on each slide. Replication is very important. (i) A common approach to analysis is to calculate the average log ratio M j for each gene, and sort the genes according to the absolute value of M j. But this is a poor choice because it ignores the variability in expression levels for each gene.

(ii) It is better to use the single sample t statistic. For each gene j, calculate: t j = M j s j / n where s j is the standard deviation of M j -values for the replicates for a gene; j = 1,..., g genes. This is in fact a paired t statistic when applied to microarray data. Why?

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic.

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic. Aim is to avoid spuriously large t j resulting from unrealistically small s j.

(iii) Penalised t statistic: this is a compromise between using M j and the t j -statistic. Aim is to avoid spuriously large t j resulting from unrealistically small s j. The penalty is applied to the estimated standard deviation s j : t j = M j (a + s j )/ n

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations.

The penalties can be estimated in different ways e.g., a may be the 90th percentile of the s j values. The choice is driven by empirical rather than by theoretical considerations. Intensity-dependent penalities are also applied in practice.

(i) Standard error of M versus average gene intensity (ii) Normal qq-plot of penalised t statistic Standard deviation of log ratios 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Sample Quantiles 10 5 0 5 10 9 10 11 12 13 14 15 Average gene intensity 4 2 0 2 4 Theoretical Quantiles

Assessing differential expression 3.3. Ranking genes: Suppose we calculate the t statistic for every gene in the array (this could be any number up to 20,000) and rank the absolute values of the t statistics. This will give us a ranked list of genes in which the largest values of t provide the strongest evidence of differential expression. However we have, in effect, just performed 20,000 t-tests!

The aim in attempting to determine which genes are truly DE is to control for the large amount of multiple testing inherent in the need to conduct a test for each gene. See Chapter 4 on Multiple Comparisons.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot.

An informal, graphical method that can be used to assess significance is to display the sorted t statistics in a normal quantile-quantile plot, or a t-distribution qq-plot. The idea is to look for points which deviate markedly from the line. The example on the next slide shows a qq-plot for t statistics with 4 degrees of freedom; the experiment compared two mutant cells lines in leukaemic mice on each slide.

t qq-plot t.statistics1a[, 1] 200 150 100 50 0 50 15 10 5 0 5 10 15 qt(ppoints(t.statistics1a[, 1]), df = 4)

3.4 More complex experiments:

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design.

3.4 More complex experiments: One of the most commonly used designs in biological experiments is the reference design. The simplest such design compares two mrna samples A and B through a reference sample, Ref. That is, A is compared with Ref, and B is compared with Ref. In terms of log ratios M, for each gene j we now have M Aj = log(a j /Ref), M Bj = log(b j /Ref) where A, B are labelled red and Ref is labelled green.

For ease of notation, we will drop the subscript j.

For ease of notation, we will drop the subscript j. In microarray experiments, there will be n 1 replicate arrays comparing (i.e. hybridising) A with Ref, and n 2 replicate arrays comparing B with Ref. Then the test statistic will be based on M A M B. We know that the optimal normal theory statistic is the two-sample t statistic: t = M A M B s p 1 n 1 + 1 n 2 where s p is the pooled sample standard deviation.

The null hypothesis for each gene is that the expression levels in the two cell types A and B are the same, i.e., H 0 : µ A = µ B versus H a : µ A µ B.

The null hypothesis for each gene is that the expression levels in the two cell types A and B are the same, i.e., H 0 : µ A = µ B versus H a : µ A µ B. s p is sometimes replaced by the penalised pooled sample standard deviation, s p = a + s 2 p.

The example on the next page is from Dudoit et al. (2002) and shows a histogram of the observed two-sample t statistics, and the normal qq-plot for two-sample t statistics from a study comparing lipid levels in treated (A) and control mice (B). There were 16 slides in the experiment, 8 for treated and 8 for control mice, each hybridised to a common reference pool of mice DNA (Ref).

Histogram & qq plot ApoA1

Remarks on t statistics

Remarks on t statistics The t statistic has the advantage of extending to more complex situations, such as factorial designs and multiple regression.

Remarks on t statistics The t statistic has the advantage of extending to more complex situations, such as factorial designs and multiple regression. The above approach to analysis can be generalised to more than two samples using F statistics, and so on. However, the two-sample t statistic assumes the random variables M A and M B are normally distributed and have equal variances, which may not be justified.

We can relax the equal variance assumption by using the approximate unequal variance form of the two-sample t statistic: t = M A M B s 2 A n 1 + s2 B n 2

We can relax the equal variance assumption by using the approximate unequal variance form of the two-sample t statistic: t = M A M B s 2 A n 1 + s2 B n 2 But there are better, alternative approaches.

The rest of this Chapter...

The rest of this Chapter... Nonparametric or distribution-free alternatives to the two-sample t statistic are popular and we consider two of these in 3.5: Mann-Whitney test Permutation test.

The rest of this Chapter... Nonparametric or distribution-free alternatives to the two-sample t statistic are popular and we consider two of these in 3.5: Mann-Whitney test Permutation test. Computer-intensive testing and estimation procedures are also popular, and in 3.6 we will study Bootstrap techniques.