Statistics Applied to Bioinformatics. Tests of homogeneity

Similar documents
Non-specific filtering and control of false positives

Single Sample Means. SOCY601 Alan Neustadtl

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Chapter 7 Comparison of two independent samples

Introduction to the Analysis of Variance (ANOVA)

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

Advanced Experimental Design

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

HYPOTHESIS TESTING. Hypothesis Testing

Chapter 10: Analysis of variance (ANOVA)

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

COMPARING SEVERAL MEANS: ANOVA

13: Additional ANOVA Topics

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

Sampling Distributions: Central Limit Theorem

Chapter 14: Repeated-measures designs

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Microarray Data Analysis: Discovery

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

10/31/2012. One-Way ANOVA F-test

Introduction to Business Statistics QM 220 Chapter 12

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Lecture 18: Analysis of variance: ANOVA

PLSC PRACTICE TEST ONE

Descriptive Statistics-I. Dr Mahmoud Alhussami

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Introduction to Statistical Data Analysis III

Biochip informatics-(i)

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

Two Sample Problems. Two sample problems

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

Statistical Inference On the High-dimensional Gaussian Covarianc

Confidence Intervals, Testing and ANOVA Summary

Transition Passage to Descriptive Statistics 28

Statistics for Managers Using Microsoft Excel Chapter 10 ANOVA and Other C-Sample Tests With Numerical Data

Lecture.10 T-test definition assumptions test for equality of two means-independent and paired t test. Student s t test

Introduction to Analysis of Variance (ANOVA) Part 2

14.2 Do Two Distributions Have the Same Means or Variances?

Looking at the Other Side of Bonferroni

Hypothesis Testing hypothesis testing approach

2 Hand-out 2. Dr. M. P. M. M. M c Loughlin Revised 2018

Population Variance. Concepts from previous lectures. HUMBEHV 3HB3 one-sample t-tests. Week 8

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Hypothesis Testing. We normally talk about two types of hypothesis: the null hypothesis and the research or alternative hypothesis.

HYPOTHESIS TESTING SAMPLING DISTRIBUTION

Two-Sample Inferential Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

One-Sample and Two-Sample Means Tests

One-way Analysis of Variance. Major Points. T-test. Ψ320 Ainsworth

One-Way ANOVA Source Table J - 1 SS B / J - 1 MS B /MS W. Pairwise Post-Hoc Comparisons of Means

IEOR165 Discussion Week 12

Lecture 7: Hypothesis Testing and ANOVA

Statistical Applications in Genetics and Molecular Biology

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

One-way ANOVA. Experimental Design. One-way ANOVA

Lab #12: Exam 3 Review Key

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

The Welch-Satterthwaite approximation

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

The miss rate for the analysis of gene expression data

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

ANOVA TESTING 4STEPS. 1. State the hypothesis. : H 0 : µ 1 =

Data Analysis and Statistical Methods Statistics 651

Lecture 3: Inference in SLR

1. How will an increase in the sample size affect the width of the confidence interval?

Analysis of Variance: Part 1

WISE Power Tutorial Answer Sheet

Analysis of Variance

Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes

3. Nonparametric methods

Sample Size Estimation for Studies of High-Dimensional Data

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

f (1 0.5)/n Z =

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Statistics for Managers Using Microsoft Excel Chapter 9 Two Sample Tests With Numerical Data

BIO 682 Nonparametric Statistics Spring 2010

10/4/2013. Hypothesis Testing & z-test. Hypothesis Testing. Hypothesis Testing

PSYC 331 STATISTICS FOR PSYCHOLOGISTS

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

Chapter 5 Confidence Intervals

N J SS W /df W N - 1

I used college textbooks because they were the only resource available to evaluate measurement uncertainty calculations.

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Hypothesis testing: Steps

Probabilistic Hindsight: A SAS Macro for Retrospective Statistical Power Analysis

Statistics - Lecture 05

Analysis of Variance (ANOVA)

Selection should be based on the desired biological interpretation!

Transcription:

Statistics Applied to Bioinformatics Tests of homogeneity

Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation with the theoretical distribution Usually, the variance is a priori not know, and has to be estimated Warning: the variance of a difference is the sum of variances The formula for estimating the whether the populations are supposed to have or not similar variances The theoretical distribution is thus the Student (t) k= +n - degrees of freedom α is shared between the two tails use the value for t - α/ in Student's table t obs = x " x # ˆ Reject H 0 if t obs " t #$ / m "m

Homogeneity of a difference The test of homogeneity can be thought of as a test of conformity on the difference between two means. H 0 : m =m H 0 : d = m -m = 0 This requires an estimation of the variance of the difference between the two means. The variance of a difference between two distributions is the sum of the variances. The standard error (i.e. the variance of a the sampling distribution of the mean) is variance of the corresponding population, divided by the sample size. When the variances of the two populations are known a priori, the two formulae can be combined to estimate the variance of a difference Variance of a difference " X #X = " X + " X = " + " " n X #X = " X Standard deviation of a difference + " X = " + " n 3

Estimating the variance of the difference between sample means Generally, the variance of the populations (σ and σ) are not known a priori. They have thus to be estimated from the two samples. The variance of the sample is a biased estimation of the variance of the population (see chapter on estimation). Each variance estimate needs thus to be corrected by a factor n/(n-). The estimation of the variance will raise an error, which has to be taken into account for the calculation of significance. This will be done differently depending on two considerations Can we assume that the two populations have the same variance? Do the two sample have the same size? " X #X = " X + " X = " + " n 4

Populations with the same variance When one can assume that the two populations have the same variance, the variance of the difference is estimated as follows. " = " = " = s + n s + n # " X #X = " " + = $ " + ' & ) = s + n s $ + ' & ) n % n ( + n # % n ( If the two samples have the same size ( =n =n), this formula can be simplified. " X #X = ns + ns $ n + n # n + ' & ) = s + s % n ( n # 5

Population with different variances When one cannot assume that the two populations have the same variance, the variance of the difference is estimated as follows " X #X = s ( #) + n s n (n #) = s # + s n # 6

Unequal variances, large samples When the two samples are large ( >30 and n >30), the Student distribution converges towards a normal distribution. The significance of the difference between two means can be assessed with the normal distribution. u obs = x " x # X "X = s x " x " + s n " u obs represents the difference between sample means, relative to the estimated standard deviation of this difference. 7

Equal variances : the Student t-test When the two samples not sufficiently large (n <30 or n<30) the same statistics is calculated, but it has to be compared to the Student distribution. t obs = x " x # X "X = s x " x " + s n " This test is called Student t-test. The shape of the Student distribution depends on one parameter: the degrees of freedom (k). Since we assume equal variance, we estimate two parameters for this test: the mean of the difference + the pooled variance. The degrees of freedom are thus k = n + n - 8

Unequal variances : the Welch test If one cannot assume variance equality, the same statistics (t obs ) can be used, but the number of degrees of freedom k is calculated with the formula besides. Note: the formula to compute k in a Welch t-test returns positive Real numbers. The number of degrees of freedom does not need to be a Natural number anymore. This test is called the Welch t-test. k = t obs = x " x # X "X = # % "$% s x " x " + s n " # s " + s & % ( $ % n "'( s & # ( + % "'( n "$% s & ( n "'( 9

Student or Welch? When searching for differentially expressed genes, should we apply Student or Welch test? In transcriptome analysis, we generally assume that the variance will be somewhat proportional to the expression level. We can thus not assume equality of variance between low- and high-expressing conditions, respectively. The Welch test is thus a priori more appropriate to detect differentially expressed genes in transcriptome array profiles. 0

Selection of differentially expressed genes The test of homogeneity can be applied to select genes differentially expressed between two experimental conditions, cell types, Example: Golub data Oligonucleotide arrays were used to measure the level of expression of > 7000 genes in 7 patients suffering from acute lymphoblastic leukemia (ALL) patients suffering from acute myeloblastic leukemia (AML) An ad hoc preliminary filtering was done by the authors, leaving 305 genes, considered as reliable measurements. Question: which genes show a significantly different level of expression between the two patient types (ALL and AML, respectively)? Approach: apply the test of homogeneity to each gene g separately, to test the null hypothesis H 0 : m g,all =m g,aml

Multiple student test Goal : select genes differentially expressed between distinct patient types Method: patient types : T-test Assumption of equal variance is generally not valid -> use Welch test instead of Student test. > patient types: ANOVA (not shown here) Attention : problem of multi-testing We are testing several thousands of probes in parallel.,578 for response to X (data from Thierry Lequerre) 3050 for the cancer type (Golub 999) If we accept the conventional alpha risk of 0.0, we expect 6 false positives for the response to X 30 false positives for the cancer type

Multi-testing corrections Bonferoni rule Reduce alpha risk according to the number of tests (T) P-value < /T E-value (equivalent to Bonferoni rule) Estimate the expected number of false positives E-value = T*P-value Family-Wise Error Rate (FWER) Probability to observe at least one false positive in T tests. FWER = - ( - P-value) T 3

Choice of the estimators for the t-test A priori, I would recommend use robust estimators not only for standardization, but also for the t-test. central tendency (median instead of mean). Dispersion: IQR instead of standard deviation. The most significant genes are the same irrespective of this choice, but for some other genes it makes a change. Classical estimators: 43 significant genes Robust estimators: 367 significant genes Genes selected in both cases: 97. 4

Golub 999 - Profiles of selected genes The 367 gene selected by the T-test have apparently different profiles. Some genes seem greener for the ALL patients (7 leftmost samples) Some genes seem greener for the AML patients ( rightmost samples) However, the relationships between the different genes is not visible on this map. In the next courses, we will use clustering methods in order to regroup genes with similar profiles, among those selected as differentially expressed. ALL AML 5