Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Similar documents
Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes

Lecture 7: Hypothesis Testing and ANOVA

Statistics Applied to Bioinformatics. Tests of homogeneity

Contents 1. Contents

Nonparametric tests. Mark Muldoon School of Mathematics, University of Manchester. Mark Muldoon, November 8, 2005 Nonparametric tests - p.

Comparison of Two Population Means

Lecture 5: ANOVA and Correlation

Inferential Statistics

Contents. Acknowledgments. xix

Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<=

Session 3 The proportional odds model and the Mann-Whitney test

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

An introduction to biostatistics: part 1

16.400/453J Human Factors Engineering. Design of Experiments II

Comparison of Two Samples

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Resampling Methods. Lukas Meier

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Statistics in medicine

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Solutions exercises of Chapter 7

Non-parametric tests, part A:

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Inference for Distributions Inference for the Mean of a Population

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Review of One-way Tables and SAS

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Exam details. Final Review Session. Things to Review

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

One-sample categorical data: approximate inference

Introduction to Statistical Analysis

S D / n t n 1 The paediatrician observes 3 =

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Confidence Intervals, Testing and ANOVA Summary

Stat 427/527: Advanced Data Analysis I

Physics 509: Non-Parametric Statistics and Correlation Testing

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Statistics Handbook. All statistical tables were computed by the author.

APPENDIX B Sample-Size Calculation Methods: Classical Design

Lecture 7: Confidence interval and Normal approximation

Institute of Actuaries of India

STAT 4385 Topic 01: Introduction & Review

Small n, σ known or unknown, underlying nongaussian

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 21: October 19

UNIVERSITY OF TORONTO Faculty of Arts and Science

Spring 2012 Math 541B Exam 1

Statistics: revision

Analyzing Small Sample Experimental Data

1 Hypothesis testing for a single mean

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

Exam Applied Statistical Regression. Good Luck!

Multiple Sample Categorical Data

Chapter 7 Comparison of two independent samples

MATH4427 Notebook 5 Fall Semester 2017/2018

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Nonparametric hypothesis tests and permutation tests

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Non-parametric (Distribution-free) approaches p188 CN

Nonparametric Location Tests: k-sample

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Y i = η + ɛ i, i = 1,...,n.

Transition Passage to Descriptive Statistics 28

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Basics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I.

= 1 i. normal approximation to χ 2 df > df

13.1 Categorical Data and the Multinomial Experiment

Business Statistics MEDIAN: NON- PARAMETRIC TESTS

STATISTICS SYLLABUS UNIT I

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

simple if it completely specifies the density of x

Topic 3: Sampling Distributions, Confidence Intervals & Hypothesis Testing. Road Map Sampling Distributions, Confidence Intervals & Hypothesis Testing

Review of Statistics 101

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Inference for Binomial Parameters

CDA Chapter 3 part II

Central Limit Theorem ( 5.3)

MATH Notebook 3 Spring 2018

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Selection should be based on the desired biological interpretation!

Non-parametric Inference and Resampling

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Turning a research question into a statistical question.

This does not cover everything on the final. Look at the posted practice problems for other topics.

Sampling Distributions: Central Limit Theorem

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

Glossary for the Triola Statistics Series

Fish SR P Diff Sgn rank Fish SR P Diff Sng rank

Transcription:

Introduction to Statistical Analysis Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

2 Timeline 9:30 Morning I I 45mn Lecture: data type, summary statistics and graphical displays 15mn Quiz 10:30 15mn Co ee & Tea break I 60mn Lecture: some statistical distributions + CLT I 15mn Exercises & discussion 12:00 Lunch break 13:00 Afternoon I 45mn Lecture: Parametric tests for the mean I 30mn Exercises with shiny apps & discussion I I 30mn Lecture: Non-parametric tests for the mean 15mn Exercises & discussion 15:00 15mn Co ee & Tea break I 15mn Lecture: Tests for categorical variables I 15mn Exercises & discussion 16:00 Group based exercises I 60mn

Grand Picture of Statistics Population Sample Data (x 1,x 2,...,x n ) Parameters µ 2 Statistics bµ b 2 b 3

4 Data Types x 1 x 2 x 3 x n Cancer status C C C C Nucleic acid sequence C T T A 5-level pain score 3 1 5 4 #ofdailyadmissionsata&e 16 23 12 17 Gene expression intensity 882.1 379.5 528.3 120.9

5 Summary statistics and plots for qualitative data 5-level answers of 21 patients to the question How much did pain due to your ureteric stones interfere with your day to day activities? : 3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4, where I 1= Notatall, I 2= Alittlebit, I 3 = Somewhat, I 4= Quiteabit, I 5= Verymuch.

6 Summary statistics and plots for quantative data 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) 0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82 x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) 1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07 x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27) 2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6 Summary statistics and plots for quantative data 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) 0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82 x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) 1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07 x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27) 2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6 Summary statistics and plots for quantative data 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) 0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82 x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) 1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07 x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27) 2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

7 Two-sample case: independent versus paired samples Permeability constants of a placental membrane at term (X) and between 12 to 26 weeks gestational age (Y). 1 2 3 4 5 6 7 8 9 10 X 0.80 0.83 1.89 1.04 1.45 1.38 1.91 1.64 0.73 1.46 Y 1.15 0.88 0.90 0.74 1.21 Hamilton depression scale factor measurements in 9 patients with mixed anxiety and depression, taken at the first (X) and second (Y) visit after initiation of a therapy (administration of a tranquilizer). 1 2 3 4 5 6 7 8 9 X 1.83 0.50 1.62 2.48 1.68 1.88 1.55 3.06 1.30 Y 0.88 0.65 0.60 2.05 1.06 1.29 1.06 3.14 1.29 Y X 0.95 0.15 1.02 0.43 0.62 0.59 0.49 0.08 0.01

8 Quiz Time Sections 1 to 4 https://docs.google.com/forms/d/e/1faipqlscblq_ -ISfSCGp_EIVPPI_mnrJHttaKxln8vVoyjJFvS8BL1w/viewform

9 Some parametric distributions: Bernoulli distribution x 1 x 2 x 3 x n Cancer status C C C C 1 0 0 1 If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, each dichotomous experiment, X i, follows a Bernoulli distribution with parameter : X i Bernoulli( ) P (X i =1)= P (X i =0)=1

Some parametric distributions: Binomial distribution 10 If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, I the number of successes out of n trials (experiments), Y = P n i=1 Xi, follows a binomial distribution with parameters n and : Y Bin(n, ), I the probability of observing exactly y successes out of n experiments, is given by P (Y = y n, ) = n! (n y)!y! y (1 ) n y.

Some parametric distributions: Binomial distribution If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, I the number of successes out of n trials (experiments), Y = P n i=1 Xi, follows a binomial distribution with parameters n and : 0.25 Y Bin(n, ), 0.20 probability 0.15 0.10 0.05 P (Y = y n, ) = n! (n y)!y! y (1 ) n y. 0.00 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 10 Number of successes out of 50 experiments

Some parametric distributions: Poisson distribution 11 If, during a time interval or in a given area, I events occur independently, I at the same rate, I and the probability of an event to occur in a small interval (area) is proportional to the length of the interval (size of the area), then, I the number of events occurring in a fixed time interval or in a given area, X, may be modelled by means of a Poisson distribution with parameter : X Poisson( ), I the probability of observing x during a fixed time interval or in a given area is given by P (X = x )= x e x!.

Some parametric distributions: Poisson distribution If, during a time interval or in a given area, I events occur independently, I at the same rate, I and the probability of an event to occur in a small interval (area) is proportional to the length of the interval (size of the area), then, I the number of events occurring in a fixed time interval or in a given area, X, may be modelled by means of a Poisson distribution with parameter : 0.35 0.30 X Poisson( ), 0.25 Probability 0.20 0.15 0.10 P (X = x )= x e x!. 0.05 0.00 0 1 2 3 4 5 6 7 8 9 10 11 11 Number of chronic conditions per patient (US National Medical Expenditure Survey)

Some parametric distributions: Continuous distrib. 12 Density 0.3 0.2 ex Gaussian skew t Gamma Inverse Gamma Gaussian 0.1 0.0 10 5 0 5 10 15 20

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. Probability density function, f Z (z), of a standard normal: 0.4 0.3 Density 0.2 0.1 0.0 4 3 2 1 0 1 2 3 4

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. Probability density function, f Z (z), of a standard normal: 0.4 0.3 0.2 0.1 0.0 µ 3 µ 2 µ µ µ + µ +2 µ +3 68.27% 95.45% 99.73%

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables: IQ 0.0266 0.0000 55 70 85 100 115 130 145 68.27% 95.45% 99.73%

Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 E[X] =µ, Var[X] = 2, (x µ)2 e 2 2 Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (ii) Central limit theorem (Lindeberg-Lévy CLT). Let (X 1,...,X n ) be n independent and identically distributed (iid) random variables drawn from distributions of expected values given by µ and finite variances given by 2,. then If X i N(µ, P n i=1 bµ = X = X i n d! N µ, 2 n. 2 ),thisresultistrueforallsamplesizes.

14 Central limit theorem shiny app: Distribution of the mean http://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) 15 I if X N(µ, I if X N(µ, 2 ),thenx N µ, 2 n, 2 ),thenz = X µ N(0, 1), P < <! =0.95

95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) I if X N(µ, 2 ),thenx N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! =0.95 Density 0.4 0.3 0.2 0.1 X N(0, 1) T St30 T St10 T St2 ft (t n 1) = ( n 2 2 ) ( n 1 2 )[(n 1) ] 1/2 1+ t2 n 1 n 2 0.0-4.303 95% 4.303-2.228 95% 2.228-2.042 95% 2.042-1.96 95% 1.96 15-5 -4-3 -2-1 0 1 2 3 4 5

95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) I if X N(µ, 2 ),thenx N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! =0.95 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

95% Confidence interval for µ, the population mean, when X i iid(µ, 2 ) I CLT: X d! N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! =0.95 0.35 0.30 0.25 Probability 0.20 0.15 0.10 0.05 0.00 16 0 1 2 3 4 5 6 7 8 9 10 11 Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

95% Confidence interval for µ Y µ X,thedi erence between population means 17 If we have I X i iid(µ X, 2 X ), i =1,...,n X, I Y i iid(µ Y, 2 Y ), i =1,...,n Y, then I if 2 X = Y 2 [Student s t-test equation], q. CI (µ Y µ X, 0.95) = (Y X) ± t 1 2,n X +n Y 2s p I if 2 X 6= 2 Y where s p = (n X 1)s 2 X +(n Y 1)s 2 Y n X +n Y 2, [Welch-Satterthwaite s t-test equation],. CI (µ Y µ X, 0.95) = (Y X) ± t 1 df = s 2 2 X n + s2 Y X ny s 2 2 X n X n X 1 + s 2 2 Y n Y n Y 1. q s 2 2,df X nx 1 n X + 1 n Y + s2 Y n Y,where

18 Central limit theorem shiny app: Coverage of Student s asymptotic confidence intervals http://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

Quiz Time Practical 1 19 http://bioinformatics-core-shared-training.github.io/ IntroductionToStats/practical.html

PART II: Parametric tests Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

Grand Picture of Statistics Population Sample Data (x 1,x 2,...,x n ) Parameters µ 2 Statistics bµ b 2 b 21

Statistical hypothesis testing 22 A hypothesis test describes a phenomenon by means of two non-overlapping idealised models/descriptions: I the null hypothesis H0, generally assumed to be true until evidence indicates otherwise I the alternative hypothesis H1. The aim of the test is to reject the null hypothesis in favour of the alternative hypothesis, and conclude, with a probability of being wrong, that the idealised model/description of H1 is true. Theory 1: Dieters lose more fat than the exercisers Theory 2: There is no majority for Brexit now Theory 3: Serum vitamin C is reduced in patients

Statistical hypothesis testing 23 Several-step process: I Define H0 and H1 according to a theory I Set, the probability of rejecting H0 when it is true (type I error), I Define n, the sample size, allowing you to reject H0 when H1 is true with a probability 1 (Power), I Determine the test statistic to be used, I Collect the data, I Perform the statistical test, define the p-value, and reject (or not) the null hypothesis.

24 Statistical hypothesis testing Example: One-sample two-sided t-test We test: H0: µ IQ =100, H1: µ IQ > 100. 0.0266 We have X i N(µ, We know I X N µ, 2 ),i=1,...,n, 2 n, I Z = X µ p n N(0, 1), Thus, if H0 is true, we have: I Z = X µ 0 p n N(0, 1). Define the p-value: I p value = P ( T >T obs ) Density 0.0000 0.4 0.3 0.2 0.1 0.0 100 85 115 70 130 55 145 68.27% 95.45% 99.73% 4 3 2 1 0 1 2 3 4

Statistical hypothesis testing 4possibleoutcomes 25 Conclude: I if p-value >! do not reject H0. I if p-value <! reject H0 in favour of H1. Unknown Truth Test Outcome H0 not rejected H1 accepted H0 true 1 H1 true 1 where I is the type I error, I is the type II error.

Statistical hypothesis testing Example: One-sided binomial exact test We test: H0: =5%, H1: >5%. We have X i Bernoulli( ),i=1,...,n, We know I Y = P n i=1 Xi Binomial (, n), Thus, if H0 is true, we have: I Y = P n i=1 Xi Binomial (5%,n), Define the p-value: I p value = P (Y > Y obs ) Probability 0.4 0.3 0.2 n = 25 n = 100 0.1 26 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Two-sample two-sided Student-s & Welch s t-tests Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' 0.5 0.0 0.5 1.0 1.5 2.0 2.5 We test H0: µ Y µ X =0 against H1: µ Y µ X 6=0. We know: I Student s t-test [assume 2 X = Y 2 ]: (Y X) (µ Y µ X ) q s 1 t 1 p + n 1 X n Y 2,n X +n Y 2 I Welch s t-test [assume 2 X 6= Y 2 ]: (Y X) (µ Y µ X ) r t s 2 1 X + s2 2,df Y n X ny Two Sample t-test 27 data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] t = 6.7983, df = 36, p-value = 6.046e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.8829143 1.6336690 sample estimates: mean of x mean of y 1.8938826 0.6355909

Two-sample two-sided Student-s & Welch s t-tests Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' 0.5 0.0 0.5 1.0 1.5 2.0 2.5 We test H0: µ Y µ X =0 against H1: µ Y µ X 6=0. We know: I Student s t-test [assume 2 X = Y 2 ]: (Y X) (µ Y µ X ) q s 1 t 1 p + n 1 X n Y 2,n X +n Y 2 I Welch s t-test [assume 2 X 6= Y 2 ]: (Y X) (µ Y µ X ) r t s 2 1 X + s2 2,df Y n X ny Welch Two Sample t-test 27 data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] t = 6.3186, df = 16.118, p-value = 9.871e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.8363826 1.6802008 sample estimates: mean of x mean of y 1.8938826 0.6355909

F-test of equality of variances 28 Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' 0.5 0.0 0.5 1.0 1.5 2.0 2.5 We test H0: 2 Y = 2 X against H1: 2 Y 6= 2 X. We know: I F-test [assume X i N(µ X, X) and Y i N(µ Y, Y )]: s 2 Y s 2 X F ny 1,n X 1 F test to compare two variances data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] F = 0.71164, num df = 26, denom df = 10, p-value = 0.4652 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.2127735 1.8428387 sample estimates: ratio of variances 0.7116441

Multiplicity correction For each test, the probability of rejecting H0 (and accept H1) when H0 is true equals. For k tests, the probability of rejecting H0 (and accept H1) at least 1 time when H0 is true, k, is given by k =1 (1 ) k. Thus, for =0.05, I if k =1, 1 =1 (1 ) 1 =0.05, I if k =2, 2 =1 (1 ) 2 =0.0975, I if k = 10, 10 =1 (1 ) 10 =0.4013. Idea: change the level of each test so that k =0.05: I Bonferroni correction : = k k, I Dunn-Sidak correction: =1 (1 k ) 1/k. 29

Introduction to Shiny Apps and Exercises 30

PART III: Non-parametric tests Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

Parametric or non-parametric? 32 T-test Outcome(s) normally distributed Yes Mildly No Small Sample size Medium Large Situations which may suggest the use of non-parametric statistics: I When there is a small sample size or very unequal groups, I When the data has notable outliers, I When one outcome has a distribution other than normal, I When the data are ordered with many ties or are rank ordered.

Sign test 33 A location model is assumed for X i, i =1,...,n: where e i iid(µ e =0, 2 e). X i = + e i, 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Test statistics: S = P n i=1 (Xi 0 > 0). Distribution of S under H0: Exact binomial test S Binomial(0.5,n). data: 21 and 27 number of successes = 21, number of trials = 27, p-value = 0.005925 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5774169 0.9137831 sample estimates: probability of success 0.7777778 Probability 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 Number of successes out of 27 experiments

Wilcoxon sign-rank test 34 A location model is assumed for X i, i =1,...,n: where e i iid(µ e =0, 2 e). X i = + e i, 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Test statistics : W + = P n i=1 (Xi 0 > 0) Rank( Xi 0 ). Distribution of W under H0: W + has no closed-form distribution. Wilcoxon signed rank test data: golub[1042, gol.fac == "ALL"] V = 268, p-value = 0.05847 alternative hypothesis: true location is not equal to 1.75 95 percent confidence interval: 1.73868 2.09106 sample estimates: (pseudo)median 1.926475

Mann-Whitney-Wilcoxon test: Shift in location Let I Xi iid(µ X, 2 ), i =1,...,n X, I Y i iid(µ X +, 2 ), i =1,...,n Y. Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Standardised test statistic: z = P ny i=1 R(Y i) [n Y (n X +n Y +1)/2] p n X n Y (n X +n Y +1)/12 0.5 0.0 0.5 1.0 1.5 2.0 2.5 where R(Y i) denotes the rank of Y i amongst the combined samples, i.e., amongst (X 1,...,X nx,y 1,...,Y ny )., Distribution of Z under H0: Z N(0, 1). 0.4 35 Implementation 1: statistic = -4.361334, p-value = 1.292716e-05 0.0 Implementation 2: W = 284, p-value = 6.15e-07 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: 0.89647 1.57023 sample estimates: difference in location 1.21951 Density 0.3 0.2 0.1 4 3 2 1 0 1 2 3 4

Non-parametric is not assumption free Shift in location tests when H0 is true 36 Simulate 2500 samples with I X i Uniform(1.5, 2.5), i =1,...,n X, I Y i Uniform(0, 4), i =1,...,n Y, so that E[X i]=e[y i]=2(i.e., same mean, same median). Assume I X i iid(µ X, 2 ), i =1,...,n X, I Y i iid(µ X +, 2 ), i =1,...,n Y. Test H0: = 0 against H1: 6= 0,atthe5%level,bymeansof I Mann-Whitney-Wilcoxon test (MWW), I T-test, I Welch-test. Sample size b Tests MWW Student s t-test Welch s test n X =200,n Y =70 0.145 0.202 0.055 n X =20,n Y =7 0.148 0.240 0.062

Exercises 37

PART IV: Tests for categorical variables Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

39 2 goodness-of-fit test A trial to assess the e ectiveness of a new treatment versus a placebo in reducing tumour size in patients with ovarian cancer: Observed frequencies Binary outcome Group Tumour did not shrink Tumour did shrink Treatment 44 40 (84) Placebo 24 16 (40) (68) (56) (124) I H0 : No association between treatment group and tumour shrinkage, I H1 : Some association. Expected frequencies under H0 Binary outcome Group Treatment Placebo Tumour did not shrink 84 68 =46.06 124 40 68 =21.94 124 Tumour did shrink 84 58 =37.94 124 (84) 40 56 =18.71 124 (40) (68) (56) (124) We have 2 categorical variables with a total of J =4cells (categories). I H0 : j = j0,j =1,...,J, I H1 : j 6= j0,j =1,...,J. JP 2 (O -test: j E j ) 2 E j 2 (J 1). j=1

2 goodness-of-fit test A trial to assess the e ectiveness of a new treatment versus a placebo in reducing tumour size in patients with ovarian cancer: Observed frequencies Binary outcome Group Tumour did not shrink Tumour did shrink Treatment 44 40 (84) Placebo 24 16 (40) (68) (56) (124) 39 I H0 : No association between treatment group and tumour shrinkage, I H1 : Some association. Expected frequencies under H0 Binary outcome Group Tumour did not shrink Tumour did shrink Treatment (84) Placebo (40) (68) (56) (124) We have 2 categorical variables with a total of J =4cells (categories). I H0 : j = j0,j =1,...,J, I H1 : j 6= j0,j =1,...,J. JP 2 (O -test: j E j ) 2 E j 2 (J 1). j=1 Pearson s Chi-squared test with Yates continuity correction data: M X-squared = 0.36474, df = 1, p-value = 0.5459

Fisher s exact test of independence 2 goodness-of-fit test not suitable when I n is small I E j < 5 for at least one cell. Observed frequencies Group Binary outcome Tumour did not shrink Tumour did shrink Treatment 44 40 (84) Placebo 24 16 (40) (68) (56) (124) Fisher showed that, under H0 (independence), P (observed table H0) =P (X = a) and X Hypergeometric(n, a + c, a + b). To compute the Fisher s test: I Define P (X = a) for all possible tables having the observed marginal counts, I Calculate the p value by defining the percentage of these tables that get a probability equal to or smaller than the one observed. Fisher s Exact Test for Count Data 40 data: M p-value = 0.4471 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3160593 1.6790135 sample estimates: odds ratio 0.7351707