Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes

Similar documents
CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Analysis of Variance

Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

More about Single Factor Experiments

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Statistics for EES Factorial analysis of variance

Lecture 7: Hypothesis Testing and ANOVA

Comparing Several Means

Chapter 12. Analysis of variance

Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) Comparing the means of more than two groups

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Lec 1: An Introduction to ANOVA

Independent Samples t tests. Background for Independent Samples t test

Math 141. Lecture 16: More than one group. Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Selection should be based on the desired biological interpretation!

Booklet of Code and Output for STAC32 Final Exam

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

22s:152 Applied Linear Regression. Take random samples from each of m populations.

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Statistics - Lecture 05

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Biostatistics 270 Kruskal-Wallis Test 1. Kruskal-Wallis Test

Turning a research question into a statistical question.

MAT3378 (Winter 2016)

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Nonparametric Location Tests: k-sample

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

On Assumptions. On Assumptions

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Exam details. Final Review Session. Things to Review

5 Inferences about a Mean Vector

Section 4.6 Simple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

Lecture 5: ANOVA and Correlation

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Rank-Based Methods. Lukas Meier

Central Limit Theorem ( 5.3)

22s:152 Applied Linear Regression. 1-way ANOVA visual:

One-Way Analysis of Variance: ANOVA

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Non-parametric (Distribution-free) approaches p188 CN

Lecture 11 Analysis of variance

Contents. Acknowledgments. xix

610 - R1A "Make friends" with your data Psychology 610, University of Wisconsin-Madison

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

Statistical Tests for Computational Intelligence Research and Human Subjective Tests

ANOVA: Analysis of Variance

Basic Business Statistics, 10/e

Other hypotheses of interest (cont d)

One-way ANOVA Model Assumptions

Epidemiology Principles of Biostatistics Chapter 10 - Inferences about two populations. John Koval

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

4.1. Introduction: Comparing Means

Chapter 7: Statistical Inference (Two Samples)

Statistics for Managers Using Microsoft Excel Chapter 10 ANOVA and Other C-Sample Tests With Numerical Data

Chapter Seven: Multi-Sample Methods 1/52

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

10 One-way analysis of variance (ANOVA)

An introduction to biostatistics: part 1

Multiple Sample Numerical Data

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

MATH Notebook 3 Spring 2018

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Week 7.1--IES 612-STA STA doc

Comparison of two samples

3 Joint Distributions 71

13: Additional ANOVA Topics

Analysis of Variance. Read Chapter 14 and Sections to review one-way ANOVA.

Lec 3: Model Adequacy Checking

Linear Combinations of Group Means

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Week 14 Comparing k(> 2) Populations

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Introduction to Analysis of Variance (ANOVA) Part 2

ANOVA: Comparing More Than Two Means

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

Statistics 210 Statistical Methods. Hal S. Stern Department of Statistics University of California, Irvine

ANOVA: Analysis of Variation

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

Inferential Statistics

Non-parametric tests, part A:

= 1 i. normal approximation to χ 2 df > df

Nonparametric Statistics Notes

Elementary Statistics for the Biological and Life Sciences

1 One-way Analysis of Variance

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

2 Hand-out 2. Dr. M. P. M. M. M c Loughlin Revised 2018

Data analysis and Geostatistics - lecture VII

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Analysis of variance. April 16, Contents Comparison of several groups

Transcription:

Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes

2 Quick review: Normal distribution Y N(µ, σ 2 ), f Y (y) = 1 2πσ 2 (y µ)2 e 2σ 2 E[Y ] = µ, Var[Y ] = σ 2, Z = Y µ N(0, 1), f Z (z) = 1 e z2 2. σ 2π Probability density function of a normal distribution: 0.4 0.3 0.2 0.1 0.0 µ 3σ µ 2σ µ σ µ µ + σ µ + 2σ µ + 3σ 68.27% 95.45% 99.73%

2 Quick review: Normal distribution Y N(µ, σ 2 ), f Y (y) = Z = Y µ σ 1 2πσ 2 E[Y ] = µ, Var[Y ] = σ 2, (y µ)2 e 2σ 2 N(0, 1), f Z (z) = 1 2π e z2 2. Suitable modelling for a lot of phenomena: IQ N(100, 15 2 ). 0.0266 0.0000 55 70 85 100 115 130 145 68.27% 95.45% 99.73%

3 Grand Picture of Statistics Statistical Hypotheses H0: µ T amoxifen = µ Control H1: µ T amoxifen < µ Control Sample Idea: Tamoxifen represses the progression of ER+ Breast cancer Data: Tumour size at day 42 (x T,1 ; x T,2 ;...; x T,nT ) (x C,1 ; x C,2 ;...; x C,nT ) Inference: Under H0 T obs = µ T amoxifen µ Control s 1 p n + 1 n T C St nt +n C 2 Point estimation µ T amoxifen µ Control

3 Grand Picture of Statistics Statistical Hypotheses H0: µ T amoxifen = µ Control H1: µ T amoxifen < µ Control Sample Idea: Tamoxifen represses the progression of ER+ Breast cancer Data: Tumour size at day 42 (x T,1 ; x T,2 ;...; x T,nT ) (x C,1 ; x C,2 ;...; x C,nT ) Inference: Under H0 T obs = µ T amoxifen µ Control s 1 p n + 1 n T C St nt +n C 2 Point estimation µ T amoxifen µ Control St nt +n C 2 p value = P (T < T obs ) -4-3 -2-1 0 1 2 3 4 T

One-sample Student s t-test Weight loss Assumed model Y i = µ + ɛ i, where i = 1,..., n and ɛ i N(0, σ 2 ). Hypotheses H0: µ = 0, H1: µ > 0. Test statistic s distribution under H0 T = Y µ0 s Student(n 1). One Sample t-test 4 data: dietb t = 6.6301, df = 24, p-value = 3.697e-07 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: 2.424694 Inf sample estimates: mean of x 3.268

Weight loss Two-sample location tests: t-tests and Mann-Whitney-Wilcoxon s test

Two independent sample Student s t-test Weight loss Assumed model Y i(g) = µ g + ɛ i(g), = µ + δ g + ɛ i(g), where g = A, B, i = 1,..., n g, ɛ i(g) N(0, σ 2 ) and n gδ g = 0. Hypotheses H0: µ A = µ B, H1: µ A µ B. Test statistic s distribution under H0 T = (Y A Y B ) (µ A µ B ) Student(n A + n B 2). s p n 1 A + n 1 B Two Sample t-test 6 data: dieta and dietb t = 0.0475, df = 47, p-value = 0.9623 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.323275 1.387275 sample estimates: mean of x mean of y 3.300 3.268

Two independent sample Welch s t-test Weight loss Assumed model Y i(g) = µ g + ɛ i(g), = µ + δ g + ɛ i(g), where g = A, B, i = 1,..., n g, ɛ i(g) N(0, σ 2 g) and n gδ g = 0. Hypotheses H0: µ A = µ B, H1: µ A µ B. Test statistic s distribution under H0 T = (Y A Y B ) (µ A µ B ) s 2 X /n X + s 2 Y /n Y Student(df). Welch Two Sample t-test 7 data: dieta and dietb t = 0.047594, df = 46.865, p-value = 0.9622 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.320692 1.384692 sample estimates: mean of x mean of y 3.300 3.268

Two independent sample Mann-Whitney-Wilcoxon test Weight loss Assumed model Y i(g) = θ g + ɛ i(g), = θ + δ g + ɛ i(g), where g = A, B, i = 1,..., n g, ɛ i(g) iid(0, σ 2 ) and n gδ g = 0. Hypotheses H0: θ A = θ B, H1: θ A θ B. Test statistic s distribution under H0 nb z = i=1 R i(g) [n B (n A + n B + 1)/2], na n B (n A + n B + 1)/12 where R i(g) denotes the global rank of the ith observation of group g. Wilcoxon rank sum test with continuity correction 8 data: dieta and dietb W = 277, p-value = 0.6526 alternative hypothesis: true location shift is not equal to 0

Weight loss Two or more sample location tests: one-way ANOVA & multiple comparisons

More than two sample case: Fisher s one-way ANOVA Weight loss Assumed model Y i(g) = µ g + ɛ i(g), = µ + δ g + ɛ i(g), where g = 1,..., G, i = 1,..., n g, ɛ i(g) N(0, σ 2 ) and n gδ g = 0. Hypotheses H0: µ 1 = µ 2 =... = µ G, H1: µ k µ l for at least one pair (k, l). Test statistic s distribution under H0 F = Ns2 Y s 2 p F isher(g 1, N G), 10 where s 2 Y = G ( ) 1 ng 2, G 1 N Y g Y g=1 s 2 p = G 1 N G (n g 1)s 2 g, g=1 N = n g, Y = 1 N G n gy g. g=1 Df Sum Sq Mean Sq F value Pr(>F) diet.type 2 60.5 30.264 5.383 0.0066 ** Residuals 73 410.4 5.622 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0

More than two sample case: Welch s one-way ANOVA Weight loss 11 Assumed model Y i(g) = µ g + ɛ i(g), = µ + δ g + ɛ i(g), where g = 1,..., G, i = 1,..., n g, ɛ i(g) N(0, σ 2 g) and n gδ g = 0. Hypotheses H0: µ 1 = µ 2 =... = µ G, H1: µ k µ l for at least one pair (k, l). Test statistic s distribution under H0 F = s 2 Y 1 + 2(G 2) 3 where s 2 = G 1 Y G 1 w g (Y g Y ) 2, g=1 [ ( ) G ] 3 = G 2 1 1 wg 1 ng wg 1, g=1 w g = ng s 2, Y = G wg Y g. g wg g=1 F isher(g 1, ), One-way analysis of means (not assuming equal variances) data: weight.diff and diet.type F = 5.2693, num df = 2.00, denom df = 48.48, p-value = 0.008497

More than two sample case: Kruskal-Wallis test Weight loss Assumed model Y i(g) = θ g + ɛ i(g), = θ + δ g + ɛ i(g), where g = 1,..., G, i = 1,..., n g, ɛ i(g) iid(0, σ 2 ) and n gδ g = 0. Hypotheses H0: θ 1 = θ 2 =... = θ G, H1: θ k θ l for at least one pair (k, l). Test statistic s distribution under H0 H = 12 G Rg N(N+1) g=1 3(N 1) ng V χ(g 1), 1 v=1 t3 v tv N 3 N where R g = ng 1 ng i=1 R i(g) and R i(g) denotes the global rank of the ith observation of group g, V is the number of different values/levels in y and t v denotes the number of times a given value/level occurred in y. Kruskal-Wallis rank sum test 12 data: weight.loss by diet.type Kruskal-Wallis chi-squared = 9.4159, df = 2, p-value = 0.009023

13 Model check: Residual analysis Y i(g) = θ g + ɛ i(g) ɛ i(g) = Y i(g) θ g, Residual boxplot per group where ɛ i(g) N(0, σ 2 ) for Fisher s ANOVA ɛ i(g) N(0, σ 2 g) for Welch s ANOVA ɛ i(g) iid(0, σ 2 ) for Kruskal-Wallis ANOVA Normal Q-Q Plot Residuals -4-2 0 2 4 6 Sample Quantiles -4-2 0 2 4 6 A B C Diet type -2-1 0 1 2 Theoretical Quantiles Shapiro-Wilk normality test data: diet$resid.mean W = 0.99175, p-value = 0.9088 Bartlett test of homogeneity of variances data: diet$resid.mean by as.numeric(diet$diet.type) Bartlett s K-squared = 0.21811, df = 2, p-value = 0.8967

Finding different pairs: Multiple comparisons All-pairwise comparison problem: Interested in finding which pair(s) are different by testing H0 1: µ 1 = µ 2, H0 2: µ 1 = µ 3,... H0 K: µ G 1 = µ G, leading to a total of K = G(G 1)/2 pairwise comparisons. Family-wise type I error for K tests, α K For each test, the probability of rejecting H0 when H0 is true equals α. For K independent tests, the probability of rejecting H0 at least 1 time when H0 is true, α K, is given by α K = 1 (1 α) K. α 1 = 0.05, α 2 = 0.0975, α 10 = 0.4013. Multiplicity correction Principle: change the level of each test so that α K = 0.05, for example: Bonferroni s correction (indep. tests): α = α K /K, Dunn-Sidak s correction (indep. tests): α = 1 (1 α K ) 1/K, Tukey s correction (dependent tests). 95% family wise confidence level B A C A 14 C B 1 0 1 2 3

Weight loss Male Female Two or more sample location tests: two-way ANOVA

16 More than one factor: Fisher s two-way ANOVA Weight loss Assumed model Y i(g) = µ gk + ɛ i(gk), = µ + δ g + δ k + δ gk + ɛ i(gk), g = 1,..., G, k = 1,..., K, i = 1,..., n g, ɛ i(gk) N(0, σ 2 ) n gδ g = n k δ k = n gk δ gk = 0. Hypotheses Male Female H0 1: δ g = 0 g, H1 1: H0 1 is false. H0 2: δ k = 0 k, H1 2: H0 2 is false. H0 3: δ gk = 0 g, k, H1 3: H0 3 is false.

More than one factor: Fisher s two-way ANOVA Weight loss Assumed model Y i(g) = µ gk + ɛ i(gk), = µ + δ g + δ k + δ gk + ɛ i(gk), g = 1,..., G, k = 1,..., K, i = 1,..., n g, ɛ i(gk) N(0, σ 2 ) n gδ g = n k δ k = n gk δ gk = 0. Hypotheses Male Female H0 1: δ g = 0 g, H1 1: H0 1 is false. H0 2: δ k = 0 k, H1 2: H0 2 is false. H0 3: δ gk = 0 g, k, H1 3: H0 3 is false. Scenario 1 Scenario 2 Scenario 3 H11 Scenario 4 H12 Scenario 5 H11 & H12 Scenario 6 H11 & H12 & H13 3.0 Mean weight loss 2.5 2.0 1.5 1.0 µ 0.5 16 0.0 A B C A B C Male A B C Female A B C A B C A B C

16 More than one factor: Fisher s two-way ANOVA Weight loss Assumed model Y i(g) = µ gk + ɛ i(gk), = µ + δ g + δ k + δ gk + ɛ i(gk), g = 1,..., G, k = 1,..., K, i = 1,..., n g, ɛ i(gk) N(0, σ 2 ) n gδ g = n k δ k = n gk δ gk = 0. Hypotheses Male Female H0 1: δ g = 0 g, H1 1: H0 1 is false. H0 2: δ k = 0 k, H1 2: H0 2 is false. H0 3: δ gk = 0 g, k, H1 3: H0 3 is false. Df Sum Sq Mean Sq F value Pr(>F) diet.type 2 60.5 30.264 5.629 0.00541 ** gender 1 0.2 0.169 0.031 0.85991 diet.type:gender 2 33.9 16.952 3.153 0.04884 * Residuals 70 376.3 5.376 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Summary 17