Specific Differences. Lukas Meier, Seminar für Statistik

Similar documents
STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

These are all actually contrasts (the coef sum to zero). What are these contrasts representing? What would make them large?

Contrasts (in general)

STAT22200 Spring 2014 Chapter 5

Multiple Testing. Tim Hanson. January, Modified from originals by Gary W. Oehlert. Department of Statistics University of South Carolina

COMPARING SEVERAL MEANS: ANOVA

ANOVA Multiple Comparisons

Multiple Comparison Procedures Cohen Chapter 13. For EDUC/PSY 6600

Week 5 Video 1 Relationship Mining Correlation Mining

More about Single Factor Experiments

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

Linear Combinations of Group Means

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

Looking at the Other Side of Bonferroni

Step-down FDR Procedures for Large Numbers of Hypotheses

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Multiple Comparisons

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

Week 14 Comparing k(> 2) Populations

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Statistical Applications in Genetics and Molecular Biology

A posteriori multiple comparison tests

Post-Selection Inference

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Statistical testing. Samantha Kleinberg. October 20, 2009

Tukey Complete Pairwise Post-Hoc Comparison

Stat 206: Estimation and testing for a mean vector,

Chapter 13 Section D. F versus Q: Different Approaches to Controlling Type I Errors with Multiple Comparisons

IEOR165 Discussion Week 12

Lecture 5: Comparing Treatment Means Montgomery: Section 3-5

Comparing Several Means: ANOVA

Chapter Seven: Multi-Sample Methods 1/52

Multiple Pairwise Comparison Procedures in One-Way ANOVA with Fixed Effects Model

H0: Tested by k-grp ANOVA

Your schedule of coming weeks. One-way ANOVA, II. Review from last time. Review from last time /22/2004. Create ANOVA table

H0: Tested by k-grp ANOVA

Orthogonal, Planned and Unplanned Comparisons

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity

Introduction. Chapter 8

Lecture 6 April

Contrasts and Multiple Comparisons Supplement for Pages

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Factorial Treatment Structure: Part I. Lukas Meier, Seminar für Statistik

High-throughput Testing

In ANOVA the response variable is numerical and the explanatory variables are categorical.

Inferences for Regression

False Discovery Rate

INTRODUCTION TO ANALYSIS OF VARIANCE

Multiple Comparison Methods for Means

Laboratory Topics 4 & 5

Chapter 10. Design of Experiments and Analysis of Variance

Chapter 16. Simple Linear Regression and Correlation

ANOVA Analysis of Variance

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Non-specific filtering and control of false positives

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Comparisons among means (or, the analysis of factor effects)

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Analysis of Variance (ANOVA)

Rank-Based Methods. Lukas Meier

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Lec 1: An Introduction to ANOVA

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

ANOVA: Analysis of Variation

Statistics for EES Factorial analysis of variance

Group comparison test for independent samples

Nesting and Mixed Effects: Part I. Lukas Meier, Seminar für Statistik


ANALYTICAL COMPARISONS AMONG TREATMENT MEANS (CHAPTER 4)

ANOVA. Testing more than 2 conditions

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Announcements. Proposals graded

Analysis of Variance II Bios 662

Multiple testing: Intro & FWER 1

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Hypothesis T e T sting w ith with O ne O One-Way - ANOV ANO A V Statistics Arlo Clark Foos -

Chapter 16. Simple Linear Regression and dcorrelation

Unit 12: Analysis of Single Factor Experiments

Garvan Ins)tute Biosta)s)cal Workshop 16/7/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Regression With a Categorical Independent Variable: Mean Comparisons

Lecture 7 April 16, 2018

One-way between-subjects ANOVA. Comparing three or more independent means

Multiple comparisons The problem with the one-pair-at-a-time approach is its error rate.

Alpha-Investing. Sequential Control of Expected False Discoveries

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

A Likelihood Ratio Test

6 Cosets & Factor Groups

Optional Stopping Theorem Let X be a martingale and T be a stopping time such

Lecture 21: October 19

Introduction to the Analysis of Variance (ANOVA)

Transcription:

Specific Differences Lukas Meier, Seminar für Statistik

Problem with Global F-test Problem: Global F-test (aka omnibus F-test) is very unspecific. Typically: Want a more precise answer (or have a more specific question) on how the group means differ. Examples Compare new treatments with control treatment (reference treatment). Do pairwise comparisons between all treatments.. A specific question can typically be formulated as an appropriate contrast. 1

Contrasts: Simple Example Want to compare group 2 with group 1 (don t care about the remaining groups for the moment). H 0 : μ 1 = μ 2 vs. H A : μ 1 μ 2. Equivalently: H 0 : μ 1 μ 2 = 0 vs. H A : μ 1 μ 2 0. The corresponding contrast would be c = 1, 1, 0, 0,, 0. A contrast c R g is a vector that encodes the null hypothesis in the sense that H 0 : g i=1 c i μ i = 0 A contrast is nothing else than an encoding of your research question. 2

Contrasts: Formal Definition Formally, a contrast is nothing else than a vector c = (c 1, c 2,, c g ) R g g with the constraint that i=1 ci = 0. The constraint reads: contrast coefficients add to zero. The side constraint ensures that the contrast is about differences between group means and not about the overall level of our response. Mathematically speaking, c is orthogonal to 1, 1,, 1 or 1/g, 1/g,, 1/g which is the overall mean. Means: Contrasts don t care about the overall mean. 3

More Examples using Meat Storage Data Treatments were 1) Commercial plastic wrap (ambient air) 2) Vacuum package 3) 1% CO, 40% O 2, 59% N 4) 100% CO 2 Current techniques (control groups) New techniques Possible questions and their corresponding contrasts Comparison Corresponding contrast c R 4 New vs. Old 1 2, 1 2, 1 2, 1 2 New vs. Vacuum 0, 1, 1 2, 1 2 CO 2 vs. Mixed 0, 0, 1,1 Mixed vs. Commercial 1, 0, 1, 0 4

Global F-Test vs. Contrasts As explained in Oehlert (2000): ANOVA is like background lighting that dimly illuminates the data but not giving enough light to see details. A contrast is like using a spotlight; it enables us to focus in on a specific, narrow feature of the data [ ] but it does not give the overall picture. Intuitively: By using several contrasts we can move our focus around and see more features of the data. vs. 5

Inference for Contrasts We estimate the value with g i=1 g i=1 c i μ i c i y i i.e. we simply replace μ i by its estimate y i The corresponding standard error can be easily derived. This information allows us to construct tests and confidence intervals. See blackboard for details. 6

Sum of Squares of a Contrast We can also compute an associated sum of squares g ci y i 2 SS c = i=1 g c 2 i i=1n i having one degree of freedom, hence MS c = SS c. This looks unintuitive at first sight but it is nothing else than the square of the t-statistic of our null hypothesis g H 0 : i=1 ci μ i = 0 (without the MS E factor). Hence, MS c MS E F 1, N g under H 0. Again: Nothing else than a squared version of the t-test. 7

Contrasts in R Multiple options Directly in R Package multcomp (will also be very useful later) Many more See the corresponding R-script for details. 8

Orthogonal contrasts Two contrasts c and c are called orthogonal, if g ci c i /n i = 0. i=1 Orthogonal contrasts contain independent information. If there are g groups, one can find g 1 different orthogonal contrasts (1 dimension already used by global mean (1,, 1). However, infinitely many possibilities 9

Decomposition of Sum of Squares A set of orthogonal contrasts partitions the treatment sum of squares. It means: the sum of the contrast sum of squares is SS Trt, i.e. for orthogonal contrasts c 1, c 2,, c g 1 it holds that SS c1 + SS c2 + + SS cg 1 = SS Trt Intuition: We get all the information about the treatment by pointing the spotlight at all directions. It s your research hypotheses that define the contrasts, not the orthogonality criterion. 10

Multiple Testing 11

Multiple Comparisons The more tests we perform, the more likely we are doing at least one type I error (i.e., falsely rejecting H 0 ). More formally: Perform m tests: H 0,j, j = 1,, m. If all H 0,j are true and if all tests are independent: Probability to make at least one false rejection is given by 1 1 α m where α is the (individual) significance level. For α = 0.05 and m = 50 this is 0.92 (!) 12

Multiple Comparisons The more tests we perform, the more likely we are getting some significant result. If we test many null-hypotheses, we expect to reject some of them, even if they are all true. If we start data-fishing (i.e., screening data for special patterns) we (implicitly) do a lot of tests. 13

Different Error Rates Consider testing m hypotheses, whereof m 0 are true. These are the potential outcomes: H 0 true H 0 false Total Significant V S R Not significant U T m R Total m 0 m m 0 m Discoveries Type I errors Type II errors Comparisonwise error rate is type I error rate of an individual test. Family-wise (FWER) (or experimentwise) error rate is the probability of rejecting at least one of the true H 0 s: FWER = P(V > 0) 14

Different Error Rates A procedure is said to control the FWER at level α in the strong sense, if FWER α for any configuration of true and non-true null hypotheses. The false discovery rate (FDR) is the expected fraction of false discoveries, i.e. FDR = E V R false discovery fraction 15

Confidence Intervals Typically, each H 0 corresponds to a parameter. We can construct confidence intervals for each of them. We call these confidence intervals simultaneous at level 1 α if the probability that all intervals cover the corresponding true parameter is 1 α. Intuition: Can look at all confidence intervals and get the correct big picture with probability 1 α. Remember: For 20 individual 95% confidence intervals it holds that on average one doesn t cover the true value. 16

Overview of Multiple Testing Procedures Control of Family-Wise Error Rate - Bonferroni (conservative) - Bonferroni-Holm (better version of Bonferroni) - Scheffé (for search over all possible contrasts, conservative) - Tukey-HSD (for pairwise comparisons) - Multiple Comparison with a Control False Discovery Rate (see book) - Benjamini-Hochberg - Benjamini-Yekutieli - Others 17

Bonferroni Use more restrictive significance level α = α m. That s it! This controls the family-wise error rate. No assumption regarding independence required (see blackboard). Equivalently: Multiply all p-values by m and keep using the original α. Can get quite conservative if m is large. The corresponding confidence intervals (based on the adjusted significance level) are simultaneous. 18

Bonferroni-Holm Less conservative and hence (uniformly) more powerful than Bonferroni. Sort p-values from small to large: p (1), p 2,, p m. For j = 1, 2, : Reject null hypothesis if p (j) α (m j+1). Stop when you reach the first non-significant p-value. Only the smallest p-value has the traditional Bonferroni correction, hence more powerful. R: p.adjust etc. This is a so called step-down procedure. 19

Scheffé Controls for search over any possible contrast This means: You are even allowed to perform data-fishing and test the most extreme contrast you ll find (really!). These p-values are honest (really!) Sounds too good to be true! Theory: SS c g 1 MS Trt for any contrast c (because SS Trt = SS c + ) Hence, SS c MS E g 1 MS Trt Therefore, max c MS E SS c /(g 1) MS E for any contrast c. MS Trt MS E F g 1, N g under H 0 : μ 1 = = μ g. 20

Scheffé The price for the nice properties are low power (meaning: test will not reject often when H 0 is not true). If F-test is not significant: don t even have to start searching! R: Calculate F-ratio (MS c /MS E ) as if ordinary contrast. Use g 1 F g 1, N g,1 α as critical value (instead of F 1, N g,1 α ) 21

Pairwise Comparisons A pairwise comparison is nothing else than comparing two specific treatments (e.g., Vacuum vs. CO 2 ) This is a multiple testing problem because there are g g 1 2 possible comparisons (basically a lot of two-sample t-tests). Hence, we need a method which adjusts for this multiple testing problem in order to control the family-wise error rate. Simplest solution: apply Bonferroni correction. Better (more powerful): Tukey Honest Significant Difference. 22

Tukey Honest Significant Difference (HSD) Start with statistics of t-test (here for the balanced case) y i y j Use the distribution of max i MSE 1 n + 1 n y i MS E 1/n min j y j MS E 1/n (the so called studentized range) for critical values. Means: How does the maximal difference between groups behave? If all the means are equal (H 0 ), this is the studentized range distribution. (R: ptukey) 23

Tukey Honest Significant Difference (HSD) Tukey honest significant difference uses this studentized range distribution to construct simultaneous confidence intervals for differences between all pairs. and calculates p-values such that the family-wise error rate is controlled. R: TukeyHSD or Package multcomp (see R-file for demo) Tukey HSD better (more powerful) than Bonferroni if all pairwise comparisons are of interest. If only a subset: re-consider Bonferroni. 24

Interpreting and Displaying the Results A non-significant difference does not imply equality. Reason: Absence of evidence is not evidence of absence. Results can be displayed using Same letters/numbers for treatments with non-significant difference. Matrix (upper or lower triangle) with p-values 25

Mutiple Comparison with a Control (MCC) Often: Compare all treatments with a (specific) control treatment. Hence, do g 1 (pairwise) comparisons with the control group. Dunnett procedure constructs simultaneous confidence intervals for μ i μ g, i = 1,, g 1 (assuming group g is control group). R: Use package multcomp. 26

What about F-test? Can I only do pairwise comparisons etc. if the omnibus F-test is significant? No, although many textbooks recommend this. The presented procedures have a multiple-testing correction built-in. Conditioning on a significant F-test makes them overconservative. Moreover, the conditional error or coverage rates can be (very) bad.

Statistical Significance vs. Practical Relevance An effect that is statistically significant is not necessarily of practical relevance. Instead of simply reporting p-values one should always consider the corresponding confidence intervals. Background knowledge should be used to judge when an effect is potentially relevant. 28

Recommendations Planned contrasts: Bonferroni (or no correction) All pairwise comparisons: Tukey HSD Comparison with a control: Dunnett Unplanned contrasts: Scheffé 29