Multiple samples: Modeling and ANOVA

Similar documents
The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

The t-test Pivots Summary. Pivots and t-tests. Patrick Breheny. October 15. Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/18

One-sample categorical data: approximate inference

Power and sample size calculations

Chi Square Analysis M&M Statistics. Name Period Date

Two-sample inference: Continuous data

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Lecture 18: Simple Linear Regression

Two-sample Categorical data: Testing

Harvard University. Rigorous Research in Engineering Education

Mathematical Notation Math Introduction to Applied Statistics

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

MS&E 226: Small Data

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Two-sample Categorical data: Testing

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

STA Module 11 Inferences for Two Population Means

STA Rev. F Learning Objectives. Two Population Means. Module 11 Inferences for Two Population Means

Statistics for IT Managers

Linear Regression. Chapter 3

Data Analysis and Statistical Methods Statistics 651

Two-sample inference: Continuous data

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Important note: Transcripts are not substitutes for textbook assignments. 1

Semiparametric Regression

Confidence intervals

Uncertainty. Michael Peters December 27, 2013

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Simple Linear Regression for the Climate Data

Two-Sample Inferential Statistics

Introduction to bivariate analysis

STA 101 Final Review

Introduction to bivariate analysis

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Permutation tests. Patrick Breheny. September 25. Conditioning Nonparametric null hypotheses Permutation testing

Your schedule of coming weeks. One-way ANOVA, II. Review from last time. Review from last time /22/2004. Create ANOVA table

Poisson regression: Further topics

INTRODUCTION TO ANALYSIS OF VARIANCE

Uni- and Bivariate Power

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Analysis of Variance

Logistic regression: Miscellaneous topics

Descriptive Statistics (And a little bit on rounding and significant digits)

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Sampling Distributions

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Lecture #11: Classification & Logistic Regression

Statistical Distribution Assumptions of General Linear Models

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

One-parameter models

ECON The Simple Regression Model

Probability and Statistics

Business Statistics 41000: Homework # 5

Data Analysis and Statistical Methods Statistics 651

Week 8: Correlation and Regression

BIO5312 Biostatistics Lecture 6: Statistical hypothesis testings

Multiple Pairwise Comparison Procedures in One-Way ANOVA with Fixed Effects Model

With our knowledge of interval estimation, we can consider hypothesis tests

Chapter 27 Summary Inferences for Regression

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

The Components of a Statistical Hypothesis Testing Problem

Applied Regression Analysis. Section 2: Multiple Linear Regression

hypotheses. P-value Test for a 2 Sample z-test (Large Independent Samples) n > 30 P-value Test for a 2 Sample t-test (Small Samples) n < 30 Identify α

SIMPLE REGRESSION ANALYSIS. Business Statistics

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

Hypothesis Testing with Z and T

Chapter 5 Confidence Intervals

Foundations for Functions

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Simple Linear Regression

7 Estimation. 7.1 Population and Sample (P.91-92)

Chapter 23. Inference About Means

12 Statistical Justifications; the Bias-Variance Decomposition

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Mathematical Notation Math Introduction to Applied Statistics

One-factor analysis of variance (ANOVA)

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

1 Statistical inference for a population mean

Correlation and Simple Linear Regression

ANOVA CIVL 7012/8012

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Applied Regression Analysis

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Transcription:

Multiple samples: Modeling and Patrick Breheny April 29 Patrick Breheny Introduction to Biostatistics (171:161) 1/23

Multiple group studies In the latter half of this course, we have discussed the analysis of data coming from various kinds of studies we started out with one-sample studies and then moved on to two-sample studies This week, we will discuss the analysis of data in which three or more groups/samples are present For example, in the tailgating study, we compared illegal drug users with non-illegal drug users However, there were really four groups: individuals who use marijuana, individuals who use MDMA (ecstasy), individuals who drink alcohol, and drug-free individuals Patrick Breheny Introduction to Biostatistics (171:161) 2/23

Comparing multiple groups One possible approach would be to simply make all 6 pairwise comparisons separately As we have seen, however, making multiple comparisons increases the Type I error rate unless we make some sort of correction In the special case of multiple groups, however, we can take another approach: to simultaneously test the equality of all the groups i.e., use the entire collection of data to carry out a single test To do this, however, we will need a different approach than the ones we have used so far in this course: we will need to build a statistical model Patrick Breheny Introduction to Biostatistics (171:161) 3/23

The philosophy of statistical models The philosophy of modeling Explained and unexplained variability There are unexplained phenomena that occur all around us, every day: Why do some die while others live? Why does one treatment work better on some, and a different treatment for others? Why do some tailgate the car in front of them while others follow at safer distances? Try as hard as we may, we will never understand any of these things in their entirety; nature is far too complicated to ever understand perfectly There will always be variability that we cannot explain The best we can hope to do is to develop an oversimplified version of how the world works that explains some of that variability Patrick Breheny Introduction to Biostatistics (171:161) 4/23

The philosophy of modeling Explained and unexplained variability The philosophy of statistical models (cont d) This oversimplified version of how the world works is called a model The point of a model is not to accurately represent exactly what is going on in nature; that would be impossible The point is to develop a model that will help us to understand, to predict, and to make decisions in the presence of this uncertainty and some models are better at this than others The philosophy of a statistical model is summarized in a famous quote by the statistician George Box: All models are wrong, but some are useful Patrick Breheny Introduction to Biostatistics (171:161) 5/23

Residuals Introduction The philosophy of modeling Explained and unexplained variability What makes one model better than another is the amount of variability it is capable of explaining Let s return to our tailgating study: the simplest model is that there is one mean tailgating distance for everyone and that everything else is inexplicable variability Using this model, we would calculate the mean tailgating distance for our sample Each observation y i will deviate from this mean by some amount r i : r i = y i ȳ The values r i are called the residuals of the model Patrick Breheny Introduction to Biostatistics (171:161) 6/23

Residual sum of squares The philosophy of modeling Explained and unexplained variability We can summarize the size of the residuals by calculating the residual sum of squares: RSS = i r 2 i The residual sum of squares is a measure of the unexplained variability that a model leaves behind For example, the residual sum of squares for the simple model of the tailgating data is ( 23.1) 2 + ( 2.1) 2 +... = 230, 116.1 Note that residual sum of squares doesn t mean much by itself, because it depends on the sample size and the scale of the outcome, but it has meaning when compared to other models applied to the same data Patrick Breheny Introduction to Biostatistics (171:161) 7/23

A more complex model Introduction The philosophy of modeling Explained and unexplained variability A more complex model for the tailgating data would be that each group has its own unique mean Using this model, we would have to calculate separate means for each group, and then compare each observation to the mean of its own group to calculate the residuals The residual sum of squares for this more complex model is ( 18.9) 2 + (2.1) 2 +... = 225, 126.8 Patrick Breheny Introduction to Biostatistics (171:161) 8/23

Explained variability Introduction The philosophy of modeling Explained and unexplained variability We can quantify how good our model is at explaining the variability we see with a quantity known as the explained variance or coefficient of multiple determination Letting RSS 0 and RSS 1 denote the residual sums of squares from the null model and the more complex model, the percentage of variance explained by the model is: R 2 = RSS 0 RSS 1 RSS 0 230, 116.1 225, 126.8 = 230, 116.1 = 0.022 In words, our model explains 2.2% of the variability in tailgating distance Patrick Breheny Introduction to Biostatistics (171:161) 9/23

Complex models always fit better The philosophy of modeling Explained and unexplained variability The more complex model has a lower residual sum of squares; it must be a better model then, right? Not necessarily; the more complex model will always have a lower residual sum of squares The reason is that, even if the population means are exactly the same for the four groups, the sample means will be slightly different Thus, a more complex model that allows the modeled means in each group to be different will always fit the observed data better But that doesn t mean it would explain the variability of future observations any better (this concept is called overfitting) Patrick Breheny Introduction to Biostatistics (171:161) 10/23

Introduction The real question is whether this reduction in the residual sum of squares is larger than what you would expect by chance alone This type of model one where we have several different groups and are interested in whether the groups have different means is called an analysis of variance model, or for short The meaning of the name is historical, as this was the first type of model to hit on the idea of looking at explained variability (variance) to test hypotheses Today, however, many different types of models use this same idea to conduct hypothesis tests Patrick Breheny Introduction to Biostatistics (171:161) 11/23

Parameters Introduction To answer the question of whether the reduction in RSS is significant, we need to keep track of the number of parameters in a model For example, the null model had one parameter: the common mean In contrast, the more complex model had four parameters: the separate means of the four groups Let s let d denote the number of parameters, so d 0 = 1 and d 1 = 4 Patrick Breheny Introduction to Biostatistics (171:161) 12/23

Variation explained per parameter In the tailgating example, the four-mean model explained an additional 4, 989.3 m 2 of variation However, since adding parameters will always lower the RSS, a better way of stating the difference between the models is that the more complicated model explained an additional 1, 663.1 m 2 of variation per parameter added However, this still isn t quite satisfactory if we measured following distance in feet instead of meters, we d get a different number To finish standardizing, we need to take the standard deviation into account Patrick Breheny Introduction to Biostatistics (171:161) 13/23

Estimating variability Introduction As in Student s t test, if we re willing to assume that the variability is the same for all observations, then we can estimate the standard deviation (or variance, which is the standard deviation squared) by pooling the residuals from the more complex model: ˆσ 2 i = r2 i n d 1 Again, as in Student s t test, it is worth keeping in mind that this is only an estimate, and thus has some uncertainty to it In this case, the pooled standard deviation is based on n d 1 pieces of information (degrees of freedom) For the tailgating data, ˆσ 2 = 225, 126.8/(119 4) = 1957.6 Patrick Breheny Introduction to Biostatistics (171:161) 14/23

Introduction Thus, we finally arrive at the test statistic: F = (RSS 0 RSS 1 )/(d 1 d 0 ) ˆσ 2 ; the test statistic is usually denoted F after Ronald Fisher, who developed the idea of and worked out the distribution that F would have under the null hypothesis that the smaller model is correct For the tailgating data, F = 1663.1 1957.6 = 0.85 To determine significance, we would have to compare this number to a new curve called the F distribution Patrick Breheny Introduction to Biostatistics (171:161) 15/23

F distribution (with df = 3, 115) 0.8 0.6 Density 0.4 0.2 p=0.47 0.0 0 1 2 3 4 5 F Patrick Breheny Introduction to Biostatistics (171:161) 16/23

Outliers Introduction Recall, however, that this data had large outliers If we rank the data and then perform an, we get a different picture of how strong the relationship is between drug use and tailgating behavior: RSS 0 = 140, 420 RSS 1 = 126, 182 Now, our model explains 10.1% of the variability in following distance: 140, 420 126, 182 140, 420 =.101 Furthermore, our F statistic is 4.3 Patrick Breheny Introduction to Biostatistics (171:161) 17/23

F test: ranked data Introduction 0.8 0.6 Density 0.4 0.2 0.0 p=0.006 0 1 2 3 4 5 F Patrick Breheny Introduction to Biostatistics (171:161) 18/23

Means and CIs: Original 60 50 40 Distance 30 20 10 ALC MDMA NODRUG THC Patrick Breheny Introduction to Biostatistics (171:161) 19/23

Means and CIs: Ranked 0.7 0.6 0.5 Average quantile 0.4 0.3 0.2 ALC MDMA NODRUG THC Patrick Breheny Introduction to Biostatistics (171:161) 20/23

for two groups? We have seen that models can be used to test whether or not three or more groups have the same mean Could we have used models to carry out two-group comparisons? Of course; however, comparing the amount of variability explained by a two-mean vs. a one-mean model produces exactly the same test as Student s t-test Patrick Breheny Introduction to Biostatistics (171:161) 21/23

Other uses for statistical models have uses far beyond comparing multiple groups, such as adjusting for the effects of confounding variables, predicting future outcomes, and studying the relationships between multiple variables Statistical modeling is a huge topic, and we are just skimming the surface today are the focus of the next course in this sequence, 171:162 (BIOS 512) Patrick Breheny Introduction to Biostatistics (171:161) 22/23

Introduction attempt to explain variation; the ability of a model to do so is measured by the coefficient of multiple determination (R 2 ) More complex models always explain more variation in the data than simpler models do, even if the simpler model is correct To test whether this increase in explained variation is significant, we can use an F test This idea can be used to carry out a single test of whether three or more groups have the same mean and avoid the need for multiple comparison adjustment; this test is known as an Patrick Breheny Introduction to Biostatistics (171:161) 23/23