Topic 22 Analysis of Variance

Similar documents
Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Chapter 7 Comparison of two independent samples

Ch 2: Simple Linear Regression

Confidence Intervals, Testing and ANOVA Summary

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

Formal Statement of Simple Linear Regression Model

Topic 19 Extensions on the Likelihood Ratio

Topic 21 Goodness of Fit

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Solutions to Final STAT 421, Fall 2008

Multiple comparisons - subsequent inferences for two-way ANOVA

Notes for Week 13 Analysis of Variance (ANOVA) continued WEEK 13 page 1

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Analysis of Variance. Read Chapter 14 and Sections to review one-way ANOVA.

i=1 X i/n i=1 (X i X) 2 /(n 1). Find the constant c so that the statistic c(x X n+1 )/S has a t-distribution. If n = 8, determine k such that

Data Analysis and Statistical Methods Statistics 651

Sociology 6Z03 Review II

Lecture 18: Analysis of variance: ANOVA

Inference for the Regression Coefficient

Math 423/533: The Main Theoretical Topics

Simple Linear Regression

Mathematical Notation Math Introduction to Applied Statistics

Linear Models and Estimation by Least Squares

Week 14 Comparing k(> 2) Populations

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

A Likelihood Ratio Test

Probability and Statistics Notes

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Multiple Linear Regression

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Chapter 12. Analysis of variance

Coefficient of Determination

Simple and Multiple Linear Regression

Chapter 10: Inferences based on two samples

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Analysis of Variance Bios 662

Design of Engineering Experiments Part 2 Basic Statistical Concepts Simple comparative experiments

Analysis of Variance

Elementary Statistics for the Biological and Life Sciences

Math 180B Problem Set 3

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

Preliminary Statistics Lecture 5: Hypothesis Testing (Outline)

1 Introduction to One-way ANOVA

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Multiple Sample Numerical Data

Concordia University (5+5)Q 1.

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

MATH Notebook 3 Spring 2018

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Ch 3: Multiple Linear Regression

STATISTICS 141 Final Review

CBA4 is live in practice mode this week exam mode from Saturday!

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

UNIVERSITY OF TORONTO Faculty of Arts and Science

PSYC 331 STATISTICS FOR PSYCHOLOGISTS

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

One-Way Analysis of Variance (ANOVA)

Much of the material we will be covering for a while has to do with designing an experimental study that concerns some phenomenon of interest.

Applied Econometrics (QEM)

Data Skills 08: General Linear Model

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Chapter 5 Confidence Intervals

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

10/31/2012. One-Way ANOVA F-test

Randomized Complete Block Designs

STAT 135 Lab 7 Distributions derived from the normal distribution, and comparing independent samples.

An inferential procedure to use sample data to understand a population Procedures

22s:152 Applied Linear Regression. Take random samples from each of m populations.

16.400/453J Human Factors Engineering. Design of Experiments II

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Inference for Regression

The Chi-Square Distributions

Lecture 6 Multiple Linear Regression, cont.

Chapter 23: Inferences About Means

Inferences for Regression

Lecture 9 Two-Sample Test. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

1 Statistical inference for a population mean

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

STK4900/ Lecture 3. Program

20.1. Balanced One-Way Classification Cell means parametrization: ε 1. ε I. + ˆɛ 2 ij =

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Analysis of Variance

Political Science 236 Hypothesis Testing: Review and Bootstrapping

TUTORIAL 8 SOLUTIONS #

Visual interpretation with normal approximation

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Transcription:

Topic 22 Analysis of Variance Comparing Multiple Populations 1 / 14

Outline Overview One Way Analysis of Variance Sample Means Sums of Squares The F Statistic Confidence Intervals 2 / 14

Overview Two-sample t procedures are designed to compare the means of two populations. Our next step is to compare the means of several populations. To explain the methodology, we consider the data set gathered from the forests in Borneo. Figure: Danum Valley Conservation Area 3 / 14

Overview Example. The data on 30 forest plots in Borneo are the number of trees per plot. never logged logged 1 year ago logged 8 years ago n j 12 12 9 ȳ j 23.750 14.083 15.778 s i 5.065 4.981 5.761 We compute these statistics from the data y 11,... y 1n1, y 21,... y 2n2 and y 31,... y 2n3, ȳ j = 1 n j n j i=1 y ij and s 2 j = 1 n j 1 n j (y ij ȳ i ) 2. i=1 4 / 14

Overview The basic question is: Are these means the same (the null hypothesis) or not (the alternative hypothesis)? The basic idea of the test is to examine the ratio of sbetween 2, the variance between groups (indicated by the variation in the center lines of the boxes) and sresidual 2, a statistic that measures the variances within groups. If the resulting ratio test statistic is sufficiently large, then we say, based on the data, that the means of these groups are distinct and we reject H 0. 5 10 15 20 25 30 35 never logged logged 1 year ago logged 8 years ago Figure: Side-by-side boxplots of the number of trees per plot. 5 / 14

One Way Analysis of Variance The hypothesis for one way analysis of variance is H 0 : µ j = µ k for all j, k and H 1 : µ j µ k for some j, k. The data {y ij, 1 i n j, 1 j q} represents that we have n j observation for the j-th group and that we have q groups. The total number of observations is denoted by n = n 1 + + n q. The model is y ij = µ j + ɛ ij where ɛ ij are independent N(0, σ) random variables with σ 2 unknown. This allows us to define the likelihood and to use that to determine the analysis of variance F test as a likelihood ratio test. NB The model for analysis requires a common value σ for all of the observations. 6 / 14

Sample Means To set up the F statistic, we introduce two types of sample means: The within group means is the sample mean inside each of the groups, n j y j = 1 y ij, j = 1...., q. n j The mean of the data taken as a whole, known as the grand mean, i=1 y = 1 n n q j y ij = 1 n j=1 i=1 q n j ȳ j, j=1 the weighted average of the ȳ j with weights n j, the sample size in each group. Exercise. For the Borneo rain forest example, show that the grand mean is 18.06055. 7 / 14

Sums of Squares Analysis of variance uses the total sums of squares SS total = n q j (y ij y) 2, j=1 i=1 the total square variation of individual observations from their grand mean. The test statistic is determined by decomposing SS total. We first rewrite the interior sum as n j (y ij y) 2 = (y ij y j ) 2 + n j (y j y) 2 = (n j 1)sj 2 + n j (y j y) 2. i=1 n j i=1 Here, s 2 j is the unbiased sample variance based on the observations in the j-th group. Exercise. Show the first equality above. (Hint: Begin with the difference in the two sums.) 8 / 14

Sums of Squares Summing the expression above over the groups j to yield the decomposition of the variation SS total = SS residual + SS between with SS residual = n q j (y ij y j ) 2 = j=1 i=1 q q (n j 1)sj 2 and SSbetween 2 = n j (y j y) 2. j=1 For the rain forest example, we find that and SS residual = (12 1) 5.065 2 + (12 1) 4.981 2 + (9 1) 5.761 2 = 820.6234 SS between = 12 (23.750 y) 2 + 12 (14.083 y) 2 + 9 (15.778 y) 2 ) = 625.1793 j=1 9 / 14

Sums of Squares source of degrees of sums of mean variation freedom squares square between groups q 1 SS between sbetween 2 = SS between/(q 1) residuals n q SS residual s 2 residual = SS residual /(n q) total n 1 SS total The q 1 degrees of freedom between groups is derived from the q groups minus 1 degree of freedom used to compute y. The n q degrees of freedom within the groups is derived from the n j 1 degree of freedom used to compute the variances s 2 j. 10 / 14

Sums of Squares The analysis of variance information for the Borneo rain forest data is summarized in the table below. source of degrees of sums of mean variation freedom squares square between groups 2 625.2 312.6 residuals 30 820.6 27.4 total 32 1445.8 11 / 14

The F Statistic The test statistic is F = s2 between s 2 residual = SS between/(q 1) SS residual /(n q). Under the null hypothesis, F is a constant multiple of the ratio of two independent χ 2 random variables, namely SS between and SS residual. This ratio is called an F random variable with q 1 numerator degrees of freedom and n q denominator degrees of freedom and written F q 1,n q 12 / 14

. For the rain forest data, F = s2 between s 2 residual = 312.6 27.4 = 11.43. The critical value for an 0.01 level test is 5.390. So, we reject H 0 stating mean number of trees does not depend on logging history. > 1-pf(11.43,2,30) [1] 0.0002041322 > qf(0.99,2,30) [1] 5.390346 F Statistic df(x, 2, 30) 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 x Figure: Upper tail critical values. The density for an F 2,30 random variable. The indicated values 3.316, 4.470, and 5.390 are critical values for significance levels α = 0.05, 0.02, and 0.01, respectively. Exercise. Use R to determine these critical values. 13 / 14

Confidence Intervals Confidence intervals are determined using the data from all of the groups as an unbiased estimate s 2 residuals = SS residuals/(n q) for the variance, σ 2. This allows us to increase the degrees of freedom in the t distribution and reduce the margin of error. Thus, the γ-level confidence interval for µ j is ȳ j ± t (1 γ)/2,n q s residual / n j. The interval for the difference in µ j µ k is similar to that for a pooled two-sample t confidence interval, 1 ȳ j ȳ k ± t (1 γ)/2,n q s residual + 1. n j n k The 95% confidence interval for mean number of trees on a lot logged 1 year ago 27.4 14.083 ± 2.042 = 14.083 ± 4.714 = (9.369, 18.979). 12 Exercise. Give the 95% confidence interval for the difference in trees between plots never logged plots versus logged 8 years ago. 14 / 14