Lecture 5: ANOVA and Correlation

Similar documents
Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

Review of Statistics 101

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

Lecture 6 Multiple Linear Regression, cont.

Chapter 8 Student Lecture Notes 8-1. Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance

Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

16.400/453J Human Factors Engineering. Design of Experiments II

Introduction to Business Statistics QM 220 Chapter 12

WELCOME! Lecture 13 Thommy Perlinger

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

STATISTICS 141 Final Review

Institute of Actuaries of India

THE PEARSON CORRELATION COEFFICIENT

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Simple Linear Regression

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Analysing data: regression and correlation S6 and S7

One-Way Analysis of Variance. With regression, we related two quantitative, typically continuous variables.

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Lecture 7: Hypothesis Testing and ANOVA

Chapter Seven: Multi-Sample Methods 1/52

2 Hand-out 2. Dr. M. P. M. M. M c Loughlin Revised 2018

Important note: Transcripts are not substitutes for textbook assignments. 1

10: Crosstabs & Independent Proportions

Inferences About Two Proportions

Lecture 12: Effect modification, and confounding in logistic regression

Analysis of Variance Bios 662

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Correlation and simple linear regression S5

Psych 230. Psychological Measurement and Statistics

Binary Logistic Regression

While you wait: Enter the following in your calculator. Find the mean and sample variation of each group. Bluman, Chapter 12 1

Statistics in medicine

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

Business Statistics. Lecture 10: Correlation and Linear Regression

Statistics Introductory Correlation

The Components of a Statistical Hypothesis Testing Problem

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Introduction to the Analysis of Variance (ANOVA)

Correlation & Simple Regression

Chapter 10: Chi-Square and F Distributions

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

MTMS Mathematical Statistics

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Hypothesis testing. Data to decisions

Lecture 9: Linear Regression

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Sociology 6Z03 Review II

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Chapter 12 - Lecture 2 Inferences about regression coefficient

Testing Independence

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Can you tell the relationship between students SAT scores and their college grades?

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Mathematical Notation Math Introduction to Applied Statistics

Review of Multiple Regression

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

Confidence Intervals, Testing and ANOVA Summary

STAT 4385 Topic 03: Simple Linear Regression

Chapter Eight: Assessment of Relationships 1/42

STAT 4385 Topic 01: Introduction & Review

Correlation and Regression

Inferences for Regression

Analysis of Variance

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12)

Linear models and their mathematical foundations: Simple linear regression

Chapter 10: STATISTICAL INFERENCE FOR TWO SAMPLES. Part 1: Hypothesis tests on a µ 1 µ 2 for independent groups

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

ANOVA: Comparing More Than Two Means

Statistical Hypothesis Testing

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Formal Statement of Simple Linear Regression Model

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Ch 2: Simple Linear Regression

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity

MATH Notebook 3 Spring 2018

Section 4.6 Simple Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Chapter 11. Correlation and Regression

Multiple Regression: Chapter 13. July 24, 2015

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

Goodness of Fit Tests

Statistics: revision

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Suppose that we are concerned about the effects of smoking. How could we deal with this?

ISQS 5349 Final Exam, Spring 2017.

Unit 6 - Introduction to linear regression

Bivariate Relationships Between Variables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Transcription:

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62

Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearson s Chi-square tests for r 2 tables Independence Goodness of Fit Homogeneity Categorical data: r c tables Pearson chi-square tests Odds ratio and relative risk 2 / 62

ANOVA: Definition Statistical technique for comparing means for multiple populations Partitioning the total variation in a data set into components defined by specific sources ANOVA = ANalysis Of VAriance 3 / 62

ANOVA: Concepts Estimate group means Assess magnitude of variation attributable to specific sources Extension of 2-sample t-test to multiple groups Population model Sample model: estimates, standard errors Partition of variability 4 / 62

Types of ANOVA One-way ANOVA One factor e.g. smoking status Two-way ANOVA Two factors e.g. gender and smoking status Three-way ANOVA Three factors e.g. gender, smoking and beer 5 / 62

Emphasis One-way ANOVA is an extension of the t-test to 3 or more samples focus analysis on group differences Two-way ANOVA (and higher) focuses on the interaction of factors Does the effect due to one factor change as the level of another factor changes? 6 / 62

ANOVA Rationale I Variation Variation Variation between each between each in all = observation + group mean observations and its group and the overall mean mean In other words, Total = Within group + Between groups sum of squares sum of squares sum of squares 7 / 62

ANOVA Rationale II In shorthand: SST = SSW + SSB If the group means are not very different, the variation between them and the overall mean (SSB) will not be much more than the variation between the observations within a group (SSW) 8 / 62

ANOVA: One-Way 9 / 62

MSW We can pool the estimates of σ 2 across groups and use an overall estimate for the population variance: Variation within a group = ˆσ 2 W = SSW N k = MSW MSW is called the within groups mean square 10 / 62

MSB We can also look at systematic variation among groups Variation between groups = ˆσ 2 B = SSB k 1 = MSB 11 / 62

An ANOVA table Suppose there are k groups (e.g. if smoking status has categories current, former or never, then k=3) We calculate our test statistic using the sum of square values as follows: 12 / 62

Hypothesis testing with ANOVA In performing ANOVA, we may want to ask: is there truly a difference in means across groups? Formally, we can specify the hypotheses: H 0 : µ 1 = µ 2 = = µ k H a : at least one of the µ i s is different The null hypothesis specifies a global relationship If the result of the test is significant, then perform individual comparisons 13 / 62

Goal of the comparisons Compare the two variability estimates, MSW and MSB If F obs = MSW MSB = ˆσ2 B is small, ˆσ W 2 then variability between groups is negligible compared to variation within groups The grouping does not explain much variation in the data 14 / 62

The F-statistic For our observations, we assume X N(µgp, σ 2 ), where µgp = E(X gp) = β 0 + β 1 I (group=2) + β 1 I (group=3) + ) and I (group=i) is an indicator to denote whether or not each individual is in the i th group Note: we have assumed the same variance σ 2 for all groups important to check this assumption Under these assumptions, we know the null distribution of the statistic F= MSB MSW The distribution is called an F-distribution 15 / 62

The F-distribution Remember that a χ 2 distribution is always specified by its degrees of freedom An F-distribution is any distribution obtained by taking the quotient of two χ 2 distributions divided by their respective degrees of freedom When we specify an F-distribution, we must state two parameters, which correspond to the degrees of freedom for the two χ 2 distributions If X 1 χ 2 df 1 and X 2 χ 2 df 2 we write: X 1 /df 1 X 2 /df 2 F df1,df 2 16 / 62

Back to the hypothesis test... Knowing the null distribution of MSB MSW, we can define a decision rule to test the hypothesis for ANOVA: Reject H 0 if F F α;k 1,N k Fail to reject H 0 if F < F α;k 1,N k 17 / 62

ANOVA: F-tests I 18 / 62

ANOVA: F-tests II 19 / 62

Example: ANOVA for HDL Study design: Randomize control trial 132 men randomized to one of Diet + exericse Diet Control Follow-up one year later: 119 men remaining in study Outcome: mean change in plasma levels of HDL cholesterol from baseline to one-year follow-up in the three groups 20 / 62

Model for HDL outcomes We model the means for each group as follows: µ c = E(HDL gp = c) = mean change in control group µ d = E(HDL gp = d) = mean change in diet group µ de = E(HDL gp = de) = mean change in diet and exercise group We could also write the model as E(HDL gp) = β 0 + β 1 I (gp = d) + β 2 I (gp = de) Recall that I(gp=D), I(gp=DE) are 0/1 group indicators 21 / 62

HDL ANOVA Table We obtain the following results from the HDL experiment: 22 / 62

HDL ANOVA results F-test H 0 : µ c = µ d = µ de (or H 0 : β 1 = β 2 = 0) H a : at least one mean is different from the others Test statistic F obs = 13 df 1 = k 1 = 3 1 = 2 df 2 = N k = 116 23 / 62

HDL ANOVA Conclusions Rejection region: F > F 0.05;2,116 = 3.07 Since F obs = 13.0 > 3.07, we reject H 0 We conclude that at least one of the group means is different from the others 24 / 62

Which groups are different? We might proceed to make individual comparisons Conduct two-sample t-tests for each pair of groups: t = ˆθ θ 0 SE(ˆθ) = X i X j 0 sp 2 n i + s2 p n j 25 / 62

Multiple Comparisons Performing individual comparisons require multiple hypothesis tests If α = 0.05 for each comparison, there is a 5% chance that each comparison will falsely be called significant Overall, the probability of Type I error is elevated above 5% Question How can we address this multiple comparisons issue? 26 / 62

Bonferroni adjustment A possible correction for multiple comparisons Test each hypothesis at level α = (α/3) = 0.0167 Adjustment ensures overall Type I error rate does not exceed α = 0.05 However, this adjustment may be too conservative 27 / 62

Multiple comparisons α Hypothesis α = α/3 H 0 : µ c = µ d (or β 1 = 0) 0.0167 H 0 : µ c = µ de (or β 2 = 0) 0.0167 H 0 : µ d = µ de (or β 1 β 2 = 0) 0.0167 Overall α = 0.05 28 / 62

HDL: Pairwise comparisons I Control and Diet groups H 0 : µ c = µ d (or β 1 = 0) t = q 0.05 0.02 0.028 40 + 0.028 40 p-value = 0.06 = 1.87 29 / 62

HDL: Pairwise comparisons II Control and Diet + exercise groups H 0 : µ c = µ de (or β 2 = 0) t = q 0.05 0.14 0.028 40 + 0.028 39 = 5.05 p-value = 4.4 10 7 30 / 62

HDL: Pairwise comparisons III Diet and Diet + exercise groups H 0 : µ d = µ de (or β 1 β 2 = 0) t = q 0.02 0.14 0.028 40 + 0.028 39 p-value = 0.0014 = 3.19 31 / 62

Bonferroni corrected p-values Hypothesis p-value adjusted p-value H 0 : µ c = µ d 0.06 0.18 H 0 : µ c = µ de 4.4 10 7 1.3 10 6 H 0 : µ d = µ de 0.0014 0.0042 Overall α = 0.05 Conclusion: Significant difference in HDL change for DE group compared to other groups 32 / 62

Two-way ANOVA Uses the same idea as one-way ANOVA by partitioning variability Allows us to look at interaction of factors Does the effect due to one factor change as the level of another factor changes? 33 / 62

Example: Public health students medical expenditures Study design: In an observation study, total medical expenditures and various demographic characteristics were recorded for 200 public health students Goal: determine how gender and smoking status affect total medical expenditures in this population 34 / 62

Example: Set-up Y = Total medical expenditures F = Indicator of Female = 1 if Gender=Female, 0 otherwise S = Indicator of Smoking = 1 if smoked 100 cigarettes or more, 0 otherwise 35 / 62

Interaction model We assume the model Y N(µ, σ 2 ) where µ = E(Y ) = β 0 + β 1 F + β 2 S + β 3 F S What are the interpretations of β 0, β 1, β 2, and β 3 36 / 62

Two-way ANOVA: Interactions Mean Model µ = E(Y ) = β 0 + β 1 F + β 2 S + β 3 F S Gender No Smoker Yes Male β 0 β 0 + β 2 Female β 0 + β 1 β 0 + β 1 + β 2 + β 3 37 / 62

Mean Model E(Expenditure Male, non-smoker) = β 0 + β 1 0 + β 2 0 + β 3 0 = β 0 E(Expenditure Female, non-smoker) = β 0 + β 1 1 + β 2 0 + β 3 0 = β 0 + β 1 E(Expenditure Male, Smoker) = β 0 + β 1 0 + β 2 1 + β 3 0 = β 0 + β 2 E(Expenditure Female, Smoker) = β 0 + β 1 1 + β 2 1 + β 3 1 = β 0 + β 1 + β 2 + β 3 38 / 62

Medical Expenditures: ANOVA table Source of Sum of Mean Variation Square df Square F p-value Model (between groups) 1.7 10 9 3 5.6 10 8 28.11 < 0.001 Error (within groups) 3.9 10 9 196 2.0 10 7 Total 5.6 10 9 199 39 / 62

Medical Expenditures: Results Overall model F-test: H 0 : β 1 = β 2 = β 3 = 0 H a : At least one group is different Test statistic: F obs = 28.11 df 1 = k 1 = 3 df 2 = N k = 196 p-value < 0.001 40 / 62

Medical Expenditures: Overall Conclusions The medical expenditures are different in at least one of the groups Now we can figure out which ones... 41 / 62

Medical Expenditures: Two-way ANOVA I Table of coefficient estimates Coefficient Estimate Standard Error β 0 (baseline) 5049 597 β 1 (female effect) 1784 765 β 2 (smoker effect) 907 1062 β 3 (female*smoke) 6239 1422 42 / 62

Medical Expenditures: Two-way ANOVA II Test statistics and confidence intervals Coefficient t P> t 95% Confidence interval β 0 8.45 0.000 (3870, 6228) β 1 2.33 0.21 (276, 3292) β 2 0.85 0.394 (-1187, 3001) β 3 4.39 0.000 (3434, 9043) 43 / 62

Medical Expenditures: Group-wise Conclusions In this population, an average male non-smoker spends about $5000 on medical costs per year Males who smoked were estimated as having spent about $900 more than non-smokers, but this difference was not found to be statistically significant Female non-smokers spent about $1700 more than there non-smoking male counterparts Female smokers spent about $8900 (= β 1 + β 2 + β 3 ) more than non-smoking males 44 / 62

Association and Correlation I Association Express the relationship between two variables Can be measured in different ways, depending on the nature of the variables For now, we ll focus on continuous variables (e.g. height, weight) Important note: association does not imply causation 45 / 62

Association and Correlation II Describing the relationship between two continuous variables Correlation analysis Measures strength of relationship between two variables Specifies direction of relationship Regression analysis Concerns prediction or estimation of outcome variable, based on value of another variable (or variables) 46 / 62

Correlation analysis Plot the data (or have a computer to do so) Visually inspect the relationship between two continous variables Is there a linear relationship (correlation)? Are there outliers? Are the distributions skewed? 47 / 62

Correlation Coefficient I Measures the strength and direction of the linear relationship between to variables X and Y Population correlation coefficient, ρ = cov(x, Y ) var(x ) var(y ) = E[(X µ X )(Y µ Y )] E[(X µx ) 2 ] E[(Y µ Y ) 2 ] 48 / 62

Correlation Coefficient II The correlation coefficient, ρ, takes values between -1 and +1-1: Perfect negative linear relationship 0: No linear relationship +1: Perfect positive relationship 49 / 62

Correlation Coefficient III Sample correlation coefficient: Obtained by plugging sample estimates into the population correlation coefficient r = sample cov(x, Y ) s 2 x s 2 Y = n n i=1 i=1 (X i X ) 2 n 1 (X i X )(Y i Ȳ ) n 1 n i=1 (Y i Ȳ ) 2 n 1 50 / 62

Correlation Coefficient IV Plot standardized Y versus standardized X Observe an ellipse (elongated circle) Correlation is the slope of the major axis 51 / 62

Correlation Notes Other names for r Pearson correlation coefficient Product moment of correlation Characteristics of r Measures *linear* association The value of r is independent of units used to measure the variables The value of r is sensitive to outliers r 2 tells us what proportion of variation in Y is explained by linear relationship with X 52 / 62

Several levels of correlation 53 / 62

Examples of the Correlation Coefficient I Perfect positive correlation, r 1 54 / 62

Examples of the Correlation Coefficient II Perfect negative correlation, r -1 55 / 62

Examples of the Correlation Coefficient III Imperfect positive correlation, 0< r <1 56 / 62

Examples of the Correlation Coefficient IV Imperfect negative correlation, -1<r <0 57 / 62

Examples of the Correlation Coefficient V No relation, r 0 58 / 62

Examples of the Correlation Coefficient VI Some relation but little *linear* relationship, r 0 59 / 62

Association and Causality In general, association between two variables means there some form of relationship between them The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to Example: Hot days, ice cream, drowning 60 / 62

Sir Bradford Hill s Criteria for Causality I Strength: magnitude of association Consistency of association: repeated observation of the association in different situations Specificity: uniqueness of the association Temporality: cause precedes effect 61 / 62

Sir Bradford Hill s Criteria for Causality II Biologic gradient: dose-response relationship Biologic plausibility: known mechanisms Coherence: makes sense based on other known facts Experimental evidence: from designed (randomized) experiments Analogy: with other known associations 62 / 62