Lecture 5: ANOVA and Correlation

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62

Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearson s Chi-square tests for r 2 tables Independence Goodness of Fit Homogeneity Categorical data: r c tables Pearson chi-square tests Odds ratio and relative risk 2 / 62

ANOVA: Definition Statistical technique for comparing means for multiple populations Partitioning the total variation in a data set into components defined by specific sources ANOVA = ANalysis Of VAriance 3 / 62

ANOVA: Concepts Estimate group means Assess magnitude of variation attributable to specific sources Extension of 2-sample t-test to multiple groups Population model Sample model: estimates, standard errors Partition of variability 4 / 62

Types of ANOVA One-way ANOVA One factor e.g. smoking status Two-way ANOVA Two factors e.g. gender and smoking status Three-way ANOVA Three factors e.g. gender, smoking and beer 5 / 62

Emphasis One-way ANOVA is an extension of the t-test to 3 or more samples focus analysis on group differences Two-way ANOVA (and higher) focuses on the interaction of factors Does the effect due to one factor change as the level of another factor changes? 6 / 62

ANOVA Rationale I Variation Variation Variation between each between each in all = observation + group mean observations and its group and the overall mean mean In other words, Total = Within group + Between groups sum of squares sum of squares sum of squares 7 / 62

ANOVA Rationale II In shorthand: SST = SSW + SSB If the group means are not very different, the variation between them and the overall mean (SSB) will not be much more than the variation between the observations within a group (SSW) 8 / 62

ANOVA: One-Way 9 / 62

MSW We can pool the estimates of σ 2 across groups and use an overall estimate for the population variance: Variation within a group = ˆσ 2 W = SSW N k = MSW MSW is called the within groups mean square 10 / 62

MSB We can also look at systematic variation among groups Variation between groups = ˆσ 2 B = SSB k 1 = MSB 11 / 62

An ANOVA table Suppose there are k groups (e.g. if smoking status has categories current, former or never, then k=3) We calculate our test statistic using the sum of square values as follows: 12 / 62

Hypothesis testing with ANOVA In performing ANOVA, we may want to ask: is there truly a difference in means across groups? Formally, we can specify the hypotheses: H 0 : µ 1 = µ 2 = = µ k H a : at least one of the µ i s is different The null hypothesis specifies a global relationship If the result of the test is significant, then perform individual comparisons 13 / 62

Goal of the comparisons Compare the two variability estimates, MSW and MSB If F obs = MSW MSB = ˆσ2 B is small, ˆσ W 2 then variability between groups is negligible compared to variation within groups The grouping does not explain much variation in the data 14 / 62

The F-statistic For our observations, we assume X N(µgp, σ 2 ), where µgp = E(X gp) = β 0 + β 1 I (group=2) + β 1 I (group=3) + ) and I (group=i) is an indicator to denote whether or not each individual is in the i th group Note: we have assumed the same variance σ 2 for all groups important to check this assumption Under these assumptions, we know the null distribution of the statistic F= MSB MSW The distribution is called an F-distribution 15 / 62

The F-distribution Remember that a χ 2 distribution is always specified by its degrees of freedom An F-distribution is any distribution obtained by taking the quotient of two χ 2 distributions divided by their respective degrees of freedom When we specify an F-distribution, we must state two parameters, which correspond to the degrees of freedom for the two χ 2 distributions If X 1 χ 2 df 1 and X 2 χ 2 df 2 we write: X 1 /df 1 X 2 /df 2 F df1,df 2 16 / 62

Back to the hypothesis test... Knowing the null distribution of MSB MSW, we can define a decision rule to test the hypothesis for ANOVA: Reject H 0 if F F α;k 1,N k Fail to reject H 0 if F < F α;k 1,N k 17 / 62

ANOVA: F-tests I 18 / 62

ANOVA: F-tests II 19 / 62

Example: ANOVA for HDL Study design: Randomize control trial 132 men randomized to one of Diet + exericse Diet Control Follow-up one year later: 119 men remaining in study Outcome: mean change in plasma levels of HDL cholesterol from baseline to one-year follow-up in the three groups 20 / 62

Model for HDL outcomes We model the means for each group as follows: µ c = E(HDL gp = c) = mean change in control group µ d = E(HDL gp = d) = mean change in diet group µ de = E(HDL gp = de) = mean change in diet and exercise group We could also write the model as E(HDL gp) = β 0 + β 1 I (gp = d) + β 2 I (gp = de) Recall that I(gp=D), I(gp=DE) are 0/1 group indicators 21 / 62

HDL ANOVA Table We obtain the following results from the HDL experiment: 22 / 62

HDL ANOVA results F-test H 0 : µ c = µ d = µ de (or H 0 : β 1 = β 2 = 0) H a : at least one mean is different from the others Test statistic F obs = 13 df 1 = k 1 = 3 1 = 2 df 2 = N k = 116 23 / 62

HDL ANOVA Conclusions Rejection region: F > F 0.05;2,116 = 3.07 Since F obs = 13.0 > 3.07, we reject H 0 We conclude that at least one of the group means is different from the others 24 / 62

Which groups are different? We might proceed to make individual comparisons Conduct two-sample t-tests for each pair of groups: t = ˆθ θ 0 SE(ˆθ) = X i X j 0 sp 2 n i + s2 p n j 25 / 62

Multiple Comparisons Performing individual comparisons require multiple hypothesis tests If α = 0.05 for each comparison, there is a 5% chance that each comparison will falsely be called significant Overall, the probability of Type I error is elevated above 5% Question How can we address this multiple comparisons issue? 26 / 62

Bonferroni adjustment A possible correction for multiple comparisons Test each hypothesis at level α = (α/3) = 0.0167 Adjustment ensures overall Type I error rate does not exceed α = 0.05 However, this adjustment may be too conservative 27 / 62

Multiple comparisons α Hypothesis α = α/3 H 0 : µ c = µ d (or β 1 = 0) 0.0167 H 0 : µ c = µ de (or β 2 = 0) 0.0167 H 0 : µ d = µ de (or β 1 β 2 = 0) 0.0167 Overall α = 0.05 28 / 62

HDL: Pairwise comparisons I Control and Diet groups H 0 : µ c = µ d (or β 1 = 0) t = q 0.05 0.02 0.028 40 + 0.028 40 p-value = 0.06 = 1.87 29 / 62

HDL: Pairwise comparisons II Control and Diet + exercise groups H 0 : µ c = µ de (or β 2 = 0) t = q 0.05 0.14 0.028 40 + 0.028 39 = 5.05 p-value = 4.4 10 7 30 / 62

HDL: Pairwise comparisons III Diet and Diet + exercise groups H 0 : µ d = µ de (or β 1 β 2 = 0) t = q 0.02 0.14 0.028 40 + 0.028 39 p-value = 0.0014 = 3.19 31 / 62

Bonferroni corrected p-values Hypothesis p-value adjusted p-value H 0 : µ c = µ d 0.06 0.18 H 0 : µ c = µ de 4.4 10 7 1.3 10 6 H 0 : µ d = µ de 0.0014 0.0042 Overall α = 0.05 Conclusion: Significant difference in HDL change for DE group compared to other groups 32 / 62

Two-way ANOVA Uses the same idea as one-way ANOVA by partitioning variability Allows us to look at interaction of factors Does the effect due to one factor change as the level of another factor changes? 33 / 62

Example: Public health students medical expenditures Study design: In an observation study, total medical expenditures and various demographic characteristics were recorded for 200 public health students Goal: determine how gender and smoking status affect total medical expenditures in this population 34 / 62

Example: Set-up Y = Total medical expenditures F = Indicator of Female = 1 if Gender=Female, 0 otherwise S = Indicator of Smoking = 1 if smoked 100 cigarettes or more, 0 otherwise 35 / 62

Interaction model We assume the model Y N(µ, σ 2 ) where µ = E(Y ) = β 0 + β 1 F + β 2 S + β 3 F S What are the interpretations of β 0, β 1, β 2, and β 3 36 / 62

Two-way ANOVA: Interactions Mean Model µ = E(Y ) = β 0 + β 1 F + β 2 S + β 3 F S Gender No Smoker Yes Male β 0 β 0 + β 2 Female β 0 + β 1 β 0 + β 1 + β 2 + β 3 37 / 62

Mean Model E(Expenditure Male, non-smoker) = β 0 + β 1 0 + β 2 0 + β 3 0 = β 0 E(Expenditure Female, non-smoker) = β 0 + β 1 1 + β 2 0 + β 3 0 = β 0 + β 1 E(Expenditure Male, Smoker) = β 0 + β 1 0 + β 2 1 + β 3 0 = β 0 + β 2 E(Expenditure Female, Smoker) = β 0 + β 1 1 + β 2 1 + β 3 1 = β 0 + β 1 + β 2 + β 3 38 / 62

Medical Expenditures: ANOVA table Source of Sum of Mean Variation Square df Square F p-value Model (between groups) 1.7 10 9 3 5.6 10 8 28.11 < 0.001 Error (within groups) 3.9 10 9 196 2.0 10 7 Total 5.6 10 9 199 39 / 62

Medical Expenditures: Results Overall model F-test: H 0 : β 1 = β 2 = β 3 = 0 H a : At least one group is different Test statistic: F obs = 28.11 df 1 = k 1 = 3 df 2 = N k = 196 p-value < 0.001 40 / 62

Medical Expenditures: Overall Conclusions The medical expenditures are different in at least one of the groups Now we can figure out which ones... 41 / 62

Medical Expenditures: Two-way ANOVA I Table of coefficient estimates Coefficient Estimate Standard Error β 0 (baseline) 5049 597 β 1 (female effect) 1784 765 β 2 (smoker effect) 907 1062 β 3 (female*smoke) 6239 1422 42 / 62

Medical Expenditures: Two-way ANOVA II Test statistics and confidence intervals Coefficient t P> t 95% Confidence interval β 0 8.45 0.000 (3870, 6228) β 1 2.33 0.21 (276, 3292) β 2 0.85 0.394 (-1187, 3001) β 3 4.39 0.000 (3434, 9043) 43 / 62

Medical Expenditures: Group-wise Conclusions In this population, an average male non-smoker spends about $5000 on medical costs per year Males who smoked were estimated as having spent about $900 more than non-smokers, but this difference was not found to be statistically significant Female non-smokers spent about $1700 more than there non-smoking male counterparts Female smokers spent about $8900 (= β 1 + β 2 + β 3 ) more than non-smoking males 44 / 62

Association and Correlation I Association Express the relationship between two variables Can be measured in different ways, depending on the nature of the variables For now, we ll focus on continuous variables (e.g. height, weight) Important note: association does not imply causation 45 / 62

Association and Correlation II Describing the relationship between two continuous variables Correlation analysis Measures strength of relationship between two variables Specifies direction of relationship Regression analysis Concerns prediction or estimation of outcome variable, based on value of another variable (or variables) 46 / 62

Correlation analysis Plot the data (or have a computer to do so) Visually inspect the relationship between two continous variables Is there a linear relationship (correlation)? Are there outliers? Are the distributions skewed? 47 / 62

Correlation Coefficient I Measures the strength and direction of the linear relationship between to variables X and Y Population correlation coefficient, ρ = cov(x, Y ) var(x ) var(y ) = E[(X µ X )(Y µ Y )] E[(X µx ) 2 ] E[(Y µ Y ) 2 ] 48 / 62

Correlation Coefficient II The correlation coefficient, ρ, takes values between -1 and +1-1: Perfect negative linear relationship 0: No linear relationship +1: Perfect positive relationship 49 / 62

Correlation Coefficient III Sample correlation coefficient: Obtained by plugging sample estimates into the population correlation coefficient r = sample cov(x, Y ) s 2 x s 2 Y = n n i=1 i=1 (X i X ) 2 n 1 (X i X )(Y i Ȳ ) n 1 n i=1 (Y i Ȳ ) 2 n 1 50 / 62

Correlation Coefficient IV Plot standardized Y versus standardized X Observe an ellipse (elongated circle) Correlation is the slope of the major axis 51 / 62

Correlation Notes Other names for r Pearson correlation coefficient Product moment of correlation Characteristics of r Measures *linear* association The value of r is independent of units used to measure the variables The value of r is sensitive to outliers r 2 tells us what proportion of variation in Y is explained by linear relationship with X 52 / 62

Several levels of correlation 53 / 62

Examples of the Correlation Coefficient I Perfect positive correlation, r 1 54 / 62

Examples of the Correlation Coefficient II Perfect negative correlation, r -1 55 / 62

Examples of the Correlation Coefficient III Imperfect positive correlation, 0< r <1 56 / 62

Examples of the Correlation Coefficient IV Imperfect negative correlation, -1<r <0 57 / 62

Examples of the Correlation Coefficient V No relation, r 0 58 / 62

Examples of the Correlation Coefficient VI Some relation but little *linear* relationship, r 0 59 / 62

Association and Causality In general, association between two variables means there some form of relationship between them The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to Example: Hot days, ice cream, drowning 60 / 62

Sir Bradford Hill s Criteria for Causality I Strength: magnitude of association Consistency of association: repeated observation of the association in different situations Specificity: uniqueness of the association Temporality: cause precedes effect 61 / 62

Sir Bradford Hill s Criteria for Causality II Biologic gradient: dose-response relationship Biologic plausibility: known mechanisms Coherence: makes sense based on other known facts Experimental evidence: from designed (randomized) experiments Analogy: with other known associations 62 / 62