11: Comparing Group Variances. Review of Variance

Size: px

Start display at page:

Download "11: Comparing Group Variances. Review of Variance"

Clinton Adams
5 years ago
Views:

1 11: Comparing Group Variances Review of Variance Parametric measures of variability are often based on sum of squares (SS) around e mean: (1) For e data set {3, 4, 5, 8}, = 5 and SS = (3 5) + (4 5) + (5 5) + (8 5) = = 14. The variance is e mean sum or squares ( mean square ). The symbol denotes e population variance (parameter) and s denotes e sample variance. The population variance is seldom known, so we calculate e sample variance: () For {3, 4, 5, 8}, s = 14 / 3 = This is an unbiased estimate of. The standard deviation is e square root of e variance ( root mean square ): (3) For {3, 4, 5, 8}, s = (4.667) =.16. This is an unbiased estimate of. Interpretation of e standard deviation is tricky. One ing to keep in mind is at big standard deviations indicate big spreads and small standard deviations indicate small spreads. For example, if one group has a standard deviation of 15 and an oer has a standard deviation of, ere is much greater variability in e first group. There are also rules for interpreting standard deviations. For Normally distributed data, 68% of values lie in ±, 95% lie in ±, and nearly all values lie in ± 3. Most data are not Normally distributed, in which case we can use Chebychev s rule which states at at least 75% of e values lie wiin ±. Illustrative example (data set = agebycen.sav for center 1). Ages (years) of patients at a given center are: 60, 66, 65, 55, 6, 70, 51, 7, 58, 61, 71, 41, 70, 57, 55, 63, 64, 76, 74, 54, 58, 73 The mean = The sum of squares SS = ( ) + ( ) + +( ) = The sample variance s = / ( 1) = 75.1 The standard deviation s = From Chebychev s rule we predict at at least 75% of e ages lies in e interval 6.5 ± ()(8.7) = 6.5 ± 17.4, or between 45.1 and Page 11.1 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

2 Graphical Representations of Spread The data in e prior illustrative example (n = ) are displayed as a stemplot: (years) We can see from is plot at data spread from 41 to 76 around a center of about 60-someing. There is a low outlier, and e distribution have a negative skew. You can also display e data wi a dot plot or mean ± standard deviation plot as follows: Be aware at error bars in statistical charts may represent standard deviations, standard errors, or margins of error, which are quite different ings. A boxplot is also nice:the plot below shows e hinge-spread (smaller vertical arrow) and a whiskers-spread (larger vertical arrow) of e data. Chapter 3 provides instruction on construction and interpretation of boxplots. Page 11. (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

3 Confidence Limits for a Variance 95% confidence interval for e population variance. The sample variance s² is an unbiased estimator of ². The 95% confidence intervals for (for Normal populations) is given by: (4) where SS represents e sum of square (formula 1), ² n-1,.975 is e 97.5 percentile on a chi-square distribution wi n 1 degrees of freedom, and ² is e.5 percentile on a chi-square distribution wi n 1 degrees of freedom. n-1,.05 Going from e variance to e sum of squares. If e sums of squares is not available, but e variance (or standard deviation) and sample size is, e sum of squares is (5) For e illustrative data, s = and n = 4. Therefore, SS = (4 1)(4.667) = Illustrative example. Consider e small data set on page 1 of is chapter in which SS = 14 and n = 4. For 95% confidence use ² 3,.975 = , ² 3,.05 = Thus, a 95% confidence interval for is = (1.50, 64.87). 95% confidence interval for e population standard deviation. The 95% confidence limits for population standard deviation is derived by taking e square root of e confidence limits for e variance. For e illustrative example, e 95% confidence interval for = ( 1.50, 64.87) = (1., 8.05). Page 11.3 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

4 Testing Variances for a Significant Difference When we have two independent samples, we might ask if e variances of e two populations differ. Consider e following fictitious data: Sample 1: {3, 4, 5, 8} Sample : {4, 6, 7, 9} 1 The first sample has s = and e second has s = Is it possible e observed difference reflects random variation and e variances in e population are e same? We test null hypoesis H : ² = ². 0 1 We ask What is e probability of taking samples from two populations wi identical variances while observing * sample variances as different as s 1 and s? If is probability is low (say, less an.06 ), we will reject H 0 and conclude e two samples came from populations wi unequal variances. If is probability is not too low, we will say ere is insufficient evidence to reject e null hypoesis. The most common procedure for e test is e F ratio test, which has is test statistic: (6) Notice at e larger variance is placed in e numerator of e statistic, and e smaller variance is placed in e denominator. The statistic is associated wi numerator and denominator degrees of freedom. The numerator degrees of freedom is df 1 = n1 1, where group 1 is e group wi e larger variance. The denominator degrees of freedom is df = n - 1. It is important to keep ese degrees of freedom in e correct numerator-denominator order. 1 Illustrative example (fictitious data). Group 1 is {3, 4, 5, 8} and group is {4, 6, 7, 9}; s = and s = We calculate F = s / s = / = 1.08 wi df = 4 1 = 3 and df = 4 1 = 3. stat stat We ask wheer e F statistic sufficiently far from 1.0 to reject H? To answer is question, we convert e F to a p value rough use of e appropriate F distribution. The test is one-tailed focusing on e upper extent of e F df1, df distribution. * Not a typo. Surely god loves = 0.06 as much as = Page 11.4 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

5 The F Distribution The F distribution is a family of distributions initially described by Fisher and Snedecor. Each member of e F distribution is identified by two parameters: df 1 (numerator degrees of freedom) and df (denominator degrees of freedom). F distributions are positively skewed, wi e extent of skewness determined by e distributions degrees of freedoms. df1,df,q 1 Notation: Let F denote e q percentile an F distribution wi df and df degrees of freedom. (The q percentile is greater an or equal to q% of e distribution.) As always, e area under e curve represents probability, and e total area under e curve sums to 1. The area under e curve to e right of e point F df1,df,q is p = 1 q: Our F table in is organized wi numerator degrees of freedom (df 1) shown in e first row of e table and denominator degrees of freedom (df ) shown in e first column. Only 95 percentile points are provided. Here s e first ten lines in e table: Numerator Degrees of Freedom * * D e n o m i n a t Page 11.5 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

6 Example: The 95 percentile on F 1,9, is 5.1 (italicize). Suppose we calculate F stat = 6.01 wi df 1 = 1 and df = 9. Since is F stat is situated to e right of e 95 percentile point, we know e P-value is less an More precise P-values can be derived wi e proper statistical utility (e.g., StaTable, WinPepi). For example, we can use StaTable to determine at an F of 6.01 wi df = 1 and df = 9 is equivalent to P = stat 1 Illustrative example. Recall at wi our illustrative data calculated F stat = 1.08 wi df 1 = 4 1 = 3 and df = 4 1 = 3. Using e F table we find at e 95 percentile on F 3,3 is 9.8. Therefore, P > Using a software utility determine P = Thus, H 0 is retained. [Keep in mind at a non-significant test does not provide evidence of equal variances, especially when samples are small. It merely says e evidence is insufficient to reject H 0: equal variances] Page 11.6 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

7 Pooling Variances It is common in statistical practice to pool (average) group variances to come up wi a more reliable estimate of variability. This should be done only if data suggest (e.g., wi a negative F ratio test). The pooled estimate of variance (s² p) is: 1 (7) where df = df 1 + df. Illustrative example. Group 1 is {3, 4, 5, 8} and group is {4, 6, 7, 9}. We ve already calculated s 1 = and s = Note df 1 = 4-1 = 3, df = 4-1 = 3, and df = = 6. Thus:. The pooled estimate of e variance can be used to derive e pooled standard error of e mean difference: (8) Use is standard error to test H 0: 1 = when and calculate a confidence interval for 1 when population variances are equal. The equal variance test statistic is: (9) The equal variance 95% confidence interval for 1 is: (10) Illustrative example. For e illustrative data = 1.5, and t stat = 1.00 wi df = = 6. A t stat wi 6 degrees of freedom (p =.36). The 95% confidence 1 = ( ) ± (.48)(1.5) = 1.5 ± 3.7 = ( 5. to.). Page 11.7 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

8 When Variances Should Not be Pooled If EDA and F testing suggest variances differ, sample variance should not be pooled and e equal variance t test should be avoided. Inference can still be pursed while assuming ² 1 ² using e Behrens-Fisher procedures (also called e unequal variance t procedure). The t statistic and confidence interval formulas for is procedure are e same as at for e equal variance procedures. What differs, however, is e standard error, which is: (11) There are two ways to calculate e df for is statistic. SPSS uses is formula: When working by hand, use e smaller of df1 or df. A 95% confidence interval for 1 wiout assuming equal variance is. Illustrative example (Familial blood glucose levels hypoetical data). Blood glucose levels are determined for twenty-five (n = 5) 5-year olds whose faers have type II diabetes ( cases ). That cases have mean fasting blood 1 glucose levels ( ) of mg/dl wi a standard deviation (s 1) of 9.6 mg/dl. A group of control group of 0 children (n = 0) from e same census track whose faers have no history of type II diabetes demonstrate = 99.7 mg/dl and s = 5. mg/dl. We want to test e means for inequality but suspect population variances differ significantly. In testing, H 0: ² 1 = ², we calculate F stat = 9.6² / 5.² = 3.41 wi df 1 = 4 and df = 19 (p =.0084), concluding variances differ significantly. An unequal t procedure will be pursued. In applying an unequal variance t test, =.45, = The df" is a tedious to calculate. Wiout software, we use e smaller of n1 1 or n 1, which in is case is 19. (The actual df'' = 38). This will provide conservative inference for e two populations. Using a t distribution wi df = 19, we derive.00 < p <.01. The 95% confidence interval for e data= ( ) ± (.09)(.45) = 7.6 ± 4.7 = (.9, 1.3). Page 11.8 (C:\data\StatPrimer\variance.wpd Print date: 8/14/06 )

13: Additional ANOVA Topics. Post hoc Comparisons

13: Additional ANOVA Topics. Post hoc Comparisons 13: Additional ANOVA Topics Post hoc Comparisons ANOVA Assumptions Assessing Group Variances When Distributional Assumptions are Severely Violated Post hoc Comparisons In the prior chapter we used ANOVA