Group comparison test for independent samples

Group comparison test for independent samples The purpose of the Analysis of Variance (ANOVA) is to test for significant differences between means. Supposing that: samples come from normal populations with possibly different means but a common variance Two independent samples: z or t test on difference between means (or ANOVA) Three or more independent samples: ANalysis Of Variance (ANOVA)

ANOVA notation y ij = j-th observation of group i (independently on the role of rows and columns of data table) G = number of groups n = number of observation (equal) in each group In balanced design: each group contains the same number n of observations H 0 : 1 = = = G H a : At least one inequality Under the null hypothesis all populations are supposed to have a common variance

Testing equality of each pair With 4 groups: 4 6 separate t tests would be required for testing the null hypothesis under consideration. Besides being tedious, 6 separate t tests on the same data would have an a level much higher than the a used in each t test: a = 0.05 in each test Total a = 0.05 x 6 = 0.3 (too high probability of I type error) One F test (a = 0.05): comparing the sample variance among groups with the sample variance within groups

Sources of variability Variance (or sum of squares) due to treatments: between groups Variance (or sum of squares) due to error: within groups The decomposition equation can be written as: SS T = SS TR + SS E where SS T is the total sum of squares n observations: SS T has n -1 d.f. k levels of factors: SS TR has k - 1 d.f. k n j observations per group: SS E has ( n j 1) n k j1 d.f

Conditional independence At least 1 continuous variable X qualitative mmmy continuous Categories of X (groups) Conditional means of Y Conditional independence of Y on X: Conditional means of Y are invariant with respect to modalities of X X AREA Y INCOME ( ) 000-3000 3000-4000 Total NORTH 6 8 CENTRE 4 6 SOUTH 6 0 6 Total 10 10 0

The partitioning of variance (i.e. of sums of squares) The total variance of Y is the sum of two components: Within variance = mean of groups variances Between variance = variance of groups means (with respect to the mean of Y) If: G = number of groups; i = mean of i-th group; n i = size of i-th group (i = 1,.,G); then: 1 1 G G y i ni i X n i1 n i1 n i WITHIN VARIANCE BETWEEN VARIANCE i.e.: or y WIT BET INT EX T

Why partitioning variance? Sales Constant mean and variance means variance BET = 0 variance mean WIT = The two groups have the same behaviour : A B brand Sales are the same for the two brands Sales Different means, constant variance means variance BET 0 variance mean WIT The two groups have different behaviours : Sales differ by brand A B brand

Example Sales (Y) Sector (X) 100-00 00-300 300-400 400-500 500-01 Total Food 11 1 5 1 3 1 Drink 1 1 0 1 0 3 Healt Care 6 1 1 1 Ice Packaging 7 1 1 3 14 Total 5 5 7 5 8 50 Sector Sales Ice Packaging 101 Food 109 Food 33 Food 199 Health Care 354 Ice Packaging 145 Drinks 467.. X = 4 groups 1. Mean of Y (unconditional): n 1 1 y ˆy n Y i j j n i1 n j1 h 150 5 50 5 350 7 450 5 156 8 50 394,96

. Conditional means of Y X i h 1 150 11 50 1 350 5 450 1 156 3 ŷn n 1 Y X x1 j 1j 1 j1 348, 48 Y X x j j j1 h 1 150 1 50 1 450 1 ŷn n 3 66,67 Y X x3 j 3j 3 j1 h 1 150 6 50 1 350 1 450 156 ŷn n 1 384,33 Y X x4 j 4j 4 j1 h 1 150 7 50 350 1 450 1 156 3 ŷn n 14 41

Remark: conditional means differ each other and with respect to the unconditional mean of Y, that there is some degree of conditional dependence between X and Y. Question: Is this dependence significant?

The F test H 0 : 1 = = = G H a : At least one inequality No effect of X on Y Effect of X an Y Y = effect of X + error y BET WIT If means are equal, between groups variance is 0: BET 0 WIT The more means differ, the more: BET WIT 0

H 0 : 1 = = = G H a : At least one inequality No effect of X on Y Effect of X an Y The decision is based on the sample ratio BET WIT 1. The lower the ratio, the more realistic the null hypothesis The higher the ratio, the less realistic the null hypothesis. Significance level of the decision: BET WIT G1 ~F ng G1;n G

Variability ANOVA output MS SS (k 1) MS SS (n k) EXT EXT EXT INT INT INT Sum of squares DoF Mean of squares F (observed) Significance Between groups (external) B SS EXT k-1 MS EXT = SS EXT /(k-1) Within groups (internal) W SS INT n-k MS INT = SS INT /(n-k) F = MS EXT /MS INT P-value Total SS TOT n-1 MS TOT = SS TOT /(n-1) =

F Sales data results Null hypothesis: mean sales are equal among sectors Variables: Sales (Y) by sector (X) Variabiliy SS Df MS F p-value Between Within Total 130575.4 6141114.89 671690.3 One-way ANOVA 3 46 49 4355.14 13350.50 Decision:.36.807 Sector Sales Ice Packaging 101 Food 109 Food 33 Food 199 Health Care 354 Ice Packaging 145 Drinks 467 Food 177 Food 161 Health Care 158 Ice Packaging 115 Ice Packaging 108 Food 1444 Health Care 493 Ice Packaging 185 Ice Packaging 85.. 0,807 0,36 EXT INT Low value = low EXT = means are close p-value is very high: We can accept the hypothesis of mean sales equal among sectors, as it s confirmed by observed sample.

Glossary Analysis of variance (ANOVA): statistical technique for deciding if G independent samples come from the same normal population. Experimental (or classification) factor: variable responsible for heterogeneity of means. Treatment: modality (categorical data) or level (ordinal data) of a factor. Random block: set of observations as homogeneous as possible. Each block includes as many observations as treatments; each observation is randomly assigned to one treatment. Sample observation: statistical unit that receives a treatment or a combination of treatments. Experimental design: set of rules for assigning sample observations to treatments, once factors are fixed

Hypotheses of ANOVA 1. Additivity: treatment effect is added to error effect, without interaction between error and treatment. Treatment effect is also independent from the intrinsic effect due to statistical units. Normality: Error is a Normal random variable, with null mean and constant variance among treatments. The G populations are normally distributed. 3. Homoschedasticity: error variance is constant among treatments and observations 4. Independence of observation and of samples: values in different samples are not in relation.

Remarks When G = ANOVA test is equal to the t-test for independent samples, since: F 1,m = t m In experimental science it is possible to select balanced (=same size) samples by a random experimental design; this is not always possible in social and economical science. Advantages of a balanced design Statistical test is less sensible towards small deviation from homoschedasticity. This is not true when samples have different sizes. Test power is maximized by equal size groups There are not serious consequences on results if group variances differ from population variance

ANOVA hypothesis violation Normality can be verified by residuals: y ˆ ij a i by an histogram or a normal probability plot. If effects are not additive (e.g.: effects are multiplicative, or an interaction effect exists but it is not included in the model), logarithmic transformation can be used. Observations independence assumption can be assured by randomly assigning statistical units to treatments, e.g. using random number.

For testing homoschedasticity we can use Hartley test, for equal size groups, or Bartlett test. For both, the null hypothesis is: H :... 0 1 k H1: at least one variance is different. If H 0 is rejected, we shouldn t proceed with ANOVA In some cases data can be transformed in order to fix variance. When causes are not identified experiment should be repeated

Some examples If the null hypothesis is rejected Conclusion: there is at least one inequality among the means of the treatment groups Further research Which pairs of treatments are different? (test the hypotheses H 0 : i = j for all i,j) Test some more complex hypotheses: how to compare one treatment effect with the average of some other treatments effects? (contrast analysis) Estimate of some parameters in the experiment? (Confidence intervals) Multiple-Comparison Procedures 1. Fisher s least significant difference. Duncan s new multiple-range test 3. Student Newman Keuls procedure 4. Tukey s honestly significant difference 5. Scheffè s method Mainly they differ for: test power type I error rate Equal sample sizes for the treatment groups

Example Precision resulting from operating hand-held chainsaws Experiment: measure of the kickback that occurs when a saw is used to cut a fiber board Response variable: angle (in degrees) to which the saw is deflected when it begins to cut the board 4 types of saws A B C D Total 4 8 57 9 17 50 45 40 4 44 48 39 3 41 34 43 61 54 30 j y ij 165 15 45 155 780 j y ij 5.999 9.965 1.175 4.981 33.10 ( j y ij ) 7.5 46.5 60.05 4.05 157.500 H 0 : m A = m B = m C = m D H 1 : at least one (m i - m j ) 0 Equal groups size n = 5

Results SS DoF MS F F 0.05;3;16 Among 1.080 3 360 3,56 3,39 Within 1.60 16 101,5 Total.700 19 Decision: Conclusion: The null hypothesis is rejected. There is a significant difference among the average kickbacks of the four types of saws.

Pairwise Comparisons Procedures (post-hoc analysis) 1. Fisher s least significant difference It is based on the t test. Cut off: y y t k h ;G n1 MS n WIT If the treatment groups are all of equal size n and equal variance s (MS WIT ), then only samples where difference in means is greater than cut-off value can be tested for a significant difference by the test statistic: t y y y y s s X s Y p nx ny n 1 1 Testing difference between means: H 0 : x = Y H 1 : X Y X Y x y s 1 1 n n x y ~ t n n x y s p = pooled sample variance Equal groups size n s pooled s n 1 s n 1 X X Y Y n n x y

Pairwise Comparison Procedures Chain saw example 165 45 ya 33 yb 49 5 5 15 155 yc 43 yd 31 5 5 4 6 pairwise comparisons H : H : 0 A B 0 B C H : H : 0 A C 0 B D H : H : 0 A D 0 C D Least significant difference: WIT t t ;G n 1 0,05;16 13,5 MS 101,5 n 5 Means in increasing order: yd 31 ya 33 yc 43 yb 49 Smallest G 1 A C B 33 43 49 D 31 1 18 A 33 10 16 C 43 6 Largest G 1 Decision: the only pairs of means that are different are ( A, B ) and ( B, D )

Pairwise Comparison Procedures. Duncan s new multiple-range test Critical values depend on the span r of the two ranked averages being compared Cut-off: y y d i j ;r;g n1 In the example: A C B 33 43 49 D 31 1 18 A 33 10 16 C 43 6 MS n WIT d is tabulated d values come from a sampling distribution of the shortest (standardized) differences between a set of means originating from the same population difference between 4 ranks difference between 3 ranks difference between (adjacent) ranks Means in increasing order: yd 31 ya 33 yc 43 yb 49 MS WIT d0,05;4;16 14,6 n MS WIT d0,05;3;16 14,1 MS WIT d0,05;;16 13,5 Slightly more conservative than Fisher s: it will sometimes find fewer significant differences About 95% agreement between them n n

3. Student Newman Keuls multiple-range test Pairwise Comparison Procedures Critical values depend on the span r of the two ranked averages being compared Cut-off: y y q i j ;r;g n1 MS n WIT q values come from a sampling distribution derived by Gosset and called the Studentized Range or Student s q, is similar to a t-distribution and corresponds to the sampling distribution of the largest differences between a set of means originating from the same population In the example: Means in increasing order: yd 31 ya 33 yc 43 yb 49 A C B 33 43 49 D 31 1 18 A 33 10 16 C 43 6 difference between 4 ranks difference between 3 ranks difference between (adjacent) ranks MS MS WIT q0,05;4;16 18, MS n WIT q0,05;3;16 16,4 WIT q0,05;;16 13,5 No significant difference, whereas the F test in the ANOVA indicated that a difference exists Still more conservative than Duncan s test n n

4. Tukey s Honestly Significant Difference (HSD test) Uses a single critical difference: Pairwise Comparison Procedures y y q i j ;G;G n1 MS n WIT that is only the largest critical difference in Student Newman Keuls s procedure In the example: A C B 33 43 49 D 31 1 18 A 33 10 16 C 43 6 MS WIT q0,05;4;16 18, n No significant difference, whereas the F test in the ANOVA indicated that a difference exists Still more conservative than Student-Newman-Keul s test

Pairwise Comparison Procedures 5. Scheffé s Method Can be used to compare means and also to make other types of contrasts, like: H 3 0 : 1 that is, that treatment 1 is the same as the average of treatments and 3. Cut-off: In the example: y y G 1 F i j ;G 1;G n 1 MS n WIT A C B 33 43 49 D 31 1 18 A 33 10 16 c 43 6 G 1F ;G1;G n1 MS n WIT 101,5 3 F0,05;3;16 19, 8 5 No significant difference, whereas the F test in the ANOVA indicated that a difference exists It the most conservative test

Pairwise Comparison Procedures Scheffé s approach is used more often for the other contrasts H B C 0 : A equivalent to: B C H 0 : A 0 Cut-off: MS G 1 F C n ;G1;G n1 E The coefficient C is the sum of the squares of the coefficients in the linear combinations of the s: 1 1 3 C 1 3 MSE 3 101,5 G 1 F 3 F0,05;3;16 17,18 n 5 ;G1;G n1 yb yc 43 49 to be compared with the sample statistic: ya 33 13 Decision: the difference is not significant

Pairwise Comparison Procedures Which procedure should be used? It depends upon which type of error is more serious: Less conservative test: less probability II type error (more power) More conservative test: less probability I type error In the chain saw example, assume the prices are approximately the same. Then a Type I error is not serious; it would imply that we decide one model has less kickback than another when in fact the two models have the same amount of kickback. A Type II error would imply that a difference in kickback actually exists but we fail to detect it, a more serious error. Thus, in this experiment we want maximum power and we would probably use Fisher s least significant difference. The experimenter should decide before the experimentation which method will be used to compare the means.

Overall level for all hypotheses m independent t tests each with = 0.05 Probability that at least one will show significance by chance is: P(diff) = 1 - (1 - a) m = 1-0.95 m m = 1 P(diff) = a = 0.05 m = P(diff) = 1 (1 - a) = 1 0.905 = 0.0975 m = 6 P(diff) = 1 0.95 6 = 1-0.056 = 0.65

Bonferroni procedure Based on t tests We change the value of t,g that will be used for statistical inference In the example: G = 4 n = 5 4 m 6possible comparisons Overall = 0.05 H : H : 0 A B 0 B C H : H : 0 A C 0 B D H : H : 0 A D 0 C D 1 6 The critical t value for each two-sided test will be one with: m(n - 1) = 4 degrees of freedom i = /6 = 0.05/6 = 0,0083 Tables of the t distribution for such a value of do not exist (we should compute the inverse of the t distribution function for that value)

The critical value used is the only thing that is different t 0.0083;16 P values: we can use the P value for each of the 6 t tests and see if it is equal to or less than 0.0083 D 31 A 33 C 43 A C B 33 43 49 t = 0.3134 P = 0.7574 t = 1.8856 P = 0.0776 t = 1.5713 P = 0.1357 t =.884 P = 0.011 t =.514 P = 0.030 t = 0.948 P = 0.3598 None of the P values is equal to or smaller than 0.0083 None of the differences between model averages can be considered statistically significant

One-degree-of-freedom comparisons The multiple-comparison procedures are known as a posteriori tests, that is, they are run after the fact. Such tests will not be as powerful as those for planned orthogonal contrasts, and it seems reasonable that experiments which are well designed and which test specific hypotheses will have the greatest statistical power A priori approach Contrasts are planned before the experiment The experimenter believes prior to the investigation that certain factors may be related to differences in treatment groups. A significant F test is not a prerequisite for these one-degree-of-freedom tests

Contrasts analysis To determine which of the models are different with respect to kickback, a follow-up procedure will be needed. The experimenter believes prior to the investigation that certain factors may be related to differences in treatment groups. For example, he might want to know if the kickback from the home type (A and D) is the same as the kickback from the industrial type (B and C). In addition, he might also be interested in any differences in kickback within types Comparison H 0 1 Home vs. industrial Home model A vs. home model D 3 Industrial model B vs. industrial model C B C A D 0 A D 0 B C 0

Two-way ANOVA Effects of two factors A and B Data are organized as follows: Each y ij is a Normal r.v. Y ij ~ N( ij ; ) Factor A A1 A... Aj... Ak B1 y 11 y 1... y 1j... y 1k y 1. y 1. B y 1 y... y j... y k y. y. Factor B........................ Bi y i1 y i... y ij... y ik y i. y i......................... Br y r1 y r... y rj... y rk y r. y.1 y.... y.i... y.k y.. y r. A1,, Ak = levels of factor A B1,, Br = levels of factor B y.1 y y... j..... y. k

The population mean y y... y i 11 1 rk i1 r. k j1 rk r k. j j = 1,, k i = 1,, r j. j i i. effect of level j of factor A effect of level i of factor B The additive model on heterogeneity of population means.j and i. ij j i Yij ij ij j i ij Effects of factor A and block B i of factor B are supposed to be additive, i.e. there is any conjoint effect between j e i ij ~ N(0; ) and k j j1 i1 r i 0

SS partition Y Y Y Y Y Y Y Y Y Y ij... j.. i... ij. j i... Yij Y.. Y.j Y.. Y.i Y.. Yij Y.j Yi. Y.. k r k r j1 i1 j1 i1 i,j Y = effect of A + effect of B + error y BET col BET row WIT SS * SS * SS * SS * T K R E

The output of two-way ANOVA Source of variation Sum of squares DoF Means of squares Among columns Among rows SS * K k 1 MS * SS * / k 1 SS * R r 1 MS * R SS * R / r 1 K K Error (= within) Total SS * E k r SS * T rk 1 1 1 MS * SS * / k 1 r 1 E E

The test 1) Test on treatments of factor A H: 0 i. J. i,j 1,...,k H 1 : at least one difference ) Test on treatments of factor B H 0 :. i. j i,j 1,...,r H 1 : at least one difference Under H: 0 i. J. F SS* K (k 1) SS* E (k 1)(r 1) MS * MS * K E ~ (k1) (k 1) (k1)(r1) (k 1)(r 1) ~ F (k1);(k1)(r1) F SS* R (r 1) SS* E (k 1)(r 1) MS * MS * R E ~ (r1) (r 1) (k1)(r1) (k 1)(r 1) ~ F (r1);(k1)(r1)

Example IMS industrial vehicles manager wants to know which combination of diesel and carburetors performs better. He plans an experiment with 5 carburetors and 4 types of diesel. The same amount of each diesel is tested in each of the 5 carburetors. The performance are in the following table: Carburetors 1 3 4 5 1 10 13 9 14 11 57 11,4 Diesel 5 10 5 10 6 36 7, 3 6 1 5 10 6 39 7,8 4 4 8 4 11 5 3 6,4 y i. y. j 5 43 3 45 8 164 6,5 10,75 5,75 11,5 7 y. j y i.

Results C y.. rk 164 ( )(5) 1344, 8 4 SS y C ( 10 5... 5 ) 1344, 8 191, T 4 5 ij i1 j1 SS K k y j. (... ) C 5 43 8 r 4 j1 1344, 8 108, SS R r yi (57... ). C 36 3 k 5 i1 1344, 8 73, SS SS ( SS SS ) 191, ( 108, 73, ) 9, 8 E T K R Source of variation Sum of squares DoF Means of squares F Among carburetors 108, 4 7,05 33,11 Among diesel types 73, 3 4,40 9,86 Error 9,8 1 0,817 Totale 191, 19

Decisions 1) Test on treatments of factor A H 0: j 0 j 1,...,k H 1 : at last one j 0 Being 33,11 > F 0.01; 4; 1 = 5.41, we reject H0: j = 0 at 1% level The 5 carburetors have different performances among diesel types ) Test on treatments of factor B H 0 : i 0 i 1,...,r H 1 i 0 : at least one Being 9,86 > F 0.01; 3; 1 = 5.95, we reject H 0 : i = 0 at 1% level The 4 diesel types have different performances among carburetors We can choose the best combination on the data table

About the k levels of a factor Fixed-effects model (FEM, or Model I) The experimenter usually in the latter stages of experimentation narrows down the possible treatments to those in which he has a special interest. All levels (of interest) of a treatment are included in the experiment. The inference is restricted to the treatments used in the experiment (Conclusion cannot be generalized to not observed treatments). Random-effects model (REM, or Model II) Treatments are a random sample of all possible treatments of interest. The model does not look for differences among the group means of the treatments being tested, but rather asks whether there is significant variability among all possible treatment groups (The investigator would be interested in the variability among all treatments). Results can be generalized to all treatments in the population even if they are not observed. If the experiment were to be repeated, treatments chosen at random would be used. In both models we assume that the experimental units are chosen at random from the population and assigned at random to the treatments.

Dependence analysis scheme Dependent variable Y Independent variable X Numerical Numerical Regression analysis Categorical Discriminant analysis Categorical ANOVA -- Y numerical X categorical Y f(x) One-way ANOVA Y f(x 1,X ) Y f(x 1,,X p) Y 1,, Yq f(x 1,,X p) Two-way ANOVA Multi-way ANOVA Multivariate ANOVA (MANOVA)

Tests for comparing means Factor X Numerical dependent variable Y Univariate Multivariate groups 3 or more groups z or t test ANOVA Hotelling T test MANOVA

Multivariate ANOVA (MANOVA) More than one dependent variable at a time Computations become increasingly complex, but the logic and nature of the computations do not change Between-Groups Designs Repeated Measures Designs Sum Scores versus MANOVA

Between-Groups Designs Example: we want to try two different textbooks, and we are interested in the students' improvements in math and physics. two dependent variables: improvement in math and improvement in physics (comparing two textbooks each) Hypothesis: both together are affected by the difference in respective textbooks. multivariate analysis of variance (MANOVA) to test this hypothesis. Interpreting results. If the overall multivariate test is significant, we conclude that the textbooks effect is significant. However, our next question would of course be whether only math skills improved, only physics skills improved, or both. In fact, after obtaining a significant multivariate test for a particular main effect or interaction, customarily we would examine the univariate F tests (see also F Distribution) for each variable to interpret the respective effect. In other words, we would identify the specific dependent variables that contributed to the significant overall effect.

Repeated Measures Designs Example: we want to measure math and physics skills at the beginning and the end of the semester, repeated measure MANOVA. Again, the logic of significance testing in such designs is simply an extension of the univariate case.

Sum Scores versus MANOVA Even experienced users of ANOVA and MANOVA techniques are often puzzled by the differences in results that sometimes occur when performing a MANOVA on, for example, three variables as compared to a univariate ANOVA on the sum of the three variables. The logic underlying the summing of variables is that each variable contains some "true" value of the variable in question, as well as some random measurement error. Therefore, by summing up variables, the measurement error will sum to approximately 0 across all measurements, and the sum score will become more and more reliable (increasingly equal to the sum of true scores). In fact, under these circumstances, ANOVA on sums is appropriate and represents a very sensitive (powerful) method. However, if the dependent variable is truly multi- dimensional in nature, then summing is inappropriate. For example, suppose that my dependent measure consists of four indicators of success in society, and each indicator represents a completely independent way in which a person could "make it" in life (e.g., successful professional, successful entrepreneur, successful homemaker, etc.). Now, summing up the scores on those variables would be like adding apples to oranges, and the resulting sum score will not be a reliable indicator of a single underlying dimension. Thus, we should treat such data as multivariate indicators of success in a MANOVA.

Multivariate test statistics Wilks Lambda () and/or its transformation Bartlett s V: their combined use allows to test differences in means with any number of Ys and Xs Hotelling T : two groups (particular case of Wilks ) Pillai-Bartlett: the most robust, to be used when assumptions are violated Roy s GCR (Greatest Characteristic Root): not that robust Wilks It is also called U statistics and varies between 0 (group means differ maximally) and 1 (group means are all equal): High values of group means are similar each other Low values of there are differences among group means