Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

Faculty of Health Sciences Outline Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA Lene Theil Skovgaard Sept. 14, 2015 Paired comparisons: tests and confidence intervals Two-sample T-tests Power and sample size One-way analysis of variance (ANOVA) as above, for related samples (two-way ANOVA) Hjemmesider: http://staff.pubhealth.ku.dk/~jufo/basicstatisticsx2015 E-mail: ltsk@sund.ku.dk : indicates that this page may be skipped without serious consequences 1 / 96 2 / 96 Comparison of treatments/groups Example: Gene expression, 3 treatments vs. Control Important distinction: Treatments applied to distinct groups of units: Unpaired comparison, Two-sample T-test, One-way ANOVA Treatments applied to the same units: Paired comparison, Paired T-test, One-way matched samples ANOVA (=two-way ANOVA) Five identical cell lines are kept apart and stored for some weeks. They are divided into 4 parts, which are treated in 4 different ways, called 4 conditions: Control and treatments A, B and C. Gene expression is measured after 8 and 12 hours. Research question: Are the treatments any good? Which one is the best? Reference: Mattias Salling Dahllöf: Personal communication 3 / 96 4 / 96

Data at 12 hour measurement Traditional illustration Cell line Control A B C 1 3.9 6.6 4.7 5.2 2 2.8 12.1 3.4 3.7 3 3.3 20.2 4.7 4.2 4 7.1 16.5 4.2 11.5 5 9.9 31.1 18.9 5.1 Average 5.4 17.3 7.18 5.94 Median 3.9 16.5 4.7 5.1 SD 3.02 9.23 6.57 3.17 SEM 1.35 4.13 2.94 1.42 Are these summary statistics informative? Somewhat - but we do not use the pairing The boxes show the average expressions for each condition The red bars show ± 1 Standard Error (for the mean) 5 / 96 6 / 96 Problems with this traditional illustration Alternative plots, I Get rid of the boxes, and show confidence limits instead: No indication of cell line (the pairing is not seen) Standard error reflects variation between cell lines (and n=5...) The boxes convey no information, only their height We get no impression of the distribution... But still: No indication of cell line (the pairing is not seen) We get no impression of the distribution... 7 / 96 Note that CI extends below 0 for Condition B and imagine what the reference regions would look like 8 / 96

Alternative plots, II Best choice: Spaghetti plots Box plots: Connect the cell lines: 9 / 96 Here we get an impression of the distribution but still no indication of cell line (the pairing is not seen) Note: The pairing can be seen We may judge the distribution...? But we have only 5 observations for each treatment... 10 / 96 Start simple: two conditions: A vs. ctrl Informative plot: Treatment A appears to be effectful, since all the lines have negative slope (lowest values for Ctrl, highest for A) 11 / 96 Paired comparisons Some cell lines (no. 5) show larger values than others: Are they analysed on the same day? By the same researcher? These differences are seen for both the A treatment and the control treatment We have paired observations Other examples of paired measurements: Two measurement techniques applied to the same individual/animal/blood sample Measurements on the same individual before and after treatment Case-control studies / matched pairs 12 / 96

Analysis of paired data Choice of scale Purpose: Investigate whether there is an actual difference between A- and Control treatment and quantify this difference. The cell line is its own control (matched samples) This gives larger power to detect possible differences. Look at individual differences but on which scale? (to be discussed...) Investigate whether the differences have mean zero Paired T-test Quantify the difference with a confidence interval and possibly also reference regions. Are the size of the differences independent of level of the measurements? In this case, the difference between treatments can be evaluated and quantified in the units of measurements Or do we see relative (percentage) differences: In this case differences should evaluated and quantified on a logarithmic scale (and transformed back) because ( ) A log = log(a) log(ctrl) Ctrl Look at Bland-Altman plot 13 / 96 14 / 96 Bland-Altman plot Traditional model for paired data Scatter plot, showing for each cell line Y: Differences (treatment A minus Control) X: Averages of treatment A and Control X i : measurement for Control treatment of the i th cell line Y i : measurement for treatment A of the i th cell line The differences D i = X i Y i (i = 1,, n = 5) are assumed to be independent: the 5 cell lines do not share any properties (except from consisting of identical cells) Not a lot of information from 5 measurements... but still a tendency for larger differences when larger values! 15 / 96 16 / 96 Violation of this assumption might be that e.g. 2 cell lines were analyzed on one day (by one researcher), and the remaining 3 on another day (another researcher)

Traditional model for paired data, II The Normal distribution The differences D i = X i Y i (i = 1,, n = 5) are furthermore assumed to have the same mean value δ and the same standard deviation (variance): evaluated by the aforementioned Bland-Altman plot of differences vs averages and to follow a Normal distribution: evaluated graphically by histograms or quantile plots (testing is not generally recommended, see p. 46) Imagine lots of observations, so that a histogram looks almost smooth, this is a density: Mean, often denoted µ or α Standard deviation, often denoted σ N(µ,σ 2 ) is theoretically justified by the Central Limit Theorem 17 / 96 18 / 96 The Normality assumption Histogram of 5 differences Note: No assumption regarding the distribution of the actual measurements! Only assumptions regarding their differences because we are dealing with a paired design but we only have 5 differences: D 1 = 2.7, D 2 = 9.3, D 3 = 16.9, D 4 = 9.4, D 5 = 21.2 which is nowhere near enough to evaluate the appropriateness of a Normality assumption. We have to either trust the assumption (based on previous experience) do something else (nonparametric tests...?) 19 / 96 Do we see a Normal distribution? We cannot answer that because of the small sample size (n=5) 20 / 96

Quantile plot for 5 differences Inference = Statistical Analysis Do we see a straight line? We cannot really answer that because of the small sample size (n=5) 21 / 96 Based on the collected data we need to say something about the total population (in this case this particular type of cell, and its reaction to treatment): Estimation: When we see these 5 differences, what can we say about the two unknown parameters, δ and σ (mean and SD)? Test: Can we detect a systematic difference between treatment A and Control? Prediction: How large can we expect the differences between treatment A and Control to get when measuring on other cell lines of the same kind? 22 / 96 Estimation of mean difference Estimate for δ i.e the mean δ of the differences D i We denote this a one-sample problem : 5 independent observations of the same (normally distributed) variable, D: D i N (δ, σ 2 ) The Maximum likelihood principle tells us that the parameters δ and σ are estimated by the average D, respectively the empirical standard deviation s, or variance: s 2 = 1 n 1 Σ(D i D) 2 Maximum likelihood estimates are denoted by a hat, e.g. ˆδ. 23 / 96 ˆδ = D, the average difference proc means data=cells; var a_minus_control; Analysis Variable : a_minus_control N Mean Std Dev Minimum Maximum ----------------------------------------------------------------- 5 11.9000000 7.2308367 2.7000000 21.2000000 ----------------------------------------------------------------- So we get: ˆδ = 11.9 (and s=sd=7.23) Estimates must be stated with uncertainty estimates! e.g. by quoting a confidence interval 24 / 96

Uncertainty of an estimate What does this mean? Here we are looking for the uncertainty of an average, as an estimate of the unknown true mean difference between treatment A and Control (or the difference between the two unknown means). We may imagine repetitions of the investigation: Each experiment yields a new estimate We may study the distribution of these estimates The standard deviation in this distribution quantifies the uncertainty in the estimate It is called the standard error (SE) of the estimate If we (as here) are concerned with a mean, it is called 25 / 96 standard error of the mean (SEM) To sum up The uncertainty of an average is called Standard error (of the mean), SE(M) and we have the relation SEM = SD n SEM gets smaller when the sample size n gets larger It is used for the construction of confidence intervals, that we will discuss now... It cannot be used for description of the variation of the measurements 26 / 96 Confidence interval Technical note, I Interval that cathes the unknown parameter (here the mean value δ) with a high (typically 95%) probability (95% coverage). What do we trust the unknown value of δ to be? 95% confidence interval for the mean δ: ˆδ ± t-quantile SEM Here, the t-quantile refers to the 97.5% quantile in the T-distribution with n 1 (here 4) degrees of freedom (typically abbreviated DF in output). The t-distribution (Student distribution) has a parameter, called df, the degrees of freedom (here: 2, 10, 100). Many degrees of freedom: The distribution looks like a Normal distribution Few degrees of freedom: Heavier tails. 27 / 96 28 / 96

Technical note, II The 97.5% quantile in a T-distribution (the near-to-2 value) is highly dependent on the number of observations - when this is small: Here shown for degrees of freedom ranging from 2 to 30: 29 / 96 degrees.of.freedom quantile 2 4.302653 3 3.182446 4 2.776445 5 2.570582 6 2.446912 7 2.364624 8 2.306004 9 2.262157 10 2.228139 11 2.200985 12 2.178813 13 2.160369 14 2.144787 15 2.131450 Confidence interval for δ Here we have SEM = 7.23 5 = 3.23 and therefore a 95% confidence interval: 11.9 ± 2.776 3.23 = (2.93, 20.88) This confidence interval may be obtained directly (and more precisely) from SAS: proc means N mean stderr clm data=cells; var a_minus_control; Analysis Variable : a_minus_control Lower 95% Upper 95% N Mean Std Error CL for Mean CL for Mean ----------------------------------------------------------------- 5 11.9000000 3.2337285 2.9217303 20.8782697 ----------------------------------------------------------------- 30 / 96 Interpretation of confidence interval Hypothesis testing (model reduction) The confidence interval for δ, the difference in means between treatment A and Control, was found to be This means: (2.92, 20.88) We have detected a systematic difference between the treatment A and Control (since the confidence interval contains only positive values, and not zero). The effect of treatment A (compared to Control) is (with 95% confidence) larger than 2.92 but smaller than 20.88 Is this sufficient to be of interest? This is not a statistical question... 31 / 96 Can we simplify the model? Could some of the parameters be equal? (the mean value for treatment A and the mean value for the Control treatment) Could a parameter be equal to zero (the difference in the two means mentioned above). The null hypothesis: H 0 : δ = 0 We want to investigate whether H 0 is true or false. If H 0 is true, we know that the ratio T = ˆδ 0 SEM will have a t-distribution with 4 degrees of freedom (technical result). 32 / 96

Test statistic Test of no difference for our paired data A quantity that measures the discrepancy between observations and hypothesis General principle: The decrease in likelihood, from model to hypothesis Large discrepancy: Reject the hypothesis, because it does not fit to the data. But what does large mean here? Larger (more extreme) than expected by chance if the hypothesis is actually true, i.e. more extreme than usually seen in a t-distribution with 4 degrees of freedom. H 0 : δ = 0, Test statistic t = ˆδ 0 SEM = 11.9 0 3.23 = 3.68 t(4) Does the value 3.68 fit well to a t-distribution with 4 degrees of freedom? No, this value lies far out in the tail of the distribution, and we will therefore reject the hypothesis. The P-value is 0.02, to be discussed in a moment. Small (numeric) t: Good fit, accept hypothesis (P > 0.05) Large (numeric) t: Bad fit, reject hypothesis (P < 0.05) 33 / 96 34 / 96 Technical note on the T test statistic The P-value Where did the T-distribution come from? The average (ˆδ) is Normally distributed (based on our assumptions) but we divide with an estimated standard deviation Had the standard deviation been known, the test statistic would also have been Normally distributed Instead we estimate the standard deviation, and the penalty for the uncertainty in this estimation is the (somewhat heavier tailed) t-distribution (see p. 28) Reminder: The P-value is a measure of the agreement between the value of the test statistic and the theoretical distribution under H 0. More precisely, it is the tail probability, i.e. the probability of observing this or worse, provided that the hypothesis is true, worse meaning something speaking more against the hypothesis than what we actually observed 35 / 96 36 / 96

The P-value, II Philosophy of P-value If the probability of observing something worse is very small, it must be pretty bad, what we did observe. But what is very small? If P is below the significance value, the test is significant, and the hypothesis is rejected The significance value α is usually chosen to be 5% (or 0.05), but this is arbitrary (more on this later on...) Here, we have t = 3.68 t(4), and P = 0.02, indicating mild discrepancy between data and hypothesis 37 / 96 38 / 96 Tests vs. confidence intervals Paired T-test in SAS: A significant test statistic : Confidence interval not including 0 A non-significant test statistic : Confidence interval including 0 Tests and confidence intervals are equivalent Here, it is exact, sometimes it is only approximate 39 / 96 Three different ways to do it: 1. Using the simple procedure proc means: proc means N mean stddev stderr t probt data=differences; var A_minus_control; 2. Using the paired statement in proc ttest: proc ttest plots=all data=cells; paired control*treata; 3. Using a T-test on the precalculated differences, again with proc ttest: proc ttest data=cells; var A_minus_control; 40 / 96

Output from method 1 Output from method 2 and 3 The TTEST Procedure Analysis Variable : A_minus_control Difference: control - treata N Mean Std Dev Std Error t Value Pr > t --------------------------------------------------------- 5 11.9000000 7.2308367 3.2337285 3.68 0.0212 -------------------------------------------------------- N Mean Std Dev Std Err Minimum Maximum 5-11.9000 7.2308 3.2337-21.2000-2.7000 Mean 95% CL Mean Std Dev 95% CL Std Dev -11.9000-20.8783-2.9217 7.2308 4.3322 20.7782 DF t Value Pr > t 4-3.68 0.0212 41 / 96 42 / 96 Reference region: Limits-of-agreement When normality cannot be assumed How large are the typical differences between the two conditions for a single cell line Limits of agreement is a special notion for the reference region (interval) for differences, i.e. D ± approx. 2 SD = 11.9 ± 2.776 7.23 = ( 8.17, 31.97) Based on this, we would say, that it may well happen that the difference is negative, so that the Control value is bigger than the value for the A condition. P-values may be too small or too big Confidence limits may be too narrow or too wide Then what? Non-parametric procedure: Wilcoxon signed-rank test No assumption of Normality Still assumption of independence and a symmetric distribution of the differences The Normality assumption is important here! even if you have a lot of observations. 43 / 96 44 / 96

Wilcoxon signed rank test in SAS can only be performed on the calculated differences: proc univariate ciquantdf normal data=differences; var A_minus_control; with output Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student s t t 3.679963 Pr > t 0.0212 Sign M 2.5 Pr >= M 0.0625 Signed Rank S 7.5 Pr >= S 0.0625 Note that the significance has disappeared It allways does, when the sample size is below 6! Confidence limits for median can be calculated (option ciquantdf) 45 / 96 Output, cont d Tests of normality (option normal on p. 45): Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.954679 Pr < W 0.7705 Kolmogorov-Smirnov D 0.235231 Pr > D >0.1500 Cramer-von Mises W-Sq 0.041573 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.237774 Pr > A-Sq >0.2500 Important note Tests for normality are very weak for n = 5. In general, they do not convey much information and are considered useless (more on p. 78) 46 / 96 New example: Selenium in cabbage Cabbage data Book p. 39: Compare two methods for determining the concentration of selenium in vegetables: Concentration is measured in 16 portions of a head of cabbage (units are mg per 100g): 8 portions with Method 1 8 portions with Method 2 Each portion can be only used once: There is no pairing Obs method selenium 1 1 0.20 2 1 0.19 3 1 0.14 4 1 0.19......... 13 2 0.18 14 2 0.15 15 2 0.14 16 2 0.12 47 / 96 48 / 96

What do we want to do? Model for unpaired comparison 1. State a reasonable model for the data 2. Estimate difference in mean values, with confidence interval 3. Test equality of means 4. Anything else? What is the purpose of the study? 49 / 96 Two groups, each assumed to have normally distributed observations: Method 1: Y 1i, i = 1,..., 8 N (µ 1, σ 2 ) Method 2: Y 2i, i = 1,..., 8 N (µ 2, σ 2 ) They need not be of identical size! Assumptions: All observations are independent The standard deviations are equal in the two groups should be checked, if possible Observations are Normally distributed in each group, with possibly different mean values 50 / 96 Summary statistics Normality assumption Hypothetical situation with two groups (dashed curves) Variable: selenium method N Mean Std Dev Std Err Minimum Maximum 1 8 0.1988 0.0275 0.00972 0.1400 0.2300 2 8 0.1513 0.0210 0.00743 0.1200 0.1800 Do the means appear to be equal? No, based on the averages above, and the figure on p. 48 Do the standard deviations appear to be equal? Yes, based on the estimates above, and the figure on p. 48 What about normality? 51 / 96 Note: Totally (solid curve) we do not have a Normal distribution!! 52 / 96

Estimated difference of means Confidence interval for difference in means Estimate ˆµ 1 ˆµ 2 = Ȳ 1 Ȳ2 = 0.1988 0.1513 = 0.0475 mg/(100 g) What is the uncertainty in this estimate? 1 SE(Ȳ1 Ȳ2) = s + 1 = 0.0122 n 1 n 2 where s=sd is a pooled estimate of the two standard deviations (here just an average because the groups are of equal size). Based on the estimated SE(Ȳ1 Ȳ2) and the degrees of freedom df=(n 1-1)+(n 2-1)=(8-1)+(8-1)=14, the 95% confidence interval becomes: 0.0475 ± 2.145 0.0122 = (0.0213, 0.0737) The number 2.145 in the above formula is the 97.5% quantile in a T-distribution with 14 degrees of freedom (see p. 29). 53 / 96 54 / 96 Unpaired T-test for equality of means Unpaired T-test in SAS Hypothesis: T = H 0 : µ 1 = µ 2 Ȳ1 Ȳ2 SE(Ȳ1 Ȳ2) = Ȳ1 Ȳ2 SD 1 + 1 n1 n2 proc ttest data=book_p39; class method; var selenium; = 0.0475 0.0122 = 3.89 which in a T distribution with 14 degrees of freedom gives a P-value P = 0.0017, as seen in the output below The TTEST Procedure Variable: selenium method N Mean Std Dev Std Err Minimum Maximum 1 8 0.1988 0.0275 0.00972 0.1400 0.2300 2 8 0.1513 0.0210 0.00743 0.1200 0.1800 Diff (1-2) 0.0475 0.0245 0.0122 method Method Mean 95% CL Mean Std Dev 1 0.1988 0.1758 0.2217 0.0275 2 0.1513 0.1337 0.1688 0.0210 Diff (1-2) Pooled 0.0475 0.0213 0.0737 0.0245 Diff (1-2) Satterthwaite 0.0475 0.0211 0.0739 55 / 96 56 / 96

Output, continued Reminder Method Variances DF t Value Pr > t Pooled Equal 14 3.88 0.0017 Satterthwaite Unequal 13.096 3.88 0.0019 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 7 7 1.71 0.4947 Note the 2 different versions of the T test, depending upon whether the standard deviations (variances) are equal in the two groups. Here, we find a significant difference no matter what, i.e. we can reject the hypothesis of equal means. 57 / 96 The distribution of a test statistic Imagine many identical investigations, each with 16 observations made by a single method, and put: 1. 8 randomly chosen in one group, the rest in the other = t 1 2. 8 randomly chosen in one group, the rest in the other = t 2 etc. etc. Distribution of the t s (assuming H 0 to be true) A t-distribution (Student distribution) with 14 degrees of freedom. Our observed T should be compared to this distribution, Does it fit in nicely? No, it is too big and therefore, we can conclude the two methods to differ. 58 / 96 Technicalities, I Technicalities, II The assumption of equal variances can be checked by looking at their ratio: F = s2 2 s 2 1 See output on p. 57 = 0.02752 = 1.71 F(7, 7) P = 0.49 0.02102 There is no indication of violation of the assumption of equal variances. But the test is week due to the small sample size, since any variance ratio below 5 will be insignificant... 59 / 96 If variances differ or if we do not want to make the equality assumption, we may instead calculate t = Ȳ1 Ȳ2 se(ȳ1 Ȳ2) = See output on p. 57 Ȳ1 Ȳ2 s 2 1 n 1 + s2 2 n 2 = 3.88 t(13.096) Here, we get the identical T-value (due to the fact that sample sizes are equal) but a fractional degrees of freedom (due to approximations). Note also, that the confidence interval will become somewhat wider: (0.0211, 0.0739) in contrast to (0.0213, 0.0737). 60 / 96

Sample size considerations, I Sample size considerations, II Truth vs. Action: accept reject H 0 true 1-α α error of type I H 0 false β 1-β error of type II The significance level α (usually 0.05) denotes the risk, that we are willing to take of rejecting a true hypothesis, also denoted as an error of type I. accept reject H 0 true 1-α α error of type I H 0 false β 1-β error of type II β denotes the probability of accepting a false hypothesis, i.e. overlooking an effect, an error of type II. 1-β is denoted the power. This describes the probability of rejecting a false hypothesis. But what does H 0 false mean? How false is H 0? 61 / 96 62 / 96 The power is a function of the true difference If the difference is xx, what is our probability of detecting it on a 5% level? Power: is calculated in order to determine the size of an investigation when the observations have been gathered, we present instead confidence intervals Paired or unpaired can make a huge difference If we had treated the comparison of treatment A and Control (see p. 41-42) as an unpaired comparison, we would have had The TTEST Procedure Variable: Expression Condition N Mean Std Dev Std Err Minimum Maximum A 5 17.3000 9.2334 4.1293 6.6000 31.1000 Ctrl 5 5.4000 3.0232 1.3520 2.8000 9.9000 Diff (1-2) 11.9000 6.8700 4.3450 Condition Method Mean 95% CL Mean Std Dev A 17.3000 5.8353 28.7647 9.2334 Ctrl 5.4000 1.6461 9.1539 3.0232 Diff (1-2) Pooled 11.9000 1.8804 21.9196 6.8700 Diff (1-2) Satterthwaite 11.9000 0.6246 23.1754 Method Variances DF t Value Pr > t Pooled Equal 8 2.74 0.0255 Satterthwaite Unequal 4.8479 2.74 0.0422 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 4 4 9.33 0.0526 63 / 96 64 / 96

Paired or unpaired, II Example: Study of dissolution testing Comparison of results for Gene Expression example Question 10 in the book (p. 224): The mass of an active ingredient is measured in 24 identical tablets, 6 on each of 4 laboratories: Method for T value degrees of freedom P-value Conf. Int. comparison Paired 3.68 4 0.02 (2.92, 20.88) Unpaired 2.74 8 0.053 (1.88, 21.92) Here, the difference is not very big... because the correlation within cell line is not that big The design is unpaired 65 / 96 66 / 96 Comparison of more than two groups Purpose of study Extension of the unpaired T-test, is called one-way ANOVA (Analysis Of Variance) Model: Laboratory 1: Y 1i, i = 1,..., n 1 N (µ 1, σ 2 ) Laboratory 2: Y 2i, i = 1,..., n 2 N (µ 2, σ 2 ) Laboratory 3: Y 3i, i = 1,..., n 3 N (µ 3, σ 2 ) Laboratory 4: Y 4i, i = 1,..., n 4 N (µ 4, σ 2 ) Y gi = µ g + ε gi, ε gi N (0, σ 2 ) Assumptions: just the same as for a two-group comparison (p. 50) Research questions: Quantify and test difference between laboratories Omnibus test of equality of means, H 0 : µ 1 = µ 2 = µ 3 = µ 4 gives rise to an F-test (see output on p. 70) Find estimates of reproducibility and repeatability (see pages 90-94) Estimate the uncertainty in everyday-measurements (see p. 94) 67 / 96 68 / 96

One-way ANOVA in SAS Data has to be organized in 2 columns, one with the observations of outcome (mass) and the other with the classification variable (laboratory). Output from 1-way ANOVA The GLM Procedure Dependent Variable: mass laboratory mass A 6.77 A 6.79 A 6.84 - - - - D 6.66 D 5.79 proc glm data=q10; class laboratory; model mass=laboratory / solution clparm; 69 / 96 Sum of Source DF Squares Mean Square F Value Pr > F Model 3 2.11768333 0.70589444 7.61 0.0014 Error 20 1.85570000 0.09278500 Corrected Total 23 3.97338333 R-Square Coeff Var Root MSE mass Mean 0.532967 4.697694 0.304606 6.484167 Source DF Type III SS Mean Square F Value Pr > F laboratory 3 2.11768333 0.70589444 7.61 0.0014 We see a significant difference between the 4 laboratories (P=0.0014 for test of identical mean values). Is this interesting? 70 / 96 Output, cont d Model checks (see p. 50) Standard Parameter Estimate Error t Value Pr > t Intercept 6.101666667 B 0.12435500 49.07 <.0001 laboratory A 0.615000000 B 0.17586453 3.50 0.0023 laboratory B 0.191666667 B 0.17586453 1.09 0.2887 laboratory C 0.723333333 B 0.17586453 4.11 0.0005 laboratory D 0.000000000 B... Parameter 95% Confidence Limits Intercept 5.842266677 6.361066657 laboratory A 0.248153016 0.981846984 laboratory B -0.175180317 0.558513650 laboratory C 0.356486350 1.090180317 laboratory D.. The above comparisons are made to the reference laboratory, which is D (the last in alphabetic order) Is this interesting? are performed by looking at Residuals: r gi = Y gi ˆµ g = Y gi Ȳg Graphics: Variance homogeneity Residuals vs. dairy product Residuals vs. predicted Normality Quantile plot of residuals Histogram of residuals may sometimes be supplemented by tests... 71 / 96 72 / 96

Model assumption 1: Independence This is something you should know from the design of the study no twins, siblings etc. no duplicates, triplicates etc. only a single observation for each unit no unrecognized explanatory variable researcher to perform the measurement temperature in lab the specific day Model assumption 2: Identical standard deviations most often called variance homogeneity, and checked from Box plots or Scatter plots, see p. 66 Test of the hypothesis of equal variances (most often Levenes test, see next page) Residuals plotted against predicted values (ȳ g. ): 73 / 96 This should show no structure 74 / 96 Levenes test for identical variances Model assumption 3: Normality Additional means-statement, with option hovtest: proc glm data=q10; class laboratory; model mass=laboratory / solution clparm; means laboratory / hovtest; Levene s Test for Homogeneity of mass Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F laboratory 3 0.0262 0.00873 1.42 0.2675 Error 20 0.1232 0.00616 The observations are assumed to be Normally distributed, with different means, according to group. Check this by drawing histograms or quantile plots for each group separately (in case you have a lot of observations). Otherwise by drawing histograms or quantile plots for the residuals r gi = Y gi ˆµ g = Y gi Ȳg When comparing the k = 4 variance estimates, we get a test statistic F=1.42, which has an F(3,20)-distribution, corresponding to P=0.27, and therefore no significance. Limited value for small samples, see p. 78 75 / 96 76 / 96

Tests for Normality Tests for model assumptions proc glm data=q10; class laboratory; model mass=laboratory / solution clparm; output out=ny P=yhat R=residual; proc univariate normal data=ny; var residual; with output The UNIVARIATE Procedure Variable: residual Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.950454 Pr < W 0.2771 Kolmogorov-Smirnov D 0.135233 Pr > D >0.1500 Cramer-von Mises W-Sq 0.049836 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.375306 Pr > A-Sq >0.2500 77 / 96 are often of little value: If you have many observations, the tests are almost allways rejected, because even small differences can be detected, even though they are of litte practical relevance (T-tests and ANOVA are quite robust). If you have few observations, the tests are allmost allways accepted, even if there may be important discrepancies - simply because we do not have enough power. Recall: Accepting a statistical test does not mean that you have proven it to be true! 78 / 96 Model checks: All in one Diagnostics Panel ods graphics on; proc glm plots=diagnosticspanel data=q10; class laboratory; model mass=laboratory / solution clparm; ods graphics off; produces a Diagnostics Panel, containing (among others) in first column (C1, see next page) Figure (R1,C1): residuals vs. predicted values We have seen that previously (p. 74), it looks OK Figure (R2-R3,C1): Quantile plot and histogram, for residuals: We have seen these previously (p. 76), they look quite OK, maybe with a tough of skewness (more tail to the right, i.e. large values), and hammock-shaped quantile plot 79 / 96 80 / 96

Comparisons between groups Mass significance Automatically, we get comparisons between the reference group and all others, i.e. 3 comparisons If the omnibus test of equality is rejected, we may want to perform all pairwise comparisons between the groups, i.e. a total of K = 4 3/2 = 6 comparisons. This may well create problems in connection to type I error (will increase dramatically) Confidence intervals (with be misleadingly narrow) Suppose all laboratories perform truely identical Every pairwise comparison has a type I error rate of α (0.05), i.e. a probability of 1 α for a correct decision K such independent comparisons yield a probability of (1 α) K for getting them all correct i.e. a probability of 1 (1 α) K for getting at least one false significance, i.e. the type I error rate for the total procedure 81 / 96 82 / 96 Risk of Type I error in multiple comparisons Correction for inflated type I error Adjusted significance level for K group comparisons (here K = 6), Control the experimentwise error rate by lowering the level of significance to α K, e.g. x All comparisons o Comparisons to reference group only Bonferroni : α K α K Sidak : α K = 1 (1 α) 1/K Others : may exist in specific contexts, here Tukey (or Tukey-Kramer) 83 / 96 84 / 96

Adjusted P-values Confidence limits K group comparisons, Increase the P-value to P K, e.g. Bonferroni : Multiply by K, i.e., P K = K P may become greater than 1! Sidak : P K = 1 (1 P) K never larger than 1 Others : may exist in specific contexts Consequence: A reduction in power (increase of type II error rate) Differences become harder to detect If each single confidence interval is constructed to have an (approximate) coverage of 95%, then the simultaneous coverage rate (for all parameters) will be less than 95%, and it will be appreciably less if many intervals are involved. Adjustment: Use alternative quantiles for the construction, e.g. the (1 α K /2) quantile, where α K is the adjusted significance level (according to some rule). 85 / 96 86 / 96 But Tukey correction in SAS Pairwise tests are not independent! If two groups look alike, they both resemble a third group equally much. Therefore, Bonferroni and Sidak are conservative Significance level α K is unneccessary low We get unneccessarily low power, i.e. Differences are too hard to detect 87 / 96 proc glm data=q10; class laboratory; model mass=laboratory / solution clparm; lsmeans laboratory / adjust=tukey pdiff cl; which gives the output Least Squares Means Adjustment for Multiple Comparisons: Tukey Least Squares Means for effect laboratory Pr > t for H0: LSMean(i)=LSMean(j) Dependent Variable: mass i/j 1 2 3 4 1 0.1079 0.9258 0.0112 2 0.1079 0.0314 0.6996 3 0.9258 0.0314 0.0028 4 0.0112 0.6996 0.0028 88 / 96

Output, cont d Repeatability and reproducibility laboratory mass LSMEAN 95% Confidence Limits A 6.716667 6.457267 6.976067 B 6.293333 6.033933 6.552733 C 6.825000 6.565600 7.084400 D 6.101667 5.842267 6.361067 Least Squares Means for Effect laboratory Difference Simultaneous 95% Between Confidence Limits for i j Means LSMean(i)-LSMean(j) 1 2 0.423333-0.068895 0.915561 1 3-0.108333-0.600561 0.383895 1 4 0.615000 0.122772 1.107228 2 3-0.531667-1.023895-0.039439 2 4 0.191667-0.300561 0.683895 3 4 0.723333 0.231105 1.215561 Is this what we wanted to know? 89 / 96 will be elaborated upon in the days to come... Repeatability: How close is the result when we perform repeated observations, which in principle should be identical, i.e. from the same laboratory: Variation of y gi1 y gi2 Reproducibility: How close can we reproduce the result in another setting, which in principle should be the same, i.e. on another laboratory: Variation of y g1 i 1 y g2 i 2 Uncertainty in measurement: How far from the true value can our measurements be expected to be? 90 / 96 Repeatability Reproducibility The ability to repeat the measurements is reflected by the standard deviation σ (the variation within laboratory), or the variance σ 2 which we have estimated to be (see p. 70) s 2 = 0.09279, s = 0.3046 We can now calulate the standard deviation of the difference y gi1 y gi2 as 2 s 2 = 0.18558 = 0.43079 We will therefore expect these differences to be in the interval ±2 0.43079 = ( 0.862, 0.862) The ability to reproduce the measurements in another laboratory will depend on the variation between laboratories, as well as the variation within laboratory. We cannot evaluate this as long as we have a model with specific mean values for each laboratory. We need a model with two variance components: ω 2 : The variation between (infinitely precise) measurements from different laboratories σ 2 : The variation between measurements from the same laboratory 91 / 96 92 / 96

Repeatability and reproducibility Calculations from output proc mixed data=q10; class laboratory; model mass= / s cl; random intercept / subject=laboratory; with output Covariance Parameter Estimates Cov Parm Subject Estimate Intercept laboratory 0.1022 Residual 0.09279 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept 6.4842 0.1715 3 37.81 <.0001 0.05 We got the estimates ˆω 2 = 0.1022, ˆσ 2 = 0.09279 From this, we calculate (see p. 91) Repeatability: ˆσ 2 = 0.3046 Typical (reference) region for y gi1 y gi2 : ±0.86 Reproducibility: ˆω 2 + ˆσ 2 = 0.4416 Typical (reference) region for y g1 i 1 y g2 i 2 : ±1.25 Uncertainty in measurement: Typical departure from truth: ±0.88 Effect Lower Upper Intercept 5.9384 7.0300 93 / 96 94 / 96 The Gene Expression example The Gene Expression example Remember the paired situation Comparison of the four conditions is not a 1-way ANOVA since we have related samples We may use two different approaches: 1. A two-way ANOVA, with two factors: Condition Cell line 2. A mixed model (variance components model), with an mean value for each Condition a random effect of cell line (between cell-line variation) a within cell-line variation Things to avoid, when analyzing your data: Normalization by dividing or subtracting Control measurements from the rest 95 / 96 96 / 96