ANALYSIS OF VARIANCE AND MODEL FITTING FOR R. C. Patrick Doncaster.

Size: px

Start display at page:

Download "ANALYSIS OF VARIANCE AND MODEL FITTING FOR R. C. Patrick Doncaster."

Tobias Jennings
5 years ago
Views:

1 ANALYSIS OF VARIANCE AND MODEL FITTING FOR R C. Patrick Doncaster n Y Yi n i 1

2 C. P. Doncaster

3 CONTENTS Page Lecture: One-Way Analysis of Variance... 1 Comparison of parametric and non-parametric methods of analysing variance What is parametric one-way Analysis of Variance (ANOVA)? How to do a parametric one way Analysis of Variance What are degrees of freedom? Assumptions of parametric Analysis of Variance Summary of parameters for estimating the population mean Practical: Calculating One-Way Analysis of Variance... 9 Lecture: Two-Way Analysis of Variance Example of two-way Analysis of Variance: cross-factored design Using a statistical model to define the test hypothesis Degrees of freedom How to do a two-way Analysis of Variance Using interaction plots Lecture: Regression Comparison of Analysis of Variance and regression models Degrees of freedom for regression Calculation of the slope and intercept of the regression line Practical: Two-Way Analysis of Variance in R Lecture: Correlation and Transformations The difference between correlation and regression, and testing for correlation Transforming data to meet the assumptions of parametric Analysis of Variance Lecture: Fitting Statistical Models to Data The three principal types of data and statistical models 1. One sample, one variable: G-test of goodness-of-fit 2. One sample, two variables: (a) Categorical variables: G-test of contingency table (b) Continuous variables: regression or correlation 3. One-way classification of two or more samples: Analysis of Variance Supplementary information: Selecting and fitting models 1. One-way classification with two continuous variables: multiple regression 2. Two-way classification of samples: two-factor ANOVA or General Linear Model Practical: Calculating Regression and Correlation Appendix 1: Terminology of Analysis of Variance Appendix 2: Self-test questions (1) Appendix 3: Sources of worked examples - ANOVA Appendix 4: Procedural steps for Analysis of Variance Appendix 5: Self-test questions (2) Appendix 6: Sources of worked examples - Regression Appendix 7: Table of critical values of the F-distribution C. P. Doncaster

4 C. P. Doncaster

5 Lecture notes: One-way Analysis of Variance LECTURE: ONE-WAY ANALYSIS OF VARIANCE This booklet covers five lectures and three practicals. It is designed to help you: 1. Understand the principles and practise of Analysis of Variance, regression and correlation; 2. Appreciate their underlying assumptions, and how to meet them; 3. Learn the basics of using statistical models for quantitative solutions. In meeting these objectives you will also become more familiar with the terminology of parametric statistics, and this should help you use statistical packages and interpret their output, and better understand published analyses. Comparison of parametric and non-parametric methods You have already been introduced to non-parametric tests earlier in this course. These are useful because they tend to be robust - they give you a rough but reliable estimate and work well on data which have an unknown underlying distribution. But often we can be confident about underlying distributions, and then parametric statistics begin to show their strengths. Some limitations of non-parametric statistics: 1. They test hypotheses, but do not always give estimates for parameters of interest; 2. They cannot test two-way interactions, or categorical combined with continuous effects; 3. They each work in different ways, with their own quirks and foibles and no grand scheme; 4. In situations of even moderate complexity such as you may encounter when doing research projects, there may be no non-parametric statistic readily available. Some advantages of parametric statistics: 1. They can be more powerful because they make use of actual data rather than ranks; 2. Parametric tests are very flexible, coping well with incomplete data and correlated effects; 3. They can test two-way interactions, and also categorical combined with continuous effects; 4. They are all built around a single theme, of Analysis of Variance. So there is a grand scheme, a single framework for understanding and using them. What is Analysis of Variance (ANOVA)? Analysis of Variance is an extension of the Student s t-test that you will already be familiar with. A t-test can look for differences between the mean scores in two samples (e.g. body weights of males and females). A one-way Analysis of Variance can look for an overall difference between the mean scores in 2 or more samples of a factor (e.g. crop yield under three different treatments of fertiliser). Later we will see how a two-way Analysis of Variance can further partition the variance among two factors (e.g. crop yield under different combinations of pesticide as well as fertiliser). What does Analysis of Variance do? It analyses samples to test for evidence of a difference between means in the sampled population. It does this by measuring the variation in a continuous response variable (e.g. weight, yield etc) in terms of its sum of squared deviations from the sample means. It then partitions this variation into explained and unexplained (residual) components. Finally it compares these partitions to ask how many times more variation is explained by differences between samples than by differences within samples. Most ways of measuring variation would not allow partitioning, because the variation in the components would not add up to the variation in the whole. We use sums of squares because they do have this property. We get the explained component of variation from the sum of squared C. P. Doncaster 1

6 Lecture notes: One-way Analysis of Variance deviations of sample means from the global mean. Then we get the unexplained component of variation from the sum of squared deviations of variates from their sample means. These two components together account for the total variation, which can be obtained from the sum of squared deviations of variates from the global mean. Let s see how it works in practice. Say we have sampled a woodland population of wood mice, and found the average weight of adult males is 25 g, and the average of adult females (not gestating) is 17 g. But both sexes vary quite widely around these means, and some males are lighter than some females. We want to know whether our samples just reflect random variation within an undifferentiated population, or whether they illustrate a real difference in weight by sex. The problem is illustrated below with an interval plot produced by R. It shows male and female means and their 95% confidence intervals. This is a common way of summarising averages of a continuous variable. The vertical lines cover the range of possible values for each population mean, with 95% confidence. You will see how they are derived in the practical, but we use them here to illustrate the extent of variation within each sample. The confidence intervals overlap, reflecting the fact that some females were heavier than some males. We do an Analysis of Variance to test whether the sexes are really likely to differ from each other on average in the population, despite this overlap in the samples. This involves comparing the two sources of variation in weight: (i) the average variation between means for each sex (this is the variation explained by the factor Sex ), and (ii) the average variation around each sample mean (this is the residual, unexplained variation). Together they add up to the total variation, when variation is measured as squared deviations from means. Box 1. Partitioning the sums of squares (supplementary information) Why do explained and unexplained sources of variation add up to the total variation, when variation is measured as squared deviations from means? For any one score, Y - G is its deviation from the grand mean. If we measure variation as squared deviations, then the total variation in our two samples is the sum of squares: ( Y - G ) 2. However, each Y - G comprises two components: Y - Y is the deviation of the score from the C. P. Doncaster 2

7 Lecture notes: One-way Analysis of Variance mean for its sample i and therefore the component not explained by the factor sex, while Y - G is the deviation of the sample mean from the grand mean and therefore the explained component. For example, a score of 28g for a particular male is 3g away from his sample mean Y = 25g, which compares to the deviation of 4g by which the sample mean differs from the global mean G = 21g (i.e. the mean of the means for each sex: (25+17)/2). We can use a vector to describe the deviation of each score in terms of the two independent sources of variation (explained and unexplained). We plot these deviations of any one of the scores to its sample mean Y, on an axis perpendicular to the one describing the deviation of the global mean G from the sample mean Y. This is because these two deviations are independent by definition: the horizontal component in the graph is explained by the factor sex, and the vertical component is unexplained, residual deviation. The total deviation is then the resultant vector, i.e. the bold arrow in the graph below resulting from the combination of these two independent sources of variation. Error (unexplained component) Y Y Y G Response (explained component) The squared length of this vector equals the sum of the squares of the other two sides (vertical and horizontal arrows: Pythagoras s theorem). So if we represent variation as squared deviations, the variation for each score partitions into the two independent sources: the explained ( Y - G ) 2, and the unexplained ( Y - Y ) 2. We could attach such vectors to all our scores, and the sum of all these increments then gives the total squared deviations in terms of the explained variation added to the unexplained variation: ( Y - G ) 2 = ( Y - G ) 2 + ( Y - Y ) 2. If the average squared deviation of G from Y is big compared to the average squared deviation of Y from Y, then we could conclude that most of the total variation is explained by differences between the sample means. This is exactly the procedure adopted by Analysis of Variance. How to do a one-way Analysis of Variance Let s do this very simple Analysis of Variance on the two samples of adult wood mice. We want to know if there is any difference between the body weights of males and females that cannot be attributed to sampling error. Design: Firstly it is very important to have designed a method of data collection that will allow a sample to represent the population that we are interested in. Whatever the method, it must allow subjects to be picked at random from the population. So if our male sample is going to comprise 5 individuals, they should not all be brothers, or all taken from the same patch of wood. [In the practical you will look at an experimental analysis, of the effect of different pesticides on hoverflies; you will then have experimental plots in place of individuals, and the important C. P. Doncaster 3

Lecture notes: One-way Analysis of Variance design consideration will be to allocate the different treatments (of pesticide) at random to the experimental plots.

8 Lecture notes: One-way Analysis of Variance design consideration will be to allocate the different treatments (of pesticide) at random to the experimental plots.] Analysis: Having collected our samples, we then weigh all the males and all the females, and calculate mean weights for each sample, and a grand (i.e. total or pooled) mean weight. These data have been put into a spreadsheet, which is shown in Fig. 2 below. They will allow us to test the null hypothesis, H0: There is no difference between the sample means. Fig. 2. Data on body weights of male and female wood mice, as they look in an Excel spreadsheet. Each score can now be tagged with the following information: 1. Its sample mean (column D); 2. The grand mean (col E); 3. The squared deviation of the sample mean from the grand mean (col F), which equals the component of variation for this score that is explained by the independent variable sex ; 4. The squared deviation of the score from the sample mean (col G), which equals the component of unexplained variation for this score; 5. The squared deviation of the score from the grand mean (col H), which equals the component of total variation. Columns F, G, and H are then summed to find their Sums of Squares, which define the variation from explained and unexplained sources, and the total variation: We are interested in comparing the average explained variation with the average unexplained (error) variation, and we get these averages from the Mean Squares : These Mean Squares measure the explained and unexplained variances in terms of variability per degree of freedom. Finally, the F-statistic is obtained from the ratio of these two Mean Squares: C. P. Doncaster 4

9 Lecture notes: One-way Analysis of Variance Interpretation: The F statistic is the ratio of average explained variation to average unexplained variation, and a large ratio indicates that differences between the sample means account for much of the variation of scores from the grand mean score. We can look up a level of significance in tables of the F-statistic. In this example, for 1 and 8 degrees of freedom, the critical 5% value is Since our calculated value exceeds this, we can draw the following conclusion: Body weights differ between males and females in the sampled population (F1,8 = 7.27, p < 0.05). This is the standard way to present results of Analysis of Variance. Whenever presenting statistical results, always give the degrees of freedom that were available for the test, so the reader can know how big your samples were. For any Analysis of Variance this means giving two sets of degrees of freedom. What are degrees of freedom? General rule: The F-ratio in an Analysis of Variance is always presented with two sets of degrees of freedom. In a one-way test, the first corresponds to one less than the a samples or levels of the explanatory variable (a - 1), and the second to the remaining error degrees of freedom (n - a). For both sets, the degrees of freedom equals the number of bits of information that we have, minus the number that we need in order to calculate variation. Think of degrees of freedom (d.f.) as the numbers of pieces of information about the noise from which an investigator wishes to extract the signal. If you want to draw a straight line to represent a scatter of n points, you need two pieces of information: slope and intercept, in order to define the line (i.e. you need n 2); the scatter about the line (are all the points on it, or are they scattered or curved from it?) can then be measured with the remaining n - 2 degrees of freedom. This is why the significance of a regression is tested with a student s t with n - 2 d.f. Likewise, when looking for a difference between two samples, a Student s t is tested with n - 2 d.f. because one d.f. is required to fix each of the two sample means. In Analysis of Variance, the first set of degrees of freedom refers to the explained component of variation. This takes size a 1, because we have a sample means and we need 1 grand mean to calculate variation between these means. The second set of degrees of freedom refers to the unexplained (error) variation. This takes size n a, because we have n data points and we need a sample means to calculate variation within samples. Thus we calculate the average variance of sample means around the grand mean from the sum of squared deviations of Y from G, divided by one less than the a samples (= 1 for the wood mice). Then we can deduce the average error variance from the sum of squared deviations of Y from Y, divided by the remaining n - a degrees of freedom (= 8 in the wood mouse example). Degrees of freedom are very important because they tell us how powerful our test is going to be. Look at the table provided of critical values of F-distribution (p. 59). With few error d.f. (the rows), the error variation needs to be many times smaller than variation between groups before the ratio of to MS is big enough that we can be confident of a difference between groups in the population from which we took samples for analysis. This is particularly true when comparing between few samples. For example, if we want to compare two samples each of 3 subjects, then the two sample means take 2 pieces of information from the 6 subjects, leaving us with 4 error d.f. A significant difference at P < 0.05 then requires that the average variation between samples is more than 7.71 times greater than the average residual variation within each sample (as opposed to > 5.32 for the 2 samples of wood mice each with 5 subjects: Appendix 7). C. P. Doncaster 5

10 Lecture notes: One-way Analysis of Variance Assumptions of Analysis of Variance: The Analysis of Variance is run on samples taken from a population of interest, which means it must assume: random sampling, independent residuals, normally distributed residuals, and homogenous variances. We examine these 4 assumptions with a real example in the practical. 1. Random sampling is a design consideration for all parametric and non-parametric analyses. If we had some a priori reason for wanting male mice to be heavier on average than females, perhaps to bolster a favoured theory, then we might be tempted to choose larger males as representatives of the male population. Clearly this is cheating, and only bolsters a circular argument. Random sampling avoids this problem. 2. Independence is the assumption that the residuals (or errors, the squared deviations of scores from their sample means) should be independently distributed around sample means. In other words, knowing how much one score deviates from its sample mean should not reveal anything about how others do. Statistics only work by the accumulation of pieces of evidence about the population, no one of which is convincing in itself. In combining these increments it is obviously important to know that they are independent, and you are not repeatedly drawing on the same information in different guises. This is true for both parametric and non-parametric tests, and it is one of the biggest problems in statistical analysis for biologists. If the wood mouse data came from sampling a wild population, some individuals may be caught several times (if they get released back into the population after weighing). But clearly 5 measures repeated on the same individual do not provide the same amount of information as one measure on each of 5 different individuals. This problem is called pseudo-replication and leads to the degrees of freedom being unjustly inflated. Analysis of variance can be conducted on repeated measures, but it requires declaring Individual as a second factor, and this adds extra complications and assumptions - avoid it if at all possible! Equally if most males came from one locality and most females from another, then we may be seeing habitat differences not sex differences (i.e. the weights within each sample are not independent, but depend on habitat). This problem is referred to as the confounding of two factors because their effects cannot be separated. 3. Homogeneity of variances is the assumption that all samples have the same variation about their means, so the analysis can pertain just to finding differences between means. Violation of this assumption is likely to obscure true differences. It can often be met by transforming the data (see section on statistical modelling). See the practical exercise on page 14 for the R command to perform a Bartlett s test of homogeneity of variances. 4. Normality is the assumption that the residuals are normally distributed about their sample means. We have seen how Analysis of Variance only makes use of two parameters to describe each sample: the mean and the average squared deviations (the variance). A normal distribution is a symmetrical distribution of frequencies defined by just these two parameters, so if the scores are normally distributed around their sample means, then the data will be adequately represented in the Analysis of Variance test. But if the distribution of scores is skewed, or bounded within fixed limits (e.g. body weights can extend upwards any amount but cannot fall below zero), then the mean may not represent the true central tendency in the data, and the squared deviations may be an unreliable indicator of variance. In such cases, it is often necessary to transform the data first (see pp ). See the practical exercise on page 14 for the R command to perform a Shapiro-Wilk normality test on the residuals. When using any statistic (parametric or non-parametric), you should do visual diagnostic tests to check its assumptions. This applies also to Analysis of Variance, and in R you can do it with a command of the sort: plot(aov(y ~ x)). C. P. Doncaster 6

11 Lecture notes: One-way Analysis of Variance Summary of parameters for estimating the population mean Whenever you collect a sample of measurements, you will want to summarise its defining characteristics. If the data are approximately normally distributed around some central tendency, and many types of biological data are, then three parametric statistics can provide much of the essential information. The sample mean, Y, tells you what is the average measurement from your sample; the standard deviation (SD) tells you how much variation there is in the in the data around the sample mean; the standard error (SE) indicates the uncertainty associated with viewing the sample mean as an estimate of the mean of the whole population,. Parameter Description Example 1. Variable A property that varies in a measurable way between subjects in a sample. 2. Sample A collection of individual observations selected by a specified procedure. In most cases the sample size is given by the number of subjects (i.e. each is measured once only). 3. Sample mean Y 4. Sum of squares, SS 5. Variance, 6. Sample standard deviation, SD, s The sum of all observations in the sample, divided by the size of the sample, n. The sample mean is an estimate of the population mean, ( mu ) which is one of two parameters defining the normal distribution (the other is, see below). The squared distance between each data point ( ) and the sample mean, summed for all n data points. The variance in a normally distributed population is described by the average of n squared deviations from the mean. Variance usually refers to a sample, however, in which case it is calculated as the sum of squares divided by n-1 rather than n. Describes the dispersion of data about the mean. It is equal to the square root of the variance. For a large sample size, Y =, and the standard deviation of the sample approaches the population standard deviation, ( sigma ). It is then a property of the normal distribution that 95% of observations will lie within standard deviations of the mean, and 99% within Weight of seeds of the Princess Bean Phaseolus vulgaris (in: Samuels, M.L Statistics for the Life Sciences. Macmillan). A sample of 25 Princess Bean seeds, selected at random from the total production of an arable field. The sample mean Y = n i 1 Y i n WEIGHT (mg) 343,755,431,480,516,469,69 4,659,441,562,597,502,612, 549,348,469,545,728,416,53 6,581,433,583,570,334 = mg. This comes from a population, the total production of the field, which follows a normal distribution and has a mean = 500 mg. 2 The sample sums of squares SS ( Y i Y ) The sample variance = n i 1 SS n 1 The sample standard deviation s = (variance) = mg. The standard deviation of the population from which the sample was drawn is = 120 mg. C. P. Doncaster 7

12 Lecture notes: One-way Analysis of Variance Parameter Description Example 7. Normal distribution A bell-shaped frequency distribution of a continuous variable. The formula for the normal distribution contains two parameters: the mean, giving its location, and the standard deviation, giving the shape of the symmetrical bell. This distribution arises commonly in nature when myriad independent forces, themselves subject to variation, combine additively to produce a central tendency. Many parametric statistics are based on the normal distribution because of this, and also its property of describing both the location (mean) and dispersion (standard deviation) of the data. Since dispersion is measured in squared deviations from the mean, it can be partitioned between sources, permitting the testing of statistical models. The weights of Princess Bean seeds in the population follows a normal distribution (shown in the graph, with frequency on the horizontal axis). Some 95% of the seeds are within 1.96 standard deviations of the mean, which is 1.96 = mg. 8. Standard error of the mean, SE Describes the uncertainty, due to sampling error, in the mean of the data. It is calculated by dividing the standard deviation by the square root of the sample size (SD/ n), and so it gets smaller as the sample size gets bigger. In other words, with a very large n, the sample mean approaches the population mean. If random samples of n measurements were taken from any population (not necessarily normal) with mean and standard deviation, the mean of the sampling distribution of Y would equal the population mean. Moreover, the standard deviation of sample means around the population mean would be given by / n. 9. Confidence interval for Regardless of the underlying distribution of data, the sample means from repeated random samples of size n would have a distribution that approached normal for large n, with 95% of sample means at ± With only one sample mean Y and standard error SE, these can nevertheless be taken as best estimates of the parametric mean and standard deviation of sample means. It is then possible to compute 95% confidence limits for at Y ±1.960 SE (for large sample sizes). For small sample sizes, The 95% confidence limits for are computed at. The standard error of the mean SD SE = n The 95% confidence intervals for from the sample of 25 Princess Bean seeds are at Y t[0.05]24 SE. The sample is thus representative of the population mean, which we happen to know is 500 mg. If we did not know this, the sample would nevertheless lead us to accept a null hypothesis that the population mean lies anywhere between and mg. C. P. Doncaster 8

13 Practical: One-way Analysis of Variance PRACTICAL : CALCULATING ONE-WAY ANALYSIS OF VARIANCE Rationale Analysis of variance is one of the most commonly used tests in biology, because biologists often want to look for differences in mean responses between groups. Do male and female shrews differ in body weight? Does crop yield differ with different concentrations of a fertiliser? Does crop yield vary with rainfall? To find out whether shrews from a population of interest differ in size between the sexes you could perform a t-test on samples from the population. This is a simplified type of Analysis of Variance suitable for just two samples (males and females), and it gives exactly the same statistical prediction. The Analysis of Variance comes into its own when you are seeking differences between more than two samples. You would use Analysis of Variance to find out if crop yield differs with three or more different concentrations of fertiliser. You would also use the same method of Analysis of Variance to test the effect on crop yield of a continuous variable such as rainfall, in which case you are testing whether rainfall has a linear effect on yield (from a single sample rather than comparing between two or more samples). In this practical you will perform an Analysis of Variance by hand, in order to see how it works. This practical is designed to help you to interpret the output from statistical packages such as R, which does most of the number crunching for you. Here is the scenario You have just graduated from University found employment with the Mambotox consultancy. Mambotox is funded by outside contracts to evaluate the environmental impact of agricultural chemicals. Its speciality is testing the effects of pesticides on non-target insects, spiders and mites that are the natural enemies of crop pests (and hence useful to farmers as biological control agents). Your first job with this company is to perform an experiment to compare the effects on hoverflies of three new brands of pesticide designed to target aphids. Aphids are a major pest of crops, but hoverflies are useful because their larvae are voracious predators of aphids. So an efficient pesticide that also kills hoverflies may be no better in practise than a less efficient one that does not. To do the test you randomly allocate the three pesticides to plots of wheat which have all been seeded with the same number of hoverfly larvae. After applying the treatments, you sample the plots for surviving hoverfly larvae. You want to know whether the pesticide treatments influence the survival of hoverfly larvae. This problem calls for an Analysis of Variance. The hypothesis Take a look at your data set at the top of page 13. It shows that each of the three treatments (Zap, GoFly and Noxious) was applied to five replicate plots; the scores are the number of hoverfly larvae counted in each replicate after treatment. The null hypothesis, H0, is that the mean scores do not differ between treatments, i.e. that mean(zap) = mean(gofly) = mean(noxious) in the sampled population. The alternative hypothesis is that the population means are not all equal. Analysis of Variance will allow you to test H0 and to decide whether it should be rejected in favour of the alternative hypothesis. Start to fill out the cells of the table beneath the data, by summing the scores for each treatment and dividing each sum by its sample size to obtain the group means. That is what is meant by the expression: C. P. Doncaster 9

14 Practical: One-way Analysis of Variance n j Y j Y ij n i.e. Group mean = sum of scores in group / number of scores in group j i 1 You can read the formula as follows: The mean (denoted Y ) for each treatment j is equal to the sum ( ) of i scores for that treatment ( ) for i = 1 to, divided by, which is the sample size (and for each of these treatments it equals 5 plots). One of the means is rather larger than the others. How do we know if the differences between the means are due to the pesticide treatments or because of random variation? It might be that random differences between the 15 plots is enough to explain the higher mean value under one treatment. This is precisely the null hypothesis that is tested by Analysis of Variance. Analysing variance from the sums of squares Analysis of Variance finds out what causes the individual scores to vary from the grand mean of all the n = 15 plots. If you calculate this grand mean you should get a value of 9260/15 = None of the scores actually equals this grand mean, and their deviations from it are explained by two possible sources of variation. The first source of variation is the pesticide treatment (Zap, GoFly or Noxious). If Zap kills fewer hoverfly larvae, then we would expect plots treated with Zap to have higher scores in general than plots treated with the other pesticides. The second source of variation is due to differences among plots, which can be seen within each treatment. The way we measure total variation for an Analysis of Variance is by summing up all the squared differences from the grand mean. This is called the total sum of squares or SS : SS total a n j Y ij Y total j 1 i 1 The above expression means: SS is obtained by subtracting the grand mean (denoted ) from each score ( denoting the i th score in the j th treatment) and squaring this difference, then summing these squares for all scores in each treatment and all a treatments. Do this, and keep a note of the value you get. The reason for squaring each difference is that we can then separate this total variation into its two sources: one due to differences between treatments (called the sum of squares between groups, or SS ), and one due to the normal variation between plots (the error sum of squares, or SS ). Then it is a very useful property of squared differences that: SS = SS + SS. Note that the word error here does not mean mistake, but is a term describing the variation in scores that we cannot attribute to a specific variable; you may also see it referred to as residual. Calculate these sums of squares and put the values in the right-hand column of the table below. Do this by first calculating the between group sums of squares for each treatment in turn: SS n j 2 Y Y 2 total n Y Y group( j) j j j total i 1 In other words, for each treatment j, square the difference between the group mean and the grand mean and multiply by the sample size. Then add the three results together to get the overall variation between group means: SS and put this value in the right-hand column. Now calculate the error sums of squares for each treatment in turn: 2 C. P. Doncaster 10

15 SS error( j) n j Y ij Y j i 1 2 Practical: One-way Analysis of Variance In other words, square the difference between each score and its group mean, and sum these squares. Then add the three group sums to get the overall variation within groups: SS and put this in the right-hand column. Finally, add SS to SS to get SS, and put it in the right-hand column. Does this total equal the value that you obtained from the sum of all squared deviations from the grand mean? It should, showing how total variance can be partitioned into its sources. The F-value It is intuitively reasonable to think that if we get a large variation between the group means compared to variation within the groups, then the means could be considered to differ between groups because of real differences between the pesticides (rather than because of residual variation). This is the comparison that the F-value makes for us. It takes the average sum of squares due to group differences (called the group mean square or MS ) and divides it by the average sum of squares due to subject differences (the error means square or MS ): MS F MS group error SS SS group error n a a 1 where a = number of groups, and n = total of 15 plots. Calculate these mean squares, and add them into the right-hand column. Finally, calculate F. This ratio will be large if the variation between the groups is large compared to the variation within the groups. But the value of F will be close to unity for a true null hypothesis, of no variation due to groups. Just how far above F = 1.00 is too much to be attributable to chance is a rather complicated function of the number of groups and the number of plots in each group. Tables of the F statistic will give us this probability based on the degrees of freedom for the between group variation (a - 1 for a groups or treatments) and the degrees of freedom for the within group variation (n - a ), or it will be provided automatically by statistical packages. Use the published table provided for you in Appendix 7 to find the critical value for the upper 5% point of the F-distribution with the appropriate degrees of freedom (denoted v1 and v2 in the table). The columns of the table give a range of possible degrees of freedom for the group mean square, which is equal to a -1. The rows of the table give a range of possible degrees of freedom for the error mean square, which is equal to n - a. Is your calculated value of F greater than this critical value? If so, you can reject the null hypothesis with < 5% chance of making a mistake in so doing. In the report of your analysis you would say pesticide treatments do differ in their effects on hoverfly numbers: = #.##, p < 0.05 substituting in the values of v1 and v2 and the calculated F to 2 decimal places. Put this conclusion in the final row of your analysis. Using a statistical package Let s compare the calculations you have been doing laboriously by hand with the output from a statistical package. Read the same dataset into R, using the format shown on page 14. Now run an Analysis of Variance in RStudio with the suite of commands on page 14. You should get the same result as you got from the calculation by hand. Make sure you understand this output in terms of the calculations you have been doing. When you use statistical packages such as R, you will need to comprehend what the output is telling you, so that you can be sure it has done what you wanted. For example, it is always a good idea to check that the output shows the correct C. P. Doncaster 11

16 Practical: One-way Analysis of Variance numbers of degrees of freedom. If it is not showing the degrees of freedom that you think it should, then the package has probably tried to analyse your data in a different way from that intended, so you would need to go back and check your input commands. Having done the analysis in RStudio, you can now plot means and their confidence intervals with two additional lines of R code, which call a script of plotting instructions and then run it: source(file=" plot_means(aovdata$trtmnt, aovdata$score, "Treatment", "Score", "CI") The 95% confidence intervals around the jth mean are at Y 1.96 s n, where sj is the sample standard deviation: s j n j i 1 2 Y Y n 1 ij j j j j j The reason for this is that 95% of normally distributed data lie within 1.96 standard errors of the mean, by definition, and the standard error is given by the term sj n j. Which of the pesticides can you recommend to farmers? The correct answer is none yet, until you have checked the assumptions of the analysis. Underlying assumptions of Analysis of Variance Any conclusions that you draw from this analysis are based on four assumptions. What are they? Refer back to page 6 if necessary. 1. The first assumption is that the plots are assigned treatments at random, which was indeed a design consideration when you carried out the experiment. 2. The second assumption is that the residuals should be independently distributed, so they succeed each other in a random sequence and knowing the value of one does not allow you to predict the value of another (i.e. they truly represent unexplained variation). This is the assumption of independence, which is a matter of declaring all known source of variation. In this case, any variation not due to treatment contributes to the MSerror, and we assume it contains no systematic variation (e.g., due to using different fields for different treatments). The other assumptions concern the distribution of the error terms (residuals):. Use R to test for these by using the commands on page The residuals should be identically distributed for each treatment, so all the groups have similar variances. This is because the error mean square used to calculate F is obtained from the pooled errors around each group mean. Since the analysis is only seeking differences between means, it assumes all else is equal. This is the assumption of homogeneity of variances, which is visualised with the graph of residuals versus fitted values (funnel shaped if heterogeneous), and also by the slope of a scale-location graph (non-zero if heterogeneous). 4. Finally, the should be normally distributed about the group means, because the sums of squares that we use to calculate variance will only provide a true estimate of variance if these residuals are normally distributed. This is the assumption of normality, which is visualised by the normal Q-Q plot. The plot should follow an approximately straight diagonal; bowing indicates skew (to right if convex) and an S-shaped indicates a flatter than normal distribution. There are various statistical methods of putting probability limits on the likelihood of your residuals meeting each of these assumptions. We will not go into them here, but they are described in any text book of statistics. Having visually checked the assumptions, which of the pesticides can you recommend to farmers? C. P. Doncaster 12

17 Practical: One-way Analysis of Variance The data: PESTICIDE Zap GoFly Noxious The Analysis of Variance: Sample sizes: Treatment group j Zap GoFly Noxious Total n j Sums of scores: Y ij i 1 Means: Y SS SS group error j n j Y ij i 1 n a 2 n j Y j Y total j 1 a n j Y ij Y j j 1 i 1 2 j Y total + + = d.f. = + + = d.f. = SStotal SSgroup SSerror MS SS a 1 group group MS error SS error n a F MS MS group error F crit[ ] Conclusion: C. P. Doncaster 13

Practical: One-way Analysis of Variance Analysis of Variance in R For this part, refer to the Using RStudio Help Guide on Blackboard. Type the data into a new text file called Score-by-pesticide.

18 Practical: One-way Analysis of Variance Analysis of Variance in R For this part, refer to the Using RStudio Help Guide on Blackboard. Type the data into a new text file called Score-by-pesticide.txt, separating each score from its treatment level by a tab. Then read this file into a data frame in R and perform the analysis in RStudio with the following suite of commands: # 1. Prepare the data frame 'aovdata' aovdata <- read.table("score-by-pesticide.txt", header = T) attach(aovdata) # Access the data frame Trtmnt <- factor(trtmnt) # Set Trtmnt as a factor # 2. Command for factorial analysis summary(aov(score ~ Trtmnt)) # Run the ANOVA bartlett.test(score ~ Trtmnt) # Test for homogenous variances shapiro.test(resid(aov(score ~ Trtmnt))) # Test for normality # 3. Plot data and residuals par(cex = 1.3, las = 1) # Enlarge, orient plot labels plot(trtmnt, Score, xlab="pesticide", ylab="score") # Box plot par(mfrow = c(2, 2)) ; plot(aov(score ~ Trtmnt)) # 4 residual plots par(mfrow = c(1, 1)) ; detach(aovdata) # Reset plot window; detach data frame The summary and plot commands will give the following outputs: Df Sum Sq Mean Sq F value Pr(>F) Trtmnt *** Residuals Signif. codes: 0 *** ** 0.01 * From the ANOVA table, you conclude that the treatment types differ in their effects on survival of hoverfly larvae (F2,12 = 16.78, P < 0.001). The ANOVA tells you nothing more than this. You then interpret where the difference lies from the box plot (showing median, first and third quartiles, and max/min values up to ~2 s.d.; any outliers would be plotted individually). The first two of four residuals plots are shown below. Residuals versus fitted (mean) response visualizes any heterogeneity of variances. Residuals versus theoretical (normal) quantiles visualises any systematic deviation from normal expectation given by the diagonal line. These plots show no detectable increase in heterogeneity with the mean (Bartlett s K 2 2 = 2.63, P = 0.27, and no systematic deviation from normality (Shapiro-Wilk W = 0.96, P = 0.75). C. P. Doncaster 14

19 Lecture notes: Two-way Analysis of Variance LECTURE: TWO-WAY ANALYSIS OF VARIANCE We have used one-way Analysis of Variance to test whether different treatments of a single factor have an effect on a response variable (finding a treatment effect: F1,12 = 16.78, P < 0.001). With two-way Analysis of Variance, we divide the samples in each treatment into sub-samples each representing a different level of a second factor. A hypothetical example illustrates what the analysis can reveal about the response variable. Example of two-way Analysis of Variance: factorial design In the following experiment, we wish to test the efficacy of different systems of speed reading, and to know whether males and females respond differently to these systems. We randomly assign 30 subjects (S1 S30) to three treatment groups: T1, T2 and T3, with 10 subjects per treatment of which 5 are male and 5 female. The three groups are each tutored in a different system of speed reading. A reading test is then given and the number of words per minute is recorded for each subject. The data are presented in a design matrix like this: Table 1. Design matrix for factorial Analysis of Variance. SYSTEM T1 T2 T3 SEX Male Y1,... Y5 Y11,... Y15 Y21,... Y25 Female Y6,... Y10 Y16,... Y20 Y26,... Y30 The table thus has 6 data cells, each containing the responses of 5 independent subjects (here coded Y1,... Y5 etc). This is a factorial design because these six cells represent all treatment combinations of the two factors SEX and SYSTEM. Because each cell contains the same number of responses, we call this a balanced design, and because each level of one factor is measured against each level of the other, it is also an orthogonal design. [See page 31 for cross-factored Analysis of Variance on unbalanced data.]. A two-way Analysis of Variance will give us three very useful pieces of information about the effects of the two factors: 1. Whether mean reading speeds differ between the three techniques when responses of males and females are pooled, indicated by a significant F for the SYSTEM main effect; 2. Whether males and females have different reading speeds when responses for the three systems are pooled, indicated by a significant F for the SEX main effect; 3. Whether males and females respond differently to the techniques, indicated by a significant F for the SEX:SYSTEM interaction effect. We get these three values of F from five sources of variation: the n scores themselves, the a cell means Y, the r row means R, the c column means C, and the single global mean G. C. P. Doncaster 15

20 Lecture notes: Two-way Analysis of Variance Table 2. Component means for the factorial design. SYSTEM T1 T2 T3 Row Means Male Y Y Y R Female Y Y Y R Column means C C C G The R analysis of real data is shown below, producing the interaction plot above. The output contains the three values of the F-statistic and their significance. The rest of this section is devoted to explaining just how the means in the table above can lead us to the inferences in the analysis below that sex and system both have additive effects on reading speed, with no interaction between them. # Prepare data frame aovdata aovdata<-read.table("system-by-sex.csv",sep=",",header=t) attach(aovdata) # Classify factors and covariates: sex <- as.factor(sex) ; system <- as.factor(system) # Specify the model structure: summary(aov(speed ~ sex*system)) Df Sum Sq Mean Sq F value Pr(>F) sex * system e-10 *** sex:system Residuals Signif. codes: 0 *** ** 0.01 * # Interaction plot: interaction.plot( sex, system, speed, xlab = "Sex", ylab = "Speed", trace.label = "System", las = 1, xtick = TRUE, cex.lab = 1.3 ) # Test for homogeneity of variances bartlett.test(speed ~ interaction(sex, system)) Bartlett test of homogeneity of variances data: speed by interaction(sex, system) Bartlett's K-squared = , df = 5, p-value = # Test for normality of residuals shapiro.test(resid(aov(speed ~ sex*system))) Shapiro-Wilk normality test data: resid(aov(speed ~ sex * system)) W = , p-value = detach(aovdata) C. P. Doncaster 16

21 Using a statistical model to define the test hypothesis Lecture notes: Two-way Analysis of Variance In defining the remit of our analysis, we want to make a statement about the hypothesised relationship of the effects to the response variable, and this can be done most concisely by specifying a model. In the one-way Analysis of Variance that you conducted in the practical, you tested the model: HOVERFLIES = PESTICIDE + The = does not signify a literal equality, but a statistical dependency. So the statistical analysis tested the hypothesis that variation in the response variable on the left of the equals sign (numbers of hoverflies) is explained or predicted by the factor on the right (pesticide treatments), in addition to a component of random variation (the error term, epsilon ). This error term describes the residual variation between the plots within each treatment. We could have written it out in full as PLOTS (PESTICIDE) meaning the variation between the random plots nested within the different types of pesticide ( nested because each treatment has its own set of plots). The Analysis of Variance tested whether much more of the variation in hoverfly numbers falls between the categories of Zap, GoFly and Noxious, and so is explained by the independent variable PESTICIDE, than lies within each category as unexplained residual variation, = PLOTS(PESTICIDE). This was accomplished by calculating the ratio: Pesticide effect: MS F MS group error MS MS PESTICIDE PLOTS' ( PESTICIDE) For our two-way experimental design, we can also partition the sources of variance. This time the sources partition into two main effects plus an interaction, and the residual variation within each sex and system combination. The full model statement looks like this: SPEED = SEX + SYSTEM + SEX:SYSTEM + SUBJECTS (SEX:SYSTEM) The four terms on the right of the equals sign describe all the sources of variance in the response term on the left. The last term describes the error variation,. It is often not included in a model description because it represents residual variation unexplained by the main effects and their interaction. But it is always present in the model structure, as the source of random variation against which to calibrate the variation explained by the main effects and interaction. With this model, we can calculate three different F-ratios: Sex effect: System effect: Sex:System interaction effect: F MS F MS group1 MS group2 MS error error MS MS MS MS SEX SUBJECTS '( SEX : SYSTEM ) SYSTEM SUBJECTS '( SEX : SYSTEM ) MS MS F MS MS int eraction SEX : SYSTEM error SUBJECTS '( SEX : SYSTEM ) Degrees of freedom Before attempting the analysis, we should check how many degrees of freedom there are for each of the main effects and the interaction, and how many error degrees of freedom. Remember that degrees of freedom are given by the number of pieces of information that we have on a response, minus the number needed to calculate its variation. The SEX main effect is tested with 1 degree of freedom (one less than its two levels: male and female), and the SYSTEM main effect with 2 degrees of freedom (one less than its three levels); C. P. Doncaster 17

22 Lecture notes: Two-way Analysis of Variance the SEX:SYSTEM interaction effect is tested with the product of these two sets of degrees of freedom (i.e. 1 2 = 2 degrees of freedom). The error degrees of freedom for both effects and the interaction comprise one less than the remaining numbers in the total sample of N = 30, which is 30-(1+2+2)-1 = 24. You can also think of error degrees of freedom as being N a, which is the number of observations minus the a = 6 sample means needed to calculate their variation. Thus the significance of the SEX effect is tested with a critical F1,24, SYSTEM with F2,24 and the SEX:SYSTEM interaction with F2,24. General rule: In general for an Analysis of Variance on n subjects (Y) measured against two independent factors X1 (the row factor in a design matrix such as Table 1) and X2 (the column factor), with r and c levels (samples) respectively, the model has the following degrees of freedom: model: Y = X1 + X2 + X1:X2 + Y (X1:X2) d.f.: r-1 c-1 (r-1) (c-1) N-r c The reason why the error degrees of freedom are r c less than N is simply because r c is equal to one more than the sum of all the main effect and interaction degrees of freedom. Thus the four sets of degrees of freedom all add up to a total of N - 1 degrees of freedom. In practise when you design an experiment or fieldwork protocol that will require Analysis of Variance, you can use this knowledge to work out in advance how many subjects you need. You will need r c degrees of freedom (e.g. 2 levels of sex times 3 of system = 6) just to define the group dimensions, and then at least the same again to give you enough error degrees of freedom for a reasonably powerful test. How to do a two-way Analysis of Variance A two-way analysis comprises a test of the model as a whole, and a test of the individual terms in the model. Its degrees of freedom and sums of squares follow the same principles as the one-way Analysis of Variance. The Quantities column shows how the component sums of squares relate to each other (with n defining the number of replicates in each of the r c samples): Table 3a. Calculation of degrees of freedom and sums of squares for the two-factor model. Source of variation d.f. SS Quantities 2 Within cells (error) r c (n -1) 3 Total r c n -1 Table 3b. Calculation of degrees of freedom and sums of squares for the terms in the model. Source of variation d.f. SS Quantities 5 Between columns (System) 6 Interaction (Sex:System) 7 Within cells (error) r c (n -1) 8 Total r c n -1 C. P. Doncaster 18

23 Lecture notes: Two-way Analysis of Variance These sums of squares allow us to calculate mean squares, MS, for components 1 to 2 and 4 to 7, by dividing each SS by its degrees of freedom. Finally, we get one F-statistic for each of components 4, 5 and 6, by dividing the row MS by the (from component 7). These are the mean squares and F-statistics shown in the R output pictured earlier. You do not need to learn the formulae in the table above, but you should be able to gain from them an appreciation of how the total sums of squares are partitioned into the different sources. Interpreting the results When we did one-way Analysis of Variance we obtained a single F-statistic on which to base our conclusions about the hypothesised relationship. The two-way analysis, however, gives three different values of F, each telling us about different aspects of the hypothesised relationship. A significant SEX:SYSTEM interaction would allow us to conclude that the techniques have different effects on males and females. In the particular example we have in Fig. 1, the interaction term is not significant (F2,24 = 0.32, p > 0.7), meaning that the effect of reading technique on speed is not modulated by (does not depend on) sex. In other words, reading technique influences speed in the same way for males and females. That would be the conclusion from the R analysis shown above. A significant SEX effect (F1,24 = 5.72, p = in Fig. 1) means that males and females have different mean speeds, irrespective of technique. A significant SYSTEM effect (F2,24 = 56.62, p < 0.001) means that reading technique does influence mean speeds, irrespective of sex. How do we interpret the analysis if one or other of the main effects is not significant? If the interaction effect is significant, but the SYSTEM effect is not, what does this tell us about the different reading techniques? In general, if an interaction term is significant, then both of the component effects must also be significant, because each one influences the effect of the other on the response variable. We should therefore always report a significant interaction first, before considering the main effects. Some graphical illustrations will help to explain why this is. Using interaction plots to help interpret two-way Analysis of Variance Take a look at the set of eight graphs on the next page. These are called interaction plots and they illustrate all eight possible ways in which a response variable can depend on two factors. The idea is to plot the response variable against one of the independent effects (it does not matter which one) and then plot on the graph the sample means for each level of the other independent effect. For the sake of clarity, means are plotted without error bars, and we can assume that each would have only a small residual variation above and below it. For each type of SYSTEM (T1, T2 and T3), the mean response is plotted for each type of SEX (male or female), and joined by a line. Thus the mid-point of each of these lines reveals the mean reading speed for systems T1, T2 and T3, irrespective of any sex effects. You can guess roughly where the mean reading speed is for each sex from the average height of the three points at each sex. C. P. Doncaster 19

24 Lecture notes: Two-way Analysis of Variance Fig. 2. Interaction plots for two independent effects, illustrating the eight possible outcomes of a two-way Analysis of Variance. 1. Significant SEX effect 2. No significant SEX effect No significant SYSTEM effect No significant interaction Significant SYSTEM effect No significant interaction SPEED T1 T2 T3 SPEED T1 T2 T3 M F M F SEX SEX 3. Significant SEX effect Significant SYSTEM effect No significant interaction 4. Significant SEX effect Significant SYSTEM effect Significant interaction T1 T1 SPEED T2 T3 SPEED T2 T3 M F M F SEX SEX 5. No significant SEX effect Significant SYSTEM effect Significant interaction 6. Significant SEX effect No significant SYSTEM effect Significant interaction T1 SPEED T1 T2 T3 SPEED T2 T3 M F M F SEX SEX 7. No significant SEX effect 8. No significant SEX effect No significant SYSTEM effect Significant interaction No significant SYSTEM effect No significant interaction T1 SPEED T2 SPEED T1 T2 T3 T3 M F M F SEX SEX C. P. Doncaster 20

25 Lecture notes: Two-way Analysis of Variance Graph 1 in Fig. 2 shows three systems that do not differ in their effects on reading speeds, but females out-perform males on average. Graph 2 shows males and females doing equally well, but subjects learning system T1 outperforming those learning system T2 who do better than those learning system T3. Graph 3 shows the same differences between systems, but females also doing better on average than males under any of the systems. This is the result we actually obtained. Graph 4 shows what a significant interaction effect looks like. The effects of system depend on sex, with differences between the methods having a more pronounced effect on female reading speeds than those of males. In other words, the system effect is modulated by sex (or equally, the sex effect is modulated by system). Graph 5 shows males and females with the same average reading speeds (as in graph 2), but the system effect depends very much on sex, with T3 being best for males and T1 for females. In graph 6, females do better than males on average. The mid-points of the lines all coincide at the same score for the response variable, and so no differences are apparent between the systems if we pool males and females. But the type of reading system clearly does have an important influence on males, and an equally important - but different - influence on females. Thus the significant interaction indicates a real effect of system, even though it was not significant as a main effect. In graph 7, neither sex nor system are significant as main effects, but their combined effect is. The effects of technique are apparent only when the sexes are considered separately. In graph 8, speed is not influenced by sex or system, either independently or interactively. Only under this outcome would the null hypothesis be accepted, that neither factor has an influence on reading speed. Other types of two-way Analysis of Variance So far we have only considered factorial designs, which have replicates in all combinations of levels of both factors. If a two-factored design has no replication within each cell, then it will not be possible to look for interaction effects, and they must be assumed to be negligible. The Latin square is an example of this (read more about it in Sokal & Rohlf). It is used in situations where a single main effect is being tested (say 4 types of fertiliser on crop yield), but in the presence of a second nuisance effect (e.g. a gradient of moisture on the slope of a hill). The best way to deal with this situation is to lay out the plots in a structured pattern (rather than random allocation): Hill top A B C D B C D A C D A B Hill bottom D A B C Thus each of 4 levels of height have each of the 4 types of fertiliser (A-D), so it is a fully orthogonal design. The test model is: Response = Factor + Block meaning that the response (yield) is to be tested against a main factor (pesticide) and a blocking variable (moisture), with an error mean square being provided by the unexplained interaction Factor:Block. Many other designs are possible. You might read about nested analyses, or three-way or higher order factorials, but when designing your own data collection, try to avoid the need for these, because greater sophistication always requires more stringent conditions. C. P. Doncaster 21

26 C. P. Doncaster 22

27 LECTURE: REGRESSION Lecture notes: Regression We have seen how Analysis of Variance gives us the capacity to test for differences between category means. For example, are males heavier on average than females in the sampled population? Here the response variable is weight and the categories are the two sexes. Sometimes however we want to measure the response variable against a continuous, instead of a categorical, variable. If we want to know whether Weight varies with Age, we could divide the observations into age categories (e.g. juvenile and adult ) and do an ANOVA, or we could measure Weight on a continuous scale with Age. In the latter case we are asking whether Weight regresses with Age. Specifically, we hypothesise that Weight shows a linear relationship to Age (we will treat non-linear relationships later). The statistical model is the same in both cases, and it is tested with Analysis of Variance in both cases. Only the degrees of freedom are different: Model for Analysis of Variance by categories: d.f. for n data points and a categories: Weight = Age + a-1 n-a Model for Analysis of Variance by regression: Weight = Age + d.f. for n data points: 1 n-2 Both models could be analysed with the aov command in R, though the first one would require identifying Age as a factor (with the command: Age <- as.factor(age)). Whether you do the regression analysis with the aov command or the lm command in R, the same Analysis of Variance will be done for you, giving an F-statistic with 1 and n-2 degrees of freedom. Where do these regression degrees of freedom come from? The value of F is calculated from MS[Age] divided by MS[ ]. For MS[Age] we have 1 d.f. because we have two pieces of information with which to construct our regression line - the intercept and slope - and we need one piece of information - the overall mean weight - in order to calculate whether the regression varies from horizontal. For MS[ ] we have n-2 degrees of freedom because we have n pieces of information - the data points - and we need two pieces - the intercept and slope - in order to calculate the residual variation, given by the squared deviation of each observation from the line. Let s see how this works with an actual example. The following page shows a data set on newborn badger cubs. Body weights in grams at different ages in days have been typed into a text file and the response Weight regressed against the predictor Age. The lm command in R has done an Analysis of Variance on the 12 data points, giving 1 and 10 d.f.. This Analysis of Variance tests the compatibility of the data with a regression slope of zero (i.e., a horizontal regression) in the population of interest. The result of F1,10 = 3.90, P = tells us that we have too high a probability of a false positive (P > 0.05) to reject the null hypothesis of zero slope, and therefore that weight does not co-vary detectably with age. The plot shows data points with homogeneous variance across the range of Age, no obvious deviations from normally distributed residuals around the regression line, and a linear relationship. The 95% confidence intervals in the plot show that the regression slope could swivel to horizontal without passing outside them confirming our lack of confidence in the sampled population having a relationship of Weight to Age. How does the analysis arrive at this result? Look now at page 25, which shows an Excel file into which the data have been typed. Here we see how the F-value was calculated. As with the Analysis of Variance for a class predictor variable, the Analysis of Variance for a continuous predictor variable partitions the squared deviations of the response variable into two independent parts. These are the explained (or regression ), and the unexplained (or residual error ), sums of squares, which together add up to the total squared deviations of the response variable from its mean value. The Table on page 26 summarises the operations. C. P. Doncaster 23

28 Lecture notes: Regression # Linear regression in R on response of Weight to Age # 1. Prepare the data frame aovdata aovdata <- read.table("weight-by-age.txt", header = T) attach(aovdata) # Access the data frame Age <- as.numeric(age) # Set Age as numeric # 2. Commands for regression analysis model.1.1i <- lm(weight ~ Age) # Analyse and store summary(model.1.1i) # Print the results Call: lm(formula = Weight ~ Age) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** Age Signif. codes: 0 *** ** 0.01 * Residual standard error: on 10 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 10 DF, p-value: # 3. Plot the data plot(age, Weight,cex=1.5, las=1, xlab="age (days)", ylab="weight (g)") # 4. Add regression line and 95% confidence intervals abline(coef(model.1.1i)) # add regression line confint <- predict(model.1.1i, interval="confidence") lines(age, confint[,2], lty=2) # add lower c.i lines(age, confint[,3], lty=2) # add upper c.i coef(model.1.1i) # print intercept and slope (Intercept) Age # 5. Test assumptions shapiro.test(resid(lm(weight ~ Age))) # Normality of residuals library(car);ncvtest(lm(weight ~ Age))# Homogeneity of variance C. P. Doncaster 24

29 Lecture notes: Regression C. P. Doncaster 25

30 This is how the terms are calculated in the Excel sheet on the preceding page: Order Term Derivation Meaning of symbols Lecture notes: Regression 1. SSx ( x - x )² The sum of squared deviations of x from its mean, where x is Age (column B) and x is mean age (cell b18). 2. SS(Total) [or SSy ] ( y - y )² The sum of squared deviations of y from its mean, where y is Weight (Column F) and y is mean weight (cell F18). 3. SPxy ( x - x ) ( y - y ) The sum of products of the deviations of x with y. Dividing this by (n 1) gives the covariance. 4. Slope: b SPxy SSx Gradient of the regression line. A horizontal line has b = 0. A positive gradient has b > 0, while negative has b < Intercept: a y - b x Calculated knowing regression line passes through x, y. 6. SS(Explained) (ŷ - y )² Explained sum of squared deviations, where: ŷ = a + b x. This is the magnitude of the predicted deviation from y. 7. d.f.(explained) 2-1 = 1 We have two pieces of information (a and b) and we need one piece ( y ) to calculate the explained variation. 8. MS(Explained) SS(Explained) d.f.(explained) Mean Square explained variation. The variance measured as variability per degree of freedom. 9. SS(Error) ( y - ŷ )² Unexplained sum of squared deviations where ŷ = a + b x. This is the magnitude of deviation from the predicted ŷ. 10. d.f.(error) n - 2 We have n pieces of information (the values of y) and we need two pieces (a and b) to calculate the error variation. 11. MS(Error) SS(Error) d.f.(error) 12. F MS(Explained) MS(Error) 13. R ² SS(Explained) SS(Total) 14. R SPxy (SSx SSy) Mean square unexplained (residual error) variation. The variance measured as variability per degree of freedom. The ratio of explained to unexplained variances, to be compared against tables of the F-distribution with 1 and n-2 degrees of freedom. Coefficient of determination (often written r²). The proportion of explained variation. If R ² = 1, all y lie on a regression line for which b 0; if R ² = 0 then b = 0. Pearson Product Moment Correlation Coefficient, r. Equal in magnitude to the square root of the coefficient of determination. Negative R means y tends to decrease with increasing x. Other terms in the R output: t -values are Student s-t tests for departures of the intercept from zero, and the slope of Weight with Age from zero. Note that the value of the Student s-t test of the slope is equal to the squareroot of the value of F from the Analysis of Variance, and both significances are identical. This is because both these tests are accomplishing exactly the same task. Residual standard error is the square-root of the variance term given by MS(error). Multiple R-squared is the coefficient of determination. Adjusted R-squared is an adjusted coefficient of determination that is uninfluenced by the number of d.f. C. P. Doncaster 26

31 Lecture notes: Regression The regression analysis on pages 24 and 25 works by partitioning the total variation in the response variable into explained and unexplained parts. The total variation is obtained from summing all the squared deviations of each weight value from the mean weight. The long arrow on the graph on page 25 illustrates the portion of total variation contributed by just one observation. The analysis will partition the total variation into its two components, illustrated by the shorter arrows on the graph. One component is predicted by the regression line: SS(Explained), while the other is the unexplained variation around the line: SS(Error). The analysis will then calculate the average squared deviations of these two components, in order finally to get from their ratio: MS(Explained) / MS(Error), the F-value with which to test the significance of the regression. The analysis proceeds in steps. First we find the regression line that will estimate values of y for each of our values of x. With these predicted values, ŷ, we will then be able to sum their squared deviations from y in order to get the explained sum of squares: SS(Explained). Steps 1 to 5 of the table on page 26. To find the regression line we must find values for two new parameters: the slope of the line, b, and its intercept with the y-axis, a. The slope b is calculated from the sum of products: SPxy = (x - x) (y - y) divided by the sum of squared deviations in x: SSx = (x - x) 2. The sum of products on the numerator tells us about the covariance of y with x. It gives the slope a positive value if the coordinates for each data point: x, y tend to be either both larger than their respective means: x, y or both smaller. The slope will have a negative value if in general x < x when y > y, and vice versa. This formula for b also means that the gradient of the slope will have a magnitude of one if, on average, each deviation y - y has the same magnitude as each corresponding deviation x - x. If the deviations in y are relatively greater than those in x, then the slope will be steeper than 1. The Excel sheet on page 25 shows that the regression line on the graph has a gradient of , signifying that y is predicted to decrease as x increases and that each decrease in y is predicted to be some 18 times the corresponding increase in x. The intercept a is calculated from a = y - b x. This is simply a rearrangement of the equation for a straight line: y = a + b x. In this case we have known values for the two variables y and x, in their respective sample means y and x, and since we have just calculated b, we can now find the unknown a. With values for a and b, we have all the information we need to draw the regression line on the graph. Excel can do this for us if we request Add Trendline... from the Chart menu. The result is shown on the graph on page 25, and it accords with the equation: the line appears to intercept the y-axis somewhere around 400 g, and the calculated a tells us it is exactly at y = g. Steps 6 to 8. With the two parameters b and a we can predict Weight, ŷ, for any given value of Age, x. For each observed x we now calculate (ŷ - y ) 2 (column L of the Excel sheet) and sum them to get the explained sum of squares: SS(Explained). Steps 9 to 12. Finally, we need the unexplained sums of squares, which we get from the squared deviation of each y from its predicted ŷ. The sum of all these (y - ŷ) 2 (in column N of the Excel sheet) is then the SS(Error). Now we calculate the mean squares: MS(Explained) and MS(Error), and the F-statistic, in just the same way as for any other Analysis of Variance. Steps 13 to 14. There remains one final parameter to calculate: the proportion of explained variance, which is simply SS(Explained) / SS(Total). We call this fraction the coefficient of determination r 2. Its square root is called the Pearson product-moment correlation coefficient r. Step 14 of the Table on page 26 shows how r is calculated directly, which results in it having a positive or negative value according to whether the regression is positive or negative. C. P. Doncaster 27

32 C. P. Doncaster 28

33 Practical: Two-way Analysis of Variance in R PRACTICAL: TWO-WAY ANALYSIS OF VARIANCE IN R Do this analysis in RStudio (refer to the Using RStudio Help Guide on Blackboard). Prepare a short report to the pharmaceutical company that makes the drug Ritalin, evaluating the utility of their product (1 side A4). Divide your report into sections: an Introduction to explain the interest in doing the test; Experimental Design and Analysis outlined briefly; Results, including the ANOVA table showing Sums of Squares and Mean Squares etc, with an interpretation of the analysis in the form: the effect of the drug depended / did not depend on the condition of the subject (F = #.##; d.f. = ##, ##; P = #.##)... the main effect of treatment etc. Interpret the main effects after the interaction. Include a fully annotated interactions plot. Finish with a short paragraph of Conclusions about appropriate use of the drug. Two-way Analysis of Variance In the previous class practical you conducted a one-way Analysis of Variance. One-way meant that you were looking for differences between mean treatment effects for a single independent factor (pesticide). Sometimes we are interested in responses to more than one independent factor, and then it is possible to conduct an Analysis of Variance with two or more main effects. The example below takes you through a two-way Analysis of Variance that you can perform for yourself in R. It illustrates how analysis of two independent variables can yield informative inferences. You may find that the output you get is easier to interpret after reading the accompanying lecture notes on two-way Analysis of Variance. Rationale The drug Ritalin was designed to calm hyperactive children, but hyperactivity is a difficult condition to diagnose, so it is important to know what effect Ritalin has on non-hyperactive children. The following medical trial tested two groups of children, one non-hyperactive and the other hyperactive. Each group was randomly divided with one half receiving Ritalin in tablet form, and the other half a placebo (a salt tablet with no physiological effect). The following activity responses were recorded on the four samples each of 4 children: CONDITION Placebo TREATMENT Ritalin Non-hyperactive 50, 45, 55, 52 67, 60, 58, 65 Hyperactive 70, 72, 68, 75 51, 57, 48, 55 In this experimental design, the two independent variables are CONDITION (non-hyperactive or hyperactive) and TREATMENT (placebo or Ritalin). Each CONDITION is tested with each level of TREATMENT on replicate subjects. A design of this sort is called a factorial design and it allows us to test for a possible interaction between the two factors in their effects on the response variable. Here the interaction we are seeking is whether the effect of Ritalin on activity depends on the condition of the child. This could be a good thing, if for example the drug only influences hyperactive children, or it could provide cautionary information, if the drug is found to have a more pronounced effect on non-hyperactive than hyperactive children. Analysis with R Enter these data into a data frame from a.csv file (command line shown in the two-way ANOVA lecture) or a.txt file (command line shown in the regression lecture). The data frame should have 16 rows, one for each score labelled with its combination of treatment-by-condition: C. P. Doncaster 29

34 Practical: Two-way Analysis of Variance in R Treatment Condition Activity Placebo Nonhyp 50 Placebo Nonhyp 45 Placebo Nonhyp 55 Placebo Nonhyp 52 Ritalin Nonhyp 67 Ritalin Nonhyp 60 : : : Then use the same R commands as for the speed-reading analysis on page 16 to run the analysis and produce an interaction plot. This requires that you specify the response variable and explanatory factors in an ANOVA model of the form: response ~ factor_1*factor_2, meaning: variation in the response is explained by the additive effects of factors 1 and 2 and by their interaction. You could equally spell out the model without using the * shorthand: response ~ factor_1 + factor_2 + factor_1:factor_2. Both expressions give identical results. In this case, the model you are going to test with Analysis of Variance is that activity is influenced by treatment and by the child s condition, and by the interaction of treatment with condition. The model tests these explained sources of variation in the response against unmeasured residual variation. Save the interaction plot and copy it into your report on the analysis. Now check the residuals, by nesting the aov( ) command within a plot( ) command (see example on page 14). The first two graphs suffice to show homogeneous variances which is the most important consideration, though with a rather flat distribution of residuals. As with everything in R, if you are not sure how to do something, try it and see you can t break the package! Save your commands in a script file, so that you can use them again in the future, and refer to them to see how you did things in the past. Do search the web for help, as usually someone will have posted an answer to someone else s similar problem. For example, if you want to know more about interpreting the Normal Q-Q plot of residuals, try Googling normal Q-Q plots in R showing skew. Peruse the results of the ANOVA, noting that a separate F-value and associated p-value have been produced for each of the main effects Treatment and Condition, and for the Treatment-by- Condition interaction. Which effects are significant? How do we interpret these results? Refer to the lecture notes on two-way ANOVA to be sure which d.f. apply to each F-value. Interpretation The analysis reveals something very interesting from a medical point of view, though it needs the interaction plot to understand it. This plot illustrates qualitatively what the ANOVA described statistically, and it unmasks the full effect of the drug Hyperactive children are less active on average with the drug than with the placebo. That is to be expected, but Non-hyperactive children are more active on average with the drug than with the placebo. This is the significant interaction effect that you will have obtained in the ANOVA. For each Treatment level, the point midway between the two condition-level means indicates that Treatment-level mean after pooling levels of Condition. These midway points are at an Activity score of about 58 for both Placebo and Ritalin, which explains the non-significant main effect of Treatment. Does a non-significant main-effect of Treatment indicate that the drug is ineffectual? No! The significant interaction means that the full effects of the drug become apparent only when the condition of the children is taken into account. Ritalin does affect activity, but although it subdues hyperactive children it raises the activity of nonhyperactive children. This is one reason why it is a controversial drug that must be prescribed only to hyperactive children. The take-home message for interpreting two-way ANOVA is to read the ANOVA table from the bottom up, because the main effects only make sense in the light of the interaction. C. P. Doncaster 30

35 Lecture notes: Correlation and transformations LECTURE: CORRELATION AND TRANSFORMATIONS Review of ANOVA procedures in regression We have seen how the significance of a simple regression line is calculated by one-way Analysis of Variance. Our example used the statistical model: Weight = Age +. We evaluated how good a predictor Age is in this model by partitioning the total observed variation in weight (measured as the sum of squared deviations from the sample mean: [ y - y ] 2 ) into a portion explained by the line of best fit for Age against Weight (SS[Age] = [ŷ - y ] 2 ), and an unexplained portion (SS[ ] = [ y - ŷ ] 2 ). We could then work out our F-statistic from the ratio of average explained variation to average unexplained variation: F1,n-2 = MS[Age] / MS[ ]. Just as you can expand ANOVA from a one-way to a two-way analysis by introducing a second factor (as we did in Lecture 2 and Practical 2 in this series), so you can expand regression from simple- to multiple-regression, by introducing a second factor. This second factor may be categorical, in which case you can plot the response variable against the continuous factor, and calculate one regression line for each level of the categorical factor. If the regression lines are not horizontal then you may have a significant continuous factor, and if the lines do not coincide then you may have a significant categorical factor. If the regression lines have different slopes, then you may have a significant interaction effect. The interaction plots shown on p. 20 of this booklet illustrate some of the range of outcomes you could get - just think of the x-axis as representing some continuous variable instead of the categorical factor Sex (for example Age ), and the lines joining sample means then become regression lines for each level of the categorical factor (in this case, System ). If the second factor is continuous rather than categorical, then you will need to illustrate these data in a 3-dimensional graph, with the response on the vertical axis, and the two continuous factors on orthogonal (i.e. at right-angles ) horizontal axes. The best-fit model will then be a plane through the data, as opposed to lines through the data. With these more complicated models, the Analysis of Variance should be done with a balanced design, so the same number of observations are recorded at each combination of factor levels. The design can become unbalanced by missing data, or by using explanatory factors that are correlated with each other and therefore non-orthogonal. For example if variation in body height is modelled against right-leg length and against left-leg length, the second-entered explanatory variable will appear to have no power to explain height while the first-entered explanatory variable may appear highly significant. The problem is that the two variables are correlated with each other, so the design is unbalanced by having missing data on short-left and long-right legs and on short-right and long-left legs. In effect, the variables are not orthogonal to each other. Having accounted for the variation explained by the first-entered factor there is then necessarily little variation left over for explanation by the second-entered factor. The true relationship would be better analysed with a one-factor regression on a single composite explanatory variable of leg length that uses the average of left and right lengths. For more on this topic see Doncaster & Davey (2007 Analysis of Variance and Covariance, pages ). C. P. Doncaster 31

36 Correlation Lecture notes: Correlation and transformations For some types of investigation of covariance between continuous variables we may wish to seek correlation without making predictions about how one variable is influenced by the other. For example, if we have measures of body Volume for each Weight, we may not have an a priori reason for knowing whether Volume determines Weight, or Weight determines Volume. For the analysis of Weight and Age, in contrast, Age was clearly an explanatory (predictor, x) variable and Weight the response (y) variable. The analysis of those two factors was predictive because Age was hypothesised to influence Weight, but Weight could not under any circumstances influence Age. Wherever we have employed Analysis of Variance up to now, it has been used to explain variation in a response variable in terms of a predicted effect. For the analysis of Weight and Volume we may not have a priori reasons for classifying one variable as effect and the other as response. We then restrict ourselves to seeking an interdependency, or an association, between the two continuous variables. We can test for association with the correlation coefficient r, because it s value does not depend on which variable is on which axis. The strength of correlation can still be tested with the Student s-t or the Analysis of Variance, as on page 24, because both these tests remain unchanged regardless of which variable is x and which y. The equation of the regression line does change, however, if we swap the axes. We can see what happens to it by manipulating the regression we did of Weight with Age (pp and practical 3 - you can try this with the Excel sheet that you create for the practical). The equation for the regression on page 25 was: Weight = Age If the axes are swapped, a new regression equation is yielded: Age = Weight, which can be rearranged in terms of weight to give: Weight = Age These two equations give entirely different predictions for weight change with age, and only the first one is correct. The second equation illustrates the kind of error that you might get if you used regression without respecting the requirement always to put the response variable on the vertical axis and the predictor variable on the horizontal axis. The first equation predicts correctly that cubs have an average weight at birth of 420 g (when Age = 0) and an average loss rate of 18 g day -1, whereas the second equation erroneously predicts an average birth weight 1.7 times greater, and an average rate of weight loss 3.5 times greater, than these figures. If you are in doubt about whether one of your variables is a true predictor, then do not put a line of best fit through the plot. Just stick to the simple correlation coefficient r for evaluating the association between the two variables. Use r instead of r 2 because the sign of r provides valuable information about whether the variables are positively or negatively correlated with each other. Remember, however, that the correlation coefficient does assume the two variables have a linear relation to each other. A perfect linear relation will return a value of r = 1.0, but a perfect curved relation will return a value of r < 1.0. If your variables are not related to each other in some direct proportion, then you may need to transform one or other axis in order to linearize the relation (see p. 35). C. P. Doncaster 32

37 Lecture notes: Correlation and transformations The graphs below illustrate some types of correlation (from Fowler et al Practical Statistics for Field Biology. Wiley). Note that the last graph, of perfect rank correlation, would give Spearman s rank correlation coefficient rs = 1.0, which is clearly an over-estimate of the true level of correlation. The non-parametric Spearman s coefficient is simply Pearson s coefficient calculated on the ranks. Use the parametric Pearson s in preference to Spearman s wherever you can meet its assumptions. C. P. Doncaster 33

38 Lecture notes: Correlation and transformations Transforming data to meet the assumptions of parametric Analysis of Variance Analysis of variance has proved to be a powerful and versatile technique for analysing any kind of response variable showing some variation around a mean value. We can use ANOVA to explain this variation in terms of two or more levels of a factor (one-way ANOVA), or in terms of the interacting levels of two or more factors (two-way ANOVA or multi-way ANOVA), or in terms of one or more continuous factors (simple regression or multiple regression). We can also use ANOVA to test the evidence for a correlation between two continuous variables. Wherever you have observations of a continuous variable that you wish to explain in terms of one or more factors, consider using Analysis of Variance before you think of using nonparametric statistics. Parametric tests are more powerful because they use the actual data rather than ranks, and for many types of data there simply is no appropriate non-parametric test (e.g. regression, two-way analyses with categorical and continuous factors, interactions etc). Having decided to use parametric Analysis of Variance, you must be aware of its underlying assumptions (introduced on p. 6 of this booklet). If you also know the ways in which these are likely to be violated, then you can pre-empt many potential difficulties by applying appropriate transformations to the data. These are the assumptions: 1. Random sampling, so that your observations are a true reflection of the population from which you took them. Is it a problem? This is a basic assumption of all statistical analyses, parametric or non-parametric. Whether or not it is met depends on sampling strategy. Solution: If your data do not meet it, then you will have to resample your data. 2. Independent observations, so that the value of one data point cannot be predicted from the value of another. Is it a problem? This is a basic assumption of all statistical analyses, parametric or non-parametric, and it depends on sampling strategy. Solution: If your data do not meet it, then either resample your data or factor out the non-independence by adding a new explanatory factor (e.g. add the categorical factor Subject if you have repeated measures on each subject). 3. Homogeneity of variance around a regression line (for a covariate), or of variances around sample means (for a factor), because the ANOVA uses pooled error variances to seek differences between means, and it does not seek differences between variances. Is it a problem? Depends on the type of observations. Often violated by observations that cannot take negative values, such as weight, length, volume, counts etc, because these are likely to have a variance that increases with the mean. Solution: logtransformation of response (which for regression and correlation may then require log-transformation of x also, to reinstate linearity). 4. Normal distribution of residual variation around a regression or around sample means, because this distribution is described by just two parameters: the mean and variance, which are the two employed by ANOVA (a skewed distribution needs to be described with a third parameter, not accounted for in ANOVA). Is it a problem? Generally less than heterogeneity, and depends on the type of observations. May be violated by observations in the form of proportions or percentages, because they are constrained to lie between zero and 1 or 100, whereas the normal distribution has tails out to plus and minus infinity. Also violated by observations in the form of counts, which follow a Poisson rather than a normal distribution. Solution: Arcsine-root transformation of proportions, or logistic regression on proportions (which assumes binomial rather than normal errors). C. P. Doncaster 34

39 Lecture notes: Correlation and transformations Square-root transformation of counts, or use a Generalised Linear Model (the glm command in R) which can assume Poisson errors. 5. For regression and correlation: Linear relations between continuous variables, because the explained and residual components of variation are measured against a predicted line defined by just two parameters, the intercept a and slope b. A non-linear relation would need describing with additional parameters, not accounted for in the regression analysis. Is it a problem? Depends on the type of observations. Most likely to be violated by relationships with an inherently non-linear biology. Solution: reinstate linearity with an appropriate transformation to one or both axes see four examples below. Consider fitting a polynomial only if it makes sense biologically to model the response with additive powers of the predictor. If any of assumptions 3-5 are not met, we should not immediately abandon the use of parametric statistics. The command glm will run a General Linear Model that can accommodate Analysis of Variance on data with inherently non-normal distributions, such as proportions (which have a binomial distribution), or frequencies of rare events (with a Poisson distribution and variance increasing with the mean response). Commands of the sort aov(y ~ A) or lm(y ~ A), which we have been using up to now, have an equivalent in glm: anova(glm(y ~ A, family = gaussian(link = identity)), test = "F"). You can replace gaussian (for a normal distribution) with poisson or binomial, as dictated by the type of data. This website shows a worked example, for its model 5.9: in R.htm An alternative route to meeting the assumptions is by transformation of the response (commonly with an arcsin-root transformation for proportions, or a square-root transformation for counts, or a generic Box-Cox transformation). This is less desirable than modelling the error structure with glm, because the transformation changes the nature of the test question. For regression analyses in particular, you may have a priori reasons for suspecting a non-linear relationship of response to predictor. An understanding of the underlying biology will often suggest an appropriate linearizing transformation. Transformations are not cheating, because they are planned in advance, and the same conversion is applied to all observations. The idea is to reduce complexity by converting a non-linear relation to a linear one. Here are some examples: 1. The response may be inherently exponential, for example in population growth over time of freely self-replicating organisms. A linear regression on ln(population) against time will give a slope that equals the intrinsic rate of natural increase per capita. 2. Response and predictor may have different dimensions, for example in a weight response to length (see p. 39), suggesting a power function. Logging both axes will linearize power-function relationships, and simultaneously deal with associated issues of the variance increasing with the mean response and skewed residuals. 3. The response may saturate, for example in the response of weight increase to body weight, or the response of food consumption to food abundance. Linearization is achieved by understanding the underlying biology: try inverse body weight, and try inverse consumption and abundance. 4. The response may be cyclic, for example in a circadian rhythm. Transformation of the predictor with circular function (e.g., sin(x) or cos(x)) may linearize the relationship. If you resort to non-parametric methods, be aware, that they all make assumptions 1 and 2 above. Also, statistics on ranks (e.g., Spearman s correlation) require that the ranks meet assumptions 3-5. Finally, some data may not suit any statistics because they have too little variation (e.g. when skewed by numerous zero values) or insufficient replication (e.g. data with too many missing values). In such cases, change your test question to allow sub-sampling from the dataset. C. P. Doncaster 35

40 C. P. Doncaster 36

41 Lecture notes: Fitting statistical models LECTURE: FITTING STATISTICAL MODELS TO DATA Statistical packages like R all work by fitting models to data. They require you to use an appropriate model for the samples and variables under investigation, before they will estimate parameter values that best fit the data. These pages will help you fit appropriate models to data. In the first example (A1) below, the model formula is a mathematical relationship: describing the probability of obtaining exactly 0, 1, 2,... species of insects per leaf. But the other examples all use a standard convention for presenting statistical models, which takes the form: response variable(s) = explanatory variable(s). Here the = sign is simply a statement of the hypothesised relationship between the variables rather than a logical equality. The chosen statistic will quantify the relationship of the response variable (continuous except in A2a) to the explanatory variables (which can be continuous: A2b & B1, or divided into samples: A3 & B2). A. The three principal types of data and statistical models 1. One sample, one variable For data of this kind, look for a goodness-of-fit of frequencies E.g. The sample is 50 leaves of Sycamore picked at random; the variable is the number of species of insect parasites per leaf. This is predicted to follow a random distribution, so the appropriate model for calculating expected frequencies is the Poisson distribution. x species per leaf Total Observed frequencies Expected frequencies (O - E) 2 / E Frequency of leaves Observed Expected Number of species per leaf H0: Observed distribution is no different to the expected Poisson (i.e. no interaction between species Test statistic: Chi-squared or G- test of association Model formula: Poisson distribution: x = 1.78 species/leaf Outcome: X3 2 = 7.73, p < 0.05 Conclusion: observed numbers of species differ from random expectation. Since the observed distribution is narrower than expected, the species are more regularly spaced than random, with one per leaf predominating (indicating mutual repulsion in competition between the species) Assumptions: data are nominal (not continuous), frequencies are independent (i.e. 50 independent leaves), no cell with expected value < 5. For continuous data, use Kolmogorov-Smirnov test. C. P. Doncaster 37

42 Lecture notes: Fitting statistical models 2. One sample, two variables For data of this kind, look for a dependent relationship (an association) between the variables (a) Categorical variables Use a contingency table of frequencies to look for an interaction between the variables E.g. Sample is 2-year old infants, variables are eye colour and behavioural dominance. Contingency table Eye colour Blue Other Total Behaviour Dominant Submissive Total Frequency Blue Other Blue Other Dominant Dominant Submissive Submissive Variables H0: Column categories are independent of row categories. Test statistic: Model formula: chi-squared or G- test of independence colour:behaviour ~ _response_ Outcome: X1 2 = 1.942, p = 0.16 Conclusion: Assumptions: there is no detectable interaction of colour with behaviour: behavioural dominance is not associated with blue eyes data are truly categorical (frequency in each cell conforms to a Poisson distribution), frequencies are independent (71 independent subjects, e.g. no siblings), no cell with expected value < 5, correction for continuity. For cells with expected values < 5, use Fisher s exact test. C. P. Doncaster 38

43 Lecture notes: Fitting statistical models (b) continuous variables Plot the response variable on the y-axis against the explanatory variable on the x-axis E.g. Sample is polar bears; response variable is body weight and explanatory variable is radius length. Subject Body weight (kg) Radius length (cm) : : : : : H0: Variation in body weight is independent of radius length. Test statistic: Linear regression on transformed weight and radius length (Ln[Weight] labelled as a new variable ln.weight ; ln[length] labelled ln.length ) Model formula: Outcome: F1,141 = 944.6, p < Conclusion: ln.weight ~ ln.length the regression slope is differs from zero; radius length is a precise predictor of body weight, explaining 87% of the variance in body weight with the chosen model. Assumptions: (i) random sampling, (ii) independent errors, (iii) homogeneity of variances, (iv) normal distribution of errors, (v) linearity. For continuous variables with no clear functional relationship, use correlation to calculate r. C. P. Doncaster 39

44 Lecture notes: Fitting statistical models 3. One-way classification of two (or more) samples For data of this kind, look for a difference between sample means E.g. Samples are two levels of a feeding regime for shrews: a diet of blow-fly pupae, and a diet of dung-fly pupae. The response variable is weight (g). blow-fly diet (g) Feeding regime dung-fly diet (g) n subjects = Mean = Standard error = Body weight (g) Blowfly Diet Dungfly H0: Feeding regime has no effect on weight (the two samples come from the same population) Test statistic: Model formula: Outcome: F1,23 = 8.60, p < 0.01 Conclusion: Assumptions: Analysis of Variance (or t-test when just two groups) weight ~ regime shrew body weights depend on type of feeding regime (i) random sampling, (ii) independent errors, (iii) homogeneity of variances, (iv) normal distribution of errors. For data with repeated measures on subjects (assumption (ii)), use repeated measures ANOVA; for data that violate assumptions (iii) - (iv) use prior transformations, or use non-parametric Kruskal-Wallis test (or Mann-Whitney if have just two samples). C. P. Doncaster 40

45 Lecture notes: Fitting statistical models B. Selecting and fitting models to data R offers many alternative commands for Analysis of Variance. The command aov will suit most straightforward analyses with normally distributed residuals. The command glm will run a General Linear Model that can accommodate Analysis of Variance on data with inherently nonnormal distributions, such as proportions (which have a binomial distribution), or frequencies of rare events (with a Poisson distribution). 1. One-way classification of two (or more) samples, two continuous variables For data of this kind, look for differences between regression slopes E.g. Samples are male (circles and continuous line) and female (triangles and broken line) polar bears; response variable is body weight and explanatory variable is radius length. Subject Body weight (kg) Radius length (cm) Sex : : : : : M F F M F : : H0: Variation in body weight is independent of radius length by sex. Test statistic: Model formula: Analysis of Variance on ln.weight with covariate ln.length (or General Linear Model for non-normal error structures). Outcome: ln.length effect (adjusted for Sex) F1,139 = , p < Sex effect (adjusted for ln.length) F1,139 = 3.57, p = 0.06 Sex-by-ln.Length interaction F1,139 = 7.24, p = Conclusion: Assumptions: ln.weight ~ ln.length + Sex + ln.length:sex the two regression lines have different slopes, so the effect of radius length on weight differs by sex (i) random sampling, (ii) independent errors, (iii) homogeneity of variances, (iv) normal distribution of errors, (v) linearity. C. P. Doncaster 41

46 Lecture notes: Fitting statistical models 2. Two-way classification of samples For data of this kind, look for two-way differences between means E.g. Shrew samples are classified by feeding regime and sex; response variable is body weight as in Analysis of Variance above. Sex females males blow-fly diet Feeding regime dung-fly diet Body weight (g) Female Male 0 Blowfly Diet Dungfly H0: The effect of regime on weight is not affected by sex Test statistic: Analysis of Variance (or General Linear Model for non-normal error structures). Model formula: Outcome: sex effect (adjusted for regime) F1,21 = 0.01, p = regime effect (adjusted for sex) F1,21 = 9.68, p < regime:sex interaction effect F1,21 = 6.68, p < 0.05 Conclusion: Assumptions: weight ~ regime + sex + regime:sex the effect of regime on weight depends on sex, with females doing better on dungflies and males on blowflies (i) random sampling, (ii) independent errors, (iii) homogeneity of variances, (iv) normal distribution of errors. C. P. Doncaster 42

47 Practical: Regression and correlation PRACTICAL: CALCULATING REGRESSION AND CORRELATION In this practical you will do by hand the linear regression shown on pages of this booklet. To save tedious calculations, however, you will put Excel to work by asking it to do all of the arithmetic for you. This still means that you will need to understand how the regression analysis works, so refer to pages as you follow the steps through on the computer. Look back through the notes for lectures 3 and 4 to appreciate the underlying logic of the analysis. First run the practical in R, using the commands on page 24 of the booklet. Then open up Excel. On a fresh spreadsheet, type in the data shown in rows 4 to 15 of columns B and F in the Excel worksheet illustrated on page 25 of this booklet. Don t type in any more data than just these two columns. Excel will do the rest! But you have to tell it what to do... Your task is now to use Excel formulae to obtain all the figures as they appear in the other cells and columns. Your objective is to replicate the entire sheet shown on page 25 without typing in any more numbers. When you have done this, save the result, as you may wish to use it again. In order to use Excel formulae, you must type an = sign in a cell where you wish to calculate a number from data in other cells. For example, to obtain a value in cell B19 for the mean age, type in cell B19: =AVERAGE(B4:B15) Likewise, to obtain a value in cell F19 for the mean weight, type in cell F19: =AVERAGE(F4:F15) Now to obtain a value in cell H4 for the squared deviation of the first Weight value (in cell F4) from its sample mean (which you have just calculated in F19), type in cell H4: =(F4-$F$19)^2 Having entered this command, you can repeat it down through the whole of column H from H4 to H15 by clicking on the bottom right corner of the cell and dragging down to H15. Look at the formulae you have created to check that they are giving you squared deviations of each weight value from the sample mean. You should now see in column H the full set of squared deviations of Weights from their sample mean. Now get the sum of squared deviations: SS(Total) in cell H17 by typing =SUM(H4:H15) Likewise, to obtain a value in cell J4 for the product of the first Weight deviation with its corresponding Age deviation, type in J4: =(B4-$B$19)*(F4-$F$19) Then drag that formula down to J15 in order to get all the products. Finally, get the sum of squared deviations in cell J17 by typing =SUM(J4:J15). Do a similar operation for column D, then calculate the parameters for the slope and intercept of the line. Use these parameter constants to obtain for each x a predicted ŷ = a +b x, in order to then calculate the values in columns L and N. Finally calculate the explained and error SS and MS, and the F-value. Check that your sheet matches the one on page 25. You can then play with the data to see what difference it makes to the significance of the relationship if you change just one of the values. For example, change the Weight value in cell F12 from 431 g to 231 g. Is the relationship now significant? Has the magnitude of the correlation coefficient r got closer to unity? Playing with test data in this way will help you to understand how the statistics work. But don t try this with real data! If you had actually observed a Weight of 431 g, then you would have to work with that. If the outcome is a non-significant relationship, then your best explanation is no detectable relationship (failure to reject H0), given the assumptions of the analysis. C. P. Doncaster 43

48 C. P. Doncaster 44

49 Appendix 1: Terminology of Analysis of Variance APPENDIX 1: TERMINOLOGY OF ANALYSIS OF VARIANCE Once you have familiarised yourself with the terminology of Analysis of Variance you will find it easier to grasp many of the parametric techniques that you read about in statistics books. Some of the terms described below may be referred to by one of many names, as indicated in the left hand column. They are illustrated here with a simple example of statistical analysis, in which a biologist wishes to explain variation in the body weights of a sample of people according to different variables such as their height, sex and nationality. More detailed descriptions of the terms shown below, as well as many others that go beyond your immediate needs, can be found in the Lexicon of Statistical Modelling ( Term Description 1. Variable A property that varies in a measurable way between subjects in a sample. 2. Response variable, Dependent variable, Y 3. Explanatory variable, Independent variable, Predictor variable, Factor, Effect, X 4. Variates, Replicates, Observations, Scores, Data points 5. Sample, Treatment The variable of interest, usually measured on a continuous scale, of (e.g. weight: what causes variation in weight?). If these measurements are free to vary in response to the explanatory variable(s), statistical analysis will reveal the explanatory power of the hypothesised source(s) of variation. The non-random measurements or observations (e.g. treatments of a drug factor, fixed by experimental design), which are hypothesised in a statistical model to have predictive power over the response variable. This hypothesis is tested by calculating sums of squares and looking for a variation in Y between levels of X that exceeds the variation within levels. An explanatory variable can be categorical (e.g. sex, with 2 levels of male and female), or continuous (e.g. height with a continuum of possibilities). The explanatory variable is assumed to be independent in the sense of being independent of the response variable: i.e. weight can vary with height, but height is independent of weight. The values of X are assumed to be measured precisely, without error, permitting an accurate estimate of their influence on Y. The replicate observations of the response variable ( ) measured at each level of the explanatory variable. These are the data points, each usually obtained from a different subject to ensure that the sample size reflects n independent replicates (i.e. it is not inflated by non-independent data: pseudoreplication ). The collection of observations measured at a level of X (e.g. body weights from one sample of males and another of females to test the effect of Sex on Weight; or crop Yield tested with two Pesticide treatments). If X is continuous the sample comprises all measures of Y on X (e.g. Weight on Height). 6. Sum of squares The squared distance between each data point,, and the sample mean, Y, summed for all n data points. The squared deviations measure variation in a form which can be partitioned into different components that sum to give the total variation (e.g. the component of variation between samples and the component of variation within samples). 7. Variance The variance in a normally distributed population is described by the average of n squared deviations from the mean. Variance usually refers to a sample, however, in which case it is calculated as the sum of squares divided by n-1 rather than n. Its positive root is then the standard deviation, SD, which describes the dispersion of normally distributed variates (e.g. 95% lying within 1.96 standard deviations of the mean when n is large). C. P. Doncaster 45

50 8. Statistical model, Y = X + 9. Null hypothesis, 10. One-way ANOVA, Y = X 11. Two-way ANOVA, Y = X1 + X2 + X1X2 12. Error, Residual Appendix 1: Terminology of Analysis of Variance A statement of the hypothesised relationship in the sampled population between the response variable and the predictor variable. A simple model would be: Weight = Sex +. The = does not signify a literal equality, but a statistical dependency. So the statistical analysis is going to test the hypothesis that variation in the response variable on the left of the equals sign (Weight) is explained or predicted by the factor on the right (Sex), in addition to a component of random variation (the error term, epsilon ). An Analysis of Variance will test whether significantly more of the variation in Weight falls between the categories of male and female, and so is explained by the independent variable Sex than lies within each category (the random variation ). The error term is often dropped from the model description though it is always present in the model structure, as the random variation against which to calibrate the variation between levels of X in the F-ratio. While a statistical model proposes a hypothesis, e.g., that Y depends on X, the statistical analysis can only seek to reject a null hypothesis: that Y does not vary with X in the population of interest. This is because it is always easier to find out how different things are than to know how much they are the same, so the statistician s easiest objective is to establish the probability of a deviation away from random expectation rather than towards any particular alternative. Thus does science in general proceed cautiously by a process of refutation. If the analysis reveals a sufficiently small probability that the null hypothesis is true, then we can reject it and state that Y evidently depends on X in some way. An Analysis of Variance (ANOVA) to test the model hypothesis that variation in the response variable Y can be partitioned into the different levels of a single explanatory variable X (e.g. Weight = Sex). If X is a continuous variable, then the analysis is equivalent to a linear regression, which tests for evidence of a slope in the best fit line describing change of Y with X (e.g. Weight with Height). Test of the hypothesis that variation in Y can be explained by one or both variables X1 and X2. If X1 and X2 are categorical and Y has been measured only once in each combination of levels of X1 and X2, then the interaction effect X1X2 cannot be estimated. Otherwise a significant interaction term means that the effect of X1 is modulated by X2 (e.g. the effect of Sex, X1, on Weight, Y, depends on Nationality, X2). If one of the explanatory variables is continuous, then the analysis is equivalent to a linear regression with one line for each level of the categorical variable (e.g. graph of Weight by Height, with one line for males and one for females): different intercepts may signify a significant effect of the categorical variable, different slopes may signify a significant interaction effect with the continuous variable. The amount by which an observed variate differs from the value predicted by the model. Errors or residuals are the segments of scores not accounted for by the analysis. In Analysis of Variance, the errors are assumed to be independent of each other, and normally distributed about the sample means. They are also assumed to be identically distributed for each sample (since the analysis is testing only for a difference between means in the sampled population), which is known as the assumption of homogeneity of variances. 13. Normal distribution A bell-shaped frequency distribution of a continuous variable. The formula for the normal distribution contains two parameters: the mean, giving its location, and the standard deviation, giving the shape of the symmetrical bell. This distribution arises commonly in nature when myriad independent forces, themselves subject to variation, combine additively to produce a central tendency. The technique of Analysis of Variance is constructed on the assumption that the component of random variation takes a normal distribution. This is because the sums of squares that are used to describe C. P. Doncaster 46

51 Appendix 1: Terminology of Analysis of Variance 14. Degrees of freedom, d.f. 15. F-statistic, F-ratio 16. Significance, p variance in an ANOVA accurately reflect the true variation between and within samples only if the residuals are normally distributed about sample means. The number of pieces of information that we have on a response, minus the number needed to calculate its variation. The F-ratio in an Analysis of Variance is always presented with two sets of degrees of freedom, the first corresponding to one less than the a samples or levels of the explanatory variable (a - 1), and the second to the remaining error degrees of freedom (n - a). For example, a one-way ANOVA may find an effect of nationality on body weight ( = 3.10, p < 0.05) in a test of four nations (giving the 3 test degrees of freedom) sampled with 27 subjects (giving the 23 error degrees of freedom). A continuous factor has one degree of freedom, so the linear regression ANOVA has 1 and n-2 degrees of freedom (e.g. a height effect on body weight: = 4.27, p < 0.05, from 27 subjects). The statistic calculated by Analysis of Variance, which reveals the significance of the hypothesis that Y depends on X. It comprises the ratio of two mean-squares: MS[X] / MS[ ]. The mean-square, MS, is the average sum of squares, in other words the sum of squared deviations from the mean X or (as defined above) divided by the appropriate degrees of freedom. This is why the F-ratio is always presented with two degrees of freedom, one used to create the numerator MS[X], and one the denominator, MS[ ]. The F-ratio tells us precisely how much more of the total variation in Y is explained by X (MS[X]) than is due to random, unexplained, variation (MS[ ]). A large ratio indicates a significant effect of X. In fact, the observed F-ratio is connected by a very complicated equation to the exact probability of a true null hypothesis, i.e. that the ratio equals unity, but you can use standard tables to find out whether the observed F-ratio indicates <5% probability of making a mistake in rejecting a true null hypothesis. This is the probability of mistakenly rejecting a null hypothesis that is actually true. In the biological sciences a critical value = 0.05 is generally taken as marking an acceptable boundary of significance. A large F-ratio signifies a small probability that the null hypothesis is true. Thus detection of a nationality effect: = 3.10, p < 0.05 means that the variation in weight between the samples from four nations is 3.10 times greater than the variation within samples, making these data incompatible with a null hypothesis of nationality having no effect on weight. The height effect detected in the linear regression ( = 4.27, p < 0.05) means that the distribution of data is incompatible with height having no influence on weight in the sampled population. This regression line takes the form:, and 95% confidence intervals for the estimated slope are obtained at ; if the slope is significant, then these intervals will not encompass zero. C. P. Doncaster 47

52 C. P. Doncaster 48

53 Appendix 2: Self-test questions on Analysis of Variance APPENDIX 2: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE 1. Write down the formula for calculating the variance of a sample of scores (use Yi to denote a score for each of n subjects). Explain in words what is meant by this expression. 2. Write down the formula for the standard error of the mean. Explain in words what is meant by this expression. Why does it get smaller as n increases? 3. A sample of 8 male blackbirds are tested for response times to an alarm signal, and this is compared to responses of a sample of 9 females. The Analysis of Variance gives a value of F = Use tables of critical values of F to decide whether mean responses differ between males and females. The problem could also have been answered with a t-test, in which case the test would have produced a value of t = 2.135, which is the square root of For both tests, critical values are looked up in tables using the same error degrees of freedom. Look up the critical value of t at = 0.05 and then square it. Check that this corresponds with the equivalent critical value of F. This shows you that an ANOVA on two samples is equivalent to a t test. 4. State the model for the above Analysis of Variance. If we increased the sample sizes to 12 of each sex and added a third sample of 12 neutered males, what would be the degrees of freedom for the Analysis of Variance? 5. If we divided each of the samples into three groups, of 4 chicks, 4 juveniles, and 4 adults, we could then test the alarm response against two independent effects: SEX and AGE. Write out the full model and give the degrees of freedom for each term. 6. If SEX and AGE main effects were significant, but the SEX:AGE interaction was not, sketch out how the interaction plot might look. Sketch another plot showing how it might look if the interaction effect was also significant. 7. How would you interpret the outcome of the experiment if the interaction effect was significant? 8. As part of your research project, you want to find out how root growth of lawn grasses is influenced by frequency of mowing under different conditions of watering. You decide to use urban gardens as sources of independent grass plots, purloining the services of willing householders to provide different mowing and watering regimes. Describe how you would design your methods so that the data could be analysed with a two-way Analysis of Variance. (Hint: think how you want the data to look in a design matrix of the sort we have been using in previous examples - requires thinking through carefully!). 9. Interpret the following output from a statistics package The regression equation is Log(survival) = Temperature Predictor Coef StDev T P Constant Temperat S = R-Sq = 89.5% R-Sq(adj) = 88.9% Analysis of Variance Source DF SS MS F P Regression Residual Error Total C. P. Doncaster 49

54 C. P. Doncaster 50

55 Appendix 3: Worked examples of Analysis of Variance APPENDIX 3: SOURCES OF WORKED EXAMPLES IN ANALYSIS OF VARIANCE 1. One-way Analysis of Variance Fowler, J. & Cohen, L Practical Statistics for Field Biology. John Wiley. Chapter 17. Section 17.3 (p. 181) Samuels, M.L Statistics for the Life Sciences. Maxwell Macmillan. Chapter 12. Example (p ) Exercises (with answers at back of book) Sokal, R.R. & Rohlf, F.J Biometry, 3 rd Edition. Freeman. Chapters 8 and 9. Table 8.1 (p. 181) and Table 8.5 Table 8.3 (p. 192) and Table 8.6 Box 9.1 (p. 210) - unequal sample sizes Box 9.4 (p. 218) - equal sample sizes Zar, J.H Biostatistical Analysis, 2 nd Edition. Prentice-Hall. Chapter 11. Example 11.1 (p. 164) 2. Two-way Analysis of Variance Fowler, J. & Cohen, L Practical Statistics for Field Biology. John Wiley. Chapter 17. Section 17.6 (p. 190) Sokal, R.R. & Rohlf, F.J Biometry, 3 rd Edition. Freeman. Chapter 11. Box 11.1 (p. 324) - cross factored analysis Table 11.1 (p. 327) - meaning of interaction: equivalent to Fig. 1.7 in your ANOVA notes. Box 11.2 (p. 332) Zar, J.H Biostatistical Analysis, 2 nd Edition. Prentice-Hall. Chapter 13. Example 13.1 (p. 207) C. P. Doncaster 51

56 C. P. Doncaster 52

57 Appendix 4: Procedural steps for ANOVA APPENDIX 4: SUMMARY OF PROCEDURAL STEPS FOR ANALYSIS OF VARIANCE OBSERVATIONS PLOT TRANSFORM DIAGNOSTICS Random Independent Normal Homogenous Linear Assumptions met? NO YES ANALYSIS OF VARIANCE F #,# = ##.##, P < 0.0# INTERPRETATION Higher order interactions first Equation and r 2 for regression Pearson's r for correlation C. P. Doncaster 53

58 C. P. Doncaster 54

59 Appendix 5: Self-test questions on regression and correlation APPENDIX 5: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE (2) 1. A colleague tells you he has data on the activity of three daphnia at each of six levels of ph, and he needs advice on analysis. a) What extra information do you need to know before you can advise on doing any statistical tests at all? b) If you are satisfied that statistical analysis is appropriate, are these data suitable for Analysis of variance, and/or regression, and/or correlation? Should it be parametric or non-parametric? c) Significance would be tested with how many degrees of freedom? 2. You have three samples of wheat grains, one of which comes from genetically modified parent plants, one from organic farming, and the third from conventional farming. You want to find out if these different practices make a difference to the weight of seeds. What are your options for analysis? a) Regression. b) Chi-squared test on the frequencies in different weight categories. c) Kruskal-Wallis test on the three samples. d) Analysis of variance on the three samples. e) Student s t-tests on each combination of pairs to find out how their averages differ from each other. 3. You have a packet of wild-type tomato seeds and a packet of genetically modified tomato seeds, and you want to know whether they give different crop yields under a conventional growing regime and under an organic regime. How do you find out? 4. What, if anything, is wrong with each of these reports? a) The data is plotted in graph 2, and it shows a significant change with temperature (F1 = , P = 0.000). b) Figure 2 shows that temperature has a strong positive influence on activity across this range (r 2 = 0.78, F1,10 = 23.72, P < 0.001). c) There is a strong negative correlation but the results are not significant (r = -0.64, P = 0.06). d) No correlation could be established from the nine observations (Pearson s coefficient r = -0.64, d.f. = 7, P > 0.05). 5. Interpret the following command and output from an analysis in R: > summary(aov(y ~ A*B)) Df Sum Sq Mean Sq F value Pr(>F) A B A:B e-07 *** Residuals Signif. codes: 0 *** ** 0.01 * > C. P. Doncaster 55

60 C. P. Doncaster 56

61 Appendix 6: Worked examples of regression APPENDIX 6: SOURCES OF WORKED EXAMPLES ON REGRESSION AND CORRELATION Doncaster, C.P. & Davey, A.J.H Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences. Cambridge University Press. - Pages See the book s web pages for: Worked examples of all Analysis of Variance models: Commands for analysing them in R: in R.htm Fowler, J. et al Practical Statistics for Field Biology. John Wiley. - Chapters Section 14.5 (p. 135) - Section 15.6 (p. 147) - Sections to (p. 156) Samuels, M.L Statistics for the Life Sciences. Maxwell Macmillan. - Chapter Numerous examples throughout this chapter, and exercises (pp. 449, 463, 474, 484 and 493, with answers at back of book) Sokal, R.R. & Rohlf, F.J Biometry, 3 rd Edition. Freeman. - Chapters Table 14.1 (p. 459) - Box 14.1 (p. 465) Zar, J.H Biostatistical Analysis, 2 nd Edition. Prentice-Hall. - Chapters 17, Examples 17.1 (p. 262), and 17.9 (p. 286). - Examples 19.1 (p. 308) Further reference information on statistical modelling with ANOVA and regression can be found in the Lexicon of Statistical Modelling at: C. P. Doncaster 57

62 C. P. Doncaster 58

Appendix 7: Critical values of the F-distribution APPENDIX 7: CRITICAL VALUES OF THE F-DISTRIBUTION v1 is the degrees of freedom of the numerator means squares; v2 is the degrees of freedom of the

63 Appendix 7: Critical values of the F-distribution APPENDIX 7: CRITICAL VALUES OF THE F-DISTRIBUTION v1 is the degrees of freedom of the numerator means squares; v2 is the degrees of freedom of the denominator means squares. Note that the power of Analysis of Variance to detect differences can be increased if the total number of variates is divided into more samples. For example: (i) 2 samples with 9 variates in each, so n = 18, has critical F1,16 = 4.49 (ii) 3 samples with 6 variates in each, so n = 18, has critical F2,15 = 3.68 (iii) 6 samples with 3 variates in each, so n = 18, has critical F5,12 = 3.11 All three tests require collecting the same amount of data. The first one can only detect a difference in the sampled population if the variance between samples is more than four times greater than the variance within samples. The third one, in contrast can detect a difference from a between-sample variance little more than three times greater than the within-sample variance. C. P. Doncaster 59

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)