Chapter 12 - Multiple Regression and the General Linear Model

Size: px

Start display at page:

Download "Chapter 12 - Multiple Regression and the General Linear Model"

John Carson
5 years ago
Views:

1 Chapter 12 - Multiple Regression and the General Linear Model The simple linear model can be extended from one response variable (Y ) and one explanatory variable (x) to one response variable and p 1 explanatory variables x 1,..., x p 1 The regression of Y on x 1,..., x p 1 is an equation that predicts the expected value of Y for particular values of the explanatory variables A Case Study: Seddigh, M. and Jolliff, G.D. 1994, Light intensity effects on meadowfoam growth and flowering, Crop Sci., 34, Meadowfoam is a small plant that grows in moist meadows of the Pacific NW; the seed oil is valuable Seddigh and Jolliff were interested in determining optimal growing conditions for commercial production They carried out a controlled growth chamber experiment with two factors: light intensity (6 levels), and timing of the onset of light treatment (either at floral induction, or 24 days prior to floral induction) 10 seedlings were randomly assigned to each of the 12 = 6 2 treatments and the average number of flowers per plant was calculated at the end of the experiment 1

2 Factors, levels and average number of flowers per plant: Table 1: Interval plotted against duration. Light intensity (µmol/m 2 /sec) Timing at FI prior to FI A Multiple Regression Model Let Y denote the average number of flowers per plant, x 1 denote the level of light intensity, and define x 2 as follows: 0, if onset of light treatment was at FI x 2 = 1, if onset of light treatment was prior to FI The model is Y = β 0 + β 1 x 1 + β 2 x 2 + ε This model says that the number of flowers per plant (Y ) is a linear function of light intensity (x 1 ), timing (x 2 ) plus a random error term (ε). For inferential purposes, it is assumed that the errors are independent and identically distributed as normal random variables with mean 0 and constant variance, i.e., ε N(0, σ 2 ). This assumption must be supported by the data The assumption of errors with mean 0 implies that the average number of flowers is a 2

3 linear function of the two explanatory variables: E(Y ) = β 0 + β 1 x 1 + β 2 x 2 Note that if Y 1 was observed when the onset of light treatment was at FI then the model for Y 1 is Y 1 = β 0 + β 1 x 1 + β ε = β 0 + β 1 x 1 + ε If Y 2 was observed when the onset of light treatment was proir to FI, then the model for Y 2 is Y 2 = β 0 + β 1 x 1 + β ε = (β 0 + β 2 ) + β 1 x 1 + ε = β0 + β 1 x 1 + ε where β0 = β 0 + β 2 The two models have the same slope, but different intercepts If β 2 is 0, then β0 = β 0, and the timing of onset of light treatment has no effect on average numbers of flowers Table 2: Summary of the fitted model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2. In addition, s ε = 6.441, df = 21, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β

4 Flowers per plant Time =1 Time = Light intensity Figure 1: Data and fitted linear model for the meadowfoam data set. The figure shows that the model fits the data quite well, and that there is no evidence of violations of the distributional assumptions The regression summary shows that there is strong evidence of a difference in numbers of flowers per plant attributable to timing of onset, and that numbers of flowers decreases as light levels are increased Beginning light treatment 24 days before the onset of FI is estimated to increase the expected number of flowers per plant by 12.2 We will test H 0 : β 2 = 0 versus H a : β 2 0 using T = β 2 / σ β2 4

5 The General Linear Model The general linear model is can be expressed as Y = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 + ε, where x 1,..., x p 1 are explanatory variables and ε is a random error with mean 0. Note: this is a general model that describes a random process that generates observation A more specific model that describes the data that have been observed often uses subscripts, for example as follows Y i = β 0 + β 1 x i,1 + β 2 x i,2 + + β p 1 x i,p 1 + ε i, There are a variety of ways to obtain the explanatory variables Polynomials in x All p 1 variables can be computed from a single explanatory variable x according to x 1 = x x 2 = x 2. x p 1 = x p 1 This general linear model may be expressed as Y i = β 0 + β 1 x i + β 2 x 2 i + + β p 1 x p 1 i + ε i Polynomial models are useful for approximating complicated nonlinear relationships between Y and x. 5

6 Example (from Ramesy and Schafer) Drake, S. and J MacLachlan, Galileo s discovery of the parabolic trajectory, Scientific American, 232, Galileo conducted an experiment to determine if the horizontal velocity of a moving object is constant. He rolled a ball off of an incline and measured the horizontal distance travelled. The height of the incline varied between 50 and 1000 punti. The data, and a fitted second-order polynomial model Ŷ = x x 2 are shown in the figure below: Horizontal distance (punti) Initial height (punti) Figure 2: Data and fitted polynomial model from Galelio s experiment on horizontal distance. 6

7 Table 3: Analysis of variance table for Galelio s experiment on horizontal distance. Source df Sum of Squares Mean of Squares f Pr(F > f ) x x error Multiple Explanatory Variables Another general linear model involves p 1 different variables x 1,..., x p 1, none of which are exponentiated. Such a model can be expressed as E(Y ) = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 The effects of the variables on Y are said to be additive. β 1 x 1 is said to be the effect of x 1 on E(Y ) (and also on Y ) In contrast, a multiplicative model is E(Y ) = β 0 x β 1 1 x β 2 2. Chemical reaction rates are often multiplicative functions of the concentration of catalysts. A linearized version is log[e(y )] = α 0 + β 1 log(x 1 ) + β 2 log(x 2 ) where α 0 = log(β 0 ) An additive model implies that the effect of a 1 unit increase of x 1, on E(Y ), when all other variables are held fixed is E (Y ) = β 0 + β 1 (x 1 + 1) + β 2 x β p 1 x p 1 (β 0 + β 1 x 1 + β 2 x β p 1 x p 1 ) = β 1 The effect of a 1 unit increase in x 1 does not depend on the values of the variables. Thus, to determine the difference in E(Y ) for two sets of conditions, we only have to add the effects of a difference in one variable to the effects of a difference in the other variable 7

8 For example, the change in E (Y ) attributable to a one unit change in x 1 and a one unit change in x 2 is E (Y ) =β 0 + β 1 (x 1 + 1) + β 2 (x 2 + 1)+ (β 0 + β 1 x 1 + β 2 x 2 ) =β 1 + β 2 Qualitative (or categorical) variables It is possible to analyze the relationship between a quantitative response variable and one or more qualitative variables using multiple regression A common approach is to create a set of indicator variables that identify the levels of a qualitative variable Example: Seddigh, M. and Jolliff, G.D. 1994, Light intensity effects on meadowfoam growth and flowering. Crop Sci., 34, The timing of the onset of the light (at FI or 24 days prior) is a qualitative variable because there is no reason to believe that time has a linear effect on the average number of flowers per plant An indicator (or dummy) variable identifies at which level of timing the observation was recorded The indicator variable is defined by : 0, if onset of light treatment was at FI x 2 = 1, if onset of light treatment was prior to FI, 8

9 and the model is Y = β 0 + β 1 x 1 + β 2 x 2 + ε where x 1 is light intensity β 2 was estimated to be and its estimated standard error was is also the estimated mean difference in average numbers of flowers per plant attributable to timing of the onset of light treatment If a factor has more than 2 levels (say k levels), then we create k 1 indicator variables to account for the factor For example, if a factor has 3 levels A, B, and C, then we may set up x 1 to be the indicator of level A and x 2 to be the indicator of level B Suppose that for the ith observation x i,1 = x i,2 = 0. Then we are certain that the ith observation was observed when the factor level was C. Conversely, if the factor level for the ith observation is C, then we are certain that x i,1 = x i,2 = 0. It is redundant to create and use a third indicator variable identifying the level as C. As a general rule, only k 1 indicator variables are used. Further, a number of computational difficulties may arise if k indicator variables are used in model fitting and analysis The level without an indicator variable to identify it is often called the aliased level. If there is a single factor with k levels, then a model of the ith observation is E(Y i ) = β 0 + β 1 x i,1 + β 2 x i,2 + + β k x i,k 1 The expected response when the level is k is therefore E(Y i ) = β 0. Note that the meaning of the intercept is different than in simple linear regression. 9

10 Note: there are a variety of ways to define the indicator variables of treatment level. Care must be taken to understand how the variables are defined and how to determine the expected response at each level. Example If we do not want to assume a linear relationship between light intensity and the average number of flowers, then we can use indicator variables to determine the estimated mean number of flowers. The following model may be used: Y = β 0 + β 1 x 1 + β 2 x β 7 x 7 + ε where 0, light intensity 150 x 2 = 1, if light intensity = 150 0, light intensity 300 x 3 = 1, if light intensity = 300 and so on, with 0, light intensity 750 x 7 = 1, if light intensity = 750 For the different levels of timing and light intensity, the models are Y = β 0 + ε, if the onset of light was at time 0 and intensity = 900, Y = β 0 + β 1 + ε, if the onset of light was at day 24 and intensity = 900, Y = β 0 + β 2 + ε, if the onset of light was at day 0 and intensity = 150, Y = β 0 + β 1 + β 2 + ε, if the onset of light was at time 24 and intensity = 150, 10

11 and so on. The fitted model and data are shown in Figure 2. The model will be referred to as the unconstrained model as we have dropped the assumption that the relationship is linear, and make no assumptions (and impose no constraints) Flowers per plant Time =1 Time = Light intensity Figure 3: Data and unconstrained fitted model for the meadowfoam data set. Homework for Wednesday March 7 Analyze the brain weight data. Determine a model that describes the data as completely as possible but using only those variables that are clearly associated with brain weight, after accounting for all other important variables. Support your model with appropriate statistical analyses and discussion. Defend all tests that you carry out by residual analyses, and identify those assumptions that are not supported by the data. 11

12 Use transformations whenever possible to reduce violations of the model assumptions. Create an indicator variable that identifies primates and treat it as an additional explanatory variable (I believe that animals 6 through 21 (row number) are primates) You should provide the following in your report: 1. Pair-wise scatterplots and correlation coefficients for all variables (after transformation) used in you final model besides the primate variable 2. For the primate variable, construct side-by-side box plots for the other explanatory variables 3. Note any explanatory variables that are moderately or strongly correlated 4. Residual plots (normal probability and plots of residuals versus fitted values). Identify the largest few outliers 5. An analysis of interaction variables. Use the following strategy. Find a best-fitting model using the explanatory variables (or transformations). Create all pair-wise interaction variables (e.g., from x 1 and x 2, create x 3 = x 1 x 2 ). Add each interaction variable to the best model, examine the significance test (for x 3 ) and retain any interaction variable that is significant. Report on those that were significant and note in change in R 2 between the models with and without the interaction variables. Interaction between explanatory variables When two variables interact, the effect of a 1 unit increase of x 1, on E(Y ), depends on the value of interacting variable The figure below shows an example of two explanatory variables that interact 12

13 One explanatory variable is light intensity, and the second explanatory variable is the timing of onset If there were no interaction, then the lines would be parallel The key to understanding interaction is the realization that the effect of intensity depends on timing (and vice versa) Flowers per plant Time =1 Time = Light intensity Figure 4: Simulated data and fitted interaction model for the meadowfoam data set. A model that describes the interaction of two explanatory variables (x 1 and x 2 ) is Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + ε, The term β 3 x 1 x 2 is called an interaction term, as it accounts for synergistic interaction between x 1 and x 2 13

14 The effect of a 1 unit increase of x 1, on E(Y ), when all other variables are held fixed is E (Y ) =β 0 + β 1 (x 1 + 1) + β 2 x 2 + β 3 (x 1 + 1)x 2 (β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 ) =β 1 + β 3 x 2 The effect of a 1 unit increase of x 1, on E(Y ), depends on the value of x 2 and the parameters β 1 and β 3 If β 3 > 0, then the effect of a 1 unit increase of x 1 increases with x 2 If β 3 < 0, then the effect of a 1 unit increase of x 1 decreases with x 2 Table 4: Summary of the fitted interaction model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2 + β 3 intensity x 2. In addition, s ε = 6.6, df = 20, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β β The effect of a 1 unit increase of light intensity (x 1 ) is estimated to be β 1 + β 3 x 2 = x , if the onset of light trmt was at FI = 0.010, if the onset of light trmt was 24 days prior to FI 14

15 The model can be expressed as x 1, if the onset of light trmt was at FI Ŷ = x 1, if the onset of light trmt was 24 days prior to FI The interaction model Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + ε, can be expressed as a general linear model if we define x 3 = x 1 x 2 and set Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε In some instances, there may be reason to believe that two variables interact with regard to their relationship with Y, and so a test of significance may be carried out For our example, we would test H 0 : β 3 = 0 versus H a : β 3 0. There is abundant evidence that the variables interact because the p-value is approximately Estimating Multiple Regression Coefficients The data can be thought of as a set of n p-tuples {(y 1, x 11, x 12,..., x 1,p 1 ), (y 2, x 21, x 22,..., x 2,p 1 ),..., (y n, x n1, x n2,..., x n,p 1 )} where the first subscript on the x s denotes the observation number, and the second index identifies the variable. The ith data tuple is (y i, x i,1, x i,2,..., x i,p 1 ) 15

16 The general linear model for the ith observation is Y i = β 0 + β 1 x i,1 + β 2 x i,2 + + β p 1 x i,p 1 + ε i The least squares estimates of the parameters, i.e., β 0, β 1,, β p 1 are obtained by minimizing the sum of the squared prediction errors SSE = = n (y i ŷ i ) 2 i=1 n (y i [β 0 + β 1 x i1 + + β p 1 x i,p 1 ]) 2 i=1 The estimates are computed by solving a set of simultaneous linear equations known as the normal equations (Ott, p. 575). The solution β 0,, β p 1 minimizes SSE. It is unique All statistics packages are very efficient and fast at solving these equations. Our interest in the proper construction and application of multiple regression models and not in computation of the estimators 12.4 Inferences in Multiple Regression The coefficient of determination R 2 is the proportion of the total variation in the response variable that is explained by the regression model R 2 can be computed by either of the two formulas R 2 = SSR SST, (1) where SSR= (ŷ i y) 2 is the regression, or model, sums or squares, and SST= (y i y) 2 16

17 is the (corrected) total sums of squares, or R 2 = SST SSE, (2) SST where SSE= (y i ŷ i ) 2 is the error (or residual) sums of squares R 2 is not a well-determined function of the individual correlation coefficients between y and each of the explanatory variables x 1, x 2,..., x p 1. In particular, R 2 is not the sum of the individual correlation coefficients except in the rare instance that the explanatory variables are orthogonal. R 2 tends to exaggerate how well the fitted model will perform at predicting new observations (i.e. observations that are not in the data set) R 2 is sensitive to over-fitting. That is, the model fits the observed data better than new observations because the model was constructed from the observed data. For large n, over-fitting is not much of a problem Often, statistic packages compute an adjusted R 2 to (partially) correct for over-fitting (the adjusted R 2 is also affected by over-fitting, but to a lesser extent than R 2 ) A popular adjusted R 2 is computed according to the formula R 2 a = SST/(n 1) SSE/(n p) SST/(n 1) = 1 s2 ε σ = σ2 s 2 ε 2 σ 2 where s 2 ε = SSE/(n p) and σ 2 = (y i y) 2 /(n 1) If p (the number of β s) is large relative to n (the number of observations), then SSE/(n p) will be large, and R 2 a will be reduced R 2 a is penalized if p is relatively large compared to n Inferences about a single parameter in the general linear model 17

18 The inferential methods for a single parameter, say β j, are the almost same as for simple linear regression, aside from hidden computational details A 100(1 α)% confidence interval for β j is β j ± t α/2 σ βj where t α/2 is the 1 α/2 percentile from the t distribution with df= n p, and σ βj is the estimated standard error of β j Example: the results from the previous example were summarized as Table 5: Summary of the fitted interaction model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2 + β 3 intensity x 2. In addition, s ε = 6.6, df = 20, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β β A 99% CI for β 3 is obtained by setting α = From Table 2, p. 1093, t α/2 = t.005 = From the computer output β 3 = and σ β3 = Hence, a 99% CI for β 2 is β 2 ± t α/2 σ β2 = ± = [0.0199, ] Because this interval excludes 0, we conclude that β 2 is not 0, and that there is evidence of interaction between light intensity and timing We can say that we are 99% confident that the true β 3 lies in this interval 18

19 We cannot say that there is a.99 probability that the true β 3 lies in this interval Hypothesis tests regarding β j In this discussion, we assume that the general linear model Y = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 + ε has been fit We suppose that the explanatory variables x 1,..., x j 1, x j+1,..., x p, are useful for explaining the variation in Y We want to test H 0 : β j = 0 versus H a : β j 0 after accounting for these other variables Said another way, we want to test whether there is linear relationship between Y and x j in the presence of the other explanatory variables This can be accomplished by examining the t-statistic associated with β j and computed from the model with all explanatory variables The advantages of this approach over testing H 0 : β j = 0 without consideration of the other variables are 1. We utilize our knowledge regarding the response variable if we account for the other variables. In practical terms, the estimate of β j is more accurate than without the variables 2. If the other variables are part of the true linear model of E (Y ), then ignoring them means that we have tested H 0 : β j = 0 under a model that is incorrect, and the test results are not trust-worthy 19

20 The test statistic for H 0 : β j = 0 versus 1. H a : β j 0, 2. H a : β j > 0, 3. H a : β j < 0 is T = β j σ βj t n p, If H 0 : β j = 0 is true, then T t n p. For a α-level test, the rejection regions and p-values are: 1. R = {t t > t α/2 }, where t α/2 is a critical value from Table 2 (df= n p). Also, p-value= 2P (T > t ) where T t n p 2. R = {t t > t α }, where t α is a critical value from Table 2 (df= n p). Also, p-value= P (T > t) 3. R = {t t < t α }. Also, p-value= P (T < t) This test should be carried out with all other variables in the model that are considered to be important for modeling E (Y ) It is better to have additional, unimportant variables in the model when testing H 0 : β j = 0 than to neglect to include important variables. Usually, including unimportant variables will not affect the test much Example: A test of whether there the interaction variable x 3 explains variation in the response variable, given x 1 and x 2, is obtained from examining the previous table 20

21 The β 3 line in the table is derived from the following calculations β 3 =.0512 σ βj =.0105 t = β j σ βj = = 4.87 p-value = 2P (T > 4.87).0001 Based on these data, we can conclude that there is very strong evidence that the relationship between average number of flowers per plant varies with light intensity and the time at which light treatment was initiated, and that the relationship differs according to the time at which light treatment was initiated By testing H 0 : β j = 0 in the presence of the other variables, we allow the other variables to account for variation in y Omitting these variables may produce a test statistic that is incorrect because x j may be similar to one of these other variables, and the importance of these other variables may then be ascribed to x j Interaction revisited The test for significance of β j just described is useful for testing for significance of an interaction term (e.g., x 3 = x 1 x 2 ). Suppose that the test leads to the conclusion that interaction (or x 3 ) is significant. Then, what should be said about x 1 and x 2 and their relationship to the response variable? 1. there is (statistical) evidence of an effect of x 1 on the response variable, though the 21

22 effect depends on the level of x 2 2. there is (statistical) evidence of an effect of x 2 on the response variable, though the effect depends on the level of x 1 Is there any point to testing for the significance of β 1, the model coefficient that multiplies x 1? No, because rejecting the null hypothesis H 0 : β 1 = 0 implies that x 1 is related to the response variable, and we have already concluded that this is the case. Worse, it is entirely possible that there is insufficient evidence to reject H 0 : β 1 = 0 and if we conclude that β 1 = 0 then we must also conclude that x 1 is unrelated to the response variable, and we have contracted the previous conclusion that x 1 is related to the response variable and that the effect depends on the level of x 2 Consequently, we do not test for the significance of the main effects x 1 and x 2 when interaction is significant 12.5 Testing a Subset of Regression Coefficients Sometimes we want to test whether a set of parameters are all 0 versus the alternative that at least one is different from 0 For example, suppose that factor A is qualitative with k > 2 levels, and that the model may contain some other explanatory variables To account for the k levels of factor A, we need k 1 indicator variables, say, 22

23 0, if the level of A is not 2 x 1 = 1, if the level of A is 2, (1) 0, if the level of A is not 3 x 2 = 1, if the level of A is 3 and so on until, 0, if the level of A is not k x k 1 = 1, if the level of A is k Note that level 1 has been aliased The model is then Y = β 0 + β 1 x 1 + β 2 x β k 1 x k 1 + β k x k + + β p 1 x p 1 + ε A test of whether Factor A (a factor with k levels) explains variation in Y is accomplished by testing H 0 : β 1 = 0, β 2 = 0,..., β k 1 = 0, versus H a : β i 0 for at least one i where 1 i k 1 A set of k 1 indicator variables are set up to account for factor A, and the model is fit twice, once with the indicator variables, and again after removing the indicator variables A test of H 0 compares the difference in error sums of squares between the model with all indicator variables in the model to the error sums of squares to the model without indicator variables in the model 23

24 A large difference in error sums of squares is evidence that factor A explains variation in the response Specifically, the test compares the fit of Y = β 0 + β 1 x 1 + β 2 x β k 1 x k 1 + β k x k + β p 1 x p 1 + ε to the fit of Y = β 0 + β k x k + + β p 1 x p 1 + ε If there is a big difference in the estimated error of the fitted models, then we conclude that there is evidence that factor A explains variation in Y Formally, let 1. M 1 denote the model with the indicator variables for A in the model, SSE 1 denote the error sums of squares associated with M 1 df 1 denote the degrees of freedom associated with M 1 2. M 2 denote the model without the indicator variables for A in the model, SSE 2 denote the error sums of squares associated with M 2 df 2 denote the degrees of freedom associated with M 2 The test statistic is F = (SSE 2 SSE 1 )/(df 2 df 1 ) SSE 1 /df 1 Under H 0, F has a F -distribution with n 1 =df 2 df 1 = k 1 numerator and n 2 =df 1 denominator degrees of freedom, respectively 24

25 We reject H 0 at the α-level if F > f α, where α = P (F n1,n 2 > f α ) is the probability that an F random variable with n 1 numerator and n 2 denominator degrees of freedom takes on a value larger than f α A p-value for the test is p-value= P (F n1,n 2 > F ) Ott and Longneckers discussion (p. 658) differs slightly from this set-up, primarily because they use the difference in regression sums-of-squares between the complete model (the one containing the factor of interest) and the reduced model (the one without the factor). The difference between the regression sums-of-squares and the difference between the error sums-or-squares are equal, so either (their or my) numerator is correct Example In the Harris bank lawsuit analysis, a test of the association between gender and monthly salary increase can be obtained by testing for the joint significance of gender and age gender 1. the residual sums-of squares from the model containing both gender and age gender (and seniority and age) was SSE 1 = 225 and df 1 = the residual sums-of squares from the model without gender and gender age (but containing seniority and age) was SSE 2 = 246 and df 2 = The test statistic is F = ( )/(90 88) 225/88 = The p-value is equal to P (F 2,88 > 4.11) = Hence there is convincing evidence of an association between gender and monthly salary increase. In addition, the effect of gender depends on age 25

26 Example from Fouts, R.S Aquisition and testing of gestural signs in four young chimpanzees, Science, 180, Fouts taught 4 chimpanzees 10 signs of the the American sign language with the intent of determining whether some signs are easier to learn, and whether some chimps tended to learn more quickly than others. Table 6: Data: (time in minutes to learn a word). Chimpanzee Word Booee Cindy Bruno Thelma Listen Drink Shoe Key More Food Fruit Hat Look String Multiple regression analysis can be used to investigate whether there are differences between chimps and words with respect to learning times. 2 factors are identified: chimp (with 4 levels), and word (with 10 levels). Indicator variables are set up to identify each of the levels For example, differences among chimps (in learning ability) are accounted for by three indicator variables I am not particularly interested in investigating whether there is interaction between words and chimps. In any case, there is not enough data to investigate interaction. An investigation of interaction would require 3 9 = 27 indicator variables. Then, there would be = 40 parameters in the model. Since there are only n = 40 observations, the model would fit perfectly and SSE = 0 R 2 = 1, and none of the F -tests could be evaluated because they use SSE in the denominator 26

27 I consider only an main effects model. The indicator variables for the chimp factor are set according to 0, if the chimp is not Cindy x 1 = 1, if the chimp is Cindy, 0, if the chimp is not Bruno x 2 = 1, if the chimp is Bruno and 0, if the chimp is not Thelma, x 3 = 1, if the chimp is Thelma The word factor requires k 1 = 10 1 = 9 indicator variables The fitted model including both factors is show in Table 7. Table 7: Summary of the fitted main effects model showing parameter estimates and tests of significance. s ε = 0.809, df = 27, R 2 = Variable Estimate Std. Error t Pr(T > t ) Constant chimp chimp chimp word word word word word word word word word The residual plots indicate that the residuals are approximately normal in distribution, though there is substantial evidence that the assumption of constant variance does not hold. 27

28 Standardized residuals Standardized residuals Fitted values Quantiles of Standard Normal Figure 5: Standardized residuals plotted against fitted values (left panel), and a normal distribution quantile plot of the residuals (right panel), derived from the regression on time (munutes). Some fitted values are negative which suggests that the model does not fit well, at least in a logical sense The response variable is replaced by its natural logarithm, and the model is re-fit using this new response variable The non-constant variance problem is resolved, though now the assumption of normality is somewhat more questionable To test whether there are differences between chimps with respect to learning times, we compare the residual sums of squares between the models with chimps and words and the model with word alone 28

29 Standardized residuals Standardized residuals Fitted values Quantiles of Standard Normal Figure 6: Standardized residuals plotted against fitted values (left panel), and a normal distribution quantile plot of the residuals (right panel), derived from the regression on the natural logarithm of time. The plots indicate that the residuals are approximately normal in distribution, though there is substantial evidence that the assumption of constant variance does not hold. Formally, the hypotheses are H 0 : β 1 = β 2 = β 3 = 0 and H a : β 1 0 or β 2 0 or β 3 0 SSE 1 = 17.65, MSE 1 = 0.654, df 1 = 30 (both chimps and word are in the model) SSE 2 = 22.99, df 1 = 27 (only word is in the model) The F -statistic testing H 0 versus H a is F = (SSE 2 SSE 1 )/(df 2 df 1 ) SSE 1 /df 1 ( )/(30 27) = = 5.34/ = 2.72 p-value = Pr(F 3, ) =

30 The test of significance for differences between words (comparing the SSE s between the models with chimp and word to the model chimp alone) is carried out the same way, and is highly significant The results are collected in an analysis of variance (ANOVA) table Table 8: Analysis of variance table. The response variable is log time. Degrees of freedom Sums-of-squares Mean squares f P r(f f) Chimp Word Residuals Time BOOEE CINDY BRUNO THELMA listen drink shoe key more food fruit hat look string Word Figure 7: Fitted model, by animal. Conclusions: these data provide some statistical evidence that there are differences among chimps with respect to learning times, and strong evidence of differences among words with respect to learning times 30

31 Caveat: the population of interest can only be the four chimps because we cannot assume that they are a sample from a population of chimps This forces us to identify the population of interest to be the four chimps, and, hence statistical inference is limited to those four On the other hand, I am comfortable in informally concluding that these data suggest that there are differences among chimps, in general, with respect to their ability to learn word signs Forecasting Using Multiple Regression The objectives are to estimate the expected response E (Y ), given a set of values for the explanatory variables, and to predict Y, given a set of values for the explanatory variables The estimate is Ê(Y ) = β 0 + β 1 x 1 + β 2 x β p 1 x p 1, and the prediction is the same: Ŷ = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 Confidence and predictions intervals for E (Y ) and Y are quite difficult to compute by hand, so we use computers. The general structure, and the interpretation is the same as in simple linear regression A 100 (1 α)% confidence interval for E (Y ) is Ê (Y ) ± t α/2 σê(y ) 31

32 where σê(y ) is the estimated standard error of the estimate Ê (Y ), and α/2 = P (t n p > t α/2 ) is the probability that a t random variable with n p degrees of freedom will take on a value larger than t α/2 σê(y ) should be computed by computer A 100 (1 α)% prediction interval for Y is Ŷ ± t α/2 σŷ where σŷ is the estimated standard error of the prediction Ŷ, and α/2 = P (t n p > t α/2 ) is the probability that a t random variable with n p degrees of freedom will take on a value larger than t α/2 Remarks on the Preparation of Statistical Reports Conciseness is most important. Present only the information that is necessary to answer the question(s). For maximum efficiency, write the results first, then the methods, and conclusion, and finally, the summary. The order of the report is opposite, though. Here are the main sections and what belongs in each. The length depends on the complexity of the question, so the guidelines below are fairly loose 1. Summary - A brief paragraph sketching the research question, the data, how it was analyzed, and your conclusions. Leave out details. 2. Research Objectives - state the research questions(s). Then, identify specific objectives that will be pursued towards the goal of answering the questions. For example, my research question might be: was there gender bias in wages paid to full-time 32

33 U.S. workers during the 1980 s?. Objectives: 1) Find a model that explains as much of the variation in wages as possible (but ignores gender). 2) Determine if there is residual variation about this model that is attributable to gender. 3. Methods - Explain what methods (e.g., tests, confidence intervals, transformations) were used, and for what. (It is not necessary to explain the method, though, unless it is unusual or new). For example, I may write...scatterplots were used to visually assess the relationship between Cesium concentration in mushrooms and soils. Plots of residuals versus fitted values were used to assess whether the normal distribution assumption was valid. T -statistics were used to test the hypothesis of a linear association between Cesium concentration in mushrooms and soils. Specifically, I tested H 0 : β 1 = 0 versus H a : β 1 > 0 using the test statistic T = β 1 / σ β Results - First, present some information describing the data (e.g., scatterplots) relevant to the question(s). Note any outliers, influential points, skewness, etc. Present whatever estimates, tests (name the statistic, its value, and a p-value, if possible), and/or confidence intervals that support your conclusions. The presentation of results should follow the objectives. Do not include computer output that is not referred to elsewhere. 5. Conclusions - state your conclusion concisely and to the point. E.g., There is strong statistical evidence of a linear association between Cesium concentration in mushrooms and soils (p-value= 0.023). Discuss any problems (e.g., the normality assumption is suspect; there are points of high leverage, etc.) Tables. All tables require a legend explaining the contents and any abbreviations 33

34 used within. The legend belongs immediately above or below the table. Figures. All figure require a legend explaining the contents and any abbreviations used within. The legend belongs immediately above or below the figure. Logistic Regression (Section 12.8) This is a very important topic that deserves substantially more attention than the treatment given by Ott and Longnecker. We will delay the study of this subject until linear regression has been finished Multiple Regression Theory (Section 12.9) Also a very important topic that deserves substantially more attention than the treatment given by Ott and Longnecker. We will not study of this subject in

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations