Chapter 12 - Multiple Regression and the General Linear Model

Size: px
Start display at page:

Download "Chapter 12 - Multiple Regression and the General Linear Model"

Transcription

1 Chapter 12 - Multiple Regression and the General Linear Model The simple linear model can be extended from one response variable (Y ) and one explanatory variable (x) to one response variable and p 1 explanatory variables x 1,..., x p 1 The regression of Y on x 1,..., x p 1 is an equation that predicts the expected value of Y for particular values of the explanatory variables A Case Study: Seddigh, M. and Jolliff, G.D. 1994, Light intensity effects on meadowfoam growth and flowering, Crop Sci., 34, Meadowfoam is a small plant that grows in moist meadows of the Pacific NW; the seed oil is valuable Seddigh and Jolliff were interested in determining optimal growing conditions for commercial production They carried out a controlled growth chamber experiment with two factors: light intensity (6 levels), and timing of the onset of light treatment (either at floral induction, or 24 days prior to floral induction) 10 seedlings were randomly assigned to each of the 12 = 6 2 treatments and the average number of flowers per plant was calculated at the end of the experiment 1

2 Factors, levels and average number of flowers per plant: Table 1: Interval plotted against duration. Light intensity (µmol/m 2 /sec) Timing at FI prior to FI A Multiple Regression Model Let Y denote the average number of flowers per plant, x 1 denote the level of light intensity, and define x 2 as follows: 0, if onset of light treatment was at FI x 2 = 1, if onset of light treatment was prior to FI The model is Y = β 0 + β 1 x 1 + β 2 x 2 + ε This model says that the number of flowers per plant (Y ) is a linear function of light intensity (x 1 ), timing (x 2 ) plus a random error term (ε). For inferential purposes, it is assumed that the errors are independent and identically distributed as normal random variables with mean 0 and constant variance, i.e., ε N(0, σ 2 ). This assumption must be supported by the data The assumption of errors with mean 0 implies that the average number of flowers is a 2

3 linear function of the two explanatory variables: E(Y ) = β 0 + β 1 x 1 + β 2 x 2 Note that if Y 1 was observed when the onset of light treatment was at FI then the model for Y 1 is Y 1 = β 0 + β 1 x 1 + β ε = β 0 + β 1 x 1 + ε If Y 2 was observed when the onset of light treatment was proir to FI, then the model for Y 2 is Y 2 = β 0 + β 1 x 1 + β ε = (β 0 + β 2 ) + β 1 x 1 + ε = β0 + β 1 x 1 + ε where β0 = β 0 + β 2 The two models have the same slope, but different intercepts If β 2 is 0, then β0 = β 0, and the timing of onset of light treatment has no effect on average numbers of flowers Table 2: Summary of the fitted model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2. In addition, s ε = 6.441, df = 21, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β

4 Flowers per plant Time =1 Time = Light intensity Figure 1: Data and fitted linear model for the meadowfoam data set. The figure shows that the model fits the data quite well, and that there is no evidence of violations of the distributional assumptions The regression summary shows that there is strong evidence of a difference in numbers of flowers per plant attributable to timing of onset, and that numbers of flowers decreases as light levels are increased Beginning light treatment 24 days before the onset of FI is estimated to increase the expected number of flowers per plant by 12.2 We will test H 0 : β 2 = 0 versus H a : β 2 0 using T = β 2 / σ β2 4

5 The General Linear Model The general linear model is can be expressed as Y = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 + ε, where x 1,..., x p 1 are explanatory variables and ε is a random error with mean 0. Note: this is a general model that describes a random process that generates observation A more specific model that describes the data that have been observed often uses subscripts, for example as follows Y i = β 0 + β 1 x i,1 + β 2 x i,2 + + β p 1 x i,p 1 + ε i, There are a variety of ways to obtain the explanatory variables Polynomials in x All p 1 variables can be computed from a single explanatory variable x according to x 1 = x x 2 = x 2. x p 1 = x p 1 This general linear model may be expressed as Y i = β 0 + β 1 x i + β 2 x 2 i + + β p 1 x p 1 i + ε i Polynomial models are useful for approximating complicated nonlinear relationships between Y and x. 5

6 Example (from Ramesy and Schafer) Drake, S. and J MacLachlan, Galileo s discovery of the parabolic trajectory, Scientific American, 232, Galileo conducted an experiment to determine if the horizontal velocity of a moving object is constant. He rolled a ball off of an incline and measured the horizontal distance travelled. The height of the incline varied between 50 and 1000 punti. The data, and a fitted second-order polynomial model Ŷ = x x 2 are shown in the figure below: Horizontal distance (punti) Initial height (punti) Figure 2: Data and fitted polynomial model from Galelio s experiment on horizontal distance. 6

7 Table 3: Analysis of variance table for Galelio s experiment on horizontal distance. Source df Sum of Squares Mean of Squares f Pr(F > f ) x x error Multiple Explanatory Variables Another general linear model involves p 1 different variables x 1,..., x p 1, none of which are exponentiated. Such a model can be expressed as E(Y ) = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 The effects of the variables on Y are said to be additive. β 1 x 1 is said to be the effect of x 1 on E(Y ) (and also on Y ) In contrast, a multiplicative model is E(Y ) = β 0 x β 1 1 x β 2 2. Chemical reaction rates are often multiplicative functions of the concentration of catalysts. A linearized version is log[e(y )] = α 0 + β 1 log(x 1 ) + β 2 log(x 2 ) where α 0 = log(β 0 ) An additive model implies that the effect of a 1 unit increase of x 1, on E(Y ), when all other variables are held fixed is E (Y ) = β 0 + β 1 (x 1 + 1) + β 2 x β p 1 x p 1 (β 0 + β 1 x 1 + β 2 x β p 1 x p 1 ) = β 1 The effect of a 1 unit increase in x 1 does not depend on the values of the variables. Thus, to determine the difference in E(Y ) for two sets of conditions, we only have to add the effects of a difference in one variable to the effects of a difference in the other variable 7

8 For example, the change in E (Y ) attributable to a one unit change in x 1 and a one unit change in x 2 is E (Y ) =β 0 + β 1 (x 1 + 1) + β 2 (x 2 + 1)+ (β 0 + β 1 x 1 + β 2 x 2 ) =β 1 + β 2 Qualitative (or categorical) variables It is possible to analyze the relationship between a quantitative response variable and one or more qualitative variables using multiple regression A common approach is to create a set of indicator variables that identify the levels of a qualitative variable Example: Seddigh, M. and Jolliff, G.D. 1994, Light intensity effects on meadowfoam growth and flowering. Crop Sci., 34, The timing of the onset of the light (at FI or 24 days prior) is a qualitative variable because there is no reason to believe that time has a linear effect on the average number of flowers per plant An indicator (or dummy) variable identifies at which level of timing the observation was recorded The indicator variable is defined by : 0, if onset of light treatment was at FI x 2 = 1, if onset of light treatment was prior to FI, 8

9 and the model is Y = β 0 + β 1 x 1 + β 2 x 2 + ε where x 1 is light intensity β 2 was estimated to be and its estimated standard error was is also the estimated mean difference in average numbers of flowers per plant attributable to timing of the onset of light treatment If a factor has more than 2 levels (say k levels), then we create k 1 indicator variables to account for the factor For example, if a factor has 3 levels A, B, and C, then we may set up x 1 to be the indicator of level A and x 2 to be the indicator of level B Suppose that for the ith observation x i,1 = x i,2 = 0. Then we are certain that the ith observation was observed when the factor level was C. Conversely, if the factor level for the ith observation is C, then we are certain that x i,1 = x i,2 = 0. It is redundant to create and use a third indicator variable identifying the level as C. As a general rule, only k 1 indicator variables are used. Further, a number of computational difficulties may arise if k indicator variables are used in model fitting and analysis The level without an indicator variable to identify it is often called the aliased level. If there is a single factor with k levels, then a model of the ith observation is E(Y i ) = β 0 + β 1 x i,1 + β 2 x i,2 + + β k x i,k 1 The expected response when the level is k is therefore E(Y i ) = β 0. Note that the meaning of the intercept is different than in simple linear regression. 9

10 Note: there are a variety of ways to define the indicator variables of treatment level. Care must be taken to understand how the variables are defined and how to determine the expected response at each level. Example If we do not want to assume a linear relationship between light intensity and the average number of flowers, then we can use indicator variables to determine the estimated mean number of flowers. The following model may be used: Y = β 0 + β 1 x 1 + β 2 x β 7 x 7 + ε where 0, light intensity 150 x 2 = 1, if light intensity = 150 0, light intensity 300 x 3 = 1, if light intensity = 300 and so on, with 0, light intensity 750 x 7 = 1, if light intensity = 750 For the different levels of timing and light intensity, the models are Y = β 0 + ε, if the onset of light was at time 0 and intensity = 900, Y = β 0 + β 1 + ε, if the onset of light was at day 24 and intensity = 900, Y = β 0 + β 2 + ε, if the onset of light was at day 0 and intensity = 150, Y = β 0 + β 1 + β 2 + ε, if the onset of light was at time 24 and intensity = 150, 10

11 and so on. The fitted model and data are shown in Figure 2. The model will be referred to as the unconstrained model as we have dropped the assumption that the relationship is linear, and make no assumptions (and impose no constraints) Flowers per plant Time =1 Time = Light intensity Figure 3: Data and unconstrained fitted model for the meadowfoam data set. Homework for Wednesday March 7 Analyze the brain weight data. Determine a model that describes the data as completely as possible but using only those variables that are clearly associated with brain weight, after accounting for all other important variables. Support your model with appropriate statistical analyses and discussion. Defend all tests that you carry out by residual analyses, and identify those assumptions that are not supported by the data. 11

12 Use transformations whenever possible to reduce violations of the model assumptions. Create an indicator variable that identifies primates and treat it as an additional explanatory variable (I believe that animals 6 through 21 (row number) are primates) You should provide the following in your report: 1. Pair-wise scatterplots and correlation coefficients for all variables (after transformation) used in you final model besides the primate variable 2. For the primate variable, construct side-by-side box plots for the other explanatory variables 3. Note any explanatory variables that are moderately or strongly correlated 4. Residual plots (normal probability and plots of residuals versus fitted values). Identify the largest few outliers 5. An analysis of interaction variables. Use the following strategy. Find a best-fitting model using the explanatory variables (or transformations). Create all pair-wise interaction variables (e.g., from x 1 and x 2, create x 3 = x 1 x 2 ). Add each interaction variable to the best model, examine the significance test (for x 3 ) and retain any interaction variable that is significant. Report on those that were significant and note in change in R 2 between the models with and without the interaction variables. Interaction between explanatory variables When two variables interact, the effect of a 1 unit increase of x 1, on E(Y ), depends on the value of interacting variable The figure below shows an example of two explanatory variables that interact 12

13 One explanatory variable is light intensity, and the second explanatory variable is the timing of onset If there were no interaction, then the lines would be parallel The key to understanding interaction is the realization that the effect of intensity depends on timing (and vice versa) Flowers per plant Time =1 Time = Light intensity Figure 4: Simulated data and fitted interaction model for the meadowfoam data set. A model that describes the interaction of two explanatory variables (x 1 and x 2 ) is Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + ε, The term β 3 x 1 x 2 is called an interaction term, as it accounts for synergistic interaction between x 1 and x 2 13

14 The effect of a 1 unit increase of x 1, on E(Y ), when all other variables are held fixed is E (Y ) =β 0 + β 1 (x 1 + 1) + β 2 x 2 + β 3 (x 1 + 1)x 2 (β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 ) =β 1 + β 3 x 2 The effect of a 1 unit increase of x 1, on E(Y ), depends on the value of x 2 and the parameters β 1 and β 3 If β 3 > 0, then the effect of a 1 unit increase of x 1 increases with x 2 If β 3 < 0, then the effect of a 1 unit increase of x 1 decreases with x 2 Table 4: Summary of the fitted interaction model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2 + β 3 intensity x 2. In addition, s ε = 6.6, df = 20, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β β The effect of a 1 unit increase of light intensity (x 1 ) is estimated to be β 1 + β 3 x 2 = x , if the onset of light trmt was at FI = 0.010, if the onset of light trmt was 24 days prior to FI 14

15 The model can be expressed as x 1, if the onset of light trmt was at FI Ŷ = x 1, if the onset of light trmt was 24 days prior to FI The interaction model Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + ε, can be expressed as a general linear model if we define x 3 = x 1 x 2 and set Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε In some instances, there may be reason to believe that two variables interact with regard to their relationship with Y, and so a test of significance may be carried out For our example, we would test H 0 : β 3 = 0 versus H a : β 3 0. There is abundant evidence that the variables interact because the p-value is approximately Estimating Multiple Regression Coefficients The data can be thought of as a set of n p-tuples {(y 1, x 11, x 12,..., x 1,p 1 ), (y 2, x 21, x 22,..., x 2,p 1 ),..., (y n, x n1, x n2,..., x n,p 1 )} where the first subscript on the x s denotes the observation number, and the second index identifies the variable. The ith data tuple is (y i, x i,1, x i,2,..., x i,p 1 ) 15

16 The general linear model for the ith observation is Y i = β 0 + β 1 x i,1 + β 2 x i,2 + + β p 1 x i,p 1 + ε i The least squares estimates of the parameters, i.e., β 0, β 1,, β p 1 are obtained by minimizing the sum of the squared prediction errors SSE = = n (y i ŷ i ) 2 i=1 n (y i [β 0 + β 1 x i1 + + β p 1 x i,p 1 ]) 2 i=1 The estimates are computed by solving a set of simultaneous linear equations known as the normal equations (Ott, p. 575). The solution β 0,, β p 1 minimizes SSE. It is unique All statistics packages are very efficient and fast at solving these equations. Our interest in the proper construction and application of multiple regression models and not in computation of the estimators 12.4 Inferences in Multiple Regression The coefficient of determination R 2 is the proportion of the total variation in the response variable that is explained by the regression model R 2 can be computed by either of the two formulas R 2 = SSR SST, (1) where SSR= (ŷ i y) 2 is the regression, or model, sums or squares, and SST= (y i y) 2 16

17 is the (corrected) total sums of squares, or R 2 = SST SSE, (2) SST where SSE= (y i ŷ i ) 2 is the error (or residual) sums of squares R 2 is not a well-determined function of the individual correlation coefficients between y and each of the explanatory variables x 1, x 2,..., x p 1. In particular, R 2 is not the sum of the individual correlation coefficients except in the rare instance that the explanatory variables are orthogonal. R 2 tends to exaggerate how well the fitted model will perform at predicting new observations (i.e. observations that are not in the data set) R 2 is sensitive to over-fitting. That is, the model fits the observed data better than new observations because the model was constructed from the observed data. For large n, over-fitting is not much of a problem Often, statistic packages compute an adjusted R 2 to (partially) correct for over-fitting (the adjusted R 2 is also affected by over-fitting, but to a lesser extent than R 2 ) A popular adjusted R 2 is computed according to the formula R 2 a = SST/(n 1) SSE/(n p) SST/(n 1) = 1 s2 ε σ = σ2 s 2 ε 2 σ 2 where s 2 ε = SSE/(n p) and σ 2 = (y i y) 2 /(n 1) If p (the number of β s) is large relative to n (the number of observations), then SSE/(n p) will be large, and R 2 a will be reduced R 2 a is penalized if p is relatively large compared to n Inferences about a single parameter in the general linear model 17

18 The inferential methods for a single parameter, say β j, are the almost same as for simple linear regression, aside from hidden computational details A 100(1 α)% confidence interval for β j is β j ± t α/2 σ βj where t α/2 is the 1 α/2 percentile from the t distribution with df= n p, and σ βj is the estimated standard error of β j Example: the results from the previous example were summarized as Table 5: Summary of the fitted interaction model showing parameter estimates and tests of significance. The model is E(Y x) = β 0 + β 1 intensity + β 2 x 2 + β 3 intensity x 2. In addition, s ε = 6.6, df = 20, R 2 = Parameter Estimate Std. Error t Pr(T > t ) β β β β A 99% CI for β 3 is obtained by setting α = From Table 2, p. 1093, t α/2 = t.005 = From the computer output β 3 = and σ β3 = Hence, a 99% CI for β 2 is β 2 ± t α/2 σ β2 = ± = [0.0199, ] Because this interval excludes 0, we conclude that β 2 is not 0, and that there is evidence of interaction between light intensity and timing We can say that we are 99% confident that the true β 3 lies in this interval 18

19 We cannot say that there is a.99 probability that the true β 3 lies in this interval Hypothesis tests regarding β j In this discussion, we assume that the general linear model Y = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 + ε has been fit We suppose that the explanatory variables x 1,..., x j 1, x j+1,..., x p, are useful for explaining the variation in Y We want to test H 0 : β j = 0 versus H a : β j 0 after accounting for these other variables Said another way, we want to test whether there is linear relationship between Y and x j in the presence of the other explanatory variables This can be accomplished by examining the t-statistic associated with β j and computed from the model with all explanatory variables The advantages of this approach over testing H 0 : β j = 0 without consideration of the other variables are 1. We utilize our knowledge regarding the response variable if we account for the other variables. In practical terms, the estimate of β j is more accurate than without the variables 2. If the other variables are part of the true linear model of E (Y ), then ignoring them means that we have tested H 0 : β j = 0 under a model that is incorrect, and the test results are not trust-worthy 19

20 The test statistic for H 0 : β j = 0 versus 1. H a : β j 0, 2. H a : β j > 0, 3. H a : β j < 0 is T = β j σ βj t n p, If H 0 : β j = 0 is true, then T t n p. For a α-level test, the rejection regions and p-values are: 1. R = {t t > t α/2 }, where t α/2 is a critical value from Table 2 (df= n p). Also, p-value= 2P (T > t ) where T t n p 2. R = {t t > t α }, where t α is a critical value from Table 2 (df= n p). Also, p-value= P (T > t) 3. R = {t t < t α }. Also, p-value= P (T < t) This test should be carried out with all other variables in the model that are considered to be important for modeling E (Y ) It is better to have additional, unimportant variables in the model when testing H 0 : β j = 0 than to neglect to include important variables. Usually, including unimportant variables will not affect the test much Example: A test of whether there the interaction variable x 3 explains variation in the response variable, given x 1 and x 2, is obtained from examining the previous table 20

21 The β 3 line in the table is derived from the following calculations β 3 =.0512 σ βj =.0105 t = β j σ βj = = 4.87 p-value = 2P (T > 4.87).0001 Based on these data, we can conclude that there is very strong evidence that the relationship between average number of flowers per plant varies with light intensity and the time at which light treatment was initiated, and that the relationship differs according to the time at which light treatment was initiated By testing H 0 : β j = 0 in the presence of the other variables, we allow the other variables to account for variation in y Omitting these variables may produce a test statistic that is incorrect because x j may be similar to one of these other variables, and the importance of these other variables may then be ascribed to x j Interaction revisited The test for significance of β j just described is useful for testing for significance of an interaction term (e.g., x 3 = x 1 x 2 ). Suppose that the test leads to the conclusion that interaction (or x 3 ) is significant. Then, what should be said about x 1 and x 2 and their relationship to the response variable? 1. there is (statistical) evidence of an effect of x 1 on the response variable, though the 21

22 effect depends on the level of x 2 2. there is (statistical) evidence of an effect of x 2 on the response variable, though the effect depends on the level of x 1 Is there any point to testing for the significance of β 1, the model coefficient that multiplies x 1? No, because rejecting the null hypothesis H 0 : β 1 = 0 implies that x 1 is related to the response variable, and we have already concluded that this is the case. Worse, it is entirely possible that there is insufficient evidence to reject H 0 : β 1 = 0 and if we conclude that β 1 = 0 then we must also conclude that x 1 is unrelated to the response variable, and we have contracted the previous conclusion that x 1 is related to the response variable and that the effect depends on the level of x 2 Consequently, we do not test for the significance of the main effects x 1 and x 2 when interaction is significant 12.5 Testing a Subset of Regression Coefficients Sometimes we want to test whether a set of parameters are all 0 versus the alternative that at least one is different from 0 For example, suppose that factor A is qualitative with k > 2 levels, and that the model may contain some other explanatory variables To account for the k levels of factor A, we need k 1 indicator variables, say, 22

23 0, if the level of A is not 2 x 1 = 1, if the level of A is 2, (1) 0, if the level of A is not 3 x 2 = 1, if the level of A is 3 and so on until, 0, if the level of A is not k x k 1 = 1, if the level of A is k Note that level 1 has been aliased The model is then Y = β 0 + β 1 x 1 + β 2 x β k 1 x k 1 + β k x k + + β p 1 x p 1 + ε A test of whether Factor A (a factor with k levels) explains variation in Y is accomplished by testing H 0 : β 1 = 0, β 2 = 0,..., β k 1 = 0, versus H a : β i 0 for at least one i where 1 i k 1 A set of k 1 indicator variables are set up to account for factor A, and the model is fit twice, once with the indicator variables, and again after removing the indicator variables A test of H 0 compares the difference in error sums of squares between the model with all indicator variables in the model to the error sums of squares to the model without indicator variables in the model 23

24 A large difference in error sums of squares is evidence that factor A explains variation in the response Specifically, the test compares the fit of Y = β 0 + β 1 x 1 + β 2 x β k 1 x k 1 + β k x k + β p 1 x p 1 + ε to the fit of Y = β 0 + β k x k + + β p 1 x p 1 + ε If there is a big difference in the estimated error of the fitted models, then we conclude that there is evidence that factor A explains variation in Y Formally, let 1. M 1 denote the model with the indicator variables for A in the model, SSE 1 denote the error sums of squares associated with M 1 df 1 denote the degrees of freedom associated with M 1 2. M 2 denote the model without the indicator variables for A in the model, SSE 2 denote the error sums of squares associated with M 2 df 2 denote the degrees of freedom associated with M 2 The test statistic is F = (SSE 2 SSE 1 )/(df 2 df 1 ) SSE 1 /df 1 Under H 0, F has a F -distribution with n 1 =df 2 df 1 = k 1 numerator and n 2 =df 1 denominator degrees of freedom, respectively 24

25 We reject H 0 at the α-level if F > f α, where α = P (F n1,n 2 > f α ) is the probability that an F random variable with n 1 numerator and n 2 denominator degrees of freedom takes on a value larger than f α A p-value for the test is p-value= P (F n1,n 2 > F ) Ott and Longneckers discussion (p. 658) differs slightly from this set-up, primarily because they use the difference in regression sums-of-squares between the complete model (the one containing the factor of interest) and the reduced model (the one without the factor). The difference between the regression sums-of-squares and the difference between the error sums-or-squares are equal, so either (their or my) numerator is correct Example In the Harris bank lawsuit analysis, a test of the association between gender and monthly salary increase can be obtained by testing for the joint significance of gender and age gender 1. the residual sums-of squares from the model containing both gender and age gender (and seniority and age) was SSE 1 = 225 and df 1 = the residual sums-of squares from the model without gender and gender age (but containing seniority and age) was SSE 2 = 246 and df 2 = The test statistic is F = ( )/(90 88) 225/88 = The p-value is equal to P (F 2,88 > 4.11) = Hence there is convincing evidence of an association between gender and monthly salary increase. In addition, the effect of gender depends on age 25

26 Example from Fouts, R.S Aquisition and testing of gestural signs in four young chimpanzees, Science, 180, Fouts taught 4 chimpanzees 10 signs of the the American sign language with the intent of determining whether some signs are easier to learn, and whether some chimps tended to learn more quickly than others. Table 6: Data: (time in minutes to learn a word). Chimpanzee Word Booee Cindy Bruno Thelma Listen Drink Shoe Key More Food Fruit Hat Look String Multiple regression analysis can be used to investigate whether there are differences between chimps and words with respect to learning times. 2 factors are identified: chimp (with 4 levels), and word (with 10 levels). Indicator variables are set up to identify each of the levels For example, differences among chimps (in learning ability) are accounted for by three indicator variables I am not particularly interested in investigating whether there is interaction between words and chimps. In any case, there is not enough data to investigate interaction. An investigation of interaction would require 3 9 = 27 indicator variables. Then, there would be = 40 parameters in the model. Since there are only n = 40 observations, the model would fit perfectly and SSE = 0 R 2 = 1, and none of the F -tests could be evaluated because they use SSE in the denominator 26

27 I consider only an main effects model. The indicator variables for the chimp factor are set according to 0, if the chimp is not Cindy x 1 = 1, if the chimp is Cindy, 0, if the chimp is not Bruno x 2 = 1, if the chimp is Bruno and 0, if the chimp is not Thelma, x 3 = 1, if the chimp is Thelma The word factor requires k 1 = 10 1 = 9 indicator variables The fitted model including both factors is show in Table 7. Table 7: Summary of the fitted main effects model showing parameter estimates and tests of significance. s ε = 0.809, df = 27, R 2 = Variable Estimate Std. Error t Pr(T > t ) Constant chimp chimp chimp word word word word word word word word word The residual plots indicate that the residuals are approximately normal in distribution, though there is substantial evidence that the assumption of constant variance does not hold. 27

28 Standardized residuals Standardized residuals Fitted values Quantiles of Standard Normal Figure 5: Standardized residuals plotted against fitted values (left panel), and a normal distribution quantile plot of the residuals (right panel), derived from the regression on time (munutes). Some fitted values are negative which suggests that the model does not fit well, at least in a logical sense The response variable is replaced by its natural logarithm, and the model is re-fit using this new response variable The non-constant variance problem is resolved, though now the assumption of normality is somewhat more questionable To test whether there are differences between chimps with respect to learning times, we compare the residual sums of squares between the models with chimps and words and the model with word alone 28

29 Standardized residuals Standardized residuals Fitted values Quantiles of Standard Normal Figure 6: Standardized residuals plotted against fitted values (left panel), and a normal distribution quantile plot of the residuals (right panel), derived from the regression on the natural logarithm of time. The plots indicate that the residuals are approximately normal in distribution, though there is substantial evidence that the assumption of constant variance does not hold. Formally, the hypotheses are H 0 : β 1 = β 2 = β 3 = 0 and H a : β 1 0 or β 2 0 or β 3 0 SSE 1 = 17.65, MSE 1 = 0.654, df 1 = 30 (both chimps and word are in the model) SSE 2 = 22.99, df 1 = 27 (only word is in the model) The F -statistic testing H 0 versus H a is F = (SSE 2 SSE 1 )/(df 2 df 1 ) SSE 1 /df 1 ( )/(30 27) = = 5.34/ = 2.72 p-value = Pr(F 3, ) =

30 The test of significance for differences between words (comparing the SSE s between the models with chimp and word to the model chimp alone) is carried out the same way, and is highly significant The results are collected in an analysis of variance (ANOVA) table Table 8: Analysis of variance table. The response variable is log time. Degrees of freedom Sums-of-squares Mean squares f P r(f f) Chimp Word Residuals Time BOOEE CINDY BRUNO THELMA listen drink shoe key more food fruit hat look string Word Figure 7: Fitted model, by animal. Conclusions: these data provide some statistical evidence that there are differences among chimps with respect to learning times, and strong evidence of differences among words with respect to learning times 30

31 Caveat: the population of interest can only be the four chimps because we cannot assume that they are a sample from a population of chimps This forces us to identify the population of interest to be the four chimps, and, hence statistical inference is limited to those four On the other hand, I am comfortable in informally concluding that these data suggest that there are differences among chimps, in general, with respect to their ability to learn word signs Forecasting Using Multiple Regression The objectives are to estimate the expected response E (Y ), given a set of values for the explanatory variables, and to predict Y, given a set of values for the explanatory variables The estimate is Ê(Y ) = β 0 + β 1 x 1 + β 2 x β p 1 x p 1, and the prediction is the same: Ŷ = β 0 + β 1 x 1 + β 2 x β p 1 x p 1 Confidence and predictions intervals for E (Y ) and Y are quite difficult to compute by hand, so we use computers. The general structure, and the interpretation is the same as in simple linear regression A 100 (1 α)% confidence interval for E (Y ) is Ê (Y ) ± t α/2 σê(y ) 31

32 where σê(y ) is the estimated standard error of the estimate Ê (Y ), and α/2 = P (t n p > t α/2 ) is the probability that a t random variable with n p degrees of freedom will take on a value larger than t α/2 σê(y ) should be computed by computer A 100 (1 α)% prediction interval for Y is Ŷ ± t α/2 σŷ where σŷ is the estimated standard error of the prediction Ŷ, and α/2 = P (t n p > t α/2 ) is the probability that a t random variable with n p degrees of freedom will take on a value larger than t α/2 Remarks on the Preparation of Statistical Reports Conciseness is most important. Present only the information that is necessary to answer the question(s). For maximum efficiency, write the results first, then the methods, and conclusion, and finally, the summary. The order of the report is opposite, though. Here are the main sections and what belongs in each. The length depends on the complexity of the question, so the guidelines below are fairly loose 1. Summary - A brief paragraph sketching the research question, the data, how it was analyzed, and your conclusions. Leave out details. 2. Research Objectives - state the research questions(s). Then, identify specific objectives that will be pursued towards the goal of answering the questions. For example, my research question might be: was there gender bias in wages paid to full-time 32

33 U.S. workers during the 1980 s?. Objectives: 1) Find a model that explains as much of the variation in wages as possible (but ignores gender). 2) Determine if there is residual variation about this model that is attributable to gender. 3. Methods - Explain what methods (e.g., tests, confidence intervals, transformations) were used, and for what. (It is not necessary to explain the method, though, unless it is unusual or new). For example, I may write...scatterplots were used to visually assess the relationship between Cesium concentration in mushrooms and soils. Plots of residuals versus fitted values were used to assess whether the normal distribution assumption was valid. T -statistics were used to test the hypothesis of a linear association between Cesium concentration in mushrooms and soils. Specifically, I tested H 0 : β 1 = 0 versus H a : β 1 > 0 using the test statistic T = β 1 / σ β Results - First, present some information describing the data (e.g., scatterplots) relevant to the question(s). Note any outliers, influential points, skewness, etc. Present whatever estimates, tests (name the statistic, its value, and a p-value, if possible), and/or confidence intervals that support your conclusions. The presentation of results should follow the objectives. Do not include computer output that is not referred to elsewhere. 5. Conclusions - state your conclusion concisely and to the point. E.g., There is strong statistical evidence of a linear association between Cesium concentration in mushrooms and soils (p-value= 0.023). Discuss any problems (e.g., the normality assumption is suspect; there are points of high leverage, etc.) Tables. All tables require a legend explaining the contents and any abbreviations 33

34 used within. The legend belongs immediately above or below the table. Figures. All figure require a legend explaining the contents and any abbreviations used within. The legend belongs immediately above or below the figure. Logistic Regression (Section 12.8) This is a very important topic that deserves substantially more attention than the treatment given by Ott and Longnecker. We will delay the study of this subject until linear regression has been finished Multiple Regression Theory (Section 12.9) Also a very important topic that deserves substantially more attention than the treatment given by Ott and Longnecker. We will not study of this subject in

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3 Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details Section 10.1, 2, 3 Basic components of regression setup Target of inference: linear dependency

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Chapter 13: Analysis of variance for two-way classifications

Chapter 13: Analysis of variance for two-way classifications Chapter 1: Analysis of variance for two-way classifications Pygmalion was a king of Cyprus who sculpted a figure of the ideal woman and then fell in love with the sculpture. It also refers for the situation

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Overfitting Categorical Variables Interaction Terms Non-linear Terms Linear Logarithmic y = a +

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Finding Relationships Among Variables

Finding Relationships Among Variables Finding Relationships Among Variables BUS 230: Business and Economic Research and Communication 1 Goals Specific goals: Re-familiarize ourselves with basic statistics ideas: sampling distributions, hypothesis

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

Chapter 4. Regression Models. Learning Objectives

Chapter 4. Regression Models. Learning Objectives Chapter 4 Regression Models To accompany Quantitative Analysis for Management, Eleventh Edition, by Render, Stair, and Hanna Power Point slides created by Brian Peterson Learning Objectives After completing

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Variance Decomposition and Goodness of Fit

Variance Decomposition and Goodness of Fit Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings

More information

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit LECTURE 6 Introduction to Econometrics Hypothesis testing & Goodness of fit October 25, 2016 1 / 23 ON TODAY S LECTURE We will explain how multiple hypotheses are tested in a regression model We will define

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Inference for the Regression Coefficient

Inference for the Regression Coefficient Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression line. We can shows that b 0 and b 1 are the unbiased estimates

More information

The Multiple Regression Model

The Multiple Regression Model Multiple Regression The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & or more independent variables (X i ) Multiple Regression Model with k Independent Variables:

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression OI CHAPTER 7 Important Concepts Correlation (r or R) and Coefficient of determination (R 2 ) Interpreting y-intercept and slope coefficients Inference (hypothesis testing and confidence

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression analysis Predicting with regression analysis Old exam question

More information

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 PDF file location: http://www.murraylax.org/rtutorials/regression_anovatable.pdf

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Regression Analysis II

Regression Analysis II Regression Analysis II Measures of Goodness of fit Two measures of Goodness of fit Measure of the absolute fit of the sample points to the sample regression line Standard error of the estimate An index

More information

CS 5014: Research Methods in Computer Science

CS 5014: Research Methods in Computer Science Computer Science Clifford A. Shaffer Department of Computer Science Virginia Tech Blacksburg, Virginia Fall 2010 Copyright c 2010 by Clifford A. Shaffer Computer Science Fall 2010 1 / 207 Correlation and

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

The General Linear Model

The General Linear Model The General Linear Model Thus far, we have discussed measures of uncertainty for the estimated parameters ( β 0, β 1 ) and responses ( y h ) from the simple linear regression model We would like to extend

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables. Regression Analysis BUS 735: Business Decision Making and Research 1 Goals of this section Specific goals Learn how to detect relationships between ordinal and categorical variables. Learn how to estimate

More information

Hypothesis Testing hypothesis testing approach

Hypothesis Testing hypothesis testing approach Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz Dept of Information Science j.nerbonne@rug.nl incl. important reworkings by Harmut Fitz March 17, 2015 Review: regression compares result on two distinct tests, e.g., geographic and phonetic distance of

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

Chapter 13. Multiple Regression and Model Building

Chapter 13. Multiple Regression and Model Building Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General Multiple Regression Model y x x x 0 1 1 2 2... k k y is the dependent variable x, x,..., x 1 2 k the model are the

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal yuppal@ysu.edu Sampling Distribution of b 1 Expected value of b 1 : Variance of b 1 : E(b 1 ) = 1 Var(b 1 ) = σ 2 /SS x Estimate of

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Chapter 7 Student Lecture Notes 7-1

Chapter 7 Student Lecture Notes 7-1 Chapter 7 Student Lecture Notes 7- Chapter Goals QM353: Business Statistics Chapter 7 Multiple Regression Analysis and Model Building After completing this chapter, you should be able to: Explain model

More information

Chapter 14 Simple Linear Regression (A)

Chapter 14 Simple Linear Regression (A) Chapter 14 Simple Linear Regression (A) 1. Characteristics Managerial decisions often are based on the relationship between two or more variables. can be used to develop an equation showing how the variables

More information

Inference for Regression Simple Linear Regression

Inference for Regression Simple Linear Regression Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Chapter 14. Linear least squares

Chapter 14. Linear least squares Serik Sagitov, Chalmers and GU, March 5, 2018 Chapter 14 Linear least squares 1 Simple linear regression model A linear model for the random response Y = Y (x) to an independent variable X = x For a given

More information

Multiple Regression an Introduction. Stat 511 Chap 9

Multiple Regression an Introduction. Stat 511 Chap 9 Multiple Regression an Introduction Stat 511 Chap 9 1 case studies meadowfoam flowers brain size of mammals 2 case study 1: meadowfoam flowering designed experiment carried out in a growth chamber general

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Unit 6 - Simple linear regression

Unit 6 - Simple linear regression Sta 101: Data Analysis and Statistical Inference Dr. Çetinkaya-Rundel Unit 6 - Simple linear regression LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable

More information

Regression With a Categorical Independent Variable

Regression With a Categorical Independent Variable Regression ith a Independent Variable ERSH 8320 Slide 1 of 34 Today s Lecture Regression with a single categorical independent variable. Today s Lecture Coding procedures for analysis. Dummy coding. Relationship

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Measuring the fit of the model - SSR

Measuring the fit of the model - SSR Measuring the fit of the model - SSR Once we ve determined our estimated regression line, we d like to know how well the model fits. How far/close are the observations to the fitted line? One way to do

More information

Unit 6 - Introduction to linear regression

Unit 6 - Introduction to linear regression Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25,

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Linear Regression Analysis - Chapters 3 and 4 in Dielman Artin Department of Statistical Science September 15, 2009 Outline 1 Simple Linear Regression Analysis 2 Using

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College 1-Way ANOVA MATH 143 Department of Mathematics and Statistics Calvin College Spring 2010 The basic ANOVA situation Two variables: 1 Categorical, 1 Quantitative Main Question: Do the (means of) the quantitative

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues

Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues What effects will the scale of the X and y variables have upon multiple regression? The coefficients

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Single and multiple linear regression analysis

Single and multiple linear regression analysis Single and multiple linear regression analysis Marike Cockeran 2017 Introduction Outline of the session Simple linear regression analysis SPSS example of simple linear regression analysis Additional topics

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1) Summary of Chapter 7 (Sections 7.2-7.5) and Chapter 8 (Section 8.1) Chapter 7. Tests of Statistical Hypotheses 7.2. Tests about One Mean (1) Test about One Mean Case 1: σ is known. Assume that X N(µ, σ

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

What is a Hypothesis?

What is a Hypothesis? What is a Hypothesis? A hypothesis is a claim (assumption) about a population parameter: population mean Example: The mean monthly cell phone bill in this city is μ = $42 population proportion Example:

More information

Lecture 9: Linear Regression

Lecture 9: Linear Regression Lecture 9: Linear Regression Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R Regression

More information

ECO220Y Simple Regression: Testing the Slope

ECO220Y Simple Regression: Testing the Slope ECO220Y Simple Regression: Testing the Slope Readings: Chapter 18 (Sections 18.3-18.5) Winter 2012 Lecture 19 (Winter 2012) Simple Regression Lecture 19 1 / 32 Simple Regression Model y i = β 0 + β 1 x

More information

Correlation Analysis

Correlation Analysis Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

More information

FinQuiz Notes

FinQuiz Notes Reading 10 Multiple Regression and Issues in Regression Analysis 2. MULTIPLE LINEAR REGRESSION Multiple linear regression is a method used to model the linear relationship between a dependent variable

More information

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I PubH 7405: REGRESSION ANALYSIS MLR: INFERENCES, Part I TESTING HYPOTHESES Once we have fitted a multiple linear regression model and obtained estimates for the various parameters of interest, we want to

More information

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression Chapter 14 Student Lecture Notes 14-1 Department of Quantitative Methods & Information Systems Business Statistics Chapter 14 Multiple Regression QMIS 0 Dr. Mohammad Zainal Chapter Goals After completing

More information

Confidence Interval for the mean response

Confidence Interval for the mean response Week 3: Prediction and Confidence Intervals at specified x. Testing lack of fit with replicates at some x's. Inference for the correlation. Introduction to regression with several explanatory variables.

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

What Is ANOVA? Comparing Groups. One-way ANOVA. One way ANOVA (the F ratio test)

What Is ANOVA? Comparing Groups. One-way ANOVA. One way ANOVA (the F ratio test) What Is ANOVA? One-way ANOVA ANOVA ANalysis Of VAriance ANOVA compares the means of several groups. The groups are sometimes called "treatments" First textbook presentation in 95. Group Group σ µ µ σ µ

More information

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is Q = (Y i β 0 β 1 X i1 β 2 X i2 β p 1 X i.p 1 ) 2, which in matrix notation is Q = (Y Xβ) (Y

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Simple linear regression

Simple linear regression Simple linear regression Biometry 755 Spring 2008 Simple linear regression p. 1/40 Overview of regression analysis Evaluate relationship between one or more independent variables (X 1,...,X k ) and a single

More information

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company Multiple Regression Inference for Multiple Regression and A Case Study IPS Chapters 11.1 and 11.2 2009 W.H. Freeman and Company Objectives (IPS Chapters 11.1 and 11.2) Multiple regression Data for multiple

More information

Soil Phosphorus Discussion

Soil Phosphorus Discussion Solution: Soil Phosphorus Discussion Summary This analysis is ambiguous: there are two reasonable approaches which yield different results. Both lead to the conclusion that there is not an independent

More information

10. Alternative case influence statistics

10. Alternative case influence statistics 10. Alternative case influence statistics a. Alternative to D i : dffits i (and others) b. Alternative to studres i : externally-studentized residual c. Suggestion: use whatever is convenient with the

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

STAT Chapter 10: Analysis of Variance

STAT Chapter 10: Analysis of Variance STAT 515 -- Chapter 10: Analysis of Variance Designed Experiment A study in which the researcher controls the levels of one or more variables to determine their effect on the variable of interest (called

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Simple Linear Regression

Simple Linear Regression 9-1 l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical Method for Determining Regression 9.4 Least Square Method 9.5 Correlation Coefficient and Coefficient

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

In ANOVA the response variable is numerical and the explanatory variables are categorical.

In ANOVA the response variable is numerical and the explanatory variables are categorical. 1 ANOVA ANOVA means ANalysis Of VAriance. The ANOVA is a tool for studying the influence of one or more qualitative variables on the mean of a numerical variable in a population. In ANOVA the response

More information

The General Linear Model

The General Linear Model The General Linear Model Thus far, we have discussed measures of uncertainty for the estimated parameters ( β 0, β 1 ) and responses ( y h ) from the simple linear regression model We would like to extend

More information

Simple Linear Regression: One Qualitative IV

Simple Linear Regression: One Qualitative IV Simple Linear Regression: One Qualitative IV 1. Purpose As noted before regression is used both to explain and predict variation in DVs, and adding to the equation categorical variables extends regression

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information