STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

Size: px
Start display at page:

Download "STK 2100 Oblig 1. Zhou Siyu. February 15, 2017"

Transcription

1 STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter box of Boston data set.

2 Here we get a general idea of the interactions between the response and each of the predictor variables as well as interactions between any two predictor variables. b) Divide the data set into training and test data sets. Discuss the advantages and shortcomings in doing so. Answer: The code is R is the as the question suggests: #b) 2 set. seed (345) 3 ind <- sample (: nrow ( Boston ), 250, replace = FALSE ) 4 Boston. train <- Boston [ ind,] 5 Boston. test <- Boston [-ind,] In splitting the data set, we have ended up with fewer observations to train the model, which is obviously a shortcoming in terms of model accuracy and efficiency. On the other hand, we can also gain information concerning the model s prediction efficiency by testing it with the test data (for prediction). It is with the test data we can see how much the model deviates from the real value and so as to evaluate model efficiency and accuracy. c) Explain the important assumptions about the linear model. Use crim as the response variable, adjust the model to the training data and discuss about the result. Answer: For linear regression models, there are some very important conditions that should be satisfied. First and foremost, ε i should be independent from each other. This is arguably the most important assumption for the model, otherwise we would see some correlation in the error terms in the response variable. Another condition is that we expect the expectation of ε i to be 0, namely, E[ε i ] = 0 while its variance to be σ 2, V [ε i ] = σ 2. This can also be a strong condition. Last but not least, we usually require ε i to be normally distributed. To fit the model with the data set, we run the following code in R. #c) 2 > fit. lim = lm( crim ~., data = Boston. train ) 3 > summary ( fit. lim ) 4 5 Call : 6 lm( formula = crim ~., data = Boston. train ) 7 8 Residuals : 9 Min Q Median 3Q Max Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 ( Intercept ) zn indus chas nox

3 9 rm age dis ** 22 rad e -06 *** 23 tax ptratio black lstat medv *** Signif. codes : 0 *** 0.00 ** 0.0 * Residual standard error : on 236 degrees of freedom 32 Multiple R- squared : 0.45, Adjusted R- squared : F- statistic : 4.92 on 3 and 236 DF, p- value : < 2.2e -6 Here we see the programme generated a linear regression model based on the training data set. But the problem with this model is that many coefficients involved have very big P-value, which suggests they might not belong in the model. In the meantime, the R 2 as well as adjusted R 2 are too small to convince us this is a reliable model. d) Remove the predictor variable with the biggest P-value and find a new model. Explain why this is a reasonable procedure. Explain the new p-values in terms of correlation between predictor variables. Answer: We can sort out the data frame with the following code in R. And then we can see which variable has the largest P-value. > newmodel = summary ( fit. lim ) $ coefficients 2 > newmodel = newmodel [ order (- newmodel [,"Pr ( > t )"]),] 3 > newmodel 4 Estimate Std. Error t value Pr ( > t ) 5 chas e -0 6 age e -0 7 black e -0 8 tax e -0 9 indus e -0 0 lstat e -0 ptratio e -0 2 rm e -0 3 ( Intercept ) e -0 4 nox e zn e dis e medv e rad e > I decide to use the update function to remove the predictor variable with the largest P-value, namely, chas, and then run for a new model with the following code: 3

4 > fit. lim = update ( fit.lim, ~.- chas ) 2 > summary ( fit. lim ) 3 4 Call : 5 lm( formula = crim ~ zn + indus + nox + rm + age + dis + rad + 6 tax + ptratio + black + lstat + medv, data = Boston. train ) 7 8 Residuals : 9 Min Q Median 3Q Max Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 ( Intercept ) zn indus nox rm age dis ** 2 rad e -06 *** 22 tax ptratio black lstat medv *** Signif. codes : 0 *** 0.00 ** 0.0 * Residual standard error : 7.52 on 237 degrees of freedom 30 Multiple R- squared : 0.45, Adjusted R- squared : F- statistic : 6.23 on 2 and 237 DF, p- value : < 2.2e > newmodel = summary ( fit. lim ) $ coefficients 34 > newmodel = newmodel [ order (- newmodel [,"Pr ( > t )"]),] 35 > 36 > 37 > newmodel 38 Estimate Std. Error t value Pr ( > t ) 39 age e black e -0 4 tax e indus e lstat e ptratio e rm e ( Intercept ) e zn e nox e dis e medv e rad e -06 Compare the two results and we will see that in removing chas, we have actually obtained a better model, in that the adjusted R 2 has slightly increased 4

5 while R 2 remains unchanged. This means that the new model explains the data set better than the previous one. In the meantime, we see that P-values and standard errors of the remaining predictor variables do not change much after we have removed chas, suggesting that these predictor variables are mutually independent. If there is any predictor variable collinear with chas, we are likely to see dramatic increase in the standard error for its least square estimation. e) Keep improving the model in this way until you get a reasonable model. Make different plots to show that this selection is reasonable. Answer: We can, of course, remove each variable by hand, but the following code in R will make it easier. > y. name <-" crim " 2 > alpha < > fit. lim = lm( crim ~., data = Boston. train ) 4 > beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 5 > tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 6 > print ( length ( tokeep )) 7 [] 3 8 > while (beta > alpha ) 9 + { 0 + if( length ( tokeep ) ==0) + { 2 + warning (" Nothing is significant ") 3 + break 4 + } 5 + if( names ( tokeep ) []== "( Intercept )") 6 + { 7 + names ( tokeep ) [] <-"" 8 + } else 9 + { 20 + names ( tokeep ) [] <-" -" 2 + } form <-as. formula ( paste (y.name,"~",paste ( names ( tokeep ), collapse = "+"))) 24 + fit. lim = lm( formula = form, data = Boston. train ) 25 + beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 26 + newmodel = summary ( fit. lim ) $ coefficients 27 + tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 28 + if( length ( tokeep ) ==) 29 + { 30 + names ( tokeep ) <-row. names ( summary ( fit. lim )$ coefficients ) [] 3 + } 32 + print ( names ( tokeep )) 33 + print ( length ( tokeep )) } 36 [] "( Intercept )" "zn" " indus " 37 [4] " nox " "rm" " dis " 38 [7] " rad " " tax " " ptratio " 39 [0] " black " " lstat " " medv " 40 [] 2 4 [] "( Intercept )" "zn" " indus " 5

6 42 [4] " nox " "rm" " dis " 43 [7] " rad " " tax " " ptratio " 44 [0] " lstat " " medv " 45 [] 46 [] "( Intercept )" "zn" " indus " 47 [4] " nox " "rm" " dis " 48 [7] " rad " " ptratio " " lstat " 49 [0] " medv " 50 [] 0 5 [] "( Intercept )" "zn" " nox " 52 [4] "rm" " dis " " rad " 53 [7] " ptratio " " lstat " " medv " 54 [] 9 55 [] "( Intercept )" "zn" " nox " 56 [4] "rm" " dis " " rad " 57 [7] " ptratio " " medv " 58 [] 8 59 [] "( Intercept )" "zn" " nox " 60 [4] " dis " " rad " " ptratio " 6 [7] " medv " 62 [] 7 63 [] "( Intercept )" "zn" " nox " 64 [4] " dis " " rad " " medv " 65 [] 6 66 [] "( Intercept )" "zn" " dis " 67 [4] " rad " " medv " 68 [] 5 69 [] "( Intercept )" "zn" " dis " 70 [4] " rad " " medv " 7 [] 5 72 > summary ( fit. lim ) Call : 75 lm( formula = form, data = Boston. train ) Residuals : 78 Min Q Median 3Q Max Coefficients : 82 Estimate Std. Error t value Pr ( > t ) 83 ( Intercept ) ** 84 zn * 85 dis * 86 rad e -5 *** 87 medv e -05 *** Signif. codes : 90 0 *** 0.00 ** 0.0 * Residual standard error : on 245 degrees of freedom 93 Multiple R- squared : , Adjusted R- squared : F- statistic : on 4 and 245 DF, p- value : < 2.2e -6 We can also run the backward model selection with the following code in R. 6

7 > library ( leaps ) 2 >fit. backward = regsubsets ( crim ~., data = Boston. train, nvmax =3, method =" backward ") 3 > summary. backward = summary ( fit. backward ) 4 > summary. backward 5 6 Selection Algorithm : backward 7 zn indus chas nox rm age dis rad tax ptra black lstat medv 8 " " " " " " " " " " " " " " "*" " " " " " " " " " " 9 " " " " " " " " " " " " " " "*" " " " " " " " " "*" 0 " " " " " " " " " " " " "*" "*" " " " " " " " " "*" "*" " " " " " " " " " " "*" "*" " " " " " " " " "*" 2 "*" " " " " "*" " " " " "*" "*" " " " " " " " " "*" 3 "*" " " " " "*" " " " " "*" "*" " " "*" " " " " "*" 4 "*" " " " " "*" "*" " " "*" "*" " " "*" " " " " "*" 5 "*" " " " " "*" "*" " " "*" "*" " " "*" " " "*" "*" 6 "*" "*" " " "*" "*" " " "*" "*" " " "*" " " "*" "*" 7 "*" "*" " " "*" "*" " " "*" "*" "*" "*" " " "*" "*" 8 "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*" 9 "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" 20 "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" 2 22 > summary. backward $ adjr2 23 [] [8] > summary. backward $cp 26 [] [8] In the above result we see that the adjust R 2 increased slightly after the model takes up more than four predictor variables. In the meantime, Mallows C p is slightly bigger than m + with four variables. Therefore we decide to take four variables: > coef ( fit. backward,4) 2 ( Intercept ) zn dis rad medv And if we run the model with the above selected predictor variables: > fit. lim2 = lm( crim ~zn+dis + rad +medv, data = Boston. train ) 2 > summary ( fit. lim2 ) 3 4 Call : 5 lm( formula = crim ~ zn + dis + rad + medv, data = Boston. train ) 6 7 Residuals : 8 Min Q Median 3Q Max

8 0 Coefficients : 2 Estimate Std. Error t value Pr ( > t ) 3 ( Intercept ) ** 4 zn * 5 dis * 6 rad e -5 *** 7 medv e -05 *** Signif. codes : 0 *** 0.00 ** 0.0 * Residual standard error : on 245 degrees of freedom 22 Multiple R- squared : , Adjusted R- squared : F- statistic : on 4 and 245 DF, p- value : < 2.2e -6 Figure 2: Scatter box of the model with four predictor variables. I have also plotted standardised residual against each predictor variable in the model. As we can see in the following figures, there seems to exit a pattern that as the predictor variable gets large, standardised residual tends to reduce. Based on the plots, as well as a low adjust R 2, I tend to think we need to consider interactions between predictor variables or models with polynomial terms. 8

9 Figure 3: Standarised residual plotted against each predictor variable. f) Use the averaged square error to see how good the model is. Answer: To calculate the required error term, we run the following code in R: > fit. lim3 = lm( crim ~zn+dis + rad +medv, data = Boston. train ) 2 > sum (( Boston. test $crim - predict ( fit.lim3, data. frame ( Boston. test ))) ^2) /n 3 [] In the meantime, if we run the following R code, we can see how the error term changes with the number of predictor variables we use in the model: > n= nrow ( Boston. test ) 2 > k= ncol ( Boston. test ) - 3 > mat = model. matrix ( crim ~., data = Boston. test ) 4 > fit.lm= regsubsets ( crim ~., data = Boston. train, nvmax =k, method = " backward ") 5 > cv=rep (0,k) 9

10 6 > for ( m in : k) 7 + { 8 + coef.m= coef ( fit.lm, m) 9 + for ( i in : n) 0 + { + pred = sum ( mat [i, names ( coef.m)]* coef.m) 2 + diff =( Boston. test $ crim [i]- pred )^2 3 + cv[ m]= cv[ m]+ diff 4 + } 5 + } 6 > cv/ n 7 [] [6] [] > cv [4] /n 2 [] We see that model with four predictor variables has an averaged square error term only slightly larger than the model using one variable. This is also consistent with the previous result, attesting to the fact that our model selection is correct. g) Repeat the model selection procedure, using the whole data set. Discuss the difference. Answer: With a slight change of code, we see that the model has also changed with a different data set. > y. name <-" crim " 2 > alpha < > fit. lim = lm( crim ~., data = Boston ) 4 > beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 5 > tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 6 > print ( length ( tokeep )) 7 [] 3 8 > while (beta > alpha ) 9 + { 0 + if( length ( tokeep ) ==0) + { 2 + warning (" Nothing is significant ") 3 + break 4 + } 5 + if( names ( tokeep ) []== "( Intercept )") 6 + { 7 + names ( tokeep ) [] <-"" 8 + } else 9 + { 20 + names ( tokeep ) [] <-" -" 2 + } form <-as. formula ( paste (y.name,"~",paste ( names ( tokeep ), collapse = "+"))) 24 + fit. lim = lm( formula = form, data = Boston ) 25 + beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 26 + newmodel = summary ( fit. lim ) $ coefficients 27 + tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 0

11 28 + if( length ( tokeep ) ==) 29 + { 30 + names ( tokeep ) <-row. names ( summary ( fit. lim )$ coefficients ) [] 3 + } 32 + print ( names ( tokeep )) 33 + print ( length ( tokeep )) } 36 [] "( Intercept )" "zn" " indus " 37 [4] " nox " "rm" " dis " 38 [7] " rad " " tax " " ptratio " 39 [0] " black " " lstat " " medv " 40 [] 2 4 [] "( Intercept )" "zn" " indus " 42 [4] " nox " "rm" " dis " 43 [7] " rad " " ptratio " " black " 44 [0] " lstat " " medv " 45 [] 46 [] "( Intercept )" "zn" " indus " 47 [4] " nox " " dis " " rad " 48 [7] " ptratio " " black " " lstat " 49 [0] " medv " 50 [] 0 5 [] "( Intercept )" "zn" " nox " 52 [4] " dis " " rad " " ptratio " 53 [7] " black " " lstat " " medv " 54 [] 9 55 [] "( Intercept )" "zn" " nox " 56 [4] " dis " " rad " " ptratio " 57 [7] " black " " medv " 58 [] 8 59 [] "( Intercept )" "zn" " nox " 60 [4] " dis " " rad " " black " 6 [7] " medv " 62 [] 7 63 [] "( Intercept )" "zn" " nox " 64 [4] " dis " " rad " " black " 65 [7] " medv " 66 [] 7 67 > summary ( fit. lim ) Call : 70 lm( formula = form, data = Boston ) 7 72 Residuals : 73 Min Q Median 3Q Max Coefficients : 77 Estimate Std. Error t value Pr ( > t ) 78 ( Intercept ) e -05 *** 79 zn ** 80 nox * 8 dis *** 82 rad < 2e -6 *** 83 black *

12 84 medv e -07 *** Signif. codes : 87 0 *** 0.00 ** 0.0 * Residual standard error : on 499 degrees of freedom 90 Multiple R- squared : 0.444, Adjusted R- squared : F- statistic : on 6 and 499 DF, p- value : < 2.2e -6 Now the model keeps six predictor variables, compared with four from before. The averaged square error: > n= nrow ( Boston ) 2 > fit. lim4 = lm( crim ~zn+dis + rad + medv + black +nox, data = Boston ) 3 > sum (( Boston $crim - predict ( fit.lim4, data. frame ( Boston ))) ^2) /n 4 [] We see that the error seems to get bigger. In the mean time, the biggest problem is that we do not have unused data to test the accuracy of the prediction. > n= nrow ( Boston ) 2 > k= ncol ( Boston ) - 3 > mat = model. matrix ( crim ~., data = Boston ) 4 > fit.lm= regsubsets ( crim ~., data = Boston, nvmax =k, method =" backward ") 5 > cv=rep (0,k) 6 > 7 > for ( m in : k) 8 + { 9 + coef.m= coef ( fit.lm, m) 0 + for ( i in : n) + { 2 + pred = sum ( mat [i, names ( coef.m)]* coef.m) 3 + diff =( Boston $ crim [i]- pred )^2 4 + cv[ m]= cv[ m]+ diff 5 + } } 8 > cv/ n 9 [] [6] [] > cv [6] /n 23 [] Another problem with my solution is that, because we do not have any fresh data to use, error terms will decrease as the number of predictor variable increases. (The error term, obtained in either way, is the same.) 2

13 Question 2 a) Show two models are equivalent. Answer: To show equivalence, we know that with c i = i, K k= x ik =. Then: Y i = β 0 + β 2 x i β K x ik + ε i K = β 0 x ik + β 2 x i β K x ik + ε i k= = β 0 x i + (β 0 + β 2 )x i (β 0 + β K )x ik + ε i Then we can see the correspondence between α and β. { β 0, j = α j = β 0 + β j, j = 2, 3, 4... K Each α j represents the mean of the population in category j. And it corresponds to certain β i according to the above mentioned rule. In the meantime, in the first model, β 0 serves as the baseline, while the difference between different categories is captured by the difference between β 0 and β j, where j = 2, 3, 4... K. In other words, β j, where j = 2, 3, 4... K represents the difference between the mean of population in category j and category. b) Show that the matrix has certain properties and some other results. Answer: The properties are straightforward: x x 2 x 3... x n x x 2 x 3... x K X T x 2 x 22 x x n2 x 2 x 22 x x 2K X = x K x 2K x 3K... x nk x n x n2 x n3... x nk x 2 i xi x i2 xi x i3... xi x ik xi x i2 x 2 i =. xi x..... i xi x ik x 2 ik We see the resulting matrix is diagonal with n i x2 i,j as the j-th element on the main diagonal. And it is very easy to see n i x2 i,j = n j, since the other terms in the sum will go to 0. This represents how many times response variable Y i falls into category j. For X T y: 3

14 x x 2 x 3... x n y xi y i X T x 2 x 22 x x n2 y 2 xi2 y i y = = xi3 y i. x K x 2K x 3K... x nk y n xik y i The resulting vector clearly shows that the result is i:c i=j y i represents the j-th element in the vector, namely, the sum of the response variable that falls into the same category. To find the least square estimator, we use matrix differentiation. RSS = (y αx) T (y αx) = (y T X T α T )(y αx) = y T y y T αx X T α T y + X T α T αx Then we differentiate RSS with respect to α, and equate it to 0. RSS α = yt y y T αx X T α T y + X T α T αx α = X T y + X T Xα = 0 Then we have the estimator: ˆα = (X T X) X T y = xiy i n xi2y i n 2 xi3y i n 3. xik y i The last equality follows from the result we get from before. In this way we see that ˆα j represents, reasonably, the mean of response variables that fall into the same category j. c) Construct a least square estimator ˆβ from ˆα. Show that it is indeed a least square estimator. Answer: According to the one-to-one correspondence we have established from before, we can formulate ˆβ as: n K β 0 α 0 β 2 α 2 α ˆβ = β 3 = α 3 α = ˆα α... β K α K α 4

15 I intend to prove that ˆβ is least square estimator by contradiction. Assume for the moment that the ˆβ thus obtained is not the least square estimator, and there exits a least square estimator ˆβ L. Then we can construct an ˆα L by using the above rule, and then: RSS = y ˆβ L X 2 = y ˆα L X 2 The last equality holds because there is a one-to-one correspondence between ˆα L and ˆβ L. Therefore we can see ˆα L is the least square estimator, and this is in contradiction with the definition of ˆα. d) Show model 3 is equivalent to model 2 or. Give an interpretation of the coefficients. Answer: First I show equivalence between model 3 and model 2. Y i = γ 0 + γ x i γ K x ik + ε i K = γ 0 x ik + γ x i γ K x ik + ε i k= = (γ 0 + γ )x i + (γ 0 + γ 2 )x i (γ 0 + γ K )x ik + ε i Together with the condition that K j= γ j = 0 we can see the correspondence between γ and α. γ j = { αj K, j = 0 αj α j, j =, 2, 3, 4... K K Here we see, γ 0 represents the mean of the whole population, regardless of category. In other words, γ 0 is the real mean. In the meantime, γ j for j =, 2, 3,..., K, represents effects of each category on the response variable, making it deviate from the real mean. e) Try the following code in R, discuss its meaning. Answer: First we see the how the code is executed in R, and if there is something going amiss. > #e) 2 >Fe = read. table (" http :// www. uio.no/ studier / emner / matnat / math / STK200 / v7 /fe.txt ",header =T, sep =",") 3 > fit = lm(fe~ form +0, data =Fe) 4 > summary ( fit ) 5 6 Call : 7 lm( formula = Fe ~ form + 0, data = Fe) 8 9 Residuals : 0 Min Q Median 3Q Max

16 3 Coefficients : 4 Estimate Std. Error t value Pr ( > t ) 5 form <2e -6 *** Signif. codes : 8 0 *** 0.00 ** 0.0 * Residual standard error : on 39 degrees of freedom 2 Multiple R- squared : , Adjusted R- squared : F- statistic : 33.6 on and 39 DF, p- value : < 2.2e -6 We see here the problem is that different forms are not being treated as different categories, but rather as a whole predictor variable. In the meantime, as the command corresponds to model 2, there is no intercept coefficient in the linear model as well. But if we put in one extra command as the question suggests, then we can see the model seems to work: > Fe$ form = as. factor (Fe$ form ) 2 > fit = lm(fe~ form +0, data =Fe) 3 > summary ( fit ) 4 5 Call : 6 lm( formula = Fe ~ form + 0, data = Fe) 7 8 Residuals : 9 Min Q Median 3Q Max Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 form <2e -6 *** 5 form <2e -6 *** 6 form <2e -6 *** 7 form <2e -6 *** Signif. codes : 20 0 *** 0.00 ** 0.0 * Residual standard error : on 36 degrees of freedom 23 Multiple R- squared : , Adjusted R- squared : F- statistic : on 4 and 36 DF, p- value : < 2.2e -6 This model clearly corresponds to model 2, with α 0 = 0 being the restriction. And each form coefficient in the above table corresponds to one categotical coefficient α j in model 2. f) Try the following code in R, determine which model it corresponds to, and list all the coefficients. Answer: As can be seen in the following result, the first code snippet 6

17 corresponds to model, namely, β = 0, while the second corresponds to model3 where j= γ j = 0 is the restriction. The first model is R s default setting while the restrictions for model 3 are implemented through the command: options(contrasts=c("contr.sum","contr.sum")). > options ( ) $ contrasts 2 [] " contr. treatment " " contr. poly " 3 > fit2 = lm( Fe~form, data =Fe ) 4 > summary ( fit2 ) 5 6 Call : 7 lm( formula = Fe ~ form, data = Fe) 8 9 Residuals : 0 Min Q Median 3Q Max Coefficients : 4 Estimate Std. Error t value Pr ( > t ) 5 ( Intercept ) < 2e -6 *** 6 form form * 8 form e -05 *** Signif. codes : 2 0 *** 0.00 ** 0.0 * Residual standard error : on 36 degrees of freedom 24 Multiple R- squared : , Adjusted R- squared : F- statistic : 0.85 on 3 and 36 DF, p- value : 3.99e > 28 > 29 > options ( contrasts =c(" contr. sum "," contr. sum ")) 30 > options ( ) $ contrasts 3 [] " contr. sum " " contr. sum " 32 > fit3 = lm( Fe~form, data =Fe) 33 > summary ( fit3 ) Call : 36 lm( formula = Fe ~ form, data = Fe) Residuals : 39 Min Q Median 3Q Max Coefficients : 43 Estimate Std. Error t value Pr ( > t ) 44 ( Intercept ) < 2e -6 *** 45 form * 46 form *** 47 form Signif. codes : 50 0 *** 0.00 ** 0.0 *

18 5 52 Residual standard error : on 36 degrees of freedom 53 Multiple R- squared : , Adjusted R- squared : F- statistic : 0.85 on 3 and 36 DF, p- value : 3.99e -05 The three set of regression coefficients correspond respectively to ˆα, ˆβ and ˆγ. The models are equivalent to each other, but the exact coefficients are different. But we have already established the rules connecting them to each other in earlier question. g) Do a variance analysis on the iron types. Make use of the results from the previous questions. Answer: To tell if there is any difference between the four types of iron, we are asking whether β j, where j = 2, 3, 4, and γ j, for j =, 2, 3, 4, are equal to 0 at the same time. In other words, we have the hypothesis: H 0 : β 2 = β 3 = β 4 = 0 against a hypothesis that at least one of them is not 0: We can run the F -test: H α : at least one of β j 0. F = (T SS RSS)/p RSS/(n p ) Fortunately, R has already run the test and printed out the result in the last line of the table, where we can see the P-value for the F -test is 3.99e 05. Therefore we can safely reject H 0. We can also run an anova analysis for the model and the result is the same for both models as well: > anova ( fit2 ) 2 Analysis of Variance Table 3 4 Response : Fe 5 Df Sum Sq Mean Sq F value Pr( >F) 6 form e -05 *** 7 Residuals Signif. codes : 0 0 *** 0.00 ** 0.0 * > anova ( fit3 ) 2 Analysis of Variance Table 3 4 Response : Fe 5 Df Sum Sq Mean Sq F value Pr( >F) 6 form e -05 *** 8

19 7 Residuals Signif. codes : 20 0 *** 0.00 ** 0.0 * h) Suggest a reasonable way to further simplify the model. Answer: We can use the F -test to see if we can leave out q predictor variable. H 0 : β 2 = β 3 =... = β q = 0 Then F = (RSS RSS 0)/q RSS/(n p ) Here RSS 0 is under H 0 while RSS is for the full model. In this way hopefully we can find some predictors that are not quite different from each other, then we can combine the cotegories together. We can also run a Tukey s procedure on this question and see the mean difference among the four types: aov. fit = aov ( Fe~ factor ( form ), data =Fe) 2 summary ( aov. fit ) 3 tukey. fit = TukeyHSD ( aov.fit, ordered =T) 4 plot ( tukey. fit ) Figure 4: Tukey plot for the iron types. We see here that type and 2 are very close to each other, while 3 and 4 is another pair. Therefore this is good chance of further simplifying the model. 9

Multiple Regression Part I STAT315, 19-20/3/2014

Multiple Regression Part I STAT315, 19-20/3/2014 Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.

More information

Stat 401B Final Exam Fall 2016

Stat 401B Final Exam Fall 2016 Stat 40B Final Exam Fall 0 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

DISCRIMINANT ANALYSIS: LDA AND QDA

DISCRIMINANT ANALYSIS: LDA AND QDA Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA s:5 Applied Linear Regression Chapter 8: ANOVA Two-way ANOVA Used to compare populations means when the populations are classified by two factors (or categorical variables) For example sex and occupation

More information

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 2. The Simple Linear Regression Model: Matrix Approach Lecture 2 The Simple Linear Regression Model: Matrix Approach Matrix algebra Matrix representation of simple linear regression model 1 Vectors and Matrices Where it is necessary to consider a distribution

More information

1 Introduction 1. 2 The Multiple Regression Model 1

1 Introduction 1. 2 The Multiple Regression Model 1 Multiple Linear Regression Contents 1 Introduction 1 2 The Multiple Regression Model 1 3 Setting Up a Multiple Regression Model 2 3.1 Introduction.............................. 2 3.2 Significance Tests

More information

1 Use of indicator random variables. (Chapter 8)

1 Use of indicator random variables. (Chapter 8) 1 Use of indicator random variables. (Chapter 8) let I(A) = 1 if the event A occurs, and I(A) = 0 otherwise. I(A) is referred to as the indicator of the event A. The notation I A is often used. 1 2 Fitting

More information

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species Lecture notes 2/22/2000 Dummy variables and extra SS F-test Page 1 Crab claw size and closing force. Problem 7.25, 10.9, and 10.10 Regression for all species at once, i.e., include dummy variables for

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 370 Two regression models are called nested if one contains all the predictors of the other, and some additional predictors. For example, the first-order model in two independent

More information

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. Take random samples from each of m populations. 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

HW1 Roshena MacPherson Feb 1, 2017

HW1 Roshena MacPherson Feb 1, 2017 HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real

More information

22s:152 Applied Linear Regression. 1-way ANOVA visual:

22s:152 Applied Linear Regression. 1-way ANOVA visual: 22s:152 Applied Linear Regression 1-way ANOVA visual: Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Y We now consider an analysis

More information

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

A discussion on multiple regression models

A discussion on multiple regression models A discussion on multiple regression models In our previous discussion of simple linear regression, we focused on a model in which one independent or explanatory variable X was used to predict the value

More information

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA) 22s:152 Applied Linear Regression Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA) We now consider an analysis with only categorical predictors (i.e. all predictors are

More information

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R Gilles Lamothe February 21, 2017 Contents 1 Anova with one factor 2 1.1 The data.......................................... 2 1.2 A visual

More information

ANOVA (Analysis of Variance) output RLS 11/20/2016

ANOVA (Analysis of Variance) output RLS 11/20/2016 ANOVA (Analysis of Variance) output RLS 11/20/2016 1. Analysis of Variance (ANOVA) The goal of ANOVA is to see if the variation in the data can explain enough to see if there are differences in the means.

More information

Stat 5303 (Oehlert): Analysis of CR Designs; January

Stat 5303 (Oehlert): Analysis of CR Designs; January Stat 5303 (Oehlert): Analysis of CR Designs; January 2016 1 > resin

More information

Lecture 15. Hypothesis testing in the linear model

Lecture 15. Hypothesis testing in the linear model 14. Lecture 15. Hypothesis testing in the linear model Lecture 15. Hypothesis testing in the linear model 1 (1 1) Preliminary lemma 15. Hypothesis testing in the linear model 15.1. Preliminary lemma Lemma

More information

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017 Lab #5 Spatial Regression (Due Date: 04/29/2017) PURPOSES 1. Learn to conduct alternative linear regression modeling on spatial data 2. Learn to diagnose and take into account spatial autocorrelation in

More information

STAT 213 Two-Way ANOVA II

STAT 213 Two-Way ANOVA II STAT 213 Two-Way ANOVA II Colin Reimer Dawson Oberlin College May 2, 2018 1 / 21 Outline Two-Way ANOVA: Additive Model FIT: Estimating Parameters ASSESS: Variance Decomposition Pairwise Comparisons 2 /

More information

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb Stat 42/52 TWO WAY ANOVA Feb 6 25 Charlotte Wickham stat52.cwick.co.nz Roadmap DONE: Understand what a multiple regression model is. Know how to do inference on single and multiple parameters. Some extra

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

Diagnostics and Transformations Part 2

Diagnostics and Transformations Part 2 Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Paul Gribble Winter, 2016 1 Correlation, Regression & Multiple Regression 1.1 Bivariate correlation The Pearson product-moment

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder Last updated: Oct 18, 2012 LINEAR REGRESSION Acknowledgements 2 Some of these slides have been sourced or modified from slides created by A. Field for Discovering Statistics using R. Simple Linear Objectives

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

STAT 350: Summer Semester Midterm 1: Solutions

STAT 350: Summer Semester Midterm 1: Solutions Name: Student Number: STAT 350: Summer Semester 2008 Midterm 1: Solutions 9 June 2008 Instructor: Richard Lockhart Instructions: This is an open book test. You may use notes, text, other books and a calculator.

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Variance Decomposition and Goodness of Fit

Variance Decomposition and Goodness of Fit Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

MODELS WITHOUT AN INTERCEPT

MODELS WITHOUT AN INTERCEPT Consider the balanced two factor design MODELS WITHOUT AN INTERCEPT Factor A 3 levels, indexed j 0, 1, 2; Factor B 5 levels, indexed l 0, 1, 2, 3, 4; n jl 4 replicate observations for each factor level

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

Workshop 7.4a: Single factor ANOVA

Workshop 7.4a: Single factor ANOVA -1- Workshop 7.4a: Single factor ANOVA Murray Logan November 23, 2016 Table of contents 1 Revision 1 2 Anova Parameterization 2 3 Partitioning of variance (ANOVA) 10 4 Worked Examples 13 1. Revision 1.1.

More information

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011 Stat588 Homework 1 (Due in class on Oct 04) Fall 2011 Notes. There are three sections of the homework. Section 1 and Section 2 are required for all students. While Section 3 is only required for Ph.D.

More information

Biostatistics 380 Multiple Regression 1. Multiple Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response)

More information

Dealing with Heteroskedasticity

Dealing with Heteroskedasticity Dealing with Heteroskedasticity James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Dealing with Heteroskedasticity 1 / 27 Dealing

More information

Regression and the 2-Sample t

Regression and the 2-Sample t Regression and the 2-Sample t James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Regression and the 2-Sample t 1 / 44 Regression

More information

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Lecture 10 Software Implementation in Simple Linear Regression Model using

More information

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Inference ME104: Linear Regression Analysis Kenneth Benoit August 15, 2012 August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Stata output resvisited. reg votes1st spend_total incumb minister

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 PDF file location: http://www.murraylax.org/rtutorials/regression_anovatable.pdf

More information

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction

More information

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 14/11/2017 This Week Categorical Variables Categorical

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

df=degrees of freedom = n - 1

df=degrees of freedom = n - 1 One sample t-test test of the mean Assumptions: Independent, random samples Approximately normal distribution (from intro class: σ is unknown, need to calculate and use s (sample standard deviation)) Hypotheses:

More information

FACTORIAL DESIGNS and NESTED DESIGNS

FACTORIAL DESIGNS and NESTED DESIGNS Experimental Design and Statistical Methods Workshop FACTORIAL DESIGNS and NESTED DESIGNS Jesús Piedrafita Arilla jesus.piedrafita@uab.cat Departament de Ciència Animal i dels Aliments Items Factorial

More information

Empirical Application of Simple Regression (Chapter 2)

Empirical Application of Simple Regression (Chapter 2) Empirical Application of Simple Regression (Chapter 2) 1. The data file is House Data, which can be downloaded from my webpage. 2. Use stata menu File Import Excel Spreadsheet to read the data. Don t forget

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

Multiple Regression: Example

Multiple Regression: Example Multiple Regression: Example Cobb-Douglas Production Function The Cobb-Douglas production function for observed economic data i = 1,..., n may be expressed as where O i is output l i is labour input c

More information

Categorical Predictor Variables

Categorical Predictor Variables Categorical Predictor Variables We often wish to use categorical (or qualitative) variables as covariates in a regression model. For binary variables (taking on only 2 values, e.g. sex), it is relatively

More information

Motivation for multiple regression

Motivation for multiple regression Motivation for multiple regression 1. Simple regression puts all factors other than X in u, and treats them as unobserved. Effectively the simple regression does not account for other factors. 2. The slope

More information

COMPARING SEVERAL MEANS: ANOVA

COMPARING SEVERAL MEANS: ANOVA LAST UPDATED: November 15, 2012 COMPARING SEVERAL MEANS: ANOVA Objectives 2 Basic principles of ANOVA Equations underlying one-way ANOVA Doing a one-way ANOVA in R Following up an ANOVA: Planned contrasts/comparisons

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression So far, we ve only considered quantitative variables in our models. We can integrate categorical predictors by constructing artificial

More information

Extensions of One-Way ANOVA.

Extensions of One-Way ANOVA. Extensions of One-Way ANOVA http://www.pelagicos.net/classes_biometry_fa18.htm What do I want You to Know What are two main limitations of ANOVA? What two approaches can follow a significant ANOVA? How

More information

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Notes on Maxwell & Delaney

Notes on Maxwell & Delaney Notes on Maxwell & Delaney PSCH 710 6 Chapter 6 - Trend Analysis Previously, we discussed how to use linear contrasts, or comparisons, to test specific hypotheses about differences among means. Those discussions

More information

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website. SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association

More information

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis STAT 3900/4950 MIDTERM TWO Name: Spring, 205 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis Instructions: You may use your books, notes, and SPSS/SAS. NO

More information

Introduction to Regression

Introduction to Regression Regression Introduction to Regression If two variables covary, we should be able to predict the value of one variable from another. Correlation only tells us how much two variables covary. In regression,

More information

Stat 401B Final Exam Fall 2015

Stat 401B Final Exam Fall 2015 Stat 401B Final Exam Fall 015 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

22S39: Class Notes / November 14, 2000 back to start 1

22S39: Class Notes / November 14, 2000 back to start 1 Model diagnostics Interpretation of fitted regression model 22S39: Class Notes / November 14, 2000 back to start 1 Model diagnostics 22S39: Class Notes / November 14, 2000 back to start 2 Model diagnostics

More information

Linear Model Specification in R

Linear Model Specification in R Linear Model Specification in R How to deal with overparameterisation? Paul Janssen 1 Luc Duchateau 2 1 Center for Statistics Hasselt University, Belgium 2 Faculty of Veterinary Medicine Ghent University,

More information

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as: 1 Joint hypotheses The null and alternative hypotheses can usually be interpreted as a restricted model ( ) and an model ( ). In our example: Note that if the model fits significantly better than the restricted

More information

Lecture 19 Multiple (Linear) Regression

Lecture 19 Multiple (Linear) Regression Lecture 19 Multiple (Linear) Regression Thais Paiva STA 111 - Summer 2013 Term II August 1, 2013 1 / 30 Thais Paiva STA 111 - Summer 2013 Term II Lecture 19, 08/01/2013 Lecture Plan 1 Multiple regression

More information

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS QUESTIONS 5.1. (a) In a log-log model the dependent and all explanatory variables are in the logarithmic form. (b) In the log-lin model the dependent variable

More information

Chapter 3: Multiple Regression. August 14, 2018

Chapter 3: Multiple Regression. August 14, 2018 Chapter 3: Multiple Regression August 14, 2018 1 The multiple linear regression model The model y = β 0 +β 1 x 1 + +β k x k +ǫ (1) is called a multiple linear regression model with k regressors. The parametersβ

More information

Principal components

Principal components Principal components Principal components is a general analysis technique that has some application within regression, but has a much wider use as well. Technical Stuff We have yet to define the term covariance,

More information

III. Inferential Tools

III. Inferential Tools III. Inferential Tools A. Introduction to Bat Echolocation Data (10.1.1) 1. Q: Do echolocating bats expend more enery than non-echolocating bats and birds, after accounting for mass? 2. Strategy: (i) Explore

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

> nrow(hmwk1) # check that the number of observations is correct [1] 36 > attach(hmwk1) # I like to attach the data to avoid the '$' addressing

> nrow(hmwk1) # check that the number of observations is correct [1] 36 > attach(hmwk1) # I like to attach the data to avoid the '$' addressing Homework #1 Key Spring 2014 Psyx 501, Montana State University Prof. Colleen F Moore Preliminary comments: The design is a 4x3 factorial between-groups. Non-athletes do aerobic training for 6, 4 or 2 weeks,

More information

Linear Regression Model. Badr Missaoui

Linear Regression Model. Badr Missaoui Linear Regression Model Badr Missaoui Introduction What is this course about? It is a course on applied statistics. It comprises 2 hours lectures each week and 1 hour lab sessions/tutorials. We will focus

More information

Simple, Marginal, and Interaction Effects in General Linear Models

Simple, Marginal, and Interaction Effects in General Linear Models Simple, Marginal, and Interaction Effects in General Linear Models PRE 905: Multivariate Analysis Lecture 3 Today s Class Centering and Coding Predictors Interpreting Parameters in the Model for the Means

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

SKLearn Tutorial: DNN on Boston Data

SKLearn Tutorial: DNN on Boston Data SKLearn Tutorial: DNN on Boston Data This tutorial follows very closely two other good tutorials and merges elements from both: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/boston.py

More information

36-707: Regression Analysis Homework Solutions. Homework 3

36-707: Regression Analysis Homework Solutions. Homework 3 36-707: Regression Analysis Homework Solutions Homework 3 Fall 2012 Problem 1 Y i = βx i + ɛ i, i {1, 2,..., n}. (a) Find the LS estimator of β: RSS = Σ n i=1(y i βx i ) 2 RSS β = Σ n i=1( 2X i )(Y i βx

More information

Advanced Regression Summer Statistics Institute. Day 3: Transformations and Non-Linear Models

Advanced Regression Summer Statistics Institute. Day 3: Transformations and Non-Linear Models Advanced Regression Summer Statistics Institute Day 3: Transformations and Non-Linear Models 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions of our linear regression model:

More information

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Biostatistics for physicists fall Correlation Linear regression Analysis of variance Biostatistics for physicists fall 2015 Correlation Linear regression Analysis of variance Correlation Example: Antibody level on 38 newborns and their mothers There is a positive correlation in antibody

More information

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) Analysis of Variance (ANOVA) Two types of ANOVA tests: Independent measures and Repeated measures Comparing 2 means: X 1 = 20 t - test X 2 = 30 How can we Compare 3 means?: X 1 = 20 X 2 = 30 X 3 = 35 ANOVA

More information

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared

More information

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim 0.0 1.0 1.5 2.0 2.5 3.0 8 10 12 14 16 18 20 22 y x Figure 1: The fitted line using the shipment route-number of ampules data STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim Problem#

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

Statistics Lab #6 Factorial ANOVA

Statistics Lab #6 Factorial ANOVA Statistics Lab #6 Factorial ANOVA PSYCH 710 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information