Multiple Linear Regression

Size: px
Start display at page:

Download "Multiple Linear Regression"

Transcription

1 Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1

2 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach to Linear Regression: p.5-6 Estimating σ 2 : p.7 Properties of the Least Squares Estimator: p.7-8 Test for Significance of Regression: p.8-9 R 2 and Adjusted R 2 : p.9-10 Tests On Individual Regression Coefficients and Subsets of Coefficients: p Hypothesis for General Regression Test: p.11 Confidence intervals on Individual Regression Coefficients: p.12 Confidence Interval on the Mean Response: p Prediction of New Observations: p.13 Residual Analysis: p Influential Observations: p Polynomial Regression Models: p.15 Categorical Regressors and Indicator Variables: p.15 Selection Of Variable Building: p Stepwise Regression: p.17 Forward Selection: p.17 Backward Elimination: p Multicollinearity: p.18 Data/Analysis: p

3 Introduction: In class, we went over simple linear regression where there was one predictor/ regressor variable. This regressor variable comes with the slope of a best fit line, which tries to extract most of the information in the data given. Learning how to create multiple linear regression models can give some ideas and insights into relationships of different variables with different responses. Usually engineers and scientists use multiple linear regression when working with experiments that have many different variables affecting the outcome of the experiment. Multiple Linear Regression Model There are different situations where there will be more than one regressor variable and this is called the multiple regression model. With k regressors we get (1) Y=β 0 + β 1 x 1 + β 2 x β k x k +ε This is a multiple linear regression multiple of k regressors and we consider the error term ε to be close to zero. We say linear because the equation (1) is a linear function of the unknown parameters β 0, β 1, β 2,.., β k. A multiple linear regression model/equation gives a surface where the β 0 is the intercept of the hyperplane, while the coefficients of the regressors are known as the partial regression coefficients. β 1 measures the expected change in Y per unit change in x 1 while x 2,..., and x k are all held constants. The same would be said for the other partial regression coefficients. The dependent variable will be Y, while the independent variables will be all the different x s. Multiple linear regression models are used for approximations for the Y variable or even any of the x variables. Least Squares Estimation of the Parameters 3

4 The least squares method is used to estimate the regression coefficients in the multiple regression equation. Suppose that n>k observations are available and let x ij show the ith observation of variable x j, the observations are: Data for Multiple Linear Regression y x 1 x 2 x k y 1 y 2 x 11 x 12 x 1k x 21 x 22 x 2k y n x n1 x n2 x nk (This table is depicted that same way as the NHL data table used below.) Then the model would be: The least squares function is We want to minimize the least squares function with respect to β 0, β 1,, β k. The least squares estimates of β 0, β 1,, β k must satisfy and And by simplifying the equation we get the scalar least squares equation: 4

5 Given data, solutions for all the regression coefficients can be obtained with standard linear algebra techniques. The matrix approach to Linear Regression When fitting a multiple Linear Regression model, it would be a lot simpler expressing the operations using a matrix notation. If there are k regressor variables and n observations, (x i1, x i2,, x ik, y i ), i = 1, 2,, n, the model relating to the regressor response is: This model can be expressed in matrix notation: where y i = β 0 +β 1 x i1 +β 2 x i β k x ik +ε i i = 1, 2,..., n y=xβ+ε y = X = β= β β β and ε= The X matrix is called the model matrix. o least squares estimator to solve for the vector β is: and β is the solution for β in the partial derivative equations: 5

6 Theses equations can be shown to be equivalent to the following Normal Equations This equation is the least squares equation in a matrix form which is identical to scalar least squares equation given before. The Least Squares Estimate of β This is the same equation as before, but we just isolated β. And this is the matrix form of the normal equations, as you can see, these equations hold a large resemblance to the scalar normal equations. With this, we get the fitted regression model to be: And in matrix notation, it would look like this: The residual would be the difference between the observed y i and the fitted value. Later on I will calculate the residual for my data. This is a (n x 1) vector of residuals. 6

7 Estimating σ 2 Measuring the variance of the error term, σ 2 in multiple linear regression is similar to measuring the σ 2 in just a simple linear regression model. In simple linear regression, we divide the sum of the squared residuals by n-2 because there were only 2 parameters. Instead, in a multiple linear regression model there are p parameters so we would divide the sum of the squared residuals by n-p. (SS E is the sum of the squared residuals) For my hockey data that you will see later on, there are 15 parameters in total. (14 categories + 1 intercept) While the formula for SS E is: We can substitute by making into the above equation, and obtain SS E = y y - β X y Properties of the Least Squares Estimators The properties of the least squares estimators can be found with certain assumptions on the error terms. We assume that the errors ε i are statistically independent with mean zero and variance σ 2. With these assumptions, the least squares estimators are unbiased estimators of the regression coefficients. This property is shown like this: Norice we assumed that E(ε) = 0 and used (X X) -1 X X = (the identity matrix = ). hen β is an unbiased estimator of β. he variances of the β 's are expressed in terms of the inverse of the X X 7

8 matrix, then the inverse X X multiplied by σ 2 gives the covariance matrix of the regression coefficients. The covariance matrix looks like this if there are 2 regressors: C =(X X) -1 = Then we can see that C 10 = C 01, C 20 = C 02, and C 12 = C 21 all equal each other because (X X) -1 is symmetric. Hence we have: Normally the covariance matrix is (p x p) that is symmetric and j,j th element is the variance of β and the i,j th element is the covariance between β i and β j : o obtain the estimates of the variances of these regression coefficients, we replace σ 2 with an estimate. (σ 2 ). he square root of the estimated variance of the jth regression coefficient is known as the estimated standard error of β j or se(β j) = σ2. These standard errors measure the precision of estimation for the regression coefficients. Small standard error means that there is a good precision. Test for Significance of Regression The test for a significance of regression is a test to check if there is a linear relationship between the response variable y and the regressor variables x 1, x 2,..., x k. The hypothesis used is By rejecting the null hypothesis, we can assume that at least one regressor variable contributes significantly to the model. Just like in simple linear regression there is a similar formula which is used that is applied in more general cases. First the total sum of Squares SS T is separated/partitioned into a sum of squares due to the model and the sum of squares due to the error. 8

9 Now if the null hypothesis is true, then SS R /σ 2 is a chi-squared random variable with the number of regressors equal to the degrees of freedom. We can also show that SS E /σ 2 is a chi-squared random variable with observation - parameter (n-p) degrees of freedom. The test statistic for H 0 : β 1 = β 2 = = β k = 0 is and it follows the F-distribution. We would reject H o if the computed f o is greater than f α,h,n-p. Usually the procedure is shown is summarized in an analysis of variance table like this one. Analysis of Variance for Testing Significance of Regression in Multiple Regression Source of Variation Sum of Squares Degrees of Freedom Mean Square F 0 Regression SS R k MS R MS R /MS E Error or residual SS E n - p MS E Total SS T n - 1 Since SS T is and we can write that SS E is or Therefore SS R (the regression sum of squares) will be 9

10 R 2 and Adjusted R 2 We can also use the equation from the simple linear regression model for the coefficient of determination R 2 in the general model for multiple linear regression. The R 2 statistic is used to evaluate the fit of the model. When working with multiple linear regression, many people like to use adjusted R 2 because SS E /(n-p) is the error or the residual mean squared and SS T /(n-1) is a constant. R 2 will only increase if a variable is added to a model and so we consider The adjusted R 2 statistic penalizes the analyst for adding terms to the model. This was helps guard against overfitting, which is including regressors that aren't useful. R 2 adj will be used when we look at variable selection. Now if we add a regression variable to the model, the sum of the squares will always increase while the error sum of squares will decrease. (R 2 will always increase) So adding an unimportant variable will cause R 2 to increase, so we look at R 2 adj instead because it's a better fit. Whereas R 2 adj only increase if the variable added to the model will reduce the error mean square reduces. Tests On Individual Regression Coefficients and Subsets of Coefficients We can test hypothesis on individual regression coefficients and these tests will determine the potential value of each regressor variables in the regression model. This will help make the model more effective by being able to deleting some variables and adding others. he hypothesis to test if an individual regression coefficient β j equals a value β j0 is And the test statistic for this hypothesis is 10

11 where C jj is the diagonal element of (X X) -1 which corresponds to β j. he denominator of the test statistic is the standard error of the regression coefficient β j. The null hypothesis H o : β j = β jo is rejected if This is known as the partial or marginal test because the regression coefficient β j depends on all the other regressor models x i (i j). A special case where H o : β jo = 0 is not rejected, this means that the regressor x j can be deleted. Partial F Test Where 0 means a vector of zeroes and β 1 is a subset of the regression coefficients. Now the model can written as if there are 2 regressor variables: X 1 represents the columns of X associated to β 1, and X 2 represents the columns associated to β 2. For the full model with both β 1 and β 2 we know that β = (X X) -1 X y. he regression sum of squares for all variables (with the intercept) is and The regression sum of squares of β 1 when β 2 is in the model is The sum of squares shown above has r degrees of freedom and it is called the extra sum of squares due to β 1. SS R (β 1 β 2 ) is the increase in the regression sum of squares by including the variables x 1, x 2,..., x r in the model. The null hypothesis β 1 = 0 and the test statistic is This is called the partial F-test and if f o > f α,r,n-p then we reject H 0 and conclude that at least one of the parameters in β 1 is not zero. This means that one of the variables x 1, x 2,..., x r in X 1 contributes significantly. The partial F-test can measure the contribution of each individual regressor in the model as if it was the last variable added. 11

12 This is the increase in the regression sum of squares caused by adding x j to the model that already includes x 1,..., x j-1, x j+1,..., x k. The F-test can measure the effect of sets of variables. Confidence intervals on Individual Regression Coefficients A 100(1-α)% confidence interval on the regression coefficient β j, j = 0,1,..., k in a multiple linear regression model is We can also write it this way Because is the standard error of the regression coefficient β j. We use the t-score in the confidence interval because the observations Y i are independently distributed with mean and variance σ 2. ince the least squares estimator β j is a linear combination of the observations, it follows that β j is normally distributed with mean vector β and covariance matrix. C jj is the jjth element of the (X X) -1 matrix, and is the estimate of the error variance. Confidence Interval on the Mean Response We can also get the a confidence interval on the mean response at a particular point (x 01, x 02,..., x ok ). We need to define the vector The mean response is E(Y x 0 ) = μ Y x0 = x o β, estimated by The variance is 12

13 And the 100(1-α)% confidence interval is constructed from the following variable is t-distributed: T= The 100(1-α)% confidence interval on the mean response at the point (x 01, x 02,..., x ok ) is Prediction of New Observations Say x 01, x 02,..., x ok, we can predict future observations on the response variable Y. f x 0 = [1, x 01, x 02,..., x ok ], a point estimation of the future estimation Y 0 at the point x 01, x 02,..., x ok is The 100(1-α)% prediction interval for the future observation is This is a general prediction interval and the prediction interval will always be wider than the mean interval because of the addition of 1 in the radical. There is a larger error in estimating the prediction interval than the error interval. Residual Analysis The residuals defined by help judge the model accuracy. By plotting the residuals versus other variables that are excluded but might be a factor because they are possible candidates. This model can show if variables can be improved when adding the candidate variable. 13

14 Standardized Residual can be useful when assessing the magnitude. Some people like to use the standardized residuals can be scaled so that their standard deviation is unity. Then there is studentized residual where h ii is the ith diagonal element of the matrix The H matrix is called the "hat" matrix, since Thus H transforms the observed values of y into a vector of fitted values. Since each row of the matrix X corresponds to a vector, the diagonal elements of the hat matrix is, another way to write h ii is the variance of the fitted. Under the belief that the model errors are independently distributed with mean zero and variance σ 2, we depict that the variance of the ith residual e i is This means that the h ii elements must fall in the interval 0 < h ii 1. This implies that the standardized residuals understate the true residual magnitude; thus, the studentized residuals would be better used to examine potential outliers. 14

15 Influential Observations Their maybe points or variables that is different and remote from the rest of the data. These points can be influential in determining R 2, estimating the regression coefficients and the magnitude of the error mean square. By measuring the distance we can detect if the points are influential. We measure the squared distance between the least squares estimate of β based on all n observations and the estimate obtained when the ith point is removed, say, We use Cook's distance If the ith point is influential, its removal will result in changing considerably from the value. A large value of D i means that the ith point is influential. The statistic D i is actually computed using In the cook's distance formula, D i consists of the squared studentized residual which shows how well the model fits the ith observation y i A value of D i > 1 would indicate that the point is influential. A component of D i (or both) may contribute to a large value. Polynomial Regression Models This is the second-degree polynomial in one variable. and the second-degree polynomial with two variables. They are both linear regression models. Polynomial regression models are used when the response is curvilinear. The general principles of multiple linear regression still apply. Categorical Regressors and Indicator Variables Categorical regressors are when we take into account qualitative variables instead of quantitative variables. To define the different levels of the qualitative variables, we would use numerical indicator variables. For example, 15

16 if the color red, blue and green where some kind of qualitative variables, then we can indicate 0 for red, 1 for blue and 2 for green. A qualitative variable with r-levels can also be shown with r - 1 indicator variables, which are assigned the value of either zero or one Selection Of Variables in Model Building All the models would have a intercept β 0 so we would have K+1 terms. But the problem is trying the figure out which variables is the right variable to choose for inclusion in the model. Preferably we would like a model to use only a few regressor variables but we don't want to remove any important regression variables. This can help us with predictions. One criterion that is used to evaluate and compare the regression models are the R 2 and R 2 adj. The analyst would increase the variables until the increase to the R 2 or R 2 adj is small. Often the R 2 adj will stabilize and decrease when we add variables to the model. The model that maximises the R 2 adj is the good candidate for the best regression equation. The value that maximizes the R 2 adj also minimizes the mean squared error. Another criterion is the C p statistic and this measures the total mean square for the regression model. The total standardized mean square error is We use the mean square error from the full K + 1 term model as an estimate of σ 2 ; that is,. The estimator of Γ p is C p statistic: If there is a bias in the p-term then: 16

17 The values of C p for each regression model under consideration should be evaluated to p. The regression equations that have negligible bias will have values of C p that are close to p, while those with significant bias will have values of C p that are significantly greater than p. We then choose as the best regression equation either a model with minimum C p or a model with a slightly larger C p. Prediction Error Sum of Squares( PRESS) statistic is another way to evaluate competing regression models, and it is defined as the sum of the squares of the differences between each observation y i and the corresponding predicted value based on a model fit to the remaining n - 1 points,. PRESS gives a measure of how well the model is likely to perform when predicting new data or data that was not used to fit the regression model. The formula for PRESS is Models with small values of PRESS are preferred. Stepwise Regression This procedure constructs a regression model by adding or deleting variables at each step. The criterion to add and remove variables is using the partial F-Test. Let f in be the value of the F-random variable for adding a variable to the model, and let f out be the value of the F-random variable for deleting a variable from the model. We must have f in f out, and usually f in = f out. Stepwise regression starts by making a one-variable model using the regressor variable that has the highest correlation with the variable Y. This regressor will produce the largest F-statistic. If the calculated value f 1 < f out, the variable x 1 is removed. If not, we keep the variable and we do the next test with a new variable and each variable that has been kept. At each step the set of remaining candidate regressors is examined, and the regressor with the largest partial F- statistic is entered if the observed value of f exceeds f in. Then the partial F-statistic for each regressor in the model is calculated, and the regressor with the smallest observed value of F is deleted if the observed f < f out. The procedure continues until no other regressors can be added to or removed from the model. Forward Selection 17

18 This procedure is a variation of stepwise regression and we just add a regressor to the model one at a time until there are no remaining candidate regressors that produce a significant increase in the regression sum of squares.( Variables are added one at a time as long as their partial F-value exceeds f in )Forward selection is a simplification of stepwise regression that doesn't use the partial F-test for removing variables from the model that have been added at previous steps. This is a potential weakness of forward selection because we don't check the previous variables added. Backward Elimination This begins with all K candidate regressors in the model. Then the regressor with the smallest partial F-statistic is deleted if this F-statistic is insignificant, that is, if f < f out. Next, the model with K - 1 regressors is fit, and the next regressor for potential elimination is found. The algorithm terminates when no further regressor can be deleted. (This technique will be used later on in my data.) Multicollinearity Normally, we expect to find dependencies between the response variable Y and the regressors x j. But, we can also find that there are dependencies between the regressor variables x j. In situations where these dependencies are strong, we say that multicollinearity exists. Effects of multicollinearity can be evaluated. The diagonal elements of the matrix C = (X X) -1 can be written as is the coefficient of multiple determination resulting from regressing x j on the other k - 1 regressor variables. We can think of as a measure of the correlation between x j and the other regressors. The stronger the linear dependency of x j on the remaining regressor variables, and the stronger the multicollinearity, the larger the value of will be. Recall that Therefore, we say that the variance of is inflated by the quantity. Consequently, we define the variance inflation factor for β j as If the columns of the model matrix X are perpendicular, then the regressors are completely uncorrelated, and the variance inflation factors will all be unity. VIF that exceeds indicates some level of multicollinearity. 18

19 If VIF exceeds 10, then multicollinearity is a problem. Another way to see if multicollinearity is present is when the F-test for significance of regression is significant, but the tests on the individual regression coefficients are not significant, then we may have multicollinearity. Doing more observations and maybe deleting some variables can decrease the levels of multicollinearity. Data/Analysis: we will use. Now that we finished summarizing multiple linear regression, were going to look over the data which NHL Stats of 30 Teams W Gf Ga AdV PPGF PCTG PEN BM AVG SHT PPGA PKPCT SHGF SHGA FG

20 ( STATISTICS GATHERED ON NHL.COM) Before going through many different calculations, we should first understand what each category mean. W(Y) = WINS GF(x 1 ) = Goals For GA(x 2 ) = Goals Against ADV(x 3 ) = Total Advantage. Power-play opportunities PPGF(x 4 ) = Power-play Goals For PCTG(x 5 ) = Power-play Percentage. Power-play Goals For Divided by Total Advantages PEN(x 6 ) = Total Penalty Minutes Including Bench Minors BMI(x 7 ) = Total Bench Minor Minutes AVG(x 8 ) = Average Penalty Minutes Per Game SHT(x 9 ) = Total Times Short-handed. Measures Opponent Opportunities PPGA(x 10 ) = Power-play Goals Against PKPCT(x 11 ) = Penalty Killing Percentage. Measures a Team's Ability to Prevent Goals While its Opponent is on a Power-play. Opponent Opportunities Minus Power-play Goals Divided by Opponents' Opportunities SHGF(x 12 ) = Short-handed Goals For SHGA(x 13 ) = Short-handed Goals Against FG(x 14 ) = Games Scored First With this data, I will investigate a multiple linear regression model with the response variable Y being wins and the other variables will be my regressor variables. To make a good model, I will use the Backward Elimination, by first placing all my regressor variables in Minitab and removing the variables whose individual tests for significance show p-values that are greater than The variables with highest p-values will be removed one at a time until there are no more p-values greater than The highlighted variables are the ones being removed in the next trial. (Trial 1).Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGF SHGA FG Predictor Coef SE Coef T P 20

21 Constant Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGF SHGA FG S = R-Sq = 93.7% R-Sq(adj) = 87.8% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGF SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI ( , ) ( , )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations 21

22 New Obs Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGF New Obs SHGA FG (Trial 2) Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG S = R-Sq = 93.7% R-Sq(adj) = 88.6% R 2 adj Increased, which means that the model takes into account 88.6% of the data and the error mean squared decreased. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG Unusual Observations 22

23 Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI ( , ) ( , )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV PPGF PCTG PEN BMI AVG SHT PPGA PKPCT SHGA New Obs FG (Trial 3) Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant Gf GA ADV PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG S = R-Sq = 93.7% R-Sq(adj) = 89.2% R 2 adj increased, the model takes into account 89.2% of the data and the error mean square was reduced. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS 23

24 Gf GA ADV PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI ( , ) ( , )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV PCTG PEN BMI AVG SHT PPGA PKPCT SHGA FG (Trial 4)Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PCTG PEN AVG SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant Gf GA ADV PCTG PEN AVG SHT PPGA PKPCT SHGA FG S = R-Sq = 93.5% R-Sq(adj) = 89.5% R 2 adj increased, the model takes into account 89.5% of the data and the error mean square was reduced. 24

25 Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PCTG PEN AVG SHT PPGA PKPCT SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI ( , ) ( , )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV PCTG PEN AVG SHT PPGA PKPCT SHGA FG (Trial 5)Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PEN AVG SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant Gf GA ADV PEN AVG SHT PPGA

26 PKPCT SHGA FG S = R-Sq = 93.1% R-Sq(adj) = 89.5% R 2 adj stayed the same, the model still takes into account 89.5% of the data and the error mean square did not change. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PEN AVG SHT PPGA PKPCT SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI ( , ) ( , )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV PEN AVG SHT PPGA PKPCT SHGA FG (Trial 6) Regression Analysis: W versus Gf, GA,... The regression equation is W = Gf GA ADV PEN SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant

27 Gf GA ADV PEN SHT PPGA PKPCT SHGA FG S = R-Sq = 92.1% R-Sq(adj) = 88.6% R 2 adj decreased by a little which is alright, the model takes into account 88.6% of the data and the error mean square has increased while there are less insignificant regressors. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PEN SHT PPGA PKPCT SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (73.819, ) (73.770, )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV PEN SHT PPGA PKPCT SHGA FG (Trial 7)Regression Analysis: W versus Gf, GA, ADV, SHT, PPGA, PKPCT, SHGA, FG 27

28 The regression equation is W = Gf GA ADV SHT PPGA PKPCT SHGA FG Predictor Coef SE Coef T P Constant Gf GA ADV SHT PPGA PKPCT SHGA FG S = R-Sq = 91.7% R-Sq(adj) = 88.5% R 2 adj decreased by a little which is alright, so did the R 2 (R 2 will always decrease if you remove a regressor) the model takes into account 88.5% of the data and the error mean squared has increased while there are less insignificant regressors. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV SHT PPGA PKPCT SHGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (55.800, ) (55.746, )XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Gf GA ADV SHT PPGA PKPCT SHGA FG

29 (Trial 8)Regression Analysis: W versus Gf, GA, ADV, SHT, PPGA, PKPCT, FG The regression equation is W = Gf GA ADV SHT PPGA PKPCT FG Predictor Coef SE Coef T P Constant Gf GA ADV SHT PPGA PKPCT FG S = R-Sq = 91.0% R-Sq(adj) = 88.1% R 2 adj decreased by a little which is alright, the model takes into account 88.1% of the data and the error mean square has increased while there are less insignificant regressors. R 2 also decreased, but that is normal. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV SHT PPGA PKPCT FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (33.499, ) (33.439, )XX XX denotes a point that is an extreme outlier in the predictors. 29

30 Values of Predictors for New Observations New Obs Gf GA ADV SHT PPGA PKPCT FG (Trial 9)Regression Analysis: W versus Gf, GA, ADV, SHT, PPGA, FG The regression equation is W = Gf GA ADV SHT PPGA FG Predictor Coef SE Coef T P Constant Gf GA ADV SHT PPGA FG S = R-Sq = 89.9% R-Sq(adj) = 87.2% R 2 adj decreased by a little which is alright, the model takes into account 87.2% of the data and the error mean square has increased while there are less insignificant regressors. R 2 also decreased, but that is normal. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV SHT PPGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (31.564, ) (31.315, )XX XX denotes a point that is an extreme outlier in the predictors. 30

31 Values of Predictors for New Observations New Obs Gf GA ADV SHT PPGA FG (Trial 10)Regression Analysis: W versus Gf, GA, ADV, PPGA, FG The regression equation is W = Gf GA ADV PPGA FG Predictor Coef SE Coef T P Constant Gf GA ADV PPGA FG S = R-Sq = 88.5% R-Sq(adj) = 86.1% R 2 adj decreased by a little which is alright, the model takes into account 86.1% of the data and the error mean square has increased while there are less insignificant regressors. R 2 also decreased, but that is normal. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV PPGA FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (21.457, ) (21.134, )XX XX denotes a point that is an extreme outlier in the predictors. 31

32 Values of Predictors for New Observations New Obs Gf GA ADV PPGA FG *(Trial 11)Regression Analysis: W versus Gf, GA, ADV, FG* The regression equation is W = Gf GA ADV FG Predictor Coef SE Coef T P Constant Gf GA ADV FG S = R-Sq = 87.2% R-Sq(adj) = 85.2% R 2 adj decreased by a little which is alright, the model takes into account 85.2% of the data and the error mean square has increased while there are less insignificant regressors. R 2 also decreased, but that is normal. Analysis of Variance Source DF SS MS F P Regression Residual Error Total Source DF Seq SS Gf GA ADV FG Unusual Observations Obs Gf W Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (25.188, ) (22.070, ) Values of Predictors for New Observations New Obs Gf GA ADV FG

33 *The highlight represents the variables that were removed in the next trial. * Now we will look at the NHL data with the Least Squares Method with my last trial. Calculating data were n=30 and k=4 is too large to be able to do manually, instead I used the program Minitab and got a linear regression model. W = Gf GA ADV FG Some aspects of the model makes sense while others don't: Remember, the 4 regressor variables create a linear function with the response Y (wins). The β 0 is the intercept and it is In practical terms the intercept doesn't really make since because it is saying that if a team shows up to a game and does absolutely nothing, they will finish the season off with 21 wins. The regressor β 1 is the expected change in Y(wins) per unit change in x 1 (GF), if the other variables were held the same. The β 1 is and it makes sense. Goals for (GF) is when your team scores a goal which gives that team a better chance to win. The more goals you score, the better the chance of getting a win. The β 2 is which also makes sense. Goals against (GA) is the amount of goals the other team scores on you, which reduces the chances of winning a game. β 2 is the expected change in wins (Y) per unit change in (Goals against) x 2. The β 3 is and this regressor doesn't make sense. ADV is the amount of power-plays your team has, which is an advantage and can help you win hockey games. This should be a positive regressor because the more power-plays a team has, the greater chance they have of winning a game. Just like the other regressors, this regressor is the expected change in wins per unit change of x 3. Finally, β 4 is and this makes sense. FG is when your team scores first, normally when a team scores first, they take an early lead and is one step closer to winning a game. β 4 is the expected change in wins per unit change when a team scores first. This is a fitted regression which is practical to predict wins for a NHL team given the other regressor variables are. β 0 = 20.9 with p-value of β 1 =0.172 with p-value of β 2 = with p-value of β 3 = with p-value of

34 β 4 =0.365 with p-value of The p-values are acquired by doing the this test The R 2 is 87.2% while the R 2 adj is 85.2%. R 2 shouldn't be considered because with the addition of any variables, the R 2 never decreases even if the errors rise. Many people take into account R 2 adj because it holds a better fit to the model. Now this is not the largest R 2 adj seen. In trial 4 and 5 the R 2 adj was at 89.5%. This means that the model fit for 89.5% of the data and that the model was significant but the regressor variables had a p-value larger than 0.05 which made the regressor variables not significant. Which lead to the R 2 adj to be 85.2%. 85.2% is still significant and it fits for 85% of the data. R 2 adj is better because it guards for over fitting while R 2 causes it to over fit when adding a not so useful variable. We can say that R 2 adj penalizes the analyst for adding terms to the model. y es mated ariance σ 2 ) is For the prediction inter al when α = 0.05 Gf = 130, Ga =114, ADV = 155, FG = 23) is <Y o < with 95% confidence The 95% confidence interval for the mean response (Gf = 130, Ga =114, ADV = 155, FG = 23) is < μ Y x0 < with 95% confidence While the residual is , the standard residual is and the Y i =18 Work Cited 34

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

Multiple Regression Methods

Multiple Regression Methods Chapter 1: Multiple Regression Methods Hildebrand, Ott and Gray Basic Statistical Ideas for Managers Second Edition 1 Learning Objectives for Ch. 1 The Multiple Linear Regression Model How to interpret

More information

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments. Analysis of Covariance In some experiments, the experimental units (subjects) are nonhomogeneous or there is variation in the experimental conditions that are not due to the treatments. For example, a

More information

School of Mathematical Sciences. Question 1. Best Subsets Regression

School of Mathematical Sciences. Question 1. Best Subsets Regression School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

MULTIPLE LINEAR REGRESSION IN MINITAB

MULTIPLE LINEAR REGRESSION IN MINITAB MULTIPLE LINEAR REGRESSION IN MINITAB This document shows a complicated Minitab multiple regression. It includes descriptions of the Minitab commands, and the Minitab output is heavily annotated. Comments

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 28. SIMPLE LINEAR REGRESSION III Fitted Values and Residuals To each observed x i, there corresponds a y-value on the fitted line, y = βˆ + βˆ x. The are called fitted values. ŷ i They are the values of

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Model Building Chap 5 p251

Model Building Chap 5 p251 Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Multiple Regression Examples

Multiple Regression Examples Multiple Regression Examples Example: Tree data. we have seen that a simple linear regression of usable volume on diameter at chest height is not suitable, but that a quadratic model y = β 0 + β 1 x +

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

REGRESSION ANALYSIS AND INDICATOR VARIABLES

REGRESSION ANALYSIS AND INDICATOR VARIABLES REGRESSION ANALYSIS AND INDICATOR VARIABLES Thesis Submitted in partial fulfillment of the requirements for the award of degree of Masters of Science in Mathematics and Computing Submitted by Sweety Arora

More information

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models Chapter 14 Multiple Regression Models 1 Multiple Regression Models A general additive multiple regression model, which relates a dependent variable y to k predictor variables,,, is given by the model equation

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

MBA Statistics COURSE #4

MBA Statistics COURSE #4 MBA Statistics 51-651-00 COURSE #4 Simple and multiple linear regression What should be the sales of ice cream? Example: Before beginning building a movie theater, one must estimate the daily number of

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

6. Multiple Linear Regression

6. Multiple Linear Regression 6. Multiple Linear Regression SLR: 1 predictor X, MLR: more than 1 predictor Example data set: Y i = #points scored by UF football team in game i X i1 = #games won by opponent in their last 10 games X

More information

Chapter 15 Multiple Regression

Chapter 15 Multiple Regression Multiple Regression Learning Objectives 1. Understand how multiple regression analysis can be used to develop relationships involving one dependent variable and several independent variables. 2. Be able

More information

Is economic freedom related to economic growth?

Is economic freedom related to economic growth? Is economic freedom related to economic growth? It is an article of faith among supporters of capitalism: economic freedom leads to economic growth. The publication Economic Freedom of the World: 2003

More information

Chapter 12: Multiple Regression

Chapter 12: Multiple Regression Chapter 12: Multiple Regression 12.1 a. A scatterplot of the data is given here: Plot of Drug Potency versus Dose Level Potency 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 Dose Level b. ŷ = 8.667 + 0.575x

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Chapter 3: Multiple Regression. August 14, 2018

Chapter 3: Multiple Regression. August 14, 2018 Chapter 3: Multiple Regression August 14, 2018 1 The multiple linear regression model The model y = β 0 +β 1 x 1 + +β k x k +ǫ (1) is called a multiple linear regression model with k regressors. The parametersβ

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

10. Alternative case influence statistics

10. Alternative case influence statistics 10. Alternative case influence statistics a. Alternative to D i : dffits i (and others) b. Alternative to studres i : externally-studentized residual c. Suggestion: use whatever is convenient with the

More information

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6 STA 8 Applied Linear Models: Regression Analysis Spring 011 Solution for Homework #6 6. a) = 11 1 31 41 51 1 3 4 5 11 1 31 41 51 β = β1 β β 3 b) = 1 1 1 1 1 11 1 31 41 51 1 3 4 5 β = β 0 β1 β 6.15 a) Stem-and-leaf

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Confidence Interval for the mean response

Confidence Interval for the mean response Week 3: Prediction and Confidence Intervals at specified x. Testing lack of fit with replicates at some x's. Inference for the correlation. Introduction to regression with several explanatory variables.

More information

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel Institutionen för matematik och matematisk statistik Umeå universitet November 7, 2011 Inlämningsuppgift 3 Mariam Shirdel (mash0007@student.umu.se) Kvalitetsteknik och försöksplanering, 7.5 hp 1 Uppgift

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity

More information

STAT 212 Business Statistics II 1

STAT 212 Business Statistics II 1 STAT 1 Business Statistics II 1 KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA STAT 1: BUSINESS STATISTICS II Semester 091 Final Exam Thursday Feb

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46 BIO5312 Biostatistics Lecture 10:Regression and Correlation Methods Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/1/2016 1/46 Outline In this lecture, we will discuss topics

More information

Contents. 2 2 factorial design 4

Contents. 2 2 factorial design 4 Contents TAMS38 - Lecture 10 Response surface methodology Lecturer: Zhenxia Liu Department of Mathematics - Mathematical Statistics 12 December, 2017 2 2 factorial design Polynomial Regression model First

More information

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns Lecture Week Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Model Criticism Study the importance of columns Draw on Scientific framework Experiment;

More information

TMA4255 Applied Statistics V2016 (5)

TMA4255 Applied Statistics V2016 (5) TMA4255 Applied Statistics V2016 (5) Part 2: Regression Simple linear regression [11.1-11.4] Sum of squares [11.5] Anna Marie Holand To be lectured: January 26, 2016 wiki.math.ntnu.no/tma4255/2016v/start

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Models with qualitative explanatory variables p216

Models with qualitative explanatory variables p216 Models with qualitative explanatory variables p216 Example gen = 1 for female Row gpa hsm gen 1 3.32 10 0 2 2.26 6 0 3 2.35 8 0 4 2.08 9 0 5 3.38 8 0 6 3.29 10 0 7 3.21 8 0 8 2.00 3 0 9 3.18 9 0 10 2.34

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot. SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot. 2. Fit the linear regression line. Regression Analysis: y versus x y

More information

School of Mathematical Sciences. Question 1

School of Mathematical Sciences. Question 1 School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 8 and Assignment 7 Solutions Question 1 Figure 1: The residual plots do not contradict the model assumptions of normality, constant

More information

Analysis of Bivariate Data

Analysis of Bivariate Data Analysis of Bivariate Data Data Two Quantitative variables GPA and GAES Interest rates and indices Tax and fund allocation Population size and prison population Bivariate data (x,y) Case corr&reg 2 Independent

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

Lecture 12 Inference in MLR

Lecture 12 Inference in MLR Lecture 12 Inference in MLR STAT 512 Spring 2011 Background Reading KNNL: 6.6-6.7 12-1 Topic Overview Review MLR Model Inference about Regression Parameters Estimation of Mean Response Prediction 12-2

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator. B. Sc. Examination by course unit 2014 MTH5120 Statistical Modelling I Duration: 2 hours Date and time: 16 May 2014, 1000h 1200h Apart from this page, you are not permitted to read the contents of this

More information

Applied Regression Analysis. Section 2: Multiple Linear Regression

Applied Regression Analysis. Section 2: Multiple Linear Regression Applied Regression Analysis Section 2: Multiple Linear Regression 1 The Multiple Regression Model Many problems involve more than one independent variable or factor which affects the dependent or response

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Lecture 10 Software Implementation in Simple Linear Regression Model using

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Conditions for Regression Inference:

Conditions for Regression Inference: AP Statistics Chapter Notes. Inference for Linear Regression We can fit a least-squares line to any data relating two quantitative variables, but the results are useful only if the scatterplot shows a

More information

(1) The explanatory or predictor variables may be qualitative. (We ll focus on examples where this is the case.)

(1) The explanatory or predictor variables may be qualitative. (We ll focus on examples where this is the case.) Introduction to Analysis of Variance Analysis of variance models are similar to regression models, in that we re interested in learning about the relationship between a dependent variable (a response)

More information

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23 2.4. ASSESSING THE MODEL 23 2.4.3 Estimatingσ 2 Note that the sums of squares are functions of the conditional random variables Y i = (Y X = x i ). Hence, the sums of squares are random variables as well.

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

SMAM 314 Exam 42 Name

SMAM 314 Exam 42 Name SMAM 314 Exam 42 Name Mark the following statements True (T) or False (F) (10 points) 1. F A. The line that best fits points whose X and Y values are negatively correlated should have a positive slope.

More information

Chapter 5 Introduction to Factorial Designs Solutions

Chapter 5 Introduction to Factorial Designs Solutions Solutions from Montgomery, D. C. (1) Design and Analysis of Experiments, Wiley, NY Chapter 5 Introduction to Factorial Designs Solutions 5.1. The following output was obtained from a computer program that

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights Stat 529 (Winter 2011) A simple linear regression (SLR) case study Reading: Sections 8.1 8.4, 8.6, 8.7 Mammals brain weights and body weights Questions of interest Scatterplots of the data Log transforming

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Dr. Bob Gee Dean Scott Bonney Professor William G. Journigan American Meridian University 1 Learning Objectives Upon successful completion of this module, the student should

More information

Chapter 7 Student Lecture Notes 7-1

Chapter 7 Student Lecture Notes 7-1 Chapter 7 Student Lecture Notes 7- Chapter Goals QM353: Business Statistics Chapter 7 Multiple Regression Analysis and Model Building After completing this chapter, you should be able to: Explain model

More information

LECTURE 5 HYPOTHESIS TESTING

LECTURE 5 HYPOTHESIS TESTING October 25, 2016 LECTURE 5 HYPOTHESIS TESTING Basic concepts In this lecture we continue to discuss the normal classical linear regression defined by Assumptions A1-A5. Let θ Θ R d be a parameter of interest.

More information

ECO220Y Simple Regression: Testing the Slope

ECO220Y Simple Regression: Testing the Slope ECO220Y Simple Regression: Testing the Slope Readings: Chapter 18 (Sections 18.3-18.5) Winter 2012 Lecture 19 (Winter 2012) Simple Regression Lecture 19 1 / 32 Simple Regression Model y i = β 0 + β 1 x

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

STAT Chapter 11: Regression

STAT Chapter 11: Regression STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

More information

Question Possible Points Score Total 100

Question Possible Points Score Total 100 Midterm I NAME: Instructions: 1. For hypothesis testing, the significant level is set at α = 0.05. 2. This exam is open book. You may use textbooks, notebooks, and a calculator. 3. Do all your work in

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

This gives us an upper and lower bound that capture our population mean.

This gives us an upper and lower bound that capture our population mean. Confidence Intervals Critical Values Practice Problems 1 Estimation 1.1 Confidence Intervals Definition 1.1 Margin of error. The margin of error of a distribution is the amount of error we predict when

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1

More information