Lecture 18 MA 2612 - Applied Statistics II D 2004 Today 1. Examples of multiple linear regression 2. The modeling process (PNC 8.4) 3. The graphical exploration of multivariable data (PNC 8.5) 4. Fitting the multiple linear regression model (PNC 8.6) 5. ANOVA for the MLR model (PNC 8.11) Examples of multiple linear regression In Lecture 17 we saw the mathematical formulation for the multiple linear regression (MLR) model, but did not give any examples of real-life situations where the MLR model can be employed. 1. Modeling tree volume as a function of tree diameter and tree height Data measured on 31 black cherry trees in the Allegheny National Forest in Pennsylvania are used to build a model for predicting tree volume using only measurements of the height and diameter of a tree. Z 1 Z 2 = tree volume in cubic feet = tree height in feet = tree diameter measured 4.5 feet above ground level in inches We might consider the following models. = β 0 + β 1 Z 1 + β 2 Z 2 + = β 0 + β 1 Z 1 + β 2 Z 2 + β 3 Z 1 Z 2 + = β 0 + β 1 Z 1 Z2 2 + = β 0 + β 1 Z 1 Z2 2 + β 2 Z 1 +
2. Modeling college GPA as a function of high school grades A university interested in predicting whether or not applicants will be successful college students modeled cumulative GPA after three semesters as a function of some of the information provided by the students on their application for admission. Z 1 Z 2 Z 3 Z 4 Z 5 = cumulative GPA after three semesters on a 4.0 scale = high school grades in mathematics courses coded on a 10.0 scale = high school grades in science courses coded on a 10.0 scale = high school grades in English courses coded on a 10.0 scale = score on quantitative section of the SAT out of 800 points = score on verbal section of the SAT out of 800 points 3. Modeling battery failure Satellite applications motivated the development of a silver-zinc battery. Time until battery failure was modeled as a function of a number of design and storage parameters in order to choose parameter values that would yield batteries with maximum longevity. Z 1 Z 2 Z 3 Z 4 Z 5 = Cycles to failure = Charge rate = Discharge rate = Depth of discharge = Temperature = End of charge voltage Aside: Note that there is a difference between multiple linear regression and multivariate linear regression. Multiple linear regression: One response and multiple predictors Multivariate linear regression: Multiple responses and one or more predictors 2
The modeling process Having seen these examples of where the MLR model has been employed, onecanseethevalueofthemlrmodel. Recall: The purpose of the MLR model is to... 1. describe the relation between the response and the predictors 2. make predictions about the response for given values of the predictors But given a response variable and a set of predictor variables how does one go about building an appropriate model? Model building is an iterative process. 1. Model specification - Specifying the number and form of the regressor variables. 2. Model fitting - Estimating the model parameters β 0,β 1,...,β q and σ 2. 3. Model assessment Using the residuals and other means to evaluate how well the model describes the relationship between our predictor variables and the response. If we determine that the model does not describe the data well, we return to step 1 and specify a new model. 4. Model validation - Testing the model on another set of data. In the remainder of the course, we will study methods associated with each of these steps. 3
The graphical exploration of multivariable data (PNC 8.5) In Lab 5, using SAS/INSIGHT, you will employ the following techniques to study the relationship between predictor variables and a response variable as well as relationships among the predictor variables themselves. 1. Scatterplot arrays allow one to study the relationship between many pairs of variables simultaneously. 2. Rotating 3-D plots allow one to visualize the relationship between three variables in three dimensions. Graphical exploration of relationships between predictors and the response are an important preliminary step in the model building process. Before we can specify a model we need to understand how the variables relate to one another. What we learn about the variables through this exploration will help us to specify appropriate regression models. 4
Fitting the multiple linear regression model (PNC 8.6) We estimate the parameters of the model, β 0,β 1,...,β q, using the least squares method discussed in Chapters 7 and 9. That is to say, we choose the estimates, b 0,b 1,...,b q, in such a way that we minimize the sum of squared error terms. nx SSE(b 0,b 1,...,b q )= [ i (b 0 + b 1 X i1 + b 2 X i2 + b q X iq )] 2 i=1 AsinthecaseoftheSLRmodel... We denote the estimates for β 0,β 1,...,β q by ˆβ 0, ˆβ 1,...,ˆβ q and write the fitted regression equation Ŷ = ˆβ 0 + ˆβ 1 X 1 + ˆβ 2 X 2 + + ˆβ q X q The residuals are the observed error terms, the difference between the observed and fitted values for each observation. e i = i Ŷi = i ( ˆβ 0 + ˆβ 1 X i1 + ˆβ 2 X i2 + + ˆβ q X iq ) We estimate the error term variance σ 2 by ˆσ 2 = 1 n q 1 nx (e i ē) 2 = i=1 1 n q 1 nx e 2 i = i=1 SSE n q 1 = SSE df E = MSE Unlike the SLR model... There is no simple expression for the least squares estimates ˆβ 0, ˆβ 1,...,ˆβ q in terms of the observed data (without using matrix notation). However, SAS will still compute the parameter estimates. 5
ANOVA for the MLR model (PNC 8.11) Source df SS MS Regression q SSR = P n i=1 (Ŷi Ȳ )2 MSR = SSR/df R Error n q 1 SSE = P n i=1 ( i Ŷ i ) 2 MSE = SSE/df E Total n 1 SSTO = P n i=1 ( i Ȳ ) 2 Note that as was the case in Chapters 7 and 9 df T = df R + df E SSTO = SSR + SSE 6