Multiple regression is arguably the single most important method in all of statistics.

Size: px

Start display at page:

Download "Multiple regression is arguably the single most important method in all of statistics."

Scot Williamson
6 years ago
Views:

1 Multiple Regressio: How Much Is Your Car Worth? 3 Essetially, all models are wrog; some are useful. George E. P. Box 1 Multiple regressio is arguably the sigle most importat method i all of statistics. Regressio models are widely used i may disciplies. I additio, a good uderstadig of regressio is all but essetial for uderstadig may other, more sophisticated statistical methods. This chapter cosists of a set of activities that will eable you to build a multivariate regressio model. The model will be used to describe the relatioship betwee the retail price of 005 used GM cars ad various car characteristics, such as mileage, make, model, presece or absece of cruise cotrol, ad egie size. The set of activities i this chapter allows you to work through the etire process of model buildig ad assessmet, icludig Applyig variable selectio techiques Usig residual plots to check for violatios of model assumptios, such as heteroskedasticity, outliers, autocorrelatio, ad oormality distributed errors Trasformig data to better fit model assumptios Uderstadig the impact of correlated explaatory variables Icorporatig categorical explaatory variables ito a regressio model Applyig F-tests i multiple regressio 67

2 68 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? 3.1 Ivestigatio: How Ca We Build a Model to Estimate Used Car Prices? Have you ever browsed through a car dealership ad observed the sticker prices o the vehicles? If you have ever seriously cosidered purchasig a vehicle, you ca probably relate to the difficulty of determiig whether that vehicle is a good deal or ot. Most dealerships are willig to egotiate o the sale price, so how ca you kow how much to egotiate? For ovices (like this author), it is very helpful to refer to a outside pricig source, such as the Kelley Blue Book, before agreeig o a purchase price. For over 80 years, Kelley Blue Book has bee a resource for accurate vehicle pricig. The compay s Website, provides a free olie resource where ayoe ca iput several car characteristics (such as age, mileage, make, model, ad coditio) ad quickly receive a good estimate of the retail price. I this chapter, you will use a relatively small subset of the Kelley Blue Book database to describe the associatio of several explaatory variables (car characteristics) with the retail value of a car. Before developig a complex multiple regressio model with several variables, let s start with a quick review of the simple liear regressio model by askig a questio: Are cars with lower mileage worth more? It seems reasoable to expect to see a relatioship betwee mileage (umber of miles the car has bee drive) ad retail value. The data set Cars cotais the make, model, equipmet, mileage, ad Kelley Blue Book suggested retail price of several used 005 GM cars. A Simple Liear Regressio Model 1. Produce a scatterplot from the Cars data set to display the relatioship betwee mileage (Mileage) ad suggested retail price (Price). Does the scatterplot show a strog relatioship betwee Mileage ad Price?. Calculate the least squares regressio lie, Price = b 0 + b 1 (Mileage). Report the regressio model, the R value, the correlatio coefficiet, the t-statistics, ad p-values for the estimated model coefficiets (the itercept ad slope). Based o these statistics, ca you coclude that Mileage is a strog idicator of Price? Explai your reasoig i a few seteces. 3. The first car i this data set is a Buick Cetury with 81 miles. Calculate the residual value for this car (the observed retail price mius the expected price calculated from the regressio lie). Mathematical Note For ay regressio equatio of the form y i = b 0 + b 1 x i + e i, the hypothesis test for the slope of the regressio equatio (b 1 ) is similar to other t-tests discussed i itroductory textbooks. (Mathematical details for this hypothesis test are described i Chapter.) To test the ull hypothesis that the slope coefficiet is zero (H 0 : b 1 = 0 versus H a : b 1 0), calculate the followig test statistic: t = b 1 - b 1 s b1 (3.1) where b 1 is the estimated slope calculated from the data ad s b1 is a estimate of the stadard deviatio of b 1. Probability theory ca be used to prove that if the regressio model assumptios are true, the t-statistic i Equatio (3.1) follows a t-distributio with - degrees of freedom. If the sample statistic, b 1, is far away from b 1 = 0 relative to the estimated stadard deviatio, the t-statistic will be large ad the correspodig p-value will be small. The t-statistic for the slope coefficiet idicates that Mileage is a importat variable. However, the R value (the percetage of variatio explaied by the regressio lie) idicates that the regressio lie is ot very useful i predictig retail price. (A review of the R value is give i the exteded activities.) As is always the case with statistics, we eed to visualize the data rather tha focus solely o a p-value. Figure 3.1 shows that the expected price decreases as mileage icreases, but the observed poits do ot appear to be close to the regressio lie. Thus, it seems reasoable that icludig additioal explaatory variables i the regressio model might help to better explai the variatio i retail price.

3 3. Goals of Multiple Regressio 69 Figure 3.1 Scatterplot ad least squares regressio model: Price = 4, (Mileage). The regressio lie shows that for each additioal mile a car is drive, the expected price of the car decreases by about 17 cets. However, may poits are ot close to the regressio lie, idicatig that the expected price is ot a accurate estimate of the actual observed price. I this chapter, you will build a liear combiatio of explaatory variables that explais the respose variable, retail price. As you work through the chapter, you will fid that there is ot oe techique, or recipe, that will give the best model. I fact, you will come to see that there is t just oe best model for these data. Ulike i most mathematics classes, where every studet is expected to submit the oe right aswer to a assigmet, here it is expected that the fial regressio models submitted by various studets will be at least slightly differet. While a sigle best model may ot exist for these data, there are certaily may bad models that should be avoided. This chapter focuses o uderstadig the process of developig a statistical model. It does t matter if you are developig a regressio model i ecoomics, psychology, sociology, or egieerig there are commo key questios ad processes that should be evaluated before a fial model is submitted. 3. Goals of Multiple Regressio It is importat to ote that multiple regressio aalysis ca be used to serve differet goals. The goals will ifluece the type of aalysis that is coducted. The most commo goals of multiple regressio are to describe, predict, or cofirm. Describe: A model may be developed to describe the relatioship betwee multiple explaatory variables ad the respose variable. Predict: A regressio model may be used to geeralize to observatios outside the sample. Just as i simple liear regressio, explaatory variables should be withi the rage of the sample data to predict future resposes. Cofirm: Theories are ofte developed about which variables or combiatio of variables should be icluded i a model. For example, is mileage useful i predictig retail price? Iferetial techiques ca be used to test if the associatio betwee the explaatory variables ad the respose could just be due to chace.

4 70 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Theory may also predict the type of relatioship that exists, such as cars with lower mileage are worth more. More specific theories ca also be tested, such as retail price decreases liearly with mileage. Whe the goal of developig a multiple regressio model is descriptio or predictio, the primary issue is ofte determiig which variables to iclude i the model (ad which to leave out). All potetial explaatory variables ca be icluded i a regressio model, but that ofte results i a cumbersome model that is difficult to uderstad. O the other had, a model that icludes oly oe or two of the explaatory variables, such as the model i Figure 3.1, may be much less accurate tha a more complex model. This tesio betwee fidig a simple model ad fidig the model that best explais the respose is what makes it difficult to fid a best model. The process of fidig the most reasoable mix, which provides a relatively simple liear combiatio of explaatory variables, ofte resembles a exploratory artistic process much more tha a formulaic recipe. Icludig redudat or uecessary variables ot oly creates a uwieldy model but also ca lead to test statistics (ad coclusios from correspodig hypothesis tests) that are less reliable. If explaatory variables are highly correlated, the their effects i the model will be estimated with more imprecisio. This imprecisio leads to larger stadard errors ad ca lead to isigificat test results for idividual variables that ca be importat i the model. Failig to iclude a relevat variable ca result i biased estimates of the regressio coefficiets ad ivalid t-statistics, especially whe the excluded variable is highly sigificat or whe the excluded variable is correlated with other variables also i the model. 3.3 Variable Selectio Techiques to Describe or Predict a Respose If your objective is to describe a relatioship or predict ew respose variables, variable selectio techiques are useful for determiig which explaatory variables should be i the model. For this ivestigatio, we will cosider the respose to be the suggested retail price from Kelley Blue Book (the Price variable i the data). We may iitially believe the followig are relevat potetial explaatory variables: Make (Buick, Cadillac, Chevrolet, Potiac, SAAB, Satur) Model (specific car for each previously listed Make) Trim (specific type of Model) Type (Seda, Coupe, Hatchback, Covertible, or Wago) Cyl (umber of cyliders: 4, 6, or 8) Liter (a measure of egie size) Doors (umber of doors:, 4) Cruise (1 = cruise cotrol, 0 = o cruise cotrol) Soud (1 = upgraded speakers, 0 = stadard speakers) Leather (1 = leather seats, 0 = ot leather seats) Mileage (umber of miles the car has bee drive) Stepwise Regressio Whe a large umber of variables are available, stepwise regressio is a iterative techique that has historically bee used to idetify key variables to iclude i a regressio model. For example, forward stepwise regressio begis by fittig several sigle-predictor regressio models for the respose; oe regressio model is developed for each idividual explaatory variable. The sigle explaatory variable (call it X 1 ) that best explais the respose (has the highest R value) is selected to be i the model. * I the ext step, all possible regressio models usig X 1 ad exactly oe other explaatory variable are calculated. From amog all these two-variable models, the regressio model that best explais the respose is used to idetify X. After the first ad secod explaatory variables, X 1 ad X, have bee selected, the * A F-test is coducted o each of the models. The size of the F-statistic (ad correspodig p-value) is used to evaluate the fit of each model. Whe models with the same umber of predictors are compared, the model with the largest F-statistic will also have the largest R value.

5 3.3 Variable Selectio Techiques to Describe or Predict a Respose 71 process is repeated to fid X 3. This cotiues util icludig additioal variables i the model o loger greatly improves the model s ability to describe the respose. * Backward stepwise regressio is similar to forward stepwise regressio except that it starts with all potetial explaatory variables i the model. Oe by oe, this techique removes variables that make the smallest cotributio to the model fit util a best model is foud. While sequetial techiques are easy to implemet, there are usually better approaches to fidig a regressio model. Sequetial techiques have a tedecy to iclude too may variables ad at the same time sometimes elimiate importat variables. 3 With improvemets i techology, most statisticias prefer to use more global techiques (such as best subset methods), which compare all possible subsets of the explaatory variables. Note Later sectios will show that whe explaatory variables are highly correlated, sequetial procedures ofte leave out variables that explai (are highly correlated with) the respose. I additio, sequetial procedures ivolve umerous iteratios ad each iteratio ivolves hypothesis tests about the sigificace of coefficiets. Remember that with ay multiple compariso problem, a a@level of 0.05 meas there is a 5% chace that each irrelevat variable will be foud sigificat ad may iappropriately be determied importat for the model. Selectig the Best Subset of Predictors A researcher must balace icreasig R agaist keepig the model simple. Whe models with the same umber of parameters are compared, the model with the highest R value should typically be selected. A larger R value idicates that more of the variatio i the respose variable is explaied by the model. However, R ever decreases whe aother explaatory variable is added. Thus, other techiques are suggested for comparig models with differet umbers of explaatory variables. Statistics such as the adjusted R, Mallows C p, ad Akaike s ad Bayes iformatio criteria are used to determie a best model. Each of these statistics icludes a pealty for icludig too may terms. I other words, whe two models have equal R values, each of these statistics will select the model with fewer terms. Best subsets techiques use several statistics to simultaeously compare several regressio models with the same umber of predictors. Mathematical Note R, called the coefficiet of determiatio, is the percetage of variatio i the respose variable that is explaied by the regressio lie. a (y i - y) a (y i - y i ) R = = 1 - a (y i - y) (3.) a (y i - y) Whe the sum of the squared residuals (y i - y i ) are small compared to the total spread of the resposes ( a (y i - y) ), R is close to oe. R = 1 idicates that the regressio model perfectly fits the data. Calculatios for adjusted R ad Mallows C p are * A a@level is ofte used to determie if ay of the explaatory variables ot curretly i the model should be added to the model. If the p-value of all additioal explaatory variables is greater tha the a@level, o more variables will be etered ito the model. Larger a@levels (such as a = 0.) will iclude more terms while smaller a@levels (such as a = 0.05) will iclude fewer terms. R adj = a (y i - y i ) - p (3.3) a (y i - y)

6 7 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? C p = ( - p) s + (p - ) (3.4) s Full where is the sample size, p is the umber of coefficiets i the model (icludig the costat term), s estimates the variace of the residuals i the model, ad s Full estimates the variace of the residuals i the full model (i.e., the model with all possible terms). R adj is similar to R, except that R adj icludes a pealty whe uecessary terms are icluded i the model. Specifically, whe p is larger, ( - 1)/( - p) is larger, ad thus R adj is smaller. C p also adjusts for the umber of terms i the model. Notice that whe the curret model explais the data as well as the full model, s /s Full = 1. The C p = ( - p)(1) + p - = p. Thus, the objective is to fid the smallest C p value that is close to the umber of coefficiets i the model. While ay of these are appropriate for some models, adjusted R still teds to select models with too may terms. Mathematical details for these calculatios are provided i the exteded activities. Comparig Variable Selectio Techiques 4. Use the Cars data to coduct a stepwise regressio aalysis. a. Calculate seve regressio models, each with oe of the followig explaatory variables: Cy1, Liter, Doors, Cruise, Soud, Leather, ad Mileage. Idetify the explaatory variable that correspods to the model with the largest R value. Call this variable X 1. b. Calculate six regressio models. Each model should have two explaatory variables, X 1 ad oe of the other six explaatory variables. Fid the two-variable model that has the highest R value. How much did R improve whe this secod variable was icluded? c. Istead of cotiuig this process to idetify more variables, use the software istructios provided to coduct a stepwise regressio aalysis. List each of the explaatory variables i the model suggested by the stepwise regressio procedure. 5. Use the software istructios provided to develop a model usig best subsets techiques. Notice that stepwise regressio simply states which model to use, while best subsets provides much more iformatio ad requires the user to choose how may variables to iclude i the model. I geeral, statisticias select models that have a relatively low C p, a large R, ad a relatively small umber of explaatory variables. (It is rare for these statistics to all suggest the same model. Thus, the researcher much choose a model based o his or her goals. The exteded activities provide additioal details about each of these statistics.) Based o the output from best subsets, which explaatory variables should be icluded i a regressio model? 6. Compare the regressio models i Questios 4 ad 5. a. Are differet explaatory variables cosidered importat? b. Did the stepwise regressio i Questio 4 provide ay idicatio that Liter could be useful i predictig Price? Did the best subsets output i Questio 5 provide ay idicatio that Liter might be useful i predictig Price? Explai why best subsets techiques ca be more iformative tha sequetial techiques. Neither sequetial or best subsets techiques guaratee a best model. Arbitrarily usig slightly differet criteria will produce differet models. Best subset methods allow us to compare models with a specific umber of predictors, but models with more predictors do ot always iclude the same terms as smaller models. Thus, it is ofte difficult to iterpret the importace of ay coefficiets i the model. Variable selectio techiques are useful i providig a high R value while limitig the umber of variables. Whe our goal is to develop a model to describe or predict a respose, we are cocered ot about the sigificace of each explaatory variable, but about how well the overall model fits. If our goal ivolves cofirmig a theory, iterative techiques are ot recommeded. Cofirmig a theory is similar to hypothesis testig. Iterative variable selectio techiques test each variable or combiatio of variables several times, ad thus the p-values are ot reliable. The stated sigificace level for a t-statistic is valid oly if the data are used for a sigle test. If multiple tests are coducted to fid the best equatio, the actual sigificace level for each test for a idividual compoet is ivalid.

7 3.4 Checkig Model Assumptios 73 Key Cocept If variables are selected by iterative techiques, hypothesis tests should ot be used to determie the sigificace of these same terms. 3.4 Checkig Model Assumptios The simple liear regressio model discussed i itroductory statistics courses typically has the followig form: y i = b 0 + b 1 x i + e i for i = 1,, c, where e i N(0, s ) (3.5) For this liear regressio model, the mea respose (b 0 + b 1 x i ) is a liear fuctio of the explaatory variable, x. The multiple liear regressio model has a very similar form. The key differece is that ow more terms are icluded i the model. y i = b 0 + b 1 x 1,i + c + b p - x p -,i + b p -1 x p -1,i + e i for i = 1,, c, where e i N(0, s ) (3.6) I this chapter, p represets the umber of parameters i the regressio model (b 0, b 1, b, c, b p -1 ) ad is the total umber of observatios i the data. I this chapter, we make the followig assumptios about the regressio model: The model parameters b 0, b 1, c, b p -1 ad s are costat. Each term i the model is additive. The error terms i the regressio model are idepedet ad have bee sampled from a sigle populatio (idetically distributed). This is ofte abbreviated as iid. The error terms follow a ormal probability distributio cetered at zero with a fixed variace, s. This assumptio is deoted as e i N(0, s ) for i = 1, c,. Regressio assumptios about the error terms are geerally checked by lookig at the residuals from the data: y i - y i. Here, y i are the observed resposes ad y i are the estimated resposes calculated by the regressio model. Istead of formal hypothesis tests, plots will be used to visually assess whether the assumptios hold. The theory ad methods are simplest whe ay scatterplot of residuals resembles a sigle, horizotal, oval balloo, but real data may ot cooperate by coformig to the ideal patter. A orery plot may show a wedge, a curve, or multiple clusters. Figure 3. shows examples of each of these types of residual plots. Suppose you held a cup full of cois ad dropped them all at oce. We hope to fid residual plots that resemble the radom patter that would likely result from dropped cois (like the oval-shaped plot). The other three plots show patters that would be very ulikely to occur by radom chace. Ay plot patters that are ot ice oval shapes suggest that the error terms are violatig at least oe model assumptio, ad thus it is likely that we have ureliable estimates of our model coefficiets. The followig sectio illustrates strategies for dealig with oe of these uwated shapes: a wedge-shaped patter. Note that i sigle-variable regressio models, residual plots show the same iformatio as the iitial fitted lie plot. However, the residual plots ofte emphasize violatios of model assumptios better tha the fitted lie plot. I additio, multivariate regressio lies are very difficult to visualize. Thus, residual plots are essetial whe multiple explaatory variables are used. Heteroskedasticity Heteroskedasticity is a term used to describe the situatio where the variace of the error term is ot costat for all levels of the explaatory variables. For example, i the regressio equatio Price = 4, (Mileage), the spread of the suggested retail price values aroud the regressio lie should be about the same whether mileage is 0 or mileage is 50,000. If heteroskedasticity exists i the model, the most commo remedy is to trasform either the explaatory variable, the respose variable, or both i the hope that the trasformed relatioship will exhibit homoskedasticity (equal variaces aroud the regressio lie) i the error terms.

8 74 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Figure 3. Commo shapes of residual plots. Ideally, residual plots should look like a radomly scattered set of dropped cois, as see i the oval-shaped plot. If a patter exists, it is usually best to try other regressio models. 7. Usig the regressio equatio calculated i Questio 5, create plots of the residuals versus each explaatory variable i the model. Also create a plot of the residuals versus the predicted retail price (ofte called a residual versus fit plot). a. Does the size of the residuals ted to chage as mileage chages? b. Does the size of the residuals ted to chage as the predicted retail price chages? You should see patters idicatig heteroskedasticity (ocostat variace). c. Aother patter that may ot be immediately obvious from these residual plots is the right skewess see i the residual versus mileage plot. Ofte ecoomic data, such as price, are right skewed. To see the patter, look at just oe vertical slice of this plot. With a pecil, draw a vertical lie correspodig to mileage equal to Are the poits i the residual plots balaced aroud the lie Y = 0? d. Describe ay patters see i the other residual plots. 8. Trasform the suggested retail price to log (Price) ad Price. Trasformig data usig roots, logarithms, or reciprocals ca ofte reduce heteroskedasticity ad right skewess. (Trasformatios are discussed i Chapter.) Create regressio models ad residual plots for these trasformed respose variables usig the explaatory variables selected i Questio 5. a. Which trasformatio did the best job of reducig the heteroskedasticity ad skewess i the residual plots? Give the R values of both ew models. b. Do the best residual plots correspod to the best R values? Explai. While other trasformatios could be tried, throughout this ivestigatio we will refer to the log-trasformed respose variable as TPrice. Figure 3.3 shows residual plots that were created to aswer Questios 7 ad 8. Notice that whe the respose variable is Price, the residual versus fit plot has a clear wedge-shaped patter. The residuals have

9 3.4 Checkig Model Assumptios 75 much more spread whe the fitted value is large (i.e., expected retail price is close to $40,000) tha whe the fitted value is ear $10,000. Usig TPrice as a respose did improve the residual versus fit plot. Although there is still a fait wedge shape, the variability of the residuals is much more cosistet as the fitted value chages. Figure 3.3 reveals aother patter i the residuals. The followig sectio will address why poits i both plots appear i clusters. Figure 3.3 Residual versus fit plots usig Price ad TPrice (the log 10 trasformatio), as resposes. The residual plot with Price as the respose has a much stroger wedge-shaped patter tha the oe with TPrice. Examiig Residual Plots Across Time/Order Autocorrelatio exists whe cosecutive error terms are related. If autocorrelatio exists, the assumptio about the idepedece of the error terms is violated. To idetify autocorrelatio, we plot the residuals versus the order of the data etries. If the ordered plot shows a patter, the we coclude that autocorrelatio exists. Whe autocorrelatio exists, the variable resposible for creatig the patter i the ordered residual plot should be icluded i the model. Note Time (order i which the data were collected) is perhaps the most commo source of autocorrelatio, but other forms, such as spatial autocorrelatios, ca also be preset. If time is ideed a variable that should be icluded i the model, a specific type of regressio model, called a time series model, should be used. 9. Create a residual versus order plot from the TPrice versus Mileage regressio lie. Describe ay patter you see i the ordered residual plot. Apparetly somethig i our data is affectig the residuals based o the order of the data. Clearly, time is ot the ifluetial factor i our data set (all of the data are from 005). Ca you suggest a variable i the Cars data set that may be causig this patter? 10. Create a secod residual versus order plot usig TPrice as the respose ad usig the explaatory variables selected i Questio 5. Describe ay patters that you see i these plots. While ordered plots make sese i model checkig oly whe there is a meaigful order to the data, the residual versus order plots could demostrate the eed to iclude additioal explaatory variables i the regressio model. Figure 3.4 shows the two residual plots created i Questios 9 ad 10. Both plots show that the data poits are clearly clustered by order. However, there is less clusterig whe the six explaatory variables (Mileage, Cyl, Doors, Cruise, Soud, ad Leather) are i the model. Also otice that the residuals ted

10 76 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Figure 3.4 Residual versus order plots usig TPrice as the respose. The first graph uses Mileage as the explaatory variable, ad the secod graph uses Mileage, Cyl, Doors, Cruise, Soud, ad Leather as explaatory variables. to be closer to zero i the secod graph. Thus, the secod graph (with six explaatory variables) teds to have estimates that are closer to the actual observed values. We do ot have a time variable i this data set, so reorderig the data would ot chage the meaig of the data. Reorderig the data could elimiate the patter; however, the clear patter see i the residual versus order plots should ot be igored because it idicates that we could create a model with a much higher R value if we could accout for this patter i our model. This type of autocorrelatio is called taxoomic autocorrelatio, meaig that the relatioship see i this residual plot is due to how the items i the data set are classified. Suggestios o how to address this issue are give i later sectios. Outliers ad Ifluetial Observatios 11. Calculate a regressio equatio usig the explaatory variables suggested i Questio 5 ad Price as the respose. Idetify ay residuals (or cluster of residuals) that do t seem to fit the overall patter i the residual versus fit ad residual versus mileage plots. Ay data values that do t seem to fit the geeral patter of the data set are called outliers. a. Idetify the specific rows of data that represet these poits. Are there ay cosistecies that you ca fid? b. Is this cluster of outliers helpful i idetifyig the patters that were foud i the ordered residual plots? Why or why ot? 1. Ru the aalysis with ad without the largest cluster of potetial outliers (the cluster of outliers correspods to the Cadillac covertibles). Use Price as the respose. Does the cluster of outliers ifluece the coefficiets i the regressio lie? If the coefficiets chage dramatically betwee the regressio models, these poits are cosidered ifluetial. If ay observatios are ifluetial, great care should be take to verify their accuracy. I additio to reducig heterskedasticity, trasformatios ca ofte reduce the effect of outliers. Figure 3.5 shows the residual versus fit plots usig Price ad TPrice, respectively. The cluster of outliers correspodig to the Cadillac covertibles is much more visible i the plot with the utrasformed (Price) respose variable. Eve though there is still clusterig i the trasformed data, the residuals correspodig to the Cadillac covertibles are o loger uusually large.

11 3.4 Checkig Model Assumptios 77 Figure 3.5 Residual versus fit plots usig Price ad TPrice as the respose. The circled observatios i the plot usig Price are o loger clear outliers i the plot usig TPrice. I some situatios, clearly uderstadig outliers ca be more time cosumig (ad possibly more iterestig) tha workig with the rest of the data. It ca be quite difficult to determie if a outlier was accurately recorded or whether the outliers should be icluded i the aalysis. If the outliers were accurately recorded ad trasformatios are ot useful i elimiatig them, it ca be difficult to kow what to do with them. The simplest approach is to ru the aalysis twice: oce with the outliers icluded ad oce without. If the results are similar, the it does t matter if the outliers are icluded or ot. If the results do chage, it is much more difficult to kow what to do. Outliers should ever automatically be removed because they do t fit the overall patter of the data. Most statisticias ted to err o the side of keepig the outliers i the sample data set uless there is clear evidece that they were mistakely recorded. Whatever fial model is selected, it is importat to clearly state if you are aware that your results are sesitive to outliers. Normally Distributed Residuals Eve though the calculatios of the regressio model ad R do ot deped o the ormality assumptio, idetifyig patters i residual plots ca ofte lead to aother model that better explais the respose variable. To determie if the residuals are ormally distributed, two graphs are ofte created: a histogram of the residuals ad a ormal probability plot. Normal probability plots are created by sortig the data (the residuals i this case) from smallest to largest. The the sorted residuals are plotted agaist a theoretical ormal distributio. If the plot forms a straight lie, the actual data ad the theoretical data have the same shape (i.e., the same distributio). (Normal probability plots are discussed i more detail i Chapter.) 13. Create a regressio lie to predict TPrice from Mileage. Create a histogram ad a ormal probability plot of the residuals. a. Do the residuals appear to follow the ormal distributio? b. Are the te outliers visible o the ormal probability plot ad the histogram? Figure 3.6 shows the ormal probability plot usig six explaatory variables to estimate TPrice. While the outliers are ot visible, both plots still show evidece of lack of ormality.

12 78 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Figure 3.6 Normal probability plot ad histogram of residuals from the model usig TPrice as the respose ad Mileage, Cyl, Doors, Cruise, Soud, ad Leather as the explaatory variables. At this time, it should be clear that simply pluggig data ito a software package ad usig a iterative variable selectio techique will ot reliably create a best model. Key Cocept Before a fial model is selected, the residuals should be plotted agaist fitted (estimated) values, observatio order, the theoretical ormal distributio, ad each explaatory variable i the model. Table 3.1 shows how each residual plot is used to check model assumptios. If a patter exists i ay of the residual plots, the R value is likely to improve if differet explaatory variables or trasformatios are icluded i the model. Table 3.1 Plots that ca be used to evaluate model assumptios about the residuals. Assumptio Normality Zero mea Equal variaces Idepedece Idetically distributed Plot Histogram or ormal probability plot No plot (errors will always sum to zero uder these models) Plot of the residuals versus fits ad each explaatory variable Residuals versus order No plot (esure each subject was sampled from the same populatio withi each group) Correlatio Betwee Explaatory Variables Multicolliearity exists whe two or more explaatory variables i a multiple regressio model are highly correlated with each other. If two explaatory variables X 1 ad X are highly correlated, it ca be very difficult to idetify whether X 1, X, or both variables are actually resposible for ifluecig the respose variable, Y. 14. Create three regressio models usig Price as the respose variable. I all three cases, provide the regressio model, R value, t-statistic for the slope coefficiets, ad correspodig p-values.

13 3.4 Checkig Model Assumptios 79 a. I the first model, use oly Mileage ad Liter as the explaatory variables. Is Liter a importat explaatory variable i this model? b. I the secod model, use oly Mileage ad umber of cyliders (Cyl) as the explaatory variables. Is Cyl a importat explaatory variable i this model? c. I the third model, use Mileage, Liter, ad umber of cyliders (Cyl) as the explaatory variables. How did the test statistics ad p-values chage whe all three explaatory variables were icluded i the model? 15. Note that the R values are essetially the same i all three models i Questio 14. The coefficiets for Mileage also stay roughly the same for all three models the iclusio of Liter or Cyl i the model does ot appear to ifluece the Mileage coefficiet. Depedig o which model is used, we state that for each additioal mile o the car, Price is reduced somewhere betwee $0.15 to $0.16. Describe how the coefficiet for Liter depeds o whether Cyl is i the model. 16. Plot Cyl versus Liter ad calculate the correlatio betwee these two variables. Is there a strog correlatio betwee these two variables? Explai. Recall that Questio 4 suggested deletig Liter from the model. The goal i stepwise regressio is to fid the best model based o the R value. If two explaatory variables both impact the respose i the same way, stepwise regressio will rather arbitrarily pick oe variable ad igore the other. Key Cocept Stepwise regressio ca ofte completely miss importat explaatory variables whe there is multicolliearity. Mathematical Note Most software packages ca create a matrix plot of the correlatios ad correspodig scatterplots of all explaatory variables. This matrix plot is helpful i idetifyig ay patters of iterdepedece amog the explaatory variables. A easy-to-apply guidelie for determiig if multicolliearity eeds to be dealt with is to use the variace iflatio factor (VIF). VIF coducts a regressio of each explaatory variable (X i ) o the remaiig explaatory variables, calculates the correspodig R value (R i ), ad the calculates the followig fuctio for each variable X i : 1/(1 - R i ). If the R i value is zero, VIF is oe, ad X i is ucorrelated with all other explaatory variables. Motgomery, Peck, ad Viig state, Practical experiece idicates that if ay of the VIFs exceed 5 or 10, it is a idicatio that the associated regressio coefficiets are poorly estimated because of multicolliearity. 4 Figure 3.7 shows that Liter ad Cyl are highly correlated. Withi the cotext of this problem, it does t make physical sese to cosider holdig oe variable costat while the other variable icreases. I geeral, it may ot be possible to fix a multicolliearity problem. If the goal is simply to describe or predict retail prices, multicolliearity is ot a critical issue. Redudat variables should be elimiated from the model, but highly correlated variables that both cotribute to the model are acceptable if you are ot iterpretig the coefficiets. However, if your goal is to cofirm whether a explaatory variable is associated with a respose (test a theory), the it is essetial to idetify the presece of multicolliearity ad to recogize that the coefficiets are ureliable whe it exists. Key Cocept If your goal is to create a model to describe or predict, the multicolliearity really is ot a problem. Note that multicolliearity has very little impact o the R value. However, if your goal is to uderstad how a specific explaatory variable iflueces the respose, as is ofte doe whe cofirmig a theory, the multicolliearity ca cause coefficiets (ad their correspodig p-values whe testig their sigificace) to be ureliable.

14 80 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Figure 3.7 Scatterplot showig a clear associatio betwee Liter ad Cyl. The followig approaches are commoly used to address multicolliearity: Get more iformatio: If it is possible, expadig the data collectio may lead to samples where the variables are ot so correlated. Cosider whether a greater rage of data could be collected or whether data could be measured differetly so that the variables are ot correlated. For example, the data here are oly for GM cars. Perhaps the relatioship betwee egie size i liters ad umber of cyliders is ot so strog for data from a wider variety of maufacturers. Re-evaluate the model: Whe two explaatory variables are highly correlated, deletig oe variable will ot sigificatly impact the R value. However, if there are theoretical reasos to iclude both variables i the model, keep both terms. I our example, Liter ad umber of cyliders (Cyl) are measurig essetially the same quatity. Liter represets the volume displaced durig oe complete egie cycle; umber of cyliders (Cyl) also is a measure of the volume that ca be displaced. Combie the variables: Usig other statistical techiques such as pricipal compoets, it is possible to combie the correlated variables optimally ito a sigle variable that ca be used i the model. There may be theoretical reasos to combie variables i a certai way. For example, the volume (size) ad weight of a car are likely highly positively correlated. Perhaps a ew variable defied as desity = weight/volume could be used i a model predictig price rather tha either of these idividual variables. I this ivestigatio, we are simply attemptig to develop a model that ca be used to estimate price, so multicolliearity will ot have much impact o our results. If we did re-evaluate the model i light of the fact that Liter ad umber of cyliders (Cyl) both measure displacemet (egie size), we might ote that Liter is a more specific variable, takig o several values, while Cyl has oly three values i the data set. Thus, we might choose to keep Liter ad remove Cyl i the model. 3.5 Iterpretig Model Coefficiets While multiple regressio is very useful i uderstadig the impacts of various explaatory variables o a respose, there are importat limitatios. Whe predictors i the model are highly correlated, the size ad meaig of the coefficiets ca be difficult to iterpret. I Questio 14, the followig three models were developed: Price = (Mileage) (Liter) Price = (Mileage) + 408(Cyl) Price = (Mileage) (Liter) + 848(Cyl)

15 3.6 Categorical Explaatory Variables 81 The iterpretatio of model coefficiets is more complex i multiple liear regressio tha i simple liear regressio. It ca be misleadig to try to iterpret a coefficiet without cosiderig other terms i the model. For example, whe Mileage ad Liter are the two predictors i a regressio model, the Liter coefficiet might seem to idicate that a icrease of oe i Liter will icrease the expected price by $4968. However, whe Cyl is also icluded i the model, the Liter coefficiet seems to idicate that a icrease of oe i Liter will icrease the expected price by $1545. The size of a regressio coefficiet ad eve the directio ca chage depedig o which other terms are i the model. I this ivestigatio, we have show that Liter ad Cyl are highly correlated. Thus, it is ureasoable to believe that Liter would chage by oe uit but Cyl would stay costat. The multiple liear regressio coefficiets caot be cosidered i isolatio. Istead, the Liter coefficiet shows how the expected price will chage whe Liter icreases by oe uit, after accoutig for correspodig chages i all the other explaatory variables i the model. Key Cocept I multiple liear regressio, the coefficiets tell us how much the expected respose will chage whe the explaatory variable icreases by oe uit, after accoutig for correspodig chages i all other terms i the model. 3.6 Categorical Explaatory Variables As we saw i Questio 9, there is a clear patter i the residual versus order plot for the Kelley Blue Book car pricig data. It is likely that oe of the categorical variables (Make, Model, Trim, or Type) could explai this patter. If ay of these categorical variables are related to the respose variable, the we wat to add these variables to our regressio model. A commo procedure used to icorporate categorical explaatory variables ito a regressio model is to defie idicator variables, also called dummy variables. Creatig idicator variables is a process of mappig the oe colum (variable) of categorical data ito several colums (idicator variables) of 0 ad 1 data. Let s take the variable Make as a example. The six possible values (Buick, Cadillac, Chevrolet, Potiac, SAAB, Satur) ca be recoded usig six idicator variables, oe for each of the six makes of car. For example, the idicator variable for Buick will have the value 1 for every car that is a Buick ad 0 for each car that is ot a Buick. Most statistical software packages have a commad for creatig the idicator variables automatically. Creatig Idicator Variables 17. Create boxplots or idividual value plots of the respose variable TPrice versus the categorical variables Make, Model, Trim, ad Type. Describe ay patters you see. 18. Create idicator variables for Make. Name the colums, i order, Buick, Cadillac, Chevrolet, Potiac, SAAB, ad Satur. Look at the ew data colums ad describe how the idicator variables are defied. For example, list all possible outcomes for the Cadillac idicator variable ad explai what each outcome represets. Ay of the idicator variables i Questio 18 ca be icorporated ito a regressio model. However, if you wat to iclude Make i its etirety i the model, do ot iclude all six idicator variables. Five will suffice because there is complete redudacy i the sixth idicator variable. If the first five idicator variables are all 0 for a particular car, we automatically kow that this car belogs to the sixth category. Below, we will leave the Buick idicator variable out of our model. The coefficiet for a idicator variable is a estimate of the average amout by which the respose variable will chage. For example, the estimated coefficiet for the Satur variable is a estimate of the average differece i TPrice whe the car is a Satur rather tha a Buick (after adjustig for correspodig chages i all other terms i the model).

16 8 Chapter 3 Multiple Regressio: How Much Is Your Car Worth? Mathematical Note For ay categorical explaatory variable with g groups, oly g - 1 terms should be icluded i the regressio model. Most software packages use matrix algebra to develop multiple regressio models. If all g terms are i the model, explaatory variables will be 100% correlated (you ca exactly predict the value of oe variable if you kow the other variables) ad the eeded matrix iversio caot be doe. If the researcher chooses to leave all g terms i the model, most software packages will arbitrarily remove oe term so that the eeded matrix calculatios ca be completed. It may be temptig to simplify the model by icludig oly a few of the most sigificat idicator variables. For example, istead of icludig five idicator variables for Make, you might cosider oly usig Cadillac ad SAAB. Most statisticias would recommed agaist this. By limitig the model to oly idicator variables that are sigificat i the sample data set, we ca overfit the model. Models are overfit whe researchers overwork a data set i order to icrease the R value. For example, a researcher could sped a sigificat amout of time pickig a few idicator variables from Make, Model, Trim, ad Type i order to fid the best R value. While the model would likely estimate the mea respose well, it ulikely to accurately predict ew values of the respose variable. This is a fairly uaced poit. The purpose of variable selectio techiques is to select the variables that best explai the respose variable. However, overfittig may occur if we break up categorical variables ito smaller uits ad the pick ad choose amog the best parts of those variables. (The Model Validatio sectio i the exteded activities discusses this topic i more detail.) Buildig Regressio Models with Idicator Variables 19. Build a ew regressio model usig TPrice as the respose ad Mileage, Liter, Satur, Cadillac, Chevrolet, Potiac, ad SAAB as the explaatory variables. Explai why you expect the R value to icrease whe you add terms for Make. 0. Create idicator variables for Type. Iclude the Make ad Type idicator variables, plus the variables Liter, Doors, Cruise, Soud, Leather, ad Mileage, i a model to predict TPrice. Remember to leave at least oe category out of the model for the Make ad Type idicator variables (e.g., leave Buick ad Hatchback out of the model). Compare this regressio model to the other models that you have fit to the data i this ivestigatio. Does the ormal probability plot suggest that the residuals could follow a ormal distributio? Describe whether the residual versus fit, the residual versus order, ad the residual versus each explaatory variable plots look more radom tha they did i earlier problems. The additioal categorical variables are importat i improvig the regressio model. Figure 3.8 shows that whe Make ad Type are ot i the model, the residuals are clustered. Whe Make ad Type are icluded i the model, the residuals appear to be more radomly distributed. By icorporatig Make ad Type, we have bee able to explai some of variability that was causig the clusterig. I additio, the sizes of the residuals are much smaller. Smaller residuals idicate a better fittig model ad a higher R value. Eve though Make ad Type improved the residual plots, there is still clusterig that might be improved by icorporatig Model ito the regressio equatio. However, if the goal is simply to develop a model that accurately estimates retail prices, the R value i Questio 0 is already fairly high. Are there a few more terms that ca be added to the model that would dramatically improve the R value? To determie a fial model, you should attempt to maximize the R value while simultaeously keepig the model relatively simple. For your fial model, you should commet o the residual versus fit, residual versus order, ad ay other residual plots that previously showed patters. If ay patter exists i the residual plots, it may be worth attemptig a ew regressio model that will accout for these patters. If the regressio model ca be modified to address the patters i the residuals, the R value will improve. However, ote that the R value is already fairly high. It may ot be worth makig the model more complex for oly a slight icrease i the R value. 1. Create a regressio model that is simple (i.e., does ot have too may terms) ad still accurately predicts retail price. Validate the model assumptios. Look at residual plots ad check for heteroskedasticity, multicolliearity, autocorrelatio, ad outliers. Your fial model should ot have sigificat clusters, skewess, outliers, or heteroskedasticity appearig i the residual plots. Submit your suggested least squares regressio formula alog with a limited umber of appropriate graphs that provide justificatio for your model. Describe why you believe this model is best.

17 3.8 F-Tests for Multiple Regressio 83 Figure 3.8 Residual versus order plots show that icorporatig the idicator variables ito the regressio model improves the radom behavior ad reduces the sizes of the residuals. 3.7 What Ca We Coclude from the 005 GM Car Study? The data are from a observatioal study, ot a experimet. Therefore, eve though the R value reveals a strog relatioship betwee our explaatory variables ad the respose, a sigificat correlatio (ad thus a sigificat coefficiet) does ot imply a causal lik betwee the explaatory variable ad the respose. There may be theoretical or practical reasos to believe that mileage (or ay of the other explaatory variables) causes lower prices, but the fial model ca be used oly to show that there is a associatio. Best subsets ad residual graphs were used to develop a model that is useful for describig or predictig the retail price based o a fuctio of the explaatory variables. However, sice iterative techiques were used, the p-values correspodig to the sigificace of each idividual coefficiet are ot reliable. For this data set, cars were radomly selected withi each make, model, ad type of 005 GM car produced, ad the suggested retail prices were determied from Kelley Blue Book. While this is ot a simple radom sample of all 005 GM cars actually o the road, there is still reaso to believe that your fial model will provide a accurate descriptio or predictio of retail price for used 005 GM cars. Of course, as time goes by, the value of these cars will be reduced ad updated models will eed to be developed. Multiple Regressio 3.8 F-Tests for Multiple Regressio Decompositio of Sum of Squares May of the calculatios ivolved i multiple regressio are very closely related to those for the simple liear regressio model. Simple liear regressio models have the advatage of beig easily visualized with scatterplots. Thus, we start with a simple liear regressio model to derive several key equatios used i multiple regressio. Figure 3.9 shows a scatterplot ad regressio lie for a subset of the used 005 Kelly Blue Book Cars data. The data set is restricted to just Chevrolet Cavalier Sedas. I this scatterplot, oe specific observatio is highlighted: the poit where Mileage is 11,488 ad the observed Price is y i = 14, I Figure 3.9, we see that for ay idividual observatio the total deviatio (y i - y) is decomposed ito two parts: y i - y = ( y i - y i ) + ( y i - y) (3.7)

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

II. Descriptive Statistics D. Liear Correlatio ad Regressio I this sectio Liear Correlatio Cause ad Effect Liear Regressio 1. Liear Correlatio Quatifyig Liear Correlatio The Pearso product-momet correlatio