Chapter 10. What is Regression Analysis? Simple Linear Regression Analysis. Examples

Chapter 10 Smple Lnear Regresson Analyss What s Regresson Analyss? A statstcal technque that descrbes the relatonshp between a dependent varable and one or more ndependent varables. Examples Consder the relatonshp between constructon permts (x) and carpet sales (y) for a company. OR Relatonshp between advertsng expendtures and sales There probably s a relatonshp......as number of permts ncreases, sales should ncrease....set advertsng expendture and we can predct sales But how would we measure and quantfy ths relatonshp? Smple Lnear Regresson Model (SLR) Assume relatonshp to be lnear Y = a + bx + Where Y = dependent varable X = ndependent varable a = y-ntercept b = slope = random error 1

Random Error Component () Makes ths a probablstc model... Represents uncertanty random varaton not explaned by x Determnstc Model = Exact relatonshp Example: Temperature: o F = 9/5 o C + 3 Assets = Labltes + Equty Probablstc Model = Det. Model + Error Graphcally, SLR lne s dsplayed as... 15 lne of means 10 Y 5 0 0 10 0 30 40 50 X Model Parameters a and b Estmated from the data Data collected as a par (x,y) Process of Developng SLR Model Hypothesze the model: E(Y) = a + bx Estmate Coeffcents yˆ aˆ bx ˆ Specfy dstrbuton of error term How adequate s the model? When model s approprate, use t for estmaton and predcton

Fttng the Straght-Lne Model Ordnary Least Squares (OLS) Once t s assumed that the model s Y = a + bx + Next we must collect the data Before estmatng parameters, we must ensure that the data follows a lnear trend Use scatterplot, scattergram, scatter dagram Monthly Carpet Sales A Scatter Plot of the Data Carpet Cty Problem 15 10 5 0 0 10 0 30 40 50 Monthly Constructon Permts Assessng Ft Assessng Ft (Devatons) Monthly Carpet Sales 15 10 5 0 Carpet Cty Problem 0 10 0 30 40 50 Monthly Constructon Permts aka errors or resduals (r, e ) Dfference between the observed value of y and the predcted value of y e r Want r to be small y yˆ 3

Assessng Ft (Cont.) NOTE: Sum of the resduals s 0 e ( y yˆ ) 0 Least Squares Lne Fnd the lne that mnmzes y ŷ wth respect to the parameters Can ft many dfferent lnes; whch one s best? Lne that best fts the data s the one that mnmzes the sum of squares of the errors (SSE). Ths s the least squares lne. Recall that Mnmze yˆ aˆ bˆ x y aˆ bˆ x Least Squares Lne (Cont.) Example 1 Estmated parameters yeld smallest SSE Estmated coeffcents are gven by: ˆ SS b SS aˆ y bx ˆ SS SS xy xx xy xx x xy y x x The Central Company manufactures a certan specalty tem once a month n a batch producton run. The number of tems produced n each run vares from month to month as demand fluctuates. The company s nterested n the relatonshp between the sze of the producton run (x) and the number of man-hours of labor (y) requred for the run. The company has collected the followng data for the 10 most recent runs: 4

Example 1 (Cont.) Example 1 (Cont.) Run Number of tems Labor (man-hours) 1 40 83 30 60 3 70 138 4 90 180 5 50 97 6 60 118 7 70 140 8 40 75 9 80 159 10 70 144 Estmated Regresson Equaton ŷ 1.836.01x Interpretaton of Regresson Equaton What does ths mean? ŷ 1.836.01x x = # of tems produces y = # of man-hours of labor State conclusons n terms of problem Intercept: when no tems are produced, the est. # of hrs. s -1.836. Does ths make sense? No! Interpretaton (Cont.) When usng regresson to predct a response, the value of the ndependent varable must fall n the range of the orgnal data. Predctons made outsde of the range of the data s called EXTRAPOLATION and may have lttle or no valdty. In our example, our ndependent varable ranges from 30 to 90, and predctons should be made n ths range. 5

Interpretaton (Cont.) Slope: every unt change n x, the average value of y wll change by the slope In the example,.01 mples that for every tem produced, the average # of man-hours s expected to ncrease by.01. Example The Tr-Cty Offce Equpment Corporaton sells an mported desk calculator on a franchse bass and performs preventve mantenance and repar servce on ths calculator. Data has been collected from 18 recent calls on users to perform routne preventve mantenance servce; for each call, x s the number of machnes servced and y s the total number of mnutes spent by the servce person. Example (Cont.) Obtan Estmated Regresson Equaton SS bˆ SS xy xx aˆ y bx ˆ 64 x xy y x x yˆ.34 14.7383x 14.73834.5 1,098 14.7383 74.5.34 Example (Cont.) Interpretatons Intercept: When no machnes are servced, the reparman spends an avg. of -.34 mnutes; Note that x=0 s probably not n the range of the data, so ntercept makes no sense. Slope: For each machne servced, we would expect approx. 14.74 mnutes of servce tme spent 6

Model Assumptons E() = 0 Var() = s normally dstrbuted I are ndependent Y The Nature of a Statstcal Relatonshp Regresson Curve Before performng regresson analyss, these assumptons should be valdated. Probablty dstrbutons for Y at dfferent levels of X X Assumptons for Regresson Descrptve Measures of Assocaton Coeffcent of Determnaton (R ) Unknown Relatonshp Y = 0 + 1 X 7 7

Y Y Error Decomposton Y (actual value) Y -Y {* } Y -Y ^ ^ } Y (estmated value) Y ^ -Y Coeffcent of Determnaton (Cont.) 0 SSE SS yy 0 R 1 Larger R, the more varablty s explaned by the regresson model yˆ aˆ bx ˆ X Coeffcent of Determnaton (Cont.) Coeffcent of Determnaton (Cont.) 30 5 0 Y 15 10 5 0 0 10 0 30 40 50 X 30 5 0 Y 15 10 5 0 0 10 0 30 40 50 X 8

Correlaton Coeffcent (r) Postve square root of R aka Pearson product-moment correlaton coeffcent Untless -1 r 1 Descrbes the strength of the relatonshp between x and y Correlaton Coeffcent (r) Computatonal Formula SSxy r SSxxSSyy -1 mples strong negatve relatonshp 0 mples no relatonshp +1 mples strong postve relatonshp Measures of Assocaton (Cont.) Hgh correlaton does not mply causaton. What does ths mean? Estmaton and Predcton Satsfed wth the model, we can perform: Estmaton of the mean value of y for a gven value of x Predcton of a new observaton for a gven value of x Where do we expect to have the most success? 9

Estmaton & Predcton (Cont.) Scatter Plot of Correct Model The ftted SLR model s yˆ aˆ bx ˆ Estmatng y at a gven value of x, say x p, yelds the same value as predctng y at a gven value of x p. Dfference s n precson of the estmate... the samplng errors Y = 3.0 + 0.5X R = 0.67 38 Scatter Plot of Curvlnear Model Scatter Plot of Outler Model Y = 3.0 + 0.5X R = 0.67 39 Y = 3.0 + 0.5X R = 0.67 40 10

Scatter Plot of Influental Model Verfyng Assumptons Y = 3.0 + 0.5X R = 0.67 41 4 Examnng Resdual Plots Regresson and Excel Excel also has a bult-n tool for performng regresson that: s easer to use provdes a lot more nformaton about the problem To nstall the Regresson tool, Tools AddIns Analyss ToolPak Then to perform the analyss Data Data Analyss Regresson 43 11

The TREND( ) Functon TREND(Y-range, X-range, X-value for predcton) where: Y-range s the spreadsheet range contanng the dependent Y varable, X-range s the spreadsheet range contanng the ndependent X varable(s), X-value for predcton s a cell (or cells) contanng the values for the ndependent X varable(s) for whch we want an estmated value of Y. Enterng the Central Company Data (see Example 1) Note: The TREND( ) functon s dynamcally updated whenever any nputs to the functon change. However, t does not provde the statstcal nformaton provded by the regresson tool. It s best to use these two dfferent approaches to regresson n conjuncton wth one another. Important Software Note Regresson Output When usng more than one ndependent varable, all varables for the X-range must be n one contguous block of cells (that s, n adjacent columns). SUMMARY OUTPUT Regresson Statstcs Multple R 0.997739951 R Square 0.99548501 Adjusted R Square 0.99490636 Standard Error.80535817 Observatons 10 ANOVA df SS MS F Sgnfcance F Regresson 1 13881.44118 13881.44118 1763.875549 1.13834E-10 Resdual 8 6.9588353 7.86985941 Total 9 13944.4 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept -1.83594118 3.0198958-0.60773449 0.56015547-8.799190869 5.1860634-8.799190869 5.1860634 Number of Items.0058835 0.048110941 41.99851841 1.13834E-10 1.909644135.13153336 1.909644135.13153336 1

Regresson Plot Man-hours of Labor Central Company 00 y =.006x - 1.8353 180 R = 0.9955 160 140 10 100 80 60 40 0 30 40 50 60 70 80 90 100 Number of Items Multple Regresson & Model Buldng Most regresson problems nvolve more than one ndependent varable. If each ndependent varable vares n a lnear manner wth y, the estmated regresson functon n ths case s: ˆ ˆ ˆ X ŷ a b1x 1 bx b k ˆ The optmal values for the b can agan be found by mnmzng the ESS. The resultng functon fts a hyperplane to our sample data. k Example Regresson Surface for Two Independent Varables Y * * * * * * ** * * * * * * * * ** * * * * * X X 1 13

Example Admssons data In SLR, we had x 1 = Entrance test score y = End of year GPA Suppose other factors nvolved x = HS GPA x 3 = SAT score Model becomes y = a + b 1 x 1 + b x + b 3 x 3 + Independent Varables May represent hgher-order terms x 1 = age x = age May be dummy/ndcator varables x 3 0, f female 1, f male May be functons of ndependent varables x 4 = prce x 5 = ndustry average prce x 6 = prce dfference = x 5 x 4 Steps to Developng Multple Regresson Model 1. Hypothesze the model: y = a + b 1 x 1 + + b k x k +. Estmate coeffcents 3. Specfy dstrbuton of and estmate 4. Valdate model assumptons 5. Evaluate model adequacy 6. Use for estmaton and predcton Fttng the Model b represents the change n y wth respect to each unt change n x when ALL other x s are held constant Method of fttng s the same as n SLR Estmate b s to mnmze SSE Computatonally ntensve Use MS Excel 14

Multple Regresson Salsberry Realty Salsberry Realty sells homes along the east coast of the Unted States. One of the questons frequently asked by prospectve buyers s: If we purchase ths home, how much can we expect to pay to heat t durng the wnter? The research department at Salsberry has been asked to develop some gudelnes regardng heatng costs for sngle famly homes. Three varables are thought to relate to the heatng costs: (1) the mean daly outsde temperature, () the number of nches of nsulaton n the attc, and (3) the age of the furnace. To nvestgate, Salsberry s research department selected a random sample of 0 recently sold homes. They determned the cost to heat the home last January, as well as the mean outsde temperature durng January n the regon, the number of nches of nsulaton n the attc, and the age of the furnace. The sample nformaton s gven below. Home Multple Regresson Salsberry Realty Heatng Cost ($) Mean Outsde Temperature ( o F) Attc Insulaton (nches) Age of Furnace (years) 1 50 35 3 6 360 9 4 10 3 165 36 7 3 4 43 60 6 9 5 9 65 5 6 6 00 30 5 5 7 355 10 6 7 8 90 7 10 10 9 30 1 9 11 10 10 55 5 11 73 54 1 4 1 05 48 5 1 13 400 0 5 15 14 30 39 4 7 15 7 60 8 6 16 7 0 5 8 17 94 58 7 3 18 190 40 8 11 19 35 7 9 8 0 139 30 7 5 Multple Regresson Salsberry Realty Determne the multple regresson equaton. Whch varables are the ndependent varables? Whch varable s the dependent varable? Use MS Excel to develop a regresson equaton. Dscuss the regresson coeffcents. Why does t ndcate that some are postve and some are negatve? What s the ntercept value? What s the estmated heatng cost for a home where the mean outsde temperature s 30 degrees, there are 5 nches of nsulaton n the attc, and the furnace s 10 years old? Multple Regresson Salsberry Realty The hypotheszed model s gven by y = a + b 1 x 1 + b x + b 3 x 3 + where y = heatng cost x 1 = mean outsde temp. x = attc nsulaton x 3 = age of furnace = random error 15

Scatterplot 1 Mean Temp. vs. Heatng Cost Scatterplot Attc Insulaton vs. Heatng Cost 450 450 400 400 350 350 300 300 Heatng Cost 50 00 Heatng Cost 50 00 150 150 100 100 50 50 0 0 0 10 0 30 40 50 60 70 0 4 6 8 10 1 14 Mean Outsde Temperature Attc Insulaton Scatterplot 3 Age of Furnace vs. Heatng Cost Multple Regresson Salsberry Realty 450 400 SUMMARY OUTPUT Heatng Cost 350 300 50 00 150 Regresson Statstcs Multple R 0.89675599 R Square 0.804170066 Adjusted R Square 0.767451954 Standard Error 51.04855358 Observatons 0 ANOVA df SS MS F Sgnfcance F 1710.478 57073.49094 1.9011803 6.56178E-06 Regresson 3 Resdual 16 41695.7717 605.95483 Total 19 1915.75 100 50 0 0 4 6 8 10 1 14 16 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Intercept 47.1938033 59.6014931 7.167509374.3764E-06 300.844446 553.5431606 Mean Outsde -4.586666 0.77319353-5.933636915.10035E-05-6.19906146 -.945419105 Temperature (F) Attc Insulaton (nches) -14.8308669 4.7544181-3.11938977 0.006605963-4.909764-4.751961175 Age of Furnace (years) 6.10103061 4.0110166 1.50650381 0.14786484 -.404808 14.60634494 Age of Furnace 16

Multple Regresson Salsberry Realty Estmated regresson equaton: ŷ 47.19 4.58x 14.83x 6.10 Dscusson Meanngful nterpretatons of coeffcents Check range of each ndependent varable 1 x3 Estmate the heatng cost for a mean outsde temp. of 30 0 F, there are 5 n. of nsulaton, and the furnace s 10 years old. Estmaton and Predcton Model Assumptons: Same as SLR : Estmaton of the varance, : s d SSE MSE n k 1 ~ N 0, Usng Dummy/Indcator Varables Qualtatve varables can also be used n the regresson model Dummy/ndcator or bnary (0, 1) varables denote the presence or absence of the varable of nterest Usng Dummy/Indcator Varables A qualtatve varable wth c classes wll be represented by (c-1) dummy/ndcator varables n the model, wth each takng on the values of 0 and 1. Example: Suppose we have an ndependent var. that represents type of det: Weght Watchers, Atkns, Body for Lfe, and Proten. Note we have 4 classes (c = 4) We wll need (c-1) = 3 varables n the model 17

Usng Dummy/Indcator Varables Types of Det could be modeled as: 1, f WW 0, otherwse x1 x x 3 1, f Atkns 0, otherwse 1, f BFL 0, otherwse Moton Pcture Industry Example A moton pcture ndustry analyst wants to estmate the gross earnngs generated by a move. The estmate wll be based on dfferent varables nvolved n the flm's producton. The ndependent varables consdered are X 1 = producton cost of the move and X = total cost of all promotonal actvtes. A thrd varable (X 3) that the analyst wants to consder s whether or not the move s based on a book publshed before the release of the move. The analyst obtans nformaton on a random sample of 0 Hollywood moves made wthn the last fve years. The data s gven n the followng table. The model could resemble: y = a + b 1 x 1 + b x + b 3 x 3 + Moton Pcture Industry Example Moton Pcture Industry Example Move Gross Earnngs, Mllons $ Producton Cost, Mllons $ Promoton Cost, Mllons $ Book 1 8 4. 1 No 35 6.0 3 Yes 3 50 5.5 6 Yes 4 0 3.3 1 No 5 75 1.5 11 Yes 6 60 9.6 8 Yes 7 15.5 0.5 No 8 45 10.8 5 No 9 50 8.4 3 Yes 10 34 6.6 No 11 48 10.7 1 Yes 1 8 11.0 15 Yes 13 4 3.5 4 No 14 50 6.9 10 No 15 58 7.8 9 Yes 16 63 10.1 10 No 17 30 5.0 1 Yes 18 37 7.5 5 No 19 45 6.4 8 Yes 0 7 10.0 1 Yes Prepare a scatter plot of gross earnngs versus producton cost and promoton cost. Does there appear to be a lnear relatonshp between gross earnngs and ether producton cost or promoton cost. If the analyst were to use a smple lnear regresson model to predct gross earnngs, whch varable should be used? Explan. Determne the parameter estmates for the model gven by Yˆ ˆ ˆ aˆ b1 X1 b X Analyze the results. Determne the parameter estmates for the model gven by Yˆ ˆ ˆ ˆ aˆ b1 X1 b X b3 X 3 Does X 3 help explan the gross earnngs when X 1 and X are also n the model? Explan. 18

Gross Earnngs v. Promoton Cost Gross Earnngs v. Producton Cost 90 90 80 80 70 70 60 Gross Earnngs 60 50 40 Gross Earnngs 50 40 30 30 0 0 10 10 0 0 4 6 8 10 1 14 Producton Cost 0 0 4 6 8 10 1 14 16 Promoton Cost Moton Pcture Industry Example Wth smplcty n mnd, suppose we ft three smple lnear regresson functons: ŷ ŷ ŷ aˆ bˆ x 1 1 aˆ bˆ x aˆ bˆ 3x 3 Key regresson results are: Varables Adjusted Parameter n the Model R R S e Estmates X 1 0.751 0.738 9.506 a=5.071, b 1 =5.57 X 0.779 0.766 8.970 a=4.33, b =3.761 X 3 0.99 0.60 15.960 a=35.111, b 3 =19.889 The model usng X accounts for 77.9% of the varaton n y, leavng approx. % unaccounted for. Moton Pcture Industry Example SUMMARY OUTPUT Regresson Statstcs Multple R 0.966636507 R Square 0.934386137 Adjusted R Square 0.96666859 Standard Error 5.0578595 Observatons 0 ANOVA df SS MS F Sgnfcance F Regresson 6113.641776 3056.80888 11.0457945 8.79959E-11 Resdual 17 49.3084 5.534495 Total 19 654.95 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Intercept 8.151744038 3.176305439.56643 0.0001068 1.450315984 14.8531709 Producton Cost 3.6748185 0.514344941 6.355104 7.58E-06.18073701 4.354669 Promoton Cost.367378471 0.3438166 6.885591916.64016E-06 1.641988553 3.09768388 19

Moton Pcture Industry Example Modelng earnngs usng producton and promoton costs yelds: ŷ 8.15 3.7x.37 1 x R = 0.9344, whch mples 93.44% of the varaton n earnngs can be explaned by prod. and prom. costs. s = 5.05, whch s sgnfcantly less than ether of the SLR models Moton Pcture Industry Example SUMMARY OUTPUT Regresson Statstcs Multple R 0.98315367 R Square 0.96671458 Adjusted R Square 0.960471044 Standard Error 3.689501338 Observatons 0 ANOVA df SS MS F Sgnfcance F Regresson 3 635.15178 108.383759 154.8867681 4.95768E-1 Resdual 16 17.7987 13.61401 Total 19 654.95 Coeffcents Standard Error t Stat P-value Lower 95% Upper 95% Intercept 7.836190009.33338079 3.35899579 0.00399676.889645907 1.7873411 Producton Cost.8476964 0.3933955 7.5834473 1.91353E-06.015969814 3.679414714 Promoton Cost.7837363 0.53436865 8.989368476 1.18387E-07 1.74097533.815499395 Book 7.1660987 1.817963514 3.94184363 0.001166383 3.3118354 11.000049 Moton Pcture Industry Example Usng the full model: ŷ 7.84.85x 1.8x 7.17x 3 R ncreases to 96.67% and the std. error s reduced to 3.6895. Moton Pcture Industry Example Indcator varables revsted ŷ 7.84.85x 1.8x 7.17x 3 Note that x 3 takes on the values of 0 and 1. 0

Selectng the Model We want to dentfy the smplest model that adequately accounts for the systematc varaton n the dependent varable, y. Arbtrarly usng all of the ndependent varables may result n overfttng. Adjusted R Statstc As addtonal ndependent varables are added to a model: The R statstc can only ncrease. The Adjusted-R statstc can ncrease or decrease. R a SSE n 1 1 SSyy n k 1 Adjusted R R The R statstc can be artfcally nflated by addng any ndependent varable to the model. We can compare adjusted-r values as a heurstc to tell whether addng an addtonal ndependent varable really helps to mprove a regresson model. Moton Pcture Industry Example Key regresson results are: Varables Adjusted Parameter n the Model R R S e Estmates x 1 0.751 0.738 9.506 a=5.071, b 1 =5.57 x 1 & x 0.934 0.97 5.05 a=8.15, b 1 =3.67, b =.367 x 1, x &x 3 0.967 0.961 3.689 a=7.836, b 1 =.848, b =.78, b 3 =7.166 The model usng x 1, x, and x 3 appears to be best: Hghest adjusted-r and hghest R Lowest s (most precse predcton ntervals) Estmaton and Predcton Same as n SLR Lke SLR, dfference les n the error of estmaton and predcton errors In multple regresson, these standard errors are complex and beyond the scope of ths class Wll rely on MS Excel output 1

Concerns Parameter Estmablty nablty of the model to estmate parameters because data s concentrated n one area data must nclude at least one more level of x than the hghest order of the x-varable that s ncluded n the model Multcollnearty relatonshp between two or more ndependent varables varables contrbutng the same nformaton f two or more varables are hghly correlated, then we only need one n the model Extrapolaton (already dscussed n SLR) Correlated Errors measurements on the dependent varable are correlated tme seres analyss Polynomal Regresson Sometmes the relatonshp between a dependent and ndependent varable s not lnear. Sellng Prce $175 $150 $15 $100 $75 $50 0.900 1.00 1.500 1.800.100.400 Square Footage Ths graph suggests a quadratc relatonshp between square footage (X) and sellng prce (Y). Polynomal Regresson An approprate regresson functon n ths case mght be, ˆ ŷ aˆ bˆ x b or equvalently, ŷ where, aˆ bˆ x 1 1 x1 bˆ 1x1 x x 1 Sellng Prce Graph of Estmated Quadratc Regresson Functon $175 $150 $15 $100 $75 $50 0.900 1.00 1.500 1.800.100.400 Square Footage

Fttng a Thrd Order Polynomal Model We could also ft a thrd order polynomal model, ˆ 3 Ŷ a bˆ X bˆ X b 1 1 1 3X1 or equvalently, Ŷ ˆ ˆ a b X1 bx where, X X 1 b3x 3 X 1 3 3 X 1 ˆ Sellng Prce Graph of Estmated Thrd Order Polynomal Regresson Functon $175 $150 $15 $100 $75 $50 0.900 1.00 1.500 1.800.100.400 Square Footage Polynomal Regresson Overfttng When fttng polynomal models, care must be taken to avod overfttng. The adj.-r statstc can also be used for buldng/fttng polynomal regresson models. We can gauge the amount of overfttng by Valdatng the ft, or usng a tranng sample to buld the model and a valdaton sample to examne ts estmaton or predcton accuracy. 3