Introducton to Regresson Dr Tom Ilvento Department of Food and Resource Economcs Overvew The last part of the course wll focus on Regresson Analyss Ths s one of the more powerful statstcal technques Provdes estmates from a model Allows for nference and testng hypotheses Extends our abltes from ANOVA Enable us to test theores We wll start wll smple, bvarate models: Y s a functon of a sngle X varable And then move toward the complex, multvarate models: Y s a functon of a set of X varables 2 Regresson We are lookng at the relatonshp between two or more varables One s called the Dependent Varable (Y), whch s to modeled or predcted The others are called Independent Varables (X or a set of Xs), whch are used to explan, estmate, or predct Y In a bvarate (two varable) case, one way to express the relatonshp s n terms of covarance and correlaton: Expressed as a lnear measures of assocaton Symmetrc measures Regresson s an extenson of correlaton/covarance Stll lnear No longer symmetrc Covarance s the basc buldng block of regresson 3 Regresson Models Regresson models represent an assortment of models wth assumptons about the dependent varable - contnuous or dscrete - and the form of the relatonshp wth ndependent varables - lnear or nonlnear We wll focus on the followng Bvarate Regresson Models Multvarate Lnear Nonlnear Dscrete Lnear Nonlnear Dscrete 4
Fahrenhet versus Celsus Fttng a Lne to the data When I go to Europe I have to deal wth temperatures n Celsus How do I convert from C to F? A frend once told me a quck rule of thumb was to double C and add 30 I used my calculator to make a small data set of values And I used t n a regresson F C 10-12.22 15-9.44 20-6.67 25-3.89 30-1.11 32 0.00 35 1.67 45 7.22 50 10.00 55 12.78 60 15.56 65 18.33 70 21.11 75 23.89 80 26.67 85 29.44 90 32.22 95 35.00 5 The relatonshp between F and C s perfect, r = 1 It s a determnstc functon I wll run a regresson of F on C and see what equaton I get. Regresson wll generate a best fttng lne to the data In Excel I wll use Tools, Data Analyss Regresson F Fahrenhet Vs Celsus Temperature 100 90 80 70 60 50 40 30 20 10 0 y = 1.8x + 32-20.0-10.0 0.0 10.0 20.0 30.0 40.0 C 6 Regresson of F on C Ths s the regresson result from Excel The estmated equaton s F = 32 + 1.8 C SUMMARY OUTPUT Regresson Statstcs Multple R 1 R Square 1 Adjusted R Square 1 Standard Error 7.0966E-05 Observatons 18 ANOVA df SS MS F Sg F Regresson 1 12372.94 12372.94 2456783461497.83 0.00 Resdual 16 0.00 0.00 Total 17 12372.94 Coeffcents Std Error t Stat P-value Lower 95% Upper 95% Intercept 32.0 0.0 1519489.6 0.0 32.0 32.0 C 1.8 0.0 1567413.0 0.0 1.8 1.8 7 Requrements of Regresson We specfy one varable as Dependent Usually represented as Y It must be measured as a contnuous varable not a dchotomy or ordnal The dependent varable s thought to be a functon of one or more Independent varables Usually represented as X Can be contnuous, dchotomes, or ordnal Regresson s lmted to Lnear Relatonshps n the parameters n the form of: Y = b0 + b1x1 + b2x2 +... bkxk We wll wll have k ndependent varables 8
Nonlnear relatonshps that can be represented by regresson It s possble to represent a nonlnear relatonshp wth a lnear approach, such as a Polynomal or Log functon Log functon take the log of both sdes Y = ax b Ln(Y) = Ln(a)+ b*ln(x) Polynomal of the kth order Y = b0 + b1x + b2x 2 + b3x 3 + bkx k It s not terrbly restrctve to be lmted to lnear relatonshps The equaton of a Lne I suspect you have seen the equaton of a lne wrtten as Y = mx + b Where m s the slope and b s the ntercept We specfy a dependent varable Y, and ndependent varable X We wll use the form Y = b0 +b1x1 Note: n multple regresson there may be more than one X: Y = b0 + b1x1 + b2x2 When referrng to the populaton I wll use Greek terms: 9 Y =! 0 +! 1 X 1 10 Equaton of a Lne In realty, we often have a random component Y = 5 +.5X X=0 then Y=5 The ntercept X=10 then Y=10 X=20 then Y=15 X=30 then Y=20 The slope shows how much Y changes for a unt change n X: Y changes.5 for each 1 unt change n X Ths s a determnstc model - there s an exact relatonshp between the two varables A Probablstc Model has a determnstc component and a random error component, denoted as e or! Our Expectaton of Y s the determnstc component Y! = " 0 + " 1X 1 + 1 E( Y ) =! +! 0 1X 1 11 12
The error term n our model Have we seen the error term before? The error component s very mportant Observed n populaton/sample Predcted from model The dfference between what we predct and what we observe Y " +! = 0 + " 1X 1 ) Y =! +! X o " = Y )! Y 1 1 Consder the followng model usng the mean Y = µ +! " A smple model based on the mean! " = Y µ Devatons about the mean!" 2 =!( Y µ ) 2 Sum of Squared Devatons!" 2 /n =!( Y µ ) 2 /n Mean Squared Devatons!" 2 /n = 2 Populaton Varance 13 14 The error term n Regresson s mportant! The error term n regresson s a measure of the: Varance of the Model Standard Devaton of the Model And ultmately contrbutes to the estmate of the Standard Error for our coeffcents We wll assume equal varances for Y (dependent varable) across each level of X (ndependent varable) In essence we wll pool the measure of the varance n regresson Ths s called, Homoscadastcty 15 How do we ft a lne to our data? We wll use the property of Least Squares We wll fnd estmates for "0 and "1 that wll mnmze the squared devatons about the ftted lne Frst an example, and then the detals A catalog sales company whch sells electronc equpment wants to mprove ts marketng campagn They collect data on a random sample of 1,000 customers The man varable of nterest s the amount of sales (n dollars) n the prevous year. 16
There s an Excel fle (Catalogs.xls) and a JMP fle (Catalogs.jmp) Y s the Dependent Varable: SALES X s the Independent Varable: SALARY The correlaton between SALES and SALARY s.700 Look at a Scatter Plot Excel wll add a trendlne and an equaton and R-square whch s based on regresson Catalog Sales data 17 Estmated Regresson of Sales on Salary SALES = -15.332 +.022(SALARY) If SALARY = 0 SALES = -15.332 +.022(0) SALES = -15.332 A unt change n SALARY ($1) results n a.022 change n SALES Ths s better expressed as: $1,000 change n SALARY results n Sales of $22.00 Our predcton of SALES for a household wth a SALARY of $50,000 s: SALES = -15.332 +.022($50,000) SALES = $1,084.67 I wll refer to ths as solvng the equaton for a person wth a salary of $50,000 18 How to do ths n Excel Organze data n columns One column contans Y (dependent) Remanng Columns contan contguous Xs (ndependent) TOOLS Data Analyss Regresson Specfy Y varable Specfy X varables need to be contguous columns (for more Xs n model, columns must be next to each other) Remember to specfy f frst row has labels Specfy Output I modfy the output How many decmal places are showng (3 to 4) Change Headngs to make them ft Bold Headers 19 The correlaton and R-square The ANOVA Table The estmated coeffcents SUMMARY OUTPUT of SALES Regressed on SALARY Regresson Statstcs Multple R 0.700 R Square 0.489 Adjusted R Square 0.489 Standard Error 687.068 Observatons 1000 Excel output ANOVA df SS MS F Sg F Regresson 1 451624335.68 451624335.68 956.71 0.000 Resdual 998 471117860.07 472061.98 Total 999 922742195.74 Coef. Std Error t Stat P-value Lower 95% Upper 95% Intercept -15.332 45.374-0.338 0.736-104.373 73.708 SALARY 0.021961 0.000710 30.931 0.000 0.021 0.023 20
Response SALES Whole Model Summary of Ft Output from JMP RSquare RSquare Adj Root Mean Square Error Mean of Response Observatons (or Sum Wgts) Analyss of Varance Source Model Error C. Total Lack Of Ft Source Lack Of Ft Pure Error Total Error Term Intercept SALARY DF 1 998 999 DF 634 364 998 Estmate -15.31783 0.0219608 Sum of Squares 451615197 471114029 922729225 Parameter Estmates 0.489434 0.488923 687.0649 1216.77 1000 Sum of Squares 345072568 126041460 471114029 Std Error 45.37416 0.00071 Mean Square 451615197 472058.14 544278 346268 F Rato 956.6940 Prob > F <.0001* F Rato Mean Square 1.5718 Prob > F t Rato -0.34 30.93 Prob> t 0.7357 <.0001* <.0001* Max RSq 0.8634 21 A few ponts about our model It s possble to predct outsde the range of the data When Salary = 0: SALES = -15.332 +.022($0) = -$15.33 When Salary = 1,000,000 SALES = -15.332 +. 022($1,000,000) = $21,985 The model parameters should be nterpreted only wthn the sampled range of the ndependent varables The predcton part of our model s determnstc, but we know we wll have some error our predcton won t match the data exactly We are fttng a model to the data All models are wrong, some models are useful George Box We wll have the ablty to test coeffcents and construct confdence ntervals - there s a known samplng dstrbuton for regresson coeffcents 22 How to generate a Best Fttng Lne We wll use the property of Least Squares We wll fnd estmates for "0 (ntercept) and "1 (slope) that wll mnmze the squared devatons about the ftted lne Best Ft means the Dfference Between Actual Y Values & Predcted Y Values Are a Mnmum Least Squares generates a set of coeffcents that mnmzes the Sum of the Squared Errors (SSE) n SSE = (Y! Y ) n " ) 2 ) = " 2 =mnmum =1 =1 23 B-varate Regresson Formulas for estmates of!0 and!1 I wll tend to use b0 and b1 for the estmated values The slope coeffcent s based on the covarance of Y and X, adjusted for the varablty n X The Intercept s based on the estmate of b1 and the means of the other varables ) Y = )! 0 + )! 1 X 1 SS! ) 1 = SS XY X where SS XY = (X " X )(Y "Y ) = X Y " ( X SS X = (X " X ) 2 = X 2 ) 2 " n )! 0 = Y " )! 1 X X n 24 Y
Summary Regresson s a strategy to model the relatonshp of a set of ndependent varables (Xs) on a dependent varable (Y). We say we regress Y on X or as set of Xs. Regresson estmates a best fttng lne to the data by mnmzng squared devatons about that lne. It s a natural extenson of much of what we have covered before, especally ANOVA. We wll cover the regresson output, the ANOVA table, understandng the regresson coeffcents, nference n regresson, and multple regresson. 25