Lecture 1: Introduction to Regression

Lecture : Itroducto to Regresso

A Eample: Eplag State Homcde Rates What kds of varables mght we use to epla/predct state homcde rates? Let s cosder just oe predctor for ow: povert Igore omtted varables, measuremet error How mght ths be related to homcde rates?

Povert ad Homcde These data are located here: http://www.publc.asu.edu/~gasweete/crj64/data/hom_pov.dta There appears to be some relatoshp betwee povert ad homcde rates, but t s ot perfect. There s a lot of ose whch we wll attrbute to uobserved factors ad radom error.

Povert ad Homcde, cot. There s some ozero value of epected homcdes the absece of povert. We epect homcde rates to crease as povert rates crease. Thus, Y Ths s the Populato Regresso Fucto X

Povert ad Homcde, Sample Regresso Fucto s the depedet varable, homcde rate, whch we are trg to epla. represets our estmate of what the homcde rate would be the absece of povert* s our estmate of the effect of a hgher povert rate o homcde u s a ose term reflectg other thgs that fluece homcde rates *Ths s etrapolato outsde the rage of data. Not recommeded. u

Povert ad Homcde, cot. u Ol ad are drectl observable the equato above. The task of a regresso aalss s to provde estmates of the slope ad tercept terms. The relatoshp s assumed to be lear. A crease s assocated wth a crease. Same epected chage homcde gog from 6 to 7% povert as from 5 to 6%

.973.475

Ordar Least Squares.973.475 u Substatvel, what do these estmates mea? How dd we arrve at ths estmate? Mmze the sum of the squared error, aka Ordar Least Squares OLS estmato m Y Y Wh squared error? Wh vertcal error? Not perpedcular.

Ordar Least Squares Estmates m Solvg for the mmum requres calculus set dervatve wth respect to β to ad solve The book shows how we ca go from some basc assumptos to estmates for β ad β wthout usg calculus. I wll go through two dfferet was to obta these estmates: Wooldrdge s ad Kha s khaacadem.org

Ordar Least Squares: Estmatg the tercept Wooldrdge s method Eu u E Assumg that the average value of the error term s zero, t s a trval matter to calculate β oce we kow β.

Ordar Least Squares: Estmatg the tercept Wooldrdge Icdetall, these last sets of equatos also mpl that the regresso le passes through the pot that correspods to the mea of ad the mea of :,

Ordar Least Squares: Estmatg the slope Wooldrdge Frst, we use the fact that the epected value of the error term s zero, to create geerate a ew equato equal to zero. We saw ths before, but here I use the eact formula used the book. u u u E

Ordar Least Squares: Estmatg the slope Wooldrdge We ca multpl ths last equato b sce the covarace betwee ad u s assumed to be zero ad the terms the paretheses are equal to u. Net, we plug our formula for the tercept ad smplf, u E u Cov

Ordar Least Squares: Estmatg the slope Wooldrdge Re-arragg...

Ordar Least Squares: Estmatg the slope Wooldrdge Re-arragg... Iterestgl, the fal result leads us to the relatoshp betwee covarace of ad ad varace of. var, cov

Ordar Least Squares: Estmates Kha s method Kha starts wth the actual pots, ad elaborates how these pots are related to the squared error, the square of the dstace betwee each pot, ad the le =m+b=β +β

Total Ordar Least Squares: Estmates Kha s method The vertcal dstace betwee a pot,, ad the regresso le = β +β s smpl -β +β Error It would be trval to mmze the total error. We could set β the slope equal to zero, ad β equal to the mea of, ad the the total error would be zero. Aother approach s to mmze the absolute dfferece, but ths actuall creates thorer math problems tha squarg the dffereces, ad results stuatos where there s ot a uque soluto. I short, what we wat s the sum of the squared error SE, whch meas we have to square ever term that equato.

Ordar Least Squares: Estmates Kha s method We eed to fd the β ad β that mmze the SE. Let s epad ths out. To be clear, the subscrpts for the β estmates just refer to our two regresso le estmates, whereas the subscrpts for our s ad s refer to the frst observato, secod observato ad so o. SE SE

Ordar Least Squares: Estmates Kha s method Summg these colums... Everthg but the regresso le coeffcets are kow ettes here. Ths equato represets a 3D surface, where dfferet values of β ad β correspod to dfferet values of the squared error. We just eed to pck the values of β ad β that mmze the SE. * * * * * mea mea mea mea mea SE

Ordar Least Squares: Estmates Kha s method Those famlar wth calculus wll kow that the mmum of the squared error surface occurs where the partal dervatve slope wth respect to β s equal to zero ad the partal dervatve wth respect to β s equal to zero. We ve see that before. How about the other dervatve? mea mea SE * *

Ordar Least Squares: Estmates Kha s method Summg these colums... Replacg β... var, cov * * * * * * * * * * * * mea mea mea mea mea mea mea mea mea mea mea SE

Ordar Least Squares Estmates Hopefull t s reassurg to kow that we ca obta the same aswers from two ver dfferet methods. These formulas allow us, a bvarate regresso, to calculate the regresso le b had wthout usg fac statstcal packages. All we eed to do s fd the mea of, the mea of, the mea of the products of ad, ad the mea of the squares of, ad the we ca plug ths to the formulas ad crak out our solutos.

OLS b had, eample Let s look at a set of 5 pots, ad see how to calculate a regresso le b had. Here are our fve pots: 4, 7,6, 6,3,4

OLS b had, eample We ca geerall guess that the slope wll be postve, but we ca fd the slope eactl f we calculate four thgs: the mea of, the mea of, the mea of the products of ad, ad the mea of the squares of The s are 4,7,,6, ad. Ther mea s 9/5=3.8 The s are,6,,3, ad 4. Ther mea s 6/5=3. The products are 8,4,,8 ad 8. Ther mea s 76/5=5.. The squared s are 6,49,,36, ad 4. Ther mea s 5/5=.

OLS b had, eample Recall the formula for the slope: mea mea * * 5. 3.*3.8 3.8*3.8 3.4 6.56.463 Oce we have the slope, the tercept s trval: 3..463*3.8.44 Ad our regresso le that mmzes the sum of squared dffereces:.44. 463 u

OLS b had, eample Checkg our work...

Aalss of Varace Oce we have our regresso le, we ca defe a ftted value as follows: Ths s our estmated value for gve our slope ad tercept estmates ad the value of. It s also sometmes called a predcted value. All of the -hats fall o the regresso le. For purposes of evaluatg our regresso, t makes sese to compare the -hats to the actual values of.

Aalss of Varace The total varato Y s assessed relatve to ts mea. We wat to partto ths varato to two compoets: We frst add ad subtract the ftted value of for each observato, the combe terms to get the resdual term, whch s devato ueplaed b the model, ad the dfferece betwee the ftted value of ad the mea of, whch s the porto of the varace eplaed b the model.

Aalss of Varace, cot. Of course, order to assess varace, we square all of these terms: SST SSR SSE Where SST s the total sum of squares, SSE s the eplaed sum of squares, ad SSR s the resdual sum of squares.

R R-squared R represets the porto of the varace that s eplaed b the model. R SSE SST Tpcall, socal scece applcatos, our stadards for R are prett low. Idvdual-level regressos rarel eceed.3

Ordar Least Squares Estmates b had See Ecel fle state hom povert -bar -bar * -bar Alabama 8.3 6.7 4.6 3.53 6.7.3 Alaska 5.4 -.9.63 -.3 4.37 Arzoa 7.5 5. 3..73 8.49 9.67 Arkasas 7.3 3.8.7.53 4.36.9 Calfora 6.8 3...3.53.3

Ordar Least Squares Estmates b had, cot. We ca also get β from the covarace. corr hom pov, c matr Stata, whch shows that the covarace of homcde ad povert s 4.34 ad the varace of povert s 9.6. β =4.34/9.6=.475 The mea of homcde rates s 4.77, ad the mea of povert rates s.9. β =4.77-.9*.475=-.973 Or, Stata. reg hom pov

Stata output β =4.34/9.6=.475 β =4.77-.9*.475=-.973. reg hom pov Source SS df MS Number of obs = 5 -------------+------------------------------ F, 48 =.36 Model.75656.75656 Prob > F =. Resdual 5.9343 48 4.68977798 R-squared =.38 -------------+------------------------------ Adj R-squared =.935 Total 35.84999 49 6.63846936 Root MSE =.656 ------------------------------------------------------------------------------ homrate Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- povert.4755.787 4.6..68376.686795 _cos -.97359.7983 -.76.45-3.5467.664 ------------------------------------------------------------------------------

Assumptos of the Classcal Lear Regresso Model X & Y are learl related the populato. We have a radom sample of sze from the populato. 3 The values of through are ot all the same. 4 The error has a epected value of zero for all values of : Eu = zero codtoal mea 5 The error term has a costat varace for all values of : Varu = homoscedastct

Leart If X ad Y are ot learl related, the estmates wll be correct. Look at our data! Eample, how do these data compare?:. summ Varable Obs Mea Std. Dev. M Ma -------------+-------------------------------------------------------- 9 3.3665 4 4 9 3.3665 4 4 3 9 3.3665 4 4 4 9 3.3665 8 9 7.599.3568 4.6.84 -------------+-------------------------------------------------------- 7.599.3657 3. 9.6 3 7.5.344 5.39.74 4 7.599.3579 5.5.5

Leart, cot. How do these models compare? β =3 β =.5 Let s look at each of them separatel

Leart, cot., Regresso

Leart, cot., Regresso 3

Leart, cot., Regresso 4

3 Sample varato If there s o varato the values of, t s ot possble to estmate a regresso le. The le of best ft would pot straght up ad pass through ever pot. Mmal varato s sometmes problematc as well, as t makes regresso estmates ver ustable. Ths assumpto s eas to check b lookg at summar statstcs.

4 Zero codtoal mea Eu = I practcal terms, ths meas that the sum of the uobserved varables s ot related to. Also, t meas that varato our estmates of the tercept ad slope are all due to varatos the error terms. Should ths assumpto hold true, our estmates of the slope ad tercept are ubased, meag that o average we re gog to get the rght aswer.

5 Varu = homoscedastct I practcal terms, ths meas that the varace of the error term s urelated to the depedet varables.

Root Mea Squared Error RMSE Root mea squared error gves us a dcato of how well the regresso le fts the data. RMSE SSR k Ths s the square root of the resdual sum of squares dvded b the sample sze mus the umber of parameters beg estmated k= smple bvarate regresso.

Root Mea Squared Error, cot. Provded the error term s dstrbuted ormall, the RMSE tells us: 68.3% of the observatos fall wth the bad that s ±*RMSE of the regresso le 95.4% of the observatos fall wth the bad that s ±*RMSE of the regresso le 99.7% of the observatos fall wth the bad that s ±3*RMSE of the regresso le RMSE s also a elemet calculatg the stadard errors of β ad β

Regresso estmates, stadard errors SE RMSE SE RMSE

Regresso estmates, stadard errors, cot. Whle these two stadard error formulas ma ot appear ver tutve, we ca glea some mportat formato from them:. As ucertat about the regresso le creases RMSE creases, the stadard errors of both β ad β crease.. As the varablt of creases, the stadard errors of both β ad β decrease.

Formal test of model ft, F-test F k, N k SSE k SSR Where k = the umber of parameters the model, ad s the sample sze Ths s a geeral test of model ft. If the F- test s statstcall sgfcat, t meas that the model eplas some of the varace Y. k

Net tme: Homework: Problems.4,.4, C.4, C.4 Read: Wooldrdge Chapters 9 & Apped C.6, ad Bushwa, Sweete & Wlso 6 artcle