Regresso
What s a Model? 1. Ofte Descrbe Relatoshp betwee Varables 2. Types - Determstc Models (o radomess) - Probablstc Models (wth radomess) EPI 809/Sprg 2008 9
Determstc Models 1. Hypothesze Eact Relatoshps 2. Sutable Whe Predcto Error s Neglgble 3. Eample: Body mass de (BMI) s measure of body fat based BMI = Weght Klograms (Heght Meters) 2 EPI 809/Sprg 2008 10
Probablstc Models 1. Hypothesze 2 Compoets Determstc Radom Error 2. Eample: Systolc blood pressure of ewbors Is 6 Tmes the Age days + Radom Error SBP = 6 age(d) + Radom Error May Be Due to Factors Other Tha age days (e.g. Brthweght) EPI 809/Sprg 2008 11
Smple Regresso Smple regresso aalyss s a statstcal tool that gves us the ablty to estmate the mathematcal relatoshp betwee a depedet varable (usually called y) ad a depedet varable (usually called ). The depedet varable s the varable for whch we wat to make a predcto. Whle varous o-lear forms may be used, smple lear regresso models are the most commo.
Itroducto The prmary goal of quattatve aalyss s to use curret formato about a pheomeo to predct ts future behavor. Curret formato s usually the form of a set of data. I a smple case, whe the data form a set of pars of umbers, we may terpret them as represetg the observed values of a depedet (or predctor or eplaatory) varable X ad a depedet ( or respose or outcome) varable Y. lot sze Ma-hours 30 73 20 50 60 128 80 170 40 87 50 108 60 135 30 69 70 148 60 132
Ma-Hour Itroducto The goal of the aalyst who studes the data s to fd a fuctoal relato betwee the respose varable y ad the predctor varable. 180 160 140 120 Statstcal relato betwee Lot sze ad Ma-Hour y f () 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 Lot sze
Pctoral Presetato of Lear Regresso Model
Lear Regresso Model EPI 809/Sprg 2008 16
Assumptos Lear regresso assumes that 1. The relatoshp betwee X ad Y s lear 2. Y s dstrbuted ormally at each value of X 3. The varace of Y at every value of X s the same (homogeety of varaces) 4. The observatos are depedet
Lear Equatos Y Y = mx + b m = Slope Chage X b = Y-tercept Chage Y X 1984-1994 T/Maker Co. EPI 809/Sprg 2008 19
Lear Regresso Model 1. Relatoshp Betwee Varables Is a Lear Fucto Populato Y-Itercept Populato Slope Radom Error Y X 0 1 Depedet (Respose) Varable (e.g., CD+ c.) Idepedet (Eplaatory) Varable (e.g., Years s. seroco.)
Meag of Regresso Coeffcets Geeral regresso model 1. 0, ad 1 are parameters 2. X s a kow costat 3. Devatos are depedet N(o, 2 ) The values of the regresso parameters 0, ad 1 are ot kow. We estmate them from data. 1 dcates the chage the mea respose per ut crease X.
Populato Lear Regresso Model Y Y X 0 1 = Radom error Observed value Y E 0 1 X Observed value X EPI 809/Sprg 2008 22
Estmatg Parameters: Least Squares Method EPI 809/Sprg 2008 23
Scatter plot 1. 2. Plot of All (X, Y ) Pars Suggests How Well Model Wll Ft 60 40 20 0 Y 0 20 40 60 X EPI 809/Sprg 2008 24
Thkg Challege How would you draw a le through the pots? How do you determe whch le fts best? 60 40 20 0 Y 0 20 40 60 X EPI 809/Sprg 2008 25
Thkg Challege How would you draw a le through the pots? How do you determe whch le fts best? 60 40 20 0 Y Itercept uchaged Slope chaged 0 20 40 60 EPI 809/Sprg 2008 26 X
Thkg Challege How would you draw a le through the pots? How do you determe whch le fts best? 60 40 20 0 Y Itercept chaged Slope uchaged 0 20 40 60 EPI 809/Sprg 2008 27 X
Thkg Challege How would you draw a le through the pots? How do you determe whch le fts best? 60 40 20 0 Y Itercept chaged Slope chaged 0 20 40 60 EPI 809/Sprg 2008 28 X
What s the best fttg le
Predcto Error
Least Squares 1. Best Ft Meas Dfferece Betwee Actual Y Values & Predcted Y Values Are a Mmum. But Postve Dffereces Off-Set Negatve. So square errors! Y Yˆ 2 1 2. LS Mmzes the Sum of the Squared Dffereces (errors) (SSE) 1 ˆ 2 EPI 809/Sprg 2008 31
Least Squares Graphcally Y 1 LS mmzes 2 2 2 2 Y ^ 2 X 2 0 1 2 2 ^ 1 ^ 3 1 4 X EPI 809/Sprg 2008 32 ^ Y X 2 3 0 1 2 4
How to estmate parameters
Estmatg the tercept ad slope: least squares estmato ** Least Squares Estmato A lttle calculus. What are we tryg to estmate? β, the slope, from What s the costrat? We are tryg to mmze the squared dstace (hece the least squares ) betwee the observatos themselves ad the predcted values, or (also called the resduals, or left-over ueplaed varablty) Dfferece = y (β + α) Dfferece 2 = (y (β + α)) 2 Fd the β that gves the mmum sum of the squared dffereces. How do you mamze a fucto? Take the dervatve; set t equal to zero; ad solve. Typcal ma/m problem from calculus. d 2 ( y ( )) 2( ( y )( )) d 1 1 2( 1 ( y )) 0... From here takes a lttle math trckery to solve for β 2
The stadard error of Y gve X s the average varablty aroud the regresso le at ay gve value of X. It s assumed to be equal at all values of X. Sy/ Sy/ Sy/ Sy/ Sy/ Sy/
Regresso Pcture y C A ŷ y A B B y C y 2 ( y y) 1 1 A 2 B 2 C 2 SS total Total squared dstace of observatos from aïve mea of y Total varato ( yˆ y) 1 SS reg Dstace from regresso le to aïve mea of y Varablty due to (regresso) 2 ( yˆ y ) *Least squares estmato gave us the le (β) that mmzed C 2 2 R 2 =SSreg/SStotal SS resdual Varace aroud the regresso le Addtoal varablty ot eplaed by what least squares method ams to mmze
Regresso Le If the scatter plot of our sample data suggests a lear relatoshp betwee two varables.e. y 1 0 we ca summarze the relatoshp by drawg a straght le o the plot. Least squares method gve us the best estmated le for our set of sample data.
Regresso Le We wll wrte a estmated regresso le based o sample data as yˆ b0 b1 The method of least squares chooses the values for b 0, ad b 1 to mmze the sum of squared errors SSE 2 ( y ˆ y ) 1 1 2 y b b 0 1
Regresso Le Usg calculus, we obta estmatg formulas: or y y y y b 1 1 2 2 1 1 1 1 2 1 1 ) ( ) ( ) )( ( b y b 1 0 y S S b r 1
Estmato of Mea Respose Ftted regresso le ca be used to estmate the mea value of y for a gve value of. Eample The weekly advertsg epedture () ad weekly sales (y) are preseted the followg table. y 1250 41 1380 54 1425 63 1425 54 1450 48 1300 46 1400 62 1510 61 1575 64 1650 71
Pot Estmato of Mea Respose From prevous table we have: 2 10 564 32604 y 14365 y 818755 The least squares estmates of the regresso coeffcets are: b y ) 1 2 2 ( y 10(818755) (564)(14365) 2 10(32604) (564) b0 1436.5 10.8(56.4) 828 10.8
Pot Estmato of Mea Respose The estmated regresso fucto s: ŷ 82810.8 Sales 828 10.8 Epedtur e Ths meas that f the weekly advertsg epedture s creased by $1 we would epect the weekly sales to crease by $10.8.
Pot Estmato of Mea Respose Ftted values for the sample data are obtaed by substtutg the value to the estmated regresso fucto. For eample f the advertsg epedture s $50, the the estmated Sales s: Sales 82810.8(50) 1368 Ths s called the pot estmate (forecast) of the mea respose (sales).
Lear correlato ad lear regresso
Covarace cov(, y) 1 ( X )( y Y ) 1
Iterpretg Covarace cov(x,y) > 0 cov(x,y) < 0 cov(x,y) = 0 X ad Y are postvely correlated X ad Y are versely correlated X ad Y are depedet
Correlato coeffcet Pearso s Correlato Coeffcet s stadardzed covarace (utless): covarace(, y) r var var y
Correlato Measures the relatve stregth of the lear relatoshp betwee two varables Ut-less Rages betwee 1 ad 1 The closer to 1, the stroger the egatve lear relatoshp The closer to 1, the stroger the postve lear relatoshp The closer to 0, the weaker ay postve lear relatoshp
Scatter Plots of Data wth Varous Correlato Coeffcets Y Y Y Y X X r = -1 r = -.6 r = 0 Y Y X X r = +1 r = +.3 Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall X r = 0 X
Lear Correlato Lear relatoshps Curvlear relatoshps Y Y X X Y Y X Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall X
Lear Correlato Strog relatoshps Weak relatoshps Y Y X X Y Y X Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall X
Lear Correlato No relatoshp Y X Y Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall X
Calculatg by had 1 ) ( 1 ) ( 1 ) )( ( var var ), ( cov ˆ 1 2 1 2 1 y y y y y y arace r
Smpler calculato formula y y SS SS SS y y y y y y y y r 1 2 1 2 1 1 2 1 2 1 ) ( ) ( ) )( ( 1 ) ( 1 ) ( 1 ) )( ( ˆ y y SS SS SS r ˆ Numerator of covarace Numerators of varace
Least Square estmato Slope (beta coeffcet) = ˆ Cov(, y) Var( ) Itercept= Calculate : ˆ y - ˆ Regresso le always goes through the pot: (, y)
Relatoshp wth correlato rˆ ˆ SD SD y I correlato, the two varables are treated as equals. I regresso, oe varable s cosdered depedet (=predctor) varable (X) ad the other the depedet (=outcome) varable Y.
Resdual Aalyss: check assumptos e Y Yˆ The resdual for observato, e, s the dfferece betwee ts observed ad predcted value Resduals are hghly useful for studyg whether a gve regresso model s approprate for the data at had. Check the assumptos of regresso by eamg the resduals Eame for learty assumpto Eame for costat varace for all levels of X (homoscedastcty) Evaluate ormal dstrbuto assumpto Evaluate depedece assumpto Graphcal Aalyss of Resduals
Resdual = observed - predcted X=95 mol/l y 48 34 yˆ y 34 yˆ 14
resduals resduals Resdual Aalyss for Learty Y Y Not Lear Lear Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall
resduals resduals Resdual Aalyss for Homoscedastcty Y Y No-costat varace Costat varace Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall
resduals resduals resduals Resdual Aalyss for Idepedece Not Idepedet Idepedet X X X Slde from: Statstcs for Maagers Usg Mcrosoft Ecel 4th Edto, 2004 Pretce-Hall
Eample: weekly advertsg epedture y y-hat Resdual (e) 1250 41 1270.8-20.8 1380 54 1411.2-31.2 1425 63 1508.4-83.4 1425 54 1411.2 13.8 1450 48 1346.4 103.6 1300 46 1324.8-24.8 1400 62 1497.6-97.6 1510 61 1486.8 23.2 1575 64 1519.2 55.8 1650 71 1594.8 55.2
Estmato of the varace of the error terms, 2 The varace 2 of the error terms the regresso model eeds to be estmated for a varety of purposes. It gves a dcato of the varablty of the probablty dstrbutos of y. It s eeded for makg ferece cocerg regresso fucto ad the predcto of y.
Regresso Stadard Error To estmate we work wth the varace ad take the square root to obta the stadard devato. For smple lear regresso the estmate of 2 s the average squared resdual. 2 1 2 1 2 s ˆ y. e ( y y ) 2 2 To estmate, use s estmates the stadard devato of the error term the statstcal 2 model for smple lear regresso. s s y. y.
Regresso Stadard Error y y-hat Resdual (e) square(e) 1250 41 1270.8-20.8 432.64 1380 54 1411.2-31.2 973.44 1425 63 1508.4-83.4 6955.56 1425 54 1411.2 13.8 190.44 1450 48 1346.4 103.6 10732.96 1300 46 1324.8-24.8 615.04 1400 62 1497.6-97.6 9525.76 1510 61 1486.8 23.2 538.24 1575 64 1519.2 55.8 3113.64 1650 71 1594.8 55.2 3047.04 y-hat = 828+10.8X total 36124.76 S y. 67.19818
Resdual plots The pots ths resdual plot have a curve patter, so a straght le fts poorly
Resdual plots The pots ths plot show more spread for larger values of the eplaatory varable, so predcto wll be less accurate whe s large.
Varable trasformatos If the resdual plot suggests that the varace s ot costat, a trasformato ca be used to stablze the varace. If the resdual plot suggests a o lear relatoshp betwee ad y, a trasformato may reduce t to oe that s appromately lear. Commo learzg trasformatos are: Varace stablzg trasformatos are: 1, log( ) 1, y log( y), y, y 2
2 predctors: age ad vt D
Dfferet 3D vew
Ft a plae rather tha a le O the plae, the slope for vtam D s the same at every age; thus, the slope for vtam D represets the effect of vtam D whe age s held costat.