University of Belgrade. Faculty of Mathematics. Master thesis Regression and Correlation

Size: px

Start display at page:

Download "University of Belgrade. Faculty of Mathematics. Master thesis Regression and Correlation"

Percival Burke
5 years ago
Views:

1 Uversty of Belgrade Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Faculty of Mathematcs Master thess Regresso ad Correlato The caddate Supervsor Karma Ibrahm Soufya Vesa Jevremovć Jue 014 Cotas

2 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Item Theme Page Itroducto Regresso aalyss Chapter oe 1 Correlato 1.1 Pearso`s coeffcet correlato Scatter Dagram Spearma coeffcet correlato Regresso 7 1. Regresso Equato Curve of Regresso Types of Regresso Lear Regresso Equato of Y o X Assumptos ecessary about the regresso model Estmatg the populato slope ad tercept 9 Chapter two The Method of least squares The Method of least squares 10.1 Resduals 1. Propertes of estmator of the slope 14.3 Propertes of estmator of the tercept 16.4 Testg the valdty of the model 17.5 Predcto tervals for the actual value of Y 18.6 Coeffcet of determato 19 Chapter three Nolear Models 3.1 Polyomal regresso 0 3. Expoetal regresso 3.3 Other curvlear models 4 Chapter four Dagostcs for smple lear regresso 4.1 Vald ad Ivald Regresso Models: Ascombe s Four Data Sets 4. Plots of resduals Regresso Dagostcs: Tools for Checkg the Valdty of a Model 4.3 Leverage Pots Stadardzed Resduals Assessg the Ifluece of Certa Cases Normalty of the Errors 39 Chapter fve Cocluso 5

3 Refereces Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Itroducto Regresso s a statstcal tool for the vestgato of relatoshps betwee varables. Usually, the vestgator seeks to ascerta the casual effect of oe varable upo aother. Regresso methods are meat to determe the best fuctoal relatoshp betwee a depedet varable Y wth oe or more depedet varables X. The earlest form of regresso was the method of least squares, whch was publshed by Legedre 1805 ad by Gauss Legedre ad Gauss both appled the method to the problem of determg, from astroomcal observatos, the orbts of bodes about the Su. Gauss publshed a further developmet of the theory of least squares 181. The term regresso was coed by Sr Fracs Galto, whle studyg the lear relatoshp betwee heghts of sos ad heghts of ther fathers. Ths thess focuses o tools ad techques for buldg regresso models usg real data ad assessg ther valdty. A key theme throughout the thess s that t makes sese to base fereces or coclusos oly o vald models. We wll show that plots are mportat tool for buldg regresso models ad assessg ther valdty through approprate dagostc procedures. The regresso output ad plots that appear thess have bee geerated usg statstcal software R. I addto, real data sets that have appeared lterature are also cosdered. The thess s dvded to fve chapters. I the frst chapter we troduce Pearso`s correlato coeffcet, dscuss mportace of scatter plots ad defe regresso. The secod chapter focuses o method of least squares ad o statstcal propertes of regresso coeffcets. Ad we show how to test valdty of the method ad we defe predcto tervals. I the thrd chapter we gve examples of olear regresso models. The fourth chapter dscusses dagostc procedures for smple lear regresso. I the ffth chapter we preset our coclusos. 1- Regresso Aalyss Frst we wll troduce Pearso`s ad Spearma`s correlato coeffcets as two dfferet measures of correlato or assocato betwee two varables.

4 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade 1-1 Correlato We wll ote that the correlato betwee the two varables the populato wth the Greek letter ρ, ad Pearso`s correlato coeffcet, as a estmate of ρ, wth the letter r,pearso`s correlato coeffcet ca assume ay value the terval (-1,1).The absolute value of r (.e. r ) dcate the stregth of the relatoshp betwee the two varables.as the absolute value of r approaches 1, the degree of lear relatoshp betwee the varables becomes stroger, achevg the maxmum whe r =1 (.e. whe r equals +1 or -1 ). The closer the absolute value of r s to 0, the weaker the lear relatoshp s betwee the two varables. Pearso s correlato coeffcet determes the degree to whch a lear relatoshp exsts betwee two varables. The sg of r dcates the ature or drecto of the lear relatoshp whch betwee two varables, the postve sg dcates a drect lear relatoshp, ad the egatve sg dcates a drect lear relatoshp. A drect lear relatoshp s oe whch a chage o oe varable s assocated wth a chage o the other varable the same drecto (.e., a crease o oe varable s assocated wth a crease o the other varable, ad a decrease o oe varable s assocated wth a decrease o the other varable). A drect relatoshp s oe whch a chage o oe varable s assocated wth a chage o the other varable the opposte drecto (.e., a crease o oe varable s assocated wth a decrease o the other varable, ad a decrease o oe varable s assocated wth a crease o the other varable). The use of the Pearso s correlato coeffcet assumes that a lear fucto best descrbes the relatoshp betwee the two varables, ad these two varables have ormal dstrbuto. If, however, the relatoshp betwee the varables s better descrbed by a curvlear fucto, Pearso`s correlato coeffcet s ot approprate measure of correlato. Calculato of Pearso s correlato coeffcet r Let ( x 1, y 1 ),( coeffcet s equal to x, y ( x, y ) be pared observatos, the Pearso s correlato r= (x x ) ( y ý) ( x x ) ( y ý) or smply

5 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade where x= x, ý= y r= x y x ý s x s y are sample meas ad = s 1 y ( y ý) are stadard sample devatos. If we use x =x x y = y ý, the we get smplfed for Pearso s correlato coeffcet Scatter Dagram Let us have pars of values ( r= x ' ' y x ' x, y x 1, y 1, ) ( y ' s x = 1 x, y. (x x ), I scatter dagram the varable X s show alog the x-axs ad the varable Y s show alog the y-axs ad all the pars of values of X ad Y are show by pots (or dots) o the graph. The scatter dagram of these pots reveals the ature ad stregth of correlato betwee these varable X ad Y. Degrees of correlato betwee two varables are show o Fgure 1. As we ca see, whe there s o correlato, pots o the scatter plot are dstrbuted radomly. Also, we observe the followg: If the pots le o a straght le rsg from lower left to upper rght, the there s a perfect postve correlato betwee the varables X ad Y. If all the pots do ot le o a straght le, but ther tedecy s to rse from lower left to upper rght

6 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade the there s a postve correlato betwee the varable X ad Y. I these cases the two varables X ad Y are the same drecto ad the assocato betwee the varables s drect. If the movemets of the varables X ad Y are opposte drecto ad the scatter dagram s a straght le, the correlato s sad to be egatve, assocato betwee the varables s sad to be drect. A scatter plot of the data lke that gve Fgure1 should always be draw to obta a dea of the sort of relatoshp ( f ay) that exsts betwee two varables (e.g., lear, quadratc, expoetal, etc.). Fgure 1. The degrees of correlato Example 1: For the followg data draw scatter dagram ad calculate Pearso s correlato coeffcet. X Y

7 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Fgure. Scatter plot for gve data Frst we calculate sample meas, sample stadard devatos. x 9, y 1.57, 7 1 x y 90 s, X Ad the Pearso s correlato coeffcet s equal to r 7 1 x 7 s X y s xy Y x x 4, s Y y y We ca see that correlato betwee varables X ad Y s very strog. By lookg at the scatter plot t seems that Y s a lear fucto of X. It s mportat to ote that strog correlato does ot mea that there s cause-effect relatoshp betwee two varables. By examg the value of correlato coeffcet, we may coclude that two varables are related, but we ca`t draw cocluso that oe varable causes the other. Varable (or varables ) whch have ot bee cosdered ca be resposble for the observed correlato betwee the two varables.

8 1..3-Spearma`s correlato coeffcet Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Spearma s correlato coeffcet measures degree of mootoc relatoshp betwee two radom varables. Mootoc relatoshp ca be creasg (postve correlato - crease values of varable X s followed by crease values of varable Y) or decreasg (egatve correlato - crease values of varable X s followed by decrease values of varable Y). Mootocally creasg, mootocally decreasg ad o-mootoc relatoshp betwee varables are show o Fgure 3 Fgure 3. Mootoc ad o-mootoc relatoshps betwee varables Spearma s populato correlato coeffcet s deoted wth r S S ad Spearma s sample correlato coeffcet wth. Whe ths coeffcet s equal to -1 or +1, perfect mootoc relatoshp exsts betwee radom varables X ad Y. We wll descrbe how to calculate Spearma s correlato coeffcet three steps. We ( x1, y1),( x, y ),..., ( x, y ) have pars of sample values of radom varables X ad Y. 1. Rak the values of varable X ad the values of varable Y, separately. For each elemet of sample we wll have par of raks R, R X Y.. Form dffereces of these raks d R X R Y.

9 3. Calculate Spearma s correlato coeffcet by the formula r S 6 1 d 1 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade 1.- Regresso Whe we are terested what value of radom varable Y s expected whe X=x, we come to codtoal mathematcal expectato E(Y X =x), the expected value of Y whe X takes the specfc value x. ad f If (X,Y) s b-dmesoal cotuous radom varable wth desty fucto f x (x) s desty fucto of radom varable X the Y /= X =x E y f (x, y ) f x ( x ) Set of all values E(Y X =x) s set of values of radom varable E(Y X ), so to calculate E(Y X ), we frst eed to calculate E(Y X =x) for all x. f (x, y) Radom varable E(Y X ) = R(X ) s called regresso. For example, f varable X represets day of the week ad varable Y sales at a gve compay, the the regresso of Y o X represets the mea (or average) sales o a gve day If (X,Y) s b-dmesoal ormally dstrbuted radom varable the E(Y X ) = β + β X Regresso Equato The fuctoal relatoshp of a depedet varable wth oe or more depedet varables s called a regresso equato: It s also called predcto equato (or estmatg equato) Curve of Regresso The graph of the regresso equato s called the curve of regresso: If the curve s a straght le; the t s called the le of regresso Types of Regresso

10 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade If there are oly two varables uder cosderato, the the regresso s called smple regresso. If the relatoshp betwee X ad Y s o-lear, the the regresso s curvlear. If there are more tha two varables uder cosderato the the regresso s called multple regresso. For example, Multple regresso ca be used to model relatoshp betwee sugar blood ad weght, age ad blood pressure of dabetes patets Lear Regresso Equato of Y o X x, y Data are collected pars ( x 1, y 1, ) ( x, y, where x 1 deotes the frst value of y 1 the X -varable ad deotes the frst value of the Y -varable. The X varable s called the predctor varable, whle the Y -varable s called the depedet varable. The regresso of Y o X s lear f X Y /= β 0 +β 1 E 0 1 where the ukow parameters ad determe the tercept ad the slope of a specfc Y 1, Y,..., Y straght le, respectvely. Suppose that are depedet realzatos of the radom x 1 x,..., x varable Y that are observed at the values of a radom varable X. If the regresso of Y o X s lear, the for = 1,,, Y E( Y X x ) e x e where e s the radom error 0 1 Y, wth zero expectato Assumptos ecessary about the regresso model 1. I what follows we shall make the followg assumptos: Y s related to x by the smple lear regresso model ( y =β 0 +β 1 x +e.. The errors e 1, e,,e are depedet of each other. 3. The errors e 1,e,,e have a commo varace σ.

11 4. The errors are ormally dstrbuted wth a mea of 0 ad varace e X N (0, S) e ~ N(0, ). σ, that s, Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Because the errors e are ormally dstrbuted, we also have that Y 1,Y,,Y β 0 + β 1 x,σ : ) the ý N = 1 y : N ( β +β x, σ 0 1 =1 ) Methods for checkg these four assumptos wll be cosdered the fourth chapter Estmatg the populato slope ad tercept Suppose for example that varable X represets heght ad varable Y weght of a perso. For a smple regresso model the mea weght of dvduals of a gve heght would be a lear fucto of that heght. I practce, we usually have a sample of data stead of the whole populato. The slope b 1 1 ad tercept 0 are ukow, sce these are the values for the whole populato. Thus, we wsh to use the gve data to estmate the slope ad the tercept. Ths ca be acheved by fdg the equato of the le whch best fts our data, that s, choose b 0 ad b 1 ŷ ˆ such that y b0 b1 x ^y=b 0 +b 1 x the th shall refer to as predcted value or the ftted value of ukow parameters we wll use the method of least squares. -The Method of Least Squares s as close as possble to y Y y. We Y. For estmatg these Ths method of curve fttg was suggested early the eteeth cetury by the Frech mathematca Adra Legedre. The method of least squares assumes that the best fttg le s oe for whch the sum of the squares of the vertcal dstaces of the pot (x,y) from the le s mmal. For the smple lear regresso we ca use least squares method to fd the estmators b 0 ad b 1 such that the sum of the squared dstaces betwee value y ad predcted value ^y

12 S 1 ( y x 0 1 ) Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade reaches the mmum amog all possble choces of regresso coeffcets 1. Vertcal dstaces of pots from regresso le are show o Fgure 4 Fgure 4. Vertcal dstaces of pots from regresso le 0 b o ad b 1 Mathematcally, the least squares estmates of the smple lear regresso could be obtaed by solvg the followg system: S =0, S =0 β 0 β 1 It s more coveet to solve ths system usg the ftted lear model:

13 x ^y =β o +β 1, Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade where Now sum of squared dstaces equals to β o =β o β 1 x S= [ y ( β 0 +β 1 (x x ))] =1 ad we eed to solve the followg system S β 0 =0, S β 1 =0 Takg the partal dervatves wth respect to ormal equatos). x ) [ y ( β o +β 1 (x ]=0 [ y ( β o +β 1 ( x x ) ] ( x x )=0 Note that y =β 0 + β 1 ( x x )=β 0. β 0 β 1 we get system of equatos (called Therefore, we have b 0 =^β 0 = 1 y = ý. Substtutg b 0 by ý we obta

14 x ) [ y ( ý +β 1 (x ] ( x x )=0 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Now t s easy to see ad We have b 1 = Estmator varables b 1 = b (x x) ² = b o ( y ý ) ( x x) - = S XY S XX b 1 x= ý b₁ x The ftted value of the smple regresso s defed as y 1, y,, y : N ( β 0 +β 1 x,σ ) the 1 ( y ý ) (x x ) 1 = ( x x ) y (x x ) (x x ) ^y =b o +b 1 x. ý = 1 y : N ( β 0 +β 1 x, σ ) b 1 s ormally dstrbuted, as lear combato of ormally dstrbuted radom Y.1-.Resduals. The, t also follows that estmator The dfferece betwee a observed s referred to as the th squares le to model for that partcular pot. b 0 has ormal dstrbuto. y ad the ftted value of ^y, e = y ^y regresso resdual. Its magtude reflects the falure of the least

15 We ca use these resduals to estmate ukow varace σ of radom errors ( σ =var(e) wth estmator ^σ = 1 ^e. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Example : A regresso model for the tmg of producto rus We shall cosder the followg data: varable Y represets the tme take ( mutes) for a producto ru (ru tme) ad varable X the umber of tems (ru sze) produced for 0 radomly selected orders. We wsh to develop a equato to model the relatoshp betwee varables Y ad X. The data are gve Table 1 ad correspodg scatter plot Fgure 5. Table 1. The producto data Case Ru tme Ru sze Case Ru tme Ru sze

16 Fgure 5. A scatter plot of the producto data Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Scatter plot of the producto data shows us that there mght be lear depedece betwee ru Y tme ad ru sze. We wll cosder the lear regresso model calculate least squares estmators of the ukow parameters We have that 0 1 b b ( x x) 0 ( x x 01.75, y 0. 05, , so the estmators x)( y y) ( x x) y b1 x ( x b x)( y ad b ad y) are equal to 0 x e. Now we wll yˆ x The ftted regresso le s Now we shall gve some propertes of estmators smple lear regresso model, always havg md the assumptos gve o page 9..- Propertes of estmator of the slope Theorem 1: The least squares estmator b 1 s a ubased estmator of Proof Here we take x = 1,,, as costats,whle E (b 1 )=E ( S xy S xx ) = 1 E 1 S xx (Y Ý ) (x X ) 1. ad β 1. y Y s a radom varable.

17 Usg the fact that (x X )=0 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade We get: E (b 1 )= 1 1 (x S XX x ) Ey = 1 1 (x S xx x ) (β o +β 1 x )= 1 1 ( x S xx x )= 1 1 ( x S xx x ) β 1 (x x )= 1 1 ( x S xx x ) β 1 = S xx β =1 S 1 =β 1. xx Theorem : Varace of the estmator of the slope s Proof Y ý ()( x x)= 1 S Var xx ( 1 1 Var (b 1 )=Var( S xy S xx ) 1 S 1 xx Y ( x x) ) = 1 S Var xx (x x² Var(Y )= 1 S xx Var ( b 1 ) = = 1 σ S xx (x x ²σ ²= σ ² S xx.

18 Theorem 3: The least square estmator b 1 ad ý are ucorrelated. The uder the Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade ormalty assumpto of depedet. Proof y As we metoed before that calculate the covarace betwee for x x ( ( y ý ), ý)= b 1 ad ý =1,,,, b 1 ad ý. Cov (b 1, ý )=Cov ( S xy S xx, ý) = 1 S xx Cov (S xy, ý )= 1 S xx Cov x ( x) y, ý x x ( y,, y )= 1 S xx Cov b 1 ad ý are ormally dstrbuted ad are ormally dstrbuted ad depedet, let us

19 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade (x x)cov ( y, y j ) j=1 1 Cov S xx Note sce that, Cov ( y, y j ) =E Ee 0 Thus, we coclude that x ( x )σ =0 Cov(b 1, ý)= 1 S xx ad y Ey y j Ey ] e =E ( s are depedet, we ca wrte E, E j e e j ) = { σ f = j o f j Recall that zero correlato s equvalet to the depedece betwee two ormal varables. Thus, we coclude that b 1 ad ý are depedet..3- Propertes of estmator of the tercept Theorem 4. The least squares estmator Proof Here also we take b 0 s a ubased estmator of x,,,., as costats, whle Y s a radom varable. β o 1 β o + β 1 x β 1 x=β o E b o =E (Y b 1 x )=( 1 E Y ) x E b 1 = 1.

20 - Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Theorem 5: Varace of the estmator of the slope s: Var ( b o )=( 1 + x S xx ) σ Proof Var (b o )=Var ( ý b 1 x )=Var ( ý )+( x ) Var(b 1 )= σ + x σ S xx =( 1 + x S xx ) σ..4 - Testg the valdty of the model reduces to If the slope of the regresso le β 1 s equal to zero, the the lear regresso model Y =β 0 +e ad we ca ot predct values of varable Y based o the kow values of varable X. To test ull hypothess use the test statstcs H 0 : β 1 =0 agast the alteratve H 1 : β 1 0, b1 1 T ˆ S X we ca whch has uder ull hypothess, Studet t-dstrbuto wth - degrees of freedom. We used the fact that estmator Test s of the form: reject b 1 : N 1, S X H 0 at level α f b 1 : N (β 1, σ s ), where c t T c, where S X,1 1 x x. s obtaed from table of t Studet dstrbuto. Otherwse, we do ot reject ull hypothess ad we coclude that lear regresso model could be vald model for our data..5- Predcto tervals for the actual value of Y I ths secto we cosder the problem of fdg a predcto terval for the

21 x p Y p X=x p actual value of Y at, a gve value of X. We ote Y varable Y. We have that p =β 1 x p +β 0 +e p ad ^Y p =b 1 x p +b 0. as ths predcto value of Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade S Frst we wll derve expectato ad varace of Y ˆ p Y p E (Y p ^Y p )=E ( β 1.x P +β 0 +e P b 1. x p +b 0 )=E (b 1 β 1 ) x p + E (b 0 β 0 )+ E e P =0 \ Var (Y p ^Y p =Var (β 1.x p +β 0 +e p b 1 x p b 0 )=Var (e p b 1 x p b 0 )= b b (0) Cov ( e p, b 0 )+ Cov (b 1 x p,b 0 ) cov(e p, b 1.x p ) (1)+Var Var (e p )+x p Var X Var 0 Cov b, e, p Cov b0 e As we have that x x 1, we get 1 p ad Cov( b 1, b0 ) ( ) ˆ x x 1 x p x Y Y x x 1 p p p p S X S X S X S X So, we get that ˆ gves Yˆ 1 ( x p x ) 0, 1 S p Yp : N X x S X, where. Stadardzg ad replacg, by

22 T ˆ Yˆ p 1 1 Y x p p x S X : t Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade A 100(1 α) % predcto terval for where c t IY p,1 Y p, the value of Y at X = x p, s gve by x x x x ˆ 1 p ˆ 1 p Y ˆ 1, ˆ p c Y 1 p c S S..6 - Coeffcet of determato X The square of the correlato betwee varables Y ad X s called the coeffcet of R determato ad s deoted wth. It represets the proporto of the total sample varablty the Y s explaed by the regresso model. R Hgher value of s preferable because t meas that regresso model s descrbg well relatoshp that exsts betwee observed varables. 3- Nolear Models X, Obvously, ot all x y -relatoshps ca be descrbed by straght les. Curvlear relatoshps of all sorts ca be foud every feld of applcato. Most of these olear models ca be ftted usg least squares method, provded the data have bee tally learzed by a sutable trasformato Polyomal regresso

23 Polyomal regresso s used whe depedece betwee varables Y ad X s of the type Y a a X a X... a 0 1 m X m e Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade We wll cosder the case whe m=. Y a bx cx Whe a set of pots exhbts a parabolc tred the we ca use quadratc fucto e y=a+bx+c x method of least squares. We get system of ormal equatos a a We ca fd ukow coeffcets a, b ad c usg the x +b x +b a+b x +c x 3 +c x +c x 3 = x y x 4 = x y x = By solvg ths system of equatos, we fd estmates of a, b ad c. Example 3 : Fd a formula for the le of the form ft the followg data y=a+b x+c x y Y a bx cx x e to y Soluto: Substtutg the values of x, x, x 3, =1 x y, x y, y ad above ormal equatos we get

24 10a 4.5b.85c a.85b.05c Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade.85a.05b c 8.89 Solvg these equatos, we obta aˆ 3.16, b ˆ ad cˆ 0.91 The requred equato of quadratc regresso s yˆ x 0.91x Scatter plot wth ftted regresso fucto s gve below. Fgure 6. Scatter plot ad ftted quadratc fucto for gve data 3. - Expoetal Regresso Suppose the relatoshp betwee two varables s best descrbed by a expoetal fucto of the form y=ae bx

25 I case t s possble to obta approxmate solutos usg umercal methods or the dea of learzato. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Trasformg the prevous equato by takg logarthms o both sdes we get whch mples that squares method to ad l y =1 a= x ( =1 ( b= x )( l y ad x x l y ) x ) l y b x l ad l y l y=l a+b x have a lear relatoshp. Whe we apply the least that yeld to Example 4 : Fd a least square curve of the form y=ae bx (a>0 ) Soluto : to the data gve below x y Cosder y=ae bx, applyg logarthm (wth base y=l a+bx (1) l e o both sdes, we get

26 Takg y=y, l the equato (1) ca be wrtte as Y a ~ bx () Y =à+bx ( ), where Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade a~ l a a= l a `. Equato () s lear equato of Y o x. The ormal equatos are a~ b a~ x b 11 x Y 11 1 x x Y After calculatos, we get 11a ~ +b=6.59 a~ 48.4b After solvg these equatos, we obta a ˆ~ b ˆ 0.96, a= a= à=0,0001 l The requred curve s So we get yˆ 1.64 e 0.96x aˆ e aˆ~ Scatter plot wth ftted regresso le s gve below.

27 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Fgure 7. Scatter plot ad ftted expoetal fucto for gve data Other curvlear models We wll cosder two other olear regresso models. b Y ax l Y l a b l x 1. Let. Takg logarthms of both sdes of equato we get. l Y U l a l x v Usg the otes, ad we get lear regresso model U bv. We fd estmates of α ad b by solvg the followg system of equatos b 1 1 v b 1 v U v 1 1 U v ˆ aˆ e We fd estmate of a as 1 Y U 1 ax b Y. Let. We use ad we get lear regresso model U ax b. We fd estmates of a ad b by solvg the followg system of equatos:

28 a 1 x b 1 U Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade x b Dagostcs for smple lear regresso a I Chapter we studed the smple lear regresso model. Throughout Chapter, we assumed that the smple lear regresso model was a vald model for the data, that s, the codtoal mea of Y gve X s a lear fucto of X ad the codtoal varace of Y gve X s costat. I other words, E (Y / X )= β 0 +β 1 X var (Y / X )=σ I Secto 4.1, we start by examg the mportat ssue of decdg whether the model uder cosderato s deed vald. I Secto 4., we wll see that whe we use a regresso model we mplctly make a seres of assumptos. We the cosder a seres of tools kow as regresso dagostcs to check each assumpto Vald ad Ivald Regresso Models: Ascombe s Four Data Sets Throughout ths secto we shall cosder four data sets costructed by Ascombe. Ths example llustrates the pot that lookg oly at the umercal regresso output may lead to very msleadg coclusos about the data, ad lead to acceptg the wrog model. The data are gve the table below (Table 4.1 ) ad are plotted Fgure 8. Values of radom varable Y dffer each of four data sets, whle x-values of varable X are same frst three sets. Table 4.1 Ascombe s four data sets Case x 1 x x 3 x 4 y 1 y y 3 y x 1 U x

29 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Fgure 8. Plots of Ascombe s four data sets Whe a regresso model s ftted to data sets 1,, 3 ad 4, each case the ftted regresso model s ^y = x, The regresso output for data sets 1 to 4 s gve below. The regresso output for the four data sets s detcal (to two decmal places) every respect. I all cases, results of t-test for valdty of a model (slope of regresso le s dfferet from zero) are sgfcat (p-value s smaller tha 0.05). Lookg at Fgure 8 t s obvous that a straght-le regresso model s approprate oly for Data Set 1. O the other had, the data Data Set seem to have o lear rather tha a straght-le relatoshp. The thrd data set has a extreme outler that should be vestgated. For the fourth data set, the slope of the regresso le s solely determed by a sgle pot, amely, the pot wth the largest x value.

30 Regresso output from R Frst model Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) * x ** Sgf. codes: 0 *** ** 0.01 * Resdual stadard error: 1.37 o 9 degrees of freedom Multple R squared: , Adjusted R squared: F statstc: o 1 ad 9 DF, p value: Secod model Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) x ** Sgf. codes: 0 *** ** 0.01 * Resdual stadard error: 1.37 o 9 degrees of freedom Multple R squared: 0.666, Adjusted R squared: 0.69 F statstc: o 1 ad 9 DF, p value:

31 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Thrd model Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) * x ** Sgf. codes: 0 *** ** 0.01 * Resdual stadard error: 1.36 o 9 degrees of freedom Multple R squared: , Adjusted R squared: 0.69 F statstc: o 1 ad 9 DF, p value: Forth model Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) * x ** Sgf. codes: 0 *** ** 0.01 * Resdual stadard error: 1.36 o 9 degrees of freedom Multple R squared: , Adjusted R squared: 0.697

32 F statstc: 18 o 1 ad 9 DF, p value: Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Ths example demostrates that the umercal regresso output should always go together wth aalyss to esure that a approprate model has bee ftted to the data. I ths case t s eough to look at the scatter plots Fgure 8 to determe whether a approprate model has bee ft. However, some stuatos we shall eed some addtoal tools order to check the valdty of the ftted model 4.- Plots of resduals Oe tool we wll use to valdate a regresso model s oe or more plots of resduals (or stadardzed resduals, whch wll be defed later ths chapter). These plots wll eable us to assess vsually whether a approprate model has bee ft to the data. Fgure 9 provdes plots of the resduals agast X for each of Ascombe s four data sets. X There s o patter ca be recogzed the plot of the resduals for data Set 1 agast 1. Fgure 9. Resdual plots for Ascombe s data sets

33 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade We shall see ext that ths dcates that a approprate model has bee ft to the data. We shall see that a plot of resduals agast X that produces a radom patter dcates a approprate model has bee ft to the data. Addtoally, we shall see that a plot of resduals agast X that produces a o radom patter dcates a correct model has bee ft to the data. Usg Plots of Resduals to Determe Whether the Proposed Regresso Model Is a Vald Model Oe way of checkg whether a vald smple lear regresso model has bee ft s to plot resduals versus x ad look for patters. If o patter s foud the ths dcates that the model provdes a adequate summary of the data,.e., s a vald model. If a patter s foud the the shape of the patter provdes formato o the fucto of x that s mssg from the model. For example, suppose that the true model s a straght le Y =β 0 +β 1 x +e Ad we ft a straght le ^y =b 0 +b 1 x b 0 b 1 are close to the ukow populato parameters. The, assumg that the least squares estmates ^e =Y ^y = ( β 0 b 0 )+(β 1 b 1 ) x + e e β 0 β 1, we fd that that s, the resduals should resemble radom errors. If the resduals vary wth x the ths dcates that a correct model has bee ft. For example, suppose that the true model s a quadratc Y =β 0 +β 1 x +β x +e ad that we ft a straght le ^y =b 0 +b 1 x The, somewhat smplstcally assumg that the least squares estmates the ukow populato parameters Example of a Quadratc Model β 0 ad β 1, we fd that ^e =Y ^y =( β 0 b 0 )+ (β 1 b 1 ) x + β x +e β x +e ^β 0 ^β 1 are close to Suppose that Y s a quadratc fucto of X wthout ay radom error. The, the resduals from the straght-le ft of Y ad X wll have a quadratc patter. Hece, we ca coclude that there s eed for a quadratc term to be added to the orgal straght-le regresso model. Ascombe s data set s a example of such a stuato. Fgure 10 cotas scatter plots of the data ad the resduals from a straght-le model for data set. As expected, a clear quadratc patter s evdet the resduals Fgure 10.

34 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Fgure 10 Scatter plots ad the resduals from a straght-le model for data set Regresso Dagostcs: Tools for Checkg the Valdty of a Model We ext look at tools (called regresso dagostcs) whch are used to check the valdty of all aspects of regresso models. Whe fttg a regresso model we wll dscover that t s mportat to: 1. Determe whether the proposed regresso model s a vald model (.e., determe whether t provdes a adequate ft to the data). The ma tools we wll use to valdate regresso assumptos are plots of stadardzed resduals. The plots eable us to assess vsually whether the assumptos are beg volated.. Determe whch (f ay) of the data pots have x -values that have a uusually large effect o the estmated regresso model (such pots are called leverage pots ). 3. Determe whch (f ay) of the data pots are outlers, that s, pots whch do ot follow the patter set by the rest of the data, whe oe takes to accout the gve model. 4. If leverage pots exst, determe whether each s a bad leverage pot. If a bad leverage pot exsts we shall assess ts fluece o the ftted model. 5. Exame whether the assumpto of costat varace of the errors s reasoable. We beg by lookg at the secod tem of the above lst, leverage pots, as these wll be eeded the explaato of stadardzed resduals.

35 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Leverage Pots Data pots whch have cosderable fluece o the ftted model are called leverage pots. To make thgs as smple as possble, we shall beg by descrbg leverage pots as ether good or bad. Example of a good ad a bad leverage pot Twety pots are radomly geerated from a kow straght-le regresso model. We get a plot lke that show Fgure 11. Oe of the 0 pots has a x -value whch makes t dstat from the other pots o the x -axs. We shall see that ths pot, whch s marked o the Y plot, s a good leverage pot. True populato regresso le (amely, β = 0 + β 1 x ) ad the least squares regresso le (amely, ^y = b 0 + b 1 x 1 Fgure 11. Good leverage pot ) are marked o the plot. Next we move oe of the pots away from the true populato regresso le. I partcular, we focus o the pot wth the largest x -value. Movg ths pot vertcally dow (so that ts x -value stays the same) produces the results show Fgure 1. Notce how the least squares regresso le has chaged dramatcally respose to chagg the Y -value of just a sgle pot. The least squares regresso le has bee levered dow by sgle pot. Hece we

36 call ths pot a leverage pot. It s a bad leverage pot sce ts the patter set by the other 19 pots. Y -value does ot follow Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade I summary, a leverage pot s a pot whose x -value s dstat from the other x -values. A pot s a bad leverage pot f ts Y -value does ot follow the patter set by the other data pots. I other words, a bad leverage pot s a leverage pot whch s also a outler. Fgure 1. Bad leverage pot Returg to Fgure 11, the pot marked o the plot s sad to be a good leverage pot sce ts Y -value closely follows the upward tred patter set by the other 19 pots. I other words, a good leverage pot s a leverage pot whch s NOT also a outler. Next we vestgate what happes whe we chage the Y -value of a pot Fgure 11 whch has a cetral x -value. We move oe of these pots away from the true populato regresso le. I partcular, we focus o the pot wth the 11th largest x -value. Movg ths pot vertcally up (so that ts x value stays the same) produces the results show Fgure 13. Notce how the least squares regresso le has chaged relatvely lttle respose to chagg the Y -value of cetrally located x. Ths pot s sad to be a outler that s ot a leverage pot.

37 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade s based o: The dstace Fgure 13 A plot of Numercal rule that wll detfy x s away from the bulk of the x s. Y agast x showg a outler that s ot a leverage pot x as a leverage pot (.e., a pot of hgh leverage) The extet to whch the ftted regresso le s flueced by the gve pot. of The secod bullet pot above deals wth the extet to whch Y at x ) depeds o y (the actual value of Y at b0 y b1 x where ad b 1 j1 c j y j, ^y =b 0 +b 1 x c j = x j x S x. So that ^y (the predcted value x=x ). We have that ^y = ý b 1 x+b 1 x = ý+b 1 (x x )

38 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade where 1 j=1 x x j S x Notce that y j + j=1 x x j x x 1 + (x x )= + h j = j=1 j=1 We ca express the predcted value, =[ h 1 j + ( x x)(x x) ] j S x ^y as ^y =h y + h j y j () j

39 where h = 1 + ( x x) j=1 ( x j x ) Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade The term h s commoly called the leverage of the h shows how y affects close to zero sce h j =1, j=1 I ths stuato, the predcted value ^y. For example, f h 1 ^y =1 y +other terms y. th data pot, otce that the the other h j terms are ^y wll be close to the actual value matter what values of the rest of the data take. Notce also h that depeds oly o the y o x s. Thus a pot of hgh leverage (or a leverage pot) ca be foud by lookg at just the values of the x s ad ot at the values of the y s. We have 1 h = 1 ( 1 + ( x x ) ) (x j x ) j= Rule for detfyg leverage pots j=1 ( x x ) 1 = + 1 = ( x j x) A popular rule, whch we shall accept, s to classfy leverage pot) a smple lear regresso model f x as a pot of hgh leverage ( a h > average (h )= = 4 Strateges for dealg wth bad leverage pots

40 1. Remove vald data pots We ask a questo about the valdty of the data pots correspodg to bad leverage pots, Are these data pots uusual or dfferet some way from the rest of the data? If so, we remove these pots ad reft the model wthout them. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade. Ft a dfferet regresso model We ask a questo about the valdty of the regresso model that has bee ftted, has a correct model bee ftted to the data? If so, we should cosder a dfferet model. Good leverage pots Whle good leverage pots do ot have a adverse effect o the estmated regresso R coeffcets, they do decrease ther estmated stadard errors as well as crease the value of. Hece, t s mportat to check extreme leverage pots for valdty, eve whe they are so-called good Stadardzed Resduals I ths secto we wll cosder stadardzed resduals ad ther mportace lear regresso dagostcs. Frst we wll show that the th least squares resdual has varace gve by var ( ^e )=σ [1 h ] Proof Recall from formula (*) that Thus, ^y =h y + h j y j j= ^e = y ^y = y h y j = h j y j =(1 h ) y h j y j j= var ( ^e )=var[ ( 1 h ) y j h j y j] = (1 h ) σ + h j σ j

41 σ [ 1 h +h + j h j ] Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Next, otce that h j j=1 So that, resdual, = j=1 [ 1 ]=. (x x ) (x j x ) + 1 S x + ( x x ) ( x j x) j=1 var ( ^y )=var( j=1 Thus, f ^e cosders that f h h j y j) = h j j 1, so that the th, has small varace (sce h 1 the vary much). We have also show that cosder the fact that whe var ( y j )=σ ^y y so that Var ^y =σ h h 1 S x + j=1 j h j =σ h (x x ) ( x j x ) = 1 (S x ) + ( x x ) (x (S x ) j x ) = 1 + ( pot s a leverage pot, the the correspodg 1 h 0 ). Ths seems reasoable whe oe ^e wll always be small (ad so t does ot. Ths aga seems reasoable whe we the ^y y. I ths case, var (^y )=σ h σ =var ( y ) The problem of the resduals havg dfferet varaces ca be overcome by stadardzg each resdual by dvdg t by a estmate of ts stadard devato. Thus, the th stadardzed resdual r, s gve by r = ^e = ^e ^σ r s 1 h where s= 1 ^e j j =1 s the estmate of σ obtaed from the model.

42 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade A crucal assumpto ay regresso aalyss s that the errors have costat varace. I lterature t s recommeded that a effectve plot to dagose o costat error varace s a plot of Resduals ftted values. 0.5 agast x or a plot of Stadardzed Resduals 0.5 agast x. (or agast the Whe pots of hgh leverage exst, t s formatve to look at plots of stadardzed resduals. The advatage of stadardzed resduals s that they mmedately tell us how may estmated stadard devatos ay pot s away from the ftted regresso model. For example, suppose that the 6th pot has a stadardzed resdual of 4.3, the ths meas that the 6th pot s a estmated 4.3 stadard devatos away from the ftted regresso le. If the errors are ormally dstrbuted, the observg a pot 4.3 stadard devatos away from the ftted regresso le s hghly uusual. Such a pot would commoly be referred to as a outler ad as such t should be vestgated. We shall follow the commo practce of labelg pots as outlers small- to moderate-sze data sets f the stadardzed resdual for the pot falls outsde the terval from to. I very large data sets, we shall chage ths rule to 4 to 4. Idetfcato ad examato of ay outlers s a key part of regresso aalyss. Recall that a bad leverage pot s a leverage pot whch s also a outler. Thus, a bad leverage pot s a leverage pot whose stadardzed resdual falls outsde the terval from to for small to moderate sze data sets. O the other had, a good leverage pot s a leverage pot whose stadardzed resdual falls sde the terval from to Assessg the Ifluece of Certa Cases Oe or more cases ca strogly cotrol or fluece the least squares ft of a regresso model. I ths secto we look at summary statstcs that measure the fluece of a sgle case o the least squares ft of a regresso model. Cook proposed measure of the fluece of dvdual cases whch s called Cook`s dstace ad t s gve by: D = r h 1 h

43 where r s the th stadardzed resdual ad h s the th leverage value, Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade lterature t s recommeded to use for smple lear 4 as a rough cut-off for oteworthy values of regresso. (Meag f Cook`s dstace of a pot s bgger tha fluece o ftted regresso model). Recommedatos for Hadlg Outlers ad Leverage Pots D 4, t has sgfcat We coclude ths secto wth some geeral advce about how to deal wth outlers ad leverage pots. Pots should ot be routely deleted from a aalyss just because they do ot ft the model. Outlers ad bad leverage pots are sgals, flaggg potetal problems wth the model. Outlers ofte pot out a mportat feature of the problem ot cosdered before. They may pot to a alteratve model whch the pots are ot a outler. I ths case t s the worth cosderg fttg a alteratve model Normalty of the Errors The assumpto of ormal errors s eeded small samples for the valdty of t- dstrbuto based hypothess tests ad cofdece tervals ad for all sample szes for predcto tervals. Ths assumpto s geerally checked by lookg at the dstrbuto of the resduals or stadardzed resduals. A commo way to assess ormalty of the errors s to look at what s commoly referred to as a ormal probablty plot or a ormal Q Q plot of the stadardzed resduals. A ormal probablty plot of the stadardzed resduals s obtaed by plottg the ordered stadardzed resduals o the vertcal axs agast the expected order statstcs from a stadard ormal dstrbuto o the horzotal axes. If the resultg plot produces pots close to a straght le the the data are sad to be cosstet wth that from a ormal dstrbuto. O the other had, departures from learty provde evdece of o-ormalty.

44 Example 6: Recall the example from Chapter o the tmg of producto rus for whch we ft a straght-le regresso model to ru tme from ru sze. The data are gve Table (alog wth values of leverages, resduals ad stadardzed resduals) ad are plotted Fgure 14. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Table Regresso dagostcs for the model Fgure 14 Case Ru tme Ru sze Leverage Resduals Std. resduals Fgure 14. A plot of the producto data wth the least squares le cluded

45 We beg by cosderg the smple regresso model where Y = ru sze ad x = ru tme. Regresso output from R s gve below. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Resduals: M 1Q Meda 3Q Max Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) *** ru.tme e-06 *** --- Sgf. codes: 0 *** ** 0.01 * Resdual stadard error: o 18 degrees of freedom Multple R-squared: 0.730, Adjusted R-squared: F-statstc: 48.7 o 1 ad 18 DF, p-value: 1.615e-06

46 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Lookg at the gve output R, we could coclude that smple regresso output s vald for gve data coeffcet of slope of regresso le s sgfcatly dfferet from zero (based o results of t test) ad 73.0% of varace of varable ru sze s explaed by the regresso model (value of coeffcet of determato). But, before we make such a cocluso we should look at the dagostc plots. Also, we should check leverage values as we sad before, pot s classfed as hgh 4 0. leverage pot f ts leverage value s bgger that. Oly the 17 th pot has leverage value bgger tha ths cut-off pot. O scatter plot, we ca see that y-value of ths pot s ot dstat from other pots, meag that ths s good leverage pot (also ts stadardzed resdual falls sde the terval from to ). Fgure 15 provdes dagostc plots produced by R: o the top left s the plot for radomess of resduals, top rght s ormal Q-Q plot, dow left plot for varace of resduals ad o dow rght s plot for Cook s dstace. Varace of the error term appears to be costat. These plots suggest that lear model s approprate for our producto data: resduals seem dstrbuted radomly, havg approxmately ormal dstrbuto wth costat varace. Noe of the pots have Cook s dstace greater tha cut-off pot 4/18= 0..

47 Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Fgure 15. Dagostc plots Cocluso

48 Ths master thess deals wth a mportat area of statstcs - amely correlato ad regresso. These topcs could be also very useful teachg statstc at the uversty. We have chose ad explaed some aspects of ths topc. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Frst we dscussed about correlato as measure of assocato betwee varables. We troduced Pearso`s correlato coeffcet as measure of lear relatoshp ad Spearma`s correlato coeffcet as measure of mootoc relatoshp. After, we emphaszed the mportace of costructg the scatter plot of observed varables to reveal the ature ad stregth of correlato. We dscussed dagostc plots as methods for checkg assumptos of smple lear regresso. We preseted example based o Ascombe s four data sets whch llustrates the pot that lookg oly at the umercal regresso output may lead to very msleadg coclusos about the data ad to accept the wrog model. We troduced regresso models to descrbe causal relatoshp betwee two varables. Further, we dealt especally wth smple lear regresso. For estmatg ukow parameters of lear regresso we used method of least squares. We proved propertes of ubasedess ad cosstecy of these parameters, as well as some other propertes cocerg dstrbuto ad depedece of statstcs we have used. We talked about testg the valdty of the model usg Studet s t- test ad about the sgfcace of coeffcet of determato. Regresso models are prmarly used for predcto, so we cosdered predcto tervals for values of depedet varables. Some of the examples gve here are oly for llustratos of deftos ad procedures, some of them, wth larger sets of data were solved usg statstcal package R. A part of the refereces gve here, I have used teachg materals from my Arabc courses at the Uversty of Lbya (Al Mergeb, Zlte ). For future research, t would be terestg to cosder other types of regresso for example, whe depedet varable s bary (logstc regresso) or whe we have more tha oe depedet varable (multvarate regresso). Also, we could explore how s regresso used to solve problems bology, medce ad other areas. Refereces

49 Cohe,Y. ad Cohe, J.Y. (008), Statstcs ad Data wth R: A Appled Approach Through Examples. Wley, New York. Vrtual Lbrary of Faculty of Mathematcs - Uversty of Belgrade Jevremovć,V ad Malšć, J. (00). Statstčke metode u meteorologj žejerstvu. Savez hdrometeorološk zavod, Beograd. Larso, R. J. ad Marx, M.L. (005). A troducto to Mathematcal statstcs ad ts Applcatos. Pretce Hall, New Jersey. Ldley,D.V. (1987). Regresso ad correlato aalyss. The New Palgrave: A Dctoary of Ecoomcs. Mlto, J. S, Corbet, J.J. ad McTeer, M.X. (1997). A troducto to Statstcs. McGraw- Hll Hgher Educato, Burr Rdge. Shaker, R. G. (006). Numercal Aalyss, New Age Iteratoal Ltd., Publshers, New Delh. Sheather, S.J. (009). A Moder Approach to Regresso wth R. Sprger, New York.

Simple Linear Regression

Simple Linear Regression Statstcal Methods I (EST 75) Page 139 Smple Lear Regresso Smple regresso applcatos are used to ft a model descrbg a lear relatoshp betwee two varables. The aspects of least squares regresso ad correlato