SIMPLE LINEAR REGRESSION and CORRELATION

Expermental Desgn and Statstcal Methods Workshop SIMPLE LINEAR REGRESSION and CORRELATION Jesús Pedrafta Arlla jesus.pedrafta@uab.cat Departament de Cènca Anmal dels Alments

Items Correlaton: degree of assocaton Regresson: predcton The model Assumptons Matrx notaton Protocol of analss Plottng data ANOVA n regresson Confdence ntervals Analss of resduals Influental observatons Basc commands cor.test lm anova seq nfluence.measures Lbrares car (scatterplot)

Analss of several varables Two man nterests:. Estmatng the degree of assocaton between two varables: CORRELATION analss.. Predctng the values of one varable gven that we know the realsed value of another varable(s): REGRESSION analss. Ths analss can also be used to understand the relatonshp among varables. a) A response varable and an ndependent varable: smple (lnear) regresson. b) A response varable and two or more ndependent varables: multple (lnear) regresson. c) When the relatonshp among varables s not lnear: nonlnear regresson. d) If the varable s a dchotomous or bnar varable: logstc regresson. 3

Data example Suppose we have recorded the age (ears) and blood pressure (mm Hg) of 0 people, obtanng the data presented n the table. Age Blood pressure 0 0 43 8 63 4 6 6 53 34 3 8 58 36 46 3 58 40 70 44 46 8 53 36 70 46 0 4 63 43 43 30 6 4 9 3 6 3 3 4

Smple statstcs > ## Importng data > BLOODP<-read.csv("bloodpress.csv", header=t) > attach(bloodp) > optons(na.acton=na.exclude) > summar(bloodp) AGE BLPRESS Mn. :9.0 Mn. :0.0 st Qu.:6.0 st Qu.:5.5 Medan :44.5 Medan :9.0 Mean :43. Mean :3.5 3rd Qu.:58.0 3rd Qu.:37.0 Max. :70.0 Max. :46.0 To avod problems n predcton when mssng values are present, we must use optons(na.acton=na.exclude). Wth the current data set t would be unnecessar. 5

BLPRESS 0 5 30 35 40 45 Plot of raw data A plot for a par of varables gves us a frst mpresson about ther relatonshp. It s also useful for detectng some extreme values. > plot(age,blpress) > lbrar(car) > scatterplot(age,blpress) 0 30 40 50 60 70 AGE Data not obvousl non lnear and no evdence of non-normalt (boxplots not asmmetrcal). No evdence of extreme values. 6

Correlaton (Pearson) The correlaton s a measure of the degree of assocaton between two varables. It s calculated as r s an estmator of, the populaton parameter. r cov( x, ) s x s ( x)( x) ) ) > cor.test(blpress, AGE) x ( x Pearson's product-moment correlaton ( data: BLPRESS and AGE t = 6.06, df = 8, p-value = 4.39e- alternatve hpothess: true correlaton s not equal to 0 95 percent confdence nterval: 0.96050 0.9869976 sample estmates: cor 0.966699 r The denomnator s the geometrc mean of the sample varances estmates. Ths makes r to range from - to. As close s an estmate to - or, the correlaton s larger. H 0 : ( = 0), s rejected 7

Correlaton sample sze - The sample sze requred to have a partcular correlaton statstcall dfferent from 0 depends upon the same correlaton coeffcent: z' 0.5ln r r Fsher s classc z-transformaton to normalze the dstrbuton of Pearson correlaton coeffcent. n z / z z' r z' 0 r 3 r 0 = 0 and r s the magntude of the coeffcent we want to estmate. Sample sze for a power of 80% 90% 0. 78 044 0. 94 58 0.3 85 3 0.4 47 6 0.5 9 38 0.6 0 5 0.7 4 7 0.8 0 0.9 7 8 8

Smple lnear regresson - the model - Dependent varable 0 0 x Intercept Regresson coeffcent (slope) = tg s the ncrease of the dependent varable when the ndependent varable ncreases unt Independent varable x Random error To estmate 0 and we resort to the Least Squares methodolog,.e., mnmze the sum of the squares of the devatons (red arrows) between actual (blue damond) an predcted values (on the slope). ˆ ˆ 0 cov( x, ) s x ˆ x 9

Assumptons n regresson analss. The varables x and are lnearl related (defnton of the model).. Both varables are measured for each of n observatons. 3. Varable x s measured wthout error (fxed). 4. Varable s a set of random observatons measured wth error. 5. The errors are ndependent and normall dstrbuted wth homogeneous varance: ε ~ N( 0, I e ) Some of the above condtons can be seen n the fgure. For each value (fxed) of x, there s a normal dstrbuton of (random), wth mean on the regresson lne. x x. x n 0

Matrx notaton n n x n x x x............ 3 0 3 3 X X X β X X Xβ ' ) ' ( ˆ ' ˆ ' As n ANOVA, we can mnmze the sum of the squared errors and then we have the normal equatons: ε Xβ Now X X s not sngular and can be solved wthout need of a generalsed nverse (or restrctons). Note that X s not a matrx of 0 and, but contans the values of the ndependent varable x.

Smple lnear regresson Protocol -. Decde whch varable s to be and whch s to be x.. Plot data, n the vertcal axs. 3. Check evenness of x and varables b a box-plot. 4. Transform x and/or f not even. 5. Compute regresson, save resduals, ftted values and nfluence statstcs. Calculate Durbn-Watson statstc f data are n a logcal order. 6. Plot studentzed or standardzed resduals aganst ftted values (or x varable). Examne resdual plots for outlers, and consder rejecton of outlers wth studentzed or standardzed resduals > 3 and go to step 5. 7. Compare nfluence statstcs wth crtcal values: Leverage > p/n Dffts (absolute value) > (p/n) Cook s D > 4/n Dfbetas > /n Where p = number of parameters n the model (number of ) and n = number of data ponts n the regresson. If two or more nfluence statstcs (among the frst three) are greater than the crtcal values, consder rejectng ponts and return to step 5. 8. If outlers or leverage ponts are a problem, consder usng a robust regresson method. (Adapted from Fr, 993)

Smple lnear regresson - Results () - > BLOODP.REG <- lm(blpress ~ AGE); summar(bloodp.reg) Resduals: Mn Q Medan 3Q Max -4.7908 -.777 0.688.875.786 ˆ 0 ˆ Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept).3666.8744 87.4 < e-6 *** AGE 0.44509 0.0777 6.03 4.4e- *** --- Sgnf. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. H 0 : =0, s rejected For each ncrement of ear, blood pressure ncreases 0.445 mm Hg ˆ.367 0. 445 x t( ˆ ) ˆ s. e.( ˆ ) 0.44509 0.0777 6.03 3

Smple lnear regresson - Results () - Resdual standard error:. on 8 degrees of freedom Multple R-squared: 0.9345, Adjusted R-squared: 0.9309 F-statstc: 56.8 on and 8 DF, p-value: 4.39e- > anova(bloodp.reg) Analss of Varance Table F t H 0 ( =0) s rejected Response: BLPRESS Df Sum Sq Mean Sq F value Pr(>F) AGE 54. 54. 56.84 4.39e- *** Resduals 8 80.88 4.49 --- Sgnf. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. R-Squared s the square of the correlaton coeffcent. It represents the fracton of the total varaton n blood pressure that s explaned b the lnear relatonshp wth age. Adj R-Sq ncludes a correcton to overcome the ncrement n R-Squared wth the number of regressors (k). R R adj SS AGE / SS ( R TOTAL N ) N k 4

ANOVA n regresson ( ˆ) ( ˆ ) ( ) ( ˆ) ( ˆ ) Devated to regresson Due to regresson 0 x Squarng and and summng on both sdes of the equaton we can arrve at the followng ANOVA table: Source d.f. S.S. M.S. E(M.S.) F Due to regresson Devatons to regresson SP x ˆ SP x n- SS ˆ SPx ˆ SS x ( SS ˆ SP ) /( n ) x MS Reg / MS Error 5

Smple lnear regresson - Results (3) - Confdence ntervals for (for 0 s smlar): ( ˆ..( ˆ ), ˆ..( ˆ t / s e t / s e )) Ths can be done easl wth R (both for b 0 and b ): > confnt(bloodp.reg,level=0.95).5 % 97.5 % (Intercept) 09.68594 5.0463 AGE 0.3867409 0.5034373 6

Smple lnear regresson - Results (4) - > data.frame(bloodp, Predcted=ftted(BLOODP.REG), Resduals=resd(BLOODP.REG), + RIstudent=rstandard(BLOODP.REG), Restudent=rstudent(BLOODP.REG)) AGE BLPRESS Predcted Resduals RIstudent Restudent 0 0.84 -.8440-0.60388-0.6094600 43 8 3.4555-3.4554909 -.6745066 -.7685383 3 63 4 40.3573 0.64778 0.38488 0.34659 4 6 6 3.8890.0338.0498400.05300889 5 53 34 35.9064 -.9063896-0.9309644-0.9733538 6 3 8 6.44.88557795 0.94933 0.90499 7 58 36 38.38 -.38739 -.0533803 -.0565337 8 46 3 3.7908-0.79075835-0.383060-0.373750 9 58 40 38.38.86876 0.989068 0.9889 0 70 44 43.479 0.570357 0.73634 0.6647648 46 8 3.7908-4.79075835 -.304789 -.6937696 53 36 35.9064 0.0936804 0.045775 0.044430 3 70 46 43.479.570357.387546.3406303 4 0 4.84.7855790.4676.460594 5 63 43 40.3573.64778.3744606.35840 6 43 30 3.4555 -.4554909-0.70445473-0.694439 7 6 4 3.8890 0.0338 0.055340 0.0536633 8 9 0.7734 0.664698 0.5949 0.7449 9 3 6 6.44-0.4405-0.056736-0.05455077 0 3 3.5537 0.4469064 0.43470 0.83376 ˆ ˆ.37 0.4450.84; r 0.84.84 RIstudent s an Internall studentzed resdual,.e., the resdual dvded b the own standard error (not unform across observatons). REstudent s an Externall studentzed resdual. Onl observaton s a weak outler. 7

Some statstcs useful for regresson analss Internall studentzed resdual Weak outler, rs > (95% confdence) Strong outler, rs >3 (95% confdence) rs r MSE( h ) ~ t N k Externall studentzed resduals (-) Calculated as the prevous one, but removng the observaton to calculate the s. Under H 0, t follows a t dstrbuton wth N-k- df. Leverage (h ) Standardzed value of how much an observaton devates from the centre of the space of x values. Observatons wth hgh leverage can ndcate an outler n the x and are potentall nfluent. Computed as the dagonal elements of X(X X) - X. DFFITS ˆ s ˆ, h R student h h where c jj are the dagonal elements of (X X) -. Analse onl DFBETAS correspondng to hgh values of DFFITS. DFBETAS Cook s D Essentall a DFFITS statstc scaled and squared to make extreme values stand out more clearl. j, ˆ j ˆ s c j, jj 8

Smple lnear regresson - Results (5) - > nfluence.measures(bloodp.reg) dfb._ dfb.age dfft cov.r cook.d hat nf -0.395 0.99083-0.475.5 3.7e-0 0.46-0.559 0.00377-0.4057 0.84 7.36e-0 0.0500 3-0.05363 0.087353 0.5.56 6.97e-03 0.80 4 0.36-0.48700 0.354.098 6.4e-0 0.00 5 0.03674-0.45-0.48.088 3.0e-0 0.0668 6 0.000-0.583 0.65.00 3.47e-0 0.075 7 0.0973-0.5983-0.384.083 5.36e-0 0.088 8-0.0804-0.04580-0.0870.63 3.98e-03 0.054 9-0.09543 0.87846 0.856.6 4.e-0 0.088 0-0.0795 0.03347 0.4.346 7.90e-03 0.74 * -0.3000-0.05085-0.673 0.58.46e-0 0.054 * -0.0076 0.005966 0.09.0 7.48e-05 0.0668 3-0.3695 0.5997 0.657.0.8e-0 0.74 4 0.5730-0.476956 0.5930.03.65e-0 0.46 5-0.35 0.37706 0.4967.034.8e-0 0.80 6-0.0595 0.000933-0.593.6.3e-0 0.0500 7 0.0644-0.0674 0.079.46.70e-04 0.00 8 0.04595-0.038599 0.0473.37.8e-03 0.497 9-0.0303 0.00899-0.055..8e-04 0.075 0 0.076-0.0668 0.0804.66 3.4e-03 0.93 All values of DFFIT are below the crtcal value 0.63 (= (/0)),.e., not nfluental observatons on the predcted values. DFBETAS (dfb.) test nfluence on the parameter estmates, and do not need to be examned because DFFIT values are low. Cook s D values are below the crtcal value 0. (=4/0). Leverage s presented n hat. All values are lower than 0., the crtcal value. 9

Crtera to flag an observaton as nfluental n R In slde we presented some crtcal ponts to decde f an observaton can be nfluental or not. These crtcal ponts are not statstcal tests but rules of thumb. Furthermore, there are not agreement among statstcans on the values. In fact, R puts a flag (a star) on an observaton, when: an of ts absolute dfbetas value s greater than, or ts absolute dffts value s greater than 3(p/(n-p), or abs(-covrato) s greater than 3p/(n-p), or ts Cook s dstance s greater than the 50% percentle of an F- dstrbuton wth p and n-p degrees of freedom, or ts hat value s greater than 3p/n Where p denotes the number of model coeffcents, ncludng the ntercept. 0

Some graphcs about nfluental observatons Hgh leverage, nfluental Hgh leverage, not nfluental x x Low leverage, nfluental Low leverage, not nfluental x x

Standardzed resduals - - 0 Standardzed resduals - - 0 Resduals -4-0 Standardzed resduals 0.0 0.5.0.5 Smple lnear regresson - dagnostcs - > laout(matrx(c(,,3,4),,)) # optonal 4 graphs/page > plot(bloodp.reg) Resduals are dstrbuted approxmatel at random: homogenet of varance met. No mportant devatons n Q-Q plot: response varable normal. None of the ponts approach the hgh Cook s dstance contour(s): none of the observatons are nfluental. 4 Resduals vs Ftted 0 5 30 35 40 Ftted values Normal Q-Q 4 4 Scale-Locaton 0 5 30 35 40 Ftted values Resduals vs Leverage Cook's dstance 4 3 0.5 - - 0 Theoretcal Quantles 0.00 0.05 0.0 0.5 Leverage

Some plots of resduals Ideal resdual plot (random dstrbuton around 0) e 0 ŷ e 0 Model should nvolve curvature e 0 Heterogeneous varance ŷ ŷ 3

Blood pressure (mm Hg) Smple lnear regresson - Regresson lne and CL - 45 Blood presure = 0.44*Age+.3 R 0.935 95% upper lmt Regresson lne 95% lower lmt 40 Observatons ( ) 35 30 5 ( x x) sˆ MSRES n SSxx Note that for greater values of x the standard error of predcted values s greater, and thus CL. Ths s lttle dstngushable when the predcton s made n the nterval of the observed x s. 0 0 30 40 50 60 70 Age (ears) 4

Smple lnear regresson program of the graphc - > ## Summar scatterplot > #cretate a plot wth sold dots (pch=6) and no axs or labels > plot(blpress~age, pch=6, axes=f, xlab="", lab="") > #put the x-axs (axs) wth smaller label font sze > axs(, cex.axs=.8) > #put the x-axs label 3 lnes down from the axs > mtext(text="age (ears)", sde=, lne=3) > #put the -axs (axs ) wth horzontal tck labels > axs(, las=) > #put the -ax label 3 lnes to the left of the axs > mtext(text= "Blood pressure (mm Hg)", sde=, lne=3) > #add the regresson lne from the ftted model > ablne(bloodp.reg) > #add the regresson formula > text(50,45,"blood presure = 0.44*Age+.3", pos=) > #add the r squared value > text(50,43,expresson(paste(r^==0.935)), pos=) > #create a sequence of 00 numbers spannng the range of ages > x<-seq(mn(age), max(age), l=000) > #for each value of x, calculate the upper and lower 95% confdence > <-predct(bloodp.reg, data.frame(age=x), nterval="c") > #plot the upper and lower 95% confdence lmts > matlnes(x,, lt=, col=) > #put an L-shaped box to complete the axs > box(bt="l") 5

References Fr J.C. 993. Bologcal Data Analss. IRL Press, Oxford. 6