Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Bostatstcs Chapter 11 Smple Lnear Correlaton and Regresson Jng L jng.l@sjtu.edu.cn http://cbb.sjtu.edu.cn/~jngl/courses/2018fall/b372/ Dept of Bonformatcs & Bostatstcs, SJTU

Recall eat chocolate

Cell 175, 347 359, October 4, 2018

Cell 175, 347 359, October 4, 2018 Comparson of frequency of the nonreference allele between the NIPT estmatons (CMDB) and Han Chnese estmatons n the 1000 genomes project (CHN)

Covarance ( ) 1 ) )( ( ), ( cov 1 = å = n Y y X x y x n Covarance s 2 =Var(x) =E(xµ) 2 Varance ( = = å ) ) a measure of how much two random varables change together

Interpretng Covarance cov(x,y) > 0 cov(x,y) < 0 cov(x,y) = 0 X and Y are postvely correlated X and Y are nversely correlated X and Y are ndependent

Correlaton coeffcent Pearson s Correlaton Coeffcent s standardzed covarance (untless): r = cov arance(x, y) var x var y Karl Pearson 1857 1936

Correlaton Measures the relatve strength of the lnear relatonshp between two varables Ranges between 1 and 1 The closer to 1, the stronger the negatve lnear relatonshp The closer to 1, the stronger the postve lnear relatonshp The closer to 0, the weaker any postve lnear relatonshp

Scatter Plots of Data wth Varous Correlaton Coeffcents Y Y Y Y X X r = 1 r =.6 r = 0 Y Y X r = +1 X r = +.3 X r = 0 X nslde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 PrentceHall

Lnear Correlaton Lnear relatonshps Curvlnear relatonshps Y Y X X Y Y X X nslde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 PrentceHall

Lnear Correlaton Strong relatonshps Weak relatonshps Y Y X X Y Y X X

Lnear Correlaton No relatonshp Y X Y nslde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 PrentceHall X

Calculatng by hand 1 ) ( 1 ) ( 1 ) )( ( var var ), ( cov ˆ 1 2 1 2 1 = = å å å = = = n y y n x x n y y x x y x y x arance r n n n

Smpler calculaton formula y x xy n n n n n n SS SS SS y y x x y y x x n y y n x x n y y x x r = = = å å å å å å = = = = = = 1 2 1 2 1 1 2 1 2 1 ) ( ) ( ) )( ( 1 ) ( 1 ) ( 1 ) )( ( ˆ y x xy SS SS SS r = ˆ Numerator of covarance Numerators of varance

Correlaton Analyss 1 < r < 1 If the correlaton coeffcent s close to +1 that means you have a strong postve relatonshp. If the correlaton coeffcent s close to 1 that means you have a strong negatve relatonshp. If the correlaton coeffcent s close to 0 that means you have no correlaton. WE HAVE THE ABILITY TO TEST THE HYPOTHESIS H 0 : r = 0

Dstrbuton of the correlaton coeffcent SE( ˆr) = 1 r2 n 2 The sample correlaton coeffcent follows a T dstrbuton wth n2 degrees of freedom (snce you have to estmate the standard error). t = r / 1 r 2 n 2

Hstory Galton's Sweet Pea Data In Natural Inhertance, Galton (1894) provded a table, whch contaned a lst of frequences of daughter seeds of varous szes organzed n rows accordng to the sze of ther parent seeds In 1896, Pearson publshed hs frst rgorous treatment of correlaton and regresson A smpler proof than Pearson's for the productmoment method proposed by Ghsell (1981)

Lnear Regresson Can we predct Novel Laureates per 10 mllon populaton usng chocolate consumpton? Chocolate ~ Nobel laureates Smple Lnear Regresson

Lnear Regresson Regresson analyss s used to predct the value of one varable (the dependent varable, ) on the bass of other varables (the ndependent varables, ). Dependent varable: denoted Y Independent varables: denoted X 1, X 2,, X k If we only have ONE ndependent varable, the model s whch s referred to as smple lnear regresson. We would be nterested n estmatng β 0 and β 1 from the data we collect.

Lnear Regresson Varables: X = Independent Varable (we provde ths) Y = Dependent Varable (we observe ths) Parameters: β 0 = YIntercept β 1 = Slope ε ~ Normal Random Varable (μ ε = 0, σ ε =???) [Nose]

The Intercept, β0

The Slope, β1

Buldng the Model Collect Data Test 2 Grade = β 0 +β1*(test 1 Grade) From Data: Estmate β 0 Estmate β 1 Estmate σ ε Student Test 1 Test 2 1 50 32 2 51 33 3 52 34 4 53 35 5 54 36 6 55 37 7 56 39 8 57 40 9 58 41 10 59 42 11 60 43 12 61 44 13 62 46 14 63 47 15 64 48 16 65 49 17 66 50 18 67 51 19 68 53 20 69 54 21 70 55 22 71 56 23 72 57

Test 2 Test B2 Test B2 Lnear Regresson Analyss 100 80 Plot of Ftted Model 92 82 Plot of Ftted Model 60 40 20 72 62 52 0 40 50 60 70 80 90 100 Test 1 100 90 Plot of Ftted Model 42 60 70 80 90 100 Test B1 80 70 60 50 50 60 70 80 90 100 Test B1

Whch lne has the best ft to the data????

Estmatng the Coeffcents In much the same way we base estmates of on, we estmate wth b 0 and wth b 1, the yntercept and slope (respectvely) of the least squares or regresson lne gven by: (Ths s an applcaton of the least squares method and t produces a straght lne that mnmzes the sum of the squared dfferences between the ponts and the lne)

Least Squares Lne these dfferences are called resduals or errors Ths lne mnmzes the sum of the squared dfferences between the ponts and the lne but where dd the lne equaton come from? How dd we get.934 for a yntercept and 2.114 for slope??

Least Squares Lne [sure glad we have computers now!] The coeffcents b 1 and b 0 for the least squares lne are calculated as: SSE = (Y Y ˆ 2 ) = (Y b0 b 1 X) 2

Statstcs Least Squares Lne See f you can estmate Yntercept and slope from ths data Recall Data Informaton Data Ponts: x y 1 6 2 1 3 9 4 5 5 17 6 12 y =.934 + 2.114x

Least Squares Lne See f you can estmate Yntercept and slope from ths data X Y X Xbar Y Ybar (XXbar)*(YYbar) (X Xbar) 2 1 6 2.500 2.333 5.833 6.250 2 1 1.500 7.333 11.000 2.250 3 9 0.500 0.667 0.333 0.250 4 5 0.500 3.333 1.667 0.250 5 17 1.500 8.667 13.000 2.250 6 12 2.500 3.667 9.167 6.250 Sum = 21 50 0.000 0.000 37.000 17.500 Xbar = 3.500 Ybar = 8.333 s xy = 7.400 37.00/(61) s 2 x = 3.500 17.5/(61) b 1 = 2.114 7.4/3.5 b 0 = 0.933 8.33 2.114*3.50

Example: Arm Crcumference and Heght

Arm Crcumference and Heght Ttest ANOVA

Vsualzng Arm Crcumference and Heght Relatonshp

Scatterplot wth regresson lne

Example: Arm Crcumference and Heght Estmated mean arm crcumference for chldren 60 cm n heght

Example: Arm Crcumference and Heght Estmated mean arm crcumference for chldren 60 cm n heght Notce, most ponts don t fall drectly on the lne: we are estmatng the mean arm crcumference of chldren 60 cm tall: observed ponts vary about the estmated mean

Lnear regresson assumes that The relatonshp between X and Y s lnear Y s dstrbuted normally at each value of X The varance of Y at every value of X s the same (homogenety of varances) The observatons are ndependent