Linear correlation and linear regression

Size: px

Start display at page:

Download "Linear correlation and linear regression"

Chad Peters
5 years ago
Views:

1 Lnear correlaton and lnear regresson

2 Contnuous outcome (means) Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

3 Recall: Covarance n cov( x, y) = = 1 ( x X )( y Y ) n 1

4 Interpretng Covarance cov(x,y) > 0 X and Y are postvely correlated cov(x,y) < 0 X and Y are nversely correlated cov(x,y) = 0 X and Y are ndependent

5 Correlaton coeffcent Pearson s Correlaton Coeffcent s standardzed covarance (untless): r = cov arance( x, y) var x var y

6 Correlaton Measures the relatve strength of the lnear relatonshp between two varables Unt-less Ranges between 1 and 1 The closer to 1, the stronger the negatve lnear relatonshp The closer to 1, the stronger the postve lnear relatonshp The closer to 0, the weaker any postve lnear relatonshp

7 Scatter Plots of Data wth Varous Correlaton Coeffcents Y Y Y Y X X r = -1 r = -.6 r = 0 Y Y X X X r = +1 r = +.3 Slde from: Statstcs for M anagers Usng M crosoft Excel 4th Edton, 2004 Prentce-Hall r = 0 X

8 Lnear Correlaton Lnear relatonshps Curvlnear relatonshps Y Y X X Y Y X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

9 Lnear Correlaton Strong relatonshps Weak relatonshps Y Y X X Y Y X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

10 Lnear Correlaton Y No relatonshp X Y Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

11 Calculatng by hand 1 ) ( 1 ) ( 1 ) )( ( var var ), ( cov ˆ = = = = = n y y n x x n y y x x y x y x arance r n n n

12 Smpler calculaton formula y x xy n n n n n n SS SS SS y y x x y y x x n y y n x x n y y x x r = = = = = = = = = ) ( ) ( ) )( ( 1 ) ( 1 ) ( 1 ) )( ( ˆ y x xy SS SS SS r= ˆ Numerator of covarance Numerators of varance

13 Dstrbuton of the correlaton coeffcent: SE( rˆ) = 1 r 2 n 2 The sample correlaton coeffcent follows a T-dstrbuton wth n-2 degrees of freedom (snce you have to estmate the standard error). *note, lke a proporton, the varance of the correlaton coeffcent depends on the correlaton coeffcent tself substtute n estmated r

14 Contnuous outcome (means) Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

15 Lnear regresson In correlaton, the two varables are treated as equals. In regresson, one varable s consdered ndependent (=predctor) varable (X) and the other the dependent (=outcome) varable Y.

16 What s Lnear? Remember ths: Y=mX+B? m B

17 What s Slope? A slope of 2 means that every 1-unt change n X yelds a 2-unt change n Y.

18 Predcton If you know somethng about X, ths knowledge helps you predct somethng about Y. (Sound famlar? sound lke condtonal probabltes?)

19 Regresson equaton Expected value of y at a gven level of x= E ( y / x ) =α+ β x

20 Predcted value for an ndvdual y = α + β*x + random error Fxed exactly on the lne Follows a normal dstrbuton

21 Assumptons (or the fne prnt) Lnear regresson assumes that 1. The relatonshp between X and Y s lnear 2. Y s dstrbuted normally at each value of X 3. The varance of Y at every value of X s the same (homogenety of varances) 4. The observatons are ndependent

22 The standard error of Y gven X s the average varablty around the regresson lne at any gven value of X. It s assumed to be equal at all values of X. S y/x S y/x S y/x S y/x S y/x S y/x

23 Regresson Pcture y C A ŷ = β x +α y A B B y C y n n 2 ( y y ) = = 1 = 1 A 2 B 2 C 2 SS total Total squared dstance of observatons from naïve mean of y Total varaton ( yˆ y ) = 1 SS reg Dstance from regresson lne to naïve mean of y Varablty due to x (regresson) 2 + n ( x yˆ y ) 2 *Least squares estmaton gave us the lne (β) that mnmzed C 2 R 2 =SS reg /SS total SS resdual Varance around the regresson lne Addtonal varablty not explaned by x what least squares method ams to mnmze

24 Recall example: cogntve functon and vtamn D Hypothetcal data loosely based on [1]; cross-sectonal study of 100 mddleaged and older European men. Cogntve functon s measured by the Dgt Symbol Substtuton Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Assocaton between 25-hydroxyvtamn D levels and cogntve performance n mddle-aged and older European men. J Neurol Neurosurg Psychatry Jul;80(7):722-9.

25 Dstrbuton of vtamn D Mean= 63 nmol/l Standard devaton = 33 nmol/l

26 Dstrbuton of DSST Normally dstrbuted Mean = 28 ponts Standard devaton = 10 ponts

27 Four hypothetcal datasets I generated four hypothetcal datasets, wth ncreasng TRUE slopes (between vt D and DSST): ponts per 10 nmol/l 1.0 ponts per 10 nmol/l 1.5 ponts per 10 nmol/l

28 Dataset 1: no relatonshp

29 Dataset 2: weak relatonshp

30 Dataset 3: weak to moderate relatonshp

31 Dataset 4: moderate relatonshp

32 The Best ft lne Regresson equaton: E(Y ) = *vt D (n 10 nmol/l)

33 The Best ft lne Note how the lne s a lttle deceptve; t draws your eye, makng the relatonshp appear stronger than t really s! Regresson equaton: E(Y ) = *vt D (n 10 nmol/l)

34 The Best ft lne Regresson equaton: E(Y ) = *vt D (n 10 nmol/l)

35 The Best ft lne Regresson equaton: E(Y ) = *vt D (n 10 nmol/l) Note: all the lnes go through the pont (63, 28)!

36 Estmatng the ntercept and slope: least squares estmaton ** Least Squares Estmaton A lttle calculus. What are we tryng to estmate? β, the slope, from What s the constrant? We are tryng to mnmze the squared dstance (hence the least squares ) between the observatons themselves and the predcted values, or (also called the resduals, or left-over unexplaned varablty) Dfference = y (βx + α) Dfference 2 = (y (βx + α)) 2 Fnd the β that gves the mnmum sum of the squared dfferences. How do you maxmze a functon? Take the dervatve; set t equal to zero; and solve. Typcal max/mn problem from calculus. d dβ 2( n = 1 n = 1 ( y ( y ( βx x + βx + α)) )) = 0... ( y From here takes a lttle math trckery to solve for β αx = 2( n = 1 βx α)( x ))

37 Resultng formulas Slope (beta coeffcent) = βˆ = Cov( x, y) Var( x) Intercept= Calculate : α ˆ = y - β ˆx Regresson lne always goes through the pont: ( x, y)

38 Relatonshp wth correlaton rˆ= βˆ SD SD x y In correlaton, the two varables are treated as equals. In regresson, one varable s consdered ndependent (=predctor) varable (X) and the other the dependent (=outcome) varable Y.

39 Example: dataset 4 SDx = 33 nmol/l SDy= 10 ponts Cov(X,Y) = 163 ponts*nmol/l βˆ SS SS x y Beta = 163/33 2 = 0.15 ponts per nmol/l = 1.5 ponts per 10 nmol/l r = 163/(10*33) = 0.49 Or r = 0.15 * (33/10) = 0.49

40 Sgnfcance testng Slope Dstrbuton of slope ~ T n-2 (β,s.e.( )) βˆ H 0 : β 1 = 0 H 1 : β 1 0 T n-2 = (no lnear relatonshp) (lnear relatonshp does exst) ˆ β s. e.( 0 β ˆ)

41 Formula for the standard error of beta (you wll not have to calculate by hand!): n x y x x β α ˆ ˆ ˆ and ) ( where SS 1 2 x + = = = x x y x n SS s SS n y y s 2 / 1 2 ˆ 2 ) ˆ ( = = = β

42 Example: dataset 4 Standard error (beta) = 0.03 T 98 = 0.15/0.03 = 5, p< % Confdence nterval = 0.09 to 0.21

43 Resdual Analyss: check assumptons The resdual for observaton, e, s the dfference between ts observed and predcted value Check the assumptons of regresson by examnng the resduals Examne for lnearty assumpton Examne for constant varance for all levels of X (homoscedastcty) Evaluate normal dstrbuton assumpton Evaluate ndependence assumpton Graphcal Analyss of Resduals e Can plot resduals vs. X = Y Yˆ

44 Predcted values yˆ = x For Vtamn D = 95 nmol/l (or 9.5 n 10 nmol/l): ˆ = (9.5) = y 34

45 Resdual = observed - predcted X=95 nmol/l y = yˆ y = 34 yˆ = 14

46 Resdual Analyss for Lnearty Y Y x x resduals x resduals x Not Lnear Lnear Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

47 Resdual Analyss for Homoscedastcty Y Y x x resduals x resduals x Non-constant varance Constant varance Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

48 Resdual Analyss for Independence Not Independent Independent resduals X resduals X resduals X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

49 Resdual plot, dataset 4

50 Multple lnear regresson What f age s a confounder here? Older men have lower vtamn D Older men have poorer cognton Adjust for age by puttng age n the model: DSST score = ntercept + slope 1 xvtamn D + slope 2 xage

51 2 predctors: age and vt D

52 Dfferent 3D vew

53 Ft a plane rather than a lne On the plane, the slope for vtamn D s the same at every age; thus, the slope for vtamn D represents the effect of vtamn D when age s held constant.

54 Equaton of the Best ft plane DSST score = xvtamn D (n 10 nmol/l) xage (n years) P-value for vtamn D >>.05 P-value for age <.0001 Thus, relatonshp wth vtamn D was due to confoundng by age!

55 Multple Lnear Regresson More than one predctor E(y)=α + β 1 *X + β 2 *W + β 3 *Z Each regresson coeffcent s the amount of change n the outcome varable that would be expected per one-unt change of the predctor, f all other varables n the model were held constant.

56 Functons of multvarate analyss: Control for confounders Test for nteractons between predctors (effect modfcaton) Improve predctons

57 A ttest s lnear regresson! Dvde vtamn D nto two groups: Insuffcent vtamn D (<50 nmol/l) Suffcent vtamn D (>=50 nmol/l), reference group We can evaluate these data wth a ttest or a lnear regresson = 7.5 T = = 3.46; p =

58 As a lnear regresson Intercept represents the mean value n the suffcent group. Slope represents the dfference n means between the groups. Dfference s sgnfcant. Parameter ````````````````Standard Varable Estmate Error t Value Pr > t Intercept <.0001 nsuff

59 ANOVA s lnear regresson! Dvde vtamn D nto three groups: Defcent (<25 nmol/l) Insuffcent (>=25 and <50 nmol/l) Suffcent (>=50 nmol/l), reference group DSST= α (=value for suffcent) + β nsuffcent *(1 f nsuffcent) + β 2 *(1 f defcent) Ths s called dummy codng where multple bnary varables are created to represent beng n each category (or not) of a categorcal varable

60 The pcture Suffcent vs. Insuffcent Suffcent vs. Defcent

61 Results Parameter Estmates Varable Parameter Standard DF Estmate Error t Value Pr > t Intercept <.0001 defcent nsuffcent Interpretaton: The defcent group has a mean DSST 9.87 ponts lower than the reference (suffcent) group. The nsuffcent group has a mean DSST 6.87 ponts lower than the reference (suffcent) group.

62 Other types of multvarate regresson Multple lnear regresson s for normally dstrbuted outcomes Logstc regresson s for bnary outcomes Cox proportonal hazards regresson s used when tme-to-event s the outcome

63 Common multvarate regresson models. Outcome (dependent varable) Example outcome varable Approprate multvarate regresson model Example equaton What do the coeffcents gve you? Contnuous Blood pressure Lnear regresson blood pressure (mmhg) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) slopes tells you how much the outcome varable ncreases for every 1-unt ncrease n each predctor. Bnary Hgh blood pressure (yes/no) Logstc regresson ln (odds of hgh blood pressure) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) odds ratos tells you how much the odds of the outcome ncrease for every 1-unt ncrease n each predctor. Tme-to-event Tme-todeath Cox regresson ln (rate of death) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) hazard ratos tells you how much the rate of the outcome ncreases for every 1-unt ncrease n each predctor.

64 Multvarate regresson ptfalls Mult-collnearty Resdual confoundng Overfttng

65 Multcollnearty Multcollnearty arses when two varables that measure the same thng or smlar thngs (e.g., weght and BMI) are both ncluded n a multple regresson model; they wll, n effect, cancel each other out and generally destroy your model. Model buldng and dagnostcs are trcky busness!

66 Resdual confoundng You cannot completely wpe out confoundng smply by adjustng for varables n multple regresson unless varables are measured wth zero error (whch s usually mpossble). Example: meat eatng and mortalty

67 Men who eat a lot of meat are unhealther for many reasons! Snha R, Cross AJ, Graubard BI, Letzmann MF, Schatzkn A. Meat ntake and mortalty: a prospectve study of over half a mllon people. Arch Intern Med 2009;169:562-71

68 Mortalty rsks Snha R, Cross AJ, Graubard BI, Letzmann MF, Schatzkn A. Meat ntake and mortalty: a prospectve study of over half a mllon people. Arch Intern Med 2009;169:562-71

69 Overfttng In multvarate modelng, you can get hghly sgnfcant but meanngless results f you put too many predctors n the model. The model s ft perfectly to the qurks of your partcular sample, but has no predctve ablty n a new sample.

70 Overfttng: class data example I asked SAS to automatcally fnd predctors of optmsm n our class dataset. Here s the resultng lnear regresson model: Parameter Standard Varable Estmate Error Type II SS F Value Pr > F Intercept exercse sleep obama <.0001 Clnton mathlove Exercse, sleep, and hgh ratngs for Clnton are negatvely related to optmsm (hghly sgnfcant!) and hgh ratngs for Obama and hgh love of math are postvely related to optmsm (hghly sgnfcant!).

71 If somethng seems to good to be true Clnton, unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept Clnton Clnton Sleep, Unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept Exercse, Unvarate: sleep sleep Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept <.0001 exercse exercse

72 More unvarate models Obama, Unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept obama obama Compare wth multvarate result; p<.0001 Love of Math, unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept mathlove mathlove Compare wth multvarate result; p=.0011

73 Overfttng Rule of thumb: You need at least 10 subjects for each addtonal predctor varable n the multvarate regresson model. Pure nose varables stll produce good R 2 values f the model s overftted. The dstrbuton of R 2 values from a seres of smulated regresson models contanng only nose varables. (Fgure 1 from: Babyak, MA. What You See May Not Be What You Get: A Bref, Nontechncal Introducton to Overfttng n Regresson-Type Models. Psychosomatc Medcne 66: (2004).)

74 Revew of statstcal tests The followng table gves the approprate choce of a statstcal test or measure of assocaton for varous types of data (outcome varables and predctor varables) by study desgn. e.g., blood pressure= pounds + age + treatment (1/0) Contnuous outcome Contnuous predctors Bnary predctor

75 Types of varables to be analyzed Predctor varable/s Outcome varable Cross-sectonal/case-control studes Statstcal procedure or measure of assocaton Bnary (two groups) Contnuous Bnary Ranks/ordnal T-test Wlcoxon rank-sum test Categorcal (>2 groups) Contnuous ANOVA Contnuous Contnuous Smple lnear regresson Multvarate (categorcal and Contnuous Multple lnear regresson contnuous) Categorcal Categorcal Ch-square test (or Fsher s exact) Bnary Bnary Odds rato, rsk rato Multvarate Bnary Logstc regresson Cohort Studes/Clncal Trals Bnary Bnary Rsk rato Categorcal Tme-to-event Kaplan-Meer/ log-rank test Multvarate Tme-to-event Cox-proportonal hazards regresson, hazard rato Categorcal Contnuous Repeated measures ANOVA Multvarate Contnuous Mxed models; GEE modelng

76 Alternatve summary: statstcs for varous types of outcome data Are the observatons ndependent or correlated? Outcome Varable ndependent correlated Assumptons Contnuous (e.g. pan scale, cogntve functon) Ttest ANOVA Lnear correlaton Lnear regresson Pared ttest Repeated-measures ANOVA Mxed models/gee modelng Outcome s normally dstrbuted (mportant for small samples). Outcome and predctor have a lnear relatonshp. Bnary or categorcal (e.g. fracture yes/no) Dfference n proportons Relatve rsks Ch-square test Logstc regresson McNemar s test Condtonal logstc regresson GEE modelng Ch-square test assumes suffcent numbers n each cell (>=5) Tme-to-event (e.g. tme to fracture) Kaplan-Meer statstcs Cox regresson n/a Cox regresson assumes proportonal hazards between groups

77 Contnuous outcome (means); HRP 259/HRP 262 Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

78 Bnary or categorcal outcomes (proportons); HRP 259/HRP 261 Outcome Varable Bnary or categorcal (e.g. fracture, yes/no) Are the observatons correlated? ndependent Ch-square test: compares proportons between two or more groups Relatve rsks: odds ratos or rsk ratos Logstc regresson: multvarate technque used when outcome s bnary; gves multvarate-adjusted odds ratos correlated McNemar s ch-square test: compares bnary outcome between correlated groups (e.g., before and after) Condtonal logstc regresson: multvarate regresson technque for a bnary outcome when groups are correlated (e.g., matched data) GEE modelng: multvarate regresson technque for a bnary outcome when groups are correlated (e.g., repeated measures) Alternatve to the chsquare test f sparse cells: Fsher s exact test: compares proportons between ndependent groups when there are sparse data (some cells <5). McNemar s exact test: compares proportons between correlated groups when there are sparse data (some cells <5).

79 Tme-to-event outcome (survval data); HRP 262 Outcome Varable Are the observaton groups ndependent or correlated? ndependent correlated Modfcatons to Cox regresson f proportonalhazards s volated: Tme-toevent (e.g., tme to fracture) Kaplan-Meer statstcs: estmates survval functons for each group (usually dsplayed graphcally); compares survval functons wth log-rank test n/a (already over tme) Tme-dependent predctors or tmedependent hazard ratos (trcky!) Cox regresson: Multvarate technque for tme-to-event data; gves multvarate-adjusted hazard ratos

Statistics for Economics & Business

Statistics for Economics & Business Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable