Linear correlation and linear regression

Similar documents
Statistics for Economics & Business

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Statistics for Business and Economics

Basic Business Statistics, 10/e

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Chapter 13: Multiple Regression

Lecture 6: Introduction to Linear Regression

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Chapter 11: Simple Linear Regression and Correlation

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Comparison of Regression Lines

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Chapter 15 - Multiple Regression

/ n ) are compared. The logic is: if the two

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Statistics MINITAB - Lab 2

Introduction to Regression

STAT 3008 Applied Regression Analysis

Learning Objectives for Chapter 11

Chapter 15 Student Lecture Notes 15-1

Correlation and Regression

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

SIMPLE LINEAR REGRESSION

Negative Binomial Regression

Chapter 14 Simple Linear Regression

17 - LINEAR REGRESSION II

The Ordinary Least Squares (OLS) Estimator

28. SIMPLE LINEAR REGRESSION III

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

x i1 =1 for all i (the constant ).

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

β0 + β1xi. You are interested in estimating the unknown parameters β

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

Chapter 3. Two-Variable Regression Model: The Problem of Estimation

ANOVA. The Observations y ij

e i is a random error

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Basically, if you have a dummy dependent variable you will be estimating a probability.

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

18. SIMPLE LINEAR REGRESSION III

Topic 7: Analysis of Variance

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

STATISTICS QUESTIONS. Step by Step Solutions.

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Linear Regression Analysis: Terminology and Notation

Diagnostics in Poisson Regression. Models - Residual Analysis

β0 + β1xi. You are interested in estimating the unknown parameters β

Regression Analysis. Regression Analysis

Reduced slides. Introduction to Analysis of Variance (ANOVA) Part 1. Single factor

Chapter 5 Multilevel Models

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

7.1. Single classification analysis of variance (ANOVA) Why not use multiple 2-sample 2. When to use ANOVA

Chapter 4: Regression With One Regressor

Midterm Examination. Regression and Forecasting Models

Scatter Plot x

Decision Analysis (part 2 of 2) Review Linear Regression

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Outline. Zero Conditional mean. I. Motivation. 3. Multiple Regression Analysis: Estimation. Read Wooldridge (2013), Chapter 3.

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Introduction to Analysis of Variance (ANOVA) Part 1

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

β0 + β1xi and want to estimate the unknown

Chapter 12 Analysis of Covariance

Lab 4: Two-level Random Intercept Model

Economics 130. Lecture 4 Simple Linear Regression Continued

Chapter 8 Indicator Variables

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

[ ] λ λ λ. Multicollinearity. multicollinearity Ragnar Frisch (1934) perfect exact. collinearity. multicollinearity. exact

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Properties of Least Squares

Unit 10: Simple Linear Regression and Correlation

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Activity #13: Simple Linear Regression. actgpa.sav; beer.sav;

The SAS program I used to obtain the analyses for my answers is given below.

ANSWERS CHAPTER 9. TIO 9.2: If the values are the same, the difference is 0, therefore the null hypothesis cannot be rejected.

Linear Correlation. Many research issues are pursued with nonexperimental studies that seek to establish relationships among 2 or more variables

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

January Examinations 2015

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Lecture 3 Stat102, Spring 2007

IV. Modeling a Mean: Simple Linear Regression

Econ107 Applied Econometrics Topic 9: Heteroskedasticity (Studenmund, Chapter 10)

Chap 10: Diagnostics, p384

T E C O L O T E R E S E A R C H, I N C.

Rockefeller College University at Albany

Transcription:

Lnear correlaton and lnear regresson

Contnuous outcome (means) Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

Recall: Covarance n cov( x, y) = = 1 ( x X )( y Y ) n 1

Interpretng Covarance cov(x,y) > 0 X and Y are postvely correlated cov(x,y) < 0 X and Y are nversely correlated cov(x,y) = 0 X and Y are ndependent

Correlaton coeffcent Pearson s Correlaton Coeffcent s standardzed covarance (untless): r = cov arance( x, y) var x var y

Correlaton Measures the relatve strength of the lnear relatonshp between two varables Unt-less Ranges between 1 and 1 The closer to 1, the stronger the negatve lnear relatonshp The closer to 1, the stronger the postve lnear relatonshp The closer to 0, the weaker any postve lnear relatonshp

Scatter Plots of Data wth Varous Correlaton Coeffcents Y Y Y Y X X r = -1 r = -.6 r = 0 Y Y X X X r = +1 r = +.3 Slde from: Statstcs for M anagers Usng M crosoft Excel 4th Edton, 2004 Prentce-Hall r = 0 X

Lnear Correlaton Lnear relatonshps Curvlnear relatonshps Y Y X X Y Y X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

Lnear Correlaton Strong relatonshps Weak relatonshps Y Y X X Y Y X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

Lnear Correlaton Y No relatonshp X Y Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall X

Calculatng by hand 1 ) ( 1 ) ( 1 ) )( ( var var ), ( cov ˆ 1 2 1 2 1 = = = = = n y y n x x n y y x x y x y x arance r n n n

Smpler calculaton formula y x xy n n n n n n SS SS SS y y x x y y x x n y y n x x n y y x x r = = = = = = = = = 1 2 1 2 1 1 2 1 2 1 ) ( ) ( ) )( ( 1 ) ( 1 ) ( 1 ) )( ( ˆ y x xy SS SS SS r= ˆ Numerator of covarance Numerators of varance

Dstrbuton of the correlaton coeffcent: SE( rˆ) = 1 r 2 n 2 The sample correlaton coeffcent follows a T-dstrbuton wth n-2 degrees of freedom (snce you have to estmate the standard error). *note, lke a proporton, the varance of the correlaton coeffcent depends on the correlaton coeffcent tself substtute n estmated r

Contnuous outcome (means) Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

Lnear regresson In correlaton, the two varables are treated as equals. In regresson, one varable s consdered ndependent (=predctor) varable (X) and the other the dependent (=outcome) varable Y.

What s Lnear? Remember ths: Y=mX+B? m B

What s Slope? A slope of 2 means that every 1-unt change n X yelds a 2-unt change n Y.

Predcton If you know somethng about X, ths knowledge helps you predct somethng about Y. (Sound famlar? sound lke condtonal probabltes?)

Regresson equaton Expected value of y at a gven level of x= E ( y / x ) =α+ β x

Predcted value for an ndvdual y = α + β*x + random error Fxed exactly on the lne Follows a normal dstrbuton

Assumptons (or the fne prnt) Lnear regresson assumes that 1. The relatonshp between X and Y s lnear 2. Y s dstrbuted normally at each value of X 3. The varance of Y at every value of X s the same (homogenety of varances) 4. The observatons are ndependent

The standard error of Y gven X s the average varablty around the regresson lne at any gven value of X. It s assumed to be equal at all values of X. S y/x S y/x S y/x S y/x S y/x S y/x

Regresson Pcture y C A ŷ = β x +α y A B B y C y n n 2 ( y y ) = = 1 = 1 A 2 B 2 C 2 SS total Total squared dstance of observatons from naïve mean of y Total varaton ( yˆ y ) = 1 SS reg Dstance from regresson lne to naïve mean of y Varablty due to x (regresson) 2 + n ( x yˆ y ) 2 *Least squares estmaton gave us the lne (β) that mnmzed C 2 R 2 =SS reg /SS total SS resdual Varance around the regresson lne Addtonal varablty not explaned by x what least squares method ams to mnmze

Recall example: cogntve functon and vtamn D Hypothetcal data loosely based on [1]; cross-sectonal study of 100 mddleaged and older European men. Cogntve functon s measured by the Dgt Symbol Substtuton Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Assocaton between 25-hydroxyvtamn D levels and cogntve performance n mddle-aged and older European men. J Neurol Neurosurg Psychatry. 2009 Jul;80(7):722-9.

Dstrbuton of vtamn D Mean= 63 nmol/l Standard devaton = 33 nmol/l

Dstrbuton of DSST Normally dstrbuted Mean = 28 ponts Standard devaton = 10 ponts

Four hypothetcal datasets I generated four hypothetcal datasets, wth ncreasng TRUE slopes (between vt D and DSST): 0 0.5 ponts per 10 nmol/l 1.0 ponts per 10 nmol/l 1.5 ponts per 10 nmol/l

Dataset 1: no relatonshp

Dataset 2: weak relatonshp

Dataset 3: weak to moderate relatonshp

Dataset 4: moderate relatonshp

The Best ft lne Regresson equaton: E(Y ) = 28 + 0*vt D (n 10 nmol/l)

The Best ft lne Note how the lne s a lttle deceptve; t draws your eye, makng the relatonshp appear stronger than t really s! Regresson equaton: E(Y ) = 26 + 0.5*vt D (n 10 nmol/l)

The Best ft lne Regresson equaton: E(Y ) = 22 + 1.0*vt D (n 10 nmol/l)

The Best ft lne Regresson equaton: E(Y ) = 20 + 1.5*vt D (n 10 nmol/l) Note: all the lnes go through the pont (63, 28)!

Estmatng the ntercept and slope: least squares estmaton ** Least Squares Estmaton A lttle calculus. What are we tryng to estmate? β, the slope, from What s the constrant? We are tryng to mnmze the squared dstance (hence the least squares ) between the observatons themselves and the predcted values, or (also called the resduals, or left-over unexplaned varablty) Dfference = y (βx + α) Dfference 2 = (y (βx + α)) 2 Fnd the β that gves the mnmum sum of the squared dfferences. How do you maxmze a functon? Take the dervatve; set t equal to zero; and solve. Typcal max/mn problem from calculus. d dβ 2( n = 1 n = 1 ( y ( y ( βx x + βx + α)) )) = 0... ( y From here takes a lttle math trckery to solve for β 2 2 + αx = 2( n = 1 βx α)( x ))

Resultng formulas Slope (beta coeffcent) = βˆ = Cov( x, y) Var( x) Intercept= Calculate : α ˆ = y - β ˆx Regresson lne always goes through the pont: ( x, y)

Relatonshp wth correlaton rˆ= βˆ SD SD x y In correlaton, the two varables are treated as equals. In regresson, one varable s consdered ndependent (=predctor) varable (X) and the other the dependent (=outcome) varable Y.

Example: dataset 4 SDx = 33 nmol/l SDy= 10 ponts Cov(X,Y) = 163 ponts*nmol/l βˆ SS SS x y Beta = 163/33 2 = 0.15 ponts per nmol/l = 1.5 ponts per 10 nmol/l r = 163/(10*33) = 0.49 Or r = 0.15 * (33/10) = 0.49

Sgnfcance testng Slope Dstrbuton of slope ~ T n-2 (β,s.e.( )) βˆ H 0 : β 1 = 0 H 1 : β 1 0 T n-2 = (no lnear relatonshp) (lnear relatonshp does exst) ˆ β s. e.( 0 β ˆ)

Formula for the standard error of beta (you wll not have to calculate by hand!): n x y x x β α ˆ ˆ ˆ and ) ( where SS 1 2 x + = = = x x y x n SS s SS n y y s 2 / 1 2 ˆ 2 ) ˆ ( = = = β

Example: dataset 4 Standard error (beta) = 0.03 T 98 = 0.15/0.03 = 5, p<.0001 95% Confdence nterval = 0.09 to 0.21

Resdual Analyss: check assumptons The resdual for observaton, e, s the dfference between ts observed and predcted value Check the assumptons of regresson by examnng the resduals Examne for lnearty assumpton Examne for constant varance for all levels of X (homoscedastcty) Evaluate normal dstrbuton assumpton Evaluate ndependence assumpton Graphcal Analyss of Resduals e Can plot resduals vs. X = Y Yˆ

Predcted values yˆ = 20+ 1. 5 x For Vtamn D = 95 nmol/l (or 9.5 n 10 nmol/l): ˆ = 20+ 1.5(9.5) = y 34

Resdual = observed - predcted X=95 nmol/l y = 48 34 yˆ y = 34 yˆ = 14

Resdual Analyss for Lnearty Y Y x x resduals x resduals x Not Lnear Lnear Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

Resdual Analyss for Homoscedastcty Y Y x x resduals x resduals x Non-constant varance Constant varance Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

Resdual Analyss for Independence Not Independent Independent resduals X resduals X resduals X Slde from: Statstcs for Managers Usng Mcrosoft Excel 4th Edton, 2004 Prentce-Hall

Resdual plot, dataset 4

Multple lnear regresson What f age s a confounder here? Older men have lower vtamn D Older men have poorer cognton Adjust for age by puttng age n the model: DSST score = ntercept + slope 1 xvtamn D + slope 2 xage

2 predctors: age and vt D

Dfferent 3D vew

Ft a plane rather than a lne On the plane, the slope for vtamn D s the same at every age; thus, the slope for vtamn D represents the effect of vtamn D when age s held constant.

Equaton of the Best ft plane DSST score = 53 + 0.0039xvtamn D (n 10 nmol/l) - 0.46 xage (n years) P-value for vtamn D >>.05 P-value for age <.0001 Thus, relatonshp wth vtamn D was due to confoundng by age!

Multple Lnear Regresson More than one predctor E(y)=α + β 1 *X + β 2 *W + β 3 *Z Each regresson coeffcent s the amount of change n the outcome varable that would be expected per one-unt change of the predctor, f all other varables n the model were held constant.

Functons of multvarate analyss: Control for confounders Test for nteractons between predctors (effect modfcaton) Improve predctons

A ttest s lnear regresson! Dvde vtamn D nto two groups: Insuffcent vtamn D (<50 nmol/l) Suffcent vtamn D (>=50 nmol/l), reference group We can evaluate these data wth a ttest or a lnear regresson 40 32.5= 7.5 T = = 3.46; p.0008 98 2 2 10.8 10.8 = + 54 46

As a lnear regresson Intercept represents the mean value n the suffcent group. Slope represents the dfference n means between the groups. Dfference s sgnfcant. Parameter ````````````````Standard Varable Estmate Error t Value Pr > t Intercept 40.07407 1.47511 27.17 <.0001 nsuff -7.53060 2.17493-3.46 0.0008

ANOVA s lnear regresson! Dvde vtamn D nto three groups: Defcent (<25 nmol/l) Insuffcent (>=25 and <50 nmol/l) Suffcent (>=50 nmol/l), reference group DSST= α (=value for suffcent) + β nsuffcent *(1 f nsuffcent) + β 2 *(1 f defcent) Ths s called dummy codng where multple bnary varables are created to represent beng n each category (or not) of a categorcal varable

The pcture Suffcent vs. Insuffcent Suffcent vs. Defcent

Results Parameter Estmates Varable Parameter Standard DF Estmate Error t Value Pr > t Intercept 1 40.07407 1.47817 27.11 <.0001 defcent 1-9.87407 3.73950-2.64 0.0096 nsuffcent 1-6.87963 2.33719-2.94 0.0041 Interpretaton: The defcent group has a mean DSST 9.87 ponts lower than the reference (suffcent) group. The nsuffcent group has a mean DSST 6.87 ponts lower than the reference (suffcent) group.

Other types of multvarate regresson Multple lnear regresson s for normally dstrbuted outcomes Logstc regresson s for bnary outcomes Cox proportonal hazards regresson s used when tme-to-event s the outcome

Common multvarate regresson models. Outcome (dependent varable) Example outcome varable Approprate multvarate regresson model Example equaton What do the coeffcents gve you? Contnuous Blood pressure Lnear regresson blood pressure (mmhg) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) slopes tells you how much the outcome varable ncreases for every 1-unt ncrease n each predctor. Bnary Hgh blood pressure (yes/no) Logstc regresson ln (odds of hgh blood pressure) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) odds ratos tells you how much the odds of the outcome ncrease for every 1-unt ncrease n each predctor. Tme-to-event Tme-todeath Cox regresson ln (rate of death) = α + βsalt*salt consumpton (tsp/day) + βage*age (years) + βsmoker*ever smoker (yes=1/no=0) hazard ratos tells you how much the rate of the outcome ncreases for every 1-unt ncrease n each predctor.

Multvarate regresson ptfalls Mult-collnearty Resdual confoundng Overfttng

Multcollnearty Multcollnearty arses when two varables that measure the same thng or smlar thngs (e.g., weght and BMI) are both ncluded n a multple regresson model; they wll, n effect, cancel each other out and generally destroy your model. Model buldng and dagnostcs are trcky busness!

Resdual confoundng You cannot completely wpe out confoundng smply by adjustng for varables n multple regresson unless varables are measured wth zero error (whch s usually mpossble). Example: meat eatng and mortalty

Men who eat a lot of meat are unhealther for many reasons! Snha R, Cross AJ, Graubard BI, Letzmann MF, Schatzkn A. Meat ntake and mortalty: a prospectve study of over half a mllon people. Arch Intern Med 2009;169:562-71

Mortalty rsks Snha R, Cross AJ, Graubard BI, Letzmann MF, Schatzkn A. Meat ntake and mortalty: a prospectve study of over half a mllon people. Arch Intern Med 2009;169:562-71

Overfttng In multvarate modelng, you can get hghly sgnfcant but meanngless results f you put too many predctors n the model. The model s ft perfectly to the qurks of your partcular sample, but has no predctve ablty n a new sample.

Overfttng: class data example I asked SAS to automatcally fnd predctors of optmsm n our class dataset. Here s the resultng lnear regresson model: Parameter Standard Varable Estmate Error Type II SS F Value Pr > F Intercept 11.80175 2.98341 11.96067 15.65 0.0019 exercse -0.29106 0.09798 6.74569 8.83 0.0117 sleep -1.91592 0.39494 17.98818 23.53 0.0004 obama 1.73993 0.24352 39.01944 51.05 <.0001 Clnton -0.83128 0.17066 18.13489 23.73 0.0004 mathlove 0.45653 0.10668 13.99925 18.32 0.0011 Exercse, sleep, and hgh ratngs for Clnton are negatvely related to optmsm (hghly sgnfcant!) and hgh ratngs for Obama and hgh love of math are postvely related to optmsm (hghly sgnfcant!).

If somethng seems to good to be true Clnton, unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept 1 5.43688 2.13476 2.55 0.0188 Clnton Clnton 1 0.24973 0.27111 0.92 0.3675 Sleep, Unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept 1 8.30817 4.36984 1.90 0.0711 Exercse, Unvarate: sleep sleep 1-0.14484 0.65451-0.22 0.8270 Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept 1 6.65189 0.89153 7.46 <.0001 exercse exercse 1 0.19161 0.20709 0.93 0.3658

More unvarate models Obama, Unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept 1 0.82107 2.43137 0.34 0.7389 obama obama 1 0.87276 0.31973 2.73 0.0126 Compare wth multvarate result; p<.0001 Love of Math, unvarate: Parameter Standard Varable Label DF Estmate Error t Value Pr > t Intercept Intercept 1 3.70270 1.25302 2.96 0.0076 mathlove mathlove 1 0.59459 0.19225 3.09 0.0055 Compare wth multvarate result; p=.0011

Overfttng Rule of thumb: You need at least 10 subjects for each addtonal predctor varable n the multvarate regresson model. Pure nose varables stll produce good R 2 values f the model s overftted. The dstrbuton of R 2 values from a seres of smulated regresson models contanng only nose varables. (Fgure 1 from: Babyak, MA. What You See May Not Be What You Get: A Bref, Nontechncal Introducton to Overfttng n Regresson-Type Models. Psychosomatc Medcne 66:411-421 (2004).)

Revew of statstcal tests The followng table gves the approprate choce of a statstcal test or measure of assocaton for varous types of data (outcome varables and predctor varables) by study desgn. e.g., blood pressure= pounds + age + treatment (1/0) Contnuous outcome Contnuous predctors Bnary predctor

Types of varables to be analyzed Predctor varable/s Outcome varable Cross-sectonal/case-control studes Statstcal procedure or measure of assocaton Bnary (two groups) Contnuous Bnary Ranks/ordnal T-test Wlcoxon rank-sum test Categorcal (>2 groups) Contnuous ANOVA Contnuous Contnuous Smple lnear regresson Multvarate (categorcal and Contnuous Multple lnear regresson contnuous) Categorcal Categorcal Ch-square test (or Fsher s exact) Bnary Bnary Odds rato, rsk rato Multvarate Bnary Logstc regresson Cohort Studes/Clncal Trals Bnary Bnary Rsk rato Categorcal Tme-to-event Kaplan-Meer/ log-rank test Multvarate Tme-to-event Cox-proportonal hazards regresson, hazard rato Categorcal Contnuous Repeated measures ANOVA Multvarate Contnuous Mxed models; GEE modelng

Alternatve summary: statstcs for varous types of outcome data Are the observatons ndependent or correlated? Outcome Varable ndependent correlated Assumptons Contnuous (e.g. pan scale, cogntve functon) Ttest ANOVA Lnear correlaton Lnear regresson Pared ttest Repeated-measures ANOVA Mxed models/gee modelng Outcome s normally dstrbuted (mportant for small samples). Outcome and predctor have a lnear relatonshp. Bnary or categorcal (e.g. fracture yes/no) Dfference n proportons Relatve rsks Ch-square test Logstc regresson McNemar s test Condtonal logstc regresson GEE modelng Ch-square test assumes suffcent numbers n each cell (>=5) Tme-to-event (e.g. tme to fracture) Kaplan-Meer statstcs Cox regresson n/a Cox regresson assumes proportonal hazards between groups

Contnuous outcome (means); HRP 259/HRP 262 Outcome Varable Contnuous (e.g. pan scale, cogntve functon) Are the observatons ndependent or correlated? ndependent Ttest: compares means between two ndependent groups ANOVA: compares means between more than two ndependent groups Pearson s correlaton coeffcent (lnear correlaton): shows lnear correlaton between two contnuous varables Lnear regresson: multvarate regresson technque used when the outcome s contnuous; gves slopes correlated Pared ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over tme n the means of two or more groups (repeated measurements) Mxed models/gee modelng: multvarate regresson technques to compare changes over tme between two or more groups; gves rate of change over tme Alternatves f the normalty assumpton s volated (and small sample sze): Non-parametrc statstcs Wlcoxon sgn-rank test: non-parametrc alternatve to the pared ttest Wlcoxon sum-rank test (=Mann-Whtney U test): nonparametrc alternatve to the ttest Kruskal-Walls test: nonparametrc alternatve to ANOVA Spearman rank correlaton coeffcent: non-parametrc alternatve to Pearson s correlaton coeffcent

Bnary or categorcal outcomes (proportons); HRP 259/HRP 261 Outcome Varable Bnary or categorcal (e.g. fracture, yes/no) Are the observatons correlated? ndependent Ch-square test: compares proportons between two or more groups Relatve rsks: odds ratos or rsk ratos Logstc regresson: multvarate technque used when outcome s bnary; gves multvarate-adjusted odds ratos correlated McNemar s ch-square test: compares bnary outcome between correlated groups (e.g., before and after) Condtonal logstc regresson: multvarate regresson technque for a bnary outcome when groups are correlated (e.g., matched data) GEE modelng: multvarate regresson technque for a bnary outcome when groups are correlated (e.g., repeated measures) Alternatve to the chsquare test f sparse cells: Fsher s exact test: compares proportons between ndependent groups when there are sparse data (some cells <5). McNemar s exact test: compares proportons between correlated groups when there are sparse data (some cells <5).

Tme-to-event outcome (survval data); HRP 262 Outcome Varable Are the observaton groups ndependent or correlated? ndependent correlated Modfcatons to Cox regresson f proportonalhazards s volated: Tme-toevent (e.g., tme to fracture) Kaplan-Meer statstcs: estmates survval functons for each group (usually dsplayed graphcally); compares survval functons wth log-rank test n/a (already over tme) Tme-dependent predctors or tmedependent hazard ratos (trcky!) Cox regresson: Multvarate technque for tme-to-event data; gves multvarate-adjusted hazard ratos