REGRESSION METHODS. Logistic regression

Similar documents
REGRESSION MODELS ANOVA

STA6938-Logistic Regression Model

1 Models for Matched Pairs

1 Inferential Methods for Correlation and Regression Analysis

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Read through these prior to coming to the test and follow them when you take your test.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Describing the Relation between Two Variables

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

REGRESSION AND ANALYSIS OF VARIANCE. Motivation. Module structure

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Stat 139 Homework 7 Solutions, Fall 2015

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Biostatistics for Med Students. Lecture 2

STAC51: Categorical data Analysis

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

General IxJ Contingency Tables

Properties and Hypothesis Testing

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Maximum Likelihood Estimation

Lecture 7: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

Math 140 Introductory Statistics

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Lecture 7: Properties of Random Samples

Correlation Regression

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

Statistics 20: Final Exam Solutions Summer Session 2007

Chapter 6 Sampling Distributions

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Stat 200 -Testing Summary Page 1

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Lecture 11 Simple Linear Regression

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Simple Linear Regression

¹Y 1 ¹ Y 2 p s. 2 1 =n 1 + s 2 2=n 2. ¹X X n i. X i u i. i=1 ( ^Y i ¹ Y i ) 2 + P n

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Final Examination Solutions 17/6/2010

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

y ij = µ + α i + ɛ ij,

Formulas and Tables for Gerstman

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Homework for 4/9 Due 4/16

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Ismor Fischer, 1/11/

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Simple Random Sampling!

6 Sample Size Calculations

Lecture 8: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

Chapter 1 (Definitions)

MidtermII Review. Sta Fall Office Hours Wednesday 12:30-2:30pm Watch linear regression videos before lab on Thursday

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Regression, Inference, and Model Building

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Linear Regression Models

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Logit regression Logit regression

Random Variables, Sampling and Estimation

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Question 1: Exercise 8.2

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Sample Size Determination (Two or More Samples)

Estimation for Complete Data

Measurement uncertainty of the sound absorption

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Chapter 4 - Summarizing Numerical Data

1036: Probability & Statistics

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Stat 421-SP2012 Interval Estimation Section

GG313 GEOLOGICAL DATA ANALYSIS

Statistics Independent (X) you can choose and manipulate. Usually on x-axis

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Transcription:

REGRESSION METHODS Logistic regressio 233

RECAP: Biary Outcome? NO Cotiuous Outcome? YES Liear Regressio/ANOVA NO Other Methods YES Odds ratio as measure of associatio? Relative risk as measure of associatio? Risk differece as measure of associatio? Logistic regressio GLM w/ log lik GLM w/ idetity lik 234

Logistic Regressio: Motivatio May scietific questios of iterest ivolve a biary outcome (e.g. disease/o disease) Let s ivestigate if geetic factors are associated with presece/absece of coroary heart disease (CHD) 235

Logistic Regressio: Motivatio Scietific questios of iterest: Assess the effect of rs4775401 o CHD Assess the effect of cholesterol o CHD Assess the effect of rs4775401 o CHD after accoutig for cholesterol 236

Logistic Regressio: Motivatio Scietific questio: Assess the effect of rs4775401 o risk of CHD rs4775401 - Coded as the umber of mior alleles 0 = C/C, 1 = C/T, 2 = T/T. 237

Motivatio: rs4755401 ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs4775401,chd) chd rs4775401 0 1 0 154 48 1 104 66 2 15 13 Prevalece of CHD i C/C: 48/(48+154) = 0.238 Prevalece of CHD i C/T: 66/(66+104) = 0.388 Prevalece of CHD i T/T: 13/(13+15) = 0.464 Does the prevalece of CHD differ across the groups? Without usig regressio, what tool could we use to look for a associatio betwee rs4755401 ad CHD? 238

Motivatio: rs4755401 ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 Without usig regressio, what tool could we use to look for a associatio? > chisq.test(rs4775401,chd) Pearso's Chi-squared test data: rs4775401 ad chd X-squared = 12.657, df = 2, p-value = 0.001785 I additio to hypothesis testig, we eed to summarize the stregth of associatio betwee the two variables 239

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece (RD) = P(outcome exposed) - P(outcome ot exposed) = (b/(a+b)) - (d/(c+d)) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 RD(T/T vs C/C) = 13/(13+15) 48/(48+154) = 0.464 0.238 =0.226 240

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece iterpretatio Additive differece i probability (risk) betwee exposed ad uexposed Also called excess risk -1 < RD < 1 RD = 0 o associatio; risk of outcome same for exposed ad uexposed 241

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk (RR) = P(outcome exposed)/p(outcome ot exposed) = (b/(a+b))/(d/(c+d)) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 RR(T/T vs C/C) = (13/(13+15)) / (48/(48+154)) = 0.464 / 0.238 =1.95 242

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk iterpretatio Multiplicative differece i probability (risk) of outcome amog exposed compared to uexposed 0 < RR < RR = 1 o associatio; risk of outcome same for exposed ad uexposed 243

Measures of associatio for biary outcomes The odds is the ratio of the risk of havig a outcome to the risk of ot havig the outcome If p is the risk of a outcome, the the odds of the outcome are p/(1-p) The odds ratio (OR) is the ratio of the odds of the outcome i the exposed to the odds of the outcome i the uexposed : OR = [p 1 /(1- p 1 )]/ [p 0 /(1- p 0 )] = odds ratio where p 1 =risk i exposed ad p 0 =risk i uexposed Like the relative risk, the odds ratio provides a measure of associatio i a ratio (rather tha a differece) The odds ratio is the ratio of two ratios (i.e. the ratio of odds) The OR approximates RR for rare evets The OR is more complicated to iterpret tha the RR (except for rare evets), but there are some study desigs (amely, case-cotrol studies) where it is ot possible to directly estimate the risk ratio, but oe ca always estimate the odds ratio 244

Measures of associatio for biary outcomes Say the chace of disease (D) if you re exposed (E) = 0.25 The the odds of gettig D (for those who are exposed) are 0.25/0.75 = 1/3 or 1:3 Say the chace of disease if you re ot exposed =0.1 The the odds of gettig D (for those who are ot exposed) are 0.1/0.9 = 1/9 or 1:9 The the disease odds ratio (ratio of the odds of disease i the exposed to the odds of disease i the uexposed) is (1/3)/(1/9) = 3 Q: What is the risk ratio here? 2.5 245

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds = P/(1-P) Odds ratio (OR) = Odds(outcome exposed)/odds(outcome ot exposed) = ((b/(a+b))/(a/(a+b)))/((d/(c+d))/(c/(c+d))) = (b/a)/(d/c) = (bc)/(ad) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 OR(T/T vs C/C) = (13/15) / (48/154) = 2.78 246

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds ratio iterpretatio Multiplicative differece i odds of outcome betwee exposed ad uexposed 0 < OR < OR = 1 o associatio; odds of outcome same for exposed ad uexposed 247

Pros ad cos of measures of associatio RD is appealig because it directly commuicates absolute icrease i risk Ofte more policy relevat tha relative measures RR more directly iterpretable tha OR (most people do t have a ituitive uderstadig of odds) OR estimable i case-cotrol studies where RR ad RD are ot For rare outcomes, OR RR 248

Logistic Regressio: Motivatio The chi-squared test is adequate for ivestigatig the associatio betwee two categorical predictors But what if we wat to ivestigate the associatio betwee a cotiuous predictor like cholesterol ad a biary outcome like CHD? Or what if we wat to adjust for potetial cofouders? Logistic regressio will provide us with a tool for this 249

Biary outcome ad cotiuous exposure Objective: Estimate associatio betwee biary outcome ad cotiuous exposure Y = biary respose (0=o, 1=yes) X = cotiuous exposure p = E(Y X) = P(Y = 1 X ) Oe solutio fit a liear model This is just a stadard liear model except our outcome is biary Iterpretatio of b 1? Problems with this approach? 250

Motivatig example: CHD ad cholesterol > lm.mod1 <- lm(chd ~ chol, data = cholesterol) > summary(lm.mod1) Call: lm(formula = chd ~ chol, data = cholesterol) Residuals: Mi 1Q Media 3Q Max -0.7067-0.3301-0.1289 0.3975 1.0227 What is the iterpretatio of the cholesterol parameter estimate? Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -1.4245087 0.1747852-8.15 4.77e-15 *** chol 0.0094718 0.0009436 10.04 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.4169 o 398 degrees of freedom Multiple R-squared: 0.202, Adjusted R-squared: 0.2 F-statistic: 100.8 o 1 ad 398 DF, p-value: < 2.2e-16 251

Biary outcome ad cotiuous exposure w w Alterative: use a trasformatio that maps P(Y = 1 X) to the real lie Let logit(p) = log(p / (1 - p))) w p (0, 1) w p /(1 - p) (0, ) w log(p /(1 - p)) (-, ) logit(p) 6 4 2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 p 252

Logistic regressio logit(p) = log(p / (1 - p))) this esures that p lies betwee 0 ad 1 Regress logit(p) o X logit[e(y X)] = log[p(y=1 X)/(1 P(Y=1 X))] = β 0 + β 1 X It turs out that the slope coefficiets i logistic regressio are readily iterpretable: they are just log odds ratios! 253

Iterpretatio of logistic regressio parameters O the log-odds scale log[odds(y=1 X = (c+1))] = β 0 + β 1 (c+1) log[odds(y=1 X = c)] = β 0 + β 1 c log[odds(y=1 X = (c+1))] - log[odds(y=1 X = c)] = β 1 log[odds(y=1 X = (c+1))/odds(y=1 X = c)] = β 1 log[or] = β 1 Odds Ratio (OR) That is, for two observatios that differ by oe uit i X there is a differece of β 1 i their log odds of Y = 1 Or, equivaletly, the log of the ratio of the odds of Y = 1 (i.e. the log OR) for two uits that differ i X by oe uit is β 1 254

Iterpretatio of logistic regressio parameters By expoetiatig we arrive at a simpler iterpretatio exp(log(or)) = exp(β 1 ) OR = exp(β 1 ) So for two observatios that differ i X by oe uit there is a multiplicative differece i their odds of Y = 1 of exp(β 1 ) Or, equivaletly, the ratio of the odds of Y = 1 (i.e., the odds ratio) for two observatios that differ i X by oe uit is exp(β 1 ) 255

Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.7437-0.8219-0.4852 0.9096 2.4536 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.09600 1.29881-8.543 < 2e-16 *** chol 0.05498 0.00678 8.109 5.12e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 409.71 o 398 degrees of freedom AIC: 413.71 Number of Fisher Scorig iteratios: 4 w What do these results tell us about the relatioship betwee cholesterol ad CHD? 256

Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.7437-0.8219-0.4852 0.9096 2.4536 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.09600 1.29881-8.543 < 2e-16 *** chol 0.05498 0.00678 8.109 5.12e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 409.71 o 398 degrees of freedom AIC: 413.71 Number of Fisher Scorig iteratios: 4 w Comparig two people who differ i cholesterol by 1 mg/dl, the log odds of CHD are higher by 0.055 for the idividual with higher cholesterol 257

Motivatig example: CHD ad cholesterol w w Differeces i log odds are pretty spectacularly difficult to iterpret! It would be much better to expoetiate the coefficiets ad report odds ratios > exp(glm.mod1$coef) (Itercept) chol 1.517293e-05 1.056515e+00 > exp(cofit(glm.mod1)) Waitig for profilig to be doe... 2.5 % 97.5 % (Itercept) 1.061838e-06 0.0001744859 chol 1.043101e+00 1.0712556915 w Comparig two people who differ i cholesterol by 1 mg/dl, the odds of CHD are higher by a factor of 1.06 (95% CI: 1.04, 1.07) for the idividual with higher cholesterol 258

Motivatig example: CHD ad cholesterol w w A 1 mg/dl differece is very small, so we might be iterested i estimatig the OR associated with a larger differece such as 10 mg/dl I this case, just as i liear regressio we just eed to multiply our coefficiet by the appropriate factor > exp(10*glm.mod1$coef) (Itercept) chol 6.466861e-49 1.732831e+00 w Comparig two people whose cholesterol levels differ by 10 mg/dl, the perso with the higher cholesterol has 1.73 times higher odds of CHD compared to the perso with lower cholesterol. 259

Multivariable logistic regressio w w Ofte we are iterested i examiig associatios betwee multiple predictors simultaeously ad a biary outcome Multiple logistic regressio follows same patter as liear regressio logit[e(y X)] = β 0 + β 1 X 1 + β 2 X 2 +. + β p X p w exp(b j ) iterpreted as the OR associated with a oe uit chage i the j th predictor, amog idividuals with other predictors at same levels (or holdig other predictors costat/cotrollig for/adjustig for etc.) 260

Motivatig example > glm.mod2 <- glm(chd ~ chol+factor(rs4775401), family = "biomial", data = cholesterol) > summary(glm.mod2) Call: glm(formula = chd ~ chol + factor(rs4775401), family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.5528-0.7810-0.4585 0.8037 2.6275 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.625209 1.335335-8.706 < 2e-16 *** chol 0.055443 0.006872 8.069 7.11e-16 *** factor(rs4775401)1 0.794212 0.259257 3.063 0.00219 ** factor(rs4775401)2 1.138308 0.464317 2.452 0.01422 * --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 397.27 o 396 degrees of freedom AIC: 405.27 Number of Fisher Scorig iteratios: 4 261

Motivatig example As we have see before, expoetiatig the coefficiets gives us odds ratios > exp(glm.mod2$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 8.937908e-06 1.057009e+00 2.212697e+00 3.121483e+00 A oe mg/dl icrease i cholesterol is associated with 1.06 times higher odds of CHD after adjustig for geotype We ca also obtai cofidece itervals for the odds ratios > exp(cofit(glm.mod2)) 2.5 % 97.5 % (Itercept) 5.776075e-07 0.0001096301 chol 1.043422e+00 1.0719733312 factor(rs4775401)1 1.336145e+00 3.6998174205 factor(rs4775401)2 1.250542e+00 7.8187264825 262

Hypothesis testig for logistic regressio Maximum likelihood is the stadard method of estimatig parameters from logistic models ad is based o fidig the estimates which maximize the joit probability for the observed data uder the chose model. The Wald test uses maximum likelihood estimates (MLE) ad their stadard errors to coduct hypothesis tests Test: H 0 : b j = 0 (o associatio) vs. H A : b j 0 Costruct a z-score: z = ˆβ j SE( ˆβ j ) N(0, 1) Wald Test 263

Motivatig example > glm.mod2 <- glm(chd ~ chol+factor(rs4775401), family = "biomial", data = cholesterol) > summary(glm.mod2) Call: glm(formula = chd ~ chol + factor(rs4775401), family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.5528-0.7810-0.4585 0.8037 2.6275 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.625209 1.335335-8.706 < 2e-16 *** chol 0.055443 0.006872 8.069 7.11e-16 *** factor(rs4775401)1 0.794212 0.259257 3.063 0.00219 ** factor(rs4775401)2 1.138308 0.464317 2.452 0.01422 * --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 397.27 o 396 degrees of freedom AIC: 405.27 Number of Fisher Scorig iteratios: 4 Wald statistics ad p-values for each parameter 264

Likelihood ratio test The likelihood ratio statistic is useful i comparig ested models. (LRT = likelihood ratio test) This allows us to test hypotheses about multiple parameters simultaeously such as H 0 : b 1 = b 2 = 0 vs H A : at least oe parameter ot equal to 0 I order to use the LRT we must fit a ested hierarchy of models For example: Model 1: logit p i = b 0 + b 1 chol i Model 2: logit p i = b 0 + b 1 chol i + b 2 SNP 1i + b 3 SNP 2i 265

Likelihood ratio test The LRT allows us to test the sigificace of the additioal parameters i the larger model. Example: Compare model 2 to model 3 H 0 : b 2 = b 3 = 0 LRT = -2 [L 1 L 2 ] c 2 2 df = # parameters beig tested 266

Example: Likelihood ratio test > lrtest(glm.mod1,glm.mod2) Likelihood ratio test Model 1: chd ~ chol Model 2: chd ~ chol + factor(rs4775401) #Df LogLik Df Chisq Pr(>Chisq) 1 2-204.85 2 4-198.63 2 12.44 0.001989 ** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 After accoutig for cholesterol, there is a statistically sigificat associatio betwee rs4775401 ad CHD 267

Logistic Regressio: Assumptios 1. Logit(E[Y x]) is related liearly to x 2. Y s are idepedet of each other 268

Summary We have cosidered: Measures of associatio for biary outcomes Logistic regressio Iterpretatio Estimatio Hypothesis testig 269

REGRESSION METHODS Geeralized liear models 270

Geeralized liear models So far we have cosidered : Cotiuous outcomes liear regressio/anova Biary outcomes logistic regressio Geeralized liear models (GLMs) provide a way to model Cotiuous ad biary outcomes Additioal types of outcome variables (e.g. couts) Additioal fuctioal forms for the relatioship betwee outcomes ad predictors 271

Geeralized Liear Models GLMs allow us to estimate regressio models for outcomes arisig from expoetial family distributios. This family icludes may familiar distributios icludig Normal, Biomial ad Poisso. A GLM is specified based o three compoets: Outcome distributio Liear predictor Lik fuctio We will see that liear ad logistic regressio are both GLMs with specific choice of outcome ad lik fuctio! 272

Outcome distributio The first step i fittig a GLM is to choose a appropriate distributio for your outcome Examples Cotiuous outcome Normal Biary outcome Biomial Cout outcome Poisso 273

Liear predictor After specifyig a distributio for the outcome, we specify the liear predictor, g[e(y)] = β 0 + β 1 x 1 + + β p x p This is just the systematic piece of our regressio model As i other regressio models we have see, we eed to idetify the set of covariates to be icluded 274

Lik fuctio Fially, we specify a lik fuctio, g[e(y)]: g[e(y)] = β 0 + β 1 x 1 + + β p x p This describes the fuctioal form of the relatioship betwee E(Y) ad the liear predictor I liear regressio, we use the idetity lik fuctio g[e(y)] = E(Y) I logistic regressio, we use the logit lik fuctio g[(e(y)] = log[e(y)/(1-e(y))] 275

Geeralized liear models A few example GLMS: Distributio Lik fuctio Model Normal Idetity g[e(y)]=e(y) Liear regressio Biomial Logit g[e(y)]= log[e(y)/(1-e(y))] Logistic regressio Poisso Log g[e(y)]=log[e(y)] Poisso GLM Gamma Log g[e(y)]=log[e(y)] Gamma GLM 276

Alteratives to logistic regressio Odds ratio is limited by difficulty of iterpretatio Relative risk is more iterpretable To estimate a relative risk usig regressio we ca use the log liear model: log[e(y x)] = β 0 + β 1 x This is sometimes referred to as relative risk regressio exp(β 1 ) is the relative risk associated with a oeuit icrease i x 277

Modified Poisso regressio To estimate the relative risk, we could use a biomial GLM with log lik. It turs out that estimatio for this model is very challegig ad results are sesitive to outliers i X A alterative approach that performs better i practice is modified Poisso regressio This method uses a Poisso GLM with log lik Usig a Poisso model for biary data will give icorrect stadard errors because the variace for biary outcomes differs from the variace for Poisso outcomes We ca combie the Poisso GLM with a robust variace estimator to accout for this violatio of the model s assumptios 278

Modified Poisso regressio > glm.rr <- gee(chd ~ chol+factor(rs4775401), family = "poisso", id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rr) GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-fuctio, versio 4.13 modified 98/01/27 (1998) Model: Lik: Logarithm Variace to Mea Relatio: Poisso Correlatio Structure: Idepedet Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -7.06494205 0.673034850-10.497141 0.586040831-12.055375 chol 0.02963414 0.003372133 8.787949 0.002752366 10.766786 factor(rs4775401)1 0.41510943 0.155971511 2.661444 0.144444876 2.873826 factor(rs4775401)2 0.63841622 0.256170049 2.492158 0.200023351 3.191708 Estimated Scale Parameter: 0.671285 Number of Iteratios: 1 279

Modified Poisso regressio w Relative risk of CHD associated with 1 mg/dl icrease i cholesterol is 1.03. > exp(glm.rr$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 0.0008545444 1.0300775972 1.5145364657 1.8934796543 w Compare this to the odds ratio we obtaied earlier usig logistic regressio > exp(glm.mod2$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 8.937908e-06 1.057009e+00 2.212697e+00 3.121483e+00 280

Relative risk regressio: Assumptios 1. log(e[y x]) = log(p(y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 2. Y s are idepedet of each other 281

Risk differece regressio w w w Recall, we also cosidered fittig a liear model to biary outcome data This allows us to estimate differeces i risk associated with a 1 uit differece i the predictor By usig robust stadard errors, we ca accout for violatio of the assumptios of ormality ad equal variace > glm.rd <- gee(chd ~ chol+factor(rs4775401), id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rd) Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -1.485416572 0.1729186937-8.590260 0.1372414095-10.823385 chol 0.009392399 0.0009293498 10.106420 0.0007615605 12.333097 factor(rs4775401)1 0.142743141 0.0427305918 3.340537 0.0423172293 3.373168 factor(rs4775401)2 0.212108382 0.0827885757 2.562049 0.0822370553 2.579231 282

Risk differece regressio A 1 mg/dl differece is very small, so we might be iterested i estimatig the RD associated with a larger differece such as 10 mg/dl Comparig two people with the same rs4775401 geotype whose cholesterol levels differ by 10 mg/dl, the risk of CHD for the perso with the higher cholesterol is 9.4% higher (i absolute terms) compared to the perso with lower cholesterol Comparig two people with the same cholesterol level, a perso with rs4775401 C/T is estimated to have risk of CHD 14.3% higher (i absolute terms) tha a perso with rs4775401 C/C Comparig two people with the same cholesterol level, a perso with rs4775401 T/T is estimated to have risk of CHD 21.2% higher (i absolute terms) tha a perso with rs4775401 C/C 283

Risk differece regressio: Assumptios 1. E[Y x] = P(Y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 or < 0 2. Y s are idepedet of each other 284

Summary We have cosidered: Logistic regressio Iterpretatio Estimatio Geeralized liear models Relative risk regressio Risk differece regressio 285

Module summary I this module we have covered a variety of regressio methods that ca be used to aalyze cotiuous ad biary outcomes: Cotiuous outcomes Simple liear regressio Multiple liear regressio ANOVA Biary outcomes Logistic regressio Relative risk regressio Risk differece regressio These methods are foudatioal for may statistical aalyses, ad we hope you will be able to apply them to your future research! 286

Everythig is regressio! (Professor Scott Emerso) 287