Statistical Modelling with Stata: Binary Outcomes

Similar documents
Modelling Binary Outcomes 21/11/2017

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Homework Solutions Applied Logistic Regression

Statistical Modelling in Stata 5: Linear Models

Correlation and regression

Confounding and effect modification: Mantel-Haenszel estimation, testing effect homogeneity. Dankmar Böhning

Lecture 12: Effect modification, and confounding in logistic regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Sociology 362 Data Exercise 6 Logistic Regression 2

Unit 5 Logistic Regression Practice Problems

Binary Dependent Variables

Logistic Regression. Continued Psy 524 Ainsworth

12 Modelling Binomial Response Data

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Single-level Models for Binary Responses

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

Lecture 2: Poisson and logistic regression

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

LOGISTIC REGRESSION Joseph M. Hilbe

Linear Regression Models P8111

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Case-control studies

8 Nominal and Ordinal Logistic Regression

Lecture 5: Poisson and logistic regression

Lecture 10: Introduction to Logistic Regression

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Logistic Regression Models for Multinomial and Ordinal Outcomes

Chapter 19: Logistic regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Unit 5 Logistic Regression

Unit 5 Logistic Regression

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Lecture 3.1 Basic Logistic LDA

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Introduction to the Logistic Regression Model

Sections 4.1, 4.2, 4.3

Generalized Linear Models for Non-Normal Data

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Unit 5 Logistic Regression

MS&E 226: Small Data

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

One-stage dose-response meta-analysis

Chapter 14 Logistic and Poisson Regressions

Classification. Chapter Introduction. 6.2 The Bayes classifier

Logistic Regression: Regression with a Binary Dependent Variable

Introduction to Logistic Regression

STK4900/ Lecture 7. Program

STATISTICS 110/201 PRACTICE FINAL EXAM

MS&E 226: Small Data

Introduction to logistic regression

Stat 579: Generalized Linear Models and Extensions

Chapter 5: Logistic Regression-I

Basic Medical Statistics Course

ssh tap sas913, sas

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Analysing categorical data using logit models

Lecture 4: Generalized Linear Mixed Models

BMI 541/699 Lecture 22

Correlation and Simple Linear Regression

Generalized linear models

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Outline. 1 Introduction. 2 Logistic regression. 3 Margins and graphical representations of effects. 4 Multinomial logistic regression.

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Statistics in medicine

11. Generalized Linear Models: An Introduction

9 Generalized Linear Models

Lecture 13: More on Binary Data

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Outline. 1 Association in tables. 2 Logistic regression. 3 Extensions to multinomial, ordinal regression. sociology. 4 Multinomial logistic regression

2. We care about proportion for categorical variable, but average for numerical one.

A Journey to Latent Class Analysis (LCA)

Logit estimates Number of obs = 5054 Wald chi2(1) = 2.70 Prob > chi2 = Log pseudolikelihood = Pseudo R2 =

A Running Example: House Voting on NAFTA

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Logistic Regression in R. by Kerry Machemer 12/04/2015

Lab 11 - Heteroskedasticity

Chapter 19: Logistic regression

Mixed Models for Longitudinal Binary Outcomes. Don Hedeker Department of Public Health Sciences University of Chicago.

Age 55 (x = 1) Age < 55 (x = 0)

Chapter 1. Modeling Basics

Table of Contents. Logistic Regression- Illustration Carol Bigelow March 21, 2017

Data-analysis and Retrieval Ordinal Classification

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Simple Linear Regression

Lecture 6 STK Categorical responses

Transcription:

Statistical Modelling with Stata: Binary Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 21/11/2017

Cross-tabulation Exposed Unexposed Total Cases a b a + b Controls c d c + d Total a + c b + d a + b + c + d Simple random sample: fix a + b + c + d Exposure-based sampling: fix a + c and b + d Outcome-based sampling: fix a + b and c + d

The χ 2 Test Compares observed to expected numbers in each cell Expected under null hypothesis: no association Works for any of the sampling schemes

Measures of Association Relative Risk = a a+c b b+d Risk Difference = Odds Ratio = == a(b + d) b(a + c) a a + c b b + d a c b d == ad cb All obtained with cs disease exposure[, or] Only Odds ratio valid with outcome based sampling

Crosstabulation in stata. cs back_p sex, or sex Exposed Unexposed Total -----------------+------------------------+------------ Cases 637 445 1082 Noncases 1694 1739 3433 -----------------+------------------------+------------ Total 2331 2184 4515 Risk.2732733.2037546.2396456 Point estimate [95% Conf. Interval] ------------------------+------------------------ Risk difference.0695187.044767.0942704 Risk ratio 1.341188 1.206183 1.491304 Attr. frac. ex..2543926.1709386.329446 Attr. frac. pop.1497672 Odds ratio 1.469486 1.27969 1.68743 (Cornfield) +------------------------------------------------- chi2(1) = 29.91 Pr>chi2 = 0.0000

Limitations of Tabulation No continuous predictors Limited numbers of categorical predictors

Linear Regression and Binary Outcomes Can t use linear regression with binary outcomes Distribution is not normal Limited range of sensible predicted values Changing parameter estimation to allow for non-normal distribution is straightforward Need to limit range of predicted values

Example: CHD and Age chd 0.2.4.6.8 1 20 30 40 50 60 70 age

Example: CHD by Age group Proportion of subjects with CHD 0.2.4.6.8 20 30 40 50 60 Mean age

Example: CHD by Age - Linear Fit 0.5 1 20 30 40 50 60 70 Proportion of subjects with CHD Fitted values

Generalized Linear Models Linear Model Y = β 0 + β 1 x 1 +... + β p x p + ε ε is normally distributed Generalized Linear Model g(y ) = β 0 + β 1 x 1 +... + β p x p + ε ε has a known distribution

Probabilities and Odds Probability Odds p Ω = p/(1 p) 0.1 = 1/10 0.1/0.9 = 1:9 = 0.111 0.5 = 1/2 0.5/0.5 = 1:1 = 1 0.9 = 9/10 0.1/0.9 = 9:1 = 9

Probabilities and Odds Proportion 0.2.4.6.8 1 5 0 5 Log odds

Advantage of the Odds Scale Just a different scale for measuring probabilities Any odds from 0 to corresponds to a probability Any log odds from to corresponds to a probability Shape of curve commonly fits data

The binomial distribution Outcome can be either 0 or 1 Has one parameter: the probability that the outcome is 1 Assumes observations are independent

The Logistic Regression Equation ( ) ˆπ log 1 ˆπ = β 0 + β 1 x 1 +... + β p x p Y Binomial(ˆπ) Y has a binomial distribution with parameter π ˆπ is the predicted probability that Y = 1

Parameter Interpretation When x i increases by 1, log (ˆπ/(1 ˆπ)) increases by β i Therefore ˆπ/(1 ˆπ) increases by a factor e β i For a dichotomous predictor, this is exactly the odds ratio we met earlier. For a continuous predictor, the odds increase by a factor of e β i for each unit increase in the predictor

Odds Ratios and Relative Risks 0 1 2 3 4 5 0.2.4.6.8 1 Proportion Odds Proportion

Logistic Regression in Stata. logistic chd age Logistic regression Number of obs = 100 LR chi2(1) = 29.31 Prob > chi2 = 0.0000 Log likelihood = -53.676546 Pseudo R2 = 0.2145 ------------------------------------------------------------------------------ chd Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age 1.117307.0268822 4.61 0.000 1.065842 1.171257 ------------------------------------------------------------------------------

Predict Lots of options for the predict command p gives the predicted probability for each subject xb gives the linear predictor (i.e. the log of the odds) for each subject

Plot of probability against age 0.2.4.6.8 1 20 30 40 50 60 70 Pr(chd) Proportion of subject in each ageband with CHD

Plot of log-odds against age Linear prediction 3 2 1 0 1 2 20 30 40 50 60 70 age

Other Models for Binary Outcomes Can use any function that maps (, ) to (0, 1) Probit Model Complementary log-log Parameters lack interpretation

The Log-Binomial Model Models log(π) rather than log(π/(1 π)) Gives relative risk rather than odds ratio Can produce predicted values greater than 1 May not fit the data as well Stata command: glm varlist, family(binomial) link(log) If association between log(π) and predictor non-linear, lose simple interpretation.

Log-binomial model example 0.5 1 1.5 20 30 40 50 60 70 logistic predictions Proportion of subjects with CHD log binomial predictions

Logistic Regression Diagnostics Goodness of Fit Influential Observations Poorly fitted Observations

Problems with R 2 Multiple definitions Lack of interpretability Low values Can predict P(Y = 1) perfectly, not predict Y well at all if P(Y = 1) 0.5.

Hosmer-Lemeshow test Very like χ 2 test Divide subjects into groups Compare observed and expected numbers in each group Want to see a non-significant result Command used is estat gof

Hosmer-Lemeshow test example. estat gof, group(5) table Logistic model for chd, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) +--------------------------------------------------------+ Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total -------+--------+-------+-------+-------+-------+------- 1 0.1690 2 2.1 18 17.9 20 2 0.3183 5 4.9 16 16.1 21 3 0.5037 9 8.7 12 12.3 21 4 0.7336 15 15.1 8 7.9 23 5 0.9125 12 12.2 3 2.8 15 +--------------------------------------------------------+ number of observations = 100 number of groups = 5 Hosmer-Lemeshow chi2(3) = 0.05 Prob > chi2 = 0.9973

Sensitivity and Specificity Test +ve Test -ve Total Cases a b a + b Controls c d c + d Total a + c b + d a + b + c + d Sensitivity: Probability that a case classified as positive a/(a + b) Specificity: Probability that a non-case classified as negative d/(c + d)

Sensitivity and Specificity in Logistic Regression Sensitivity and specificity can only be used with a single dichotomous classification. Logistic regression gives a probability, not a classification Can define your own threshold for use with logistic regression Commonly choose 50% probability of being a case Can choose any probability: sensitivity and specificity will vary Why not try every possible threshold and compare results: ROC curve

ROC Curves Shows how sensitivity varies with changing specificity Larger area under the curve = better Maximum = 1 Tossing a coin would give 0.5 Command used is lroc

ROC Example Sensitivity 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 Specificity Area under ROC curve = 0.7999

Influential Observations Residuals less useful in logistic regression than linear Can only take the values 1 ˆπ or ˆπ. Leverage does not translate to logistic regression model ˆβ i measures effect of i th observation on parameters Obtained from dbeta option to predict command Plot against ˆπ to reveal influential observations

Plot of ˆβ i against ˆπ Pregibon s dbeta 0.05.1.15.2.25 0.2.4.6.8 1 Pr(chd)

Effect of removing influential observation. logistic chd age if dbeta < 0.2 Logistic regression Number of obs = 98 LR chi2(1) = 32.12 Prob > chi2 = 0.0000 Log likelihood = -50.863658 Pseudo R2 = 0.2400 ------------------------------------------------------------------------------ chd Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age 1.130329.0293066 4.73 0.000 1.074324 1.189254 ------------------------------------------------------------------------------

Poorly fitted observations Can be identified by residuals Deviance residuals: predict varname, ddeviance χ 2 residuals: predict varname, dx2 Not influential: omitting them will not change conclusions May need to explain fit is poor in particular area Plot residuals against predicted probability, look for outliers

Separation Need at least one case and one control in each subgroup If you have lots of subgroups, this may not be true In which case, log(or) for that group is or Stata will drop all subjects from that group (unless you use the option asis) Not a problem with continuous predictors