Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Similar documents
Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

COMPLEMENTARY LOG-LOG MODEL

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Lecture 12: Effect modification, and confounding in logistic regression

Case-control studies

Simple logistic regression

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STA6938-Logistic Regression Model

Case-control studies C&H 16

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Binary Logistic Regression

Sections 4.1, 4.2, 4.3

Correlation and regression

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Multinomial Logistic Regression Models

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Cohen s s Kappa and Log-linear Models

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Chapter 20: Logistic regression for binary response variables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Regression Models for Risks(Proportions) and Rates. Proportions. E.g. [Changes in] Sex Ratio: Canadian Births in last 60 years

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Appendix: Computer Programs for Logistic Regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Models for Binary Outcomes

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Statistics in medicine

Suppose that we are concerned about the effects of smoking. How could we deal with this?

Stat 642, Lecture notes for 04/12/05 96

Lecture 24. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Introduction to logistic regression

STAT 7030: Categorical Data Analysis

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

5. Parametric Regression Model

Introduction to the Analysis of Tabular Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

10: Crosstabs & Independent Proportions

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Lab 8. Matched Case Control Studies

An ordinal number is used to represent a magnitude, such that we can compare ordinal numbers and order them by the quantity they represent.

Count data page 1. Count data. 1. Estimating, testing proportions

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Lecture 10: Introduction to Logistic Regression

8 Nominal and Ordinal Logistic Regression

Logistic Regression - problem 6.14

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Deriving Decision Rules

Part IV Statistics in Epidemiology

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Poisson Data. Handout #4

Chapter 14 Logistic and Poisson Regressions

More Statistics tutorial at Logistic Regression and the new:

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Analysing categorical data using logit models

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Section Poisson Regression

Chapter Six: Two Independent Samples Methods 1/51

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Chapter 1 Statistical Inference

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Chapter 1. Modeling Basics

APPENDIX B Sample-Size Calculation Methods: Classical Design

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Exercise 5.4 Solution

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

STAT 705: Analysis of Contingency Tables

Chapter 5: Logistic Regression-I

Investigating Models with Two or Three Categories

Generalized linear models

Epidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval

Sociology 362 Data Exercise 6 Logistic Regression 2

BMI 541/699 Lecture 22

Homework Solutions Applied Logistic Regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Basic Medical Statistics Course

Model Estimation Example

Single-level Models for Binary Responses

Logistic Regression Models to Integrate Actuarial and Psychological Risk Factors For predicting 5- and 10-Year Sexual and Violent Recidivism Rates

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Transcription:

Logistic regression analysis Birthe Lykke Thomsen H. Lundbeck A/S 1

Response with only two categories Example Odds ratio and risk ratio Quantitative explanatory variable More than one variable Logistic regression Case-control designs 2

Example Use of contraception at rst intercourse Data collected in a study of risk factors for cervical neoplasia (Susanne Krüger Kjær) Cohort of 11088 women aged 20 to 29, inhabitants of Copenhagen in 1993. Among other things, the women were asked which type of contraception, if any, they used at their rst sexual intercourse. A sub-study (Edith Svare et al., 2002) investigated predictors for no use of contraception at the rst intercourse among 10839 women (244 virgins and 5 non-responders excluded). 3

Contraception at 1 st intercourse and smoking Does smoking status at the age at rst intercourse matter? Use of contraception Smoker No Yes Total Ever 1199 2609 3808 Never 1505 5526 7031 Total 2704 8135 10839 Chi-square test: χ 2 (1)=134, p<.0001 Quantication of the eect: Risk ratio: 1199 3808 / 1505 7031 = 0.315/0.214 = 1.47 Odds ratio: 1199 2609 / 1505 5526 = 1199 5526 2609 1505 = 1.69 4

Odds ratio (OR) Odds are the ratio between the number of events of the two types, e.g., sick/healthy In probabilities odds = P (sick) P (healthy) = P (sick) 1 P (sick) OR=2 implies a doubling of the number of sick individuals for very-low-risk populations (for rare diseases OR RR) and a halving of the number of healthy individuals for very-high-risk populations [RR doesn't work well for high-risk populations] 5

Odds ratio (OR) and risk ratio (RR) RR is limited by the fact that the risk (probability) cannot exceed 1. This may lead to problems if both high-risk and low-risk populations exist. OR may vary freely among all positive numbers no matter the proportion of diseased in the reference group. RR is always between 1 and OR. OR is symmetric. If the opposite event is in focus, OR is simply inverted: OR(without contraception) = 1 OR(with contraception) RR is asymmetric, thus which outcome to model nicely must be decided based on substantial arguments (impossible for the present example) 6

Odds ratio (OR) Interpretation beyond the doubling/halving aspect well... Deeper insight in the exact value of the OR probably requires familiarity with odds perhaps British gamblers nd it trivial... BUT: every statistical analysis intend to describe reality in the simplest possible way. The approximation to reality will never be perfect, but the measure of coherence should preferably apply acceptably to as many people as possible. OR has the advantage that the same eect may apply to both high risk and low risk populations. Thus it is more likely that a single number can be used to estimate the association for all the relevant individuals 7

Use of contraception at the rst intercourse Association with year of birth: Birth Contraception year No Yes Total 1961 36 81 117 1962 292 594 886 1963 348 902 1250 1964 342 861 1203 1965 313 871 1184 1966 317 889 1206 1967 313 864 1177 1968 246 875 1121 1969 200 849 1049 1970 145 694 839 1971 110 484 594 1972 42 171 213 Total 2704 8135 10839 χ 2 (11)-test statistic=117.435, p=0.001. 8

Odds of unprotected intercourse Pattern: Straight line with logarithmic vertical axis. Mathematically: logarithmic axis: ln(odds) straight line: ln(odds for birth year X) = a + b X P (rst intercourse unprotected birth year X) = exp(a + bx) 1 + exp(a + bx) ln(or year x+1 relative to year x) = ln((odds year x+1)/(odds year x)) = ln(odds year x+1) ln(odds year x) = a + b(x + 1) (a + bx) = b, so OR = exp(b). 9

Combining the variables Pattern: Parallel course of two linear curves with logarithmic vertical axis. Mathematically: linear: ln(odds for a never smoker born year X) = a never smoker + b X Parallel: ln(odds for an ever smoker born year X) = ln(odds for a never smoker born year X) + c = a never smoker + b X + c 1{smoker} An ever smoker compared to a never smoker born in the same year: ln(or) = ln(odds for the ever smoker) ln(odds for the never smoker) = c, so OR = exp(c). 10

Logistic regression model The response Y is always either 0 or 1. We model the mean which equals the probability P (Y = 1). X 1,..., X k explanatory variables, the exposures. The usual linear approach is not appropriate because probabilities are between 0 and 1, and straight lines will cross these limits (unless they are horizontal no association). One solution to that problem is to use a linear model after transforming the probability using the logit transformation: logit(p) = ln( p 1 p ) 11

Logistic regression model cont. Response and exposure are linked using logit: logit(p (Y = 1 X 1 = x 1,..., X k = x k )) = b 0 + b 1 x 1 + b 2 x 2 +... + b k x k Estimation using SAS DATA praevent; SET praevent; _1=1; RUN; PROC GENMOD DATA=praevent; CLASS smoker; MODEL nocontra/_1 = smoker byear / DIST=BIN LINK=LOGIT TYPE1 TYPE3 ; RUN; [SAS manual: http://support.sas.com/onlinedoc/913/docmainpage.jsp] 12

Output from PROC GENMOD The GENMOD Procedure Model Information Data Set WORK.PRAEVENT Distribution Binomial Link Function Logit Response Variable (Events) nocontra Response Variable (Trials) _1 Observations Used 10839 Number Of Events 2704 Number Of Trials 10839 Class Level Information Class Levels Values smoker 2 Yes ~No Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 11E3 11936.3513 1.1015 : : : : : : : : Log Likelihood -5968.1756 Algorithm converged. 13

Output from PROC GENMOD Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr>ChiSq Intercept 1 165.1749 16.0237 133.7691 196.5807 106.26 <.0001 smoker Yes 1 0.5390 0.0457 0.4494 0.6286 139.06 <.0001 smoker ~No 0 0.0000 0.0000 0.0000 0.0000.. byear 1-0.0847 0.0082-0.1007-0.0687 107.92 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. LR Statistics For Type 1 Analysis Chi- Source Deviance DF Square Pr > ChiSq Intercept 12177.6509 smoker 12046.3475 1 131.30 <.0001 byear 11936.3513 1 110.00 <.0001 LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq smoker 1 137.70 <.0001 byear 1 110.00 <.0001 14

Translation of parameter estimates Association with ever smoking: OR for ever smoker versus never smoker = exp(0.5390)=1.71, 95% condence limits: exp(0.4494)=1.57 to exp(0.6286)=1.87. Association with birth year: OR per year = exp(-0.0847)=0.919, 95% condence limits: exp(-0.1007)=0.904 to exp(-0.0687)=0.934 OR per 5 years = exp(5-0.0847) =exp(-0.4235)=0.65, 95% condence limits: exp(5-0.1007)=exp(-0.5035)=0.60 to exp(5-0.0687)=exp(-0.3435)=0.71 Note: First multiply by 5, then take the exponential 15

Case-control design # diseased exposed # healthy exposed OR = / # diseased unexposed # healthy unexposed = # diseased exposed # healthy unexposed # healthy exposed # diseased unexposed = # diseased exposed # diseased unexposed / # healthy exposed # healthy unexposed is symmetric in exposure and outcome. This is the basis for the case-control studies examining exposure among cases and suitable controls. 16

Matched case-control designs In frequency-matched case-control design, the matching variable must be included in the model the same way as the matching is done (e.g., frequency-matching in two-years interval class variable grouping in two-years intervals) Individually-matched case-control sampling designs must be analyzed using conditional logistic regression. These designs cannot be analyzed using PROC GENMOD; the analyses require programs suitable for survival analysis. 17

Individually matched case-control designs DATA matched; SET rawdata; if case=1 then dummy_time=1; * cases ; if case=0 then dummy_time=2; * controls ; RUN; PROC PHREG DATA=matched NOSUMMARY; MODEL dummy_time*case(0)=exposure; STRATA matchgrp; RUN; Here, the variable dummy_time is set to 1 for cases and 2 for controls to ensure, that the event time for the controls is later than the event time for the case. NOSUMMARY is not necessary, but is included to avoid a print-out of number of cases (Events) and number of controls (Censored) for each single matched pair/group. 18