Stat 642, Lecture notes for 04/12/05 96

Similar documents
Multinomial Logistic Regression Models

Basic Medical Statistics Course

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ssh tap sas913, sas

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Chapter 5: Logistic Regression-I

8 Nominal and Ordinal Logistic Regression

Statistics in medicine

CHAPTER 1: BINARY LOGIT MODEL

Logistic Regression. Continued Psy 524 Ainsworth

More Statistics tutorial at Logistic Regression and the new:

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Longitudinal Modeling with Logistic Regression

Correlation and regression

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Logistic Regression Models for Multinomial and Ordinal Outcomes

Lecture 5: Poisson and logistic regression

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Lecture 2: Poisson and logistic regression

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Chapter 14 Logistic and Poisson Regressions

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Prediction of ordinal outcomes when the association between predictors and outcome diers between outcome levels

Categorical data analysis Chapter 5

Simple logistic regression

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 12: Effect modification, and confounding in logistic regression

A stationarity test on Markov chain models based on marginal distribution

STA6938-Logistic Regression Model

Statistics 262: Intermediate Biostatistics Model selection

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

A class of latent marginal models for capture-recapture data with continuous covariates

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

9 Generalized Linear Models

Logistic Regression: Regression with a Binary Dependent Variable

STAT 525 Fall Final exam. Tuesday December 14, 2010

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Introduction to Statistical Analysis

STAT 7030: Categorical Data Analysis

Generalized Linear Modeling - Logistic Regression

Analysing categorical data using logit models

Unit 9: Inferences for Proportions and Count Data

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Introduction to mtm: An R Package for Marginalized Transition Models

Introduction to the Logistic Regression Model

Mixed Models for Longitudinal Ordinal and Nominal Outcomes

Comparing IRT with Other Models

ZERO INFLATED POISSON REGRESSION

Consider Table 1 (Note connection to start-stop process).

Unit 9: Inferences for Proportions and Count Data

Chapter 1. Modeling Basics

Estimating the Marginal Odds Ratio in Observational Studies

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized logit models for nominal multinomial responses. Local odds ratios

Logistic Regression Models to Integrate Actuarial and Psychological Risk Factors For predicting 5- and 10-Year Sexual and Violent Recidivism Rates

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Poisson regression: Further topics

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Chapter 19: Logistic regression

Bayes methods for categorical data. April 25, 2017

Stat 5101 Lecture Notes

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Generalized linear models

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

The Sum of Standardized Residuals: Goodness of Fit Test for Binary Response Model

Generalized Linear Models for Non-Normal Data

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Calculating Effect-Sizes. David B. Wilson, PhD George Mason University

Model Estimation Example

One-Way Tables and Goodness of Fit

STAT 526 Spring Final Exam. Thursday May 5, 2011

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Survival Regression Models

Models for Ordinal Response Data

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Exact McNemar s Test and Matching Confidence Intervals Michael P. Fay April 25,

Applied Epidemiologic Analysis

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

Generalized Models: Part 1

Chapter 1 Statistical Inference

Beyond GLM and likelihood

Time-Invariant Predictors in Longitudinal Models

Using Estimating Equations for Spatially Correlated A

Lecture 5: LDA and Logistic Regression

Logistic regression modeling the probability of success

Transcription:

Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal sized groups according to their predicted probabilities. Then where G 2 HL = 10 j=1 O j E j 2 E j 1 E j /n j χ2 8 n j = Number of observations in the j th group O j = i E j = i y ij = Observed number of cases in the j th group ˆp ij = Expected number of cases in the j th group Example: Using the THS data described earlier 4/7/2005, I added the lackfit option to the model statement in PROC LOGISTIC as follows: model cens28dy = trt age sbpcat aist hctbase / lackfit; The output corresponding to the Hosmer-Lemeshow statistic is: Partition for the Hosmer and Lemeshow Test CENS28DY = 0 CENS28DY = 1 Group Total Observed Expected Observed Expected 1 9 0 0.14 9 8.86 2 9 0 1.60 9 7.40 3 9 4 3.68 5 5.32 4 9 9 5.28 0 3.72 5 9 7 6.52 2 2.48 6 9 5 7.47 4 1.53 7 9 9 8.03 0 0.97 8 9 7 8.50 2 0.50 9 9 9 8.82 0 0.18 10 12 12 11.96 0 0.04 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 19.5799 8 0.0120

Stat 642, Lecture notes for 04/12/05 97 This output shows the observed and expected numbers of cases and controls within each group, and the final test statistic. The chi-square statistic suggests that there may be lack of fit, so I refit the model including a quadratic term for age: model cens28dy = trt age age sbpcat aist hctbase / lackfit; Partition for the Hosmer and Lemeshow Test CENS28DY = 0 CENS28DY = 1 Group Total Observed Expected Observed Expected 1 9 0 0.13 9 8.87 2 9 0 1.36 9 7.64 3 9 3 3.38 6 5.62 4 9 8 5.24 1 3.76 5 9 8 6.58 1 2.42 6 9 7 7.74 2 1.26 7 9 7 8.22 2 0.78 8 9 8 8.54 1 0.46 9 9 9 8.85 0 0.15 10 12 12 11.97 0 0.03 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 9.8690 8 0.2743 This model shows no evidence of lack of fit based on the H-L statistic, so apparently any lack of fit has been corrected. Ideally, incorrect model specification such as non-linearity in the predictors or missing predictors should be detectable by this statistic. However, depending on the model, G 2 LH may not be particularly sensitive to departures from the true model. The H-L test does not provide any information regarding the nature of the lack of fit observed. Other procedures may be more useful. Other goodness-of-fit procedures Tsiatis A note on the goodness-of-fit test for the istic regression model, Biometrika, 67:259-251, 1980 proposes a test for goodness-of-fit obtained by grouping observations into k groups according to covariate values. The goodness-of-fit test uses the score test of H 0 that the fitted model is correct, versus the model which adds to the fitted model a categorical variable corresponding to the groupings. For example, for the model above, one could group observations by age categories possibly equally spaced, or of equal size - the groups are somewhat arbitrary, and perform the score test for the addition of these groupings to the model similar to what one would do in a forward selection procedure.

Stat 642, Lecture notes for 04/12/05 98 Lin, Wei and Ying Model checking techniques based on cumulative residuals, Biometrics, 58:1-12, 2002 propose assessing goodness-of-fit using moving sums of residuals of the form: W x = i Ix b < X ij < x + by i Ey i where I is one if the argument is true, and zero otherwise, x is an arbitrary value of covariate j and X ij is the value of covariate j for subject i. W x is simply the total observed - expected for all observations with the value covariate j within b units of x. By plotting W x against x for values of x over the range of the covariate, patterns may be apparent which suggest lack of fit. LWY propose a formal test using simulation to assess whether the observed patterns are likely by chance alone. The score test of Tsiatis may also be used to provide a formal test. Polychotomous Responses note that in the interest of time, I skipped over some of the material in this section during class Now we consider the case in which we have more than two levels of outcome variable. Examples Type of Disease Site of Onset e.g. tumor Disease Severity ordered Cause of Death Suppose that we have k + 1 response categories, 0, 1,..., k. Under a multinomial model there are k + 1 event probabilities which sum to one. Hence there are k independent its to model. We consider three approaches to this modeling, all of which produce different results. Log-linear model Continuation Ratios Cumulative Logit Log-linear Model Poisson Sampling model Suppose we have a covariate with levels 0, 1, 2,.... We may construct the following table:

Stat 642, Lecture notes for 04/12/05 99 Response 0... 1....... k... 0 1 2... Covariate Values For the i, j cell, we have the -linear model λ ij = α i + β j + γ ij where we impose the constraint that γ 0j = γ i0 = 0 i, j. We could, for example, choose a baseline category for the response, say i = 0. Then λij λ 0j = pij p 0j = α i α 0 + γ ij Then γ ij is the odds ratio for response level i versus level 0 and exposure level j versus level 0. This may be called the baseline it model. Similarly, if the responses are naturally ordered, then it may be reasonable to consider adjacent categories: λi+1,j λ i,j = pi+1,j p i,j = α i+1 α i + γ i+1,j γ i,j = α i + γ i,j Then γij is the odds ratio for response level i + 1 versus level i and exposure level j versus level 0. More generally, we might have In which case pij p 0j λ ij = x T j β i We may compute the probabilities themselves via = x T j β i β 0 = x T j β i. p ij = e x jβ i 1 + l 1 ex jβ l For example, if k = 2, we have: p 0j = 1 1 + e x jβ1 + e x jβ2, p 1j = e x jβ 1 1 + e x jβ 1 + e x jβ 2 p 2j = e x jβ 2 1 + e x jβ 1 + e x jβ 2 Hypothetical example: Suppose we consider death from Heart Disease or Cancer. There is no particular order to these outcomes, so this kind of model may make some sense. In the accompanying

Stat 642, Lecture notes for 04/12/05 100 plot, the probabilities of death from heart disease and cancer change as a function of age and this model allows them to change quite independently. Hypothetical Example Death from Heart Disease vs. Cancer Heart Disease Cancer Survival or Other Probability Age Note that if we consider a plot of its for each of heart disease and cancer versus survival we have straight lines: Hypothetical Example - it scale Heart Disease Cancer p i /p 0 Age

Stat 642, Lecture notes for 04/12/05 101 Continuation Ratios Take k = 2 for a moment and consider two parallel schemes for constructing the 2 independent its: In General: Scheme A p0j p 1j + p 2j p1j p 2j pij l>i p lj Scheme B p2j p 0j + p 1j p1j p 0j pij l<i p lj If the categories are 0, 1, 2,..., k, then under Scheme A, we first consider the lowest category i = 0 versus all remaining categories. For each category i, i < k, we consider category i versus all higher categories. Under Scheme B, we consider for each category i, i < 0, we consider category i versus all lower categories. For example, if the categories are Alive, Heart Disease Death and Cancer Death, we might consider its for Alive versus Dead and among the deaths Heart Disease Death versus Cancer Death. If we set the i th it equal to x T j β i, we can express the multinomial probabilities of being in a particular category as follows: take k = 3, Scheme A. First we have that next, Similarly, Finally, p 0j = Pr{y j = 0} = ext β 0 1 + e xt β 0 p 1j = Pr{y j = 1} = Pr{y j = 1 y j 1} Pr{y j 1} = e xt β 1 1 + e xt β 1 1 1 + e xt β 0. p 2j = Pr{y j = 2} = Pr{y j = 2 y j 2} Pr{y j 2 y j 1} Pr{y j 1} = e xt β 2 1 1 1 + e xt β 2 1 + e xt β 1 1 + e xt β 0 p 3j = Pr{y j = 3} = Pr{y j = 3 y j 2} Pr{y j 2 y j 1} Pr{y j 1} = 1 1 1 1 + e xt β 2 1 + e xt β 1 1 + e xt β 0

Stat 642, Lecture notes for 04/12/05 102 In all cases, the p ij are products of conditional probabilities each of which depends one only on one of the β i. Since the likelihood is composed of products of the above p ij, it will also factor into a product of conditional likelihoods, each of which can be independently maximized. Hence, this model can be fit as a series of binary istic regression models. Each binary model is fit by restricting the data set to only those observations with responses in categories i through k under Scheme A or 0 through i under Scheme B and considering response level i versus the remaining categories. Continuation Ratio models give different fitted values than -linear models and Scheme A gives different fitted values than Scheme B. Cumulative Logits. Cumulative Logit models are typically used when we have ordinal responses. That is, we may consider the outcomes as a series of progressively worse or more extreme outcomes. In this case the relevant questions usually involve probabilities of subjects progressing from one category to the next, or probabilities of a subject being above of a given category versus at or lower than a given category. Example: Disease Severity Percent of Subjects 0 20 40 60 80 100 None Mild Moderate Severe - + Exposure In the above plot, the size of the none category decreases slightly with exposure. This suggests that exposure increases the risk of disease without regard for severity. On the other hand, the mild category also decreases with exposure, yet this fact by itself is not very meaningful since we don t know whether or not subjects who otherwise would have been in the mild category were moved to the none category or the moderate or worse categories. The conclusion might be quite different depending on which of these happened. More meaningful is the observation that

Stat 642, Lecture notes for 04/12/05 103 the combined none and mild categories shrunk, suggesting that the rate of moderate or worse disease increased. This leads us to the cumulative it model: Pr{yj i} = α i + x T j Pr{y j > i} β i. If β i = β i, this is the Proportional Odds model. We also require that α 0 α 1... although this will hold automatically in a proportional odds model. eα i+x T j β i We have Pr{y j i} =. 1 + e α i+x T j β i Note that if β i β i for some i i, then α i + β i x = α i + β i x for some x and the lines will cross. If α i + x T j β i > α i + x T j β i for some i < i then Pr{y j i} > Pr{y j i } which is nonsense. Hence if a cumulative it model is NOT a proportional odds model, there may be particular values of x for which the predicted values are inconsistent. Under the proportional odds model, the shapes of all curves are the same, they are shifted progressively to the left for increasing i. The horizontal distance between adjacent curves is α i α i 1 /β. Pr{y i} 0.0 0.2 0.4 0.6 0.8 1.0 i = 0 i = 1 i = 2 Pr{y = 3 x} Pr{y = 2 x} Pr{y = 1 x} Pr{y = 0 x} Odds Ratios: Let the odds of {y j i} be ψ ij = Pr{y j i} Pr{y j > i} = eα i+x T j β x so and ψ ij ψ i j ψ ij ψ ij = e α i α i is independent of j proportional odds = e x j x j T β is independent of i With more than two levels of the outcome variable, SAS fits a proportional odds model and uses the score test for to test for proportionality.