Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Similar documents
Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Multinomial Regression Models

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

Lecture 7 Time-dependent Covariates in Cox Regression

Survival Regression Models

MAS3301 / MAS8311 Biostatistics Part II: Survival

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

β j = coefficient of x j in the model; β = ( β1, β2,

Lecture 8 Stat D. Gillen

Survival Analysis Math 434 Fall 2011

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Multivariable Fractional Polynomials

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Logistic regression model for survival time analysis using time-varying coefficients

Multistate models and recurrent event models

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Lecture 12: Effect modification, and confounding in logistic regression

Multivariable Fractional Polynomials

Relative-risk regression and model diagnostics. 16 November, 2015

Multistate models and recurrent event models

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Survival Analysis I (CHL5209H)

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Statistics in medicine

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

STA6938-Logistic Regression Model

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Chapter 4 Regression Models

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Survival Analysis. Stat 526. April 13, 2018

Introduction to logistic regression

Power and Sample Size Calculations with the Additive Hazards Model

1 The problem of survival analysis

Tied survival times; estimation of survival probabilities

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

Matched Pair Data. Stat 557 Heike Hofmann

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Lecture 4 - Survival Models

STAT 526 Spring Final Exam. Thursday May 5, 2011

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Proportional hazards regression

Chapter 4 Fall Notations: t 1 < t 2 < < t D, D unique death times. d j = # deaths at t j = n. Y j = # at risk /alive at t j = n

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Single-level Models for Binary Responses

9 Generalized Linear Models

Introducing Generalized Linear Models: Logistic Regression

Logistic Regressions. Stat 430

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Generalized logit models for nominal multinomial responses. Local odds ratios

More Statistics tutorial at Logistic Regression and the new:

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Survival Analysis. STAT 526 Professor Olga Vitek

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Multivariate Survival Analysis

Ph.D. course: Regression models. Introduction. 19 April 2012

Joint Modeling of Longitudinal Item Response Data and Survival

especially with continuous

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Chapter 20: Logistic regression for binary response variables

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Introduction to the Analysis of Tabular Data

Duration of Unemployment - Analysis of Deviance Table for Nested Models

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Binary Logistic Regression

Categorical and Zero Inflated Growth Models

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Introduction to Statistical Analysis

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

STAT 7030: Categorical Data Analysis

Sociology 362 Data Exercise 6 Logistic Regression 2

Introduction to General and Generalized Linear Models

12 Modelling Binomial Response Data

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Generalised linear models. Response variable can take a number of different formats

Multiple Regression: Chapter 13. July 24, 2015

TMA 4275 Lifetime Analysis June 2004 Solution

Generalized linear models

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

R Hints for Chapter 10

The coxvc_1-1-1 package

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

A comparison of 5 software implementations of mediation analysis

The influence of categorising survival time on parameter estimates in a Cox model

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Transcription:

Statistics 255 - Survival Analysis Presented March 3, 2016 Motivating Dan Gillen Department of Statistics University of California, Irvine 11.1

First question: Are the data truly discrete? : Number of attempts at a puzzle before it is solved Motivating Number of grades completed before dropping out of school Number of doses required for a given effect to be observed Number of inseminations of cows required to achieve pregnancy Years measured by the the number of rings on a tree (time is the number of seasonal cycles) Number of screening visits to detect recurrence of disease 11.2

If failures are really instantaneous and the time measure is really continuous, then data not truly discrete, but are interval censored... : 1. Alzheimer s disease cohort study Enter into cohort with normal cognition Annual neuropsychological testing to assess conversion to demented state Motivating 2. Breast cancer screening Time from birth until the development of BC Subjects screened every 5 years until age 40, then each year after that 3. Maintenance therapy studies Cohort of cancer patients in remission Time to disease recurrence Patients regularly screened 11.3

Types of interval censoring Fixed interval censoring Researcher (or some external process) selects a fixed screening interval Every subject is screened according to the defined intervals Random interval censoring Screening intervals vary from screening-to-screening and from person-to-person Motivating Independent interval censoring Occurence of screening and lengths of intervals are independent of failure times (conditional on covariates) 11.4

Types of interval censoring Example of where independence does not hold... Relapse prevention for ulcers comparing two treatments Regular screening intervals for endoscopy (6 months) Additional endoscopies at other visits - due to symptoms or other health problems Motivating Effective screening intervals are not independent of relapse We will consider fixed interval censoring... 11.5

Notation Suppose patients are followed-up or screened at times t 1, t 2,..., t j,..., t J "Complete" right censored data is given by: (Y 1, δ 1, x 1 ),..., (Y n, δ n, x n ) "Observed" interval censored data is given by: Motivating (Y 1, δ 1, x 1 ),..., (Y n, δ n, x n ) where Y i = j if Y i (t j 1, t j ] 11.6

Analysis Strategies Univariate Analysis Cohort life table analysis Assume censoring and death is uniformly distributed within each interval Regression Analysis Motivating Fixed interval proportional hazards model 11.7

Consider the proportional hazards model for right censored data Motivating so that λ i (t x i ) = λ 0 (t)e βt x i S i (t x i ) = [S 0 (t)] exp(βt x i ) Now consider this model in the interval censored setting so that S i (t j x i ) = [S 0 (t j )] exp(βt x i ) = Pr[T > t j x i ] = Pr[Surviving j th interval x i ] 11.8

Now consider the conditional probability of failing in an interval given survival up to the start of the interval Pr[Failing in j th interval Survive j 1 interval, x i ] = S i(t j 1 xi) S i (t j x i ) S i (t j 1 xi) = 1 S i(t j x i ) S i (t j 1 xi) ( S0 (t j ) = 1 S 0 (t j 1 ) ) exp(β T x i ) Motivating 11.9

Now, let So that π j = Pr[Failing in j th interval Survive j 1 interval, x i ] ( S0 (t j ) π j = 1 S 0 (t j 1 ) ( S0 (t j ) 1 π j = S 0 (t j 1 ) ) exp(β T x i ) ) exp(β T x i ) log(1 π j ) = exp(β T x i ) log ( ) S0 (t j ) S 0 (t j 1 ) = exp(β T x i )[Λ 0 (t j ) Λ 0 (t j 1 )] Motivating 11.10

Thus we have log[ log(1 π j )] = log[λ 0 (t j ) Λ 0 (t j 1 )] + β T x i γ j + β T x i This is just a binary regression model (ie. a GLM) with a complimentary log-log (cloglog) link and interval-specific intercepts Motivating Provided that J (the number of intervals) is not too large, this model can be fit with ordinary software 11.11

(Section 1.14, K & M) Outcome: Time to cessation of breast feeding (weeks) Covariate of interest: smoking, adjusting for race (race= 1 (White), race= 2 (Black), race= 3 (other)) This is illustrative, so lets do the Cox model using the Efron approximation for ties (for comparison) Motivating I cannot compute the exact partial likelihood in R (on my computer) 11.12

(Section 1.14, K & M) Cox model with Efron approximation > bfeed <- read.table( "http://www.ics.uci.edu/~dgillen/ STAT255//bfeed.txt" ) > names( bfeed ) <- c( "duration", "icompbf", "racemom", "poverty", + "momsmoke", "momdrink", "momage", + "yob", "momeduc", "precare" ) > bfeed <- bfeed[ order(bfeed$duration), ] > bfeed$id <- 1:dim(bfeed)[1] > > > ## > ##### Fit Cox model with Efron adjustment for ties > ## > fit <- coxph( Surv(duration,icompbf) ~ factor(racemom) + momsmoke, data=bfeed, method="efron" ) > summary(fit) Motivating exp(coef) exp(-coef) lower.95 upper.95 factor(racemom)2 1.17 0.858 0.952 1.43 factor(racemom)3 1.38 0.727 1.144 1.66 momsmoke 1.32 0.756 1.140 1.53 11.13

(Section 1.14, K & M) Now recode data to be intervals > bfeed$int.durat <- cut(bfeed$duration, c(0,1,2,4,6,10,16,24,36,52,192), include.lowest=true, ordered_result=true ) > xtabs( ~ int.durat + icompbf, data=bfeed ) icompbf int.durat 0 1 [0,1] 2 77 (1,2] 3 71 (2,4] 6 119 (4,6] 9 75 (6,10] 7 109 (10,16] 5 148 (16,24] 3 107 (24,36] 0 74 (36,52] 0 85 (52,192] 0 27 Motivating 11.14

(Section 1.14, K & M) Now, expand the data to set up for fixed-interval censored proportional hazards model > ## > ##### Expand dataset to consider interval censoring > ## > u.evtimes <- as.ordered(unique(bfeed$int.durat[ bfeed$icompbf==1 ])) > num.event <- length( u.evtimes ) > bfeed.texpand <- bfeed[, c("id", "int.durat", "icompbf", "racemom", "momsmoke") ] > bfeed.texpand <- bfeed.texpand[ rep(bfeed.texpand$id,each=num.event), ] > bfeed.texpand$interval <- rep( u.evtimes, sum(!duplicated(bfeed.texpand$id)) ) > bfeed.texpand <- bfeed.texpand[ bfeed.texpand$int.durat >= bfeed.texpand$interval, ] > bfeed.texpand <- bfeed.texpand[ dim(bfeed.texpand)[1]:1, ] > bfeed.texpand$icompbf <- ifelse(!duplicated(bfeed.texpand$id), bfeed.texpand$icompbf,0 ) > bfeed.texpand <- bfeed.texpand[ dim(bfeed.texpand)[1]:1, ] Motivating 11.15

(Section 1.14, K & M) Let s have a look at the data... > bfeed.texpand[c(1:5,300:305),] id int.durat icompbf racemom momsmoke interval 2 1 [0,1] 1 1 1 [0,1] 28 2 [0,1] 1 1 1 [0,1] 35 3 [0,1] 1 1 0 [0,1] 86 4 [0,1] 1 1 1 [0,1] 88 5 [0,1] 1 1 1 [0,1] 564 178 (2,4] 0 1 1 [0,1] 564.1 178 (2,4] 0 1 1 (1,2] 564.2 178 (2,4] 1 1 1 (2,4] 620 179 (2,4] 0 3 1 [0,1] 620.1 179 (2,4] 0 3 1 (1,2] 620.2 179 (2,4] 1 3 1 (2,4] Motivating 11.16

(Section 1.14, K & M) Now, fit fixed-interval survival model > fit.intph <- glm( icompbf ~ factor(as.numeric(interval)) + factor(racemom) + momsmoke, data=bfeed.texpand, family=binomial(link="cloglog") ) > summary(fit.intph) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.61481 0.12085-21.64 < 2e-16 *** factor(as.numeric(interval))2 0.00843 0.16467 0.05 0.95916 factor(as.numeric(interval))3 0.66003 0.14633 4.51 6.5e-06 *** factor(as.numeric(interval))4 0.35893 0.16241 2.21 0.02710 * factor(as.numeric(interval))5 0.91960 0.14908 6.17 6.9e-10 *** factor(as.numeric(interval))6 1.55781 0.14094 11.05 < 2e-16 *** factor(as.numeric(interval))7 1.68622 0.15050 11.20 < 2e-16 *** factor(as.numeric(interval))8 1.81313 0.16433 11.03 < 2e-16 *** factor(as.numeric(interval))9 2.84750 0.16520 17.24 < 2e-16 *** factor(as.numeric(interval))10 5.33164 20.19268 0.26 0.79175 factor(racemom)2 0.14344 0.10511 1.36 0.17235 factor(racemom)3 0.33313 0.09762 3.41 0.00064 *** momsmoke 0.29272 0.07717 3.79 0.00015 *** Motivating 11.17

(Section 1.14, K & M) Use glmci() on the course webpage to exponentiate results and produce CIs > signif( glmci( fit.intph ), 3 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0732 0.0577 9.27e-02-21.6000 0.0000 factor(as.numeric(interval))2 1.0100 0.7300 1.39e+00 0.0512 0.9590 factor(as.numeric(interval))3 1.9300 1.4500 2.58e+00 4.5100 0.0000 factor(as.numeric(interval))4 1.4300 1.0400 1.97e+00 2.2100 0.0271 factor(as.numeric(interval))5 2.5100 1.8700 3.36e+00 6.1700 0.0000 factor(as.numeric(interval))6 4.7500 3.6000 6.26e+00 11.1000 0.0000 factor(as.numeric(interval))7 5.4000 4.0200 7.25e+00 11.2000 0.0000 factor(as.numeric(interval))8 6.1300 4.4400 8.46e+00 11.0000 0.0000 factor(as.numeric(interval))9 17.2000 12.5000 2.38e+01 17.2000 0.0000 factor(as.numeric(interval))10 207.0000 0.0000 3.19e+19 0.2640 0.7920 factor(racemom)2 1.1500 0.9390 1.42e+00 1.3600 0.1720 factor(racemom)3 1.4000 1.1500 1.69e+00 3.4100 0.0006 momsmoke 1.3400 1.1500 1.56e+00 3.7900 0.0001 Motivating 11.18

(Section 1.14, K & M) Interpretation: The estimated relative hazard for cessation of breast feeding for smoking versus non-smoking mothers is exp(0.29) = 1.34 (95% CI: 1.15-1.56; p-value =.0001) Motivating This compares to an estimate of 1.32 when the complete data were used with the Efron approximation 11.19

What about situations where time truly is discrete: Number of attempts a child takes to solve a puzzle before it is solved Number of grades completed before dropping out of school Number of doses required for a given effect to be observed It is natural to think of these settings as continuation trials Motivating A child will only attempt to solve a puzzle a second time if they failed to solve it the first time A student can only fail the 10 th grade if they passed the 9 th grade A patient is only given dose j + 1 if they did not show benefit at dose j 11.20

Discrete-time hazard function Here we may wish to focus on conditional probabilities What is the probability of success on attempt j + 1 given failure on attempt j? Notice how this is similar to a hazard... Motivating What is the probability of failure at time t, given survival up to time t? 11.21

Discrete-time hazard function The survival function for discrete-time data is defined as: S(t j ) = Pr[T > t j ] (same as the survival function for continuous-time data) The hazard function for discrete-time data is: λ(t j ) = Pr[T = t j] Pr[T t j ] = Pr[T = t j] Pr[T > t j 1 ] = S(t j 1) S(t j ) S(t j 1 ) Motivating (different from continuous-time data, but a natural extension) 11.22

Analysis Strategies Univariate Analysis Ordinary Kaplan-Meier / Nelson-Aalen estimators Regression Analysis Motivating Discrete-time proportional hazards model (aka continuation ratio model) 11.23

Let λ(t j ) denote the discrete hazard at time t j. Therefore λ(t j ) is the probability of failure at t j, given survival up to t j (ie. past t j 1 ) The odds of failure at t j, given survival up to t j are then given by λ(t j ) 1 λ(t j ) Motivating Now suppose that λ i (t j x i ) 1 λ i (t j x i ) = λ 0(t j ) 1 λ 0 (t j ) exp(βt x i ) 11.24

This can be thought of as a proportional odds model on the conditional probability of failure at time t j, given survival up to t j. It is referred as a continuation ratio model We can also rewrite this so that log ( λi (t j x i ) ) 1 λ i (t j x i ) = log ( ) λ0 (t j ) 1 λ 0 (t j ) exp(βt x i ) Motivating = γ j + β T x i This is has the form of a logistic regression model with separate intercepts for each follow-up time... 11.25

How do we fit the model? Motivating If the number of unique failure (follow-up) times t1,..., t J is reasonable in size with many ties, we can use ordinary logistic regression with standard software. If the number of unique failure (follow-up) times t1,..., t J is large with few ties,we can use the Cox PH model with the "exact" ties options. What if we have both many ties and a large number of unique failure times?...best going with the the logistic model in most packages 11.26

Survey data considering factors that affect high school graduation available on N=1,691 9th grade students enrolled in a single school district Covariates available 1. Race (White, Black, Hispanic, other) 2. Gender 3. Familiy income (low, medium, high lowest and highest 20%) 4. Parents education (no HS grad, HS grad, some college, college grad) Goal: Estimate the affect of these covariates on the cumulative probability of HS dropout (we ll focus on mom s education as an example) Motivating 11.27

A brief look at the data... ## ##### Read in HS graduation data ## > hsgrad <- read.table( "http://www.ics.uci.edu/~dgillen/stat255/ /hsgrad_comp.txt", header=true ) > nsubjects <- nrow( hsgrad ) > nsubjects [1] 1691 > hsgrad[1:5,] id race male mom.ed dad.ed inc graduate maxgrade 1 101 1 0 2 1 1 1 12 3 103 1 0 2 2 2 1 12 5 105 1 1 4 4 1 1 12 6 106 1 0 4 2 3 1 12 7 107 1 0 2 2 2 3 10 Motivating > table( hsgrad$maxgrade ) 8 9 10 11 12 8 46 55 160 1422 11.28

Set the data up for a CRM fit > ### > ### Construct CRM data... > ### > # > ##### STEP (1) construct the pairs (Y,H) > # > hsgrad$maxgrade <- hsgrad$maxgrade - 7 > ncuts <- max(hsgrad$maxgrade) - 1 > print( paste("ncuts =", ncuts) ) [1] "ncuts = 4" Motivating > y.crm <- NULL > h.crm <- NULL > id <- NULL > for( j in 1:nsubjects ){ + yj <- rep( 0, ncuts ) + if( hsgrad$maxgrade[j] <= ncuts ) yj[ hsgrad$maxgrade[j] ] <- 1 + hj <- 1 - c(0,cumsum(yj)[1:(ncuts-1)]) + y.crm <- c( y.crm, yj ) + h.crm <- c( h.crm, hj ) + id <- c( id, rep(j,ncuts) ) + } 11.29

Set the data up for a CRM fit > ### > ### Construct CRM data... > ### > # > ##### STEP (2) construct the intercepts > # > level <- factor( rep(1:ncuts, nsubjects ), levels=c(1:ncuts), + labels=paste(": ",(1:ncuts)+7) ) > int.mat <- NULL > for( j in 1:ncuts ){ + intj <- rep( 0, ncuts ) + intj[ j ] <- 1 + int.mat <- cbind( int.mat, rep( intj, nsubjects) ) + } > dimnames(int.mat) <- list( NULL, paste("int",c(1:ncuts),sep="") ) Motivating 11.30

Set the data up for a CRM fit > # > ##### STEP (3) expand the X s > # > race <- rep( hsgrad$race, rep(ncuts,nsubjects) ) > male <- rep( hsgrad$male, rep(ncuts,nsubjects) ) > mom.ed <- rep( hsgrad$mom.ed, rep(ncuts,nsubjects) ) > dad.ed <- rep( hsgrad$dad.ed, rep(ncuts,nsubjects) ) > inc <- rep( hsgrad$inc, rep(ncuts,nsubjects) ) > # > print( cbind( id, y.crm, h.crm, level, int.mat, mom.ed )[1:22,] ) id y.crm h.crm level Int1 Int2 Int3 Int4 mom.ed [1,] 1 0 1 1 1 0 0 0 2 [2,] 1 0 1 2 0 1 0 0 2 [3,] 1 0 1 3 0 0 1 0 2 [4,] 1 0 1 4 0 0 0 1 2 [5,] 2 0 1 1 1 0 0 0 2 [6,] 2 0 1 2 0 1 0 0 2 [7,] 2 0 1 3 0 0 1 0 2 [8,] 2 0 1 4 0 0 0 1 2 [9,] 3 0 1 1 1 0 0 0 4 [10,] 3 0 1 2 0 1 0 0 4 [11,] 3 0 1 3 0 0 1 0 4... Motivating 11.31

Set the data up for a CRM fit > # > ##### STEP (4) drop the H=0 and build dataframe > # > keep <- h.crm==1 > hsgrad.crm.data <- data.frame( + id = id[keep], + y = y.crm[keep], + level = level[keep], + race = race[keep], + male = male[keep], + mom.ed = mom.ed[keep], + dad.ed = dad.ed[keep], + inc = inc[keep] ) > print( hsgrad.crm.data[1:15,] ) id y level race male mom.ed dad.ed inc 1 1 0 : 8 1 0 2 1 1 2 1 0 : 9 1 0 2 1 1 3 1 0 : 10 1 0 2 1 1 4 1 0 : 11 1 0 2 1 1 5 2 0 : 8 1 0 2 2 2 6 2 0 : 9 1 0 2 2 2 7 2 0 : 10 1 0 2 2 2... Motivating 11.32

First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > # > ##### > ##### Fit separate logistic regressions for cumulative > ##### probability of dropout > ##### > # > table( hsgrad.crm.data$level ) Motivating : 8 : 9 : 10 : 11 1691 1683 1637 1582 11.33

First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit1 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==1), data = hsgrad.crm.data ) > summary( fit1 ) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.9836 1.0034-4.967 6.81e-07 *** factor(mom.ed)2-0.1561 1.0837-0.144 0.885 factor(mom.ed)3-15.5825 1438.1233-0.011 0.991 factor(mom.ed)4-0.9053 1.4176-0.639 0.523 Motivating > glmci( fit1 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0068 0.0010 0.0490-4.9666 0.0000 factor(mom.ed)2 0.8555 0.1023 7.1562-0.1440 0.8855 factor(mom.ed)3 0.0000 0.0000 Inf -0.0108 0.9914 factor(mom.ed)4 0.4044 0.0251 6.5091-0.6386 0.5231 11.34

First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit2 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==2), data = hsgrad.crm.data ) > glmci( fit2 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0429 0.0189 0.0970-7.5554 0.0000 factor(mom.ed)2 0.8257 0.3412 1.9986-0.4245 0.6712 factor(mom.ed)3 0.3111 0.0618 1.5670-1.4154 0.1569 factor(mom.ed)4 0.1955 0.0482 0.7926-2.2855 0.0223 Motivating > fit3 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==3), data = hsgrad.crm.data ) > glmci( fit3 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.1111 0.0640 0.1930-7.7994 0.0000 factor(mom.ed)2 0.3009 0.1563 0.5793-3.5937 0.0003 factor(mom.ed)3 0.2466 0.0791 0.7683-2.4146 0.0158 factor(mom.ed)4 0.1275 0.0450 0.3611-3.8775 0.0001 11.35

First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit4 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==4), data = hsgrad.crm.data ) > glmci( fit4 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.1250 0.0717 0.2179-7.3356 0.0000 factor(mom.ed)2 0.9754 0.5398 1.7626-0.0826 0.9342 factor(mom.ed)3 1.1250 0.5351 2.3651 0.3107 0.7560 factor(mom.ed)4 0.5836 0.2918 1.1671-1.5229 0.1278 Motivating 11.36

Now let s consider fitting the continuation ratio model which simultaneously models all grades > # > ##### > ##### Fit continuation ratio (logit) model > ##### > # > # > fit5 <- glm( y ~ level + factor(mom.ed), family=binomial, + data = hsgrad.crm.data ) Motivating > glmci( fit5 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0078 0.0036 0.0167-12.4296 0.0000 level: 9 5.9254 2.7874 12.5961 4.6242 0.0000 level: 10 7.3585 3.4930 15.5015 5.2501 0.0000 level: 11 24.0970 11.8006 49.2065 8.7357 0.0000 factor(mom.ed)2 0.6614 0.4508 0.9703-2.1142 0.0345 factor(mom.ed)3 0.5866 0.3405 1.0103-1.9230 0.0545 factor(mom.ed)4 0.3249 0.1981 0.5329-4.4523 0.0000 11.37

Motivating Interpretation of coefficients from the CRM model: 11.38

We can also test the assumption of a common (conditional) odds ratio across grade levels using an LRT That is, we wish to test whether the effect of mother s education varies by grade level This is called a test of the "proportional odds" assumption As with any diagnostic test, be careful of underpowered tests... > # > ##### > ##### Test the null hypothesis of a single odds ratio across levels > ##### > # > # > fit6 <- glm( y ~ level * factor( mom.ed ), family=binomial, + data = hsgrad.crm.data ) Motivating > anova( fit5, fit6 ) Analysis of Deviance Table Model 1: y ~ level + factor(mom.ed) Model 2: y ~ level * factor(mom.ed) Resid. Df Resid. Dev Df Deviance 1 6586 2018.30 2 6577 2003.96 9 14.34 11.39

Might be better to use my lrtest() function since it automatically computes the p-value... > # > ##### > ##### Test the null hypothesis of a single odds ratio across levels > ##### > lrtest( fit5, fit6 ) Motivating Assumption: Model 1 nested within Model 2 Resid. Df Resid. Dev Df Deviance pvalue 1 6586 2018.301 2 6577 2003.958 9 14.343 0.1106 11.40

Conclusion from the above test: Motivating 11.41