Cox regression: Estimation

Similar documents
Proportional hazards regression

Tied survival times; estimation of survival probabilities

Semiparametric Regression

Accelerated Failure Time Models

Residuals and model diagnostics

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

STAT5044: Regression and Anova

The Weibull Distribution

Time-dependent coefficients

Sampling distribution of GLM regression coefficients

Lecture 16 Solving GLMs via IRWLS

Multistate models and recurrent event models

Multistate models and recurrent event models

Weighted Least Squares I

FSAN815/ELEG815: Foundations of Statistical Learning

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Censoring mechanisms

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Survival Analysis Math 434 Fall 2011

Regression with Numerical Optimization. Logistic

MIT Spring 2016

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Generalized Linear Models. Kurt Hornik

LOGISTIC REGRESSION Joseph M. Hilbe

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Outline of GLMs. Definitions

Theoretical results for lasso, MCP, and SCAD

2 Nonlinear least squares algorithms

Lecture 17: Numerical Optimization October 2014

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Survival Regression Models

ST745: Survival Analysis: Cox-PH!

MAS3301 / MAS8311 Biostatistics Part II: Survival

Cox s proportional hazards/regression model - model assessment

Continuous case Discrete case General case. Hazard functions. Patrick Breheny. August 27. Patrick Breheny Survival Data Analysis (BIOS 7210) 1/21

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Logistic regression: Miscellaneous topics

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Cox s proportional hazards model and Cox s partial likelihood

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Regularization in Cox Frailty Models

TGDR: An Introduction

Lecture 5: LDA and Logistic Regression

Extensions to LDA and multinomial regression

Information in a Two-Stage Adaptive Optimal Design

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Nonconvex penalties: Signal-to-noise ratio and algorithms

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

STAT331. Cox s Proportional Hazards Model

Math (P)refresher Lecture 8: Unconstrained Optimization

Multivariate Survival Analysis

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

MAS3301 / MAS8311 Biostatistics Part II: Survival

Lecture 22 Survival Analysis: An Introduction

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Linear Methods for Prediction

Multistate Modeling and Applications

I = i 0,

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Logistic Regression. Seungjin Choi

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Generalized Linear Models

SSUI: Presentation Hints 2 My Perspective Software Examples Reliability Areas that need work

Machine Learning Lecture 5

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Statistical Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Beyond GLM and likelihood

Neural Network Training

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

SB1a Applied Statistics Lectures 9-10

Statistical Inference and Methods

STAT 730 Chapter 4: Estimation

Consider Table 1 (Note connection to start-stop process).

The Matrix Algebra of Sample Statistics

Time-dependent covariates

Computational methods for mixed models

Probabilistic Graphical Models

Modelling geoadditive survival data

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Stability and the elastic net

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Transcription:

Cox regression: Estimation Patrick Breheny October 27 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/19

Introduction The Cox Partial Likelihood In our last lecture, we introduced the Cox partial likelihood; today, we will go over how to solve for β, the maximum (partial) likelihood estimator As in previous models, this will require working out the score vector and Hessian matrix and applying an iterative Newton-Raphson procedure On a superficial level this procedure is similar to our other regression models, but the details are quite different: although the observations are independent, we can no longer treat the partial likelihood contributions from each observation in isolation Patrick Breheny Survival Data Analysis (BIOS 7210) 2/19

Partial likelihood; at-risk indicator Recall the Cox partial likelihood (PL): L(β) = j exp(x T j β) k R(t j ) exp(xt k β), where j indexes the observed failure times and R(t) is the set of observations at risk at time t The denominator in the expression above is also sometimes written as n Y k (t j ) exp(x T k β), k=1 where Y i (t) is an at-risk indicator, equal to 1 if subject i is at risk at time t, and 0 otherwise Patrick Breheny Survival Data Analysis (BIOS 7210) 3/19

Cox PL in terms of individual weights As an alternative, it is often convenient to express the likelihood as a product of terms for each individual, as opposed to each failure time To simplify the expression, let w j = exp(x T j β); the cox partial likelihood can now be written as L(β) = { } di w i i R i w j Expressing the partial likelihood in this way emphasizes the fact that the model assigns weights w i to the relative likelihood that individual i will fail compared to the other subjects at risk Patrick Breheny Survival Data Analysis (BIOS 7210) 4/19

Comments The Cox Partial Likelihood Note that the d i exponent ensures that only the observations at which a failure is observed contribute to the likelihood However, because each subject affects the total hazard w j over all the failure times at which they are in the risk set, the contribution that subject i makes to the likelihood is not limited to the ith term in the product Because this sum will appear many times in our derivations today, I will denote it W i : W i = R(t i ) w j, where W i represents the total hazard for all subjects at risk for the time at which subject i fails Patrick Breheny Survival Data Analysis (BIOS 7210) 5/19

Failure probabilities The Cox Partial Likelihood The relative probability of failure for subject i is given by w i ; let us denote the absolute probability of failure for subject i at time t j as π ij : π ij = Y i (t j ) w i W j Note, of course, that this probability is absolute only in the conditional sense, given that a failure occurs at time t j Patrick Breheny Survival Data Analysis (BIOS 7210) 6/19

Log-likelihood The Cox Partial Likelihood The (partial) log-likelihood is therefore l = i d i log w i i d i log W i = i d i η i i d i log W i As we begin to take derivatives, keep in mind that the W i term contains many η terms in addition to η i Patrick Breheny Survival Data Analysis (BIOS 7210) 7/19

Score (with respect to η) Score Hessian Solving for β involves deriving the score equations and setting them equal to zero Let us begin by evaluating the partial derivative of the likelihood with respect to the kth linear predictor: l η k = d k i π ki d i Thus, we can write the score with respect to the vector of linear predictors as u(η) = d Pd, where P is an n n matrix with elements π ij Patrick Breheny Survival Data Analysis (BIOS 7210) 8/19

Score (with respect to β) Score Hessian As we have seen before, by the chain rule the score with respect to β is therefore u(β) = X T (d Pd) Alternatively, we can express the score equations as (x j E j x) = 0, j where E j x = i x iπ ij can be thought of as the expected value of the covariate vector at the jth failure time given the probability distribution implied by the model Patrick Breheny Survival Data Analysis (BIOS 7210) 9/19

Hessian (with respect to η) Score Hessian The score, of course, is nonlinear in β, meaning that we will have to apply a Taylor series expansion in order to solve it This, in turn, involves finding second derivatives: i.e., the Hessian matrix Let us start with the diagonal elements (with respect to the linear predictors): 2 l η 2 k = i d i π ki (1 π ki ) Similarly, 2 l η k η j = i d i π ki π ji Patrick Breheny Survival Data Analysis (BIOS 7210) 10/19

Hessian (with respect to β) Score Hessian Again, applying the chain rule we obtain the Hessian with respect to β: H(β) = X T WX, where W denotes the (non-diagonal) matrix whose terms are given on the previous slide, with signs reversed (note that W is unrelated to W i ; my apologies if the notation is confusing) Alternatively, one can express the Hessian as H(β) = π kj (x k E j x)(x k E j x) T, j where j here indexes the observed failure times k Patrick Breheny Survival Data Analysis (BIOS 7210) 11/19

Newton-Raphson algorithm As we have seen previously with the exponential and Weibull regression models, the Newton-Raphson algorithm is an effective, efficient iterative procedure that converges to the MLE (usually) For Cox regression, the Newton-Raphson update is given by β (m+1) = (X T WX) 1 X T (d Pd) + β (m), where W and P are evaluated at β (m), the current value of the regression coefficients Patrick Breheny Survival Data Analysis (BIOS 7210) 12/19

Crude R code The Cox Partial Likelihood for (i in 1:20) { eta <- X %*% b haz <- as.numeric(exp(eta)) # w[i] rsk <- rev(cumsum(rev(haz))) # W[i] P <- outer(haz, rsk, '/') P[upper.tri(P)] <- 0 W <- -P %*% diag(d) %*% t(p) diag(w) <- diag(p %*% diag(d) %*% t(1-p)) b <- solve(t(x)%*%w%*% X) %*% t(x) %*% (d - P%*%d) + b } The above code assumes that the data has been sorted by time on study, and assumes no ties are present Patrick Breheny Survival Data Analysis (BIOS 7210) 13/19

Comments The Cox Partial Likelihood The code on the previous slide is crude for several reasons: It could be faster/more efficient It doesn t check for convergence It can occasionally fail to converge, because it doesn t implement step-halving when needed You are tasked with addressing the last two shortcomings on your next homework assignment Patrick Breheny Survival Data Analysis (BIOS 7210) 14/19

Examples: pbc data The Cox Partial Likelihood Some examples for how well Newton-Raphson works on the pbc data: Model contains trt, stage, and hepato: Converges in 4 iterations Model contains trt, stage, hepato, and bili: Fails to converge Model contains trt, stage, hepato, and bili, but we employ step-halving: Converges in 20 iterations Patrick Breheny Survival Data Analysis (BIOS 7210) 15/19

Conditional step-halving The survival package, however, can fit the Cox model with trt, stage, hepato, and bili in just 6 iterations... how does it do that? The fundamental tradeoff here is between stability and speed: step-halving slows down convergence (intentionally!), but provides stability It would be desirable to use Newton-Raphson as a default, but have some sort of check in place that uses step-halving when problems arise Patrick Breheny Survival Data Analysis (BIOS 7210) 16/19

Likelihood checking The Cox Partial Likelihood It turns out that this is fairly straightforward to accomplish Let β denote the Newton-Raphson update, and consider the following procedure: (1) Calculate l( β (m) ) (2) Calculate l( β) (3) If l( β) > l( β (m) ), then β (m+1) β; otherwise, β (m+1) 1 2 β + 1 2 β (m) Using this procedure, we can solve for β in 6 iterations, using step-halving only once, on the initial update Patrick Breheny Survival Data Analysis (BIOS 7210) 17/19

Guaranteed convergence? The procedure on the previous page almost always works, but is still not guaranteed to converge The reason is that step halving might not be enough: it is possible that l( 1 2 β + 1 2 β (m) ) is still smaller than l( β (m) ) To guarantee convergence, we need to iteratively reapply the step-halving: consider 1 4, 1 8, 1 16,... until we reach a step size small enough that the likelihood does, in fact, increase Typically, this is not necessary, but this kind of check is necessary to ensure that the likelihood goes up with every iteration, even in pathological cases Patrick Breheny Survival Data Analysis (BIOS 7210) 18/19

Ties The Cox Partial Likelihood As a final comment, note that we are ignoring the issue of tied observations, even though there are in fact a few ties in the pbc data However, unless there are a large number of ties, this is typically a very minor issue: trt stage hepato bili Crude -0.15530 0.62157 0.34860 0.13358 survival -0.15473 0.62138 0.34854 0.13353 We will, however, discuss ties more carefully in a future lecture Patrick Breheny Survival Data Analysis (BIOS 7210) 19/19