WALD III SOFTWARE FOR THE MASSES (AND AN EXAMPLE) Leo Breiman UCB Statistics

Similar documents
HOW TO USE SURVIVAL FORESTS (SFPDV1)

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Semiparametric Regression

Lecture 7 Time-dependent Covariates in Cox Regression

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Extensions of Cox Model for Non-Proportional Hazards Purpose

Random Forests: Finding Quasars

3003 Cure. F. P. Treasure

Multistate models and recurrent event models

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Multistate models and recurrent event models

Cox s proportional hazards/regression model - model assessment

PhD course in Advanced survival analysis. One-sample tests. Properties. Idea: (ABGK, sect. V.1.1) Counting process N(t)

Introduction to Statistical Analysis

Survival Analysis. Stat 526. April 13, 2018

Extensions of Cox Model for Non-Proportional Hazards Purpose

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Lecture 22 Survival Analysis: An Introduction

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Multinomial Logistic Regression Models

Survival analysis in R

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

LOGISTIC REGRESSION Joseph M. Hilbe

Tied survival times; estimation of survival probabilities

Right-truncated data. STAT474/STAT574 February 7, / 44

MAS3301 / MAS8311 Biostatistics Part II: Survival

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Time-dependent covariates

Statistical Distribution Assumptions of General Linear Models

Cox s proportional hazards model and Cox s partial likelihood

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Power and Sample Size Calculations with the Additive Hazards Model

Generalized Linear Models for Non-Normal Data

ST745: Survival Analysis: Cox-PH!

Time-Invariant Predictors in Longitudinal Models

Probability and Statistics

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Dependent Variable Q83: Attended meetings of your town or city council (0=no, 1=yes)

Reliability Engineering I

Constrained estimation for binary and survival data

Introduction to Reliability Theory (part 2)

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

The influence of categorising survival time on parameter estimates in a Cox model

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Time-Invariant Predictors in Longitudinal Models

Survival analysis in R

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina

Chapter 17. Failure-Time Regression Analysis. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Package threg. August 10, 2015

Survival Analysis for Case-Cohort Studies

TMA 4275 Lifetime Analysis June 2004 Solution

Package SimSCRPiecewise

Lecture 3. Truncation, length-bias and prevalence sampling

Survival Analysis I (CHL5209H)

Lecture 21: October 19

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Generalized Boosted Models: A guide to the gbm package

A Re-Introduction to General Linear Models (GLM)

Predicting flight on-time performance

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Subgroup analysis using regression modeling multiple regression. Aeilko H Zwinderman

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Basic Medical Statistics Course

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Building a Prognostic Biomarker

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

11 Survival Analysis and Empirical Likelihood

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Survival Analysis Math 434 Fall 2011

Multimodal Deep Learning for Predicting Survival from Breast Cancer

The coxvc_1-1-1 package

β j = coefficient of x j in the model; β = ( β1, β2,

Part III Measures of Classification Accuracy for the Prediction of Survival Times

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Lecture 5 Models and methods for recurrent event data

Double Bootstrap Confidence Interval Estimates with Censored and Truncated Data

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Simple techniques for comparing survival functions with interval-censored data

STAT331. Cox s Proportional Hazards Model

Differential Equations. Lectures INF2320 p. 1/64

Philosophy and Features of the mstate package

MODELING MISSING COVARIATE DATA AND TEMPORAL FEATURES OF TIME-DEPENDENT COVARIATES IN TREE-STRUCTURED SURVIVAL ANALYSIS

Multi-state Models: An Overview

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

1 Terminology and setup

Statistical Learning Theory and the C-Loss cost function

Model Selection. Frank Wood. December 10, 2009

Rapid detection of spatiotemporal clusters

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Variance Reduction and Ensemble Methods

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Machine Learning (CSE 446): Probabilistic Machine Learning

Transcription:

1 WALD III SOFTWARE FOR THE MASSES (AND AN EXAMPLE) Leo Breiman UCB Statistics leo@stat.berkeley.edu

2 IS THERE AN OBLIGATION? Tens of thousands of statisticians around the world are using statistical packages to arrive at conclusions-sas, SPSS, STATA, etc. These packages are filled with a lot of obsolete crank-it-out procedures, and very few up-to-date methods (except Matlab which is expensive and not in common use). The statisticians using these packages are the very large majority in the statistical profession. WHAT OBLIGATION DOES ACADEMIC STATISTICS HAVE TO PROVIDE THE FIELD WITH USEFUL, MODERN SOFTWARE? (PREFERABLY FREE--KUDOS TO R)

3 A PERSONAL VIEW My answer is personal--i think we have to close the great gap in statistics today between theory and practice. We need to develop good software tools that can help the practice of statistics out in the field: the technology transfer from academic statistics to the field needs to be greatly increased. An excellent example of the merging of theory and practice is wavelets, but there are few such examples. The pivotal tool is developing software is the ability to do it. Therefore, I have a radical proposal: That every Ph.D. graduate in statistics be required to have a good working knowledge of art least one higher level language-- C,,JAVA or Fortran 9. My plan for myself is to put more working software on my web site for free. The following discusses my next effort.

4 A LESSON FROM THE PAST Three decades ago many statisticians and quantitative social scientists were enamored of multilinear regression and its theory of hypothesis testing on the coefficients. Every statistical package had a regression program variable selection program based on F-to delete and F to enter. It was almost impossible to get a paper published unless you showed that a certain coefficient was signifigant at the 5% level. This was regardless of how well the linear model fit the data and little effort was made to find out. Many conclusions were undoubtedly wrong, and I don't think statisticians now-a-days dispute the error of these ways.

5 ENTER THE COX MODEL In the last decade or two, the Cox model for that analysis of survival data has come to occupy the place in the medical field that multilinear regression once had in the social science. A method for fitting the Cox model appears in every statistical package under the sun. My friend and biostatistician Richard Olshen tells me that use of the Cox model is the requirement for publication in some medical journals. When I voiced my concerns to well-known biostatisticians over the last few years, the response was "well, the data is too weak to do anything else". I considered this a challenge to create a better method.

6 SURVIVAL ANALYSIS Data from a survival experiment are of the form: (c n,t n,x n ), n = 1,...,N} Here, c n =1 if the nth case died during the duration of the experiment and if it was censored by dropping out during the course of the experiment or lasting until the end. The t n is the time of death or censoring, and x n is a vector of covariates. The goal of survival analysis is to trace the effects of the covariates on the times of death.

7 HAZARD AND SURVIVAL Given a random death time T(x) depending on the covariate vector, define the hazard function by: h(t,x) = P(T(x) (t + dt,t) T(x) t)/dt and the survival function: S(t,x) = P(T(x) t) = exp( h(τ,x)dτ) t The Cox model makes the assumption that h(t,x) = r(t)exp(β x) and then makes a clever partial likelihood jump that allows the r(t) to be canceled out in the estimation of β. One result is that all survival curves have the same shape. Can this possibly represent the complex action of nature even as a first approximation?

8 THREE SIMULATED DATA SETS Its useful when testing new methods to generate simulated data where truth is known. Sim1. This is 3 case data with five uniformly distributed covariates, 19% censoring, generated from a Cox model where the hazard function is: h(t,x) = exp(x 1 2x 4 + 2x 5 ) Sim2 This 3 case data with three uniformly distributed covariates, 15% censoring, has hazard function:: if x 1.5, h(t,x) = if.5 t 2.5, else exp(x 2 ) if x 1 >.5, h(t,x) = if 2.5 t 4.5, else exp(x 3 ) Sim3 This 3 case data set with six uniformly distributed covariates, 29% censoring, has hazard function: h(t,x) = (1 + z 2 t)exp*(z 1 + z 2 t) where z 1 =.5x 1, z 2 =x 4 +x 5

9 SIM1 SURVIVAL CURVES 1.5 1 2 3 SIM2 SURVIVAL CURVES 1.5 1 2 3 4 5 6 7 SIM3 SURVIVAL CURVES 1.5.5 1

INTRODUCING SURVIVAL FORESTS (SF) 1 I am developing a method (survival forests (SF)) for estimating the survival functions S(t,x) At present it is still in the experimental stage These functions can also be estimated using the Cox model by first estimating coefficients and then doing an ml estimate of r(t). With simulated data the true survival functions S*(t,x) are known. Define the error in the estimates as: error = av k,n S(t k,x n ) S *(t k,x n ) where the {t k } are the uncensored death times Errors data set Cox SF Sim1.54.66 Sim2.135.1 Sim3.572.68

FOUR REAL DATA SETS 11 Simulated data may not reflect the vagaries of real data. I have been working with four real data sets. Bcan: this is a breast cancer data set sent to me from England. It has 272 cases, 6 covariates and 17% censoring. Vcan: a veterans cancer data set with 136 cases, 6 covariates and 7% censoring. UIS: a return to drugs data set with 575 cases, 8 covariates and 19% censoring. (see the book "Applied Survival Analysis" by Hosmer and Lemeshow) Gcan: the German cancer study with 686 cases, 7 covariates, and 56% censoring.

ARE SIMULATED AND REAL COMPARABLE 12 Do the real data sets have as much information in them as the simulated? In SF there is a test set method for estimating the times of death--estimating td(n) by ˆtd(n). Measure the error in this estimate by: error = av n ( td(n) ˆtd(n) / ˆtd(n)) and the strength of the data by 1(1-error). If the error is high the strength is low. Data Set Strength Sim1 19.5 Sim2 13.5 Sim3 32. Bcan 27.4 Vcan 18.9 UIS 35.5 Gcan 45.2 The real data sets have, on average, more strength than the simulated. Strength is an indicator of how well the survival functions are estimated. Note that the lowest strength simulated data set Sim2 also had the highest error in estimating these functions.

13 CORRELATIONS OVER TIME There are three methods (so far) for using SF to look at the time process in the data. The first method is time-varying correlations. We work with L(t,x) = log(s(t,x)) Take the time points {t k } to be 1 order statistics from the uncensored death times. At each t k compute the correlation between L(t k, x n ) and each of the covariates. If a Cox model fits the data, these correlations should be constant in time. Graphs of the corelations follow.

14 SIM 1 CORRELATIONS 1.5 -.5 X1 X2 X3 X4 X5-1 1 2 3 SIM 2 CORRELATIONS 1.5 X1 X2 X3 -.5 1 2 3 4 5 6 7 SIM 3 CORRELATIONS 1.5 X1 X2 X3 X4 X5 X6.25.5.75 1

15 BCAN CORRELATIONS 1.6.2 X1 X2 X3 X4 X5 X6 -.2 1 2 3 VCAN CORRELATIONS 1.5 -.5 X1 X2 X3 X4 X5 X6-1 25 5 75 1

16 UIS CORRELATIONS 1.5 X1 X2 X3 X4 X5 X6 X7 X8 -.5 1 2 3 4 5 6 7 GCAN CORRELATIONS 1.5 X1 X2 X3 X4 X5 X6 X7 -.5 5 1 15 2 25

TIME FITTING COX MODELS 17 Another tool that is useful is fitting a Cox model at each time t k to the L(t k, x n ). At each t k let y n = L(t k,x n ) and f (n,t(k),β(k)) = t(k)exp(β(k) x n ) The right hand side is the negative log of the Cox expression with the t(k),β(k) parameters to be determined by minimizing (y n f (n,t(k),β(k))) 2 n If the Cox model is a good fit, the β(k) will ce constant in time and the t(k)} are an estimate of the integral of the baseline risk. Again, we show some graphs:

18 SIM 1 COEFFICIENT TRACES AV RSQ=.84 2.5 1.5.5 -.5-1.5 X1 X2 X3 X4 X5-2.5 1 2 3 SIM 2 COEFFICIENT TRACES AV RSQ=.33 2 1.5 1.5 X1 X2 X3 -.5-1 1 2 3 4 5 6 7 SIM 3 COEFFICIENT TRACES AV RSQ=.62.5.3.1 X1 X2 X3 X4 X5 X6 -.1.2.4.6.8

19 BCAN COEFFICIENT TRACES AV RSQ=.7 3.5 2.5 1.5 X1 X2 X3 X4 X5 X6.5 -.5 5 1 15 2 25 3 VCAN COEFFICIENT TRACES AV RSQ=.78 1-1 -2 X1 X2 X3 X4 X5 X6-3 2 4 6 8 1

2 UIS COEFFICIENT TRACES AV RSQ=.48.6.3 -.3 X1 X2 X3 X4 X5 X6 X7 X8 -.6 1 2 3 4 5 6 GCAN COEFFICIENT TRACES AV RSQ=.44 2-2 -4-6 X1 X2 X3 X4 X5 X6 X7-8 -1 5 1 15 2

21 END NOTE ON APPLICATIONS When there are changes in time over the duration of the experiment, the Cox model gives dubious results. For instance, in the UIS data the Cox model singles out the significant variables (in order) as X1>X4>X7 and no others significant at the.5 level. Examination of the correlations in time and the coefficient traces for the UIS data shows that there are three important variables XI,X2,X5. The Cox model may be dangerous to the medical experimental profession.

22 HOW SV WORKS Building a Survival Tree With fixed covariate effects and random censoring, the log likelihood can be expressed in terms of the hazard function h(t,x)as LL = c n log[h(t n,x n )] h(τ,x n ) dτ t n This expression is our starting point. The tree will be grown to maximize it. Divide the time-covariate space into a union of L disjoint rectangles indexed by l. Call these nodes. Use the notation l=i l R l where I l is a time interval and R l is a rectangle in covariate space. The essential step is to express h(t,x) as h(t, x) = exp[ I( (t, x) l)α(l)] l where I is the,1, indicator function.

23 CONTINUED Define (1) N D (l) = c n I((t n,x n ) l) n Then N D (l)is the number of deaths in the node l. Define also (2) T(l) = I(x n R l ) (,t n )I I l n where denotes length of the enclosed interval. Substituting the expression for h(t,x), (1) and (2) into the expression for the LL, gives LL = α(l) N D (l) T(l)exp( α(l)) l l Maximizing this over the α(l) gives and α(l) = log(n D (l)/t(l)) LL = N D (l) log(n D (l)/t(l)) N D (l) l l

24 SPLITTING NODES The second term above is constant. At each step the split of l is found which most increases N D (l)log(n D (l)/t(l)) Univariate splits on a covariate are the ordinary splits as used in CART but using the above expression to maximize. Splits on time are tricky. t(n) t2 cut t1 A case with covariate x n keeps traveling through nodes until it comes to a node such that t 1 <t n t 2. In a time split of a node, all cases in the original node are in each child node.

25 CONTINUED To work, there have to be many splits on time. With probability.75 the split on each node is on the time variable only. With probability,25, the split is the best split on any of the covariates. The tree is grown until each terminal node has exactly one uncensored death in it. But terminal nodes can contain many other cases.

26 GROWING THE FOREST Each tree in the forest is grown on a bootstrap sample from the original training set. Again, about one-third of the cases are oob. Let x be the covariate vector of an oob case. Put it down the corresponding tree and estmate S(t,x) by: log(s(t,x)) = [N D (l)/t(l)] I(x R l ) (,t)i I l l where the sum is over all terminal nodes. This estimate is accumulated and averaged every time x is not in the tree growing sample. The averaged survival function, estimated for cases in the data not used in the corresponding tree, is automatically a test set estimate.

27 TYING UP A LOOSE END WITH SCALING As in RF, proximities can be defined by of often two cases occupy the same terminal node, although there is a different definition for terminal nodes in SF. They can also be projected down into two dimensions. Recall, we have not yet solved the problem of how to recognize the action of the switch variable in Sim2. Here is a graph of the 2nd scaling coordinate versus the first for sim 2. SCALING COORDINATE PLOT COORDINATE 2.1 -.1 -.2 -.25 -.15 -.5.5.15 COORDINATE 1

28 AHA! Noting that it looked symmetric above and below the zero of 2nd coordinate, I constructed the histograms of x1 above and below this zero point. 25 HISTOGRAM OF X1--COOR. 2 > 2 Count 15 1 5.1.2.3.4.5 X1 values HISTOGRAM OF X1 --COOR 2< 2 15 Count 1 5.1.2.3.4.5 XI VALUES AHA! A CLUE

29 FINAL REMARKS RF has been extensively tested in the field. SV is still being born and needs more testing, working with, and extending. But it may prove the point that algorithmic models can provide more (and more reliable) information than stochastic models. As soon as it is in decent shape, SV will show up on my web site as free software for use by the masses. THANK YOU