Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Similar documents
Building a Prognostic Biomarker

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

VARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA. Xiaoxi Liu

Survival Analysis for Case-Cohort Studies

Residuals and model diagnostics

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Lecture 3. Truncation, length-bias and prevalence sampling

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Support Vector Machines

Support Vector Machine

Beyond GLM and likelihood

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

STAT331. Cox s Proportional Hazards Model

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Survival Analysis Math 434 Fall 2011

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Semiparametric Regression

Linear & nonlinear classifiers

MAS3301 / MAS8311 Biostatistics Part II: Survival

Correlation and regression

Introduction to Support Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Machine Learning Practice Page 2 of 2 10/28/13

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

On Measurement Error Problems with Predictors Derived from Stationary Stochastic Processes and Application to Cocaine Dependence Treatment Data

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Statistical Methods for Alzheimer s Disease Studies

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Robustifying Trial-Derived Treatment Rules to a Target Population

Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects

Consider Table 1 (Note connection to start-stop process).

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression

Statistics in medicine

Lecture 4 Discriminant Analysis, k-nearest Neighbors

University of California, Berkeley

Machine Learning And Applications: Supervised Learning-SVM

β j = coefficient of x j in the model; β = ( β1, β2,

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Robust Kernel-Based Regression

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Sparse Linear Models (10/7/13)

Introduction to Support Vector Machines

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback

Machine Learning for OR & FE

Joint Modeling of Longitudinal Item Response Data and Survival

Classification. Chapter Introduction. 6.2 The Bayes classifier

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Convex Optimization and Support Vector Machine

Survival SVM: a Practical Scalable Algorithm

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Introduction to Logistic Regression and Support Vector Machine

Linear Methods for Classification

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Semiparametric Models for Joint Analysis of Longitudinal Data and Counting Processes

Achieving Optimal Covariate Balance Under General Treatment Regimes

Linear & nonlinear classifiers

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

High-Dimensional Statistical Learning: Introduction

Support Vector Machines Explained

Multistate models and recurrent event models

PhD course: Statistical evaluation of diagnostic and predictive models

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

L5 Support Vector Classification

Support Vector Machines

Relevance Vector Machines

Linear Methods for Prediction

CIS 520: Machine Learning Oct 09, Kernel Methods

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Introduction to Statistical Analysis

Nearest Neighbors Methods for Support Vector Machines

Support Vector Machine (continued)

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Statistical Machine Learning from Data

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Support Vector Machines.

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Longitudinal + Reliability = Joint Modeling

Machine Learning Linear Classification. Prof. Matteo Matteucci

Transcription:

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes

Introduction Method Theoretical Results Simulation Studies Application Conclusions

Introduction

Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event.

Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event. A variety of regression models focus on estimating the survival function and evaluating covariate effects, but not on predicting event times.

Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event. A variety of regression models focus on estimating the survival function and evaluating covariate effects, but not on predicting event times. We consider supervised learning algorithms for prediction: they directly aim for prediction; they are nonparametric; they are powerful and flexible in handling large number of predictors.

Introduction Many learning methods exist for predicting non-censored outcomes, of which support vector machines (SVM) are commonly used for binary outcomes.

Introduction Many learning methods exist for predicting non-censored outcomes, of which support vector machines (SVM) are commonly used for binary outcomes. A simple geometric interpretation. A convex quadratic programming problem. Inclusion of non-linearity by using kernels. Adapted for regression with a continuous response by using the ɛ-insensitive loss.

Introduction For survival outcomes, censoring imposes new challenge for SVM. Most of existing methods focus on modifying the support vector regression.

Introduction For survival outcomes, censoring imposes new challenge for SVM. Most of existing methods focus on modifying the support vector regression. Shivaswamy et al. (2007) and Khan and Zubek (2008) generalized the ɛ-insensitive loss function.

Introduction Van Belle et al. (2009, 2011a) adopted the concept of concordance index to handle censored data. They considered all the comparable pairs. For observed times y i < y j, ranking constraints were added for predicted values to penalize misranking: f (x j ) f (x i ) 1 ζ ij, i < j.

Introduction Van Belle et al. (2009, 2011a) adopted the concept of concordance index to handle censored data. They considered all the comparable pairs. For observed times y i < y j, ranking constraints were added for predicted values to penalize misranking: f (x j ) f (x i ) 1 ζ ij, i < j. Van Belle et al. (2011b) included both regression (loss function modification) and ranking constraints. Prediction rule not clear; No theoretical justification; Observed information may not be fully used; Censoring is completely random.

Introduction Goldberg and Kosorok (2013) used inverse probability of censoring weighting to adapt standard support vector methods.

Introduction Goldberg and Kosorok (2013) used inverse probability of censoring weighting to adapt standard support vector methods. Their method may suffer from severe bias when the censoring distribution is misspecified. The learning only uses uncensored observations. Large weights (high censoring) often make algorithms numerically unstable and even infeasible.

Method

Method We represent the survival times in the framework of the counting process.

Method We represent the survival times in the framework of the counting process. At each event time, we use a support vector machine to identify event vs. non-events.

Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense.

Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets;

Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets; g(x): a function of covariates for true prediction, e.g. g(x) = X T β for linear kernel;

Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets; g(x): a function of covariates for true prediction, e.g. g(x) = X T β for linear kernel; imbalance between events and non-events at each t (1 vs n).

Method Notations: counting process N i (t) = I(T i C i t); at-risk process Y i (t) = I(T i C i t); total number of events: d = n i=1 I(T i C i ). Primal form: min α,g 1 2 g 2 + C n s.t. Y i (t j )ζ i (t j ) 0, n i=1 d Y i (t j )w i (t j )ζ i (t j ) j=1 Y i (t j )δn i (t j ){α(t j ) + g(x i )} Y i (t j ){1 ζ i (t j )}, where δn i (t j ) 2{N i (t j ) N i (t j )} 1, and w i (t j ) = I { δn i (t j ) = 1 } { } 1 1 n i=1 Y +I { δn i (t j ) = 1 } { } 1 i(t j ) n i=1 Y. i(t j )

Method n d Dual form: L D = γ ij Y i (t j ) i=1 i =1 j=1 j =1 i=1 j=1 1 n n d d γ ij γ i j 2 Y i(t j )Y i (t j )δn i (t j )δn i (t j )K(X i, X i ) s.t. 0 γ ij w i (t j )C n, n γ ij Y i (t j )δn i (t j ) = 0. i=1

Method n d Dual form: L D = γ ij Y i (t j ) i=1 i =1 j=1 j =1 i=1 j=1 1 n n d d γ ij γ i j 2 Y i(t j )Y i (t j )δn i (t j )δn i (t j )K(X i, X i ) s.t. 0 γ ij w i (t j )C n, n γ ij Y i (t j )δn i (t j ) = 0. i=1 The kernel function K(X i, X i ) = g(x i ) T g(x i ). The predictive score ĝ(x i ) = n i=1 d j=1 ˆγ ijy i (t j )δn i (t j )K(X i, X i ).

Method Prediction: using the predictive scores ĝ(x) and distinct event times in the training data set to predict the survival outcome of the future subject. A future subject with ĝ(xnew) Training data set: ĝ(x 3 ) T 1 ĝ(x 2 ) T 2 ĝ(x 1 ) T 3... If the value of ĝ(xnew) is closest to ĝ(x 2 ), we predict the survival outcome of this subject to be T 2.

Theoretical Results

Empirical risk Regularization Form: min R n (f ) + λ n g, where n d R n(f ) = n 1 w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. i=1 j=1

Empirical risk Regularization Form: min R n (f ) + λ n g, where R n(f ) = n 1 n i=1 d w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. j=1 After substituting ˆα(t j ) into R n (f ), we obtain a profile empirical risk PR n (g) for g( ): 1 n n i=1 n k=1 I(Y k Y i )[2 g(x i ) + g(x k )] + i n k=1 I(Y k Y i ) 2 n n i=1 i n k=1 I(Y k Y i ).

Empirical risk Regularization Form: min R n (f ) + λ n g, where R n(f ) = n 1 n i=1 d w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. j=1 After substituting ˆα(t j ) into R n (f ), we obtain a profile empirical risk PR n (g) for g( ): 1 n n i=1 n k=1 I(Y k Y i )[2 g(x i ) + g(x k )] + i n k=1 I(Y k Y i ) 2 n n i=1 i n k=1 I(Y k Y i ). ĝ(x) minimizes PR n (g) + λ n g, and R n (ˆf ) = PR n (ĝ). PR n (g) takes a similar form as the partial likelihood function in survival analysis under a different loss function.

Risk Function and Optimal Decision Rule We consider another empirical risk R 0n (f ), comparable to the concept of 0-1 loss in standard SVMs, R 0n (f ) = n 1 n i=1 d w i (t j )Y i (t j )I(δN i (t j )f (t j, X i ) < 0). j=1

Risk Function and Optimal Decision Rule We consider another empirical risk R 0n (f ), comparable to the concept of 0-1 loss in standard SVMs, R 0n (f ) = n 1 n i=1 d w i (t j )Y i (t j )I(δN i (t j )f (t j, X i ) < 0). j=1 Denoting the asymptotic limits of R n (f ) and R 0n (f ) to be R(f ) and R 0 (f ), we find the optimal decision rule f (t, x) that minimizes both R(f ) and R 0 (f ), and the minimal risk R 0 (f ).

Risk Function and Optimal Decision Rule Theorem Let h(t, x) denote the conditional hazard rate function of T = t given X = x and let h(t) = E[dN(t)/dt]/E[Y(t)] = E[h(t, X) Y(t) = 1] be the average hazard rate at time t. Then f (t, x) = sign(h(t, x) h(t)) minimizes R(f ). Furthermore, f (t, x) also minimizes R 0 (f ) and R 0 (f ) = P(T C) 1 [ ] 2 E E{Y(t) X = x} h(t, x) h(t) dt. In addition, for any f (t, x) [ 1, 1], for some constant c. R 0 (f ) R 0 (f ) R(f ) R(f )

Asymptotic Properties We consider the profile risk for PR(g) and the reproducing kernel space H n from a Gaussian kernel. Then we derive the asymptotic learning rate. Theorem Assume that X s support is compact and E[Y(τ) X] is bounded from zero where τ is the study duration. Furthermore, assume λ n and σ n satisfies λ n, σ n 0, and nλ n σ n (2/p 1/2)d for some p (0, 2). Then it holds λ n ĝ 2 H n + PR(ĝ) inf PR(g) + O p g { λ n + σ d/2 n } + λ 1/2 n σ n (1/p 1/4)d. n

Simulation Studies

Simulation Setup Our method compared with Van Belle et al., 2011 (Modified SVR) and Goldberg and Kosorok, 2013 (IPCW). Five baseline covariates X marginally normal with mean 0 and variance 0.25. Survival times generated from Cox model with baseline Weibull distribution and β = (2, 1.6, 1.2, 0.8, 0.4). Censoring times depending on covariates X, and censoring ratios 40% and 60%: Case 1 using AFT models for censoring; Case 2 using Cox models for censoring.

Simulation Setup Using linear kernel; Sample size 100 and 200; 500 replicates. Tuning parameter C n selected using 5-fold cross-validation via a grid 2 16, 2 15,..., 2 16. 10000 observations in the testing data set. Evaluating prediction performances: correlation, RMSE.

Simulation Results Table: Case 1, censoring times following the AFT model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 40% 0 Modified SVR 0.59 5.59 (0.60) 1.19 0.62 5.58 (0.58) 1.24 IPCW-KM 0.40 5.60 (0.52) 1.20 0.45 5.45 (0.41) 1.21 IPCW-Cox 0.43 5.80 (0.64) 1.24 0.50 5.62 (0.57) 1.25 SVHR 0.61 4.68 (0.27) 1.00 0.64 4.49 (0.17) 1.00 5 Modified SVR 0.55 5.64 (0.60) 1.15 0.61 5.63 (0.57) 1.22 IPCW-KM 0.32 5.93 (0.47) 1.21 0.42 5.63 (0.44) 1.22 IPCW-Cox 0.33 6.17 (0.54) 1.26 0.44 5.87 (0.57) 1.27 SVHR 0.58 4.90 (0.35) 1.00 0.63 4.62 (0.20) 1.00 15 Modified SVR 0.46 5.73 (0.47) 1.10 0.54 5.55 (0.50) 1.15 IPCW-KM 0.21 6.12 (0.32) 1.18 0.31 5.86 (0.34) 1.22 IPCW-Cox 0.20 6.47 (0.46) 1.24 0.32 6.09 (0.47) 1.26 SVHR 0.48 5.20 (0.36) 1.00 0.57 4.82 (0.23) 1.00 95 Modified SVR 0.21 6.65 (0.89) 1.10 0.30 6.29 (0.47) 1.09 IPCW-KM 0.06 6.33 (0.21) 1.05 0.10 6.28 (0.14) 1.09 IPCW-Cox 0.08 6.59 (0.23) 1.09 0.11 6.61 (0.39) 1.15 SVHR 0.22 6.04 (0.32) 1.00 0.32 5.76 (0.25) 1.00

Simulation Results Table: Case 1, censoring times following the AFT model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 60% 0 Modified SVR 0.55 6.00 (0.54) 1.16 0.60 6.07 (0.42) 1.24 IPCW-KM 0.15 6.45 (0.41) 1.25 0.18 6.42 (0.37) 1.32 IPCW-Cox 0.21 6.56 (0.47) 1.27 0.26 6.47 (0.48) 1.33 SVHR 0.57 5.18 (0.43) 1.00 0.61 4.88 (0.33) 1.00 5 Modified SVR 0.50 6.06 (0.53) 1.12 0.57 6.07 (0.50) 1.21 IPCW-KM 0.11 6.61 (0.34) 1.22 0.15 6.56 (0.32) 1.31 IPCW-Cox 0.15 6.77 (0.39) 1.25 0.21 6.66 (0.39) 1.33 SVHR 0.51 5.40 (0.48) 1.00 0.58 5.02 (0.33) 1.00 15 Modified SVR 0.39 6.14 (0.45) 1.10 0.49 5.97 (0.45) 1.15 IPCW-KM 0.07 6.56 (0.30) 1.17 0.10 6.54 (0.24) 1.26 IPCW-Cox 0.10 6.82 (0.30) 1.22 0.13 6.70 (0.27) 1.29 SVHR 0.40 5.60 (0.44) 1.00 0.51 5.20 (0.36) 1.00 95 Modified SVR 0.17 6.90 (1.08) 1.11 0.25 7.20 (1.52) 1.21 IPCW-KM 0.01 6.53 (0.26) 1.05 0.03 6.54 (0.20) 1.10 IPCW-Cox 0.02 6.87 (0.20) 1.10 0.04 6.86 (0.21) 1.15 SVHR 0.17 6.22 (0.24) 1.00 0.26 5.94 (0.25) 1.00

Simulation Results Table: Case 2, censoring times following the Cox model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 40% 0 Modified SVR 0.59 5.15 (0.59) 1.11 0.62 5.09 (0.54) 1.12 IPCW-KM 0.53 5.16 (0.42) 1.11 0.55 5.08 (0.31) 1.12 IPCW-Cox 0.52 5.31 (0.57) 1.14 0.56 5.09 (0.46) 1.12 SVHR 0.61 4.66 (0.25) 1.00 0.63 4.53 (0.16) 1.00 5 Modified SVR 0.56 5.28 (0.51) 1.08 0.61 5.09 (0.50) 1.12 IPCW-KM 0.46 5.58 (0.42) 1.14 0.52 5.27 (0.34) 1.13 IPCW-Cox 0.44 5.73 (0.52) 1.17 0.51 5.41 (0.51) 1.16 SVHR 0.58 4.89 (0.29) 1.00 0.62 4.65 (0.18) 1.00 15 Modified SVR 0.47 5.43 (0.40) 1.04 0.55 5.14 (0.38) 1.06 IPCW-KM 0.36 5.79 (0.34) 1.11 0.44 5.49 (0.30) 1.13 IPCW-Cox 0.34 6.00 (0.40) 1.15 0.42 5.70 (0.43) 1.18 SVHR 0.49 5.21 (0.33) 1.00 0.57 4.84 (0.20) 1.00 95 Modified SVR 0.21 6.43 (0.92) 1.04 0.33 6.03 (0.54) 1.04 IPCW-KM 0.17 6.16 (0.21) 1.00 0.24 6.06 (0.18) 1.05 IPCW-Cox 0.16 6.32 (0.23) 1.02 0.22 6.21 (0.22) 1.07 SVHR 0.23 6.18 (0.40) 1.00 0.34 5.78 (0.24) 1.00

Simulation Results Table: Case 2, censoring times following the Cox model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 60% 0 Modified SVR 0.56 5.43 (0.56) 1.08 0.59 5.43 (0.47) 1.12 IPCW-KM 0.44 5.68 (0.43) 1.13 0.46 5.62 (0.33) 1.16 IPCW-Cox 0.42 5.83 (0.56) 1.16 0.47 5.67 (0.48) 1.17 SVHR 0.57 5.01 (0.37) 1.00 0.60 4.85 (0.25) 1.00 5 Modified SVR 0.50 5.61 (0.48) 1.07 0.57 5.40 (0.46) 1.09 IPCW-KM 0.36 6.02 (0.38) 1.15 0.43 5.79 (0.35) 1.17 IPCW-Cox 0.34 6.25 (0.44) 1.20 0.41 5.96 (0.47) 1.20 SVHR 0.53 5.23 (0.37) 1.00 0.59 4.96 (0.27) 1.00 15 Modified SVR 0.40 5.77 (0.42) 1.05 0.50 5.44 (0.38) 1.06 IPCW-KM 0.27 6.07 (0.31) 1.10 0.35 5.94 (0.26) 1.16 IPCW-Cox 0.25 6.39 (0.40) 1.16 0.32 6.16 (0.33) 1.20 SVHR 0.42 5.51 (0.40) 1.00 0.52 5.13 (0.29) 1.00 95 Modified SVR 0.18 6.47 (0.87) 1.05 0.27 6.31 (0.80) 1.05 IPCW-KM 0.12 6.22 (0.29) 1.01 0.18 6.19 (0.21) 1.03 IPCW-Cox 0.12 6.54 (0.26) 1.07 0.16 6.50 (0.23) 1.08 SVHR 0.20 6.14 (0.38) 1.00 0.28 6.00 (0.35) 1.00

Application

Huntington s Disease Study Data The study is to identify and combine clinical and biological markers to detect early indicators of disease progression. The outcome is age observed at event or censored time. There are 705 subjects and 126 of them are non-censored. We study the prediction capability of 15 covariates on the age-at-onset of Huntington s Disease. Three-fold cross validation is used to choose tuning parameter C n. We consider both linear kernel and Gaussian kernel. To evaluate the prediction capability, we assess the usefulness of combined score in performing risk stratification.

Huntington s Disease Study Data Table: Normalized coefficient estimates using linear kernel Marker Normalized β Cox model a Total Motor Score 0.680 0.354 * CAP 0.440 0.334 * Stroop Color -0.235-0.247 Stroop Word -0.208-0.107 SDMT -0.151-0.076 Stroop Interference 0.034 0.271 FRSBE Total 0.246 0.242 UHDRS Psychiatric 0.197 0.270 SCL90 Depression -0.062-0.306 SCL90 GSI -0.004 0.114 SCL90 PST -0.081-0.217 SCL90 PSDI 0.096 0.061 TFC -0.054-0.047 Education -0.025-0.092 Male Gender -0.315-0.392 * a The estimates from Cox model with significant p-value (p-value < 0.05) are marked with *.

Huntington s Disease Study Data Table: Comparison of prediction capability for different methods 25th percentile 50th percentile 75th percentile Kernel Method C-index Logrank χ 2 a HR b Logrank χ 2 HR Logrank χ 2 HR Linear Modified SVR 0.71 42.50 3.08 27.53 2.98 13.41 3.82 IPCW-KM 0.44 0.06 1.06 0.23 1.09 1.39 1.26 IPCW-Cox 0.54 5.47 1.57 5.66 1.53 0.90 1.22 SVHR 0.75 81.79 4.68 35.12 4.74 16.53 7.76 Gaussian Modified SVR 0.72 46.53 3.27 30.24 3.34 14.33 3.67 IPCW-KM 0.44 0.32 1.16 0.86 1.19 1.50 1.27 IPCW-Cox 0.53 5.42 1.57 3.89 1.42 1.97 1.34 SVHR 0.75 78.66 4.63 36.78 4.68 17.46 8.11 a Logrank χ 2, Chi-square statistics from Logrank tests for two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values. b HR, Hazard Ratios comparing two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values.

Huntington s Disease Study Data 10 Linear Kernel 8 Hazard ratio 6 4 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups Figure: Hazard ratios comparing two groups separated using percentiles of predicted values as cut points. Hazard ratio 10 8 6 4 Gaussian Kernel Dotted curve: Modified SVR; Dashed curve: IPCW-KM; Dashed-dotted curve: IPCW-Cox; Black solid curve: SVHR. 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups

Atherosclerosis Risk in Communities Study Data A prospective epidemiologic study to investigate the etiology of atherosclerosis in cardiovascular risk factors. Baseline examination enrolled 15,792 participants with ages 45-64 from four U.S. communities. In this example, we apply our method to part of the baseline data, where participants are African-American males with hypertension living in Jackson, Mississippi. There are 624 participants and 133 of them are non-censored. We assess the prediction capability of some common cardiovascular risk factors for incident heart failure until 2005. We analyze the data following the same procedure.

Atherosclerosis Risk in Communities Study Data Table: Normalized coefficient estimates using linear kernel Covariate Normalized β Cox model a Age (in years) 0.363 0.328 * Diabetes 0.288 0.221 * BMI (kg/m 2 ) 0.150 0.136 SBP (mm of Hg) 0.172 0.178 Fasting glucose (mg/dl) 0.173-0.093 Serum albumin (g/dl) -0.363-0.273 * Serum creatinine (mg/dl) 0.007 0.029 Heart rate (beats/minute) 0.124 0.125 Left ventricular hypertrophy 0.250 0.158 * Bundle branch block 0.341 0.242 * Prevalent CHD 0.330 0.216 * Valvular heart disease 0.200 0.169 * HDL (mg/dl) -0.287-0.436 * LDL (mg/dl) 0.016 0.051 Pack years of smoking 0.289 0.230 * Current smoking status 0.210 0.022 Former smoking status -0.133-0.232 * a The estimates from Cox model with significant p-value (p-value < 0.05) are marked with *.

Atherosclerosis Risk in Communities Study Data Table: Comparison of prediction capability for different methods 25th percentile 50th percentile 75th percentile Kernel Method C-index Logrank χ 2 a HR b Logrank χ 2 HR Logrank χ 2 HR Linear Modified SVR 0.74 90.52 4.63 59.11 4.16 31.85 5.01 IPCW-KM 0.69 54.90 3.48 29.53 2.64 22.92 3.45 IPCW-Cox 0.71 48.34 3.24 39.70 3.12 27.63 4.32 Our method 0.76 95.09 4.78 67.06 4.63 34.93 5.36 Gaussian Modified SVR 0.76 105.10 5.12 70.41 4.87 37.66 6.39 IPCW-KM 0.70 58.15 3.61 33.49 2.81 19.61 3.00 IPCW-Cox 0.72 52.77 3.39 47.10 3.50 27.99 4.37 Our method 0.77 111.10 5.31 64.79 4.53 35.60 5.76 a Logrank χ 2, Chi-square statistics from Logrank tests for two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values. b HR, Hazard Ratios comparing two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values.

Atherosclerosis Risk in Communities Study Data 10 Linear Kernel 8 Hazard ratio 6 4 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups Figure: Hazard ratios comparing two groups separated using percentiles of predicted values as cut points. Hazard ratio 10 8 6 4 Gaussian Kernel Dotted curve: Modified SVR; Dashed curve: IPCW-KM; Dashed-dotted curve: IPCW-Cox; Black solid curve: SVHR. 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups

Conclusions

Concluding Remarks We adapted the learning algorithm SVM to predict event times in the framework of counting process. The proposed method is optimal in discriminating covariate specific hazard from population average hazard. We can handle censored data appropriately without specifying the censoring distribution. Numerical studies showed superiority of SVHR, especially in the presence of high censoring ratio and noise variables. One potential challenge is the fast growing dimensions of the quadratic programming optimization as the sample size increases.