Survival SVM: a Practical Scalable Algorithm

Similar documents
Building a Prognostic Biomarker

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines Explained

Support Vector Ordinal Regression using Privileged Information

Introduction to SVM and RVM

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning. Support Vector Machines. Manfred Huber

Joint Regression and Linear Combination of Time Series for Optimal Prediction

Support Vector Machine (continued)

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine (SVM) and Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine & Its Applications

Comparison of Predictive Accuracy of Neural Network Methods and Cox Regression for Censored Survival Data

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Compactly supported RBF kernels for sparsifying the Gram matrix in LS-SVM regression models

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Spatio-temporal feature selection for black-box weather forecasting

Learning SVM Classifiers with Indefinite Kernels

AUC Maximizing Support Vector Learning

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Learning for Ordinal Regression

A Large Deviation Bound for the Area Under the ROC Curve

Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Convex Optimization and Support Vector Machine

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Kernel Methods and Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning : Support Vector Machines

Final Exam, Machine Learning, Spring 2009

SVMC An introduction to Support Vector Machines Classification

Max Margin-Classifier

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Discriminative Direction for Kernel Classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Stephen Scott.

Linear & nonlinear classifiers

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Universal Learning Technology: Support Vector Machines

L5 Support Vector Classification

Sparse Kernel Machines - SVM

Classification on Pairwise Proximity Data

Classifier Complexity and Support Vector Classifiers

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Training algorithms for fuzzy support vector machines with nois

Lecture 10: A brief introduction to Support Vector Machine

Multivariate statistical methods and data mining in particle physics

Statistical Machine Learning from Data

Object Recognition Using Local Characterisation and Zernike Moments

This is an author-deposited version published in : Eprints ID : 17710

CS 188: Artificial Intelligence Spring Announcements

On Weighted Structured Total Least Squares

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Model Selection for LS-SVM : Application to Handwriting Recognition

A Simple Algorithm for Learning Stable Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

A Comparison of Different ROC Measures for Ordinal Regression

Kernel methods, kernel SVM and ridge regression

Discriminant Kernels based Support Vector Machine

Support Vector Machine via Nonlinear Rescaling Method

Machine learning for pervasive systems Classification in high-dimensional spaces

Neutron inverse kinetics via Gaussian Processes

Classification of Ordinal Data Using Neural Networks

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Brief Introduction to Machine Learning

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Analysis of Multiclass Support Vector Machines

Heuristics for The Whitehead Minimization Problem

SUPPORT VECTOR MACHINE

Introduction to Support Vector Machines

4 Bias-Variance for Ridge Regression (24 points)

Support Vector Machine for Classification and Regression

Transductively Learning from Positive Examples Only

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Statistical Pattern Recognition

Transcription:

Survival SVM: a Practical Scalable Algorithm V. Van Belle, K. Pelckmans, J.A.K. Suykens and S. Van Huffel Katholieke Universiteit Leuven - Dept. of Electrical Engineering (ESAT), SCD Kasteelpark Arenberg - B-3 Leuven - Belgium {vvanbell, kpelckma, johan.suykens, vanhuffe}@esat.kuleuven.be Abstract. This work advances the Support Vector Machine (SVM) based approach for predictive modelling of failure time data as proposed in []. The main results concern a drastic reduction in computation time, an improved criterion for model selection, and the use of additive models for improved interpretability in this context. Particular attention is given towards the influence of right-censoring in the methods. The approach is illustrated on a case-study in prostate cancer. Introduction Survival analysis models the time until the event under study occurs. Censored data are a particular problem in these studies. Censoring occurs when the exact response is unknown. There are three types of censoring: for right censored data it is only known that the event did not occur before a certain time, for left censored data the event occurred before a certain time and interval censored data have a right as well as a left censoring time. Within this work we will concentrate on right censored data. Traditional statistical survival techniques as Cox proportional hazard model or the accelerated failure time model focus on explicitly modelling the underlying probabilistic mechanism of the phenomenon under study [2]. The main focus of machine learning techniques is merely to learn a predictive rule which will generalize well to unseen data [3]. Amongst other applications, SVMs have also been used for prognostic reasons by reformulating the survival problem as a classification problem, dividing the time axis in predefined intervals or classes [4], and as a rank regression problem in our previous work []. The algorithm proposed in the latter optimizes the concordance index between observed event times and estimated ranks of event occurrence. The relation between the concordance index and the area under the ROC curve permits the translation of advances in machine learning in a context of ordinal regression, ranking and information retrieval [5, 6]. In the present work we propose a practical alternative for the computationally demanding algorithm in []. Optimization of the concordance index implies the comparison of all data pairs in response and time domain. The number of comparisons is reduced by selecting appropriate pairs, resulting in a significant We kindly acknowledge the support and constructive remarks of E. Biganzoli and P. Boracchi. Research supported by GOA-AMBioRICS, CoE EF/5/6, FWO G.47.2 and G.32.7, IWT, IUAP P6/4, BIOPATTERN (FP6-22-IST 5883), etumour (FP6-22-LIFESCIHEALTH 5394). 89

decrease of calculation time without notable loss of performance. Special attention is drawn towards the importance of the censoring mechanism on tuning algorithm and performance. This paper is organized as follows. Section 2 describes the modification of support vector machines for survival data and illustrates the reduction of computational load. Section 3 summarizes results on artificial and cancer data. 2 Support Vector Machines for Survival Data Throughout the paper the following notations are used for vectors and matrices: a lower case will denote a vector or a scalar (clear from context), an upper case will denote a matrix. The covariates x i for each subject i =,...,N can be organized in a matrix X R d N with X i = x i. The failure times {t} N i= are organized in a vector t R N. I N indicates the identity matrix of size N. Concordance Index The concordance index (c-index) [7] is a measure of association between the predicted and observed failures in case of right censored data. The c-index equals the ratio of concordant to comparable pairs of data points. Two samples i and j are comparable if t i <t j and δ i =, with δ a censoring variable equal to for an observed event and in case of censoring. A pair of samples i and j is concordant if they are comparable and u(x i ) <u(x j ), with u(x) the predicted value corresponding to the sample x. Formally, the sample based c- index of a model generating predictions u(x i ) for samples x i from a dataset D = {(x i,t i,δ i )} N i= canbeexpressedas CI N (u) = N(N ) I[(u(x i ) u(x j ))(t(x i ) t(x j ))], () i =j where I[z] = if z>, and zero otherwise. Pairwise Maximal Margin Machine A learning strategy would try to find a mapping u : R d R which reconstructs the orders in the observed failure times measured by the corresponding CI N.In order to overcome the combinatorial hardness which would result from directly optimizing over CI N, [] proposed to optimize a convex relaxation. Minimization of this upper bound results in an optimized empirical c-index. The optimal health function u(x i )=w T ϕ(x i ):R d R can then be found as (ŵ, ˆξ) = arg min w,ξ 2 wt w + C i<j,δ i= v ij ξ ij s.t. { w T ϕ(x j ) w T ϕ(x i ) ξ ij ξ ij, i, j =,,N (2) 9

with C a positive real constant and v ij = if the pair (x i,x j ) is comparable, otherwise; ϕ : R d R dϕ is a feature map such that K(x, x )=ϕ(x) T ϕ(x )isa positive definite kernel function K : R d R d R. Taking the Lagrangian L(w, ξ; α, β) = 2 wt w + C i<j v ij ξ ij i<j β ij ξ ij i<j α ij (w T ϕ(x j ) w T ϕ(x i ) +ξ ij ) (3) with multipliers α, β R N + leads to the optimality conditions { w = i<j α ij(ϕ(x j ) ϕ(x i )) Cv ij = α ij + β ij, i<j. (4) The dual problem is obtained as min α 2 i<j k<l α ij α kl (ϕ(x j ) ϕ(x i )) T (ϕ(x l ) ϕ(x k )) i<j α ij s.t. α ij Cv ij, i<j (5) The estimated health function can be evaluated on a new point x as û(x )= i<j α ij(k(x j,x ) K(x i,x )). A major drawback of this algorithm is the large computational cost, making the method unapplicable for larger datasets. In the next section an approach to reduce the prohibitive computational cost is proposed. A Scalable Nearest Neighbor Algorithm The above method can be adapted to reduce computational load without considerable loss of performance. To find an optimal health function u(x), O(qN 2 /2) comparisons between time and health values are required, with q and N the percentage of non-censored and total number of samples in the training dataset respectively. The calculation time can be largely reduced by selecting a set C i of k samples with a survival time nearest to the survival time of sample i (k-nearest neighbor (k-nn) algorithm): C i = {(i, j) :t j is k nearest to t i }, i =,...,N, (6) decreasing the number of comparisons to O(qkN). The effect of the number of neighbors in the training algorithm is illustrated in Figure. For a linear function (Figure (a)) the value of k in the k-nn algorithm has no influence on the performance. Therefore a very small number of neighbors suffices. For a non linear function the value of k accounts for a trade-off between computational load and performance. However, performance no longer increases above a certain value of k. 9

Concordance index on test set.99.98.97.96.95 3% of censoring NN 3 NN 5 NN NN 5 NN 2 NN Concordance index on test set.75.7 5.55.5 3% of censoring NN 3 NN 5 NN NN 5 NN 2 NN Concordance index on test set.99.98.97.96.95 6% of censoring NN 3 NN 5 NN NN 5 NN 2 NN Concordance index on test set.75.7 5.55.5 6% of censoring NN 3 NN 5 NN NN 5 NN 2 NN (a) y = x + white gaussian noise (b) y 2 = sinc(x) + white gaussian noise Figure : Performance on testsets (N=) depending on k and the censoring percentage. For a non-linear function the performance saturates for k =. Model Selection Schemes High censoring rates as common in applications urge for special care in the model selection stage, as the implied increased variance can deteriorate performance. We shortly discuss three model selection schemes in the context of censored data. The first scheme uses a single validation set of size N/2, with N the number of training samples. The second criterion randomizes this scheme a number of times, such that one has in each iteration a disjunct training - test set. The classical -fold cross-validation (CV) criterion - the third scheme - imposes that the validation sets of size N/ are disjunct over the folds. The use if these schemes is illustrated on a small example. An artificial dataset with y = sinc(x)+noise (normally distributed) was created. Figure 2 shows that for a dataset with 3% of censoring, CV has a much wider spread in concordance. The second method performs best. When 9% of data are censored the c-index on the validation sets ranges from to for tuning via CV. This is explained by the very low number of events in a set of only N/ samples. As a consequence there is no guarantee on the generalization performances of this model. Since the first method is dependent on the partition between training and validation samples, the results on the test set are disappointing. The second tuning algorithm forms a compromise, resulting in a better generalization ability. However, performance on high censored data are uncertain. 3 Case Study in Prostate Cancer The performance of the k-nn based variation of the survival-svm model was compared with the Cox proportional hazard model on the prostate cancer dataset of Byar and Green [8] (http://lib.stat.cmu.edu/s/harrell/data/descriptions/ prostate.html). The variables age, weight index, performance rating, history of cardiovascular disease, serum haemoglobin and Glesan stage/grade category were included in the analysis. From the 483 patients with complete observations 92

.5 concordance on validation set 5 5 5 5 3% of censoring.5 5 5.5 concordance on test set.75.7 5 3% of censoring permutation repeated permutation cross validation 5 5 9% of censoring 5 9% of censoring concordance on validation set 5 5 5 concordance on test set.7.5.4.5 permutation.5 repeated permutation.5 cross validation.3 permutation repeated permutation cross validation (a) validation set (b) test set Figure 2: Concordance depending on tuning algorithm and censoring percentage q. The k-nn with k= is used. (a)-(b): left: single validation set; middle: validation sets; right: -fold cross-validation; top: q=3%; bottom: q=6%. Repeated permutation has better generalization characteristics. 25 died of prostatic cancer. A Cox proportional hazard model trained on two third of the data resulted in a concordance index of.7635 on the test set (remaining data). A nearest neighbor survival SVM model with a linear kernel and five neighbors results in a concordance index of.764. The same performance is obtained using a polynomial kernel with only three neighbors. Given that the time needed in the training phase is much higher for the latter (77.46 versus 6.8 seconds) it is preferred to use a linear kernel with more comparisons. The Cox model trains much faster:.24 seconds. Using all possible pairs our method obtains a concordance index of.7728, but training required 778 seconds. Additive models were used to improve interpretability. The relation between response and covariate x d can be modelled as u(x) = u d (x d )+u d (x d )=w T d ϕ d (x d )+w T dϕ(x d ), (7) where u d (x d ) is the contribution to the model outcome of the dth covariate, modelled by a non-linear function and u d (x d ) the contribution of all other covariates, modelled in a linear way. Figure 3 illustrates this principle. The model predicts a difference in survival behavior for the stage/grade index below or above (Figure 3(a)). Kaplan-Meier curves, showing a significant different observed survival in these strata, confirm this finding. The model suggests no relation between weight index and survival. The Kaplan-Meier curves stratified for three weight groups shows no significant difference in observed survival. An increasing tumor size results in decreasing estimated and observed survival. 4 Conclusions This paper discussed the SVM for survival data as proposed in [], and copes with issues in computational tractability, model selection and interpretability. 93

5 6 7 8 9 2 3 4 5.2.2.5.5.8. 2.5 u Indexstage hist.4 u Indexweight 2.5 u size.5.2 3..5 3.5.2.2.4 Index stage hist 4 6 7 8 9 2 3 4 Index weight.25 2 3 4 5 6 7 size.9.8.95.9.85 Index weight <95 95<=Index <5 weight Index weight >=5 CI(95<=Index <5) weight.9.8 size<8 8<=size<8 size>=8 survival probability.7.5 survival probability.8.75.7 survival probability.7.4 5.5.3 Index stage grade < Index stage grade >=.55.4.2 2 3 4 5 6 7 8 time (a) stage/grade index.5 2 3 4 5 6 7 8 time (b) weight index 2 3 4 5 6 7 8 time (c) tumor size Figure 3: Illustration of the use of additive models for three covariates. Top: Estimated relation between covariates and survival. Bottom: Kaplan-Meier curves. Conclusions drawn from the model are confirmed by the observations. As such, the approach is made practical for real world datasets as described in case of a retrospective prostate cancer dataset. References [] V. Van Belle, K. Pelckmans, J.A.K. Suykens, and S. Van Huffel. Support Vector Machines for Survival Analysis. In Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare (CIMED27), 27. [2] J.D. Kalbfleisch and R.L. Prentice. The Statistical Analysis of Failure Time Data. Wiley series in probability and statistics. Wiley, 22. [3] V. Vapnik. Statistical Learning Theory. Wiley and Sons, 998. [4] E. Zafiropoulos, I. Maglogiannis, and I. Anagnostopoulos. Artificial Intelligence Applications and Innovations, volume 24 of IFIP International Federation for Information Processing, chapter A Support Vector Machine Approach to Breast Cancer Diagnosis and Prognosis, pages 5 57. Springer Boston, 26. [5] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 5 32, 2. [6] A. Rakotomamonjy. Optimizing AUC with SVMs. In Proceedings of the European Conference on Artificial Intelligence, Workshop on ROC Analysis in AI, pages 7 79, Valencia (Spain), August 24. [7] F. Harrell Jr., K. Klee, R. Califf, D. Pryor, and R. Rosati. Regression modeling strategies for improved prognostic prediction. Statistics in Medicine, 95:634 635, 984. [8] D. Byar and S. Green. Prognostic variables for survival in a randomized comparison of treatments for prostatic cancer. Bulletin du Cancer, 67:477 49, 98. 94