SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Similar documents
SUPPORT VECTOR MACHINES FOR BANKRUPTCY ANALYSIS

Support-Vector Machines

Pattern Recognition 2014 Support Vector Machines

The blessing of dimensionality for kernel methods

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

IAML: Support Vector Machines

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Linear programming III

Chapter 3: Cluster Analysis

What is Statistical Learning?

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Tree Structured Classifier

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Support Vector Machines and Flexible Discriminants

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Computational modeling techniques

Part 3 Introduction to statistical classification techniques

Lyapunov Stability Stability of Equilibrium Points

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Introduction: A Generalized approach for computing the trajectories associated with the Newtonian N Body Problem

Checking the resolved resonance region in EXFOR database

Elements of Machine Intelligence - I

Distributions, spatial statistics and a Bayesian perspective

Sequential Allocation with Minimal Switching

Resampling Methods. Chapter 5. Chapter 5 1 / 52

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Machine Learning. Support Vector Machines. Manfred Huber

Pre-Calculus Individual Test 2017 February Regional

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The Solution Path of the Slab Support Vector Machine

Simple Linear Regression (single variable)

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

A Matrix Representation of Panel Data

7 TH GRADE MATH STANDARDS

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

NOTE ON A CASE-STUDY IN BOX-JENKINS SEASONAL FORECASTING OF TIME SERIES BY STEFFEN L. LAURITZEN TECHNICAL REPORT NO. 16 APRIL 1974

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

INSTRUMENTAL VARIABLES

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

1 The limitations of Hartree Fock approximation

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Least Squares Optimal Filtering with Multirate Observations

Numerical Simulation of the Thermal Resposne Test Within the Comsol Multiphysics Environment

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

IN a recent article, Geary [1972] discussed the merit of taking first differences

A Scalable Recurrent Neural Network Framework for Model-free

Lecture 8: Multiclass Classification (I)

ELECTRON CYCLOTRON HEATING OF AN ANISOTROPIC PLASMA. December 4, PLP No. 322

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Introduction to Support Vector Machines

Image Processing 1 (IP1) Bildverarbeitung 1

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

COMP 551 Applied Machine Learning Lecture 4: Linear classification

ENSC Discrete Time Systems. Project Outline. Semester

Math Foundations 20 Work Plan

CS 109 Lecture 23 May 18th, 2016

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Floating Point Method for Solving Transportation. Problems with Additional Constraints

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Determining the Accuracy of Modal Parameter Estimation Methods

Multiple Source Multiple. using Network Coding

Figure 1a. A planar mechanism.

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Hypothesis Tests for One Population Mean

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Computational modeling techniques

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

A NOTE ON THE EQUIVAImCE OF SOME TEST CRITERIA. v. P. Bhapkar. University of Horth Carolina. and

SAMPLING DYNAMICAL SYSTEMS

Kinematic transformation of mechanical behavior Neville Hogan

Bayesian nonparametric modeling approaches for quantile regression

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Contents. This is page i Printer: Opaque this

( ) ( ) Pre-Calculus Team Florida Regional Competition March Pre-Calculus Team Florida Regional Competition March α = for 0 < α <, and

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Dead-beat controller design

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines and Flexible Discriminants

OF SIMPLY SUPPORTED PLYWOOD PLATES UNDER COMBINED EDGEWISE BENDING AND COMPRESSION

Linear Classification

ANALYTICAL SOLUTIONS TO THE PROBLEM OF EDDY CURRENT PROBES

Transcription:

1 SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES Wlfgang HÄRDLE Ruslan MORO Center fr Applied Statistics and Ecnmics (CASE), Humbldt-Universität zu Berlin

Mtivatin 2 Applicatins in Medicine estimatin f survival chances classificatin f patients with respect t their sensitivity t treatment reprductin f test results withut using invasive methds Other Applicatins cmpany rating based n survival prbability insurance

Mtivatin 3 General Apprach estimate the prbability f death in perid t given that the patient has survived up t perid t 1 What statistical methds are suitable?

Mtivatin 4 Standard Methdlgy C prprtinal hazard regressin (1972) A semi-parametric mdel based n a generalised linear mdel lnh i (t) = a(t) + b 1 i1 + b 2 i2 +... + b d id r eplicitly fr the hazard h i (t) h i (t) = h 0 (t) ep(b 1 i1 + b 2 i2 +... + b d id ) The hazard rati fr any tw bservatins is independent f time t: h i (t) h j (t) = h 0(t)e ηi h 0 (t)e η j = eηi e η j where η i = b 1 i1 + b 2 i2 +... + b d id

Mtivatin 5 Prpsed Methdlgy at time t break all surviving patients int tw grups: 1. thse wh will die in perid t + 1 2. the rest patients wh will survive in perid t + 1 train a classificatin machine n these tw grups repeat the prcedure fr all t {0, 1,..., T 1} Alltgether we will get T differently trained classificatin machines What classificatin methd t apply?

Mtivatin 6 Multivariate Discriminant Analysis Fisher (1931) The scre: S i = a 1 i1 + a 2 i2 +... + a d id = a i i are screening and test results fr the i-th patient survival: death: S i s S i < s

Mtivatin 7 Linear Discriminant Analysis X 2 Death Survival X 1

Mtivatin 8 Linear Discriminant Analysis Death Survival Distributin density s Scre

Mtivatin 9 Other Mdels Lgit E[y i i ] = ep(a 0 + a 1 i1 +... + a d id ) 1 + ep(a 0 + a 1 i1 +... + a d id ) y i = {0, 1} dentes the class, e.g. surviving r dead Prbit CART Neural netwrks E[y i i ] = Φ (a 0 + a 1 i1 + a 2 i2 +... + a d id )

Mtivatin 10 Linearly Nn-separable Classificatin Prblem X 2 2 3 1 Death Survival X 1

Outline f the Talk 11 Outline f the Talk 1. Mtivatin 2. Supprt Vectr Machines and Their Prperties 3. Epected Risk vs. Empirical Risk Minimisatin 4. Realisatin f a SVM 5. Nn-linear Case 6. Survival Estimatin with SVMs

Supprt Vectr Machines and their Prperties 12 Supprt Vectr Machines (SVMs) SVMs are a grup f methds fr classificatin (and regressin) that make use f classifiers prviding high margin. SVMs pssess a fleible structure which is nt chsen a priri The prperties f SVMs can be derived frm statistical learning thery SVMs d nt rely n asympttic prperties; they are especially useful when d/n is high, i.e. in mst practically significant cases SVMs give a unique slutin

Supprt Vectr Machines and their Prperties 13 Classificatin Prblem Training set: {( i, y i )} n i=1 with the distributin P(, y). Find the class y f a new bject using the classifier f : X {+1; 1}, such that the epected risk R(f) is minimal. i is the vectr f the i-th bject characteristics; y i { 1, +1} r {0, 1} is the class f the i-th bject. Regressin Prblem Setup as fr the classificatin prblem but: y R

Epected Risk vs. Empirical Risk Minimisatin 14 Epected Risk Minimisatin Epected risk R(f) = 1 2 f() y dp(, y) = E P(,y)[L] can be minimised directly with respect t f f pt = arg min f F R(f) The lss L = 1 2 f() y = 0 if classificatin is crrect = 1 if classificatin is wrng F is a set f (nn)linear classifier functins

Epected Risk vs. Empirical Risk Minimisatin 15 Empirical Risk Minimisatin In practice P(, y) is usually unknwn: use Empirical Risk ˆR(f) = 1 n n i=1 1 2 f( i) y i Minimisatin (ERM) ver the training set {( i, y i )} n i=1 ˆf n = arg min f F ˆR(f)

Epected Risk vs. Empirical Risk Minimisatin 16 Empirical Risk vs. Epected Risk Risk R R ˆ R ˆ (f) R (f) f f pt f ˆ n Functin class

Epected Risk vs. Empirical Risk Minimisatin 17 Cnvergence Frm the law f large numbers lim n ˆR(f) = R(f) In additin ERM satisfies lim n min f F ˆR(f) = min f F R(f) if F is nt t big.

Epected Risk vs. Empirical Risk Minimisatin 18 Vapnik-Chervnenkis (VC) Bund A basic result f Statistical Learning Thery (fr linear classifier functins): ( R(f) ˆR(f) h + φ n, ln(η) ) n when the bund hlds with prbability 1 η and φ ( h n, ln(η) ) n = h(ln 2n h + 1) ln( η 4 ) n Structural Risk Minimisatin search fr the ptimal mdel structure described by S h F such that the VC bund is minimised; f S h (h is VC dimensin)

Epected Risk vs. Empirical Risk Minimisatin 19 Vapnik-Chervnenkis (VC) Dimensin Definitin. h is VC dimensin f a set f functins if there eists a set f pints { i } h i=1 such that these pints can be separated in all 2h pssible cnfiguratins, and n set { i } q i=1 eists where q > h satisfies this prperty. Eample 1. The functins A sinθ has an infinite VC dimensin. Eample 2. Three pints n a plane can be shattered by a set f linear indicatr functins in 2 h = 2 3 = 8 ways (whereas 4 pints cannt be shattered in 2 q = 2 4 = 16 ways). The VC dimensin equals h = 3.

Epected Risk vs. Empirical Risk Minimisatin 20 VC Dimensin. Eample

Epected Risk vs. Empirical Risk Minimisatin 21 Regularised LS Estimatin and VC Bund Prblem slved: min f F n {f( i ) y i } 2 + λω(f) i=1 The regularised functinal: a specific type f the VC bund with a quadratic empirical lss functin The Classifier Functin Class f an SVM F Λ = {f : R n R f() = w + b, w Λ}

Realisatin f an SVM 22 Linearly Separable Case The training set: {( i, y i )} n i=1, y i = {±1}, i R d. Find the classifier with the highest margin the gap between the parallel hyperplanes separating tw classes where the vectrs f neither class can lie. Maimisatin f the margin minimises the VC dimensin.

Realisatin f an SVM 23 Let w + b = 0 be a separating hyperplane. Then d + (d ) will be the shrtest distance t the clsest bjects f the classes +1 ( 1). i w + b +1 fr y i = +1 i w + b 1 fr y i = 1 cmbine them int ne cnstraint y i ( i w + b) 1 0 i = 1, 2,..., n (1) The cannical hyperplanes i w + b = ±1 are parallel and the distance between each f them and the separating hyperplane is d ± = 1/ w.

Realisatin f an SVM 24 Linear SVMs. Separable Case The margin is d + + d = 2/ w. T maimise it minimise the Euclidean nrm w subject t the cnstraint (1).

Realisatin f an SVM 25 The Lagrangian Frmulatin The Lagrangian fr the primal prblem L P = 1 2 w 2 n α i {y i ( i w + b) 1} i=1 The Karush-Kuhn-Tucker (KKT) Cnditins L P w k = 0 n i=1 α iy i ik = w k k = 1,..., d L P b = 0 n i=1 α iy i = 0 y i ( i w + b) 1 0 α i 0 i = 1,..., n α i {y i ( i w + b) 1} = 0

Realisatin f an SVM 26 Substitute the KKT cnditins int L P and btain the Lagrangian fr the dual prblem L D = n α i 1 2 i=1 n i=1 n α i α j y i y j i j j=1 The primal and dual prblems are min ma L P w k,b α i s.t. α i 0 ma α i L D n α i y i = 0 i=1 Since the ptimisatin prblem is cnve the dual and primal frmulatins give the same slutin.

Realisatin f an SVM 27 The Classificatin Stage The classificatin rule is: where w = n i=1 α iy i i b = 1 2 ( + + ) w g() = sign( w + b) + and are any supprt vectrs frm each class α i = arg ma α i L D subject t the cnstraint y i ( i w + b) 1 0 i = 1, 2,..., n.

Realisatin f an SVM 28 Adaptin f an SVM t Hazard Estimatin The scre values f = w + b estimated by an SVM crrespnd t hazard: Suggestin: select an area f ± f f hazard cunt the number f deaths and survivals in the area if the data is representative f the whle ppulatin ˆ hazard = #deaths/#survivals estimate the mapping f ˆ hazard fr several f ± f

Realisatin f an SVM 29 Linear SVMs. Nn-separable Case In the nn-separable case it is impssible t separate the data pints with hyperplanes withut an errr.

Realisatin f an SVM 30 The prblem can be slved by intrducing the psitive variables {ξ i } n i=1 in the cnstraints i w + b 1 ξ i fr y i = 1 i w + b 1 + ξ i fr y i = 1 ξ i 0 i If ξ i > 1, an errr ccurs. The bjective functin in this case is 1 2 w 2 + C( n ξ i ) ν where ν is a psitive integer cntrlling sensitivity t utliers; C ( capacity ) cntrls the tlerance t errrs n the training set. Under such a frmulatin the prblem is cnve i=1

Realisatin f an SVM 31 The Lagrangian Frmulatin The Lagrangian fr the primal prblem fr ν = 1: L P = 1 2 w 2 + C n ξ i n α i {y i ( i w + b) 1 + ξ i } n ξ i µ i i=1 i=1 i=1 The primal prblem: min w k,b,ξ i ma α i,µ i L P

Realisatin f an SVM 32 The KKT Cnditins L P w k = 0 w k = n i=1 α iy i ik k = 1,..., d L P b = 0 n i=1 α iy i = 0 L P ξ i = 0 C α i µ i = 0 y i ( i w + b) 1 + ξ i 0 ξ i 0 α i 0 µ i 0 α i {y i ( i w + b) 1 + ξ i} = 0 µ i ξ i = 0

Realisatin f an SVM 33 Fr ν = 1 the dual Lagrangian will nt cntain ξ i r their Lagrange multipliers L D = n α i 1 2 n n α i α j y i y j i j (2) i=1 i=1 j=1 The dual prblem is subject t ma α i L D 0 α i C n α i y i = 0 i=1

Realisatin f an SVM 34 Linear SVM. Nn-separable Case

Nn-linear Case 35 Nn-linear SVMs Map the data t the Hilbert space H and perfrm classificatin there Ψ : R d H Nte, that in the Lagrangian frmulatin (2) the training data appear nly in the frm f dt prducts i j, which can be mapped t Ψ( i ) Ψ( j ). If a kernel functin K eists such that K( i, j ) = Ψ( i ) Ψ( j ), then we can use K withut knwing Ψ eplicitly

Nn-linear Case 36 Mapping int the Feature Space. Eample R 2 R 3, Ψ( 1, 2 ) = ( 2 1, 2 1 2, 2 2), K( i, j ) = ( i j) 2 Data Space Feature Space

Nn-linear Case 37 Mercer s Cnditin (1909) A necessary and sufficient cnditin fr a symmetric functin K( i, j ) t be a kernel is that it must be psitive definite, i.e. fr any data set 1,..., n and any real numbers λ 1,..., λ n the functin K must satisfy n i=1 n λ i λ j K( i, j ) 0 j=1 Sme eamples f kernel functins: K( i, j ) = e ( i j ) Σ 1 ( i j )/2 K( i, j ) = ( i j + 1) p K( i, j ) = tanh(k i j δ) Gaussian kernel plynmial kernel hyperblic tangent kernel

Nn-linear Case 38 Classes f Kernels A statinary kernel is the kernel which is translatin invariant K( i, j ) = K S ( i j ) An istrpic (hmgeneus) kernel is ne which depends nly n the nrm f the lag vectr (distance) between tw data pints K( i, j ) = K I ( i j ) A lcal statinary kernel is the kernel f the frm K( i, j ) = K 1 ( i + j 2 )K 2 ( i j ) where K 1 is a nn-negative functin, K 2 is a statinary kernel.

Nn-linear Case 39 Matérn kernel K I ( i j ) K I (0) = 1 2 ν 1 Γ(ν) (2 ν i j θ ) ν H ν ( 2 ν i j ) θ where Γ is the gamma functin and H ν is the mdified Bessel functin f the secnd kind f rder ν. The parameter ν allws t cntrl the smthness. The Matérn kernel reduces t the Gaussian kernel fr ν.

Survival Estimatin with SVMs 40 Estimatin f Survival Chances fr Breast Cancer Patients Data surce: the Breast cancer survival.sav file supplied with SPSS and the database used in Lee et al. (2001) 325 cases selected and merged in ne database (112 deaths, 223 censred cases) Predictrs: 2 variables that are cntained in bth databases the pathlgy size and the number f methastased lymph ndes an SVM with an anistrpic Gaussian kernel with the radial basis 3Σ 1/2 and capacity C = 1 was applied (here Σ = cv. matri)

Survival Estimatin with SVMs 41 Methdlgy the cases were srted in ascending rder by survival time r time t censring 5 grups (t = 1,...,5) were selected; all 112 death cases are in grups t = 1,...,4; all 213 censred cases are in grup t = 5 an SVM was trained at time t (t = 0,...,3); the patients wh wuld die in perid t + 1 were given the label y i = 1, thse wh wuld survive: y i = 1

Survival Estimatin with SVMs 42 The Timeline 0 mnths 23.7 mnths 36.9 mnths 52.7 mnths 82.1 mnths t=0 t=1 t=2 t=3 t=4 t=5 t btaining test results 28 deaths 1.18 deaths a mnth 28 deaths 2.12 deaths a mnth 28 deaths 1.75 deaths a mnth 28 deaths 0.96 deaths a mnth 223 deaths

Survival Estimatin with SVMs 43 Survival Estimatin

Survival Estimatin with SVMs 44 Survival Chances (t=0) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 45 Survival Chances (t=1) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 46 Survival Chances (t=2) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 47 Survival Chances (t=3) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 48 References C, D. R. (1972). Regressin Mdels and Life Tables, Jurnal f the Ryal Statistical Sciety B34: 187-220. Lee, Y.-J., Mangasarian, O. L., Wlberg, W. H. (2001). Survival Time Classificatin f Breast Cancer Patients (technical reprt): ftp://ftp.cs.wisc.edu/pub/dmi/tech-reprts/01-03.ps. Vapnik, V. N. (1995). The Nature f Statistical Learning Thery, Springer, New Yrk, NY.