SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

1 SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES Wlfgang HÄRDLE Ruslan MORO Center fr Applied Statistics and Ecnmics (CASE), Humbldt-Universität zu Berlin

Mtivatin 2 Applicatins in Medicine estimatin f survival chances classificatin f patients with respect t their sensitivity t treatment reprductin f test results withut using invasive methds Other Applicatins cmpany rating based n survival prbability insurance

Mtivatin 3 General Apprach estimate the prbability f death in perid t given that the patient has survived up t perid t 1 What statistical methds are suitable?

Mtivatin 4 Standard Methdlgy C prprtinal hazard regressin (1972) A semi-parametric mdel based n a generalised linear mdel lnh i (t) = a(t) + b 1 i1 + b 2 i2 +... + b d id r eplicitly fr the hazard h i (t) h i (t) = h 0 (t) ep(b 1 i1 + b 2 i2 +... + b d id ) The hazard rati fr any tw bservatins is independent f time t: h i (t) h j (t) = h 0(t)e ηi h 0 (t)e η j = eηi e η j where η i = b 1 i1 + b 2 i2 +... + b d id

Mtivatin 5 Prpsed Methdlgy at time t break all surviving patients int tw grups: 1. thse wh will die in perid t + 1 2. the rest patients wh will survive in perid t + 1 train a classificatin machine n these tw grups repeat the prcedure fr all t {0, 1,..., T 1} Alltgether we will get T differently trained classificatin machines What classificatin methd t apply?

Mtivatin 6 Multivariate Discriminant Analysis Fisher (1931) The scre: S i = a 1 i1 + a 2 i2 +... + a d id = a i i are screening and test results fr the i-th patient survival: death: S i s S i < s

Mtivatin 7 Linear Discriminant Analysis X 2 Death Survival X 1

Mtivatin 8 Linear Discriminant Analysis Death Survival Distributin density s Scre

Mtivatin 9 Other Mdels Lgit E[y i i ] = ep(a 0 + a 1 i1 +... + a d id ) 1 + ep(a 0 + a 1 i1 +... + a d id ) y i = {0, 1} dentes the class, e.g. surviving r dead Prbit CART Neural netwrks E[y i i ] = Φ (a 0 + a 1 i1 + a 2 i2 +... + a d id )

Mtivatin 10 Linearly Nn-separable Classificatin Prblem X 2 2 3 1 Death Survival X 1

Outline f the Talk 11 Outline f the Talk 1. Mtivatin 2. Supprt Vectr Machines and Their Prperties 3. Epected Risk vs. Empirical Risk Minimisatin 4. Realisatin f a SVM 5. Nn-linear Case 6. Survival Estimatin with SVMs

Supprt Vectr Machines and their Prperties 12 Supprt Vectr Machines (SVMs) SVMs are a grup f methds fr classificatin (and regressin) that make use f classifiers prviding high margin. SVMs pssess a fleible structure which is nt chsen a priri The prperties f SVMs can be derived frm statistical learning thery SVMs d nt rely n asympttic prperties; they are especially useful when d/n is high, i.e. in mst practically significant cases SVMs give a unique slutin

Supprt Vectr Machines and their Prperties 13 Classificatin Prblem Training set: {( i, y i )} n i=1 with the distributin P(, y). Find the class y f a new bject using the classifier f : X {+1; 1}, such that the epected risk R(f) is minimal. i is the vectr f the i-th bject characteristics; y i { 1, +1} r {0, 1} is the class f the i-th bject. Regressin Prblem Setup as fr the classificatin prblem but: y R

Epected Risk vs. Empirical Risk Minimisatin 14 Epected Risk Minimisatin Epected risk R(f) = 1 2 f() y dp(, y) = E P(,y)[L] can be minimised directly with respect t f f pt = arg min f F R(f) The lss L = 1 2 f() y = 0 if classificatin is crrect = 1 if classificatin is wrng F is a set f (nn)linear classifier functins

Epected Risk vs. Empirical Risk Minimisatin 15 Empirical Risk Minimisatin In practice P(, y) is usually unknwn: use Empirical Risk ˆR(f) = 1 n n i=1 1 2 f( i) y i Minimisatin (ERM) ver the training set {( i, y i )} n i=1 ˆf n = arg min f F ˆR(f)

Epected Risk vs. Empirical Risk Minimisatin 16 Empirical Risk vs. Epected Risk Risk R R ˆ R ˆ (f) R (f) f f pt f ˆ n Functin class

Epected Risk vs. Empirical Risk Minimisatin 17 Cnvergence Frm the law f large numbers lim n ˆR(f) = R(f) In additin ERM satisfies lim n min f F ˆR(f) = min f F R(f) if F is nt t big.

Epected Risk vs. Empirical Risk Minimisatin 18 Vapnik-Chervnenkis (VC) Bund A basic result f Statistical Learning Thery (fr linear classifier functins): ( R(f) ˆR(f) h + φ n, ln(η) ) n when the bund hlds with prbability 1 η and φ ( h n, ln(η) ) n = h(ln 2n h + 1) ln( η 4 ) n Structural Risk Minimisatin search fr the ptimal mdel structure described by S h F such that the VC bund is minimised; f S h (h is VC dimensin)

Epected Risk vs. Empirical Risk Minimisatin 19 Vapnik-Chervnenkis (VC) Dimensin Definitin. h is VC dimensin f a set f functins if there eists a set f pints { i } h i=1 such that these pints can be separated in all 2h pssible cnfiguratins, and n set { i } q i=1 eists where q > h satisfies this prperty. Eample 1. The functins A sinθ has an infinite VC dimensin. Eample 2. Three pints n a plane can be shattered by a set f linear indicatr functins in 2 h = 2 3 = 8 ways (whereas 4 pints cannt be shattered in 2 q = 2 4 = 16 ways). The VC dimensin equals h = 3.

Epected Risk vs. Empirical Risk Minimisatin 20 VC Dimensin. Eample

Epected Risk vs. Empirical Risk Minimisatin 21 Regularised LS Estimatin and VC Bund Prblem slved: min f F n {f( i ) y i } 2 + λω(f) i=1 The regularised functinal: a specific type f the VC bund with a quadratic empirical lss functin The Classifier Functin Class f an SVM F Λ = {f : R n R f() = w + b, w Λ}

Realisatin f an SVM 22 Linearly Separable Case The training set: {( i, y i )} n i=1, y i = {±1}, i R d. Find the classifier with the highest margin the gap between the parallel hyperplanes separating tw classes where the vectrs f neither class can lie. Maimisatin f the margin minimises the VC dimensin.

Realisatin f an SVM 23 Let w + b = 0 be a separating hyperplane. Then d + (d ) will be the shrtest distance t the clsest bjects f the classes +1 ( 1). i w + b +1 fr y i = +1 i w + b 1 fr y i = 1 cmbine them int ne cnstraint y i ( i w + b) 1 0 i = 1, 2,..., n (1) The cannical hyperplanes i w + b = ±1 are parallel and the distance between each f them and the separating hyperplane is d ± = 1/ w.

Realisatin f an SVM 24 Linear SVMs. Separable Case The margin is d + + d = 2/ w. T maimise it minimise the Euclidean nrm w subject t the cnstraint (1).

Realisatin f an SVM 25 The Lagrangian Frmulatin The Lagrangian fr the primal prblem L P = 1 2 w 2 n α i {y i ( i w + b) 1} i=1 The Karush-Kuhn-Tucker (KKT) Cnditins L P w k = 0 n i=1 α iy i ik = w k k = 1,..., d L P b = 0 n i=1 α iy i = 0 y i ( i w + b) 1 0 α i 0 i = 1,..., n α i {y i ( i w + b) 1} = 0

Realisatin f an SVM 26 Substitute the KKT cnditins int L P and btain the Lagrangian fr the dual prblem L D = n α i 1 2 i=1 n i=1 n α i α j y i y j i j j=1 The primal and dual prblems are min ma L P w k,b α i s.t. α i 0 ma α i L D n α i y i = 0 i=1 Since the ptimisatin prblem is cnve the dual and primal frmulatins give the same slutin.

Realisatin f an SVM 27 The Classificatin Stage The classificatin rule is: where w = n i=1 α iy i i b = 1 2 ( + + ) w g() = sign( w + b) + and are any supprt vectrs frm each class α i = arg ma α i L D subject t the cnstraint y i ( i w + b) 1 0 i = 1, 2,..., n.

Realisatin f an SVM 28 Adaptin f an SVM t Hazard Estimatin The scre values f = w + b estimated by an SVM crrespnd t hazard: Suggestin: select an area f ± f f hazard cunt the number f deaths and survivals in the area if the data is representative f the whle ppulatin ˆ hazard = #deaths/#survivals estimate the mapping f ˆ hazard fr several f ± f

Realisatin f an SVM 29 Linear SVMs. Nn-separable Case In the nn-separable case it is impssible t separate the data pints with hyperplanes withut an errr.

Realisatin f an SVM 30 The prblem can be slved by intrducing the psitive variables {ξ i } n i=1 in the cnstraints i w + b 1 ξ i fr y i = 1 i w + b 1 + ξ i fr y i = 1 ξ i 0 i If ξ i > 1, an errr ccurs. The bjective functin in this case is 1 2 w 2 + C( n ξ i ) ν where ν is a psitive integer cntrlling sensitivity t utliers; C ( capacity ) cntrls the tlerance t errrs n the training set. Under such a frmulatin the prblem is cnve i=1

Realisatin f an SVM 31 The Lagrangian Frmulatin The Lagrangian fr the primal prblem fr ν = 1: L P = 1 2 w 2 + C n ξ i n α i {y i ( i w + b) 1 + ξ i } n ξ i µ i i=1 i=1 i=1 The primal prblem: min w k,b,ξ i ma α i,µ i L P

Realisatin f an SVM 32 The KKT Cnditins L P w k = 0 w k = n i=1 α iy i ik k = 1,..., d L P b = 0 n i=1 α iy i = 0 L P ξ i = 0 C α i µ i = 0 y i ( i w + b) 1 + ξ i 0 ξ i 0 α i 0 µ i 0 α i {y i ( i w + b) 1 + ξ i} = 0 µ i ξ i = 0

Realisatin f an SVM 33 Fr ν = 1 the dual Lagrangian will nt cntain ξ i r their Lagrange multipliers L D = n α i 1 2 n n α i α j y i y j i j (2) i=1 i=1 j=1 The dual prblem is subject t ma α i L D 0 α i C n α i y i = 0 i=1

Realisatin f an SVM 34 Linear SVM. Nn-separable Case

Nn-linear Case 35 Nn-linear SVMs Map the data t the Hilbert space H and perfrm classificatin there Ψ : R d H Nte, that in the Lagrangian frmulatin (2) the training data appear nly in the frm f dt prducts i j, which can be mapped t Ψ( i ) Ψ( j ). If a kernel functin K eists such that K( i, j ) = Ψ( i ) Ψ( j ), then we can use K withut knwing Ψ eplicitly

Nn-linear Case 36 Mapping int the Feature Space. Eample R 2 R 3, Ψ( 1, 2 ) = ( 2 1, 2 1 2, 2 2), K( i, j ) = ( i j) 2 Data Space Feature Space

Nn-linear Case 37 Mercer s Cnditin (1909) A necessary and sufficient cnditin fr a symmetric functin K( i, j ) t be a kernel is that it must be psitive definite, i.e. fr any data set 1,..., n and any real numbers λ 1,..., λ n the functin K must satisfy n i=1 n λ i λ j K( i, j ) 0 j=1 Sme eamples f kernel functins: K( i, j ) = e ( i j ) Σ 1 ( i j )/2 K( i, j ) = ( i j + 1) p K( i, j ) = tanh(k i j δ) Gaussian kernel plynmial kernel hyperblic tangent kernel

Nn-linear Case 38 Classes f Kernels A statinary kernel is the kernel which is translatin invariant K( i, j ) = K S ( i j ) An istrpic (hmgeneus) kernel is ne which depends nly n the nrm f the lag vectr (distance) between tw data pints K( i, j ) = K I ( i j ) A lcal statinary kernel is the kernel f the frm K( i, j ) = K 1 ( i + j 2 )K 2 ( i j ) where K 1 is a nn-negative functin, K 2 is a statinary kernel.

Nn-linear Case 39 Matérn kernel K I ( i j ) K I (0) = 1 2 ν 1 Γ(ν) (2 ν i j θ ) ν H ν ( 2 ν i j ) θ where Γ is the gamma functin and H ν is the mdified Bessel functin f the secnd kind f rder ν. The parameter ν allws t cntrl the smthness. The Matérn kernel reduces t the Gaussian kernel fr ν.

Survival Estimatin with SVMs 40 Estimatin f Survival Chances fr Breast Cancer Patients Data surce: the Breast cancer survival.sav file supplied with SPSS and the database used in Lee et al. (2001) 325 cases selected and merged in ne database (112 deaths, 223 censred cases) Predictrs: 2 variables that are cntained in bth databases the pathlgy size and the number f methastased lymph ndes an SVM with an anistrpic Gaussian kernel with the radial basis 3Σ 1/2 and capacity C = 1 was applied (here Σ = cv. matri)

Survival Estimatin with SVMs 41 Methdlgy the cases were srted in ascending rder by survival time r time t censring 5 grups (t = 1,...,5) were selected; all 112 death cases are in grups t = 1,...,4; all 213 censred cases are in grup t = 5 an SVM was trained at time t (t = 0,...,3); the patients wh wuld die in perid t + 1 were given the label y i = 1, thse wh wuld survive: y i = 1

Survival Estimatin with SVMs 42 The Timeline 0 mnths 23.7 mnths 36.9 mnths 52.7 mnths 82.1 mnths t=0 t=1 t=2 t=3 t=4 t=5 t btaining test results 28 deaths 1.18 deaths a mnth 28 deaths 2.12 deaths a mnth 28 deaths 1.75 deaths a mnth 28 deaths 0.96 deaths a mnth 223 deaths

Survival Estimatin with SVMs 43 Survival Estimatin

Survival Estimatin with SVMs 44 Survival Chances (t=0) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 45 Survival Chances (t=1) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 46 Survival Chances (t=2) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 47 Survival Chances (t=3) Number f metastased lymph ndes 0 5 10 15 20 25 0 5 10 Tumur size, cm

Survival Estimatin with SVMs 48 References C, D. R. (1972). Regressin Mdels and Life Tables, Jurnal f the Ryal Statistical Sciety B34: 187-220. Lee, Y.-J., Mangasarian, O. L., Wlberg, W. H. (2001). Survival Time Classificatin f Breast Cancer Patients (technical reprt): ftp://ftp.cs.wisc.edu/pub/dmi/tech-reprts/01-03.ps. Vapnik, V. N. (1995). The Nature f Statistical Learning Thery, Springer, New Yrk, NY.