Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes
Introduction Method Theoretical Results Simulation Studies Application Conclusions
Introduction
Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event.
Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event. A variety of regression models focus on estimating the survival function and evaluating covariate effects, but not on predicting event times.
Introduction For survival data, one important goal is to use a subject s baseline information to predict the exact timing of an event. A variety of regression models focus on estimating the survival function and evaluating covariate effects, but not on predicting event times. We consider supervised learning algorithms for prediction: they directly aim for prediction; they are nonparametric; they are powerful and flexible in handling large number of predictors.
Introduction Many learning methods exist for predicting non-censored outcomes, of which support vector machines (SVM) are commonly used for binary outcomes.
Introduction Many learning methods exist for predicting non-censored outcomes, of which support vector machines (SVM) are commonly used for binary outcomes. A simple geometric interpretation. A convex quadratic programming problem. Inclusion of non-linearity by using kernels. Adapted for regression with a continuous response by using the ɛ-insensitive loss.
Introduction For survival outcomes, censoring imposes new challenge for SVM. Most of existing methods focus on modifying the support vector regression.
Introduction For survival outcomes, censoring imposes new challenge for SVM. Most of existing methods focus on modifying the support vector regression. Shivaswamy et al. (2007) and Khan and Zubek (2008) generalized the ɛ-insensitive loss function.
Introduction Van Belle et al. (2009, 2011a) adopted the concept of concordance index to handle censored data. They considered all the comparable pairs. For observed times y i < y j, ranking constraints were added for predicted values to penalize misranking: f (x j ) f (x i ) 1 ζ ij, i < j.
Introduction Van Belle et al. (2009, 2011a) adopted the concept of concordance index to handle censored data. They considered all the comparable pairs. For observed times y i < y j, ranking constraints were added for predicted values to penalize misranking: f (x j ) f (x i ) 1 ζ ij, i < j. Van Belle et al. (2011b) included both regression (loss function modification) and ranking constraints. Prediction rule not clear; No theoretical justification; Observed information may not be fully used; Censoring is completely random.
Introduction Goldberg and Kosorok (2013) used inverse probability of censoring weighting to adapt standard support vector methods.
Introduction Goldberg and Kosorok (2013) used inverse probability of censoring weighting to adapt standard support vector methods. Their method may suffer from severe bias when the censoring distribution is misspecified. The learning only uses uncensored observations. Large weights (high censoring) often make algorithms numerically unstable and even infeasible.
Method
Method We represent the survival times in the framework of the counting process.
Method We represent the survival times in the framework of the counting process. At each event time, we use a support vector machine to identify event vs. non-events.
Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense.
Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets;
Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets; g(x): a function of covariates for true prediction, e.g. g(x) = X T β for linear kernel;
Method A risk score f (t, X) = α(t) + g(x) is used to classify the binary outcomes (o/x) in a maximal separation sense. α(t): depending on t; used to stratify times for risk sets; g(x): a function of covariates for true prediction, e.g. g(x) = X T β for linear kernel; imbalance between events and non-events at each t (1 vs n).
Method Notations: counting process N i (t) = I(T i C i t); at-risk process Y i (t) = I(T i C i t); total number of events: d = n i=1 I(T i C i ). Primal form: min α,g 1 2 g 2 + C n s.t. Y i (t j )ζ i (t j ) 0, n i=1 d Y i (t j )w i (t j )ζ i (t j ) j=1 Y i (t j )δn i (t j ){α(t j ) + g(x i )} Y i (t j ){1 ζ i (t j )}, where δn i (t j ) 2{N i (t j ) N i (t j )} 1, and w i (t j ) = I { δn i (t j ) = 1 } { } 1 1 n i=1 Y +I { δn i (t j ) = 1 } { } 1 i(t j ) n i=1 Y. i(t j )
Method n d Dual form: L D = γ ij Y i (t j ) i=1 i =1 j=1 j =1 i=1 j=1 1 n n d d γ ij γ i j 2 Y i(t j )Y i (t j )δn i (t j )δn i (t j )K(X i, X i ) s.t. 0 γ ij w i (t j )C n, n γ ij Y i (t j )δn i (t j ) = 0. i=1
Method n d Dual form: L D = γ ij Y i (t j ) i=1 i =1 j=1 j =1 i=1 j=1 1 n n d d γ ij γ i j 2 Y i(t j )Y i (t j )δn i (t j )δn i (t j )K(X i, X i ) s.t. 0 γ ij w i (t j )C n, n γ ij Y i (t j )δn i (t j ) = 0. i=1 The kernel function K(X i, X i ) = g(x i ) T g(x i ). The predictive score ĝ(x i ) = n i=1 d j=1 ˆγ ijy i (t j )δn i (t j )K(X i, X i ).
Method Prediction: using the predictive scores ĝ(x) and distinct event times in the training data set to predict the survival outcome of the future subject. A future subject with ĝ(xnew) Training data set: ĝ(x 3 ) T 1 ĝ(x 2 ) T 2 ĝ(x 1 ) T 3... If the value of ĝ(xnew) is closest to ĝ(x 2 ), we predict the survival outcome of this subject to be T 2.
Theoretical Results
Empirical risk Regularization Form: min R n (f ) + λ n g, where n d R n(f ) = n 1 w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. i=1 j=1
Empirical risk Regularization Form: min R n (f ) + λ n g, where R n(f ) = n 1 n i=1 d w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. j=1 After substituting ˆα(t j ) into R n (f ), we obtain a profile empirical risk PR n (g) for g( ): 1 n n i=1 n k=1 I(Y k Y i )[2 g(x i ) + g(x k )] + i n k=1 I(Y k Y i ) 2 n n i=1 i n k=1 I(Y k Y i ).
Empirical risk Regularization Form: min R n (f ) + λ n g, where R n(f ) = n 1 n i=1 d w i (t j )Y i (t j )[1 (α(t j ) + g(x i ))δn i (t j )] +. j=1 After substituting ˆα(t j ) into R n (f ), we obtain a profile empirical risk PR n (g) for g( ): 1 n n i=1 n k=1 I(Y k Y i )[2 g(x i ) + g(x k )] + i n k=1 I(Y k Y i ) 2 n n i=1 i n k=1 I(Y k Y i ). ĝ(x) minimizes PR n (g) + λ n g, and R n (ˆf ) = PR n (ĝ). PR n (g) takes a similar form as the partial likelihood function in survival analysis under a different loss function.
Risk Function and Optimal Decision Rule We consider another empirical risk R 0n (f ), comparable to the concept of 0-1 loss in standard SVMs, R 0n (f ) = n 1 n i=1 d w i (t j )Y i (t j )I(δN i (t j )f (t j, X i ) < 0). j=1
Risk Function and Optimal Decision Rule We consider another empirical risk R 0n (f ), comparable to the concept of 0-1 loss in standard SVMs, R 0n (f ) = n 1 n i=1 d w i (t j )Y i (t j )I(δN i (t j )f (t j, X i ) < 0). j=1 Denoting the asymptotic limits of R n (f ) and R 0n (f ) to be R(f ) and R 0 (f ), we find the optimal decision rule f (t, x) that minimizes both R(f ) and R 0 (f ), and the minimal risk R 0 (f ).
Risk Function and Optimal Decision Rule Theorem Let h(t, x) denote the conditional hazard rate function of T = t given X = x and let h(t) = E[dN(t)/dt]/E[Y(t)] = E[h(t, X) Y(t) = 1] be the average hazard rate at time t. Then f (t, x) = sign(h(t, x) h(t)) minimizes R(f ). Furthermore, f (t, x) also minimizes R 0 (f ) and R 0 (f ) = P(T C) 1 [ ] 2 E E{Y(t) X = x} h(t, x) h(t) dt. In addition, for any f (t, x) [ 1, 1], for some constant c. R 0 (f ) R 0 (f ) R(f ) R(f )
Asymptotic Properties We consider the profile risk for PR(g) and the reproducing kernel space H n from a Gaussian kernel. Then we derive the asymptotic learning rate. Theorem Assume that X s support is compact and E[Y(τ) X] is bounded from zero where τ is the study duration. Furthermore, assume λ n and σ n satisfies λ n, σ n 0, and nλ n σ n (2/p 1/2)d for some p (0, 2). Then it holds λ n ĝ 2 H n + PR(ĝ) inf PR(g) + O p g { λ n + σ d/2 n } + λ 1/2 n σ n (1/p 1/4)d. n
Simulation Studies
Simulation Setup Our method compared with Van Belle et al., 2011 (Modified SVR) and Goldberg and Kosorok, 2013 (IPCW). Five baseline covariates X marginally normal with mean 0 and variance 0.25. Survival times generated from Cox model with baseline Weibull distribution and β = (2, 1.6, 1.2, 0.8, 0.4). Censoring times depending on covariates X, and censoring ratios 40% and 60%: Case 1 using AFT models for censoring; Case 2 using Cox models for censoring.
Simulation Setup Using linear kernel; Sample size 100 and 200; 500 replicates. Tuning parameter C n selected using 5-fold cross-validation via a grid 2 16, 2 15,..., 2 16. 10000 observations in the testing data set. Evaluating prediction performances: correlation, RMSE.
Simulation Results Table: Case 1, censoring times following the AFT model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 40% 0 Modified SVR 0.59 5.59 (0.60) 1.19 0.62 5.58 (0.58) 1.24 IPCW-KM 0.40 5.60 (0.52) 1.20 0.45 5.45 (0.41) 1.21 IPCW-Cox 0.43 5.80 (0.64) 1.24 0.50 5.62 (0.57) 1.25 SVHR 0.61 4.68 (0.27) 1.00 0.64 4.49 (0.17) 1.00 5 Modified SVR 0.55 5.64 (0.60) 1.15 0.61 5.63 (0.57) 1.22 IPCW-KM 0.32 5.93 (0.47) 1.21 0.42 5.63 (0.44) 1.22 IPCW-Cox 0.33 6.17 (0.54) 1.26 0.44 5.87 (0.57) 1.27 SVHR 0.58 4.90 (0.35) 1.00 0.63 4.62 (0.20) 1.00 15 Modified SVR 0.46 5.73 (0.47) 1.10 0.54 5.55 (0.50) 1.15 IPCW-KM 0.21 6.12 (0.32) 1.18 0.31 5.86 (0.34) 1.22 IPCW-Cox 0.20 6.47 (0.46) 1.24 0.32 6.09 (0.47) 1.26 SVHR 0.48 5.20 (0.36) 1.00 0.57 4.82 (0.23) 1.00 95 Modified SVR 0.21 6.65 (0.89) 1.10 0.30 6.29 (0.47) 1.09 IPCW-KM 0.06 6.33 (0.21) 1.05 0.10 6.28 (0.14) 1.09 IPCW-Cox 0.08 6.59 (0.23) 1.09 0.11 6.61 (0.39) 1.15 SVHR 0.22 6.04 (0.32) 1.00 0.32 5.76 (0.25) 1.00
Simulation Results Table: Case 1, censoring times following the AFT model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 60% 0 Modified SVR 0.55 6.00 (0.54) 1.16 0.60 6.07 (0.42) 1.24 IPCW-KM 0.15 6.45 (0.41) 1.25 0.18 6.42 (0.37) 1.32 IPCW-Cox 0.21 6.56 (0.47) 1.27 0.26 6.47 (0.48) 1.33 SVHR 0.57 5.18 (0.43) 1.00 0.61 4.88 (0.33) 1.00 5 Modified SVR 0.50 6.06 (0.53) 1.12 0.57 6.07 (0.50) 1.21 IPCW-KM 0.11 6.61 (0.34) 1.22 0.15 6.56 (0.32) 1.31 IPCW-Cox 0.15 6.77 (0.39) 1.25 0.21 6.66 (0.39) 1.33 SVHR 0.51 5.40 (0.48) 1.00 0.58 5.02 (0.33) 1.00 15 Modified SVR 0.39 6.14 (0.45) 1.10 0.49 5.97 (0.45) 1.15 IPCW-KM 0.07 6.56 (0.30) 1.17 0.10 6.54 (0.24) 1.26 IPCW-Cox 0.10 6.82 (0.30) 1.22 0.13 6.70 (0.27) 1.29 SVHR 0.40 5.60 (0.44) 1.00 0.51 5.20 (0.36) 1.00 95 Modified SVR 0.17 6.90 (1.08) 1.11 0.25 7.20 (1.52) 1.21 IPCW-KM 0.01 6.53 (0.26) 1.05 0.03 6.54 (0.20) 1.10 IPCW-Cox 0.02 6.87 (0.20) 1.10 0.04 6.86 (0.21) 1.15 SVHR 0.17 6.22 (0.24) 1.00 0.26 5.94 (0.25) 1.00
Simulation Results Table: Case 2, censoring times following the Cox model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 40% 0 Modified SVR 0.59 5.15 (0.59) 1.11 0.62 5.09 (0.54) 1.12 IPCW-KM 0.53 5.16 (0.42) 1.11 0.55 5.08 (0.31) 1.12 IPCW-Cox 0.52 5.31 (0.57) 1.14 0.56 5.09 (0.46) 1.12 SVHR 0.61 4.66 (0.25) 1.00 0.63 4.53 (0.16) 1.00 5 Modified SVR 0.56 5.28 (0.51) 1.08 0.61 5.09 (0.50) 1.12 IPCW-KM 0.46 5.58 (0.42) 1.14 0.52 5.27 (0.34) 1.13 IPCW-Cox 0.44 5.73 (0.52) 1.17 0.51 5.41 (0.51) 1.16 SVHR 0.58 4.89 (0.29) 1.00 0.62 4.65 (0.18) 1.00 15 Modified SVR 0.47 5.43 (0.40) 1.04 0.55 5.14 (0.38) 1.06 IPCW-KM 0.36 5.79 (0.34) 1.11 0.44 5.49 (0.30) 1.13 IPCW-Cox 0.34 6.00 (0.40) 1.15 0.42 5.70 (0.43) 1.18 SVHR 0.49 5.21 (0.33) 1.00 0.57 4.84 (0.20) 1.00 95 Modified SVR 0.21 6.43 (0.92) 1.04 0.33 6.03 (0.54) 1.04 IPCW-KM 0.17 6.16 (0.21) 1.00 0.24 6.06 (0.18) 1.05 IPCW-Cox 0.16 6.32 (0.23) 1.02 0.22 6.21 (0.22) 1.07 SVHR 0.23 6.18 (0.40) 1.00 0.34 5.78 (0.24) 1.00
Simulation Results Table: Case 2, censoring times following the Cox model # of n = 100 n = 200 Censoring Noises Method Corr. RMSE Ratio Corr. RMSE Ratio 60% 0 Modified SVR 0.56 5.43 (0.56) 1.08 0.59 5.43 (0.47) 1.12 IPCW-KM 0.44 5.68 (0.43) 1.13 0.46 5.62 (0.33) 1.16 IPCW-Cox 0.42 5.83 (0.56) 1.16 0.47 5.67 (0.48) 1.17 SVHR 0.57 5.01 (0.37) 1.00 0.60 4.85 (0.25) 1.00 5 Modified SVR 0.50 5.61 (0.48) 1.07 0.57 5.40 (0.46) 1.09 IPCW-KM 0.36 6.02 (0.38) 1.15 0.43 5.79 (0.35) 1.17 IPCW-Cox 0.34 6.25 (0.44) 1.20 0.41 5.96 (0.47) 1.20 SVHR 0.53 5.23 (0.37) 1.00 0.59 4.96 (0.27) 1.00 15 Modified SVR 0.40 5.77 (0.42) 1.05 0.50 5.44 (0.38) 1.06 IPCW-KM 0.27 6.07 (0.31) 1.10 0.35 5.94 (0.26) 1.16 IPCW-Cox 0.25 6.39 (0.40) 1.16 0.32 6.16 (0.33) 1.20 SVHR 0.42 5.51 (0.40) 1.00 0.52 5.13 (0.29) 1.00 95 Modified SVR 0.18 6.47 (0.87) 1.05 0.27 6.31 (0.80) 1.05 IPCW-KM 0.12 6.22 (0.29) 1.01 0.18 6.19 (0.21) 1.03 IPCW-Cox 0.12 6.54 (0.26) 1.07 0.16 6.50 (0.23) 1.08 SVHR 0.20 6.14 (0.38) 1.00 0.28 6.00 (0.35) 1.00
Application
Huntington s Disease Study Data The study is to identify and combine clinical and biological markers to detect early indicators of disease progression. The outcome is age observed at event or censored time. There are 705 subjects and 126 of them are non-censored. We study the prediction capability of 15 covariates on the age-at-onset of Huntington s Disease. Three-fold cross validation is used to choose tuning parameter C n. We consider both linear kernel and Gaussian kernel. To evaluate the prediction capability, we assess the usefulness of combined score in performing risk stratification.
Huntington s Disease Study Data Table: Normalized coefficient estimates using linear kernel Marker Normalized β Cox model a Total Motor Score 0.680 0.354 * CAP 0.440 0.334 * Stroop Color -0.235-0.247 Stroop Word -0.208-0.107 SDMT -0.151-0.076 Stroop Interference 0.034 0.271 FRSBE Total 0.246 0.242 UHDRS Psychiatric 0.197 0.270 SCL90 Depression -0.062-0.306 SCL90 GSI -0.004 0.114 SCL90 PST -0.081-0.217 SCL90 PSDI 0.096 0.061 TFC -0.054-0.047 Education -0.025-0.092 Male Gender -0.315-0.392 * a The estimates from Cox model with significant p-value (p-value < 0.05) are marked with *.
Huntington s Disease Study Data Table: Comparison of prediction capability for different methods 25th percentile 50th percentile 75th percentile Kernel Method C-index Logrank χ 2 a HR b Logrank χ 2 HR Logrank χ 2 HR Linear Modified SVR 0.71 42.50 3.08 27.53 2.98 13.41 3.82 IPCW-KM 0.44 0.06 1.06 0.23 1.09 1.39 1.26 IPCW-Cox 0.54 5.47 1.57 5.66 1.53 0.90 1.22 SVHR 0.75 81.79 4.68 35.12 4.74 16.53 7.76 Gaussian Modified SVR 0.72 46.53 3.27 30.24 3.34 14.33 3.67 IPCW-KM 0.44 0.32 1.16 0.86 1.19 1.50 1.27 IPCW-Cox 0.53 5.42 1.57 3.89 1.42 1.97 1.34 SVHR 0.75 78.66 4.63 36.78 4.68 17.46 8.11 a Logrank χ 2, Chi-square statistics from Logrank tests for two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values. b HR, Hazard Ratios comparing two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values.
Huntington s Disease Study Data 10 Linear Kernel 8 Hazard ratio 6 4 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups Figure: Hazard ratios comparing two groups separated using percentiles of predicted values as cut points. Hazard ratio 10 8 6 4 Gaussian Kernel Dotted curve: Modified SVR; Dashed curve: IPCW-KM; Dashed-dotted curve: IPCW-Cox; Black solid curve: SVHR. 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups
Atherosclerosis Risk in Communities Study Data A prospective epidemiologic study to investigate the etiology of atherosclerosis in cardiovascular risk factors. Baseline examination enrolled 15,792 participants with ages 45-64 from four U.S. communities. In this example, we apply our method to part of the baseline data, where participants are African-American males with hypertension living in Jackson, Mississippi. There are 624 participants and 133 of them are non-censored. We assess the prediction capability of some common cardiovascular risk factors for incident heart failure until 2005. We analyze the data following the same procedure.
Atherosclerosis Risk in Communities Study Data Table: Normalized coefficient estimates using linear kernel Covariate Normalized β Cox model a Age (in years) 0.363 0.328 * Diabetes 0.288 0.221 * BMI (kg/m 2 ) 0.150 0.136 SBP (mm of Hg) 0.172 0.178 Fasting glucose (mg/dl) 0.173-0.093 Serum albumin (g/dl) -0.363-0.273 * Serum creatinine (mg/dl) 0.007 0.029 Heart rate (beats/minute) 0.124 0.125 Left ventricular hypertrophy 0.250 0.158 * Bundle branch block 0.341 0.242 * Prevalent CHD 0.330 0.216 * Valvular heart disease 0.200 0.169 * HDL (mg/dl) -0.287-0.436 * LDL (mg/dl) 0.016 0.051 Pack years of smoking 0.289 0.230 * Current smoking status 0.210 0.022 Former smoking status -0.133-0.232 * a The estimates from Cox model with significant p-value (p-value < 0.05) are marked with *.
Atherosclerosis Risk in Communities Study Data Table: Comparison of prediction capability for different methods 25th percentile 50th percentile 75th percentile Kernel Method C-index Logrank χ 2 a HR b Logrank χ 2 HR Logrank χ 2 HR Linear Modified SVR 0.74 90.52 4.63 59.11 4.16 31.85 5.01 IPCW-KM 0.69 54.90 3.48 29.53 2.64 22.92 3.45 IPCW-Cox 0.71 48.34 3.24 39.70 3.12 27.63 4.32 Our method 0.76 95.09 4.78 67.06 4.63 34.93 5.36 Gaussian Modified SVR 0.76 105.10 5.12 70.41 4.87 37.66 6.39 IPCW-KM 0.70 58.15 3.61 33.49 2.81 19.61 3.00 IPCW-Cox 0.72 52.77 3.39 47.10 3.50 27.99 4.37 Our method 0.77 111.10 5.31 64.79 4.53 35.60 5.76 a Logrank χ 2, Chi-square statistics from Logrank tests for two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values. b HR, Hazard Ratios comparing two groups separated using the 25th percentile, 50th percentile, and 75th percentile of predicted values.
Atherosclerosis Risk in Communities Study Data 10 Linear Kernel 8 Hazard ratio 6 4 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups Figure: Hazard ratios comparing two groups separated using percentiles of predicted values as cut points. Hazard ratio 10 8 6 4 Gaussian Kernel Dotted curve: Modified SVR; Dashed curve: IPCW-KM; Dashed-dotted curve: IPCW-Cox; Black solid curve: SVHR. 2 0 5 15 25 35 45 55 65 75 Percentile as the cut point separating binary groups
Conclusions
Concluding Remarks We adapted the learning algorithm SVM to predict event times in the framework of counting process. The proposed method is optimal in discriminating covariate specific hazard from population average hazard. We can handle censored data appropriately without specifying the censoring distribution. Numerical studies showed superiority of SVHR, especially in the presence of high censoring ratio and noise variables. One potential challenge is the fast growing dimensions of the quadratic programming optimization as the sample size increases.