Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44
Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure, tumor growth) x i - p-vector of features (eg. SNPs, gene expression values) Want to model y i by x i to Predict high risk patients (give them additional care) Learn the underlying biology 2 / 44
An Oldie But a Goodie Linear Regression: We assume that Generally fit by solving: min β0,β y i = β 0 + x i β + ɛ i 1 ( y i β 0 xi 2 ) 2 β 3 / 44
An Oldie But a Goodie Diabetes data example n = 442, p = 10 y i is quantitative measure of disease progression (one year after baseline) x i includes age, sex, BMI, avg BP, and six serum measurements 4 / 44
An Oldie But a Goodie Can use η i = ˆβ 0 + ˆβ x i to predict risk. Or can stratify η i cutoff high risk η i < cutoff low risk Choosing the cutoff can be tricky 5 / 44
An Oldie But a Goodie 6 / 44
An Oldie But a Goodie Two Issues When p n (or p > n), estimate is highly variable When true model is far from linear, estimate is inflexible 7 / 44
Working in High Dimensions For p n we need an often reasonable assumption: Most of the features are [conditionally] unrelated to response More formally: In the model y i = β 0 + x i β + ɛ i Most of the β j are 0 (or very nearly 0). 8 / 44
Bet on Sparsity Is this assumption reasonable? Often If not, statistical trickery generally will not help. Either need more samples Or more benchwork 9 / 44
Taking Advantage of Sparsity How do we fit a sparse model? Most obvious approach is: minimize β0,β subject to i ( y i β 0 x i # {j β j 0} d ) 2 β Unfortunately, this is computationally intractable. 10 / 44
Taking Advantage of Sparsity How do we fit a sparse model? Instead we use the Lasso minimize β0,β subject to i ( y i β 0 x i β j c j ) 2 β This can be solved as fast as (or faster than) least squares. 11 / 44
Taking Advantage of Sparsity How does this give us sparsity? Decreasing c increases sparsity. 12 / 44
Taking Advantage of Sparsity Equivalent Penalized Form ( minimize β0,β y i β 0 xi i ) 2 β + λ β j j c in constrained form λ in penalized form 13 / 44
Choosing λ In practice everyone uses Training/test validation OR Cross-validation 14 / 44
Training/Test Validation 1. Choose candidate λ-values: λ 1,..., λ M 2. Split observations into two groups: training and test 3. For each candidate m M 3.1 Training Data Build model with λ m 3.2 Test Data Apply model to get predictions 4. Evaluate the predictions for each model choose the best. 15 / 44
Training/Test Validation Evaluating each model For each λ m we have ŷ (m) i i = 1,..., n test. Simplest evaluation via mean-square-error: performance m = i test data ( y i ŷ (m) i ) 2 16 / 44
Training/Test Validation 17 / 44
Cross-Validation 18 / 44
Cross-Validation 1. Choose candidate λ-values: λ 1,..., λ M 2. Split observations into K folds: 3. For each candidate m M, and each fold k K 3.1 Data (minus fold k) Build model with λ m 3.2 Data (fold k) Apply model to get predictions 4. Evaluate models (now on all data) 19 / 44
Making Linear Regression Less Linear What if the relationship isn t linear? y = 3 sin(x) + ɛ y = 2e x + ɛ y = 3x 2 + 2x + 1 + ɛ If we know the functional form we can still use linear regression 20 / 44
Making Linear Regression Less Linear y = 3 sin(x) + ɛ: x sin(x) y = 3x 2 + 2x + 1 + ɛ: x x x 2 21 / 44
Making Linear Regression Less Linear What if we don t know the right functional form? Use a flexible basis expansion: polynomial basis x x x 2 x k hockey-stick (/spline) basis x x (x t 1 ) + (x t k ) + 22 / 44
Making Linear Regression Less Linear 23 / 44
Making Linear Regression Less Linear For high dimensional problems, expand each variable x 1 x 2 x p x 1 x1 k x 2 x2 k x p xp k and use the Lasso on this expanded problem. k must be small ( 5ish) Spline basis generally outperforms polynomial 24 / 44
Prognostic Biomarker for a Binary Measure On each of n patients measure y i - single binary outcome (eg. Progression after a year, pcr) x i - p-vector of features (eg. SNPs, gene expression values) More common than continuous response 25 / 44
Logistic Regression Relate it back to continuous methods: For continuous response solve: minimize β,β0 i ( y i β 0 x i ) 2 β One interpretation: If y i x i N (xi β, σ 2) then maximizing likelihood is equivalent to least squares. 26 / 44
Logistic Regression Relate it back to continuous methods: For Binary Response, consider ( y i x i ber p i = eβ 0+x i β 1 + e β 0+x i β Maximizing likelihood is equivalent to minimizing l (β, β 0 ) = ( ) [y i β 0 + xi β log (1 )] + e β 0+xi β i ) This is just logistic regression 27 / 44
Diabetes Example Additionally: 166/221 test-positive vs 55/221 test-negative patients with above median progression 28 / 44
Penalized Logistic Regression As in least squares, we can induce sparsity: minimize β,β0 l (β, β 0 ) + λ β j 29 / 44
Penalized Logistic Regression As in least squares, we can induce sparsity: minimize β,β0 l (β, β 0 ) + λ β j Choosing λ is a bit tricky. We need a measure of the performance of our model 29 / 44
Choosing λ Using CV, we get ˆη i = ˆβ 0 train + x i ˆβ train 30 / 44
Choosing λ For each patient we have a CV score ˆη i = ˆβ 0 train + x i ˆβ train Now plug-in e ˆη i ˆp i = 1 + e ˆη i And use Cross-Validated Predictive Likelihood as our measure i ˆp y i i (1 ˆp i ) 1 y i (Some software equivalently uses the negative log-likelihood) 31 / 44
Choosing λ Can also classify patients using ˆp i : { 1 : ˆpi 0.5 ĉlass i = 0 : ˆp i < 0.5 And use Cross-Validated Misclassification Rate as our measure ) proportion (y i ĉlass i 32 / 44
Example Prognostic Biomarker for pcr of HER2+ breast cancer patients on Herceptin + CT n = 60 patients, with 28 pcr Expression from p = 5349 genes with non-negligable variance GEO: GSE50948 33 / 44
Example 34 / 44
Some Other Classifiers Many other classification choices. Noteworthy high dimensional options: Diagonal Linear Discriminant Analysis (DLDA) Nearest Shrunken Centroid (PAM) Support Vector Machine (SVM) Find a score based on features: x i xi each class β indicating likelihood of 35 / 44
DLDA Assumes: Gaussian features within class With Pooled Diagonal Covariance: (x i y i = 0) N (µ 0, D) (x i y i = 1) N (µ 1, D) Estimate µ 1, µ 2, D by maximum likelihood. Calculate Probability of each class using Bayes and plug-in. Score is: η i = ˆD 1 (ˆµ 1 ˆµ 0 ) x i 36 / 44
PAM Assumes: Gaussian features within class With Pooled Diagonal Covariance: (x i y i = 0) N (µ 0, D) (x i y i = 1) N (µ 1, D) Estimate µ 1, µ 2 with shrinkage! µ 1j = µ 2j for most j Score is: η i = ˆD 1 ( µ 1 µ 0 ) x i 37 / 44
SVM No statistical model: just discriminant method. Finds the maximum-margin separating hyperplane Score is (signed) distance from hyperplane η i = β x i + β 0 Can be adapted for incomplete separation 38 / 44
Prognostic Biomarker for Survival Data On each of n patients measure (t i, s i ) - time, censoring-status (eg. Disease free survival) x i - p-vector of features (eg. SNPs, gene expression values) 39 / 44
Prognostic Biomarker for Survival Data Hazard is probability density of failure at time t given survival up to t. Want to model the hazard as a function of covariates 40 / 44
Prognostic Biomarker for Survival Data We will use Cox Proportional Hazards Model Assumes hazard is λ(t) = h(t)e x i β where h(t) is covariate-independent baseline hazard e x i β is covariate-based tilt 41 / 44
Prognostic Biomarker for Survival Data Considering Partial Likelihood likelihood at only event times h(t) falls out. L (β) = i D e x i β j R i e x j β Maximizing this is equivalent to minimizing l(β) = i D x i β log j Ri e x j β 42 / 44
Bells and Whistles Similar additions as continuous/binary response minimize β l (β) + λ β j Trickiest for choosing λ yet. 43 / 44
Straightforward CV approach Find CV estimate for each patient ˆη i = x i ˆβ train Calculate CV predictive partial likelihood i D e ˆη i j R i e ˆη j Not necessarily a great measure 44 / 44