Building a Prognostic Biomarker

Similar documents
High-Dimensional Statistical Learning: Introduction

Statistical aspects of prediction models with high-dimensional data

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Linear Methods for Prediction

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Univariate shrinkage in the Cox model for high dimensional data

CMSC858P Supervised Learning Methods

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

ECE521 week 3: 23/26 January 2017

Beyond GLM and likelihood

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

MAS3301 / MAS8311 Biostatistics Part II: Survival

Classification. Chapter Introduction. 6.2 The Bayes classifier

The Bayes classifier

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Machine Learning for OR & FE

Classification: Linear Discriminant Analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture #11: Classification & Logistic Regression

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Bias-Variance Tradeoff

Lecture 6: Methods for high-dimensional problems

Statistical Data Mining and Machine Learning Hilary Term 2016

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Modelling geoadditive survival data

Proteomics and Variable Selection

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Kernel Logistic Regression and the Import Vector Machine

TGDR: An Introduction

Linear Methods for Classification

MODULE -4 BAYEIAN LEARNING

Linear Methods for Prediction

Machine Learning 2017

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Fast Regularization Paths via Coordinate Descent

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning Linear Classification. Prof. Matteo Matteucci

MSA220/MVE440 Statistical Learning for Big Data

Support Vector Machines for Classification: A Statistical Portrait

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4771 Regression. Nakul Verma

Sparse Linear Models (10/7/13)

Generative v. Discriminative classifiers Intuition

Least Squares Regression

Introduction to Data Science

Classification 1: Linear regression of indicators, linear discriminant analysis

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

L5 Support Vector Classification

Regularization in Cox Frailty Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Novel Quantization Strategies for Linear Prediction with Guarantees

Statistical Learning with the Lasso, spring The Lasso

Least Squares Regression

Gaussian Models

Recap from previous lecture

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

A Bias Correction for the Minimum Error Rate in Cross-validation

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

in classification problems

Pattern Recognition and Machine Learning

Reduced-rank hazard regression

Linear Models for Regression CS534

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 4 Discriminant Analysis, k-nearest Neighbors

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

β j = coefficient of x j in the model; β = ( β1, β2,

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Dimension Reduction Methods

Nonparametric Bayes tensor factorizations for big data

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

application in microarrays

Applied Multivariate and Longitudinal Data Analysis

Classification objectives COMS 4771

Covariance-regularized regression and classification for high-dimensional problems

Relevance Vector Machines

Extensions to LDA and multinomial regression

ESS2222. Lecture 4 Linear model

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

3003 Cure. F. P. Treasure

Support Vector Machines

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Estimating Optimal Dynamic Treatment Regimes from Clustered Data

Generative v. Discriminative classifiers Intuition

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Residuals and model diagnostics

Lecture 5: Classification

MAS3301 / MAS8311 Biostatistics Part II: Survival

Transcription:

Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44

Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure, tumor growth) x i - p-vector of features (eg. SNPs, gene expression values) Want to model y i by x i to Predict high risk patients (give them additional care) Learn the underlying biology 2 / 44

An Oldie But a Goodie Linear Regression: We assume that Generally fit by solving: min β0,β y i = β 0 + x i β + ɛ i 1 ( y i β 0 xi 2 ) 2 β 3 / 44

An Oldie But a Goodie Diabetes data example n = 442, p = 10 y i is quantitative measure of disease progression (one year after baseline) x i includes age, sex, BMI, avg BP, and six serum measurements 4 / 44

An Oldie But a Goodie Can use η i = ˆβ 0 + ˆβ x i to predict risk. Or can stratify η i cutoff high risk η i < cutoff low risk Choosing the cutoff can be tricky 5 / 44

An Oldie But a Goodie 6 / 44

An Oldie But a Goodie Two Issues When p n (or p > n), estimate is highly variable When true model is far from linear, estimate is inflexible 7 / 44

Working in High Dimensions For p n we need an often reasonable assumption: Most of the features are [conditionally] unrelated to response More formally: In the model y i = β 0 + x i β + ɛ i Most of the β j are 0 (or very nearly 0). 8 / 44

Bet on Sparsity Is this assumption reasonable? Often If not, statistical trickery generally will not help. Either need more samples Or more benchwork 9 / 44

Taking Advantage of Sparsity How do we fit a sparse model? Most obvious approach is: minimize β0,β subject to i ( y i β 0 x i # {j β j 0} d ) 2 β Unfortunately, this is computationally intractable. 10 / 44

Taking Advantage of Sparsity How do we fit a sparse model? Instead we use the Lasso minimize β0,β subject to i ( y i β 0 x i β j c j ) 2 β This can be solved as fast as (or faster than) least squares. 11 / 44

Taking Advantage of Sparsity How does this give us sparsity? Decreasing c increases sparsity. 12 / 44

Taking Advantage of Sparsity Equivalent Penalized Form ( minimize β0,β y i β 0 xi i ) 2 β + λ β j j c in constrained form λ in penalized form 13 / 44

Choosing λ In practice everyone uses Training/test validation OR Cross-validation 14 / 44

Training/Test Validation 1. Choose candidate λ-values: λ 1,..., λ M 2. Split observations into two groups: training and test 3. For each candidate m M 3.1 Training Data Build model with λ m 3.2 Test Data Apply model to get predictions 4. Evaluate the predictions for each model choose the best. 15 / 44

Training/Test Validation Evaluating each model For each λ m we have ŷ (m) i i = 1,..., n test. Simplest evaluation via mean-square-error: performance m = i test data ( y i ŷ (m) i ) 2 16 / 44

Training/Test Validation 17 / 44

Cross-Validation 18 / 44

Cross-Validation 1. Choose candidate λ-values: λ 1,..., λ M 2. Split observations into K folds: 3. For each candidate m M, and each fold k K 3.1 Data (minus fold k) Build model with λ m 3.2 Data (fold k) Apply model to get predictions 4. Evaluate models (now on all data) 19 / 44

Making Linear Regression Less Linear What if the relationship isn t linear? y = 3 sin(x) + ɛ y = 2e x + ɛ y = 3x 2 + 2x + 1 + ɛ If we know the functional form we can still use linear regression 20 / 44

Making Linear Regression Less Linear y = 3 sin(x) + ɛ: x sin(x) y = 3x 2 + 2x + 1 + ɛ: x x x 2 21 / 44

Making Linear Regression Less Linear What if we don t know the right functional form? Use a flexible basis expansion: polynomial basis x x x 2 x k hockey-stick (/spline) basis x x (x t 1 ) + (x t k ) + 22 / 44

Making Linear Regression Less Linear 23 / 44

Making Linear Regression Less Linear For high dimensional problems, expand each variable x 1 x 2 x p x 1 x1 k x 2 x2 k x p xp k and use the Lasso on this expanded problem. k must be small ( 5ish) Spline basis generally outperforms polynomial 24 / 44

Prognostic Biomarker for a Binary Measure On each of n patients measure y i - single binary outcome (eg. Progression after a year, pcr) x i - p-vector of features (eg. SNPs, gene expression values) More common than continuous response 25 / 44

Logistic Regression Relate it back to continuous methods: For continuous response solve: minimize β,β0 i ( y i β 0 x i ) 2 β One interpretation: If y i x i N (xi β, σ 2) then maximizing likelihood is equivalent to least squares. 26 / 44

Logistic Regression Relate it back to continuous methods: For Binary Response, consider ( y i x i ber p i = eβ 0+x i β 1 + e β 0+x i β Maximizing likelihood is equivalent to minimizing l (β, β 0 ) = ( ) [y i β 0 + xi β log (1 )] + e β 0+xi β i ) This is just logistic regression 27 / 44

Diabetes Example Additionally: 166/221 test-positive vs 55/221 test-negative patients with above median progression 28 / 44

Penalized Logistic Regression As in least squares, we can induce sparsity: minimize β,β0 l (β, β 0 ) + λ β j 29 / 44

Penalized Logistic Regression As in least squares, we can induce sparsity: minimize β,β0 l (β, β 0 ) + λ β j Choosing λ is a bit tricky. We need a measure of the performance of our model 29 / 44

Choosing λ Using CV, we get ˆη i = ˆβ 0 train + x i ˆβ train 30 / 44

Choosing λ For each patient we have a CV score ˆη i = ˆβ 0 train + x i ˆβ train Now plug-in e ˆη i ˆp i = 1 + e ˆη i And use Cross-Validated Predictive Likelihood as our measure i ˆp y i i (1 ˆp i ) 1 y i (Some software equivalently uses the negative log-likelihood) 31 / 44

Choosing λ Can also classify patients using ˆp i : { 1 : ˆpi 0.5 ĉlass i = 0 : ˆp i < 0.5 And use Cross-Validated Misclassification Rate as our measure ) proportion (y i ĉlass i 32 / 44

Example Prognostic Biomarker for pcr of HER2+ breast cancer patients on Herceptin + CT n = 60 patients, with 28 pcr Expression from p = 5349 genes with non-negligable variance GEO: GSE50948 33 / 44

Example 34 / 44

Some Other Classifiers Many other classification choices. Noteworthy high dimensional options: Diagonal Linear Discriminant Analysis (DLDA) Nearest Shrunken Centroid (PAM) Support Vector Machine (SVM) Find a score based on features: x i xi each class β indicating likelihood of 35 / 44

DLDA Assumes: Gaussian features within class With Pooled Diagonal Covariance: (x i y i = 0) N (µ 0, D) (x i y i = 1) N (µ 1, D) Estimate µ 1, µ 2, D by maximum likelihood. Calculate Probability of each class using Bayes and plug-in. Score is: η i = ˆD 1 (ˆµ 1 ˆµ 0 ) x i 36 / 44

PAM Assumes: Gaussian features within class With Pooled Diagonal Covariance: (x i y i = 0) N (µ 0, D) (x i y i = 1) N (µ 1, D) Estimate µ 1, µ 2 with shrinkage! µ 1j = µ 2j for most j Score is: η i = ˆD 1 ( µ 1 µ 0 ) x i 37 / 44

SVM No statistical model: just discriminant method. Finds the maximum-margin separating hyperplane Score is (signed) distance from hyperplane η i = β x i + β 0 Can be adapted for incomplete separation 38 / 44

Prognostic Biomarker for Survival Data On each of n patients measure (t i, s i ) - time, censoring-status (eg. Disease free survival) x i - p-vector of features (eg. SNPs, gene expression values) 39 / 44

Prognostic Biomarker for Survival Data Hazard is probability density of failure at time t given survival up to t. Want to model the hazard as a function of covariates 40 / 44

Prognostic Biomarker for Survival Data We will use Cox Proportional Hazards Model Assumes hazard is λ(t) = h(t)e x i β where h(t) is covariate-independent baseline hazard e x i β is covariate-based tilt 41 / 44

Prognostic Biomarker for Survival Data Considering Partial Likelihood likelihood at only event times h(t) falls out. L (β) = i D e x i β j R i e x j β Maximizing this is equivalent to minimizing l(β) = i D x i β log j Ri e x j β 42 / 44

Bells and Whistles Similar additions as continuous/binary response minimize β l (β) + λ β j Trickiest for choosing λ yet. 43 / 44

Straightforward CV approach Find CV estimate for each patient ˆη i = x i ˆβ train Calculate CV predictive partial likelihood i D e ˆη i j R i e ˆη j Not necessarily a great measure 44 / 44