Multimodal Deep Learning for Predicting Survival from Breast Cancer

Similar documents
Building a Prognostic Biomarker

arxiv: v1 [cs.lg] 25 Mar 2019

β j = coefficient of x j in the model; β = ( β1, β2,

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Relative-risk regression and model diagnostics. 16 November, 2015

Sequential Supervised Learning

Why DNN Works for Acoustic Modeling in Speech Recognition?

Sparse Models for Speech Recognition

Hidden Markov Models in Language Processing

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Multi-state Models: An Overview

MAS3301 / MAS8311 Biostatistics Part II: Survival

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Lecture 7 Time-dependent Covariates in Cox Regression

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Information Extraction from Text

Multistate models and recurrent event models

Knowledge Extraction from DBNs for Images

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Computational Genomics and Molecular Biology, Fall

Computational Genomics

A fast routine for fitting Cox models with time varying effects

A Least Squares Formulation for Canonical Correlation Analysis

Multimodal Machine Learning

Statistics in medicine

Classification Based on Probability

Multistate models and recurrent event models

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Survival Prediction Under Dependent Censoring: A Copula-based Approach

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Survival SVM: a Practical Scalable Algorithm

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 5 Models and methods for recurrent event data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Probabilistic Graphical Models for Image Analysis - Lecture 1

TMA 4275 Lifetime Analysis June 2004 Solution

Package SimSCRPiecewise

Undirected Graphical Models

Survival Regression Models

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Dimension Reduction (PCA, ICA, CCA, FLD,

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Association studies and regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Multiplex network inference

Hidden Markov Models

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Midterm: CS 6375 Spring 2015 Solutions

Philosophy and Features of the mstate package

Beyond GLM and likelihood

] Automatic Speech Recognition (CS753)

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Canonical Correlation Analysis with Kernels

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Graph Wavelets to Analyze Genomic Data with Biological Networks

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Univariate shrinkage in the Cox model for high dimensional data

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Brief Introduction of Machine Learning Techniques for Content Analysis

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

STAT 730 Chapter 1 Background

Bits of Machine Learning Part 1: Supervised Learning

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

arxiv: v1 [stat.ml] 16 Jan 2018

Survival Analysis for Case-Cohort Studies

Structural Learning and Integrative Decomposition of Multi-View Data

Ph.D. course: Regression models. Introduction. 19 April 2012

Mixtures of Gaussians with Sparse Structure

Master s Written Examination - Solution

Lecture 3. Truncation, length-bias and prevalence sampling

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

The coxvc_1-1-1 package

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine learning - HT Maximum Likelihood

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Power and Sample Size Calculations with the Additive Hazards Model

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Machine Learning 2nd Edition

Multivariable Fractional Polynomials

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Multimodal context analysis and prediction

Discriminative Direction for Kernel Classifiers

The influence of categorising survival time on parameter estimates in a Cox model

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Package CoxRidge. February 27, 2015

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Extensions of Cox Model for Non-Proportional Hazards Purpose

Transcription:

Multimodal Deep Learning for Predicting Survival from Breast Cancer Heather Couture Deep Learning Journal Club Nov. 16, 2016

Outline Background on tumor histology & genetic data Background on survival analysis Deep survival models Katzman et al., Deep Survival: A Deep Cox Proportional Hazards Network, 2016 Yousefi et al., Learning Genomic Representations to Predict Clinical Outcomes in Cancer, ICLR, 2016 Multimodal deep learning Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 My work: predicting survival from tumor histology & genetics 2

Tissue Microarray microarray assembly gene expression tumor area identified on slide genetic subtype immunohistochemistry receptor status Sauter, Tissue microarrays in drug discovery, 2003 H&E histology histologic subtype, grade 3

Applications of Tumor Analysis Prognosis More favorable outcome if: few mitoses less nuclear pleomorphism (irregularity of nuclear size and shape) well differentiated (cell specialization) Abnormal cells http://www.gistsupport.org http://www.imaginis.com Personalized treatment Target tumors based on molecular analysis: http://www.breastpathology.info treatment 1 molecular analysis + image analysis? treatment 2 treatment 3 4

Motivation Improve predictions by using automated image analysis Faster More repeatable Capture properties that humans cannot Capture spatial properties in a way that genetics cannot vs Complement genetic analysis by integrating image and genetic data into a single model Predict survival to identify high and low risk patients 5

Approach Deep survival model with multimodal data + risk score image features gene expression 6

Survival Data Event time T, event indicator E E=1 (e.g., death) T is time to death E=0 (e.g., last contact with patient) T is time of last followup (right-censored) Predicting survival: Standard regression methods Must discard right-censored data Binary discriminative methods (e.g., death by time T) Must discard time to death Solution: proportional hazards model 7

Survival Analysis Survival function S(t) = Pr(T > t) Hazard function λ(t )=lim δ 0 Pr (t T <t +δ T t) δ Proportional hazards model λ(t x)=λ 0 (t )e h(x) Cox proportional hazards model λ 0 (t) baseline hazard function h(x) risk function x covariates h β (x)=β T x Cox partial likelihood maximize partial log likelihood L c (β)= i ϵ{i E i =1} e h β(x i ) e h β(x j ) j ϵ{j T j >T i } 8

Model Performance Concordance index pairwise agreement of risk predictions CI (β, X)= P I (i, j) P I (i, j)={ 1 if h(βt x i )>h(β T x j ) and T j >T i 0 otherwise P set of orderable pairs (X i,x j ) i.e., if E i =1 and E j =1 or E j =0 and T j > T i 9

Deep Survival Network Katzman et al., Deep Survival: A Deep Cox Proportional Hazards Network, 2016 Yousefi et al., Learning Genomic Representations to Predict Clinical Outcomes in Cancer, 2016 Approach: replace h(x) with a DNN risk score h θ (x i ) network weights θ input features x i Cost function: Cox partial log likelihood L(θ)= iϵ{i E i =1} h θ (x i ) log e h θ(x j ) j ϵ{j T j >T i } 10

Experiments Katzman et al., Deep Survival: A Deep Cox Proportional Hazards Network, 2016 Worcester Heart Attack Study 1638 observations 5 features (age, sex, BMI, left heart failure complications, order of MI) Linear Cox regression C-index: 0.669 (95% CI: 0.666-0.671) DeepSurv C-index: 0.779 (95% CI: 0.777-0.782) Molecular Taxonomy of Breast Cancer 1981 patients expression level for 14 manually selected genes clinical features: age, number of positive nodes, tumor size, receptor status, treatment Linear Cox regression C-index: 0.688 (95% CI: 0.686-0.690) Deep Surv C-index: 0.695 (95% CI: 0.693-0.697) 11

Experiments Yousefi et al., Learning Genomic Representations to Predict Clinical Outcomes in Cancer, ICLR, 2016 TCGA brain tumors 628 samples, 183 genomic features 10 random sets: 70% training, 30% testing 2 fully connected layers of 250 hidden units each 12

Multimodal Deep Learning Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 Access to multiple unlabeled views of data for representation learning but only one view at test time Examples Audio + video Images + text Parallel text in two languages Words + context Two approaches Canonical correlation analysis (CCA) Learn features in two views that are maximally correlated Autoencoder Learn a representation that best reconstructs the inputs 13

Multimodal Data Given (xi,y i ), i=1,,n Wish to learn f(xi ) and g(y i ) such that f(x i ) and g(y i ) are highly correlated and/or Possible to reconstruct y i from x i through f(x i ) and vice versa 14

Canonical Correlation Analysis (CCA) Find projections u and v such that the data are maximally correlated u T Σ xy v (u, v)=argmax u, v maximize: u T Σ xy v corr(u T X, v T Y )=argmax u,v Constrain projections to have unit variance subject to: u T Σ xx u=v T Σ yy v=1 maximize: tr(u T Σ xy V ) subject to: U T Σ xx U =V T Σ yy V =I u T Σ xx u v T Σ yy v Find multiple pairs (u i, v i ) such that u i Σ xx u j = v i Σ yy v j = 0 for i < j U = [u 1,,u k ] and V = [v 1,,v k ] 15

Deep Canonical Correlation Analysis (DCCA) Andrew et al., Deep Canonical Correlation Analysis, ICML, 2013 maximize: 1 N tr (U T f ( X) g(y ) T V ) subject to: U T ( 1 N f (X )f ( X)T +r x I ) U =I features within modality are uncorrelated V T ( 1 N g(y ) g(y )T +r y I ) V =I u i T f ( X )g(y ) T v j =0 for i j r x, r y regularization parameters 16

Split Autoencoder (SplitAE) minimize: 1 N i=1 N ( x i p(f (x i )) 2 + y i q (f (x i )) 2 ) 17

Deep Canonically Correlated Autoencoder (DCCAE) Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 minimize: 1 N tr (U T f ( X) g(y ) T V ) + subject to: U T ( 1 N f ( X)f ( X )T +r x I ) U =I λ N N ( x i p(f (x i )) 2 + y i q(g( y i )) 2 ) i=1 autoencoder regularization DCCA V T ( 1 N g(y )g(y )T +r y I ) V =I u i f ( X) g(y ) T v j =0 for i j 18

Correlated Autoencoder (CorrAE) Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 minimize: 1 N tr(u T f ( X )g(y ) T V )+ λ N i=1 N ( x i p(f (x i )) 2 + y i q(g( y i )) 2 ) subject to: u i T f ( X )f ( X) T u i =v i T g(y ) g(y ) T v i =N, 1 i L Relaxation of DCCAE: feature dimensions within each view not constrained to be uncorrelated with each other 19

Experiments: Speech Recognition Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 Recorded speech & articulatory measurements from 47 American English speakers 39 acoustic & 16 articulatory features from each of 7 frames Roughly 50k frames/speaker 1.43M frames Apply representation learning to frames Use original & learned features in standard HMM-based recognizer PER = phone error rates 20

Experiments: Multilingual Word Embeddings Wang et al., On Deep Multi-View Representation Learning, ICML, 2015 Learn representation of English words from pairs of English-German word embeddings 640D monolingual word vectors trained via LSA 36K English-German word pairs Evaluated on 180k English word embeddings Add projections of the two words in each bigram Cosine similarity between bigram pairs Order pairs by similarity Measure Spearman s correlation between model s and human s rankings AN: adjective-noun VN: verb-object 21

Experiments: Diagnosis of Schizophrenia Qi and Tejedor, Deep Multi-view Representation Learning for Multi-modal Features of Schizophrenia and Schizo-affective Disorder, ICASSP, 2016 Features from MRI: Source-based morphometric loading: 32 Functional network connectivity: 378 86 labeled, 119,748 unlabeled samples Train SVM on learned features 22

My Work: Predicting Survival from Breast Cancer SPECS: breast tumors tissue microarray 145 patients 2 cores per patient 512 image features from VGG16 14,570 gene expression levels

Multimodal Deep Survival Network risk score h θ (x i ) network weights θ input image features x i 1 input genetic features x i 2 L Cox (θ)= i ϵ{i E i =1} h θ (x i ) log e h θ(x j ) jϵ{j T j >T i } 24

Multiple Outputs for Regularization subtype risk score grade, receptor status, etc. network weights θ input image features x i 1 input genetic features x i 2 L(θ, X, E,T,Y subtype,y grade )= α L Cox (θ, X, E,T )+ L cross entropy (θ, X,Y subtype )+ L cross entropy (θ, X,Y grade ) 25

Implementation Details/Tricks Batch normalization Drop out L2 regularization 26

5-fold cross-validation x4 Results

Questions?