Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction

Size: px

Start display at page:

Download "Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction"

Samuel Robbins
5 years ago
Views:

1 Faculty of Science Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction Niels Richard Hansen & Martin Vincent Department of Mathematical Sciences March 14, 2014 Slide 1/23

2 Overview Present qpcr data set on mirna expression from primary cancers and liver biopsies. A brief detour around the multinomial group lasso predictor. Present a computational method for dealing with heterogeneous tissue composition in biopsy samples. Present a general modeling framework for class prediction based on heterogeneous tissue and some preliminary methods and results. Slide 2/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

3 Prediction of primary site Class description Resections Liver core (primaries) biopsies Breast cancer 17 7 (5/2) Colorectal cancer (8/4) Gastric/Cardia cancer (8/4) Pancreatic cancer (5/5) Squamous cell cancers (of different origins) (6/6) Hepatocellular carcinoma 17 3 Cholangiocarcinoma 20 4 Subtotal Cirrhotic liver 17 8 Normal liver 20 7 Total Objective: Predict site of primary tumor from liver biopsy. Slide 3/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

4 Misclassification for biopsies from metastases Principal Number ANOVA+PAM Multinomial training data of core biopsies group lasso Number of mirnas (0) 81% a 77% a 77% 74% Primaries 2 (10) 74% 71% 59% 54% 4 (20) 64% 64% 48% 45% 0 (0) 60% 57% 45% 43% Artificial 2 (10) - b - b 39% 41% 4 (20) - b - b 34% 39% a Constructed as in Ferracin et al. J. Pathol., 255, 4353, b Sample weights not directly supported by ANOVA+PAM. Slide 4/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

5 Multinomial regression Class variable Y {1,..., K}, X R p ( ) P(Y = y X ) exp X i β iy. i Ordinary lasso objective: l(β) }{{} neg. log-like + λ iy β iy. Sparse group lasso objective: l(β) + λ (1 α) i β i 2 + α iy β iy. Slide 5/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

6 Multinomial regression - test example 0.5 c Err ˆ 10k 30k 50k ˆ =0 =0.25 =0.5 =0.75 lasso Classification of Amazon reviewers. Group lasso clearly outperforms lasso. Sparse group lasso implementation in R package msgl. Slide 6/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

7 The heterogeneity model The standard model of molecular signatures from heterogenous tissue: α primary tumor signature + (1 α) normal liver signature Our model, conditionally on class Y = y, allows for a non-linear transformation due to qpcr: Z y = f ( αf 1 (X y ) + (1 α)f 1 (X 0 ) ) Model assumption: α X X 0 Y Slide 7/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

8 Artificial training data Based on Z y = f ( αf 1 (X y ) + (1 α)f 1 (X 0 ) ) and sampling of X y with replacement from primary signatures for class y sampling of X 0 with replacement from liver signatures and sampling of α from the Beta(2,2)-distribution we artificially sampled Z y used to train the multinomial predictor. In the paper we considered two choices of f : the identity or f i (x i ) = 1.7 log x i corresponding to a PCR amplification efficiency of 80%. Slide 8/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

9 Comparison of real and artificial data Second principal component First principal component Samples Liver core biopsies Resections Slide 9/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

10 Misclassification for biopsies from metastases Principal Number ANOVA+PAM Multinomial training data of core biopsies group lasso Number of mirnas (0) 81% a 77% a 77% 74% Primaries 2 (10) 74% 71% 59% 54% 4 (20) 64% 64% 48% 45% 0 (0) 60% 57% 45% 43% Artificial 2 (10) - b - b 39% 41% 4 (20) - b - b 34% 39% a Constructed as in Ferracin et al. J. Pathol., 255, 4353, b Sample weights not directly supported by ANOVA+PAM. Slide 10/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

11 A general modeling approach Consider a triple of variables (X, Z, Y ) with X, Z R p and Y {1,..., K} the class label. Observations of (X, Y ) are available for construction of a predictor, but observations of Z are available for prediction. With π 0 the joint distribution of (Z, Y ), then if Y Z X π 0 (z, y) = p(z x)π(x, y)dx π 0 (y z) = π(y x)q(x z)dx. Slide 11/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

12 Our previous solution π 0 (z, y) = p(z x)π(x, y)dx Effectively, we computed estimates ˆπ(x, y) (the empirical distribution), and ˆp(z x) to make a forward simulation from ˆπ 0 (z, y). The forward simulated data were used to fit a model of ˆπ 0 (y z). Slide 12/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

13 An alternative solution π 0 (y z) = π(y x)q(x z)dx Alternatively, we can compute the estimate ˆπ(y x) and use a Monte Carlo method to compute ˆπ 0 (y z) = 1 B B ˆπ(y x i ) i=1 with x i from a Markov Chain with invariant distribution q(x z). This is a backward simulation solution. Slide 13/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

14 Latent Gaussian model If Z = [X X 1 ]α + ε with X = [X X 1 ] being a p k matrix, and α, ε, X are independent Gaussian, then X Z, α N (, ) and α Z, X N (, ). This is what is needed to implement the Gibbs sampler. Slide 14/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

15 Parameters used α ([ ] [, ]) [ X ] 0 ˆξ liver, ˆξΣ classˆξ + ˆσ ˆσ p 0 0 ˆσ ˆσ p ε (0, 0.2I p ) Slide 15/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

16 Projections of primary samples PC1 PC2 class Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous Slide 16/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

17 Projections of primary and biopsy samples PC1 PC2 class Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous Slide 17/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

18 Projections of primary and biopsy posterior means PC1 PC2 class Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous Slide 18/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

19 Top 10 mirnas for class prediction Primaries mir.122 mir.192 mir.885.5p mir.196b mir.17 mir.148a mir.214 mir.221 let.7c mir.92a Squamous Pancreas Liver classes HCC EG CRC Cirrhosis CCA Breast value Biopsies mir.122 mir.192 mir.885.5p mir.196b mir.17 mir.148a mir.214 mir.221 let.7c mir.92a Squamous Pancreas Liver classes HCC EG CRC Cirrhosis CCA Breast value Slide 19/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

20 Top 10 mirnas for class prediction Primaries mir.122 mir.192 mir.885.5p mir.196b mir.17 mir.148a mir.214 mir.221 let.7c mir.92a Squamous Pancreas Liver classes HCC EG CRC Cirrhosis CCA Breast value Biopsies, posterior means mir.122 mir.192 mir.885.5p mir.196b mir.17 mir.148a mir.214 mir.221 let.7c mir.92a Squamous Pancreas Liver classes HCC EG CRC Cirrhosis CCA Breast value Slide 20/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

21 Preliminary results Principal Number Backward Forward training data of core biopsies multinomial multinomial Method Number of mirnas ˆπ 0 (y z) ˆπ(y ˆx) (0) 48% 50% 77% 74% Primaries 2 (10) % 54% 4 (20) % 45% 0 (0) % 43% Artificial 2 (10) % 41% 4 (20) % 39% Slide 21/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

22 Conclusions Tissue heterogeneity can be a big problem for prediction based on molecular signatures. A forward or backward simulation can decrease but not solve the problem. The forward solution was first understood in the Machine Learning lingo as domain adaptation. Backward simulation is closely related to deconvolution of the molecular signature. M. Vincent, N. R. Hansen. Sparse group lasso and high dimensional multinomial classification, Comp. Stat. Data Anal M. Vincent, K. Perell, F. C. Nielsen, G. Daugaard and N. R. Hansen Modeling tissue contamination to improve molecular identification of the primary tumor site of metastases, Bioinformatics, Slide 22/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

23 Proportion posterior means CM classes.cb Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous LM Slide 23/23 Niels Richard Hansen Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction March 14, 2014

Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability