Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes

Size: px

Start display at page:

Download "Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes"

Barrie Berry
5 years ago
Views:

1 Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes Christopher Holmes (joint work with Chris Yau) Department of Statistics, & Wellcome Trust Centre for Human Genetics, University of Oxford, & MRC Harwell Bordeaux, October 2011

2 Overview Motivating Problem: Copy-number-variation in the human and cancer genomes Classification models for sequences: state-space hidden Markov models how to make predictions? optimal predictions under computational constraints Application: Conclusions Copy-Number-Aberrations in cancer genomes

3 Copy-Number-Variation CNVs are quite common in normal germline (sperm / egg) genomes, and very common in cancers where they are known as CNAs Around 7% to 10% of the human genome is under copy number variation Somatic copy number variation induced during mitosis (cell devision) termed, Copy Number Aberrations (CNA) are key drivers of cancer

4 Typical CNV data file Probe based technologies allow one to (noisily) measure the amount of genomic content at pre-defined positions across the genome We are interested in statistical approaches to identify these regions of CNVs across 100,000s of loci

5 Classification of linear sequence This can be thought of as a generic task in classification on ordered data That is, suppose you have general predictors ordered in some sequence {x (1), x (2),..., x (T ) }, say by time, genomic location, a covariate, etc, and you wish to build a classifier for each observation, S i {1,..., M} classes P r(s i = j x (i) ) But where there is persistence (dependence) with index in the classifier, P r(s i = j, S i 1 = k x (i), x (i 1) ) P r(s i = j x (i) )P r(s i 1 = k x (i 1) )

6 Classification of linear sequence Note, this is not a conventional classification task as we don t have access to labelled training data, {S i, x (i) } T i=1 It s not a conventional clustering task as the number, M, and labels of classes are well defined { deletion, normal, duplication, etc } Part classification (the goal) and part clustering and we know that the class labels persists on the index, P r(s i = j S i 1 = j) > P r(s i = j)

7 Classification Two approaches are typically adopted in the literature Partition or change-point models which divide the region up into K (unknown) contiguous blocks and then post-classify (assign) each region to a class f(x) = K f(x j θ j ) j=1 Hidden state-space models that jointly find the regions AND classify each region directly according to the (hidden) class state, S j f(x) = K f(x j θ Sj ) j=1

8 State-space model To capture dependence in copy-number state across loci we can use a Hidden Markov Model (HMM) with hidden state index S i {1,..., M} f(x S) = T f(x i θ Si ) i=1 P r(s i+1 = j S i = k) = (Π) jk where (Π) is the M M transition matrix, θ Sj are the parameters in the likelihood associated with state S j So the transition matrix (Π) allows us to capture (extensive) prior beliefs regards the length scales of CNV events Note, (Π) jj governs the expected holding time (span) in state S i = j Change-points in classification occur when S i S i 1

9 Inference For the purposes of this talk, we assume that you have adopted your favourite Bayesian model, P r(x, θ, S), and computational algorithm to obtain an approximation to the marginal posterior density P r(s x) = P r(s, θ x)dθ The question is: Now what are we going to do with it? Note, P r(s x) is typically of dimension > 30 20,000 states

10 Predictions Most scientists wish to explore predictions under the model to highlight probable CNVs, and maybe rank them or look for association with disease traits But how to choose a prediction? How can we grade the quality (or utility) of a classification prediction on a sequence?

11 The nature of errors In conventional classification models, or polychotmous regression, a good way to assess the performance is to count errors via some statistic, ɛ = T l(ŝi, S i ) where Ŝi is the prediction and S i is the true state nature obtains i=1 But when classifying on a sequence of states this is less straightforward For illustration, consider the following binary classification where there is a notion of a null state, S i = 0 and an alternative S i = 1

12 Nature of errors Both (a) and (b); and (c) and (d), have the same number of pointwise errors

13 Judging predictions It s clear that judging the performance of the model is not as simple as that for regular classification models Moreover, it is clear that the performance is explicitly linked to what you re going to do with the prediction (action-orientated) So two problems exist (a) We need a way to judge performance, we need a loss-function (b) Given a loss-function we need a way to find Ŝ P r(s x), the state sequence (prediction(s)) of minimum expected loss, wrt P r(s x) and in the region of 30 20,000 state sequences for the problems we shall consider

14 Standard summaries for Hidden Markov models There are two (ubiquitous) approaches to reporting states from HMMs 1. The most probable or MAP sequence, using Viterbi algorithm, Ŝ = arg max P r(s = Z x) Z 2. The sequence of states that maximize the posterior marginals, using forward-backward algorithm, Ŝ = {Ŝ1,..., ŜT } Ŝ i = arg max Z i P r(s i = Z i x) P r(s i x) = S i P r(s x) where S i denotes all states other than the ith Under both schemes, Ŝ, can be found in order O(T M 2 ) computation. But do these relate to sensible choices?

15 Analysis on data

16 Standard summaries for Hidden Markov models 1. The most probable or MAP sequence looks good but it s inflexible If I wish to consider more calls, or less calls, I have no handle on this. 2. For the sequence of states that maximize the posterior marginals you do have a simple handle on making calls, { 0, if P r(s i = 1 x) < α Ŝ i = 1, otherwise where α controls the FDR, false-positives / false-negatives However, does this give sensible predictions as you vary α?

17 Analysis on data

18 Analysis on data

19 Analysis on data

20 Analysis on data

21 Analysis on data

22 Analysis on data

23 Calling using posterior marginals So, on the positive side for the posterior marginals we can find Ŝ exactly under P r(s x), using forward-backward algorithm, and the predictions are flexible to some FDR parameter monotone in α On the negative side the predictions don t look so good as we re really after the CNV events rather than point predictions To see why this is so we can take a closer look at the implicit cost-functions that the MAP, via Viterbi, and the posterior marginals, via forward-backward, are solving

24 Standard summaries for Hidden Markov models 1. The most probable or MAP sequence, Ŝ = arg max Z P r(s = Z x), relates to a 0 1 loss over the whole sequence l(z, S) = { 0, Z = S, 1, Z S. where Z is the prediction and nature obtains S 2. The sequence of states that maximize the marginals, Ŝ i = arg max Zi P r(s i = Z i x), relates to a pointwise loss invariant to permutations l(z, S) = l(z i, S i ) i { 0, Z i = S i, l(z i, S i ) = 1, Z i S i

25 These loss functions not really suited to the way people wish to look at CNV predictions However, to a large extent, the MAP and set of marginals seems to be the only methods considered in around 30 years of HMMs

26 Loss Functions under computational constraints One can ponder on a proper loss function appropriate for CNV classification l(z, S)...but then... Ŝ arg min E P r(s x)[l(z, S)]. Z is not computable (for the ones I could think of)

27 Optimal predictions under computational constraints Solution: think of a loss function which you can compute Ŝ with and which solves an approximating optimal prediction problem That is, solve a different but related decision problem precisely

28 Markov Loss Functions We have developed the use of k th order Markov loss functions specifically for calling CNVs l(z, S) = T k i=1 l(z i,..., Z i+k, S i,..., S i+k ) and in particular we shall consider the simplest case of first-order loss, T 1 l(z, S) = l(z i, Z i+1, S i, S i+1 ) i=1 Assuming the loss is homogeneous (does not alter over the sequence) this leads to a M k+1 M k+1 loss matrix with M 2k+2 entries which, for M = 2, k = 1, we can write as follows,

29 Table: Loss matrix l(z i, Z i+1, S i, S i+1) for M = 2 binary state transitions. Call (Z i, Z i+1 ) Nature (S i, S i+1 ) (0, 0) (0, 1) (1, 0) (1, 1) (0, 0) FNH FN (0, 1) FPT XX (1, 0) XX FNT (1, 1) FP FPH = Perfect = GoodCorrection FNH = FalseNegativeHold FNT = FalseNegativeTransition FPH = FalsePositiveHold FPT = FalsePositiveTransition FN = FalseNegative FP = FalsePositive XX = Oh no

30 Adapting your loss So by adjusting {,,FNH, FPH, FNT, FPT, XX, FN, FP} we can adapt calls to suit the application Note, if {, } = 0 {FNH,..., FP} = γ then we get back to the (zero th order) marginal loss So we can see that the use of posterior marginal calls is equivalent to statements of symmetry on the types of losses incurred

31 Calculating Ŝ, the Sequence of Minimum Loss The expected first-order loss is given by, E P r(s x) [l(z, S)] = { T 1 } l(z i, Z i+1, S i, S i+1 ) P r(s x) S i=1 where, by exchanging the order of summation, { } T 1 E P r(s x) [l(z, S)] = l(z i, Z i+1, S i, S i+1 )P r(s x), = i=1 T 1 i=1 S {S i,s i+1} l(z i, Z i+1, S i, S i+1 )P r(s i, S i+1 x).

32 Dynamic Programming The expected loss is additive, so Ŝ can be found exactly using dynamic programming recursions (similar to Viterbi algorithm): For first-order loss the computational cost is O(M 4 T ), where M is the number of hidden states and T is the length of the sequence more generally O(M 2k T ) for a kth order Markov loss function Note, the approach is applicable to any discrete state sequence model P r(s x), e.g. change-point models

33 Example For illustration we consider the previous data example with, {, } = 0 {FN, FNH, FPH} = 1 {FP, FNT, FPT} = γ XX = 500 With γ > 1 which penalises FalsePositives and also jumps (calls) when there should not have been ones That is, once you made a prediction of an event, S i = 1, you are committed to it

34 Analysis on data

35 Analysis on data

36 Analysis on data

37 Analysis on data

38 Analysis on data

39 Analysis on data

40 Analysis on data And barring a XX, this case is identical to the marginal loss

41 Behaviour So by altering the penalties on certain error types we can align the predictions to the context of the analysis It s important to note that this loss specification is separated from the actual modelling task: 1. Build the best possible model of nature π(x) = P r(x, θ)dθ 2. Define a loss specific to the task to hand and compute Ŝ These two tasks being seperate

42 Motivating Ongoing Studies The methodology we present has largely been shaped by two major on-going collaborations: Clinical Diagnostics for Leukemia. Anna Schuh, Jenny Taylor, Longitudinal analysis (diagnosis/relapse) of Chronic Lymphoblastoid Leukemia (CLL) blood samples. 400 samples 1.2m ukp Health Innovation Challenge Fund award (NHS-Wellcome Trust) Ludwig Colon Cancer Initiative. Oliver Seiber (Melbourne LCI) Genome-wide -omics study of 1,000+ paired normal-tumour colon cancer samples.

43 Copy-Number-Aberrations (CNAs) in Cancer It is well established that CNAs are key drivers of oncogenesis through deletion of tumour suppression genes duplication of oncogenes loss of heterozygosity (LOH) However, real data is much more complicated than previous illustrative examples non-standard noise distributions (NP or SP likelihoods) tumour heterogeneity (mixture models) stromal contamination (de-convolution) longitudinal design, population samples

44 Copy-Number-Variation One major challenge in the analysis of CNAs is that the sample contains mixtures of population sub-types

45 Tumor Heterogeneity So the sample represents a population of cells from multiple subtypes

46 SNP genotyping arrays Simply measuring total DNA content, such as in array-chg, will not help as neither copy neutral LOH or heterogeneity levels are identifiable from the data Ideally you would like to do single cell sequencing, but we re still a few years off this However technologies originally designed to genotype single-neucleotide-polymorphisms (SNPs) turn out to be useful SNP: a single base in the genome where two (or more) of {A, G, C, T } are found in the population

47 Genotyping So the genotypes form three clusters in the probe-output space But what happens to the signal when you have a CNV at the SNP?

48 CNV

49 Polar Coordinates It s much simpler to work in polar coordinates So at SNPs the data looks like, And for CNAs...

50 CNAs in theory it should look like this but in practice look like this...

51 Generalized Genotyping: actual (messy) data Noise in the real data means we need to see the signal span multiple adjacent loci in order to be confident of a call

Genome-wide data: colon cancer SNP data is composed of two types of signal: Log R Ratio (r) whose magnitude is related to the total number of alleles (r 0

52 Genome-wide data: colon cancer SNP data is composed of two types of signal: Log R Ratio (r) whose magnitude is related to the total number of alleles (r 0 corresponds to copy number 2) B Allele Frequency (b) measures the relative numbers of the B alleles in the genotype at each location, e.g. a genotype AB has b = 1/2.

53 Tumor Heterogeneity: spike-in experiment Columns show data under 0%, 21% and 50% contamination respectively. Top row shows log-r and second row the corresponding B-allele freq Note, the SNP array measures the aggregated contribution of all cell populations in the sample

54 Tumor Heterogeneity: deletion-normal

55 The Model We can model this process using a mixture of hidden Markov models So we now have two (hidden) states, a normal state {AA, AB, BB} S (N) i And a tumour aberration state S (T ) i {A, B, AAA, AAB,...} And a mixing proportion, w i (0, 1) which can vary across the genome depending on the % of tumour cells in the sample that have an aberration genotype

56 The model is written as, P r(s (T ) i P r(x i ) = w i f(x i S (N) i ) + (1 w i )f(x i S (T ) i ) P r(s (N) i ) = Dir(α 1, α 2 ) S (N) i, S (T ) i 1 ) = v j Dir[α j (S (N) i, S (T ) i 1 )] j w i Be(a, b) where x i = {log R, B.freq} Note, the allowable states of S (T ) i genotype are restricted by S (N) i the normal E.g. if the normal is homoygous, S (N) i = {AA}, then the tumour copy-number-genotype must be in S (T ) i {A, AAA, AAAA,..., }

57 Likelihood We need a likelihood (sampling distribution) for f(x i S i )

58 Likelihood Normal densities are non-robust and lead to appalling predictions We have investigated mixture models for the within state sampling distributions NAR 2007; Genome Biol. 2010; JRSSB Semi-parametric mixture of Student f(x i S i ) = K v j St ν (mu jsi, V Si ) j=1 v j = Dir(ɛ, ɛ,...) with ɛ set small to encourage sparsity 2. Non-parametric mixtures using Dirichlet Process (MDP) f(x i S i ) = N(x i P )dp (θ) P D(α, P 0 ) which is formally an infinite mixture model with Dirichlet Process with concentration parameter α and baseline measure P 0.

59 The MDP model requires careful computational strategies to be applicable to genomic scale data (JRSSB, 2011). When dealing with 100s of samples, we tend to adopt the Student mixture and make use of the EM algorithm from dispersed starting points Having learnt a model we then wish to explore predictions under P r(s (T ) x (N), x (B) ) P r(s (T ) x (N), x (B) ) = P r(s (T ), S (N), θ x (N), x (B) )dθ S (N) where θ contains the mixing weights, transition parameters of the HMM and parameters in the likelihood S (T ) has around 21 states, hence we use a sparse loss matrix l(z i, Z i+1, S i, S i+1 ) that penalises calls from the normal state and false transitions

60 Colon cancer: Ch11

61 Colon cancer: Ch12

62 Colon cancer: Ch7

63 Conclusions Subjective Bayesian statistics provides a prescriptive formal framework on how to think about data analysis Build your best model of nature don t worry (too much) about how the model is to be used Define your loss and suggest actions accordingly Often â, the best action can not be computed with certainty Solution: solve a different problem precisely Optimal predictions under computational constraints These approaches provide state-of-the-art predictions for structural variation in cancer genomics

64 Thank you relevant papers: Identification of novel CNVs CNV discovery NAR, 2007; JRSSB, 2011; Gen. Biol., 2010 Classification (calling) of CNVs within multiple samples at known loci CNV classification Gen. Epi., 2011 Association between CNVs and traits of interest, response variables, such as disease status CNV association Nature, 2010; Leukemia, 2011 (under revision)

65 Extensions: Sequential (longitudinal) model Problem Motivated by CLL study. Multiple samples from the same patient at different times. - Normal, Diagnosis, Treatment, Relapse Track changes in copy number/loh profile over time. Borrow information from across samples to improve calling. Model data at each locus {x (N) i, x (D) i, x states {S (N) i, S (T ) i } (T r) i tumour state composition % s {w (N) i, x (R) i } Interest is identifying regions whereby w (R) i w = 0, w (D) i, w (T r) i (T r) i, w (R) i } is large and for trends

66 Sequential model The duplication at diagnosis is not detected using single-sample analysis.

67 Sequential model

68 Population model Problem Multiple tumour samples from different patients. Cluster tumours into distinct groups based on copy number/loh profile. Borrow structural information across samples to improve inference. Model Maintain a pool of tumour states {S (T ) 1,..., S (T ) K } Defining a nonparametric prior on these states so that K is adaptive Allow each tumour to sample from this population pool with some probability Interest is in the pool of tumour states and how they associate with clinical traits

69 Population model SNP (Copy number) Tumor Type K x x x x x SNP Individual Tumor Type Tumor AB AB ABB AAB AAA 1 1 Normal AB AB AB AB AA % Normal Tumor A A AB AB AA 2 2 Normal A A AB AB AA % Normal Tumor AA AB ABB ABB BB 3 1 Normal AA AB AB AB BB % Normal Tumor B B ABB ABB BB 4 3 Normal AB AB AB AB BB % Normal Tumor A A AAB ABB AAA 5 3 Normal AB AA AB AB AA % Normal where K is unknown, K {1, 2,...}

70 Population model: colon cancer, 500 samples, chrm 8 And we can investigate association between the clusters and clinical traits

Association studies and regression

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration