Mixture models for analysing transcriptome and ChIP-chip data

Size: px

Start display at page:

Download "Mixture models for analysing transcriptome and ChIP-chip data"

Helen York
5 years ago
Views:

Institute for agricultural research (INRA) Unit of

Paris Unit of Plant Genomics Research (URGV), Evry M.

1 Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

2 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

3 Introduction Observations described by 2 variables Observation distribution seems easy to model with one Gaussian M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no

4 Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no external information about them This is an underlying structure observed through the data M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

5 Introduction Definition of a mixture model It is a probabilistic model for representing the presence of subpopulations within an overall population. Introduction of a latent variable Z indicating the subpopulation where each observation comes from what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = It is an unsupervised classification method M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis

6 Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis thaliana, 16% of the genes are orphean genes i.e. without any information on their function with the high-throughput technologies, it is now possible to improve the functional annotation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45

7 First genomic example: co-expression analysis Co-expressed genes are good candidates to be involved in a same biological process (Eisen et al, 1998) Pearson correlation values are often used to measure the co-expression, but it is a local point of view Co-expression analysis can be recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45 independent transcriptome experiments. Clusters are identified with a mixture. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

8 Second example: ChIP-chip analysis These experiments aim at identifying interactions between a protein and DNA Most methods look for peaks of log(ip/input) along the genome There exists an underlying structure between the two samples M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

9 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

10 Key ingredients of a mixture model what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = Let y = (y 1,..., y n ) denote n observations with y i R Q and let Z = (Z 1,..., Z n ) be the latent vector. 1) Distribution of Z: {Z i } are assumed to be independent and P(Z i = k) = π k with K π k = 1 Z M(n; π 1,..., π K ) k=1 and where K is the number of components of the mixture 2) Distribution of (y i Z i = k): a parametric distribution f ( ; γ k ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

11 Some properties: {Z i } are independent {Y i } are independent conditionally to {Z i } Couples {(Y i, Z i )} are i.i.d. The model is invariant for any permutation of the labels {1,..., K } the mixture model has K! equivalent definitions. Distribution of Y: n K P(Y K, θ) = P(Y i, Z i = k) = i=1 k=1 = n K P(Z i = k)p(y i Z i = k) i=1 k=1 n i=1 k=1 K π k f (Y i ; γ k ) It is a weighted sum of parametric distributions known up to the parameter vector θ = (π 1,..., π K 1, γ 1,..., γ K ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

12 Statistical inference of incomplete data models Maximum likelihood estimate: θ = arg max θ log P(Y K, θ) = arg max θ [ n K ] log π k f (Y i ; γ k ) i=1 k=1 It is not always possible since this sum involves K n terms... Expectation-Maximization algorithm: iterative algorithm based on the expectation of the completed data conditionally to θ (l) { θ (l+1) = arg max θ E log P(Y, Z K, θ) Y, θ (l)} According to the theory, it implies that log P(Y K, θ) tends toward a local maximum. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

13 EM algorithm details Initialisation of θ (0) While the convergence criterion is not reached, iterate E-step Calculation of the conditional probabilities τ (l) ik = P(Z i = k y i, θ (l) ) = π (l) k f (y i; γ (l) k ) K k =1 π(l) k f (y i ; γ (l) k ) M-step Calculation of θ by maximising the complete likehood where Z is replaced with the conditional probabilities θ = arg max θ n K i=1 k=1 τ (l) ik [log π k + log f (y i ; γ k )] weighted version of the usual maximum likelihood estimates (MLE). M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

14 EM algorithm properties Convergence is always reached but not always toward a global maximum EM algorithm is sensitive to the initialisation step EM algorithm exists in all good statistical sotfwares In R software, it is available in MCLUST and RMIXMOD packages. RMIXMOD proposes the best strategy of initialisation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

15 Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = k = k = These probabilities enables the classification of the observations into the subpopulations M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

16 Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = k = k = Maximum A Posteriori rule: Classification in the component for which the conditional probability is the highest. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

17 Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Bayesian Information Criterion (BIC) where proxy of the integrated likelihood P(Y K ) = P(Y K, θ)π(θ K )dθ aims at finding a good number of components for a global fit of the data distribution BIC(K ) = log P(Y K, θ) ν K 2 log(n) ν K is the number of free parameters of the model P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

18 Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Integrated Information Criterion (ICL) where proxy of the integrated complete likelihood P(Y, Z m) dedicated to classification since it strongly penalizes models for which the classification is uncertain n K ICL(K ) = BIC(K )+ τ ik log τ ik, ν K is the number of free parameters i=1 k=1 P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

19 Conclusions on the model selection BIC aims at finding a good number of components for a global fit of the data distribution. It tends to overestimate the number of components ICL is dedicated to a classification purpose. It strongly penalizes models for which the classification is uncertain. Whatever the criterion, it must be a convex function of the number of components Bad behavior Correct behavior a non-convex function may indicate an issue of modeling M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

20 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples Mixtures for co-expression analysis Mixtures for analysing chip-chip data 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks

21 GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks using mixture models Data: An original resource generated by the transcriptomic platform of URGV Homogeneous data generated with the CATMA microarray 5,095 genes not present in Affymetrix chip High diversity of biological samples relative to stress conditions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

22 Workflow overview - Extraction of CATdb of 387 stress comparaisons - 17,264 genes are differentially expressed in at least one of these comparisons (FWER controlled at 5% on overall the tests) - Analyses performed with Gaussian Mixture Models - According to BIC curve, the naive clustering on the whole dataset is not relevant M.L - Martin-Magniette Gene co-expression (INRA) depends Mixture models on and thegenomic stress data categories 7-11 July / 30

Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional

23 Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional bias in a term of gene ontology - 80% are associated to a stress term - 39% have a preferential sub-cellular localization in plastid - 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

24 Focus on nematode stress Example of Cluster genes described by 10 expression differences 29 clusters of co-expression identified 1519 genes with a conditional proba. close to 1 49 genes repressed from 14 days after infection 13 genes known to be involved in stress response 10 orphean genes Endoplasmic reticulum bias M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

25 GEM2Net database Integration of various resources: gene ontology, genes involved in stress responses, gene families (transcription factors and hormones) and protein-protein interactions (experimental and predicted). Original representation and interactive visualization, using pie charts to summarize the functional biases at first glance M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

26 ChIP-chip experiments The log-ratio is not tractable while the couple (IP, Input) is Development of mixture of 2 linear regressions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

27 MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) Martin-Magniette et al. (2008), Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input

28 MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

29 Use to create the first epigenomic map of Arabidopsis thaliana: Roudier et al. (2011), EMBO Journal study the additive inherance of histone modifications in Arabidopsis thaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe

30 MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe status are (Markov-)dependent enables this information in the model: {Z i } MC(π, ν) π kl = Pr{Z i = k Z i 1 = l} M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

31 Table : Example of one known H3K27me3 target gene identified only with MultiChIPmixHMM. MultiChIPmix and MultiChIPmixHMM are alternative methods to peak detections Analysis of several replicates simultaneously + modelling the spatial dependency = more accurate conditional probabilities MultiChIPmixHMM is available as an R package: Bérard et al. (2013), BMC Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

32 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

33 Conclusions Mixtures reveal underlying structures Key ingredients are P(Z) and P(Y Z) For genomic data, component distribution modeling is sometimes tricky, especially for RNA-Seq data Applications on genomic data sometimes raise new methodological questions about the parameter inference and classification rules Examples of R packages using mixtures: Mclust, Rmixmod, MultiChIPmixHMM, HTSDiff, HTSCluster, poisson.glm.mix M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

34 Acknowledgements Statistics Bioinformatics Biology S. Robin V. Brunaud J-P. Renou T. Mary-Huard J-P Tamby E. Delannoy C. Bérard R. Zaag S. Balzergue G. Celeux Z. Tariq C. Maugis-Rabusseau V. Colot G. Rigaill F. Roudier A. Rau P. Papastamoulis M. Seifert Thank you for your attention! M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche