Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30
Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 2 / 30
Introduction Observations described by 2 variables Observation distribution seems easy to model with one Gaussian M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no external information about them This is an underlying structure observed through the data M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
Introduction Definition of a mixture model It is a probabilistic model for representing the presence of subpopulations within an overall population. Introduction of a latent variable Z indicating the subpopulation where each observation comes from what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = It is an unsupervised classification method M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30
Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis thaliana, 16% of the genes are orphean genes i.e. without any information on their function with the high-throughput technologies, it is now possible to improve the functional annotation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 5 / 30
First genomic example: co-expression analysis Co-expressed genes are good candidates to be involved in a same biological process (Eisen et al, 1998) Pearson correlation values are often used to measure the co-expression, but it is a local point of view Co-expression analysis can be recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45 independent transcriptome experiments. Clusters are identified with a mixture. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 6 / 30
Second example: ChIP-chip analysis These experiments aim at identifying interactions between a protein and DNA Most methods look for peaks of log(ip/input) along the genome There exists an underlying structure between the two samples M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 7 / 30
Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 8 / 30
Key ingredients of a mixture model what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = Let y = (y 1,..., y n ) denote n observations with y i R Q and let Z = (Z 1,..., Z n ) be the latent vector. 1) Distribution of Z: {Z i } are assumed to be independent and P(Z i = k) = π k with K π k = 1 Z M(n; π 1,..., π K ) k=1 and where K is the number of components of the mixture 2) Distribution of (y i Z i = k): a parametric distribution f ( ; γ k ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 9 / 30
Some properties: {Z i } are independent {Y i } are independent conditionally to {Z i } Couples {(Y i, Z i )} are i.i.d. The model is invariant for any permutation of the labels {1,..., K } the mixture model has K! equivalent definitions. Distribution of Y: n K P(Y K, θ) = P(Y i, Z i = k) = i=1 k=1 = n K P(Z i = k)p(y i Z i = k) i=1 k=1 n i=1 k=1 K π k f (Y i ; γ k ) It is a weighted sum of parametric distributions known up to the parameter vector θ = (π 1,..., π K 1, γ 1,..., γ K ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 10 / 30
Statistical inference of incomplete data models Maximum likelihood estimate: θ = arg max θ log P(Y K, θ) = arg max θ [ n K ] log π k f (Y i ; γ k ) i=1 k=1 It is not always possible since this sum involves K n terms... Expectation-Maximization algorithm: iterative algorithm based on the expectation of the completed data conditionally to θ (l) { θ (l+1) = arg max θ E log P(Y, Z K, θ) Y, θ (l)} According to the theory, it implies that log P(Y K, θ) tends toward a local maximum. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 11 / 30
EM algorithm details Initialisation of θ (0) While the convergence criterion is not reached, iterate E-step Calculation of the conditional probabilities τ (l) ik = P(Z i = k y i, θ (l) ) = π (l) k f (y i; γ (l) k ) K k =1 π(l) k f (y i ; γ (l) k ) M-step Calculation of θ by maximising the complete likehood where Z is replaced with the conditional probabilities θ = arg max θ n K i=1 k=1 τ (l) ik [log π k + log f (y i ; γ k )] weighted version of the usual maximum likelihood estimates (MLE). M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 12 / 30
EM algorithm properties Convergence is always reached but not always toward a global maximum EM algorithm is sensitive to the initialisation step EM algorithm exists in all good statistical sotfwares In R software, it is available in MCLUST and RMIXMOD packages. RMIXMOD proposes the best strategy of initialisation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 13 / 30
Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 These probabilities enables the classification of the observations into the subpopulations M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 Maximum A Posteriori rule: Classification in the component for which the conditional probability is the highest. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Bayesian Information Criterion (BIC) where proxy of the integrated likelihood P(Y K ) = P(Y K, θ)π(θ K )dθ aims at finding a good number of components for a global fit of the data distribution BIC(K ) = log P(Y K, θ) ν K 2 log(n) ν K is the number of free parameters of the model P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Integrated Information Criterion (ICL) where proxy of the integrated complete likelihood P(Y, Z m) dedicated to classification since it strongly penalizes models for which the classification is uncertain n K ICL(K ) = BIC(K )+ τ ik log τ ik, ν K is the number of free parameters i=1 k=1 P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
Conclusions on the model selection BIC aims at finding a good number of components for a global fit of the data distribution. It tends to overestimate the number of components ICL is dedicated to a classification purpose. It strongly penalizes models for which the classification is uncertain. Whatever the criterion, it must be a convex function of the number of components Bad behavior Correct behavior a non-convex function may indicate an issue of modeling M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30
Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples Mixtures for co-expression analysis Mixtures for analysing chip-chip data 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 17 / 30
GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks using mixture models Data: An original resource generated by the transcriptomic platform of URGV Homogeneous data generated with the CATMA microarray 5,095 genes not present in Affymetrix chip High diversity of biological samples relative to stress conditions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 18 / 30
Workflow overview - Extraction of CATdb of 387 stress comparaisons - 17,264 genes are differentially expressed in at least one of these comparisons (FWER controlled at 5% on overall the tests) - Analyses performed with Gaussian Mixture Models - According to BIC curve, the naive clustering on the whole dataset is not relevant M.L - Martin-Magniette Gene co-expression (INRA) depends Mixture models on and thegenomic stress data categories 7-11 July 2014 19 / 30
Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional bias in a term of gene ontology - 80% are associated to a stress term - 39% have a preferential sub-cellular localization in plastid - 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 20 / 30
Focus on nematode stress Example of Cluster 14 7467 genes described by 10 expression differences 29 clusters of co-expression identified 1519 genes with a conditional proba. close to 1 49 genes repressed from 14 days after infection 13 genes known to be involved in stress response 10 orphean genes Endoplasmic reticulum bias M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 21 / 30
GEM2Net database http://urgv.evry.inra.fr/gem2net Integration of various resources: gene ontology, genes involved in stress responses, gene families (transcription factors and hormones) and protein-protein interactions (experimental and predicted). Original representation and interactive visualization, using pie charts to summarize the functional biases at first glance M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 22 / 30
ChIP-chip experiments The log-ratio is not tractable while the couple (IP, Input) is Development of mixture of 2 linear regressions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 23 / 30
MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) Martin-Magniette et al. (2008), Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
Use to create the first epigenomic map of Arabidopsis thaliana: Roudier et al. (2011), EMBO Journal study the additive inherance of histone modifications in Arabidopsis thaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 25 / 30
MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe status are (Markov-)dependent enables this information in the model: {Z i } MC(π, ν) π kl = Pr{Z i = k Z i 1 = l} M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 26 / 30
Table : Example of one known H3K27me3 target gene identified only with MultiChIPmixHMM. MultiChIPmix and MultiChIPmixHMM are alternative methods to peak detections Analysis of several replicates simultaneously + modelling the spatial dependency = more accurate conditional probabilities MultiChIPmixHMM is available as an R package: Bérard et al. (2013), BMC Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 27 / 30
Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 28 / 30
Conclusions Mixtures reveal underlying structures Key ingredients are P(Z) and P(Y Z) For genomic data, component distribution modeling is sometimes tricky, especially for RNA-Seq data Applications on genomic data sometimes raise new methodological questions about the parameter inference and classification rules Examples of R packages using mixtures: Mclust, Rmixmod, MultiChIPmixHMM, HTSDiff, HTSCluster, poisson.glm.mix M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 29 / 30
Acknowledgements Statistics Bioinformatics Biology S. Robin V. Brunaud J-P. Renou T. Mary-Huard J-P Tamby E. Delannoy C. Bérard R. Zaag S. Balzergue G. Celeux Z. Tariq C. Maugis-Rabusseau V. Colot G. Rigaill F. Roudier A. Rau P. Papastamoulis M. Seifert Thank you for your attention! M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 30 / 30