Mixture models for analysing transcriptome and ChIP-chip data

Similar documents
Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for genomics data

Co-expression analysis of RNA-seq data

Co-expression analysis

arxiv: v1 [stat.ml] 22 Jun 2012

A transcriptome meta-analysis identifies the response of plant to stresses. Etienne Delannoy, Rim Zaag, Guillem Rigaill, Marie-Laure Martin-Magniette

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Predicting Protein Functions and Domain Interactions from Protein Interactions

Discovering molecular pathways from protein interaction and ge

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Co-expression analysis of RNA-seq data with the coseq package

Variable selection for model-based clustering

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Different points of view for selecting a latent structure model

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Density Estimation. Seungjin Choi

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Bioinformatics

Modeling heterogeneity in random graphs

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from High-throughput Data

Deciphering and modeling heterogeneity in interaction networks

Dispersion modeling for RNAseq differential analysis

Non-specific filtering and control of false positives

Linear models for the joint analysis of multiple. array-cgh profiles

Modelling gene expression dynamics with Gaussian processes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Probabilistic Time Series Classification

Unsupervised machine learning

An Introduction to mixture models

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Learning in Bayesian Networks

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Latent Variable Models and EM algorithm

Bayesian learning of sparse factor loadings

Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Measuring TF-DNA interactions

Uncovering structure in biological networks: A model-based approach

Mathematical Formulation of Our Example

COS513 LECTURE 8 STATISTICAL CONCEPTS

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Latent Variable Models and EM Algorithm

Lecture 6: Gaussian Mixture Models (GMM)

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

A segmentation-clustering problem for the analysis of array CGH data

Bayesian Methods: Naïve Bayes

Data Mining Techniques

Lecture : Probabilistic Machine Learning

Multidimensional data analysis in biomedicine and epidemiology

Non-Parametric Bayes

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Naïve Bayes classification

Statistical analysis of biological networks.

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Improving Gene Functional Analysis in Ethylene-induced Leaf Abscission using GO and ProteInOn

Statistical learning. Chapter 20, Sections 1 4 1

Bioinformatics 2 - Lecture 4

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Gibbs Sampling Methods for Multiple Sequence Alignment

The combined use of Arabidopsis thaliana and Lepidium sativum to find conserved mechanisms of seed germination within the Brassicaceae family

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Design of Microarray Experiments. Xiangqin Cui

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

ECE521 week 3: 23/26 January 2017

Clustering high-throughput sequencing data with Poisson mixture models

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Inferring Protein-Signaling Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Learning (II)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Sparse Linear Models (10/7/13)

Penalized Loss functions for Bayesian Model Choice

Introduction to Machine Learning Midterm, Tues April 8

Determining the number of components in mixture models for hierarchical data

Maximum Likelihood Estimation. only training data is available to design a classifier

Modelling Transcriptional Regulation with Gaussian Processes

Introduction to Probabilistic Machine Learning

Gaussian Mixture Models

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Algorithmisches Lernen/Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Computational methods for predicting protein-protein interactions

Parametric Techniques Lecture 3

Transcription:

Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30

Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 2 / 30

Introduction Observations described by 2 variables Observation distribution seems easy to model with one Gaussian M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30

Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no external information about them This is an underlying structure observed through the data M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30

Introduction Definition of a mixture model It is a probabilistic model for representing the presence of subpopulations within an overall population. Introduction of a latent variable Z indicating the subpopulation where each observation comes from what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = It is an unsupervised classification method M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30

Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis thaliana, 16% of the genes are orphean genes i.e. without any information on their function with the high-throughput technologies, it is now possible to improve the functional annotation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 5 / 30

First genomic example: co-expression analysis Co-expressed genes are good candidates to be involved in a same biological process (Eisen et al, 1998) Pearson correlation values are often used to measure the co-expression, but it is a local point of view Co-expression analysis can be recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45 independent transcriptome experiments. Clusters are identified with a mixture. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 6 / 30

Second example: ChIP-chip analysis These experiments aim at identifying interactions between a protein and DNA Most methods look for peaks of log(ip/input) along the genome There exists an underlying structure between the two samples M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 7 / 30

Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 8 / 30

Key ingredients of a mixture model what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = Let y = (y 1,..., y n ) denote n observations with y i R Q and let Z = (Z 1,..., Z n ) be the latent vector. 1) Distribution of Z: {Z i } are assumed to be independent and P(Z i = k) = π k with K π k = 1 Z M(n; π 1,..., π K ) k=1 and where K is the number of components of the mixture 2) Distribution of (y i Z i = k): a parametric distribution f ( ; γ k ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 9 / 30

Some properties: {Z i } are independent {Y i } are independent conditionally to {Z i } Couples {(Y i, Z i )} are i.i.d. The model is invariant for any permutation of the labels {1,..., K } the mixture model has K! equivalent definitions. Distribution of Y: n K P(Y K, θ) = P(Y i, Z i = k) = i=1 k=1 = n K P(Z i = k)p(y i Z i = k) i=1 k=1 n i=1 k=1 K π k f (Y i ; γ k ) It is a weighted sum of parametric distributions known up to the parameter vector θ = (π 1,..., π K 1, γ 1,..., γ K ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 10 / 30

Statistical inference of incomplete data models Maximum likelihood estimate: θ = arg max θ log P(Y K, θ) = arg max θ [ n K ] log π k f (Y i ; γ k ) i=1 k=1 It is not always possible since this sum involves K n terms... Expectation-Maximization algorithm: iterative algorithm based on the expectation of the completed data conditionally to θ (l) { θ (l+1) = arg max θ E log P(Y, Z K, θ) Y, θ (l)} According to the theory, it implies that log P(Y K, θ) tends toward a local maximum. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 11 / 30

EM algorithm details Initialisation of θ (0) While the convergence criterion is not reached, iterate E-step Calculation of the conditional probabilities τ (l) ik = P(Z i = k y i, θ (l) ) = π (l) k f (y i; γ (l) k ) K k =1 π(l) k f (y i ; γ (l) k ) M-step Calculation of θ by maximising the complete likehood where Z is replaced with the conditional probabilities θ = arg max θ n K i=1 k=1 τ (l) ik [log π k + log f (y i ; γ k )] weighted version of the usual maximum likelihood estimates (MLE). M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 12 / 30

EM algorithm properties Convergence is always reached but not always toward a global maximum EM algorithm is sensitive to the initialisation step EM algorithm exists in all good statistical sotfwares In R software, it is available in MCLUST and RMIXMOD packages. RMIXMOD proposes the best strategy of initialisation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 13 / 30

Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 These probabilities enables the classification of the observations into the subpopulations M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30

Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 Maximum A Posteriori rule: Classification in the component for which the conditional probability is the highest. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30

Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Bayesian Information Criterion (BIC) where proxy of the integrated likelihood P(Y K ) = P(Y K, θ)π(θ K )dθ aims at finding a good number of components for a global fit of the data distribution BIC(K ) = log P(Y K, θ) ν K 2 log(n) ν K is the number of free parameters of the model P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30

Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Integrated Information Criterion (ICL) where proxy of the integrated complete likelihood P(Y, Z m) dedicated to classification since it strongly penalizes models for which the classification is uncertain n K ICL(K ) = BIC(K )+ τ ik log τ ik, ν K is the number of free parameters i=1 k=1 P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30

Conclusions on the model selection BIC aims at finding a good number of components for a global fit of the data distribution. It tends to overestimate the number of components ICL is dedicated to a classification purpose. It strongly penalizes models for which the classification is uncertain. Whatever the criterion, it must be a convex function of the number of components Bad behavior Correct behavior a non-convex function may indicate an issue of modeling M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30

Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples Mixtures for co-expression analysis Mixtures for analysing chip-chip data 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 17 / 30

GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks using mixture models Data: An original resource generated by the transcriptomic platform of URGV Homogeneous data generated with the CATMA microarray 5,095 genes not present in Affymetrix chip High diversity of biological samples relative to stress conditions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 18 / 30

Workflow overview - Extraction of CATdb of 387 stress comparaisons - 17,264 genes are differentially expressed in at least one of these comparisons (FWER controlled at 5% on overall the tests) - Analyses performed with Gaussian Mixture Models - According to BIC curve, the naive clustering on the whole dataset is not relevant M.L - Martin-Magniette Gene co-expression (INRA) depends Mixture models on and thegenomic stress data categories 7-11 July 2014 19 / 30

Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional bias in a term of gene ontology - 80% are associated to a stress term - 39% have a preferential sub-cellular localization in plastid - 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 20 / 30

Focus on nematode stress Example of Cluster 14 7467 genes described by 10 expression differences 29 clusters of co-expression identified 1519 genes with a conditional proba. close to 1 49 genes repressed from 14 days after infection 13 genes known to be involved in stress response 10 orphean genes Endoplasmic reticulum bias M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 21 / 30

GEM2Net database http://urgv.evry.inra.fr/gem2net Integration of various resources: gene ontology, genes involved in stress responses, gene families (transcription factors and hormones) and protein-protein interactions (experimental and predicted). Original representation and interactive visualization, using pie charts to summarize the functional biases at first glance M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 22 / 30

ChIP-chip experiments The log-ratio is not tractable while the couple (IP, Input) is Development of mixture of 2 linear regressions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 23 / 30

MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) Martin-Magniette et al. (2008), Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30

MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30

Use to create the first epigenomic map of Arabidopsis thaliana: Roudier et al. (2011), EMBO Journal study the additive inherance of histone modifications in Arabidopsis thaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 25 / 30

MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe status are (Markov-)dependent enables this information in the model: {Z i } MC(π, ν) π kl = Pr{Z i = k Z i 1 = l} M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 26 / 30

Table : Example of one known H3K27me3 target gene identified only with MultiChIPmixHMM. MultiChIPmix and MultiChIPmixHMM are alternative methods to peak detections Analysis of several replicates simultaneously + modelling the spatial dependency = more accurate conditional probabilities MultiChIPmixHMM is available as an R package: Bérard et al. (2013), BMC Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 27 / 30

Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 28 / 30

Conclusions Mixtures reveal underlying structures Key ingredients are P(Z) and P(Y Z) For genomic data, component distribution modeling is sometimes tricky, especially for RNA-Seq data Applications on genomic data sometimes raise new methodological questions about the parameter inference and classification rules Examples of R packages using mixtures: Mclust, Rmixmod, MultiChIPmixHMM, HTSDiff, HTSCluster, poisson.glm.mix M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 29 / 30

Acknowledgements Statistics Bioinformatics Biology S. Robin V. Brunaud J-P. Renou T. Mary-Huard J-P Tamby E. Delannoy C. Bérard R. Zaag S. Balzergue G. Celeux Z. Tariq C. Maugis-Rabusseau V. Colot G. Rigaill F. Roudier A. Rau P. Papastamoulis M. Seifert Thank you for your attention! M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 30 / 30