Mixture models for analysing transcriptome and ChIP-chip data

Size: px
Start display at page:

Download "Mixture models for analysing transcriptome and ChIP-chip data"

Transcription

1 Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

2 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

3 Introduction Observations described by 2 variables Observation distribution seems easy to model with one Gaussian M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

4 Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no external information about them This is an underlying structure observed through the data M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

5 Introduction Definition of a mixture model It is a probabilistic model for representing the presence of subpopulations within an overall population. Introduction of a latent variable Z indicating the subpopulation where each observation comes from what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = It is an unsupervised classification method M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

6 Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis thaliana, 16% of the genes are orphean genes i.e. without any information on their function with the high-throughput technologies, it is now possible to improve the functional annotation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

7 First genomic example: co-expression analysis Co-expressed genes are good candidates to be involved in a same biological process (Eisen et al, 1998) Pearson correlation values are often used to measure the co-expression, but it is a local point of view Co-expression analysis can be recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45 independent transcriptome experiments. Clusters are identified with a mixture. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

8 Second example: ChIP-chip analysis These experiments aim at identifying interactions between a protein and DNA Most methods look for peaks of log(ip/input) along the genome There exists an underlying structure between the two samples M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

9 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

10 Key ingredients of a mixture model what we observe the model the expected results Z =? Z : 1 =, 2 =, 3 = Let y = (y 1,..., y n ) denote n observations with y i R Q and let Z = (Z 1,..., Z n ) be the latent vector. 1) Distribution of Z: {Z i } are assumed to be independent and P(Z i = k) = π k with K π k = 1 Z M(n; π 1,..., π K ) k=1 and where K is the number of components of the mixture 2) Distribution of (y i Z i = k): a parametric distribution f ( ; γ k ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

11 Some properties: {Z i } are independent {Y i } are independent conditionally to {Z i } Couples {(Y i, Z i )} are i.i.d. The model is invariant for any permutation of the labels {1,..., K } the mixture model has K! equivalent definitions. Distribution of Y: n K P(Y K, θ) = P(Y i, Z i = k) = i=1 k=1 = n K P(Z i = k)p(y i Z i = k) i=1 k=1 n i=1 k=1 K π k f (Y i ; γ k ) It is a weighted sum of parametric distributions known up to the parameter vector θ = (π 1,..., π K 1, γ 1,..., γ K ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

12 Statistical inference of incomplete data models Maximum likelihood estimate: θ = arg max θ log P(Y K, θ) = arg max θ [ n K ] log π k f (Y i ; γ k ) i=1 k=1 It is not always possible since this sum involves K n terms... Expectation-Maximization algorithm: iterative algorithm based on the expectation of the completed data conditionally to θ (l) { θ (l+1) = arg max θ E log P(Y, Z K, θ) Y, θ (l)} According to the theory, it implies that log P(Y K, θ) tends toward a local maximum. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

13 EM algorithm details Initialisation of θ (0) While the convergence criterion is not reached, iterate E-step Calculation of the conditional probabilities τ (l) ik = P(Z i = k y i, θ (l) ) = π (l) k f (y i; γ (l) k ) K k =1 π(l) k f (y i ; γ (l) k ) M-step Calculation of θ by maximising the complete likehood where Z is replaced with the conditional probabilities θ = arg max θ n K i=1 k=1 τ (l) ik [log π k + log f (y i ; γ k )] weighted version of the usual maximum likelihood estimates (MLE). M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

14 EM algorithm properties Convergence is always reached but not always toward a global maximum EM algorithm is sensitive to the initialisation step EM algorithm exists in all good statistical sotfwares In R software, it is available in MCLUST and RMIXMOD packages. RMIXMOD proposes the best strategy of initialisation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

15 Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = k = k = These probabilities enables the classification of the observations into the subpopulations M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

16 Outputs of the model Distribution: Conditional probabilities: g(y i ) = π 1 f (y i ; γ 1 ) + π 2 f (y i ; γ 2 ) + π 3 f (y i ; γ 3 ) τ ik = P(Z i = k y i ) = π kf (y i ; γ k ) g(y i ) τ ik (%) i = 1 i = 2 i = 3 k = k = k = Maximum A Posteriori rule: Classification in the component for which the conditional probability is the highest. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

17 Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Bayesian Information Criterion (BIC) where proxy of the integrated likelihood P(Y K ) = P(Y K, θ)π(θ K )dθ aims at finding a good number of components for a global fit of the data distribution BIC(K ) = log P(Y K, θ) ν K 2 log(n) ν K is the number of free parameters of the model P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

18 Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and K max The best model is the one maximising a criterion Integrated Information Criterion (ICL) where proxy of the integrated complete likelihood P(Y, Z m) dedicated to classification since it strongly penalizes models for which the classification is uncertain n K ICL(K ) = BIC(K )+ τ ik log τ ik, ν K is the number of free parameters i=1 k=1 P(Y K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

19 Conclusions on the model selection BIC aims at finding a good number of components for a global fit of the data distribution. It tends to overestimate the number of components ICL is dedicated to a classification purpose. It strongly penalizes models for which the classification is uncertain. Whatever the criterion, it must be a convex function of the number of components Bad behavior Correct behavior a non-convex function may indicate an issue of modeling M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

20 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples Mixtures for co-expression analysis Mixtures for analysing chip-chip data 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

21 GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks using mixture models Data: An original resource generated by the transcriptomic platform of URGV Homogeneous data generated with the CATMA microarray 5,095 genes not present in Affymetrix chip High diversity of biological samples relative to stress conditions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

22 Workflow overview - Extraction of CATdb of 387 stress comparaisons - 17,264 genes are differentially expressed in at least one of these comparisons (FWER controlled at 5% on overall the tests) - Analyses performed with Gaussian Mixture Models - According to BIC curve, the naive clustering on the whole dataset is not relevant M.L - Martin-Magniette Gene co-expression (INRA) depends Mixture models on and thegenomic stress data categories 7-11 July / 30

23 Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional bias in a term of gene ontology - 80% are associated to a stress term - 39% have a preferential sub-cellular localization in plastid - 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

24 Focus on nematode stress Example of Cluster genes described by 10 expression differences 29 clusters of co-expression identified 1519 genes with a conditional proba. close to 1 49 genes repressed from 14 days after infection 13 genes known to be involved in stress response 10 orphean genes Endoplasmic reticulum bias M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

25 GEM2Net database Integration of various resources: gene ontology, genes involved in stress responses, gene families (transcription factors and hormones) and protein-protein interactions (experimental and predicted). Original representation and interactive visualization, using pie charts to summarize the functional biases at first glance M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

26 ChIP-chip experiments The log-ratio is not tractable while the couple (IP, Input) is Development of mixture of 2 linear regressions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

27 MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) Martin-Magniette et al. (2008), Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

28 MultiChIPmix: Mixture of two linear regressions Let Z i the status of the probe i: P(Z i = 1) = π The linear relation between IP and Input depends on the probe status a 0r + b 0r Input ir + E ir if Z i = 0 (normal) IP ir = V (IP ir ) = σr 2 a 1r + b 1r Input ir + E ir if Z i = 1 (enriched) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

29 Use to create the first epigenomic map of Arabidopsis thaliana: Roudier et al. (2011), EMBO Journal study the additive inherance of histone modifications in Arabidopsis thaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

30 MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe status are (Markov-)dependent enables this information in the model: {Z i } MC(π, ν) π kl = Pr{Z i = k Z i 1 = l} M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

31 Table : Example of one known H3K27me3 target gene identified only with MultiChIPmixHMM. MultiChIPmix and MultiChIPmixHMM are alternative methods to peak detections Analysis of several replicates simultaneously + modelling the spatial dependency = more accurate conditional probabilities MultiChIPmixHMM is available as an R package: Bérard et al. (2013), BMC Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

32 Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

33 Conclusions Mixtures reveal underlying structures Key ingredients are P(Z) and P(Y Z) For genomic data, component distribution modeling is sometimes tricky, especially for RNA-Seq data Applications on genomic data sometimes raise new methodological questions about the parameter inference and classification rules Examples of R packages using mixtures: Mclust, Rmixmod, MultiChIPmixHMM, HTSDiff, HTSCluster, poisson.glm.mix M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

34 Acknowledgements Statistics Bioinformatics Biology S. Robin V. Brunaud J-P. Renou T. Mary-Huard J-P Tamby E. Delannoy C. Bérard R. Zaag S. Balzergue G. Celeux Z. Tariq C. Maugis-Rabusseau V. Colot G. Rigaill F. Roudier A. Rau P. Papastamoulis M. Seifert Thank you for your attention! M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July / 30

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Mixtures and Hidden Markov Models for genomics data

Mixtures and Hidden Markov Models for genomics data Mixtures and Hidden Markov Models for genomics data M.-L. Martin-Magniette marie laure.martin@agroparistech.fr Researcher at the French National Institut for Agricultural Research Plant Science Institut

More information

Co-expression analysis of RNA-seq data

Co-expression analysis of RNA-seq data Co-expression analysis of RNA-seq data Etienne Delannoy & Marie-Laure Martin-Magniette & Andrea Rau Plant Science Institut of Paris-Saclay (IPS2) Applied Mathematics and Informatics Unit (MIA-Paris) Genetique

More information

Co-expression analysis

Co-expression analysis Co-expression analysis Etienne Delannoy & Marie-Laure Martin-Magniette & Andrea Rau ED& MLMM& AR Co-expression analysis Ecole chercheur SPS 1 / 49 Outline 1 Introduction 2 Unsupervised clustering Distance-based

More information

arxiv: v1 [stat.ml] 22 Jun 2012

arxiv: v1 [stat.ml] 22 Jun 2012 Hidden Markov Models with mixtures as emission distributions Stevenn Volant 1,2, Caroline Bérard 1,2, Marie-Laure Martin Magniette 1,2,3,4,5 and Stéphane Robin 1,2 arxiv:1206.5102v1 [stat.ml] 22 Jun 2012

More information

A transcriptome meta-analysis identifies the response of plant to stresses. Etienne Delannoy, Rim Zaag, Guillem Rigaill, Marie-Laure Martin-Magniette

A transcriptome meta-analysis identifies the response of plant to stresses. Etienne Delannoy, Rim Zaag, Guillem Rigaill, Marie-Laure Martin-Magniette A transcriptome meta-analysis identifies the response of plant to es Etienne Delannoy, Rim Zaag, Guillem Rigaill, Marie-Laure Martin-Magniette Biological context Multiple biotic and abiotic es impacting

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Some Statistical Models and Algorithms for Change-Point Problems in Genomics Some Statistical Models and Algorithms for Change-Point Problems in Genomics S. Robin UMR 518 AgroParisTech / INRA Applied MAth & Comput. Sc. Journées SMAI-MAIRCI Grenoble, September 2012 S. Robin (AgroParisTech

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Discovering molecular pathways from protein interaction and ge

Discovering molecular pathways from protein interaction and ge Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Co-expression analysis of RNA-seq data with the coseq package

Co-expression analysis of RNA-seq data with the coseq package Co-expression analysis of RNA-seq data with the coseq package Andrea Rau 1 and Cathy Maugis-Rabusseau 1 andrea.rau@jouy.inra.fr coseq version 0.1.10 Abstract This vignette illustrates the use of the coseq

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay) Model selection criteria in Classification contexts Gilles Celeux INRIA Futurs (orsay) Cluster analysis Exploratory data analysis tools which aim is to find clusters in a large set of data (many observations

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Modeling heterogeneity in random graphs

Modeling heterogeneity in random graphs Modeling heterogeneity in random graphs Catherine MATIAS CNRS, Laboratoire Statistique & Génome, Évry (Soon: Laboratoire de Probabilités et Modèles Aléatoires, Paris) http://stat.genopole.cnrs.fr/ cmatias

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

Inferring Transcriptional Regulatory Networks from High-throughput Data

Inferring Transcriptional Regulatory Networks from High-throughput Data Inferring Transcriptional Regulatory Networks from High-throughput Data Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20

More information

Deciphering and modeling heterogeneity in interaction networks

Deciphering and modeling heterogeneity in interaction networks Deciphering and modeling heterogeneity in interaction networks (using variational approximations) S. Robin INRA / AgroParisTech Mathematical Modeling of Complex Systems December 2013, Ecole Centrale de

More information

Dispersion modeling for RNAseq differential analysis

Dispersion modeling for RNAseq differential analysis Dispersion modeling for RNAseq differential analysis E. Bonafede 1, F. Picard 2, S. Robin 3, C. Viroli 1 ( 1 ) univ. Bologna, ( 3 ) CNRS/univ. Lyon I, ( 3 ) INRA/AgroParisTech, Paris IBC, Victoria, July

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Linear models for the joint analysis of multiple. array-cgh profiles

Linear models for the joint analysis of multiple. array-cgh profiles Linear models for the joint analysis of multiple array-cgh profiles F. Picard, E. Lebarbier, B. Thiam, S. Robin. UMR 5558 CNRS Univ. Lyon 1, Lyon UMR 518 AgroParisTech/INRA, F-75231, Paris Statistics for

More information

Modelling gene expression dynamics with Gaussian processes

Modelling gene expression dynamics with Gaussian processes Modelling gene expression dynamics with Gaussian processes Regulatory Genomics and Epigenomics March th 6 Magnus Rattray Faculty of Life Sciences University of Manchester Talk Outline Introduction to Gaussian

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

An Introduction to mixture models

An Introduction to mixture models An Introduction to mixture models by Franck Picard Research Report No. 7 March 2007 Statistics for Systems Biology Group Jouy-en-Josas/Paris/Evry, France http://genome.jouy.inra.fr/ssb/ An introduction

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Bayesian learning of sparse factor loadings

Bayesian learning of sparse factor loadings Magnus Rattray School of Computer Science, University of Manchester Bayesian Research Kitchen, Ambleside, September 6th 2008 Talk Outline Brief overview of popular sparsity priors Example application:

More information

Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions

Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions UDC 519.2 Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions Yu. K. Belyaev, D. Källberg, P. Rydén Department of Mathematics and Mathematical Statistics, Umeå University,

More information

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:

More information

Measuring TF-DNA interactions

Measuring TF-DNA interactions Measuring TF-DNA interactions How is Biological Complexity Achieved? Mediated by Transcription Factors (TFs) 2 Regulation of Gene Expression by Transcription Factors TF trans-acting factors TF TF TF TF

More information

Uncovering structure in biological networks: A model-based approach

Uncovering structure in biological networks: A model-based approach Uncovering structure in biological networks: A model-based approach J-J Daudin, F. Picard, S. Robin, M. Mariadassou UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées Statistics

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux Choosing a model in a Classification purpose Guillaume Bouchard, Gilles Celeux Abstract: We advocate the usefulness of taking into account the modelling purpose when selecting a model. Two situations are

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu & Lloyd Allison Monash University, Australia September 8, 25 Parthan Kasarapu

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

A segmentation-clustering problem for the analysis of array CGH data

A segmentation-clustering problem for the analysis of array CGH data A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA P-G / ENGREF / INRA MIA 518 APPLIED STOCHASTIC MODELS AND DATA ANALYSIS Brest

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Multidimensional data analysis in biomedicine and epidemiology

Multidimensional data analysis in biomedicine and epidemiology in biomedicine and epidemiology Katja Ickstadt and Leo N. Geppert Faculty of Statistics, TU Dortmund, Germany Stakeholder Workshop 12 13 December 2017, PTB Berlin Supported by Deutsche Forschungsgemeinschaft

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Statistical analysis of biological networks.

Statistical analysis of biological networks. Statistical analysis of biological networks. Assessing the exceptionality of network motifs S. Schbath Jouy-en-Josas/Evry/Paris, France http://genome.jouy.inra.fr/ssb/ Colloquium interactions math/info,

More information

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 Olga (NBIS) scrna-seq de October 2017 1 / 34 Outline Introduction: what

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Improving Gene Functional Analysis in Ethylene-induced Leaf Abscission using GO and ProteInOn

Improving Gene Functional Analysis in Ethylene-induced Leaf Abscission using GO and ProteInOn Improving Gene Functional Analysis in Ethylene-induced Leaf Abscission using GO and ProteInOn Sara Domingos 1, Cátia Pesquita 2, Francisco M. Couto 2, Luis F. Goulao 3, Cristina Oliveira 1 1 Instituto

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

PACKAGE LMest FOR LATENT MARKOV ANALYSIS PACKAGE LMest FOR LATENT MARKOV ANALYSIS OF LONGITUDINAL CATEGORICAL DATA Francesco Bartolucci 1, Silvia Pandofi 1, and Fulvia Pennoni 2 1 Department of Economics, University of Perugia (e-mail: francesco.bartolucci@unipg.it,

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

The combined use of Arabidopsis thaliana and Lepidium sativum to find conserved mechanisms of seed germination within the Brassicaceae family

The combined use of Arabidopsis thaliana and Lepidium sativum to find conserved mechanisms of seed germination within the Brassicaceae family www.seedbiology.de The combined use of Arabidopsis thaliana and Lepidium sativum to find conserved mechanisms of seed germination within the Brassicaceae family Linkies, A., Müller, K., Morris, K., Gräber,

More information

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics I believe it is helpful to number your slides for easy reference. It's been a while since I took

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species 02-710 Computational Genomics Reconstructing dynamic regulatory networks in multiple species Methods for reconstructing networks in cells CRH1 SLT2 SLR3 YPS3 YPS1 Amit et al Science 2009 Pe er et al Recomb

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Clustering high-throughput sequencing data with Poisson mixture models

Clustering high-throughput sequencing data with Poisson mixture models Clustering high-throughput sequencing data with Poisson mixture models Andrea Rau, Gilles Celeux, Marie-Laure Martin-Magniette, Cathy Maugis-Rabusseau To cite this version: Andrea Rau, Gilles Celeux, Marie-Laure

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Determining the number of components in mixture models for hierarchical data

Determining the number of components in mixture models for hierarchical data Determining the number of components in mixture models for hierarchical data Olga Lukočienė 1 and Jeroen K. Vermunt 2 1 Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 5000

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Modelling Transcriptional Regulation with Gaussian Processes

Modelling Transcriptional Regulation with Gaussian Processes Modelling Transcriptional Regulation with Gaussian Processes Neil Lawrence School of Computer Science University of Manchester Joint work with Magnus Rattray and Guido Sanguinetti 8th March 7 Outline Application

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription

More information

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information