Discovering molecular pathways from protein interaction and ge

Similar documents
Learning in Bayesian Networks

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Predicting Protein Functions and Domain Interactions from Protein Interactions

Non-Parametric Bayes

Mixture models for analysing transcriptome and ChIP-chip data

Clustering & microarray technology

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

STA 4273H: Statistical Machine Learning

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

CS6220: DATA MINING TECHNIQUES

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

CSE446: Clustering and EM Spring 2017

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Dynamic Approaches: The Hidden Markov Model

Introduction to Bioinformatics

Clustering using Mixture Models

CS6220: DATA MINING TECHNIQUES

Intelligent Systems (AI-2)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Preprocessing. Cluster Similarity

Intelligent Systems (AI-2)

Protein Complex Identification by Supervised Graph Clustering

3 Undirected Graphical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Inferring Protein-Signaling Networks II

Mathematical Formulation of Our Example

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

BMD645. Integration of Omics

Introduction to Bioinformatics

Introduction to clustering methods for gene expression data analysis

Naïve Bayes classification

Chapter 17: Undirected Graphical Models

O 3 O 4 O 5. q 3. q 4. Transition

Naive Bayes classification

Probability Theory for Machine Learning. Chris Cremer September 2015

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Discovering modules in expression profiles using a network

CSC 412 (Lecture 4): Undirected Graphical Models

Learning Bayesian network : Given structure and completely observed data

Generative Clustering, Topic Modeling, & Bayesian Inference

Gibbs Field & Markov Random Field

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Prediction of double gene knockout measurements

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Comparative Network Analysis

ECE521 Lecture 19 HMM cont. Inference in HMM

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

An Introduction to Bayesian Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Data Mining in Bioinformatics HMM

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CPSC 540: Machine Learning

Supplementary online material

13: Variational inference II

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Mobile Robot Localization

Introduction to clustering methods for gene expression data analysis

CS145: INTRODUCTION TO DATA MINING

{ p if x = 1 1 p if x = 0

Mobile Robot Localization

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Causal Discovery by Computer

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Computational Genomics

STA 4273H: Statistical Machine Learning

A Complex-based Reconstruction of the Saccharomyces cerevisiae Interactome* S

COMP90051 Statistical Machine Learning

Inferring Transcriptional Regulatory Networks from High-throughput Data

Microarray Data Analysis: Discovery

Identifying Signaling Pathways

Probabilistic sparse matrix factorization with an application to discovering gene functions in mouse mrna expression data

Conditional Random Field

p(d θ ) l(θ ) 1.2 x x x

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

STA 4273H: Statistical Machine Learning

Latent Variable models for GWAs

Example: multivariate Gaussian Distribution

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Introduction to Probabilistic Machine Learning

The Monte Carlo Method: Bayesian Networks

COS513 LECTURE 8 STATISTICAL CONCEPTS

Learning Gaussian Graphical Models with Unknown Group Sparsity

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES

Bayesian Methods for Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 414/2104: Machine Learning

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

The Bayes classifier

Intelligent Systems:

Expectation maximization

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Transcription:

Discovering molecular pathways from protein interaction and gene expression data 9-4-2008

Aim To have a mechanism for inferring pathways from gene expression and protein interaction data.

Motivation Why search for pathways Pathway Set of genes that coordinate to achieve a specific task.

Motivation Why search for pathways Pathway Set of genes that coordinate to achieve a specific task. What do we gain from understanding pathways 1. A coherent global picture of (condition-specific) cellular activity. 2. Application to disease mechanisms.

Motivation Why use two kinds of data 2 properties of (many) pathways

Motivation Why use two kinds of data 2 properties of (many) pathways (A) Genes in the same pathway are activated together exhibit similar expression profiles.

Motivation Why use two kinds of data 2 properties of (many) pathways (A) Genes in the same pathway are activated together exhibit similar expression profiles. (B) When genes coordinate to achieve a particular task, their protein products often interact.

Motivation Why use two kinds of data 2 properties of (many) pathways (A) Genes in the same pathway are activated together exhibit similar expression profiles. (B) When genes coordinate to achieve a particular task, their protein products often interact. Each data type alone is a weaker indicator of pathway activity.

Intuitive Idea Introduction Expression profiles Protein interaction Markov random field Unified model Detect group of genes that are co-expressed, and whose products interact in the protein data.

Expression profiles Protein interaction Markov random field Unified model Intuitive Idea Detect group of genes that are co-expressed, and whose products interact in the protein data. Create a model for gene expression data. Create a model for protein interaction data.

Expression profiles Protein interaction Markov random field Unified model Intuitive Idea Detect group of genes that are co-expressed, and whose products interact in the protein data. Create a model for gene expression data. Create a model for protein interaction data. Join them.

Basics Introduction Expression profiles Protein interaction Markov random field Unified model Gene Set of genes G = {1,..., n}.

Basics Introduction Expression profiles Protein interaction Markov random field Unified model Gene Set of genes G = {1,..., n}. Each gene g has two attributes:

Basics Introduction Expression profiles Protein interaction Markov random field Unified model Gene Set of genes G = {1,..., n}. Each gene g has two attributes: Class (pathway), denoted by g.c (discrete value).

Basics Introduction Expression profiles Protein interaction Markov random field Unified model Gene Set of genes G = {1,..., n}. Each gene g has two attributes: Class (pathway), denoted by g.c (discrete value). Expression in microarray i, denoted by g.ei.

Basics Introduction Expression profiles Protein interaction Markov random field Unified model Gene Set of genes G = {1,..., n}. Each gene g has two attributes: Class (pathway), denoted by g.c (discrete value). Expression in microarray i, denoted by g.ei. If there are m microarrays g.e = {g.e 1,..., g.em}.

Expression profiles Protein interaction Markov random field Unified model Model for expression profiles Naive Bayes Naive Bayes given the class label g.c, g.e i and G.E j are independent.

Expression profiles Protein interaction Markov random field Unified model Model for expression profiles Naive Bayes Class probability g.c follows a multinomial probability distribution p(g.c = k) = θ k K i=1 θ i = 1

Expression profiles Protein interaction Markov random field Unified model Model for expression profiles Naive Bayes Expression profiles g.e i g.c = k N(µ ki, σ 2 ki ) A pathway i specifies the average expression level for each microarray and also the variance.

Expression profiles Protein interaction Markov random field Unified model Model for expression profiles Naive Bayes Example: 1 pathway, 10 genes, 3 microarrays Pathway specifies the averages µ = (15, 60, 50) What is the most likely expression matrix?

Expression profiles Protein interaction Markov random field Unified model Model for expression profiles Naive Bayes Example: 1 pathway, 10 genes, 3 microarrays Pathway specifies the averages µ = (15, 60, 50) What is the most likely expression matrix? (The matrix on the left)

Expression profiles Protein interaction Markov random field Unified model Model for protein interaction Markov random field Undirected graph, V = {g 1.C,..., g n.c}, E =set of protein interactions.

Expression profiles Protein interaction Markov random field Unified model Model for protein interaction Markov random field Undirected graph, V = {g 1.C,..., g n.c}, E =set of protein interactions. Assumption Interacting proteins are more likely to be in the same pathway. Intuitive idea If a pair of nodes share the same class likelihood is higher

Markov random field Formalism Expression profiles Protein interaction Markov random field Unified model Each g i.c is associated with a potential φ i (g i.c).

Expression profiles Protein interaction Markov random field Unified model Markov random field Formalism Each g i.c is associated with a potential φ i (g i.c). Each edge g i.c g j.c is associated with a compatibility potential φ i,j (g i.c, g j.c).

Expression profiles Protein interaction Markov random field Unified model Markov random field Formalism Each g i.c is associated with a potential φ i (g i.c). Each edge g i.c g j.c is associated with a compatibility potential φ i,j (g i.c, g j.c). Joint distribution is P(g 1.C,..., g n.c) = 1 Z n φ i (g i.c) i=1 Z is a normalization constant. {g i.c g j.c} E φ i,j (g i.c, g j.c) (1)

Markov random field Formalism Expression profiles Protein interaction Markov random field Unified model φ i,j (g i.c = p, g j.c = q) = { α p = q 1 otherwise

Markov random field Formalism Expression profiles Protein interaction Markov random field Unified model (α 1). φ i,j (g i.c = p, g j.c = q) = { α p = q 1 otherwise

Expression profiles Protein interaction Markov random field Unified model Unified Model What we already have Model for expression data (Naive Bayes) Model for class probability (Markov random field)

Expression profiles Protein interaction Markov random field Unified model Unified Model What we already have Model for expression data (Naive Bayes) Model for class probability (Markov random field) What we want Probability distribution P(G.C, G.E), using expression and protein data.

Expression profiles Protein interaction Markov random field Unified model Unified Model What we already have Model for expression data (Naive Bayes) Model for class probability (Markov random field) What we want Probability distribution P(G.C, G.E), using expression and protein data. What we are missing Naive Bayes provides that prob. distribution, but does not use protein data.

Expression profiles Protein interaction Markov random field Unified model Unified Model What we already have Model for expression data (Naive Bayes) Model for class probability (Markov random field) What we want Probability distribution P(G.C, G.E), using expression and protein data. What we are missing Naive Bayes provides that prob. distribution, but does not use protein data. We haven t specified the potentials φ i (g i.c).

Unified Model Introduction Expression profiles Protein interaction Markov random field Unified model Solution Use Markov random field as P(G.C).

Expression profiles Protein interaction Markov random field Unified model Unified Model Solution Use Markov random field as P(G.C). Use multinomial dist. P(g i.c) from Naive Bayes as potential φ i (g i.c). Call it P (g i.c).

Expression profiles Protein interaction Markov random field Unified model Unified Model P(G.C, G.E) = 1 Z n n P (g i.c) i=1 i=1 j=1 m P(g i.e j g i.c) P(G.C) Markov random field. P(G.E) Gaussian distributions. {g i.c g j.c} E φ i,j (g i.c, g j.c)

EM algorithm Parameters to be estimated Multinomial distribution (θ 1,..., θ K ). Mean and variance for gaussian distributions

Datasets Gene Expression 173 arrays (Gasch et al. 03) 77 arrays (Spellman et al. 98) Protein Interaction 10705 interactions (Xenarios et al. 05) After preprocessing 3589 genes.

Running the algorithm EM for optimizing parameters Number of pathways fixed as 60 Starting point for parameters use hierarchical clustering How to set the α parameter?

Setting α Recall: α is the compatibility potential when two proteins interact and belong to the same pathway.

Setting α Recall: α is the compatibility potential when two proteins interact and belong to the same pathway.

Comparisons with other methods Methods that use only one type of data Markov Cluster (Enright et al. 02) Hierarchical clustering (Eisen et al. 98)

Tests Prediction of held-out interactions. Functional enrichment in Gene Ontology. Coverage of protein complexes. Assigning new roles to unknown proteins.

Prediction of held-out interactions Cross-validation divide protein data into 5 disjoint sets (4 for training, 1 for testing)

Prediction of held-out interactions Cross-validation divide protein data into 5 disjoint sets (4 for training, 1 for testing) Get average number of held-out interactions between genes in the same pathway

Prediction of held-out interactions Cross-validation divide protein data into 5 disjoint sets (4 for training, 1 for testing) Get average number of held-out interactions between genes in the same pathway Result: 222.4 ± 13.2 (MCL) 383.2 ± 29.1

Biological coherence of the inferred pathways General result More functionally coherent than when using standard clustering or MCL

Biological coherence of the inferred pathways General result More functionally coherent than when using standard clustering or MCL Example Pathways related to translation, protein degradation, transcription, and DNA replication Genes in these pathways interact with many genes from other categories. They are also co-expressed.

Biological coherence of the inferred pathways General result More functionally coherent than when using standard clustering or MCL Example Pathways related to translation, protein degradation, transcription, and DNA replication Genes in these pathways interact with many genes from other categories. They are also co-expressed. MCL cannot isolate them.

Protein Complexes Motivation The components of many pathways are protein complexes. Thus, a good pathway model should assign the member genes of many of these complexes to the same pathway.

Protein Complexes Motivation The components of many pathways are protein complexes. Thus, a good pathway model should assign the member genes of many of these complexes to the same pathway. Procedure Use experimental assays (Gavin et al. 02) and (Ho et al. 02) Associate each gene to the complexes to which it belongs. Measure enrichment in pathways.

Protein Complexes In general, better than clustering: 374 complexes significantly enriched (higher than in clustering). Stress data 124 complexes in which more than 50% of members appear in the same pathway. Clustering only 46 complexes that verify that condition.

Assigning New Roles to Unknown Proteins Largest connected component of pathway 1 (cytoplasmic exosome): YHR081W is uncharacterized

Assigning New Roles to Unknown Proteins Largest connected component of pathway 1 (cytoplasmic exosome): YHR081W is uncharacterized Clustering Only 4 genes in pathway

Assigning New Roles to Unknown Proteins Largest connected component of pathway 1 (cytoplasmic exosome): YHR081W is uncharacterized Clustering Only 4 genes in pathway MCL Includes 114 additional genes in connected component

Summary Probabilistic model for integrating gene expression and protein interaction data

Summary Probabilistic model for integrating gene expression and protein interaction data Method aims at finding co-expressed and connected genes (pathways)

Summary Probabilistic model for integrating gene expression and protein interaction data Method aims at finding co-expressed and connected genes (pathways) Comparison with single-source methods

Summary Probabilistic model for integrating gene expression and protein interaction data Method aims at finding co-expressed and connected genes (pathways) Comparison with single-source methods Some pathways are only obtainable by combining both types of data

Limitations

Limitations Model for co-expression is too restrictive

Limitations Model for co-expression is too restrictive Assignment of each gene to a single pathway

Limitations Model for co-expression is too restrictive Assignment of each gene to a single pathway Pathways should be condition-specific (same goes for protein interaction)

Questions (1) On which two assumptions about pathways is the model based? (2) Map each of the previous assumptions into a property of the model (3) Why must the α parameter in the markov random field be greater than one? (4) What happens when (a) α = 1 or when (b) α is close to infinity?