Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Similar documents
Learning in Bayesian Networks

Introduction to Bioinformatics

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Bayesian Networks BY: MOHAMAD ALSABBAGH

Computational Genomics

Physical network models and multi-source data integration

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Predicting Protein Functions and Domain Interactions from Protein Interactions

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Discovering molecular pathways from protein interaction and ge

Inferring Transcriptional Regulatory Networks from High-throughput Data

Directed Graphical Models or Bayesian Networks

Chris Bishop s PRML Ch. 8: Graphical Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

The Monte Carlo Method: Bayesian Networks

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

10708 Graphical Models: Homework 2

Causal Graphical Models in Systems Genetics

Bayesian Machine Learning

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Latent Variable models for GWAs

STA 4273H: Statistical Machine Learning

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

STA 4273H: Statistical Machine Learning

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Informatics 2D Reasoning and Agents Semester 2,

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Inferring Protein-Signaling Networks II

Bayesian Networks. Motivation

CS Lecture 4. Markov Random Fields

Quantifying uncertainty & Bayesian networks

Protein Complex Identification by Supervised Graph Clustering

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Causal Discovery by Computer

Review: Bayesian learning and inference

Bayesian Networks Inference with Probabilistic Graphical Models

Inferring Protein-Signaling Networks

{ p if x = 1 1 p if x = 0

Directed Graphical Models

Bayesian Approaches Data Mining Selected Technique

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

STA 414/2104: Machine Learning

Probabilistic Reasoning. (Mostly using Bayesian Networks)

CS6220: DATA MINING TECHNIQUES

2 : Directed GMs: Bayesian Networks

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Based on slides by Richard Zemel

2 : Directed GMs: Bayesian Networks

Network Biology-part II

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Bayesian Network Learning and Applications in Bioinformatics. Xiaotong Lin

3 : Representation of Undirected GM

Artificial Intelligence Bayes Nets: Independence

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Learning Causality. Sargur N. Srihari. University at Buffalo, The State University of New York USA

Comparative Network Analysis

Representation of undirected GM. Kayhan Batmanghelich

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

STA 4273H: Statistical Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

MODULE -4 BAYEIAN LEARNING

COMBINING LOCATION AND EXPRESSION DATA FOR PRINCIPLED DISCOVERY OF GENETIC REGULATORY NETWORK MODELS

Directed and Undirected Graphical Models

Learning Bayesian Networks (part 1) Goals for the lecture

Statistical learning. Chapter 20, Sections 1 4 1

Data Structures for Efficient Inference and Optimization

Undirected Graphical Models

PROBABILISTIC REASONING SYSTEMS

CS6220: DATA MINING TECHNIQUES

Index. FOURTH PROOFS n98-book 2009/11/4 page 261

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 5522: Artificial Intelligence II

Probabilistic Graphical Networks: Definitions and Basic Results

Bayes Nets: Independence

Directed Graphical Models

Qualifier: CS 6375 Machine Learning Spring 2015

Algorithmisches Lernen/Machine Learning

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models

CS 188: Artificial Intelligence Spring Announcements

Integration of functional genomics data

Probabilistic Models

Introduction to Probabilistic Graphical Models

Learning Bayesian network : Given structure and completely observed data

PATTERN CLASSIFICATION

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Introduction to Bayesian Learning

Graphical Models and Kernel Methods

K. Nishijima. Definition and use of Bayesian probabilistic networks 1/32

Introduction to Probabilistic Graphical Models

Structure Learning: the good, the bad, the ugly

Rapid Introduction to Machine Learning/ Deep Learning

Transcription:

02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models

High throughput data So far in this class we discussed several different types of high throughput datasets, each providing an important, but limited, view of cellular activity. These include: - Coding sequences: genes, exons, mirnas - Non coding sequences: enhancers, DNA motifs - Gene and microrna expression: microrarrays, RNA-Seq - Protein-DNA interactions: ChIP-CHIP, CHIP-Seq, PBM - Epigenetic data - Etc.

Systems Biology: Motivation High-level goal: Integrate different types of high throughput data to discover patterns of combinatorial regulation and to understand how the activity of genes involved in related biological processes is coordinated and interconnected.

Graphical models

Independence Independence allows for easier models, learning and inference (for example, when using a Naïve Bayes classifier) For example, with 3 binary variables we only need 3 parameters rather than 7. The saving is even greater if we have many more variables In many cases it would be useful to assume independence, even if its not the case Is there any middle ground?

Bayesian networks Bayesian networks are directed graphs with nodes representing random variables and edges representing dependency assumptions Lets use our movie example: We would like to determine the joint probability for length, liked and slept in a movie G1 P53 G3 G2 E2F MYC

Bayesian networks: Notations Bayesian networks are directed acyclic graphs. Conditional probability tables (CPTs) P(G1) = 0.5 G1 Conditional dependency P(G3 G1) = 0.4 P(G3 G1) = 0.7 G3 G2 P(G2 G1) = 0.6 P(G2 G1) = 0.2 Random variables

Bayesian networks: Notations The Bayesian network below represents the following joint probability distribution: p( G1, G2, G3) P( G1) P( G2 G1) P( G3 G1) More generally Bayesian network represent the following joint probability distribution: p(x 1 x n ) p(x i Pa(x i )) i P(G1) = 0.5 The set of parents of x i in the graph G1 P(G2 G1) = 0.4 P(G2 G1) = 0.7 G2 G3 P(G3 G1) = 0.6 P(G3 G1) = 0.2

Bayesian network: Inference Once the network is constructed, we can use algorithms for inferring the values of unobserved variables. For example, assume we only observed G2 and G3. Can we determine the value of G1? P(G1) = 0.5 P(G2 G1) = 0.4 P(G2 G1) = 0.7 Le P(G3 G1) = 0.6 P(G3 G1) = 0.2 Li S

Methods for grouping genes in clusters and networks Clustering of expression data Groups together genes with similar expression patterns Does not reveal structural relations between genes Boolean networks Deterministic models of the logical interactions between genes Deterministic, static Linear models Deterministic fully-connected linear model Under-constrained, assumes linearity of interactions

So, Why Bayesian Networks? Flexible representation of (in)dependency structure of multivariate distributions and interactions Natural for modeling global processes with local interactions => good for biology Clear probabilistic semantics Natural for statistical confidence analysis of results and answering of queries Stochastic in nature: models stochastic processes & deals ( sums out ) noise in measurements

Learning Bayesian Network The goal: Given set of independent samples (assignments to random variables), find the best (most likely) Bayesian Network E B (both DAG and CPDs) { (B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F).. (B,E,A,C,R)=(F,T,T,T,F) } E e e e e R B b b b b A C P(A E,B) 0.9 0.1 0.2 0.8 0.9 0.1 0.01 0.99

Learning Bayesian Network Learning of best CPTs given DAG is easy (collect statistics of values of each node given specific assignment to its parents). But The structure (G) learning problem is NP-hard => heuristic search for best model must be applied, generally bring out a locally optimal network. It turns out, that the richer structures give higher likelihood P(D G) to the data (adding an edge is always preferable), because

Learning Bayesian Network A B A B C C If we add B to Pa(C), we have more parametes to fit => more freedom => can always optimize SPD(C), such that: P( C A) P( C A, B) But we prefer simpler (more explanatory) networks (Occam s razor!) Therefore, practical scores of Bayesian Networks compensate the likelihood improvement by a penalty on complex networks.

Modeling Biological Regulation Variables of interest: Expression levels of genes Concentration levels of proteins Exogenous variables: Nutrient levels, Metabolite Levels, Temperature Phenotype information Bayesian Network Structure: Capture dependencies among these variables

Possible Biological Interpretation Measured expression level of each gene Random variables Gene interaction Interactions are represented by a graph: Probabilistic dependencies Each gene is represented by a node in the graph Edges between the nodes represent direct dependency X A B A B

More Local Structures Dependencies can be mediated through other nodes B A Common cause C A B Intermediate gene C Common/combinatorial effects: A B C

The Approach of Friedman et al. Bayesian Network Learning Algorithm Expression data E B Use learned network to make predictions about structure of the interactions between genes No prior biological knowledge is used! R A C Friedman et al, Using Bayesian networks to analyze expression data, RECOMB 2000

The Discretization Problem The expression measurements are real numbers. => We can either discretize the values in order to learn general CPTs => lose information => If we don t, we must assume some specific type of CPT (like regression based linear Gaussian models) => lose generality

Problem of Sparse Data There are much more genes than experiments => many different networks suit the data well Shrink the network search space. E.g., we can use the notion, that in biological systems each gene is regulated directly by only a few regulators. Don t take for granted the resulting network, but instead fetch from it pieces of reliable information.

Learning With Many Variables Sparse Candidate algorithm - efficient heuristic search, relies on sparseness of regulation nets. For each gene, choose promising candidate parents set for direct influence for each gene Find (locally) optimal BN constrained on those parent candidates for each gene Iteratively improve candidate set candidates parents in BN

Experiment Data from Spellman et al. (Mol.Bio. of the Cell 1998). Contains 76 samples of all the yeast genome: Different methods for synchronizing cell-cycle in yeast. Time series at few minutes (5-20min) intervals. Spellman et al. identified 800 cellcycle regulated genes.

Network Learned

Challenge: Statistical Significance Sparse Data Small number of samples Flat posterior -- many networks fit the data Solution estimate confidence in network features E.g., two types of features Markov neighbors: X directly interacts with Y (through mutual edge or a mutual child) Order relations: X is a parent of Y

Confidence Estimates Bootstrap approach [FGW, UAI99] D 1 Learn R E B A C E B D resample D 2 Learn R A C... E B Estimate Confidence level : C ( f ) D m Learn 1 m m R A C 1 f G i i 1

In summary Expression data Normalization, Discretization Preprocess Bayesian Network Learning Algorithm, Mutation modeling + Bootstrap Learn model Markov Edge Separator Ancestor Feature extraction Result: a list of features with high confidence. They can be biologically interpreted.

Resulting Features: Markov Relations Question: Do X and Y directly interact? Parent-Child SST2 STE6 SST2 STE6 (0.91) confidence (one gene regulating the other) Regulator in mating pathway Hidden Parent (two genes co-regulated by a hidden factor) Transcription factor Exporter of mating factor ARG5 ARG3 (0.84) GCN4 ARG3 Arginine Biosynthesis ARG5 Arginine Biosynthesis

Resulting Features: Separators Question: Given that X andy are indirectly dependent, who mediates this dependence? Separator relation: X affects Z who in turn affects Z Z regulates both X and Y Mating transcriptional regulator of nuclear fusion KAR4 AGA1 Cell fusion FUS1 Cell fusion

Separators: Intra-cluster Context MAPK of cell wall integrity pathway SLT2 CRH1 SLR3 Cell wall protein YPS3 YPS1 Protein of unknown function All pairs have high correlation Clustered together Cell wall protein Cell wall protein

Dependencies and causality Since P(x 1 )P(x 2 x 1 ) = P(x 2 )P(x 1 x 1 ) = P(x 1, x 2 ), we cannot immediately attach any causal interpretation to the probabilistic dependencies (e.g., if factor x 1 regulates x 2 )

Causality We can use interventions (external manipulations) to disambiguate between possible causal interpretations For example: if we intervene to set the value of x 2 to a specific value (e.g., knock-out) then:

Extensions: Bayesian networks and regression Another way to deal with the continuous data is to use a different probabilistic model. For example, Gaussian linear regression: ), ( ~ ), ( ~ ), ( ~ ) : ( 2 1 1 1 2 2 2 2 2 3 1 3 3 1 3 N x N x x N x x x p

Rich probabilistic gene expression networks In many cases we have additional types of information that can be used for the network learning. In addition, the expression levels of multiple genes is commonly affected by their regulators and function Graphical models based on the idea of related modules can be used to capture these notions. Specific model is termed Probabilistic Rational Model (PRM) Data sources includes: - Functional assignment for gene (from MIPS) - Binding site information for known TFs Gene classes are latent variables. Array classes are known (different class to each array). Segal et al Nature Genetics 2003

Probability model Decision tree for each of the expression levels. Decision can be based on expression levels of other genes or on discrete values from the other data sources. Can use the node in the tree to determine parents for a given node. Issues: Acyclic graph Learning the tree for each gene

Determining significance of results Use permutation data to determine if the structure observed was present in the data. Apply the same algorithm to a randomized version of the data. Use likelihood of generated model to test the relevance of the learned structure.

Testing the clusters Test the variance of the expression in each cluster. Remove functional annotation after initial step to allow for new annotations for unknown genes.

From modules to networks

Determine combinatorial control

Segal et al Nature Genetics 2003 Resulting module

More combinatorial regulation