Inferring Transcriptional Regulatory Networks from High-throughput Data

Similar documents
Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Protein-Signaling Networks

Bayesian networks for reconstructing transcriptional regulatory networks

Inferring Protein-Signaling Networks II

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduction to Bioinformatics

Probabilistic Graphical Models

Learning in Bayesian Networks

GWAS IV: Bayesian linear (variance component) models

Graphical Models for Collaborative Filtering

Latent Variable models for GWAs

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Probabilistic Graphical Models

Overfitting, Bias / Variance Analysis

Data Preprocessing. Cluster Similarity

Lecture 4: Probabilistic Learning

10708 Graphical Models: Homework 2

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Unsupervised machine learning

The Monte Carlo Method: Bayesian Networks

Causal Graphical Models in Systems Genetics

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

An Introduction to Statistical and Probabilistic Linear Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

K-Means and Gaussian Mixture Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Sparse Linear Models (10/7/13)

Microarray Data Analysis: Discovery

Introduction to continuous and hybrid. Bayesian networks

Learning Bayesian network : Given structure and completely observed data

Lecture 6: Graphical Models: Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

CSC321 Lecture 18: Learning Probabilistic Models

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Probabilistic Graphical Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

CS6220: DATA MINING TECHNIQUES

{ p if x = 1 1 p if x = 0

Mixture models for analysing transcriptome and ChIP-chip data

Machine Learning Summer School

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Mixture of Gaussians Models

Discovering molecular pathways from protein interaction and ge

CSE446: Clustering and EM Spring 2017

Bioinformatics. Transcriptome

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

Lecture : Probabilistic Machine Learning

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Structure Learning: the good, the bad, the ugly

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Dynamic Approaches: The Hidden Markov Model

Density Estimation. Seungjin Choi

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

CS6220: DATA MINING TECHNIQUES

Probabilistic Graphical Models

Network Biology-part II

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Parametric Techniques Lecture 3

Lecture 9: PGM Learning

Predicting Protein Functions and Domain Interactions from Protein Interactions

Comparative Network Analysis

Probabilistic Graphical Models for Image Analysis - Lecture 1

Clustering and Network

Multi Omics Clustering. ABDBM Ron Shamir

Clustering and Gaussian Mixture Models

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

PMR Learning as Inference

25 : Graphical induced structured input/output models

An Introduction to Bayesian Machine Learning

Learning Bayesian belief networks

Learning Bayesian Networks for Biomedical Data

Probabilistic Machine Learning. Industrial AI Lab.

Example: multivariate Gaussian Distribution

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning for Data Science (CS4786) Lecture 12

Statistical learning. Chapter 20, Sections 1 4 1

ECE521 week 3: 23/26 January 2017

Bayesian Networks Structure Learning (cont.)

MCMC: Markov Chain Monte Carlo

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Lecture 5: November 19, Minimizing the maximum intracluster distance

Unsupervised Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Introduction to clustering methods for gene expression data analysis

An Introduction to Bayesian Networks: Representation and Approximate Inference

Transcription:

Inferring Transcriptional Regulatory Networks from High-throughput Data Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline Microarray gene expression data Measuring the RNA level of genes Clustering approaches Beyond clustering Algorithms for learning regulatory networks Application of probabilistic models Structure learning of Bayesian networks Module networks Evaluation of the method 2 1

Spot Your Genes Normal cell Known gene sequences Cy5 dye Isolation RNA Glass slide (chip) Cy3 dye Cancer cell 3 Matrix of expression Gene 1 E 1 E 2 E 3 Gene 2 Exp 1 Exp 2 Exp 3 Gene N 4 2

Microarray gene expression data Induced i Experiments (samples) Downregulated Upregulated Genes j Repressed E ij - RNA level of gene j in experiment i 5 Analyzing micrarray data Supervised learning problems Un-supervised learning genes Learn the mapping! Learn the underlying model! samples Gene signatures can provide valuable diagnostic tool for clinical purposes Can also help the molecular characterization of cellular processes underlying disease states clinical trait (cancer/normal) Gene clustering can reveal cellular processes and their response to different conditions Sample clustering can reveal phenotypically distinct populations, with clinical implications * Bhattacharjee et al. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. PNAS (2001). * van't Veer et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature (2002). 3

Why care about clustering? Gene 1 Gene 2 E1 E2 E3 Gene N E1 E2 E3 Discover functional relation Similar expression functionally related Gene 1 Gene 2 Assign function to unknown genes Find which gene controls which other genes Gene N 7 Hierarchical clustering Compute all pairwise distances Merge closest pair E1 E2 E3 1. Euclidean distance 2. (Pearson s) correlation coefficient 3. Etc etc Easy Depends on where to start the grouping Trouble to interpret the tree structure 8 4

K-means clustering E2 Gene 1 Gene 2 E1 E2 E1 Gene N 9 K-means clustering Overall optimization E2 How many (k)? How to initiate? Local minima E1 Generally, heuristic methods have no established means to determine the correct number of clusters and to choose the best algorithm. 10 5

Clustering expression profiles Limitations: No explanation on what caused expression of each gene (No regulatory mechanism) Co-regulated genes cluster together Infer gene function 11 Beyond Clustering Cluster: set of genes with similar expression profiles Regulatory module: set of genes with shared regulatory mechanism Goal: Automatic method for identifying candidate modules and their regulatory mechanism 12 6

Inferring regulatory networks Expression data measurement of mrna levels of all genes Experimental conditions Q 2x10 4 (for human) Infer the regulatory network that controls gene expression Causal relationships among e 1-Q A and B regulate the expression of C (A and B are regulators of C) e 1 e 6 B A C e Q Bayesian networks 13 Regulatory network Bayesian network representation Xi: expression level of gene i Val(Xi): continuous Interpretation Conditional independence Causal relationships Joint distribution P(X) = Conditional probability distribution (CPD)? 14 7

CPD for Discrete Expression After discretizing the expression levels to high/low, Parameters probability values in every entry Table CPD =high =low =high, =high 0.3 0.7 =high, =low 0.95 0.05 Activator Repressor =low, =high 0.1 0.9 =low, =low 0.2 0.8 parameters How about continuous-valued expression level? Tree CPD; Linear Gaussian CPD 15 Context specificity of gene expression Context A Basal expression level RNA level Context B Activator induces expression Upstream region of target gene () Activator ()? activator binding site Context C Activator + repressor decrease expression Repressor () Activator () repressor binding site activator binding site 16 8

Context specificity of gene expression Context A Basal expression level RNA level Context B Activator induces expression Upstream region of target gene () Activator ()? Context C Activator + repressor decrease expression Repressor () repressor binding site activator binding site Activator () activator binding site 0 Context A Context B... 3-3 Context 17 C 17 Continuous-valued expression I Tree conditional probability distributions (CPD) Parameters mean (μ) & variance (σ 2 ) of the normal distribution in each context Represents combinatorial and context-specific regulation Tree CPD (μ A,σ A ) parameters (μ B,σ B ) (μ C,σ C )... 0 Context A 3 Context B -3 Context C 18 18 9

Continuous-valued expression II Linear Gaussian CPD Parameters weights w 1,,w N associated with the parents (regulators) Linear Gaussian CPD parameters X N X N w 1 w 2 w N 0 P( Par:w) = N(Σw i x i, ε 2 ) 19 Learning Structure learning [Koller & Friedman] Constraint based approaches Score based approaches Bayesian model averaging Given a set of all possible network structures and the scoring function that measures how well the model fits the observed data, we try to select the highest scoring network structure. Scoring function Likelihood score Bayesian score 20 10

Scoring Functions Let S: structure, Θ S : parameters for S, D: data Likelihood score How to overcome overfitting? Reduce the complexity of the model Bayesian score: P(Structure Data) Regularization Structure S Simplify the structure Module networks 21 Feature Selection Via Regularization Assume linear Gaussian CPD MLE: solve maximize w -(Σw i x i -E Targets ) 2 x 1 x 2 w 1 w 2 w N parameters E Targets x N Regulatory network Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes P(E Targets x:w) = N(Σw i x i, ε 2 ) Problem: This objective learns too many regulators 22 11

L 1 Regularization Select a subset of regulators Combinatorial search? Effective feature selection algorithm: L 1 regularization (LASSO) [Tibshirani, J. Royal. Statist. Soc B. 1996] minimize w (Σw i x i -E Targets ) 2 + C w i : convex optimization! Induces sparsity in the solution w (Many w i s set to zero) x 1 x 2 w 1 w 2 w N x N Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes E Targets P(E Targets x:w) = N(Σw i x i, ε 2 ) 23 Modularity of Regulatory Networks Genes tend to be co-regulated with others by the same factors. Biologically more relevant Module 1 More compact representation Smaller number of parameters Reduced search space for structure learning Module 2 Candidate regulators A fixed set of genes that can be parents of other modules. Same module Share the CPD Module 3 24 12

-3 x + 0.5 x The Module Networks Concept Repressor Activator = Module 1 Module 2 Module 3 activator expression repressor expression target gene expression induced Activator Repressor Tree CPDs regulation program Module 3 genes Module3 Linear CPDs repressed 25 Goal Heat CMK1 Shock? HAP4 Identify modules (member genes) Discover module regulation program Regulation program genes (X s) experiments Candidate regulators Module 1 (μ A,σ A ) Module 2 0 Context A fals e 3 Context B... -3 Context C Module 3 26 13

Structure Learning Bayesian Score & Tree CPD Find the structure S that maximizes P(S D) P(Structure Data) P(D S) P(S) maximize S log P(D S) + log P(S) P(D S) = P(D S,Θ S ) P(Θ S S) dθ S P(S): prior distribution on the structure maximize S log P(D S,Θ S ) P(Θ S S) dθ S + log P(S) ML score: max Θ log P(D S,Θ) More prone to overfitting X1 : : genes Structure S? Data D experiments (arrays) Module 1 Module 2 Module 3 27 Structure Learning Bayesian Score & Tree CPD Find the structure S that maximizes P(S D) P(Structure Data) P(D S) P(S) maximize S log P(D S) + log P(S) P(D S) = P(D S,Θ S ) P(Θ S S) dθ S P(S): prior distribution on the structure maximize S log P(D S,Θ S ) P(Θ S S) dθ S + log P(S) Decomposability For a certain structure S, log P(D S) X1 : : genes Data D experiments (arrays) Structure S? Module 1 = log P(D S,Θ S ) P(Θ S S) dθ S P(X1 Θ m1 )P(,, X1,Θ m2 ) P(,,,Θ m3 ) P(Θ m1 )P(Θ m2 ) P(Θ m3 ) module 1 score dθ m1 dθ m2 dθ m3 Module 2 module 2 score module 3 score = log P(X1 Θ m1 ) P(Θ m1 ) dθ m1 + log P(,, X1,Θ m2 ) P(Θ m2 ) dθ m2 + log P(,,,Θ m3 ) P(Θ m3 ) dθ m3 Module 3 28 14

Learning Structure learning Find the structure that maximizes Bayesian score log P(S D) (or via regularization) Expectation Maximization (EM) algorithm M-step: Given a partition of the genes into modules, learn the best regulation program (tree CPD) for each module. E-step: Given the inferred regulatory programs, we reassign genes into modules such that the associated regulation program best predicts each gene s behavior. 29 Learning Regulatory Network Iterative procedure Cluster genes into modules (E-step) Learn a regulatory program for each module (tree model) (M-step) Maximum increase in the structural score (Bayesian) M1 Candidate regulators M22 KEM1 MKT1 MSK1 DHH1 ECM18 UTH1 GPA1 HAP4 TEC1 ASG7 MFA1 MEC3 HAP1 SGS1 SEC59 RIM15 SAS5 PHO5 PHO2 PHO3 PHM6 SPL2 PHO84 PHO4 GIT1 VTC3 30 15