De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Similar documents
Chapter 7: Regulatory Networks

Matrix-based pattern discovery algorithms

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

EM-algorithm for motif discovery

Introduction to Bioinformatics

Transcrip:on factor binding mo:fs

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

MCMC: Markov Chain Monte Carlo

DNA Binding Proteins CSE 527 Autumn 2007

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Inferring Transcriptional Regulatory Networks from High-throughput Data

Computational Genomics. Uses of evolutionary theory

A Combined Motif Discovery Method

Gibbs Sampling Methods for Multiple Sequence Alignment

Whole-genome analysis of GCN4 binding in S.cerevisiae

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Alignment. Peak Detection

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Outline CSE 527 Autumn 2009

Hidden Markov Models

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Graph Alignment and Biological Networks

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny

HMMs and biological sequence analysis

How can one gene have such drastic effects?

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Hidden Markov Models

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

A Case Study -- Chu et al. The Transcriptional Program of Sporulation in Budding Yeast. What is Sporulation? CSE 527, W.L. Ruzzo 1

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stochastic processes and

Variable Selection in Structured High-dimensional Covariate Spaces

Stephen Scott.

Probabilistic models of biological sequence motifs

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

QB LECTURE #4: Motif Finding

Markov Models & DNA Sequence Evolution

Position-specific scoring matrices (PSSM)

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

O 3 O 4 O 5. q 3. q 4. Transition

CSE446: Clustering and EM Spring 2017

Lecture 7 Sequence analysis. Hidden Markov Models

Inferring Protein-Signaling Networks

Quantitative Bioinformatics

Bio nformatics. Lecture 3. Saad Mneimneh

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

A Sequential Monte Carlo Method for Motif Discovery Kuo-ching Liang, Xiaodong Wang, Fellow, IEEE, and Dimitris Anastassiou, Fellow, IEEE

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Deciphering regulatory networks by promoter sequence analysis

Measuring TF-DNA interactions

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Gibbs sampling. Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark.

Computational Biology: Basics & Interesting Problems

Hidden Markov Models

On the Monotonicity of the String Correction Factor for Words with Mismatches

1.5 MM and EM Algorithms

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Alignment Algorithms. Alignment Algorithms

Expectation-Maximization (EM) algorithm

Algorithms for Bioinformatics

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Comparative Network Analysis

Lecture 5: November 19, Minimizing the maximum intracluster distance

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Ch. 9 Multiple Sequence Alignment (MSA)

Finding motifs from all sequences with and without binding sites

From Arabidopsis roots to bilinear equations

Shane T. Jensen, X. Shirley Liu, Qing Zhou and Jun S. Liu

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models

Finding Regulatory Motifs in DNA Sequences

Phylogenetic Gibbs Recursive Sampler. Locating Transcription Factor Binding Sites

Today s Lecture: HMMs

Chapter 8. Regulatory Motif Discovery: from Decoding to Meta-Analysis. 1 Introduction. Qing Zhou Mayetri Gupta

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Discovering MultipleLevels of Regulatory Networks

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Predicting Protein Functions and Domain Interactions from Protein Interactions

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

CSEP 590B Fall Motifs: Representation & Discovery

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Genome 541 Introduction to Computational Molecular Biology. Max Libbrecht

STATISTICAL SIGNIFICANCE FOR DNA MOTIF DISCOVERY

Logic Regression: Biological Motivation Cyclic Gene Study

Transcription:

De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes

Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription factor, Find the TF-binding motif in common.

Characteristics of Regulatory Motifs TCCGGAAGC TCCGGATGC TCCGGATCT CATGGATGC CCAGGAAGT GGTGGATGC ACCGGATGC Tiny degenerate ~Constant Size Because a constant-size transcription factor binds Occurs everywhere

Enormous non-coding sequences??? 0 gene Given upstream regions of coregulated genes: Increasing length makes motif finding harder random motifs clutter the true ones Decreasing length makes motif finding harder true motif missing in some sequences

Problem Definition Given a collection of promoter sequences s 1,, s N of genes with common expression Combinatorial Motif M: m 1 m W Some of the m i s blank Find M that occurs in all s i with k differences Probabilistic Motif: M ij ; 1 i W 1 j 4 M ij = Prob[ letter i, pos j ] Find best M, and positions p 1,, p N in sequences

Essentially a Multiple Local Alignment... Find best multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases

Probabilistic Algorithms 1. Expectation Maximization: MEME 2. Gibbs Sampling: AlignACE, BioProspector Combinatorial CONSENSUS

Discrete Approaches to Motif Finding

Discrete Formulations Given sequences S = {x 1,, x n } A motif W is a consensus string w 1 w K Find motif W * with best match to x 1,, x n Definition of best : d(w, x i ) = min hamming dist. between W and a word in x i d(w, S) = i d(w, x i )

Approaches Exhaustive Searches CONSENSUS MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWER

Exhaustive Searches Pattern-driven algorithm: For W = AA A to TT T possibilities) Find d( W, S ) Report W* = argmin( d(w, S) ) Running time: O( K N 4 K ) (where N = i x i ) (4 K

Exhaustive Searches (2) 2. Sample-driven algorithm: For W = a K-bp word in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W * Running time: O( K N 2 )

Exhaustive Searches (3) Problem with sample-driven approach: If: True motif does not occur in data, and True motif is weak Then, random strings may score better than any instance of true motif

CONSENSUS (1) Algorithm: Cycle 1: For each word W in S 1 For each word W in S 2 Create alignment (gap free) of W, W Keep the C 1 best alignments, A 1,, A C1 ACGGTTG, CGAACTT, GGGCTCT ACGCCTG, AGAACTA, GGGGTGT

CONSENSUS (2) Algorithm (cont d): Cycle 2: For each word W in S For each alignment A j from the previous cycle,create alignment (gap free) of W, A j Keep the C l best alignments A 1,, A cl

CONSENSUS (3) C 1,, C n are user-defined heuristic constants Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + + O(N C n ) = O( N 2 + NC total ) Where C total = i C i, typically O(nC), where C is a big constant

MULTIPROFILER Extended sample-driven approach Given a K-bp long word W, define: N a (W) = words W in S s.t. d(w,w ) a Idea: Assume W is occurrence of true motif W * Will use N a (W) to correct errors in W

MULTIPROFILER (2) Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-bp pattern with blanks, differing from W Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

Algorithm: MULTIPROFILER (2) For each W in S: For L = 1 to L max Find all strong L-long wordlets G in N a (W) Modify W by the wordlet G -> W Compute d(w, S) Report W * = argmin d(w, S)

Expectation Maximization in Motif Finding

Expectation Maximization (1) The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): Given genomic sequences find all K-bp words 1. Assume each word is motif or background 2. Find likeliest motif & background models, and classification of words

Expectation Maximization (2) Given sequences x 1,, x N, Find all k-bp words X 1,, X n Define motif model: M = (M 1,, M K ) M i = (M i1,, M i4 ) (assume {A, C, G, T}) where M ij = Prob[ motif position i is letter j ] Define background model: B = B 1,, B 4 B i = Prob[ letter j in background sequence ]

Expectation Maximization (3) Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = a[1] a[k], P[ X i, Z i1 =1 ] = M 1a[1] M ka[k] P[ X i, Z i2 =1 ] = (1 - ) B a[1] B a[k]

Expectation Maximization (4) Define: Parameter space = (M,B) Objective: Maximize log likelihood of model: 2 1 2 1 1 1 1 2 1 1 log ) ( log )) ( log( ),,... ( log j j j ij n i j i ij n i n i j j i j ij n Z Z Z X P X P Z X X P

Expectation Maximization (5) Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: E[log P( X... X, Z, )] 1 n Maximization: Maximize expected value over,

Expectation Maximization (6): E-step Expectation: Find expected value of log likelihood: 2 1 2 1 1 1 1 ]log [ ) ( ]log [ )],,... ( [log j j j ij n i j i ij n i n Z Z E X P E Z X X P E where expected values of Z can be computed as follows: 2 1 ) ( ) ( k k i k j i j ij X P X P Z

Expectation Maximization (7): M-step Maximization: Maximize expected value over and independently For, this is easy: NEW j arg m a x n E[ Z ]log j ij j i1 i1 n Z ij n

Expectation Maximization (8): M-step For = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] It easily follows: M NEW jk c jk 4 k 1 c jk B NEW k c 0k 4 c k 1 0 k to not allow any 0 s, add pseudocounts

Overview of EM Algorithm 1. Initialize parameters = (M, B), : Try different values of from N -1/2 up to 1/(2K) 2. Repeat: a. Expectation b. Maximization 3. Until change in = (M, B), falls below 4. Report results for several good

Conclusion One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting points. Overall complexity: unclear typically O(N 2 K) -O(N 3 K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.

Gibbs Sampling in Motif Finding

Overview of Gibbs Sampler Iteratively sample from conditional distribution when other parameters are fixed. In order to draw: ( 1 2 n ~ X, X,, X ) ( X ) i, draw X i ~ ( X i X [ i] )

Gibbs Sampling (1) Given: x 1,, x N, motif length K, background B, Find: Model M Locations a 1,, a N in x 1,, x N Maximizing log-odds likelihood ratio: N K M ( k, x i a k log i B( x i ) i1 k 1 a k i )

Gibbs Sampling (2) AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1. Initialization: a. Select random locations in sequences x 1,, x N b. Compute an initial model M from these locations 2. Sampling Iterations: a. Remove one sequence x i b. Recalculate model c. Pick a new location of motif in x i according to probability the location is a motif occurrence

Gibbs Sampling (3) Initialization: Select random locations a 1,, a N in x 1,, x N For these locations, compute M: N 1 M kj ( xa k j) i N That is, M kj is the number of occurrences of letter j in motif position k, over the total i1

Gibbs Sampling (4) Predictive Update: Select a sequence x = x i Remove x i, recompute model: M kj ( N 1 1) ( j B s N 1, s ( x i a s k j)) where j are pseudocounts to avoid 0s, and B = j j M

Gibbs Sampling (5) Sampling: For every K-long word x j,,x j+k-1 in x: Q j = Prob[ word motif ] = M(1,x j ) M(k,x j+k-1 ) P i = Prob[ word background ] B(x j ) B(x j+k-1 ) Let A j Q j x k 1 j1 / Q j P / j P j Prob 0 x Sample a random new position a i according to the probabilities A 1,, A x -k+1.

Gibbs Sampling (6) Running Gibbs Sampling: 1. Initialize 2. Run until convergence 3. Repeat 1,2 several times, report common motifs

Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

Repeats, and a Better Background Model Repeat DNA can be confused as motif Especially low-complexity CACACA AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A A), P(A C),, P(T T) } K th order: B = { P(X b 1 b K ); X, b i {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

Applications

Application 1: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church s lab, Harvard Data: Microarrays on 6,220 mrnas from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles

Processing of Data 1. Selection of 3,000 genes Genes with most variable expression were selected 2. Clustering according to common expression K-means clustering 30 clusters, 50-190 genes/cluster Clusters correlate well with known function 3. AlignACE motif finding 600-long upstream regions 50 regions/trial

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

Application 2: Discovery of Heat Shock Motif in C. Elegans Group: GuhaThakurta et al. 2002, C.D. Link s lab & colleagues Data: Microarrays on 11,917 genes from C. Elegans Isolated genes upregulated in heat shock

Processing of Data, and Results Isolated 28 genes upregulated in heat shock during 5 separate experiments Motif finding with CONSENSUS and ANNSpec on 500-long upstream regions 2 motifs found: TTCTAGAA: known heat shock factor (HSF) GGGTGTC: previously unreported Conserved in comparison with C. Briggsae Validation by in vitro mutagenesis of a GFP reporter