DNA Binding Proteins CSE 527 Autumn 2007

Similar documents
CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Outline CSE 527 Autumn 2009

CSEP 590B Fall Motifs: Representation & Discovery

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

MCMC: Markov Chain Monte Carlo

Chapter 7: Regulatory Networks

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Matrix-based pattern discovery algorithms

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

O 3 O 4 O 5. q 3. q 4. Transition

1 Probabilities. 1.1 Basics 1 PROBABILITIES

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Today s Lecture: HMMs

Gibbs Sampling Methods for Multiple Sequence Alignment

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

Bi 8 Lecture 11. Quantitative aspects of transcription factor binding and gene regulatory circuit design. Ellen Rothenberg 9 February 2016

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Markov Models & DNA Sequence Evolution

Gene Control Mechanisms at Transcription and Translation Levels

MCMC and Gibbs Sampling. Kayhan Batmanghelich

A Combined Motif Discovery Method

EM-algorithm for motif discovery

Quantitative Bioinformatics

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Alignment. Peak Detection

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Transcrip:on factor binding mo:fs

Robert Collins CSE586, PSU Intro to Sampling Methods

Lecture 7 Sequence analysis. Hidden Markov Models

Inferring Protein-Signaling Networks

Stat 516, Homework 1

Lecture 6: Markov Chain Monte Carlo

Bayesian Inference for DSGE Models. Lawrence J. Christiano

MARKOV CHAIN MONTE CARLO

Bayesian Inference and MCMC

Hidden Markov Models

Multiple Sequence Alignment using Profile HMM

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Lecture 7: Simple genetic circuits I

Graphical Models and Kernel Methods

Theoretical distribution of PSSM scores

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Computational Genomics and Molecular Biology, Fall

Inferring Transcriptional Regulatory Networks from High-throughput Data

Notes on Machine Learning for and

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

CS 188: Artificial Intelligence. Bayes Nets

Position-specific scoring matrices (PSSM)

Session 3A: Markov chain Monte Carlo (MCMC)

Discovering Binding Motif Pairs from Interacting Protein Groups

CSCE 478/878 Lecture 6: Bayesian Learning

Introduction to Bioinformatics

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

Hidden Markov Models

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Monte Carlo methods for sampling-based Stochastic Optimization

Markov Processes. Stochastic process. Markov process

Sampling Good Motifs with Markov Chains

Bayesian Inference for DSGE Models. Lawrence J. Christiano

CPSC 540: Machine Learning

Markov chain Monte Carlo

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Co-ordination occurs in multiple layers Intracellular regulation: self-regulation Intercellular regulation: coordinated cell signalling e.g.

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Hidden Markov Models

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Hidden Markov Models. x 1 x 2 x 3 x K

Markov Chain Monte Carlo methods

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Markov Networks.

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Introduction to Machine Learning CMU-10701

Network motifs in the transcriptional regulation network (of Escherichia coli):

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

CSE P 590 A Spring : MLE, EM

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

QB LECTURE #4: Motif Finding

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Markov Chain Monte Carlo

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MCMC notes by Mark Holder

Learning Sequence Motif Models Using Gibbs Sampling

Transcription:

DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 2 1 In the groove The Double Helix Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove 3 Los Alamos Science Alberts, et al. 4

Helix-Turn-Helix DNA Binding Motif H-T-H Dimers Alberts, et al. Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity Alberts, et al. 5 Zinc Finger Motif 6 Leucine Zipper Motif Homo-/hetero-dimers and combinatorial control Alberts, et al. 7 Alberts, et al. 8

But the overall DNA binding code still defies prediction Some Protein/DNA interactions well-understood 9 10 DNA binding site summary Bacterial Met Repressor a beta-sheet DNA binding domain Negative feedback loop: high Met level repress Met synthesis genes complex code short patches (6-8 bp) often near each other (1 turn = 10 bp) often reverse-complements SAM (Met derivative) not perfect matches Alberts, et al. 11 12

Sequence Motifs Last few slides described structural motifs in proteins Equally interesting are the DNA sequence motifs to which these proteins bind - e.g., one leucine zipper dimer might bind (with varying affinities) to dozens or hundreds of similar sequences E. coli Promoters TATA Box ~ 10bp upstream of transcription start How to define it? Consensus is TATAAT BUT all differ from it Allow k mismatches? Equally weighted? Wildcards like R,Y? ({A,G}, {C,T}, resp.) TACGAT TAAAAT TATACT GATAAT TATGAT TATGTT 13 14 E. coli Promoters TATA Box - consensus TATAAT ~ 10bp upstream of transcription start Not exact: of 168 studied (mid 80 s) nearly all had 2/3 of TAxyzT 80-90% had all 3 50% agreed in each of x,y,z no perfect match Other common features at -35, etc. TATA Box Frequencies pos base 1 2 3 4 5 6 A 2 95 26 59 51 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 15 16

( 6 ( P(S promoter ) % " % = ( % & i 1 f B, i f i # 6 B & # = =! & i, i log log # & = ' $ " 6 # i 1 log P(S nonpromoter ) ' i= 1 f B $ ' f B i i $ TATA Scores Scanning for TATA pos base 1 2 3 4 5 6 A C G T A -36 19 1 12 10-46 C -15-36 -8-9 -3-31 A C G T G -13-46 -6-7 -9-46(?) T 17-31 8-9 -6 19 A C G T 17 Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 18 Score Distribution (Simulated) Weight Matrices: Statistics Assume: 0 500 1000 1500 2000 2500 3000 3500 f b,i = frequency of base b in position i in TATA f b = frequency of base b in all sequences Log likelihood ratio, given S = B 1 B 2...B 6 : -150-130 -110-90 -70-50 -30-10 10 30 50 70 90 19 Assumes independence 20

Neyman-Pearson Given a sample x 1, x 2,..., x n, from a distribution f (... ") with parameter ", want to test hypothesis " =! 1 vs " =! 2. Might as well look at likelihood ratio: f(x 1, x 2,..., x n! 1 ) f(x 1, x 2,..., x n! 2 ) > # 0 500 1000 1500 2000 2500 3000 3500 Score Distribution (Simulated) (or log likelihood ratio) 21-150 -130-110 -90-70 -50-30 -10 10 30 50 70 90 22 What s best WMM? Weight Matrices: Chemistry Given 20 sequences s 1, s 2,..., s k of length 8, assumed to be generated at random according to a WMM defined by 8 x (4-1) parameters!, what s the best!? E.g., what s MLE for! given data s 1, s 2,..., s k? Answer: count frequencies per position. Experiments show ~80% correlation of log likelihood weight matrix scores to measured binding energy of RNA polymerase to variations on TATAAT consensus [Stormo & Fields] 23 24

Another WMM example 8 Sequences: ATG ATG ATG ATG ATG GTG GTG TTG Log-Likelihood Ratio: log 2 f xi,i f xi, f xi = 1 4 Freq. Col 1 Col 2 Col3 A.625 0 0 C 0 0 0 G.250 0 1 T.125 1 0 LLR Col 1 Col 2 Col 3 A 1.32 -$ -$ C -$ -$ -$ G 0 -$ 2.00 T -1.00 2.00 -$ 25 Non-uniform Background E. coli - DNA approximately 25% A, C, G, T M. jannaschi - 68% A-T, 32% G-C LLR from previous example, assuming f A = f T = 3/8 f C = f G = 1/8 LLR Col 1 Col 2 Col 3 A.74 -$ -$ C -$ -$ -$ G 1.00 -$ 3.00 T -1.58 1.42 -$ e.g., G in col 3 is 8 x more likely via WMM than background, so (log 2 ) score = 3 (bits). 26 AKA Kullback-Liebler Distance/Divergence, AKA Information Content Given distributions P, Q Notes: Relative Entropy H(P Q) = x Ω Let P (x) log P (x) Q(x) Undefined if 0 = Q(x) < P (x) P (x) log P (x) Q(x)! 0 = 0 if P (x) = 0 [since lim y log y = 0] y 0 27 WMM: How Informative? Mean score of site vs bkg? For any fixed length sequence x, let P(x) = Prob. of x according to WMM Q(x) = Prob. of x according to background Relative Entropy: H(P Q) = P (x) P (x) log 2 Q(x) x Ω H(P Q) is expected log likelihood score of a sequence randomly chosen from WMM; -H(Q P) is expected score of Background -H(Q P) H(P Q) 28

0 500 1000 1500 2000 2500 3000 3500 WMM Scores vs Relative Entropy -H(Q P) = -6.8 H(P Q) = 5.0-150 -130-110 -90-70 -50-30 -10 10 30 50 70 90 For WMM, you can show (based on the assumption of independence between columns), that : H(P Q) = i H(P i Q i ) where P i and Q i are the WMM/background distributions for column i. 29 WMM Example, cont. Uniform LLR Col 1 Col 2 Col 3 A 1.32 -$ -$ C -$ -$ -$ G 0 -$ T -1.00 2.00 2.00 -$ RelEnt.70 2.00 2.00 4.70 Freq. Col 1 Col 2 Col3 A.625 0 0 C 0 0 0 G.250 0 1 T.125 1 0 Non-uniform LLR Col 1 Col 2 Col 3 A.74 -$ -$ C -$ -$ -$ G 1.00 -$ T -1.58 1.42 3.00 -$ RelEnt.51 1.42 3.00 4.93 Pseudocounts Are the -! s a problem? Certain that a given residue never occurs in a given position? Then -! just right Else, it may be a small-sample artifact Typical fix: add a pseudocount to each observed count small constant (e.g.,.5, 1) Sounds ad hoc; there is a Bayesian justification 31 32

WMM Summary How-to Questions Weight Matrix Model (aka Position Specific Scoring Matrix, PSSM, possum, 0th order Markov models) Simple statistical model assuming independence between adjacent positions To build: count (+ pseudocount) letter frequency per position, log likelihood ratio to background To scan: add LLRs per position, compare to threshold Generalizations to higher order models (i.e., letter frequency per position, conditional on neighbor) also possible, with enough training data Given aligned motif instances, build model? - Frequency counts (above, maybe with pseudocounts) Given a model, find (probable) instances - Scanning, as above Given unaligned strings thought to contain a motif, find it? (e.g., upstream regions for co-expressed genes from a microarray experiment) - Hard... rest of lecture. 33 34 Motif Discovery Motif Discovery: 4 example approaches Brute Force Unfortunately, finding a site of max relative entropy in a set of unaligned sequences is NP-hard [Akutsu] Greedy search Expectation Maximization Gibbs sampler 35 36

Brute Force Greedy Best-First Approach [Hertz & Stormo] Input: Seqs s 1, s 2,..., s k (len ~n, say), each w/ one instance of an unknown length l motif Algorithm: create singleton set with each length l subsequence of each s 1, s 2,..., s k (~nk sets) for each set, add each possible length l subsequence not already present (~n 2 k(k-1) sets) repeat until all have k sequences (~n k k! sets) compute relative entropy of each; pick best problem: astronomically sloooow Input: Sequence s 1, s 2,..., s k ; motif length I; breadth d Algorithm: create singleton set with each length l subsequence of each s 1, s 2,..., s k X for each set, add each possible length l X subsequence not already present compute relative entropy of each discard all but d best repeat until all have k sequences X usual greedy problems 37 38 Expectation Maximization [MEME, Bailey & Elkan, 1995] MEME Outline Input (as above): Sequence s 1, s 2,..., s k ; motif length l; background model; again assume one instance per sequence (variants possible) Algorithm: EM Visible data: the sequences Hidden data: where s the motif { 1 if motif in sequence i begins at position j Y i,j = 0 otherwise Parameters ": The WMM 39 Typical EM algorithm: Parameters " t at t th iteration, used to estimate where the motif instances are (the hidden variables) Use those estimates to re-estimate the parameters " to maximize likelihood of observed data, giving " t+1 Repeat Key: given a few good matches to best motif, expect to pick out more 40

Expectation Step (where are the motif instances?) Ŷ i,j = E(Y i,j s i, θ t ) = P (Y i,j = 1 s i, θ t ) = P (s i Y i,j = 1, θ t ) P (Y i,j=1 θ t ) P (s i θ t ) = cp (s i Y i,j = 1, θ t ) = c l k=1 P (s i,j+k 1 θ t ) where c is chosen so that j Ŷi,j = 1. E = 0 P (0) + 1 P (1) Bayes Ŷ i,j 1 3 5 7 9 11... Sequence i }"=1 Maximization Step (what is the motif?) Find " maximizing expected value: Q(θ θ t ) = E Y θ t[log P (s, Y θ)] = E Y θ t[log k i=1 P (s i, Y i θ)] = E Y θ t[ k i=1 log P (s i, Y i θ)] si l+1 i=1 si l+1 i=1 si l+1 i=1 = E Y θ t[ k = E Y θ t[ k = k = k i=1 j=1 Y i,j log P (s i, Y i,j = 1 θ)] j=1 Y i,j log(p (s i Y i,j = 1, θ)p (Y i,j = 1 θ))] j=1 E Y θ t[y i,j ] log P (s i Y i,j = 1, θ) + C si l+1 j=1 Ŷ i,j log P (s i Y i,j = 1, θ) + C 41 42 M-Step (cont.) Initialization Q(θ θ t ) = k i=1 Exercise: Show this is maximized by counting letter frequencies over all possible motif instances, with counts weighted by Ŷ i,j, again the obvious thing. si l+1 j=1 Ŷ i,j log P (s i Y i,j = 1, θ) + C s 1 : ACGGATT...... s k : GC... TCGGAC Ŷ 1,1 Ŷ 1,2 Ŷ 1,3. Ŷ k,l 1 Ŷ k,l ACGG CGGA GGAT. CGGA GGAC 1. Try every motif-length substring, and use as initial " a WMM with, say 80% of weight on that sequence, rest uniform 2. Run a few iterations of each 3. Run best few to convergence (Having a supercomputer helps) 44

Another Motif Discovery Approach The Gibbs Sampler Lawrence, et al. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Sequence Alignment, Science 1993 Some History Geman & Geman, IEEE PAMI 1984 Hastings, Biometrika, 1970 Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, Equations of State Calculations by Fast Computing Machines, J. Chem. Phys. 1953 Josiah Williard Gibbs, 1839-1903, American physicist, a pioneer of thermodynamics

How to Average An old problem: n random variables: Joint distribution (p.d.f.): Some function: Want Expected Value: x 1, x 2,..., x k P (x 1, x 2,..., x k ) f(x 1, x 2,..., x k ) E(f(x 1, x 2,..., x k )) How to Average E(f(x 1, x 2,..., x k )) = f(x 1, x 2,..., x k ) P (x 1, x 2,..., x k )dx 1 dx 2... dx k x 2 x k x 1 Approach 1: direct integration (rarely solvable analytically, esp. in high dim) Approach 2: numerical integration (often difficult, e.g., unstable, esp. in high dim) Approach 3: Monte Carlo integration sample x (1), x (2),... x (n) P ( x) and average: E(f( x)) 1 n n i=1 f( x(i) ) Markov Chain Monte Carlo (MCMC) Independent sampling also often hard, but not required for expectation MCMC X t+1 P ( X t+1 X t ) w/ stationary dist = P Simplest & most common: Gibbs Sampling P (x i x 1, x 2,..., x i 1, x i+1,..., x k ) Algorithm for t = 1 to # t+1 t for i = 1 to k do : x t+1,i P (x t+1,i x t+1,1, x t+1,2,..., x t+1,i 1, x t,i+1,..., x t,k ) 1 3 5 7 9 11... Sequence i Ŷ i,j

Input: again assume sequences s 1, s 2,..., s k with one length w motif per sequence Motif model: WMM Parameters: Where are the motifs? for 1! i! k, have 1! x i! s i -w+1 Full conditional : to calc P (x i = j x 1, x 2,..., x i 1, x i+1,..., x k ) build WMM from motifs in all sequences except i, then calc prob that motif in i th seq occurs at j by usual scanning alg. Similar to MEME, but it would average over, rather than sample from Overall Gibbs Alg Randomly initialize x i s for t = 1 to # for i = 1 to k discard motif instance from s i ; recalc WMM from rest for j = 1... s i -w+1 calculate prob that i th motif is at j: P (x i = j x 1, x 2,..., x i 1, x i+1,..., x k ) pick new x i according to that distribution 54 Issues Burnin - how long must we run the chain to reach stationarity? Mixing - how long a post-burnin sample must we take to get a good sample of the stationary distribution? (Recall that individual samples are not independent, and may not move freely through the sample space. Also, many isolated modes.) Variants & Extensions Phase Shift - may settle on suboptimal solution that overlaps part of motif. Periodically try moving all motif instances a few spaces left or right. Algorithmic adjustment of pattern width: Periodically add/remove flanking positions to maximize (roughly) average relative entropy per position Multiple patterns per string

Methodology 13 tools Real motifs (Transfac) 56 data sets (human, mouse, fly, yeast) Real, generic, Markov Expert users, top prediction only 60

$ Greed * Gibbs ^ EM * * $ * ^ ^ ^ * * Lessons Evaluation is hard (esp. when truth is unknown) Accuracy low partly reflects defects in evaluation methodology (e.g. <= 1 prediction per data set; results better in synth data) partly reflects difficult task, limited knowledge (e.g. yeast > others) No clear winner re methods or models Motif Discovery Summary Important problem: a key to understanding gene regulation Hard problem: short, degenerate signals amidst much noise Many variants have been tried, for representation, search, and discovery. We looked at only a few: Weight matrix models for representation & search greedy, MEME and Gibbs for discovery Still much room for improvement. Comparative genomics, i.e. cross-species comparison is very promising 63 64