Outline CSE 527 Autumn 2009

Similar documents
CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

DNA Binding Proteins CSE 527 Autumn 2007

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

CSEP 590B Fall Motifs: Representation & Discovery

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Chapter 7: Regulatory Networks

MCMC: Markov Chain Monte Carlo

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Matrix-based pattern discovery algorithms

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

O 3 O 4 O 5. q 3. q 4. Transition

Bi 8 Lecture 11. Quantitative aspects of transcription factor binding and gene regulatory circuit design. Ellen Rothenberg 9 February 2016

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Quantitative Bioinformatics

Gene Control Mechanisms at Transcription and Translation Levels

Computational Biology: Basics & Interesting Problems

Computational Cell Biology Lecture 4

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Introduction to Bioinformatics

Co-ordination occurs in multiple layers Intracellular regulation: self-regulation Intercellular regulation: coordinated cell signalling e.g.

Stat 516, Homework 1

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Bayesian Inference and MCMC

Lecture 7: Simple genetic circuits I

Graph Alignment and Biological Networks

Markov Models & DNA Sequence Evolution

Today s Lecture: HMMs

Chapter 15 Active Reading Guide Regulation of Gene Expression

Alignment. Peak Detection

Gibbs Sampling Methods for Multiple Sequence Alignment

Villa et al. (2005) Structural dynamics of the lac repressor-dna complex revealed by a multiscale simulation. PNAS 102:

Transcrip:on factor binding mo:fs

EM-algorithm for motif discovery

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

A Combined Motif Discovery Method

Lecture 6: Markov Chain Monte Carlo

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Introduction. Gene expression is the combined process of :

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

Hidden Markov Models

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

MCMC and Gibbs Sampling. Kayhan Batmanghelich

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

Inferring Protein-Signaling Networks

Theoretical distribution of PSSM scores

Hidden Markov Models. x 1 x 2 x 3 x K

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Lecture 4: Transcription networks basic concepts

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

QB LECTURE #4: Motif Finding

13.4 Gene Regulation and Expression

Translation - Prokaryotes

Hidden Markov Models

Robert Collins CSE586, PSU Intro to Sampling Methods

CSE P 590 A Spring : MLE, EM

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

Name Period The Control of Gene Expression in Prokaryotes Notes

Position-specific scoring matrices (PSSM)

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Session 3A: Markov chain Monte Carlo (MCMC)

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

HMMs and biological sequence analysis

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Gene regulation I Biochemistry 302. Bob Kelm February 25, 2005

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Algorithms for Bioinformatics

Discovering Binding Motif Pairs from Interacting Protein Groups

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Welcome to Class 21!

APGRU6L2. Control of Prokaryotic (Bacterial) Genes

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Network motifs in the transcriptional regulation network (of Escherichia coli):

Controlling Gene Expression

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Notes on Machine Learning for and

MARKOV CHAIN MONTE CARLO

On the Monotonicity of the String Correction Factor for Words with Mismatches

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Multiple Sequence Alignment using Profile HMM

56:198:582 Biological Networks Lecture 8

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Lecture 7 Sequence analysis. Hidden Markov Models

Big Idea 3: Living systems store, retrieve, transmit and respond to information essential to life processes. Tuesday, December 27, 16

Inferring Transcriptional Regulatory Networks from High-throughput Data

Transcription:

Outline CSE 527 Autumn 2009 5 Motifs: Representation & Discovery Previously: Learning from data MLE: Max Likelihood Estimators EM: Expectation Maximization (MLE w/hidden data) These Slides: Bio: Expression & regulation Expression: creation of gene products Regulation: when/where/how much of each gene product; complex and critical Comp: using MLE/EM to find regulatory motifs in biological sequence data Gene Expression & Regulation Gene Expression Recall a gene is a DNA sequence for a protein To say a gene is expressed means that it is transcribed from DNA to RNA the mrna is processed in various ways is exported from the nucleus (eukaryotes) is translated into protein A key point: not all genes are expressed all the time, in all cells, or at equal levels

RNA Transcription Some genes heavily transcribed (many are not) Regulation In most cells, pro- or eukaryote, easily a 10,000-fold difference between least- and most-highly expressed genes Regulation happens at all steps. E.g., some genes are highly transcribed, some are not transcribed at all, some transcripts can be sequestered then released, or rapidly degraded, some are weakly translated, some are very actively translated,... Below, focus on 1st step only: transcriptional regulation Alberts, et al. E. coli growth on glucose + lactose http://en.wikipedia.org/wiki/lac_operon

1965 Nobel Prize Physiology or Medicine Sea Urchin - Endo16 François Jacob, Jacques Monod, André Lwoff DNA Binding Proteins A variety of DNA binding proteins (so-called transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes

In the groove The Double Helix Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove Los Alamos Science Helix-Turn-Helix DNA Binding Motif H-T-H Dimers Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity

Zinc Finger Motif Leucine Zipper Motif Homo-/hetero-dimers and combinatorial control Alberts, et al. MyoD http://www.rcsb.org/pdb/explore/jmol.do?structureid=1mdy&bionumber=1 Some Protein/DNA interactions well-understood

But the overall DNA binding code still defies prediction Summary! CAP Proteins can bind DNA to regulate gene expression (i.e., production of other proteins & themselves)! This is widespread! Complex combinatorial control is possible! But it s not the only way to do this...! 16! Sequence Motifs Motif: a recurring salient thematic element Last few slides described structural motifs in proteins Equally interesting are the DNA sequence motifs to which these proteins bind - e.g., one leucine zipper dimer might bind (with varying affinities) to dozens or hundreds of similar sequences DNA binding site summary Complex code Short patches (4-8 bp) Often near each other (1 turn = 10 bp) Often reverse-complements Not perfect matches

E. coli Promoters TATA Box ~ 10bp upstream of transcription start How to define it? Consensus is TATAAT BUT all differ from it Allow k mismatches? Equally weighted? Wildcards like R,Y? ({A,G}, {C,T}, resp.) TACGAT TAAAAT TATACT GATAAT TATGAT TATGTT E. coli Promoters TATA Box - consensus TATAAT ~10bp upstream of transcription start Not exact: of 168 studied (mid 80 s) nearly all had 2/3 of TAxyzT 80-90% had all 3 50% agreed in each of x,y,z no perfect match Other common features at -35, etc. TATA Box Frequencies pos base 1 2 3 4 5 6 A 2 95 26 59 51 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 TATA Scores A Weight Matrix Model or WMM pos base 1 2 3 4 5 6 A -36 19 1 12 10-46 C -15-36 -8-9 -3-31 G -13-46 -6-7 -9-46(?) T 17-31 8-9 -6 19

Scanning for TATA Scanning for TATA A -36 19 1 12 10-46 C -15-36 -8-9 -3-31 G -13-46 -6-7 -9-46 T 17-31 8-9 -6 19 A -36 19 1 12 10-46 C -15-36 -8-9 -3-31 G -13-46 -6-7 -9-46 T 17-31 8-9 -6 19 = -90 A C T A T A A T C G A -36 19 1 12 10-46 C -15-36 -8-9 -3-31 G -13-46 -6-7 -9-46 T 17-31 8-9 -6 19 = 85 A C T A T A A T C G = -91 Score -150-100 -50 0 50 100 85 66 50 23-90 -91 A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 TATA Scan at 2 genes Score Distribution (Simulated) LacI Score Score -150-50 50-150 -50 50 LacZ 0 500 1000 1500 2000 2500 3000 3500-150 -130-110 -90-70 -50-30 -10 10 30 50 70 90-400 AUG +400

( 6 ( P(S promoter ) % " % = ( % & i 1 f B, i f i # 6 B & # = =! & i, i log log # & = ' $ " 6 # i 1 log P(S nonpromoter ) ' i= 1 f B $ ' f B i i $ Weight Matrices: Statistics Assume: f b,i!= frequency of base b in position i in TATA f b! = frequency of base b in all sequences Log likelihood ratio, given S = B 1 B 2...B 6 : Assumes independence Neyman-Pearson Given a sample x 1, x 2,..., x n, from a distribution f(...!) with parameter!, want to test hypothesis! = " 1 vs! = " 2. Might as well look at likelihood ratio: f(x 1, x 2,..., x n " 1 ) f(x 1, x 2,..., x n " 2 ) (or log likelihood ratio) > # Score Distribution (Simulated) What s best WMM? 0 500 1000 1500 2000 2500 3000 3500 Given, say, 168 sequences s 1, s 2,..., s k of length 6, assumed to be generated at random according to a WMM defined by 6 x (4-1) parameters ", what s the best "? E.g., what s MLE for " given data s 1, s 2,..., s k? Answer: like coin flips or dice rolls, count frequencies per position (see HW). -150-130 -110-90 -70-50 -30-10 10 30 50 70 90

Weight Matrices: Chemistry Experiments show ~80% correlation of log likelihood weight matrix scores to measured binding energy of RNA polymerase to variations on TATAAT consensus [Stormo & Fields] Another WMM example 8 Sequences: ATG ATG ATG ATG ATG GTG GTG TTG Log-Likelihood Ratio: log 2 f xi,i f xi, f xi = 1 4 Freq. Col 1 Col 2 Col 3 A 0.625 0 0 C 0 0 0 G 0.250 0 1 T 0.125 1 0 LLR Col 1 Col 2 Col 3 A 1.32 -$ -$ C -$ -$ -$ G 0 -$ 2.00 T -1.00 2.00 -$ Non-uniform Background E. coli - DNA approximately 25% A, C, G, T M. jannaschi - 68% A-T, 32% G-C LLR from previous example, assuming f A = f T = 3/8 f C = f G = 1/8 LLR Col 1 Col 2 Col 3 A 0.74 -$ -$ C -$ -$ -$ G 1.00 -$ 3.00 T -1.58 1.42 -$ e.g., G in col 3 is 8 x more likely via WMM than background, so (log 2 ) score = 3 (bits). AKA Kullback-Liebler Distance/Divergence, AKA Information Content Given distributions P, Q Notes: Relative Entropy H(P Q) = x Ω Let P (x) log P (x) Q(x) Undefined if 0 = Q(x) < P (x) P (x) log P (x) Q(x) 0 = 0 if P (x) = 0 [since lim y log y = 0] y 0

WMM: How Informative? Mean score of site vs bkg? For any fixed length sequence x, let P(x) = Prob. of x according to WMM Q(x) = Prob. of x according to background Relative Entropy: H(P Q) = P (x) P (x) log 2 Q(x) x Ω H(P Q) is expected log likelihood score of a sequence randomly chosen from WMM; -H(Q P) is expected score of Background Expected score difference: H(P Q) + H(Q P) -H(Q P) H(P Q) 0 500 1000 1500 2000 2500 3000 3500 WMM Scores vs Relative Entropy -H(Q P) = -6.8 H(P Q) = 5.0-150 -130-110 -90-70 -50-30 -10 10 30 50 70 90 On average, foreground model scores > background by 11.8 bits (score difference of 118 on 10x scale used in examples above). For a WMM: H(P Q) = i H(P i Q i ) where P i and Q i are the WMM/background distributions for column i. Proof: exercise Hint: Use the assumption of independence between WMM columns WMM Example, cont. Uniform LLR Col 1 Col 2 Col 3 A 1.32 -$ -$ C -$ -$ -$ G 0 -$ 2.00 T -1.00 2.00 -$ RelEnt 0.70 2.00 2.00 4.70 Freq. Col 1 Col 2 Col 3 A 0.625 0 0 C 0 0 0 G 0.250 0 1 T 0.125 1 0 Non-uniform LLR Col 1 Col 2 Col 3 A 0.74 -$ -$ C -$ -$ -$ G 1.00 -$ 3.00 T -1.58 1.42 -$ RelEnt 0.51 1.42 3.00 4.93

Pseudocounts WMM Summary Are the - s a problem? Certain that a given residue never occurs in a given position? Then - just right Else, it may be a small-sample artifact Typical fix: add a pseudocount to each observed count small constant (e.g.,.5, 1) Sounds ad hoc; there is a Bayesian justification Weight Matrix Model (aka Position Weight Matrix, PWM, Position Specific Scoring Matrix, PSSM, possum, 0th order Markov model) Simple statistical model assuming independence between adjacent positions To build: count (+ pseudocount) letter frequency per position, log likelihood ratio to background To scan: add LLRs per position, compare to threshold Generalizations to higher order models (i.e., letter frequency per position, conditional on neighbor) also possible, with enough training data How-to Questions Motif Discovery Given aligned motif instances, build model? Frequency counts (above, maybe w/ pseudocounts) Given a model, find (probable) instances Scanning, as above Given unaligned strings thought to contain a motif, find it? (e.g., upstream regions of coexpressed genes) Hard... rest of lecture. Unfortunately, finding a site of max relative entropy in a set of unaligned sequences is NPhard [Akutsu]

Motif Discovery: 4 example approaches Brute Force Greedy search Expectation Maximization Gibbs sampler Brute Force Input: Motif length L, plus sequences s1, s2,..., sk (all of length n+l-1, say), each with one instance of an unknown motif Algorithm: Build all k-tuples of length L subsequences, one from each of s1, s2,..., sk (n k such tuples) Compute relative entropy of each Pick best Brute Force, II Input: Motif length L, plus seqs s1, s2,..., sk (all of length n+l-1, say), each with one instance of an unknown motif Algorithm in more detail: Build singletons: each len L subseq of each s1, s2,..., sk (nk sets) Extend to pairs: len L subseqs of each pair of seqs (n 2 ( k) sets) Then triples: len L subseqs of each triple of seqs (n 3 ( ) sets) Repeat until all have k sequences (n k ( ) sets) Compute relative entropy of each; pick best k k 2 k 3 problem: astronomically sloooow Example A0 A1 B0 B1 C0 C1 A0,B0 A0,B1 A0, C0 A0, C1 A1, B0 A1, B1 A1,C0 A1, C1 B0, C0 B0, C1 B1,C0 B1,C1 A0, B0, C0 A0, B0, C1 A0, B1, C0 A0, B1, C1 A1, B0, C0 A1, B0, C1 A1, B1, C0 A1, B1, C1 Three sequences (A, B, C), each with two possible motif positions (0,1)

Greedy Best-First [Hertz, Hartzell & Stormo, 1989, 1990] X X X Expectation Maximization [MEME, Bailey & Elkan, 1995] Input: Sequences s 1, s 2,..., s k ; motif length L; breadth d, say d = 1000 Algorithm: As in brute, but discard all but best d relative entropies at each stage d=2 usual greedy problems Input (as above): Sequence s 1, s 2,..., s k ; motif length l; background model; again assume one instance per sequence (variants possible) Algorithm: EM Visible data: the sequences Hidden data: where s the motif { 1 if motif in sequence i begins at position j Y i,j = 0 otherwise Parameters θ: The WMM MEME Outline Typical EM algorithm: Parameters θ t at t th iteration, used to estimate where the motif instances are (the hidden variables) Use those estimates to re-estimate the parameters θ to maximize likelihood of observed data, giving θ t+1 Repeat Key: given a few good matches to best motif, expect to pick more Expectation Step (where are the motif instances?) Ŷ i,j = E(Y i,j s i, θ t ) = P (Y i,j = 1 s i, θ t ) = P (s i Y i,j = 1, θ t ) P (Y i,j=1 θ t ) P (s i θ t ) = cp (s i Y i,j = 1, θ t ) = c l k=1 P (s i,j+k 1 θ t ) where c is chosen so that j Ŷi,j = 1. E = 0 P (0) + 1 P (1) Bayes 1 3 5 7 9 11... Sequence i Ŷ i,j} =1

Maximization Step (what is the motif?) Find θ maximizing expected value: M-Step (cont.) Q(θ θ t ) = k i=1 si l+1 j=1 Ŷ i,j log P (s i Y i,j = 1, θ) + C Q(θ θ t ) = E Y θ t[log P (s, Y θ)] = E Y θ t[log k i=1 P (s i, Y i θ)] = E Y θ t[ k i=1 log P (s i, Y i θ)] si l+1 i=1 si l+1 i=1 si l+1 i=1 = E Y θ t[ k = E Y θ t[ k = k = k i=1 j=1 Y i,j log P (s i, Y i,j = 1 θ)] j=1 Y i,j log(p (s i Y i,j = 1, θ)p (Y i,j = 1 θ))] j=1 E Y θ t[y i,j ] log P (s i Y i,j = 1, θ) + C si l+1 j=1 Ŷ i,j log P (s i Y i,j = 1, θ) + C Exercise: Show this is maximized by counting letter frequencies over all possible motif instances, with counts weighted by Ŷ i,j, again the obvious thing. s 1 : ACGGATT...... s k : GC... TCGGAC Ŷ 1,1 Ŷ 1,2 Ŷ 1,3. Ŷ k,l 1 Ŷ k,l ACGG CGGA GGAT. CGGA GGAC Initialization 1. Try every motif-length substring, and use as initial θ a WMM with, say, 80% of weight on that sequence, rest uniform 2. Run a few iterations of each 3. Run best few to convergence (Having a supercomputer helps): http://meme.sdsc.edu/ Another Motif Discovery Approach The Gibbs Sampler Lawrence, et al. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Sequence Alignment, Science 1993

6 10 Some History How to Average Geman & Geman, IEEE PAMI 1984 Hastings, Biometrika, 1970 Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, Equations of State Calculations by Fast Computing Machines, J. Chem. Phys. 1953 Josiah Williard Gibbs, 1839-1903, American physicist, a pioneer of thermodynamics An old problem: n random variables: Joint distribution (p.d.f.): Some function: Want Expected Value: x 1, x 2,..., x k P (x 1, x 2,..., x k ) f(x 1, x 2,..., x k ) E(f(x 1, x 2,..., x k ))

E(f(x 1, x 2,..., x k )) = f(x 1, x 2,..., x k ) P (x 1, x 2,..., x k )dx 1 dx 2... dx k x 2 x k x 1 How to Average Approach 1: direct integration (rarely solvable analytically, esp. in high dim) Approach 2: numerical integration (often difficult, e.g., unstable, esp. in high dim) Approach 3: Monte Carlo integration sample x (1), x (2),... x (n) P ( x) and average: E(f( x)) 1 n n i=1 f( x(i) ) Markov Chain Monte Carlo (MCMC) Independent sampling also often hard, but not required for expectation MCMC X t+1 P ( X t+1 X t ) w/ stationary dist = P Simplest & most common: Gibbs Sampling P (x i x 1, x 2,..., x i 1, x i+1,..., x k ) Algorithm for t = 1 to t+1 t for i = 1 to k do : x t+1,i P (x t+1,i x t+1,1, x t+1,2,..., x t+1,i 1, x t,i+1,..., x t,k ) 1 3 5 7 9 11... Sequence i Ŷ i,j Input: again assume sequences s 1, s 2,..., s k with one length w motif per sequence Motif model: WMM Parameters: Where are the motifs? for 1! i! k, have 1! x i! s i -w+1 Full conditional : to calc P (x i = j x 1, x 2,..., x i 1, x i+1,..., x k ) build WMM from motifs in all sequences except i, then calc prob that motif in i th seq occurs at j by usual scanning alg.

Similar to MEME, but it would average over, rather than sample from Overall Gibbs Alg Randomly initialize x i s for t = 1 to for i = 1 to k discard motif instance from s i ; recalc WMM from rest for j = 1... s i -w+1 calculate prob that i th motif is at j: P (x i = j x 1, x 2,..., x i 1, x i+1,..., x k ) pick new x i according to that distribution Issues Burnin - how long must we run the chain to reach stationarity? Mixing - how long a post-burnin sample must we take to get a good sample of the stationary distribution? In particular: Samples are not independent; may not move freely through the sample space Many isolated modes Variants & Extensions Phase Shift - may settle on suboptimal solution that overlaps part of motif. Periodically try moving all motif instances a few spaces left or right. Algorithmic adjustment of pattern width: Periodically add/remove flanking positions to maximize (roughly) average relative entropy per position Multiple patterns per string

Methodology 13 tools Real motifs (Transfac) 56 data sets (human, mouse, fly, yeast) Real, generic, Markov Expert users, top prediction only Blind sort of $ Greed * Gibbs ^ EM * * $ * ^ ^ ^ * *

Lessons Evaluation is hard (esp. when truth is unknown) Accuracy low partly reflects limitations in evaluation methodology (e.g. 1 prediction per data set; results better in synth data) partly reflects difficult task, limited knowledge (e.g. yeast > others) No clear winner re methods or models Motif Discovery Summary Important problem: a key to understanding gene regulation Hard problem: short, degenerate signals amidst much noise Many variants have been tried, for representation, search, and discovery. We looked at only a few: Weight matrix models for representation & search Greedy, MEME and Gibbs for discovery Still much room for improvement. Comparative genomics, i.e. cross-species comparison is very promising