6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Size: px

Start display at page:

Download "6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008"

Allison Phelps
5 years ago
Views:

1 MIT OpenCourseWare / Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit:

2 Computational Biology: Genomes, Networks, Evolution Motif Discovery Lecture 9 October 2, 2008

3 Regulatory Motifs Find promoter motifs associated with co-regulated or functionally related genes

4 Motifs Are Degenerate Protein-DNA interactions A DNA Sugar phosphate backbone B Proteins read DNA by feeling the chemical properties of the bases Without opening DNA (not by base complementarity) Ser Arg 3 Asn Sequence specificity 1 Topology of 3D contact dictates sequence specificity of binding Arg Some positions are fully constrained; other positions are degenerate C Base pair COOH D Ambiguous / degenerate positions are loosely contacted by the transcription factor NH 2 Figure by MIT OpenCourseWare.

5 Other Motifs Splicing Signals Splice junctions Exonic Splicing Enhancers (ESE) Exonic Splicing Surpressors (ESS) Protein Domains Glycosylation sites Kinase targets Targetting signals Protein Epitopes MHC binding specificities

6 Essential Tasks Modeling Motifs How to computationally represent motifs Visualizing Motifs Motif Information Predicting Motif Instances Using the model to classify new sequences Learning Motif Structure Finding new motifs, assessing their quality

7 Modeling Motifs

8 Consensus Sequences Useful for publication IUPAC symbols for degenerate sites Not very amenable to computation HEM13 HEM13 HEM13 ANB1 ANB1 ANB1 ANB1 ROX1 CCCATTGTTCTC TTTCTGGTTCTC TCAATTGTTTAG CTCATTGTTGTC TCCATTGTTCTC CCTATTGTTCTC TCCATTGTTCGT CCAATTGTTTTG YCHATTGTTCTC Figure by MIT OpenCourseWare. Nature Biotechnology 24, (2006)

9 Probabilistic Model HEM13 1 K Count frequencies Add pseudocounts CCCATT HEM13 HEM13 ANB1 ANB1 TTTCTG TCAATT CTCATT TCCATT A C G T M M K.7 ANB1 CCTATT P k (S M) ANB1 TCCATT ROX1 CCAATT Figure by MIT OpenCourseWare. Position Frequency Matrix (PFM)

10 Scoring A Sequence To score a sequence, we compare to a null model Score N N PS ( PFM) Pi( Si PFM ) Pi( Si PFM ) = log = log = log PS ( B) PS ( B) PS ( B) i= 1 i i= 1 i A C G T PFM Background DNA (B) A: 05 T: 05 G: 05 C: 05 A C G T Position Weight Matrix (PWM)

11 Scoring a Sequence Courtesy of Kenzie MacIsaac and Ernest Fraenkel. Used with permission. MacIsaac, Kenzie, and Ernest Fraenkel. "Practical Strategies for Discovering Regulatory DNA Sequence Motifs." PLoS Computational Biology 2, no. 4 (2006): e36. Common threshold = 60% of maximum score MacIsaac & Fraenkel (2006) PLoS Comp Bio

12 Visualizing Motifs Motif Logos Represent both base frequency and conservation at each position Height of letter proportional to frequency of base at that position Height of stack proportional to conservation at that position

13 Motif Information The height of a stack is often called the motif information at that position measured in bits Information Motif Position Information = 2 pblog b= { A, T, G, C} p b Why is this a measure of information?

14 Uncertainty and probability Uncertainty is related to our surprise at an event The sun will rise tomorrow The sun will not rise tomorrow Not surprising (p~1) Very surprising (p<<1) Uncertainty is inversely related to probability of event

15 Average Uncertainty Two possible outcomes for sun rising A The sun will rise tomorrow B The sun will not rise tomorrow P(A)=p 1 P(B)=p 2 What is our average uncertainty about the sun rising = PA ( )Uncertainty(A) + PB ( )Uncertainty(B) = p log p p log p = p i log p i = Entropy

16 Entropy Entropy measures average uncertainty Entropy measures randomness H ( X) = pilog 2 p i i If log is base 2, then the units are called bits

17 Entropy versus randomness Entropy is maximum at maximum randomness Entropy Example: Coin Toss P(heads)=0 Not very random H(X)=0.47 bits P(heads)=0.5 Completely random H(X)=1 bits P(heads)

18 Entropy Examples P(x) A1 T2 G3 C4 H( X ) = [05log(05) + 05log(05) + 05log(05) + 05log(05)] = 2 bits P(x) A1 T2 G3 C4 H( X ) = [0log(0) + 0log(0) + 0log(0) log(0.75)] = 0.63 bits

19 Information Content Information is a decrease in uncertainty Once I tell you the sun will rise, your uncertainty about the event decreases Information = H before (X) - H after (X) Information is difference in entropy after receiving information

20 Motif Information Motif Position Information = 2 - b= { A, T, G, C} p b log p b H background (X) Prior uncertainty about nucleotide H motif_i (X) Uncertainty after learning it is position i in a motif P(x) A T G C H(X)=2 bits P(x) A T G C H(X)=0.63 bits Uncertainty at this position has been reduced by 0.37 bits

21 Motif Logo Conserved Residue Reduction of uncertainty of 2 bits Little Conservation Minimal reduction of uncertainty

22 Background DNA Frequency The definition of information assumes a uniform background DNA nucleotide frequency What if the background frequency is not uniform? (e.g. Plasmodium) P(x) H background (X) A T G C H(X)=1.7 bits P(x) H motif_i (X) A T G C H(X)=1.9 bits Motif Position Information = b= { A, T, G, C} p b log p b = -0 bits Some motifs could have negative information!

23 A Different Measure Relative entropy or Kullback-Leibler (KL) divergence Divergence between a true distribution and another D ( P P ) = P ( i)log motif KL motif background motif i= { A, T, G, C} Pbackground P () i () i True Distribution Other Distribution D KL is larger the more different P motif is from P background Same as Information if P background is uniform Properties D D KL KL 0 = 0 if and only if P =P motif D ( P Q) D ( Q P) KL KL background

24 Comparing Both Methods Information assuming uniform background DNA KL Distance assuming 20% GC content (e.g. Plasmodium)

25 Online Logo Generation

26 Finding New Motifs Learning Motif Models

27 A Promoter Model Length K Motif Background DNA A C G T M M K.3.4 A: 05 T: 05 G: 05 C: 05 P k (S M) P(S B) The same motif model in all promoters

28 Probability of a Sequence Given a sequence(s), motif model and motif location A T A T G C 59 6 = = i Pk ( k + 63 ) PSeq ( Mstart 10, Model) PS ( B) S M P( S B) 100 i= 1 k = 1 i= 66 i S i = nucleotide at position i in the sequence A C G T M M K.3.4

29 Parameterizing the Motif Model Given multiple sequences and motif locations but no motif model AATGCG ATATGG ATATCG GATGCA Count Frequencies Add pseudocounts A C G T M 1 M 6 3/4 ETC 3/4

30 Finding Known Motifs Given multiple sequences and motif model but no motif locations P(Seq window Motif) window Calculate P(Seq window Motif) for every starting location

31 Motif Position Distribution Z ij Z ij the element of the matrix represents the probability that the motif starts in position j in sequence I Z Z = seq seq seq seq Some examples: Z 1 Z 2 Z 3 no clear winner two candidates one big winner Z 4 uniform

32 Calculating the Z Vector PZ ( = 1 SM, ) = ij PS ( Zij= 1, M) PZij ( = 1) PS ( ) (Bayes rule) PZ ( = 1 SM, ) = ij L K+ 1 k = 1 PS ( Zij= 1, M) PZij ( = 1) PS ( Zij= 1, M) PZij ( = 1) PZ ( = 1 SM, ) = ij L K+ 1 k = 1 PS ( Zij= 1, M) PS ( Zij= 1, M) Assume uniform priors (motif equally likely to start at any position)

33 Calculating the Z Vector - Example Z i Z i X i p = = G C T G T A G A C G T = = then normalize so that L W + 1 Z ij j= 1 = 1

34 Discovering Motifs Given a set of co-regulated genes, we need to discover with only sequences We have neither a motif model nor motif locations Need to discover both How can we approach this problem? (Hint: start with a random motif model)

35 Expectation Maximization (EM) Remember the basic idea! 1.Use model to estimate distribution of missing data 2.Use estimate to update model 3.Repeat until convergence Model is the motif model Missing data are the motif locations

36 EM for Motif Discovery 1. Start with random motif model 2. E Step: estimate probability of motif positions for each sequence A C G T M Step: use estimate to update motif model 4. Iterate (to convergence) A C G T ETC.3

37 The M-Step Calculating the Motif Matrix M ck is the probability of character c at position k With specific motif positions, we can estimate M ck : Counts of c at pos k In each motif position M ck, = b n ck, ck, n + d + d bk, bk, Pseudocounts But with probabilities of positions, Z ij, we average: n ck, = sequences S i { js = c} i Z ij

38 MEME MEME - implements EM for motif discovery in DNA and proteins MAST search sequences for motifs given a model

39 P(Seq Model) Landscape EM searches for parameters to increase P(seqs parameters) Useful to think of P(seqs parameters) as a function of parameters EM starts at an initial set of parameters And then climbs uphill until it reaches a local maximum P(Sequences params1,params2) Parameter1 Parameter2 Where EM starts can make a big difference

40 Search from Many Different Starts To minimize the effects of local maxima, you should search multiple times from different starting points MEME uses this idea Start at many points Run for one iteration Choose starting point that got the highest and continue P(Sequences params1,params2) Parameter1 Parameter2

41 The ZOOPS Model The approach as we ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model The ZOOPS model assumes zero or one occurrences per sequence

42 E-step in the ZOOPS Model We need to consider another alternative: the ith sequence doesn t contain the motif We add another parameter (and its relative) λ γ = ( L W +1)λ prior prob that any position in a sequence is the start of a motif prior prob of a sequence containing a motif

43 E-step in the ZOOPS Model PZ ( = 1) = ij Pr( S Z = 1, M) λ i L W+ 1 Pr( S Q = 0, M)(1 γ ) + Pr( S Z = 1, M) λ i i i ik k = 1 ij here Q i is a random variable that takes on 0 to indicate that the sequence doesn t contain a motif occurrence L W + 1 Q i = Z i j= 1, j

44 M-step in the ZOOPS Model update p same as before λ, γ update as follows λ γ 1 ( t+ 1) n m ( t+ 1) () t = = Zi, j ( L W + 1) n( L W + 1) sequences i= 1 positions j= 1 Z i, j ( t) average of across all sequences, positions

45 The TCM Model The TCM (two-component mixture model) assumes zero or more motif occurrences per sequence

46 Likelihood in the TCM Model the TCM model treats each length W subsequence independently to determine the likelihood of such a subsequence: j+ W 1 Pr( S Z = 1, M) = M ij ij ck, k j+ 1 k= j assuming a motif starts there j+ W 1 Pr( Sij Zij = 0, p) = P( ck B) k= j assuming a motif doesn t start there

47 E-step in the TCM Model Z ij = Pr( S Z = 1, M) λ i, j Pr( S Z = 0, B)(1 λ) + Pr( S Z = 1, M) λ i, j ij i, j ij ij subsequence isn t a motif subsequence is a motif M-step same as before

48 Gibbs Sampling A stochastic version of EM that differs from deterministic EM in two key ways 1. At each iteration, we only update the motif position of a single sequence 2. We may update a motif position to a suboptimal new position

49 Gibbs Sampling Best Location New Location 1. Start with random motif locations and calculate a motif model 2. Randomly select a sequence, remove its motif and recalculate tempory model A C G T With temporary model, calculate probability of motif at each position on sequence 4. Select new position based on this distribution A C G T Update model and Iterate ETC

50 Gibbs Sampling and Climbing Because gibbs sampling does not always choose the best new location it can move to another place not directly uphill P(Sequences params1,params2) Parameter1 Parameter2 In theory, Gibbs Sampling less likely to get stuck a local maxima

51 AlignACE Implements Gibbs sampling for motif discovery Several enhancements ScanAce look for motifs in a sequence given a model CompareAce calculate similarity between two motifs (i.e. for clustering motifs)

52 Antigen Epitope Prediction

53 Antigens and Epitopes Antigens are molecules that induce immune system to produce antibodies Antibodies recognize parts of molecules called epitopes

54 Genome to Immunome Pathogen genome sequences provide define all proteins that could illicit an immune response Looking for a needle Only a small number of epitopes are typically antigenic in a very big haystack Vaccinia virus (258 ORFs): 175,716 potential epitopes (8-, 9-, and 10-mers) M. tuberculosis (~4K genes): 433,206 potential epitopes A. nidulans (~9K genes): 1,579,000 potential epitopes Can computational approaches predict all antigenic epitopes from a genome?

55 Modeling MHC Epitopes Have a set of peptides that have been associate with a particular MHC allele Want to discover motif within the peptide bound by MHC allele Use motif to predict other potential epitopes

56 Motifs Bound by MHCs MHC 1 Closed ends of grove Peptides 8-10 AAs in length Motif is the peptide MHC 2 Grove has open ends Peptides have broad length distribution: AAs Need to find binding motif within peptides

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence