Probabilistic models of biological sequence motifs

Size: px

Start display at page:

Download "Probabilistic models of biological sequence motifs"

Claribel Dennis
5 years ago
Views:

1 Probabilistic models of biological sequence motifs Description of Known Motifs AGB - Master in Bioinformatics UPF Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain

2 What we will see How to build simple probabilistic models to describe sequence motifs (using a training set). How to study the motif properties in terms of heterogeneity and dependencies between positions How to model dependencies between positions.

3 Genome Added complexity: RNA processing (e.g. Splicing) Transcription Start Site Termination Site Translation pre-mrna exon Intron exon intron Splicing exon intron exon mrna 5 UTR Protein coding sequence 3 UTR Translation

4 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

5 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

6 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

7 Description of signals (motifs) Exact word 1 example Multiple examples CAGGTAAGT!! Consensus!! CAGGTAAGT! TAGGTGAGC! GTAGTAAGA! CAAGTAATA! ATGGTAATG! CAGGTGATC! AAGGTGAGC!! NWRGTRAKN! Consensus motif

8 The simplest probabilistic model Position Weight Matrix (PWM) (position specific scoring matrix (PSSM))

9 Weight Matrices pre-mrna donador BS aceptor CAGGURAGU CURAY YYNCAGG exón intrón exón intrón exón intrón exón ----exon----intron!!caggtaccc!!gaggtgaga!!ctggtgagg!!taggtgagt!!caggtctgt!!ctggtgagc!!caggtaagt! Observations (real splice-sites) E.g. position 1, P( C ) = frequency = 5/7 = 0.71 pos A C G T

10 Testing for a new functional site pos A C G T What is the probability that a sequence contains a functional site described by this model S = s 1 s 2 s 3...s n We can calculate the probability that S is given by the model obtained from the observations: P(S) = P(s 1 s 2...s N ) = P(s 1,pos = 1)P(s 2,pos = 2)...P(s N,pos = n) Implicitly, we assume independence between adjacent positions

8 9 A 0 0.71 0 0 0 0.28 0.71 0 0.14 C 0.

11 Graphical Representation: Sequence Logos pos A C G T human drosophila yeast

12 Pseudocounts In any observed data set there is the possibility, especially with low-probability events and/or small data sets, of a possible event not occurring. Its observed frequency is therefore 0, implying a probability of 0. -exon--intron-!!caggtaccc!!gaggtgaga!!ctggtgagg!!taggtgagt!!caggtctgt!!ctggtgagc!!caggtaagt! Estimated probability P(A, pos=1) = 0 We may wrongly infer that lack of A is characteristic of splice-sites (overfitting) Simplest solution: modify the counting: n i n i + p, i=1,2,3,4. E.g.: P(A) = n A + mp n A + n C + n G + n T + m Laplace rule: pseudocount m=4, p=1/4

13 Hypothesis testing Problem: choosing between two models M, N to represent a data set Each model represents a prob distribution of the sample space S S M N We need Statistical test that can distinguish between the two models To distinguish between two models, we consider the Likelihood-ratio between two possible models: P(s M) = P(s) under M P(s N) = P(s) under N LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R)

14 Likelihood ratio In general, we want to compare the model of real sites M with an alternative (false site) model R Example of alternative models: Random sequences (P(a)=0.25, for a=a,c,g,t) False sites (sequences with GT but are not real donors) LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R) pos A C G T

15 Likelihood ratio In general, we want to compare the model of real sites M with an alternative (false site) model R Example of alternative models: Random sequences (P(a)=0.25, for a=a,c,g,t) False sites (sequences with GT but are not real donors) LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R) pos A C G T Using uniform background

16 Position Weight Matrices (PWMs) Probabilities are small Problem multiplying probabilities (too small to be correctly handled by a computer) Solution: use logarithms: M ai = pos$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ A$ /999$ 1.02$ /999$ /999$ /999$ /0.22$ 1.02$ /999$ /0.91$ C$ 1.02$ /999$ /0.22$ /999$ /999$ /0.91$ /0.91$ /0.91$ /0.22$ G$ /0.91$ /999$ 1.02$ 1.38$ /999$ 0.69$ /999$ 1.16$ /0.91$ T$ /0.91$ /0.22$ /999$ /999$ 1.38$ /999$ /0.91$ /999$ 0.47$ $ Log-Likelihood ratio log LR = log P(s 1 M)P(s 2 M)...P(s n M) P(s 1 R)P(s 2 R)...P(s n R) n = log P(s i M) P(s i R) Log 0 is generally set up to be a large negative number THe alternative is to use pseudocounts i=1

17 Consider the example: CTGGTAAGC M a,i = Position Weight Matrices (PWMs) pos A C G T P(CTGGTAAGC M) logl = log P(CTGGTAAGC R) = log P (C M)P (T M)P (G M)... P (G M)P (C M) P 1 (C R)P 2 (T R)P 3 (G R)... P 8 (G R)P 9 (C R) = log P 1 (C M) P 1 (C R) + log P 2 (T M) P 2 (T R) + log P 3 (G M) P 3 (G R) log P 8 (G M) P 8 (G R) + log P 9 (C M) P 9 (C R) = M C,1 + M T,2 + M G, M G,8 + M C,9 = = 6.32

18 Position Weight Matrices (PWMs) To search in an unknown sequence for this mo3f (for the possibility that there is a donor splice site), we use a sliding window along the sequence of the same size as the mo3f, and at each posi3on we score the similarity to the mo3f using the score given by the Matrix. sites:..cgtgagtcggggtgagagcatgctggtaagcccggctggtgaactgccggtagtc..! Assign score to each 9 base window. Use score cutoff to predict poten3al 5 splice sites log LR = log P(S M ) P(S R) = log P(s 1, pos =1 M ) P(s 1, pos =1 R) log P(s N, pos = n M ) P(s N, pos = n R) > a log LR > a => it is more likely to correspond to a case of model M log LR < a => it is more likely to correspond to a case of model R

19 Position Weight Matrices (PWMs) Since GT is invariable, we could skip the contribution of positions 4 and 5, and consider only query sequences with GT at these positions pos M ai = A C G T Example:..CGTGAGTCGGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTGAACTGCCGGTAGTC..!! We would score only window1 and window2

20 Searching for mo2fs in novel sequences (The sliding window approach) 65,71,849:;<= 3 splice site model *"!'!" ' " Acceptor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" Compare Log- likelihood scores of real vs. pseudo splice sites Training: on the training set we calculate the parameters for the PWM 5 splice site model *' *" 0435, Donor Scores +,-. /0,123 Tes2ng: we use test data to evaluate the accuracy of our model and establish the score cut- offs 65,71,849:;<=!'!" ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435,

21 Searching for mo2fs in novel sequences (The sliding window approach) 3 splice site model *"!' Acceptor Scores +,-. /0,123 Compare Log- likelihood scores of real vs. pseudo splice sites 65,71,849:;<=!" Reminder: 65,71,849:;<= ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 5 splice site model *' *"!'!" ' " 0435, Donor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435, Sensi2vity frac3on of real sites with score above cutoff PPV frac3on of sites with score > cutoff that are true sites. FPR Frac3on of nega3ve cases (pseudo splice- sites) that are above the score cut- off Etc

22 Determining the relevant positions

23 Splice-site signals pre-mrna donor BS acceptor exon intron CAGGURAGU CURAY YYNCAGG exon intron exon intron exon U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

24 Splice-site signals U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

25 Splice-site signals How many positions are relevant to model? U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

26 Information Content A B Information content Informa(on content (The change in entropy comparing the expected and the observed distribu3on) C Frequency (%) Frequencies I c (X) = H before H after N = log 2 N + P(x i ) log 2 P(x i ) i=1 (comparing to an uniform background) Density Mutual information exon Value U1 snrnp GUCCAUUCA!! CAGGUAAGU intron U2 snrnp 8 AUGAUG! 9! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna Corvelo et al. 2010

27 Information Content H. sapiens S. cerevisiae Height is proportional to the Information content, the letter relative sizes are proportional to their frequencies (

28 Kullback-Leibler Divergence of two distributions D(P Q) = Also called the rela3ve entropy, is the expected value of the log- rate of two distribu3ons x P(x)log 2 P(x) Q(x) n P(x D(P Q) = E(log 2 L) = P(x i )log i ) 2 Q(x i ) i=1 log 2 L = log 2 P(x) Q(x) The rela3ve entropy is defined for two probability distribu3ons that take values over the same alphabet (same symbols)

29 Kullback-Leibler Divergence of two distributions D(P Q) = x P(x)log 2 P(x) Q(x) The rela3ve entropy is not a distance, but measures how different two distribu2ons are D(P Q) D(P Q) It is not symmetric The value is never nega3ve. It is zero when the 2 distribu3ons are iden3cal D(P Q) 0 with =0 for P=Q The relative entropy provides a measure of the information content gained with the distribution P with respect to the distribution Q. Its applications are similar to those of the Information Content Better to apply D rather than I when background is not random

30 Exercise: (exam 2013) Consider two discrete probability distributions P and Q, such that i P(x i ) =1 and Q(x i ) =1 Show that the relative entropy D(P Q) is equivalent to the information content of P when the distribution Q is uniform. i

31 Total Relative Entropy To quantify the variability of an entire motif we can calculate the total relative entropy adding up the value for all positions P: distribution of the observed sequences corresponding to the motif Q: distribution of a background model (e.g. Random sequences) 4! P(x D(P Q) = P(x a )log a ) $ 2 # & " Q(x a )% a=1 N 4! P(x D total (P Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Relative entropy at one position of the motif Total Relative entropy The total relative entropy is calculated from the probability distribution of nucleotides, a=1,2,3,4, at each position, i=1,,n, in the real signal, P(x a,i ), relative to the distribution in a randomized set, Q(x a,i ):

32 Total Relative entropy N 4! P(x D total (P,Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Plass et al. 2008

33 Total Relative entropy N 4! P(x D total (P,Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Plass et al. 2008

34 Modeling dependencies between positions

35 Modeling dependencies between positions 65,71,849:;<= 65,71,849:;<= 3 splice site model *"!'!" ' " Acceptor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 5 splice site model *' *"!'!" 0435, Donor Scores +,-. /0,123 What does this result tell us? A) Splicing machinery also uses other informa3on besides 5 ss/3 ss mo3fs to iden3fy splice sites B) PWM model does not accurately capture some aspects of the 5 ss/ 3 ss that are used in recogni3on C) Or both ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435,

36 Modeling dependencies between positions

37 Mutual informa3on Mutual Informa3on P(x, y) MI(X,Y ) = H(X)+ H(Y ) H(H,Y ) = P(x, y)log 2 P(x)P(y) x y The mutual informa3on of two random variables X and Y measures the dependencies between two variables, that is, informa3on in X that is shared with Y. e.g. X and Y take as values the nucleo3des in two different posi3ons, and the sum is carried out over the alphabet of nucleo3des Independent posi3ons M(X,Y) = 0 Dependent posi3ons M(X,Y)>0 CTGAG! GTAGA! TTGAC! ATAGT! GTGAG! CTAAA! TTGAC! ATAAT! 12345!

38 MutualInform ation in bits Mutual informa3on Positio Position (bp) E MutualInform ation in bits Position (bp) position Position (bp) MutualInform ation in bits n n P(x MI(X,Y ) = P(x i, y j )log i, y j ) 2 i=1 position Position (bp) position )LJXUH0XWXDOLQIRUPDWLRQDURXQGGRQRUDDQGDFFHSWRUEVSOLFHVLWHV3RVLWLRQVLQ GRQRUVLWHZLQGRZ>@DQGDFFHSWRUVLWHZLQGRZ>@DUHVKRZQDORQJERWKD j=1 7KHFDQRQLFDO*LVDWSRVLWLRQLQDDQGSRVLWLRQLQE7KHGLDJRQDOOLQHSUHVHQWLQERW P(x i )P(y j ) DQGEUHSUHVHQWVWKHPXWXDOLQIRUPDWLRQEHWZHHQQHLJKERXULQJSDLUVRIEDVHV Levine & Durbin position Position (bp)

39 MutualInform ation in bits Mutual informa3on Positio Position (bp) E MutualInform ation in bits Position (bp) position Position (bp) MutualInform ation in bits n n P(x MI(X,Y ) = P(x i, y j )log i, y j ) 2 i=1 position Position (bp) position )LJXUH0XWXDOLQIRUPDWLRQDURXQGGRQRUDDQGDFFHSWRUEVSOLFHVLWHV3RVLWLRQVLQ GRQRUVLWHZLQGRZ>@DQGDFFHSWRUVLWHZLQGRZ>@DUHVKRZQDORQJERWKD j=1 7KHFDQRQLFDO*LVDWSRVLWLRQLQDDQGSRVLWLRQLQE7KHGLDJRQDOOLQHSUHVHQWLQERW P(x i )P(y j ) DQGEUHSUHVHQWVWKHPXWXDOLQIRUPDWLRQEHWZHHQQHLJKERXULQJSDLUVRIEDVHV Levine & Durbin position Position (bp)

40 positions i and j (Durbin et al., 1998). Score calculations Score calculations Mutual Based on informa3on the mutual information results (see Figure 3.2a) a mode Based on the mutual information results (see Figure 3.2a) a model was whtch divided the region around the donor splice site into blocks as shown whtch divided the region around the donor splice site into blocks as shown below ! MutualInform ation in bits Position (bp) position Thts model was scored using log-ltkelihood scoring considering the con 3 2 Thts model was scored using log-ltkelihood scoring considering th 1 probabhties 0 of the blocks above (dependencies indicated by the horizontal blac -1-2 and using genomic dinucleotide frequencies for the null model. 'Thus, the sco -3-4 sequence -5 X in bits is -6-7 sequence X in bits is position Position (bp) probabhties of the blocks above (dependencies indicated by the horizonta and using genomic dinucleotide frequencies for the null model. 'Thus, t Frequency values for each possible base combination of each block g The dependencies are used to build the model (see more later) dependencies in the model were calculated by adding pseudocounts based on g Frequency values for each possible base combination of each bl dinucleotide frequencies to the observed counts. 'Thus, for example, dependencies in the model were calculated by adding pseudocounts based Levine & Durbin 2001 C(x-4 = z,x-3 = a,x-* = b,l, = c) + 43 q(a I z)q(b I a)q(c I b dinucleotide f(abc frequencies I z) = to the observed counts. 'Thus, for example,

41 Mutual informa3on MutualInform ation in bits position Position (bp) fraction oftru e sites included TPR PWM Order-1 dependencies Block Dependence First-OrderDependence Independent Higher order dependencies Position (bp) position # false /10 kb )LJXUH &RPSDULVRQEHWZHHQLQGHSHQGHQWILUVWRUGHUGHSHQGHQFHDQGEORFNGHSHQGHQFH PRGHOVIRU GRQRU VSOLFH VLWH LGHQWLILFDWLRQ 7KH EORFN GHSHQGHQFH PRGHO SUHGLFWHG IHZHU IDOVH SRVLWLYHVDWPRVWVHQVLWLYLW\OHYHOVWKDQGLGWKHRWKHUWZRPRGHOV 'LVFXVVLRQ FPR Using a model with dependencies improves the overall accuracy 7KH EORFN GHSHQGHQFH PRGHO GHVFULEHG KHUH VKRZV WKDW VSOLFH VLWH VLJQDO UHFRJQLWLRQ FDQ EH LPSURYHG E\ FRQVLGHULQJ KLJKHU RUGHU DQG ORQJUDQJH LQWHUDFWLRQV +RZHYHU WKLV LPSURYHPHQW LV TXLWH PRGHVW ZKHQ FRPSDUHG WR WKH ILUVW RUGHU GHSHQGHQFH PRGHO 2WKHU VSOLFH VLWH LGHQWLILFDWLRQ PRGHOV LQFOXGLQJ PD[LPDO Levine & Durbin 2001 GHSHQGHQFHGHFRPSRVLWLRQWKHPRGHOXVHGE\WKH*(16&$1JHQHSUHGLFWLRQSURJUDP %XUJHDQG.DUOLQDOVRVKRZPRGHVWLPSURYHPHQWVRYHUDILUVWRUGHUGHSHQGHQFH

42 Markov models

43 Mutual information can help us find out about dependencies between positions (between variables) How to incorporate that into the model?

44 Markov models The probability to observe a sequence according to the model described by P S = s 1 s 2 s 3...s N P(S) = P(s 1 s 2 s 3...s N ) The joint probability can be re- wriaen as a factoriza3on of condi3onal probabili3es P(S) = P(s 1 s 2 s 3...s N ) = P(s N s 1...s N 1 )P(s N 1 s 1...s N 2 )... P(s 2 s 1 )P(s 1 ) (the chain rule for probabili3es) For three elements: Apply twice the defini3on of condi3onal probability P(s 1 s 2 s 3 ) = P(s 3 s 1 s 2 )P(s 1 s 2 ) P(s 3 s 1 s 2 ) = P(s 1 s 2 s 3 ) P(s 1 s 2 ) P(s 1 s 2 s 3 ) = P(s 3 s 1 s 2 )P(s 2 s 1 )P(s 1 ) P(s 2 s 1 ) = P(s 1s 2 ) P(s 1 )

45 Markov models The chain rule for probabili3es P(S) = P(s 1 s 2 s 3...s N ) = P(s N s 1...s N 1 )P(s N 1 s 1...s N 2 )... P(s 2 s 1 )P(s 1 ) We define the order of the Markov chain as the number of the dependencies ORDER 0 ORDER 1 ORDER 2 ORDER n P(s i s 1...s i 1 ) = P(s i ) P(s i s 1...s i 1 ) = P(s i s i 1 ) P(s i s 1...s i 1 ) = P(s i s i 2 s i 1 )... P(s i s 1...s i 1 ) = P(s i s i n...s i 1 ) E.g. Markov model of order 1 (Markov chain) P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 )

46 E.g. Markov model of order 1: Markov models P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 ) 1)Probabili3es are es3mated regardless of the posi3on (recall the NB model for book classifica3on). E.g. for order 1 P(s i = G s i 1 = T ) = n(s = G s = T ) i i 1 n(s i = a s i 1 = T ) a M 2) We always need an ini3al set of probabili3es. Eg. For order 1: P(A) = n(a) b M n(b) These are es3mated from the (ini2al) posi2ons of the training set

47 GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! Markov models Example: consider the following sequences:

48 Markov models Example: consider the following sequences: GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! P(C G) = n(s i = C s i 1 = G) a { A,C,G,T} n(s i = a s i 1 = G) = GC transitions 12 G positions For the 1 st order parameter, we count the number of times that C follows a G in the sequences And divide by the number of times any nucleotide follows a G in the sequence.

49 Markov models Example: consider the following sequences: GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! P(C G) = P(A G) = P(G G) = P(T G) = P(C) = P(A) = P(G) = P(T ) = Initial probabilities We can use pseudocounts with the sequences as before

50 Markov chains Markov model of order 1 are generally called Markov chains P(s i s 1...s i 1 ) = P(s i s i 1 ) P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 ) E.g.: A Markov chain for nucleo3des is a set of probabili3es of the form P(a b), where a,b = {A,C,G,T} We can view this as transi3ons

51 Markov chains A Markov chain can be represented as a set of states (1 per nucleo3de) S 0 P(s 1 ) =1 s 1 with connec3ons between them transi3on probabili3es The start of the sequence string is modeled with a ini3al fic33ous state S 0 Transi3on probabili3es P(s i s i 1 ) =1 s i P(A C)+ P(C C)+ P(G C)+ P(T C) =1

52 Markov models Markov model of order k: next base depends on previous k bases For order 2: P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 2 s N 1 )P(s N 1 s N 3 s N 2 )... P(s 3 s 1 s 2 )P(s 1 s 2 ) P(ACA) = P(A AC)P(AC)

53 Markov models For order 2: 1) If we es3mate these probabili3es regardless of the posi3on P(s i = G s i 2 = A, s i 1 = T ) = n(s i = G s i 2 = A, s i 1 = T ) n(s i = a s i 2 = A, s i 1 = T ) a M 2) We need an ini3al set of probabili3es: P(s 1 s 2 ) P(AC) = a M n(ac) b M n(a,b) These are es3mated from the ini3al posi3ons of the training set

54 Markov models How to select the order? The number of parameters (probabilities) to estimate grows exponentially with the order: ~4 (k+1) Higher order may be more accurate (captures better the dependencies) But it is less reliable: less data to estimate parameters (more dependent on pseudocounts)

55 Example: CpG Islands

56 Example: CpG Islands Wherever there is a CG (CpG) in the genome, C tends to change chemically by methyla3on. Methylated C is more likely to mutate into a T during replica3on. Thus, CpG dinucleo3des are less frequent than expected:

57 Example: CpG Islands This transforma3on is ohen suppressed in specific regions, like the promoter of some genes, giving rise to a high- content of CpGs: These regions are called CpG islands CpG islands are of variable length, between hundreds to thousands of base pairs. We would like to answer the following questions: 1) given a DNA sequence, Is it part of a CpG island? 2) Given a large DNA region, can we find CpG islands in it?

Example: CpG Islands + a st a st Transi3on probability between two adjacent posi3ons in CpG islands Transi3on probability between two adjacent pos.

58 Example: CpG Islands + a st a st Transi3on probability between two adjacent posi3ons in CpG islands Transi3on probability between two adjacent pos. outside CpG islands Given a sequence S the log- likelihood ra3o is: + a st σ = log LR = log P(S +) P(S ) = N i=1 log a + i 1,i a i 1,i (up to the contribution from the initial probabilities) a st The larger the value of sigma, the more likely is to be a CpG island

59 Example: CpG Islands Approach 1: Given a large stretch of DNA of length any length we calculate P(S +) σ = log LR = log P(S ) = N i=1 log a + i 1,i a i 1,i Sequences with σ(s) > 0 are the possible CpG islands Disadvantage: CpG islands may be much shorter than the whole sequence. We therefore could underscore the real CpG island by including too much false sequence. As a result we will miss many posi3ve cases.

60 Example: CpG Islands Approach 2: Given a large stretch of DNA of length L, we extract windows of l nucleo3des: S (k ) = (s k+1,...,s k+ l ) 1 k L l l << L For each window we calculate σ(s (k ) ) σ(s (k) P(S +) ) = log LR = log P(S ) = k i=1 log a + i 1,i a i 1,i Windows with σ(s (k > ) ) 0 are the possible CpG islands Disadvantage: We assume that CpG islands have at least l nucleo3des. This must be fixed ad- hoc. These Markov models do not provide a way of modeling the lengths.

61 Example: CpG Islands For each window we calculate σ(s (k ) ) P(S +) σ(s (k ) ) = log L = log P(S ) = k i=1 log a + i 1,i a i 1,i Log LR score CpG Island Log LR score Position along the genome

62 Exercise (from exam AGB 2014): Consider the following sequence for a C-island : TCCCTCCCTCCC! Estimate a Markov model of order 1 from this sequence. Make a graphical representation of the model and calculate whether the sequence TCC belongs to the model. Assume that the background model is given by sequences with no frequency or positional preferences for T or C. Help: you can use log 2 3 = 1.6

63 Posi3on- dependent Markov Models (Weight Array Matrices) (inhomogeneous Markov models)

64 Position-dependent Markov Models We can model dependencies using condi3onal probabili3es (Markov) GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! P(s 9 s 8 ) Condi3onal probability The Probability distribu3on is the same at every posi2on

65 Position-dependent Markov Models E.g. Markov model of order n=2: P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 2 s N 1 )P(s N 1 s N 3 s N 2 )... P(s 3 s 1 s 2 )P(s 1 s 2 ) Probabili3es may correspond to the different distribu3ons for every posi2on and es3mated from (n+1)- mer frequencies (3- mers in this case) P(S) = P(s 1 s 2 s 3...s N ) = P N (s N s N 2 s N 1 )P N 1 (s N 1 s N 3 s N 2 )... P 3 (s 3 s 1 s 2 )P 2 (s 1 s 2 ) A different probability distribution at every position

Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! 123456789!

66 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! Posi3on dependent model Each posi3on has a different probability distribu3on Weight Array Matrices (WAMs) or Inhomogeneous Markov chains

67 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! We can provide a different Markov model (order 1 in this case) at every posi3on of the mo3f. Thus a mo3f of size 9 is described by 9 Markov models: P 8 (s 8 s 7 ) P 7 (s 7 s 6 ) P 6 (s 6 s 5 ) P 9 P 8 P 7 P 6 P 5 P 4 P 3 P 2 P 1 One for each posi3on

68 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! P(S)=P 1 (s 1 ) P 2 (s 2 s 1 ) P 3 (s 3 s 2 ) P 4 (s 4 s 3 ) P 5 (s 5 s 4 ) P 6 (s 6 s 5 ) P 7 (s 7 s 6 ) P 8 (s 8 s 7 ) P 9 (s 9 s 8 ) One Markov model of order 0 (nucleo3de frequency at a given posi3on)

69 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! P(S)=P 1 (s 1 ) P 2 (s 2 s 1 ) P 3 (s 3 s 2 ) P 4 (s 4 s 3 ) P 5 (s 5 s 4 ) P 6 (s 6 s 5 ) P 7 (s 7 s 6 ) P 8 (s 8 s 7 ) P 9 (s 9 s 8 ) 8 Markov models of order 1 (transi3on matrices)... Position 2!! Position 3...! A C G T A C G T! A ! C ! G ! T !...

70 Position-dependent Markov Models We have used dependencies between adjacent positions. But we can also model dependencies between any positions

71 Summary Markov models allow to model dependencies in sequence data Markov models are described by transition probabilities between states Parameters are estimated from the observations by counting transitions Order of the Markov models: higher order needs more data for training. Generally we will use 1 st order dependencies. Homogeneous (position-independent) Markov models: No positional dependence, they describe signal content e.g. CpG islands Inhomogeneous (Position-dependent) Markov models: positional dependence, they describe dependencies at specific positions, e.g. splice-sites

72 References Biological Sequence Analysis: Probabilis2c Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solu2ons in Biological Sequence Analysis Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006 Bioinforma2cs and Molecular Evolu2on Paul G. Higgs and Teresa Aawood. Blackwell Publishing 2005.

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands