Probabilistic models of biological sequence motifs

Size: px
Start display at page:

Download "Probabilistic models of biological sequence motifs"

Transcription

1 Probabilistic models of biological sequence motifs Description of Known Motifs AGB - Master in Bioinformatics UPF Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain

2 What we will see How to build simple probabilistic models to describe sequence motifs (using a training set). How to study the motif properties in terms of heterogeneity and dependencies between positions How to model dependencies between positions.

3 Genome Added complexity: RNA processing (e.g. Splicing) Transcription Start Site Termination Site Translation pre-mrna exon Intron exon intron Splicing exon intron exon mrna 5 UTR Protein coding sequence 3 UTR Translation

4 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

5 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

6 Splice-site signals pre-mrna donor BS acceptor CAGGURAGU CURAY YYNCAGG exon intron exon intron exon intron exon

7 Description of signals (motifs) Exact word 1 example Multiple examples CAGGTAAGT!! Consensus!! CAGGTAAGT! TAGGTGAGC! GTAGTAAGA! CAAGTAATA! ATGGTAATG! CAGGTGATC! AAGGTGAGC!! NWRGTRAKN! Consensus motif

8 The simplest probabilistic model Position Weight Matrix (PWM) (position specific scoring matrix (PSSM))

9 Weight Matrices pre-mrna donador BS aceptor CAGGURAGU CURAY YYNCAGG exón intrón exón intrón exón intrón exón ----exon----intron!!caggtaccc!!gaggtgaga!!ctggtgagg!!taggtgagt!!caggtctgt!!ctggtgagc!!caggtaagt! Observations (real splice-sites) E.g. position 1, P( C ) = frequency = 5/7 = 0.71 pos A C G T

10 Testing for a new functional site pos A C G T What is the probability that a sequence contains a functional site described by this model S = s 1 s 2 s 3...s n We can calculate the probability that S is given by the model obtained from the observations: P(S) = P(s 1 s 2...s N ) = P(s 1,pos = 1)P(s 2,pos = 2)...P(s N,pos = n) Implicitly, we assume independence between adjacent positions

11 Graphical Representation: Sequence Logos pos A C G T human drosophila yeast

12 Pseudocounts In any observed data set there is the possibility, especially with low-probability events and/or small data sets, of a possible event not occurring. Its observed frequency is therefore 0, implying a probability of 0. -exon--intron-!!caggtaccc!!gaggtgaga!!ctggtgagg!!taggtgagt!!caggtctgt!!ctggtgagc!!caggtaagt! Estimated probability P(A, pos=1) = 0 We may wrongly infer that lack of A is characteristic of splice-sites (overfitting) Simplest solution: modify the counting: n i n i + p, i=1,2,3,4. E.g.: P(A) = n A + mp n A + n C + n G + n T + m Laplace rule: pseudocount m=4, p=1/4

13 Hypothesis testing Problem: choosing between two models M, N to represent a data set Each model represents a prob distribution of the sample space S S M N We need Statistical test that can distinguish between the two models To distinguish between two models, we consider the Likelihood-ratio between two possible models: P(s M) = P(s) under M P(s N) = P(s) under N LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R)

14 Likelihood ratio In general, we want to compare the model of real sites M with an alternative (false site) model R Example of alternative models: Random sequences (P(a)=0.25, for a=a,c,g,t) False sites (sequences with GT but are not real donors) LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R) pos A C G T

15 Likelihood ratio In general, we want to compare the model of real sites M with an alternative (false site) model R Example of alternative models: Random sequences (P(a)=0.25, for a=a,c,g,t) False sites (sequences with GT but are not real donors) LR = P(S M) P(S R) = P(s 1, pos =1 M) P(s 1, pos =1 R)... P(s N, pos = n M) P(s N, pos = n R) pos A C G T Using uniform background

16 Position Weight Matrices (PWMs) Probabilities are small Problem multiplying probabilities (too small to be correctly handled by a computer) Solution: use logarithms: M ai = pos$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ A$ /999$ 1.02$ /999$ /999$ /999$ /0.22$ 1.02$ /999$ /0.91$ C$ 1.02$ /999$ /0.22$ /999$ /999$ /0.91$ /0.91$ /0.91$ /0.22$ G$ /0.91$ /999$ 1.02$ 1.38$ /999$ 0.69$ /999$ 1.16$ /0.91$ T$ /0.91$ /0.22$ /999$ /999$ 1.38$ /999$ /0.91$ /999$ 0.47$ $ Log-Likelihood ratio log LR = log P(s 1 M)P(s 2 M)...P(s n M) P(s 1 R)P(s 2 R)...P(s n R) n = log P(s i M) P(s i R) Log 0 is generally set up to be a large negative number THe alternative is to use pseudocounts i=1

17 Consider the example: CTGGTAAGC M a,i = Position Weight Matrices (PWMs) pos A C G T P(CTGGTAAGC M) logl = log P(CTGGTAAGC R) = log P (C M)P (T M)P (G M)... P (G M)P (C M) P 1 (C R)P 2 (T R)P 3 (G R)... P 8 (G R)P 9 (C R) = log P 1 (C M) P 1 (C R) + log P 2 (T M) P 2 (T R) + log P 3 (G M) P 3 (G R) log P 8 (G M) P 8 (G R) + log P 9 (C M) P 9 (C R) = M C,1 + M T,2 + M G, M G,8 + M C,9 = = 6.32

18 Position Weight Matrices (PWMs) To search in an unknown sequence for this mo3f (for the possibility that there is a donor splice site), we use a sliding window along the sequence of the same size as the mo3f, and at each posi3on we score the similarity to the mo3f using the score given by the Matrix. sites:..cgtgagtcggggtgagagcatgctggtaagcccggctggtgaactgccggtagtc..! Assign score to each 9 base window. Use score cutoff to predict poten3al 5 splice sites log LR = log P(S M ) P(S R) = log P(s 1, pos =1 M ) P(s 1, pos =1 R) log P(s N, pos = n M ) P(s N, pos = n R) > a log LR > a => it is more likely to correspond to a case of model M log LR < a => it is more likely to correspond to a case of model R

19 Position Weight Matrices (PWMs) Since GT is invariable, we could skip the contribution of positions 4 and 5, and consider only query sequences with GT at these positions pos M ai = A C G T Example:..CGTGAGTCGGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTGAACTGCCGGTAGTC..!! We would score only window1 and window2

20 Searching for mo2fs in novel sequences (The sliding window approach) 65,71,849:;<= 3 splice site model *"!'!" ' " Acceptor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" Compare Log- likelihood scores of real vs. pseudo splice sites Training: on the training set we calculate the parameters for the PWM 5 splice site model *' *" 0435, Donor Scores +,-. /0,123 Tes2ng: we use test data to evaluate the accuracy of our model and establish the score cut- offs 65,71,849:;<=!'!" ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435,

21 Searching for mo2fs in novel sequences (The sliding window approach) 3 splice site model *"!' Acceptor Scores +,-. /0,123 Compare Log- likelihood scores of real vs. pseudo splice sites 65,71,849:;<=!" Reminder: 65,71,849:;<= ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 5 splice site model *' *"!'!" ' " 0435, Donor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435, Sensi2vity frac3on of real sites with score above cutoff PPV frac3on of sites with score > cutoff that are true sites. FPR Frac3on of nega3ve cases (pseudo splice- sites) that are above the score cut- off Etc

22 Determining the relevant positions

23 Splice-site signals pre-mrna donor BS acceptor exon intron CAGGURAGU CURAY YYNCAGG exon intron exon intron exon U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

24 Splice-site signals U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

25 Splice-site signals How many positions are relevant to model? U1 snrnp U2 snrnp exon GUCCAUUCA!! CAGGUAAGU intron AUGAUG!! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna

26 Information Content A B Information content Informa(on content (The change in entropy comparing the expected and the observed distribu3on) C Frequency (%) Frequencies I c (X) = H before H after N = log 2 N + P(x i ) log 2 P(x i ) i=1 (comparing to an uniform background) Density Mutual information exon Value U1 snrnp GUCCAUUCA!! CAGGUAAGU intron U2 snrnp 8 AUGAUG! 9! UACUAC! A! U2AF65 PPT! U2AF35 CAGG! exon pre-mrna Corvelo et al. 2010

27 Information Content H. sapiens S. cerevisiae Height is proportional to the Information content, the letter relative sizes are proportional to their frequencies (

28 Kullback-Leibler Divergence of two distributions D(P Q) = Also called the rela3ve entropy, is the expected value of the log- rate of two distribu3ons x P(x)log 2 P(x) Q(x) n P(x D(P Q) = E(log 2 L) = P(x i )log i ) 2 Q(x i ) i=1 log 2 L = log 2 P(x) Q(x) The rela3ve entropy is defined for two probability distribu3ons that take values over the same alphabet (same symbols)

29 Kullback-Leibler Divergence of two distributions D(P Q) = x P(x)log 2 P(x) Q(x) The rela3ve entropy is not a distance, but measures how different two distribu2ons are D(P Q) D(P Q) It is not symmetric The value is never nega3ve. It is zero when the 2 distribu3ons are iden3cal D(P Q) 0 with =0 for P=Q The relative entropy provides a measure of the information content gained with the distribution P with respect to the distribution Q. Its applications are similar to those of the Information Content Better to apply D rather than I when background is not random

30 Exercise: (exam 2013) Consider two discrete probability distributions P and Q, such that i P(x i ) =1 and Q(x i ) =1 Show that the relative entropy D(P Q) is equivalent to the information content of P when the distribution Q is uniform. i

31 Total Relative Entropy To quantify the variability of an entire motif we can calculate the total relative entropy adding up the value for all positions P: distribution of the observed sequences corresponding to the motif Q: distribution of a background model (e.g. Random sequences) 4! P(x D(P Q) = P(x a )log a ) $ 2 # & " Q(x a )% a=1 N 4! P(x D total (P Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Relative entropy at one position of the motif Total Relative entropy The total relative entropy is calculated from the probability distribution of nucleotides, a=1,2,3,4, at each position, i=1,,n, in the real signal, P(x a,i ), relative to the distribution in a randomized set, Q(x a,i ):

32 Total Relative entropy N 4! P(x D total (P,Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Plass et al. 2008

33 Total Relative entropy N 4! P(x D total (P,Q) = P(x a,i )log a,i ) $ 2 # " Q(x a,i ) & i=1 a=1 % Plass et al. 2008

34 Modeling dependencies between positions

35 Modeling dependencies between positions 65,71,849:;<= 65,71,849:;<= 3 splice site model *"!'!" ' " Acceptor Scores +,-. /0,123!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 5 splice site model *' *"!'!" 0435, Donor Scores +,-. /0,123 What does this result tell us? A) Splicing machinery also uses other informa3on besides 5 ss/3 ss mo3fs to iden3fy splice sites B) PWM model does not accurately capture some aspects of the 5 ss/ 3 ss that are used in recogni3on C) Or both ' "!!"!#!$!%!&!'!(!)!*!!! * ) ( ' & % $ #!" 0435,

36 Modeling dependencies between positions

37 Mutual informa3on Mutual Informa3on P(x, y) MI(X,Y ) = H(X)+ H(Y ) H(H,Y ) = P(x, y)log 2 P(x)P(y) x y The mutual informa3on of two random variables X and Y measures the dependencies between two variables, that is, informa3on in X that is shared with Y. e.g. X and Y take as values the nucleo3des in two different posi3ons, and the sum is carried out over the alphabet of nucleo3des Independent posi3ons M(X,Y) = 0 Dependent posi3ons M(X,Y)>0 CTGAG! GTAGA! TTGAC! ATAGT! GTGAG! CTAAA! TTGAC! ATAAT! 12345!

38 MutualInform ation in bits Mutual informa3on Positio Position (bp) E MutualInform ation in bits Position (bp) position Position (bp) MutualInform ation in bits n n P(x MI(X,Y ) = P(x i, y j )log i, y j ) 2 i=1 position Position (bp) position )LJXUH0XWXDOLQIRUPDWLRQDURXQGGRQRUDDQGDFFHSWRUEVSOLFHVLWHV3RVLWLRQVLQ GRQRUVLWHZLQGRZ>@DQGDFFHSWRUVLWHZLQGRZ>@DUHVKRZQDORQJERWKD j=1 7KHFDQRQLFDO*LVDWSRVLWLRQLQDDQGSRVLWLRQLQE7KHGLDJRQDOOLQHSUHVHQWLQERW P(x i )P(y j ) DQGEUHSUHVHQWVWKHPXWXDOLQIRUPDWLRQEHWZHHQQHLJKERXULQJSDLUVRIEDVHV Levine & Durbin position Position (bp)

39 MutualInform ation in bits Mutual informa3on Positio Position (bp) E MutualInform ation in bits Position (bp) position Position (bp) MutualInform ation in bits n n P(x MI(X,Y ) = P(x i, y j )log i, y j ) 2 i=1 position Position (bp) position )LJXUH0XWXDOLQIRUPDWLRQDURXQGGRQRUDDQGDFFHSWRUEVSOLFHVLWHV3RVLWLRQVLQ GRQRUVLWHZLQGRZ>@DQGDFFHSWRUVLWHZLQGRZ>@DUHVKRZQDORQJERWKD j=1 7KHFDQRQLFDO*LVDWSRVLWLRQLQDDQGSRVLWLRQLQE7KHGLDJRQDOOLQHSUHVHQWLQERW P(x i )P(y j ) DQGEUHSUHVHQWVWKHPXWXDOLQIRUPDWLRQEHWZHHQQHLJKERXULQJSDLUVRIEDVHV Levine & Durbin position Position (bp)

40 positions i and j (Durbin et al., 1998). Score calculations Score calculations Mutual Based on informa3on the mutual information results (see Figure 3.2a) a mode Based on the mutual information results (see Figure 3.2a) a model was whtch divided the region around the donor splice site into blocks as shown whtch divided the region around the donor splice site into blocks as shown below ! MutualInform ation in bits Position (bp) position Thts model was scored using log-ltkelihood scoring considering the con 3 2 Thts model was scored using log-ltkelihood scoring considering th 1 probabhties 0 of the blocks above (dependencies indicated by the horizontal blac -1-2 and using genomic dinucleotide frequencies for the null model. 'Thus, the sco -3-4 sequence -5 X in bits is -6-7 sequence X in bits is position Position (bp) probabhties of the blocks above (dependencies indicated by the horizonta and using genomic dinucleotide frequencies for the null model. 'Thus, t Frequency values for each possible base combination of each block g The dependencies are used to build the model (see more later) dependencies in the model were calculated by adding pseudocounts based on g Frequency values for each possible base combination of each bl dinucleotide frequencies to the observed counts. 'Thus, for example, dependencies in the model were calculated by adding pseudocounts based Levine & Durbin 2001 C(x-4 = z,x-3 = a,x-* = b,l, = c) + 43 q(a I z)q(b I a)q(c I b dinucleotide f(abc frequencies I z) = to the observed counts. 'Thus, for example,

41 Mutual informa3on MutualInform ation in bits position Position (bp) fraction oftru e sites included TPR PWM Order-1 dependencies Block Dependence First-OrderDependence Independent Higher order dependencies Position (bp) position # false /10 kb )LJXUH &RPSDULVRQEHWZHHQLQGHSHQGHQWILUVWRUGHUGHSHQGHQFHDQGEORFNGHSHQGHQFH PRGHOVIRU GRQRU VSOLFH VLWH LGHQWLILFDWLRQ 7KH EORFN GHSHQGHQFH PRGHO SUHGLFWHG IHZHU IDOVH SRVLWLYHVDWPRVWVHQVLWLYLW\OHYHOVWKDQGLGWKHRWKHUWZRPRGHOV 'LVFXVVLRQ FPR Using a model with dependencies improves the overall accuracy 7KH EORFN GHSHQGHQFH PRGHO GHVFULEHG KHUH VKRZV WKDW VSOLFH VLWH VLJQDO UHFRJQLWLRQ FDQ EH LPSURYHG E\ FRQVLGHULQJ KLJKHU RUGHU DQG ORQJUDQJH LQWHUDFWLRQV +RZHYHU WKLV LPSURYHPHQW LV TXLWH PRGHVW ZKHQ FRPSDUHG WR WKH ILUVW RUGHU GHSHQGHQFH PRGHO 2WKHU VSOLFH VLWH LGHQWLILFDWLRQ PRGHOV LQFOXGLQJ PD[LPDO Levine & Durbin 2001 GHSHQGHQFHGHFRPSRVLWLRQWKHPRGHOXVHGE\WKH*(16&$1JHQHSUHGLFWLRQSURJUDP %XUJHDQG.DUOLQDOVRVKRZPRGHVWLPSURYHPHQWVRYHUDILUVWRUGHUGHSHQGHQFH

42 Markov models

43 Mutual information can help us find out about dependencies between positions (between variables) How to incorporate that into the model?

44 Markov models The probability to observe a sequence according to the model described by P S = s 1 s 2 s 3...s N P(S) = P(s 1 s 2 s 3...s N ) The joint probability can be re- wriaen as a factoriza3on of condi3onal probabili3es P(S) = P(s 1 s 2 s 3...s N ) = P(s N s 1...s N 1 )P(s N 1 s 1...s N 2 )... P(s 2 s 1 )P(s 1 ) (the chain rule for probabili3es) For three elements: Apply twice the defini3on of condi3onal probability P(s 1 s 2 s 3 ) = P(s 3 s 1 s 2 )P(s 1 s 2 ) P(s 3 s 1 s 2 ) = P(s 1 s 2 s 3 ) P(s 1 s 2 ) P(s 1 s 2 s 3 ) = P(s 3 s 1 s 2 )P(s 2 s 1 )P(s 1 ) P(s 2 s 1 ) = P(s 1s 2 ) P(s 1 )

45 Markov models The chain rule for probabili3es P(S) = P(s 1 s 2 s 3...s N ) = P(s N s 1...s N 1 )P(s N 1 s 1...s N 2 )... P(s 2 s 1 )P(s 1 ) We define the order of the Markov chain as the number of the dependencies ORDER 0 ORDER 1 ORDER 2 ORDER n P(s i s 1...s i 1 ) = P(s i ) P(s i s 1...s i 1 ) = P(s i s i 1 ) P(s i s 1...s i 1 ) = P(s i s i 2 s i 1 )... P(s i s 1...s i 1 ) = P(s i s i n...s i 1 ) E.g. Markov model of order 1 (Markov chain) P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 )

46 E.g. Markov model of order 1: Markov models P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 ) 1)Probabili3es are es3mated regardless of the posi3on (recall the NB model for book classifica3on). E.g. for order 1 P(s i = G s i 1 = T ) = n(s = G s = T ) i i 1 n(s i = a s i 1 = T ) a M 2) We always need an ini3al set of probabili3es. Eg. For order 1: P(A) = n(a) b M n(b) These are es3mated from the (ini2al) posi2ons of the training set

47 GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! Markov models Example: consider the following sequences:

48 Markov models Example: consider the following sequences: GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! P(C G) = n(s i = C s i 1 = G) a { A,C,G,T} n(s i = a s i 1 = G) = GC transitions 12 G positions For the 1 st order parameter, we count the number of times that C follows a G in the sequences And divide by the number of times any nucleotide follows a G in the sequence.

49 Markov models Example: consider the following sequences: GCCGCGCTTG! GCTTGGTGGC! TGGCCGTTGC! P(C G) = P(A G) = P(G G) = P(T G) = P(C) = P(A) = P(G) = P(T ) = Initial probabilities We can use pseudocounts with the sequences as before

50 Markov chains Markov model of order 1 are generally called Markov chains P(s i s 1...s i 1 ) = P(s i s i 1 ) P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 1 )P(s N 1 s N 2 )... P(s 2 s 1 )P(s 1 ) E.g.: A Markov chain for nucleo3des is a set of probabili3es of the form P(a b), where a,b = {A,C,G,T} We can view this as transi3ons

51 Markov chains A Markov chain can be represented as a set of states (1 per nucleo3de) S 0 P(s 1 ) =1 s 1 with connec3ons between them transi3on probabili3es The start of the sequence string is modeled with a ini3al fic33ous state S 0 Transi3on probabili3es P(s i s i 1 ) =1 s i P(A C)+ P(C C)+ P(G C)+ P(T C) =1

52 Markov models Markov model of order k: next base depends on previous k bases For order 2: P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 2 s N 1 )P(s N 1 s N 3 s N 2 )... P(s 3 s 1 s 2 )P(s 1 s 2 ) P(ACA) = P(A AC)P(AC)

53 Markov models For order 2: 1) If we es3mate these probabili3es regardless of the posi3on P(s i = G s i 2 = A, s i 1 = T ) = n(s i = G s i 2 = A, s i 1 = T ) n(s i = a s i 2 = A, s i 1 = T ) a M 2) We need an ini3al set of probabili3es: P(s 1 s 2 ) P(AC) = a M n(ac) b M n(a,b) These are es3mated from the ini3al posi3ons of the training set

54 Markov models How to select the order? The number of parameters (probabilities) to estimate grows exponentially with the order: ~4 (k+1) Higher order may be more accurate (captures better the dependencies) But it is less reliable: less data to estimate parameters (more dependent on pseudocounts)

55 Example: CpG Islands

56 Example: CpG Islands Wherever there is a CG (CpG) in the genome, C tends to change chemically by methyla3on. Methylated C is more likely to mutate into a T during replica3on. Thus, CpG dinucleo3des are less frequent than expected:

57 Example: CpG Islands This transforma3on is ohen suppressed in specific regions, like the promoter of some genes, giving rise to a high- content of CpGs: These regions are called CpG islands CpG islands are of variable length, between hundreds to thousands of base pairs. We would like to answer the following questions: 1) given a DNA sequence, Is it part of a CpG island? 2) Given a large DNA region, can we find CpG islands in it?

58 Example: CpG Islands + a st a st Transi3on probability between two adjacent posi3ons in CpG islands Transi3on probability between two adjacent pos. outside CpG islands Given a sequence S the log- likelihood ra3o is: + a st σ = log LR = log P(S +) P(S ) = N i=1 log a + i 1,i a i 1,i (up to the contribution from the initial probabilities) a st The larger the value of sigma, the more likely is to be a CpG island

59 Example: CpG Islands Approach 1: Given a large stretch of DNA of length any length we calculate P(S +) σ = log LR = log P(S ) = N i=1 log a + i 1,i a i 1,i Sequences with σ(s) > 0 are the possible CpG islands Disadvantage: CpG islands may be much shorter than the whole sequence. We therefore could underscore the real CpG island by including too much false sequence. As a result we will miss many posi3ve cases.

60 Example: CpG Islands Approach 2: Given a large stretch of DNA of length L, we extract windows of l nucleo3des: S (k ) = (s k+1,...,s k+ l ) 1 k L l l << L For each window we calculate σ(s (k ) ) σ(s (k) P(S +) ) = log LR = log P(S ) = k i=1 log a + i 1,i a i 1,i Windows with σ(s (k > ) ) 0 are the possible CpG islands Disadvantage: We assume that CpG islands have at least l nucleo3des. This must be fixed ad- hoc. These Markov models do not provide a way of modeling the lengths.

61 Example: CpG Islands For each window we calculate σ(s (k ) ) P(S +) σ(s (k ) ) = log L = log P(S ) = k i=1 log a + i 1,i a i 1,i Log LR score CpG Island Log LR score Position along the genome

62 Exercise (from exam AGB 2014): Consider the following sequence for a C-island : TCCCTCCCTCCC! Estimate a Markov model of order 1 from this sequence. Make a graphical representation of the model and calculate whether the sequence TCC belongs to the model. Assume that the background model is given by sequences with no frequency or positional preferences for T or C. Help: you can use log 2 3 = 1.6

63 Posi3on- dependent Markov Models (Weight Array Matrices) (inhomogeneous Markov models)

64 Position-dependent Markov Models We can model dependencies using condi3onal probabili3es (Markov) GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! P(s 9 s 8 ) Condi3onal probability The Probability distribu3on is the same at every posi2on

65 Position-dependent Markov Models E.g. Markov model of order n=2: P(S) = P(s 1 s 2 s 3...s N ) = P(s N s N 2 s N 1 )P(s N 1 s N 3 s N 2 )... P(s 3 s 1 s 2 )P(s 1 s 2 ) Probabili3es may correspond to the different distribu3ons for every posi2on and es3mated from (n+1)- mer frequencies (3- mers in this case) P(S) = P(s 1 s 2 s 3...s N ) = P N (s N s N 2 s N 1 )P N 1 (s N 1 s N 3 s N 2 )... P 3 (s 3 s 1 s 2 )P 2 (s 1 s 2 ) A different probability distribution at every position

66 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! Posi3on dependent model Each posi3on has a different probability distribu3on Weight Array Matrices (WAMs) or Inhomogeneous Markov chains

67 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! We can provide a different Markov model (order 1 in this case) at every posi3on of the mo3f. Thus a mo3f of size 9 is described by 9 Markov models: P 8 (s 8 s 7 ) P 7 (s 7 s 6 ) P 6 (s 6 s 5 ) P 9 P 8 P 7 P 6 P 5 P 4 P 3 P 2 P 1 One for each posi3on

68 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! P(S)=P 1 (s 1 ) P 2 (s 2 s 1 ) P 3 (s 3 s 2 ) P 4 (s 4 s 3 ) P 5 (s 5 s 4 ) P 6 (s 6 s 5 ) P 7 (s 7 s 6 ) P 8 (s 8 s 7 ) P 9 (s 9 s 8 ) One Markov model of order 0 (nucleo3de frequency at a given posi3on)

69 Position-dependent Markov Models GGGGTGAGAGCATGCTGGTAAGCCCGGCTGGTG! ! P(S)=P 1 (s 1 ) P 2 (s 2 s 1 ) P 3 (s 3 s 2 ) P 4 (s 4 s 3 ) P 5 (s 5 s 4 ) P 6 (s 6 s 5 ) P 7 (s 7 s 6 ) P 8 (s 8 s 7 ) P 9 (s 9 s 8 ) 8 Markov models of order 1 (transi3on matrices)... Position 2!! Position 3...! A C G T A C G T! A ! C ! G ! T !...

70 Position-dependent Markov Models We have used dependencies between adjacent positions. But we can also model dependencies between any positions

71 Summary Markov models allow to model dependencies in sequence data Markov models are described by transition probabilities between states Parameters are estimated from the observations by counting transitions Order of the Markov models: higher order needs more data for training. Generally we will use 1 st order dependencies. Homogeneous (position-independent) Markov models: No positional dependence, they describe signal content e.g. CpG islands Inhomogeneous (Position-dependent) Markov models: positional dependence, they describe dependencies at specific positions, e.g. splice-sites

72 References Biological Sequence Analysis: Probabilis2c Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solu2ons in Biological Sequence Analysis Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006 Bioinforma2cs and Molecular Evolu2on Paul G. Higgs and Teresa Aawood. Blackwell Publishing 2005.

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Entropy and Information

Entropy and Information Entropy and Information Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain What are the best variables to describe our model? Feature/attribute

More information

Probabilistic models of biological sequence motifs

Probabilistic models of biological sequence motifs Probabilistic models of biological sequence motifs Discovery of new motifs Master in Bioinformatics UPF 2015-2016 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain what

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017 Hidden Markov Models and Applica2ons Spring 2017 February 21,23, 2017 Gene finding in prokaryotes Reading frames A protein is coded by groups of three nucleo2des (codons): ACGTACGTACGTACGT ACG-TAC-GTA-CGT-ACG-T

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models (HMMs) and Profiles Hidden Markov Models (HMMs) and Profiles Swiss Institute of Bioinformatics (SIB) 26-30 November 2001 Markov Chain Models A Markov Chain Model is a succession of states S i (i = 0, 1,...) connected by transitions.

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler + Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2 Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis Shamir s lecture notes and Rabiner s tutorial on HMM 1 music recognition deal with variations

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models (I)

Hidden Markov Models (I) GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

More information

Lecture #5. Dependencies along the genome

Lecture #5. Dependencies along the genome Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3., Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger s and Nir Friedman s. Dependencies along the genome

More information

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY BIOINFORMATICS Lecture 11-12 Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION LABORATORY @ ROWAN UNIVERSITY These lecture notes are prepared

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

QB LECTURE #4: Motif Finding

QB LECTURE #4: Motif Finding QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015 2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding 3 Transcription Initiation Chromatin

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos Sequence Analysis BBSI 2006: Lecture #(χ+1) Takis Benos (2006) Molecular Genetics 101 What is a gene? We cannot define it (but we know it when we see it ) A loose definition: Gene is a DNA/RNA information

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Transcrip:on factor binding mo:fs

Transcrip:on factor binding mo:fs Transcrip:on factor binding mo:fs BMMB- 597D Lecture 29 Shaun Mahony Transcrip.on factor binding sites Short: Typically between 6 20bp long Degenerate: TFs have favorite binding sequences but don t require

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K HiSeq X & NextSeq Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization:

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Chapter 7: Regulatory Networks

Chapter 7: Regulatory Networks Chapter 7: Regulatory Networks 7.2 Analyzing Regulation Prof. Yechiam Yemini (YY) Computer Science Department Columbia University The Challenge How do we discover regulatory mechanisms? Complexity: hundreds

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models (HMMs) November 14, 2017 Hidden Markov Models (HMMs) November 14, 2017 inferring a hidden truth 1) You hear a static-filled radio transmission. how can you determine what did the sender intended to say? 2) You know that genes

More information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Info 2950, Lecture 25

Info 2950, Lecture 25 Info 2950, Lecture 25 4 May 2017 Prob Set 8: due 11 May (end of classes) 4 3.5 2.2 7.4.8 5.5 1.5 0.5 6.3 Consider the long term behavior of a Markov chain: is there some set of probabilities v i for being

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world Sequence Analysis BBSI 2006: ecture #χ+ Takis Benos 2006 Molecular Genetics 0 Cell s internal world DNA - Chromosomes - Genes We cannot define it but we know it when we see it A loose definition: What

More information

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

13 Comparative RNA analysis

13 Comparative RNA analysis 13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,

More information

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Interpolated Markov Models for Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the

More information

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. GENE FINDING The Computational Problem We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. The Computational Problem Confounding Realities:

More information

Sta$s$cal sequence recogni$on

Sta$s$cal sequence recogni$on Sta$s$cal sequence recogni$on Determinis$c sequence recogni$on Last $me, temporal integra$on of local distances via DP Integrates local matches over $me Normalizes $me varia$ons For cts speech, segments

More information

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Chapter 4: Hidden Markov Models

Chapter 4: Hidden Markov Models Chapter 4: Hidden Markov Models 4.1 Introduction to HMM Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Markov models of sequence structures Introduction to Hidden Markov

More information

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II 6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:

More information

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs GS 559 Lecture 12a, 2/12/09 Larry Ruzzo A little more about motifs 1 Reflections from 2/10 Bioinformatics: Motif scanning stuff was very cool Good explanation of max likelihood; good use of examples (2)

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

RNA Processing: Eukaryotic mrnas

RNA Processing: Eukaryotic mrnas RNA Processing: Eukaryotic mrnas Eukaryotic mrnas have three main parts (Figure 13.8): 5! untranslated region (5! UTR), varies in length. The coding sequence specifies the amino acid sequence of the protein

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung: Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:

More information

Information Tutorial

Information Tutorial Information Tutorial Dr. Michael D. Rice Emeritus Professor of Computer Science Wesleyan University Middletown, CT 06459 Version 1.0 October, 2005 (revised June, 2018) 1. Introduction The notion of entropy

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models A selection of slides taken from the following: Chris Bystroff Protein Folding Initiation Site Motifs Iosif Vaisman Bioinformatics and Gene Discovery Colin Cherry Hidden Markov Models

More information

Information Theory and Hypothesis Testing

Information Theory and Hypothesis Testing Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment

More information