I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

I529: Machne Learnng n Bonformatcs (Sprng 217) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 217

Outlne Smple model (frequency & profle) revew Markov chan CpG sland queston 1 Model comparson by log lkelhood rato test Markov chan varants Kth order Inhomogeneous Markov chans Interpolated Markov models (IMM) Applcatons Gene fndng (Genemark & Glmmer) Taxonomc assgnment n metagenomcs (Phymm)

A DNA profle (matrx) TATAAA TATAAT TATAAA TATAAA TATAAA TATTAA TTAAAA TAGAAA 1 2 3 4 5 6 T 8 1 6 1 1 C A 7 1 7 8 7 G 1 Sparse data à pseudo-counts 1 2 3 4 5 6 T 9 2 7 2 1 2 C 1 1 1 1 1 1 A 1 8 2 8 9 8 G 1 1 2 1 1 1

Frequency & profle model Frequency model: the order of nucleotdes n the tranng sequences s gnored; Profle model: the tranng sequences are algned à the order of nucleotdes n the tranng sequences s fully preserved Markov chan model: orders are partally ncorporated

Markov chan model Sometmes we need to model dependences between adjacent postons n the sequence There are certan regons n the genome, lke TATA wthn the regulatory area, upstream a gene. The pattern CG s less common than expected for random samplng. Such dependences can be modeled by Markov chans.

Markov chans A Markov chan s a sequence of random varables wth Markov property,.e., gven the present state, the future and the past are ndependent. A famous example of Markov chan s the drunkard's walk at each step, the poston may change by +1 or 1 wth equal probablty. Pr(5è 4) = Pr(5è 6) =.5, all other transton probabltes from 5 are. these probabltes are ndependent of whether the system was prevously n step 4 or 6.

1 st order Markov chan An nteger tme stochastc process, consstng of a set of m>1 states {s 1,,s m } and 1. An m dmensonal ntal dstrbuton vector ( p(s 1 ),.., p(s m )) 2. An m m transton probabltes matrx M= (a s s j ) For example, for DNA sequence: the states are {A, C, T, G} (m=4) p(a) the probablty of A to be the 1 st letter a AG the probablty that G follows A n a sequence.

1 st order Markov chan X 1 X 2 X n-1 X n For each nteger n, a Markov Chan assgns probablty to sequences (x 1 x n ) as follows: p(( x, x,... x )) = p( X = x ) p( X = x X = x ) 1 2 n 1 1 1 1 = 2 n = px ( 1) ax 1x = 2 n

Matrx representaton A B C D A.95.2 B.5.2 C.5 1 D.3.8 The transton probabltes matrx M =(a st ) M s a stochastc matrx: a = t st 1 The ntal dstrbuton vector (u 1 u m ) defnes the dstrbuton of X 1 (p(x 1 =s )=u ).

Dgraph (drected graph) representaton.95 A A.95 B C.5 D.2 A B.5 B C.2.5.2.3.8.5.2.3 D 1.8 C D 1 Each drected edge A B s assocated wth the postve transton probablty from A to B.

Classfcaton of Markov chan states States of Markov chans are classfed by the dgraph representaton (omttng the actual probablty values) A, C and D are recurrent states: they are n strongly connected components whch are snks n the graph. B s not recurrent t s a transent state A B C D Alternatve defntons: A state s s recurrent f t can be reached from any state reachable from s; otherwse t s transent.

Another example of recurrent and transent states A B C D A and B are transent states, C and D are recurrent states. Once the process moves from B to D, t wll never come back.

A 3-state Markov model of the weather Assume the weather can be: ran or snow (state 1), cloudy (state 2), or sunny (state 3) Assume the weather of any day t s characterzed by one of the three states The transton probabltes between the three states A = {a j } = Questons a 11 a 12 a 13 a 21 a 22 a 23 = a 31 a 32 a 33.4.3.3.2.6.2.1.1.8 Gven the frst day s sunny, what s the probablty that the weather for the followng 7 days wll be sun-sun-ran-ran-sun-cloudy-sun? The probablty of the weather stayng n a state for d days? Rabner (1989)

CpG sland modelng In mammalan genomes, the dnucleotde CG often transforms to (methyl-c)g whch often subsequently mutates to TG. Hence CG appears less than expected from what s expected from the ndependent frequences of C and G alone. Due to bologcal reasons, ths process s sometmes suppressed n short stretches of genomes such as n the upstream regons of many genes. These areas are called CpG slands.

Questons about CpG slands We consder two questons (and some varants): Queston 1: Gven a short stretch of genomc data, does t come from a CpG sland? Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, where, and how long? We solve the frst queston by modelng sequences wth and wthout CpG slands as Markov Chans over the same states {A,C,G,T} but dfferent transton probabltes.

Markov models for (non) CpG slands a + st a - st The + model: Use transton matrx A + = (a + st ), = (the probablty that t follows s n a CpG sland) à postve samples The - model: Use transton matrx A - = (a - st ), = (the probablty that t follows s n a non CpG sland sequence) à negatve samples Wth these two models, to solve Queston 1 we need to decde whether a gven short sequence s more lkely to come from the + model or from the model. Ths s done by usng the defntons of Markov Chan, n whch the parameters are determned by tranng data.

Matrces of the transton probabltes A + (CpG slands): p + (x x -1 ) (rows sum to 1) X -1 A - (non-cpg slands): X A C G T A.18.274.426.12 C.171.368.274.188 G.161.339.375.125 T.79.355.384.182 X A C G T A.3.25.285.21 X -1 C.322.298.78.32 G.248.246.298.28 T.177.239.292.292

Model comparson Gven a sequence x=(x 1.x L ), now compute the lkelhood rato If RATIO>1, CpG sland s more lkely. Actually the log of ths rato s computed. = + = + + = + = 1 1 1 1 model) ( model) ( RATIO L L x x p x x p p p ) ( ) ( x x Note: p + (x 1 x ) s defned for convenence as p + (x 1 ). p - (x 1 x ) s defned for convenence as p - (x 1 ).

Log lkelhood rato test Takng logarthm yelds log Q = log p(x p(x 1 1...x...x L L + ) ) = log p p + (x x (x x 1 1 ) ) If logq >, then + s more lkely (CpG sland). If logq <, then - s more lkely (non-cpg sland).

A toy example Sequence: CGCG P(CGCG +) =? P(CGCG -) =? Log lkelhood rato?

Where do the parameters (transton probabltes) come from? Learnng from tranng data. Source: A collecton of sequences from CpG slands, and a collecton of sequences from non-cpg slands. Input: Tuples of the form (x 1,, x L, h), where h s + or - Output: Maxmum Lkelhood parameters (MLE) Count all pars (X =a, X -1 =b) wth label +, and wth label -, say the numbers are N ba,+ and N ba,-.

CpG sland: queston 2 Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, and where? For ths, we need to decde whch parts of a gven long sequence of letters s more lkely to come from the + model, and whch parts are more lkely to come from the model. We wll defne a Markov Chan over 8 states. A + A - C + G + T + C - G - T - The problem s that we don t know the sequence of states (hdden) whch are traversed, but just the sequence of letters (observaton). Hdden Markov Model!

Markov model varatons kth order Markov chans (Markov chans wth memory) Inhomogeneous Markov chans (vs homogeneous Markov chans) Interpolated Markov chans

kth order Markov Chan (a Markov chan wth memory k) ( ) ( ) ( ) = = = = = = = = n k k k k k n x X x X x X x X p x X x X p x x p,...,,,...,... 2 2 1 1 1 1 1 kth Markov Chan assgns probablty to sequences (x 1 x n ) as follows: Intal dstrbuton Transton probabltes

Inhomogeneous Markov chan for gene fndng X 1 X 2 X 3 X 4 X 5 X 6 X 7 a b c a b c Agan, the parameters (the transton probabltes, a, b, and c need to be learned from tranng samples)

Inhomogeneous Markov chan: predcton X 1 X 2 X 3 X 4 X 5 X 6 X 7 Readng frame 1 a b c a b c Readng frame 2 c a b c a b Readng frame 3 b c a b c a

Gene fndng usng nhomogeneous Markov chan Consder sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9. where x s a nucleotde let p 1 = a x1x2 b x2 x3 c x3x4 a x4x5 b x5x6c x6x7. p 2 = c x1x2 a x2x3 b x3x4 c x4x5 a x5x6 b x6x7. p 3 = b x1x2 c x2x3 a x3x4 b x4x5 c x5x6 a x6x7. then probablty that th readng frame s the codng frame s: P = p p 1 + p 2 + p 3 Genemark (gene fnder for bacteral genomes)

Selectng the order of a Markov chan For Markov models, what order to choose? Hgher order, more memory (hgher predctve value), but means more parameters to learn The hgher the order, the less relable the parameter estmates. E.g., we have a DNA sequence of 1 kbp 2 nd order Markov chan, 4 3 =64 parameters, 1562 tmes on average for each hstory 5 th order, 4 6 =496 parameters, 24 tmes on average 8 th order, 4 9 =65536 parameters, 1.5 tmes on average

Interpolated Markov models (IMMs) IMMs are called varable-order Markov models A IMM uses a varable number of states to compute the probablty of the next state smple lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 P (x x 1 )+ + n P (x x n,,x 1 ) general lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 (x )P (x x 1 )+ + n (x n,,x 1 )P (x x n,,x 1 )

GLIMMER Glmmer s a system for fndng genes n mcrobal DNA, especally the genomes of bactera, archaea, and vruses eukaryotc verson of Glmmer: GlmmerHMM Glmmer (Gene Locator and Interpolated Markov ModelER) uses IMMs to dentfy the codng. Glmmer verson 3.2 s the current verson of the system (http://www.cbcb.umd.edu/software/ glmmer/) Glmmer3 makes several algorthmc changes to reduce the number of false postve predctons and to mprove the accuracy of start-ste predctons

IMM n GLIMMER A lnear combnaton of 8 dfferent Markov chans, from 1st through 8th-order, weghtng each model accordng to ts predctve power. Glmmer uses 3-perodc nonhomogenous Markov models n ts IMMs. Score of a sequence s the product of nterpolated probabltes of bases n the sequence IMM tranng Longer context s always better; only reason not to use t s undersamplng n tranng data. If sequence occurs frequently enough n tranng data, use t,.e., λ = 1 Otherwse, use frequency and χ 2 sgnfcance to set λ.

Clusterng metagenomc sequences wth IMMs IMMs are used to classfy metagenomc sequences based on patterns of DNA dstnct to a clade (a speces, genus, or hgher-level phylogenetc group). Durng tranng, the IMM algorthm constructs probablty dstrbutons representng observed patterns of nucleotdes that characterze each speces. Nat Methods 29, 6(9):673-676