Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

and Markov Models Huizhen Yu janey.yu@cs.helsinki.fi Det. Comuter Science, Univ. of Helsinki Some Proerties of Probabilistic Models, Sring, 200 Huizhen Yu (U.H.) and Markov Models Jan. 2 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 2 / 32 Some Proerties of Some facts: Named after the Russian mathematician A. A. Markov The earliest and simlest model for deendent variables Has many fascinating asects and a wide range of alications Markov chains are often used in studying temoral and sequence data, for modeling short-range deendences (e.g., in biological sequence analysis), as well as for analyzing long-term behavior of systems (e.g., in queueing systems). Markov chains are also the basis of samling-based comutation methods called Markov Chain Monte Carlo. We introduce Markov chains and study a small art of its roerties, most of which relate to modeling short-range deendences. Huizhen Yu (U.H.) and Markov Models Jan. 2 3 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 4 / 32

Are DNA sequences urely random? A Motivating Examle A sequence segment of bases A, C, G, T from the human reroglucagon gene: GTATTAAATCCGTAGTCTGAACTAACTA Words such as CTGAC are susected to serve some biological function if they seem to occur often in a segment of the sequence. So it is of interest to measure the ofteness of the occurrences of a word. A oular way is to model the sequence at the background. Observed frequencies/roortions of airs of consecutive bases from a sequence of 57 bases: st base 2nd base A C G T A 0.359 0.43 0.67 0.33 C 0.384 0.56 0.023 0.437 G 0.305 0.99 0.50 0.345 T 0.284 0.82 0.77 0.357 overall 0.328 0.67 0.44 0.360 Some Proerties of Huizhen Yu (U.H.) and Markov Models Jan. 2 5 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 6 / 32 Recall of Indeendence Recall: for discrete random variables, Y, Z with joint distribution P, by definition, Y, Z are mutually indeendent if P( = x, Y = y, Z = z) = P( = x)p(y = y)p(z = z), x, y, z;, Y are conditionally indeendent given Z, i.e., Y Z, if P( = x Y = y, Z = z) = P( = x Z = z), x, y, z. (Our convention: the equalities hold for all x, y, z such that the quantities involved are well-defined under P.) Let, 2,... be an indexed sequence of discrete random variables with joint robability distribution P. If, 2,... are mutually indeendent, then by definition, for all n, P(, 2,..., n) = P( )P( 2) P( n), P( n+,..., n) = P( n+). Definition of a Markov Chain The sequence { n} is called a (discrete-time) Markov chain if it satisfies the Markov roerty: for all n and (x,..., x n), P( n+ = x n+ = x,..., n = x n) = P( n+ = x n+ n = x n), () i.e., n+, 2,..., n n. Recall: the joint PMF of, 2,... n can be exressed as (x, x 2,... x n) = (x )(x 2 x )(x 3 x, x 2) (x n x, x 2,... x n ). The Markov roerty (x i x,..., x i ) = (x i x i ), i then imlies that (x, x 2,..., x n) factorizes as (x, x 2,..., x n) = (x )(x 2 x ) (x n x n ), n. (2) In turn, this is equivalent to that for all m > n, (x n+, x n+2,..., x m x, x 2,..., x n) = (x n+, x n+2,..., x m x n). Informally, the future is indeendent of the ast given the resent. Huizhen Yu (U.H.) and Markov Models Jan. 2 7 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 8 / 32

Terminologies n: the state of the chain at time n S = {ossible x n, n}: the state sace (we assume S is finite) P( n+ = j n = i), i, j S: the transition robabilities The chain iaid to be homogeneous, if for all n and i, j S, P( n+ = j n = i) = indeendently of n; and inhomogeneous, otherwise. For a homogeneous chain, the S S matrix 2 3 2 m 2 22 2m 6 4.. 7 where m = S,.. 5 m m2 mm is called the transition robability matrix of the Markov chain, ( or transition matrix for short). Grahical Model of a Markov Chain The joint PMF of, 2,... n factorizes as (x, x 2,..., x n) = (x )(x 2 x ) (x n x n ). A ictorial reresentation of this factorization form: A directed grah G = (V, E) Vertex set: V = {,..., n} Edge set: E = {(i, i), i < n} i is associated with i, i V 2 3 This is a grahical model for Markov chains with length n. n- Each vertex i with its incoming edge reresents a term in the factorized exression of the PMF, (x i x i ). n Huizhen Yu (U.H.) and Markov Models Jan. 2 9 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 0 / 32 Parameters of a Markov Chain: Transition Probabilities Transition robabilities determine entirely the behavior of the chain: (x, x 2,... x n) = (x )(x 2 x ) (x n x n ) n Y = (x ) j= xj x j+ (for a homoneneous chain) For a homogeneous chain, P is determined by {, i, j S} and the initial distribution of. (Thus the number of free arameters is S 2.) Transition robability grah: another ictorial reresentation of a homogeneous Markov chain; it shows the structure of the chain in the state sace: 3 s 3 2 s s2 2 4 s 4 2 4 2 2 Nodes reresent ossible states and arcs ossible transitions. (Not to be confused with the grahical model.) Using this grah one may classify states as being recurrent, absorbing, or transient notions imortant for understanding long term behavior of the chain. Some Proerties of Huizhen Yu (U.H.) and Markov Models Jan. 2 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 2 / 32

of DNA Sequence Modeling A Markov chain model for the DNA sequence shown earlier: State sace S = {A, C, G, T} Transition robabilities (taken to be the observed frequencies) A C G T A 0.359 0.43 0.67 0.33 C 0.384 0.56 0.023 0.437 G 0.305 0.99 0.50 0.345 T 0.284 0.82 0.77 0.357 The robability of CTGAC given the first base being C : P(CTGAC = C) = CT TG GA AC T A = 0.437 0.77 0.305 0.43 0.00337. (The robabilitieoon become too small to handle as the sequence length grows. In ractice we work with ln P and the log transition robabilities instead: ln P(CTGAC = C) = ln CT + ln TG + ln GA + ln AC. ) C G of DNA Sequence Modeling Second-order Markov Chain: Joint PMF and grahical model (x, x 2,... x n) = (x, x 2)(x 3 x, x 2) (x n x n, x n 2) 2 3 4 n- n Corresondingly, {Y n} is a Markov chain where Y n = (Y n,, Y n,2) = ( n, n), P(Y n+, = Y n,2 Y n) =. Second-order model for the DNA sequence examle: State sace S = {A, C, G, T} {A, C, G, T} Size of transition robability matrix: 6 6 The number of arameters grows exonentially with the order. Transition robability grah (shown artially on the right) AA CA AC CC CT Higher order Markov chains are analogously defined. Huizhen Yu (U.H.) and Markov Models Jan. 2 3 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 4 / 32 Some Proerties of of Language Modeling A trivial examle: the transition robability grah of a Markov chain that can generate a sentence a cat slet here with start and end: start state a dog ate here end state B the cat slet there Sequences of English generated by two Markov models: Third-order letter model: THE GENERATED JOB PROVIDUAL BETTER TRAND THE DISPLAYED CODE, ABOVERY UPONDULTS WELL THE CODERST IN... E First-order word model: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHODS FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD... Some Proerties of Alications in the biology and language modeling contexts include analyzing otentially imortant words sequence classification coding/comression Huizhen Yu (U.H.) and Markov Models Jan. 2 5 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 6 / 32

Some Proerties of Some Proerties of Two extreme cases Two Extreme Cases If, 2,... are mutually indeendent, { n} is certainly a Markov chain (of zeroth-order). If { n} is an arbitrary random sequence, then {Y n}, define by Y n = (, 2,... n), is a Markov chain. Because Y n+ = (Y n, n+ ) and in Y n the entire ast is ket. Notes: Consider a Markov chain { n}. At the current state x n = s, the ast is not necessarily forgotten. It may even haen that for some m < n, there is only one ossible ath (s m, s m+,... s n ) for m, m+,..., n such that n = s. But each time the chain visit, its behavior starting from s is robabilistically the same regardless of the aths it took to reach s in the ast. Geometrically Distributed Duration of Stay The geometric distribution with arameter q [0, ] has the PMF (m) = ( q) m q, m =, 2,... For indeendent trials each with success robability q, this is the distribution of the number M of trials until (and including) the first success. The mean and variance of M are E[M] = q, var(m) = q q 2. Consider a homogeneous Markov chain. Let s be some state with self-transition robability ss > 0. Let L s be the duration of stay at s after entering s at some arbitrary time n + : Then, for m, L s = m if n s, n+ = = n+m = s, n+m+ s, P(L s = m n s, n+ = s) = P(L s = m n+ = s) = ss m ( ss). Huizhen Yu (U.H.) and Markov Models Jan. 2 7 / 32 So the duration has the geometric distribution with mean ss. Huizhen Yu (U.H.) and Markov Models Jan. 2 8 / 32 Some Proerties of Some Proerties of (m) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Geometrically Distributed Duration of Stay Geometric distribution has a memoryless roerty: P`time of the first success = r + m failed the first r trials = (m). Illustration of geometric distributions with q = ss: ss = 0. ss = 0.3 ss = 0.5 ss = 0.9 0 0 2 4 6 8 0 2 4 6 8 20 m The shae of such distributions may not match that found in data. When the duration distribution reflects an imortant asect of the nature of data, a general aroach is to model the duration distribution exlicitly by including as art of the state variable the time already sent at s after entering s. Subsequences of a Markov Chain Let {n k } be a sequence of integers with n n 2... Let Y k = nk, k. Is {Y k } a Markov chain? We check the form of the joint PMF (y,..., y k+ ). We have (y,..., y k+ ) = (x, x 2,..., x nk+ ), x i and by the Markov roerty of { n}, i<n k+ i {n,...,n k+ } (x, x 2,..., x nk+ ) = (x, x 2,..., x nk ) (x nk +, x nk +2,... x nk+ x nk ). So (y,..., y k+ ) equals i<n k i {n,...,n k } x i (x, x 2,..., x nk )! 0 @ n k <i<n k+ (x nk +, x nk +2,... x nk+ x nk ) A, x i = (x n, x n2,..., x nk ) (x nk+ x nk ) = (y, y 2,..., y k ) (y k+ y k ). Thihows so {Y k } is a Markov chain. Y k+ Y, Y 2,..., Y k Y k, Huizhen Yu (U.H.) and Markov Models Jan. 2 9 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 20 / 32

Some Proerties of Some Proerties of Subsequences of a Markov Chain The indeendence relation nk+ n, n2,..., nk nk read off from the grahical model: can actually be m-ste Transition Probabilities of a Markov chain The m-ste transition robabilities (m) satisfy the recursive formula 2 n n k- k nk+ nk searates nk+ from n, n2,..., nk in the grah. (The exact meaning of this will be exlained together with more general results in the future.) A secial choice of the index sequence {n k } is n k = (k ) m +, for some fixed integer m >. The corresonding Markov chain is, m+, 2m+,.... The transition robabilities (m) of this chain (homogeneous case) are the m-ste transition robabilities of the original chain: (m) = P( m+ = j = i), i, j S. n r m : (m) = P( m+ = j = i) = l S P( r+ = l = i) P( m+ = j r+ = l) = l S (r) il (m r) lj, i, j S. (3) This is called the Chaman-Kolmogorov equation. (Once it was conjectured to be an equivalent definition for Markov chains, but this turned out to be false.) In matrix form, Eq. (3) can be exressed as bp (m) = b P m = b P r bp m r, r m. where P b denotes the transition robability matrix of { n}, and P b(m) denotes the m-ste transition robability matrix with P b(m) = (m), i, j S. We examine next the relation between (m) and, i, j S. Huizhen Yu (U.H.) and Markov Models Jan. 2 2 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 22 / 32 Some Proerties of Some Proerties of Let, 2,..., n be a Markov chain. Reversing the Chain Let Y k = n+ k, k n. Is {Y k } a Markov chain? 2 j, k: j+k = n + j j+ n- Y Y n Yn- Y k k- Y2 Y The joint PMF (x j, x j+,..., x n) can be exressed as so (x j, x j+,..., x n) = (x j, x j+) (x j+2,..., x n x j+) n = (x j x j+) (x j+) (x j+2,..., x n x j+) = (x j x j+) (x j+, x j+2,..., x n), (x j x j+,..., x n) = (x j x j+). Thihows the reversed sequence {Y k } is a Markov chain. Another Conditional Indeendence Relation Let, 2,..., n be a Markov chain. Denote j = (,..., j, j+,..., n), all variables but j. Then, for all j n, P( j = x j j = x j) = P( j = x j j = x j, j+ = x j+). (4) Visualize the ositions of the variables in the grah: 2 j- j j+ n Derivation of Eq. (4): the joint PMF (x j, x j ) can be written as (x j, x j ) = (x,..., x j ) (x j x j )(x j+ x j ) (x j+2,..., x n x j+ ), so it has the form (x j, x j ) = h(x,..., x j ) g(x j, x j, x j+ ) ω(x j+,..., x n) for some functions h, g, ω. Thihows (one of the exercises) that j,..., j 2, j+2,..., n j, j+. Huizhen Yu (U.H.) and Markov Models Jan. 2 23 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 24 / 32

Some Proerties of Hidden Markov Models (HMM) Hidden Markov Models refer loosely to a broad class of models in which a subset of the random variables, denoted by = {, 2,...}, is modeled as a Markov chain, and their values are not observed in ractice; the rest of the random variables, denoted by Y, are observable in ractice, and P(Y ) usually has a simle form. A common case: Y = {Y, Y 2,...}, and conditional on, Y is are indeendent with P(Y i ) = P(Y i i, i). (Treat 0 as a dummy variable.) The joint PMF of (, Y ) then factorizes as (x, y) = Y (x i x i )(y i x i, x i). i (y i x i, x i): often called observation (or emission) robabilities A grahical model indicating the factorization form of : 2 Y Y2 3 Y 3 n- Y n- n Yn The sequence of ( i, Y i) jointly is a Markov chain. Huizhen Yu (U.H.) and Markov Models Jan. 2 25 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 26 / 32 HMM and Data-Generating Processes HMM for Parts-of-Seech Tagging 2 Y Y2 3 Y 3 n- Y n- n Yn In some alications, the above structure of HMM corresonds intuitively to the data-generating rocess: for instance, Seech recognition: seech, Y acoustic signals Tracking a moving target: ositions/velocities of the target, Y signals detected Robot navigation: ositions of a robot, Y observations of landmarks In other alications, the model can have nothing to do with the underlying data-generating rocess, and is introduced urely for questions at hand. Examles include sequence alignment, sequence labeling alications. Parts-of-seech tags for a sentence: ron v adv final unct. I drove home. Possible tags: I: n, ron drove: v, n home: n, adj, adv, v Reresent a sentence as a sequence of words W = (W, W 2,..., W n). Let the associated tags be T = (T, T 2,..., T n). A Markov chain model for (W i, T i), i n used in ractice is: (w, t) = ny (w i t i)(t i t i, t i 2). i= Question: Why W is treated as generated by T and not the other way around? Huizhen Yu (U.H.) and Markov Models Jan. 2 27 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 28 / 32

Another Artificial Examle of Sequence Labeling Suose two different tyes of regions, tye + and tye -, can occur in the same DNA sequence, and each tye has its own characteristic attern of transitions among the bases A, C, G, T. Given a sequence un-annotated with region tyes, we want to find where changes between the two tyes may occur and whether they occur sharly. An HMM for this roblem: Let Y = {Y i} corresond to a DNA sequence. Introduce variables Z i with z i {+, } to indicate which tye is in force at osition i. Let i = (Y i, Z i), and model = { i} as a Markov chain on S = {A, C, G, T} {+, }. Then, for an un-annotated sequence y, we solve arg max x P( = x Y = y), equivalently, arg max P(Z = z Y = y). z Any z in the latter set gives one of the most robable configurations of boundaries between tye + and tye - regions for the sequence y. Huizhen Yu (U.H.) and Markov Models Jan. 2 29 / 32 Inference and Learning with HMM Quantities of interest in alications usually include P(Y = y), and P( = x Y = y) for a single x arg max x P( = x Y = y), the most robable ath given Y = y P( i Y = y), the marginal distribution of i given Y = y; and arg max xi P( i = x i Y = y) Efficient inference algorithms are available utilizing the structure/factorization form of, comutation can be streamlined. To secify an HMM, we need to secify its toology (sace of i, relation between { i} and {Y i}), and then its arameters. Parameters of HMM in the earlier examles: transition robabilities and observation robabilities Parameter estimation: Comlete data case: sequences of states {x i} are given for estimation This case reduces to the case of Markov chains. Incomlete data case: sequences of states {x i} are unknown In this case estimation is based on the observed sequences of {y i}, and is tyically done with the so-called exectation-maximization (EM) algorithm. We will study these toics in some future lectures. Huizhen Yu (U.H.) and Markov Models Jan. 2 30 / 32 HMM vs. High Order Markov Models In the alication contexthown earlier, the main interest is on the hidden/latent variables { i}. Consider now the case where our interest iolely on {Y i} (for instance, to redict Y n+ given Y,..., Y n). We comare two choices of models for {Y i}: a Markov model for {Y i}, ossibly of high order; an HMM for {Y i}, in which we introduce auxiliary, latent random variables { i}. On Markov chains: Further Readings. A. C. Davison. Statistical Models, Cambridge Univ. Press, 2003. Cha. 6.. In an HMM for {Y i}, (y, y 2,..., y n) = x,...,x n ny (x i x i ) (y i x i, x i), i= (You may ski materials on. 229-232 about classification of states if not interested.) so generally, Y,..., Y n are fully deendent as modeled. In a Markov or high order Markov model, certain conditional indeendence among {Y i} is assumed. Thihows with HMM we aroximate the true distribution of {Y i} by relatively simle distributions of another kind than those in a Markov or high order Markov model. Huizhen Yu (U.H.) and Markov Models Jan. 2 3 / 32 Huizhen Yu (U.H.) and Markov Models Jan. 2 32 / 32