11 Hidden Markov Models

Size: px

Start display at page:

Download "11 Hidden Markov Models"

Kellie Carter
6 years ago
Views:

1 Hidde Markov Models Hidde Markov Models are a popular machie learig approach i bioiformatics. Machie learig algorithms are preseted with traiig data, which are used to derive importat isights about the (ofte hidde) parameters. Oce a algorithm has bee suitably traied, it ca apply these isights to the aalysis of a test sample. As the amout of traiig data icreases, the accuracy of the machie learig algorithm typically icreases as well. The parameters that are leared durig traiig represet kowledge; applicatio of the algorithm with those parameters to ew data (ot used i the traiig phase) represets the algorithm s use of that kowledge. The Hidde Markov Model (HMM) approach, cosidered i this chapter, lears some ukow probabilistic parameters from traiig samples ad uses these parameters i the framework of dyamic programmig (ad other algorithmic techiques) to fid the best explaatio for the experimetal data.. CG-Islads ad the Fair Bet Casio The least frequet diucleotide i may geomes is CG. The reaso for this is that the C withi CG is easily methylated, ad the resultig methyl-c has a tedecy to mutate ito T. However, the methylatio is ofte suppressed aroud gees i areas called CG-islads i which CG appears relatively frequetly. A importat problem is to defie ad locate CG-islads i a log geomic text. Fidig CG-islads ca be modeled after the followig toy gamblig problem. The Fair Bet Casio has a game i which a dealer flips a coi ad. Cells ofte biochemically modify DNA ad proteis. Methylatio is the most commo DNA modificatio ad results i the additio of a methyl (CH ) group to a ucleotide positio i DNA.

2 88 Hidde Markov Models the player bets o the outcome (heads or tails). The dealer i this (crooked) casio uses either a fair coi (heads or tails are equally likely) or a biased coi that will give heads with a probability of. For security reasos, the dealer does ot like to chage cois, so this happes relatively rarely, with a probability of 0.. Give a sequece of coi tosses, the problem is to fid out whe the dealer used the biased coi ad whe he used the fair coi, sice this will help you, the player, lear the dealer s psychology ad eable you to wi moey. Obviously, if you observe a log lie of heads, it is likely that the dealer used the biased coi, whereas if you see a eve distributio of heads ad tails, he likely used the fair oe. Though you ca ever be certai that a log strig of heads is ot just a fluke, you are primarily iterested i the most probable explaatio of the data. Based o this sesible ituitio, we might formulate the problem as follows: Fair Bet Casio Problem: Give a sequece of coi tosses, determie whe the dealer used a fair coi ad whe he used a biased coi. Iput: A sequece x = x x x... x of coi tosses (either H or T ) made by two possible cois (F or B). Output: A sequece π = π π π π, with each π i beig either F or B idicatig that x i is the result of tossig the fair or biased coi, respectively. Ufortuately, this problem formulatio simply makes o sese. The ambiguity is that ay sequece of cois could possibly have geerated the observed outcomes, so techically π = FFF... FF is a valid aswer to this problem for every observed sequece of coi flips, as is π = BBB...BB. We eed to icorporate a way to grade differet coi sequeces as beig better aswers tha others. Below we explai how to tur this ill-defied problem ito the Decodig problem based o HMM paradigm. First, we cosider the problem uder the assumptio that the dealer ever chages cois. I this case, lettig 0 deote tails ad heads, the questio is which of the two cois he used, fair (p + (0) = p + () = ) or biased (p (0) =, p () = ). If the resultig sequece of tosses is x = x... x, the the

3 . CG-Islads ad the Fair Bet Casio 8 probability that x was geerated by a fair coi is P(x fair coi) = i= p + (x i ) =. O the other had, the probability that x was geerated by a biased coi is P(x biased coi) = p (x i ) = i= ( k ) ( k k ) = k. Here k is the umber of heads i x. If P(x fair coi) > P(x biased coi), the the dealer most likely used a fair coi; o the other had, we ca see that if P(x fair coi)<p(x biased coi), the the dealer most likely used a biased coi. The probabilities P(x fair coi)= ad P(x biased coi) = k become equal at k = log. As a result, whe k < log, the dealer most likely used a fair coi, ad whe k > log, he most likely used a biased coi. We ca defie the log-odds ratio as follows: P(x fair coi) k log P(x biased coi) = p + (x i ) log p (x i ) = k log i= However, we kow that the dealer does chage cois, albeit rarely. Oe approach to makig a educated guess as to which coi the dealer used at each poit would be to slide a widow of some width alog the sequece of coi flips ad calculate the log-odds ratio of the sequece uder each widow. I effect, this is cosiderig the log-odds ratio of short regios of the sequece. If the log-odds ratio of the short sequece falls below 0, the the dealer most likely used a biased coi while geeratig this widow of sequece; otherwise the dealer most likely used a fair coi. Similarly, a aive approach to fidig CG-islads i log DNA sequeces is to calculate log-odds ratios for a slidig widow of some particular legth, ad to declare widows that receive positive scores to be potetial CG-islads. Of course, the disadvatage of this approach is that we do ot kow the legth of CG-islads i advace ad that some overlappig widows may classify the same ucleotide differetly. HMMs represet a differet probabilistic approach to this problem.. The otatio P(x y) is shorthad for the probability of x occurrig uder the assumptio that (some coditio) y is true. The otatio Q i= a i meas a a a a.

4 0 Hidde Markov Models. The Fair Bet Casio ad Hidde Markov Models A HMM ca be viewed as a abstract machie that has a ability to produce some output usig coi tossig. The operatio of the machie proceeds i discrete steps: at the begiig of each step, the machie is i a hidde state of which there are k. Durig the step, the HMM makes two decisios: () What state will I move to ext? ad () What symbol from a alphabet Σ will I emit? The HMM decides o the former by choosig radomly amog the k states; it decides o the latter by choosig radomly amog the Σ symbols. The choices that the HMM makes are typically biased, ad may follow arbitrary probabilities. Moreover, the probability distributios that gover which state to move to ad which symbols to emit chage from state to state. I essece, if there are k states, the there are k differet ext state distributios ad k differet symbol emissio distributios. A importat feature of HMMs is that a observer ca see the emitted symbols but has o ability to see what state HMM is i at ay step, hece the ame Hidde Markov Models. The goal of the observer is to ifer the most likely states of the HMM by aalyzig the sequeces of emitted symbols. Sice a HMM effectively uses dice to emit symbols, the sequece of symbols it produces does ot form ay readily recogizable patter. Formally, a HMM M is defied by a alphabet of emitted symbols Σ, a set of (hidde) states Q, a matrix of state trasitio probabilities A, ad a matrix of emissio probabilities E, where Σ is a alphabet of symbols; Q is a set of states, each of which will emit symbols from the alphabet Σ; A = (a kl ) is a Q Q matrix describig the probability of chagig to state l after the HMM is i state k; ad E = (e k (b)) is a Q Σ matrix describig the probability of emittig the symbol b durig a step i which the HMM is i state k. Each row of the matrix A describes a state die with Q sides, while each row of the matrix E describes a symbol die with Σ sides. The Fair. A probability distributio is simply a assigmet of probabilities to outcomes; i this case, the outcomes are either symbols to emit or states to move to. We have see probability distributios, i a disguised form, i the cotext of motif fidig. Every colum of a profile, whe each elemet is divided by the umber of sequeces i the sample, forms probability distributios.. Sigular of dice.

5 . The Fair Bet Casio ad Hidde Markov Models 0 0 F 0 0 B H T H T Figure. The HMM desiged for the Fair Bet Casio problem. There are two states: F (fair) ad B (biased). From each state, the HMM ca emit either heads (H) or tails (T), with the probabilities show. The HMM will switch betwee F ad B with probability /0. Bet Casio process correspods to the followig HMM M(Σ, Q, A, E) show i figure.: Σ = {0, }, correspodig to tails (0) or heads () Q = {F, B}, correspodig to a fair (F ) or biased (B) coi a FF = a BB = 0., a FB = a BF = 0. e F (0) =, e F() =, e B(0) =, e B() = A path π = π... π i the HMM M is a sequece of states. For example, if a dealer used the fair coi for the first three ad the last three tosses ad the biased coi for five tosses i betwee, the correspodig path π would be π = FFFBBBBBFFF. If the resultig sequece of tosses is 00000, the the followig shows the matchig of x to π ad the probability of x i beig geerated by π i at each flip: x π P(x i π i ) = F F F B B B B B F F F We write P(x i π i ) to deote the probability that symbol x i was emitted from state π i these values are give by the matrix E. We write P(π i π i+ )

6 Hidde Markov Models to deote the probability of the trasitio from state π i to π i+ these values are give by the matrix A. The path π = FFFBBBBBFFF icludes oly two switches of cois, first from F to B (after the third step), ad secod from B to F (after the eighth step). The probability of these two switches, π π ad π 8 π, is 0, while the probability of all other trasitios, π i π i, is 0 as show below:5 x π P(x i π i ) P(π i π i ) = F F F B B B B B F F F The probability of geeratig x through the path π (assumig for simplicity that i the first momet the dealer is equally likely to have a fair or a biased coi) is roughly ad is computed as: ««««««««««« I the above example, we assumed that we kew π ad observed x. However, i reality we do ot have access to π. If you oly observe that x = 00000, the you might ask yourself whether or ot π =FFFBBBBBFFF is the best explaatio for x. Furthermore, if it is ot the best explaatio, is it possible to recostruct the best oe? It turs out that FFFBBBBBFFF is ot the most probable path for x = 00000: FFFBBBFFFFF is slightly better, with probability x π P(x i π i ) P(π i π i ) = F F F B B B F F F F F The probability that sequece x was geerated by the path π, give the model M, is P(x π) = P(π 0 π ) P(x i π i )P(π i π i+ ) = a π0,π e πi (x i ) a πi,π i+. i= 5. We have added a fictitious term, P(π 0 π ) = to model the iitial coditio: the dealer is equally likely to have either a fair or a biased coi before the first flip. i=

7 . Decodig Algorithm For coveiece, we have itroduced π 0 ad π + as the fictitious iitial ad termial states begi ad ed. This model defies the probability P(x π) for a give sequece x ad a give path π. Sice oly the dealer kows the real sequece of states π that emitted x, we say that π is hidde ad attempt to solve the followig Decodig problem: Decodig Problem: Fid a optimal hidde path of states give observatios. Iput: Sequece of observatios x = x...x geerated by a HMM M(, Q, A, E). Output: A path that maximizes P(x π) over all possible paths π. The Decodig problem is a improved formulatio of the ill-defied Fair Bet Casio problem.. Decodig Algorithm I 67 Adrew Viterbi used a HMM-ispired aalog of the Mahatta grid for the Decodig problem, ad described a efficiet dyamic programmig algorithm for its solutio. Viterbi s Mahatta is show i figure. with every choice of π,...,π correspodig to a path i this graph. Oe ca set the edge weights i this graph so that the product of the edge weights for path π=π...π equals P(x π). There are Q ( ) edges i this graph with the weight of a edge from (k, i) to (l, i + ) give by e l (x i+ ) a kl. Ulike the aligmet approaches covered i chapter 6 where the set of valid directios was restricted to south, east, ad southeast edges, the Mahatta built to solve the decodig problem oly forces the tourists to move i ay eastward directio (e.g., ortheast, east, southeast, etc.), ad places o additioal restrictios (fig..). To see why the legth of the edge betwee the vertices (k, i) ad (l, i + ) i the correspodig graph is give by e l (x i+ ) a kl, oe should compare p k,i [the probability of a path edig i vertex (k, i)] with

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training