Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Size: px

Start display at page:

Download "Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence"

Damian Barton
5 years ago
Views:

1 rundlagen der Bioinformatik, SS 09,. Huson, June 16, Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis, Cambridge, 1998 S. urbin, S. ddy,. Krogh and. Mitchison, Biological Sequence 7.1 Cp-islands xample: finding Cp-islands in the human genome. ouble stranded N:...pCpCpppppppCpppppCpppCpCpppCppppCppCpp pppppCpppCpppCpCppppppppppCppppCppCp... he C in a Cp-pair is often modified by methylation (that is, an H-atom is replaced by a CH 3 - group). here is a relatively high chance that the methyl-c will mutate to a. Hence, Cp-pairs are under-represented in the human genome. Upstream of a gene, the methylation process is suppressed in short regions of the genome of length hese areas are called Cp-islands and they are characterized by the fact that we see more Cp-pairs in them then elsewhere. Cp-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine. efinition (classical definition of Cp-islands) N sequence of length 200 with a C + content of 50% and a ratio of observed-to-expected number of Cp s that is above 0.6. (ardiner-arden & Frommer, 1987) ccording to a recent study, human chromosomes 21 and 22 contain about 1100 Cp-islands and about 750 genes. (Comprehensive analysis of Cp islands in human chromosomes 21 and 22,. akai & P.. Jones, PNS, March 19, 2002) We will address the following two main questions concerning Cp-islands: Main questions: 1. iven a short segment of genomic sequence, how to decide whether this segment comes from a Cp-island or not? 2. iven a long segment of genomic sequence, how to find all contained Cp-islands? 7.2 Markov chains Our goal is to set up a probabilistic model for Cp-islands. Because pairs of consecutive nucleotides are important in this context, we need a model in which the probability of one symbol depends on the probability of its predecessor. his leads us to a Markov chain.

2 82 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 xample: C Circles= states, e.g. with names, C, and. rrows= possible transitions, each labeled with a transition probability a st = P (x i = t x i 1 = s). efinition (Markov chain) (time-homogeneous) Markov chain (of order 1) is a system (S, ) consisting of a finite set of states S = {s 1, s 2,..., s n } and a transition matrix = {a st } with t S a st = 1 for all s S, that determines the probability of the transition s t as follows: P (x i+1 = t x i = s) = a st. (t any time i the chain is in a specific state x i and at the tick of a clock the chain changes to state x i+1 according to the given transition probabilities). xample: Weather in übingen, daily at midday: Possible states are rain, sun, clouds or tornado. ransition probabilities: R S C R S C Weather:...rrrrrrccsssssscscscccrrcrcssss Probability of a sequence of states iven a sequence of states x 1, x 2, x 3,..., x L, what is the probability that a Markov chain will step through precisely this sequence of states? P (x) = P (x L, x L 1,..., x 1 ) = P (x L x L 1,..., x 1 )P (x L 1 x L 2,..., x 1 )... P (x 1 ), (by repeated application of P (X, ) = P (X )P ( )) = P (x L, x L 1 )P (x L 1 x L 2 )... P (x 2 x 1 )P (x 1 ) = P (x 1 ) L i=2 a x i 1 x i, because P (x i x i 1,..., x 1) = P (x i x i 1) = a xi 1 x i, the Markov chain property! Modeling the begin and end states In the previous discussion we overlooked the fact that a Markov chain starts in some state x 1, with initial probability of P (x 1 ).

3 rundlagen der Bioinformatik, SS 09,. Huson, June 16, We add a begin state to the model that is labeled b. We will always assume that x 0 = b holds. hen: P (x 1 = s) = a bs = P (s), where P (s) denotes the background probability of symbol s. Similarly, we explicitly model the end of the sequence of states using an end state e. probability that we end in state t is P (x L = t) = a xl e. hus, the xtension of the model C xample: b e # Markov chain that generates Cp islands # (Source: MK98, p 50) # Number of states: 6 # State labels: C * + # ransition matrix: etermining the transition matrix he transition matrix + is obtained empirically ( trained ) by counting transitions that occur in a training set of known Cp-islands. his is done as follows: a + st = c+ st t c+ st, where c st is the number of positions in a training set of Cp-islands at which state s is followed by state t. We obtain empirically in a similar way, using a training set of known non-cp-islands wo examples of Markov chains # Markov chain for Cp islands # Markov chain for non-cp islands # (Source: MK98, p 50) # (Source: MK98, p 50) # Number of states: # Number of states: 6 6 # State labels: # State labels:

4 84 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 C * + C * + # ransition matrix: # ransition matrix: nswering question 1 Suppose we are given a short sequence x = (x 1, x 2,..., x L ). oes it come from a Cp-island (Model + )? with x 0 = b and x L+1 = e. We use the following score: P (x Model + ) = L a xi x i+1, i=0 S(x) = log P (x Model+ ) L P (x Model ) = log a+ x i 1 x i a. x i 1 x i he higher this score is, the higher the probability is, that x comes from a Cp-island. i= ypes of questions that a Markov chain can answer xample weather in übingen, daily at midday: Possible states are rain, sun or clouds. ransition probabilities: R S C R S C ypes of questions that the model can answer: If it is sunny today, what is the probability that the sun will shine for the next seven days? 7.3 Hidden Markov Models (HMM) Motivation: Question 2, how to detect Cp-islands inside a long sequence? One possible approach is a window technique: a window of width w is moved along the sequence and the score is plotted. Problem: it is hard to determine the boundaries of Cp-islands, which window size w should one choose?... We will consider an alternative approach: Merge the two Markov chains Model + and Model to obtain a so-called Hidden Markov Model. efinition (HMM) HMM is a system M = (S, Q,, e) consisting of an alphabet S, a set of states Q, a matrix = {a kl } of transition probabilities a kl for k, l Q, and an emission probability e k (b) for every k Q and b S.

5 rundlagen der Bioinformatik, SS 09,. Huson, June 16, xample he topology of an HMM for Cp-islands: + C+ + + C (dditionally, we have all transitions between states in either of the two sets that carry over from the two Markov chains Model + and Model.) HMM for Cp-islands # Number of states: 9 # Names of states (begin/end, +, C+, +, +, -, C-, - and -): 0 C a c g t # Number of symbols: 4 # Names of symbols: a c g t # ransition matrix, probability to change from +island to -island (and vice versa) is # mission probabilities: From now one we use 0 for the begin and end state xample fair/loaded dice Casino uses two dice, fair and loaded: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/ : 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/ Fair Casino guest only observes the number rolled: Unfair Which dice was used remains hidden: F F F F F F F F F F F F U U U U U F F F F F F F F F F...

6 86 rundlagen der Bioinformatik, SS 09,. Huson, June 16, eneration of simulated data We can use HMMs to generate data: lgorithm (Simulator) Start in state 0. While we have not reentered state 0: Choose a new state using the transition probabilities Choose a symbol using the emission probabilities and report it. We use the fair/loaded HMM to generate a sequence of states and symbols: Symbols: States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF Symbols: States : FFUUUUUUUU How probable is a given sequence of data? If we can observe only the symbols, can we reconstruct the corresponding states? etermining the probability, given the states and symbols efinition (Path) path π = (π 1, π 2,..., π L ) is a sequence of states in the model M. Suppose we are given a sequence of symbols x = (x 1,..., x L ) and a path π = (π 1,..., π L ) through M. he joint probability is: P (x, π) = a 0π1 with π L+1 = 0. L i=1 e πi (x i )a πi π i+1, Unfortunately, we usually do not know the path through the model ecoding a sequence of symbols Problem: We have observed a sequence x of symbols and would like to decode the sequence: xample: he sequence of symbols C C has a number of explanations within the Cp-model, e.g.: (C +, +, C +, + ), (C,, C, ) and (C, +, C, + ). path through the HMM determines which parts of the sequence x are classified as Cp-islands, such a classification of the observed symbols is called a decoding.

7 rundlagen der Bioinformatik, SS 09,. Huson, June 16, he most probable path o solve the decoding problem, we want to determine the path π that maximizes the probability of having generated the sequence x of symbols, that is: π = arg max P (x, π). π his most probable path π can be computed recursively. efinition (iterbi variable) iven a prefix (x 1, x 2,..., x i ), the iterbi variable v k (i) denotes the probability that the most probable path is in state k when it generates symbol x i at position i. hen: v l (i + 1) = e l (x i+1 ) max k Q (v k(i)a kl ), with v 0 (0) = 1, initially. (xercise: We have: arg max π P (x, π) = arg max π P (π x)) ynamic programming matrix: x 0 x 1 x 2 x 3... x i 2 x i 1 x i x i C + C + C +... C + C + C C C C... C C C he iterbi algorithm lgorithm (iterbi algorithm) Input: HMM M = (S, Q,, e) and symbol sequence x Output: Most probable path π. Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k 0. For all i = 1... L, l Q: v l (i) = e l (x i ) max k Q (v k (i 1)a kl ) ptr i (l) = arg max k Q (v k (i 1)a kl ) ermination: P (x, π ) = max k Q (v k (L)a k0 ) π L = arg max k Q(v k (L)a k0 ) raceback: For all i = L : π i 1 = ptr i(π i ) Implementation hint: instead of multiplying many small values, add their logarithms! (xercise: Run-time complexity)

8 88 rundlagen der Bioinformatik, SS 09,. Huson, June 16, xample for iterbi Suppose we are given the sequence C C and the HMM for Cp-islands. Here is a table of possible values for v: sequence v C C C State C iterbi-decoding of the casino example We used the fair/loaded HMM to first generate a sequence of symbols and then use the iterbi algorithm to decode the sequence, result: Symbols: States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF iterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF iterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF Symbols: States : FFUUUUUUUU iterbi: FFFFFFUUUU hree Main problems for HMMs Let M be a HMM, x a sequence of symbols. (Q1) For x, determine the most probable sequence of states through M: iterbi algorithm (Q2) etermine the probability that M generated x: P (x) = P (x M): forward algorithm (Q3) iven x and perhaps some additional sequences of symbols, how do we train the parameters of M? Baum-Welch algorithm Computing P (x M) Suppose we are given an HMM M and a sequence of symbols x. he probability that x was generated by M is given by: P (x M) = P (x, π M), π summing over all possible state sequences π through M. (xercise: how fast does the number of paths increase as a function of length?)

9 rundlagen der Bioinformatik, SS 09,. Huson, June 16, Forward algorithm he value of P (x M) can be efficiently computed using the forward algorithm. his algorithm is obtained from the iterbi algorithm by replacing max by a sum. More precisely, we define the forward-variable: f k (i) = P (x 1... x i, π i = k), which equals the probability that the model reports the prefix sequence (x 1,..., x i ) and is in state π i = k at position i. We obtain the recursion: f l (i + 1) = e l (x i+1 ) k Q f k(i)a kl. f p (i) f q (i) f r (i) f l (i+1) f s (i) f t (i) p tl lgorithm (Forward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = 0): f 0 (0) = 1, f k (0) = 0 for k 0. For all i = 1... L, l Q: f l (i) = e l (x i ) k Q (f k(i 1)a kl ) Result: P (x M) = k Q (f k(l)a k0 ) Implementation hint: Logarithms can not be employed here easily, but there are so-called scaling methods. his solves Main problem Q2! Backward algorithm he backward-variable contains the probability to start in state p i = k and then to generate the suffix sequence (x i+1,..., x L ): b k (i) = P (x i+1... x L π i = k). lgorithm (Backward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = L): b k (L) = a k0 for all k. For all i = L , k Q: b k (i) = l Q a kle l (x i+1 )b l (i + 1) Result: P (x M) = l Q (a 0le l (x 1 )b l (1))

10 90 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 b p (i+1) b q (i+1) b k (i) b r (i+1) p kt b s (i+1) b t (i+1) Summary of the three variables iterbi v k (i) probability with which the most probable state path generates the sequence of symbols (x 1, x 2,..., x i ) and the system is in state k at time i. Forward f k (i) probability that the prefix sequence of symbols x 1,..., x i is generated, and the system is in state k at time i. Backward b k (i) probability that the system starts in state k at time i and then generates the sequence of symbols x i+1,..., x L Posterior probabilities Suppose we are given an HMM M and a sequence of symbols x. Let P (π i = k x) be the probability that symbol x i was reported in state π i = k. We call this the posterior probability, as it computed after observing the sequence x. We have: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i), P (x) as P (g, h) = P (g h)p (h) and by definition of the forward- and backward-variable ecoding with posterior probabilities here are alternatives to the iterbi-decoding that are useful, e.g., when many other paths exist that have a similar probability to π. We define a sequence of states ˆπ thus: ˆπ i = arg max k Q P (π i = k x), in other words, at every position we choose the most probable state for that position. his decoding is useful if we are interested in the state at a specific position i and not in the whole sequence of states. Warning: if the transition matrix forbids some transitions (i.e., a kl = 0), then this decoding may produce a sequence that is not a valid path, because its probability is 0! raining the parameters How does one generate an HMM?

11 rundlagen der Bioinformatik, SS 09,. Huson, June 16, First step: etermine its topology, i.e. the number of states and how they are connected via transitions of non-zero probability. he topology is usually designed by hand. Second step: Set the parameters, i.e. the transition probabilities a kl and the emission probabilities e k (b). We will now discuss the second step. iven a set of example sequences, our goal is to train the parameters of the HMM using the example sequences, e.g. to set the parameters in such a way that the probability, with which the HMM generates the given example sequences, is maximized raining when the states are known Let M = (S, Q,, e) be a HMM. Suppose we are given a list of sequences of symbols x 1, x 2,..., x n and a list of corresponding paths π 1, π 2,..., π n. (.g., N sequences with annotated Cp-islands.) We want to choose the parameters (, e) of the HMM M optimally, such that: P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e) ) = max P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e ) ). (,e ) In other words, we want to determine the so-called Maximum Likelihood stimator (ML-estimator) for (, e) ML-stimation for (, e) (Recall: If we consider P ( M) as a function of, then we call this a probability; as a function of M, then we use the word likelihood.) ML-estimation: (, e) ML = arg max (,e ) P (x1,..., x n, π 1,..., π n M = (S, Q,, e )). o compute and e from labeled training data, we first determine the following numbers: â kl : ê k (b): Number of observed transitions from state k to l Number of observed emissions of b in state k We then set and e as follows: a kl = â kl q Q âkq and e k (b) = ê k (b). ( ) êk(s) s S raining the fair/loaded HMM Suppose we are given example data x and π: Symbols x: States pi: F F F F F F F U U U U F F F State transitions:

12 92 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 â kl 0 F U 0 F U ê k (b) F U missions: a kl 0 F U 0 F U e k (b) F U Pseudocounts One problem in training is overfitting. For example, if some possible transition k l is never seen in the example data, then we will set ā kl = 0 and the transition is then forbidden. lso, if a given state k is never seen in the example data, then ā kl is undefined for all l. o solve this problem, we introduce pseudocounts r kl and r k (b), and define: â kl = number of transitions from k to l in the example data + r kl ê k (b) = number of emissions of b in k in the example data + r k (b). Small pseudocounts reflect little pre-knowledge, large ones reflect more pre-knowledge Parameter training when the states are unknown In practice, one usually has access only to the sequences of symbols and not to the state paths. Suppose we are given sequences of symbols x 1, x 2,..., x n, for which we do NO know the corresponding state paths π 1,..., π n. he problem of choosing the parameters (, e) of HMM M optimally so that holds is known to be NP -hard. P ( x 1,..., x n M = (S, Q,, e) ) = max P ( x 1,..., x n M = (S, Q,, e ) ) (,e ) efinition (Log-likelihood score) We define the log-likelihood score of the model M as: l(x 1,..., x n ) = log P (x 1,..., x n (, e)) = n log P (x j (, e)). (Here we assume, that the sequences of symbols are independent and therefore P (x 1,..., x n ) = P (x 1 ) P (x n ) holds.) he goal is to determine parameters (, e) so that we maximize this score. j= Baum-Welch algorithm (In the lecture we didn t actually do this but rather we looked at iterbi training.) Let M = (S, Q,, e) be a HMM and assume we are given training sequences x 1, x 2,..., x n. parameters (, e) are to be iteratively improved as follows: he

13 rundlagen der Bioinformatik, SS 09,. Huson, June 16, Based on x 1,..., x n and π 1,..., π n and the current value of (, e) we estimate expectation values a kl and e l (b) for â kl and ê l (b). - We then compute (, e) from Ā and ē using equation ( ). - his is repeated until the log-likelihood score cannot be improved. (his is a special case of the so-called expectation maximization (M) technique.) lgorithm (Baum-Welch algorithm) Input: HMM M = (S, Q,, e), training data x 1, x 2,..., x n, Output: HMM M = (S, Q,, e ) with an improved score. Init.: Randomly assign and e repeat for each sequence x j do for each position i do for each state k do Compute f k (i) for x j with the forward algorithm. Compute b k (i) for x j with the backward algorithm. for each state k do for each state l do Compute a kl = j for each symbol b do Compute e k (b) = j 1 P (x j ) i f j k (i)a kle l (x j i+1 )bj l (i + 1) 1 P (x j ) {i x j i =b} f j k (i)bj l (i) Set new model parameters (, e) from ā and ē using ( ). Compute the new log-likelihood l(x 1,..., x n (, e)). until the log-likelihood does not improve or a maximum number of iterations was reached. Why do we use the following expression to compute the expectation for â kl in the algorithm? ā kl = n j=1 1 P (x j ) L j i=1 f j k (i)a kle l (x j i+1 )bj l (i + 1) For a single sequence x and a single position i, the expected number of transitions from π i = k to π i+1 = l is given by: Convergence: P (π i = k, π i+1 = l x, (, e)) = f k(i)a kl e l (x i+1 )b l (i + 1). P (x) P (πi = k, x) his follows from: P (π i = k x) = P (x) = f «k(i)b k (i). P (x) One can prove that the log-likelihood-score converges to a local maximum using the Baum-Welch algorithm. However, this doesn t imply that the parameters converge! Local maxima can be avoided by considering many different starting points. dditionally, any standard optimization approaches can also be applied to solve the optimization problem.

14 94 rundlagen der Bioinformatik, SS 09,. Huson, June 16, Protein Families Suppose we are given the following related sequences, how to characterize this family? #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU...BBBBBBBBBBBBBBBBCCCCCCCCCCC......LSQRQIWKINKCLIKFLSHPQMF.FS...S...PLKL...HLPKSLWK...NLRLLPWQRFFSFLSPMNPKKHKKL...LSPKNKWK..HLRMFLSFPKFPHF.LS...HSQKHKK...LSWQLLHWK..HQILIRLFKSHPLKFRFKHLKMKSLKKHL PISPLSKKIRSWPS..SILKFFSPQFFPKFKLQLKKSRWHRII...LSQISQSFKK...PILFKPSIMKFQF.KLSIKPFHNRI...LSQLKSSWFN..NIPKHHRFFILLIPKLFS.FLK.SPQNNPLQHKF...FFFFFFFFFFFF..FF...HHHHHHHHHHHHHHHHHHH QISHL..KMQMKRHKNKHIKQFPLSLLSMHRIKMNKWISL FSLHL...NLKFLSLHCKL..HPNFRLLNLCLHHFKFPPQQKNL LNH...MPNLSLSLHHKL..RPNFKLLSHCLLLHLPFPHSLKFLSSL LILKK...K.HHLKPLQSHKH..KIPIKLFISIIHLHSRHPFQMNKLLFRKI NNSM..KMSMKLRLSKHKSF..QPQFKLI...FKLMSMICILL FFSKIIL..P...NINFSHKPR...HQLNNFRFSMKH..F.WLFFMI KLIQLQLKNLSHSK...HFPKILKIKKWSLNSWILII #-helices HHHHHHH... LB1_LI ISLQS... HBB_HUMN HKH... HB_HUMN SKR... M_PHC KKLQ lignment of seven globin sequences LB5_PM RS... How can this family be characterized? LB3_CHIP FSKM... LB2_LUPLU KKMN... Some ideas for characterizing a family: xemplary sequence Consensus sequence Regular expression (Prosite): LB2_LUPLU LB1_LI...FN--NIPKH......IN......[FI]-[N]-[]-x(1,2)-N-[I]-[P]-[K]-[H]... HMM? Simple HMM How to represent this? HB_HUMN...--H... HBB_HUMN N... M_PHC...--H... LB3_CHIP...K LB5_PM...S--S... LB2_LUPLU...FN--NIPKH... LB1_LI...IN... "Matches": *** ***** We first consider a simple HMM that is equivalent to a PSSM (Position Specific Score Matrix): F I K S H N I P K H S (he listed amino-acids have a higher emission-probability.)

15 rundlagen der Bioinformatik, SS 09,. Huson, June 16, Insert-states We introduce so-called insert-states that emit symbols based on their background probabilities. Begin F I K S H N I P K H S nd his allows us to model segments of sequence that lie outside of conserved domains elete-states We introduce so-called delete-states that are silent and do not emit any symbols. Begin F I K S H N I P K H S nd his allows us to model the absence of individual domains opology of a profile-hmm he result is a so-called profile HMM: Begin nd Match-state, Insert-state, elete-state esign of a profile-hmm Suppose we are given a multiple alignment of a family of sequences. First we must decide which positions are to be modeled as match- and which positions are to be modeled as insert-states. Rule-of-thumb: columns with more than 50% gaps should be modeled as insert-states. We determine the transition and emission probabilities simply by counting the observed transitions kl and emissions k (B): a kl = kl l and e k (b) = k(b) kl b k(b ). Obviously, it may happen that certain transitions or emissions do not appear in the training data and thus we use the Laplace-rule and add 1 to each count.

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 89 8 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,