Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 89 8 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis, Cambridge, 1998 S. urbin, S. ddy,. Krogh and. Mitchison, Biological Sequence 8.1 Cp-islands xample: finding Cp-islands in the human genome. ouble stranded N:...pCpCpppppppCpppppCpppCpCpppCppppCppCpp............pppppCpppCpppCpCppppppppppCppppCppCp... he C in a Cp-pair is often modified by methylation (that is, an H-atom is replaced by a CH 3 - group). here is a relatively high chance that the methyl-c will mutate to a. Hence, Cp-pairs are under-represented in the human genome. Upstream of a gene, the methylation process is suppressed in short regions of the genome of length 100-5000. hese areas are called Cp-islands and they are characterized by the fact that we see more Cp-pairs in them then elsewhere. Cp-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine. efinition 8.1.1 (classical definition of Cp-islands) N sequence of length 200 with a C + content of 50% and a ratio of observed-to-expected number of Cp s that is above 0.6. (ardiner-arden & Frommer, 1987) ccording to a recent study, human chromosomes 21 and 22 contain about 1100 Cp-islands and about 750 genes. (Comprehensive analysis of Cp islands in human chromosomes 21 and 22,. akai & P.. Jones, PNS, March 19, 2002) We will address the following two main questions concerning Cp-islands: Main questions: 1. iven a short segment of genomic sequence, how to decide whether this segment comes from a Cp-island or not? 2. iven a long segment of genomic sequence, how to find all contained Cp-islands? 8.2 Markov chains Our goal is to set up a probabilistic model for Cp-islands. Because pairs of consecutive nucleotides are important in this context, we need a model in which the probability of one symbol depends on the probability of its predecessor. his leads us to a Markov chain.

90 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 xample: C Circles= states, e.g. with names, C, and. rrows= possible transitions, each labeled with a transition probability a st = P (x i = t x i 1 = s). efinition 8.2.1 (Markov chain) (time-homogeneous) Markov chain (of order 1) is a system (S, ) consisting of a finite set of states S = {s 1, s 2,..., s n } and a transition matrix = {a st } with t S a st = 1 for all s S, that determines the probability of the transition s t as follows: P (x i+1 = t x i = s) = a st. (t any time i the chain is in a specific state x i and at the tick of a clock the chain changes to state x i+1 according to the given transition probabilities). xample: Weather in übingen, daily at midday: Possible states are rain, sun, clouds or tornado. ransition probabilities: R S C R.5.1.4 0 S.2.6.2 0 C.3.3.4 0.5.0.1 0.4 Weather:...rrrrrrccsssssscscscccrrcrcssss... 8.2.1 Probability of a sequence of states iven a sequence of states x 1, x 2, x 3,..., x L, what is the probability that a Markov chain will step through precisely this sequence of states? P (x) = P (x L, x L 1,..., x 1 ) = P (x L x L 1,..., x 1 )P (x L 1 x L 2,..., x 1 )... P (x 1 ), (by repeated application of P (X, ) = P (X )P ( )) = P (x L, x L 1 )P (x L 1 x L 2 )... P (x 2 x 1 )P (x 1 ) = P (x 1 ) L i=2 a x i 1 x i, because P (x i x i 1,..., x 1 ) = P (x i x i 1 ) = a xi 1x i, the Markov chain property! 8.2.2 Modeling the begin and end states In the previous discussion we overlooked the fact that a Markov chain starts in some state x 1, with initial probability of P (x 1 ).

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 91 We add a begin state to the model that is labeled b. We will always assume that x 0 = b holds. hen: P (x 1 = s) = a bs = P (s), where P (s) denotes the background probability of symbol s. Similarly, we explicitly model the end of the sequence of states using an end state e. probability that we end in state t is P (x L = t) = a xl e. hus, the 8.2.3 xtension of the model C xample: b e # Markov chain that generates Cp islands # (Source: MK98, p 50) # Number of states: 6 # State labels: C * + # ransition matrix: 0.1795 0.2735 0.4255 0.1195 0 0.002 0.1705 0.3665 0.2735 0.1875 0 0.002 0.1605 0.3385 0.3745 0.1245 0 0.002 0.0785 0.3545 0.3835 0.1815 0 0.002 0.2495 0.2495 0.2495 0.2495 0 0.002 0.0000 0.0000 0.0000 0.0000 0 1.000 8.2.4 etermining the transition matrix he transition matrix + is obtained empirically ( trained ) by counting transitions that occur in a training set of known Cp-islands. his is done as follows: a + st = c+ st t c+ st, where c st is the number of positions in a training set of Cp-islands at which state s is followed by state t. We obtain empirically in a similar way, using a training set of known non-cp-islands. 8.2.5 wo examples of Markov chains # Markov chain for Cp islands # Markov chain for non-cp islands # (Source: MK98, p 50) # (Source: MK98, p 50) # Number of states: # Number of states: 6 6

92 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 # State labels: # State labels: C * + C * + # ransition matrix: # ransition matrix:.1795.2735.4255.1195 0 0.002.2995.2045.2845.2095 0.002.1705.3665.2735.1875 0 0.002.3215.2975.0775.0775 0.002.1605.3385.3745.1245 0 0.002.2475.2455.2975.2075 0.002.0785.3545.3835.1815 0 0.002.1765.2385.2915.2915 0.002.2495.2495.2495.2495 0 0.002.2495.2495.2495.2495 0.002.0000.0000.0000.0000 0 1.000.0000.0000.0000.0000 0 1.00 8.2.6 nswering question 1 iven a short sequence x = (x 1, x 2,..., x L ). oes it come from a Cp-island (Model + )? with x 0 = b and x L+1 = e. We use the following score: P (x Model + ) = L a xi x i+1, i=0 S(x) = log P (x Model+ ) L P (x Model ) = log a+ x i 1 x i a. x i 1 x i he higher this score is, the higher the probability is, that x comes from a Cp-island. i=0 8.2.7 ypes of questions that a Markov chain can answer xample weather in übingen, daily at midday: Possible states are rain, sun or clouds. ransition probabilities: R S C R.5.1.4 S.2.6.2 C.3.3.4 ypes of questions that the model can answer: If it is sunny today, what is the probability that the sun will shine for the next seven days? How large is the probability, that it will rain for a month? 8.3 Hidden Markov Models (HMM) Motivation: Question 2, how to detect Cp-islands inside a long sequence? One possible approach is a window technique: a window of width w is moved along the sequence and the score is plotted. Problem: it is hard to determine the boundaries of Cp-islands, which window size w should one choose?... We will consider an alternative approach: Merge the two Markov chains Model + and Model to obtain a so-called Hidden Markov Model. efinition 8.3.1 (HMM) HMM is a system M = (S, Q,, e) consisting of an alphabet S,

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 93 a set of states Q, a matrix = {a kl } of transition probabilities a kl for k, l Q, and an emission probability e k (b) for every k Q and b S. 8.3.1 xample he topology of an HMM for Cp-islands: + C+ + + C (dditionally, we have all transitions between states in either of the two sets that carry over from the two Markov chains Model + and Model.) 8.3.2 HMM for Cp-islands # Number of states: 9 # Names of states (begin/end, +, C+, +, +, -, C-, - and -): 0 C a c g t # Number of symbols: 4 # Names of symbols: a c g t # ransition matrix, probability to change from +island to -island (and vice versa) is 10-4 0.0000000000 0.0725193101 0.1637630296 0.1788242720 0.0754545682 0.1322050994 0.1267006624 0.1226380452 0.1278950131 0.0010000000 0.1762237762 0.2682517483 0.4170629371 0.1174825175 0.0035964036 0.0054745255 0.0085104895 0.0023976024 0.0010000000 0.1672435130 0.3599201597 0.2679840319 0.1838722555 0.0034131737 0.0073453094 0.0054690619 0.0037524950 0.0010000000 0.1576223776 0.3318881119 0.3671328671 0.1223776224 0.0032167832 0.0067732268 0.0074915085 0.0024975025 0.0010000000 0.0773426573 0.3475514486 0.3759440559 0.1781818182 0.0015784216 0.0070929071 0.0076723277 0.0036363636 0.0010000000 0.0002997003 0.0002047952 0.0002837163 0.0002097902 0.2994005994 0.2045904096 0.2844305694 0.2095804196 0.0010000000 0.0003216783 0.0002977023 0.0000769231 0.0003016983 0.3213566434 0.2974045954 0.0778441558 0.3013966034 0.0010000000 0.0002477522 0.0002457542 0.0002977023 0.0002077922 0.2475044955 0.2455084915 0.2974035964 0.2075844156 0.0010000000 0.0001768232 0.0002387612 0.0002917083 0.0002917083 0.1766463536 0.2385224775 0.2914165834 0.2914155844 # mission probabilities: 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 From now one we use 0 for the begin and end state. 8.3.3 xample fair/loaded dice Casino uses two dice, fair and loaded: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 0.05 0.1 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.95 0.9 Fair Unfair

94 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 Casino guest only observes the number rolled: 6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6... Which dice was used remains hidden: F F F F F F F F F F F F U U U U U F F F F F F F F F F... 8.3.4 eneration of simulated data We can use HMMs to generate data: lgorithm 8.3.2 (Simulator) Start in state 0. While we have not reentered state 0: Choose a new state using the transition probabilities Choose a symbol using the emission probabilities and report it. We use the fair/loaded HMM to generate a sequence of states and symbols: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU How probable is a given sequence of data? If we can observe only the symbols, can we reconstruct the corresponding states? 8.3.5 etermining the probability, given the states and symbols efinition 8.3.3 (Path) path π = (π 1, π 2,..., π L ) is a sequence of states in the model M. iven a sequence of symbols x = (x 1,..., x L ) and a path π = (π 1,..., π L ) through M. he joint probability is: P (x, π) = a 0π1 with π L+1 = 0. L i=1 e πi (x i )a πi π i+1, Unfortunately, we usually do not know the path through the model. 8.3.6 ecoding a sequence of symbols Problem: We have observed a sequence x of symbols and would like to decode the sequence: xample: he sequence of symbols C C has a number of explanations within the Cp-model, e.g.: (C +, +, C +, + ), (C,, C, ) and (C, +, C, + ). path through the HMM determines which parts of the sequence x are classified as Cp-islands, such a classification of the observed symbols is called a decoding.

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 95 8.3.7 he most probable path o solve the decoding problem, we want to determine the path π that maximizes the probability of having generated the sequence x of symbols, that is: π = arg max P (x, π). π his most probable path π can be computed recursively. efinition 8.3.4 (iterbi variable) iven a prefix (x 1, x 2,..., x i ), the iterbi variable v k (i) denotes the probability that the most probable path is in state k when it generates symbol x i at position i. hen: v l (i + 1) = e l (x i+1 ) max k Q (v k(i)a kl ), with v 0 (0) = 1, initially. (xercise: We have: arg max π P (x, π) = arg max π P (π x)) ynamic programming matrix: x 0 x 1 x 2 x 3... x i 2 x i 1 x i x i+1 + + +... + + +... C + C + C +... C + C + C + + + +... + + + + + +... + + + 0... C C C... C C C...... 8.3.8 he iterbi algorithm lgorithm 8.3.5 (iterbi algorithm) Input: HMM M = (S, Q,, e) and symbol sequence x Output: Most probable path π. Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k 0. For all i = 1... L, l Q: v l (i) = e l (x i ) max k Q (v k (i 1)a kl ) ptr i (l) = arg max k Q (v k (i 1)a kl ) ermination: P (x, π ) = max k Q (v k (L)a k0 ) π L = arg max k Q(v k (L)a k0 ) raceback: For all i = L 1... 1: π i 1 = ptr i(π i ) Implementation hint: instead of multiplying many small values, add their logarithms! (xercise: Run-time complexity)

96 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.9 xample for iterbi iven the sequence C C and the HMM for Cp-islands. Here is a table of possible values for v: State sequence v C C 0 1 0 0 0 0 + 0 0 0 0 0 C + 0.13 0.012 0 + 0 0.034 0.0032 + 0 0 0 0 0 0 0 0 0 0 C 0.13 0.0026 0 0 0.010 0.00021 0 0 0 0 0 8.3.10 iterbi-decoding of the casino example We used the fair/loaded HMM to first generate a sequence of symbols and then use the iterbi algorithm to decode the sequence, result: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF iterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF iterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU iterbi: FFFFFFUUUU 8.3.11 hree Hauptprobleme for HMMs Let M be a HMM, x a sequence of symbols. (Q1) For x, determine the most probable sequence of states through M: iterbi algorithm (Q2) etermine the probability that M generated x: P (x) = P (x M): forward algorithm (Q3) iven x and perhaps some additional sequences of symbols, how do we train the parameters of M?.g., iterbi training or the Baum-Welch algorithm 8.3.12 Computing P (x M) iven a HMM M and a sequence of symbols x. he probability that x was generated by M is given by: P (x M) = P (x, π M), π summing over all possible state sequences π through M. (xercise: how fast does the number of paths increase as a function of length?)

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 97 8.3.13 Forward algorithm he value of P (x M) can be efficiently computed using the forward algorithm. his algorithm is obtained from the iterbi algorithm by replacing max by a sum. More precisely, we define the forward-variable: f k (i) = P (x 1... x i, π i = k), that equals the probability, that model reports the prefix sequence (x 1,..., x i ) and reaches in state π i = k. We obtain the recursion: f l (i + 1) = e l (x i+1 ) k Q f k(i)a kl. f p (i) f q (i) f r (i) f l (i+1) f s (i) f t (i) p tl lgorithm 8.3.6 (Forward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = 0): f 0 (0) = 1, f k (0) = 0 for k 0. For all i = 1... L, l Q: f l (i) = e l (x i ) k Q (f k(i 1)a kl ) Result: P (x M) = k Q (f k(l)a k0 ) Implementation hint: Logarithms can not be employed here easily, but there are scaling methods... his solves Hauptproblem Q2! 8.3.14 Backward algorithm he backward-variable contains the probability to start in state p i = k and then to generate the suffix sequence (x i+1,..., x L ): b k (i) = P (x i+1... x L π i = k). lgorithm 8.3.7 (Backward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = L): b k (L) = a k0 for all k. For all i = L 1... 1, k Q: b k (i) = l Q a kle l (x i+1 )b l (i + 1) Result: P (x M) = l Q (a 0le l (x 1 )b l (1)) b p (i+1) b q (i+1) b k (i) b r (i+1) p kt b s (i+1) b t (i+1)

98 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.15 Summary of the three variables iterbi v k (i) probability, with which the most probable state path generates the sequence of symbols (x 1, x 2,..., x i ) and the system is in state k at time i. Forward f k (i) probability, that the prefix sequence of symbols x 1,..., x i is generated, and the system is in state k at time i. Backward b k (i) probability, that the system starts in state k at time i and then generates the sequence of symbols x i+1,..., x L. 8.3.16 Posterior probabilities iven a HMM M and a sequence of symbols x. Let P (π i = k x) be the probability, that symbol x i was reported in state π i = k. We call this the posterior probability, as it computed after observing the sequence x. We have: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i), P (x) as P (g, h) = P (g h)p (h) and by definition of the forward- and backward-variable. 8.3.17 ecoding with posterior probabilities here are alternatives to the iterbi-decoding that are useful e.g., when many other paths exist that have a similar probability to π. We define a sequence of states ˆπ thus: ˆπ i = arg max k Q P (π i = k x), in other words, at every position we choose the most probable state for that position. his decoding may be useful, if we are interested in the state at a specific position i and not in the whole sequence of states. Warning: if the transition matrix forbids some transitions (i.e., a kl = 0), then this decoding may produce a sequence that is not a valid path, because its probability is 0! 8.3.18 raining the parameters How does one generate an HMM? First step: etermine its topology, i.e. the number of states and how they are connected via transitions of non-zero probability. Second step: Set the parameters, i.e. the transition probabilities a kl and the emission probabilities e k (b). We consider the second step. iven a set of example sequences, our goal is to train the parameters of the HMM using the example sequences, e.g. to set the parameters in such a way that the probability, with which the HMM generates the given example sequences, is maximized.

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 99 8.3.19 raining when the states are known Let M = (S, Q,, e) be a HMM. Suppose we are given a list of sequences of symbols x 1, x 2,..., x n and a list of corresponding paths π 1, π 2,..., π n. (.g., N sequences with annotated Cp-islands.) We want to choose the parameters (, e) of the HMM M optimally, such that: P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e) ) = max P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e ) ). (,e ) In other words, we want to determine the Maximum Likelihood stimator (ML-estimator) for (, e). 8.3.20 ML-stimation for (, e) (Recall: If we consider P ( M) as a function of, then we call this a probability; as a function of M, then we use the word likelihood.) ML-estimation: Computation: kl : k (b): (, e) ML = arg max (,e ) P (x1,..., x n, π 1,..., π n M = (S, Q,, e )). Number of transitions from state k to l Number of emissions of b in state k We set the parameters for M: ā kl = kl q Q kq and ē k (b) = k (b) s S k(s). 8.3.21 raining the fair/loaded HMM iven example data x and π: Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5 States pi: F F F F F F F U U U U F F F kl 0 F U 0 F U k (b) 1 2 3 4 5 6 0 F U State transitions: missions: ā kl 0 F U 0 F U ē k (b) 1 2 3 4 5 6 0 F U

100 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.22 Pseudocounts One problem in training is overfitting. For example, if some possible transition k l is never seen in the example data, then we will set ā kl = 0 and the transition is then forbidden. If a given state k is never seen in the example data, then ā kl is undefined for all l. o solve this problem, we introduce pseudocounts r kl and r k (b) and define: kl = number of transitions from k to l in the example data + r kl k (b) = number of emissions of b in k in the example data + r k (b). Small pseudocounts reflect little pre-knowledge, large ones reflect more pre-knowledge. 8.3.23 Parameter training when the states are unknown In practice, one usually has access only to the sequences of symbols and not to the state paths. iven sequences of symbols x 1, x 2,..., x n, for which we do NO know the corresponding state paths π 1,..., π n. he problem of choosing the parameters (, e) of HMM M optimally so that holds, is NP -hard. P ( x 1,..., x n M = (S, Q,, e) ) = max P ( x 1,..., x n M = (S, Q,, e ) ) (,e ) 8.3.24 Log-likelihood iven sequences of symbols x 1, x 2,..., x n. Let M = (S, Q,, e) be a HMM. We define the score of the model M as: l(x 1,..., x n ) = log P (x 1,..., x n (, e)) = n log P (x j (, e)). (Here we assume, that the sequences of symbols are independent and therefore P (x 1,..., x n ) = P (x 1 ) P (x n ) holds.) he goal is to chooses parameters (, e) so that we maximize this score, called the log likelihood. j=1 8.3.25 Baum-Welch algorithm Let M = (S, Q,, e) be a HMM and assume we are given training sequences x 1, x 2,..., x n. parameters (, e) are to be iteratively improved as follows: - Based on x 1,..., x n and π 1,..., π n and the current value of (, e) we estimate expectation values kl and e l (b) for kl and e l (b). - We then set (, e) (Ā, ē). - his is repeated until some halting criterion is met. his is a special case of the so-called M-technique. (M=expectation maximization). lgorithm 8.3.8 (Baum-Welch algorithm) he

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 101 Input: HMM M = (S, Q,, e), training data x 1, x 2,..., x n, pseudocounts r kl and r k (b), if desired Output: HMM M = (S, Q,, e ) with an improved score. Initialize: Set and e randomly Recursion: For every sequence x j : Compute f k (i) for x j with the forward algorithm. Compute b k (i) for x j with the backward algorithm. dd both values to the sums: kl = j 1 P (x j ) e k (b) = j i 1 P (x j ) f j k (i)a kle l (x j i+1 )bj l (i + 1) {i x j i =b} f j k (i)bj l (i) nd: etermine new model parameters (, e) (Ā, ē) etermine the new Log-likelihood l(x 1,..., x n (, e)) When the log-likelihood did not improve or a maximum number of iterations was reached. 8.3.26 Remarks iven x. For the expected number of transitions from π i = k to π i+1 = l we have: P (π i = k, π i+1 = l x, (, e)) = f k(i)a kl e l (x i+1 )b l (i + 1). P (x) Hence: kl = n j=1 1 P (x j ) L j i=1 f j k (i)a kle l (x j i+1 )bj l (i + 1) (Recall: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i).) P (x) 8.3.27 Convergence Remark One can prove that the log-likelihood-score converges to a local maximum using the Baum- Welch algorithm. However, this doesn t imply that the parameters converge! Local maxima can be avoided by considering many different starting points. dditionally, any standard optimization approaches can also be applied to solve the optimization problem.

102 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.4 Protein Families Suppose we are given the following related sequences, how to characterize this family? #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU...BBBBBBBBBBBBBBBBCCCCCCCCCCC......LSQRQIWKINKCLIKFLSHPQMF.FS...S...PLKL...HLPKSLWK...NLRLLPWQRFFSFLSPMNPKKHKKL...LSPKNKWK..HLRMFLSFPKFPHF.LS...HSQKHKK...LSWQLLHWK..HQILIRLFKSHPLKFRFKHLKMKSLKKHL PISPLSKKIRSWPS..SILKFFSPQFFPKFKLQLKKSRWHRII...LSQISQSFKK...PILFKPSIMKFQF.KLSIKPFHNRI...LSQLKSSWFN..NIPKHHRFFILLIPKLFS.FLK.SPQNNPLQHKF...FFFFFFFFFFFF..FF...HHHHHHHHHHHHHHHHHHH QISHL..KMQMKRHKNKHIKQFPLSLLSMHRIKMNKWISL FSLHL...NLKFLSLHCKL..HPNFRLLNLCLHHFKFPPQQKNL LNH...MPNLSLSLHHKL..RPNFKLLSHCLLLHLPFPHSLKFLSSL LILKK...K.HHLKPLQSHKH..KIPIKLFISIIHLHSRHPFQMNKLLFRKI NNSM..KMSMKLRLSKHKSF..QPQFKLI...FKLMSMICILL FFSKIIL..P...NINFSHKPR...HQLNNFRFSMKH..F.WLFFMI KLIQLQLKNLSHSK...HFPKILKIKKWSLNSWILII #-helices HHHHHHH... LB1_LI ISLQS... HBB_HUMN HKH... HB_HUMN SKR... M_PHC KKLQ lignment of seven lobinsequences LB5_PM RS... How can this family be characterized? LB3_CHIP FSKM... LB2_LUPLU KKMN... Some ideas for characterizing a family: xemplary sequence Consensus sequence Regular expression (Prosite): LB2_LUPLU LB1_LI...FN--NIPKH......IN......[FI]-[N]-[]-x(1,2)-N-[I]-[P]-[K]-[H]... HMM? 8.4.1 Simple HMM How to represent this? HB_HUMN...--H... HBB_HUMN...----N... M_PHC...--H... LB3_CHIP...K------... LB5_PM...S--S... LB2_LUPLU...FN--NIPKH... LB1_LI...IN... "Matches": *** ***** We first consider a simple HMM that is equivalent to a PSSM (Position Specific Score Matrix): F I K S H N I P K H S (he listed amino-acids have a higher emission-probability.)

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 103 8.4.2 Insert-states We introduce so-called insert-states that emit symbols based on their background probabilities. Begin F I K S H N I P K H S nd his allows us to model segments of sequence that lie outside of conserved domains. 8.4.3 elete-states We introduce so-called delete-states that are silent and do not emit any symbols. Begin F I K S H N I P K H S nd his allows us to model the absence of individual domains. 8.4.4 opology of a profile-hmm he result is a so-called profile HMM: Begin nd Match-state, Insert-state, elete-state 8.4.5 esign of a profile-hmm Suppose we are given a multiple alignment of a family of sequences. First we must decide which positions are to be modeled as match- and which positions are to be modeled as insert-states. Rule-of-thumb: columns with more than 50% gaps should be modeled as insert-states. We determine the transition and emission probabilities simply by counting the observed transitions kl and emissions k (B): a kl = kl l and e k (b) = k(b) kl b k(b ). Obviously, it may happen that certain transitions or emissions do not appear in the training data and thus we use the Laplace-rule and add 1 to each count.