Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Similar documents
Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models for biological sequence analysis

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models for biological sequence analysis I

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Lecture 9. Intro to Hidden Markov Models (finish up)

Stephen Scott.

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models (I)

Markov chains and Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

0 Algorithms in Bioinformatics, Uni Tübingen, Daniel Huson, SS 2004

HIDDEN MARKOV MODELS

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. x 1 x 2 x 3 x K

11.3 Decoding Algorithm

Today s Lecture: HMMs

Lecture 7 Sequence analysis. Hidden Markov Models

192 Algorithms in Bioinformatics II, Uni Tübingen, Daniel Huson, SS 2003

HMMs and biological sequence analysis

6 Markov Chains and Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 5: December 13, 2001

Parametric Models Part III: Hidden Markov Models

CS711008Z Algorithm Design and Analysis

O 3 O 4 O 5. q 3. q 4. Transition

HMM: Parameter Estimation

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Stephen Scott.

EECS730: Introduction to Bioinformatics

Data Mining in Bioinformatics HMM

Lecture 11: Hidden Markov Models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models 1

Hidden Markov Models. Hosein Mohimani GHC7717

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Introduction to Machine Learning CMU-10701

Hidden Markov Models. Ron Shamir, CG 08

STA 414/2104: Machine Learning

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Dynamic Approaches: The Hidden Markov Model

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Pairwise sequence alignment and pair hidden Markov models

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Pairwise alignment using HMMs

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

L23: hidden Markov models

Chapter 4: Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Computational Genomics and Molecular Biology, Fall

Multiple Sequence Alignment using Profile HMM

STA 4273H: Statistical Machine Learning

order is number of previous outputs

Hidden Markov Models. Terminology and Basic Algorithms

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models: All the Glorious Gory Details

8: Hidden Markov Models

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models

BMI/CS 576 Fall 2016 Final Exam

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Lecture #5. Dependencies along the genome

Genome 373: Hidden Markov Models II. Doug Fowler

Pair Hidden Markov Models

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models. Terminology, Representation and Basic Problems

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Hidden Markov Methods. Algorithms and Implementation

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

HMM part 1. Dr Philip Jackson

Statistical Methods for NLP

Statistical Machine Learning from Data

Advanced Data Science

Transcription:

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 89 8 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis, Cambridge, 1998 S. urbin, S. ddy,. Krogh and. Mitchison, Biological Sequence 8.1 Cp-islands xample: finding Cp-islands in the human genome. ouble stranded N:...pCpCpppppppCpppppCpppCpCpppCppppCppCpp............pppppCpppCpppCpCppppppppppCppppCppCp... he C in a Cp-pair is often modified by methylation (that is, an H-atom is replaced by a CH 3 - group). here is a relatively high chance that the methyl-c will mutate to a. Hence, Cp-pairs are under-represented in the human genome. Upstream of a gene, the methylation process is suppressed in short regions of the genome of length 100-5000. hese areas are called Cp-islands and they are characterized by the fact that we see more Cp-pairs in them then elsewhere. Cp-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine. efinition 8.1.1 (classical definition of Cp-islands) N sequence of length 200 with a C + content of 50% and a ratio of observed-to-expected number of Cp s that is above 0.6. (ardiner-arden & Frommer, 1987) ccording to a recent study, human chromosomes 21 and 22 contain about 1100 Cp-islands and about 750 genes. (Comprehensive analysis of Cp islands in human chromosomes 21 and 22,. akai & P.. Jones, PNS, March 19, 2002) We will address the following two main questions concerning Cp-islands: Main questions: 1. iven a short segment of genomic sequence, how to decide whether this segment comes from a Cp-island or not? 2. iven a long segment of genomic sequence, how to find all contained Cp-islands? 8.2 Markov chains Our goal is to set up a probabilistic model for Cp-islands. Because pairs of consecutive nucleotides are important in this context, we need a model in which the probability of one symbol depends on the probability of its predecessor. his leads us to a Markov chain.

90 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 xample: C Circles= states, e.g. with names, C, and. rrows= possible transitions, each labeled with a transition probability a st = P (x i = t x i 1 = s). efinition 8.2.1 (Markov chain) (time-homogeneous) Markov chain (of order 1) is a system (S, ) consisting of a finite set of states S = {s 1, s 2,..., s n } and a transition matrix = {a st } with t S a st = 1 for all s S, that determines the probability of the transition s t as follows: P (x i+1 = t x i = s) = a st. (t any time i the chain is in a specific state x i and at the tick of a clock the chain changes to state x i+1 according to the given transition probabilities). xample: Weather in übingen, daily at midday: Possible states are rain, sun, clouds or tornado. ransition probabilities: R S C R.5.1.4 0 S.2.6.2 0 C.3.3.4 0.5.0.1 0.4 Weather:...rrrrrrccsssssscscscccrrcrcssss... 8.2.1 Probability of a sequence of states iven a sequence of states x 1, x 2, x 3,..., x L, what is the probability that a Markov chain will step through precisely this sequence of states? P (x) = P (x L, x L 1,..., x 1 ) = P (x L x L 1,..., x 1 )P (x L 1 x L 2,..., x 1 )... P (x 1 ), (by repeated application of P (X, ) = P (X )P ( )) = P (x L, x L 1 )P (x L 1 x L 2 )... P (x 2 x 1 )P (x 1 ) = P (x 1 ) L i=2 a x i 1 x i, because P (x i x i 1,..., x 1 ) = P (x i x i 1 ) = a xi 1x i, the Markov chain property! 8.2.2 Modeling the begin and end states In the previous discussion we overlooked the fact that a Markov chain starts in some state x 1, with initial probability of P (x 1 ).

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 91 We add a begin state to the model that is labeled b. We will always assume that x 0 = b holds. hen: P (x 1 = s) = a bs = P (s), where P (s) denotes the background probability of symbol s. Similarly, we explicitly model the end of the sequence of states using an end state e. probability that we end in state t is P (x L = t) = a xl e. hus, the 8.2.3 xtension of the model C xample: b e # Markov chain that generates Cp islands # (Source: MK98, p 50) # Number of states: 6 # State labels: C * + # ransition matrix: 0.1795 0.2735 0.4255 0.1195 0 0.002 0.1705 0.3665 0.2735 0.1875 0 0.002 0.1605 0.3385 0.3745 0.1245 0 0.002 0.0785 0.3545 0.3835 0.1815 0 0.002 0.2495 0.2495 0.2495 0.2495 0 0.002 0.0000 0.0000 0.0000 0.0000 0 1.000 8.2.4 etermining the transition matrix he transition matrix + is obtained empirically ( trained ) by counting transitions that occur in a training set of known Cp-islands. his is done as follows: a + st = c+ st t c+ st, where c st is the number of positions in a training set of Cp-islands at which state s is followed by state t. We obtain empirically in a similar way, using a training set of known non-cp-islands. 8.2.5 wo examples of Markov chains # Markov chain for Cp islands # Markov chain for non-cp islands # (Source: MK98, p 50) # (Source: MK98, p 50) # Number of states: # Number of states: 6 6

92 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 # State labels: # State labels: C * + C * + # ransition matrix: # ransition matrix:.1795.2735.4255.1195 0 0.002.2995.2045.2845.2095 0.002.1705.3665.2735.1875 0 0.002.3215.2975.0775.0775 0.002.1605.3385.3745.1245 0 0.002.2475.2455.2975.2075 0.002.0785.3545.3835.1815 0 0.002.1765.2385.2915.2915 0.002.2495.2495.2495.2495 0 0.002.2495.2495.2495.2495 0.002.0000.0000.0000.0000 0 1.000.0000.0000.0000.0000 0 1.00 8.2.6 nswering question 1 iven a short sequence x = (x 1, x 2,..., x L ). oes it come from a Cp-island (Model + )? with x 0 = b and x L+1 = e. We use the following score: P (x Model + ) = L a xi x i+1, i=0 S(x) = log P (x Model+ ) L P (x Model ) = log a+ x i 1 x i a. x i 1 x i he higher this score is, the higher the probability is, that x comes from a Cp-island. i=0 8.2.7 ypes of questions that a Markov chain can answer xample weather in übingen, daily at midday: Possible states are rain, sun or clouds. ransition probabilities: R S C R.5.1.4 S.2.6.2 C.3.3.4 ypes of questions that the model can answer: If it is sunny today, what is the probability that the sun will shine for the next seven days? How large is the probability, that it will rain for a month? 8.3 Hidden Markov Models (HMM) Motivation: Question 2, how to detect Cp-islands inside a long sequence? One possible approach is a window technique: a window of width w is moved along the sequence and the score is plotted. Problem: it is hard to determine the boundaries of Cp-islands, which window size w should one choose?... We will consider an alternative approach: Merge the two Markov chains Model + and Model to obtain a so-called Hidden Markov Model. efinition 8.3.1 (HMM) HMM is a system M = (S, Q,, e) consisting of an alphabet S,

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 93 a set of states Q, a matrix = {a kl } of transition probabilities a kl for k, l Q, and an emission probability e k (b) for every k Q and b S. 8.3.1 xample he topology of an HMM for Cp-islands: + C+ + + C (dditionally, we have all transitions between states in either of the two sets that carry over from the two Markov chains Model + and Model.) 8.3.2 HMM for Cp-islands # Number of states: 9 # Names of states (begin/end, +, C+, +, +, -, C-, - and -): 0 C a c g t # Number of symbols: 4 # Names of symbols: a c g t # ransition matrix, probability to change from +island to -island (and vice versa) is 10-4 0.0000000000 0.0725193101 0.1637630296 0.1788242720 0.0754545682 0.1322050994 0.1267006624 0.1226380452 0.1278950131 0.0010000000 0.1762237762 0.2682517483 0.4170629371 0.1174825175 0.0035964036 0.0054745255 0.0085104895 0.0023976024 0.0010000000 0.1672435130 0.3599201597 0.2679840319 0.1838722555 0.0034131737 0.0073453094 0.0054690619 0.0037524950 0.0010000000 0.1576223776 0.3318881119 0.3671328671 0.1223776224 0.0032167832 0.0067732268 0.0074915085 0.0024975025 0.0010000000 0.0773426573 0.3475514486 0.3759440559 0.1781818182 0.0015784216 0.0070929071 0.0076723277 0.0036363636 0.0010000000 0.0002997003 0.0002047952 0.0002837163 0.0002097902 0.2994005994 0.2045904096 0.2844305694 0.2095804196 0.0010000000 0.0003216783 0.0002977023 0.0000769231 0.0003016983 0.3213566434 0.2974045954 0.0778441558 0.3013966034 0.0010000000 0.0002477522 0.0002457542 0.0002977023 0.0002077922 0.2475044955 0.2455084915 0.2974035964 0.2075844156 0.0010000000 0.0001768232 0.0002387612 0.0002917083 0.0002917083 0.1766463536 0.2385224775 0.2914165834 0.2914155844 # mission probabilities: 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 From now one we use 0 for the begin and end state. 8.3.3 xample fair/loaded dice Casino uses two dice, fair and loaded: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 0.05 0.1 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.95 0.9 Fair Unfair

94 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 Casino guest only observes the number rolled: 6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6... Which dice was used remains hidden: F F F F F F F F F F F F U U U U U F F F F F F F F F F... 8.3.4 eneration of simulated data We can use HMMs to generate data: lgorithm 8.3.2 (Simulator) Start in state 0. While we have not reentered state 0: Choose a new state using the transition probabilities Choose a symbol using the emission probabilities and report it. We use the fair/loaded HMM to generate a sequence of states and symbols: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU How probable is a given sequence of data? If we can observe only the symbols, can we reconstruct the corresponding states? 8.3.5 etermining the probability, given the states and symbols efinition 8.3.3 (Path) path π = (π 1, π 2,..., π L ) is a sequence of states in the model M. iven a sequence of symbols x = (x 1,..., x L ) and a path π = (π 1,..., π L ) through M. he joint probability is: P (x, π) = a 0π1 with π L+1 = 0. L i=1 e πi (x i )a πi π i+1, Unfortunately, we usually do not know the path through the model. 8.3.6 ecoding a sequence of symbols Problem: We have observed a sequence x of symbols and would like to decode the sequence: xample: he sequence of symbols C C has a number of explanations within the Cp-model, e.g.: (C +, +, C +, + ), (C,, C, ) and (C, +, C, + ). path through the HMM determines which parts of the sequence x are classified as Cp-islands, such a classification of the observed symbols is called a decoding.

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 95 8.3.7 he most probable path o solve the decoding problem, we want to determine the path π that maximizes the probability of having generated the sequence x of symbols, that is: π = arg max P (x, π). π his most probable path π can be computed recursively. efinition 8.3.4 (iterbi variable) iven a prefix (x 1, x 2,..., x i ), the iterbi variable v k (i) denotes the probability that the most probable path is in state k when it generates symbol x i at position i. hen: v l (i + 1) = e l (x i+1 ) max k Q (v k(i)a kl ), with v 0 (0) = 1, initially. (xercise: We have: arg max π P (x, π) = arg max π P (π x)) ynamic programming matrix: x 0 x 1 x 2 x 3... x i 2 x i 1 x i x i+1 + + +... + + +... C + C + C +... C + C + C + + + +... + + + + + +... + + + 0... C C C... C C C...... 8.3.8 he iterbi algorithm lgorithm 8.3.5 (iterbi algorithm) Input: HMM M = (S, Q,, e) and symbol sequence x Output: Most probable path π. Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k 0. For all i = 1... L, l Q: v l (i) = e l (x i ) max k Q (v k (i 1)a kl ) ptr i (l) = arg max k Q (v k (i 1)a kl ) ermination: P (x, π ) = max k Q (v k (L)a k0 ) π L = arg max k Q(v k (L)a k0 ) raceback: For all i = L 1... 1: π i 1 = ptr i(π i ) Implementation hint: instead of multiplying many small values, add their logarithms! (xercise: Run-time complexity)

96 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.9 xample for iterbi iven the sequence C C and the HMM for Cp-islands. Here is a table of possible values for v: State sequence v C C 0 1 0 0 0 0 + 0 0 0 0 0 C + 0.13 0.012 0 + 0 0.034 0.0032 + 0 0 0 0 0 0 0 0 0 0 C 0.13 0.0026 0 0 0.010 0.00021 0 0 0 0 0 8.3.10 iterbi-decoding of the casino example We used the fair/loaded HMM to first generate a sequence of symbols and then use the iterbi algorithm to decode the sequence, result: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF iterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF iterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU iterbi: FFFFFFUUUU 8.3.11 hree Hauptprobleme for HMMs Let M be a HMM, x a sequence of symbols. (Q1) For x, determine the most probable sequence of states through M: iterbi algorithm (Q2) etermine the probability that M generated x: P (x) = P (x M): forward algorithm (Q3) iven x and perhaps some additional sequences of symbols, how do we train the parameters of M?.g., iterbi training or the Baum-Welch algorithm 8.3.12 Computing P (x M) iven a HMM M and a sequence of symbols x. he probability that x was generated by M is given by: P (x M) = P (x, π M), π summing over all possible state sequences π through M. (xercise: how fast does the number of paths increase as a function of length?)

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 97 8.3.13 Forward algorithm he value of P (x M) can be efficiently computed using the forward algorithm. his algorithm is obtained from the iterbi algorithm by replacing max by a sum. More precisely, we define the forward-variable: f k (i) = P (x 1... x i, π i = k), that equals the probability, that model reports the prefix sequence (x 1,..., x i ) and reaches in state π i = k. We obtain the recursion: f l (i + 1) = e l (x i+1 ) k Q f k(i)a kl. f p (i) f q (i) f r (i) f l (i+1) f s (i) f t (i) p tl lgorithm 8.3.6 (Forward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = 0): f 0 (0) = 1, f k (0) = 0 for k 0. For all i = 1... L, l Q: f l (i) = e l (x i ) k Q (f k(i 1)a kl ) Result: P (x M) = k Q (f k(l)a k0 ) Implementation hint: Logarithms can not be employed here easily, but there are scaling methods... his solves Hauptproblem Q2! 8.3.14 Backward algorithm he backward-variable contains the probability to start in state p i = k and then to generate the suffix sequence (x i+1,..., x L ): b k (i) = P (x i+1... x L π i = k). lgorithm 8.3.7 (Backward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = L): b k (L) = a k0 for all k. For all i = L 1... 1, k Q: b k (i) = l Q a kle l (x i+1 )b l (i + 1) Result: P (x M) = l Q (a 0le l (x 1 )b l (1)) b p (i+1) b q (i+1) b k (i) b r (i+1) p kt b s (i+1) b t (i+1)

98 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.15 Summary of the three variables iterbi v k (i) probability, with which the most probable state path generates the sequence of symbols (x 1, x 2,..., x i ) and the system is in state k at time i. Forward f k (i) probability, that the prefix sequence of symbols x 1,..., x i is generated, and the system is in state k at time i. Backward b k (i) probability, that the system starts in state k at time i and then generates the sequence of symbols x i+1,..., x L. 8.3.16 Posterior probabilities iven a HMM M and a sequence of symbols x. Let P (π i = k x) be the probability, that symbol x i was reported in state π i = k. We call this the posterior probability, as it computed after observing the sequence x. We have: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i), P (x) as P (g, h) = P (g h)p (h) and by definition of the forward- and backward-variable. 8.3.17 ecoding with posterior probabilities here are alternatives to the iterbi-decoding that are useful e.g., when many other paths exist that have a similar probability to π. We define a sequence of states ˆπ thus: ˆπ i = arg max k Q P (π i = k x), in other words, at every position we choose the most probable state for that position. his decoding may be useful, if we are interested in the state at a specific position i and not in the whole sequence of states. Warning: if the transition matrix forbids some transitions (i.e., a kl = 0), then this decoding may produce a sequence that is not a valid path, because its probability is 0! 8.3.18 raining the parameters How does one generate an HMM? First step: etermine its topology, i.e. the number of states and how they are connected via transitions of non-zero probability. Second step: Set the parameters, i.e. the transition probabilities a kl and the emission probabilities e k (b). We consider the second step. iven a set of example sequences, our goal is to train the parameters of the HMM using the example sequences, e.g. to set the parameters in such a way that the probability, with which the HMM generates the given example sequences, is maximized.

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 99 8.3.19 raining when the states are known Let M = (S, Q,, e) be a HMM. Suppose we are given a list of sequences of symbols x 1, x 2,..., x n and a list of corresponding paths π 1, π 2,..., π n. (.g., N sequences with annotated Cp-islands.) We want to choose the parameters (, e) of the HMM M optimally, such that: P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e) ) = max P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e ) ). (,e ) In other words, we want to determine the Maximum Likelihood stimator (ML-estimator) for (, e). 8.3.20 ML-stimation for (, e) (Recall: If we consider P ( M) as a function of, then we call this a probability; as a function of M, then we use the word likelihood.) ML-estimation: Computation: kl : k (b): (, e) ML = arg max (,e ) P (x1,..., x n, π 1,..., π n M = (S, Q,, e )). Number of transitions from state k to l Number of emissions of b in state k We set the parameters for M: ā kl = kl q Q kq and ē k (b) = k (b) s S k(s). 8.3.21 raining the fair/loaded HMM iven example data x and π: Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5 States pi: F F F F F F F U U U U F F F kl 0 F U 0 F U k (b) 1 2 3 4 5 6 0 F U State transitions: missions: ā kl 0 F U 0 F U ē k (b) 1 2 3 4 5 6 0 F U

100 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.3.22 Pseudocounts One problem in training is overfitting. For example, if some possible transition k l is never seen in the example data, then we will set ā kl = 0 and the transition is then forbidden. If a given state k is never seen in the example data, then ā kl is undefined for all l. o solve this problem, we introduce pseudocounts r kl and r k (b) and define: kl = number of transitions from k to l in the example data + r kl k (b) = number of emissions of b in k in the example data + r k (b). Small pseudocounts reflect little pre-knowledge, large ones reflect more pre-knowledge. 8.3.23 Parameter training when the states are unknown In practice, one usually has access only to the sequences of symbols and not to the state paths. iven sequences of symbols x 1, x 2,..., x n, for which we do NO know the corresponding state paths π 1,..., π n. he problem of choosing the parameters (, e) of HMM M optimally so that holds, is NP -hard. P ( x 1,..., x n M = (S, Q,, e) ) = max P ( x 1,..., x n M = (S, Q,, e ) ) (,e ) 8.3.24 Log-likelihood iven sequences of symbols x 1, x 2,..., x n. Let M = (S, Q,, e) be a HMM. We define the score of the model M as: l(x 1,..., x n ) = log P (x 1,..., x n (, e)) = n log P (x j (, e)). (Here we assume, that the sequences of symbols are independent and therefore P (x 1,..., x n ) = P (x 1 ) P (x n ) holds.) he goal is to chooses parameters (, e) so that we maximize this score, called the log likelihood. j=1 8.3.25 Baum-Welch algorithm Let M = (S, Q,, e) be a HMM and assume we are given training sequences x 1, x 2,..., x n. parameters (, e) are to be iteratively improved as follows: - Based on x 1,..., x n and π 1,..., π n and the current value of (, e) we estimate expectation values kl and e l (b) for kl and e l (b). - We then set (, e) (Ā, ē). - his is repeated until some halting criterion is met. his is a special case of the so-called M-technique. (M=expectation maximization). lgorithm 8.3.8 (Baum-Welch algorithm) he

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 101 Input: HMM M = (S, Q,, e), training data x 1, x 2,..., x n, pseudocounts r kl and r k (b), if desired Output: HMM M = (S, Q,, e ) with an improved score. Initialize: Set and e randomly Recursion: For every sequence x j : Compute f k (i) for x j with the forward algorithm. Compute b k (i) for x j with the backward algorithm. dd both values to the sums: kl = j 1 P (x j ) e k (b) = j i 1 P (x j ) f j k (i)a kle l (x j i+1 )bj l (i + 1) {i x j i =b} f j k (i)bj l (i) nd: etermine new model parameters (, e) (Ā, ē) etermine the new Log-likelihood l(x 1,..., x n (, e)) When the log-likelihood did not improve or a maximum number of iterations was reached. 8.3.26 Remarks iven x. For the expected number of transitions from π i = k to π i+1 = l we have: P (π i = k, π i+1 = l x, (, e)) = f k(i)a kl e l (x i+1 )b l (i + 1). P (x) Hence: kl = n j=1 1 P (x j ) L j i=1 f j k (i)a kle l (x j i+1 )bj l (i + 1) (Recall: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i).) P (x) 8.3.27 Convergence Remark One can prove that the log-likelihood-score converges to a local maximum using the Baum- Welch algorithm. However, this doesn t imply that the parameters converge! Local maxima can be avoided by considering many different starting points. dditionally, any standard optimization approaches can also be applied to solve the optimization problem.

102 rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 8.4 Protein Families Suppose we are given the following related sequences, how to characterize this family? #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU...BBBBBBBBBBBBBBBBCCCCCCCCCCC......LSQRQIWKINKCLIKFLSHPQMF.FS...S...PLKL...HLPKSLWK...NLRLLPWQRFFSFLSPMNPKKHKKL...LSPKNKWK..HLRMFLSFPKFPHF.LS...HSQKHKK...LSWQLLHWK..HQILIRLFKSHPLKFRFKHLKMKSLKKHL PISPLSKKIRSWPS..SILKFFSPQFFPKFKLQLKKSRWHRII...LSQISQSFKK...PILFKPSIMKFQF.KLSIKPFHNRI...LSQLKSSWFN..NIPKHHRFFILLIPKLFS.FLK.SPQNNPLQHKF...FFFFFFFFFFFF..FF...HHHHHHHHHHHHHHHHHHH QISHL..KMQMKRHKNKHIKQFPLSLLSMHRIKMNKWISL FSLHL...NLKFLSLHCKL..HPNFRLLNLCLHHFKFPPQQKNL LNH...MPNLSLSLHHKL..RPNFKLLSHCLLLHLPFPHSLKFLSSL LILKK...K.HHLKPLQSHKH..KIPIKLFISIIHLHSRHPFQMNKLLFRKI NNSM..KMSMKLRLSKHKSF..QPQFKLI...FKLMSMICILL FFSKIIL..P...NINFSHKPR...HQLNNFRFSMKH..F.WLFFMI KLIQLQLKNLSHSK...HFPKILKIKKWSLNSWILII #-helices HHHHHHH... LB1_LI ISLQS... HBB_HUMN HKH... HB_HUMN SKR... M_PHC KKLQ lignment of seven lobinsequences LB5_PM RS... How can this family be characterized? LB3_CHIP FSKM... LB2_LUPLU KKMN... Some ideas for characterizing a family: xemplary sequence Consensus sequence Regular expression (Prosite): LB2_LUPLU LB1_LI...FN--NIPKH......IN......[FI]-[N]-[]-x(1,2)-N-[I]-[P]-[K]-[H]... HMM? 8.4.1 Simple HMM How to represent this? HB_HUMN...--H... HBB_HUMN...----N... M_PHC...--H... LB3_CHIP...K------... LB5_PM...S--S... LB2_LUPLU...FN--NIPKH... LB1_LI...IN... "Matches": *** ***** We first consider a simple HMM that is equivalent to a PSSM (Position Specific Score Matrix): F I K S H N I P K H S (he listed amino-acids have a higher emission-probability.)

rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 103 8.4.2 Insert-states We introduce so-called insert-states that emit symbols based on their background probabilities. Begin F I K S H N I P K H S nd his allows us to model segments of sequence that lie outside of conserved domains. 8.4.3 elete-states We introduce so-called delete-states that are silent and do not emit any symbols. Begin F I K S H N I P K H S nd his allows us to model the absence of individual domains. 8.4.4 opology of a profile-hmm he result is a so-called profile HMM: Begin nd Match-state, Insert-state, elete-state 8.4.5 esign of a profile-hmm Suppose we are given a multiple alignment of a family of sequences. First we must decide which positions are to be modeled as match- and which positions are to be modeled as insert-states. Rule-of-thumb: columns with more than 50% gaps should be modeled as insert-states. We determine the transition and emission probabilities simply by counting the observed transitions kl and emissions k (B): a kl = kl l and e k (b) = k(b) kl b k(b ). Obviously, it may happen that certain transitions or emissions do not appear in the training data and thus we use the Laplace-rule and add 1 to each count.