Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Similar documents
Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models for biological sequence analysis

Lecture 9. Intro to Hidden Markov Models (finish up)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models for biological sequence analysis I

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Markov chains and Hidden Markov Models

Stephen Scott.

Hidden Markov Models (I)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. Three classic HMM problems

HIDDEN MARKOV MODELS

Hidden Markov Models. x 1 x 2 x 3 x K

0 Algorithms in Bioinformatics, Uni Tübingen, Daniel Huson, SS 2004

Today s Lecture: HMMs

Lecture 7 Sequence analysis. Hidden Markov Models

HMMs and biological sequence analysis

Hidden Markov Models. x 1 x 2 x 3 x K

6 Markov Chains and Hidden Markov Models

11.3 Decoding Algorithm

192 Algorithms in Bioinformatics II, Uni Tübingen, Daniel Huson, SS 2003

Computational Genomics and Molecular Biology, Fall

CS711008Z Algorithm Design and Analysis

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 5: December 13, 2001

Hidden Markov Models

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

EECS730: Introduction to Bioinformatics

Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Lecture 11: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Stephen Scott.

Data Mining in Bioinformatics HMM

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

HMM: Parameter Estimation

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models 1

Hidden Markov Models. Hosein Mohimani GHC7717

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Hidden Markov Models. Ron Shamir, CG 08

STA 414/2104: Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Pairwise alignment using HMMs

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Dynamic Approaches: The Hidden Markov Model

Introduction to Machine Learning CMU-10701

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Computational Genomics and Molecular Biology, Fall

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Pairwise sequence alignment and pair hidden Markov models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

STA 4273H: Statistical Machine Learning

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

order is number of previous outputs

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

L23: hidden Markov models

Chapter 4: Hidden Markov Models

Hidden Markov Models and Their Applications in Biological Sequence Analysis

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models. Terminology and Basic Algorithms

Genome 373: Hidden Markov Models II. Doug Fowler

8: Hidden Markov Models

Lecture #5. Dependencies along the genome

Pair Hidden Markov Models

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Methods. Algorithms and Implementation

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models: All the Glorious Gory Details

Statistical Methods for NLP

Applications of Hidden Markov Models

Hidden Markov Models. Terminology, Representation and Basic Problems

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Hidden Markov Models

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

HMM part 1. Dr Philip Jackson

Transcription:

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 81 7 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis, Cambridge, 1998 S. urbin, S. ddy,. Krogh and. Mitchison, Biological Sequence 7.1 Cp-islands xample: finding Cp-islands in the human genome. ouble stranded N:...pCpCpppppppCpppppCpppCpCpppCppppCppCpp............pppppCpppCpppCpCppppppppppCppppCppCp... he C in a Cp-pair is often modified by methylation (that is, an H-atom is replaced by a CH 3 - group). here is a relatively high chance that the methyl-c will mutate to a. Hence, Cp-pairs are under-represented in the human genome. Upstream of a gene, the methylation process is suppressed in short regions of the genome of length 100-5000. hese areas are called Cp-islands and they are characterized by the fact that we see more Cp-pairs in them then elsewhere. Cp-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine. efinition 7.1.1 (classical definition of Cp-islands) N sequence of length 200 with a C + content of 50% and a ratio of observed-to-expected number of Cp s that is above 0.6. (ardiner-arden & Frommer, 1987) ccording to a recent study, human chromosomes 21 and 22 contain about 1100 Cp-islands and about 750 genes. (Comprehensive analysis of Cp islands in human chromosomes 21 and 22,. akai & P.. Jones, PNS, March 19, 2002) We will address the following two main questions concerning Cp-islands: Main questions: 1. iven a short segment of genomic sequence, how to decide whether this segment comes from a Cp-island or not? 2. iven a long segment of genomic sequence, how to find all contained Cp-islands? 7.2 Markov chains Our goal is to set up a probabilistic model for Cp-islands. Because pairs of consecutive nucleotides are important in this context, we need a model in which the probability of one symbol depends on the probability of its predecessor. his leads us to a Markov chain.

82 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 xample: C Circles= states, e.g. with names, C, and. rrows= possible transitions, each labeled with a transition probability a st = P (x i = t x i 1 = s). efinition 7.2.1 (Markov chain) (time-homogeneous) Markov chain (of order 1) is a system (S, ) consisting of a finite set of states S = {s 1, s 2,..., s n } and a transition matrix = {a st } with t S a st = 1 for all s S, that determines the probability of the transition s t as follows: P (x i+1 = t x i = s) = a st. (t any time i the chain is in a specific state x i and at the tick of a clock the chain changes to state x i+1 according to the given transition probabilities). xample: Weather in übingen, daily at midday: Possible states are rain, sun, clouds or tornado. ransition probabilities: R S C R.5.1.4 0 S.2.6.2 0 C.3.3.4 0.5.0.1 0.4 Weather:...rrrrrrccsssssscscscccrrcrcssss... 7.2.1 Probability of a sequence of states iven a sequence of states x 1, x 2, x 3,..., x L, what is the probability that a Markov chain will step through precisely this sequence of states? P (x) = P (x L, x L 1,..., x 1 ) = P (x L x L 1,..., x 1 )P (x L 1 x L 2,..., x 1 )... P (x 1 ), (by repeated application of P (X, ) = P (X )P ( )) = P (x L, x L 1 )P (x L 1 x L 2 )... P (x 2 x 1 )P (x 1 ) = P (x 1 ) L i=2 a x i 1 x i, because P (x i x i 1,..., x 1) = P (x i x i 1) = a xi 1 x i, the Markov chain property! 7.2.2 Modeling the begin and end states In the previous discussion we overlooked the fact that a Markov chain starts in some state x 1, with initial probability of P (x 1 ).

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 83 We add a begin state to the model that is labeled b. We will always assume that x 0 = b holds. hen: P (x 1 = s) = a bs = P (s), where P (s) denotes the background probability of symbol s. Similarly, we explicitly model the end of the sequence of states using an end state e. probability that we end in state t is P (x L = t) = a xl e. hus, the 7.2.3 xtension of the model C xample: b e # Markov chain that generates Cp islands # (Source: MK98, p 50) # Number of states: 6 # State labels: C * + # ransition matrix: 0.1795 0.2735 0.4255 0.1195 0 0.002 0.1705 0.3665 0.2735 0.1875 0 0.002 0.1605 0.3385 0.3745 0.1245 0 0.002 0.0785 0.3545 0.3835 0.1815 0 0.002 0.2495 0.2495 0.2495 0.2495 0 0.002 0.0000 0.0000 0.0000 0.0000 0 1.000 7.2.4 etermining the transition matrix he transition matrix + is obtained empirically ( trained ) by counting transitions that occur in a training set of known Cp-islands. his is done as follows: a + st = c+ st t c+ st, where c st is the number of positions in a training set of Cp-islands at which state s is followed by state t. We obtain empirically in a similar way, using a training set of known non-cp-islands. 7.2.5 wo examples of Markov chains # Markov chain for Cp islands # Markov chain for non-cp islands # (Source: MK98, p 50) # (Source: MK98, p 50) # Number of states: # Number of states: 6 6 # State labels: # State labels:

84 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 C * + C * + # ransition matrix: # ransition matrix:.1795.2735.4255.1195 0 0.002.2995.2045.2845.2095 0.002.1705.3665.2735.1875 0 0.002.3215.2975.0775.0775 0.002.1605.3385.3745.1245 0 0.002.2475.2455.2975.2075 0.002.0785.3545.3835.1815 0 0.002.1765.2385.2915.2915 0.002.2495.2495.2495.2495 0 0.002.2495.2495.2495.2495 0.002.0000.0000.0000.0000 0 1.000.0000.0000.0000.0000 0 1.00 7.2.6 nswering question 1 Suppose we are given a short sequence x = (x 1, x 2,..., x L ). oes it come from a Cp-island (Model + )? with x 0 = b and x L+1 = e. We use the following score: P (x Model + ) = L a xi x i+1, i=0 S(x) = log P (x Model+ ) L P (x Model ) = log a+ x i 1 x i a. x i 1 x i he higher this score is, the higher the probability is, that x comes from a Cp-island. i=0 7.2.7 ypes of questions that a Markov chain can answer xample weather in übingen, daily at midday: Possible states are rain, sun or clouds. ransition probabilities: R S C R.5.1.4 S.2.6.2 C.3.3.4 ypes of questions that the model can answer: If it is sunny today, what is the probability that the sun will shine for the next seven days? 7.3 Hidden Markov Models (HMM) Motivation: Question 2, how to detect Cp-islands inside a long sequence? One possible approach is a window technique: a window of width w is moved along the sequence and the score is plotted. Problem: it is hard to determine the boundaries of Cp-islands, which window size w should one choose?... We will consider an alternative approach: Merge the two Markov chains Model + and Model to obtain a so-called Hidden Markov Model. efinition 7.3.1 (HMM) HMM is a system M = (S, Q,, e) consisting of an alphabet S, a set of states Q, a matrix = {a kl } of transition probabilities a kl for k, l Q, and an emission probability e k (b) for every k Q and b S.

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 85 7.3.1 xample he topology of an HMM for Cp-islands: + C+ + + C (dditionally, we have all transitions between states in either of the two sets that carry over from the two Markov chains Model + and Model.) 7.3.2 HMM for Cp-islands # Number of states: 9 # Names of states (begin/end, +, C+, +, +, -, C-, - and -): 0 C a c g t # Number of symbols: 4 # Names of symbols: a c g t # ransition matrix, probability to change from +island to -island (and vice versa) is 10-4 0.000 0.0725193101 0.1637630296 0.1788242720 0.0754545682 0.1322050994 0.1267006624 0.1226380452 0.1278950131 0.001 0.1762237762 0.2682517483 0.4170629371 0.1174825175 0.0035964036 0.0054745255 0.0085104895 0.0023976024 0.001 0.1672435130 0.3599201597 0.2679840319 0.1838722555 0.0034131737 0.0073453094 0.0054690619 0.0037524950 0.001 0.1576223776 0.3318881119 0.3671328671 0.1223776224 0.0032167832 0.0067732268 0.0074915085 0.0024975025 0.001 0.0773426573 0.3475514486 0.3759440559 0.1781818182 0.0015784216 0.0070929071 0.0076723277 0.0036363636 0.001 0.0002997003 0.0002047952 0.0002837163 0.0002097902 0.2994005994 0.2045904096 0.2844305694 0.2095804196 0.001 0.0003216783 0.0002977023 0.0000769231 0.0003016983 0.3213566434 0.2974045954 0.0778441558 0.3013966034 0.001 0.0002477522 0.0002457542 0.0002977023 0.0002077922 0.2475044955 0.2455084915 0.2974035964 0.2075844156 0.001 0.0001768232 0.0002387612 0.0002917083 0.0002917083 0.1766463536 0.2385224775 0.2914165834 0.2914155844 # mission probabilities: 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 From now one we use 0 for the begin and end state. 7.3.3 xample fair/loaded dice Casino uses two dice, fair and loaded: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 0.05 0.1 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.95 0.9 Fair Casino guest only observes the number rolled: Unfair 6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6... Which dice was used remains hidden: F F F F F F F F F F F F U U U U U F F F F F F F F F F...

86 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 7.3.4 eneration of simulated data We can use HMMs to generate data: lgorithm 7.3.2 (Simulator) Start in state 0. While we have not reentered state 0: Choose a new state using the transition probabilities Choose a symbol using the emission probabilities and report it. We use the fair/loaded HMM to generate a sequence of states and symbols: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU How probable is a given sequence of data? If we can observe only the symbols, can we reconstruct the corresponding states? 7.3.5 etermining the probability, given the states and symbols efinition 7.3.3 (Path) path π = (π 1, π 2,..., π L ) is a sequence of states in the model M. Suppose we are given a sequence of symbols x = (x 1,..., x L ) and a path π = (π 1,..., π L ) through M. he joint probability is: P (x, π) = a 0π1 with π L+1 = 0. L i=1 e πi (x i )a πi π i+1, Unfortunately, we usually do not know the path through the model. 7.3.6 ecoding a sequence of symbols Problem: We have observed a sequence x of symbols and would like to decode the sequence: xample: he sequence of symbols C C has a number of explanations within the Cp-model, e.g.: (C +, +, C +, + ), (C,, C, ) and (C, +, C, + ). path through the HMM determines which parts of the sequence x are classified as Cp-islands, such a classification of the observed symbols is called a decoding.

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 87 7.3.7 he most probable path o solve the decoding problem, we want to determine the path π that maximizes the probability of having generated the sequence x of symbols, that is: π = arg max P (x, π). π his most probable path π can be computed recursively. efinition 7.3.4 (iterbi variable) iven a prefix (x 1, x 2,..., x i ), the iterbi variable v k (i) denotes the probability that the most probable path is in state k when it generates symbol x i at position i. hen: v l (i + 1) = e l (x i+1 ) max k Q (v k(i)a kl ), with v 0 (0) = 1, initially. (xercise: We have: arg max π P (x, π) = arg max π P (π x)) ynamic programming matrix: x 0 x 1 x 2 x 3... x i 2 x i 1 x i x i+1 + + +... + + +... C + C + C +... C + C + C + + + +... + + + + + +... + + + 0... C C C... C C C...... 7.3.8 he iterbi algorithm lgorithm 7.3.5 (iterbi algorithm) Input: HMM M = (S, Q,, e) and symbol sequence x Output: Most probable path π. Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k 0. For all i = 1... L, l Q: v l (i) = e l (x i ) max k Q (v k (i 1)a kl ) ptr i (l) = arg max k Q (v k (i 1)a kl ) ermination: P (x, π ) = max k Q (v k (L)a k0 ) π L = arg max k Q(v k (L)a k0 ) raceback: For all i = L 1... 1: π i 1 = ptr i(π i ) Implementation hint: instead of multiplying many small values, add their logarithms! (xercise: Run-time complexity)

88 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 7.3.9 xample for iterbi Suppose we are given the sequence C C and the HMM for Cp-islands. Here is a table of possible values for v: sequence v C C 0 1 0 0 0 0 + 0 0 0 0 0 C + 0.16 0.015 0 State + 0 0.044 0.0039 + 0 0 0 0 0 0 0 0 0 0 C 0.13 0.0026 0 0 0.010 0.00021 0 0 0 0 0 7.3.10 iterbi-decoding of the casino example We used the fair/loaded HMM to first generate a sequence of symbols and then use the iterbi algorithm to decode the sequence, result: Symbols: 24335642611341666666526562426612134635535566462666636664253 States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF iterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF Symbols: 35246363252521655615445653663666511145445656621261532516435 States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF iterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF Symbols: 5146526666 States : FFUUUUUUUU iterbi: FFFFFFUUUU 7.3.11 hree Main problems for HMMs Let M be a HMM, x a sequence of symbols. (Q1) For x, determine the most probable sequence of states through M: iterbi algorithm (Q2) etermine the probability that M generated x: P (x) = P (x M): forward algorithm (Q3) iven x and perhaps some additional sequences of symbols, how do we train the parameters of M? Baum-Welch algorithm 7.3.12 Computing P (x M) Suppose we are given an HMM M and a sequence of symbols x. he probability that x was generated by M is given by: P (x M) = P (x, π M), π summing over all possible state sequences π through M. (xercise: how fast does the number of paths increase as a function of length?)

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 89 7.3.13 Forward algorithm he value of P (x M) can be efficiently computed using the forward algorithm. his algorithm is obtained from the iterbi algorithm by replacing max by a sum. More precisely, we define the forward-variable: f k (i) = P (x 1... x i, π i = k), which equals the probability that the model reports the prefix sequence (x 1,..., x i ) and is in state π i = k at position i. We obtain the recursion: f l (i + 1) = e l (x i+1 ) k Q f k(i)a kl. f p (i) f q (i) f r (i) f l (i+1) f s (i) f t (i) p tl lgorithm 7.3.6 (Forward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = 0): f 0 (0) = 1, f k (0) = 0 for k 0. For all i = 1... L, l Q: f l (i) = e l (x i ) k Q (f k(i 1)a kl ) Result: P (x M) = k Q (f k(l)a k0 ) Implementation hint: Logarithms can not be employed here easily, but there are so-called scaling methods. his solves Main problem Q2! 7.3.14 Backward algorithm he backward-variable contains the probability to start in state p i = k and then to generate the suffix sequence (x i+1,..., x L ): b k (i) = P (x i+1... x L π i = k). lgorithm 7.3.7 (Backward algorithm) Input: HMM M = (S, Q,, e) and sequence of symbols x Output: probability P (x M) Initialization (i = L): b k (L) = a k0 for all k. For all i = L 1... 1, k Q: b k (i) = l Q a kle l (x i+1 )b l (i + 1) Result: P (x M) = l Q (a 0le l (x 1 )b l (1))

90 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 b p (i+1) b q (i+1) b k (i) b r (i+1) p kt b s (i+1) b t (i+1) 7.3.15 Summary of the three variables iterbi v k (i) probability with which the most probable state path generates the sequence of symbols (x 1, x 2,..., x i ) and the system is in state k at time i. Forward f k (i) probability that the prefix sequence of symbols x 1,..., x i is generated, and the system is in state k at time i. Backward b k (i) probability that the system starts in state k at time i and then generates the sequence of symbols x i+1,..., x L. 7.3.16 Posterior probabilities Suppose we are given an HMM M and a sequence of symbols x. Let P (π i = k x) be the probability that symbol x i was reported in state π i = k. We call this the posterior probability, as it computed after observing the sequence x. We have: P (π i = k x) = P (π i = k, x) P (x) = f k(i)b k (i), P (x) as P (g, h) = P (g h)p (h) and by definition of the forward- and backward-variable. 7.3.17 ecoding with posterior probabilities here are alternatives to the iterbi-decoding that are useful, e.g., when many other paths exist that have a similar probability to π. We define a sequence of states ˆπ thus: ˆπ i = arg max k Q P (π i = k x), in other words, at every position we choose the most probable state for that position. his decoding is useful if we are interested in the state at a specific position i and not in the whole sequence of states. Warning: if the transition matrix forbids some transitions (i.e., a kl = 0), then this decoding may produce a sequence that is not a valid path, because its probability is 0! 7.3.18 raining the parameters How does one generate an HMM?

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 91 First step: etermine its topology, i.e. the number of states and how they are connected via transitions of non-zero probability. he topology is usually designed by hand. Second step: Set the parameters, i.e. the transition probabilities a kl and the emission probabilities e k (b). We will now discuss the second step. iven a set of example sequences, our goal is to train the parameters of the HMM using the example sequences, e.g. to set the parameters in such a way that the probability, with which the HMM generates the given example sequences, is maximized. 7.3.19 raining when the states are known Let M = (S, Q,, e) be a HMM. Suppose we are given a list of sequences of symbols x 1, x 2,..., x n and a list of corresponding paths π 1, π 2,..., π n. (.g., N sequences with annotated Cp-islands.) We want to choose the parameters (, e) of the HMM M optimally, such that: P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e) ) = max P ( x 1,..., x n, π 1,..., π n M = (S, Q,, e ) ). (,e ) In other words, we want to determine the so-called Maximum Likelihood stimator (ML-estimator) for (, e). 7.3.20 ML-stimation for (, e) (Recall: If we consider P ( M) as a function of, then we call this a probability; as a function of M, then we use the word likelihood.) ML-estimation: (, e) ML = arg max (,e ) P (x1,..., x n, π 1,..., π n M = (S, Q,, e )). o compute and e from labeled training data, we first determine the following numbers: â kl : ê k (b): Number of observed transitions from state k to l Number of observed emissions of b in state k We then set and e as follows: a kl = â kl q Q âkq and e k (b) = ê k (b). ( ) êk(s) s S 7.3.21 raining the fair/loaded HMM Suppose we are given example data x and π: Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5 States pi: F F F F F F F U U U U F F F State transitions:

92 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 â kl 0 F U 0 F U ê k (b) 1 2 3 4 5 6 0 F U missions: a kl 0 F U 0 F U e k (b) 1 2 3 4 5 6 0 F U 7.3.22 Pseudocounts One problem in training is overfitting. For example, if some possible transition k l is never seen in the example data, then we will set ā kl = 0 and the transition is then forbidden. lso, if a given state k is never seen in the example data, then ā kl is undefined for all l. o solve this problem, we introduce pseudocounts r kl and r k (b), and define: â kl = number of transitions from k to l in the example data + r kl ê k (b) = number of emissions of b in k in the example data + r k (b). Small pseudocounts reflect little pre-knowledge, large ones reflect more pre-knowledge. 7.3.23 Parameter training when the states are unknown In practice, one usually has access only to the sequences of symbols and not to the state paths. Suppose we are given sequences of symbols x 1, x 2,..., x n, for which we do NO know the corresponding state paths π 1,..., π n. he problem of choosing the parameters (, e) of HMM M optimally so that holds is known to be NP -hard. P ( x 1,..., x n M = (S, Q,, e) ) = max P ( x 1,..., x n M = (S, Q,, e ) ) (,e ) efinition 7.3.8 (Log-likelihood score) We define the log-likelihood score of the model M as: l(x 1,..., x n ) = log P (x 1,..., x n (, e)) = n log P (x j (, e)). (Here we assume, that the sequences of symbols are independent and therefore P (x 1,..., x n ) = P (x 1 ) P (x n ) holds.) he goal is to determine parameters (, e) so that we maximize this score. j=1 7.3.24 Baum-Welch algorithm (In the lecture we didn t actually do this but rather we looked at iterbi training.) Let M = (S, Q,, e) be a HMM and assume we are given training sequences x 1, x 2,..., x n. parameters (, e) are to be iteratively improved as follows: he

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 93 - Based on x 1,..., x n and π 1,..., π n and the current value of (, e) we estimate expectation values a kl and e l (b) for â kl and ê l (b). - We then compute (, e) from Ā and ē using equation ( ). - his is repeated until the log-likelihood score cannot be improved. (his is a special case of the so-called expectation maximization (M) technique.) lgorithm 7.3.9 (Baum-Welch algorithm) Input: HMM M = (S, Q,, e), training data x 1, x 2,..., x n, Output: HMM M = (S, Q,, e ) with an improved score. Init.: Randomly assign and e repeat for each sequence x j do for each position i do for each state k do Compute f k (i) for x j with the forward algorithm. Compute b k (i) for x j with the backward algorithm. for each state k do for each state l do Compute a kl = j for each symbol b do Compute e k (b) = j 1 P (x j ) i f j k (i)a kle l (x j i+1 )bj l (i + 1) 1 P (x j ) {i x j i =b} f j k (i)bj l (i) Set new model parameters (, e) from ā and ē using ( ). Compute the new log-likelihood l(x 1,..., x n (, e)). until the log-likelihood does not improve or a maximum number of iterations was reached. Why do we use the following expression to compute the expectation for â kl in the algorithm? ā kl = n j=1 1 P (x j ) L j i=1 f j k (i)a kle l (x j i+1 )bj l (i + 1) For a single sequence x and a single position i, the expected number of transitions from π i = k to π i+1 = l is given by: Convergence: P (π i = k, π i+1 = l x, (, e)) = f k(i)a kl e l (x i+1 )b l (i + 1). P (x) P (πi = k, x) his follows from: P (π i = k x) = P (x) = f «k(i)b k (i). P (x) One can prove that the log-likelihood-score converges to a local maximum using the Baum-Welch algorithm. However, this doesn t imply that the parameters converge! Local maxima can be avoided by considering many different starting points. dditionally, any standard optimization approaches can also be applied to solve the optimization problem.

94 rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 7.4 Protein Families Suppose we are given the following related sequences, how to characterize this family? #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU #-helices LB1_LI HBB_HUMN HB_HUMN M_PHC LB5_PM LB3_CHIP LB2_LUPLU...BBBBBBBBBBBBBBBBCCCCCCCCCCC......LSQRQIWKINKCLIKFLSHPQMF.FS...S...PLKL...HLPKSLWK...NLRLLPWQRFFSFLSPMNPKKHKKL...LSPKNKWK..HLRMFLSFPKFPHF.LS...HSQKHKK...LSWQLLHWK..HQILIRLFKSHPLKFRFKHLKMKSLKKHL PISPLSKKIRSWPS..SILKFFSPQFFPKFKLQLKKSRWHRII...LSQISQSFKK...PILFKPSIMKFQF.KLSIKPFHNRI...LSQLKSSWFN..NIPKHHRFFILLIPKLFS.FLK.SPQNNPLQHKF...FFFFFFFFFFFF..FF...HHHHHHHHHHHHHHHHHHH QISHL..KMQMKRHKNKHIKQFPLSLLSMHRIKMNKWISL FSLHL...NLKFLSLHCKL..HPNFRLLNLCLHHFKFPPQQKNL LNH...MPNLSLSLHHKL..RPNFKLLSHCLLLHLPFPHSLKFLSSL LILKK...K.HHLKPLQSHKH..KIPIKLFISIIHLHSRHPFQMNKLLFRKI NNSM..KMSMKLRLSKHKSF..QPQFKLI...FKLMSMICILL FFSKIIL..P...NINFSHKPR...HQLNNFRFSMKH..F.WLFFMI KLIQLQLKNLSHSK...HFPKILKIKKWSLNSWILII #-helices HHHHHHH... LB1_LI ISLQS... HBB_HUMN HKH... HB_HUMN SKR... M_PHC KKLQ lignment of seven globin sequences LB5_PM RS... How can this family be characterized? LB3_CHIP FSKM... LB2_LUPLU KKMN... Some ideas for characterizing a family: xemplary sequence Consensus sequence Regular expression (Prosite): LB2_LUPLU LB1_LI...FN--NIPKH......IN......[FI]-[N]-[]-x(1,2)-N-[I]-[P]-[K]-[H]... HMM? 7.4.1 Simple HMM How to represent this? HB_HUMN...--H... HBB_HUMN...----N... M_PHC...--H... LB3_CHIP...K------... LB5_PM...S--S... LB2_LUPLU...FN--NIPKH... LB1_LI...IN... "Matches": *** ***** We first consider a simple HMM that is equivalent to a PSSM (Position Specific Score Matrix): F I K S H N I P K H S (he listed amino-acids have a higher emission-probability.)

rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 95 7.4.2 Insert-states We introduce so-called insert-states that emit symbols based on their background probabilities. Begin F I K S H N I P K H S nd his allows us to model segments of sequence that lie outside of conserved domains. 7.4.3 elete-states We introduce so-called delete-states that are silent and do not emit any symbols. Begin F I K S H N I P K H S nd his allows us to model the absence of individual domains. 7.4.4 opology of a profile-hmm he result is a so-called profile HMM: Begin nd Match-state, Insert-state, elete-state 7.4.5 esign of a profile-hmm Suppose we are given a multiple alignment of a family of sequences. First we must decide which positions are to be modeled as match- and which positions are to be modeled as insert-states. Rule-of-thumb: columns with more than 50% gaps should be modeled as insert-states. We determine the transition and emission probabilities simply by counting the observed transitions kl and emissions k (B): a kl = kl l and e k (b) = k(b) kl b k(b ). Obviously, it may happen that certain transitions or emissions do not appear in the training data and thus we use the Laplace-rule and add 1 to each count.