Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Similar documents
Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Hidden Markov Models for biological sequence analysis

CSCE 471/871 Lecture 3: Markov Chains and

HIDDEN MARKOV MODELS

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Stephen Scott.

Hidden Markov Models. Three classic HMM problems

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models

Introduction to Machine Learning CMU-10701

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models (I)

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models 1

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

11.3 Decoding Algorithm

Hidden Markov Models. Hosein Mohimani GHC7717

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Advanced Data Science

Markov Chains and Hidden Markov Models. = stochastic, generative models

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models

6 Markov Chains and Hidden Markov Models

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Basic math for biology

CS711008Z Algorithm Design and Analysis

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Lecture 7 Sequence analysis. Hidden Markov Models

HMMs and biological sequence analysis

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Today s Lecture: HMMs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Genome 373: Hidden Markov Models II. Doug Fowler

Pairwise alignment using HMMs

Soundex distance metric

Markov chains and Hidden Markov Models

Multiple Sequence Alignment using Profile HMM

Pair Hidden Markov Models

Bioinformatics: Biology X

Lecture 5: December 13, 2001

MACHINE LEARNING 2 UGM,HMMS Lecture 7

HMM: Parameter Estimation

Computational Genomics and Molecular Biology, Fall

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models,99,100! Markov, here I come!

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Data Mining in Bioinformatics HMM

Hidden Markov Models (HMMs) November 14, 2017

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Info 2950, Lecture 25

STA 414/2104: Machine Learning

2 : Directed GMs: Bayesian Networks

Chapter 4: Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Models Part 2: Algorithms

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Hidden Markov Methods. Algorithms and Implementation

STA 4273H: Statistical Machine Learning

2 : Directed GMs: Bayesian Networks

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

EECS730: Introduction to Bioinformatics

Hidden Markov Models: All the Glorious Gory Details

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPECTED VALUE

Markov Models & DNA Sequence Evolution

Lecture 12: Algorithms for HMMs

Brief Introduction of Machine Learning Techniques for Content Analysis

Evolutionary Models. Evolutionary Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

p(d θ ) l(θ ) 1.2 x x x

Lecture 12: Algorithms for HMMs

Math 350: An exploration of HMMs through doodles.

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Transcription:

Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3) = P(5) = P(6) = /6 Loaded die P() = P() = P(3) = P(5) = /0 P(6) = / Casino player switches between fair and loaded die (not too often, and not for too long) The dishonest casino model Question # Evaluation 0.95 0.05 0.9 GIVEN: A sequence of rolls by the casino player FAIR LOADED 4556464646363666664666366636663665565546356344 QUESTION: Prob =.3 x 0-35 P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0. P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Question # Decoding Question # 3 Learning GIVEN: A sequence of rolls by the casino player 4556464646363666664666366636663665565546356344 QUESTION: FAIR LOADED FAIR What portion of the sequence was generated with the fair die, and what portion with the loaded die? GIVEN: A sequence of rolls by the casino player 4556464646363666664666366636663665565546356344 QUESTION: How does the casino player work: How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs This is the DECODING question in HMMs

The dishonest casino model Definition of a hidden Markov model 0.95 0.05 0.9 Alphabet Σ = { b, b,,b M } Set of states Q = {,..., } ( = Q ) Transition probabilities between any two states FAIR LOADED a ij = transition probability from state i to state j a i + + a i =, for all states i P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0. P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / Initial probabilities a 0i a 0 + + a 0 = Emission probabilities within each state e k (b) = P( x i = b π i = k) e k (b ) + + e k (b M ) = Hidden states and observed sequence An HMM is memory-less At time step t, π t denotes the (hidden) state in the Markov chain x t denotes the symbol emitted in state π t At time step t, the only thing that affects the next state is the current State, π t A path of length N is: An observed sequence of length N is: π, π,, π N x, x,, x N P(π t+ = k whatever happened so far ) = P(π t+ = k π, π,, π t, x, x,, x t ) = P(π t+ = k π t ) A parse of a sequence Given a sequence x = x x N, A parse of x is a sequence of states π = π,, π N Given a sequence x = x,x N and a parse π = π,,π N, How likely is the parse (given our HMM)? P(x, π) = P(x,, x N, π,, π N ) Likelihood of a parse = P(x N, π N x x N-, π,, π N- ) P(x x N-, π,, π N- ) = P(x N, π N π N- ) P(x x N-, π,, π N- ) = = P(x N, π N π N- ) P(x N-, π N- π N- )P(x, π π ) P(x, π ) = P(x N π N ) P(π N π N- ) P(x π ) P(π π ) P(x π ) P(π ) = a 0π a ππ a πn-πn e π (x )e πn (x N ) N =! a# e ( x i" i i i ) # # i=

Example: the dishonest casino What is the probability of a sequence of rolls x =,,, 5, 6,,, 6,, 4 and the parse π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0,Fair = ½, a 0,Loaded = ½) ½ P( Fair) P(Fair Fair) P( Fair) P(Fair Fair) P(4 Fair) = ½ (/6) 0 (0.95) 9 = 5. 0-9 Example: the dishonest casino So, the likelihood the die is fair in all this run is 5. 0-9 What about π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P( Loaded) P(Loaded Loaded) P(4 Loaded) = ½ (/0) 8 (/) (0.9) 9 = 4.8 0-0 Therefore, it is more likely that the die is fair all the way, than loaded all the way Example: the dishonest casino Let the sequence of rolls be: x =, 6, 6, 5, 6,, 6, 6, 3, 6 And let s consider π = F, F,, F P(x, π) = ½ (/6) 0 (0.95) 9 = 5. 0-9 (same as before) And for π = L, L,, L: P(x, π) = ½ (/0) 4 (/) 6 (0.9) 9 = 3.0 0-7 So, the observed sequence is ~00 times more likely if a loaded die is used Clarification of notation P[ x M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters θ = a ij, e i (.) So, P[x M] is the same as P[ x θ ], and P[ x ], when the architecture, and the parameters, respectively, are implied Similarly, P[ x, π M ], P[ x, π θ ] and P[ x, π ] are the same when the architecture, and the parameters, are implied In the LEARNING problem we write P[ x θ ] to emphasize that we are seeking the θ* that maximizes P[ x θ ] What we know Given a sequence x = x,,x N and a parse π = π,,π N, we know how to compute how likely the parse is: P(x, π). Evaluation What we would like to know GIVEN HMM M, and a sequence x, FIND Prob[ x M ]. Decoding GIVEN HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π M ] 3. Learning GIVEN HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (e i (.), a ij ) that maximize P[ x θ ] 3

Decoding Problem : Decoding GIVEN x = x x,,x N We want to find π = π,, π N, such that P[ x, π ] is maximized π * = argmax π P[ x, π ] Maximize a 0π e π (x ) a ππ a πn-πn e πn (x N ) Find the best parse of a sequence We can use dynamic programming! Let V k (i) = max {π,, π i-} P[x x i-, π,, π i-, x i, π i = k] = Probability of maximum probability path ending at state π i = k Decoding main idea Inductive assumption: Given V k (i) = max {π,, π i-} P[x x i-, π,, π i-, x i, π i = k] What is V r (i+)? V r (i+) = max {π,, πi} P[ x x i, π,, π i, x i+, π i+ = r ] = max {π,, πi} P(x i+, π i+ = r x x i,π,, π i ) P[x x i, π,, π i ] = max {π,, πi} P(x i+, π i+ = r π i ) P[x x i-, π,, π i-, x i, π i ] = max k [P(x i+, π i+ = r π i = k) max {π,, π i-} P[x x i-,π,,π i-, x i,π i =k]] = max k e r (x i+ ) a kr V k (i) = e r (x i+ ) max k a kr V k (i) The Viterbi Algorithm Input: x = x,,x N V 0 (0) = (0 is the imaginary first position) V k (0) = 0, for all k > 0 V j (i) = e j (x i ) max k a kj V k (i-) Ptr j (i) = argmax k a kj V k (i-) P(x, π*) = max k V k (N) Traceback: π N * = argmax k V k (N) π i- * = Ptr πi (i) The Viterbi Algorithm The Viterbi Algorithm Input: x = x,,x N V 0 (0) = (0 is the imaginary first position) V k (0) = 0, for all k > 0 x i- Traceback: V j (i) = e j (x i ) max k a kj V k (i-) Ptr j (i) = argmax k a kj V k (i-) P(x, π*) = max k a k0 V k (N) (with an end state) π N * = argmax k V k (N) π i- * = Ptr πi (i) V j (i) Similar to aligning a set of states to a sequence Time: O( N) Space: O(N) 4

The edit graph for the decoding problem Viterbi Algorithm a practical detail Underflows are a significant problem P[ x,., x i, π,, π i ] = a 0π a ππ,,a πi e π (x ),,e πi (x i ) These numbers become extremely small underflow Solution: Take the logs of all values V r (i) = log e k (x i ) + max k [ V k (i-) + log a kr ] The Decoding Problem is essentially finding a longest path in a directed acyclic graph (DAG) Example Let x be a sequence with a portion that has /6 6 s, followed by a portion with ½ 6 s x = 34563456345 66636465666364656 Then, it is not hard to show that the optimal parse is: FFF...F LLL...L Example Observed Sequence: x =,,,6,6 P(x) = 0.00095337 Best 8 paths: LLLLL 0.00008 FFFFF 0.000054 FFFLL 0.000048 FFLLL 0.000049 FLLLL 0.0000089 FFFFL 0.0000083 LLLLF 0.000008 LFFFF 0.000007 Conditional Probabilities given x Position F L 0.50997 0.4970073 0.4759 0.547808 3 0.46375 0.588365 4 0.945388 0.70546 5 0.665376 0.733464 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: Problem : Evaluation Finding the probability a sequence is generated by the model. Start at state π according to probability a 0π. Emit letter x according to probability e π (x ) 3. Go to state π according to probability a ππ 4. until emitting x n 0 5

The Forward Algorithm The Forward Algorithm derivation want to calculate P(x) = probability of x, given the HMM (x = x,,x N ) Sum over all possible ways of generating x: P(x) = Σ all paths π P(x, π) To avoid summing over an exponential number of paths π, define f k (i) = P(x,,x i, π i = k) (the forward probability) the forward probability: f k (i) = P(x x i, π i = k) = Σ ππi- P(x x i-, π,, π i-, π i = k) e k (x i ) = Σ r Σ ππi- P(x x i-, π,, π i-, π i- = r) a rk e k (x i ) = Σ r P(x x i-, π i- = r) a rk e k (x i ) = e k (x i ) Σ r f r (i-) a rk The Forward Algorithm A dynamic programming algorithm: f 0 (0) = ; f k (0) = 0, for all k > 0 f k (i) = e k (x i ) Σ r f r (i-) a rk P(x) = Σ k f k (N) The Forward Algorithm If our model has an end state: f 0 (0) = ; f k (0) = 0, for all k > 0 f k (i) = e k (x i ) Σ r f r (i-) a rk P(x) = Σ k f k (N) a k0 Where, a k0 is the probability that the terminating state is k (usually = a 0k ) Relation between Forward and Viterbi The most probable state VITERBI V 0 (0) = V k (0) = 0, for all k > 0 FORWARD f 0 (0) = f k (0) = 0, for all k > 0 Given a sequence x, what is the most likely state that emitted x i? In other words, we want to compute P(π i = k x) Example: the dishonest casino Say x = 34636663646634634 V j (i) = e j (x i ) max k V k (i-) a kj P(x, π*) = max k V k (N) f l (i) = e l (x i ) Σ k f k (i-) a kl P(x) = Σ k f k (N) Most likely path: π = FFF However: marked letters more likely to be L 6

Motivation for the Backward Algorithm The Backward Algorithm derivation We want to compute P(π i = k x), Define the backward probability: We start by computing P(π i = k, x) = P(x x i, π i = k, x i+ x N ) = P(x x i, π i = k) P(x i+ x N x x i, π i = k) = P(x x i, π i = k) P(x i+ x N π i = k) Then, P(π i = k x) = P(π i = k, x) / P(x) = f k (i) b k (i) / P(x) b k (i) = P(x i+ x N π i = k) = Σ πi+πn P(x i+,x i+,, x N, π i+,, π N π i = k) = Σ r Σ πi+πn P(x i+,x i+,, x N, π i+ = r, π i+,, π N π i = k) = Σ r e r (x i+ ) a kr Σ πi+πn P(x i+,, x N, π i+,, π N π i+ = r) = Σ r e r (x i+ ) a kr b r (i+) The Backward Algorithm A dynamic programming algorithm for b k (i): b k (N) =, for all k b k (i) = Σ r e r (x i+ ) a kr b r (i+) P(x) = Σ r a 0r e r (x ) b r () The Backward Algorithm In case of an end state: b k (N) = a k0, for all k b k (i) = Σ r e r (x i+ ) a kr b r (i+) P(x) = Σ r a 0r e r (x ) b r () Computational Complexity What is the running time, and space required for Forward, and Backward algorithms? Time: O( N) Space: O(N) Useful implementation technique to avoid underflows - Rescaling at each position by multiplying by a constant Define: Recursion: Scaling Choosing scaling factors such that Leads to: Scaling for the backward probabilities: 7

Posterior Decoding We can now calculate f k (i) b k (i) P(π i = k x) = P(x) Then, we can ask What is the most likely state at position i of sequence x? Using Posterior Decoding we can now define: = argmax k P(π i = k x) For each state, Posterior Decoding Posterior Decoding gives us a curve of probability of state for each position, given the sequence x That is sometimes more informative than Viterbi path π * Posterior Decoding may give an invalid sequence of states Why? Viterbi vs. posterior decoding Viterbi, Forward, Backward A class takes a multiple choice test. How does the lazy professor construct the answer key? Viterbi approach: use the answers of the best student Posterior decoding: majority vote VITERBI V 0 (0) = V k (0) = 0, for all k > 0 V r (i) = e r (x i ) max k V k (i-) a kr FORWARD f 0 (0) = f k (0) = 0, for all k > 0 f r (i) = e r (x i ) Σ k f k (i-) a kr BACWARD b k (N) = a k0, for all k b r (i) = Σ k e r (x i +) a kr b k (i+) P(x, π*) = max k V k (N) P(x) = Σ k f k (N) a k0 P(x) = Σ k a 0k e k (x ) b k () Methylation & Silencing A modeling Example CpG islands in DNA sequences Methylation Addition of CH 3 in C- nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated Methylation is inherited during cell division 8

CpG Islands CpG Islands CpG nucleotides in the genome are frequently methylated C methyl-c T Methylation often suppressed around genes, promoters CpG islands In CpG islands: CG is more frequent Other dinucleotides (AA, AG, AT) have different frequencies Problem: Detect CpG islands A model of CpG Islands Architecture A model of CpG Islands Transitions How do we estimate parameters of the model? Emission probabilities: /0. Transition probabilities within CpG islands + A C G T A.80.74.46.0 C.7.368.74.88 Established from known CpG islands G.6.339.375.5 T.079.355.384.8. Transition probabilities within non-cpg islands - A C G T A.300.05.85.0 Established from non-cpg islands C.33.98.078.30 G.48.46.98.08 T.77.39.9.9 p A model of CpG Islands Transitions What about transitions between (+) and (-) states? Their probabilities affect Avg. length of CpG island Avg. separation between two CpG islands -p q Length distribution of X region: P[L X = ] = -p P[L X = ] = p(-p) P[L X = k] = p k- (-p) A model of CpG Islands Transitions No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide Estimate average length L CPG of a CpG island: L CPG = /(-p) p = /L CPG For each pair of (+) states: a kr p a kr + For each (+) state k, (-) state r: a kr =(-p)(a 0r- ) Do the same for (-) states A problem with this model: CpG islands don t have a geometric length distribution p -q -q E[L X ] = /(-p) Geometric distribution, with mean /(-p) This is a defect of HMMs a price we pay for ease of analysis & efficient computation 9

Using the model Posterior decoding Given a DNA sequence x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely parse Results of applying posterior decoding to a part of human chromosome Using the Forward/Backward algorithms we can calculate P(x i is in CpG island) = P(π i = A + x) Posterior Decoding can assign locally optimal predictions of CpG islands = argmax k P(π i = k x) Posterior decoding Viterbi decoding Sliding window (size = 00) Sliding widow (size = 00) 0

Sliding window (size = 300) Sliding window (size = 400) Sliding window (size = 600) Sliding window (size = 000) Modeling CpG islands with silent states What if a new genome comes? Suppose we just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

Two learning scenarios Estimation when the right answer is known Examples: GIVEN: a genomic region x = x x N where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening as he changes the dice and produces 0,000 rolls Estimation when the right answer is unknown Examples: GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 0,000 rolls of the casino player, but we don t see when he changes dice GOAL: Update the parameters θ of the model to maximize P(x θ) When the right answer is known Given x = x x N for which π = π π N is known, Define: A kr = # of times k r transition occurs in π E k (b) = # of times state k in π emits b in x The maximum likelihood estimates of the parameters are: A kr E k (b) a kr = e k (b) = Σ i A ki Σ c E k (c) When the right answer is known Intuition: When we know the underlying states, the best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x θ) is maximized, but θ is unreasonable 0 probabilities VERY BAD Example: Suppose we observe: x = 5 6 3 6 3 5 3 4 6 3 6 π = F F F F F F F F F F F F F F F L L L L L Then: a FF = 4/5 a FL = /5 a LL = a LF = 0 e F (4) = 0; e L (4) = 0 Pseudocounts Solution for small training sets: Add pseudocounts A kr E k (b) = # times k r state transition occurs + t kr = # times state k emits symbol b in x + t k (b) t kr, t k (b) are pseudocounts representing our prior belief Larger pseudocounts Strong prior belief Small pseudocounts (ε < ): just to avoid 0 probabilities Pseudocounts Pseudocounts A kr + t kr E k (b) + t k (b) a kr = e k (b) = Σ i A ki + t kr Σ c E k (c) + t k (b) Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: t 0F = t 0L = t F0 = t L0 = ; t FL = t LF = t FF = t LL = ; t F () = t F () = = t F (6) = 0 (strong belief fair is fair) t L () = t L () = = t L (6) = 5 (wait and see for loaded) Above numbers arbitrary assigning priors is an art

When the right answer is unknown We don t know the actual A kr, E k (b) Idea: Initialize the model Compute A kr, E k (b) Update the parameters of the model, based on A kr, E k (b) Repeat until convergence Two algorithms: Baum-Welch, Viterbi training When the right answer is unknown Given x = x x N for which the true π = π π N is unknown, EXPECTATION MAXIMIZATION (EM). Pick model parameters θ. Estimate A kr, E k (b) (Expectation) 3. Update θ according to A kr, E k (b) (Maximization) 4. Repeat & 3, until convergence To estimate A kr : Estimating new parameters At each position of sequence x, find probability that transition k r is used: P(π i = k, π i+ = r, x x N ) Q P(π i = k, π i+ = r x) = = P(x) P(x) Q = P(x x i, π i = k, π i+ = r, x i+ x N ) = P(π i+ = r, x i+ x N π i = k) P(x x i, π i = k) = P(π i+ = r, x i+ x i+ x N π i = k) f k (i) = P(x i+ x N π i+ = r) P(x i+ π i+ = r) P(π i+ = r π i = k) f k (i) = b r (i+) e r (x i+ ) a kr f k (i) f k (i) a kr e r (x i+ ) b r (i+) So: P(π i = k, π i+ = r x, θ) = P(x θ) Estimating new parameters f k (i) a kr e r (x i+ ) b r (i+) P(π i = k, π i+ = r x, θ) = P(x θ) f k (i) x x i- k x i a kr e r (x i+ ) Summing over all positions gives the expected number of times the transition is used: f k (i) a kr e r (x i+ ) b r (i+) A kr = Σ i P(π i = k, π i+ = r x, θ) = Σ i P(x θ) r x i+ b r (i+) x i+ x N Emission probabilities: Estimating new parameters Σ {i xi = b} f k(i) b k (i) E k (b) = P(x θ) When you have multiple sequences: sum over them. The Baum-Welch Algorithm Pick the best-guess for model parameters (or random). Forward. Backward 3. Calculate A kr, E k (b) + pseudocounts 4. Calculate new model parameters a kr, e k (b) 5. Calculate new log-likelihood P(x θ) Likelihood guaranteed to increase (EM) Repeat until P(x θ) does not change much 3

Time Complexity: The Baum-Welch Algorithm # iterations O( N) Guaranteed to increase the log likelihood P(x θ) Not guaranteed to find globally best parameters: Converges to local optimum Too many parameters / too large model: Overfitting Alternative: Viterbi Training Same. Perform Viterbi, to find π *. Calculate A kr, E k (b) according to π * + pseudocounts 3. Calculate the new parameters a kr, e k (b) Until convergence Comments: Convergence is guaranteed Why? Does not maximize P(x θ) Maximizes P(x θ, π * ) In general, worse performance than Baum-Welch Higher-order HMMs How do we model memory of more than one time step? HMM variants First order HMM: P(π i+ = r π i = k) (a kr ) nd order HMM: P(π i+ = r π i = k, π i - = j) (a jkr ) A second order HMM with states is equivalent to a first order HMM with states a HT (prev = H) a HT (prev = T) state HH a HHT a HTH state HT state H a TH (prev = H) a TH (prev = T) state T a THH state TH a THT a TTH a HTT state TT Note that not all transitions are allowed! Modeling the Duration of States Solution : Chain several states Length distribution of region X: E[L X ] = /(-p) Geometric distribution, with mean /(-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions Disadvantage: Still very inflexible L X = C + geometric with mean /(-p) 4

Solution : Negative binomial distribution Example: genes in prokaryotes n copies of state X: EasyGene: a prokaryotic gene-finder (Larsen TS, rogh A) Suppose the duration in X is m, where During first m turns, exactly n arrows to next state are followed In the m th turn, an arrow to next state is followed m P(L X = m) = n ( p) n p m-n Codons are modeled with 3 looped triplets of states (negative binomial with n = 3) Solution 3: Duration modeling Upon entering a state:. Choose duration d, according to probability distribution. Generate d letters according to emission probs 3. Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D ) Space: O(D) where D = maximum duration of state 5