Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search

(Hidden) Markov models

Why consider probabilistics models of sequences?! Classification! Machine learning! Data mining which category (family, ) the sequence belongs to?! Estimating likelihood of an observation! Simulating temporal processes! Average-case analysis of algorithms!

Knowledge assumptions! Basics of probability theory! Conditional probability! Bayes formula P(A B) = P(A B) = P(B A) P(A) P(B) P(A & B) P(B)

Probabilistic model of sequences! Simplest model: all letters are i.i.d. Bernoulli distributed random variables e.g. P(A)=P(C)=P(G)=P(T)=0.25 or P(A)=P(T)=0.2 and P(C)=P(G)=0.3

Markov chains (models)! Markov chain of order k: P(x i ) depends on x i-1, x i-2,, x i-k А.А.Марков (1856-1922)

Markov chains: example! Ex: assume three letters {Rainy,Cloudy, Sunny}, and a Markov chain of order 1 (first order): 0.4 0.3 0.6 Rainy 0.3 0.1 0.2 Sunny 0.1 Cloudy 0.2 0.8

Markov chains (cont) If the weather today is Sunny, than the proba of following S-S-R-R-S-C-S is P[ SSSRRSCS Model] = = P[S] P[S S] 2 P[R S] P[R R] P[S R] P[C S] P[S C] =1 (0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2) 1.536 10 4 Example: given that today is Cloudy, what is the proba that it will be Rainy the day after tomorrow?

Markov chains (cont)! Given that the model is in a known state, what is the probability it stays in that state for exactly d days?! The answer is! Thus the expected number of consecutive days in the same state is! So the expected number of consecutive Sunny days, according to the model is 5.

Hidden Markov models! at each moment the model is at one of the hidden states (finite number)! each hidden state holds a (Bernoulli) distribution for emitting letters (emission probabilities)! switching between hidden states is defined by transition probabilities! Example: you don't know the weather (S,C,R) but you observe if the person you see carries an umbrella, and P(umbrella S)=0.05 P(umbrella C)=0.2 P(umbrella R)=0.9

CpG-Islands

Why CpG-Islands? By CFCF - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30029083

CpG Islands and the Fair Bet Casino! The CG islands problem can be modeled after a problem named The Fair Bet Casino! The game is to flip coins, which results in only two possible outcomes: Head or Tail! The Fair coin will give Heads and Tails with same probability ½ : P(H F) = P(T F) = ½! The Biased coin will give Heads with probability ¾ : P(H B) = ¾, P(T B) = ¼! The dealer changes between Fair and Biased coins with probability 0.1

The Fair Bet Casino Problem! Input: sequence x = x 1 x 2 x 3 x n of coin tosses made by two possible coins (F or B)! Output: sequence π = π 1 π 2 π 3 π n, with each π i being either F or B indicating that x i is the result of tossing the Fair or Biased coin respectively

Decoding problem! Any observed outcome of coin tosses could have been generated by any sequence of states! Goal: compute the most likely sequence π producing x, i.e. π maximizing P(π x)! This problem is called the decoding problem

Warm-up: what if the coin stays the same?! Assume that the dealer never changes the coin! P(x F): probability of the outcome x provided that the dealer uses the F coin all along! P(x B): same if the dealer uses the B coin! P(x F)=P(x 1 x n F)=Π i=1,n P(x i F)= (1/2) n! P(x B)=P(x 1 x n B)=(3/4) k (1/4) n-k = 3 k /4 n where k is the number of Heads in x

What if the coin stays the same? (cont)! P(x F)=P(x B) (1/2) n =3 k /4 n k = n / log 2 3 (k ~ 0.63n)! We can compute the log-odds ratio to measure the discrimination of F vs B: log 2 (P(x F)/ P(x B)) = n k log 2 3

Hidden Markov Model (HMM)! Can be viewed as an abstract machine with k hidden states that emits symbols (observations) from an alphabet Σ! Each state has its own probability distribution of moving to another state (transition probabilities). Altogether, they define a Markov chain on the states! Each state has a probability distribution of emitting symbols of Σ (emission probabilities)! While in a certain state, the machine randomly decides:! what is the next state! what symbol is emitted

HMM Parameters

HMM Parameters (cont d)

Summary: HMM for Fair Bet Casino Fair Biased Tails(0) Heads(1) Fair Biased Fair Biased

HMM for Fair Bet Casino (cont) F B H F H F

Hidden Paths! A path π = π 1 π n in the HMM is defined as a sequence of states.! Consider path π = FFFBBBBBFFF and sequence x =THTHHHTHTTH

P(x π) Calculation! P(x π): Probability that sequence x was generated by the path π: P(x π) = Π P(x i π i ) P(π i-1 π i ) assuming that P(π 0 π 1 ) is the probability P(π 1 ) of π 1 to be the starting state

Decoding Problem! Goal: Find an "optimal" (most likely) hidden path of states given observations.! Input: Sequence of observations x = x 1 x n generated by an HMM M(Σ, Q, A, E)! Output: A path that maximizes P(x π) over all possible paths π=π 1 π n.

Viterbi algorithm (1967)! Consider prefix x 1 x i! For each hidden state π i =l, let s l,i be the maximum probability (over i-1 previous states) to observe x 1 x i and arrive to state l! Why computing s l,i? Assume that the sequence of states π* that realizes max{s l,n l Q}! observe that max π P(π x)=max π P(π and x)/p(x)! the π* is the most likely decoding

DP implementation! Consider the graph

DP implementation! Every choice of π = π 1 π n corresponds to a path in the graph.! This graph has Q 2 n edges! Initialization: s l,0 = probability for the model to start from l! DP recurrence: s l,i =max{s k,i-1 a kl e l (x i) k Q}! Resulting path π is retrieved by "backtracing" starting from node argmax{s l,n l Q}! time complexity O( Q 2 n)

Decoding Problem vs. Alignment Problem

Decoding Problem as Finding a heaviest Path in a DAG! The Decoding Problem can be reduced to finding a heaviest path in the directed acyclic graph (DAG)! Note: the weight of the path is defined as the product of its edges weights, not the sum

Computer arithmetic problems

Example! Two hidden states: raining, not-raining! Proba to stay in the same state is 0.7, to change 0.3! Probabilities modelling the person's behaviour:! The initial probability of raining is 0.5! Question: what is the most likely sequence of hidden states for (umbrella, umbrella, no umbrella)?

Many applications! speech recognition! handwriting recognition! computational finance!! bioinformatics! gene prediction! protein classification! protein secondary structure and protein folding! DNA motif discovery (binding sites)!.

HMM in speech recognition from [Gales&Young, Foundations and Trends in Signal Processing, 2007]

Main problems for HMMs

Computing the probability of x (exercise)

Forward-Backward Problem Given: a sequence of coin tosses π = π 1 π n generated by an HMM. Goal: compute the probability that the dealer was using a biased coin at a particular time. In general: Given: a sequence x = x 1 x n Goal: find the probability P(π i = k x)

Plan of the computation P(π i = k x) = P(x,π i = k) P(x) = f k (i) b k (i) P(x) k P(π i = k x) =1

Forward algorithm forward probability f k (i)=p(x 1 x i, π i = k) dynamic programming again! the recurrence for the forward algorithm: f k (i) = e k (x i ) l Q f l (i 1) a lk base case: f k (1) = p 0 (k) e k (x 1 )

Backward algorithm However, forward probability is not the only factor affecting P(π i = k x). The sequence of transitions and emissions that the HMM undergoes between π i+1 and π n also affects P(π i = k x). forward x i backward

Backward algorithm (cont) backward probability b k (i)=p(π i = k, x i+1 x n ) dynamic programming (of course ) the recurrence for the backward algorithm: b k (i) = l Q e l (x i+1 ) b l (i +1) a kl base case: bk (n 1) = a kl e l (x n ) l Q

Forward-Backward algorithm! The probability that the dealer used a biased coin at a moment i: P(π i = k x) = P(x,π i = k) P(x) = f k (i) b k (i) P(x)! P(x) can be recovered from the fact that k P(π i = k x) =1! Remark: FB algorithm cannot replace Viterbi algorithm

Example (cont)! In the example raining not-raining, what is the probability that it was not raining on day 2 if the observations are (umbrella, umbrella, no umbrella)?