Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1

Data Mining in Bioinformatics n Many genomes being sequenced n Resulting problem in computational biology: n Finding genes in DNA sequences Gene Finding n Gene finding refers to identifying stretches of nucleotide sequences in genomic DNA that are biologically functional n Computational gene finding deals with algorithmically identifying protein-coding genes 2

Gene Finding n n Gene finding is not an easy task Gene structure can be very complex n The gene is discontinous coding: n Exons: a region that encodes a sequence of amino acids n Introns: non-coding polynucleotide sequences that interrupts the coding sequences, the exons, of a gene 3

n In gene finding there are some important biological rules: n Translation starts with a start codon (ATG). n Translation ends with a stop codon (TAG, TGA, TAA) n Exon can never follow an exon without an intron in between n Complete genes can never end with an intron 4

Gene Finder n The HMMs can be applied efficiently to well known biological problems. n Problems like: n Protein secondary structure recognition n Multiple sequence alignment n Gene finding 5

Hidden Markov Models in Bioinformatics n A HMM is a statistical model for sequences of discrete simbols. n HMMs are perfect for the gene finding task. Categorizing nucleotids within a genomic sequence can be interpreted as a clasification problem with a set of ordered observations that posses hidden structure, that is a suitable problem for the application of hidden Markov models. Example of a Markov Model 6

Markov Chains Rain Sunny Cloudy States : Three states - sunny, cloudy, rainy. State transition matrix : The probability of the weather given the previous day's weather. Initial Distribution : Defining the probability of the system being in each of the states at time 0. n For each Markov chain, a transition matrix stores the transitions for each combination of states. n As the length of the Markov chain increases, the size of the matrix also increases exponentially, and the matrix becomes extremely sparse. n Processing time also increases proportionally. 7

Example of a Hidden Markov Model 8

Markov chain: an example Weather model: n 3 states {rainy, cloudy, sunny} Problem: n Forecast weather state, based on the current weather state Markov chain Model Definition n N States, {S 1, S 2, S N } n Sequence of states Q ={q 1, q 2, } Initial probabilities π={π 1, π 2, π N } π i =P(q 1 =S i ) Transition matrix A NxN a ij =P(q t+1 =S j q t =S i ) 9

Mixture Models: an example Weather model: n 3 hidden states n {rainy, cloudy, sunny} n Measure weather-related variables (e.g. temperature, humidity, barometric pressure) Problem: n Given the values of the weather variables, what is the state? Gaussian Mixture Model Definition n Ν states observed through an observation x Model parameter θ={p 1 p N, µ 1...µ Ν, Σ 1...Σ Ν } 10

HMM: an example Weather model: n 3 hidden states n {rainy, cloudy, sunny} n Measure weather-related variables (e.g. temperature, humidity, barometric pressure) Problem: n Forecast the weather state, given the current weather variables Hidden Markov Model Definition (1/2) n N hidden States, {S 1, S 2, S N } n Sequence of states Q ={q 1, q 2, } n Sequence of observations O={O 1, O 2, } 11

Hidden Markov Model Definition (2/2) Similar to Markov Chain Similar to Mixture Model Similar to Markov Chain n λ=(a, B, π): Hidden Markov Model n A={a ij }: State transition probabilities n a ij =P(q t+1 =S j q t =S i ) Β={b i (v)}: Observation probability distribution b i (v)=p(o t =v q t =S i ) n π={π i }: initial state distribution n π i =P(q 1 =S i ) HMM Graph Similar to Markov Chain Similar to Mixture Model 12

The three basic problems n Evaluation n O, λ P(O λ) n Uncover the hidden part n O, λ Q that P(Q O, λ) is maximum n Learning n {Ο} λ that P(O λ) is maximum Evaluation n O, λ P(O λ) n Solved by using the forwardbackward procedure n Applications n Evaluation of a sequence of observations n Find most suitable HMM n Used in the other two problems 13

Uncover the hidden part n O, λ Q that P(Q O, λ) is maximum n Solved by Viterbi algorithm n Applications n Find the real states n Learn about the structure of the model n Estimate statistics of the states n Used in the learning problem Learning n {Ο} λ that P(O λ) is maximum n No analytic solution n Usually solved by Baum-Welch (EM variation) n Applications n Unsupervised Learning (single HMM) n Supervised Learning (multiple HMM) 14

Hidden Markov models Set of states: { s 1, s2,!, sn } Process moves from one state to another generating a sequence of states : s, s, 2!, s,! i1 i ik Markov chain property: probability of each subsequent state depends only on what was the previous state: P ( sik si 1, si2,!, sik- 1) = P( sik sik- 1) States are not visible, but each state randomly generates one of M observations (or visible states) { v 1, v2,!, vm } Hidden Markov models To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(a ij ), a ij = P(s i s j ) Matrix of observation probabilities B=(b i (v m )), b i (v m )= P(v m s i ) Vector of initial probabilities p=(p i ), p i = P(s i ) Model is represented by M=(A, B, p). 15

Word recognition Typed word recognition, assume all characters are separated. Character recognizer outputs probability of the image being particular character, P(image character). a b c 0.5 0.03 0.005 z 0.31 Hidden state Observation Word recognition Hidden states of HMM = characters. Observations = typed images of characters segmented from the image v a. Note that there is an infinite number of observations Observation probabilities = character recognizer scores. B = ( b v )) ( P( v s )) i( a = a i Transition probabilities will be defined differently in two subsequent models. 16

Word recognition If lexicon is given, we can construct separate HMM models for each lexicon word. Amherst a m h e r s t Buffalo b u f f a l o 0.5 0.03 0.4 0.6 Here recognition of word image is equivalent to the problem of evaluating few HMM models. This is an application of Evaluation problem. We can construct a single HMM for all words. Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before. a f m o r t b h e s v Here we have to determine the best sequence of hidden states, the one that most likely produced word image. This is an application of Decoding problem. 17

Character recognition with HMM The structure of hidden states is chosen. Observations are feature vectors extracted from vertical slice Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models 2. Quantize feature vector space. The structure of hidden states: s 1 s 2 s 3 Observation = number of islands in the vertical slice. HMM for character A : æ.8.2 0 ö Transition probabilities: {a ij }= ½ 0.8.2 ½ è 0 0 1 ø æ.9.1 0 ö Observation probabilities: {b jk }= ½.1.8.1 ½ è.9.1 0 ø HMM for character B : æ.8.2 0 ö Transition probabilities: {a ij }= ½ 0.8.2 ½ è 0 0 1 ø æ.9.1 0 ö Observation probabilities: {b jk }= ½ 0.2.8 ½ è.6.4 0 ø 18

Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed: { 1, 3, 2, 1} What HMM is more likely to generate this observation sequence, HMM for A or HMM for B? 19

Law of Total Probabilities 20

n The algorithm makes use of the principle of dynamic programming n Computes efficiently the values that are required to obtain the posterior marginal distributions in two passes. n The first pass goes forward in time while the second goes backward in time; hence the name forward backward algorithm n In the first pass, the forward backward algorithm computes a set of forward probabilities which provide n The probability of ending up in any particular state given the first k observations 21

n In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point n These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence 23

Expectation maximization n Expectation (E) step, which creates a function for the expectation using the current estimate for the parameters n Maximization (M) step, which computes parameters maximizing the expected found on the E step 26

Finding genes in DNA sequence This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally. 29

Example-HMM Transition Prob. Output Prob. Scoring a Sequence with an HMM: The probability of ACCY along this path is.4 *.3 *.46 *.6 *.97 *.5 *.015 *.73 *.01 * 1 = 1.76x10-6. 31

n When using HMMs first we have to specify a model n When choosing the model we have to take into consideration their complexity like the number of states and allowed transitions. 33