Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Similar documents
Today s Lecture: HMMs

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

HMMs and biological sequence analysis

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Models

Stephen Scott.

Markov Chains and Hidden Markov Models. = stochastic, generative models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

EECS730: Introduction to Bioinformatics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models

Lecture 3: Markov chains.

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models (I)

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Markov Models & DNA Sequence Evolution

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models for biological sequence analysis

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Computational Genomics and Molecular Biology, Fall

STA 414/2104: Machine Learning

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Three classic HMM problems

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models for biological sequence analysis I

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

STA 4273H: Statistical Machine Learning

Stephen Scott.

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Hidden Markov Models. x 1 x 2 x 3 x K

Data Mining in Bioinformatics HMM

order is number of previous outputs

Hidden Markov Models. Ron Shamir, CG 08

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Pairwise alignment using HMMs

Chapter 4: Hidden Markov Models

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

HIDDEN MARKOV MODELS

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Hidden Markov Models (HMMs) and Profiles

Dynamic Approaches: The Hidden Markov Model

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Moreover, the circular logic

Basic math for biology

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Multiple Sequence Alignment using Profile HMM

Week 10: Homology Modelling (II) - HHpred

Statistical Methods for NLP

6 Markov Chains and Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Lecture #5. Dependencies along the genome

Exercise 5. Sequence Profiles & BLAST

Biology 644: Bioinformatics

Introduction to Machine Learning CMU-10701

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

11.3 Decoding Algorithm

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

CS711008Z Algorithm Design and Analysis

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Sequence Alignment

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

bioinformatics 1 -- lecture 7

Hidden Markov Models and Their Applications in Biological Sequence Analysis

HMM : Viterbi algorithm - a toy example

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Hidden Markov models in population genetics and evolutionary biology

HMM part 1. Dr Philip Jackson

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Bioinformatics 2 - Lecture 4

Statistical Machine Learning from Data

Multiple Sequence Alignment: HMMs and Other Approaches

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Pairwise sequence alignment and pair hidden Markov models

HMM: Parameter Estimation

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Pairwise sequence alignment

Hidden Markov Models. Terminology, Representation and Basic Problems

Transcription:

Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs

target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts or position-specific pseudocounts aligned become MSA pairwise columns condensed to account for unobserved profile data distances becomes match states in time order redundancy removed by weights adds delete and insert states to profile HMM tree overview

Profile hidden Markov models I = insertion state D = deletion state M = match state I begin D I M D I M... D I M D I M end The probability of a gap or insertion might be position specific. Profile HMMs can model this.

Markov processes time sequence Markov process is any process where the next item in the list depends on the current item. The dimension can be time, sequence position, etc

Modeling proteins using Markov chains A Markov chain is a network of states connected by transitions A Markov chain is a stochastic model that emits symbol data whose probability depends only on the last symbol emitted. H E H=helix L E=extended (strand) L=loop

What is a stochastic model? A model is a simplified version of reality. The simpler, the better. A stochastic model has the form: random numbers model synthetic data

Markov states...emits a symbol each time you visit it....connects to other states (and possibly itself), with probabilities attached..6 E.3 L The sum of all transition probabilities = 1.1 H note ===> Markov chains emit discrete 1-dimensional data.

Setting the parameters of a Markov model from data. Secondary structure data LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLEEEEELLLLLLLLLLLEEEEEEEEELLLLLEEEEEEEEELL LLLLLLEEEEEELLLLLEEEEEELLLLLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLLLLLLEEEELLLLEEEELLLLEEE EEEEELLLLLLEEEEEEEEELLLLLLEELLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLEEEEEELLLLLLLLLLEEEEEEL LLLLEEELLLLLLLLLLLLLEEEEEEEEELLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHLLLLLLLEEL HHHHHHHHHHLLLLLLHHHHHHHHHHHLLLLLLLELHHHHHHHHHHHHLLLLLHHHHHHHHHHHHHLLLLLEEEL HHHHHHHHHHLLLLLLHHHHHHHHHHEELLLLLLHHHHHHHHHHHLLLLLLLHHHHHHHHHHHHHHHHHHHHH HHHHHHLLLLLLHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLLLLLEEEEL LLLHHHHHHHHHHHHHHHLLLLLLLLEELLLLLHHHHHHHHHHHHHHLLLLLLEEEEELLLLLLLLLLHHHHHHHH HHHHHHHHHHHHHHHLLLLLHHHHHHHHHLLLLHHHHHHHLLHHHHHHHHHHHHHHHHHHHH Count the pairs to get the transition probability. E P(L E) L P(L E) = P(EL)/P(E) = counts(el)/counts(e) counts(e) = counts(ee) + counts(el) + counts(eh) Therefore: P(E E) + P(L E) + P(H E) = 1.

Bayes notation and Rabiner s notation a yx = P(x y) = P(y, x) P(y) = F(y, x) F(y)...the conditional probability of x given y. π x = P(x) = F(x)/N...the probability of x (unconditional).

A transition matrix P(H H) P(E H) P(E E) P(q t q t-1 ) H E H E L P(H E) P(L H) P(L E) H.93.01.06 P(H L) P(E L) E.01.80.19 L L.04.06.90 P(L L) **This is a first-order MM. Transition probabilities depand on

What is P(S λ), the probability of a sequence, given the model? λ H E L H.93.01.06 P( HHEELL λ) =P(H)P(H H)P(E H)P(E E)P(L E)P(L L) =(.33)(.93)(.01)(.80)(.19)(.90) =4.2E-4 E L.01.80.19.04.06.90 P( HHHHHH λ) =0.69 P( HEHEHE λ) =1E-6 common protein secondary structure not protein secondary structure Probability discriminates between realistic and unrealistic sequences

What is the maximum likelihood model given a dataset of sequences? Dataset. HHEELL H E L H E L HHEELL HHEELL HHEELL H E 1 1 0 0 1 1 H E 0.5 0.5 0 0 0.5 0.5 HHEELL L 0 0 1 L.0 0 1.0 HHEELL Maximum likelihood model Count the state pairs. Normalize by row.

Is this model too simple? H E L Frequency Synthetic helix length data from this model 1 2 3 4 5 6 7 8 9 10 Real helix length data *L.Pal et al, J. Mol. Biol. (2003) 326, 273 291 A model should be as simple as possible but not simpler --Einstein

A Markov chain for proteins where helices are always exactly 4 residues long H H H H L E

A Markov chain for proteins where helices are always at least 4 residues long H H H H L E Can you draw a Markov chain where helices are always a multiple of 4 long?

Exercise: generate a MM based on the data. how much wood would a wood chuck chuck if a wood chuck would chuck wood?

Markov chain for DNA sequence A T C G P(ATCGCGTA...) = π A a AT a TC a CG a GC a CG a GT a TA 1

DNA... + methylated CpG Islands Not methylated + -... - DNA is methylated on C to protect against endonucleases. Using mass spectroscopy we can find regions of DNA that are methylated and regions that are not. Regions that are protected from methylation may be functionally important, i.e. transcription factor binding sites. NNNCGNNN NNNTGNNN During the course of evolution. Methylated CpG s get mutated to TpG s

Using Markov chains for descrimination: CpG Islands in human chromosome sequences... + - + - CpG rich CpG poor... CpG rich= "+" + A C G T A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.385 0.125 T 0.079 0.355 0.384 0.182 CpG poor= "-" - A C G T A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292 P(CGCG +) = π C (0.274)(0.339)(0.274) = π C 0.0255 P(CGCG -) = π C (0.078)(0.246)(0.078) = π C 0.0015 From Durbin,Eddy, Krogh and Mitcheson Biological Sequence Analysis (1998) p.50 2

Comparing two MMs The log likelihood ratio (LLR) L + a L + L x i 1 x i i =1 a x i 1 x i i =1 a xi 1 x i i =1 log = log a x i 1 x i = β x i 1 x i Log-likelihood ratios for transitions: β A C G T A -0.740 0.419 0.580-0.803 C -0.913 0.302 1.812-0.685 G -0.624 0.461 0.331-0.730 T -1.169 0.573 0.393-0.679 Sum the LLRs. If the result is positive, its a CpG island, otherwise not. LLR(CGCG)=1.812 + 0.461 + 1.812 = 4.085 yes 3

In class exercise: what s the LLR? What is the LLR that this seq is a CpG Island? ATGTCTTAGCGCGATCAGCGAAAGCCACG LLR = L β xi 1 x i i=1 β A C G T A -0.740 0.419 0.580-0.803 C -0.913 0.302 1.812-0.685 G -0.624 0.461 0.331-0.730 T -1.169 0.573 0.393-0.679 = 3

Markov chain for DNA sequence A T C G P(ATCGCGTA...) = π A a AT a TC a CG a GC a CG a GT a TA 1

Combining two Markov chains to make a hidden Markov model A hidden Markov model can have multiple paths for a sequence C "+" model A G T " "model A T Transitions between +/- models G In Hidden Markov models (HMM), there is no one-to-one correspondence between the state and the emitted symbol. 4

Probability of a sequence using a HMM Different state sequences can produce the same emitted sequence Nucleotide sequence: C G C G State sequences (paths): C+ G+ C+ G+ C- G- C- G- C+ G+ C- G- C+ G- C- G+ etc... P(sequence,path) π C+ a C+G+ a G+C+ a C+G+ π C- a C-Ga G-Ca C-Gπ C+ a C+G+ a G+Ca C-Gπ C+ a C+Ga G-Ca C-G+ etc... sum these P(CGCG λ) = Σ P(Q) All paths Q Each state sequence has a probability. The sum of all state sequences that emit CGCG is the P(CGCG). 5

The problem is finding the states given the sequence. Typically, when using a HMM, the task is to determine the optimal state pathway given the sequence. The state pathway provides some predictive feature, such as secondary structure, or splice site/not splice site, or CpG island/not CpG island, etc. In Principle, we can do this task by trying all state pathways Q, and choosing the optimal. In Practice, this is usually impossible, because the number of pathways increases as the number of states to the power of the length, i.e. O(n m ). How do we do it, then?

HMM that use profiles b H (i) H E The marble bag represents a probability distribution of amino acids, b. ( a profile ) L Each state emits one amino acid from the marblebag, for each visit. probability distribution == a set of probabilities (0 p 1) that sum to 1. stacked odds? 0. 1.

states emit aa and ss. Amino acid Sequence H E L Given an amino acid sequence, what is the most probable state sequence? State sequence (secondary structure)

Maximize: Joint probability of a sequence and pathway Q = {q 1,q 2,q 3, q T } = sequence of Markov states, or pathway S = {s 1,s 2,s 3, s T } = sequence of amino acids or nucleotides T = length of S and Q. Joint probability of a pathway and sequence, given a HMM λ. S= A G P L V D H H H H H H Q= E E E E E E L L L L L L P= π H b H (A) a HH b H (G) a HE b E (P) a EE b E (L) a EE b E (V) a EL b L (D)

Joint probability : general expression General expression for pathway Q through HMM λ : P(S,Q λ) = π q1 b qt ( s t )a qt q t+1 t =1,T ** **when t=t, there is no q t+1. Use a = 1 A G P L V D H H H H H H E E E E E E L L L L L L

The Three HMM Algorithms 1. The Viterbi algorithm: get the optimal state pathway. Maximum joint prob. 2. The Forward/Backward algorithm: get the probability of each state at each position. Sum over all joint probs. 3. Expectation/Maximization: refine the parameters of the model using the data

The Viterbi algorithm:the maximum probability path v k (t) = MAX v l (t-1) a lk b k (s t ) Trc k (t) = largmax v l (t-1) a lk b k (s t ) l Recursive. We save the value v and also a traceback arrow Trc as we go along. Plot state versus position. Each v is a MAX over the whole previous column of v s. v l (i) Markov states l 1 2 3... T-1 T k When t = T the last position, the traceback arrow from the MAX give the optimal state sequence. sequence position t

Exercise: Write the Viterbi algorithm 6 5 4 3 v k (t) = MAX v l (t-1) a lk b k (s t ) Trc k (t) = ARGMAX v l (t-1) a lk b k (s t ) 1 2 states 1..L positions 1..T

Exercise: Write the Viterbi algorithm 5 4 6 3 1 2 v k (t) = MAX v l (t-1) a lk b k (s t ) Trc k (t) = ARGMAX v l (t-1) a lk b k (s t ) initialize v k (1)=b k (s 1 ) for t=2,t { states 1..L for k=1,l { positions 1..T } }

α The Forward algorithm:all paths to a state This is alpha, the forward probability α k (t) = Σ α l (t-1) a lk b k (t) l This is a, the arrow between states. Forward stands for forward recursion After the first row, each α depends on the whole previous row of α s. Markov states l 1 2 3 α lt... t-1 t k... Sum of P over all paths up to state k at t = α k (t) sequence position i At the end of the sequence, when t=t, the sum of α k (T) equals the total probability of the seuqence given the model, P(S λ).

β The Backward algorithm:all paths from a state β k (t) = Σ β l (t+1) a kl b k (t) l Backward stands for backward recursion. The algorithm starts at t=t, the end of the sequence. (The transitions are still forward.) Each β depends on the whole next row of β s. β lt Sum over all paths to state k from t+1 = β k (t)... k... Markov states l t t+1 T-2 T-1 T sequence position i At the beginning of the sequence, when t=1, the sum of β k (1) equals the total probability of the sequence given the model, P(S λ).

Exercise: Write the Forward algorithm 5 4 6 3 1 2 α k (t) = Σ α l (t-1) a lk b k (t) initialize α k (1)=π k (s 1 ) for t=2,t { states 1..L for k=1,l { positions 1..T } }

γ Forward/Backward algorithm:all paths through a state. γ k (t) = α k (t) *β k (t) γ k (t) is the total probability of state k at t, given the sequence S and the model, λ. α lt β lt Markov states l 1 2 3... t-1 k t t+1... T-2 T-1 T Markov states l sequence position t The bottleneck through which all paths must travel.

Expectation/Maximization: refining the model Example: refining b k (G) (i.e. the number of Gly s in the k th marble bag) Step 1) Count how many Glycines are found in state k. Step 2) Normalize it. Reset b k (G) in the new model to that value. Step 3) Do steps 1-2 for all states k in λ and all 20 amino acids. Repeat steps 1-3 using the new model. Iterate to convergence. Expectation/Maximization is often abbreviated EM.

Σ over all G in all sequences, S Expectation/Maximization: refining the model Example: refining b k (G) To count the Glycines, we calculate the Forward/Backward value for state k at every Glycine in the database. Then sum them. P(k t,s,λ) = Σ all paths though k at t = γ k (t) = α k (t) *β k (t) S D K P H S G L K V S D E S D K P H S S I K G S D E K S G I N C L K V H R S D E + + + S D K P Q G L K V S D E F F S D K P H S E E E G S D E + + + D G L K V S D E G W W N N K P G L K V S D E G Q G Q + + = b k (G) S D K P H S G M G L K E A S D K P H G L K V S D E This is normalized to sum to 1 over all 20 AA s.

Expectation/Maximization: refining the model Example: refining a jk, the probability of a transition from state j to state k. Step 1) Get the probability of ending in state j at t --> α j (t) Step 2) Get the probability of starting in state k at t+1 --> β k (t) Step 3) Multiply these by the current a jk Step 4) Do Steps 1-3 for all positions t and all sequences, S. Sum--> a. Then normalize. Reset a jk in the new model to a. Do 1-4 using the new model. Repeat until convergence.

Expectation/Maximization: refining the model Example: refining a jk, the probability of a transition from state j to state k. β k (t+1) α j (t) + + + + + +... a jk Σ Σ α j (t) a jk β k (t+1) S t + = a After summing all a, they are normalized to sum to 1. Σ over all t in all sequences, S

Profile HMMs State emissions: I = insert state, one character from the background profile D = delete state, non-emitting. A connector. M = match state, one character from a specific profile. I begin D I M D I M... D I M D I M end Begin = non-emitting. Source state. End = non-emitting. Sink state. All π(q)=0, except π(begin)=1 To get the scores of a sequence to a profile HMM, we use the F/B algorithm to get P(End). This is the measure of how well the sequence fits the model. Then we can test several models.

Generating a profile HMM from a multiple sequence alignment base the model on, say, this one VGA--H V----N VEA--D VKG--- VYS--T FNA--N IAGADN Make four match states begin M M M M end

Generating a profile HMM from a multiple sequence alignment base the model on, say, this one VGA--H V----N VEA--D VKG--- VYS--T FNA--N IAGADN Make four match states

Generating a profile HMM from a multiple sequence alignment VGA--H V----N VEA--D VKG--- VYS--T FNA--N IAGADN begin M M M M end I Add insertion states where there are insertions. (red)

Generating a profile HMM from a multiple sequence alignment VGA--H V----N VEA--D VKG--- VYS--T FNA--N IAGADN begin M D M D I M D M end Add deletion states where there are deletions. (red dashes)...now optimize using expectation maximization.

Getting profiles for every Match state w 1 w 2 w 3 w 4 w 5 w 6 w 7 VGA--H V----N VEA--D VKG--- VYS--T FNA--N IAGADN Count the frequency of each amino acid, scaled by sequence weights, w. begin M P( V ) = D M w i s i = V w i all i D I M D M b M1 (V) = (w 1 +w 2 +w 3 +w 4 +w 5 )/ (w 1 +w 2 +w 3 +w 4 +w 5 +w 6 +w 7 ) end

Calculating the probability of a sequence given the model: P(s λ) begin M D M D I M D M end Sum forward (forward algorithm) using the sequence s. For each Match state, multiply by the transition (a) and the profile value, b M (s i ), and increment i For each Deletion state, multiply by a, do not increment i. For each Insertion state, multiply by a, increment i.

Picking a parent sequence The parent defines the number of Match states A Match state should conserve the chemical nature of the sidechain as much as possible. A Match state implies structural similarity.

Homolog detection using a library of profile HMMs Get P(S λ) for each λ 1 MYSEQUENCE 2 3 4 P(s λ 1 ) P(s λ 2 ) P(s λ 3 ) P(s λ 4 ) Pick the model with the max P

In Class exercise: make a profile HMM AGF---PDG AGGYL-PDG AG----PNG SGFFLIPNG SGF--EPNG Pick the best parent. Draw match states. Draw insertion states for positions followed by "-" in the parent. Draw deletion states for positions in parent that align with "-". For each Match state, write the predominant amino acid.

Make a HMM from Blast data Score E Sequences producing significant alignments: (bits) Value gi 18977279 ref NP_578636.1 (NC_003413) hypothetical protein [P... 136 5e-32 gi 14521217 ref NP_126692.1 (NC_000868) hypothetical protein [P... 59 8e-09 gi 14591052 ref NP_143127.1 (NC_000961) hypothetical protein [P... 56 8e-08 gi 18313751 ref NP_560418.1 (NC_003364) translation elongation... 42 9e-04 gi 729396 sp P41203 EF1A_DESMO Elongation factor 1-alpha (EF-1-a... 40 0.007 gi 1361925 pir S54734 translation elongation factor aef-1 alpha... 39 0.008 gi 18312680 ref NP_559347.1 (NC_003364) translation initiation... 37 0.060 QUERY 3 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 59 18977279 2 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 58 14521217 1 -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V 53 14591052 1 -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V 52 18313751 243 --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V 274 729396 236 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 268 1361925 239 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 271 18312680 487 -----------------------------------IVGV-KVL-AGTIKPGVT----L-V 504 QUERY 60 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 109 18977279 59 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 108 14521217 54 --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF-- 100 14591052 53 --KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-- 99 18313751 275 VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL----- 322 729396 269 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 314 1361925 272 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 317 18312680 505 --KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- 555

Make a HMM from Blast data begin GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV -----------------------------------IVGV-KVL-AGTIKPGVT----L-V Delete states Insert states Match states --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF-- --KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-- VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL----- --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ --KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- end

Added information In DP, we assumed insertions and deletions were equally probable, and that the probability was independent of position. With Profile HMMs we allow insertions and deletions to have different probabilities, and to be dependent on the position.

Many uses of HMMs Weather prediction Ecosystem modeling Brain activity Language structure Econometrics etc etc HMMs can be applied to any dataset that can be represented as strings. The expert input is the topology, or how the states are connected.

Profile HMM libraries available via web Pfam (HMMer): SAM: pfam.wustl.edu www.cse.ucsc.edu/research/compbio/hmm-apps/