Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Similar documents
Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Today s Lecture: HMMs

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models

Introduction to Machine Learning CMU-10701

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Hidden Markov Models. x 1 x 2 x 3 x K

Markov Chains and Hidden Markov Models. = stochastic, generative models

HMMs and biological sequence analysis

Stephen Scott.

Statistical Sequence Recognition and Training: An Introduction to HMMs

order is number of previous outputs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Computational Genomics and Molecular Biology, Fall

Basic math for biology

STA 4273H: Statistical Machine Learning

CSCE 471/871 Lecture 3: Markov Chains and

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models. x 1 x 2 x 3 x K

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Stephen Scott.

STA 414/2104: Machine Learning

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

O 3 O 4 O 5. q 3. q 4. Transition

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

EECS730: Introduction to Bioinformatics

Hidden Markov Models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Three classic HMM problems

Multiscale Systems Engineering Research Group

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Dynamic Approaches: The Hidden Markov Model

Statistical Methods for NLP

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models Part 2: Algorithms

Data Mining in Bioinformatics HMM

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Modelling

Note Set 5: Hidden Markov Models

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Brief Introduction of Machine Learning Techniques for Content Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computational Genomics and Molecular Biology, Fall

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Conditional Random Field

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

COMP90051 Statistical Machine Learning

Hidden Markov Models in Language Processing

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Sequence modelling. Marco Saerens (UCL) Slides references

Lecture 11: Hidden Markov Models

Data-Intensive Computing with MapReduce

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Hidden Markov Model and Speech Recognition

Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Gibbs Sampling Methods for Multiple Sequence Alignment

Pair Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 3: ASR: HMMs, Forward, Viterbi

CS532, Winter 2010 Hidden Markov Models

Hidden Markov Models (I)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. Terminology, Representation and Basic Problems

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Lecture 9. Intro to Hidden Markov Models (finish up)

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

CSE 473: Artificial Intelligence

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Markov Chains and Hidden Markov Models

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

HMM : Viterbi algorithm - a toy example

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CS 188: Artificial Intelligence Fall Recap: Inference Example

Transcription:

Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994) Hidden Markov models in computational biology: applications to protein modeling J Mol Biol 35:50-53 Book: Eddy & Durbin, 999 See web site utorial: Rabiner, L (989) A tutorial on hidden Markov models and selected applications in speech recognition Proc IEEE 77 () 57-86 Probability review Probability notation: PA ( ) P(A) 0, Probability of A: P(A) = A Joint probability of A AD B: PA, ( B) Conditional probability of A Given that B is true: PA ( B)= PA, ( B) PB ( ) Marginal probability of A, Given all possible Bs: PA ()= PA, ( B) B Markov chains Markov property: P(X 0, X X t ) = P(X 0 )P(X X 0 ) P(X t X t- ) Formally: State space = list of possible values for X ransition matrix = probability of moving from one X to another Initial distribution = initial value of X Independence of A and B: Bayes rule for estimating P(B A) given rest: PA, ( B)= PA ( )PB ( ) PB ( A)= PA ( B)P( B) PA ( ) S0 P(S S0) S S P(S S) Markovian sequence States through which the chain passes form a sequence: Example: S 0, S, S, S, S 0, S,L 4,3 Graphically: By the Markov property:,5 S0 S S Example Markov chain for generating a DA sequence: Sequence probability: A G A C G S0 S S S S0 S ( ) P( Sequence)= PS 0, S, S, S, S 0, S,K = π( S 0 )PS ( S 0 )PS ( S )K P( AGACG) = π( A)P( G A)PA ( G)P ( A)K Data would come from Dinucleotide frequency (eg base-stacking)

Page Hidden Markov chains Observed sequence is a probabilistic function of underlying Markov chain Example: HMM for a (noisy) DA sequence (see eg Churchill 989) rue state sequence unknown, but observation sequence gives us a clue Example: Hidden Markov chain for protein sequence based on secondary structure Specify sequence of helix elements (H) or loop (L), and then generate a sequence by choosing amino acids based on their preference for H or L H H H L L Unobserved ruth A G A C G PAA ( H) PAA ( H) PAA ( H) PAA ( L) PAA ( L) Observed noisy sequence data Obs = A 7 3 C G 8 A 7 3 A 3 7 C 6 G 4 C G 8 -> A G G G (one possible observation) Certain amino acids prefer helices over loops, and vice versa Example: Hidden Markov chain for protein sequence Specify sequence of helix elements (H) or loop (L), and then generate a sequence by choosing amino acids (AA) from probability distributions associated with H or L H H H L L PAA ( H) PAA ( H) PAA ( H) PAA ( L) PAA ( L) wo possible sequences that could be generated from this V A W K A A V Goals for HMMs + Multiple Alignment Summarize a family of aligned sequences What are the amino acids that can occur at a position, and with what probability? Where can there be insertions and deletions? Generate fake but plausible sequences in family Evaluate the probability that a new sequence belongs in a family Does the model create a sequence of high probability? 3 Generate the alignment itself How can I explain the organization of all these sequences, which I believe are in the same family? A HMM for multiple protein sequences (Krogh et al) HMM globin model m = match state emits amino acids (that can be aligned as equivalent in different sequences) i = insert state (emit amino acids that don t align) d = delete state (don t emit amino acid at position) Path through states aligns sequence to model Figure from (Krogh et al, 994) Could emit: V L S A E E K A V K A G H P A - W QAK L C S m states are shown with their probability of emiting each of the 0 amino acids d states shows position in alignment for that column of d,i,m i states are average length of an insertion IF it is chosen

Page 3 Example: Alignment of globin sequences Example: Aligning sequence to model Given an HMM model for a protein family: Align a new sequence to the model by finding the most likely path through the graph (d states are gaps, i states are insertions) Figure from (Krogh et al, 994) hree computational tasks with HMMs Probability of an observed sequence Given O, O O find P(O, O O ) Most likely hidden state sequence Given O, O O compute maximum P(Q, Q Q O, O O ) (O s are observed, Q s are not) 3 Estimation of model parameters Given observed sequences {O, O O } and a topology, find transition probabilities and emission probabilities that maximize SUM of P(O, O O ) Computing likelihood of observed sequence Compute P(O, O O ) = SUM [P(O, O O Q, Q Q ) P(Q, Q Q ) Q, Q Q rue state sequence unknown Must sum over all possible paths that lead to the observables umber paths ~ O( ), = # states, = seq length Markovian structure permits: recursive definition and hence efficient calculation by dynamic programming Key observation: Any path must be in exactly one state at time t Key Idea for HMM computations 0 Matrix holds probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char 3 Fill entire matrix, sum values in last column to get P(observed sequence) Example: Searching protein database with HMM profile For each sequence in database: Does sequence fit model? possible states Score by P(O, O O ), compute Z-score adjusted for length Z-score = number of standard deviations from the mean t- t Observables ( t-,t, t+, ) Z = means SD above the mean, etc ote: states are i and m states that produce amino acids, D states Inferred, for example if m(i) jumps to m(i+)

Page 4 Computing most likely hidden state Why? Because that tells us how to align different sequences (based on aligning the characters emitted by corresponding hidden states) Want the sequence of states most likely to have produced observables Very similar to computing probability, except we look for maximum path (instead of summing all possible paths) Finding most likely path = Viterbi 0 Matrix holds maximum probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char, and choosing the most maximum probability state/transition 3 Fill entire matrix, find maximum value in last column to get best last state Finding most likely path 0 Matrix holds maximum probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char, and choosing the most maximum probability state/transition 3 Fill entire matrix, find maximum value in last column to get best last state possible states possible states MAX MAX MAX MAX MAX t- t Observables ( t-,t, t+, ) t- t Observables ( t-,t, t+, ) Other details How many states in model? How to initialize parameters? How to avoid local modes? See (Krogh et al, 994) for some suggestions Estimate alignment and model parameters - simultaneously Key idea - missing data: What if we knew the alignment? Parameters easy to estimate: Calculate (expected) number of transitions Calculate (expected) frequency of amino acids What if we knew the parameters? Alignment easy to find Align each sequence to model using Viterbi algorithm Align residues in match states

Page 5 Estimating Parameters (hardest part) Algorithm (Baum-Welch) discussed nicely in handout from Ewens & Grant (Chapter ) Set topology of network (number of states and their connectivity) before hand Estimate: transition probabilities emission probabilities initial probabilities that hidden state i is responsible for character Estimating parameters: High level summary Iteratively introduce each sequence into the model Use (initially lousy) estimates to align sequence to the model 3 Use resulting alignment to accumulate statistics about parameters 4 Introduce next sequence and continue accumulation 5 Use accumulated statistics to update parameters 6 Repeat entire process of introducing sequences until parameters converge Multiple protein sequence alignment Given a set of sequences: Example: Multiple alignment of globin sequences Estimate HMM model using optimization for parameter search Align each sequence to model (Viterbi) Match states of model provide columns of resulting multiple alignment Figure from (Krogh et al, 994) Advantages: radeoffs Explicit probabilistic model for family Position specific residue distributions, gap penalties, insertions frequencies Disadvantages: Many parameters, requires more data or care raded one hard optimization problem for another Modeling domains Figures from (Krogh et al, 994) Extensions Clustering subfamilies

Page 6 HMM Summary Powerful tool for modeling protein families Generalization of existing profile methods Data-intensive Widely applicable to problems in bioinformatics Lecture ends here Extra slides follow Forward-backward algorithm Forward pass: α t ( j) = α t (i)p( S j S i ) i = PO ( t S j ) Define Prob of subsequence O O K O t when in S j at t Forward-backward algorithm (cont d) PO ( otice O,K,O )= α ( j) j= Define an analogous backward pass so that: Key obs: any path must be in of states at t β t (j) = β t +(i)p S i S j PO ( t + S i ) i = and ( ) PO ( t came froms j )= α t ( j )β t ( j ) α t(i)β t(i) i= t- t + t- t Baum-Welch algorithm (Expectation-Maximization) PO (,K,O n θ) Set parameters to expected values given observed sequences: State transition probs: Observation probs: t = PS ( j S i )= PinS ( i at t, in S j at t + O) Recalculate expectations with new probabilities Iterate to convergence Guaranteed strictly increasing, converge to local mode (See Rabiner, 989 for details) t = P( obs S i )= t = PinS ( i at t O) PinS ( i at t O) ( O t = obs) t = PinS ( i at t O) HMM-based multiple sequence alignment Multiple alignment of k sequences is hard, so instead: Estimate a statistical model for the sequences * Use a PROFILE alignment as a first guess * (OR) Start from scratch with unaligned sequences (harder) Align each sequence to the model 3 Alignment yields assignments of equivalent sequence elements within the multiple alignment