Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment

Similar documents
Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Dynamics and time series: theory and applications. Stefano Marmi Scuola Normale Superiore Lecture 6, Nov 23, 2011

Dynamics and time series: theory and applications. Stefano Marmi Giulio Tiozzo Scuola Normale Superiore Lecture 3, Jan 20, 2010

The Infinite Markov Model: A Nonparametric Bayesian approach

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Today s Lecture: HMMs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

O 3 O 4 O 5. q 3. q 4. Transition

Dynamical systems, information and time series

Dynamic Approaches: The Hidden Markov Model

Stephen Scott.

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Note Set 5: Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 11: Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

(today we are assuming sentence segmentation) Wednesday, September 10, 14

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Lecture 7 Sequence analysis. Hidden Markov Models

CS532, Winter 2010 Hidden Markov Models

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Computational Genomics and Molecular Biology, Fall

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Machine Learning CMU-10701

Statistical NLP: Hidden Markov Models. Updated 12/15

Multiple Sequence Alignment using Profile HMM

Data Mining in Bioinformatics HMM

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. x 1 x 2 x 3 x K

COMP90051 Statistical Machine Learning

Hidden Markov Models Part 2: Algorithms

CSCE 471/871 Lecture 3: Markov Chains and

Parametric Models Part III: Hidden Markov Models

Data-Intensive Computing with MapReduce

In search of the shortest description

Sequences and Information

Statistical Methods for NLP

Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Stephen Scott.

Linear Dynamical Systems

Statistical Sequence Recognition and Training: An Introduction to HMMs

order is number of previous outputs

Lecture 3: ASR: HMMs, Forward, Viterbi

ECE521 Lecture 19 HMM cont. Inference in HMM

Multiscale Systems Engineering Research Group

Hidden Markov Modelling

Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Log-Linear Models, MEMMs, and CRFs

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models,99,100! Markov, here I come!

HMMs and biological sequence analysis

Lecture 9. Intro to Hidden Markov Models (finish up)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Model and Speech Recognition

Conditional Random Field

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Machine Learning for OR & FE

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models for biological sequence analysis I

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

EECS730: Introduction to Bioinformatics

Conditional Random Fields: An Introduction

Hidden Markov Models

Advanced Data Science

CS 7180: Behavioral Modeling and Decision- making in AI

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Hidden Markov Models

Hidden Markov Models for biological sequence analysis

Graphical models for part of speech tagging

Hidden Markov Models. Terminology and Basic Algorithms

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept.

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Graphical Models Seminar

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Basic math for biology

Transcription:

Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment Slides borrowed from Scott C. Schmidler (MIS graduated student)

Outline! Probability Review! Markov Chains! Hidden Markov Chains! Examples in HMMs for Protein Sequence! Algorithm Review for HMMs (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Motivation: Composing a Drama by Mimicking Shakespeare! Assume we want to write a drama of Shakespeare style! We collect a large set of Shakespeare's works! Define a vocabulary V = {X,X 2,..., X N }! Build a model P(X i X j )fori, j =,..., N! To compose a drama, generate words from the model P(X i X j )! Though this is too simplistic to be useful, this naive model can be extended and refined to mimic the writing style of Shakespears' (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 4

Markov Approximations to English! From Shannon s original paper:. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CITSBE (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 5 Markov Approximations (cont.) From Shannon s paper 4. Third-order approximation: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTABIN IS REGOACTIONA OF CRE Markov random field with 000 features, no underlying machine (Della Pietra et. Al, 997): WAS REASER IN THERE TO WILL WAS BY HOMES THING BE RELOVERATED THER WHICH CONISTS AT RORES ANDITING WITH PROVERAL THE CHESTRAING FOR HAVE TO INTRALLY OF QUT DIVERAL THIS OFFECT INATEVER THIFER CONSTRANDED STATER VILL MENTTERING AND OF IN VERATE OF TO (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 6

Word-Based Approximations. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE T 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPETED Shannon s comment: It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 7 Motivation: Composing a Symphony of Beethoven Style! We want to compose a symphony of Beethoven style! We collect a large set of Beethoven's works! Define a vocabulary V = {X,X 2,..., X N } of musical notes! Build a model P(X i X j ) for i, j =,..., N! To compose a symphony, generate note symbols from the model P(X i X j ) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 8

Modeling Biological Sequences! Collect a set of sequences of interest! Define a vocabulary V = {X,X 2,...,X N } 4 For DNA sequences: N = 4 and V = {A, T, G, C} 4 For protein sequences: N = 20 and V = {amino acids}! Build (learn) a model P(X i X j )fori, j =,..., N or in more general P(X w) with X = X,X 2,...,X M and model parameter vector w! The model can be used to 4 To generate typical sequences from the class of training sequences, e.g. protein family 4 To compute the probability of an observed sequence O being generated from the model class 4 and others! Hidden Markov models (HMMs) are a class of stochastic generative models effective for building such probabilistic models. (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 9 Probability Review! Probability notation: 4 Probability: 4 Joint probability: 4 Conditional probability: 4 Marginal probability: 4 Independence: 4 Bayes rule: (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 0

Markov Chains! Markov property:! Formally: 4 State space 4 Transition matrix 4 Initial distribution! CS intuition 4 Stochastic finite automaton (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) Markovian Sequence! States through which the chain passes form a sequences: Example: S S, S, S, S,,L 0, 0 S! Graphically:! By the Markov property: 9 : : : 9 : P ( Sequence) = P( S0, S, S, S, S0, S, L) = π ( S ) P( S S ) P( S S ) 0 0 (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 2

Example! Markov chain for generating a DNA sequence:! Sequence probability: D D ( AGATCG) = π ( A) P( G A) P( A G) P( T A)K P Dinucleotide frequency (e.g. base-stacking) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Hidden Markov Chains! Observed sequence is a probabilistic function of underlying Markov chain 4 Example: HMM for a (noisy) DNA sequence (see e.g. Churchill 989) True state sequence unknown, but observation sequence gives us a clue Unobserved truth D D Observed noisy sequence data (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 4

Figure from (Krogh et. al. 994) Example: Hidden Markov Chain for Protein Sequence! State space is backbone secondary structure 4 Used for prediction (Asai et. al., Stultz et. al.) I I! State space is side chain environment 4 Used for fold-recognition (Hubbard et. al.) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 5 A HMM for Multiple Protein Sequences (Krogh et. al.)! Match states are model (consensus) positions! Position-specific deletion penalties! Position-specific insertion frequencies! Path through states aligns sequence to model (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 6

Figure from (Krogh et. al. 994) Example: Multiple Alignment of Globin Sequences (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 7 HMM-based Multiple Sequence Alignment! Multiple alignment of k sequences is O(n k ), so instead:. Estimate a statistical model for the sequences Use head start PROFILE alignment Start from scratch with unaligned sequences (harder) 2. Align each remaining sequence to the model 3. Alignment yields assignments of equivalent sequence elements within the multiple alignment (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 8

Example: Aligning Sequence to Model! Given an HMM model for a protein family: Align a new sequence to the model (d states are gaps, i states are insertions) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 9 Computing with HMMs! Three tasks:. Probability of an observed sequence Given O, O 2,,O r find P ( O, O2 K, O r ) (nontrivial since state sequence unobserved) 2. Most likely hidden state sequence Given O, O 2,,O r compute 2. Most likely hidden state sequence Given observed sequence {O,,O n }find arg max P θ n ( O, K, O θ ) ( S, K, S O, K, O ) arg max P r S K, S, (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 20 r r

Computing Likelihood of Observed Sequence P O, O2, K, O r 4 True state sequence unknown 4 Must sum over all possible paths 4 Number paths O(T N ) 4 Markovian structure permits:! Compute Recursive definition and hence Efficient calculation by dynamic programming P ( ) ( O K, O ) = P( O, O, K, O S, S, K, S ) P( S, S, K, S ), r 2 r 0 r 0 S, S, K, S 0 r! Key observation: Any path must be in exactly one state at time t r (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 2 Key Idea for HMM Computations N possible amino acids t t+ T States (t, t+,t) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 22

Example: Searching Protein Database with HMM Profile! For each sequence in database:! Does sequence fit model?! Score by P(O, O 2,,O r ), compute Z-score adjusted for length Globins: Protein Kinases: Figure from (Krogh et. al. 994) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 23 Estimate Alignment and Model Parameters - Simultaneously! Key idea missing data: 4What if we know the alignment? Parameters easy to estimate: Calculate (expected) number of transitions Calculate (expected) frequency of amino acids 4What if we knew the parameters? Alignment easy to find Align each sequence to model using Viterbi algorithm Align residues in match states (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 24

Other details! Howmanystatesinmodel?! How to initialize parameters?! How to avoid local models? See (Krong et. al., 994) for some suggestion (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 25 Multiple Protein Sequence Alignment! Give a set of sequences: 4Estimate HMM model using optimization for parameter search (Baum-Welch, EM) 4Align each sequence to model (Viterbi) 4Match states of model provide columns of resulting multiple alignment (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 26

Extensions Clustering subfamilies Modeling domains Figure from (Krogh et. al. 994) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 27 Tradeoffs! Advantages: 4Explicit probabilistic model for family 4Position specific residue distributions, gap penalties, insertions frequencies! Disadvantages: 4Many parameters, requires more data of care 4Traded one hard optimization problem for another (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 28

HMM Summary! Powerful tool for modeling protein families! Generalization of existing profile methods! Data-intensive! Widely applicable to problems in bioinformatics (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 29 References! Bioinformatics Classic: Krogh et. al. (994) Hidden Markov models in computational biology: applications to protein modeling, J. Mol. Biol. 235: 50-53! Book: Eddy & Durbin, 999. See web site.! Tutorial: Rabiner, L. (989) A tutorial on hidden Markov models and selected applications in speech recognition, Proc IEEE, 77(2), 257-286 (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 30

Forward-backward Algorithm! Forward pass: 4 Define α t N ( j ) ( i) P( S S ) P( O S ) = αt i= 4 Prob. Of subsequence O, O 2,,O t when in S j at t j i i j Key obs: any path must be in of N states at t t- t T (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Forward-backward Algorithm! Notice (, O, K O r ) α ( j) P O 2, N = t! Define an analogues backward pass so that: β t N ( j ) = βt ( i) P( Si S j ) P( O t + S i ) i= and αt ( ) () i βt () i t came from S j = N αt i βt P O i= () () i t- t T+ (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 32

Finding Most Likely Path! Forward pass: 4 Replace summation with maximization 4 Max prob. of subseq. O, O 2,,O r When in S j at t 4 Again: max P O, O, K, O = maxα (, then trace back ( ) ) 2 r T j j N (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 33 Baum-Welch Algorithm (Expectation- Maximization)! Set parameters to expected values given observed sequences: 4 State transition probs: 4 Observation probs: 4 Recalculate expectations with new probabilities 4 Iterate to convergence Guaranteed P( O, K, O n θ ) strictly increasing, converge to local mode (See Rabiner, 989 for details) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 34