Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Similar documents
Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Pairwise sequence alignment and pair hidden Markov models

Week 10: Homology Modelling (II) - HHpred

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Hidden Markov Models (HMMs) and Profiles

EECS730: Introduction to Bioinformatics

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

STRUCTURAL BIOINFORMATICS I. Fall 2015

CAP 5510 Lecture 3 Protein Structures

Gibbs Sampling Methods for Multiple Sequence Alignment

Markov Chains and Hidden Markov Models. = stochastic, generative models

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Hidden Markov Models

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Sequences and Information

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Hidden Markov Models for biological sequence analysis I

Motif Extraction and Protein Classification

Truncated Profile Hidden Markov Models

Protein Structure Prediction and Display

Computational Genomics and Molecular Biology, Fall

Replicated Softmax: an Undirected Topic Model. Stephen Turner

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Dynamic Approaches: The Hidden Markov Model

Data Mining in Bioinformatics HMM

bioinformatics 1 -- lecture 7

Hidden Markov Models for biological sequence analysis

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

Stephen Scott.

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

SI Materials and Methods

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Conditional Graphical Models

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bayesian Networks BY: MOHAMAD ALSABBAGH

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Protein Secondary Structure Prediction

Bi 8 Midterm Review. TAs: Sarah Cohen, Doo Young Lee, Erin Isaza, and Courtney Chen

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Computational Molecular Biology (

Template-Based Representations. Sargur Srihari

Learning to Align Sequences: A Maximum-Margin Approach

Bayesian Learning in Undirected Graphical Models

Lecture 7 Sequence analysis. Hidden Markov Models

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Numerically Stable Hidden Markov Model Implementation

Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Directed Graphical Models or Bayesian Networks

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

A Graphical Model for Protein Secondary Structure Prediction

Bayesian Segmental Models with Multiple Sequence Alignment Profiles for Protein Secondary Structure and Contact Map Prediction

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Conditional Random Fields: An Introduction

SOME CHALLENGES IN COMPUTATIONAL BIOLOGY

Learning in Bayesian Networks

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

Introduction to Bioinformatics

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Similarity searching summary (2)

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Undirected Graphical Models

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Data-Intensive Computing with MapReduce

Discrete Probability and State Estimation

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Hidden Markov Models

Sequence Alignment Techniques and Their Uses

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Large-Scale Genomic Surveys

Building 3D models of proteins

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

BIOINFORMATICS LAB AP BIOLOGY

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Moreover, the circular logic

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

EM with Features. Nov. 19, Sunday, November 24, 13

Sequence Analysis '17 -- lecture 7

The Monte Carlo Method: Bayesian Networks

Hidden Markov Models. Terminology, Representation and Basic Problems

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Computational Biology From The Perspective Of A Physical Scientist

Intelligent Systems (AI-2)

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Probabilistic Arithmetic Automata

Transcription:

Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1

Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2

An accessible book on statistical models of sequence matching R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis: Probabilitic models of proteins and nucleic acids, Cambridge, Cambridge 1998 3

What is a Markov model and why is it Hidden?? A Markov process is used to describe time evolution. It depends only one step to the past p(i+1 i) transition operator Sometimes we have a path but do not know the rules: the model is hidden. 1 2 3 4 5 6 7 4

HMM is a probabilistic model: A few words on probability Set of objects A i (cities, amino acids): 1 2 3 4 Probability of an object P(A i ) Joint probability P(A i,a k ) Conditional probability P(A i A k ) Bayesian relationship ( ) ( ) ( ) ( ) i, j = i j j = j i ( i) P A A P A A P A P A A P A 5

Markov Models P(1,2,3,4,5,6,7) P(1 2) P(2,3,4,5,6,7) P(1 2) P(2,3,4,5,6,7) P(1 2) P(2 3)... P(6 7) P(7) In a Markov process the probability of item i+1 depends only on item i 5 6 7 1 2 3 4 6

Examples of Processes that are described by Markov models Walk under the influence Molecular diffusion in liquid Stock market Protein sequences MKTTEYVAEILNELHNSAAYISNEEA 7

Is sequence scoring described by a Markov model? Query: RPFAANGQSANYIPALGKVNDSQLGICVLEPDGTMIHAGDWNVSFTMQSISKVISFIAACMSRG Sbjct: AVHYLNPESADLRALAKHLYDSYIKSFPLTKAKARAILTGKTTDKSPFVIYDMNSLMMGEDKIK Score = score( ai, bi) i Scoring sequence alignment is simpler than a Markov model 8

Scoring pair of sequences is independent on the order of the pairs Score = i score( a, b) i i In a Markov model we will expect scoring of the type: Score= score( a, b a, b) i+ 1 i+ 1 i i i 9

Simple Markov model for scoring sequence alignment Query: RPFAANGQSANYIPALGKVNDSQLGICVLEPDGTMIHAGDWNVSFTMQSISKVISFIAACMSRG Sbjct: AVHYLNPESADLRALAKHLYDSYIKSFPLTKAKARAILTGKTTDKSPFVIYDMNSLMMGEDKIK S = snn ( GP) + sgp ( QE) 10

A Markov model of a single sequence aa... a Pa ( a ) P a a... P a 1 2 n 1 2 2 3 ( ) ( ) n What is the Markov model of a random protein? Pa (,..., a ) P a P a P a ( ) ( ) ( ) n 1 n 1 2 n 1 20 11

A Markov model can be tailored to specific protein family...hw.....hw.....hf.. ( ) ( ) P family =... δ PW or F... H 12

Pictorial view of a Markov model 13

Think on a Markov model for protein annotation as a sequence generating machine... Sequences are generated that are compatible with a given family. For example, generate all sequences with histidine at position 64 and 93 14

The universe of sequences can be presented by HMM models MM1 MM2 MM4 MM3 Related work by Golan Yona 15

How to use MM in sequence annotation Design Markov Models for known families Use multiple sequence information Estimate parameters for the Markov model Generate compatible sequences, examine most probable sequence of the sequence generating machine. Most probable sequences can help identify family features For annotation: check if a sequence is compatible with one of the MM models 16

Score against a Markov model T pa a = log ( ) j j 1 The sum is over multiple paths 17

Hidden Markov Models We have at hand a sequence with unknown annotation. We assume that the unknown sequence was generated by one of the Markov models of the known protein families. We wish to identify which of the (hidden) Markov Models is likely to generate the probe sequence. 18

A simple scoring scheme for a HMM model (profile) The protein family is denoted by M A sequence is denoted by S Are we interested in P(S M) or P(M S)? A simple probability model for P(S M) ( ) PS ( M) pi a Yield a site-dependent scoring function T ( a) ( ) p i = log q a Is the general HMM score, can be written as a sum? t i 19

HMM for secondary structure It is also possible to generate HMM for secondary structure prediction (based on correlation between 10-20 amino acids) Psequence ( helix) or Psequence ( beta sheet) 20

The design of a profile model Have multiple sequence alignment A A C H G W N... A V C F G W Q... A A C F G Y Q... A G C H A W D... A A C H G W Q... Estimate probability of an amino acid at column i, p(a,i): problem with sampling 21

Estimating p(a,i) Use multiple sequence alignment to estimate frequencies f(a,i)=n(a,i)/n(i) In the simplest version the frequencies are set equal to probabilities p(a,i)=f(a,i) 22

Estimating p(a,i) Estimating p(a,i) is difficult Typically we will have sequences of about 10 family members For each site we have 20 possibilities (20 amino acids) If the probability is zero the score (-log of probability) is minus infinity A major research in computer science: estimating model parameters from sparse data 23

Estimating parameters of the model Is a problem also for the simple profile model. Clearly a problem also for the more complex HMM models. 24

Pseudo counting is a simple workable procedure Have a reference zero order distribution (e.g. P(a) the probability of finding an amino acid independent of the site) Assume a set of points estimated from the reference distribution in addition to the multiple sequence alignment P(a,i)=n(a,i)/n(i)+n(a)/n The more we have reference points the less specific is the model. A compromise between data and statistical accuracy 25

Summary HMM versus simple alignment HMM is a more complex model that includes information on multiple sequence alignment to parameterize a sequence-generating machine. As such it is expected to be more accurate than single sequence to a single sequence comparison. Depends on the availability of multiple sequences. A lot of work to generate the models... Extracting information from sparse data is a critical issue. 26