EECS730: Introduction to Bioinformatics

Similar documents
Hidden Markov Models

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Week 10: Homology Modelling (II) - HHpred

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Computational Genomics and Molecular Biology, Fall

Pair Hidden Markov Models

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Hidden Markov Models. Three classic HMM problems

Pairwise sequence alignment and pair hidden Markov models

Hidden Markov Models. x 1 x 2 x 3 x K

EECS730: Introduction to Bioinformatics

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models (HMMs) and Profiles

Stephen Scott.

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CS711008Z Algorithm Design and Analysis

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Chapter 4: Hidden Markov Models

Hidden Markov Models

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Today s Lecture: HMMs

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

HMMs and biological sequence analysis

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Models (HMMs) November 14, 2017

Stephen Scott.

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Lecture 9. Intro to Hidden Markov Models (finish up)

CSCE 471/871 Lecture 3: Markov Chains and

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Ron Shamir, CG 08

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

HIDDEN MARKOV MODELS

11.3 Decoding Algorithm

Hidden Markov Models

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models for biological sequence analysis

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pairwise & Multiple sequence alignments

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Pairwise alignment using HMMs

Hidden Markov Models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 7 Sequence analysis. Hidden Markov Models

Biological Sequences and Hidden Markov Models CPBS7711

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Terminology, Representation and Basic Problems

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

O 3 O 4 O 5. q 3. q 4. Transition

Applications of Hidden Markov Models

Data Mining in Bioinformatics HMM

Introduction to Machine Learning CMU-10701

Evolutionary Models. Evolutionary Models

Exercise 5. Sequence Profiles & BLAST

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A profile-based protein sequence alignment algorithm for a domain clustering database

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 3: Markov chains.

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Statistical Methods for NLP

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Basic math for biology

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence analysis and Genomics

Basic Local Alignment Search Tool

Hidden Markov Models 1

Sequence Alignment (chapter 6)

Single alignment: Substitution Matrix. 16 march 2017

Algorithms in Bioinformatics

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Biology 644: Bioinformatics

Advanced Data Science

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Moreover, the circular logic

BMI/CS 576 Fall 2016 Final Exam

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Malik Hammoutène - SSC. Sequence Analysis

Transcription:

EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang (University of Central Florida)

Information from multiple sequence alignments Protein/Gene family: homolog, ortholog, paralog, and xenolog Usually homologs are rooted form the same gene, diverged during evolution, and have similar biological functions Multiple alignments of homologous sequences usually reveal important sequence feature of the protein family and indicate its function We have discussed in the previous class how to build multiple sequence alignments from a set of homologous sequences

The revised homolog detection problem Can we use the unbiased centroid as query instead? Querying sequence Protein family False negatives True positives False positives

The revised homolog detection problem Input: a set of homologous sequences from the same protein family, and a unannotated protein sequence Output: the likelihood that the unannotated protein sequence is also from the protein family Naïve solution: perform pairwise alignment between each sequence in the family with the unannotated protein sequence It could be very slow, and it may not reflect true homology

Can we summarize information of a protein family from MSA?

An intuitive way is to summarize column-wise frequency https://en.wikipedia.org/wiki/position_weight_matrix

Using the Position Specific Scoring Matrix Modified matching scores Sum(p i,j * score(j, a)) Keep the original setup for the gap penalty RPS-BLAST https://upload.wikimedia.org/wikipedia/commons/8/ 85/LexA_gram_positive_bacteria_sequence_logo.png The gaps are not handled well, we need more advanced model to account for gaps

Introducing the Markov Model First-order Markov Chain M = (Q, p, a) Q finite set of states, say Q = n a n x n transition probability matrix a(i,j)= Pr[q t+1 =j g t =i] p n-vector, starting probability vector p(i) = Pr[q 0 =i] For any row of a the sum of entries = 1 Sp (i) = 1

Hidden Markov Model (HMM) Hidden Markov Model is a Markov model in which one does not observe a sequence of states but results of a function prescribed on states in our case this is emission of a symbol (amino acid or a nucleotide). States are hidden to the observers.

Emission probabilities Assume that at each state a Markov process emits (with some distribution) a symbol from alphabet S. Rather than observing a sequence of states we observe a sequence of emitted symbols. Example: S ={A,C,T,G}. Generate a sequence where A,C,T,G have frequency p(a) =.33, p(g)=.2, p(c)=.2, p(t) =.27 respectively one state A.33 T.27 C.2 G.2 1.0 emission probabilities

HMM HMM is a Markov process that at each time step generates a symbol from some alphabet, S, according to emission probability that depends on state. M = (Q, S, p, a, e) Q finite set of states, say n states ={q 0,q 1, } a n x n transition probability matrix: a(i,j) = Pr[q t+1 =j g t =i] p n-vector, start probability vector: p (i) = Pr[q 0 =i] S = {s 1,,s k }-alphabet e(i,j) = Pr[o t =s j q t = i]; o t t th element of generated sequences = probability of generating o j in state q i (S=o 0, o T the output sequence)

Occasionally dishonest casino

Summarizing MSA using HMM If we simply consider MSA columns without gaps This is equivalent to PSSM Eddy et al. Biological Sequence Analysis, Cambridge Press

MSA to HMM Considering the Insertions Affine gap penalty Eddy et al. Biological Sequence Analysis, Cambridge Press

MSA to HMM Affine gap penalty Eddy et al. Biological Sequence Analysis, Cambridge Press

MSA to HMM, the complete model Eddy et al. Biological Sequence Analysis, Cambridge Press

How many states should we have The number of matching state is usually determined as the number of columns who have non-gap majority Number of insertion and deletion states determined correspondingly

Computing the parameters Emission probability Transition probability Eddy et al. Biological Sequence Analysis, Cambridge Press

How to align MSA profile to a sequence Eddy et al. Biological Sequence Analysis, Cambridge Press

Time complexity O(MN), where M is the number of states in HMM and N is the length of the observed sequence

Viterbi algorithm for generalized HMM Eddy et al. Biological Sequence Analysis, Cambridge Press

Time complexity O(M 2 N), where M is the number of states in HMM and N is the length of the observed sequence

Limitation of the Viterbi path Eddy et al. Biological Sequence Analysis, Cambridge Press

Forward-backward algorithm Using Viterbi algorithm, we can calculate the most probable parse of the observed sequence given the HMM However, in many cases we want to calculate all probable parses that can give rise to the observed sequence given the HMM This can be very useful when there are many suboptimal paths that are nearly as good as the most probable path We can compute is using the Forward algorithm

Forward algorithm Eddy et al. Biological Sequence Analysis, Cambridge Press

The need for decoding What is the probability that an observed character comes from a given state??? Eddy et al. Biological Sequence Analysis, Cambridge Press

Backward algorithm Eddy et al. Biological Sequence Analysis, Cambridge Press

Decoding Eddy et al. Biological Sequence Analysis, Cambridge Press

PFAM

HMMER3 HMM

HMMER