1 Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department of Electrical Engineering & Computer Science University of Missouri Free for Academic Use. Jianlin Cheng.
2 Application of HMM in Biological Gene prediction Sequence Analysis Protein sequence modeling (learning, profile) Protein sequence alignment (decoding) Protein database search (scoring, e.g. fold recognition) Protein structure prediction
4 Motif and Gene Structure Intergenic region 5 untranslated region Intron Ploy A actcgtcggggcgtacgtacgtaacgtacgtaacttgaggtacaaaattcggagcactgttgagcgacaagtactgctatacataggttacgtacaa a Exon 3 untranslated region Promoter Binding site (or motif) Transcription Start site HMM has been used for modeling binding site and gene structure prediction.
5 Simplified State Transition Diagram of GenScan. GENSCAN (genes.mit.edu/genscan.html)
6 Model Protein Family (Profile HMM) Create a statistical model (HMM) for a group of related protein sequences (e.g protein family) Identify core (conserved) elements of homologous sequences Positional evolutionary information (e.g. insertion and deletion)
7 Example: Hemoglobin Transports Oxygen Heterocyclic Ring Iron atom Conserved Position
8 Why do We Build a Profile (Model)? Understand the conservation (core function and structure elements) and variation Sequence generation Multiple sequence alignments Profile-sequence alignment (more sensitive than sequence-sequence alignment) Family / fold recognition Profile-profile alignment
9 Protein Family seq1 VRRNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRLLAQGRLSEMAGADALDV seq2 NIYIDSNGIAHIYANNLHDLFLAEGYYEASQRLFEIELFGLAMGNLSSWVGAKALSS seq3 SAETYRDAWGIPHLRADTPHELARAQGTARDRAWQLEVERHRAQGTSASFLGPEALSW seq4 DRLGVVTIDAANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDL seq1 ---VRRNNMGMPLIESSSYHDALFTLGY--AGDRISQMLGMRLLAQGRLSEMAGADALDV seq2 --NIYIDSNGIAHIYANNLHDLFLAEGYYEASQRLFEIELFG-LAMGNLSSWVGAKALSS seq3 SAETYRDAWGIPHLRADTPHELARAQGT--ARDRAWQLEVERHRAQGTSASFLGPEALSW seq DRLGVVTIDAANQLDAMRALGY--AQERYFEMDLMRRAPAGELSELFGAKAVDL Imagine these sequences evolve from a single ancestral sequence and undergo evolutionary mutations. How to use a HMM to model?
10 Key to Build a HMM is to Set Up States Think about the positions of the ancestral sequence is undergoing mutation events to generate new sequences in difference species. A position can be modeled by a dice. Match (match or mutate): the position is kept with or without variations / mutations. Delete: the position is deleted Insert: amino acids are inserted between two positions.
11 Hidden Markov Model Begin M1 M2 M3 M4 End Each match state has an emission distribution of 20 amino acids; one match state for a position. Krogh et al. 1994, Baldi et al. 1994
12 Hidden Markov Model D1 D2 D3 D4 Begin M1 M2 M3 M4 End Each match state has an emission distribution of 20 amino acids. Deletion state is a mute state (emitting a dummy) Krogh et al. 1994, Baldi et al. 1994
13 Hidden Markov Model D1 D2 D3 D4 Begin M1 M2 M3 M4 End I1 I2 I3 I4 I5 Krogh et al. 1994, Baldi et al Each match state has an emission distribution of 20 amino acids. Each insertion state has an emission distribution of 20 amino acids. Variants of architecture exist. (see Eddy, bioinformatics, 1997)
14 Hidden Markov Model D1 D2 D3 D4 Begin M1 M2 M3 M4 End I1 I2 I3 I4 I5 How many states? (M positions: length of model) M (match) + M (deletion) + (M+1) (insertion) + 2 = 3M + 3 Krogh et al. 1994, Baldi et al. 1994
15 Hidden Markov Model D1 D2 D3 D4 Begin M1 M2 M3 M4 End I1 I2 I3 I4 I5 How many transitions? (M positions: length of model) Deletion: 3M 1, Match: 3M 1, Insertion: 3(M+1) 1, B/E: 3 Total = 9M +3. Krogh et al. 1994, Baldi et al. 1994
16 Hidden Markov Model D1 D2 D3 D4 Begin M1 M2 M3 M4 End I1 I2 I3 I4 I5 How many emissions? (M positions: length of model) M * 20 (match) + (M+1)*20 (insertion) = 40M + 20 Krogh et al. 1994, Baldi et al. 1994
17 Initialization of HMM How to decide model length (the number of match states)? How to initialize transition probabilities? How to initialize emission probabilities?
18 How to Decide Model Length? Learn: Use a range of model length (centered at the average sequence length). If transition probability from a match (M i ) state to a delete state (D i+1 ) > 0.5, remove the M i+1. If transition probability from a match (M i ) state to an insertion state (I i+1 ) > 0.5, add a match state. Get from multiple alignment: assign a match state to any column with <50% gaps.
19 How to Initialize Parameters? Uniform initialization of transition probabilities is ok in most cases. Uniform initialization of emission probability of insert state is ok in many cases. Uniform initialization of emission probability of match state is bad. (lead to bad local minima) Using amino acid distribution to initialize the emission probabilities is better. (need regularization / smoothing to avoid zero)
20 Initialize from Multiple Alignments seq1 ---VRRNNMGMPLIESSSYHDALFTLGY--AGDRISQMLGMRLLAQGRLSEMAGADALDV seq2 --NIYIDSNGIAHIYANNLHDLFLAEGYYEASQRLFEIELFG-LAMGNLSSWVGAKALSS seq3 SAETYRDAWGIPHLRADTPHELARAQGT--ARDRAWQLEVERHRAQGTSASFLGPEALSW seq DRLGVVTIDAANQLDAMRALGY--AQERYFEMDLMRRAPAGELSELFGAKAVDL First, assign match / main states, delete states, insert states from MSA Get the path of each sequence Count the amino acid frequencies emitted from match or insert states, which are converted into probabilities for each state (need smoothing/ regularization / pseudo-count). Count the number of state transitions and use them to initialize transition probabilities.
21 Estimate Parameters (Learning) We want to find a set of parameters to maximize the probability of the observed sequences in the family: maximum likelihood: P(sequences model) = P(sequence 1 model) * * P(sequence n model). Baum-Welch s algorithm (or EM algorithm) (see my previous lectures about HMM theory)
22 Demo of HMMER (learning)
23 Demo /Users/jianlincheng/Desktop/work_2018/Te aching_2018/machine_learning_bioinfo_20 18/hmmer-3.0/demo Build a HMM
24 HHSuite Remmert M, Biegert A, Hauser A, Söding J (2011). "HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment.". Nat. Methods. 9 (2):
25 Visualization of Features and Structure in HMM Myoglobin protein family. How to interpret it? Krogh et al. 94
26 Demo of HMMEditor J. Dai and J. Cheng. HMMEditor: a visual editing tool for profile hidden Markov model. BMC Genomics, Support HMM models built by HMMer Paper:
27 Demo /Users/jianlincheng/Desktop/work_2018/Te aching_2018/machine_learning_bioinfo_20 18/HMMVE_1.2 Transition probabilities and emission probabilities
28 Protein Family Profile HMM Databases Pfam database (protein family database) ( Pfam 31.0 contains a total of families
29 What Can We Do With the HMM? Recognition and classification. Widely used for database search: does a new sequence belong to the family? (database search) Idea: The sequences belonging to the family (or generated from HMM) should receive higher probability than the sequence not belong to the family (unrelated sequences).
30 Two Ways to Search Build a HMM for each family in the database. Search a query sequence against the database of HMMs. (Pfam) Build a HMM for a query family, and search HMM against of a database of sequences
31 Compute P(Sequence HMM) Forward algorithm to compute P(sequence model) We work on: -log(p(sequence M): distance from the sequence to the model. (negative log likelihood score) Unfortunately, -log(p(sequence M) is length dependent. So what can we do?
32 Normalize the Score into Z-score Search the profile against a large database such as Swiss-Prot Plot log(p(seuqence model), NULL scores, against sequence length.
33 NLL score is linear to sequence length. NLL scores of the same family is lower than un-related sequences We need normalization. Krogh et al. 94
34 NULL model of unrelated sequences: Length Mean (u) Std (σ) Compute Z-score: s u / σ Z > 4: the sequence is very different from unrelated sequence. (for non-database search, a randomization can work. )
35 Extreme Value Distribution (Karlin and Altschul) Log-odds score = log(p(seq λ) / P(seq null)) P-value E-value K and lamda are statistical parameters. m,n model and sequence length.
37 Insight I: Evaluating a Sequence Against a Profile HMM is Fold Recognition Process or Function Prediction Check if a sequence is in the same family (or superfamily) as the protein family (superfamily) used to build the profile HMM. If they are in the same family, they will share the similar protein structure (fold), possibly protein function. The known structure can be used to model the structure of the proteins without known structure. The known function can be used to predict function.
38 Insight II: Evaluating a Sequence Against a HMM is Sequence-Profile Alignment Align a query sequence against a HMM of the target sequence to get the most likely path (Viterbi algorithm) (or vice versa) Match the path of the query sequence with the path of the target sequence, we get their alignment. Represented work: HMMER.
39 Pairwise Alignment via HMM Seq 1: A T G R K E Path : M 1 I 1 I 1 M 2 D 3 M 4 I 4 Seq 2: V C K E R P Path : M 1 I 1 M 2 M 3 M 4 I 4 Path: M 1 M 2 M 3 M 4 Seq 1: A T G R - K E Seq 2: V C - K E R P
40 HMM for Multiple Sequence Alignment Build a HMM for a group of sequences Align each sequence against HMM using Viterbi algorithm to find the most likely path. (dynamic programming) Match the main/match states of these paths together. Add gaps for delete states For insertion between two positions, use the longest insertion of a sequence as template. Add gaps to other sequence if necessary. (see Krogh s paper)
41 Demo for Multiple Sequence Alignment Using HMMEditor (based on hmmer2 models)
42 How About Evaluating the Similarity between HMMs? Can we evaluate the similarity of two HMMs? Can we align two profile HMMs? (profileprofile alignment). Compare HMM with HMM. J. Soeding, Bioinformatics, 2005
43 HMM-HMM Comparison Profile-Profile Alignment Generalized Sequence (of States) 1 Generalized Sequence (of States) 2 Goal: find a sequence of paired states with Maximum sum of log odds scores J. Soeding, Bioinformatics, 2005
44 Sequence Weighting Henikoff-Henikoff: sum of position-based weight. For each position, each type of amino acid is assigned weight 1. The weight is of each amino acid is 1 / frequency of the amino acid at the position. The weight of a sequence is the sum of positional weights. (gap is not counted. A position with more than 50% gaps may be removed from counting.) An easy, useful algorithm Tree algorithm: construct a phylogenetic tree. Start from root, weight 1 flows down. At any branch, the weight is cut to half.
45 Null Model Background null model (log-odds) Reverse null model (SAM) Sum of log-odds score of local alignment obeys extreme-value distribution (same as PSI-BLAST): good for estimating the significance of sequence-hmm match. Sum of log-odds score of global alignment is length dependent.
46 HMM Software and Code HMMER: SAM: HHSearch: MUSCLE: HHSuite: HHblits Remmert M, Biegert A, Hauser A, Söding J (2011). "HHblits: Lightning-fast iterative protein sequence searching by HMM- HMM alignment.". Nat. Methods. 9 (2):
47 Project 1: Multiple Sequence Alignment Using HMM Dataset: BaliBASE: Generate multiple alignment using Clustalw ( for a family of sequences Design and implement your HMM (you may refer to open source HMM code) Construct a HMM for a family of sequences (initialization, number of states) Estimate the parameters of HMM using the sequences Analyze the emission probability of states (visualization) Analyze the transition probability between states (visualization) Compute the probability of each sequence Generate multiple sequence alignments and compare it with the initial multiple alignment Reference: A. Krogh et al, JMB, 1994 and open source HMM. J. D. Thompson, F. Plewniak and O. Poch. Bioinformatics, 1999
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb firstname.lastname@example.org Abstract I will discuss some problems
GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables
Hidden Markov Models (HMMs) and Profiles Swiss Institute of Bioinformatics (SIB) 26-30 November 2001 Markov Chain Models A Markov Chain Model is a succession of states S i (i = 0, 1,...) connected by transitions.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet email@example.com sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare . However, there
Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health firstname.lastname@example.org Slides created
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.
Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence
Andrea Passerini email@example.com Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly
Chapter 4: Hidden Markov Models 4.1 Introduction to HMM Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Markov models of sequence structures Introduction to Hidden Markov
11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
Multiple Sequence Alignment BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey firstname.lastname@example.org Multiple Sequence Alignment: Tas Definition Given a set of more than 2 sequences a method for
Template Free Protein Structure Modeling Jianlin Cheng, PhD Professor Department of EECS Informatics Institute University of Missouri, Columbia 2018 Protein Energy Landscape & Free Sampling http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html
Multiple Sequence Alignment: HMMs and Other Approaches Background Readings: Durbin et. al. Section 3.1, Ewens and Grant, Ch4. Wing-Kin Sung, Ch 6 Beerenwinkel N, Siebourg J. Statistics, probability, and
From THE CENTER FOR GENOMICS AND BIOINFORMATICS Karolinska Institutet, Stockholm, Sweden HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION Markus Wistrand Stockholm 2005 All previously published
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington
Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation
Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey email@example.com Goals for Lecture the key concepts to understand are the following: using related genomes
BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments
I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,
Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models R.C. Edgar, K. Sjölander Pacific Symposium on Biocomputing 8:180-191(2003) SIMULTANEOUS SEQUENCE ALIGNMENT AND TREE CONSTRUCTION
Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 firstname.lastname@example.org www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
MM : Viterbi algorithm - a toy example 0.6 et's consider the following simple MM. This model is composed of 2 states, (high GC content) and (low GC content). We can for example consider that state characterizes
Domain-based computational approaches to understand the molecular basis of diseases Dr. Maricel G. Kann Assistant Professor Dept of Biological Sciences UMBC http://bioinf.umbc.edu Research at Kann s Lab.
I529: Machine Learning in Bioinformatics (Spring 2017) HMM: Parameter Estimation Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Content Review HMM: three problems
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing
Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple
Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,
Consortium for Comparative Genomics University of Colorado School of Medicine Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics
USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)