HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Similar documents
Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

HMMs and biological sequence analysis

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Protein Structure: Data Bases and Classification Ingo Ruczinski

Hidden Markov Models. x 1 x 2 x 3 x K

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Moreover, the circular logic

Computational Genomics and Molecular Biology, Fall

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Stephen Scott.

CAP 5510 Lecture 3 Protein Structures

Multiple Sequence Alignment using Profile HMM

EECS730: Introduction to Bioinformatics

CSCE 471/871 Lecture 3: Markov Chains and

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Protein structure alignments

Today s Lecture: HMMs

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models (I)

Pairwise alignment using HMMs

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Basics of protein structure

Stephen Scott.

Protein Structure Prediction using String Kernels. Technical Report

Hidden Markov Models. x 1 x 2 x 3 x K

CS612 - Algorithms in Bioinformatics

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Week 10: Homology Modelling (II) - HHpred

Protein tertiary structure prediction with new machine learning approaches

Large-Scale Genomic Surveys

Introduction to Comparative Protein Modeling. Chapter 4 Part I

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Lecture 9. Intro to Hidden Markov Models (finish up)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Conditional Graphical Models for Protein Structure Prediction

Hidden Markov Models

ALL LECTURES IN SB Introduction

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Heteropolymer. Mostly in regular secondary structure

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Protein Secondary Structure Prediction

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Hidden Markov Models (HMMs) and Profiles

Pairwise sequence alignment and pair hidden Markov models

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

Protein Structure Prediction

Hidden Markov Models

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Analysis and Prediction of Protein Structure (I)

DATE A DAtabase of TIM Barrel Enzymes

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Some Problems from Enzyme Families

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Computational Genomics and Molecular Biology, Fall

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and comparison

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Quantifying sequence similarity

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

EBI web resources II: Ensembl and InterPro

O 3 O 4 O 5. q 3. q 4. Transition

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Alignment Algorithms. Alignment Algorithms

Algorithms in Bioinformatics

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Pair Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Markov Chains and Hidden Markov Models. = stochastic, generative models

Protein Structure & Motifs

Transcription:

HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding with HMMs Using the gene finder Apply the gene finder to both the top strand and bottom strand Training using annotated genes 1

A simple eukaryotic gene finder Pair HMMs and Sequence Alignment Global alignment with affine gaps revisited Suppose we re aligning sequences x,y of lengths m and n. Dynamic Programming: M(i, j): Optimal alignment of x 1 x i to y 1 y j ending in a match I X (i, j): Optimal alignment of x 1 x i to y 1 y j ending in a gap in x Alignment with affine gaps Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = -, for i, j > 0 I X (i,0) = - (d + i * e); I Y (0, j) = - (d + j * e) Iteration: M(i 1, j 1) M(i, j) = s(x i, y j ) + max I X (i 1, j 1) I X (i 1, j 1) I Y (i, j): Optimal alignment of x 1 x i to y 1 y j ending in a gap in y I X (i, j) = max I Y (i, j) = max I X (i 1, j) - e M(i 1, j) - d I Y (i, j 1) - e M(i, j 1) - d Termination: Optimal alignment given by max { M(m, n), I X (m, n), I Y (m, n) } A state model for alignment Scoring transitions x -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- y TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC XMMYMMMMMMMYYMMMMMMYMMMMMMMXXMMMMMXXX x -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- y TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC XMMYMMMMMMMYYMMMMMMYMMMMMMMXXMMMMMXXX 2

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Compute the likelihood that the sequences are indeed related A Pair HMM for alignments ε M δ 1 2δ optional Transition probabilities: δ δ: set so that 1/2δ is avg. length of match region ε: set so that 1/() is avg. length of a gap ε Model generates two sequences simultaneously Emission probabilities: Match/Mismatch state M: Emits pair of symbols with distribution P(x,y) that reflects substitution frequencies between pairs of amino acids Insertion states X,Y: Emit symbol + gap according to P(x), P(y) that reflect amino acid frequencies of x and y Viterbi for pair HMMs A model for unaligned sequences Initialization: Recurrence: ε 1 2δ δ δ R The two sequences are independently generated from one another P(x, y R) = P(x 1 ) P(x m ) P(y 1 ) P(y n ) = Π i P(x i ) Π j P(y j ) Termination: ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes: M (1 2δ) P(x i, y j ) when matched ε P(x i ) or ε P(y j ) when gapped R P(x i ) P(y j ) in random model ε 1 2δ δ δ ε ALIGNMENT vs. RANDOM hypothesis To compare the models we consider the log odds ratio under the two models, which can be expressed as a sum of terms: δ ε P(x i, y j ) s(x i, y j ) = log + log (1 2δ) P(x i ) P(y j ) δ () P(x i ) d = log (1 2δ) P(x i ) 1 2δ δ = Defn substitution score = Defn gap initiation penalty ε P(x i ) e = log P(x i ) = Defn gap extension penalty 3

Log-odds Viterbi Comments The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i 1, j 1), V X ( i 1, j 1), V Y ( i 1, j 1) } + s(x i, y j ) V X (i, j) = max { V M (i 1, j) d, V X ( i 1, j) e } V Y (i, j) = max { V M (i 1, j) d, V Y ( i 1, j) e } The pair HMM formulation provides a probabilistic interpretation for the pairwise alignment algorithm, and allows assigning values to parameters from first principles. Viterbi provides the probability that two sequences are related by a particular alignment according to the HMM. Using the forward algorithm we can compute the probability that the two sequences are related by any alignment: Probability that two residues are aligned Notation: means x i and y j are aligned f M and b M are the forward and backward probs. that correspond to a match state Pair HMM for local alignment Describing protein families for prediction of protein structure and function Protein sequence and structure Protein classification Figure 4.3 from Durbin et al. 4

Primary Structure: Sequence The primary structure of a protein is its amino acid sequence Primary Structure: Sequence Primary Structure: Sequence Twenty different amino acids have distinct shapes and properties Secondary Structure: α, β, & loops α helices and β sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms A useful mnemonic for the hydrophobic amino acids is "FAMILY VW" Tertiary Structure: the Fold of a protein Structure Determines Function The Protein Folding Problem What determines structure? 3d structure is minimum free-energy conformation How can we determine structure? Experimental methods Computational predictions 5

Growth of the Protein Data Bank (PDB) The number of folds in nature New PDB structures Classifying folds Structure Classification Databases Despite the large growth in the number of solved protein structures, the number of new folds identified is small Classifying proteins into folds Generate overview of structure types Detect similarities between proteins Help predict 3D structure of new protein sequences Classification of 25,973 protein structures in PDB SCOP release 1.69, Class # folds All alpha proteins 218 All beta proteins 144 Alpha and beta proteins (a/b) 136 Alpha and beta proteins (a+b) 279 Multi-domain proteins 46 Membrane & cell surface 47 Small proteins 75 Total 945 SCOP Manual classification (A. Murzin) scop.berkeley.edu CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html Major classes in SCOP All α: Hemoglobin (1bab) Classes All α proteins All β proteins α and β proteins (α/β) α and β proteins (α+β) Membrane and cell surface proteins Small proteins Morten Nielsen,CBS, BioCentrum, DTU 6

All β: Immunoglobulin (8fab) α/β: Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU Morten Nielsen,CBS, BioCentrum, DTU α+β: Lysozyme (1jsf) Protein Family Consists of proteins whose evolutionarily relationship is readily recognizable from sequence (>30% sequence identity) Superfamily Fold Family Proteins Morten Nielsen,CBS, BioCentrum, DTU Protein Superfamily Fold Consists of proteins which are (remotely) evolutionarily related Sequence similarity low Related function Similar in structure Relationships between members of a superfamily hard to recognize from sequence alone Superfamily Family Fold Proteins Proteins in the same fold share the same general architecture, and have the same major secondary structures in the same topological arrangement No evolutionary relationship implied Fold Superfamily Family Proteins 7

Protein Classification PSI-BLAST Given a new protein, can we assign it to its position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs new protein? Superfamily Family Fold Given a sequence query x, and database D 1. Find all pairwise alignments of x to sequences in D 2. Collect all matches of x to y with some minimum significance 3. Construct a profile M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4. Using M, search D for more matches 5. Iterate 1 4 until convergence Supervised machine learning Proteins Profile M Profiles & Profile HMMs Psi-BLAST builds a profile Profile HMM: a more elaborate version of a profile Intuitively, a profile that models insertions and deletions PWMs as HMMs Insertions I 1 BEGIN M L BEGIN M L A PWM can be described as a very simple HMM where the emissions associated with M i correspond to the i th column of the PWM Insertion: a part of the sequence x that is not modeled by the match states The loop allows for insertions of multiple letters An insert state emits symbols with the background distribution 8

Insertions Deletions I 1 BEGIN M L BEGIN Log odds contribution of an insertion of length k: M L Looks like an affine gap penalty Deletion states: silent states Cost of deletion of a sequence of letters: sum of log transition probabilities We could model deletions by allowing all forward transitions from match states: lots of transitions Profile HMMs Profile HMMs BEGIN I 0 I 1 I L-1 I L BEGIN I 0 I 1 I L-1 I L M L Protein Family F Each M state has a position-specific pre-computed substitution table Each I state has position-specific gap penalties (and in principle can have its own emission distributions) Each D allows to skip its corresponding match state In principle, D-D transitions can also be customized per position M L Protein Family F transition between match states a M(i)M(i+1) transitions between match and insert states a M(i)I(i), a I(i)M(i+1) transition within insert state a I(i)I(i) transition between match and delete states a M(i)D(i+1), a D(i)M(i+1) transition within delete state a D(i)D(i+1) emission of amino acid b at a state S e S (b) Estimating Profile HMMs from MSAs Estimating Profile HMMs from MSAs BEGIN I 0 I 1 I L-1 I L M L Columns that have less than a given number of gaps are used as match state of the HMM (starred columns) Amino acid frequencies of each column used to set emission probabilities for the corresponding match state transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced Akl akl =! A ' ' l kl E k ( a) ek ( a) =! E k ( a ' ) ' a ' 9

Profile HMM for the globins example What you can do with a profile HMM Classification: given a protein, we can compute its probability under the model, or compare it to a random model (using a log-odds-ratio). Use forward/backward Alignment: Use the Viterbi algorithm to align a sequence to the HMM. Learning: set model parameters using a given MSA, or from scratch using EM. Figure 5.4 from Durbin et al. Alignment of a protein to a profile HMM Log-odds Viterbi To align sequence x 1 x n to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm Define V jm (i): score of best alignment of x 1 x i to the HMM ending in x i being emitted from M j V ji (i): score of best alignment of x 1 x i to the HMM ending in x i being emitted from I j V jd (i): score of best alignment of x 1 x i to the HMM ending in D j (x i is the last character emitted before D j ) Denote by q a the frequency of amino acid a in a random protein Log-odds Viterbi (termination) MSA using a profile HMM 10

BEGIN I 0 I 1 I m-1 M m I m BEGIN I 0 I 1 I m-1 M m BEGIN I 0 I 1 I m-1 M m I m I m How to build a profile HMM Classification with Profile HMMs Fold Superfamily Family new protein? Classification with Profile HMMs Classifying globins How generative models work Training examples: sequences known to be members of family Tuning parameters with a priori knowledge Model assigns a probability to any given protein sequence Idea: The sequences from the family (hopefully) yields a higher probability than sequences outside the family Log-likelihood ratio as score P(H 1 X) L(X) = log P(H 0 X) Classification using Z score Weighting training sequences The examples of a protein family are not a good sample from the sequences that belong to the family: typically there are sequences that are very closely related. They can given the mistaken impression regarding the pattern of conservation in the family. Solution: assign weights to each sequence. Less weight to related sequences. 11

Profile HMMs for local alignment PFAM: profile HMMs for protein families PFAM method: Use Blast to cluster a protein database into families of related proteins Construct a multiple alignment for each protein family. Construct a profile HMM from each alignment Given a protein of unknown origin: Align the target sequence against each HMM to find the best fit PFAM Generative Models Pfam decribes protein domains Each protein domain family in Pfam has: Seed alignment: manually verified multiple alignment of a representative set of sequences. HMM built from the seed alignment for further database searches. Full alignment generated automatically from the HMM The distinction between seed and full alignments facilitates Pfam updates. Seed alignments are a stable resource (manually inspected by an expert) HMMs can be updated with new protein sequences. Generative Models Generative Models 12

Digression: Discriminative Models k-mer based SVMs for protein classification Rather than model each class (red/ green), construct a decision boundary between the classes. margin Support Vector Machines (SVM) learn large margin decision boundaries For given word size k, and mismatch tolerance m, define K(x, y) = # distinct k-long word occurrences with m mismatches SVM can be learned by supplying this kernel function X A B A C A R D I w Decision Rule: red: w T x > 0 Let k = 3; m = 1 Y A B R A D A B I K(X, Y) = 4 Benchmarks Resources on the web HMMer a free profile HMM software http://hmmer.janelia.org/ SAM another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html PFAM database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/software/pfam/ SCOP a structural classification of proteins http://scop.berkeley.edu/ Transmembrane proteins Prediction of transmembrane proteins http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.figgrp.614 13

Transmembrane proteins Sonnhammer et al., ISMB, 6:175-182, 1998: TMHMM Properties of transmembrane proteins: hydrophobic α-helices connected by interfacial loops There are 3 main locations of a residue: TM helix core (in hydrophobic tail of membrane) TM helix cap (in head of membrane) cytoplasmic vs non-cytoplasmic side of the helix core loops cytoplasimc vs non-cytoplasmic (short) vs non-cytoplasmic (long) cyto non-cyto Architecture of TMHMM 14