HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding with HMMs Using the gene finder Apply the gene finder to both the top strand and bottom strand Training using annotated genes 1

A simple eukaryotic gene finder Pair HMMs and Sequence Alignment Global alignment with affine gaps revisited Suppose we re aligning sequences x,y of lengths m and n. Dynamic Programming: M(i, j): Optimal alignment of x 1 x i to y 1 y j ending in a match I X (i, j): Optimal alignment of x 1 x i to y 1 y j ending in a gap in x Alignment with affine gaps Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = -, for i, j > 0 I X (i,0) = - (d + i * e); I Y (0, j) = - (d + j * e) Iteration: M(i 1, j 1) M(i, j) = s(x i, y j ) + max I X (i 1, j 1) I X (i 1, j 1) I Y (i, j): Optimal alignment of x 1 x i to y 1 y j ending in a gap in y I X (i, j) = max I Y (i, j) = max I X (i 1, j) - e M(i 1, j) - d I Y (i, j 1) - e M(i, j 1) - d Termination: Optimal alignment given by max { M(m, n), I X (m, n), I Y (m, n) } A state model for alignment Scoring transitions x -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- y TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC XMMYMMMMMMMYYMMMMMMYMMMMMMMXXMMMMMXXX x -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- y TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC XMMYMMMMMMMYYMMMMMMYMMMMMMMXXMMMMMXXX 2

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Compute the likelihood that the sequences are indeed related A Pair HMM for alignments ε M δ 1 2δ optional Transition probabilities: δ δ: set so that 1/2δ is avg. length of match region ε: set so that 1/() is avg. length of a gap ε Model generates two sequences simultaneously Emission probabilities: Match/Mismatch state M: Emits pair of symbols with distribution P(x,y) that reflects substitution frequencies between pairs of amino acids Insertion states X,Y: Emit symbol + gap according to P(x), P(y) that reflect amino acid frequencies of x and y Viterbi for pair HMMs A model for unaligned sequences Initialization: Recurrence: ε 1 2δ δ δ R The two sequences are independently generated from one another P(x, y R) = P(x 1 ) P(x m ) P(y 1 ) P(y n ) = Π i P(x i ) Π j P(y j ) Termination: ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes: M (1 2δ) P(x i, y j ) when matched ε P(x i ) or ε P(y j ) when gapped R P(x i ) P(y j ) in random model ε 1 2δ δ δ ε ALIGNMENT vs. RANDOM hypothesis To compare the models we consider the log odds ratio under the two models, which can be expressed as a sum of terms: δ ε P(x i, y j ) s(x i, y j ) = log + log (1 2δ) P(x i ) P(y j ) δ () P(x i ) d = log (1 2δ) P(x i ) 1 2δ δ = Defn substitution score = Defn gap initiation penalty ε P(x i ) e = log P(x i ) = Defn gap extension penalty 3

Log-odds Viterbi Comments The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i 1, j 1), V X ( i 1, j 1), V Y ( i 1, j 1) } + s(x i, y j ) V X (i, j) = max { V M (i 1, j) d, V X ( i 1, j) e } V Y (i, j) = max { V M (i 1, j) d, V Y ( i 1, j) e } The pair HMM formulation provides a probabilistic interpretation for the pairwise alignment algorithm, and allows assigning values to parameters from first principles. Viterbi provides the probability that two sequences are related by a particular alignment according to the HMM. Using the forward algorithm we can compute the probability that the two sequences are related by any alignment: Probability that two residues are aligned Notation: means x i and y j are aligned f M and b M are the forward and backward probs. that correspond to a match state Pair HMM for local alignment Describing protein families for prediction of protein structure and function Protein sequence and structure Protein classification Figure 4.3 from Durbin et al. 4

Primary Structure: Sequence The primary structure of a protein is its amino acid sequence Primary Structure: Sequence Primary Structure: Sequence Twenty different amino acids have distinct shapes and properties Secondary Structure: α, β, & loops α helices and β sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms A useful mnemonic for the hydrophobic amino acids is "FAMILY VW" Tertiary Structure: the Fold of a protein Structure Determines Function The Protein Folding Problem What determines structure? 3d structure is minimum free-energy conformation How can we determine structure? Experimental methods Computational predictions 5

Growth of the Protein Data Bank (PDB) The number of folds in nature New PDB structures Classifying folds Structure Classification Databases Despite the large growth in the number of solved protein structures, the number of new folds identified is small Classifying proteins into folds Generate overview of structure types Detect similarities between proteins Help predict 3D structure of new protein sequences Classification of 25,973 protein structures in PDB SCOP release 1.69, Class # folds All alpha proteins 218 All beta proteins 144 Alpha and beta proteins (a/b) 136 Alpha and beta proteins (a+b) 279 Multi-domain proteins 46 Membrane & cell surface 47 Small proteins 75 Total 945 SCOP Manual classification (A. Murzin) scop.berkeley.edu CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html Major classes in SCOP All α: Hemoglobin (1bab) Classes All α proteins All β proteins α and β proteins (α/β) α and β proteins (α+β) Membrane and cell surface proteins Small proteins Morten Nielsen,CBS, BioCentrum, DTU 6

All β: Immunoglobulin (8fab) α/β: Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU Morten Nielsen,CBS, BioCentrum, DTU α+β: Lysozyme (1jsf) Protein Family Consists of proteins whose evolutionarily relationship is readily recognizable from sequence (>30% sequence identity) Superfamily Fold Family Proteins Morten Nielsen,CBS, BioCentrum, DTU Protein Superfamily Fold Consists of proteins which are (remotely) evolutionarily related Sequence similarity low Related function Similar in structure Relationships between members of a superfamily hard to recognize from sequence alone Superfamily Family Fold Proteins Proteins in the same fold share the same general architecture, and have the same major secondary structures in the same topological arrangement No evolutionary relationship implied Fold Superfamily Family Proteins 7

Protein Classification PSI-BLAST Given a new protein, can we assign it to its position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs new protein? Superfamily Family Fold Given a sequence query x, and database D 1. Find all pairwise alignments of x to sequences in D 2. Collect all matches of x to y with some minimum significance 3. Construct a profile M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4. Using M, search D for more matches 5. Iterate 1 4 until convergence Supervised machine learning Proteins Profile M Profiles & Profile HMMs Psi-BLAST builds a profile Profile HMM: a more elaborate version of a profile Intuitively, a profile that models insertions and deletions PWMs as HMMs Insertions I 1 BEGIN M L BEGIN M L A PWM can be described as a very simple HMM where the emissions associated with M i correspond to the i th column of the PWM Insertion: a part of the sequence x that is not modeled by the match states The loop allows for insertions of multiple letters An insert state emits symbols with the background distribution 8

Insertions Deletions I 1 BEGIN M L BEGIN Log odds contribution of an insertion of length k: M L Looks like an affine gap penalty Deletion states: silent states Cost of deletion of a sequence of letters: sum of log transition probabilities We could model deletions by allowing all forward transitions from match states: lots of transitions Profile HMMs Profile HMMs BEGIN I 0 I 1 I L-1 I L BEGIN I 0 I 1 I L-1 I L M L Protein Family F Each M state has a position-specific pre-computed substitution table Each I state has position-specific gap penalties (and in principle can have its own emission distributions) Each D allows to skip its corresponding match state In principle, D-D transitions can also be customized per position M L Protein Family F transition between match states a M(i)M(i+1) transitions between match and insert states a M(i)I(i), a I(i)M(i+1) transition within insert state a I(i)I(i) transition between match and delete states a M(i)D(i+1), a D(i)M(i+1) transition within delete state a D(i)D(i+1) emission of amino acid b at a state S e S (b) Estimating Profile HMMs from MSAs Estimating Profile HMMs from MSAs BEGIN I 0 I 1 I L-1 I L M L Columns that have less than a given number of gaps are used as match state of the HMM (starred columns) Amino acid frequencies of each column used to set emission probabilities for the corresponding match state transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced Akl akl =! A ' ' l kl E k ( a) ek ( a) =! E k ( a ' ) ' a ' 9

Profile HMM for the globins example What you can do with a profile HMM Classification: given a protein, we can compute its probability under the model, or compare it to a random model (using a log-odds-ratio). Use forward/backward Alignment: Use the Viterbi algorithm to align a sequence to the HMM. Learning: set model parameters using a given MSA, or from scratch using EM. Figure 5.4 from Durbin et al. Alignment of a protein to a profile HMM Log-odds Viterbi To align sequence x 1 x n to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm Define V jm (i): score of best alignment of x 1 x i to the HMM ending in x i being emitted from M j V ji (i): score of best alignment of x 1 x i to the HMM ending in x i being emitted from I j V jd (i): score of best alignment of x 1 x i to the HMM ending in D j (x i is the last character emitted before D j ) Denote by q a the frequency of amino acid a in a random protein Log-odds Viterbi (termination) MSA using a profile HMM 10

BEGIN I 0 I 1 I m-1 M m I m BEGIN I 0 I 1 I m-1 M m BEGIN I 0 I 1 I m-1 M m I m I m How to build a profile HMM Classification with Profile HMMs Fold Superfamily Family new protein? Classification with Profile HMMs Classifying globins How generative models work Training examples: sequences known to be members of family Tuning parameters with a priori knowledge Model assigns a probability to any given protein sequence Idea: The sequences from the family (hopefully) yields a higher probability than sequences outside the family Log-likelihood ratio as score P(H 1 X) L(X) = log P(H 0 X) Classification using Z score Weighting training sequences The examples of a protein family are not a good sample from the sequences that belong to the family: typically there are sequences that are very closely related. They can given the mistaken impression regarding the pattern of conservation in the family. Solution: assign weights to each sequence. Less weight to related sequences. 11

Profile HMMs for local alignment PFAM: profile HMMs for protein families PFAM method: Use Blast to cluster a protein database into families of related proteins Construct a multiple alignment for each protein family. Construct a profile HMM from each alignment Given a protein of unknown origin: Align the target sequence against each HMM to find the best fit PFAM Generative Models Pfam decribes protein domains Each protein domain family in Pfam has: Seed alignment: manually verified multiple alignment of a representative set of sequences. HMM built from the seed alignment for further database searches. Full alignment generated automatically from the HMM The distinction between seed and full alignments facilitates Pfam updates. Seed alignments are a stable resource (manually inspected by an expert) HMMs can be updated with new protein sequences. Generative Models Generative Models 12

Digression: Discriminative Models k-mer based SVMs for protein classification Rather than model each class (red/ green), construct a decision boundary between the classes. margin Support Vector Machines (SVM) learn large margin decision boundaries For given word size k, and mismatch tolerance m, define K(x, y) = # distinct k-long word occurrences with m mismatches SVM can be learned by supplying this kernel function X A B A C A R D I w Decision Rule: red: w T x > 0 Let k = 3; m = 1 Y A B R A D A B I K(X, Y) = 4 Benchmarks Resources on the web HMMer a free profile HMM software http://hmmer.janelia.org/ SAM another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html PFAM database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/software/pfam/ SCOP a structural classification of proteins http://scop.berkeley.edu/ Transmembrane proteins Prediction of transmembrane proteins http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.figgrp.614 13

Transmembrane proteins Sonnhammer et al., ISMB, 6:175-182, 1998: TMHMM Properties of transmembrane proteins: hydrophobic α-helices connected by interfacial loops There are 3 main locations of a residue: TM helix core (in hydrophobic tail of membrane) TM helix cap (in head of membrane) cytoplasmic vs non-cytoplasmic side of the helix core loops cytoplasimc vs non-cytoplasmic (short) vs non-cytoplasmic (long) cyto non-cyto Architecture of TMHMM 14