Stephen Scott. - PDF Free Download

1 / 21 sscott@cse.unl.edu

2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for searching databases for more homologues and for aligning strings to the family

3 / 21 Outline of a profile HMM Ungapped regions Insert and delete states Non-global alignments a model Determining states: match, insert, delete Estimating probabilities Pseudocounts and aligning with HMMs Viterbi Forward

4 / 21 of a Profile HMM Match States Match States Insertion States Deletion States General Structure Start with a trivial HMM M (not really hidden at this point) 1 1 B M 1 1 M 1 1 E Each match state has its own set of emission probabilities, so we can compute probability of a new sequence x being part of this family: P(x M) = i L e i (x i ) i=1 Can, as usual, convert probabilities to log-odds score

5 / 21 of a Profile HMM (2) Insertion States Match States Insertion States Deletion States General Structure But this assumes ungapped alignments! To handle gaps, consider insertions and deletions Insertion: part of x that doesn t match anything in multiple alignment (use insert states) I i M i

6 / 21 of a Profile HMM (3) Deletion States Match States Insertion States Deletion States General Structure Deletion: parts of multiple alignment not matched by any residue in x (use silent delete states) D i M i

7 / 21 General Profile HMM Structure Match States Insertion States Deletion States General Structure B E

8 / 21 Handling non-global Alignments Original profile HMMs model entire sequence Add flanking model states (or free insertion modules) to generate non-local residues Match States Insertion States Deletion States General Structure B E

States Probabilities Pseudocounts a Model Determining States Given a multiple alignment, how to build an HMM? General structure defined, but how many match states? 9 / 21 Given a multiple alignment, how to build an HMM? General structure defined, but how many match states?... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V...

a Model (2) Determining States States Probabilities Pseudocounts 10 / 21 Given a multiple alignment, how to build an HMM? General structure defined, but how many match states? Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V...

11 / 21 a Model (3) Determining States a Model (cont d) States Probabilities Pseudocounts Now, find parameters Now, find parameters Multiple alignment + HMM structure state sequence Multiple alignment + HMM structure state sequence M1 D3 I3... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V... Non-gap in match column -> match state Gap in match column -> delete state Non-gap in insert column -> insert state Gap in insert column -> ignore Durbin Fig 5.4, p. 109 11

12 / 21 a Model (4) Estimating Probabilities States Probabilities Pseudocounts Count number of transitions and emissions and compute: a kl = A kl l A kl e k (b) = E k(b) b E k(b ) Still need to beware of some counts = 0

13 / 21 Weighted Pseudocounts States Probabilities Pseudocounts Let c ja = observed count of residue a in position j of multiple alignment e Mj (a) = c ja + Aq a a c ja + A q a = background probability of a, A = weight placed on pseudocounts (sometimes use A 20) Background probabilities also called a prior distribution

14 / 21 Dirichlet Mixtures States Probabilities Pseudocounts Can be thought of as a mixture of pseudocounts The mixture has different components, each representing a different context of a protein sequence E.g., in parts of a sequence folded near protein s surface, more weight (higher q a ) can be given to hydrophilic residues But in other regions, may want to give more weight to hydrophobic residues Will find a different mixture for each position of the alignment based on the distribution of residues in that column

15 / 21 Dirichlet Mixtures (2) States Probabilities Pseudocounts Each component k consists of a vector of pseudocounts α k (so α k a corresponds to Aq a ) and a mixture coefficient (m k, for now) that is the probability that component k is selected Pseudocount model k is the correct one with probability m k We ll set the mixture coefficients for each column based on which vectors best fit the residues in that column E.g., first column of alignment on slide 10 is dominated by V, so any vector α k that favors V will get a higher m k

Dirichlet Mixtures (3) States Probabilities Pseudocounts Let c j be vector of counts in column j e Mj (a) = k P (k c j ) c ja + α k a a ( cja + α k a ) P (k c j ) are the posterior mixture coefficients, which are easily computed [Sjölander et al. 1996], yielding: e Mj (a) = X a a X a, where X a = k m k0 exp ( ln B ( α k + c j ) ln B ( α k )) c ja + α k a a ( cja + α k a ) 16 / 21 ln B( x) = i ( ) ln Γ(x i ) ln Γ x i i

17 / 21 Dirichlet Mixtures (4) Γ is gamma function, and ln Γ is computed via lgamma and related functions in C m k0 is prior probability of component k (= q below) States Probabilities Pseudocounts

18 / 21 for Homologues Viterbi Forward Aligning Score a candidate match x by using log-odds: P(x, π M) is probability that x came from model M via most likely path π Find using Viterbi Pr(x M) is probability that x came from model M summed over all possible paths Find using forward algorithm score(x) = log(p(x M)/P(x φ)) φ is a null model, which is often the distribution of amino acids in the training set or AA distribution over each individual column If x matches M much better than φ, then score is large and positive

Viterbi Equations Viterbi Forward Aligning 19 / 21 V M j Vj M (i) = log-odds score of best path matching x 1...i to model, x i emitted by M j (similarly define Vj I(i) and VD j (i)) B is M 0, V M 0 (0) = 0, E is M L+1 (V M L+1 = final) ( ) emj (x i ) (i) = log + max q xi ( ) Vj I eij (x i ) (i) = log + max q xi Vj D (i) = max Vj 1 M (i 1) + log a M j 1 M j Vj 1 I (i 1) + log a I j 1 M j Vj 1 D (i 1) + log a D j 1 M j Vj M (i 1) + log a Mj I j Vj I(i 1) + log a I j I j Vj D (i 1) + log a Dj I j V M j 1 (i) + log a M j 1 D j V I j 1 (i) + log a I j 1 D j V D j 1 (i) + log a D j 1 D j

Forward Equations Viterbi Forward Aligning ( ) Fj M emj (x i ) (i) = log + log [ a Mj 1 M q j exp ( Fj 1(i M 1) ) + xi a Ij 1 M j exp ( Fj 1(i I 1) ) + a Dj 1 M j exp ( Fj 1(i D 1) )] ( ) Fj I eij (x i ) (i) = log + log [ a Mj I q j exp ( Fj M (i 1) ) + xi a Ij I j exp ( Fj I (i 1) ) + a Dj I j exp ( Fj D (i 1) )] F D j (i) = log [ a Mj 1 D j exp ( F M j 1(i) ) + a Ij 1 D j exp ( F I j 1(i) ) +a Dj 1 D j exp ( F D j 1(i) )] 20 / 21 exp( ) needed for sums and logs (can still be fast; see p. 78)

21 / 21 Aligning a Sequence with a Model (Multiple Alignment) Viterbi Forward Aligning Given a string x, use Viterbi to find most likely path π and use the state sequence as the alignment More detail in Durbin, Section 6.5 Also discusses building an initial multiple alignment and HMM simultaneously via Baum-Welch