Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994) Hidden Markov models in computational biology: applications to protein modeling J Mol Biol 35:50-53 Book: Eddy & Durbin, 999 See web site utorial: Rabiner, L (989) A tutorial on hidden Markov models and selected applications in speech recognition Proc IEEE 77 () 57-86 Probability review Probability notation: PA ( ) P(A) 0, Probability of A: P(A) = A Joint probability of A AD B: PA, ( B) Conditional probability of A Given that B is true: PA ( B)= PA, ( B) PB ( ) Marginal probability of A, Given all possible Bs: PA ()= PA, ( B) B Markov chains Markov property: P(X 0, X X t ) = P(X 0 )P(X X 0 ) P(X t X t- ) Formally: State space = list of possible values for X ransition matrix = probability of moving from one X to another Initial distribution = initial value of X Independence of A and B: Bayes rule for estimating P(B A) given rest: PA, ( B)= PA ( )PB ( ) PB ( A)= PA ( B)P( B) PA ( ) S0 P(S S0) S S P(S S) Markovian sequence States through which the chain passes form a sequence: Example: S 0, S, S, S, S 0, S,L 4,3 Graphically: By the Markov property:,5 S0 S S Example Markov chain for generating a DA sequence: Sequence probability: A G A C G S0 S S S S0 S ( ) P( Sequence)= PS 0, S, S, S, S 0, S,K = π( S 0 )PS ( S 0 )PS ( S )K P( AGACG) = π( A)P( G A)PA ( G)P ( A)K Data would come from Dinucleotide frequency (eg base-stacking)
Page Hidden Markov chains Observed sequence is a probabilistic function of underlying Markov chain Example: HMM for a (noisy) DA sequence (see eg Churchill 989) rue state sequence unknown, but observation sequence gives us a clue Example: Hidden Markov chain for protein sequence based on secondary structure Specify sequence of helix elements (H) or loop (L), and then generate a sequence by choosing amino acids based on their preference for H or L H H H L L Unobserved ruth A G A C G PAA ( H) PAA ( H) PAA ( H) PAA ( L) PAA ( L) Observed noisy sequence data Obs = A 7 3 C G 8 A 7 3 A 3 7 C 6 G 4 C G 8 -> A G G G (one possible observation) Certain amino acids prefer helices over loops, and vice versa Example: Hidden Markov chain for protein sequence Specify sequence of helix elements (H) or loop (L), and then generate a sequence by choosing amino acids (AA) from probability distributions associated with H or L H H H L L PAA ( H) PAA ( H) PAA ( H) PAA ( L) PAA ( L) wo possible sequences that could be generated from this V A W K A A V Goals for HMMs + Multiple Alignment Summarize a family of aligned sequences What are the amino acids that can occur at a position, and with what probability? Where can there be insertions and deletions? Generate fake but plausible sequences in family Evaluate the probability that a new sequence belongs in a family Does the model create a sequence of high probability? 3 Generate the alignment itself How can I explain the organization of all these sequences, which I believe are in the same family? A HMM for multiple protein sequences (Krogh et al) HMM globin model m = match state emits amino acids (that can be aligned as equivalent in different sequences) i = insert state (emit amino acids that don t align) d = delete state (don t emit amino acid at position) Path through states aligns sequence to model Figure from (Krogh et al, 994) Could emit: V L S A E E K A V K A G H P A - W QAK L C S m states are shown with their probability of emiting each of the 0 amino acids d states shows position in alignment for that column of d,i,m i states are average length of an insertion IF it is chosen
Page 3 Example: Alignment of globin sequences Example: Aligning sequence to model Given an HMM model for a protein family: Align a new sequence to the model by finding the most likely path through the graph (d states are gaps, i states are insertions) Figure from (Krogh et al, 994) hree computational tasks with HMMs Probability of an observed sequence Given O, O O find P(O, O O ) Most likely hidden state sequence Given O, O O compute maximum P(Q, Q Q O, O O ) (O s are observed, Q s are not) 3 Estimation of model parameters Given observed sequences {O, O O } and a topology, find transition probabilities and emission probabilities that maximize SUM of P(O, O O ) Computing likelihood of observed sequence Compute P(O, O O ) = SUM [P(O, O O Q, Q Q ) P(Q, Q Q ) Q, Q Q rue state sequence unknown Must sum over all possible paths that lead to the observables umber paths ~ O( ), = # states, = seq length Markovian structure permits: recursive definition and hence efficient calculation by dynamic programming Key observation: Any path must be in exactly one state at time t Key Idea for HMM computations 0 Matrix holds probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char 3 Fill entire matrix, sum values in last column to get P(observed sequence) Example: Searching protein database with HMM profile For each sequence in database: Does sequence fit model? possible states Score by P(O, O O ), compute Z-score adjusted for length Z-score = number of standard deviations from the mean t- t Observables ( t-,t, t+, ) Z = means SD above the mean, etc ote: states are i and m states that produce amino acids, D states Inferred, for example if m(i) jumps to m(i+)
Page 4 Computing most likely hidden state Why? Because that tells us how to align different sequences (based on aligning the characters emitted by corresponding hidden states) Want the sequence of states most likely to have produced observables Very similar to computing probability, except we look for maximum path (instead of summing all possible paths) Finding most likely path = Viterbi 0 Matrix holds maximum probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char, and choosing the most maximum probability state/transition 3 Fill entire matrix, find maximum value in last column to get best last state Finding most likely path 0 Matrix holds maximum probability that state i emits char seen at position t Can fill in first column based on probability that each state produces first char Can fill in second column by considering all possible transitions from first column, their probability, and the probability that the row of second column would emit second char, and choosing the most maximum probability state/transition 3 Fill entire matrix, find maximum value in last column to get best last state possible states possible states MAX MAX MAX MAX MAX t- t Observables ( t-,t, t+, ) t- t Observables ( t-,t, t+, ) Other details How many states in model? How to initialize parameters? How to avoid local modes? See (Krogh et al, 994) for some suggestions Estimate alignment and model parameters - simultaneously Key idea - missing data: What if we knew the alignment? Parameters easy to estimate: Calculate (expected) number of transitions Calculate (expected) frequency of amino acids What if we knew the parameters? Alignment easy to find Align each sequence to model using Viterbi algorithm Align residues in match states
Page 5 Estimating Parameters (hardest part) Algorithm (Baum-Welch) discussed nicely in handout from Ewens & Grant (Chapter ) Set topology of network (number of states and their connectivity) before hand Estimate: transition probabilities emission probabilities initial probabilities that hidden state i is responsible for character Estimating parameters: High level summary Iteratively introduce each sequence into the model Use (initially lousy) estimates to align sequence to the model 3 Use resulting alignment to accumulate statistics about parameters 4 Introduce next sequence and continue accumulation 5 Use accumulated statistics to update parameters 6 Repeat entire process of introducing sequences until parameters converge Multiple protein sequence alignment Given a set of sequences: Example: Multiple alignment of globin sequences Estimate HMM model using optimization for parameter search Align each sequence to model (Viterbi) Match states of model provide columns of resulting multiple alignment Figure from (Krogh et al, 994) Advantages: radeoffs Explicit probabilistic model for family Position specific residue distributions, gap penalties, insertions frequencies Disadvantages: Many parameters, requires more data or care raded one hard optimization problem for another Modeling domains Figures from (Krogh et al, 994) Extensions Clustering subfamilies
Page 6 HMM Summary Powerful tool for modeling protein families Generalization of existing profile methods Data-intensive Widely applicable to problems in bioinformatics Lecture ends here Extra slides follow Forward-backward algorithm Forward pass: α t ( j) = α t (i)p( S j S i ) i = PO ( t S j ) Define Prob of subsequence O O K O t when in S j at t Forward-backward algorithm (cont d) PO ( otice O,K,O )= α ( j) j= Define an analogous backward pass so that: Key obs: any path must be in of states at t β t (j) = β t +(i)p S i S j PO ( t + S i ) i = and ( ) PO ( t came froms j )= α t ( j )β t ( j ) α t(i)β t(i) i= t- t + t- t Baum-Welch algorithm (Expectation-Maximization) PO (,K,O n θ) Set parameters to expected values given observed sequences: State transition probs: Observation probs: t = PS ( j S i )= PinS ( i at t, in S j at t + O) Recalculate expectations with new probabilities Iterate to convergence Guaranteed strictly increasing, converge to local mode (See Rabiner, 989 for details) t = P( obs S i )= t = PinS ( i at t O) PinS ( i at t O) ( O t = obs) t = PinS ( i at t O) HMM-based multiple sequence alignment Multiple alignment of k sequences is hard, so instead: Estimate a statistical model for the sequences * Use a PROFILE alignment as a first guess * (OR) Start from scratch with unaligned sequences (harder) Align each sequence to the model 3 Alignment yields assignments of equivalent sequence elements within the multiple alignment