Hidden Markov models in population genetics and evolutionary biology

Hidden Markov models in population genetics and evolutionary biology Gerton Lunter Wellcome Trust Centre for Human Genetics Oxford, UK April 29, 2013

Topics for today Markov chains Hidden Markov models Examples Sequence features (genes, domains) Sequence evolution (alignment, conserved elements) Population genetics (phasing; demographic inference) Journal club

Markov chains

Markov chains Suppose a stochastic process of interest is modelled as a discrete-time process {X i } i 1. This process is a Markov process if it is characterized by X 1 µ( ) (the initial distribution) X k (X k 1 = x k 1 ) f( x k 1 ) (the transition probabilities) Notation: x i:j = (x i, x i+1,...,x j 1, x j ) n p(x 1:n ) = p(x 1 ) p(x k x 1:k 1 ) = µ(x 1 ) k=2 n f(x k x k 1 ) k=2

Example: a weather model Modeling the observation that today s weather is likely to be similar to yesterday s: 6 7 1 7 1 3 2 3 f( ) = 1 7, etc. p(,,,,, ) = µ( ) f( ) f( ) f( ) f( ) f( )

Example: CpG frequency in mammalian genomes In mammals, the C in CpG dinucleotides is often methylated, increasing the rate of the C T transition, and causing CpGs to be 5 times less frequent than expected. This can be modelled by a Markov chain along the sequence, on the state space {A, C, G, T}: Start A C G T End (Blue = lower probability transition.)

Example: mutation process Let X i {A, C, G, T} be the nucleotide state at some site at time i t. f(x k x k 1 ) = A xk 1 x k with A = A C G T A 1 3ǫ ǫ ǫ ǫ C ǫ 1 3ǫ ǫ ǫ ; G ǫ ǫ 1 3ǫ ǫ T ǫ ǫ ǫ 1 3ǫ n p(x 2:n x 1 ) = A xk 1 x k ; k=2 p(x n x 1 ) = n x 2:n 1 k=2 A xk 1 x k = (A n 1 ) x1 x n Let A = I + B t (B is the rate matrix), t = 1 n and let n, then p(x n x 1 ) = ( (I + B/n) n 1) x 1 x n exp(b) x1 x n

Hidden Markov models

Hidden Markov models Suppose that {X k } k 1 is not observed (it is hidden), but that we do observe a related process {Y k } k 1. Conditional on {X k } k 1 the observations {Y k } k 1 are independent, and marginally distributed as Y k (X k = x k ) g(, x k ) This implies that conditional on {X k } k 1 we have n p(y 1:n x 1:n ) = g(y k x k ) k=1

Example: Weather model 6 7 1 7 2 3 9 10 : 9 10 : 1 10 1 10 1 5 : 3 10 : 7 10 4 5 1 3 Markov chain: Move from state to state according to the transition probabilities f(, ). Observations are the states visited: Hidden Markov model: Move between states (H,L) according to a Markov chain as before, but emit the observation (, ) according to a probability distribution g(, ) instead:,,,,,,...,,,,,,...

Example: Weather model Markov chain (hidden): p(hhhlll) =µ(h) f(h H) f(h H) f(l H) f(l L) f(l L); Observations: p( HHHLLL) =p(hhhlll) g( H)g( H)g( H)g( L)g( L)g( L)

Hidden Markov models Questions you may want to ask: What is the likelihood of observations: p( ) = x 1:6 p(x, ) What is the posterior probability of a particular state given observations: x p(x k ) = 1:k 1 x k+1:n p(x 1:n, ) x p(x 1:n 1:n, ) What is the single most likely state sequence given observations: arg max p(x 1:n, ) x 1:n

Hidden Markov models Questions you may want to ask: What is the likelihood of observations: p( ) = x 1:6 p(x, ) Forward algorithm What is the posterior probability of a particular state given observations: x p(x k ) = 1:k 1 x k+1:n p(x 1:n, ) x p(x 1:n 1:n, ) Forward + Backward algorithms What is the single most likely state sequence given observations: Viterbi algorithm arg max p(x 1:n, ) x 1:n

Example: posterior probability of a particular state p(x k y 1:n ) = p(x k, y 1:n ) x k p(x k, y 1:n ) p(x k, y 1:n ) = p(x 1:n, y 1:n ) x 1:k 1 x k+1:n = p(x 1:k, y 1:k )p(x k+1:n, y k+1:n x 1:k, y 1:k ) x 1:k 1 x k+1:n = p(x 1:k, y 1:k )p(x k+1:n, y k+1:n x k ) x k+1:n x 1:k 1 = p(x k, y 1:k )p(y k+1:n x k ) := α k (x k )β k (x k )

Example: posterior probability of a particular state α k (x k ) := p(x k, y 1:k ) α 1 (x 1 ) = p(x 1, y 1 ) = µ(x 1 )g(y 1, x 1 ) α k+1 (x k+1 ) = p(x 1:k+1, y 1:k+1 ) x 1:k = p(x 1:k, y 1:k )f(x k+1 x k )g(y k+1 x k+1 ) x 1:k 1 x k = p(x 1:k, y 1:k ) f(x k+1 x k )g(y k+1 x k+1 ) x k x 1:k 1 = α k (x k )f(x k+1 x k )g(y k+1 x k+1 ) x k

Example: posterior probability of a particular state β k (x k ) := p(y k+1:n x k ) β n (x n ) = p( x n ) = 1 β k 1 (x k 1 ) = p(y k:n x k 1 ) = x k p(x k x k 1 )p(y k x k )p(y k+1:n x k ) = x k f(x k x k 1 )g(y k x k )β k (x k )

Summary of useful HMM algorithms Sampling from the prior (trivial) Forward or Backward for the likelihood: p(y 1:n ) = x n α n (x n ) = x 1 β 1 (x 1 ) Forward and Backward for: State posteriors: p(x k y 1:n ) = α k (x k )β k (x k )/p(y 1:n ) Sampling state paths from the posterior Expectation-Maximization (Baum-Welch) to estimate parameters Posterior decoding (MAP paths) Viterbi for single most likely state path, arg max x 1:n p(x 1:n y 1:n )

Examples

Example 1: Motif finding hx0a ( 173 )...dyvrsmiadylnklid-igvagfridaskhmw... 1smd ( 173 )...dyvrskiaeymnhlid-igvagfridaskhmw... 1jae ( 161 )...dyvrgvlidymnhmid-lgvagfrvdaakhms... 1g94a ( 150 )...nyvqntiaayindlqa-igvkgfrfdaskhva... 1bag ( 152 )...tqvqsylkrfleraln-dgadgfrfdaakhie... 1smaa ( 303 )...pevkrylldvatywirefdidgwrldvaneid... 1bvza ( 301 )...pevkeylfdvarfwm-eqgidgwrldvanevd... 1uok ( 175 )...ekvrqdvyemmkfwle-kgidgfrmdvinfis... 2aaa ( 181 )...tavrtiwydwvadlvsnysvdglridsvlevq... 7taa ( 181 )...dvvknewydwvgslvsnysidglridtvkhvq... 1cgt ( 205 )...atidkyfkdaiklwld-mgvdgirvdavkhmp... 1ciu ( 206 )...stidsylksaikvwld-mgidgirldavkhmp... 1cyg ( 201 )...pvidrylkdavkmwid-mgidgirmdavkhmp... 1qhpa ( 204 )...gtiaqyltdaavqlva-hgadglridavkhfn... 1hvxa ( 209 )...pevvtelkswgkwyvnttnidgfrldavkhik... 1vjs ( 206 )...pdvaaeikrwgtwyanelqldgfrldavkhik... 1gcya ( 168 )...pqvygmfrdeftnlrsqygaggfrfdfvrgya... 1avaa ( 154 )...lrvqkelvewlnwlkadigfdgwrfdfakgys... 1ehaa ( 227 )...devrkfilenveywikeynvdgfrldavhaii... 1bf2 ( 350 )...tvaqnlivdslaywantmgvdgfrfdlasvlg... 1gjwa ( 360 )...relweylagviphyqkkygidgarldmghalp...

Example 1: Motif finding A motif modeled as an ungapped weight matrix can be represented as an HMM. We can ask for a local alignment by adding padding states at the beginning and the end: X X Start A: A: U: C: C: A: 0.8 1.0 0.8 0.9 1.0 0.8 G: C: U: C: 0.2 0.1 0.1 0.1 A: G: 0.1 0.1 End

Example 1: Motif finding Not all related motifs have exactly the same length; some may lack certain residues. This is modeled by introducing delete states into the HMM: X X Start A: A: U: C: C: A: 0.8 1.0 0.8 0.9 1.0 0.8 G: C: U: C: 0.2 0.1 0.1 0.1 A: G: 0.1 0.1 End The transition probabilities to/from delete states is position-dependent: the probability of deleting a particular nucleotide depends on the location within the motif.

Example 1: Motif finding Similarly, some motif may have extra residues, which are modeled with insert states. X X X X X X X X Start A: A: U: C: C: A: 0.8 1.0 0.8 0.9 1.0 0.8 G: C: U: C: 0.2 0.1 0.1 0.1 A: G: 0.1 0.1 End

Example 1: Motif finding We can add a loopback transition to allow for multiple consecutive matches (think e.g. zinc-finger proteins): X X X X X X X X Start A: A: U: C: C: A: 0.8 1.0 0.8 0.9 1.0 0.8 G: C: U: C: 0.2 0.1 0.1 0.1 A: G: 0.1 0.1 End

Example 1: Motif finding This is the profile HMM architecture in the SAM/HMMER packages: X X X X X X X X Start U: C: C: A: A: A: 1.0 0.8 0.8 0.8 1.0 0.9 U: C: G: C: 0.1 0.2 0.1 0.1 G: A: 0.1 0.1 End In this context, the standard algorithm achieve the following: Viterbi: alignment of a sequence to the HMM Forward: likelihood that a sequence contains the motif Forward-Backward: posterior expected state/transition counts Baum-Welch uses these expectations to maximise the likelihood of a given training set

Example 2: Gene finding (source unknown)

Example 2: Gene finding Burge and Karlin, JMB 1998

Example 2: Gene finding UCSC genome browser

Example 3: PhyloHMM Siepel et al., Genome Research 2005

Example 4: Alignment Observation: two sequences GAATTCGA; GCATCGA Required: Alignment; sequence of alignment columns ######## ####-### GAATTCGA GCAT-CGA

Example 4: Alignment To fit into HMM framework: Allow two sequences to be emitted simultaneously Allow states with empty emissions The Markov chain {X i } i 1 is the sequence of alignment columns, # #, # #, # #, # #, # -, # #, # #, # # Nucleotides emitted together are correlated (homologous) α, β now have two indices; computing them involves traversing a 2-dimensional dynamic programming table.

Example 5: co-estimating alignment and conservation M = # # ; Ins = - # ; Del = # -

Example 5: co-estimating alignment and conservation

Example 6: Probabilistic progressive alignment Problem: How to align > 2 sequences? Naive HMM implementation Properly accounts for uncertainty (in alignment; not tree) Complexity O(L N ); DP table has N dimensions Progressive alignment: pairwise + infer sequence at root Practical approach; complexity O(NL 2 ) Inferences are biased and overconfident e.g. PRANK; Loytynoja & Goldman 2005

Example 6: Probabilistic progressive alignment Solution: Combine progressive and probabilistic approaches. Represent ancestral sequence a of a pair of descendant sequences s 1, s 2 as a partial likelihood, with a as parameter: P(s 1, s 2 a) Prune unlikely alignment columns, and represent remainder of dynamic programming table as a graph Iterate, aligning/pruning graphs progressively up the tree At root, use prior distribution on a to find multiple alignment Algorithm can be formalized in terms of transducers. Details: Westesson/Holmes, arxiv:1103.4347v2; PLoS ONE e34572

Example 6: Probabilistic progressive alignment

Example 7: Lander-Green (phasing in pedigrees)

Example 7: Lander-Green (phasing in pedigrees) Transmission in pedigree (n non-founders) determined by 2n bits 2 bits identify grandparent of origin for paternal/maternal chromosome The transmission vector is the state of the HMM (2 2n states) States changes (single bit flips) correspond to recombinations State determines more/less likely observed genotypes Some require > 1 mutation per site Lander and Green, PNAS, 1987

Example 8: Li and Stephens (phasing in populations) Li and Stephens, Genetics 2003

Intermezzo: the Wright-Fisher model

Example 9: CoalHMM and incomplete lineage sorting Hobolth,Dutheil,Hawks,Schierup,Mailund (2011) Genome Research

Example 10: PSMC and demographic inference Li and Durbin, Nature 2011

Example 10: PSMC and demographic inference

Jounal club

Papers: Hobolth, Christensen, Mailund, Schierup (2007) Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet 3(2):e7 Li and Durbin (2011) Inference of human population history from individual whole-genome sequences Nature 475, 493-496. Lunter, Rocco, Mimouni, Heger, Caldeira, Hein (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment Genome Res 18(2) 298-309. P Scheet and M Stephens (2006) A fast and flexible method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78(4) Lander and Green (1987) PNAS 84:2363-7; and Kruglyak, Daly, Reeve-Daly and Lander (1996) AJHG 58:1347-63.

Questions - Hobolth et al. Hobolth et al.: Explain the difference between phylogeny and genealogy, and the concept of incomplete lineage sorting. What do the states of the HMM represent? What are informative sites for the model? The model can in principle be applied to any quartet of species. What aspect of the shape of the phylogeny relating the species, and what other parameters (if any) are relevant to assess whether the model might provide useful inferences?

Questions - Li and Durbin What do the states of the HMM represent? At any locus, the density of heterozygous sites determines which state is currently most likely. On average, and in human, how many heterozygous sites occur between state switches? Do you think the data is very informative about the HMM state at any position? What limits the power to infer N e at recent and ancient times?

Questions - Lunter et al. List some causes of inaccuracies in alignments. Would a more accurate model of sequence evolution improve alignments? Is model misfit the main cause for alignment inaccuracies? What is the practical limit (in terms of evolutionary distance, in mutations/site) for pairwise alignment of DNA? Would multiple alignment allow DNA from more divergent species to be aligned? How can divergence be assessed by alignment for species that are more divergent? What is posterior decoding and how does it work? In what way does this improve alignments compared to a Viterbi decoding? Why is this?