Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health sonia.leach@gmail.com Slides created from David Pollock s 2009 slides from 7711 and current reading list from CPBS711 website Center for Genes, Environment, and Health
Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes) Hidden Markov Models (HMMs) are probabilistic bili models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine) HMMs exhibit the Markov property: the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of molecules/features is modelled as a path through states of the HMM Andrey Markov 1856-1922 which emit the sequence of molecules/features Actual state is hidden and observed only through output symbols Center for Genes, Environment, and Health 2
Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i Example: 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 0.9 0.1 1.0 0 0 0.1 0.9 0.75 0.25 0.5 0. 5 Center for Genes, Environment, and Health 3
Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Hidden Markov Model (HMM) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i X t-1 O t-1 X t O t Example: 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 0.9 0.1 1.0 0 0 0.1 0.9 0.75 0.25 0.5 0. 5 Center for Genes, Environment, and Health 4
Probabilistic Graphical Models Markov Process (MP) Y X Time X t 1 X t Observability Utility Observability and Utility Hidden Markov Model (HMM) X t-1 O t-1 O t X t Markov Decision Process (MDP) A t 1 X t 1 U t 1 A t X t U t Partially Observable Markov Decision Process (POMDP) A t 1 O t 1 A t X t 1 U t 1 X t U t O t Center for Genes, Environment, and Health 5
Three basic problems of HMMs 1. Given the observation sequence O=O 1,O 2,,O n, how do we compute Pr(O )? 2. Given the observation sequence, how do we choose the corresponding state sequence X=X 1,X 2,,X n which is optimal? 3. How do we adjust the model parameters to maximize Pr(O )? Center for Genes, Environment, and Health 6
Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Observation sequence O? State Sequence X? Prob(O,X )? Center for Genes, Environment, and Health 7
Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T What is computational o a complexity of this sum? Center for Genes, Environment, and Health 8
Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T At each t, are N states to reach, so N T possible state sequences and 2T multiplications per seq, means O(2T*N T ) operations So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! Center for Genes, Environment, and Health 9
Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T Efficient ce tdynamic cprogramming oga gago algorithm todo this: Forward algorithm(baum and Welch,O(N 2 T)) Center for Genes, Environment, and Health 10
A Simple HMM CpG Islands where in one state, much higher probability to be C or G 0.8 0.9 G.3 0.2 C.3 A.2 T.2 0.1 CpG G.1 C.1 A.4 T.4 Non-CpG From David Pollock
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It Assuming π = (0.5, 0.5) and given the sequence G, G, what is Pr(O=G λ)? For O=G, have 2 possible state sequences C (i.e. CpG state) N (i.e. Non-CpG state) t Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G Assuming π X =0.5, Pr(G λ) =ππ C b CG + π N b NG =.5*.3 +.5*.1 For convenience, let s drop the 0.5s for now and add them in later (so number to right of G in box here is probability of emitting G in that state, i.e. b XG ) Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G CC NC and emit C CN NN and emit C C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G (.3*.8+.1*.1)*.3 =.075 (.3*.2+.1*.9)*.1 =.015 C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C came from C or from N and emit G came from C or from N and emit G G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 A (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 A Adapted from David Pollock s
CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) +.0029*.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 =.0005 0.2 0.1 G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) +.0029*.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 =.0011 T.4 09 0.9 G C G A A Problem 1: Pr(O λ)=0.5*.0005 + 0.5*.0011= 8e-4 Non-CpG (.003*.8+.0025*.1) (.003*.2+.0025*.9) *.4
CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) +.0029*.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 0.2 0.1 G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) +.0029*.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 T.4 09 0.9 G C G A A Problem 2: What is optimal state sequence? Non-CpG (.003*.8+.0025*.1) =.0005 (.003*.2+.0025*.9) *.4 =.0011
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 Probability of being in state CpG or Non-CPG at step i A (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 A Adapted from David Pollock s
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (from forward algorithm) with max becomes.3*.8,.1*.1).1) *.3 =.072.3*.2,.1*.9) *.1 =.009 (with Viterbi algorithm) Adapted from David Pollock s (note error in formulas on his)
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C G A A Adapted from David Pollock s (note error in formulas on his)
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2+. 0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s (note error in formulas on his)
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 G.3 G.1 The Viterbi Algorithm Most Likely Path.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014.0173*.8,.0014*.1) *.2 =.0028.0173*.2+. 0014*.9) *.4 =.0014.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 G C G A A What if choose max prob state at each step? Ans: CCCCN. What is problem with doing that? Adapted from David Pollock s (note error in formulas on his)
Hint 1 2 3 Suppose in same way most likely state at each step is 32112.. Center for Genes, Environment, and Health 27
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Non-CpG
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s
CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s
CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 Forward-backward algorithm G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 Problem G 3: How C to learn Gmodel? A A Forward algorithm calculated Pr(O 1..t,X t =i λ)
How do you learn an HMM? Baum-Welch Iterative algorithm is popular Equivalent to Expectation Maximization (EM) Maximize If know hidden variables (states), maximize model parameters with respect to that knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Iterate between two steps until convergence of parameter estimates Center for Genes, Environment, and Health 32
Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable α t (i) Pr(O 1..t,X t =i λ) Backward variable β t (i) Pr(O t+1..n X t =i, λ) Rabiner 1989
Parameter Estimation Define 2 variables ξ and γ. Probability of transitioning at time t from state i to j, no matter the path ξ t (i,j) = Pr(q t =S i, q t =S j O,λ) = α t (i) a ij b jot+1 β t+1 (i) / i=1ton j=1ton α t (i) a ij b jot+1 β t+1 (i) Probability of being in state i at time t, no matter the path γ t(i) = Pr(q t=s i O,λ) = α t(i)β t (i) / i=1ton α t(i)β t (i) Then expected values for parameters are π i = γ 1 (i) a ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Center for Genes, Environment, and Health 34
Baum-Welch algorithm (equivalent to EM) Given an initial assignment to parameters λ=(π, b, a), compute ξ and γ from α and β Generate new estimate λ*=(π*, (, b*, a*) from π * i = γ 1 (i) a * ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b * jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Set λ= λ* and repeat until convergence Center for Genes, Environment, and Health 35
Where are HMMs used in Computational Biology? DNA motif matching, gene matching, multiple sequence alignment Amino Acids domain matching, fold recognition Microarrays/Whole Genome Sequencing assign copy number ChIP-chip/seq distinct chromatin states Center for Genes, Environment, and Health 36
Homologous Sequences what is consensus sequence? how can we recognize all of them? but how to distinguish unlikely members? Center for Genes, Environment, and Health Krogh 1998 37
Homologous Sequences Krogh 1998 Center for Genes, Environment, and Health 38
Probability of Sequences Center for Genes, Environment, and Health 39
Learning Parameters of Compbio HMMs built from pre-aligned (pre-labeled sequences) so states have meaningful biological labels (like insertion portiop, then parameter estimation just tabulates frequencies, like in previous example note longer sequences have lower probability, so often converted to log-odds parameters (see Krogh 1998) built from unaligned/unlabelled l ll d sequences, where semantics of states can (sometimes) be interpreted later and must do Baum-Welch or equivalent for parameter estimation, like in chromatin state example shown later HMMs encode regular grammars so do poor job on problems where long-range (complementary) correlations (ex. RNA/protein secondary structure) Center for Genes, Environment, and Health 40
Homology HMM Gene recognition, classify to identify distant homologs Common Ancestral Sequence Parameter set λ = (A, B, π), strict left-right model Specially defined set of states: start, stop, match, insert, delete For initial state distribution π, use start state For transition matrix A use global transition probabilities For emission matrix B Match, site-specific emission probabilities Insert (relative to ancestor), global emission probs Delete, emit nothing Multiple Sequence Alignments Adapted from David Pollock s
Homology HMM insert insert insert start t match match end delete delete Adapted from David Pollock s
Homology HMM Example A.1 A.04 A.2 C.05 C.1 C.01 D.2 D.01 D.05 match match E.08 E.2 E.1 F.01 F.02 F.06 match
Ungapped blocks Ungapped blocks where insertion states model intervening sequence between blocks Insert/delete states allowed anywhere Center for Genes, Environment, and Health Allow multiple domains, sequence fragments Eddy, 1998 44
Uses for Homology HMM Find homologs to profile HMM in database Score multiple sequences for match to 1 HMM Not always Pr(O λ) since some areas may highly diverge Sometimes use highest scoring subsequence Goal is to find homologs in database Classify sequence using library of profile HMMs Compare 1 seq to >1 alternate models ex. Pfam, PROSITE motif databases Alignment of additional sequences Structural alignment when alphabet is secondary structure symbols so can do fold-recognition, etc Adapted from David Pollock s
Variable Length and Composition of Protein Domains http://rnajournal.cshlp.org/content/12/12/2080.full Center for Genes, Environment, and Health 46
Why Hidden Markov Models for MSA? Multiple sequence alignment as consensus May have substitutions, not all AA are equal FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 Could use regular expressions but how to handle indels? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 What about variable-length members of family? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110 Center for Genes, Environment, and Health 47
Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position Center for Genes, Environment, and Health 48
Detecting ti Copy Number in Array Array Data CGH data Discrete number of copies found by segmenting array intensities along chromsome HMM segmentation Naïve smoothing http://www.cs.cmu.edu/~epxing/class/10810-05/lecture11.pdf Center for Genes, Environment, and Health 49
Detecting ti Copy Number in Whole Genome Sequencing Data ABI Bioscope manual 2010 Center for Genes, Environment, and Health Compute log ratio of observed coverage to expected coverage Fit to HMM with states for 0-9 copies Copy number assigned to region with Viterbi algorithm 50
HMMs for Chromatin States Specific amino acid of specific histone protein modified at a given level can be tagged and assayed ex) H3K27me3 means 3 methyl groups have been added to Lysine at postion 27 in histone 3 Center for Genes, Environment, and Health Rodenhiser& Mann CMAJ 2006 174(3):341 51
Combination of Chromatin States If HMM for sequence from single mark with states, eg. has H3K27me3 or no H3K27me3 (peak finding) However peaks for single mark could still be distributed all across genome, which ones are important? Comparing across multiple signals identifies specific combinations which distinguish the important peaks in an individual signal (combinatorial patterns) Barski et al Cell 2007 129:823-837 Center for Genes, Environment, and Health 52
Combination States Learned optimized to Q=51 labels (a.k.a. states) where semantics assigned post hoc based on prior biological knowledge, relation to gene models, gene expression data, and sequence conservation Center for Genes, Environment, and Health 53
Multivariate HMMs for Chromatin States http://www.nature.com/nbt/journal/v28/n8/pdf/nbt.1662.pdf Center for Genes, Environment, and Health Ernst 2010 learned 51 distinct chromatin states, interpreted t post hoc as promoterassociated, transcriptionassociated, active intergenic, large- scale repressed and repeat- associated states. t 54
Hot Topic: Better than HMMs for Chromatin States: Dynamic Bayes Nets! allows specification of min/max length of feature and way to count that down ( memory ) and way to enforce or disallow certain transitions Recall: X t-1 H M M O t-1 X t O t Segway by Hoffman et al 2011 Center for Genes, Environment, and Health hidden state of model sequence of observations for each of n chromatin/txfac marks 55
Hot Topic: Better than HMMs for Chromatin States Segway by Hoffman et al 2011 specify Q=25 labels (a.k.a. states) Center for Genes, Environment, and Health semantics of learned states assigned post hoc based on prior bio knowledge 56
Hot Topic: Better than HMMs for Chromatin States Center for Genes, Environment, and Health Segway by Hoffman et al 2011 57
Homology HMM Resources Great tutorial (Krogh 1998) ** http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7972&rep=rep1&type=pdf WUSTL/Janelia (Eddy Bioinformatics 1998 14(9):755)** Pfam: database of pre-computed HMM alignments for various proteins HMMer: program for building HMMs UCSC (Haussler) SAM: align, secondary structure predictions, HMM parameters, etc. Chromatin States Ernst et al, PMCID: PMC2919626 Segway: http://noble.gs.washington.edu/proj/segway/manuscript/segway.pdf
Center for Genes, Environment, and Health 59
Other David Pollock Slides 2009 Center for Genes, Environment, and Health 60
Model Comparison Based on P(D, M) For ML, take P max (D, M) lnp max (D, M) Usually to avoid numeric error max For heuristics, score is For Bayesian, calculate log 2 P(D fixed, M) P max (, M D) P(D, M)*P* PM P(D,M) * P * P M Uses prior information on parameters P( ) Adapted from David Pollock s
Parameters, Types of parameters Amino acid distributions for positions (match states) Global AA distributions for insert states Order of match states Transition probabilities Phylogenetic tree topology and branch lengths Hidden states (integrate or augment) Wander parameter space (search) Maximize, or move according to posterior probability (Bayes) Adapted from David Pollock s
Expectation Maximization (EM) Classic algorithm to fit probabilistic model parameters with unobservable states Two Stages Maximize If know hidden variables (states), maximize model parameters with respect to that t knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Works well even with e.g., Bayesian to find near-equilibrium i space Adapted from David Pollock s
Homology HMM EM Start with heuristic MSA (e.g., ClustalW) Maximize Match states are residues aligned in most sequences Amino acid frequencies observed in columns Expectation Realign all the sequences given model Repeat until convergence Problems: Local, not global optimization Use procedures to check how it worked Adapted from David Pollock s
Model Comparison Determining significance depends on comparing two models (family vs non-family) Usually null model, H 0, and test model, H 1 Models are nested if H 0 is a subset of H 1 If not nested Akaike Information Criterion (AIC) [similar to empirical Bayes] or Bayes Factor (BF) [but be careful] Generating a null distribution of statistic 2 Z-factor, bootstrapping,, parametric bootstrapping, posterior predictive Adapted from David Pollock s
Z Test Method Database of known negative controls E.g., non-homologous (NH) sequences Assume NH scores ~ N(,) i.e., you are modeling known NH sequence scores as a normal distribution Set appropriate significance level for multiple comparisons (more below) Problems Is homology certain? Is it the appropriate null model? Normal distribution often not a good approximation Parameter control hard: e.g., length distribution Adapted from David Pollock s
Bootstrapping t and Parametric Models Random sequence sampled from the same set of emission probability distributions Same length is easy Bootstrapping is re-sampling columns Parametric uses estimated frequencies, may include variance, tree, etc. More flexible, can have more complex null Pseudocounts of global l frequencies if data limit it Insertions relatively hard to model What frequencies for insert states? Global? Adapted from David Pollock s
Center for Genes, Environment, and Health 68