Biological Sequences and Hidden Markov Models CPBS7711

Size: px

Start display at page:

Download "Biological Sequences and Hidden Markov Models CPBS7711"

Michael Curtis
6 years ago
Views:

1 Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health Slides created from David Pollock s 2009 slides from 7711 and current reading list from CPBS711 website Center for Genes, Environment, and Health

Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns,

state machine) HMMs exhibit the Markov property: the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of

2 Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes) Hidden Markov Models (HMMs) are probabilistic bili models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine) HMMs exhibit the Markov property: the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of molecules/features is modelled as a path through states of the HMM Andrey Markov which emit the sequence of molecules/features Actual state is hidden and observed only through output symbols Center for Genes, Environment, and Health 2

3 Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i Example: N3, M2 π0.25, 0.55, 0.2 A B Center for Genes, Environment, and Health 3

4 Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Hidden Markov Model (HMM) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i X t-1 O t-1 X t O t Example: N3, M2 π0.25, 0.55, 0.2 A B Center for Genes, Environment, and Health 4

5 Probabilistic Graphical Models Markov Process (MP) Y X Time X t 1 X t Observability Utility Observability and Utility Hidden Markov Model (HMM) X t-1 O t-1 O t X t Markov Decision Process (MDP) A t 1 X t 1 U t 1 A t X t U t Partially Observable Markov Decision Process (POMDP) A t 1 O t 1 A t X t 1 U t 1 X t U t O t Center for Genes, Environment, and Health 5

6 Three basic problems of HMMs 1. Given the observation sequence O=O 1,O 2,,O n, how do we compute Pr(O )? 2. Given the observation sequence, how do we choose the corresponding state sequence X=X 1,X 2,,X n which is optimal? 3. How do we adjust the model parameters to maximize Pr(O )? Center for Genes, Environment, and Health 6

7 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Observation sequence O? State Sequence X? Prob(O,X )? Center for Genes, Environment, and Health 7

8 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T What is computational o a complexity of this sum? Center for Genes, Environment, and Health 8

9 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T At each t, are N states to reach, so N T possible state sequences and 2T multiplications per seq, means O(2T*N T ) operations So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! Center for Genes, Environment, and Health 9

10 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T Efficient ce tdynamic cprogramming oga gago algorithm todo this: Forward algorithm(baum and Welch,O(N 2 T)) Center for Genes, Environment, and Health 10

11 A Simple HMM CpG Islands where in one state, much higher probability to be C or G G C.3 A.2 T CpG G.1 C.1 A.4 T.4 Non-CpG From David Pollock

12 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It Assuming π = (0.5, 0.5) and given the sequence G, G, what is Pr(O=G λ)? For O=G, have 2 possible state sequences C (i.e. CpG state) N (i.e. Non-CpG state) t Adapted from David Pollock s

13 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G Assuming π X =0.5, Pr(G λ) =ππ C b CG + π N b NG =.5*.3 +.5*.1 For convenience, let s drop the 0.5s for now and add them in later (so number to right of G in box here is probability of emitting G in that state, i.e. b XG ) Adapted from David Pollock s

14 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G CC NC and emit C CN NN and emit C C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

15 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G (.3*.8+.1*.1)*.3 =.075 (.3*.2+.1*.9)*.1 =.015 C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

16 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

17 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C came from C or from N and emit G came from C or from N and emit G G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

18 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

19 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 A (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 A Adapted from David Pollock s

20 CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) *.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 = G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) *.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 =.0011 T G C G A A Problem 1: Pr(O λ)=0.5* *.0011= 8e-4 Non-CpG (.003* *.1) (.003* *.9) *.4

21 CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) *.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 = G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) *.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 T G C G A A Problem 2: What is optimal state sequence? Non-CpG (.003* *.1) =.0005 (.003* *.9) *.4 =.0011

22 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 Probability of being in state CpG or Non-CPG at step i A (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 A Adapted from David Pollock s

23 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (from forward algorithm) with max becomes.3*.8,.1*.1).1) *.3 =.072.3*.2,.1*.9) *.1 =.009 (with Viterbi algorithm) Adapted from David Pollock s (note error in formulas on his)

24 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C G A A Adapted from David Pollock s (note error in formulas on his)

25 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s (note error in formulas on his)

26 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 G.3 G.1 The Viterbi Algorithm Most Likely Path.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 = *.8,.009*.1) *.3 = *.2,.009*.9) *.1 = *.8,.0014*.1) *.2 = * *.9) *.4 = *.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 G C G A A What if choose max prob state at each step? Ans: CCCCN. What is problem with doing that? Adapted from David Pollock s (note error in formulas on his)

27 Hint Suppose in same way most likely state at each step is Center for Genes, Environment, and Health 27

28 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Non-CpG

29 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

30 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

31 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 Forward-backward algorithm G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 Problem G 3: How C to learn Gmodel? A A Forward algorithm calculated Pr(O 1..t,X t =i λ)

32 How do you learn an HMM? Baum-Welch Iterative algorithm is popular Equivalent to Expectation Maximization (EM) Maximize If know hidden variables (states), maximize model parameters with respect to that knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Iterate between two steps until convergence of parameter estimates Center for Genes, Environment, and Health 32

33 Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable α t (i) Pr(O 1..t,X t =i λ) Backward variable β t (i) Pr(O t+1..n X t =i, λ) Rabiner 1989

34 Parameter Estimation Define 2 variables ξ and γ. Probability of transitioning at time t from state i to j, no matter the path ξ t (i,j) = Pr(q t =S i, q t =S j O,λ) = α t (i) a ij b jot+1 β t+1 (i) / i=1ton j=1ton α t (i) a ij b jot+1 β t+1 (i) Probability of being in state i at time t, no matter the path γ t(i) = Pr(q t=s i O,λ) = α t(i)β t (i) / i=1ton α t(i)β t (i) Then expected values for parameters are π i = γ 1 (i) a ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Center for Genes, Environment, and Health 34

35 Baum-Welch algorithm (equivalent to EM) Given an initial assignment to parameters λ=(π, b, a), compute ξ and γ from α and β Generate new estimate λ*=(π*, (, b*, a*) from π * i = γ 1 (i) a * ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b * jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Set λ= λ* and repeat until convergence Center for Genes, Environment, and Health 35

36 Where are HMMs used in Computational Biology? DNA motif matching, gene matching, multiple sequence alignment Amino Acids domain matching, fold recognition Microarrays/Whole Genome Sequencing assign copy number ChIP-chip/seq distinct chromatin states Center for Genes, Environment, and Health 36

37 Homologous Sequences what is consensus sequence? how can we recognize all of them? but how to distinguish unlikely members? Center for Genes, Environment, and Health Krogh

38 Homologous Sequences Krogh 1998 Center for Genes, Environment, and Health 38

39 Probability of Sequences Center for Genes, Environment, and Health 39

40 Learning Parameters of Compbio HMMs built from pre-aligned (pre-labeled sequences) so states have meaningful biological labels (like insertion portiop, then parameter estimation just tabulates frequencies, like in previous example note longer sequences have lower probability, so often converted to log-odds parameters (see Krogh 1998) built from unaligned/unlabelled l ll d sequences, where semantics of states can (sometimes) be interpreted later and must do Baum-Welch or equivalent for parameter estimation, like in chromatin state example shown later HMMs encode regular grammars so do poor job on problems where long-range (complementary) correlations (ex. RNA/protein secondary structure) Center for Genes, Environment, and Health 40

41 Homology HMM Gene recognition, classify to identify distant homologs Common Ancestral Sequence Parameter set λ = (A, B, π), strict left-right model Specially defined set of states: start, stop, match, insert, delete For initial state distribution π, use start state For transition matrix A use global transition probabilities For emission matrix B Match, site-specific emission probabilities Insert (relative to ancestor), global emission probs Delete, emit nothing Multiple Sequence Alignments Adapted from David Pollock s

42 Homology HMM insert insert insert start t match match end delete delete Adapted from David Pollock s

43 Homology HMM Example A.1 A.04 A.2 C.05 C.1 C.01 D.2 D.01 D.05 match match E.08 E.2 E.1 F.01 F.02 F.06 match

44 Ungapped blocks Ungapped blocks where insertion states model intervening sequence between blocks Insert/delete states allowed anywhere Center for Genes, Environment, and Health Allow multiple domains, sequence fragments Eddy,

45 Uses for Homology HMM Find homologs to profile HMM in database Score multiple sequences for match to 1 HMM Not always Pr(O λ) since some areas may highly diverge Sometimes use highest scoring subsequence Goal is to find homologs in database Classify sequence using library of profile HMMs Compare 1 seq to >1 alternate models ex. Pfam, PROSITE motif databases Alignment of additional sequences Structural alignment when alphabet is secondary structure symbols so can do fold-recognition, etc Adapted from David Pollock s

46 Variable Length and Composition of Protein Domains Center for Genes, Environment, and Health 46

47 Why Hidden Markov Models for MSA? Multiple sequence alignment as consensus May have substitutions, not all AA are equal FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 Could use regular expressions but how to handle indels? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 What about variable-length members of family? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110 Center for Genes, Environment, and Health 47

48 Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position Center for Genes, Environment, and Health 48

chromsome HMM segmentation Naïve smoothing http://www.cs.cmu.

49 Detecting ti Copy Number in Array Array Data CGH data Discrete number of copies found by segmenting array intensities along chromsome HMM segmentation Naïve smoothing Center for Genes, Environment, and Health 49

50 Detecting ti Copy Number in Whole Genome Sequencing Data ABI Bioscope manual 2010 Center for Genes, Environment, and Health Compute log ratio of observed coverage to expected coverage Fit to HMM with states for 0-9 copies Copy number assigned to region with Viterbi algorithm 50

51 HMMs for Chromatin States Specific amino acid of specific histone protein modified at a given level can be tagged and assayed ex) H3K27me3 means 3 methyl groups have been added to Lysine at postion 27 in histone 3 Center for Genes, Environment, and Health Rodenhiser& Mann CMAJ (3):341 51

Combination of Chromatin States If HMM for sequence from

has H3K27me3 or no H3K27me3 (peak finding) However peaks

combinations which distinguish the important peaks in an

52 Combination of Chromatin States If HMM for sequence from single mark with states, eg. has H3K27me3 or no H3K27me3 (peak finding) However peaks for single mark could still be distributed all across genome, which ones are important? Comparing across multiple signals identifies specific combinations which distinguish the important peaks in an individual signal (combinatorial patterns) Barski et al Cell : Center for Genes, Environment, and Health 52

53 Combination States Learned optimized to Q=51 labels (a.k.a. states) where semantics assigned post hoc based on prior biological knowledge, relation to gene models, gene expression data, and sequence conservation Center for Genes, Environment, and Health 53

54 Multivariate HMMs for Chromatin States Center for Genes, Environment, and Health Ernst 2010 learned 51 distinct chromatin states, interpreted t post hoc as promoterassociated, transcriptionassociated, active intergenic, large- scale repressed and repeat- associated states. t 54

55 Hot Topic: Better than HMMs for Chromatin States: Dynamic Bayes Nets! allows specification of min/max length of feature and way to count that down ( memory ) and way to enforce or disallow certain transitions Recall: X t-1 H M M O t-1 X t O t Segway by Hoffman et al 2011 Center for Genes, Environment, and Health hidden state of model sequence of observations for each of n chromatin/txfac marks 55

56 Hot Topic: Better than HMMs for Chromatin States Segway by Hoffman et al 2011 specify Q=25 labels (a.k.a. states) Center for Genes, Environment, and Health semantics of learned states assigned post hoc based on prior bio knowledge 56

57 Hot Topic: Better than HMMs for Chromatin States Center for Genes, Environment, and Health Segway by Hoffman et al

58 Homology HMM Resources Great tutorial (Krogh 1998) ** WUSTL/Janelia (Eddy Bioinformatics (9):755)** Pfam: database of pre-computed HMM alignments for various proteins HMMer: program for building HMMs UCSC (Haussler) SAM: align, secondary structure predictions, HMM parameters, etc. Chromatin States Ernst et al, PMCID: PMC Segway:

59 Center for Genes, Environment, and Health 59

60 Other David Pollock Slides 2009 Center for Genes, Environment, and Health 60

61 Model Comparison Based on P(D, M) For ML, take P max (D, M) lnp max (D, M) Usually to avoid numeric error max For heuristics, score is For Bayesian, calculate log 2 P(D fixed, M) P max (, M D) P(D, M)*P* PM P(D,M) * P * P M Uses prior information on parameters P( ) Adapted from David Pollock s

62 Parameters, Types of parameters Amino acid distributions for positions (match states) Global AA distributions for insert states Order of match states Transition probabilities Phylogenetic tree topology and branch lengths Hidden states (integrate or augment) Wander parameter space (search) Maximize, or move according to posterior probability (Bayes) Adapted from David Pollock s

63 Expectation Maximization (EM) Classic algorithm to fit probabilistic model parameters with unobservable states Two Stages Maximize If know hidden variables (states), maximize model parameters with respect to that t knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Works well even with e.g., Bayesian to find near-equilibrium i space Adapted from David Pollock s

64 Homology HMM EM Start with heuristic MSA (e.g., ClustalW) Maximize Match states are residues aligned in most sequences Amino acid frequencies observed in columns Expectation Realign all the sequences given model Repeat until convergence Problems: Local, not global optimization Use procedures to check how it worked Adapted from David Pollock s

65 Model Comparison Determining significance depends on comparing two models (family vs non-family) Usually null model, H 0, and test model, H 1 Models are nested if H 0 is a subset of H 1 If not nested Akaike Information Criterion (AIC) [similar to empirical Bayes] or Bayes Factor (BF) [but be careful] Generating a null distribution of statistic 2 Z-factor, bootstrapping,, parametric bootstrapping, posterior predictive Adapted from David Pollock s

66 Z Test Method Database of known negative controls E.g., non-homologous (NH) sequences Assume NH scores ~ N(,) i.e., you are modeling known NH sequence scores as a normal distribution Set appropriate significance level for multiple comparisons (more below) Problems Is homology certain? Is it the appropriate null model? Normal distribution often not a good approximation Parameter control hard: e.g., length distribution Adapted from David Pollock s

67 Bootstrapping t and Parametric Models Random sequence sampled from the same set of emission probability distributions Same length is easy Bootstrapping is re-sampling columns Parametric uses estimated frequencies, may include variance, tree, etc. More flexible, can have more complex null Pseudocounts of global l frequencies if data limit it Insertions relatively hard to model What frequencies for insert states? Global? Adapted from David Pollock s

68 Center for Genes, Environment, and Health 68

Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Consortium for Comparative Genomics University of Colorado School of Medicine Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience