Biological Sequences and Hidden Markov Models CPBS7711

Size: px
Start display at page:

Download "Biological Sequences and Hidden Markov Models CPBS7711"

Transcription

1 Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health Slides created from David Pollock s 2009 slides from 7711 and current reading list from CPBS711 website Center for Genes, Environment, and Health

2 Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes) Hidden Markov Models (HMMs) are probabilistic bili models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine) HMMs exhibit the Markov property: the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of molecules/features is modelled as a path through states of the HMM Andrey Markov which emit the sequence of molecules/features Actual state is hidden and observed only through output symbols Center for Genes, Environment, and Health 2

3 Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i Example: N3, M2 π0.25, 0.55, 0.2 A B Center for Genes, Environment, and Health 3

4 Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Hidden Markov Model (HMM) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i X t-1 O t-1 X t O t Example: N3, M2 π0.25, 0.55, 0.2 A B Center for Genes, Environment, and Health 4

5 Probabilistic Graphical Models Markov Process (MP) Y X Time X t 1 X t Observability Utility Observability and Utility Hidden Markov Model (HMM) X t-1 O t-1 O t X t Markov Decision Process (MDP) A t 1 X t 1 U t 1 A t X t U t Partially Observable Markov Decision Process (POMDP) A t 1 O t 1 A t X t 1 U t 1 X t U t O t Center for Genes, Environment, and Health 5

6 Three basic problems of HMMs 1. Given the observation sequence O=O 1,O 2,,O n, how do we compute Pr(O )? 2. Given the observation sequence, how do we choose the corresponding state sequence X=X 1,X 2,,X n which is optimal? 3. How do we adjust the model parameters to maximize Pr(O )? Center for Genes, Environment, and Health 6

7 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Observation sequence O? State Sequence X? Prob(O,X )? Center for Genes, Environment, and Health 7

8 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T What is computational o a complexity of this sum? Center for Genes, Environment, and Health 8

9 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T At each t, are N states to reach, so N T possible state sequences and 2T multiplications per seq, means O(2T*N T ) operations So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! Center for Genes, Environment, and Health 9

10 Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) N3, M2 π0.25, 0.55, 0.2 A B Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T Efficient ce tdynamic cprogramming oga gago algorithm todo this: Forward algorithm(baum and Welch,O(N 2 T)) Center for Genes, Environment, and Health 10

11 A Simple HMM CpG Islands where in one state, much higher probability to be C or G G C.3 A.2 T CpG G.1 C.1 A.4 T.4 Non-CpG From David Pollock

12 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It Assuming π = (0.5, 0.5) and given the sequence G, G, what is Pr(O=G λ)? For O=G, have 2 possible state sequences C (i.e. CpG state) N (i.e. Non-CpG state) t Adapted from David Pollock s

13 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G Assuming π X =0.5, Pr(G λ) =ππ C b CG + π N b NG =.5*.3 +.5*.1 For convenience, let s drop the 0.5s for now and add them in later (so number to right of G in box here is probability of emitting G in that state, i.e. b XG ) Adapted from David Pollock s

14 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G CC NC and emit C CN NN and emit C C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

15 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G (.3*.8+.1*.1)*.3 =.075 (.3*.2+.1*.9)*.1 =.015 C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

16 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

17 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C came from C or from N and emit G came from C or from N and emit G G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

18 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

19 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 A (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 A Adapted from David Pollock s

20 CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) *.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 = G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) *.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 =.0011 T G C G A A Problem 1: Pr(O λ)=0.5* *.0011= 8e-4 Non-CpG (.003* *.1) (.003* *.9) *.4

21 CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) *.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 = G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) *.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 T G C G A A Problem 2: What is optimal state sequence? Non-CpG (.003* *.1) =.0005 (.003* *.9) *.4 =.0011

22 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 G (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 Probability of being in state CpG or Non-CPG at step i A (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 A Adapted from David Pollock s

23 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (from forward algorithm) with max becomes.3*.8,.1*.1).1) *.3 =.072.3*.2,.1*.9) *.1 =.009 (with Viterbi algorithm) Adapted from David Pollock s (note error in formulas on his)

24 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C G A A Adapted from David Pollock s (note error in formulas on his)

25 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s (note error in formulas on his)

26 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 G.3 G.1 The Viterbi Algorithm Most Likely Path.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 = *.8,.009*.1) *.3 = *.2,.009*.9) *.1 = *.8,.0014*.1) *.2 = * *.9) *.4 = *.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 G C G A A What if choose max prob state at each step? Ans: CCCCN. What is problem with doing that? Adapted from David Pollock s (note error in formulas on his)

27 Hint Suppose in same way most likely state at each step is Center for Genes, Environment, and Health 27

28 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Non-CpG

29 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

30 CpG 0.8 G.3 C.3 A.2 T G.1 C.1 A.4 T Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 = *.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 = * *.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 = *.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

31 CpG G.3 C.3 A.2 T.2 G.1 C.1 A.4 T Non-CpG 0.1 Forward-backward algorithm G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 (.075* *.1) *3 *.3 =.0185 ( (.075* *.9) *.1 =.0029 (.0185* *.1 )*.2 =.003 (.0185* *.9) *.4 =.0025 (.003* *.1) *.2 =.0005 ( (.003* *.9) *.4 =.0011 Problem G 3: How C to learn Gmodel? A A Forward algorithm calculated Pr(O 1..t,X t =i λ)

32 How do you learn an HMM? Baum-Welch Iterative algorithm is popular Equivalent to Expectation Maximization (EM) Maximize If know hidden variables (states), maximize model parameters with respect to that knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Iterate between two steps until convergence of parameter estimates Center for Genes, Environment, and Health 32

33 Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable α t (i) Pr(O 1..t,X t =i λ) Backward variable β t (i) Pr(O t+1..n X t =i, λ) Rabiner 1989

34 Parameter Estimation Define 2 variables ξ and γ. Probability of transitioning at time t from state i to j, no matter the path ξ t (i,j) = Pr(q t =S i, q t =S j O,λ) = α t (i) a ij b jot+1 β t+1 (i) / i=1ton j=1ton α t (i) a ij b jot+1 β t+1 (i) Probability of being in state i at time t, no matter the path γ t(i) = Pr(q t=s i O,λ) = α t(i)β t (i) / i=1ton α t(i)β t (i) Then expected values for parameters are π i = γ 1 (i) a ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Center for Genes, Environment, and Health 34

35 Baum-Welch algorithm (equivalent to EM) Given an initial assignment to parameters λ=(π, b, a), compute ξ and γ from α and β Generate new estimate λ*=(π*, (, b*, a*) from π * i = γ 1 (i) a * ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b * jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Set λ= λ* and repeat until convergence Center for Genes, Environment, and Health 35

36 Where are HMMs used in Computational Biology? DNA motif matching, gene matching, multiple sequence alignment Amino Acids domain matching, fold recognition Microarrays/Whole Genome Sequencing assign copy number ChIP-chip/seq distinct chromatin states Center for Genes, Environment, and Health 36

37 Homologous Sequences what is consensus sequence? how can we recognize all of them? but how to distinguish unlikely members? Center for Genes, Environment, and Health Krogh

38 Homologous Sequences Krogh 1998 Center for Genes, Environment, and Health 38

39 Probability of Sequences Center for Genes, Environment, and Health 39

40 Learning Parameters of Compbio HMMs built from pre-aligned (pre-labeled sequences) so states have meaningful biological labels (like insertion portiop, then parameter estimation just tabulates frequencies, like in previous example note longer sequences have lower probability, so often converted to log-odds parameters (see Krogh 1998) built from unaligned/unlabelled l ll d sequences, where semantics of states can (sometimes) be interpreted later and must do Baum-Welch or equivalent for parameter estimation, like in chromatin state example shown later HMMs encode regular grammars so do poor job on problems where long-range (complementary) correlations (ex. RNA/protein secondary structure) Center for Genes, Environment, and Health 40

41 Homology HMM Gene recognition, classify to identify distant homologs Common Ancestral Sequence Parameter set λ = (A, B, π), strict left-right model Specially defined set of states: start, stop, match, insert, delete For initial state distribution π, use start state For transition matrix A use global transition probabilities For emission matrix B Match, site-specific emission probabilities Insert (relative to ancestor), global emission probs Delete, emit nothing Multiple Sequence Alignments Adapted from David Pollock s

42 Homology HMM insert insert insert start t match match end delete delete Adapted from David Pollock s

43 Homology HMM Example A.1 A.04 A.2 C.05 C.1 C.01 D.2 D.01 D.05 match match E.08 E.2 E.1 F.01 F.02 F.06 match

44 Ungapped blocks Ungapped blocks where insertion states model intervening sequence between blocks Insert/delete states allowed anywhere Center for Genes, Environment, and Health Allow multiple domains, sequence fragments Eddy,

45 Uses for Homology HMM Find homologs to profile HMM in database Score multiple sequences for match to 1 HMM Not always Pr(O λ) since some areas may highly diverge Sometimes use highest scoring subsequence Goal is to find homologs in database Classify sequence using library of profile HMMs Compare 1 seq to >1 alternate models ex. Pfam, PROSITE motif databases Alignment of additional sequences Structural alignment when alphabet is secondary structure symbols so can do fold-recognition, etc Adapted from David Pollock s

46 Variable Length and Composition of Protein Domains Center for Genes, Environment, and Health 46

47 Why Hidden Markov Models for MSA? Multiple sequence alignment as consensus May have substitutions, not all AA are equal FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 Could use regular expressions but how to handle indels? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 What about variable-length members of family? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110 Center for Genes, Environment, and Health 47

48 Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position Center for Genes, Environment, and Health 48

49 Detecting ti Copy Number in Array Array Data CGH data Discrete number of copies found by segmenting array intensities along chromsome HMM segmentation Naïve smoothing Center for Genes, Environment, and Health 49

50 Detecting ti Copy Number in Whole Genome Sequencing Data ABI Bioscope manual 2010 Center for Genes, Environment, and Health Compute log ratio of observed coverage to expected coverage Fit to HMM with states for 0-9 copies Copy number assigned to region with Viterbi algorithm 50

51 HMMs for Chromatin States Specific amino acid of specific histone protein modified at a given level can be tagged and assayed ex) H3K27me3 means 3 methyl groups have been added to Lysine at postion 27 in histone 3 Center for Genes, Environment, and Health Rodenhiser& Mann CMAJ (3):341 51

52 Combination of Chromatin States If HMM for sequence from single mark with states, eg. has H3K27me3 or no H3K27me3 (peak finding) However peaks for single mark could still be distributed all across genome, which ones are important? Comparing across multiple signals identifies specific combinations which distinguish the important peaks in an individual signal (combinatorial patterns) Barski et al Cell : Center for Genes, Environment, and Health 52

53 Combination States Learned optimized to Q=51 labels (a.k.a. states) where semantics assigned post hoc based on prior biological knowledge, relation to gene models, gene expression data, and sequence conservation Center for Genes, Environment, and Health 53

54 Multivariate HMMs for Chromatin States Center for Genes, Environment, and Health Ernst 2010 learned 51 distinct chromatin states, interpreted t post hoc as promoterassociated, transcriptionassociated, active intergenic, large- scale repressed and repeat- associated states. t 54

55 Hot Topic: Better than HMMs for Chromatin States: Dynamic Bayes Nets! allows specification of min/max length of feature and way to count that down ( memory ) and way to enforce or disallow certain transitions Recall: X t-1 H M M O t-1 X t O t Segway by Hoffman et al 2011 Center for Genes, Environment, and Health hidden state of model sequence of observations for each of n chromatin/txfac marks 55

56 Hot Topic: Better than HMMs for Chromatin States Segway by Hoffman et al 2011 specify Q=25 labels (a.k.a. states) Center for Genes, Environment, and Health semantics of learned states assigned post hoc based on prior bio knowledge 56

57 Hot Topic: Better than HMMs for Chromatin States Center for Genes, Environment, and Health Segway by Hoffman et al

58 Homology HMM Resources Great tutorial (Krogh 1998) ** WUSTL/Janelia (Eddy Bioinformatics (9):755)** Pfam: database of pre-computed HMM alignments for various proteins HMMer: program for building HMMs UCSC (Haussler) SAM: align, secondary structure predictions, HMM parameters, etc. Chromatin States Ernst et al, PMCID: PMC Segway:

59 Center for Genes, Environment, and Health 59

60 Other David Pollock Slides 2009 Center for Genes, Environment, and Health 60

61 Model Comparison Based on P(D, M) For ML, take P max (D, M) lnp max (D, M) Usually to avoid numeric error max For heuristics, score is For Bayesian, calculate log 2 P(D fixed, M) P max (, M D) P(D, M)*P* PM P(D,M) * P * P M Uses prior information on parameters P( ) Adapted from David Pollock s

62 Parameters, Types of parameters Amino acid distributions for positions (match states) Global AA distributions for insert states Order of match states Transition probabilities Phylogenetic tree topology and branch lengths Hidden states (integrate or augment) Wander parameter space (search) Maximize, or move according to posterior probability (Bayes) Adapted from David Pollock s

63 Expectation Maximization (EM) Classic algorithm to fit probabilistic model parameters with unobservable states Two Stages Maximize If know hidden variables (states), maximize model parameters with respect to that t knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Works well even with e.g., Bayesian to find near-equilibrium i space Adapted from David Pollock s

64 Homology HMM EM Start with heuristic MSA (e.g., ClustalW) Maximize Match states are residues aligned in most sequences Amino acid frequencies observed in columns Expectation Realign all the sequences given model Repeat until convergence Problems: Local, not global optimization Use procedures to check how it worked Adapted from David Pollock s

65 Model Comparison Determining significance depends on comparing two models (family vs non-family) Usually null model, H 0, and test model, H 1 Models are nested if H 0 is a subset of H 1 If not nested Akaike Information Criterion (AIC) [similar to empirical Bayes] or Bayes Factor (BF) [but be careful] Generating a null distribution of statistic 2 Z-factor, bootstrapping,, parametric bootstrapping, posterior predictive Adapted from David Pollock s

66 Z Test Method Database of known negative controls E.g., non-homologous (NH) sequences Assume NH scores ~ N(,) i.e., you are modeling known NH sequence scores as a normal distribution Set appropriate significance level for multiple comparisons (more below) Problems Is homology certain? Is it the appropriate null model? Normal distribution often not a good approximation Parameter control hard: e.g., length distribution Adapted from David Pollock s

67 Bootstrapping t and Parametric Models Random sequence sampled from the same set of emission probability distributions Same length is easy Bootstrapping is re-sampling columns Parametric uses estimated frequencies, may include variance, tree, etc. More flexible, can have more complex null Pseudocounts of global l frequencies if data limit it Insertions relatively hard to model What frequencies for insert states? Global? Adapted from David Pollock s

68 Center for Genes, Environment, and Health 68

Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience Consortium for Comparative Genomics University of Colorado School of Medicine Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Hidden Markov Models (I)

Hidden Markov Models (I) GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models (HMMs) and Profiles Hidden Markov Models (HMMs) and Profiles Swiss Institute of Bioinformatics (SIB) 26-30 November 2001 Markov Chain Models A Markov Chain Model is a succession of states S i (i = 0, 1,...) connected by transitions.

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2 Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis Shamir s lecture notes and Rabiner s tutorial on HMM 1 music recognition deal with variations

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II 6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9. Intro to Hidden Markov Models (finish up) Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

INTEGRATING EPIGENETIC PRIORS FOR IMPROVING COMPUTATIONAL IDENTIFICATION OF TRANSCRIPTION FACTOR BINDING SITES AFFAN SHOUKAT

INTEGRATING EPIGENETIC PRIORS FOR IMPROVING COMPUTATIONAL IDENTIFICATION OF TRANSCRIPTION FACTOR BINDING SITES AFFAN SHOUKAT INTEGRATING EPIGENETIC PRIORS FOR IMPROVING COMPUTATIONAL IDENTIFICATION OF TRANSCRIPTION FACTOR BINDING SITES AFFAN SHOUKAT A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

BMI/CS 576 Fall 2016 Final Exam

BMI/CS 576 Fall 2016 Final Exam BMI/CS 576 all 2016 inal Exam Prof. Colin Dewey Saturday, December 17th, 2016 10:05am-12:05pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept.

Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept. Introduction to Hidden Markov Models Hagit Shatkay, Celera Tübingen, Sept. 2002 Model Fitting Data Model 2 The Many Facets of HMMs... @#$% Found no match for your criteria. Speech Recognition DNA/Protein

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series Recall: Modeling Time Series CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K HiSeq X & NextSeq Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization:

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models (HMMs) November 14, 2017 Hidden Markov Models (HMMs) November 14, 2017 inferring a hidden truth 1) You hear a static-filled radio transmission. how can you determine what did the sender intended to say? 2) You know that genes

More information

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung: Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY BIOINFORMATICS Lecture 11-12 Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION LABORATORY @ ROWAN UNIVERSITY These lecture notes are prepared

More information

Lab 3: Practical Hidden Markov Models (HMM)

Lab 3: Practical Hidden Markov Models (HMM) Advanced Topics in Bioinformatics Lab 3: Practical Hidden Markov Models () Maoying, Wu Department of Bioinformatics & Biostatistics Shanghai Jiao Tong University November 27, 2014 Hidden Markov Models

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information