Biological Sequences and Hidden Markov Models CPBS7711

Similar documents
Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

O 3 O 4 O 5. q 3. q 4. Transition

Data Mining in Bioinformatics HMM

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models (I)

Computational Genomics and Molecular Biology, Fall

Stephen Scott.

Hidden Markov Models

Today s Lecture: HMMs

EECS730: Introduction to Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

HMMs and biological sequence analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

Hidden Markov Models

Hidden Markov Models (HMMs) and Profiles

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

HIDDEN MARKOV MODELS

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

STA 414/2104: Machine Learning

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models. Three classic HMM problems

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Statistical Methods for NLP

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Markov Models & DNA Sequence Evolution

An Introduction to Sequence Similarity ( Homology ) Searching

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models for biological sequence analysis

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

STA 4273H: Statistical Machine Learning

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models for biological sequence analysis I

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Hidden Markov Models Hamid R. Rabiee

order is number of previous outputs

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

INTEGRATING EPIGENETIC PRIORS FOR IMPROVING COMPUTATIONAL IDENTIFICATION OF TRANSCRIPTION FACTOR BINDING SITES AFFAN SHOUKAT

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

BMI/CS 576 Fall 2016 Final Exam

Bioinformatics 2 - Lecture 4

Brief Introduction of Machine Learning Techniques for Content Analysis

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept.

Lecture 12: Algorithms for HMMs

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Hidden Markov Models and Gaussian Mixture Models

Lecture 12: Algorithms for HMMs

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Gibbs Sampling Methods for Multiple Sequence Alignment

Lecture 3: Markov chains.

Hidden Markov Models (HMMs) November 14, 2017

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Computational Genomics and Molecular Biology, Fall

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Hidden Markov Models. Ron Shamir, CG 08

Statistical Methods for NLP

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models. x 1 x 2 x 3 x K

Dynamic Approaches: The Hidden Markov Model

Advanced Data Science

Outline of Today s Lecture

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Intelligent Systems (AI-2)

Hidden Markov Modelling

Intelligent Systems (AI-2)

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Multiscale Systems Engineering Research Group

Multiple Sequence Alignment using Profile HMM

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Lab 3: Practical Hidden Markov Models (HMM)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Hidden Markov Models Part 2: Algorithms

Week 10: Homology Modelling (II) - HHpred

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Biology 644: Bioinformatics

11.3 Decoding Algorithm

Transcription:

Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health sonia.leach@gmail.com Slides created from David Pollock s 2009 slides from 7711 and current reading list from CPBS711 website Center for Genes, Environment, and Health

Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes) Hidden Markov Models (HMMs) are probabilistic bili models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine) HMMs exhibit the Markov property: the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of molecules/features is modelled as a path through states of the HMM Andrey Markov 1856-1922 which emit the sequence of molecules/features Actual state is hidden and observed only through output symbols Center for Genes, Environment, and Health 2

Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i Example: 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 0.9 0.1 1.0 0 0 0.1 0.9 0.75 0.25 0.5 0. 5 Center for Genes, Environment, and Health 3

Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set ) Hidden Markov Model (HMM) Initial state distribution π i PrX 1 i Transition probability a ij PrX t j X t 1 i Emission probability b ik PrO t k X t i X t-1 O t-1 X t O t Example: 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 0.9 0.1 1.0 0 0 0.1 0.9 0.75 0.25 0.5 0. 5 Center for Genes, Environment, and Health 4

Probabilistic Graphical Models Markov Process (MP) Y X Time X t 1 X t Observability Utility Observability and Utility Hidden Markov Model (HMM) X t-1 O t-1 O t X t Markov Decision Process (MDP) A t 1 X t 1 U t 1 A t X t U t Partially Observable Markov Decision Process (POMDP) A t 1 O t 1 A t X t 1 U t 1 X t U t O t Center for Genes, Environment, and Health 5

Three basic problems of HMMs 1. Given the observation sequence O=O 1,O 2,,O n, how do we compute Pr(O )? 2. Given the observation sequence, how do we choose the corresponding state sequence X=X 1,X 2,,X n which is optimal? 3. How do we adjust the model parameters to maximize Pr(O )? Center for Genes, Environment, and Health 6

Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Observation sequence O? State Sequence X? Prob(O,X )? Center for Genes, Environment, and Health 7

Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T What is computational o a complexity of this sum? Center for Genes, Environment, and Health 8

Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T At each t, are N states to reach, so N T possible state sequences and 2T multiplications per seq, means O(2T*N T ) operations So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! Center for Genes, Environment, and Health 9

Example: π i = Pr(X 1 = i) a ij = Pr(X t =j X t-1 = i) b ik = Pr(O t =k X t = i) 1 2 3 N3, M2 π0.25, 0.55, 0.2 A 0 0.2 0.8 B 0 1.0 0.9 0 0.1 0 0.1 0.75 0.5 0.9 0.25 0.5 Probability of O is sum over all state sequences Pr(O λ) = all X Pr(O X, λ) Pr(X λ) = all X π x11 b x11 o 1 a x11 x 2 b x22 o 2... a xt-1 x T b xtt o T Efficient ce tdynamic cprogramming oga gago algorithm todo this: Forward algorithm(baum and Welch,O(N 2 T)) Center for Genes, Environment, and Health 10

A Simple HMM CpG Islands where in one state, much higher probability to be C or G 0.8 0.9 G.3 0.2 C.3 A.2 T.2 0.1 CpG G.1 C.1 A.4 T.4 Non-CpG From David Pollock

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It Assuming π = (0.5, 0.5) and given the sequence G, G, what is Pr(O=G λ)? For O=G, have 2 possible state sequences C (i.e. CpG state) N (i.e. Non-CpG state) t Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G Assuming π X =0.5, Pr(G λ) =ππ C b CG + π N b NG =.5*.3 +.5*.1 For convenience, let s drop the 0.5s for now and add them in later (so number to right of G in box here is probability of emitting G in that state, i.e. b XG ) Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G CC NC and emit C CN NN and emit C C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 G.1 G (.3*.8+.1*.1)*.3 =.075 (.3*.2+.1*.9)*.1 =.015 C For O=GC have 4 possible state sequences CC,NC, CN,NN Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C came from C or from N and emit G came from C or from N and emit G G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G For O=GCG have 8 possible state sequences CCC, CNC NCC, NNC CCN, CNN NCN, NNN Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 A (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 A Adapted from David Pollock s

CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) +.0029*.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 =.0005 0.2 0.1 G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) +.0029*.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 =.0011 T.4 09 0.9 G C G A A Problem 1: Pr(O λ)=0.5*.0005 + 0.5*.0011= 8e-4 Non-CpG (.003*.8+.0025*.1) (.003*.2+.0025*.9) *.4

CpG The Forward Algorithm 0.8 Probability of a Sequence is the Sum of All Paths G.3 that Can Produce It C.3 G.3 (.3*.8+ (.075*.8+ (.0185*.8 A.2.1*.1).015*.1) +.0029*.1 T.2 *.3 *3 *.3 )*.2 *.2 =.075 =.0185 =.003 0.2 0.1 G.1 (.3*.2+ (.075*.2+ (.0185*.2 ( G.1.1*.9).015*.9) +.0029*.9) C.1 *.1 *.1 *.4 A.4 =.015 =.0029 =.0025 T.4 09 0.9 G C G A A Problem 2: What is optimal state sequence? Non-CpG (.003*.8+.0025*.1) =.0005 (.003*.2+.0025*.9) *.4 =.0011

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 G (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 Probability of being in state CpG or Non-CPG at step i A (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 A Adapted from David Pollock s

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 G C (from forward algorithm) with max becomes.3*.8,.1*.1).1) *.3 =.072.3*.2,.1*.9) *.1 =.009 (with Viterbi algorithm) Adapted from David Pollock s (note error in formulas on his)

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C G A A Adapted from David Pollock s (note error in formulas on his)

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG The Viterbi Algorithm Most Likely Path (use max instead of sum) G.3 G.1 G.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2+. 0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s (note error in formulas on his)

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 G.3 G.1 The Viterbi Algorithm Most Likely Path.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014.0173*.8,.0014*.1) *.2 =.0028.0173*.2+. 0014*.9) *.4 =.0014.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 G C G A A What if choose max prob state at each step? Ans: CCCCN. What is problem with doing that? Adapted from David Pollock s (note error in formulas on his)

Hint 1 2 3 Suppose in same way most likely state at each step is 32112.. Center for Genes, Environment, and Health 27

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Non-CpG

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

CpG 0.8 G.3 C.3 A.2 T.2 0.2 0.1 G.1 C.1 A.4 T.4 09 0.9 Non-CpG G.3 G.1 G The Viterbi Algorithm Most Likely Path: Backtracking.3*.8,.1*.1) *.3 =.072.3*.2,.1*.9) *.1 =.009 C.072*.8,.009*.1) *.3 =.0173.072*.2,.009*.9) *.1 =.0014 G.0173*.8,.0014*.1) *.2 =.0028.0173*.2 +.0014*.9) *.4 =.0014 A.0028*.8,.0014*.1) *.2 =.00044.0028*.2,.0014*.9 )*.4 =.0005 A Adapted from David Pollock s

CpG 0.2 0.8 G.3 C.3 A.2 T.2 G.1 C.1 A.4 T.4 09 0.9 Non-CpG 0.1 Forward-backward algorithm G.3 (.3*.8+.1*.1) *.3 =.075 G.1 (.3*.2+.1*.9) *.1 =.015 (.075*.8+.015*.1) *3 *.3 =.0185 ( (.075*.2+.015*.9) *.1 =.0029 (.0185*.8 +.0029*.1 )*.2 =.003 (.0185*.2 +.0029*.9) *.4 =.0025 (.003*.8+.0025*.1) *.2 =.0005 ( (.003*.2+.0025*.9) *.4 =.0011 Problem G 3: How C to learn Gmodel? A A Forward algorithm calculated Pr(O 1..t,X t =i λ)

How do you learn an HMM? Baum-Welch Iterative algorithm is popular Equivalent to Expectation Maximization (EM) Maximize If know hidden variables (states), maximize model parameters with respect to that knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Iterate between two steps until convergence of parameter estimates Center for Genes, Environment, and Health 32

Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable α t (i) Pr(O 1..t,X t =i λ) Backward variable β t (i) Pr(O t+1..n X t =i, λ) Rabiner 1989

Parameter Estimation Define 2 variables ξ and γ. Probability of transitioning at time t from state i to j, no matter the path ξ t (i,j) = Pr(q t =S i, q t =S j O,λ) = α t (i) a ij b jot+1 β t+1 (i) / i=1ton j=1ton α t (i) a ij b jot+1 β t+1 (i) Probability of being in state i at time t, no matter the path γ t(i) = Pr(q t=s i O,λ) = α t(i)β t (i) / i=1ton α t(i)β t (i) Then expected values for parameters are π i = γ 1 (i) a ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Center for Genes, Environment, and Health 34

Baum-Welch algorithm (equivalent to EM) Given an initial assignment to parameters λ=(π, b, a), compute ξ and γ from α and β Generate new estimate λ*=(π*, (, b*, a*) from π * i = γ 1 (i) a * ij = t=1 to T-1 ξ t (i,j) / t=1 to T-1 γ t (j) b * jk = t=1 to T-1 s.t Ot =k γ t (j) / t=1 to T-1 γ t (j) Set λ= λ* and repeat until convergence Center for Genes, Environment, and Health 35

Where are HMMs used in Computational Biology? DNA motif matching, gene matching, multiple sequence alignment Amino Acids domain matching, fold recognition Microarrays/Whole Genome Sequencing assign copy number ChIP-chip/seq distinct chromatin states Center for Genes, Environment, and Health 36

Homologous Sequences what is consensus sequence? how can we recognize all of them? but how to distinguish unlikely members? Center for Genes, Environment, and Health Krogh 1998 37

Homologous Sequences Krogh 1998 Center for Genes, Environment, and Health 38

Probability of Sequences Center for Genes, Environment, and Health 39

Learning Parameters of Compbio HMMs built from pre-aligned (pre-labeled sequences) so states have meaningful biological labels (like insertion portiop, then parameter estimation just tabulates frequencies, like in previous example note longer sequences have lower probability, so often converted to log-odds parameters (see Krogh 1998) built from unaligned/unlabelled l ll d sequences, where semantics of states can (sometimes) be interpreted later and must do Baum-Welch or equivalent for parameter estimation, like in chromatin state example shown later HMMs encode regular grammars so do poor job on problems where long-range (complementary) correlations (ex. RNA/protein secondary structure) Center for Genes, Environment, and Health 40

Homology HMM Gene recognition, classify to identify distant homologs Common Ancestral Sequence Parameter set λ = (A, B, π), strict left-right model Specially defined set of states: start, stop, match, insert, delete For initial state distribution π, use start state For transition matrix A use global transition probabilities For emission matrix B Match, site-specific emission probabilities Insert (relative to ancestor), global emission probs Delete, emit nothing Multiple Sequence Alignments Adapted from David Pollock s

Homology HMM insert insert insert start t match match end delete delete Adapted from David Pollock s

Homology HMM Example A.1 A.04 A.2 C.05 C.1 C.01 D.2 D.01 D.05 match match E.08 E.2 E.1 F.01 F.02 F.06 match

Ungapped blocks Ungapped blocks where insertion states model intervening sequence between blocks Insert/delete states allowed anywhere Center for Genes, Environment, and Health Allow multiple domains, sequence fragments Eddy, 1998 44

Uses for Homology HMM Find homologs to profile HMM in database Score multiple sequences for match to 1 HMM Not always Pr(O λ) since some areas may highly diverge Sometimes use highest scoring subsequence Goal is to find homologs in database Classify sequence using library of profile HMMs Compare 1 seq to >1 alternate models ex. Pfam, PROSITE motif databases Alignment of additional sequences Structural alignment when alphabet is secondary structure symbols so can do fold-recognition, etc Adapted from David Pollock s

Variable Length and Composition of Protein Domains http://rnajournal.cshlp.org/content/12/12/2080.full Center for Genes, Environment, and Health 46

Why Hidden Markov Models for MSA? Multiple sequence alignment as consensus May have substitutions, not all AA are equal FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 Could use regular expressions but how to handle indels? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 What about variable-length members of family? FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110 Center for Genes, Environment, and Health 47

Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position Center for Genes, Environment, and Health 48

Detecting ti Copy Number in Array Array Data CGH data Discrete number of copies found by segmenting array intensities along chromsome HMM segmentation Naïve smoothing http://www.cs.cmu.edu/~epxing/class/10810-05/lecture11.pdf Center for Genes, Environment, and Health 49

Detecting ti Copy Number in Whole Genome Sequencing Data ABI Bioscope manual 2010 Center for Genes, Environment, and Health Compute log ratio of observed coverage to expected coverage Fit to HMM with states for 0-9 copies Copy number assigned to region with Viterbi algorithm 50

HMMs for Chromatin States Specific amino acid of specific histone protein modified at a given level can be tagged and assayed ex) H3K27me3 means 3 methyl groups have been added to Lysine at postion 27 in histone 3 Center for Genes, Environment, and Health Rodenhiser& Mann CMAJ 2006 174(3):341 51

Combination of Chromatin States If HMM for sequence from single mark with states, eg. has H3K27me3 or no H3K27me3 (peak finding) However peaks for single mark could still be distributed all across genome, which ones are important? Comparing across multiple signals identifies specific combinations which distinguish the important peaks in an individual signal (combinatorial patterns) Barski et al Cell 2007 129:823-837 Center for Genes, Environment, and Health 52

Combination States Learned optimized to Q=51 labels (a.k.a. states) where semantics assigned post hoc based on prior biological knowledge, relation to gene models, gene expression data, and sequence conservation Center for Genes, Environment, and Health 53

Multivariate HMMs for Chromatin States http://www.nature.com/nbt/journal/v28/n8/pdf/nbt.1662.pdf Center for Genes, Environment, and Health Ernst 2010 learned 51 distinct chromatin states, interpreted t post hoc as promoterassociated, transcriptionassociated, active intergenic, large- scale repressed and repeat- associated states. t 54

Hot Topic: Better than HMMs for Chromatin States: Dynamic Bayes Nets! allows specification of min/max length of feature and way to count that down ( memory ) and way to enforce or disallow certain transitions Recall: X t-1 H M M O t-1 X t O t Segway by Hoffman et al 2011 Center for Genes, Environment, and Health hidden state of model sequence of observations for each of n chromatin/txfac marks 55

Hot Topic: Better than HMMs for Chromatin States Segway by Hoffman et al 2011 specify Q=25 labels (a.k.a. states) Center for Genes, Environment, and Health semantics of learned states assigned post hoc based on prior bio knowledge 56

Hot Topic: Better than HMMs for Chromatin States Center for Genes, Environment, and Health Segway by Hoffman et al 2011 57

Homology HMM Resources Great tutorial (Krogh 1998) ** http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7972&rep=rep1&type=pdf WUSTL/Janelia (Eddy Bioinformatics 1998 14(9):755)** Pfam: database of pre-computed HMM alignments for various proteins HMMer: program for building HMMs UCSC (Haussler) SAM: align, secondary structure predictions, HMM parameters, etc. Chromatin States Ernst et al, PMCID: PMC2919626 Segway: http://noble.gs.washington.edu/proj/segway/manuscript/segway.pdf

Center for Genes, Environment, and Health 59

Other David Pollock Slides 2009 Center for Genes, Environment, and Health 60

Model Comparison Based on P(D, M) For ML, take P max (D, M) lnp max (D, M) Usually to avoid numeric error max For heuristics, score is For Bayesian, calculate log 2 P(D fixed, M) P max (, M D) P(D, M)*P* PM P(D,M) * P * P M Uses prior information on parameters P( ) Adapted from David Pollock s

Parameters, Types of parameters Amino acid distributions for positions (match states) Global AA distributions for insert states Order of match states Transition probabilities Phylogenetic tree topology and branch lengths Hidden states (integrate or augment) Wander parameter space (search) Maximize, or move according to posterior probability (Bayes) Adapted from David Pollock s

Expectation Maximization (EM) Classic algorithm to fit probabilistic model parameters with unobservable states Two Stages Maximize If know hidden variables (states), maximize model parameters with respect to that t knowledge Expectation If know model parameters, find expected values of the hidden variables (states) Works well even with e.g., Bayesian to find near-equilibrium i space Adapted from David Pollock s

Homology HMM EM Start with heuristic MSA (e.g., ClustalW) Maximize Match states are residues aligned in most sequences Amino acid frequencies observed in columns Expectation Realign all the sequences given model Repeat until convergence Problems: Local, not global optimization Use procedures to check how it worked Adapted from David Pollock s

Model Comparison Determining significance depends on comparing two models (family vs non-family) Usually null model, H 0, and test model, H 1 Models are nested if H 0 is a subset of H 1 If not nested Akaike Information Criterion (AIC) [similar to empirical Bayes] or Bayes Factor (BF) [but be careful] Generating a null distribution of statistic 2 Z-factor, bootstrapping,, parametric bootstrapping, posterior predictive Adapted from David Pollock s

Z Test Method Database of known negative controls E.g., non-homologous (NH) sequences Assume NH scores ~ N(,) i.e., you are modeling known NH sequence scores as a normal distribution Set appropriate significance level for multiple comparisons (more below) Problems Is homology certain? Is it the appropriate null model? Normal distribution often not a good approximation Parameter control hard: e.g., length distribution Adapted from David Pollock s

Bootstrapping t and Parametric Models Random sequence sampled from the same set of emission probability distributions Same length is easy Bootstrapping is re-sampling columns Parametric uses estimated frequencies, may include variance, tree, etc. More flexible, can have more complex null Pseudocounts of global l frequencies if data limit it Insertions relatively hard to model What frequencies for insert states? Global? Adapted from David Pollock s

Center for Genes, Environment, and Health 68