Biology 644: Bioinformatics

Similar documents
Markov Models & DNA Sequence Evolution

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 3: Markov chains.

Stephen Scott.

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models for biological sequence analysis

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CSCE 471/871 Lecture 3: Markov Chains and

HIDDEN MARKOV MODELS

Hidden Markov Models

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Hidden Markov Models for biological sequence analysis I

Stochastic processes and

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models 1

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models (HMMs) November 14, 2017

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models and some applications

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models and some applications

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Today s Lecture: HMMs

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

EECS730: Introduction to Bioinformatics

Hidden Markov Models

Hidden Markov Models. Ron Shamir, CG 08

Lecture #5. Dependencies along the genome

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Advanced Data Science

Introduction to Hidden Markov Models (HMMs)

Hidden Markov Models

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Hidden Markov Models. Three classic HMM problems

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Stochastic processes and Markov chains (part II)

Chapter 4: Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Stochastic processes and

Hidden Markov Models

Bioinformatics 2 - Lecture 4

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

HMM part 1. Dr Philip Jackson

Computational Genomics and Molecular Biology, Fall

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Lecture 9. Intro to Hidden Markov Models (finish up)

STA 4273H: Statistical Machine Learning

One-Parameter Processes, Usually Functions of Time

Computational Genomics and Molecular Biology, Fall

Computational Biology and Chemistry

Hidden Markov Models. Terminology, Representation and Basic Problems

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Data Mining in Bioinformatics HMM

Predicting RNA Secondary Structure

Hidden Markov Models (I)

Statistical Methods for NLP

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

HMMs and biological sequence analysis

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models. x 1 x 2 x 3 x K

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

L23: hidden Markov models

Hidden Markov Models. x 1 x 2 x 3 x K

STA 414/2104: Machine Learning

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Genomes Comparision via de Bruijn graphs

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Data-Intensive Computing with MapReduce

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Info 2950, Lecture 25

Hidden Markov Models Part 1: Introduction

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Theoretical Physics Methods for Computational Biology. Third lecture

Hidden Markov Models NIKOLAY YAKOVETS

Bayesian Models for Sequences

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Introduction to Machine Learning CMU-10701

Dynamic Approaches: The Hidden Markov Model

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Lecture 6: Entropy Rate

Bio nformatics. Lecture 3. Saad Mneimneh

Hidden Markov Models. x 1 x 2 x 3 x N

Recognition by generation. Devika Subramanian Comp 140 Fall 2008

Transcription:

A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past and present values) depends only upon the present state; that is, given the present, the future does not depend on the past Markov chain (discrete-time Markov chain or DTMC) undergoes transitions from one state to another in a state space. It is a random process usually characterized as memory-less: the next state depends only on the current state and not on the sequence of events that preceded it. A Hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. First used in speech and handwriting recognition In biology, frequently used to predict gene tracks, splice sites, chromatin states, CpG islands, protein folding conformations, binding sites, CNV, etc. N refers to the amount of sequence memory in the model How many positions in the sequence you have to look back to predict the next symbol Mononucleotide Content is a 0 th Order Markov Model Best visualized by a State Transition Diagram.999 Start 1.0 A =.31 C =.24 G =.27 T =.18.001 End andom Walk: ATGCGAGATCAAGC. 1

Looks back 1 position in the sequence CGATCGATCGATAC.. to determine the probability of the next nucleotide Best visualized by a State Transition Diagram.2.2.3.3 Add 2 additional States, Begin and End, that model the starting and ending behavior for a sequence Looks back 2 positions in the sequence CGATCGATCGATAC.. to determine the probability of the next nucleotide For 1 st Order Model that were 4 states in the State Transition Diagram (not including the Begin and End States). How many will there be the 2 nd Order Model? How many transitions will there be the 2 nd Order Model? TT AA AC AG AT CA TG CC TC CG TA GT GG GC GA CT 2

In general, there are L N states in a State Transition Diagram of an Nth Order Markov Model L is the size of the alphabet A 5 th Order Markov Model to model Genomic Nucleotide Biases will have 4 5 states In general, there are (L N ) 2 transitions (parameters to train) in an Nth Order Markov Model where L is the size of the alphabet A 5 th Order Markov Model to model Genomic Nucleotide Biases will have (4 5 ) 2 transitions to train Higher order Markov Models are able to capture longer range dependencies between the positions If there aren t dependencies longer than N positions apart, then an (N+1)th Order Markov Model won t be any more accurate than an Nth Order Markov Model Higher order Markov Models become more difficult to train due to the exponential increase in the number of parameters to train State Transition Diagrams also become too large to draw out for higher order models Generally, in order to computationally model Nth Order Markov Models we count (N+1)-mers and N-mers in order to build Markov Chains of Conditional Probabilities 1st Order Markov Model Example Looks back 1 position in the sequence CGATCGATCGATAC.. to determine the probability of the next nucleotide Count 2-mers and 1-mers and calculate frequencies P CGATCGATCGATAC = P CG P GA G P AT A P TC T P CG C P GA G P CGATCGATCGATAC = P CG P(GA) P(G) P(AT) P(A) P(TC) P(T) P(CG) P(C) P(GA) P(G) Generally, in order to computationally model Nth Order Markov Models we count (N+1)-mers and N-mers in order to build Markov Chains of Conditional Probabilities 2nd Order Markov Model Example Looks back 2 positions in the sequence CGATCGATCGATAC.. to determine the probability of the next nucleotide Count 3-mers and 2-mers and calculate frequencies P CGATCGATCGATAC = P CGA P GAT GA P ATC AT P TCG TC P CGA CG P GAT GA P CGATCGATCGATAC = P CGA P(GAT) P(GA) P(ATC) P(AT) P(TCG) P(TC) P(CGA) P(CG) P(GAT) P(GA) 3

TTTTTTTT TTTTTTTT TTTTTTTT 1 A and rest all Ts The 3 rd and 5 th Order Markov Models are much better at recapitulating the observed 8-mer Counts in the andom Pool. The T -> A -> G -> C biases are due to melting rates in the PC reactions and synthesis biases The transition probabilities of Markov Models can be conveniently organized into a matrix 5/6 1/6 1/2 1/2 5/6 1/6 1/2 1/2 andom Walk:. The transition probabilities of Markov Models can be conveniently organized into a matrix A C G T A 1/6 5/6 0 0 C 1/8 1/2 1/4 1/8 G 0 1/3 1/2 1/6 T 0 1/6 1/2 1/3 andom Walk: ACGGCGTGGCCGGCG. What kind of DNA sequence might this transition matrix model? 4

The transition probabilities of Markov Models can be conveniently organized into a matrix 5/6 1/6 1/2 1/2 Let be the vector of initial probabilities, then represents the probability distributions at 1 time step: The transition probabilities of Markov Models can be conveniently organized into a matrix 5/6 1/6 1/2 1/2 And in general we have that So from an initial probability vector, we can calculate the probability distributions at time n in the model by taking n powers of the probability transitions matrix P: 0 n A probability distribution π satisfying π T = π T P is stationary because the transition matrix does not change the probabilities of the states of the process Such a distribution exists and is unique if the Markov Chain is Ergodic or irreducible: if it is possible to eventually get from every state to every other state with positive probability Describes the long term behavior of the process For any initial distribution π 0, it follows that π 0 P n tends to π T as n 5

When at least some of the data labels are missing (hidden) in the training data, then we must infer (label) the missing hidden states equires correctly inferring the model topology of the system equires a lot of training data to train the additional parameters Successful training of the parameters is highly dependent on the initial conditions Famous example: Occasionally Dishonest Casino When at least some of the data labels are missing (hidden) in the training data, then we must infer (label) the missing hidden states equires correctly inferring the topology of the system equires a lot of training data to train the additional parameters Successful training of the parameters is highly dependent on the initial conditions Famous example: Occasionally Dishonest Casino Used to generate rolls Training with 300 rolls Training with 30,000 rolls 6

7