Introduction to Hidden Markov Models (HMMs)

Similar documents
O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models (I)

HMMs and biological sequence analysis

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Markov Models & DNA Sequence Evolution

Hidden Markov Models (HMMs) November 14, 2017

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models for biological sequence analysis

HIDDEN MARKOV MODELS

Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Stephen Scott.

BMI/CS 576 Fall 2016 Final Exam

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

MACHINE LEARNING 2 UGM,HMMS Lecture 7

CSCE 471/871 Lecture 3: Markov Chains and

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Introduction to Machine Learning CMU-10701

Advanced Data Science

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Today s Lecture: HMMs

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CS711008Z Algorithm Design and Analysis

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models for biological sequence analysis I

DNA Feature Sensors. B. Majoros

Hidden Markov Models 1

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Chapter 4: Hidden Markov Models

Biology 644: Bioinformatics

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Pairwise alignment using HMMs

Hidden Markov Models. Hosein Mohimani GHC7717

Hidden Markov Models,99,100! Markov, here I come!

Genome 373: Hidden Markov Models II. Doug Fowler

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models

Stephen Scott.

objective functions...

EECS730: Introduction to Bioinformatics

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Lecture 7 Sequence analysis. Hidden Markov Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

0 Algorithms in Bioinformatics, Uni Tübingen, Daniel Huson, SS 2004

Statistical Machine Learning from Data

Basic math for biology

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/26/2018. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Dynamic Approaches: The Hidden Markov Model

Hidden Markov models in population genetics and evolutionary biology

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Gene Prediction in Eukaryotes Using Machine Learning

Data Mining in Bioinformatics HMM

L23: hidden Markov models

Hidden Markov Models. x 1 x 2 x 3 x N

11.3 Decoding Algorithm

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Frequentist Statistics and Hypothesis Testing Spring

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM: Parameter Estimation

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by:

Hidden Markov Models and Their Applications in Biological Sequence Analysis

6 Markov Chains and Hidden Markov Models

Markov chains and Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

6.4 Type I and Type II Errors

Applications of hidden Markov models for comparative gene structure prediction

Graphical Models Seminar

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

STA 414/2104: Machine Learning

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Transcription:

Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation 4.! Hypothesis Testing 5.! Likelihood 7.! Conditional Probability 8.! Stochastic Processes 9.! Inference for Stochastic Processes

Probability The probability of a particular event occurring is the frequency of that event over a very long series of repetitions.!p(tossing a head) = 0.50!P(rolling a 6) = 0.167!P(average age in a population sample is greater than 21) = 0.25 Random Variables A random variable is a quantity that cannot be measured or predicted with absolute accuracy. X = age of an individual Y = length of a gene!p = fraction of nucleotides that are either G or C

Probability Distributions! The distribution of a random variable describes the possible values of the variable and the probabilities of each value.! For discrete random variables, the distribution can be enumerated; for continuous ones we describe the distribution with a function. Examples of Distributions X ~ Bin( 3, 0. 5) Z ~ N( µ,! 2 ) x! P(X = x)! 0! 0.125! 1! 0.375! 2! 0.375! 3! 0.125! f (z;µ,! 2 ) = P(a < Z < b) = (z #µ ) 1 2! 2" e# 2! 2 " b a f (z;µ,! 2 )dz Binomial Normal

Parameter Estimation One of the primary goals of statistical inference is to estimate unknown parameters. For example, using a sample taken from the target population, we might estimate the population mean using several different statistics: the sample mean, the sample median, or the sample mode. Different statistics have different sampling properties. Hypothesis Testing A second goal of statistical inference is testing the validity of hypotheses about parameters using sample data. H H O A : p = 0. 5 : p > 0. 5 If the observed frequency is much greater than 0.5, we should reject the null hypothesis in favor of the alternative hypothesis. How do we decide what much greater is?

Likelihood For our purposes, it is sufficient to define the likelihood function as L(!) = Pr( data;parameter values) = Pr( X ;!) Analyses based on the likelihood function are wellstudied, and usually have excellent statistical properties. Maximum Likelihood Estimation The maximum likelihood estimate of an unknown parameter is defined to be the value of that parameter that maximizes the likelihood function: ˆ! = argmax! L(!) = argmax! Pr(X;!) We say that!! is the maximum likelihood estimate of!.

Example: Binomial Probability If X ~ Bin( n, p), then L( p) = n! x!( n! x)! p x ( 1! p) n! x Some simple calculus shows that the MLE of p is!p, the frequency of successes in our sample of size n. If we had been unable to do the calculus, we could still have found the MLE by plotting the likelihood: 0.0025 0.002 0.0015 p ( 1! p) 7 3 0.001 0.0005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p

Likelihood Ratio Tests Consider testing the hypothesis: H H O A : :! =!! >! O O The likelihood ratio test statistic is: max L( ") " = " O! = = max L( ") " > " O L L O A (" ) (" ) Distribution of the Likelihood Ratio Test Statistic Under quite general conditions,! 2 ln " ~ # 2 n! 1 where n-1 is the difference between the number of free parameters in the two hypotheses.

The Parametric Bootstrap Why we need it: The conditions necessary for the asymptotic chi-squared distribution are not always satisfied.! What it is: A simulation based approach for evaluating the p-value of a statistical test statistic (often a likelihood ratio)! Parametric Bootstrap Procedure 1.! Compute the LRT using the observed data 2.! Use the parameters estimated under the null hypothesis to simulate a new dataset of the same size as the observed data. 3.! Compute the LRT for the simulated dataset. 4.! Repeat steps 2 & 3, say, 1000 times. 5.! Construct a histogram of the simulated LRTs. 6.! The p-value for the test is the frequency of simulated LRTs that exceed the observed LRT.

Conditional Probability The conditional probability of event A given that event B has happened is Pr( A B) = Pr(A and B) Pr(B) Pr(2 even number rolled) = P(2 and even) P(even) 1 = 6 1 = 1 2 3 Stochastic Processes A stochastic process is a series of random variables measured over time. Values in the future typically depend on current values.!closing value of the stock market!annual per capita murder rate!current temperature

ACGGTTACGGATTGTCGAA t = 0 ACaGTTACGGATTGTCGAA t = 1 ACaGTTACGGATgGTCGAA t = 2 ACcGTTACGGATgGTCGAA t = 3 Inference for Stochastic Processes We often need to make inferences that involve the changes in molecular genetic sequences over time. Given a model for the process of sequence evolution, likelihood analyses can be performed. Pr( sequence i! sequence j; t, ")

Introduction to Hidden Markov Models (HMMs) HMM: Hidden Markov Model! Does this sequence come from a particular class?!does the sequence contain a beta sheet?! What can we determine about the internal composition of the sequence if it belongs to this class?!assuming this sequence contains a gene, where are the splice sites?

Example: A Dishonest Casino! Suppose a casino usually uses fair dice (probability of any side is 1/6), but occasionally changes briefly to unfair dice (probability of a 6 is 1/2, all others have probability 1/10)! We only observe the results of the tosses! Can we identify the tosses with the biased dice? The data we actually observe look like the following: 2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5 Which (if any) tosses were made using an unfair die? F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U U U U F F U U U F F F F F F F F

2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5 F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U U U U F F U U U F F F F F F F F If the tosses were made with a fair die (scenario 1), the probability of observing the series of tosses is:! Pr(data) = 1 6 ( ) 23 =1.266!10 "18 If the indicated tosses were made with an unfair die (scenario 2), then the series of tosses has probability! Pr(data) = 1 6 ( ) 16 1 ( ) 5 1 ( ) 2 =1.108!10 "16 2 The series of tosses is 87.5 times more probable under scenario 2 than under the scenario 1.! 10 0.9 Transition Probabilities! Emission Probabilities! 0.5 1! 1/6 2! 1/6 3! 1/6 4! 1/6 5! 1/6 6! 1/6 Fair 0.5 0.1 1! 1/10 2! 1/10 3! 1/10 4! 1/10 5! 1/10 6! 1/2 Unfair Hidden states!

The Likelihood Function Pr( X;!) = # Pr( " ) Pr( X ";!) Probability of the data in terms of 1 or more unknown parameters. paths " Probability of the hidden states ( may depend on 1 or more unknown parameters). Probability of the data GIVEN the hidden states, in terms of 1 or more unknown parameters. Compute via the forward algorithm! Predicting the Hidden States 1. The most probable state path (compute via the Viterbi algorithm): ˆ! = arg max paths! P(X,!;") = arg max Pr(! )Pr(X!;") paths! 2. Posterior state probabilities (compute via the backward algorithm): Pr(! i = k x;") = P(x,! i = k;") P(x;")

Simple Gene Structure Model Start codon! Stop codon! 5 UTR Exons Introns 3 UTR HMM Example: Gene Regions 5 UTR 3 UTR Exon Start Stop Intron

Content sensor: Region of residues with similar properties (introns, exons) Signal sensor: A specific signal sequence; might be a consensus sequence (start, stop codons) 5 EI Basic Gene-finding HMM E B S D A T F I EF 3!B: Begin Sequence!S: Start translation!d: Donor splice site!a: Acceptor splice site!t: Stop translation!f: End sequence ES!5 : 5 Untranslated region!ei: Initial exon!es: Single exon!e: Exon!I: Intron!EF: Final exon!3 : 3 untranslated region

OK, What do we do with it now? The HMM must first be trained using a database of known genes. Consensus sequences for all signal sensors are needed. Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors. Transition probabilities between all connected states must be estimated. GENSCAN Chris Burge and Samuel Karlin J. Mol. Biol (1997) 268:78-94 Prediction of Complete Gene Structures in Human Genomic DNA

Basic features of GENSCAN! HMM description of human genomic sequences,including:!transcriptional, translational, and splicing signals!length distributions and compositional features of introns, exons, and intergenic regions!distinct model parameters for regions with different GC compositions Accuracy per nucleotide! Accuracy per exon! Sn! Sp! AC! CC! Sn! Sp! Avg! ME! WE! 0.93! 0.93! 0.91! 0.92! 0.78! 0.81! 0.80! 0.09! 0.05! Sn: Sensitivity = Prob(True nucleotide or exon predicted in a gene) Sp: Specificity = Prob(Predicted nucleotide or exon is in a gene) AC and CC are overall measures of accuracy, including both positive and negative predictions. ME: Prob(true exon is completely missed in the prediction) WE: Prob(predicted exon is not in a gene)

Profile HMMs A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A A G C T C - C G A PROSITE regular expression [AT][ATG][CT][T][CT]*[CT][GT][AG]

A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A A G C T C - C G A [AT][ATG][CT][T][CT]*[CT][GT][AG] T T T T T T T G The sequence matches the PROSITE expression at every position, but does it really match the profile? A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A 0.6 A 0 C.75 G 0 T.25 0.75 0.25 1.0 1.0 1.0 0.4 1.0 1.0 A.8 C 0 G 0 T.2 A.4 C 0 G.4 T.2 A 0 C.8 G 0 T.2 A 0 C 0 G 0 T 1 A 0 C.8 G 0 T.2 A 0 C 0 G.6 T.4 A.8 C 0 G.2 T 0

0.6 A 0 C.75 G 0 T.25 0.75 0.25 1.0 1.0 1.0 0.4 1.0 1.0 A.8 C 0 G 0 T.2 A.4 C 0 G.4 T.2 A 0 C.8 G 0 T.2 A 0 C 0 G 0 T 1 A 0 C.8 G 0 T.2 A 0 C 0 G.6 T.4 A.8 C 0 G.2 T 0 T T T T T T T G. 2! 1! 0. 2! 1! 0. 2! 1! 1! 0. 6! 0. 25! 0. 75! 0. 2! 1! 0.4! 1! 0. 2 T T T T T T T G General Profile HMM Structure D j! I j! Delete States Insert States Begin M j! End Match States