Introduction to Hidden Markov Models (HMMs)

Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation 4.! Hypothesis Testing 5.! Likelihood 7.! Conditional Probability 8.! Stochastic Processes 9.! Inference for Stochastic Processes

Probability The probability of a particular event occurring is the frequency of that event over a very long series of repetitions.!p(tossing a head) = 0.50!P(rolling a 6) = 0.167!P(average age in a population sample is greater than 21) = 0.25 Random Variables A random variable is a quantity that cannot be measured or predicted with absolute accuracy. X = age of an individual Y = length of a gene!p = fraction of nucleotides that are either G or C

Probability Distributions! The distribution of a random variable describes the possible values of the variable and the probabilities of each value.! For discrete random variables, the distribution can be enumerated; for continuous ones we describe the distribution with a function. Examples of Distributions X ~ Bin( 3, 0. 5) Z ~ N( µ,! 2 ) x! P(X = x)! 0! 0.125! 1! 0.375! 2! 0.375! 3! 0.125! f (z;µ,! 2 ) = P(a < Z < b) = (z #µ ) 1 2! 2" e# 2! 2 " b a f (z;µ,! 2 )dz Binomial Normal

Parameter Estimation One of the primary goals of statistical inference is to estimate unknown parameters. For example, using a sample taken from the target population, we might estimate the population mean using several different statistics: the sample mean, the sample median, or the sample mode. Different statistics have different sampling properties. Hypothesis Testing A second goal of statistical inference is testing the validity of hypotheses about parameters using sample data. H H O A : p = 0. 5 : p > 0. 5 If the observed frequency is much greater than 0.5, we should reject the null hypothesis in favor of the alternative hypothesis. How do we decide what much greater is?

Likelihood For our purposes, it is sufficient to define the likelihood function as L(!) = Pr( data;parameter values) = Pr( X ;!) Analyses based on the likelihood function are wellstudied, and usually have excellent statistical properties. Maximum Likelihood Estimation The maximum likelihood estimate of an unknown parameter is defined to be the value of that parameter that maximizes the likelihood function: ˆ! = argmax! L(!) = argmax! Pr(X;!) We say that!! is the maximum likelihood estimate of!.

Example: Binomial Probability If X ~ Bin( n, p), then L( p) = n! x!( n! x)! p x ( 1! p) n! x Some simple calculus shows that the MLE of p is!p, the frequency of successes in our sample of size n. If we had been unable to do the calculus, we could still have found the MLE by plotting the likelihood: 0.0025 0.002 0.0015 p ( 1! p) 7 3 0.001 0.0005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p

Likelihood Ratio Tests Consider testing the hypothesis: H H O A : :! =!! >! O O The likelihood ratio test statistic is: max L( ") " = " O! = = max L( ") " > " O L L O A (" ) (" ) Distribution of the Likelihood Ratio Test Statistic Under quite general conditions,! 2 ln " ~ # 2 n! 1 where n-1 is the difference between the number of free parameters in the two hypotheses.

The Parametric Bootstrap Why we need it: The conditions necessary for the asymptotic chi-squared distribution are not always satisfied.! What it is: A simulation based approach for evaluating the p-value of a statistical test statistic (often a likelihood ratio)! Parametric Bootstrap Procedure 1.! Compute the LRT using the observed data 2.! Use the parameters estimated under the null hypothesis to simulate a new dataset of the same size as the observed data. 3.! Compute the LRT for the simulated dataset. 4.! Repeat steps 2 & 3, say, 1000 times. 5.! Construct a histogram of the simulated LRTs. 6.! The p-value for the test is the frequency of simulated LRTs that exceed the observed LRT.

Conditional Probability The conditional probability of event A given that event B has happened is Pr( A B) = Pr(A and B) Pr(B) Pr(2 even number rolled) = P(2 and even) P(even) 1 = 6 1 = 1 2 3 Stochastic Processes A stochastic process is a series of random variables measured over time. Values in the future typically depend on current values.!closing value of the stock market!annual per capita murder rate!current temperature

ACGGTTACGGATTGTCGAA t = 0 ACaGTTACGGATTGTCGAA t = 1 ACaGTTACGGATgGTCGAA t = 2 ACcGTTACGGATgGTCGAA t = 3 Inference for Stochastic Processes We often need to make inferences that involve the changes in molecular genetic sequences over time. Given a model for the process of sequence evolution, likelihood analyses can be performed. Pr( sequence i! sequence j; t, ")

Introduction to Hidden Markov Models (HMMs) HMM: Hidden Markov Model! Does this sequence come from a particular class?!does the sequence contain a beta sheet?! What can we determine about the internal composition of the sequence if it belongs to this class?!assuming this sequence contains a gene, where are the splice sites?

Example: A Dishonest Casino! Suppose a casino usually uses fair dice (probability of any side is 1/6), but occasionally changes briefly to unfair dice (probability of a 6 is 1/2, all others have probability 1/10)! We only observe the results of the tosses! Can we identify the tosses with the biased dice? The data we actually observe look like the following: 2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5 Which (if any) tosses were made using an unfair die? F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U U U U F F U U U F F F F F F F F

2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5 F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U U U U F F U U U F F F F F F F F If the tosses were made with a fair die (scenario 1), the probability of observing the series of tosses is:! Pr(data) = 1 6 ( ) 23 =1.266!10 "18 If the indicated tosses were made with an unfair die (scenario 2), then the series of tosses has probability! Pr(data) = 1 6 ( ) 16 1 ( ) 5 1 ( ) 2 =1.108!10 "16 2 The series of tosses is 87.5 times more probable under scenario 2 than under the scenario 1.! 10 0.9 Transition Probabilities! Emission Probabilities! 0.5 1! 1/6 2! 1/6 3! 1/6 4! 1/6 5! 1/6 6! 1/6 Fair 0.5 0.1 1! 1/10 2! 1/10 3! 1/10 4! 1/10 5! 1/10 6! 1/2 Unfair Hidden states!

The Likelihood Function Pr( X;!) = # Pr( " ) Pr( X ";!) Probability of the data in terms of 1 or more unknown parameters. paths " Probability of the hidden states ( may depend on 1 or more unknown parameters). Probability of the data GIVEN the hidden states, in terms of 1 or more unknown parameters. Compute via the forward algorithm! Predicting the Hidden States 1. The most probable state path (compute via the Viterbi algorithm): ˆ! = arg max paths! P(X,!;") = arg max Pr(! )Pr(X!;") paths! 2. Posterior state probabilities (compute via the backward algorithm): Pr(! i = k x;") = P(x,! i = k;") P(x;")

Simple Gene Structure Model Start codon! Stop codon! 5 UTR Exons Introns 3 UTR HMM Example: Gene Regions 5 UTR 3 UTR Exon Start Stop Intron

Content sensor: Region of residues with similar properties (introns, exons) Signal sensor: A specific signal sequence; might be a consensus sequence (start, stop codons) 5 EI Basic Gene-finding HMM E B S D A T F I EF 3!B: Begin Sequence!S: Start translation!d: Donor splice site!a: Acceptor splice site!t: Stop translation!f: End sequence ES!5 : 5 Untranslated region!ei: Initial exon!es: Single exon!e: Exon!I: Intron!EF: Final exon!3 : 3 untranslated region

OK, What do we do with it now? The HMM must first be trained using a database of known genes. Consensus sequences for all signal sensors are needed. Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors. Transition probabilities between all connected states must be estimated. GENSCAN Chris Burge and Samuel Karlin J. Mol. Biol (1997) 268:78-94 Prediction of Complete Gene Structures in Human Genomic DNA

Basic features of GENSCAN! HMM description of human genomic sequences,including:!transcriptional, translational, and splicing signals!length distributions and compositional features of introns, exons, and intergenic regions!distinct model parameters for regions with different GC compositions Accuracy per nucleotide! Accuracy per exon! Sn! Sp! AC! CC! Sn! Sp! Avg! ME! WE! 0.93! 0.93! 0.91! 0.92! 0.78! 0.81! 0.80! 0.09! 0.05! Sn: Sensitivity = Prob(True nucleotide or exon predicted in a gene) Sp: Specificity = Prob(Predicted nucleotide or exon is in a gene) AC and CC are overall measures of accuracy, including both positive and negative predictions. ME: Prob(true exon is completely missed in the prediction) WE: Prob(predicted exon is not in a gene)

Profile HMMs A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A A G C T C - C G A PROSITE regular expression [AT][ATG][CT][T][CT]*[CT][GT][AG]

A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A A G C T C - C G A [AT][ATG][CT][T][CT]*[CT][GT][AG] T T T T T T T G The sequence matches the PROSITE expression at every position, but does it really match the profile? A A C T - - C T A A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A 0.6 A 0 C.75 G 0 T.25 0.75 0.25 1.0 1.0 1.0 0.4 1.0 1.0 A.8 C 0 G 0 T.2 A.4 C 0 G.4 T.2 A 0 C.8 G 0 T.2 A 0 C 0 G 0 T 1 A 0 C.8 G 0 T.2 A 0 C 0 G.6 T.4 A.8 C 0 G.2 T 0

0.6 A 0 C.75 G 0 T.25 0.75 0.25 1.0 1.0 1.0 0.4 1.0 1.0 A.8 C 0 G 0 T.2 A.4 C 0 G.4 T.2 A 0 C.8 G 0 T.2 A 0 C 0 G 0 T 1 A 0 C.8 G 0 T.2 A 0 C 0 G.6 T.4 A.8 C 0 G.2 T 0 T T T T T T T G. 2! 1! 0. 2! 1! 0. 2! 1! 1! 0. 6! 0. 25! 0. 75! 0. 2! 1! 0.4! 1! 0. 2 T T T T T T T G General Profile HMM Structure D j! I j! Delete States Insert States Begin M j! End Match States