Markov Models & DNA Sequence Evolution

Similar documents
Predicting RNA Secondary Structure

Biology 644: Bioinformatics

Today s Lecture: HMMs

Hidden Markov Models (I)

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

O 3 O 4 O 5. q 3. q 4. Transition

Lecture 3: Markov chains.

Markov Chains and Hidden Markov Models. = stochastic, generative models

HMMs and biological sequence analysis

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Introduction to Hidden Markov Models (HMMs)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models (HMMs) November 14, 2017

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models for biological sequence analysis I

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Computational Biology and Chemistry

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models for biological sequence analysis

Data Mining in Bioinformatics HMM

Lecture 9. Intro to Hidden Markov Models (finish up)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Computational Genomics and Molecular Biology, Fall

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Hidden Markov Models in Bioinformatics 14.11

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Ron Shamir, CG 08

Dynamic Approaches: The Hidden Markov Model

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Bioinformatics 2 - Lecture 4

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

Hidden Markov Models

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Gibbs Sampling Methods for Multiple Sequence Alignment

Gene Prediction in Eukaryotes Using Machine Learning

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Computational Genomics and Molecular Biology, Fall

An Introduction to Bioinformatics Algorithms Hidden Markov Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Terminology, Representation and Basic Problems

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Introduction to Machine Learning CMU-10701

Stochastic processes and

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Evolutionary Models. Evolutionary Models

Hidden Markov Models

CS 188: Artificial Intelligence Fall 2011

DNA Feature Sensors. B. Majoros

Hidden Markov Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

An Introduction to Bioinformatics Algorithms Hidden Markov Models

HIDDEN MARKOV MODELS

Basic math for biology

TMHMM2.0 User's guide

Approximate Inference

Stephen Scott.

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Three classic HMM problems

BMI/CS 776 Lecture 4. Colin Dewey

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

STA 414/2104: Machine Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

HMM part 1. Dr Philip Jackson

CS 188: Artificial Intelligence Fall Recap: Inference Example

Hidden Markov Methods. Algorithms and Implementation

GEP Annotation Report

Hidden Markov models in population genetics and evolutionary biology

CSCE 471/871 Lecture 3: Markov Chains and

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

7.36/7.91 recitation CB Lecture #4

Motivating the need for optimal sequence alignments...

Hidden Markov Models

Small RNA in rice genome

Lecture 12: Algorithms for HMMs

Transcription:

7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge

Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under the hood The Viterbi Algorithm Real World HMMs Ch. 4 of Mount

CpG Islands 60 50 40 30 %C+G

CpG Island Hidden Markov Model P ig = 0.001 P = 0.99999 gg Pii = 0.999 Hidden Genome P gi = 0.00001 Island A C T C G A G T A Observable C G A T CpG Island: 0.3 0.3 0.2 0.2 Genome: 0.2 0.2 0.3 0.3

CpG Island HMM II Transition probabilities P = 0.99999 P ig = 0.001 Island gg Genome P gi = 0.00001 P ii = 0.999 A C T C G A G T A Emission Probabilities C G A T CpG Island: 0.3 0.3 0.2 0.2 Genome: 0.2 0.2 0.3 0.3

CpG Island HMM III Want to infer A C T C G A G T A Observe But HMM is written in the other direction (observable depends on hidden)

Inferring the Hidden from the Observable (Bayes Rule) P ( H = h 1, h 2,..., h n O = o 1, o 2,..., o n ) = = P ( H = h 1,..., h n, O = o 1,..., o n ) P (O = o 1,..., o n ) Conditional Prob: P(A B) = P(A,B)/P(B) P ( H = h 1,..., h n )P(O = o 1,..., o n H = h 1,..., h n ) P (O = o 1,..., o n ) P (O = o 1,..., o n ) somewhat difficult to calculate But notice: P ( H = h 1,..., h n, O = o 1,..., o n ) > P ( H = h 1,..., h n, O = o 1,..., o n ) implies P (H = h 1,..., h n O = o 1,..., o n ) > P (H = h 1,..., h n O = o 1,..., o n ) so can treat P (O = o 1,..., o n ) as a constant

Finding the Optimal Parse (Viterbi Algorithm) H opt opt opt Want to find sequence of hidden states = h 1, h 3,... opt, h 2 which maximizes joint probability: P( H = h 1,..., h n, O = o 1,..., o n ) Solution: (optimal parse of sequence) ( h) = probability of optimal parse of the Define R i subsequence 1..i ending in state h ( h) Solve recursively, i.e. determine R ( 2h) in terms of R1, etc. A. Viterbi, an MIT BS/MEng student in E.E. - founder of Qualcomm

Trellis Diagram for Viterbi Algorithm Position in Sequence 1 i i+1 i+2 i+3 i+4 L Hidden States T A T C G C A Run time for k-state HMM on sequence of length L?

Viterbi Algorithm Examples What is the optimal parse of the sequence: (ACGT) 10000 A 1000 C 80 T 1000 C 40 A 1000 G 60 T 1000 Powers of 1.5: N = 20 40 60 80 (1.5) N = 3x10 3 1x10 7 3x10 10 1x10 14

What else can you model with HMMs? Bacterial gene: Open Reading Frame ORF Start Stop

HMMs developed Useful HMMs developed

Parameter Estimation for HMMs How many parameters for a k-state HMM over an alphabet of size 4? Initial probabilities: Transition probabilities: Emission probabilities:

Pseudocounts Courtesy of M. Yaffe If the number of sequences in the training set is both large and diverse, then the sequences in the training set represent a good statistical sampling of the motif.if not, then we have a sampling error! Correct for this by adding pseudocounts. How many to add? Too many pseudocounts dominate the frequencies and the resulting matrix won t work! Too few pseudocounts then we ll miss many amino acid variations, and matrix will only find sequences that produced the motif! Add few pseudocounts if sampling is good (robust), and add more pseudocounts if sampling is sparse One reasonable approach is to add N pseudocounts, where N is the number of sequences As N increases, the influence of pseusocounts decreases since N increases faster than N, but doesn t add enough at low N

Dealing With Small Training Sets Position: 1 2 3 4 5 A 8 Training Set C 1 ACCTG G 1 AGCTG T 0 ACCCG ACCTG If the true frequency of T at pos. 1 was 10%, ACCCA what s the probability we wouldn t see any Ts GACTG in a sample of 10 seqs? ACGTA ACCTG P(N=0) = (10!/0!10!)(0.1) 0 (0.9) 10 = ~35% CCCCG ACATC So we should add pseudocounts

Pseudocounts (Ψcounts) Nt Count Ψcount Bayescount ML est. Bayes est. A 8 + 1 9 0.80 0.64 C 1 + 1 2 0.10 0.14 G 1 + 1 2 0.10 0.14 T 0 + 1 1 0.00 0.07 10 14 1.0 1.0 The add 1 to each observed count rule can be derived analytically from the Bayesian posterior distribution under a Dirichlet prior - see Appendix A of statistics primer for details.

Real World HMMs

Please see the following Web site: http://www.cbs.dtu.dk/services/tmhmm/ Reference for TMHMM: Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 (19 January 2001): 567-80.

Architecture of TMHMM Please see figures 1a and 1c of: Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 (19 January 2001): 567-80.

TMHMM Output for Optimal Parse Mouse Chloride Channel CLC6 Posterior Probability Transmembrane inside outside

Genscan Model Incorporates: Transcriptional signals Splicing signals Translational signals Composition of exons Composition of introns Other gene features Burge & Karlin, J Mol Biol 1997

Semi-Markov HMM Model

Genscan predictions in human CD4 gene region Annotated exons Genscan predicted exons Overall: ~75% of exons exactly correct Burge and Karlin J. Mol. Biol. 1997

Genscan, GenomeScan Predictions in Human BRCA1 Region Please see figures 1 of Yeh, RF, LP Lim, and CB Burge. "Computational Inference of Homologous Gene Structures in the Human Genome." Genome Res. 11, no. 5 (May 2001): 803-16.

DNA Sequence Evolution Generation n-1 (grandparent) 5 TGGCATGCACCCTGTAAGTCAATATAAATGGCTACGCCTAGCCCATGCGA 3 3 ACCGTACGTGGGACATTCAGTTATATTTACCGATGCGGATCGGGTACGCT 5 Generation n (parent) 5 TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCATGCGA 3 3 ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGTACGCT 5 Generation n+1 (child) 5 TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCGTGCGA 3 3 ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGCACGCT 5

What is a Markov Model (aka Markov Chain)? Classical Definition A discrete stochastic process X 1, X 2, X 3, which has the Markov property: P(X n+1 = j X 1 =x 1, X 2 =x 2, X n =x n ) = P(X n+1 = j X n =x ) n In words: (for all x, all j, all n) i A random process which has the property that the future (next state) is conditionally independent of the past given the present (current state) Markov - a Russian mathematician, ca. 1922

DNA Sequence Evolution is a Markov Process No selection case P AA P AC P AG P AT P CA P CC P CG P CT S n = base at generation n P = P GA P GC P GG P GT Pij = P (S = j S n = i ) P TA P TC P TG P TT n +1 q n =(q,q,q,q A C T ) G = vector of prob s of bases at gen. n Handy relations: q n +1 = q n P q n+k = q n P k

Limit Theorem for Markov Chains S n = base at generation n Pij = P (S n +1 = j S n =i ) If Pij >0 for all i,j (and Pij =1 for all i) j then there is a unique vector r r such that = r P = q and limq P n r (for any prob. vector ) n r is called the stationary or limiting distribution of P See Ch. 4, Taylor & Karlin, An Introduction to Stochastic Modeling, 1984 for details

Stationary Distribution Examples 2-letter alphabet: R = purine, Y = pyrimidine Stationary distributions for: 1 0 0 1 I = Q = 0 1 1 0 1 p p P = p 1 p 0 < p < 1 1 p P = q p 1 q 0 < p < 1, 0 < q < 1

How does entropy change when a Markov transition matrix is applied? If limiting distribution is uniform, then entropy increases (analogous to 2nd Law of Thermodynamics) However, this is not true in general (why not?)

How rapidly is the stationary distribution approached?

Jukes-Cantor Model Courtesy of M. Yaffe α G A α α α α T C α Assume each nucleotide equally likely to change into any other nt, with rate of change=α. Overall rate of substitution = 3α so if G at t=0, at t=1, P G(1) =1-3α and P G(2) =(1-3α)P G(1) +α [1 P G(1) ] Expanding this gives P G(t) =1/4 + (3/4)e -4αt Can show that this gives K = -3/4 ln[1-(4/3)(p)] K = true number of substitutions that have occurred, P = fraction of nt that differ by a simple count. Captures general behaviour

Paper #1: Literature Discussion Tues. 3/16 Kellis, M, N Patterson, M Endrizzi, B Birren, and ES Lander. "Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements." Nature 423, no. 6937 (15 May 2003): 241-54. Part 1 - Finding Genes, etc., pp. 241-247 Part 2 - Regulatory Elements, pp. 247-254 Paper #2: Rivas, E, RJ Klein, TA Jones, and SR Eddy. "Computational Identification of Noncoding RNAs in E. coli by Comparative Genomics." Curr Biol. 11, no. 17 (4 September 2001): 1369-73.