Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Similar documents
Interpolated Markov Models for Gene Finding

Interpolated Markov Models for Gene Finding

Hidden Markov Models (Part 1)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Markov Models & DNA Sequence Evolution

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Biology 644: Bioinformatics

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

O 3 O 4 O 5. q 3. q 4. Transition

Practical Bioinformatics

Data Mining in Bioinformatics HMM

DNA Feature Sensors. B. Majoros

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Lecture 15: Realities of Genome Assembly Protein Sequencing

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Markov Chains and Hidden Markov Models. = stochastic, generative models

Multiple Whole Genome Alignment

Using an Artificial Regulatory Network to Investigate Neural Computation

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Edinburgh Research Explorer

Evolutionary Analysis of Viral Genomes

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 3: Markov chains.

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Crick s early Hypothesis Revisited

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

HMMs and biological sequence analysis

Lecture IV A. Shannon s theory of noisy channels and molecular codes

PROTEIN SYNTHESIS INTRO

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

10/17/04. Today s Main Points

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

SUPPLEMENTARY DATA - 1 -

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

Computational Genomics and Molecular Biology, Fall

Sequence analysis and comparison

Hidden Markov Models

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Supplementary Information for

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

A p-adic Model of DNA Sequence and Genetic Code 1

Pair Hidden Markov Models

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Hidden Markov Models for biological sequence analysis

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science

Genetic Code, Attributive Mappings and Stochastic Matrices

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II)

STA 414/2104: Machine Learning

Hidden Markov Models

Hidden Markov Models

In this article, we investigate the possible existence of errordetection/correction

Detecting non-coding RNA in Genomic Sequences

A modular Fibonacci sequence in proteins

Today s Lecture: HMMs

Hidden Markov Models (HMMs) November 14, 2017

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Chapter 2 Class Notes Words and Probability

Language Modeling. Michael Collins, Columbia University

Multiple Sequence Alignment

ALL LIVING things consist of DNA, which is a very complex

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Using algebraic geometry for phylogenetic reconstruction

Intelligent Systems (AI-2)

Prof. Christian MICHEL

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Supplemental Materials

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Statistical Methods for NLP

Comparative Network Analysis

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Biology 155 Practice FINAL EXAM

Hidden Markov Models

Searching Sear ( Sub- (Sub )Strings Ulf Leser

objective functions...

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Genetic code on the dyadic plane

Lecture 3: ASR: HMMs, Forward, Viterbi

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Intelligent Systems (AI-2)

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Graphical Models Another Approach to Generalize the Viterbi Algorithm

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine

From Gene to Protein

Transcription:

Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu

Goals for Lecture the key concepts to understand are the following the gene-finding task the trade-off between potential predictive value and parameter uncertainty in choosing the order of a Markov model interpolated Markov models back-off models

The Gene Finding Task Given: an uncharacterized DNA sequence Do: locate the genes in the sequence, including the coordinates of individual exons and introns

Sources of Evidence for Gene Finding signals: the sequence signals (e.g. splice junctions) involved in gene expression content: statistical properties that distinguish proteincoding DNA from non-coding DNA conservation: signal and content properties that are conserved across related sequences (e.g. orthologous regions of the mouse and human genome)

Gene Finding: Search by Content encoding a protein affects the statistical properties of a DNA sequence some amino acids are used more frequently than others (Leu more popular than Trp) different numbers of codons for different amino acids (Leu has 6, Trp has 1) for a given amino acid, usually one codon is used more frequently than others this is termed codon preference these preferences vary by species

Codon Preference in E. Coli AA codon /1000 ---------------------- Gly GGG 1.89 Gly GGA 0.44 Gly GGU 52.99 Gly GGC 34.55 Glu GAG 15.68 Glu GAA 57.20 Asp GAU 21.63 Asp GAC 43.26

Reading Frames a given sequence may encode a protein in any of the six reading frames G C T A C G G A G C T T C G G A G C C G A T G C C T C G A A G C C T C G

Open Reading Frames (ORFs) an ORF is a sequence that starts with a potential start codon ends with a potential stop codon, in the same reading frame doesn t contain another stop codon in-frame and is sufficiently long (say > 100 bases) G T T A T G G C T T C G T G A T T an ORF meets the minimal requirements to be a protein-coding gene in an organism without introns

Markov Models & Reading Frames consider modeling a given coding sequence for each word we evaluate, we ll want to consider its position with respect to the reading frame we re assuming reading frame G C T A C G G A G C T T C G G A G C G C T A C G C T A C G G G is in 3 rd codon position G is in 1 st position T A C G G A A is in 2nd position can do this using an inhomogeneous model

A Fifth Order Inhomogeneous Markov Chain AAAAA AAAAA AAAAA CTACA CTACC CTACA CTACC start CTACG CTACT CTACG CTACT TACAA TACAC TACAG transitions to states in pos 2 GCTAC GCTAC TACAT TTTTT TTTTT TTTTT position 2 position 3 position 1

Selecting the Order of a Markov Chain Model higher order models remember more history additional history can have predictive value example: predict the next word in this sentence fragment you (are, give, idiot, say, see, too,?) now predict it given more history can you say can you oh say can you

Selecting the Order of a Markov Chain Model but the number of parameters we need to estimate grows exponentially with the order n+1 for modeling DNA we need O(4 ) parameters for an nth order model the higher the order, the less reliable we can expect our parameter estimates to be suppose we have 100k bases of sequence to estimate parameters of a model for a 2 nd order homogeneous Markov chain, we d see each history 6250 times on average for an 8 th order chain, we d see each history ~ 1.5 times on average

Interpolated Markov Models the IMM idea: manage this trade-off by interpolating among models of various orders simple linear interpolation: P IMM (x i x i n ) = λ 0 P(x i ) where λ i = 1 i + λ 1 P(x i x i 1 )... + λ n P(x i x i n,..., x i 1 )

Interpolated Markov Models we can make the weights depend on the history for a given order, we may have significantly more data to estimate some words than others general linear interpolation P IMM (x i x i n ) = λ 0 P(x i ) λ is a function of the given history + λ 1 (x i 1 )P(x i x i 1 )... + λ n (x i n )P(x i x i n,..., x i 1 )

The GLIMMER System [Salzberg et al., Nucleic Acids Research, 1998] system for identifying genes in bacterial genomes uses 8 th order, inhomogeneous, interpolated Markov chain models

IMMs in GLIMMER how does GLIMMER determine the values? first, let s express the IMM probability calculation recursively let history P IMM,n (x i x i n ) = λ n (x i n )P(x i x i n ) + [1 λ n (x i n )]P IMM,n-1 (x i x i n +1 ) c(x i n ) x i n λ be the number of times we see the in our training set λ n (x i n ) =1 if c(x i n ) > 400

IMMs in GLIMMER x i,..., x 1 if we haven t seen n i more than 400 times, then compare the counts for the following: nth order history + base xi n,..., xi 1, a xi n,..., xi 1, c xi n,..., xi 1, g xi n,..., xi 1, t 2 χ (n-1)th order history + base xi n+1,..., xi 1, a xi n+1,..., xi 1, c xi n+1,..., xi 1, xi n+1,..., xi 1, t use a statistical test ( ) to get a value d indicating our confidence that the distributions represented by the two sets of counts are different g

IMMs in GLIMMER putting it all together $ 1 if c(xi n,..., xi! c(xi n,..., xi ) 1 ) = # d else if d 0. 5! 400 " 0 otherwise 1 λ n ( xi n,..., xi 1 ) > 400 where d (0,1)

IMM Example suppose we have the following counts from our training set ACGA 25 ACGC 40 ACGG 15 ACGT 20 100 CGA 100 CGC 90 CGG 35 CGT 75 300 GA 175 GC 140 GG 65 GT 120 500 χ 2 test: d = 0.857 χ 2 test: d = 0.140 λ 3 (ACG) = 0.857 100/400 = 0.214 λ 2 (CG) = 0 (d < 0.5, c(cg) < 400) λ 1 (G) = 1 (c(g) > 400)

IMM Example (Continued) now suppose we want to calculate P IMM,3 (T ACG) P IMM,1 (T G) = λ 1 (G)P(T G) + ( 1 λ 1 (G))P IMM,0 (T) = P(T G) P IMM,2 (T CG) = λ 2 (CG)P(T CG) + ( 1 λ 2 (CG))P IMM,1 (T G) = P(T G) P IMM,3 (T ACG) = λ 3 (ACG)P(T ACG) + ( 1 λ 3 (ACG))P IMM,2 (T CG) = 0.214 P(T ACG) + (1 0.214) P(T G)

Gene Recognition in GLIMMER essentially ORF classification for each ORF calculate the prob of the ORF sequence in each of the 6 possible reading frames if the highest scoring frame corresponds to the reading frame of the ORF, mark the ORF as a gene for overlapping ORFs that look like genes score overlapping region separately predict only one of the ORFs as a gene

GLIMMER Experiment 8 th order IMM vs. 5 th order Markov model trained on 1168 genes (ORFs really) tested on 1717 annotated (more or less known) genes

GLIMMER Results TP FN FP & TP? GLIMMER has greater sensitivity than the baseline it s not clear if its precision/specificity is better

An Alternative Approach: Back-off Models devised for language modeling [Katz, IEEE Transactions on Acoustics, Speech and Signal Processing, 1987] % (1 δ) c(x,...,x ) i n i c(x P BACK (x i x i n ) = i n,..., x i 1 ), if c(x,..., x ) > k i n i ' & (' λp BACK (x i x i n +1,..., x i 1 ), otherwise use nth order probability if we ve seen this sequence (history + current character) k times otherwise back off to lower-order

An Alternative Approach: Back-off Models P BACK (x i x i n ) = % (1 δ) c(x,..., x ) i n i c(x i n,..., x i 1 ), if c(x,..., x ) > k i n i ' & (' λp BACK (x i x i n +1,..., x i 1 ), otherwise why do we need δ and λ? δ: save some probability mass for sequences we haven t seen λ: distribute this saved mass to lower-order sequences (different λ for each history; really λ( x i n+1,..., xi 1) ) this is important for natural language, where there are many words that could follow a particular history

Simple Back-off Example % P BACK (x i x i n,..., x i 1 ) = (1 δ) c(x i n,...,x i ) ' c(x i n ), if c(x,..., x ) > k i n i & (' λp BACK (x i x i n +1,..., x i 1 ), otherwise given training sequence: TAACGACACG suppose δ = 0.2 and k = 0 P BACK (A) = 4 10 P BACK (C) = 3 10 P BACK (G) = 2 10 P BACK (T) = 1 10 P BACK (A A) = (1 δ) 1 4 = 0.2 P BACK (C A) = (1 δ) 3 4 = 0.6 P BACK (G A) # δ & = % ( P BACK (G) = 0.2 $ P BACK (G) + P BACK (T)' 0.3 0.2 P BACK (T A) # δ & = % ( P BACK (T) = 0.2 $ P BACK (G) + P BACK (T)' 0.3 0.1