Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Similar documents
CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Today s Lecture: HMMs

EECS730: Introduction to Bioinformatics

BMI/CS 576 Fall 2016 Final Exam

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Hidden Markov Models

HMMs and biological sequence analysis

Hidden Markov Models (I)

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models

Pairwise alignment using HMMs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Pair Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

Data Mining in Bioinformatics HMM

Evolutionary Models. Evolutionary Models

Stephen Scott.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Stephen Scott.

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Introduction to Hidden Markov Models (HMMs)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Terminology, Representation and Basic Problems

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Lecture 9. Intro to Hidden Markov Models (finish up)

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models (HMMs) November 14, 2017

Computational Identification of Evolutionarily Conserved Exons

Markov Models & DNA Sequence Evolution

Computational Genomics and Molecular Biology, Fall

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models. x 1 x 2 x 3 x K

HIDDEN MARKOV MODELS

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Applications of Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

Alignment Algorithms. Alignment Algorithms

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Whole Genome Alignments and Synteny Maps

Basic math for biology

Multiple Whole Genome Alignment

Note Set 5: Hidden Markov Models

Genome 373: Hidden Markov Models II. Doug Fowler

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Statistical Methods for NLP

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

HMM: Parameter Estimation

Pairwise sequence alignment and pair hidden Markov models

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Multiple Sequence Alignment using Profile HMM

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Hidden Markov Models Part 2: Algorithms

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

STA 414/2104: Machine Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

11.3 Decoding Algorithm

Hidden Markov Methods. Algorithms and Implementation

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Hidden Markov Models. Ron Shamir, CG 08

Introduction to Machine Learning CMU-10701

Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Gene Prediction in Eukaryotes Using Machine Learning

HMM : Viterbi algorithm - a toy example

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov Models

Lecture 3: Markov chains.

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models and Their Applications in Biological Sequence Analysis

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Sequence analysis and Genomics

Transcription:

Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu

Goals for Lecture the key concepts to understand are the following: using related genomes as an additional source of evidence for gene finding the TWINSCAN approach: use a pre-computed conservation sequence that is aligned to the given DNA sequence pair HMMs the correspondence between Viterbi in a pair HMM and standard dynamic programming for sequence alignment the SLAM approach: use a pair HMM to simultaneously align and parse sequences

Why use comparative methods? genes are among the most conserved elements in the genome use conservation to help infer locations of genes some signals associated with genes are short and occur frequently use conservation to eliminate from consideration false candidate sites

Conservation as powerful information source Scale chr17: SGSH SGSH 4.9 _ 5 kb hg19 78,185,000 78,190,000 Basic Gene Annotation Set from ENCODE/GENCODE Version 7 SGSH GERP scores for mammalian alignments SLC26A11 SLC26A11 GERP 0 - -9.8 _ Rhesus Mouse Dog Elephant Opossum Chicken X_tropicalis Zebrafish Multiz Alignments of 46 Vertebrates

TWINSCAN Korf et al., Bioinformatics 2001 prediction with TWINSCAN given: a sequence to be parsed, x using BLAST, construct a conservation sequence, c have HMM simultaneously parse (using Viterbi) x and c training with TWINSCAN given: set of training sequences X with known gene structure annotations for each x in X construct a conservation sequence c for x infer emission parameters for both x and c

Conservation Sequences in TWINSCAN before processing a given sequence, TWINSCAN first computes a corresponding conservation sequence ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC :... : : : : :: matched unaligned mismatched Given: a sequence of length n, a set of aligned BLAST matches c[1...n] = unaligned sort BLAST matches by alignment score for each BLAST match h (from best to worst) for each position i covered by h if c[i] == unaligned c[i] = h[i]

Conservation Sequence Example given sequence significant BLAST matches ordered from best to worst ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC ATTTA : ATCTA ATGGACCGCTTCAGC : : : ACGCACCGCTTCATC AGCATGGTATCC : : :: AGAAGGGTCACC resulting conservation sequence ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC :... : : : : ::

Parsing a DNA Sequence The Viterbi path represents a parse of a given sequence, predicting exons, introns, etc. ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA

Modeling Sequences in TWINSCAN each state emits two sequences the given DNA sequence, x the conservation sequence, c it treats them as conditionally independent given the state Pr( x, c q) = i i Pr( d i q) Pr( x i q, d i ) Pr( c i q, d i ) q x c i i ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC :... : : : : :: d i

Modeling Sequences in TWINSCAN conservation sequence is treated just as a string over a 3-character alphabet (, :,.) conservation sequence emissions modeled by inhomogeneous 2 nd -order chains for splice sites homogeneous 5 th -order Markov chains for other states like GENSCAN, based on hidden semi-markov models algorithms for learning, inference same as GENSCAN

TWINSCAN vs. GENSCAN conservation is neither necessary nor sufficient to predict an exon mouse alignments RefSeq (gold standard) GENSCAN prediction TWINSCAN prediction TWINSCAN correctly omits this exon prediction because conserved region ends within it TWINSCAN correctly predicts both splice sites because they are within the aligned regions

GENSCAN vs. TWINSCAN: Empirical Comparison sensitivity (Sn) specificity (Sp) = = TP TP + FN TP TP + FP note: the definition of specificity here is somewhat nonstandard; it s the same as precision genes exactly correct? exons exactly correct? nucleotides correct? Figure from Flicek et al., Genome Research, 2003

Accuracy of TWINSCAN as a Function of Sequence Coverage very crude mouse genome sequence good mouse genome sequence

SLAM Pachter et al., RECOMB 2001 prediction with SLAM given: a pair of sequences to be parsed, x and y find approximate alignment of x and y run constrained Viterbi to have HMM simultaneously parse and align x and y training with SLAM given: a set of aligned pairs of training sequences X for each x, y in X infer emission/alignment parameters for both x and y

Pair Hidden Markov Models each non-silent state emits one or a pair of characters H: homology (match) state I: insert state D: delete state

PHMM Paths = Alignments sequence 1: AAGCGC sequence 2: ATGTC hidden: observed: B H A A H A T I G I H D C G G T H C C E

Transition Probabilities probabilities of moving between states at each step state i+1 B H I D E B 1-2δ-τ δ δ τ state i H 1-2δ-τ δ δ τ I 1-ε-τ ε τ D 1-ε-τ ε τ E

Emission Probabilities Deletion (D) Insertion (I) e D (x i ) e I (y j ) Homology (H) e H (x i, y j ) A 0.3 A 0.1 C 0.2 C 0.4 G 0.3 G 0.4 T 0.2 T 0.1 A C G T A C G T 0.13 0.03 0.06 0.03 0.03 0.13 0.03 0.06 0.06 0.03 0.13 0.03 0.03 0.06 0.03 0.13 single character single character pairs of characters

PHMM Viterbi probability of most likely sequence of hidden states generating length i prefix of x and length j prefix of y, with the last state being: H I D note that the recurrence relations here allow I D and D I transitions

PHMM Alignment calculate probability of most likely alignment traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability HIDHHDDIIHH...

Correspondence with NW NW values logarithms of PHMM Viterbi values

Posterior Probabilities there are similar recurrences for the Forward and Backward values from the Forward and Backward values, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

Uses for Posterior Probabilities sampling of suboptimal alignments posterior probability of pairs of residues being homologous (aligned to each other) posterior probability of a residue being gapped training model parameters (EM)

Posterior Probabilities plot of posterior probability of each alignment column

Parameter Training supervised training given: sequences and correct alignments do: calculate parameter values that maximize joint likelihood of sequences and alignments unsupervised training given: sequence pairs, but no alignments do: calculate parameter values that maximize marginal likelihood of sequences (sum over all possible alignments)

Generalized Pair HMMs represent a parse π, as a sequence of states and a sequence of associated lengths for each input sequence q = { q, q2,, q 1 n } N P + F + E + init d = { d1, d2,, dn} e = { e1, e2,, en} may be gaps in the sequences

Generalized Pair HMMs representing a parse π, as a sequence of states and associated lengths (durations) q = { q1, q2,, qn} d = d, d,, d } e = e, e,, e } { 1 2 n the joint probability of generating parse π and sequences x and y P(x,y,π ) = a start,1 P(d 1,e 1 q 1 )P(x 1,y 1 q 1,d 1,e 1 ) n k=2 { 1 2 n a k 1,k P(d k,e k q k )P(x k,y k q k,d k,e k )

Generalized Pair HMM Algorithms Generalized HMM Forward Algorithm f l (i) = k D d=1 " # i f k (i d) a kl P(d q l ) P(x i d+1 q l, d) $ % Generalized Pair HMM Algorithm f l (i, j) = k D d=1 D e=1 " # i j f k (i d, j e) a kl P(d, e q l ) P(x i d+1 y j e+1 q l, d,e) $ % Viterbi: replace sum with max

Prediction in SLAM could find alignment and gene predictions by running Viterbi to make it more efficient find an approximate alignment (using a fast anchorbased approach) each base in x constrained to align to a window of size h in y x y h analogous to banded alignment methods

GENSCAN, TWINSCAN, & SLAM

TWINSCAN vs. SLAM both use multiple genomes to predict genes both use generalized HMMs TWINSCAN takes as an input a genomic sequence, and a conservation sequence computed from an informant genome models probability of both sequences; assumes they re conditionally independent given the state predicts genes only in the genomic sequence SLAM takes as input two genomic sequences models joint probability of pairs of aligned sequences can simultaneously predict genes and compute alignments