Hidden Markov Models in Bioinformatics 14.11

Similar documents
Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Markov Models & DNA Sequence Evolution

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov models in population genetics and evolutionary biology

Lecture 3: Markov chains.

Lecture 7 Sequence analysis. Hidden Markov Models

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Gene finding with a hidden Markov model of genome structure and evolution

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Markov Chains and Hidden Markov Models. = stochastic, generative models

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Predicting Protein Functions and Domain Interactions from Protein Interactions

Introduction to population genetics & evolution

How much non-coding DNA do eukaryotes require?

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Data Mining in Bioinformatics HMM

2. Map genetic distance between markers

Identify stages of plant life cycle Botany Oral/written pres, exams

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Calculation of IBD probabilities

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

BIOINFORMATICS: An Introduction

Hidden Markov Models for biological sequence analysis

How robust are the predictions of the W-F Model?

Today s Lecture: HMMs

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Applications of hidden Markov models for comparative gene structure prediction

Hidden Markov Models for biological sequence analysis I

A DNA Sequence 2017/12/6 1

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Modeling Noise in Genetic Sequences

Computational Genomics and Molecular Biology, Fall

Calculation of IBD probabilities

genome a specific characteristic that varies from one individual to another gene the passing of traits from one generation to the next

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

SOME CHALLENGES IN COMPUTATIONAL BIOLOGY

Introduction to Hidden Markov Models (HMMs)

Principles of Genetics

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models (I)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Chapter 18 Active Reading Guide Genomes and Their Evolution

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Chapter 13 Meiosis and Sexual Reproduction

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

F1 Parent Cell R R. Name Period. Concept 15.1 Mendelian inheritance has its physical basis in the behavior of chromosomes

HMMs and biological sequence analysis

Biology Scope and Sequence Student Outcomes (Objectives Skills/Verbs)

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

BME 5742 Biosystems Modeling and Control

Hidden Markov Models. x 1 x 2 x 3 x K

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Gene Prediction in Eukaryotes Using Machine Learning

Bioinformatics 2 - Lecture 4

Introduction to Machine Learning CMU-10701

Variation of Traits. genetic variation: the measure of the differences among individuals within a population

Genome 559 Wi RNA Function, Search, Discovery

Chapter 4: Hidden Markov Models

The history of Life on Earth reflects an unbroken chain of genetic continuity and transmission of genetic information:

Genotype Imputation. Biostatistics 666

Lesson 4: Understanding Genetics

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Biology I Level - 2nd Semester Final Review

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Name Period. 3. How many rounds of DNA replication and cell division occur during meiosis?

What can sequences tell us?

Designer Genes C Test

ACM 116: Lecture 1. Agenda. Philosophy of the Course. Definition of probabilities. Equally likely outcomes. Elements of combinatorics

Lab 3: Practical Hidden Markov Models (HMM)

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

7.36/7.91 recitation CB Lecture #4

Name Period. 2. Name the 3 parts of interphase AND briefly explain what happens in each:

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

The Phylo- HMM approach to problems in comparative genomics, with examples.

Homolog. Orthologue. Comparative Genomics. Paralog. What is Comparative Genomics. What is Comparative Genomics

Unit 3 - Molecular Biology & Genetics - Review Packet

Reconstruire le passé biologique modèles, méthodes, performances, limites

Chromosome duplication and distribution during cell division

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Solutions to Problem Set 5

BIOLOGY LTF DIAGNOSTIC TEST MEIOSIS & MENDELIAN GENETICS

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Concept 15.1 Mendelian inheritance has its physical basis in the behavior of chromosomes

Full file at CHAPTER 2 Genetics

Introduction to Bioinformatics

Transcription:

idden Markov Models in Bioinformatics 14.11 60 min Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile MM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment

idden Markov Models (O 1, 1 ), (O 2, 2 ),. (O n, n ) is a sequence of stochastic variables with 2 components - one that is observed (O i ) and one that is hidden ( i ). The marginal discribution of the i s are described by a omogenous Markov Chain: Let p i,j = P( k =i, k+1 =j) Let π i =P{ 1 =i) - often π i is the equilibrium distribution of the Markov Chain. Conditional on k (all k), the O k are independent. The distribution of O k only depends on the value of i and is called the emit function e(i, j) = P{O k = i k = j) O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3

What is the probability of the data? P( O r ) = r P( O r r )P( r ) The probability of the observed is, which can be hard to calculate. owever, these calculations can be considerably accelerated. Let P k = j Ok the probability of the observations (O 1,..O k ) = i conditional on k =j. Following recursion will be obeyed: i. P k = j Ok = i = P(O k = i k = j) P Ok 1 k 1 = j k 1 = j p j,i ii. P 1 = j O1 = i = P(O 1 = i 1 = j)π j (initial condition) iii. P(O) = n = j P n = j On = i 1 2 3 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 P 5 = 2 O5 = i = P(O 5 = i 5 = 2) 4 = j P 4 = j O4 p j,i

What is the most probable hidden configuration? Let * be the sequences of hidden states that maximizes the observed j sequence O ie ArgMax [ P{O } ]. Let k probability of the most probability of the most probable path up to k ending in hidden state j. Again recursions can be found i. k j i = max i { k 1 p i, j }e(o k, j) ii. 1 j = π j e(o 1,1) The actual sequence of hidden states k * can be found recursively by * iii. k 1 i = {i k 1 p i, j e(o k, j) = k k *} 1 2 3 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 6 1 = max j { 6 1 * p j,1 *e(o 6,1)} 5 * = {i 5 i * p i,1 *e(o 6,1) = 6 1 }

What is the probability of specific hidden state? j Let Q k be the probability of the observations from j+1 to n given k =j. These will also obey recursions: Q k j k+1 = i i = P(O k k +1 = i) p j,i Q k +1 The probability of the observations and a specific hidden state can P{O, k = j) = P j j found as: k Q k And of a specific hidden state can found as: P{ k = j) = P k j Q k j /P(O) O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3 P{ 5 = 2) = P 5 2 Q 5 2 /P(O)

Fast/Slowly Evolving States sequences k Felsenstein & Churchill, 1996 positions 1 1 n MM: slow - r s fast - r f π r - equilibrium distribution of hidden states (rates) at first position p i,j - transition probabilities between hidden states L (j,r) - likelihood for j th column given rate r. L (j,r) - likelihood for first j columns given j th column has rate r. Likelihood Recursions: L (j,f) = (L (j-1,f) p f, f + L (j-1,s) p s, f )L ( j, f ) Likelihood Initialisations: L (j,s) = (L (j-1,f) p f,s + L (j-1,s) p s,s )L ( j,s) L (1,f) = π f L (1, f ) L (1,s) = π s L (1,s)

Statistical Alignment Steel and ein,2000 + olmes and Bruno,2000 Emit functions: e(##)= π(n 1 )f(n 1,N 2 ) e(#-)= π(n 1 ), e(-#)= π(n 2 ) π(n 1 ) - equilibrium prob. of N f(n 1,N 2 ) - prob. that N 1 evolves into N 2 An MM Generating Alignments - # # E # # - E * * λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) - # λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) _ # λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) 1 λβe µ λβe µ (µ λ)β # - 1 e µ 1 e µ 1 e µ λβ T C C A C

Probability of Data given a pedigree. Elston-Stewart (1971) -Temporal Peeling Algorithm: Mother Father Condition on parental states Recombination and mutation are Markovian Lander-Green (1987) - Genotype Scanning Algorithm: Mother Father Condition on paternal/maternal inheritance Recombination and mutation are Markovian Comment: Obvious parallel to Wiuf-ein99 reformulation of udson s 1983 algorithm

Isochore: Churchill,1989,92 MM: Likelihood Recursions: Further Examples I poor rich L p (C)=L p (G)=0.1, L p (A)=L p (T)=0.4, L r (C)=L r (G)=0.4, L r (A)=L r (T)=0.1 L (j,p) = (L j 1,p p p,p + L (j-1,s) p s, f )P p (S[ j]), L (j,r) = (L (j-1,r) p p,r + L (j-1,r) p r,r )P r (S[ j]) Likelihood Initialisations: L (1,p) = π p P p (S[1]), L (1,r) = π r P r (S[1]) Gene Finding: Burge and Karlin, 1996 Simple Prokaryotic Simple Eukaryotic

Secondary Structure Elements: Goldman, 1996 Further Examples II MM for SSEs: α L α L β L β α β L α.909.005.062 β.0005.881.086 L.091.184.852.325.212.462 Adding Evolution: SSE Prediction: Profile MM Alignment: Krogh et al.,1994

Summary O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile MM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment

Recommended Literature Vineet Bafna and Daniel. uson (2000) The Conserved Exon Method for Gene Finding ISMB 2000 pp. 3-12 S.Batzoglou et al.(2000) uman and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research. 10.950-58. Blayo, Rouze & Sagot (2002) Orphan Gene Finding - An exon assembly approach J.Comp.Biol. Delcher, AL et al.(1998) Alignment of Whole Genomes Nuc.Ac.Res. 27.11.2369-76. Gravely, BR (2001) Alternative Splicing: increasing diversity in the proteomic world. TIGS 17.2.100- Guigo, R.et al.(2000) An Assesment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research 10.1631-42 Kan, Z. Et al. (2001) Gene Structure Prediction and Alternative Splicing Using Genomically Aligned ESTs Genome Research 11.889-900. Ian Korf et al.(2001) Integrating genomic homology into gene structure prediction. Bioinformatics vol17.suppl.1 pages 140-148 Tejs Scharling (2001) Gene-identification using sequence comparison. Aarhus University JS Pedersen (2001) Progress Report: Comparative Gene Finding. Aarhus University Reese,MG et al.(2000) Genome Annotation Assessment in Drosophila melanogaster Genome Research 10.483-501. Stein,L.(2001) Genome Annotation: From Sequence to Biology. Nature Reviews Genetics 2.493-