Chap. 3 Markov chains and hidden Markov models (2)

Similar documents
MARKOV CHAIN AND HIDDEN MARKOV MODEL

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018

Associative Memories

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Expectation Maximization Mixture Models HMMs

Image Classification Using EM And JE algorithms

Neural network-based athletics performance prediction optimization model applied research

Predicting Model of Traffic Volume Based on Grey-Markov

Nested case-control and case-cohort studies

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Cyclic Codes BCH Codes

Logistic Regression Maximum Likelihood Estimation

Ensemble Methods: Boosting

Interpolated Markov Models for Gene Finding

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Hidden Markov Model Cheat Sheet

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Maximum Likelihood Estimation

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

A General Column Generation Algorithm Applied to System Reliability Optimization Problems

GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Generalized Linear Methods

Numerical Investigation of Power Tunability in Two-Section QD Superluminescent Diodes

Lecture Notes on Linear Regression

9 : Learning Partially Observed GM : EM Algorithm

Analysis of Bipartite Graph Codes on the Binary Erasure Channel

Search sequence databases 2 10/25/2016

COXREG. Estimation (1)

Boundary Value Problems. Lecture Objectives. Ch. 27

Summary with Examples for Root finding Methods -Bisection -Newton Raphson -Secant

AS-Level Maths: Statistics 1 for Edexcel

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Bayesian Planning of Hit-Miss Inspection Tests

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Analysis of Non-binary Hybrid LDPC Codes

Homework Assignment 3 Due in class, Thursday October 15

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

Which Separator? Spring 1

3. Stress-strain relationships of a composite layer

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Nice plotting of proteins II

Machine learning: Density estimation

Hidden Markov Models

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Optimization of JK Flip Flop Layout with Minimal Average Power of Consumption based on ACOR, Fuzzy-ACOR, GA, and Fuzzy-GA

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

A finite difference method for heat equation in the unbounded domain

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

[WAVES] 1. Waves and wave forces. Definition of waves

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

STAT 511 FINAL EXAM NAME Spring 2001

MODEL TUNING WITH THE USE OF HEURISTIC-FREE GMDH (GROUP METHOD OF DATA HANDLING) NETWORKS

A DIMENSION-REDUCTION METHOD FOR STOCHASTIC ANALYSIS SECOND-MOMENT ANALYSIS

SDMML HT MSc Problem Sheet 4

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Lecture 6 More on Complete Randomized Block Design (RBD)

Multilayer Perceptron (MLP)

Mixture o f of Gaussian Gaussian clustering Nov

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Profile HMM for multiple sequences

Analysis of Discrete Time Queues (Section 4.6)

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

An Effective Space Charge Solver. for DYNAMION Code

In this Chapter. Chap. 3 Markov chains and hidden Markov models. Probabilistic Models. Example: CpG Islands

A marginal mixture model for discovering motifs in sequences

On the Power Function of the Likelihood Ratio Test for MANOVA

Delay tomography for large scale networks

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lower Bounding Procedures for the Single Allocation Hub Location Problem

XII.3 The EM (Expectation-Maximization) Algorithm

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

Lecture Nov

Portfolios with Trading Constraints and Payout Restrictions

NP-Completeness : Proofs

USING LEARNING CELLULAR AUTOMATA FOR POST CLASSIFICATION SATELLITE IMAGERY

Improvement in Estimating the Population Mean Using Exponential Estimator in Simple Random Sampling

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Problem Points Score Total 100

Relevance Vector Machines Explained

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Modeling and Simulation NETW 707

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

International Journal "Information Theories & Applications" Vol.13

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Uncertainty as the Overlap of Alternate Conditional Distributions

Transcription:

Chap. 3 Marov chans and hdden Marov modes 2 Bontegence Laboratory Schoo of Computer Sc. & Eng. Seou Natona Unversty Seou 5-742 Korea Ths sde fe s avaabe onne at http://b.snu.ac.r/ Copyrght c 2002 by SNU CSE Bontegence Lab

Specfyng the Mode HMM The desgn of the structure What states there are and how they are connected. The assgnment of parameter vaues The transton and emsson probabtes a and e b. Copyrght c 2002 by SNU CSE Bontegence Lab 2

The Framewor for arameter Estmaton We have a set of tranng sequences that we want the mode to ft we. Gven the ndependent tranng sequences n. The og probabty of these sequences gven the mode and ts parameters the og ehood of the mode. The parameter vaue whch mamzes the above probabty the og ehood s chosen. n n j... og... og n j Copyrght c 2002 by SNU CSE Bontegence Lab 3

Estmaton when the State Sequence s Known It s easer to estmate the probabty parameters when the state paths are nown for a the tranng eampes. Gven a set of genomc sequences n whch the CpG sands were aready abeed based on the epermenta data. An HMM for the predcton of secondary structure of protens wth tranng sequences obtaned from the set of protens wth nown structures. An HMM predctng genes from genomc sequences where the transcrpt structure has been determned by cdna sequencng Copyrght c 2002 by SNU CSE Bontegence Lab 4

The Mamum Lehood Estmator The mamum ehood estmator Cacuate the number of tmes n the tranng sequences for each transton A and each emsson E b. a A and b A ' ' ' E b' b In the case of nsuffcent data overfttng e A number of transtons to n tranng data r E b number of emssons of b from n tranng data r b E b ror nowedge from users pseudocounts Cf. Drchet prors n Bayesan statstcs Copyrght c 2002 by SNU CSE Bontegence Lab 5

Estmaton when aths are Unnown A the standard agorthms for optmzaton of contnuous functons can be used. The Baum-Wech agorthm nforma descrpton. Estmate the A and E b by consderng probabe paths for the tranng sequences usng the current vaues of a and e b. 2. Estmated A s and E b s are used for the update of a s and e b s. 3. Above process s terated unt some stoppng crteron s reached. The overa og ehood of the mode s ncreased by the teraton oca mama There est many oca mama. strongy depends on the startng pont of the agorthm. Copyrght c 2002 by SNU CSE Bontegence Lab 6

Copyrght c 2002 by SNU CSE Bontegence Lab 7 Estmaton of the Estmaton of the A Vaue Vaue The probabty that a s used at poston n sequence :............................ 2 2 b e a f L L L L

Copyrght c 2002 by SNU CSE Bontegence Lab 8 Estmaton of the Estmaton of the E b Vaue Vaue The probabty that e b s used at poston n sequence :................ b f b b b b b L L Ony the case where b s consdered.

Estmaton of A and E b Vaues The Baum-Wech agorthm cacuates A and E b as the epected number of tmes each transton or emsson s used gven the tranng sequences. A j j j f ae b j j E b j j { j f j b} b j Havng cacuated the above epectatons the new mode parameter vaues are cacuated. We are convergng n a contnuous-vaued space. Stoppng crteron: the average change n the og ehood. Copyrght c 2002 by SNU CSE Bontegence Lab 9

The Baum-Wech Agorthm Intazaton: c arbtrary mode parameters Recurrence: Set a the A and E varabes to ther pseudocount vaues r or to zero For each sequence j n Cacuate f for sequence j usng the forward agorthm Cacuate b for sequence j usng the bacward agorthm Add the contrbuton of sequence j to A and E. Cacuate the new mode parameters. Cacuate the new og ehood of the mode. Termnaton: Stop f the change n og ehood s ess than some predefned threshod or the mamum number of teratons s eceeded. Copyrght c 2002 by SNU CSE Bontegence Lab 0

The Baum-Wech Agorthm Cont d The pseudocount vaues r coud not be nterpreted n terms of Drchet prors rgorousy. The Baum-Wech agorthm s a speca case of a very powerfu genera approach to probabstc parameter estmaton caed the EM epectaton-mamzaton agorthm. Copyrght c 2002 by SNU CSE Bontegence Lab

The Vterb Tranng The most probabe path for each tranng sequence s derved by the Vterb agorthm. Ths nformaton s used n the re-estmaton process. Once the nformaton s gven t s the same as the tranng when the state path s nown. The agorthm converges precsey because the assgnment of paths s a dscrete process. The Baum-Wech mamzes the true ehood. n The Vterb tranng fnds the vaue of that mamzes the contrbuton to the ehood n * * n from the most probabe paths for a the sequences. In genera the Baum-Wech performs better than the Vterb tranng. When the prmary use of the HMM s Vterb decodng. Copyrght c 2002 by SNU CSE Bontegence Lab 2

Eampe: Casno art 5 Tranng Sequences: Copyrght c 2002 by SNU CSE Bontegence Lab 3

The Mode Estmated by the Baum-Wech Agorthm Estmated probabtes are somewhat dfferent. Due to the probem of oca mama. The mted amount of data does not permt to estmate the accurate parameter vaues. Copyrght c 2002 by SNU CSE Bontegence Lab 4

The Mode Learned from 30000 Ros The og-odds per ro assumng a far de 300 a oaded de mode og a far de mode The correct mode: 0.0 bts Mode from 300 ros: 0.097 bts Mode from 30000 ros: 0.097 bts Copyrght c 2002 by SNU CSE Bontegence Lab 5

Estmaton of the HMM arameters for the Cassfcaton robem We have to tran the modes of one cass separatey from the mode of the other cass and then combne them nto a arger HMM. Ths separate estmaton can be qute tedous. more than two casses The estmaton of the transtons s not a smpe countng probem. when the transtons between the submodes are ambguous. Copyrght c 2002 by SNU CSE Bontegence Lab 6

Modeng of Labeed Sequences The combned mode of a the casses Each state s assgned a cass abe. The abe on the observaton L. y y L. In the Baum-Wech agorthm ony the vad paths are aowed. Copyrght c 2002 by SNU CSE Bontegence Lab 7

Dscrmnatve Estmaton Uness there are ambguous transtons between submodes the above estmaton procedure gves the same resut as f the submodes were estmated separatey by the Baum-Wech agorthm and then combned wth approprate transtons afterwards. ML arg ma y Our prmary nterest s n obtanng good predctons of y: The condtona mamum ehood CML arg ma y Copyrght c 2002 by SNU CSE Bontegence Lab 8

The Mamum Mutua Informaton y y / y: the probabty cacuated by the forward agorthm for abeed sequences. : the probabty cacuated by the standard forward agorthm dsregardng a the abes. There s no EM agorthm for optmzng ths ehood. Copyrght c 2002 by SNU CSE Bontegence Lab 9

Choce of Mode Topoogy Fuy-connected modes The severe probem of oca mama rarey used n practce The ess constraned the mode s the more severe the oca mamum probem becomes. There est methods whch attempt to adapt the more topoogy based on the tranng data. Successfu HMMs are constructed based on the nowedge about the probem under nvestgaton. E the mode of CpG sands To dsabe the transton from state to just set a to be zero. Copyrght c 2002 by SNU CSE Bontegence Lab 20

Duraton Modeng The nuceotde dstrbuton does not change for a certan ength of DNA The CpG sands mode or the dshonest casno The probabty of stayng n the state for resdues: resdues pp geometrc dstrbuton eponenta decay coud be napproprate n some appcatons A mnmum ength of 5 resdues An eponentay decayng dstrbuton over onger sequences. Copyrght c 2002 by SNU CSE Bontegence Lab 2

Duraton Modeng Cont d Any dstrbuton of engths between 2 and 0 Copyrght c 2002 by SNU CSE Bontegence Lab 22

A More Subte Modeng for Non-Geometrc Length Dstrbuton An array of n states: p p p p p p p p p For any path of ength n the transton probabty s p -n - p n The probabty of a the possbe sequences of ength. n n n p p Copyrght c 2002 by SNU CSE Bontegence Lab 23

The Negatve Bnoma Dstrbuton p 0.99 n 5 For a contnuous Marov process: Erang dstrbuton and phase-type dstrbuton It s possbe to mode the ength dstrbuton epcty. Copyrght c 2002 by SNU CSE Bontegence Lab 24

Sent States The states that do not emt any symbo Sent or nu states begn and end states Eampe n Chapter 5 The Marov mode where a states n a chan need to be connected to a states ater n the chan. If the ength of the chan s 200 about 20000 transtons Too arge to be reaby estmated from reastc data sets. Copyrght c 2002 by SNU CSE Bontegence Lab 25

The Forward Connected Mode In order to aow for arbtrary deetons: The number of parameters requred for the above mode consstng of 200 states s: 99 99 98 / 2 20000. Copyrght c 2002 by SNU CSE Bontegence Lab 26

A arae Chan of Sent States A mode wth sent states The number of parameters of the mode correspondent to the above forward connected mode s 99 98 98 97 800 The reducton of the parameters aso reduces the representaton power. 5 and 2 4 wth hgh probabtes whe 4 and 2 5 wth ow probabtes. Copyrght c 2002 by SNU CSE Bontegence Lab 27

Etensons to the Forward Agorthm So ong as there are no oops consstng entrey of sent states t s easy to etend a the HMM agorthms to ncorporate them. For the forward agorthm f :. For a rea states cacuate f as before from f for states. 2. For any sent state set f to Σ f a for rea states. 3. Startng from the owest numbered sent state add Σ f a to f for a sent states <. If there are oops entrey of sent states Emnate the sent states from the cacuatons the fuy connected mode Copyrght c 2002 by SNU CSE Bontegence Lab 28

Hgh Order Marov Chans An nth order Marov process 2...... An nth order Marov chan over some aphabet A s equvaent to a frst order Marov chan over the aphabet A n of n-tupes. Because n... n... n... n Copyrght c 2002 by SNU CSE Bontegence Lab 29

A Second Order Marov Chans A second order Marov chan for sequences of A and B ABBAB AB BB BA AB The equvaent frst order Marov chan: AA AB BA BB BB or BA foowng BA or AA s dsaowed. Sometmes the framewor of hgh order mode s convenent. Copyrght c 2002 by SNU CSE Bontegence Lab 30

Fndng roaryotc Genes Genes of proaryotes bactera: Very smpe one-dmensona structure Start codon : codons for amno acds : stop codon Copyrght c 2002 by SNU CSE Bontegence Lab 3

Open Readng Frames Fndng good gene canddates from DNA stretches: Start wth one of the possbe start codons. Contnung wth a number of non-stop codons. Endng wth one of the possbe stop codons. Open readng frames ORFs Overappng ORFs The same stop codon wth dfferent start codons. There are many more ORFs than rea genes. Dscrmnate between a non-codng ORF and a rea gene. Copyrght c 2002 by SNU CSE Bontegence Lab 32

Eampe: DNA from E. Co The tranng data set 00 genes more than 00 nuceotdes ong 900 for tranng the mode and 200 for testng the traned mode. A frst order mode just as for the CpG sands was estmated from the tranng data In the test set 6500 ORFs wth a ength of more than 00 bases were found. ORFs that share the stop codon wth a nown rea genes were not ncuded. Copyrght c 2002 by SNU CSE Bontegence Lab 33

Hstograms of the Log-Odds per Nuceotde The nu mode for og-odds: the smpes mode wth the probabty for each nuceotde equa to the frequency by whch t occurs n a the data. Average for genes: 0.08 average for NORFs: 0.009 Dscrmnaton wth ths mode s neary mpossbe. Copyrght c 2002 by SNU CSE Bontegence Lab 34

A Marov Chan Consstng of Codons A the sequences are transformed to sequences of codons. A 64-state frst order Marov chan was estmated. The nu mode: a unform dstrbuton over codons Copyrght c 2002 by SNU CSE Bontegence Lab 35

Inhomogeneous Marov Chans Usng the poston nformaton n the codon Three modes for the poston 2 and 3. a a 2 a 3 a a GENEMARK gene-fndng program http://opa.boogy.gatech.edu/genemar/ 2 2 23 3 4 45 56 Inhomogeneous Marov chans are used. Etensons to the emsson probabtes e b b... b b... b... n n n Copyrght c 2002 by SNU CSE Bontegence Lab 36

Numerca Stabty of HMM Agorthms Many cacuatons n Vterb forward and bacward agorthms Mutpyng many probabtes A mode of genomc sequences wth 00000 bases wth a typca emsson and transton probabty of 0. The resutng probabty w be 0-00000. Most computers coud not dea wth such a sma number. Copyrght c 2002 by SNU CSE Bontegence Lab 37

The Log Transformaton og 0 0-00000 -00000 For the Vterb agorthm V e~ ma V a~ For the forward and bacward agorthms The ogarthm of a sum of probabtes eponentaton s requred. ~ r og p q ~ ~ Appromated by p og p and q og q ~ r ogep ~ p ep q ~ nterpoaton from a tabe ~ p og ep q ~ ~ p Copyrght c 2002 by SNU CSE Bontegence Lab 38

Copyrght c 2002 by SNU CSE Bontegence Lab 39 Scang of robabtes Scang of robabtes To rescae the f and b varabes ~ ~ ~ ~ ~ j j f a f e s a f e s f s f f e b a s b ~ ~

Further Readng Basc ntroducton to HMMs: [Rabner and Juang 986; Krogh 998] Eary appcatons of HMM-e modes to sequence anayss: [Borodovsy et a. 986a 986b 986c] GENEMARK genefnder program: [Borodovsy and McInnch 993] EM agorthm for modeng proten bndng motfs: [Cardon and Stormo 992] Combnng neura networs and HMMs: [Stormo and Hausser 996; Kup et a. 996; Reese et a. 997; Burge and Karn 997] HMMs for modeng compostona dfferences between DNA from mtochondra and from the human X chromosome and bacterophage ambda: [Church 989] A three-state HMM for predcton of proten secondary structure: [Asa Hayamzu and Handa 993] A HMM wth ten states n a rng for modeng an oscatory pattern n nuceosomes [Bad et a. 996] Copyrght c 2002 by SNU CSE Bontegence Lab 40