I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Similar documents
Interpolated Markov Models for Gene Finding

10-701/ Machine Learning, Fall 2005 Homework 3

Search sequence databases 2 10/25/2016

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

1/4/13. Outline. Markov Models. Frequency & profile model. A DNA profile (matrix) Markov chain model. Markov chains

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Tracking with Kalman Filter

Introduction to Hidden Markov Models

Boostrapaggregating (Bagging)

EM and Structure Learning

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

6. Stochastic processes (2)

6. Stochastic processes (2)

Hidden Markov Models

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Composite Hypotheses testing

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Profile HMM for multiple sequences

Maximum Likelihood Estimation

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Google PageRank with Stochastic Matrix

Hidden Markov Models

Random Walks on Digraphs

Limited Dependent Variables

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

NP-Completeness : Proofs

Generalized Linear Methods

Clustering gene expression data & the EM algorithm

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

Statistical pattern recognition

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Markov chains. Definition of a CTMC: [2, page 381] is a continuous time, discrete value random process such that for an infinitesimal

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 4: Universal Hash Functions/Streaming Cont d

APPENDIX A Some Linear Algebra

Learning from Data 1 Naive Bayes

Course 395: Machine Learning - Lectures

Sampling Theory MODULE VII LECTURE - 23 VARYING PROBABILITY SAMPLING

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Linear Regression Analysis: Terminology and Notation

} Often, when learning, we deal with uncertainty:

Linear Approximation with Regularization and Moving Least Squares

Lecture 12: Classification

Lecture 4: Constant Time SVD Approximation

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Lecture 10: May 6, 2013

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Mixture o f of Gaussian Gaussian clustering Nov

Computing Correlated Equilibria in Multi-Player Games

CS-433: Simulation and Modeling Modeling and Probability Review

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Bayesian predictive Configural Frequency Analysis

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

First Year Examination Department of Statistics, University of Florida

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Lecture Nov

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Note on EM-training of IBM-model 1

Multilayer Perceptron (MLP)

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Classes of States and Stationary Distributions

CS286r Assign One. Answer Key

Lecture Notes on Linear Regression

Problem Set 9 - Solutions Due: April 27, 2005

Retrieval Models: Language models

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Gaussian Mixture Models

Chapter 13: Multiple Regression

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Generative classification models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Hidden Markov Models

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

CS 3750 Machine Learning Lecture 6. Monte Carlo methods. CS 3750 Advanced Machine Learning. Markov chain Monte Carlo

Dynamical Systems and Information Theory

Entropy of Markov Information Sources and Capacity of Discrete Input Constrained Channels (from Immink, Coding Techniques for Digital Recorders)

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

On splice site prediction using weight array models: a comparison of smoothing techniques

Probability Theory (revisited)

Continuous Time Markov Chain

Hidden Markov Model Cheat Sheet

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Split alignment. Martin C. Frith April 13, 2012

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Statistics II Final Exam 26/6/18

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Transcription:

I529: Machne Learnng n Bonformatcs (Sprng 217) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 217

Outlne Smple model (frequency & profle) revew Markov chan CpG sland queston 1 Model comparson by log lkelhood rato test Markov chan varants Kth order Inhomogeneous Markov chans Interpolated Markov models (IMM) Applcatons Gene fndng (Genemark & Glmmer) Taxonomc assgnment n metagenomcs (Phymm)

A DNA profle (matrx) TATAAA TATAAT TATAAA TATAAA TATAAA TATTAA TTAAAA TAGAAA 1 2 3 4 5 6 T 8 1 6 1 1 C A 7 1 7 8 7 G 1 Sparse data à pseudo-counts 1 2 3 4 5 6 T 9 2 7 2 1 2 C 1 1 1 1 1 1 A 1 8 2 8 9 8 G 1 1 2 1 1 1

Frequency & profle model Frequency model: the order of nucleotdes n the tranng sequences s gnored; Profle model: the tranng sequences are algned à the order of nucleotdes n the tranng sequences s fully preserved Markov chan model: orders are partally ncorporated

Markov chan model Sometmes we need to model dependences between adjacent postons n the sequence There are certan regons n the genome, lke TATA wthn the regulatory area, upstream a gene. The pattern CG s less common than expected for random samplng. Such dependences can be modeled by Markov chans.

Markov chans A Markov chan s a sequence of random varables wth Markov property,.e., gven the present state, the future and the past are ndependent. A famous example of Markov chan s the drunkard's walk at each step, the poston may change by +1 or 1 wth equal probablty. Pr(5è 4) = Pr(5è 6) =.5, all other transton probabltes from 5 are. these probabltes are ndependent of whether the system was prevously n step 4 or 6.

1 st order Markov chan An nteger tme stochastc process, consstng of a set of m>1 states {s 1,,s m } and 1. An m dmensonal ntal dstrbuton vector ( p(s 1 ),.., p(s m )) 2. An m m transton probabltes matrx M= (a s s j ) For example, for DNA sequence: the states are {A, C, T, G} (m=4) p(a) the probablty of A to be the 1 st letter a AG the probablty that G follows A n a sequence.

1 st order Markov chan X 1 X 2 X n-1 X n For each nteger n, a Markov Chan assgns probablty to sequences (x 1 x n ) as follows: p(( x, x,... x )) = p( X = x ) p( X = x X = x ) 1 2 n 1 1 1 1 = 2 n = px ( 1) ax 1x = 2 n

Matrx representaton A B C D A.95.2 B.5.2 C.5 1 D.3.8 The transton probabltes matrx M =(a st ) M s a stochastc matrx: a = t st 1 The ntal dstrbuton vector (u 1 u m ) defnes the dstrbuton of X 1 (p(x 1 =s )=u ).

Dgraph (drected graph) representaton.95 A A.95 B C.5 D.2 A B.5 B C.2.5.2.3.8.5.2.3 D 1.8 C D 1 Each drected edge A B s assocated wth the postve transton probablty from A to B.

Classfcaton of Markov chan states States of Markov chans are classfed by the dgraph representaton (omttng the actual probablty values) A, C and D are recurrent states: they are n strongly connected components whch are snks n the graph. B s not recurrent t s a transent state A B C D Alternatve defntons: A state s s recurrent f t can be reached from any state reachable from s; otherwse t s transent.

Another example of recurrent and transent states A B C D A and B are transent states, C and D are recurrent states. Once the process moves from B to D, t wll never come back.

A 3-state Markov model of the weather Assume the weather can be: ran or snow (state 1), cloudy (state 2), or sunny (state 3) Assume the weather of any day t s characterzed by one of the three states The transton probabltes between the three states A = {a j } = Questons a 11 a 12 a 13 a 21 a 22 a 23 = a 31 a 32 a 33.4.3.3.2.6.2.1.1.8 Gven the frst day s sunny, what s the probablty that the weather for the followng 7 days wll be sun-sun-ran-ran-sun-cloudy-sun? The probablty of the weather stayng n a state for d days? Rabner (1989)

CpG sland modelng In mammalan genomes, the dnucleotde CG often transforms to (methyl-c)g whch often subsequently mutates to TG. Hence CG appears less than expected from what s expected from the ndependent frequences of C and G alone. Due to bologcal reasons, ths process s sometmes suppressed n short stretches of genomes such as n the upstream regons of many genes. These areas are called CpG slands.

Questons about CpG slands We consder two questons (and some varants): Queston 1: Gven a short stretch of genomc data, does t come from a CpG sland? Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, where, and how long? We solve the frst queston by modelng sequences wth and wthout CpG slands as Markov Chans over the same states {A,C,G,T} but dfferent transton probabltes.

Markov models for (non) CpG slands a + st a - st The + model: Use transton matrx A + = (a + st ), = (the probablty that t follows s n a CpG sland) à postve samples The - model: Use transton matrx A - = (a - st ), = (the probablty that t follows s n a non CpG sland sequence) à negatve samples Wth these two models, to solve Queston 1 we need to decde whether a gven short sequence s more lkely to come from the + model or from the model. Ths s done by usng the defntons of Markov Chan, n whch the parameters are determned by tranng data.

Matrces of the transton probabltes A + (CpG slands): p + (x x -1 ) (rows sum to 1) X -1 A - (non-cpg slands): X A C G T A.18.274.426.12 C.171.368.274.188 G.161.339.375.125 T.79.355.384.182 X A C G T A.3.25.285.21 X -1 C.322.298.78.32 G.248.246.298.28 T.177.239.292.292

Model comparson Gven a sequence x=(x 1.x L ), now compute the lkelhood rato If RATIO>1, CpG sland s more lkely. Actually the log of ths rato s computed. = + = + + = + = 1 1 1 1 model) ( model) ( RATIO L L x x p x x p p p ) ( ) ( x x Note: p + (x 1 x ) s defned for convenence as p + (x 1 ). p - (x 1 x ) s defned for convenence as p - (x 1 ).

Log lkelhood rato test Takng logarthm yelds log Q = log p(x p(x 1 1...x...x L L + ) ) = log p p + (x x (x x 1 1 ) ) If logq >, then + s more lkely (CpG sland). If logq <, then - s more lkely (non-cpg sland).

A toy example Sequence: CGCG P(CGCG +) =? P(CGCG -) =? Log lkelhood rato?

Where do the parameters (transton probabltes) come from? Learnng from tranng data. Source: A collecton of sequences from CpG slands, and a collecton of sequences from non-cpg slands. Input: Tuples of the form (x 1,, x L, h), where h s + or - Output: Maxmum Lkelhood parameters (MLE) Count all pars (X =a, X -1 =b) wth label +, and wth label -, say the numbers are N ba,+ and N ba,-.

CpG sland: queston 2 Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, and where? For ths, we need to decde whch parts of a gven long sequence of letters s more lkely to come from the + model, and whch parts are more lkely to come from the model. We wll defne a Markov Chan over 8 states. A + A - C + G + T + C - G - T - The problem s that we don t know the sequence of states (hdden) whch are traversed, but just the sequence of letters (observaton). Hdden Markov Model!

Markov model varatons kth order Markov chans (Markov chans wth memory) Inhomogeneous Markov chans (vs homogeneous Markov chans) Interpolated Markov chans

kth order Markov Chan (a Markov chan wth memory k) ( ) ( ) ( ) = = = = = = = = n k k k k k n x X x X x X x X p x X x X p x x p,...,,,...,... 2 2 1 1 1 1 1 kth Markov Chan assgns probablty to sequences (x 1 x n ) as follows: Intal dstrbuton Transton probabltes

Inhomogeneous Markov chan for gene fndng X 1 X 2 X 3 X 4 X 5 X 6 X 7 a b c a b c Agan, the parameters (the transton probabltes, a, b, and c need to be learned from tranng samples)

Inhomogeneous Markov chan: predcton X 1 X 2 X 3 X 4 X 5 X 6 X 7 Readng frame 1 a b c a b c Readng frame 2 c a b c a b Readng frame 3 b c a b c a

Gene fndng usng nhomogeneous Markov chan Consder sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9. where x s a nucleotde let p 1 = a x1x2 b x2 x3 c x3x4 a x4x5 b x5x6c x6x7. p 2 = c x1x2 a x2x3 b x3x4 c x4x5 a x5x6 b x6x7. p 3 = b x1x2 c x2x3 a x3x4 b x4x5 c x5x6 a x6x7. then probablty that th readng frame s the codng frame s: P = p p 1 + p 2 + p 3 Genemark (gene fnder for bacteral genomes)

Selectng the order of a Markov chan For Markov models, what order to choose? Hgher order, more memory (hgher predctve value), but means more parameters to learn The hgher the order, the less relable the parameter estmates. E.g., we have a DNA sequence of 1 kbp 2 nd order Markov chan, 4 3 =64 parameters, 1562 tmes on average for each hstory 5 th order, 4 6 =496 parameters, 24 tmes on average 8 th order, 4 9 =65536 parameters, 1.5 tmes on average

Interpolated Markov models (IMMs) IMMs are called varable-order Markov models A IMM uses a varable number of states to compute the probablty of the next state smple lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 P (x x 1 )+ + n P (x x n,,x 1 ) general lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 (x )P (x x 1 )+ + n (x n,,x 1 )P (x x n,,x 1 )

GLIMMER Glmmer s a system for fndng genes n mcrobal DNA, especally the genomes of bactera, archaea, and vruses eukaryotc verson of Glmmer: GlmmerHMM Glmmer (Gene Locator and Interpolated Markov ModelER) uses IMMs to dentfy the codng. Glmmer verson 3.2 s the current verson of the system (http://www.cbcb.umd.edu/software/ glmmer/) Glmmer3 makes several algorthmc changes to reduce the number of false postve predctons and to mprove the accuracy of start-ste predctons

IMM n GLIMMER A lnear combnaton of 8 dfferent Markov chans, from 1st through 8th-order, weghtng each model accordng to ts predctve power. Glmmer uses 3-perodc nonhomogenous Markov models n ts IMMs. Score of a sequence s the product of nterpolated probabltes of bases n the sequence IMM tranng Longer context s always better; only reason not to use t s undersamplng n tranng data. If sequence occurs frequently enough n tranng data, use t,.e., λ = 1 Otherwse, use frequency and χ 2 sgnfcance to set λ.

Clusterng metagenomc sequences wth IMMs IMMs are used to classfy metagenomc sequences based on patterns of DNA dstnct to a clade (a speces, genus, or hgher-level phylogenetc group). Durng tranng, the IMM algorthm constructs probablty dstrbutons representng observed patterns of nucleotdes that characterze each speces. Nat Methods 29, 6(9):673-676