Statistical Machine Learning Methods for Bioinformatics I. Hidden Markov Model Theory

Similar documents
Machine Learning Methods for Bioinformatics I. Hidden Markov Model Theory

Hidden Markov Models. Adapted from. Dr Catherine Sweeney-Reed s slides

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

Hidden Markov Models

Hidden Markov Models. Seven. Three-State Markov Weather Model. Markov Weather Model. Solving the Weather Example. Markov Weather Model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Isolated-word speech recognition using hidden Markov models

Expectation- Maximization & Baum-Welch. Slides: Roded Sharan, Jan 15; revised by Ron Shamir, Nov 15

CS 4495 Computer Vision Hidden Markov Models

Ensamble methods: Boosting

Ensamble methods: Bagging and Boosting

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

GMM - Generalized Method of Moments

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

An introduction to the theory of SDDP algorithm

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

1 Review of Zero-Sum Games

Západočeská Univerzita v Plzni, Czech Republic and Groupe ESIEE Paris, France

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Unit Root Time Series. Univariate random walk

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Christos Papadimitriou & Luca Trevisan November 22, 2016

Tom Heskes and Onno Zoeter. Presented by Mark Buller

Some Basic Information about M-S-D Systems

Machine Learning 4771

Viterbi Algorithm: Background

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Object tracking: Using HMMs to estimate the geographical location of fish

Vehicle Arrival Models : Headway

Sequential Importance Resampling (SIR) Particle Filter

Authors. Introduction. Introduction

EKF SLAM vs. FastSLAM A Comparison

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Echocardiography Project and Finite Fourier Series

Math 10B: Mock Mid II. April 13, 2016

3.1 More on model selection

Notes on Kalman Filtering

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Planning in POMDPs. Dominik Schoenberger Abstract

Doctoral Course in Speech Recognition

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Lecture 33: November 29

Air Traffic Forecast Empirical Research Based on the MCMC Method

References are appeared in the last slide. Last update: (1393/08/19)

Random Walk with Anti-Correlated Steps

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Testing for a Single Factor Model in the Multivariate State Space Framework

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Final Spring 2007

Maximum Likelihood Parameter Estimation in State-Space Models

CONTROL SYSTEMS, ROBOTICS AND AUTOMATION Vol. XI Control of Stochastic Systems - P.R. Kumar

Block Diagram of a DCS in 411

Longest Common Prefixes

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Introduction to Probability and Statistics Slides 4 Chapter 4

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Notes for Lecture 17-18

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

The Potential Effectiveness of the Detection of Pulsed Signals in the Non-Uniform Sampling

Online Convex Optimization Example And Follow-The-Leader

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl

BU Macro BU Macro Fall 2008, Lecture 4

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

Estimation of Poses with Particle Filters

Temporal probability models

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

STATE-SPACE MODELLING. A mass balance across the tank gives:

5. Stochastic processes (1)

Tracking. Announcements

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

Temporal probability models. Chapter 15, Sections 1 5 1

LECTURE 1: GENERALIZED RAY KNIGHT THEOREM FOR FINITE MARKOV CHAINS

) were both constant and we brought them from under the integral.

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Linear Response Theory: The connection between QFT and experiments

10. State Space Methods

Two Coupled Oscillators / Normal Modes

The Arcsine Distribution

Some Ramsey results for the n-cube

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

MATHEMATICAL DESCRIPTION OF THEORETICAL METHODS OF RESERVE ECONOMY OF CONSIGNMENT STORES

Transcription:

Saisical Machine Learning Mehods for Bioinformaics I. Hidden Markov Model Theory Jianlin Cheng, PhD Informaics Insiue, Deparmen of Compuer Science Universiy of Missouri 2009 Free for Academic Use. Copyrigh @ Jianlin Cheng.

Wha s is Machine Learning? As a broad subfield of arificial inelligence, machine learning is concerned wih he design and developmen of algorihms and echniques ha allow compuers o learn. The major focus is o exrac informaion from daa auomaically, by compuaional and saisical mehods. Closely relaed o daa mining, saisics, and heoreical compuer science.

Why Machine Learning? Bill Gaes: If you design a machine ha can learn, i is worh 0 Microsof.

Applicaions of Machine Learning Naural language processing, synacic paern recogniion, search engine Bioinformaics, cheminformaics, medical diagnosis Deecing credi card fraud, sock marke analysis, speech and handwriing recogniion, objec recogniion Game playing and roboics

Common Algorihms Types Saisical Modeling (HMM, Bayesian neworks Supervised Learning (classificaion Unsupervised Learning (clusering

Why Bioinformaics? For he firs ime, we compuer / informaion scieniss can make direc impac on healh and medical sciences. Working wih biologiss, we will revoluionize he research of life science, decode he secree of he life, and cure diseases hreaening human being.

Don Qui Because I is Hard Inerviewer: Do you view yourself a billionaire, CEO, or he presiden of Microsof? Bill Gaes: I view myself a sofware engineer. Business is no complicaed, bu science is complicaed. JFK: we wan o go o moon, no because i is easy, bu because i is hard.

Saisical Machine Learning Mehods for Bioinformaics. Hidden Markov Model and is Applicaion in Bioinformaics (e.g. sequence and profile alignmen 2. Suppor Vecor Machine and is Applicaion in Bioinformaics (e.g. proein fold recogniion 3. EM Algorihm, Gibbs Sampling, and Bayesian Neworks and heir Applicaions in Bioinformaics (e.g. gene regulaory nework

Possible Projecs Muliple sequence alignmen using HMM Profile-profile alignmen using HMM Proein secondary srucure predicion using neural neworks Proein fold recogniion using suppor vecor machine Or your own projecs upon my approval Work alone or in a group up o 3 persons. 40% presenaion and 60% repor.

Real World Process Signal Source Discree signals: characers, nucleoides, Coninuous signals: speech samples, music emperaure measuremens, music, Signal can be pure (from a single source or be corruped from oher sources (e.g Noise Saionary source: is saisical properies does no vary. Nonsaionary source: he signal properies vary over ime.

Fundamenal Problem: Characerizing Real-World Signals in Terms of Signal Models A signal model can provide basis for a heoreical descripion of a signal processing sysem which can be used o process signals o improve oupus. (remove noise from speech signals. Signal models le us learn a grea deal abou he signal source wihou having he source available. (simulae source and learn via simulaion. Signal models work prey well in pracical sysem predicion sysems, recogniion sysems, idenificaion sysems.

Types of Signal Models Deerminisic models: exploi known properies of signals. (sine wave, sum of exponenials. Easy, only need o esimae parameers (ampliude, frequency, Saisical models: ry o characerize only he saisical properies of models (Gaussian processes, Poisson processes, Markov processes, Hidden Markov process Assumpion of saisical models: signals can be well characerized as a parameric random process, and he parameers of he sochasic processes can be esimaed in a well defined manner.

Hisory of HMM Inroduced in he lae 960s and early 970s (L.E. Baum and ohers Widely used in speech recogniion in 980s. (Baker, Jelinek, Rabiner and ohers Widely used in biological sequence analysis in 990s ill now (Churchill, Haussler, Krog, Baldi, Durbin, Eddy, Kaplus, Karlin, Burges, Hughey, Snonhammer, Sjolander, Edgar, Soeding, and many ohers

Discree Markov Process Consider a sysem described a any ime as being in one of a se of N disinc saes, S, S 2,, S N. a S a5 a 55 S5 a 22 S2 a 2 a 32 a4 a35 a54 a45 S3 S4 a34 a 33 a44 A regularly spaced discree imes, he sysem undergoes a change of sae according o a se of probabiliies.

Finie Sae Machine A class of Auomaa Example: deerminisic finie auomaon (sae ransiion according o inpu o deermines if he binary inpu conains an even number of 0s. 0 S S2 0 Wha s he difference from Markov Process? Sochasic Finie Auomaon

Definiion of Variables Time insans associaed wih sae changes as =, 2,.,T Observaion: O = o o 2 o T Acual sae a ime as q. A full probabilisic descripion of he sysem requires he specificaion of he curren sae, as well as all he predecessor saes. The firs order Makov chain (runcaion: P[q = S j q - = S i, q -2 = S k, ] = P[q = S j q - = S i ]. Furher simplificaion: sae ransiion is independen of ime. a ij = P[q = S j q - = S i ], <= i, j <= N, subjec o consrains: a ij >= 0 and N j a ij

Observable Markov Model The sochasic process is called an observable model if he oupu of he process is he se of saes a each insan of ime, which each sae corresponds o a physical (observable even. Example: Markov Model of Wheher Marix A of sae ransiion probabiliies: Sae : rain or snow j Sae 2: cloudy Sae 3: sunny 0.4 0.3 0.3 2 3 A = {a ij } = i 0.2 0.6 0.2 0. 0. 0.8

Observable Markov Model Example: Markov Model of Wheher Sae : rain or snow Marix A of sae ransiion probabiliies: j Sae 2: cloudy 2 3 Sae 3: sunny 0.4 0.3 0.3 A = {a ij } = i 0.2 0.6 0.2 0. 0. 0.8 Quesion: Given he weaher on day (= is sunny (sae 3, wha is he probabiliy of for he nex 7 days will be sun sun rain rain sun cloudy sun? 2 3

Formalizaion Define he observaion sequence O = {S 3, S 3, S 3, S, S, S 3, S 2, S 3 } corresponding o =, 2,.., 8. P (O M = P[S 3, S 3, S 3, S, S, S 3, S 2, S 3 Model] = P[S 3 ] * P[S 3 S 3 ] * P[S 3 S 3 ] * P[S S 3 ] * P[S S ] * P[S 3 S ] * P[S 2 S 3 ] * P[S 3 S 2 ] = π3 * 0.8 * 0.8 * 0. * 0.4 * 0.3 * 0. * 0.2 =.536 * 0-4 Denoe π i = P[q = S i ], <= i <= N.

Web as a Markov Model 6 7 3 8 2 4 5 9 0 Which web page is mos likely o be visied? (mos popular? Which web pages are leas likely o be visied?

Web as a Markov Model 6 7 A 3 8 B 2 4 5 9 0 C Which one (A or C is more likely o be visied? How abou A and B?

Random Walk Markov Model 6 7 A 3 8 B 2 4 5 9 0 C Which one (A or C is more likely o be visied? How abou A and B? Wha wo hings are imporan o he populariy of a page?

Random Walk Markov Model 6 7 A Robo 3 8 B 2 4 5 9 0 C Randomly selec an ou-link o visi nex page The probabiliy of aking an ou-link is equally disribuedly among all links

Page Rank Calculaion Web Link Marix 2 2 3 4 5 6 7 8 9 0 3 /2 4 /2 5 /2 6 7 /2 8 /2 9 0 /2 Rank Vecor X 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. = Each row is he in-link of he page Repea unil i converges. 0. 0.2 0.5 0.5 0.45 0. 0.5 0.5 0. 0.5

Hidden Markov Models Observaion is a probabilisic funcion of he sae. Doubly embedded process: Underlying sochasic process ha is no observable (hidden, bu only be observed hrough anoher se of sochasic processes ha produce he sequence of observaions.

Coin Tossing Models A person is ossing muliple coins in oher room. He ells you he resul of each coin flip. A sequence of hidden coin ossing experimens is performed, wih observaion sequence consising of a series of heads and ails. O = O O 2 O 3 O T = HHTTTHTTH H

How o Build a HMM o Model o Explain he Observed Sequence? Wha do he saes in he model correspond o? How many saes should be in he model?

Observable Markov Model for Coin Tossing Assume a single biased coin is ossed. Saes: Head and Tail. The Markov model is observable, he only issue for complee specificaion of he model is o decide he bes value of bias (he probabiliy of head. P(H -P(H Head -P(H P(H O = H H T T H T H H T T H S = 2 2 2 2 2 Tail 2 parameer

One Sae HMM Model Head P(H Tail P(H Observaion parameer O = H H T T H T H H T T H S = Coin Sae Unknown parameer is bias: P(H. Degenerae HMM

Two-Sae HMM 2 saes corresponding o wo differen, biased coins being ossed. Each sae is characerized by a probabiliy disribuion of heads and ails. Transiion beween saes are characerized by a sae ransiion marix. a a 22 a 2 O = H H T T H T H H T T H S = 2 2 2 2 2 2 2 a 22 P(H = P P(H = P 2 P(T = P P(T = P 2 4 Parameers

Three-Sae HMM a a 22 2 a 3 a 3 a 2 a 2 a 23 O = H H T T H T H H T T H S = 3 2 3 3 2 3 3 3 a 32 P(H P(T 2 3 P P 2 P 3 -P -P 2 -P 3 a 33 9 parameers

Model Selecion Which model bes mach he observaions? P(O M -Sae HMM and Markov Model has one parameer. 2-Sae HMM has four parameers. 3-Sae HMM has nine parameers. Pracical consideraions impose srong limiaion on model size. (validaion, predicion performance Consrained by underlying physical process.

N-Sae M-Symbol HMM N Urns. Each Urn has a number of colored balls. M disinc colors of he balls. Urn Urn 2 Urn N P(red = b ( P(red = b 2 ( P(red = b N ( P(blue = b (2 P(blue = b 2 (2 P(blue = b N (2 P(green = b (3 P(green = b 2 (3 P(green = b N (3 P(Yellow = b (M P(Yellow = b 2 (M P(Yellow = b N (M O = {green, green, blue, red, yellow, red,., blue}

Elemens of HMM N, he number of saes in he model. S = {S, S 2,, S N }, and he sae a ime as q. M, he number of disinc observaion symbols per sae, i.e., he discree alphabe size. The observaion symbols corresponds o he physical oupu of he sysem. V = {v, v2,, V M } A, he sae ransiion probabiliy marix. A = {a ij }. a ij = P[q + = j q = i}, <= i, j <= N. For he special case where any sae can reach any oher sae in a single sep (a ij >0 for all i, j. The observaion symbol (emission probabiliy disribuion in sae j, B={b j (k}, where b j (k = P[v k a q = s j ], <= j <= N, <= k <= M. The iniial sae disribuion π = {π i } where π i = P[q =S i ], <= i <=N. Complee specificaion: N, M, A, B, and π. λ = {A, B, π}

Use HMM o Generae Observaions Given appropriae values of N, M, A, B, and π, HMM can be used as a generaor o give an observaion sequence: O = O O 2 O T. Each O is one of symbols from V, and T is he number of observaions in he sequence as follows:. Choose an iniial sae q = S i according o π. 2. Se = 3. Choose O = v k according o he symbol probabiliy disribuion in sae S i, i.e., b i (k. 4. Transi o a new sae q + = S j according o he sae ransiion probabiliy disribuion for sae S i, i.e., a ij. 5. Se = + ; reurn o sep 3 if < T; oherwise erminae he procedure.

Three Main Problems of HMM. Given he observaion O =O O 2 O T and a model λ = (A, B, π, how o efficienly compue P(O λ? 2. Given he observaion O = O O 2 O T and he model λ, how o we choose a corresponding sae sequence (pah Q = q q 2 q T which is opimal in some meaningful sense? (bes explain he observaions 3. How do we adjus he model parameers λ o maximize P(O λ?

Insighs Problem : Evaluaion and scoring (choose he model which bes maches he observaions Problem 2: Decoding. Uncover he hidden saes. We usually use an opimaliy crierion o solve his problem as bes as possible. Problem 3: Learning. We aemp o opimize he model parameers so as o bes describe how a given observaion sequence comes ou. The observaion sequence used o adjus he model parameers is called a raining sequence. Key o creae bes models for real phenomena.

Word Recogniion Example For each word of in a word vocabulary, we design a N-sae HMM. The speech signal (observaions of a given word is a ime sequence of coded specral vecors (specral code, or phoneme. (M unique specral vecors. Each observaion is he index of he specral vecor closes o he original speech signal. Task : Build individual word models by using soluion o Problem 3 o esimae model parameers for each model. Task 2: Undersand he physical meaning of model saes using he soluion o problem 2 o segmen each of word raining sequences ino saes and hen sudy he properies of he specral vecors ha lead o he observaion in each sae. Refine model. Task 3: Use soluion o Problem o score each word model based upon he given es observaion sequence and selec he word whose score is highes.

Soluion o Problem (scoring Problem: calculae he probabiliy of he observaion sequence, O=O O 2 O T, given he model λ, i.e., P(O λ. Brue force: enumerae every possible sae sequence of lengh T. Consider one such fixed sae sequence Q = q q 2 q T. T P P(O Q,λ = ( O q,, assuming saisical independence of observaions. We ge, P(O Q, λ = b q (O * b q2 (O 2 bq T (O T.

P(Q λ = π q a qq2 a q2q3 a qt-qt Join probabiliy: P(O,Q λ = P(O Q,λP(Q λ. The probabiliy of O (given he model P(O λ = (, ( Q P Q O P allq (... ( ( 2 2 2 2... T q q q q q q q q q q q O b a O b a O b T T T T Time complexiy: O(T * N T

Forward-Backward Algorihm Definiion: Forward variable a (i = P(O,O 2 O, q =S i λ, i.e., he join probabiliy of he parial observaion O O 2 O and sae S i a ime. Algorihm (inducive or recursive:. Iniializaion: a (i = π i b i (O, <= i <= N N 2. Inducion: a + (j = ( a, <= <= ( i aij bj ( O i T-, <= j <= N. N 3. Terminaion: P(O λ = a ( i i T

Insighs Sep : iniialize he forward probabiliies as he join probabiliy of sae S i and iniial observaion O. Sep 2 (inducion: S S2 a 2j a j S j S N a (i a Nj + a + (j

Laice View Sae 2... N 2 3 T Time

Marix View (Implemenaion π b (o N i ( i ai2b ( O2 2 Sae (i N 2 T a (i: he probabiliy of observing Time ( O O and saying in sae i. Fill he marix column by column. Wha oher bioinformaics algorihm does his algorihm look like?

Time Complexiy The compuaion is performed for all saes j, for a given ime ; he compuaion is hen ieraed for =,2, T-. Finally he desired is he sum of he erminal forward variable a T (i. This is he case since a T (i = P(O O 2 O T, q T = S i λ. Time complexiy: N 2 T.

Insighs Key: here is only N saes a each ime slo in he laice, all he possible sae sequences will merge ino hese N nodes, no maer how long he observaion sequence. Similar o Dynamic Programming (DP rick.

Backward Algorihm Definiion: backward variable β (i = P(O + O +2 O T q =S i,λ. Iniializaion: β T (i =, <= i <= N. (P(O T+ q T =S i, λ =. O T+, a dummy. Inducion: ( i ijbj( O ( j = T-, T-2,,, <= j <= N. The iniializaion sep arbirarily defines β T (i o be for all i.

Backward Algorihm a i a i2 S S2 a in β (i + β + (j S N Time Complexiy: N 2 T

Soluion o Problem 2 (decoding Find he opimal saes associaed wih he given observaion sequence. There are differen opimizaion crierion. One opimaliy crierion is o choose he saes q which are individually mos likely. This opimaliy crierion maximizes he expeced number of correc individual saes. Definiion: γ (i = P(q =S i O, λ, i.e., he probabiliy of being in sae S i a ime, given he observaion sequence O, and he model λ.

Gamma Variable is Expressed by Forward and Backward Variables N i i i i i O P i i i ( ( ( ( ( ( ( ( The normalizaion facor P(O λ makes γ (i a probabiliy measure so ha is sum over all saes is.

Solve he Individually Mos Likely Sae q Algorihm: for each column ( of gamma marix, selec elemen wih maximum value q argmax[ i N <= <= T ( i] This maximizes he expeced number of correc saes by choosing he mos likely sae for each ime. A simple join of individually mos likely saes won yield an opimal pah. Problem wih resuling sae sequence (pah : low probabiliy (a ij is low or even no valid (a ij = 0. Reason: deermines he mos likely sae a every insan, wihou regard o he probabiliy of occurrence of sequences of saes.

Find he Bes Sae Sequence (Vierbi Dynamic programming Definiion: Algorihm he bes score (highes probabiliy along a single pah a ime, which accouns for he firs observaions and ends in sae S i. Inducion: ( i max P[ qq2... q i, oo 2... o qq2... q ( j [max ( i ij] bj ( O i To rerieve he sae sequence, we need o keep rack of he chosen sae for each and j. We use a array (marix ψ (j o record hem. ]

Vierbi Algorihm Iniializaion: δ (i=π i b i (O, <= i <= N Ψ (i = 0. Recursion Terminaion p * ( j Pah (sae backracking max[ ( i ij] bj ( O,2 T, i N j N ( j arg max[ ( i ij],2 T, j N. i N * max[ ( i] q arg max[ ( i] i N T T i N T q * ( * q, T, T 2,...,

Inducion S S 2 a 2j a j S i a ij S j S N δ (i a Nj + δ + (j Choose he ransiion sep yielding he maximum probabiliy.

Marix View (Implemenaion π b (o 2 Sae (i N 2 T a (i: he probabiliy of observing Time ( O O and saying in sae i. Fill he marix column by column. Wha oher bioinformaics algorihm does his algorihm look like? (sequence alignmen

Insighs Vierbi algorihm is similar (excep for he backracking sep in implemenaion o he forward calculaion. The major difference is he maximizaion over previous saes insead of he summing procedure used in he forward algorihm. The same laice / marix srucure efficienly implemens he Vierbi algorihm. Vierbi algorihm is similar as global sequence alignmen algorihm. (boh use dynamic programming

Soluion o Problem 3: Learning Goal: adjus he model parameers λ=(a,b,π o maximize he probabiliy (P(O λ of he observaion sequence given he model. (called maximum likelihood Bad news: No opimal way of esimaing he model parameers o find global maxima. Good news: We can locally maximize P(O λ using ieraive procedure such as Baum-Welch algorihm, EM algorihm, and gradien echniques.

Global Maxima Local Maxima Local Maxima P(O λ λ

Baum-Welch Algorihm Definiion: ξ (i,j, he probabiliy of being in sae S i a ime and sae S j a ime +, given he model and observaion sequence, i.e. ξ (i,j = P(q =S i, q + =S j O,λ Si a ij b j (o + Sj a (i β + (j - + +2

From he definiions of he forward and backward variables, we can wrie ξ (i,j in he form: ( ( ( (, ( O P j O b a i j i j ij N i N j j ij j ij j O b a i j O b a i ( ( ( ( ( ( The numeraor erm is jus P(q =S i, q + =S j O,λ and he division by P(O λ gives he desired probabiliy measure.

Imporan Quaniies γ (i is he probabiliy of being in sae S i a ime, given he observaion and he model, hence we can relae γ (i o ξ (i,j by summing over j, giving T ( i N ( i ( i, j j = expeced number of ransiions from S i. T ( i, j = expeced number of ransiions from S i o S j.

Re-esimaion of HMM parameers i a ij (, ( T T i j i A se of reasonable re-esimaion formulas for π, A, and B are: T T V O s j j k.., ( ( = expeced frequency (number of imes in sae S i a ime (= = γ (i Expeced number of ransiions from sae S i o sae S j Expeced number of ransiions from sae S i. b j (k = expeced number of imes in sae j and observing symbol v k expeced number of imes in sae j

Baum-Welch Algorihm Iniialize model λ = (A,B,π Repea E-sep: Use forward/backward algorihm o expeced frequencies, given he curren model λ and O. M-sep: Use expeced frequencies compue he new model λ. If (λ = λ, sops, oherwise, se λ = λ and go o repea. The final resul of his esimaion procedure is called a maximum likelihood esimae of he HMM. (local maxima

Local Opimal Guaraneed If we define he curren model as λ=(a,b,π and use i o compue he new model λ=(a,b,π, i has been proven by Baum e al. ha eiher iniial model λ defines a criical poin, in which case λ = λ; or 2 λ is more likely han λ in he sense ha P(O λ > P(O λ, i.e., we have found a new model λ from which he observaion sequence is more likely o have been produced.

Global Maxima Local Maxima Local Maxima P(O λ Hill Climbing Likelihood monoonically crease per ieraion unil i converges o local maxima. λ

P(O λ Ieraion Likelihood monoonically increases per ieraion unil i converges o local maxima. I usually converges very fas in several ieraions.

Anoher Way o Inerpre he Reesimaion Formulas The re-esimaion formulas can be derived by maximizing (using sandard consrained opimizaion echniques Baum s auxiliary funcion (auxiliary variable Q: Q(, P( Q Q O, log[ P( O, Q I has been proven ha he maximizaion of he auxiliary funcion leads o increased likelihood, i.e. max[ Q(, ] P( O P( O The selecion Q is problem-specific.

Relaion o EM algorihm E (expecaion sep is he compuaion of he auxiliary funcion Q(λ, λ (averaging. M (maximizaion sep is he maximizaion over λ. Thus Baum-Welch re-esimaion is essenially idenical o he EM seps for his paricular problem.

Insighs The sochasic consrains of he HMM parameers, namely N i i N j a ij, i N M i b j ( k, are auomaically saisfied a each ieraion. By looking a he parameer esimaion problem as consrained opimizaion of P(O λ subjec o he above consrains, he echniques of of Lagrange Mulipliers can be used o find he values of π i, a ij, and b j (k which maximize P (use P o denoe P(O λ. The same resuls as Baum-Welch s formula s will be reached. Commens: since he enire problem can be se up as an opimizaion problem, sandard gradien echniques can be used o solve for opimal value oo. j N

Types of HMM Ergodic model has propery ha every sae can be reached from every oher sae. Lef-righ model: model has he properies ha as ime increases he sae index increases (or say he same, i.e., he saes proceeds from lef o righ. (model signals whose properies change over ime (speech or biological sequence. Propery : a ij = 0, i > j. Propery 2: π =, π i = 0 (i. Opional consrains: a ij = 0, j > i + Δ, make sure large change of sae indices are no allowed. For he las sae in a lef-righ model, a NN =, a Ni = 0, i < N.

One Example of Lef-Righ HMM 2 4 6 3 5 Imposiion of consrains of he lef-righ model essenially has no effec on he re-esimaion procedure because any HMM parameer se o zero iniially, will remain a zero.

HMM for Coninuous Signals In order o use a coninuous observaion densiy, some resricions have o be placed on he form of he model probabiliy densiy funcion (pdf o insure he parameers of he pdf can be re-esimaed in a consisen way. Use a finie mixure of Gaussian disribuions are used. (see he Rabiner s paper for more deails.

Implemenaion Issue : Scaling To compue a (i and b (i, Muliplicaion of a large number of erms (probabiliy, value heads o 0 quickly, which exceed he precision range of any machine. The basic procedure is o muliply hem by a scaling coefficien ha is independen of i (i.e., i depends only on. Logarihm canno be used because of summaion. Bu we can use c N i a ( i C will be sored for he ime poins when he scaling is performed. C is used for boh a (i and b (i. The scaling facor will be canceled ou for parameer esimaion. For Vierbi algorihm, use logarihm is ok.

Implemenaion Issue 2: Muliple Observaion Sequence Denoe a se of K observaion sequences as O = [O (,O (2,,O (k ]. Assume he observed sequences are independen. The re-esimaion of formulas for muliple sequences are modified by adding ogeher he individual frequencies for each sequence.

Adjus he parameer of model λ o maximize: K k K k k k P O P O P ( ( ( K k T k k k K k T k k j ij k k ij k k i i a p j O b a i a p a ( ( ( ( ( K k T k k k K k T v O s k k k j k k l i i a p i i a p l b.., ( ( ( ( ( π i is no re-esimaed since π =, π i = 0, i for lefrigh HMM.

Iniial Esimae of HMM Parameers Random or uniform iniializaion of π and A is appropriae. For emission probabiliy marix B, good esimae is helpful in discree HMM, and essenial in coninuous HMM. Iniially segmen sequences ino saes and hen compue empirical emission probabiliy. Iniializaion (or using prior is imporan when here is no sufficien daa (pseudo couns, Dirichile Prior.