Statistical Machine Learning Methods for Bioinformatics I. Hidden Markov Model Theory

Saisical Machine Learning Mehods for Bioinformaics I. Hidden Markov Model Theory Jianlin Cheng, PhD Informaics Insiue, Deparmen of Compuer Science Universiy of Missouri 2009 Free for Academic Use. Copyrigh @ Jianlin Cheng.

Wha s is Machine Learning? As a broad subfield of arificial inelligence, machine learning is concerned wih he design and developmen of algorihms and echniques ha allow compuers o learn. The major focus is o exrac informaion from daa auomaically, by compuaional and saisical mehods. Closely relaed o daa mining, saisics, and heoreical compuer science.

Why Machine Learning? Bill Gaes: If you design a machine ha can learn, i is worh 0 Microsof.

Applicaions of Machine Learning Naural language processing, synacic paern recogniion, search engine Bioinformaics, cheminformaics, medical diagnosis Deecing credi card fraud, sock marke analysis, speech and handwriing recogniion, objec recogniion Game playing and roboics

Common Algorihms Types Saisical Modeling (HMM, Bayesian neworks Supervised Learning (classificaion Unsupervised Learning (clusering

Why Bioinformaics? For he firs ime, we compuer / informaion scieniss can make direc impac on healh and medical sciences. Working wih biologiss, we will revoluionize he research of life science, decode he secree of he life, and cure diseases hreaening human being.

Don Qui Because I is Hard Inerviewer: Do you view yourself a billionaire, CEO, or he presiden of Microsof? Bill Gaes: I view myself a sofware engineer. Business is no complicaed, bu science is complicaed. JFK: we wan o go o moon, no because i is easy, bu because i is hard.

Saisical Machine Learning Mehods for Bioinformaics. Hidden Markov Model and is Applicaion in Bioinformaics (e.g. sequence and profile alignmen 2. Suppor Vecor Machine and is Applicaion in Bioinformaics (e.g. proein fold recogniion 3. EM Algorihm, Gibbs Sampling, and Bayesian Neworks and heir Applicaions in Bioinformaics (e.g. gene regulaory nework

Possible Projecs Muliple sequence alignmen using HMM Profile-profile alignmen using HMM Proein secondary srucure predicion using neural neworks Proein fold recogniion using suppor vecor machine Or your own projecs upon my approval Work alone or in a group up o 3 persons. 40% presenaion and 60% repor.

Real World Process Signal Source Discree signals: characers, nucleoides, Coninuous signals: speech samples, music emperaure measuremens, music, Signal can be pure (from a single source or be corruped from oher sources (e.g Noise Saionary source: is saisical properies does no vary. Nonsaionary source: he signal properies vary over ime.

Fundamenal Problem: Characerizing Real-World Signals in Terms of Signal Models A signal model can provide basis for a heoreical descripion of a signal processing sysem which can be used o process signals o improve oupus. (remove noise from speech signals. Signal models le us learn a grea deal abou he signal source wihou having he source available. (simulae source and learn via simulaion. Signal models work prey well in pracical sysem predicion sysems, recogniion sysems, idenificaion sysems.

Types of Signal Models Deerminisic models: exploi known properies of signals. (sine wave, sum of exponenials. Easy, only need o esimae parameers (ampliude, frequency, Saisical models: ry o characerize only he saisical properies of models (Gaussian processes, Poisson processes, Markov processes, Hidden Markov process Assumpion of saisical models: signals can be well characerized as a parameric random process, and he parameers of he sochasic processes can be esimaed in a well defined manner.

Hisory of HMM Inroduced in he lae 960s and early 970s (L.E. Baum and ohers Widely used in speech recogniion in 980s. (Baker, Jelinek, Rabiner and ohers Widely used in biological sequence analysis in 990s ill now (Churchill, Haussler, Krog, Baldi, Durbin, Eddy, Kaplus, Karlin, Burges, Hughey, Snonhammer, Sjolander, Edgar, Soeding, and many ohers

Discree Markov Process Consider a sysem described a any ime as being in one of a se of N disinc saes, S, S 2,, S N. a S a5 a 55 S5 a 22 S2 a 2 a 32 a4 a35 a54 a45 S3 S4 a34 a 33 a44 A regularly spaced discree imes, he sysem undergoes a change of sae according o a se of probabiliies.

Finie Sae Machine A class of Auomaa Example: deerminisic finie auomaon (sae ransiion according o inpu o deermines if he binary inpu conains an even number of 0s. 0 S S2 0 Wha s he difference from Markov Process? Sochasic Finie Auomaon

Definiion of Variables Time insans associaed wih sae changes as =, 2,.,T Observaion: O = o o 2 o T Acual sae a ime as q. A full probabilisic descripion of he sysem requires he specificaion of he curren sae, as well as all he predecessor saes. The firs order Makov chain (runcaion: P[q = S j q - = S i, q -2 = S k, ] = P[q = S j q - = S i ]. Furher simplificaion: sae ransiion is independen of ime. a ij = P[q = S j q - = S i ], <= i, j <= N, subjec o consrains: a ij >= 0 and N j a ij

Observable Markov Model The sochasic process is called an observable model if he oupu of he process is he se of saes a each insan of ime, which each sae corresponds o a physical (observable even. Example: Markov Model of Wheher Marix A of sae ransiion probabiliies: Sae : rain or snow j Sae 2: cloudy Sae 3: sunny 0.4 0.3 0.3 2 3 A = {a ij } = i 0.2 0.6 0.2 0. 0. 0.8

Observable Markov Model Example: Markov Model of Wheher Sae : rain or snow Marix A of sae ransiion probabiliies: j Sae 2: cloudy 2 3 Sae 3: sunny 0.4 0.3 0.3 A = {a ij } = i 0.2 0.6 0.2 0. 0. 0.8 Quesion: Given he weaher on day (= is sunny (sae 3, wha is he probabiliy of for he nex 7 days will be sun sun rain rain sun cloudy sun? 2 3

Formalizaion Define he observaion sequence O = {S 3, S 3, S 3, S, S, S 3, S 2, S 3 } corresponding o =, 2,.., 8. P (O M = P[S 3, S 3, S 3, S, S, S 3, S 2, S 3 Model] = P[S 3 ] * P[S 3 S 3 ] * P[S 3 S 3 ] * P[S S 3 ] * P[S S ] * P[S 3 S ] * P[S 2 S 3 ] * P[S 3 S 2 ] = π3 * 0.8 * 0.8 * 0. * 0.4 * 0.3 * 0. * 0.2 =.536 * 0-4 Denoe π i = P[q = S i ], <= i <= N.

Web as a Markov Model 6 7 3 8 2 4 5 9 0 Which web page is mos likely o be visied? (mos popular? Which web pages are leas likely o be visied?

Web as a Markov Model 6 7 A 3 8 B 2 4 5 9 0 C Which one (A or C is more likely o be visied? How abou A and B?

Random Walk Markov Model 6 7 A 3 8 B 2 4 5 9 0 C Which one (A or C is more likely o be visied? How abou A and B? Wha wo hings are imporan o he populariy of a page?

Random Walk Markov Model 6 7 A Robo 3 8 B 2 4 5 9 0 C Randomly selec an ou-link o visi nex page The probabiliy of aking an ou-link is equally disribuedly among all links

Page Rank Calculaion Web Link Marix 2 2 3 4 5 6 7 8 9 0 3 /2 4 /2 5 /2 6 7 /2 8 /2 9 0 /2 Rank Vecor X 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. = Each row is he in-link of he page Repea unil i converges. 0. 0.2 0.5 0.5 0.45 0. 0.5 0.5 0. 0.5

Hidden Markov Models Observaion is a probabilisic funcion of he sae. Doubly embedded process: Underlying sochasic process ha is no observable (hidden, bu only be observed hrough anoher se of sochasic processes ha produce he sequence of observaions.

Coin Tossing Models A person is ossing muliple coins in oher room. He ells you he resul of each coin flip. A sequence of hidden coin ossing experimens is performed, wih observaion sequence consising of a series of heads and ails. O = O O 2 O 3 O T = HHTTTHTTH H

How o Build a HMM o Model o Explain he Observed Sequence? Wha do he saes in he model correspond o? How many saes should be in he model?

Observable Markov Model for Coin Tossing Assume a single biased coin is ossed. Saes: Head and Tail. The Markov model is observable, he only issue for complee specificaion of he model is o decide he bes value of bias (he probabiliy of head. P(H -P(H Head -P(H P(H O = H H T T H T H H T T H S = 2 2 2 2 2 Tail 2 parameer

One Sae HMM Model Head P(H Tail P(H Observaion parameer O = H H T T H T H H T T H S = Coin Sae Unknown parameer is bias: P(H. Degenerae HMM

Two-Sae HMM 2 saes corresponding o wo differen, biased coins being ossed. Each sae is characerized by a probabiliy disribuion of heads and ails. Transiion beween saes are characerized by a sae ransiion marix. a a 22 a 2 O = H H T T H T H H T T H S = 2 2 2 2 2 2 2 a 22 P(H = P P(H = P 2 P(T = P P(T = P 2 4 Parameers

Three-Sae HMM a a 22 2 a 3 a 3 a 2 a 2 a 23 O = H H T T H T H H T T H S = 3 2 3 3 2 3 3 3 a 32 P(H P(T 2 3 P P 2 P 3 -P -P 2 -P 3 a 33 9 parameers

Model Selecion Which model bes mach he observaions? P(O M -Sae HMM and Markov Model has one parameer. 2-Sae HMM has four parameers. 3-Sae HMM has nine parameers. Pracical consideraions impose srong limiaion on model size. (validaion, predicion performance Consrained by underlying physical process.

N-Sae M-Symbol HMM N Urns. Each Urn has a number of colored balls. M disinc colors of he balls. Urn Urn 2 Urn N P(red = b ( P(red = b 2 ( P(red = b N ( P(blue = b (2 P(blue = b 2 (2 P(blue = b N (2 P(green = b (3 P(green = b 2 (3 P(green = b N (3 P(Yellow = b (M P(Yellow = b 2 (M P(Yellow = b N (M O = {green, green, blue, red, yellow, red,., blue}

Elemens of HMM N, he number of saes in he model. S = {S, S 2,, S N }, and he sae a ime as q. M, he number of disinc observaion symbols per sae, i.e., he discree alphabe size. The observaion symbols corresponds o he physical oupu of he sysem. V = {v, v2,, V M } A, he sae ransiion probabiliy marix. A = {a ij }. a ij = P[q + = j q = i}, <= i, j <= N. For he special case where any sae can reach any oher sae in a single sep (a ij >0 for all i, j. The observaion symbol (emission probabiliy disribuion in sae j, B={b j (k}, where b j (k = P[v k a q = s j ], <= j <= N, <= k <= M. The iniial sae disribuion π = {π i } where π i = P[q =S i ], <= i <=N. Complee specificaion: N, M, A, B, and π. λ = {A, B, π}

Use HMM o Generae Observaions Given appropriae values of N, M, A, B, and π, HMM can be used as a generaor o give an observaion sequence: O = O O 2 O T. Each O is one of symbols from V, and T is he number of observaions in he sequence as follows:. Choose an iniial sae q = S i according o π. 2. Se = 3. Choose O = v k according o he symbol probabiliy disribuion in sae S i, i.e., b i (k. 4. Transi o a new sae q + = S j according o he sae ransiion probabiliy disribuion for sae S i, i.e., a ij. 5. Se = + ; reurn o sep 3 if < T; oherwise erminae he procedure.

Three Main Problems of HMM. Given he observaion O =O O 2 O T and a model λ = (A, B, π, how o efficienly compue P(O λ? 2. Given he observaion O = O O 2 O T and he model λ, how o we choose a corresponding sae sequence (pah Q = q q 2 q T which is opimal in some meaningful sense? (bes explain he observaions 3. How do we adjus he model parameers λ o maximize P(O λ?

Insighs Problem : Evaluaion and scoring (choose he model which bes maches he observaions Problem 2: Decoding. Uncover he hidden saes. We usually use an opimaliy crierion o solve his problem as bes as possible. Problem 3: Learning. We aemp o opimize he model parameers so as o bes describe how a given observaion sequence comes ou. The observaion sequence used o adjus he model parameers is called a raining sequence. Key o creae bes models for real phenomena.

Word Recogniion Example For each word of in a word vocabulary, we design a N-sae HMM. The speech signal (observaions of a given word is a ime sequence of coded specral vecors (specral code, or phoneme. (M unique specral vecors. Each observaion is he index of he specral vecor closes o he original speech signal. Task : Build individual word models by using soluion o Problem 3 o esimae model parameers for each model. Task 2: Undersand he physical meaning of model saes using he soluion o problem 2 o segmen each of word raining sequences ino saes and hen sudy he properies of he specral vecors ha lead o he observaion in each sae. Refine model. Task 3: Use soluion o Problem o score each word model based upon he given es observaion sequence and selec he word whose score is highes.

Soluion o Problem (scoring Problem: calculae he probabiliy of he observaion sequence, O=O O 2 O T, given he model λ, i.e., P(O λ. Brue force: enumerae every possible sae sequence of lengh T. Consider one such fixed sae sequence Q = q q 2 q T. T P P(O Q,λ = ( O q,, assuming saisical independence of observaions. We ge, P(O Q, λ = b q (O * b q2 (O 2 bq T (O T.

P(Q λ = π q a qq2 a q2q3 a qt-qt Join probabiliy: P(O,Q λ = P(O Q,λP(Q λ. The probabiliy of O (given he model P(O λ = (, ( Q P Q O P allq (... ( ( 2 2 2 2... T q q q q q q q q q q q O b a O b a O b T T T T Time complexiy: O(T * N T

Forward-Backward Algorihm Definiion: Forward variable a (i = P(O,O 2 O, q =S i λ, i.e., he join probabiliy of he parial observaion O O 2 O and sae S i a ime. Algorihm (inducive or recursive:. Iniializaion: a (i = π i b i (O, <= i <= N N 2. Inducion: a + (j = ( a, <= <= ( i aij bj ( O i T-, <= j <= N. N 3. Terminaion: P(O λ = a ( i i T

Insighs Sep : iniialize he forward probabiliies as he join probabiliy of sae S i and iniial observaion O. Sep 2 (inducion: S S2 a 2j a j S j S N a (i a Nj + a + (j

Laice View Sae 2... N 2 3 T Time

Marix View (Implemenaion π b (o N i ( i ai2b ( O2 2 Sae (i N 2 T a (i: he probabiliy of observing Time ( O O and saying in sae i. Fill he marix column by column. Wha oher bioinformaics algorihm does his algorihm look like?

Time Complexiy The compuaion is performed for all saes j, for a given ime ; he compuaion is hen ieraed for =,2, T-. Finally he desired is he sum of he erminal forward variable a T (i. This is he case since a T (i = P(O O 2 O T, q T = S i λ. Time complexiy: N 2 T.

Insighs Key: here is only N saes a each ime slo in he laice, all he possible sae sequences will merge ino hese N nodes, no maer how long he observaion sequence. Similar o Dynamic Programming (DP rick.

Backward Algorihm Definiion: backward variable β (i = P(O + O +2 O T q =S i,λ. Iniializaion: β T (i =, <= i <= N. (P(O T+ q T =S i, λ =. O T+, a dummy. Inducion: ( i ijbj( O ( j = T-, T-2,,, <= j <= N. The iniializaion sep arbirarily defines β T (i o be for all i.

Backward Algorihm a i a i2 S S2 a in β (i + β + (j S N Time Complexiy: N 2 T

Soluion o Problem 2 (decoding Find he opimal saes associaed wih he given observaion sequence. There are differen opimizaion crierion. One opimaliy crierion is o choose he saes q which are individually mos likely. This opimaliy crierion maximizes he expeced number of correc individual saes. Definiion: γ (i = P(q =S i O, λ, i.e., he probabiliy of being in sae S i a ime, given he observaion sequence O, and he model λ.

Gamma Variable is Expressed by Forward and Backward Variables N i i i i i O P i i i ( ( ( ( ( ( ( ( The normalizaion facor P(O λ makes γ (i a probabiliy measure so ha is sum over all saes is.

Solve he Individually Mos Likely Sae q Algorihm: for each column ( of gamma marix, selec elemen wih maximum value q argmax[ i N <= <= T ( i] This maximizes he expeced number of correc saes by choosing he mos likely sae for each ime. A simple join of individually mos likely saes won yield an opimal pah. Problem wih resuling sae sequence (pah : low probabiliy (a ij is low or even no valid (a ij = 0. Reason: deermines he mos likely sae a every insan, wihou regard o he probabiliy of occurrence of sequences of saes.

Find he Bes Sae Sequence (Vierbi Dynamic programming Definiion: Algorihm he bes score (highes probabiliy along a single pah a ime, which accouns for he firs observaions and ends in sae S i. Inducion: ( i max P[ qq2... q i, oo 2... o qq2... q ( j [max ( i ij] bj ( O i To rerieve he sae sequence, we need o keep rack of he chosen sae for each and j. We use a array (marix ψ (j o record hem. ]

Vierbi Algorihm Iniializaion: δ (i=π i b i (O, <= i <= N Ψ (i = 0. Recursion Terminaion p * ( j Pah (sae backracking max[ ( i ij] bj ( O,2 T, i N j N ( j arg max[ ( i ij],2 T, j N. i N * max[ ( i] q arg max[ ( i] i N T T i N T q * ( * q, T, T 2,...,

Inducion S S 2 a 2j a j S i a ij S j S N δ (i a Nj + δ + (j Choose he ransiion sep yielding he maximum probabiliy.

Marix View (Implemenaion π b (o 2 Sae (i N 2 T a (i: he probabiliy of observing Time ( O O and saying in sae i. Fill he marix column by column. Wha oher bioinformaics algorihm does his algorihm look like? (sequence alignmen

Insighs Vierbi algorihm is similar (excep for he backracking sep in implemenaion o he forward calculaion. The major difference is he maximizaion over previous saes insead of he summing procedure used in he forward algorihm. The same laice / marix srucure efficienly implemens he Vierbi algorihm. Vierbi algorihm is similar as global sequence alignmen algorihm. (boh use dynamic programming

Soluion o Problem 3: Learning Goal: adjus he model parameers λ=(a,b,π o maximize he probabiliy (P(O λ of he observaion sequence given he model. (called maximum likelihood Bad news: No opimal way of esimaing he model parameers o find global maxima. Good news: We can locally maximize P(O λ using ieraive procedure such as Baum-Welch algorihm, EM algorihm, and gradien echniques.

Global Maxima Local Maxima Local Maxima P(O λ λ

Baum-Welch Algorihm Definiion: ξ (i,j, he probabiliy of being in sae S i a ime and sae S j a ime +, given he model and observaion sequence, i.e. ξ (i,j = P(q =S i, q + =S j O,λ Si a ij b j (o + Sj a (i β + (j - + +2

From he definiions of he forward and backward variables, we can wrie ξ (i,j in he form: ( ( ( (, ( O P j O b a i j i j ij N i N j j ij j ij j O b a i j O b a i ( ( ( ( ( ( The numeraor erm is jus P(q =S i, q + =S j O,λ and he division by P(O λ gives he desired probabiliy measure.

Imporan Quaniies γ (i is he probabiliy of being in sae S i a ime, given he observaion and he model, hence we can relae γ (i o ξ (i,j by summing over j, giving T ( i N ( i ( i, j j = expeced number of ransiions from S i. T ( i, j = expeced number of ransiions from S i o S j.

Re-esimaion of HMM parameers i a ij (, ( T T i j i A se of reasonable re-esimaion formulas for π, A, and B are: T T V O s j j k.., ( ( = expeced frequency (number of imes in sae S i a ime (= = γ (i Expeced number of ransiions from sae S i o sae S j Expeced number of ransiions from sae S i. b j (k = expeced number of imes in sae j and observing symbol v k expeced number of imes in sae j

Baum-Welch Algorihm Iniialize model λ = (A,B,π Repea E-sep: Use forward/backward algorihm o expeced frequencies, given he curren model λ and O. M-sep: Use expeced frequencies compue he new model λ. If (λ = λ, sops, oherwise, se λ = λ and go o repea. The final resul of his esimaion procedure is called a maximum likelihood esimae of he HMM. (local maxima

Local Opimal Guaraneed If we define he curren model as λ=(a,b,π and use i o compue he new model λ=(a,b,π, i has been proven by Baum e al. ha eiher iniial model λ defines a criical poin, in which case λ = λ; or 2 λ is more likely han λ in he sense ha P(O λ > P(O λ, i.e., we have found a new model λ from which he observaion sequence is more likely o have been produced.

Global Maxima Local Maxima Local Maxima P(O λ Hill Climbing Likelihood monoonically crease per ieraion unil i converges o local maxima. λ

P(O λ Ieraion Likelihood monoonically increases per ieraion unil i converges o local maxima. I usually converges very fas in several ieraions.

Anoher Way o Inerpre he Reesimaion Formulas The re-esimaion formulas can be derived by maximizing (using sandard consrained opimizaion echniques Baum s auxiliary funcion (auxiliary variable Q: Q(, P( Q Q O, log[ P( O, Q I has been proven ha he maximizaion of he auxiliary funcion leads o increased likelihood, i.e. max[ Q(, ] P( O P( O The selecion Q is problem-specific.

Relaion o EM algorihm E (expecaion sep is he compuaion of he auxiliary funcion Q(λ, λ (averaging. M (maximizaion sep is he maximizaion over λ. Thus Baum-Welch re-esimaion is essenially idenical o he EM seps for his paricular problem.

Insighs The sochasic consrains of he HMM parameers, namely N i i N j a ij, i N M i b j ( k, are auomaically saisfied a each ieraion. By looking a he parameer esimaion problem as consrained opimizaion of P(O λ subjec o he above consrains, he echniques of of Lagrange Mulipliers can be used o find he values of π i, a ij, and b j (k which maximize P (use P o denoe P(O λ. The same resuls as Baum-Welch s formula s will be reached. Commens: since he enire problem can be se up as an opimizaion problem, sandard gradien echniques can be used o solve for opimal value oo. j N

Types of HMM Ergodic model has propery ha every sae can be reached from every oher sae. Lef-righ model: model has he properies ha as ime increases he sae index increases (or say he same, i.e., he saes proceeds from lef o righ. (model signals whose properies change over ime (speech or biological sequence. Propery : a ij = 0, i > j. Propery 2: π =, π i = 0 (i. Opional consrains: a ij = 0, j > i + Δ, make sure large change of sae indices are no allowed. For he las sae in a lef-righ model, a NN =, a Ni = 0, i < N.

One Example of Lef-Righ HMM 2 4 6 3 5 Imposiion of consrains of he lef-righ model essenially has no effec on he re-esimaion procedure because any HMM parameer se o zero iniially, will remain a zero.

HMM for Coninuous Signals In order o use a coninuous observaion densiy, some resricions have o be placed on he form of he model probabiliy densiy funcion (pdf o insure he parameers of he pdf can be re-esimaed in a consisen way. Use a finie mixure of Gaussian disribuions are used. (see he Rabiner s paper for more deails.

Implemenaion Issue : Scaling To compue a (i and b (i, Muliplicaion of a large number of erms (probabiliy, value heads o 0 quickly, which exceed he precision range of any machine. The basic procedure is o muliply hem by a scaling coefficien ha is independen of i (i.e., i depends only on. Logarihm canno be used because of summaion. Bu we can use c N i a ( i C will be sored for he ime poins when he scaling is performed. C is used for boh a (i and b (i. The scaling facor will be canceled ou for parameer esimaion. For Vierbi algorihm, use logarihm is ok.

Implemenaion Issue 2: Muliple Observaion Sequence Denoe a se of K observaion sequences as O = [O (,O (2,,O (k ]. Assume he observed sequences are independen. The re-esimaion of formulas for muliple sequences are modified by adding ogeher he individual frequencies for each sequence.

Adjus he parameer of model λ o maximize: K k K k k k P O P O P ( ( ( K k T k k k K k T k k j ij k k ij k k i i a p j O b a i a p a ( ( ( ( ( K k T k k k K k T v O s k k k j k k l i i a p i i a p l b.., ( ( ( ( ( π i is no re-esimaed since π =, π i = 0, i for lefrigh HMM.

Iniial Esimae of HMM Parameers Random or uniform iniializaion of π and A is appropriae. For emission probabiliy marix B, good esimae is helpful in discree HMM, and essenial in coninuous HMM. Iniially segmen sequences ino saes and hen compue empirical emission probabiliy. Iniializaion (or using prior is imporan when here is no sufficien daa (pseudo couns, Dirichile Prior.