Bayesian networks are graphical models that characterize how variables are independent of each other.

Size: px

Start display at page:

Download "Bayesian networks are graphical models that characterize how variables are independent of each other."

Julianna Anderson
5 years ago
Views:

Ali Tomescu, http://people.csail.mit.edu/~aliush 6.867 Machie learig Prof.

1 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 Lecture 21: Hidde Markov Models Fial exam: Eveig of December 10 th, locatio ad time to be aouced. - Hidde Markov models are sure to be o the fial exam, because it is so easy to use them as a test of how well you uderstad geerative modellig Bayesia etworks are graphical models that characterize how variables are idepedet of each other. - s is a paret of x - x is a child of s P(s, x, y) = P(s)P(x, y s) = x,y are coditioally idepedet = P(s)P(x s)p(y s) Hidde Markov models A particular type of Bayesia etwork. The graph gives us parsimoy of descriptio (a compact way of describig it). It also gives us efficiecy of computatio. Notatio chage: The latet variables we do t kow about are deoted with the letter s, which stads for state. States are coupled with observatios. I kow somethig about each state. By cotrast, a simple mixture model looks like this: Page 1

2 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 Example: x i ca be a word ad all the observatios would costitute a setece, such as: This course is { terrible great = x 1, x 2, x 3, x 4 You would like to give a part of speech tag for each of these words, as follows: = det, s 2 = ou, s 3 = verb, s 4 = adjective How ca we write dow the distributio for this graphical model, for this Bayesia etwork? What idepedece properties are satisfied? P(x 1,, x,,, s ) =? 1. x 1,, x are coditioally idepedet give,, s P(x 1,, x,,, s ) = P(x 1,, x,, s )P(,, s ) = cod idep = P(x 1,, s )P(,, s ) 2., s 2,, s i 2 ad s i are coditioally idepedet give s i 1 s i s i 2,, s i 1 P(s i, s i 2,, s i 1 ) = P(, s 2,, s i 2 s i 1 )P(s i, s i 1 ) P(x 1,, x,,, s ) = P(x 1,, s )P(,, s ) = P(x 1,, s )P(s s 1, s 2,, )P(s 1, s 2,, ) = = P(x 1,, s )P( )P(s 2 )P(s 3 s 2, )P(s s 1,, ) = P(x 1,, s )P( )P(s 2 )P(s 3 s 2 )P(s s 1 ) 3. x i all the other x i s ad all the other s i s s i P(x 1,, x,,, s ) = [ P x,i (x i s i )] [P 1 ( ) P i (s i s i 1 )] = 4. We will make a additioal assumptio here ot show i the graph: HMM is homogeous (the probabilities P(z i = z z i 1 = z ) do ot deped o the positio i alog the sequece) What do we eed to specify a HMM? P(x 1,, x,,, s ) = [ P E (x i s i )] [P 1 ( ) P T (s i s i 1 )] i=2 i=2 Page 2

3 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 What are the states? s {1,, k} What are the outputs? x X = { Rd W We eed to specify the iitial state distributio P 1 (S 1 ) We eed to specify emissio output probabilities: P E (x s), which is a table of probabilities, or it could be a Gaussia distributio with a mea that depeds o the state N(x; μ s, σ 2 I). We eed to model the trasitio probabilities: P T (s s) Example: P 1 ( ): [ 1 0 ] = 1 s 2 = 2 P T (s t s t 1 ) s t = 1 s t = 2 s t 1 = s t 1 = P E (x s) = N(x; μ s ; σ 2 ), μ 1 > μ 2 What does this model geerate? What is a likely sequece of states?, s 2, s 3, = 1,2,2,2,. I terms of observatios, at time 1 I am always i state 1 ad at time 2 or greater I am always goig to be ad remai i state 2. Page 3

4 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 Trasitio diagram How to use these HMM models? We eed to be able to solve a few problems: How likely is a observatio sequece i this model, after specifyig it. We eed to evaluate: P(x 1,, x ) = P(x 1,, x,,, s ) all k possible,,s We eed to be able to estimate P 1 ( ), P E (x s), P T (s s) from data { (1) (1) x 1,, x1 } (T) (T) x 1,, xt We eed to estimate the predictio (,, s ) = argmax P(x 1,, x,,, s ) for a particular data row of x i s i the,,s above data matrix. But how ca we sum over k possible terms? We ca perform the summatio i time liear to the legth of the sequece due to the idepedece relatios. The forward-backward algorithm Gives us P(x 1,, x ) i liear time. Forward probabilities: Predictive probabilities. For a particular sequece x 1,, x, with s i {1,, k}, we wat to predict α t (i) = P(x 1,, x t, s t = i). The we ca predict P(s t = i x 1,, x t ) = Page 4 α 1 ( ) = P 1 ( )P E (x 1 ) = P(x 1, ) α 1 ( ) = P(x 1 ) α t(i). j α t (j)

5 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 α 2 (s 2 ) = P(x 1, x 2,, s 2 ) = (P 1 ( )P E (x 1 )P T (s 2 )P E (x 2 s 2 )) = α 2 ( )P T (s 2 )P E (x 2 s 2 ) α 3 (s 3 ) = P(x 1, x 2, x 3,, s 2, s 3 ) = ( P(x 1, x 2,, s 2 )) P T (s 3 s 2 )P E (x 3 s 3 ) = α 2 (s 2 )P T (s 3 s 2 )P E (x 3 s 3 ),s 2 s 2 I geeral, we get: α t (s t ) = P(x 1, x 2,, x t, s t ) = P(x 1, x 2,, x t,, s 2,, s t ),s 2,,s t 1 = α t 1 (s t 1 )P t (s t s t 1 )P E (x t s t ) s t 1, s t = 1,, k α t (s t ) s t = P(x 1, x 2,, x t ) For α 1 ( ), we have k possible values, correspodig to each {1,, k}. s 2 What is the computatioal cost of evaluatig P(x 1, x 2,, x )? O(k 2 ), because I have k umbers to fill i for α t ad each oe ivolves summig over the k previous α t 1 values. Note that t {1,, } hece the O(k 2 ). Note: Icreasig the umber of values k for the hidde states i a HMM has much greater effect o the computatioal cost of O(k 2 ) forward-backward algorithm tha icreasig the legth of the observatio sequece. Backward probabilities: The complemet of forward probabilities. Diagostic probabilities. Page 5

6 Ali Tomescu, Machie learig Prof. Tommi Jaakkola Week 12, Tuesday, November 19th, 2013 Lecture 21 β t (i) = P(x t+1,, x s t = i) β t (s t ) = P(x t+1,, x s t ) If I start from that state, the what is the probabilities of geeratig all the future observatios? β (s ) = 1 B 1 (s 1 ) = P(x s 1 ) = P T (s s 1 )P E (x s ) s B 2 (s 2 ) = P(x 1, x s 2 ) = P T (s 1 s 2 )P E (x 1 s 1 )P T (s s 1 )P E (x s ) s,s 1 = ( P T (s s 1 )P E (x s ) s ) P T (s 1 s 2 )P E (x 1 s 1 ) s 1 = B 1 (s 1 )P T (s 1 s 2 )P E (x 1 s 1 ) s 1 β t (s t ) = P T (s t+1 s t )P E (x t+1 s t+1 )β t+1 (s t+1 ) s t+1 How to evaluate the posterior probability of a particular state: P(s t = s x 1,, x ) = P(x 1,, x, s t = s) P(x 1,, x ) How to evaluate the probability of the data set: = P(x 1,, x t, s t = s)p(x t+1,, x s t = s) P(x 1,, x ) P(x 1, x 2,, x ) = α (s ) s = α t(s)β t (s) α t (s)β t (s) s P(x 1, x 2,, x ) = P( )P(x 1 )β 1 ( ) Page 6 P(x 1, x 2,, x ) = α t (s t )β t (s t ) How to evaluate the posterior probability that the HMM wet s s at time t. P(s t = s, s t+1 = s x 1,, x ) = α t(s)p T (s s)p E (x t+1 s )β t+1 (s ) α t (s )β t (s ) s s t

Sequences, Sums, and Products

Sequences, Sums, and Products CSCE 222 Discrete Structures for Computig Sequeces, Sums, ad Products Dr. Philip C. Ritchey Sequeces A sequece is a fuctio from a subset of the itegers to a set S. A discrete structure used to represet