MACHINE LEARNING 2 UGM,HMMS Lecture 7

Size: px

Start display at page:

Download "MACHINE LEARNING 2 UGM,HMMS Lecture 7"

Shana Casey
5 years ago
Views:

1 LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7

2 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability etc. (later Baum-Welch)

3 EXTENDED STUDENT EXAMPLE L H L H B L L H B - better H - higher L B L - less

4 EXTENDED STUDENT EXAMPLE

5 INDEPENDENCE I-MAP I(G) (conditional) independences implied by G (not yet defined) I(P) (conditional) independences in the distribution P G I-map for P in I(G) I(P) p q X Y X Y X Y

6 INDEPENDENCE I-MAP I(G) independences implied by G (not yet defined) I(P) independences in the distribution P G I-map for P in I(G) I(P) p q X Y X Y X Y p: X and Y ind. ex. p(x=1) = =0.6, p(y=1) = 0.8, and p(x=1,y=1) = 0.48 q: X and Y are dependent

7 INDEPENDENCE I-MAP I(G) independences implied by G (not yet defined) I(P) independences in the distribution P G I-map for P in I(G) I(P) p q X G1 Y X G2 Y X G3 Y All three graphs are I-maps for p G1 and G2 are I-maps for q, but G3 is not

8 X Y O Z Chain D-SEPARATION A path is d-separated by O if it has X Y O Z Fork a chain X Y Z where Y O a fork X Y Z where Y O a v-structure X Y Z where (Y desc(y)) O = X Y O Z v-struct desc O

9 D-SEPARATION SETS AND CI OF DAGS G O A is d-separated from B given O if every undirected path between A and B is d-separated by O A B Cond. ind rel. in DAG G, x A G x B x O A is d-separated from B given O

10 FACTORIZATION OVER G p(x 1,...,x N )= N n=1 p(x n x pa(xn )) p can be factorized over G if it can be expressed as above

11 SOUNDNESS AND COMPLETENESS I(G) conditional independence relations implied by d-sep in G I(p) conditional independence relations satisfied by p Theorem A distribution P can be factorised over G iff I(G) I(p) = not possible to achieve, ex. clique and independent distribution

12 UGM UGMs - Undirected graphical models What is the direction between 2 pixels, 2 proteins? Coherence Probabilistic interpretation? Difficulty Intelligence p factorizes over G can be Grade SAT expressed as normalized product over factors associated with cliques Happy Letter Job

13 EXAMPLE CLIQUE

14 EXAMPLE MAXIMAL CLIQUE

15 EXAMPLE MAXIMUM CLIQUE

16 UGM An undirected graph G with so-called factors associated with its maximal cliques, for factor is a function from the clique s variables (the scope) to non-neg real numbers

17 Scope A,B B,C C,D D,A Factors misconception example Probability where P (A, B, C, D) = 1 Z 1 (A, B) 2 (B,C) 3 (C, D) 4 (D, A) Z = a,b,c,d 1(a, b) 2 (b, c) 3 (c, d) 4 (d, a) PROBABILISTIC INTERPRETATION

18 Misconception E.g. 1(A =1,B =1) 2 (B =1,C = 0) 3 (C =0,D = 1) 4 (D =1,A= 1) = = Z = a,b,c,d 1(a, b) 2 (b, c) 3 (c, d) 4 (d, a) A FACTOR PRODUCT

19 DE-NOISING

20 p(y x ) ex Gaussian ISING MODEL- DE-NOISING Values -1,1 Factors of form and

21 ISING MODEL- DE-NOISING Values -1,1 Factors of form and Bipartite graph Suggests iterative procedu

22 Large is the noisy image; upper, UGM de-noised; and lower, graph cut de-noised

23 LATENT = HIDDEN Can reduce #parameters Can represent common causes

24 MARKOV CHAINS (DISCRETE) Directed graph with transition probabilities p2 We observe the sequence of visited vertices p1 q

25 MARKOV CHAINS (DISCRETE) Probabilities on outgoing edges sum to one p2 p1 i [d] pi =1 pd

26 THE OCCASIONALLY DISHONEST CASINO p 1-p 1-q Fair q Biased/loaded We observe the sequence of dice outcomes of visited vertices

27 EMISSION DISTRIBUTIONS

28 WHAT AN HMM DOES Starts in the state z1 When in state zt p outputs p(xt zt) 1-p 1-q moves to p(zt+1 zt) Fair q Biased/loaded Stops after a fixed number of steps or when reaching a stop step

29 WHAT AN HMM DOES Starts in the state z1 When in state zt p outputs p(xt zt) 1-p 1-q moves to p(zt+1 zt) Fair q Biased/loaded Stops after a fixed number of steps or when reaching a stop step The parameters

30 THE JOINT DISTRIBUTION Starts in the state z1 When in state zt emits p(xt zt) Categorial or Gaussian transits to p(zt+1 zt) Stops after a fixed number of steps or when reaching a stop step

31 THE JOINT DISTRIBUTION Starts in the state z1 When in state zt emits p(xt zt) Categorial or Gaussian transits to p(zt+1 zt) Stops after a fixed number of steps or when reaching a stop step

32 GAUSSIAN EMISSIONS AND HIDDEN STATES

33 LAYERED OR NOT D D D I I I I Begin M M M End

34 APPLICATIONS OF HMMS bat rat cat gnat goat x x... x AG- - - C A- AG- C AG- AA- - - A A A C AG- - - C D D D I I I I Begin M M M End Automatic speech recognition Part of speech tagging Gene finding Gene family characterization Secondary structure prediction

35 TERMINOLOGY X ABOVE Z BELOW p(zt x1:t) p(zt+h x1:t) p(zt x1:t+l) p(zt x1:t)

36 MORE INFERENCE TYPES bat rat cat gnat goat x x... x AG- - - C A- AG- C AG- AA- - - A A A C AG- - - C D D D I I I I Begin M M M End Viterbi (MAP) argmax p(z1:t x1:t) Posterior samples: ~p(z1:t x1:t) Parameters: given D & struct. Structure and param.: given D Probability of data: p(x1:t)

37 INFERENCE IN THE OCCASIONALLY DISHONEST CASINO Grey regions are states corresponding to biased die 1 filtered 1 smoothed 1 Viterbi p(loaded) 0.5 p(loaded) 0.5 Text MAP state (0=fair,1=loaded) roll number roll number roll number Filtering: p(zt x1:t), online Smoothing, MAP state: p(zt x1:t) offline Viterbi, MAP path argmax p(z1:t x1:t)

38 Pairs of strings abbacd acbadd abbac abbacd abbac acbadd acbad acbad DP What is a subproblem? Rooted trees What is a subsolution? How do we decompose into smaller subproblems? How do we combine subsolutions into larger? How do we enumerate? How many and what time?

39 DP Polynomial many What is a subproblem? What is a subsolution? Polynomial time Polynomial time How do we decompose into smaller subproblems? How do we combine subsolutions into larger? Polynomial time How do we enumerate? Polynomial time overall How many and what time?

40 AN HMM CAN BE SEEN AS A DGM Zi hidden Xi observable Hidden often not observable when training, never when applying

41 SPECIAL CASE: HIDDEN MARKOV MODEL (HMM) Combinations of the transition distributions Zi hidden Combinations of emission the emission distribution Xi observable Hidden often not observable when training, never when applying

42 JOINT &FORWARD VARIABLE Joint is easy to express Graphical model The sum has exponentially many terms The forward variable, ft, can be computed with DP

43 Zt-1=k gives smaller Graphical model Knowing also Zt-1 breaks it into smaller, i.e., the event is the AND of the events

44 Applying sum rule Notice, by the sum rule, f t (k) =p(x 1:t 1,Z t = k) = X p(x 1:t 1,Z t 1 = k 0,Z t = k) k 0 2[K] The set of states each term in the sum is a probability of an event? # x 1!? #x2!? #x3 Z t 0 1 = k! Z t = k # # x t 1?!? which, as noted, can be broken into smaller

45 Forward recursion f t (k) = X l f t 1 (l) {z } smaller p(x t 1 Z t 1 = l) {z } emission p(z t = k Z t 1 = l) {z } transition

46 Forward recursion

47 Forward recursion Given Forward Algorithm For the start state k * s(0,k ):=1 For all other states k s(0,k):=0 For t=1 to T For k=1 to K

48 Time O(K 2 ) Forward Algorithm For the start state k * s(0,k ):=1 For all other states k s(0,k):=0 For t=1 to T { For k=1 to K }constant time }O(TK 2 ) O(K) So in total time O(TK 2 )

49 If layered If layered, total time O(TK) D D D I I I I O(K) Forward Algorithm For the start state k * s(0,k ):=1 For all other states k s(0,k):=0 For t=1 to T { For k=1 to K Begin M M M }constant time }O(TK) End Replace by sum over constant number of states in previous layer

50 OBSERVATION PROBABILITY f t (k) =p(x 1:t 1, Z t = k) In general, (e.g. t=t) The final probability is easily obtained since

51 FILTERING emission data probability Filtering: p(zt x1:t), online

52 FILTERING emission data probability Filtering: p(zt x1:t), online

53 Backward variable Defined by Graphical model?! Z t = k #?!? # x t+1!? # xt

54 Sum rule gives Zt+1 Defined by Each term in the sum is a probability of an event?! Z t = k #? which is an AND of! Z t+1 = l # x t+1!? # xt Z t+1 = l # x t+1!? # x t+2!? # xt

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated