- PDF Free Download

Size: px

Start display at page:

Download ""

Meghan Hardy
5 years ago
Views:

1 CIS 519/419 Appled Machne Learnng Dan Roth 461C, 3401 Walnut Sldes were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), Erc Eaton for CIS519/419 at Penn, or from other authors who have made ther ML sldes avalable.

2 Admnstraton Exam: The exam wll take place on the orgnally assgned date, 4/30. Smlar to the prevous mdterm. 75 mnutes; closed books. What s covered: The focus s on the materal covered after the prevous md-term. However, notce that the deas n ths class are cumulatve!! Everythng that we present n class and n the homework assgnments Materal that s n the sldes but s not dscussed n class s not part of the materal requred for the exam. Example 1: We talked about Boostng. But not about boostng the confdence. Example 2: We talked about multclass classfcaton: OvA, AvA, but not Error Correctng codes, and addtonal materal n the sldes. We wll gve a few practce exams. Homework: mssng and regrades 2

3 Admnstraton Projects We wll have a poster sesson 6-8pm on May 7 n the actve learnng room, 3401 Walnut. The hope s that ths wll be a fun event where all of you have an opportunty to see and dscuss the projects people have done. All are nvted! Mandatory for CIS519 students The fnal project report wll be due on 5/8 Logstcs: you wll send us you posters the a day earler; we wll prnt t and hang t; you wll present t. If you haven t done so already: Come to my offce hours at least once ths or next week to dscuss the project!! 3

4 Summary: Basc Probablty Product Rule: P(A,B) = P(A B)P(B) = P(B A)P(A) If A and B are ndependent: P(A,B) = P(A)P(B); P(A B)= P(A), P(A B,C)=P(A C) Sum Rule: P(A B) = P(A)+P(B)-P(A,B) Bayes Rule: P(A B) = P(B A) P(A)/P(B) Total Probablty: If events A 1, A 2, A n are mutually exclusve: A A j =, P(A )= 1 P(B) = P(B, A ) = P(B A ) P(A ) Total Condtonal Probablty: If events A 1, A 2, A n are mutually exclusve: A A j =, P(A )= 1 P(B C) = P(B, A C) = P(B A,C) P(A C) 4

5 So far Bayesan Learnng What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference; Approxmate nference Learnng Representatons of Probablty Dstrbutons 5

6 Unsupervsed Learnng We get as nput (n+1) tuples: (X 1, X 2, X n, X n+1 ) There s no noton of a class varable or a label. After seeng a few examples, we would lke to know somethng about the doman: correlatons between varables, probablty of certan events, etc. We want to learn the most lkely model that generated the data Sometmes called densty estmaton. 6

7 Smple Dstrbutons In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) y P(x 2 P(x 1 y) x1 x 2 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y),j Can these (strong) assumptons be relaxed? y) x 3 P(y) P(x n Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) y) x n 7

8 Smple Dstrbutons P(x 1 y) x1 P(x 2 y) P(y) y P(x n y) x 2 x 3 Under the assumpton P(x x j,y) = P(x y),j we can compute the jont probablty dstrbuton on the n+1 varables P(y, x 1, x 2, x n ) = p(y) 1 nn P(x y) Therefore, we can compute the probablty of any event: P(x 1 = 0, x 2 = 0, y = 1) = {b Є {0,1}} P(y=1, x 1 =0, x 2 =0, x 3 =b 3, x 4 =b 4,,x n =b n ) More effcently (drectly from the ndependence assumpton): P(x 1 = 0, x 2 = 0, y = 1) = P(x 1 =0, x 2 =0 y=1) p(y=1) = = P(x 1 =0 y=1) P(x 2 =0 y=1) p(y=1) We can compute the probablty of any event or condtonal event over the n+1 varables. x n 8

9 Representng Probablty Dstrbuton Goal: To represent all jont probablty dstrbutons over a set of random varables X 1, X 2,., X n There are many ways to represent dstrbutons. A table, lstng the probablty of each nstance n {0,1} n We wll need 2 n -1 numbers What can we do? Make Independence Assumptons Mult-lnear polynomals Multnomals over varables Bayesan Networks Drected acyclc graphs Markov Networks Undrected graphs 9

10 Graphcal Models of Probablty Dstrbutons Bayesan Networks represent the jont probablty dstrbuton over a set of varables. Independence Assumpton: x, x s ndependent of ts non-descendants gven ts parents Ths s a theorem. To prove t, order the nodes from leaves up, and use the product rule. The terms are called CPTs (Condtonal Probablty tables) and they completely defne the probablty dstrbuton. Wth these conventons, the jont probablty dstrbuton s gven by: P(y, x, x,...x ) = p(y) P(x Parents(x ) ) Y Z Z 1 Z 2 Z 3 X 10 X 1 2 n z s a parent of x x s a descendant of y X 2 10

11 Bayesan Network Semantcs of the DAG Nodes are random varables Edges represent causal nfluences Each node s assocated wth a condtonal probablty dstrbuton Two equvalent vewponts A data structure that represents the jont dstrbuton compactly A representaton for a set of condtonal ndependence assumptons about a dstrbuton 11

12 Bayesan Network: Example The burglar alarm n your house rngs when there s a burglary or an earthquake. An earthquake wll be reported on the rado. If an alarm rngs and your neghbors hear t, they wll call you. What are the random varables? 12

13 Bayesan Network: Example If there s an earthquake, you ll probably hear about t on the rado. Earthquake Burglary Rado Alarm An alarm can rng because of a burglary or an earthquake. How many parameters do we have? Mary Calls John Calls How many would we have f we had to store the entre jont? If your neghbors hear an alarm, they wll call you. 13

14 Wth these probabltes, (and assumptons, encoded n the graph) we can compute the probablty of any event over these varables. Bayesan Network: Example P(R E) Rado P(E) P(E, B, A, R, M, J) = P(E) P(B, A, R, M, J E) = = P(E) P(B) P(A, R, M, J E, B) = = P(E) P(B) P(R E, B ) P(M, J, A E, B) Earthquake Mary Calls = P(E) P(B) P(R E) P(M, J A, E, B) P(A, E, B) = P(E) P(B) ) P(R E) P(M A) P(J A) P(A E, B) P(B) Alarm Burglary P(A E, B) P(M A) P(J A) John Calls 14

15 Computatonal Problems Learnng the structure of the Bayes net (What would be the gudng prncple?) Learnng the parameters Supervsed? Unsupervsed? Inference: Computng the probablty of an event: [#P Complete, Roth 93, 96] Gven structure and parameters Gven an observaton E, what s the probablty of Y? P(Y=y E=e) (E, Y are sets of nstantated varables) Most lkely explanaton (Maxmum A Posteror assgnment, MAP, MPE) [NP-Hard; Shmony 94] Gven structure and parameters Gven an observaton E, what s the most lkely assgnment to Y? Argmax y P(Y=y E=e) (E, Y are sets of nstantated varables) 16

16 Inference Inference n Bayesan Networks s generally ntractable n the worst case Two broad approaches for nference Exact nference Eg. Varable Elmnaton Approxmate nference Eg. Gbbs samplng 17

17 Tree Dependent Dstrbutons Drected Acyclc graph Each node has at most one parent Independence Assumpton: x s ndependent of ts nondescendants gven ts parents (x s ndependent of other nodes gve z; v s ndependent of w gven u;) P(y, x, x,...x Need to know two numbers for each lnk: p(x z), and a pror for the root p(y) V W P(y) 1 2 n ) = p(y) P(x U X Y Z P(x z) Parents(x P(s y) ) ) T S 18

18 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: P(x=1) = Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x)? V W P(y) P(y, x1, x2,...xn ) = p(y) P(x = P(x=1 z=1)p(z=1) + P(x=1 z=0)p(z=0) Recursvely, go up the tree: P(z=1) = P(z=1 y=1)p(y=1) + P(z=1 y=0)p(y=0) P(z=0) = P(z=0 y=1)p(y=1) + P(z=0 y=0)p(y=0) Lnear Tme Algorthm U X Y Z P(x z) P(s y) T Parents(x 19 S ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables)

19 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,y)? P(x=1,y=0) = = P(x=1 y=0)p(y=0) V W P(y) P(y, x1, x2,...xn ) = p(y) P(x U X Y Z P(x z) Recursvely, go up the tree along the path from x to y: P(x=1 y=0) = z=0,1 P(x=1 y=0, z)p(z y=0) = = z=0,1 P(x=1 z)p(z y=0) P(s y) T Parents(x ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables) 20 S

20 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,u)? (No drect path from x to u) P(x=1,u=0) = P(x=1 u=0)p(u=0) V W P(y) P(y, x1, x2,...xn ) = p(y) P(x Let y be a parent of x and u (we always have one) P(x=1 u=0) = y=0,1 P(x=1 u=0, y)p(y u=0) = = y=0,1 P(x=1 y)p(y u=0) = U X Y Z P(x z) P(s y) T Parents(x Now we have reduced t to cases we have seen 21 S ) )

21 Tree Dependent Dstrbutons Inference Problem: Gven the Tree wth all the assocated CPTs, we showed that we can evaluate the probablty of all events effcently. There are more effcent algorthms P(y, The dea was to show that the nference s ths case s a smple applcaton of Bayes rule and probablty theory. V W P(y) U x1,x2,...xn ) = p(y) P(x X Y Z P(x z) P(s y) T Parents(x Thngs are not so smple n the general case, due to cycles; there are multple ways to get from node A to B, and ths has to be accounted for n Inference. 22 S ) )

22 Graphcal Models of Probablty Dstrbutons For general Bayesan Networks The learnng problem s hard The nference problem (gven the network, evaluate the probablty of a gven event) s hard (#P Complete) Y P(y) Z Z 1 Z 2 Z 3 P(z 3 y) X 10 X P(x z 1, z 2,z, z 3 ) X 2 P(y, x,x,...x ) = p(y) 1 2 n P(x Parents(x ) ) 23

23 Varable Elmnaton Suppose the query s P(X 1 ) Key Intuton: Move rrelevant terms outsde summaton and cache ntermedate results 24

24 Varable Elmnaton: Example 1 A A B C We want to compute P(C) Let s call ths f A (B) A has been (nstantated and) elmnated What have we saved wth ths procedure? How many multplcatons and addtons dd we perform? 25

25 Varable Elmnaton VE s a sequental procedure. Gven an orderng of varables to elmnate For each varable v that s not n the query Replace t wth a new functon f v That s, margnalze v out The actual computaton depends on the order What s the doman and range of f v? It need not be a probablty dstrbuton 26

26 Varable Elmnaton: Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) What s P(M, J B)? Mary Calls John Calls 27

27 Varable Elmnaton: Example 2 Assumptons (graph; jont representaton) It s suffcent to compute the numerator and normalze Elmnaton order R, A, E To elmnate R 28

28 Varable Elmnaton: Example 2 It s suffcent to compute the numerator and normalze Elmnaton order A, E To elmnate A 29

29 Varable Elmnaton: Example 2 It s suffcent to compute the numerator and normalze Fnally elmnate E Factors 30

30 Varable Elmnaton The order n whch varables are elmnated matters In the prevous example, what would happen f we elmnate E frst? The sze of the factors would be larger Complexty of Varable Elmnaton Exponental n the sze of the factors What about worst case? The worst case s ntractable 31

31 Inference Exact Inference n Bayesan Networks s #P-hard We can count the number of satsfyng assgnments for 3-SAT wth a Bayesan Network Approxmate nference Eg. Gbbs samplng Skp 32

32 Approxmate Inference P(x)? Basc dea If we had access to a set of examples from the jont dstrbuton, we could just count. X For nference, we generate nstances from the jont and count How do we generate nstances? 33

33 Generatng nstances Samplng from the Bayesan Network Condtonal probabltes, that s, P(X E) Only generate nstances that are consstent wth E Problems? How many samples? [Law of large numbers] What f the evdence E s a very low probablty event? Skp 34

34 Detour: Markov Chan Revew A 0.1 C B Generates a sequence of A,B,C Defned by ntal and transton probabltes P(X 0 ) and P(X t+1 = X t =j) P j : Tme ndependent transton probablty matrx Statonary Dstrbutons: A vector q s called a statonary dstrbuton f q : The probablty of beng n state If we sample from the Markov Chan repeatedly, the dstrbuton over the states converges to the statonary dstrbuton 35

35 Markov Chan Monte Carlo Our goal: To sample from P(X e) Overall dea: The next sample s a functon of the current sample The samples can be thought of as comng from a Markov Chan whose statonary dstrbuton s the dstrbuton we want Can approxmate any dstrbuton 36

36 Gbbs Samplng The smplest MCMC method to sample from P(X=x 1 x 2 x n e) Creates a Markov Chan of samples as follows: Intalze X randomly At each tme step, fx all random varables except one. Sample that random varable from the correspondng condtonal dstrbuton 37

37 Gbbs Samplng Algorthm: Intalze X randomly Iterate: Pck a varable X unformly at random Sample x (t+1) from P(x x (t) 1,,x (t) -1, x (t), +1, x (t) n,e) X (t+1) k =x (t+1) k for all other k Ths s the next sample X (1),X (2), X (t) forms a Markov Chan Why s Gbbs Samplng easy for Bayes Nets? P(x x - (t),e) s local 38

38 Gbbs Samplng: Bg pcture Gven some condtonal dstrbuton we wsh to compute, collect samples from the Markov Chan Typcally, the chan s allowed to run for some tme before collectng samples (burn n perod) So that the chan settles nto the statonary dstrbuton Usng the samples, we approxmate the posteror by countng 39

39 Gbbs Samplng Example 1 A B C We want to compute P(C): Suppose, after burn n, the Markov Chan s at A=true, B=false, C= false 1. Pck a varable B 2. Draw the new value of B from P(B A=true, C= false) = P(B A=true) Suppose B new = true 3. Our new sample s A=true, B = true, C = false 4. Repeat 40

40 Gbbs Samplng Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) Exercse: P(M,J B)? Mary Calls John Calls 41

41 Example: Hdden Markov Model Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Transton probabltes Emsson probabltes A Bayesan Network wth a specfc structure. Xs are called the observatons and Ys are the hdden states Useful for sequence taggng tasks part of speech, modelng temporal structure, speech recognton, etc 42

42 HMM: Computatonal Problems Probablty of an observaton gven an HMM P(X parameters): Dynamc Programmng Fndng the best hdden states for a gven sequence P(Y X, parameters): Dynamc Programmng Learnng the parameters from observatons EM 43

43 Gbbs Samplng for HMM Goal:Computng P(y x) Intalze the Ys randomly Iterate: Only these varables are needed because they form the Markov blanket of Y. Pck a random Y Draw Y t from P(Y Y -1,Y +1,X ) Compute the probablty usng counts after the burn n perod Gbbs samplng allows us to ntroduce prors on the emsson and transton probabltes. 44

44 Bayesan Networks Bayesan Networks Compact representaton probablty dstrbutons Unversal: Can represent all dstrbutons Inference In the worst case, every random varable wll be connected to all others Inference s hard n the worst case Learnng? Exact nference s #P-hard, approxmate nference s NP-hard [Roth93,96] Inference for Trees s effcent General exact Inference: Varable Elmnaton 45

45 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton What does that mean? Generatve model Fnd the tree representaton of the dstrbuton. What does that mean? V W P(y) P(y, x1, x2,...xn ) = p(y) P(x U X Y Z P(x z) P(s y) Among all trees, fnd the most lkely one, gven the data: P(T D) = P(D T) P(T)/P(D) T Parents(x 46 S ) )

46 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton Fnd the tree representaton of the dstrbuton. V W P(y) U X Y Z P(x z) P(s y) T S Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) Now we can see why we had to solve the nference problem frst; t s requred for learnng. 47

47 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton Fnd the tree representaton of the dstrbuton. V W P(y) U X Y Z P(x z) P(s y) T S Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) = Try ths for naïve Bayes = argmax T {x} P T (x Parents(x )) 48

48 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: Are these representatons of the same dstrbuton? Gven a sample, whch of these generated t? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 49

49 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: P(x 1 x 4 ) X 1 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? P(x 2 x 4 ) X 4 P(x 4 ) X 2 P(x 3 x 4 ) X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3 50

50 Example: Learnng Dstrbutons Probablty Dstrbuton 1: What s the lkelhood that ths table generated the data? P(T D) = P(D T) P(T)/P(D) Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= 0 P(1001 T)= 0.1 P(0100 T)= 0.1 P(Data Table)=0 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? 51

51 Example: Learnng Dstrbutons Probablty Dstrbuton 2: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=1/2 P(x 1 x 4 ) X 2 X 3 p(x 1 =1 x 4 =0)=1/2 p(x 1 =1 x 4 =1)=1/2 p(x 2 =1 x 4 =0)=1/3 p(x 2 =1 x 4 =1)=1/3 p(x 3 =1 x 4 =0)=1/6 p(x 3 =1 x 4 =1)=5/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 4 =1)=1/2 1/2 2/3 5/6= 10/72 P(1001 T)= = 1/2 1/2 2/3 5/6=10/72 P(0100 T)= =1/2 1/2 2/3 5/6=10/72 X 1 P(x 2 x 4 ) X 4 P(x 4 ) P(x 3 x 4 ) P(Data Tree)=125/4*3 6 52

52 Example: Learnng Dstrbutons Probablty Dstrbuton 3: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 1 x 4 ) P(x 4 =1)=2/3 p(x 1 =1 x 4 =0)=1/3 p(x 1 =1 x 4 =1)=1 p(x 2 =1 x 4 =0)=1 p(x 2 =1 x 4 =1)=1/2 p(x 3 =1 x 2 =0)=2/3 p(x 3 =1 x 2 =1)=1/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 2 =1)=2/3 1 1/2 2/3= 2/9 P(1001 T)= = 1/2 1/2 2/3 1/6=1/36 P(0100 T)= =1/2 1/2 1/3 5/6=5/72 X 1 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3 P(Data Tree)=10/ Dstrbuton 2 s the most lkely dstrbuton to have produced the data. 53

53 Example: Summary We are now n the same stuaton we were when we decded whch of two cons, far (0.5,0.5) or based (0.7,0.3) generated the data. But, ths sn t the most nterestng case. In general, we wll not have a small number of possble dstrbutons to choose from, but rather a parameterzed famly of dstrbutons. (analogous to a con wth p Є [0,1] ) We need a systematc way to search ths famly of dstrbutons. 54

54 Example: Summary Frst, let s make sure we understand what we are after. We have 3 data ponts that have been generated accordng to our target dstrbuton: 1011; 1001; 0100 What s the target dstrbuton? We cannot fnd THE target dstrbuton. What s our goal? As before we are nterested n generalzaton Gven Data (e.g., the above 3 data ponts), we would lke to know P(1111) or P(11**), P(***0) etc. We could compute t drectly from the data, but. Assumptons about the dstrbuton are crucal here 55

55 Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) V W P(y) Space of all Dstrbutons U X Y Z P(x z) P(s y) T Space of all Tree Dstrbutons S Target Dstrbuton Fnd the Tree closest to the target Target Dstrbuton 56

56 Learnng Tree Dependent Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) Dstrbutons V W P(y) U X Y Z P(x z) P(s y) The smple mnded algorthm for learnng a tree dependent dstrbuton requres (1) for each tree, compute ts lkelhood L(T) = P(D T) = =argmax T {x} P T (x 1, x 2, x n ) = =argmax T {x} P T (x Parents(x )) (2) Fnd the maxmal one T 57 S

57 1. Dstance Measure To measure how well a probablty dstrbuton P s approxmated by probablty dstrbuton T we use here the Kullback-Lebler cross entropy measure (KLdvergence): Non negatve. D(P, T) = x D(P,T)=0 ff P and T are dentcal P(x)log P(x) T(x) Non symmetrc. Measures how much P dffers from T. 58

58 2. Rankng Dependences Intutvely, the mportant edges to keep n the tree are edges (x---y) for x, y whch depend on each other. Gven that the dstance between the dstrbuton s measured usng the KL dvergence, the correspondng measure of dependence s the mutual nformaton between x and y, (measurng the nformaton x gves about y) I(x, y) P(x, y) = P(x, y)log x, y P(x)P(y) whch we can estmate wth respect to the emprcal dstrbuton (that s, the gven data). 59

59 Learnng Tree Dependent Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree 60

60 Spannng Tree Goal: Fnd a subset of the edges that forms a tree that ncludes every vertex, where the total weght of all the edges n the tree s maxmzed Sort the weghts Start greedly wth the largest one. Add the next largest as long as t does not create a loop. In case of a loop, dscard ths weght and move on to the next weght. Ths algorthm wll create a tree; It s a spannng tree: t touches all the vertces. It s not hard to see that ths s the maxmum weghted spannng tree The complexty s O(n 2 log(n)) 61

61 Learnng Tree Dependent (2) (3) (1) Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree Transform the resultng undrected tree to a drected tree. Choose a root varable and set the drecton of all the edges away from t. Place the correspondng condtonal probabltes on the edges. 62

62 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng probablty dstrbuton T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best tree dependent approxmaton to P Let T be the tree-dependent dstrbuton accordng to the fxed tree t. Recall: T(x) = Π T(x Parent(x )) = Π P(x π (x )) D(P, T) = x P(x)log P(x) T(x) 63

63 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best t-dependent approxmaton to P D(P, T) = = H(x) When s ths maxmzed? = That s, how to defne T(x π(x ))? x x P(x) P(x)log T(x) x = P(x)log P(x) - P(x) n = 1 x P(x)log T(x) = log T(x π ( x )) = Slght abuse of notaton at the root 64

64 Correctness (1) D(P, T) Defnton of expectaton: = x P(x)log = H(x) = H(x) = H(x) = H(x) x = H(x) E n = 1 n = 1 (x n P P(x) T(x) P(x) [ n = 1 E, P, π (x = 1 π (x ) = n = 1 )) x log T(x [log T(x P(x P( π (x P(x)log P(x) - logt(x, π (x )) x π (x π (x π (x P(x ))] = ))] = x )) = )) log T(x π (x P(x)log T(x) = π (x P(x (x ) log T(x (x )) takes ts maxmal value when we set: T(x (x )) = P(x (x )) )) = )log T(x π (x )) 65

65 Correctness (2) P(x Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that: However: Ths gves: D(P, T) = H(x) D(P,T) = -H(x) - 1,n I(x,(x )) - 1,n x P(x ) log P(x ) 1st and 3rd term do not depend on the tree structure. Snce the dstance s non negatve, mnmzng t s equvalent to maxmzng the sum of the edges weghts I(x,y). n = 1 (x,, π (x P(x P(x, π (x )), π (x ))log P(x π (x )) = P(x, π (x ))log + P(x P(x )P( π (x )) )), π (x )) log P(x π (x )) P(x, π (x )) log P(x π (x )) = log + log P(x P(x )P( π (x )), π (x ) ))log P(x 66 )

66 Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that the T s the best tree approxmaton of P f t s chosen to maxmze the sum of the edges weghts. D(P,T) = -H(x) - 1,n I(x,(x )) - 1,n x P(x ) log P(x ) The mnmzaton problem s solved wthout the need to exhaustvely consder all possble trees. Ths was acheved snce we transformed the problem of fndng the best tree to that of fndng the heavest one, wth mutual nformaton on the edges. 67

67 Correctness (3) Transform the resultng undrected tree to a drected tree. (Choose a root varable and drect of all the edges away from t.) What does t mean that you get the same dstrbuton regardless of the chosen root? (Exercse) Ths algorthm learns the best tree-dependent approxmaton of a dstrbuton D. L(T) = P(D T) = {x} P T (x Parent(x )) Gven data, ths algorthm fnds the tree that maxmzes the lkelhood of the data. The algorthm s called the Chow-Lu Algorthm. Suggested n 1968 n the context of data compresson, and adapted by Pearl to Bayesan Networks. Invented a couple more tmes, and generalzed snce then. 68

68 Example: Learnng tree Dependent Dstrbutons We have 3 data ponts that have been generated accordng to the target dstrbuton: 1011; 1001; 0100 We need to estmate some parameters: P(A=1) = 2/3, P(B=1)=1/3, P(C=1)=1/3), P(D=1)=2/3 For the values 00, 01, 10, 11 respectvely, we have that: P(A,B)=0; 1/3; 2/3; 0 P(A,B)/P(A)P(B)=0; 3; 3/2; 0 I(A,B) ~ 9/2 P(A,C)=1/3; 0; 1/3; 1/3 P(A,C)/P(A)P(C)=3/2; 0; 3/4; 3/2 I(A,C) ~ 15/4 P(A,D)=1/3; 0; 0; 2/3 P(A,D)/P(A)P(D)=3; 0; 0; 3/2 I(A,D) ~ 9/2 P(B,C)=1/3; 1/3; 1/3;0 P(B,C)/P(B)P(C)=3/4; 3/2; 3/2; 0 I(B,C) ~ 15/4 P(B,D)=0; 2/3; 1/3;0 P(B,D)/P(B)P(D)=0; 3; 3/2; 0 I(B,D) ~ 9/2 P(C,D)=1/3; 1/3; 0; 1/3 P(C,D)/P(C)P(D)=3/2; 3/4; 0; 3/2 I(C,D) ~ 15/4 Generate the tree; place probabltes. B I(x, y) A P(x, y) = P(x, y)log x, y P(x)P(y) D C 69

69 Learnng tree Dependent Dstrbutons Chow-Lu algorthm fnds the tree that maxmzes the lkelhood. In partcular, f D s a tree dependent dstrbuton, ths algorthm learns D. (what does t mean?) Less s known on how many examples are needed n order for t to converge. (what does that mean?) Notce that we are takng statstcs to estmate the probabltes of some event n order to generate the tree. Then, we ntend to use t to evaluate the probablty of other events. One may ask the queston: why do we need ths structure? Why can t answer the query drectly from the data? (Almost lke makng predcton drectly from the data n the badges problem) 70

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference Bayesan Learnng So far What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference;