Size: px
Start display at page:

Download ""

Transcription

1 CIS 519/419 Appled Machne Learnng Dan Roth 461C, 3401 Walnut Sldes were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), Erc Eaton for CIS519/419 at Penn, or from other authors who have made ther ML sldes avalable.

2 Admnstraton Exam: The exam wll take place on the orgnally assgned date, 4/30. Smlar to the prevous mdterm. 75 mnutes; closed books. What s covered: The focus s on the materal covered after the prevous md-term. However, notce that the deas n ths class are cumulatve!! Everythng that we present n class and n the homework assgnments Materal that s n the sldes but s not dscussed n class s not part of the materal requred for the exam. Example 1: We talked about Boostng. But not about boostng the confdence. Example 2: We talked about multclass classfcaton: OvA, AvA, but not Error Correctng codes, and addtonal materal n the sldes. We wll gve a few practce exams. Homework: mssng and regrades 2

3 Admnstraton Projects We wll have a poster sesson 6-8pm on May 7 n the actve learnng room, 3401 Walnut. The hope s that ths wll be a fun event where all of you have an opportunty to see and dscuss the projects people have done. All are nvted! Mandatory for CIS519 students The fnal project report wll be due on 5/8 Logstcs: you wll send us you posters the a day earler; we wll prnt t and hang t; you wll present t. If you haven t done so already: Come to my offce hours at least once ths or next week to dscuss the project!! 3

4 Summary: Basc Probablty Product Rule: P(A,B) = P(A B)P(B) = P(B A)P(A) If A and B are ndependent: P(A,B) = P(A)P(B); P(A B)= P(A), P(A B,C)=P(A C) Sum Rule: P(A B) = P(A)+P(B)-P(A,B) Bayes Rule: P(A B) = P(B A) P(A)/P(B) Total Probablty: If events A 1, A 2, A n are mutually exclusve: A A j =, P(A )= 1 P(B) = P(B, A ) = P(B A ) P(A ) Total Condtonal Probablty: If events A 1, A 2, A n are mutually exclusve: A A j =, P(A )= 1 P(B C) = P(B, A C) = P(B A,C) P(A C) 4

5 So far Bayesan Learnng What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference; Approxmate nference Learnng Representatons of Probablty Dstrbutons 5

6 Unsupervsed Learnng We get as nput (n+1) tuples: (X 1, X 2, X n, X n+1 ) There s no noton of a class varable or a label. After seeng a few examples, we would lke to know somethng about the doman: correlatons between varables, probablty of certan events, etc. We want to learn the most lkely model that generated the data Sometmes called densty estmaton. 6

7 Smple Dstrbutons In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) y P(x 2 P(x 1 y) x1 x 2 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y),j Can these (strong) assumptons be relaxed? y) x 3 P(y) P(x n Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) y) x n 7

8 Smple Dstrbutons P(x 1 y) x1 P(x 2 y) P(y) y P(x n y) x 2 x 3 Under the assumpton P(x x j,y) = P(x y),j we can compute the jont probablty dstrbuton on the n+1 varables P(y, x 1, x 2, x n ) = p(y) 1 nn P(x y) Therefore, we can compute the probablty of any event: P(x 1 = 0, x 2 = 0, y = 1) = {b Є {0,1}} P(y=1, x 1 =0, x 2 =0, x 3 =b 3, x 4 =b 4,,x n =b n ) More effcently (drectly from the ndependence assumpton): P(x 1 = 0, x 2 = 0, y = 1) = P(x 1 =0, x 2 =0 y=1) p(y=1) = = P(x 1 =0 y=1) P(x 2 =0 y=1) p(y=1) We can compute the probablty of any event or condtonal event over the n+1 varables. x n 8

9 Representng Probablty Dstrbuton Goal: To represent all jont probablty dstrbutons over a set of random varables X 1, X 2,., X n There are many ways to represent dstrbutons. A table, lstng the probablty of each nstance n {0,1} n We wll need 2 n -1 numbers What can we do? Make Independence Assumptons Mult-lnear polynomals Multnomals over varables Bayesan Networks Drected acyclc graphs Markov Networks Undrected graphs 9

10 Graphcal Models of Probablty Dstrbutons Bayesan Networks represent the jont probablty dstrbuton over a set of varables. Independence Assumpton: x, x s ndependent of ts non-descendants gven ts parents Ths s a theorem. To prove t, order the nodes from leaves up, and use the product rule. The terms are called CPTs (Condtonal Probablty tables) and they completely defne the probablty dstrbuton. Wth these conventons, the jont probablty dstrbuton s gven by: P(y, x, x,...x ) = p(y) P(x Parents(x ) ) Y Z Z 1 Z 2 Z 3 X 10 X 1 2 n z s a parent of x x s a descendant of y X 2 10

11 Bayesan Network Semantcs of the DAG Nodes are random varables Edges represent causal nfluences Each node s assocated wth a condtonal probablty dstrbuton Two equvalent vewponts A data structure that represents the jont dstrbuton compactly A representaton for a set of condtonal ndependence assumptons about a dstrbuton 11

12 Bayesan Network: Example The burglar alarm n your house rngs when there s a burglary or an earthquake. An earthquake wll be reported on the rado. If an alarm rngs and your neghbors hear t, they wll call you. What are the random varables? 12

13 Bayesan Network: Example If there s an earthquake, you ll probably hear about t on the rado. Earthquake Burglary Rado Alarm An alarm can rng because of a burglary or an earthquake. How many parameters do we have? Mary Calls John Calls How many would we have f we had to store the entre jont? If your neghbors hear an alarm, they wll call you. 13

14 Wth these probabltes, (and assumptons, encoded n the graph) we can compute the probablty of any event over these varables. Bayesan Network: Example P(R E) Rado P(E) P(E, B, A, R, M, J) = P(E) P(B, A, R, M, J E) = = P(E) P(B) P(A, R, M, J E, B) = = P(E) P(B) P(R E, B ) P(M, J, A E, B) Earthquake Mary Calls = P(E) P(B) P(R E) P(M, J A, E, B) P(A, E, B) = P(E) P(B) ) P(R E) P(M A) P(J A) P(A E, B) P(B) Alarm Burglary P(A E, B) P(M A) P(J A) John Calls 14

15 Computatonal Problems Learnng the structure of the Bayes net (What would be the gudng prncple?) Learnng the parameters Supervsed? Unsupervsed? Inference: Computng the probablty of an event: [#P Complete, Roth 93, 96] Gven structure and parameters Gven an observaton E, what s the probablty of Y? P(Y=y E=e) (E, Y are sets of nstantated varables) Most lkely explanaton (Maxmum A Posteror assgnment, MAP, MPE) [NP-Hard; Shmony 94] Gven structure and parameters Gven an observaton E, what s the most lkely assgnment to Y? Argmax y P(Y=y E=e) (E, Y are sets of nstantated varables) 16

16 Inference Inference n Bayesan Networks s generally ntractable n the worst case Two broad approaches for nference Exact nference Eg. Varable Elmnaton Approxmate nference Eg. Gbbs samplng 17

17 Tree Dependent Dstrbutons Drected Acyclc graph Each node has at most one parent Independence Assumpton: x s ndependent of ts nondescendants gven ts parents (x s ndependent of other nodes gve z; v s ndependent of w gven u;) P(y, x, x,...x Need to know two numbers for each lnk: p(x z), and a pror for the root p(y) V W P(y) 1 2 n ) = p(y) P(x U X Y Z P(x z) Parents(x P(s y) ) ) T S 18

18 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: P(x=1) = Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x)? V W P(y) P(y, x1, x2,...xn ) = p(y) P(x = P(x=1 z=1)p(z=1) + P(x=1 z=0)p(z=0) Recursvely, go up the tree: P(z=1) = P(z=1 y=1)p(y=1) + P(z=1 y=0)p(y=0) P(z=0) = P(z=0 y=1)p(y=1) + P(z=0 y=0)p(y=0) Lnear Tme Algorthm U X Y Z P(x z) P(s y) T Parents(x 19 S ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables)

19 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,y)? P(x=1,y=0) = = P(x=1 y=0)p(y=0) V W P(y) P(y, x1, x2,...xn ) = p(y) P(x U X Y Z P(x z) Recursvely, go up the tree along the path from x to y: P(x=1 y=0) = z=0,1 P(x=1 y=0, z)p(z y=0) = = z=0,1 P(x=1 z)p(z y=0) P(s y) T Parents(x ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables) 20 S

20 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,u)? (No drect path from x to u) P(x=1,u=0) = P(x=1 u=0)p(u=0) V W P(y) P(y, x1, x2,...xn ) = p(y) P(x Let y be a parent of x and u (we always have one) P(x=1 u=0) = y=0,1 P(x=1 u=0, y)p(y u=0) = = y=0,1 P(x=1 y)p(y u=0) = U X Y Z P(x z) P(s y) T Parents(x Now we have reduced t to cases we have seen 21 S ) )

21 Tree Dependent Dstrbutons Inference Problem: Gven the Tree wth all the assocated CPTs, we showed that we can evaluate the probablty of all events effcently. There are more effcent algorthms P(y, The dea was to show that the nference s ths case s a smple applcaton of Bayes rule and probablty theory. V W P(y) U x1,x2,...xn ) = p(y) P(x X Y Z P(x z) P(s y) T Parents(x Thngs are not so smple n the general case, due to cycles; there are multple ways to get from node A to B, and ths has to be accounted for n Inference. 22 S ) )

22 Graphcal Models of Probablty Dstrbutons For general Bayesan Networks The learnng problem s hard The nference problem (gven the network, evaluate the probablty of a gven event) s hard (#P Complete) Y P(y) Z Z 1 Z 2 Z 3 P(z 3 y) X 10 X P(x z 1, z 2,z, z 3 ) X 2 P(y, x,x,...x ) = p(y) 1 2 n P(x Parents(x ) ) 23

23 Varable Elmnaton Suppose the query s P(X 1 ) Key Intuton: Move rrelevant terms outsde summaton and cache ntermedate results 24

24 Varable Elmnaton: Example 1 A A B C We want to compute P(C) Let s call ths f A (B) A has been (nstantated and) elmnated What have we saved wth ths procedure? How many multplcatons and addtons dd we perform? 25

25 Varable Elmnaton VE s a sequental procedure. Gven an orderng of varables to elmnate For each varable v that s not n the query Replace t wth a new functon f v That s, margnalze v out The actual computaton depends on the order What s the doman and range of f v? It need not be a probablty dstrbuton 26

26 Varable Elmnaton: Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) What s P(M, J B)? Mary Calls John Calls 27

27 Varable Elmnaton: Example 2 Assumptons (graph; jont representaton) It s suffcent to compute the numerator and normalze Elmnaton order R, A, E To elmnate R 28

28 Varable Elmnaton: Example 2 It s suffcent to compute the numerator and normalze Elmnaton order A, E To elmnate A 29

29 Varable Elmnaton: Example 2 It s suffcent to compute the numerator and normalze Fnally elmnate E Factors 30

30 Varable Elmnaton The order n whch varables are elmnated matters In the prevous example, what would happen f we elmnate E frst? The sze of the factors would be larger Complexty of Varable Elmnaton Exponental n the sze of the factors What about worst case? The worst case s ntractable 31

31 Inference Exact Inference n Bayesan Networks s #P-hard We can count the number of satsfyng assgnments for 3-SAT wth a Bayesan Network Approxmate nference Eg. Gbbs samplng Skp 32

32 Approxmate Inference P(x)? Basc dea If we had access to a set of examples from the jont dstrbuton, we could just count. X For nference, we generate nstances from the jont and count How do we generate nstances? 33

33 Generatng nstances Samplng from the Bayesan Network Condtonal probabltes, that s, P(X E) Only generate nstances that are consstent wth E Problems? How many samples? [Law of large numbers] What f the evdence E s a very low probablty event? Skp 34

34 Detour: Markov Chan Revew A 0.1 C B Generates a sequence of A,B,C Defned by ntal and transton probabltes P(X 0 ) and P(X t+1 = X t =j) P j : Tme ndependent transton probablty matrx Statonary Dstrbutons: A vector q s called a statonary dstrbuton f q : The probablty of beng n state If we sample from the Markov Chan repeatedly, the dstrbuton over the states converges to the statonary dstrbuton 35

35 Markov Chan Monte Carlo Our goal: To sample from P(X e) Overall dea: The next sample s a functon of the current sample The samples can be thought of as comng from a Markov Chan whose statonary dstrbuton s the dstrbuton we want Can approxmate any dstrbuton 36

36 Gbbs Samplng The smplest MCMC method to sample from P(X=x 1 x 2 x n e) Creates a Markov Chan of samples as follows: Intalze X randomly At each tme step, fx all random varables except one. Sample that random varable from the correspondng condtonal dstrbuton 37

37 Gbbs Samplng Algorthm: Intalze X randomly Iterate: Pck a varable X unformly at random Sample x (t+1) from P(x x (t) 1,,x (t) -1, x (t), +1, x (t) n,e) X (t+1) k =x (t+1) k for all other k Ths s the next sample X (1),X (2), X (t) forms a Markov Chan Why s Gbbs Samplng easy for Bayes Nets? P(x x - (t),e) s local 38

38 Gbbs Samplng: Bg pcture Gven some condtonal dstrbuton we wsh to compute, collect samples from the Markov Chan Typcally, the chan s allowed to run for some tme before collectng samples (burn n perod) So that the chan settles nto the statonary dstrbuton Usng the samples, we approxmate the posteror by countng 39

39 Gbbs Samplng Example 1 A B C We want to compute P(C): Suppose, after burn n, the Markov Chan s at A=true, B=false, C= false 1. Pck a varable B 2. Draw the new value of B from P(B A=true, C= false) = P(B A=true) Suppose B new = true 3. Our new sample s A=true, B = true, C = false 4. Repeat 40

40 Gbbs Samplng Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) Exercse: P(M,J B)? Mary Calls John Calls 41

41 Example: Hdden Markov Model Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Transton probabltes Emsson probabltes A Bayesan Network wth a specfc structure. Xs are called the observatons and Ys are the hdden states Useful for sequence taggng tasks part of speech, modelng temporal structure, speech recognton, etc 42

42 HMM: Computatonal Problems Probablty of an observaton gven an HMM P(X parameters): Dynamc Programmng Fndng the best hdden states for a gven sequence P(Y X, parameters): Dynamc Programmng Learnng the parameters from observatons EM 43

43 Gbbs Samplng for HMM Goal:Computng P(y x) Intalze the Ys randomly Iterate: Only these varables are needed because they form the Markov blanket of Y. Pck a random Y Draw Y t from P(Y Y -1,Y +1,X ) Compute the probablty usng counts after the burn n perod Gbbs samplng allows us to ntroduce prors on the emsson and transton probabltes. 44

44 Bayesan Networks Bayesan Networks Compact representaton probablty dstrbutons Unversal: Can represent all dstrbutons Inference In the worst case, every random varable wll be connected to all others Inference s hard n the worst case Learnng? Exact nference s #P-hard, approxmate nference s NP-hard [Roth93,96] Inference for Trees s effcent General exact Inference: Varable Elmnaton 45

45 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton What does that mean? Generatve model Fnd the tree representaton of the dstrbuton. What does that mean? V W P(y) P(y, x1, x2,...xn ) = p(y) P(x U X Y Z P(x z) P(s y) Among all trees, fnd the most lkely one, gven the data: P(T D) = P(D T) P(T)/P(D) T Parents(x 46 S ) )

46 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton Fnd the tree representaton of the dstrbuton. V W P(y) U X Y Z P(x z) P(s y) T S Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) Now we can see why we had to solve the nference problem frst; t s requred for learnng. 47

47 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton Fnd the tree representaton of the dstrbuton. V W P(y) U X Y Z P(x z) P(s y) T S Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) = Try ths for naïve Bayes = argmax T {x} P T (x Parents(x )) 48

48 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: Are these representatons of the same dstrbuton? Gven a sample, whch of these generated t? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 49

49 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: P(x 1 x 4 ) X 1 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? P(x 2 x 4 ) X 4 P(x 4 ) X 2 P(x 3 x 4 ) X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3 50

50 Example: Learnng Dstrbutons Probablty Dstrbuton 1: What s the lkelhood that ths table generated the data? P(T D) = P(D T) P(T)/P(D) Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= 0 P(1001 T)= 0.1 P(0100 T)= 0.1 P(Data Table)=0 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? 51

51 Example: Learnng Dstrbutons Probablty Dstrbuton 2: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=1/2 P(x 1 x 4 ) X 2 X 3 p(x 1 =1 x 4 =0)=1/2 p(x 1 =1 x 4 =1)=1/2 p(x 2 =1 x 4 =0)=1/3 p(x 2 =1 x 4 =1)=1/3 p(x 3 =1 x 4 =0)=1/6 p(x 3 =1 x 4 =1)=5/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 4 =1)=1/2 1/2 2/3 5/6= 10/72 P(1001 T)= = 1/2 1/2 2/3 5/6=10/72 P(0100 T)= =1/2 1/2 2/3 5/6=10/72 X 1 P(x 2 x 4 ) X 4 P(x 4 ) P(x 3 x 4 ) P(Data Tree)=125/4*3 6 52

52 Example: Learnng Dstrbutons Probablty Dstrbuton 3: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 1 x 4 ) P(x 4 =1)=2/3 p(x 1 =1 x 4 =0)=1/3 p(x 1 =1 x 4 =1)=1 p(x 2 =1 x 4 =0)=1 p(x 2 =1 x 4 =1)=1/2 p(x 3 =1 x 2 =0)=2/3 p(x 3 =1 x 2 =1)=1/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 2 =1)=2/3 1 1/2 2/3= 2/9 P(1001 T)= = 1/2 1/2 2/3 1/6=1/36 P(0100 T)= =1/2 1/2 1/3 5/6=5/72 X 1 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3 P(Data Tree)=10/ Dstrbuton 2 s the most lkely dstrbuton to have produced the data. 53

53 Example: Summary We are now n the same stuaton we were when we decded whch of two cons, far (0.5,0.5) or based (0.7,0.3) generated the data. But, ths sn t the most nterestng case. In general, we wll not have a small number of possble dstrbutons to choose from, but rather a parameterzed famly of dstrbutons. (analogous to a con wth p Є [0,1] ) We need a systematc way to search ths famly of dstrbutons. 54

54 Example: Summary Frst, let s make sure we understand what we are after. We have 3 data ponts that have been generated accordng to our target dstrbuton: 1011; 1001; 0100 What s the target dstrbuton? We cannot fnd THE target dstrbuton. What s our goal? As before we are nterested n generalzaton Gven Data (e.g., the above 3 data ponts), we would lke to know P(1111) or P(11**), P(***0) etc. We could compute t drectly from the data, but. Assumptons about the dstrbuton are crucal here 55

55 Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) V W P(y) Space of all Dstrbutons U X Y Z P(x z) P(s y) T Space of all Tree Dstrbutons S Target Dstrbuton Fnd the Tree closest to the target Target Dstrbuton 56

56 Learnng Tree Dependent Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) Dstrbutons V W P(y) U X Y Z P(x z) P(s y) The smple mnded algorthm for learnng a tree dependent dstrbuton requres (1) for each tree, compute ts lkelhood L(T) = P(D T) = =argmax T {x} P T (x 1, x 2, x n ) = =argmax T {x} P T (x Parents(x )) (2) Fnd the maxmal one T 57 S

57 1. Dstance Measure To measure how well a probablty dstrbuton P s approxmated by probablty dstrbuton T we use here the Kullback-Lebler cross entropy measure (KLdvergence): Non negatve. D(P, T) = x D(P,T)=0 ff P and T are dentcal P(x)log P(x) T(x) Non symmetrc. Measures how much P dffers from T. 58

58 2. Rankng Dependences Intutvely, the mportant edges to keep n the tree are edges (x---y) for x, y whch depend on each other. Gven that the dstance between the dstrbuton s measured usng the KL dvergence, the correspondng measure of dependence s the mutual nformaton between x and y, (measurng the nformaton x gves about y) I(x, y) P(x, y) = P(x, y)log x, y P(x)P(y) whch we can estmate wth respect to the emprcal dstrbuton (that s, the gven data). 59

59 Learnng Tree Dependent Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree 60

60 Spannng Tree Goal: Fnd a subset of the edges that forms a tree that ncludes every vertex, where the total weght of all the edges n the tree s maxmzed Sort the weghts Start greedly wth the largest one. Add the next largest as long as t does not create a loop. In case of a loop, dscard ths weght and move on to the next weght. Ths algorthm wll create a tree; It s a spannng tree: t touches all the vertces. It s not hard to see that ths s the maxmum weghted spannng tree The complexty s O(n 2 log(n)) 61

61 Learnng Tree Dependent (2) (3) (1) Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree Transform the resultng undrected tree to a drected tree. Choose a root varable and set the drecton of all the edges away from t. Place the correspondng condtonal probabltes on the edges. 62

62 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng probablty dstrbuton T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best tree dependent approxmaton to P Let T be the tree-dependent dstrbuton accordng to the fxed tree t. Recall: T(x) = Π T(x Parent(x )) = Π P(x π (x )) D(P, T) = x P(x)log P(x) T(x) 63

63 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best t-dependent approxmaton to P D(P, T) = = H(x) When s ths maxmzed? = That s, how to defne T(x π(x ))? x x P(x) P(x)log T(x) x = P(x)log P(x) - P(x) n = 1 x P(x)log T(x) = log T(x π ( x )) = Slght abuse of notaton at the root 64

64 Correctness (1) D(P, T) Defnton of expectaton: = x P(x)log = H(x) = H(x) = H(x) = H(x) x = H(x) E n = 1 n = 1 (x n P P(x) T(x) P(x) [ n = 1 E, P, π (x = 1 π (x ) = n = 1 )) x log T(x [log T(x P(x P( π (x P(x)log P(x) - logt(x, π (x )) x π (x π (x π (x P(x ))] = ))] = x )) = )) log T(x π (x P(x)log T(x) = π (x P(x (x ) log T(x (x )) takes ts maxmal value when we set: T(x (x )) = P(x (x )) )) = )log T(x π (x )) 65

65 Correctness (2) P(x Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that: However: Ths gves: D(P, T) = H(x) D(P,T) = -H(x) - 1,n I(x,(x )) - 1,n x P(x ) log P(x ) 1st and 3rd term do not depend on the tree structure. Snce the dstance s non negatve, mnmzng t s equvalent to maxmzng the sum of the edges weghts I(x,y). n = 1 (x,, π (x P(x P(x, π (x )), π (x ))log P(x π (x )) = P(x, π (x ))log + P(x P(x )P( π (x )) )), π (x )) log P(x π (x )) P(x, π (x )) log P(x π (x )) = log + log P(x P(x )P( π (x )), π (x ) ))log P(x 66 )

66 Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that the T s the best tree approxmaton of P f t s chosen to maxmze the sum of the edges weghts. D(P,T) = -H(x) - 1,n I(x,(x )) - 1,n x P(x ) log P(x ) The mnmzaton problem s solved wthout the need to exhaustvely consder all possble trees. Ths was acheved snce we transformed the problem of fndng the best tree to that of fndng the heavest one, wth mutual nformaton on the edges. 67

67 Correctness (3) Transform the resultng undrected tree to a drected tree. (Choose a root varable and drect of all the edges away from t.) What does t mean that you get the same dstrbuton regardless of the chosen root? (Exercse) Ths algorthm learns the best tree-dependent approxmaton of a dstrbuton D. L(T) = P(D T) = {x} P T (x Parent(x )) Gven data, ths algorthm fnds the tree that maxmzes the lkelhood of the data. The algorthm s called the Chow-Lu Algorthm. Suggested n 1968 n the context of data compresson, and adapted by Pearl to Bayesan Networks. Invented a couple more tmes, and generalzed snce then. 68

68 Example: Learnng tree Dependent Dstrbutons We have 3 data ponts that have been generated accordng to the target dstrbuton: 1011; 1001; 0100 We need to estmate some parameters: P(A=1) = 2/3, P(B=1)=1/3, P(C=1)=1/3), P(D=1)=2/3 For the values 00, 01, 10, 11 respectvely, we have that: P(A,B)=0; 1/3; 2/3; 0 P(A,B)/P(A)P(B)=0; 3; 3/2; 0 I(A,B) ~ 9/2 P(A,C)=1/3; 0; 1/3; 1/3 P(A,C)/P(A)P(C)=3/2; 0; 3/4; 3/2 I(A,C) ~ 15/4 P(A,D)=1/3; 0; 0; 2/3 P(A,D)/P(A)P(D)=3; 0; 0; 3/2 I(A,D) ~ 9/2 P(B,C)=1/3; 1/3; 1/3;0 P(B,C)/P(B)P(C)=3/4; 3/2; 3/2; 0 I(B,C) ~ 15/4 P(B,D)=0; 2/3; 1/3;0 P(B,D)/P(B)P(D)=0; 3; 3/2; 0 I(B,D) ~ 9/2 P(C,D)=1/3; 1/3; 0; 1/3 P(C,D)/P(C)P(D)=3/2; 3/4; 0; 3/2 I(C,D) ~ 15/4 Generate the tree; place probabltes. B I(x, y) A P(x, y) = P(x, y)log x, y P(x)P(y) D C 69

69 Learnng tree Dependent Dstrbutons Chow-Lu algorthm fnds the tree that maxmzes the lkelhood. In partcular, f D s a tree dependent dstrbuton, ths algorthm learns D. (what does t mean?) Less s known on how many examples are needed n order for t to converge. (what does that mean?) Notce that we are takng statstcs to estmate the probabltes of some event n order to generate the tree. Then, we ntend to use t to evaluate the probablty of other events. One may ask the queston: why do we need ths structure? Why can t answer the query drectly from the data? (Almost lke makng predcton drectly from the data n the badges problem) 70

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference Bayesan Learnng So far What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference;

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

Artificial Intelligence Bayesian Networks

Artificial Intelligence Bayesian Networks Artfcal Intellgence Bayesan Networks Adapted from sldes by Tm Fnn and Mare desjardns. Some materal borrowed from Lse Getoor. 1 Outlne Bayesan networks Network structure Condtonal probablty tables Condtonal

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics /7/7 CSE 73: Artfcal Intellgence Bayesan - Learnng Deter Fox Sldes adapted from Dan Weld, Jack Breese, Dan Klen, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer What s Beng Learned? Space

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Semi-Supervised Learning

Semi-Supervised Learning Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Hidden Markov Models

Hidden Markov Models Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

Computing Correlated Equilibria in Multi-Player Games

Computing Correlated Equilibria in Multi-Player Games Computng Correlated Equlbra n Mult-Player Games Chrstos H. Papadmtrou Presented by Zhanxang Huang December 7th, 2005 1 The Author Dr. Chrstos H. Papadmtrou CS professor at UC Berkley (taught at Harvard,

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Quantifying Uncertainty

Quantifying Uncertainty Partcle Flters Quantfyng Uncertanty Sa Ravela M. I. T Last Updated: Sprng 2013 1 Quantfyng Uncertanty Partcle Flters Partcle Flters Appled to Sequental flterng problems Can also be appled to smoothng problems

More information

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation Readngs: K&F 0.3, 0.4, 0.6, 0.7 Learnng undrected Models Lecture 8 June, 0 CSE 55, Statstcal Methods, Sprng 0 Instructor: Su-In Lee Unversty of Washngton, Seattle Mean Feld Approxmaton Is the energy functonal

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014 Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html Energy functon The

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore 8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8 U.C. Berkeley CS278: Computatonal Complexty Handout N8 Professor Luca Trevsan 2/21/2008 Notes for Lecture 8 1 Undrected Connectvty In the undrected s t connectvty problem (abbrevated ST-UCONN) we are gven

More information

Machine learning: Density estimation

Machine learning: Density estimation CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Calculation of time complexity (3%)

Calculation of time complexity (3%) Problem 1. (30%) Calculaton of tme complexty (3%) Gven n ctes, usng exhaust search to see every result takes O(n!). Calculaton of tme needed to solve the problem (2%) 40 ctes:40! dfferent tours 40 add

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7 Stanford Unversty CS54: Computatonal Complexty Notes 7 Luca Trevsan January 9, 014 Notes for Lecture 7 1 Approxmate Countng wt an N oracle We complete te proof of te followng result: Teorem 1 For every

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

Clustering gene expression data & the EM algorithm

Clustering gene expression data & the EM algorithm CG, Fall 2011-12 Clusterng gene expresson data & the EM algorthm CG 08 Ron Shamr 1 How Gene Expresson Data Looks Entres of the Raw Data matrx: Rato values Absolute values Row = gene s expresson pattern

More information

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Randomness and Computation

Randomness and Computation Randomness and Computaton or, Randomzed Algorthms Mary Cryan School of Informatcs Unversty of Ednburgh RC 208/9) Lecture 0 slde Balls n Bns m balls, n bns, and balls thrown unformly at random nto bns usually

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

Maximum likelihood. Fredrik Ronquist. September 28, 2005

Maximum likelihood. Fredrik Ronquist. September 28, 2005 Maxmum lkelhood Fredrk Ronqust September 28, 2005 Introducton Now that we have explored a number of evolutonary models, rangng from smple to complex, let us examne how we can use them n statstcal nference.

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

1 Motivation and Introduction

1 Motivation and Introduction Instructor: Dr. Volkan Cevher EXPECTATION PROPAGATION September 30, 2008 Rce Unversty STAT 63 / ELEC 633: Graphcal Models Scrbes: Ahmad Beram Andrew Waters Matthew Nokleby Index terms: Approxmate nference,

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example.

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example. CIS587 - Artfcal Intellgence Bayesan Networks KB for medcal dagnoss. Example. We want to buld a KB system for the dagnoss of pneumona. Problem descrpton: Dsease: pneumona Patent symptoms (fndngs, lab tests):

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

Lecture 10: May 6, 2013

Lecture 10: May 6, 2013 TTIC/CMSC 31150 Mathematcal Toolkt Sprng 013 Madhur Tulsan Lecture 10: May 6, 013 Scrbe: Wenje Luo In today s lecture, we manly talked about random walk on graphs and ntroduce the concept of graph expander,

More information

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons

More information

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Lecture 2: Prelude to the big shrink

Lecture 2: Prelude to the big shrink Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple

More information

Bayesian belief networks

Bayesian belief networks CS 1571 Introducton to I Lecture 24 ayesan belef networks los Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square CS 1571 Intro to I dmnstraton Homework assgnment 10 s out and due next week Fnal exam: December

More information

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable

More information

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons Contnuous dstrbutons Contnuous random varable X

More information

Expectation propagation

Expectation propagation Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors

More information

1 The Mistake Bound Model

1 The Mistake Bound Model 5-850: Advanced Algorthms CMU, Sprng 07 Lecture #: Onlne Learnng and Multplcatve Weghts February 7, 07 Lecturer: Anupam Gupta Scrbe: Bryan Lee,Albert Gu, Eugene Cho he Mstake Bound Model Suppose there

More information

Engineering Risk Benefit Analysis

Engineering Risk Benefit Analysis Engneerng Rsk Beneft Analyss.55, 2.943, 3.577, 6.938, 0.86, 3.62, 6.862, 22.82, ESD.72, ESD.72 RPRA 2. Elements of Probablty Theory George E. Apostolaks Massachusetts Insttute of Technology Sprng 2007

More information

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta Bayesan Networks Course: CS40022 Instructor: Dr. Pallab Dasgupta Department of Computer Scence & Engneerng Indan Insttute of Technology Kharagpur Example Burglar alarm at home Farly relable at detectng

More information

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of

More information

Retrieval Models: Language models

Retrieval Models: Language models CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum

More information

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov 9.93 Class IV Part I Bayesan Decson Theory Yur Ivanov TOC Roadmap to Machne Learnng Bayesan Decson Makng Mnmum Error Rate Decsons Mnmum Rsk Decsons Mnmax Crteron Operatng Characterstcs Notaton x - scalar

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Probabilistic Classification: Bayes Classifiers. Lecture 6: Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons.

More information

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far So far Supervsed machne learnng Lnear models Non-lnear models Unsupervsed machne learnng Generc scaffoldng So far

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 6 Luca Trevsan September, 07 Scrbed by Theo McKenze Lecture 6 In whch we study the spectrum of random graphs. Overvew When attemptng to fnd n polynomal

More information

NP-Completeness : Proofs

NP-Completeness : Proofs NP-Completeness : Proofs Proof Methods A method to show a decson problem Π NP-complete s as follows. (1) Show Π NP. (2) Choose an NP-complete problem Π. (3) Show Π Π. A method to show an optmzaton problem

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs -755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear

More information

THE SUMMATION NOTATION Ʃ

THE SUMMATION NOTATION Ʃ Sngle Subscrpt otaton THE SUMMATIO OTATIO Ʃ Most of the calculatons we perform n statstcs are repettve operatons on lsts of numbers. For example, we compute the sum of a set of numbers, or the sum of the

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Explaining the Stein Paradox

Explaining the Stein Paradox Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten

More information

Bayesian predictive Configural Frequency Analysis

Bayesian predictive Configural Frequency Analysis Psychologcal Test and Assessment Modelng, Volume 54, 2012 (3), 285-292 Bayesan predctve Confgural Frequency Analyss Eduardo Gutérrez-Peña 1 Abstract Confgural Frequency Analyss s a method for cell-wse

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Probability-Theoretic Junction Trees

Probability-Theoretic Junction Trees Probablty-Theoretc Juncton Trees Payam Pakzad, (wth Venkat Anantharam, EECS Dept, U.C. Berkeley EPFL, ALGO/LMA Semnar 2/2/2004 Margnalzaton Problem Gven an arbtrary functon of many varables, fnd (some

More information

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002 Man Goals Present

More information

Lecture Nov

Lecture Nov Lecture 18 Nov 07 2008 Revew Clusterng Groupng smlar obects nto clusters Herarchcal clusterng Agglomeratve approach (HAC: teratvely merge smlar clusters Dfferent lnkage algorthms for computng dstances

More information

} Often, when learning, we deal with uncertainty:

} Often, when learning, we deal with uncertainty: Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally

More information

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur Module Random Processes Lesson 6 Functons of Random Varables After readng ths lesson, ou wll learn about cdf of functon of a random varable. Formula for determnng the pdf of a random varable. Let, X be

More information

Statistical pattern recognition

Statistical pattern recognition Statstcal pattern recognton Bayes theorem Problem: decdng f a patent has a partcular condton based on a partcular test However, the test s mperfect Someone wth the condton may go undetected (false negatve

More information