Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Size: px
Start display at page:

Download "Representing arbitrary probability distributions Inference. Exact inference; Approximate inference"

Transcription

1 Bayesan Learnng So far What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference; Approxmate nference HW7 Released already Learnng Representatons of Probablty Dstrbutons You cannot do Problem 2 yet. Due on 5/1 No late hours Fnal: May 9 Same style as the md-term Comprehensve, but focus on materal after md-term. Projects: Reports and 5 mn. vdeos are due on May 11. 1

2 Unsupervsed Learnng We get as nput (n+1) tuples: (X 1, X 2, X n, X n+1 ) There s no noton of a class varable or a label. After seeng a few examples, we would lke to know somethng about the doman: correlatons between varables, probablty of certan events, etc. We want to learn the most lkely model that generated the data 2

3 Smple Dstrbutons In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) P(y) P(x 2 y) P(x y) P(x 1 y) n x1 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y) 8,j Can these assumptons be relaxed? Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) x 2 x 3 y x n 3

4 Smple Dstrbutons P(x 2 y) P(x 1 y) P(y) y P(x n y) Under the assumpton P(x x j,y) = P(x y) 8,j we can compute the jont probablty dstrbuton on the n+1 varables P(y, x 1, x 2, x n ) = p(y) 1n P(x y) Therefore, we can compute the probablty of any event: P(x 1 = 0, x 2 = 0, y = 1) = {b 2 {0,1}} P(y=1, x 1 =0, x 2 =0, x 3 =b 3, x 4 =b 4,,x n =b n ) More effcently: x 1 x 2 P(x 1 = 0, x 2 = 0, y = 1) = P(x 1 =0 y=1) P(x 2 =0 y=1) p(y=1) x 3 We can compute the probablty of any event or condtonal event over the n+1 varables. x n 4

5 Representng Probablty Dstrbuton Goal: To represent all jont probablty dstrbutons over a set of random varables X 1, X 2,., X n There are many ways to represent dstrbutons. A table, lstng the probablty of each nstance n {0,1} n We wll need 2 n -1 numbers What can we do? Make Independence Assumptons Mult-lnear polynomals Polynomals over varables (Samdan & Roth 09, Poon & Domngos 11) Bayesan Networks Drected acyclc graphs Markov Networks Undrected graphs 5

6 We can compute the probablty of any event or condtonal event over the n+1 varables. Unsupervsed Learnng In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) P(y) P(x 2 y) P(x y) P(x 1 y) n x1 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y) 8,j Can these assumptons be relaxed? Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) x 2 x 3 y x n 6

7 Graphcal Models of Probablty Dstrbutons Ths s a theorem. To prove t, order the nodes from leaves up, and use the product rule. The terms are called CPTs (Condtonal Probablty tables) and they completely defne the probablty dstrbuton. Bayesan Networks represent the jont probablty dstrbuton over a set of varables. Independence Assumpton: 8 x, x s ndependent of ts non-descendants gven ts parents Wth these conventons, the jont probablty dstrbuton s gven by: P(y, x, x,...x ) p(y) P(x Parents(x ) ) 1 2 n Y Z Z 1 Z 2 Z 3 X 10 X z s a parent of x x s a descendant of y X 2 7

8 Bayesan Network Semantcs of the DAG Nodes are random varables Edges represent causal nfluences Each node s assocated wth a condtonal probablty dstrbuton Two equvalent vewponts A data structure that represents the jont dstrbuton compactly A representaton for a set of condtonal ndependence assumptons about a dstrbuton 8

9 Bayesan Network: Example The burglar alarm n your house rngs when there s a burglary or an earthquake. An earthquake wll be reported on the rado. If an alarm rngs and your neghbors hear t, they wll call you. What are the random varables? 9

10 Bayesan Network: Example If there s an earthquake, you ll probably hear about t on the rado. Earthquake Burglary Rado Alarm An alarm can rng because of a burglary or an earthquake. How many parameters do we have? Mary Calls John Calls How many would we have f we had to store the entre jont? If your neghbors hear an alarm, they wll call you. 10

11 Bayesan Network: Example Wth these probabltes, (and assumptons, encoded n the graph) we can compute the probablty of any event over these varables. P(R E) Rado P(E) Earthquake Mary Calls P(E, B, A, R, M, J) = P(E) P(B, A, R, M, J E) = = P(E) P(B) P(A, R, M, J E, B) = = P(E) P(B) P(R E, B ) P(M, J, A E, B) Alarm = P(E) P(B) P(R E) P(M, J A, E, B) P(A, E, B) = P(E) P(B) ) P(R E) P(A E, B P(M A) P(J A) P(B) Burglary P(A E, B) P(M A) P(J A) John Calls 11

12 Computatonal Problems Learnng the structure of the Bayes net (What would be the gudng prncple?) Learnng the parameters Supervsed? Unsupervsed? Inference: Computng the probablty of an event: [#P Complete, Roth 93, 96] Gven structure and parameters Gven an observaton E, what s the probablty of Y? P(Y=y E=e) (E, Y are sets of nstantated varables) Most lkely explanaton (Maxmum A Posteror assgnment, MAP, MPE) [NP-Hard; Shmony 94] Gven structure and parameters Gven an observaton E, what s the most lkely assgnment to Y? Argmax y P(Y=y E=e) (E, Y are sets of nstantated varables) 13

13 Inference Inference n Bayesan Networks s generally ntractable n the worst case Two broad approaches for nference Exact nference Eg. Varable Elmnaton Approxmate nference Eg. Gbbs samplng 14

14 Tree Dependent Dstrbutons Drected Acyclc graph Each node has at most one parent Independence Assumpton: x s ndependent of ts nondescendants gven ts parents (x s ndependent of other nodes gve z; v s ndependent of w gven u;) P(y, x1, x2,...xn ) p(y) P(x Need to know two numbers for each lnk: p(x z), and a pror for the root p(y) V W P(y) U X Y Z P(x z) P(s y) Parents(x ) ) T S 15

15 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x)? P(x=1) = P(y, x1, x2,...xn ) p(y) P(x V W P(y) = P(x=1 z=1)p(z=1) + P(x=1 z=0)p(z=0) Recursvely, go up the tree: P(z=1) = P(z=1 y=1)p(y=1) + P(z=1 y=0)p(y=0) P(z=0) = P(z=0 y=1)p(y=1) + P(z=0 y=0)p(y=0) Lnear Tme Algorthm U X Y Z P(x z) P(s y) T Parents(x 16 S ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables)

16 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,y)? P(x=1,y=0) = = P(x=1 y=0)p(y=0) P(y, x1, x2,...xn ) p(y) P(x V W P(y) U X Y Z P(x z) Recursvely, go up the tree along the path from x to y: P(x=1 y=0) = z=0,1 P(x=1 y=0, z)p(z y=0) = = z=0,1 P(x=1 z)p(z y=0) P(s y) T Parents(x ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables) 17 S

17 Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,u)? (No drect path from x to u) P(y, x1, x2,...xn ) p(y) P(x P(x=1,u=0) = P(x=1 u=0)p(u=0) V W P(y) Let y be a parent of x and u (we always have one) P(x=1 u=0) = y=0,1 P(x=1 u=0, y)p(y u=0) = = y=0,1 P(x=1 y)p(y u=0) = U X Y Z P(x z) P(s y) T Parents(x Now we have reduced t to cases we have seen 18 S ) )

18 Tree Dependent Dstrbutons Inference Problem: P(y) Y P(s y) Gven the Tree wth all the assocated CPTs, we showed that we can evaluate the probablty of all events effcently. There are more effcent algorthms The dea was to show that the nference s ths case s a smple applcaton of Bayes rule and probablty theory. P(y, x1, x2,...xn ) p(y) P(x V W U X Z P(x z) S T Parents(x 19 ) )

19 Graphcal Models of Probablty Dstrbutons For general Bayesan Networks The learnng problem s hard The nference problem (gven the network, evaluate the probablty of a gven event) s hard (#P Complete) Y P(y) Z Z 1 Z 2 Z 3 P(z 3 y) X 10 X P(x z 1, z 2,z, z 3 ) X 2 P(y, x, x,...x p(y) P(x ) 1 2 n Parents(x ) ) 20

20 Varable Elmnaton Suppose the query s P(X 1 ) Key Intuton: Move rrelevant terms outsde summaton and cache ntermedate results 21

21 Varable Elmnaton: Example 1 A A B C We want to compute P(C) P(C) = åå P(A, B,C) = P(A)P(B A)P(C B) A B å å = P(C B) P(A)P(B A) B A åå A B Let s call ths f A (B) å = P(C B) f A (B) B A has been (nstantated and) elmnated What have we saved wth ths procedure? How many multplcatons and addtons dd we perform? 22

22 Varable Elmnaton VE s a sequental procedure. Gven an orderng of varables to elmnate For each varable v that s not n the query Replace t wth a new functon f v That s, margnalze v out The actual computaton depends on the order What s the doman and range of f v? It need not be a probablty dstrbuton 23

23 Varable Elmnaton: Example 2 P(E) Earthquake P(B) Burglary P(A E, B) P(R E) Rado Alarm What s P(M, J B)? P(M A) P(J A) Mary Calls P(E, B, A, R, M, J) = P(E B, A, R, M, J)P(B, A, R, M, J) John Calls = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) 24

24 Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = Assumptons (graph; jont representaton) P(M, J, B = true) P(B = true) P(M, J, B = true) = P(E, B = true, A, R, M, J) å å E,A,R It s suffcent to compute the numerator and normalze = P(E) P(B = true) P(R E) P(A E, B = true) P(M A) P(J A) E,A,R å To elmnate R f R (E) = P(R E) R Elmnaton order R, A, E 25

25 Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = P(M, J, B = true) P(B = true) It s suffcent to compute the numerator and normalze å P(M, J, B = true) = P(E, B = true, A, R, M, J) å E,A,R = P(E) P(B = true) P(A E, B = true) P(M A) P(J A) f R (E) E,A Elmnaton order A, E To elmnate A å f R (E) = P(R E) R f A (E, M, J) = å P(A E, B = true) P(M A) P(J A) A 26

26 Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = P(M, J, B = true) P(B = true) It s suffcent to compute the numerator and normalze å P(M, J, B = true) = P(E, B = true, A, R, M, J) å E,A,R = P(E) P(B = true) f A (E, M, J) f R (E) E Factors Fnally elmnate E f R (E) = åp(r E) f A (E, M, J) = å P(A E, B = true) P(M A) P(J A) R A 27

27 Varable Elmnaton The order n whch varables are elmnated matters In the prevous example, what would happen f we elmnate E frst? The sze of the factors would be larger Complexty of Varable Elmnaton Exponental n the sze of the factors What about worst case? The worst case s ntractable 28

28 Inference Exact Inference n Bayesan Networks s #P-hard We can count the number of satsfyng assgnments for 3- SAT wth a Bayesan Network Approxmate nference Eg. Gbbs samplng Skp 29

29 Approxmate Inference P(x)? X Basc dea If we had access to a set of examples from the jont dstrbuton, we could just count. E[ f (x)]» 1 N N å =1 f (x () ) For nference, we generate nstances from the jont and count How do we generate nstances? 30

30 Generatng nstances Samplng from the Bayesan Network Condtonal probabltes, that s, P(X E) Only generate nstances that are consstent wth E Problems? How many samples? [Law of large numbers] What f the evdence E s a very low probablty event? 31

31 Detour: Markov Chan Revew C Generates a sequence of A,B,C Defned by ntal and transton probabltes P(X 0 ) and P(X t+1 = X t =j) 0.1 A B 0.1 P j : Tme ndependent transton probablty matrx Statonary Dstrbutons: A vector q s called a statonary dstrbuton f q : The probablty of beng n state q j = å q P j If we sample from the Markov Chan repeatedly, the dstrbuton over the states converges to the statonary dstrbuton 32

32 Markov Chan Monte Carlo Our goal: To sample from P(X e) Overall dea: The next sample s a functon of the current sample The samples can be thought of as comng from a Markov Chan whose statonary dstrbuton s the dstrbuton we want Can approxmate any dstrbuton 33

33 Gbbs Samplng The smplest MCMC method to sample from P(X=x 1 x 2 x n e) Creates a Markov Chan of samples as follows: Intalze X randomly At each tme step, fx all random varables except one. Sample that random varable from the correspondng condtonal dstrbuton 34

34 Gbbs Samplng Algorthm: Intalze X randomly Iterate: Pck a varable X unformly at random Sample x (t+1) from P(x x (t) 1,,x (t) -1, x (t), +1, x (t) n,e) X (t+1) k =x (t+1) k for all other k Ths s the next sample X (1),X (2), X (t) forms a Markov Chan Why s Gbbs Samplng easy for Bayes Nets? P(x x - (t),e) s local 35

35 Gbbs Samplng: Bg pcture Gven some condtonal dstrbuton we wsh to compute, collect samples from the Markov Chan Typcally, the chan s allowed to run for some tme before collectng samples (burn n perod) So that the chan settles nto the statonary dstrbuton Usng the samples, we approxmate the posteror by countng 36

36 Gbbs Samplng Example 1 A B C We want to compute P(C): Suppose, after burn n, the Markov Chan s at A=true, B=false, C= false 1. Pck a varable B 2. Draw the new value of B from P(B A=true, C= false) = P(B A=true) Suppose B new = true 3. Our new sample s A=true, B = true, C = false 4. Repeat 37

37 Gbbs Samplng Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) Mary Calls John Calls Exercse: P(M,J B)? 38

38 Example: Hdden Markov Model Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Transton probabltes Emsson probabltes A Bayesan Network wth a specfc structure. Xs are called the observatons and Ys are the hdden states Useful for sequence taggng tasks part of speech, modelng temporal structure, speech recognton, etc 39

39 HMM: Computatonal Problems Probablty of an observaton gven an HMM P(X parameters): Dynamc Programmng Fndng the best hdden states for a gven sequence P(Y X, parameters): Dynamc Programmng Learnng the parameters from observatons EM 40

40 Gbbs Samplng for HMM Goal:Computng P(y x) Intalze the Ys randomly Iterate: Only these varables are needed because they form the Markov blanket of Y. Pck a random Y Draw Y t from P(Y Y -1,Y +1,X ) Compute the probablty usng counts after the burn n perod Gbbs samplng allows us to ntroduce prors on the emsson and transton probabltes. 41

41 Bayesan Networks Bayesan Networks Compact representaton probablty dstrbutons Unversal: Can represent all dstrbutons In the worst case, every random varable wll be connected to all others Inference Inference s hard n the worst case Exact nference s #P-hard, approxmate nference s NP-hard [Roth93,96] Inference for Trees s effcent General exact Inference: Varable Elmnaton Learnng? 42

42 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton What does that mean? P(y, Fnd the tree representaton of the dstrbuton. Generatve model What does that mean? V W P(y) U x1, x2,...xn ) p(y) P(x X Y Z P(x z) Among all trees, fnd the most lkely one, gven the data: P(T D) = P(D T) P(T)/P(D) P(s y) T Parents(x S ) ) 43

43 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton W P(y) U Y Z P(s y) S Fnd the tree representaton of the dstrbuton. V X P(x z) T Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) Now we can see why we had to solve the nference problem frst; t s requred for learnng. 44

44 Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton W P(y) U Y Z P(s y) S Fnd the tree representaton of the dstrbuton. V X P(x z) T Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) = = argmax T {x} P T (x Parents(x )) Try ths for naïve Bayes 45

45 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: Are these representatons of the same dstrbuton? Gven a sample, whch of these generated t? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 46

46 Example: Learnng Dstrbutons Probablty Dstrbuton 1: Probablty Dstrbuton 2: We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 47

47 Example: Learnng Dstrbutons Probablty Dstrbuton 1: What s the lkelhood that ths table generated the data? P(T D) = P(D T) P(T)/P(D) Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= 0 P(1001 T)= 0.1 P(0100 T)= 0.1 P(Data Table)=0 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? 48

48 Example: Learnng Dstrbutons Probablty Dstrbuton 2: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=1/2 P(x 1 x 4 ) X 2 X 3 p(x 1 =1 x 4 =0)=1/2 p(x 1 =1 x 4 =1)=1/2 p(x 2 =1 x 4 =0)=1/3 p(x 2 =1 x 4 =1)=1/3 p(x 3 =1 x 4 =0)=1/6 p(x 3 =1 x 4 =1)=5/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 4 =1)=1/2 1/2 2/3 5/6= 10/72 P(1001 T)= = 1/2 1/2 2/3 5/6=10/72 P(0100 T)= =1/2 1/2 2/3 5/6=10/72 P(Data Tree)=125/4*3 6 X 1 P(x 2 x 4 ) X 4 P(x 4 ) P(x 3 x 4 ) 49

49 Example: Learnng Dstrbutons Probablty Dstrbuton 3: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=2/3 P(x 1 x 4 ) p(x 1 =1 x 4 =0)=1/3 p(x 1 =1 x 4 =1)=1 p(x 2 =1 x 4 =0)=1 p(x 2 =1 x 4 =1)=1/2 p(x 3 =1 x 2 =0)=2/3 p(x 3 =1 x 2 =1)=1/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 2 =1)=2/3 1 1/2 2/3= 2/9 P(1001 T)= = 1/2 1/2 2/3 1/6=1/36 P(0100 T)= =1/2 1/2 1/3 5/6=5/72 P(Data Tree)=10/ Dstrbuton 2 s the most lkely Learnng Dstrbutons dstrbuton CS446 -Sprng to have 17 produced the data. X 1 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3

50 Example: Summary We are now n the same stuaton we were when we decded whch of two cons, far (0.5,0.5) or based (0.7,0.3) generated the data. But, ths sn t the most nterestng case. In general, we wll not have a small number of possble dstrbutons to choose from, but rather a parameterzed famly of dstrbutons. (analogous to a con wth p 2 [0,1] ) We need a systematc way to search ths famly of dstrbutons.

51 Example: Summary Frst, let s make sure we understand what we are after. We have 3 data ponts that have been generated accordng to our target dstrbuton: 1011; 1001; 0100 What s the target dstrbuton? We cannot fnd THE target dstrbuton. What s our goal? As before we are nterested n generalzaton Gven Data (e.g., the above 3 data ponts), we would lke to know P(1111) or P(11**), P(***0) etc. We could compute t drectly from the data, but. Assumptons about the dstrbuton are crucal here

52 Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton W P(y) U Y Z P(s y) S fnd the most probable tree representaton of the dstrbuton. V X P(x z) T 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) Space of all Dstrbutons Space of all Tree Dstrbutons Target Dstrbuton Fnd the Tree closest Learnng Dstrbutons CS446 to the -Sprng target 17 Target Dstrbuton 53

53 Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) (2) Fnd the maxmal one V W P(y) U X Y Z P(x z) P(s y) The smple mnded algorthm for learnng a tree dependent dstrbuton requres (1) for each tree, compute ts lkelhood L(T) = P(D T) = = {x} P T (x 1, x 2, x n ) = = {x} P T (x Parents(x )) T 54 S

54 1. Dstance Measure To measure how well a probablty dstrbuton P s approxmated by probablty dstrbuton T we use here the Kullback-Lebler cross entropy measure (KL-dvergence): D(P, T) Non negatve. x P(x)log D(P,T)=0 ff P and T are dentcal P(x) T(x) Non symmetrc. Measures how much P dffers from T. 55

55 2. Rankng Dependences Intutvely, the mportant edges to keep n the tree are edges (x---y) for x, y whch depend on each other. Gven that the dstance between the dstrbuton s measured usng the KL dvergence, the correspondng measure of dependence s the mutual nformaton between x and y, (measurng the nformaton x gves about y) P(x, y) I(x, y) P(x, y)log x, y P(x)P(y) whch we can estmate wth respect to the emprcal dstrbuton (that s, the gven data). 56

56 Learnng Tree Dependent Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree 57

57 Spannng Tree Goal: Fnd a subset of the edges that forms a tree that ncludes every vertex, where the total weght of all the edges n the tree s maxmzed Sort the weghts Start greedly wth the largest one. Add the next largest as long as t does not create a loop. In case of a loop, dscard ths weght and move on to the next weght. Ths algorthm wll create a tree; It s a spannng tree: t touches all the vertces. It s not hard to see that ths s the maxmum weghted spannng tree The complexty s O(n 2 log(n)) 58

58 Learnng Tree Dependent Dstrbutons (2) (3) (1) The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree Transform the resultng undrected tree to a drected tree. Choose a root varable and set the drecton of all the edges away from t. Place the correspondng condtonal probabltes on the edges. 59

59 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng probablty dstrbuton T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best tree dependent approxmaton to P Let T be the tree-dependent dstrbuton accordng to the fxed tree t. Recall: T(x) = T(x Parent(x )) = P(x (x )) D(P, T) x P(x)log P(x) T(x) 60

60 Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best t-dependent approxmaton to P D(P,T) H(x) When s ths maxmzed? That s, how to defne T(x (x ))? x x P(x) P(x)log T(x) x P(x)log P(x) - P(x) n 1 x P(x)log T(x) log T(x ( x )) Slght abuse of notaton at the root 61

61 Correctness (1) D(P,T) Defnton of expectaton: x P(x)log H(x) H(x) H(x) H(x) x H(x) E n 1 n 1 (x n P P(x) T(x) P(x) [ n 1 E, P, (x 1 (x ) n 1 )) x log T(x [log T(x P(x P( (x P(x)log P(x) - logt(x, (x )) x (x (x (x P(x ))] ))] x )) )) log T(x (x P(x)log T(x) (x P(x ¼(x ) log T(x ¼(x )) takes ts maxmal value when we set: T(x ¼(x )) = P(x ¼(x )) )) )log T(x (x )) 62

62 P(x Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that: However:, (x ))log P(x Ths gves: (x log )) D(P,T) H(x) P(x P(x, (x D(P,T) = -H(x) - 1,n I(x,¼(x )) - 1,n x P(x ) log P(x ) 1st and 3rd term do not depend on the tree structure. Snce the dstance s non negatve, mnmzng t s equvalent to maxmzng the sum of the edges weghts I(x,y). (x n 1 (x,, (x P(x P(x, (x )) )) log P(x )P( (x )) P(x, (x )) ))log P(x )P( (x )) )), (x )) log P(x (x )) log P(x P(x, (x ) ))log P(x 63 )

63 Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that the T s the best tree approxmaton of P f t s chosen to maxmze the sum of the edges weghts. D(P,T) = -H(x) - 1,n I(x,¼(x )) - 1,n x P(x ) log P(x ) The mnmzaton problem s solved wthout the need to exhaustvely consder all possble trees. Ths was acheved snce we transformed the problem of fndng the best tree to that of fndng the heavest one, wth mutual nformaton on the edges. 64

64 Correctness (3) Transform the resultng undrected tree to a drected tree. (Choose a root varable and drect of all the edges away from t.) What does t mean that you get the same dstrbuton regardless of the chosen root? (Exercse) Ths algorthm learns the best tree-dependent approxmaton of a dstrbuton D. L(T) = P(D T) = {x} P T (x Parent(x )) Gven data, ths algorthm fnds the tree that maxmzes the lkelhood of the data. The algorthm s called the Chow-Lu Algorthm. Suggested n 1968 n the context of data compresson, and adapted by Pearl to Bayesan Networks. Invented a couple more tmes, and generalzed snce then. 65

65 Example: Learnng tree Dependent Dstrbutons We have 3 data ponts that have been generated accordng to the target dstrbuton: 1011; 1001; 0100 We need to estmate some parameters: P(A=1) = 2/3, P(B=1)=1/3, P(C=1)=1/3), B P(D=1)=2/3 For the values 00, 01, 10, 11 respectvely, we have that: P(A,B)=0; 1/3; 2/3; 0 P(A,B)/P(A)P(B)=0; 3; 3/2; 0 I(A,B) ~ 9/2 P(A,C)=1/3; 0; 1/3; 1/3 P(A,C)/P(A)P(C)=3/2; 0; 3/4; 3/2 I(A,C) ~ 15/4 P(A,D)=1/3; 0; 0; 2/3 P(A,D)/P(A)P(D)=3; 0; 0; 3/2 I(A,D) ~ 9/2 P(B,C)=1/3; 1/3; 1/3;0 P(B,C)/P(B)P(C)=3/4; 3/2; 3/2; 0 I(B,C) ~ 15/4 P(B,D)=0; 2/3; 1/3;0 P(B,D)/P(B)P(D)=0; 3; 3/2; 0 I(B,D) ~ 9/2 P(C,D)=1/3; 1/3; 0; 1/3 P(C,D)/P(C)P(D)=3/2; 3/4; 0; 3/2 I(C,D) ~ 15/4 Generate the tree; place probabltes. I(x, y) A P(x, y) P(x, y)log x, P(x)P(y) y D C 66

66 Learnng tree Dependent Dstrbutons Chow-Lu algorthm fnds the tree that maxmzes the lkelhood. In partcular, f D s a tree dependent dstrbuton, ths algorthm learns D. (what does t mean?) Less s known on how many examples are needed n order for t to converge. (what does that mean?) Notce that we are takng statstcs to estmate the probabltes of some event n order to generate the tree. Then, we ntend to use t to evaluate the probablty of other events. One may ask the queston: why do we need ths structure? Why can t answer the query drectly from the data? (Almost lke makng predcton drectly from the data n the badges problem) 67

CIS 519/419 Appled Machne Learnng www.seas.upenn.edu/~cs519 Dan Roth danroth@seas.upenn.edu http://www.cs.upenn.edu/~danroth/ 461C, 3401 Walnut Sldes were created by Dan Roth (for CIS519/419 at Penn or

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

Artificial Intelligence Bayesian Networks

Artificial Intelligence Bayesian Networks Artfcal Intellgence Bayesan Networks Adapted from sldes by Tm Fnn and Mare desjardns. Some materal borrowed from Lse Getoor. 1 Outlne Bayesan networks Network structure Condtonal probablty tables Condtonal

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics /7/7 CSE 73: Artfcal Intellgence Bayesan - Learnng Deter Fox Sldes adapted from Dan Weld, Jack Breese, Dan Klen, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer What s Beng Learned? Space

More information

Semi-Supervised Learning

Semi-Supervised Learning Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume

More information

Computing Correlated Equilibria in Multi-Player Games

Computing Correlated Equilibria in Multi-Player Games Computng Correlated Equlbra n Mult-Player Games Chrstos H. Papadmtrou Presented by Zhanxang Huang December 7th, 2005 1 The Author Dr. Chrstos H. Papadmtrou CS professor at UC Berkley (taught at Harvard,

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that

More information

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore 8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø

More information

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7 Stanford Unversty CS54: Computatonal Complexty Notes 7 Luca Trevsan January 9, 014 Notes for Lecture 7 1 Approxmate Countng wt an N oracle We complete te proof of te followng result: Teorem 1 For every

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation Readngs: K&F 0.3, 0.4, 0.6, 0.7 Learnng undrected Models Lecture 8 June, 0 CSE 55, Statstcal Methods, Sprng 0 Instructor: Su-In Lee Unversty of Washngton, Seattle Mean Feld Approxmaton Is the energy functonal

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

Quantifying Uncertainty

Quantifying Uncertainty Partcle Flters Quantfyng Uncertanty Sa Ravela M. I. T Last Updated: Sprng 2013 1 Quantfyng Uncertanty Partcle Flters Partcle Flters Appled to Sequental flterng problems Can also be appled to smoothng problems

More information

Hidden Markov Models

Hidden Markov Models Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Randomness and Computation

Randomness and Computation Randomness and Computaton or, Randomzed Algorthms Mary Cryan School of Informatcs Unversty of Ednburgh RC 208/9) Lecture 0 slde Balls n Bns m balls, n bns, and balls thrown unformly at random nto bns usually

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

Calculation of time complexity (3%)

Calculation of time complexity (3%) Problem 1. (30%) Calculaton of tme complexty (3%) Gven n ctes, usng exhaust search to see every result takes O(n!). Calculaton of tme needed to solve the problem (2%) 40 ctes:40! dfferent tours 40 add

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Clustering gene expression data & the EM algorithm

Clustering gene expression data & the EM algorithm CG, Fall 2011-12 Clusterng gene expresson data & the EM algorthm CG 08 Ron Shamr 1 How Gene Expresson Data Looks Entres of the Raw Data matrx: Rato values Absolute values Row = gene s expresson pattern

More information

Lecture 2: Prelude to the big shrink

Lecture 2: Prelude to the big shrink Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson

More information

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8 U.C. Berkeley CS278: Computatonal Complexty Handout N8 Professor Luca Trevsan 2/21/2008 Notes for Lecture 8 1 Undrected Connectvty In the undrected s t connectvty problem (abbrevated ST-UCONN) we are gven

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Maximum likelihood. Fredrik Ronquist. September 28, 2005

Maximum likelihood. Fredrik Ronquist. September 28, 2005 Maxmum lkelhood Fredrk Ronqust September 28, 2005 Introducton Now that we have explored a number of evolutonary models, rangng from smple to complex, let us examne how we can use them n statstcal nference.

More information

Expectation propagation

Expectation propagation Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014 Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html Energy functon The

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta Bayesan Networks Course: CS40022 Instructor: Dr. Pallab Dasgupta Department of Computer Scence & Engneerng Indan Insttute of Technology Kharagpur Example Burglar alarm at home Farly relable at detectng

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 6 Luca Trevsan September, 07 Scrbed by Theo McKenze Lecture 6 In whch we study the spectrum of random graphs. Overvew When attemptng to fnd n polynomal

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be

More information

Machine learning: Density estimation

Machine learning: Density estimation CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of

More information

Lecture 10: May 6, 2013

Lecture 10: May 6, 2013 TTIC/CMSC 31150 Mathematcal Toolkt Sprng 013 Madhur Tulsan Lecture 10: May 6, 013 Scrbe: Wenje Luo In today s lecture, we manly talked about random walk on graphs and ntroduce the concept of graph expander,

More information

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes

More information

NP-Completeness : Proofs

NP-Completeness : Proofs NP-Completeness : Proofs Proof Methods A method to show a decson problem Π NP-complete s as follows. (1) Show Π NP. (2) Choose an NP-complete problem Π. (3) Show Π Π. A method to show an optmzaton problem

More information

1 The Mistake Bound Model

1 The Mistake Bound Model 5-850: Advanced Algorthms CMU, Sprng 07 Lecture #: Onlne Learnng and Multplcatve Weghts February 7, 07 Lecturer: Anupam Gupta Scrbe: Bryan Lee,Albert Gu, Eugene Cho he Mstake Bound Model Suppose there

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique Outlne and Readng Dynamc Programmng The General Technque ( 5.3.2) -1 Knapsac Problem ( 5.3.3) Matrx Chan-Product ( 5.3.1) Dynamc Programmng verson 1.4 1 Dynamc Programmng verson 1.4 2 Dynamc Programmng

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example.

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example. CIS587 - Artfcal Intellgence Bayesan Networks KB for medcal dagnoss. Example. We want to buld a KB system for the dagnoss of pneumona. Problem descrpton: Dsease: pneumona Patent symptoms (fndngs, lab tests):

More information

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang CS DESIGN ND NLYSIS OF LGORITHMS DYNMIC PROGRMMING Dr. Dasy Tang Dynamc Programmng Idea: Problems can be dvded nto stages Soluton s a sequence o decsons and the decson at the current stage s based on the

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple

More information

Retrieval Models: Language models

Retrieval Models: Language models CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan. THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY Wllam A. Pearlman 2002 References: S. Armoto - IEEE Trans. Inform. Thy., Jan. 1972 R. Blahut - IEEE Trans. Inform. Thy., July 1972 Recall

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

Vapnik-Chervonenkis theory

Vapnik-Chervonenkis theory Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines. Oct Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron

More information

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 Generatve and Dscrmnatve Models Je Tang Department o Computer Scence & Technolog Tsnghua Unverst 202 ML as Searchng Hpotheses Space ML Methodologes are ncreasngl statstcal Rule-based epert sstems beng

More information

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002 Man Goals Present

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Scence robablstc Graphcal Models Appromate Inference: Markov Chan Monte Carlo 05 07 Erc Xng Lecture 7 March 9 04 X X 075 05 05 03 X 3 Erc Xng @ CMU 005-04 Recap of Monte Carlo Monte

More information

EGR 544 Communication Theory

EGR 544 Communication Theory EGR 544 Communcaton Theory. Informaton Sources Z. Alyazcoglu Electrcal and Computer Engneerng Department Cal Poly Pomona Introducton Informaton Source x n Informaton sources Analog sources Dscrete sources

More information

Bayesian belief networks

Bayesian belief networks CS 1571 Introducton to I Lecture 24 ayesan belef networks los Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square CS 1571 Intro to I dmnstraton Homework assgnment 10 s out and due next week Fnal exam: December

More information

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

Lecture 4: Constant Time SVD Approximation

Lecture 4: Constant Time SVD Approximation Spectral Algorthms and Representatons eb. 17, Mar. 3 and 8, 005 Lecture 4: Constant Tme SVD Approxmaton Lecturer: Santosh Vempala Scrbe: Jangzhuo Chen Ths topc conssts of three lectures 0/17, 03/03, 03/08),

More information

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models I529: Machne Learnng n Bonformatcs (Sprng 217) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 217 Outlne Smple model (frequency & profle) revew Markov chan

More information

Explaining the Stein Paradox

Explaining the Stein Paradox Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten

More information

Expected Value and Variance

Expected Value and Variance MATH 38 Expected Value and Varance Dr. Neal, WKU We now shall dscuss how to fnd the average and standard devaton of a random varable X. Expected Value Defnton. The expected value (or average value, or

More information

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

1 Motivation and Introduction

1 Motivation and Introduction Instructor: Dr. Volkan Cevher EXPECTATION PROPAGATION September 30, 2008 Rce Unversty STAT 63 / ELEC 633: Graphcal Models Scrbes: Ahmad Beram Andrew Waters Matthew Nokleby Index terms: Approxmate nference,

More information

Engineering Risk Benefit Analysis

Engineering Risk Benefit Analysis Engneerng Rsk Beneft Analyss.55, 2.943, 3.577, 6.938, 0.86, 3.62, 6.862, 22.82, ESD.72, ESD.72 RPRA 2. Elements of Probablty Theory George E. Apostolaks Massachusetts Insttute of Technology Sprng 2007

More information

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov 9.93 Class IV Part I Bayesan Decson Theory Yur Ivanov TOC Roadmap to Machne Learnng Bayesan Decson Makng Mnmum Error Rate Decsons Mnmum Rsk Decsons Mnmax Crteron Operatng Characterstcs Notaton x - scalar

More information

Probability-Theoretic Junction Trees

Probability-Theoretic Junction Trees Probablty-Theoretc Juncton Trees Payam Pakzad, (wth Venkat Anantharam, EECS Dept, U.C. Berkeley EPFL, ALGO/LMA Semnar 2/2/2004 Margnalzaton Problem Gven an arbtrary functon of many varables, fnd (some

More information