Bayesian Networks Structure Learning (cont.)

Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic structure learning) Learning BN tutorial (class website): ftp://ftp.research.microsoft.com/pub/tr/tr-95-6.pdf TAN paper (class website): http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Structure Learning (cont.) Machine Learning 171/15781 Carlos Guestrin Carnegie Mellon University April 3 rd, 6 1

Learning Bayes nets Known structure Unknown structure Fully observable data Missing data Data x (1) x (m) structure CPTs P(X i Pa Xi ) parameters

Learning the CPTs Data For each discrete variable X i x (1) x (M) WHY?????????? 3

Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 4

Maximum likelihood (ML) for learning BN structure Possible structures Learn parameters using ML Score structure Flu Allergy Sinus Headache Nose Data <x 1 (1),,x n (1) > <x 1 (M),,x n (M) > 5

Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 6

Information-theoretic interpretation of maximum likelihood 3 Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 7

Mutual information Independence tests Statistically difficult task! Intuitive approach: Mutual information Mutual information and independence: X i and X j independent if and only if I(X i,x j )= Conditional mutual information: 8

Decomposable score Log data likelihood 9

Scoring a tree 1: equivalent trees 1

Scoring a tree : similar trees 11

Chow-Liu tree learning algorithm 1 For each pair of variables X i,x j Compute empirical distribution: Compute mutual information: Define a graph Nodes X 1,,X n Edge (i,j) gets weight 1

Chow-Liu tree learning algorithm Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, breadth-firstsearch defines directions 13

Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. 97] Naïve Bayes model overcounts, because correlation between features not considered Same as Chow-Liu, but score edges with: 14

Can we extend Chow-Liu (Approximately learning) models with tree-width up to k [Narasimhan & Bilmes 4] But, O(n k+1 ) 15

Scoring general graphical models Model selection problem What s the best structure? Flu Allergy Sinus Headache Nose Data <x_1^{(1)},,x_n^{(1)}> <x_1^{(m)},,x_n^{(m)}> The more edges, the fewer independence assumptions, the higher the likelihood of the data, but will overfit 16

Maximum likelihood overfits! Information never hurts: Adding a parent always increases score!!! 17

Bayesian score avoids overfitting Given a structure, distribution over parameters Difficult integral: use Bayes information criterion (BIC) approximation (equivalent as M ) Note: regularize with MDL score Best BN under BIC still NP-hard 18

How many graphs are there? 19

Structure learning for general graphs In a tree, a node only has one parent Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition in different ways

Learn BN structure using local search Starting from Chow-Liu tree Local search, possible moves: Add edge Delete edge Invert edge Score using BIC 1

What you need to know about learning BNs Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree (Chow-Liu) Best TAN Other BNs, usually local search with BIC score

Unsupervised learning or Clustering K-means Gaussian mixture models Machine Learning 171/15781 Carlos Guestrin Carnegie Mellon University April 3 rd, 6 3

Some Data 4

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 5

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 8

Unsupervised Learning You walk into a bar. A stranger approaches and tells you: I ve got data from k classes. Each class produces observations with a normal distribution and variance σ I. Standard simple multivariate gaussian assumptions. I can tell you all the P(w i ) s. So far, looks straightforward. I need a maximum likelihood estimate of the µ i s. No problem: There s just one thing. None of the data are labeled. I have datapoints, but I don t know what class they re from (any of them!) Uh oh!! 3

31 Gaussian Bayes Classifier Reminder ) ( ) ( ) ( ) ( x x x p i y P i y p i y P = = = = ( ) ( ) ) ( 1 exp ) ( 1 ) ( 1/ / x µ x Σ µ x Σ x p p i y P i i k i T i k i m = = π How do we deal with that?

Predicting wealth from age 3

Predicting wealth from age 33

34 Learning modelyear, mpg ---> maker = m m m m m 1 1 1 1 1 σ σ σ σ σ σ σ σ σ L M O M M L L Σ

35 General: O(m ) parameters = m m m m m 1 1 1 1 1 σ σ σ σ σ σ σ σ σ L M O M M L L Σ

36 Aligned: O(m) parameters = m m 1 3 1 σ σ σ σ σ L L M M O M M M L L L Σ

37 Aligned: O(m) parameters = m m 1 3 1 σ σ σ σ σ L L M M O M M M L L L Σ

38 Spherical: O(1) cov parameters = σ σ σ σ σ L L M M O M M M L L L Σ

39 Spherical: O(1) cov parameters = σ σ σ σ σ L L M M O M M M L L L Σ

Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 4

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ 1 µ µ 3 41

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ I Assume that each datapoint is generated according to the following recipe: µ 1 µ µ 3 4

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ Each component generates data from a Gaussian with mean µ i and covariance matrix σ I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ). 43

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ Each component generates data from a Gaussian with mean µ i and covariance matrix σ I x Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ).. Datapoint ~ N(µ i, σ I ) 44

The General GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ).. Datapoint ~ N(µ i, Σ i ) µ 1 µ µ 3 45

Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW -d UNLABELED DATA (X VECTORS) DISTRIBUTED IN -d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS 46

Computing likelihoods in supervised learning case We have y 1,x 1, y,x, y N,x N Learn P(y 1 ) P(y ).. P(y k ) Learn σ, µ 1,, µ k By MLE: P(y 1,x 1, y,x, y N,x N µ i, µ k, σ) 47

Computing likelihoods in unsupervised case We have x 1, x, x N We know P(y 1 ) P(y ).. P(y k ) We know σ P(x y i, µ i, µ k ) = Prob that an observation from class y i would have value x given class means µ 1 µ x Can we write an expression for that? 48

likelihoods in unsupervised case We have x 1 x x n We have P(y 1 ).. P(y k ). We have σ. We can define, for any x, P(x y i, µ 1, µ.. µ k ) Can we define P(x µ 1, µ.. µ k )? Can we define P(x 1, x 1,.. x n µ 1, µ.. µ k )? [YES, IF WE ASSUME THE X 1 S WERE DRAWN INDEPENDENTLY] 49

Unsupervised Learning: Mediumly Good News We now have a procedure s.t. if you give me a guess at µ 1, µ.. µ k, I can tell you the prob of the unlabeled data given those µ s. Suppose x s are 1-dimensional. (From Duda and Hart) There are two classes; w 1 and w P(y 1 ) = 1/3 P(y ) = /3 σ = 1. There are 5 unlabeled datapoints x 1 =.68 x = -1.59 x 3 =.35 x 4 = 3.949 : x 5 = -.71 5

Duda & Hart s Example We can graph the prob. dist. function of data given our µ 1 and µ estimates. We can also graph the true function from which the data was randomly generated. They are close. Good. The nd solution tries to put the /3 hump where the 1/3 hump should go, and vice versa. In this example unsupervised is almost as good as supervised. If the x 1.. x 5 are given the class which was used to learn them, then the results are (µ 1 =-.176, µ =1.684). Unsupervised got (µ 1 =-.13, µ =1.668). 51

Duda & Hart s Example µ Graph of log P(x 1, x.. x 5 µ 1, µ ) against µ 1 ( ) and µ ( ) µ1 Max likelihood = (µ 1 =-.13, µ =1.668) Local minimum, but very close to global at (µ 1 =.85, µ =-1.57)* * corresponds to switching y 1 with y. 5

Finding the max likelihood µ 1,µ..µ k We can compute P( data µ 1,µ..µ k ) How do we find the µ i s which give max. likelihood? The normal max likelihood trick: Set log Prob (.) = µ i and solve for µ i s. # Here you get non-linear non-analyticallysolvable equations Use gradient descent Slow but doable Use a much faster, cuter, and recently very popular method 53

Expectation Maximalization 54

The E.M. Algorithm DETOUR We ll get back to unsupervised learning soon. But now we ll look at an even simpler case with hidden information. The EM algorithm Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as we ll see. Many, many other uses, including inference of Hidden Markov Models (future lecture). 55

Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w = Gets a B P(B) = µ w 3 = Gets a C P(C) = µ w 4 = Gets a D P(D) = ½-3µ (Note µ 1/6) Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 56

Trivial Statistics P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ P( a,b,c,d µ) = K(½) a (µ) b (µ) c (½-3µ) d log P( a,b,c,d µ) = log K + alog ½ + blog µ + clog µ + dlog (½-3µ) LogP FOR MAX LIKE µ, SET = µ LogP b c = + µ µ µ Gives max like µ = So if class got 3d 1 / 6 b = 3µ + c ( b + c + d ) A B C D Max like µ = 1 1 14 6 9 1 Boring, but true! 58

Same Problem with Hidden Information REMEMBER Someone tells us that Number of High grades (A s + B s) = h Number of C s = c Number of D s = d P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ What is the max. like estimate of µ now? 59

Same Problem with Hidden Information Someone tells us that Number of High grades (A s + B s) = h Number of C s = c Number of D s = d What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION MAXIMIZATION REMEMBER P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ If we know the value of µ we could compute the expected value of a and b 1 µ a = h b = 1 + µ 1 + µ Since the ratio a:b should be the same as the ratio ½ : µ If we know the expected values of a and b we could compute the maximum likelihood value of µ µ = 6 b + c ( b + c + d) 6 h

E.M. for our Trivial Problem REMEMBER We begin with a guess for µ We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b. P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ Define µ(t) the estimate of µ on the t th iteration b(t) the estimate of b on t th iteration µ() = initialguess b( t) = µ( t + 1) = 6 = 1 µ(t) h = Ε + µ( t) b() t + c ( b() t + c + d) max likeest of [ b µ( t) ] () µ given b t E-step M-step Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum. 61

E.M. Convergence Convergence proof based on fact that Prob(data µ) must increase or remain same between each iteration [NOT OBVIOUS] But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, suppose we had h = c = 1 d = 1 µ() = Convergence is generally linear: error decreases by a constant factor each time step. t µ(t) b(t) 1.833.857.937 3.158 3.947 3.185 4.948 3.187 5.948 3.187 6.948 3.187 6

Back to Unsupervised Learning of GMMs Remember: We have unlabeled data x 1 x x R We know there are k classes We know P(y 1 ) P(y ) P(y 3 ) P(y k ) We don t know µ 1 µ.. µ k We can write P( data µ 1. µ k ) = p = = = ( x... x µ...µ ) R i= 1 R i= 1 j= 1 R 1 p i= 1 j= 1 ( x µ...µ ) k k i R p 1 ( x w,µ...µ ) P( y ) i 1 1 K exp σ j k k 1 k ( x µ ) P( y ) i j j j 63

E.M. for GMMs For Max likelihood we know µ j = R i= 1 R P i= 1 ( y x,µ...µ ) P j ( y x,µ...µ ) j i i 1 1 k k x i µ Some wild' n' crazy algebra turns This is n nonlinear equations in µ j s. i log Pr ob ( data µ...µ ) 1 this into :" For Max likelihood, for each If, for each x i we knew that for each w j the prob that µ j was in class y j is P(y j x i,µ 1 µ k ) Then we would easily compute µ j. k See http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf = j, If we knew each µ j then we could easily compute P(y j x i,µ 1 µ k ) for each y j and x i. I feel an EM experience coming on!! 64

65 E.M. for GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ (t) µ c (t) } E-step Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = = = c j j j j k i i i k t k t i t i k t k i t p t y x t p t y x x y y x x y 1 ) ( ), (, p ) ( ), (, p p P, p, P I I σ µ σ µ λ λ λ λ M-step. Compute Max. like µ given our data s class membership distributions ( ) ( ) ( ) = + k t k i k k t k i i x y x x y t λ λ, P, P 1 µ Just evaluate a Gaussian at x k

E.M. Convergence Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works. As with all EM procedures, convergence to a local optimum guaranteed. This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 66

E.M. for General GMMs 67 Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ (t) µ c (t), Σ 1 (t), Σ (t) Σ c (t), p 1 (t), p (t) p c (t) } E-step Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = Σ Σ = = c j j j j j k i i i i k t k t i t i k t k i t p t t y x t p t t y x x y y x x y 1 ) ( ) ( ), (, p ) ( ) ( ), (, p p P, p, P µ µ λ λ λ λ M-step. Compute Max. like µ given our data s class membership distributions p i (t) is shorthand for estimate of P(y i ) on t th iteration ( ) ( ) ( ) = + k t k i k k t k i i x y x x y t λ λ, P, P 1 µ ( ) ( ) ( ) [ ] ( ) [ ] ( ) + + = + Σ k t k i T i k i k k t k i i x y t x t x x y t λ µ µ λ, P 1 1, P 1 ( ) ( ) R x y t p k t k i i = +,λ P 1 R = #records Just evaluate a Gaussian at x k

Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible 68

After first iteration 69

After nd iteration 7

After 3rd iteration 71

After 4th iteration 7

After 5th iteration 73

After 6th iteration 74

After th iteration 75

Some Bio Assay data 76

GMM clustering of the assay data 77

Resulting Density Estimator 78

Three classes of assay (each learned with it s own mixture model) 79

Resulting Bayes Classifier 8

Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 81

Final Comments Remember, E.M. can get stuck in local minima, and empirically it DOES. Our unsupervised learning example assumed P(y i ) s known, and variances fixed and known. Easy to relax this. It s possible to do Bayesian unsupervised learning instead of max. likelihood. 8

What you should know How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data. Be happy with this kind of probabilistic analysis. Understand the two examples of E.M. given in these notes. 83

Acknowledgements K-means & Gaussian mixture models presentation derived from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering/tu torial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem. html 84