Bayesian Networks aka belief networks, probabilistic networks A BN over variables {X 1, X 2,, X n } consists of: a DAG whose nodes are the variables a set of PTs (Pr(X i Parents(X i ) ) for each X i P(a) P(~a) A S886 Fall 07 - Lecture 7, Oct. 2, 2007 1 Bayesian Networks aka belief networks, probabilistic networks Key notions parents of a node: Par(X i ) children of node descendents of a node ancestors of a node family: set of nodes consisting of X i and its parents PTs are defined over families in the BN P(b) B P(~b) A B Parents()={A,B} hildren(a)={} P(c a,b) P(~c a,b) Descendents(B)={,D} P(c ~a,b) P(~c ~a,b) Ancestors{D}={A,B,} P(c a,~b) P(~c a,~b) P(c ~a,~b) P(~c ~a,~b) D Family{}={,A,B} S886 Fall 07 - Lecture 7, Oct. 2, 2007 2 An Example Bayes Net A few PTs are shown Explicit joint requires 211-1 =2047 params Semantics of a Bayes Net The structure of the BN means: every X is conditionally independent of all of its nondescendants given its parents: BN requires only 27 parms (the number of entries for each PT is listed) S886 Fall 07 - Lecture 7, Oct. 2, 2007 3 Pr(X i S Par(X i )) = Pr(X i Par(X i )) for any subset S NonDescendants(X i ) S886 Fall 07 - Lecture 7, Oct. 2, 2007 4 Semantics of Bayes Nets If we ask for P(x 1, x 2,, x n ) we obtain assuming an ordering consistent with network By the chain rule, we have: P(x 1, x 2,, x n ) = P(x n x n-1,,x 1 ) P(x n-1 x n-2,,x 1 ) P(x 1 ) = P(x n Par(x n )) P(x n-1 Par(x n-1 )) P(x 1 ) Bayes net example I Naïve Bayesmodel Naïve: because words do not depend on each other. Joint = Pr(T) Π i Pr(W i T) Pr(T) T Thus, the joint is recoverable using the parameters (PTs) specified in an arbitrary BN S886 Fall 07 - Lecture 7, Oct. 2, 2007 5 Pr(W i T) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2, 2007 6 1
Bayes net example II Tree augmented naïve Bayes classifier Allow arcs between the leaves as long as they form a tree E.g., augment naïve Bayes model with bigram model Pr(T) T Bayes net example III Hidden Markov model T i : hidden variable (e.g., tag) W i : observable variable (e.g., word) T 1 T 2 T 3 T 4 Pr(T i T i-1 ) Pr(W i W i-1,t) W 1 W 2 W 3 W 4 Pr(W i T i ) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2, 2007 7 S886 Fall 07 - Lecture 7, Oct. 2, 2007 8 Maximum Likelihood Learning ML learning of Bayes net parameters: For θ V=true,pa(V)=v = Pr(V=true par(v) = v) θ V=true,pa(V)=v = #[V=true,pa(V)=v] #[V=true,pa(V)=v] + #[V=false,pa(V)=v] Assumes all attributes have values What if values of some attributes are missing? Incomplete data But many real-world problems have hidden variables (a.k.a latent variables) Values of some attributes missing Incomplete data unsupervised learning Examples: Part of speech tagging Topic modeling Market segmentation for marketing Medical diagnosis S886 Fall 07 - Lecture 7, Oct. 2, 2007 9 S886 Fall 07 - Lecture 7, Oct. 2, 2007 10 Naive solutions for incomplete data Solution #1: Ignore records with missing values But what if all records are missing values (i.e., when a variable is hidden, none of the records have any value for that variable) Solution #2: Ignore hidden variables Model may become significantly more complex! Heart disease example 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 (a) 2 2 2 Smoking Diet Exercise 54 162 486 Symptom 1 Symptom 2 Symptom 3 a) simpler (i.e., fewer PT parameters) b) complex (i.e., lots of PT parameters) (b) S886 Fall 07 - Lecture 7, Oct. 2, 2007 11 S886 Fall 07 - Lecture 7, Oct. 2, 2007 12 2
Direct maximum likelihood Solution 3: maximize likelihood directly Let Z be hidden and E observable h ML = argmax h P(e h) = argmax h Σ Z P(e,Z h) = argmax h Σ Z Π i PT(V i ) = argmax h log Σ Z Π i PT(V i ) Problem: can t push log past sum to linearize product Solution #4: EM algorithm Intuition: if we knew the missing values, computing h ML would be trival Guess h ML Iterate Expectation: based on h ML, compute expectation of the missing values Maximization: based on expected missing values, compute new estimate of h ML S886 Fall 07 - Lecture 7, Oct. 2, 2007 13 S886 Fall 07 - Lecture 7, Oct. 2, 2007 14 More formally: Approximate maximum likelihood Iteratively compute: h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) Expectation Maximization Derivation log P(e h) = log [P(e,Z h) / P(Z e,h)] = log P(e,Z h) log P(Z e,h) = Σ Z P(Z e,h) log P(e,Z h) Σ Z P(Z e,h) log P(Z e,h) Σ Z P(Z e,h) log P(e,Z h) EM finds a local maximum of Σ Z P(Z e,h) log P(e,Z h) which is a lower bound of log P(e h) S886 Fall 07 - Lecture 7, Oct. 2, 2007 15 S886 Fall 07 - Lecture 7, Oct. 2, 2007 16 Log inside sum can linearize product h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) = argmax h Σ Z P(Z h i,e) log Π j PT j = argmax h Σ Z P(Z h i,e) Σ j log PT j Monotonic improvement of likelihood P(e h i+1 ) P(e h i ) andy Example Suppose you buy two bags of candies of unknown type (e.g. flavour ratios) You plan to eat sufficiently many candies of each bag to learn their type Ignoring your plan, your roommate mixes both bags How can you learn the type of each bag despite being mixed? S886 Fall 07 - Lecture 7, Oct. 2, 2007 17 S886 Fall 07 - Lecture 7, Oct. 2, 2007 18 3
andy Example Bag variable is hidden Unsupervised lustering lass variable is hidden Naïve Bayesmodel P( Bag= 1) Bag P(F=cherry B) Bag 1 F1 2 F2 Flavor Wrapper Holes X (a) (b) S886 Fall 07 - Lecture 7, Oct. 2, 2007 19 S886 Fall 07 - Lecture 7, Oct. 2, 2007 20 andy Example Unknown Parameters: θ i = P(Bag=i) θ Fi = P(Flavour=cherry Bag=i) θ Wi = P(Wrapper=red Bag=i) θ Hi = P(Hole=yes Bag=i) When eating a candy: F, W and H are observable B is hidden andy Example Let true parameters be: θ=0.5, θ F1 =θ W1 =θ H1 =0.8, θ F2 =θ W2 =θ H2 =0.3 After eating 1000 candies: F=cherry F=lime H=1 273 79 W=red H=0 93 100 W=green H=1 104 94 H=0 90 167 S886 Fall 07 - Lecture 7, Oct. 2, 2007 21 S886 Fall 07 - Lecture 7, Oct. 2, 2007 22 EM algorithm andy Example Guess h 0 : θ=0.6, θ F1 =θ W1 =θ H1 =0.6, θ F2 =θ W2 =θ H2 =0.4 Alternate: Expectation: expected # of candies in each bag Maximization: new parameter estimates andy Example Expectation: expected # of candies in each bag #[Bag=i] = Σ j P(B=i f j,w j,h j ) ompute P(B=i f j,w j,h j ) by variable elimination (or any other inference alg.) Example: #[Bag=1] = 612 #[Bag=2] = 388 S886 Fall 07 - Lecture 7, Oct. 2, 2007 23 S886 Fall 07 - Lecture 7, Oct. 2, 2007 24 4
andy Example Maximization: relative frequency of each bag θ 1 = 612/1000 = 0.612 θ 2 = 388/1000 = 0.388 andy Example Expectation: expected # of cherry candies in each bag #[B=i,F=cherry] = Σ j P(B=i f j =cherry,w j,h j ) ompute P(B=i f j =cherry,w j,h j ) by variable elimination (or any other inference alg.) Maximization: θ F1 = #[B=1,F=cherry] / #[B=1] = 0.668 θ F2 = #[B=2,F=cherry] / #[B=2] = 0.389 S886 Fall 07 - Lecture 7, Oct. 2, 2007 25 S886 Fall 07 - Lecture 7, Oct. 2, 2007 26 andy Example Bayesian networks Log-likelihood -1975-1980 -1985-1990 -1995-2000 -2005-2010 -2015-2020 -2025 0 20 40 60 80 100 120 Iteration number S886 Fall 07 - Lecture 7, Oct. 2, 2007 27 EM algorithm for general Bayes nets Expectation: #[V i =v ij,pa(v i )=pa ik ] = expected frequency Maximization: θ vij,pa ik = #[V i =v ij,pa(v i )=pa ik ] / #[Pa(V i )=pa ik ] S886 Fall 07 - Lecture 7, Oct. 2, 2007 28 5