Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Size: px

Start display at page:

Download "Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net"

Elisabeth Cobb
5 years ago
Views:

1 Bayesian Networks aka belief networks, probabilistic networks A BN over variables {X 1, X 2,, X n } consists of: a DAG whose nodes are the variables a set of PTs (Pr(X i Parents(X i ) ) for each X i P(a) P(~a) A S886 Fall 07 - Lecture 7, Oct. 2, Bayesian Networks aka belief networks, probabilistic networks Key notions parents of a node: Par(X i ) children of node descendents of a node ancestors of a node family: set of nodes consisting of X i and its parents PTs are defined over families in the BN P(b) B P(~b) A B Parents()={A,B} hildren(a)={} P(c a,b) P(~c a,b) Descendents(B)={,D} P(c ~a,b) P(~c ~a,b) Ancestors{D}={A,B,} P(c a,~b) P(~c a,~b) P(c ~a,~b) P(~c ~a,~b) D Family{}={,A,B} S886 Fall 07 - Lecture 7, Oct. 2, An Example Bayes Net A few PTs are shown Explicit joint requires =2047 params Semantics of a Bayes Net The structure of the BN means: every X is conditionally independent of all of its nondescendants given its parents: BN requires only 27 parms (the number of entries for each PT is listed) S886 Fall 07 - Lecture 7, Oct. 2, Pr(X i S Par(X i )) = Pr(X i Par(X i )) for any subset S NonDescendants(X i ) S886 Fall 07 - Lecture 7, Oct. 2, Semantics of Bayes Nets If we ask for P(x 1, x 2,, x n ) we obtain assuming an ordering consistent with network By the chain rule, we have: P(x 1, x 2,, x n ) = P(x n x n-1,,x 1 ) P(x n-1 x n-2,,x 1 ) P(x 1 ) = P(x n Par(x n )) P(x n-1 Par(x n-1 )) P(x 1 ) Bayes net example I Naïve Bayesmodel Naïve: because words do not depend on each other. Joint = Pr(T) Π i Pr(W i T) Pr(T) T Thus, the joint is recoverable using the parameters (PTs) specified in an arbitrary BN S886 Fall 07 - Lecture 7, Oct. 2, Pr(W i T) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2,

2 Bayes net example II Tree augmented naïve Bayes classifier Allow arcs between the leaves as long as they form a tree E.g., augment naïve Bayes model with bigram model Pr(T) T Bayes net example III Hidden Markov model T i : hidden variable (e.g., tag) W i : observable variable (e.g., word) T 1 T 2 T 3 T 4 Pr(T i T i-1 ) Pr(W i W i-1,t) W 1 W 2 W 3 W 4 Pr(W i T i ) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, Maximum Likelihood Learning ML learning of Bayes net parameters: For θ V=true,pa(V)=v = Pr(V=true par(v) = v) θ V=true,pa(V)=v = #[V=true,pa(V)=v] #[V=true,pa(V)=v] + #[V=false,pa(V)=v] Assumes all attributes have values What if values of some attributes are missing? Incomplete data But many real-world problems have hidden variables (a.k.a latent variables) Values of some attributes missing Incomplete data unsupervised learning Examples: Part of speech tagging Topic modeling Market segmentation for marketing Medical diagnosis S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, Naive solutions for incomplete data Solution #1: Ignore records with missing values But what if all records are missing values (i.e., when a variable is hidden, none of the records have any value for that variable) Solution #2: Ignore hidden variables Model may become significantly more complex! Heart disease example Smoking Diet Exercise 54 HeartDisease Symptom 1 Symptom 2 Symptom 3 (a) Smoking Diet Exercise Symptom 1 Symptom 2 Symptom 3 a) simpler (i.e., fewer PT parameters) b) complex (i.e., lots of PT parameters) (b) S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2,

3 Direct maximum likelihood Solution 3: maximize likelihood directly Let Z be hidden and E observable h ML = argmax h P(e h) = argmax h Σ Z P(e,Z h) = argmax h Σ Z Π i PT(V i ) = argmax h log Σ Z Π i PT(V i ) Problem: can t push log past sum to linearize product Solution #4: EM algorithm Intuition: if we knew the missing values, computing h ML would be trival Guess h ML Iterate Expectation: based on h ML, compute expectation of the missing values Maximization: based on expected missing values, compute new estimate of h ML S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, More formally: Approximate maximum likelihood Iteratively compute: h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) Expectation Maximization Derivation log P(e h) = log [P(e,Z h) / P(Z e,h)] = log P(e,Z h) log P(Z e,h) = Σ Z P(Z e,h) log P(e,Z h) Σ Z P(Z e,h) log P(Z e,h) Σ Z P(Z e,h) log P(e,Z h) EM finds a local maximum of Σ Z P(Z e,h) log P(e,Z h) which is a lower bound of log P(e h) S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, Log inside sum can linearize product h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) = argmax h Σ Z P(Z h i,e) log Π j PT j = argmax h Σ Z P(Z h i,e) Σ j log PT j Monotonic improvement of likelihood P(e h i+1 ) P(e h i ) andy Example Suppose you buy two bags of candies of unknown type (e.g. flavour ratios) You plan to eat sufficiently many candies of each bag to learn their type Ignoring your plan, your roommate mixes both bags How can you learn the type of each bag despite being mixed? S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2,

4 andy Example Bag variable is hidden Unsupervised lustering lass variable is hidden Naïve Bayesmodel P( Bag= 1) Bag P(F=cherry B) Bag 1 F1 2 F2 Flavor Wrapper Holes X (a) (b) S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, andy Example Unknown Parameters: θ i = P(Bag=i) θ Fi = P(Flavour=cherry Bag=i) θ Wi = P(Wrapper=red Bag=i) θ Hi = P(Hole=yes Bag=i) When eating a candy: F, W and H are observable B is hidden andy Example Let true parameters be: θ=0.5, θ F1 =θ W1 =θ H1 =0.8, θ F2 =θ W2 =θ H2 =0.3 After eating 1000 candies: F=cherry F=lime H= W=red H= W=green H= H= S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, EM algorithm andy Example Guess h 0 : θ=0.6, θ F1 =θ W1 =θ H1 =0.6, θ F2 =θ W2 =θ H2 =0.4 Alternate: Expectation: expected # of candies in each bag Maximization: new parameter estimates andy Example Expectation: expected # of candies in each bag #[Bag=i] = Σ j P(B=i f j,w j,h j ) ompute P(B=i f j,w j,h j ) by variable elimination (or any other inference alg.) Example: #[Bag=1] = 612 #[Bag=2] = 388 S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2,

5 andy Example Maximization: relative frequency of each bag θ 1 = 612/1000 = θ 2 = 388/1000 = andy Example Expectation: expected # of cherry candies in each bag #[B=i,F=cherry] = Σ j P(B=i f j =cherry,w j,h j ) ompute P(B=i f j =cherry,w j,h j ) by variable elimination (or any other inference alg.) Maximization: θ F1 = #[B=1,F=cherry] / #[B=1] = θ F2 = #[B=2,F=cherry] / #[B=2] = S886 Fall 07 - Lecture 7, Oct. 2, S886 Fall 07 - Lecture 7, Oct. 2, andy Example Bayesian networks Log-likelihood Iteration number S886 Fall 07 - Lecture 7, Oct. 2, EM algorithm for general Bayes nets Expectation: #[V i =v ij,pa(v i )=pa ik ] = expected frequency Maximization: θ vij,pa ik = #[V i =v ij,pa(v i )=pa ik ] / #[Pa(V i )=pa ik ] S886 Fall 07 - Lecture 7, Oct. 2,

Expectation Maximization [KF Chapter 19] Incomplete data

Expectation Maximization [KF Chapter 19] CS 786 University of Waterloo Lecture 17: June 28, 2012 Complete data Incomplete data Values of all attributes are known Learning is relatively easy But many real-world