Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Similar documents
Expectation Maximization [KF Chapter 19] Incomplete data

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Statistical learning. Chapter 20, Sections 1 4 1

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Networks. Motivation

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Bayesian Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

CSC242: Intro to AI. Lecture 23

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Rapid Introduction to Machine Learning/ Deep Learning

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Lecture 8: Bayesian Networks

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Approaches Data Mining Selected Technique

Statistical learning. Chapter 20, Sections 1 3 1

Probabilistic Classification

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Learning Bayesian Networks (part 1) Goals for the lecture

Probability and Statistics

Lecture 6: Graphical Models

{ p if x = 1 1 p if x = 0

Probabilistic Models

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CS6220: DATA MINING TECHNIQUES

Intelligent Systems (AI-2)

Learning with Probabilities

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

CS6220: DATA MINING TECHNIQUES

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Probabilistic Reasoning. (Mostly using Bayesian Networks)

CS 5522: Artificial Intelligence II

Directed Graphical Models or Bayesian Networks

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Intelligent Systems (AI-2)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CS 188: Artificial Intelligence Fall 2009

Directed Probabilistic Graphical Models CMSC 678 UMBC

Markov Networks.

CS 188: Artificial Intelligence. Bayes Nets

Bayes Nets: Independence

4 : Exact Inference: Variable Elimination

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Directed Graphical Models

Latent Variable Models and Expectation Maximization

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Notes on Machine Learning for and

PROBABILISTIC REASONING SYSTEMS

Directed and Undirected Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning Lecture 14

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Learning Terminological Naïve Bayesian Classifiers Under Different Assumptions on Missing Knowledge

Probabilistic Graphical Models

Introduction to Artificial Intelligence. Unit # 11

Lecture 4 October 18th

Probabilistic Graphical Networks: Definitions and Basic Results

Statistical Pattern Recognition

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

CS 7180: Behavioral Modeling and Decision- making in AI

Stephen Scott.

Bayesian Networks Representation

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

An Introduction to Bayesian Machine Learning

Directed and Undirected Graphical Models

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Learning in Bayesian Networks

2 : Directed GMs: Bayesian Networks

Bayesian Networks for Causal Analysis

Bayes Networks 6.872/HST.950

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Uncertainty and knowledge. Uncertainty and knowledge. Reasoning with uncertainty. Notes

Informatics 2D Reasoning and Agents Semester 2,

Intelligent Systems (AI-2)

Undirected Graphical Models

Review: Bayesian learning and inference

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Probabilistic Graphical Models (I)

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Naïve Bayes classification

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

Introduction to Machine Learning

Transcription:

Bayesian Networks aka belief networks, probabilistic networks A BN over variables {X 1, X 2,, X n } consists of: a DAG whose nodes are the variables a set of PTs (Pr(X i Parents(X i ) ) for each X i P(a) P(~a) A S886 Fall 07 - Lecture 7, Oct. 2, 2007 1 Bayesian Networks aka belief networks, probabilistic networks Key notions parents of a node: Par(X i ) children of node descendents of a node ancestors of a node family: set of nodes consisting of X i and its parents PTs are defined over families in the BN P(b) B P(~b) A B Parents()={A,B} hildren(a)={} P(c a,b) P(~c a,b) Descendents(B)={,D} P(c ~a,b) P(~c ~a,b) Ancestors{D}={A,B,} P(c a,~b) P(~c a,~b) P(c ~a,~b) P(~c ~a,~b) D Family{}={,A,B} S886 Fall 07 - Lecture 7, Oct. 2, 2007 2 An Example Bayes Net A few PTs are shown Explicit joint requires 211-1 =2047 params Semantics of a Bayes Net The structure of the BN means: every X is conditionally independent of all of its nondescendants given its parents: BN requires only 27 parms (the number of entries for each PT is listed) S886 Fall 07 - Lecture 7, Oct. 2, 2007 3 Pr(X i S Par(X i )) = Pr(X i Par(X i )) for any subset S NonDescendants(X i ) S886 Fall 07 - Lecture 7, Oct. 2, 2007 4 Semantics of Bayes Nets If we ask for P(x 1, x 2,, x n ) we obtain assuming an ordering consistent with network By the chain rule, we have: P(x 1, x 2,, x n ) = P(x n x n-1,,x 1 ) P(x n-1 x n-2,,x 1 ) P(x 1 ) = P(x n Par(x n )) P(x n-1 Par(x n-1 )) P(x 1 ) Bayes net example I Naïve Bayesmodel Naïve: because words do not depend on each other. Joint = Pr(T) Π i Pr(W i T) Pr(T) T Thus, the joint is recoverable using the parameters (PTs) specified in an arbitrary BN S886 Fall 07 - Lecture 7, Oct. 2, 2007 5 Pr(W i T) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2, 2007 6 1

Bayes net example II Tree augmented naïve Bayes classifier Allow arcs between the leaves as long as they form a tree E.g., augment naïve Bayes model with bigram model Pr(T) T Bayes net example III Hidden Markov model T i : hidden variable (e.g., tag) W i : observable variable (e.g., word) T 1 T 2 T 3 T 4 Pr(T i T i-1 ) Pr(W i W i-1,t) W 1 W 2 W 3 W 4 Pr(W i T i ) W 1 W 2 W 3 W 4 S886 Fall 07 - Lecture 7, Oct. 2, 2007 7 S886 Fall 07 - Lecture 7, Oct. 2, 2007 8 Maximum Likelihood Learning ML learning of Bayes net parameters: For θ V=true,pa(V)=v = Pr(V=true par(v) = v) θ V=true,pa(V)=v = #[V=true,pa(V)=v] #[V=true,pa(V)=v] + #[V=false,pa(V)=v] Assumes all attributes have values What if values of some attributes are missing? Incomplete data But many real-world problems have hidden variables (a.k.a latent variables) Values of some attributes missing Incomplete data unsupervised learning Examples: Part of speech tagging Topic modeling Market segmentation for marketing Medical diagnosis S886 Fall 07 - Lecture 7, Oct. 2, 2007 9 S886 Fall 07 - Lecture 7, Oct. 2, 2007 10 Naive solutions for incomplete data Solution #1: Ignore records with missing values But what if all records are missing values (i.e., when a variable is hidden, none of the records have any value for that variable) Solution #2: Ignore hidden variables Model may become significantly more complex! Heart disease example 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 (a) 2 2 2 Smoking Diet Exercise 54 162 486 Symptom 1 Symptom 2 Symptom 3 a) simpler (i.e., fewer PT parameters) b) complex (i.e., lots of PT parameters) (b) S886 Fall 07 - Lecture 7, Oct. 2, 2007 11 S886 Fall 07 - Lecture 7, Oct. 2, 2007 12 2

Direct maximum likelihood Solution 3: maximize likelihood directly Let Z be hidden and E observable h ML = argmax h P(e h) = argmax h Σ Z P(e,Z h) = argmax h Σ Z Π i PT(V i ) = argmax h log Σ Z Π i PT(V i ) Problem: can t push log past sum to linearize product Solution #4: EM algorithm Intuition: if we knew the missing values, computing h ML would be trival Guess h ML Iterate Expectation: based on h ML, compute expectation of the missing values Maximization: based on expected missing values, compute new estimate of h ML S886 Fall 07 - Lecture 7, Oct. 2, 2007 13 S886 Fall 07 - Lecture 7, Oct. 2, 2007 14 More formally: Approximate maximum likelihood Iteratively compute: h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) Expectation Maximization Derivation log P(e h) = log [P(e,Z h) / P(Z e,h)] = log P(e,Z h) log P(Z e,h) = Σ Z P(Z e,h) log P(e,Z h) Σ Z P(Z e,h) log P(Z e,h) Σ Z P(Z e,h) log P(e,Z h) EM finds a local maximum of Σ Z P(Z e,h) log P(e,Z h) which is a lower bound of log P(e h) S886 Fall 07 - Lecture 7, Oct. 2, 2007 15 S886 Fall 07 - Lecture 7, Oct. 2, 2007 16 Log inside sum can linearize product h i+1 = argmax h Σ Z P(Z h i,e) log P(e,Z h) = argmax h Σ Z P(Z h i,e) log Π j PT j = argmax h Σ Z P(Z h i,e) Σ j log PT j Monotonic improvement of likelihood P(e h i+1 ) P(e h i ) andy Example Suppose you buy two bags of candies of unknown type (e.g. flavour ratios) You plan to eat sufficiently many candies of each bag to learn their type Ignoring your plan, your roommate mixes both bags How can you learn the type of each bag despite being mixed? S886 Fall 07 - Lecture 7, Oct. 2, 2007 17 S886 Fall 07 - Lecture 7, Oct. 2, 2007 18 3

andy Example Bag variable is hidden Unsupervised lustering lass variable is hidden Naïve Bayesmodel P( Bag= 1) Bag P(F=cherry B) Bag 1 F1 2 F2 Flavor Wrapper Holes X (a) (b) S886 Fall 07 - Lecture 7, Oct. 2, 2007 19 S886 Fall 07 - Lecture 7, Oct. 2, 2007 20 andy Example Unknown Parameters: θ i = P(Bag=i) θ Fi = P(Flavour=cherry Bag=i) θ Wi = P(Wrapper=red Bag=i) θ Hi = P(Hole=yes Bag=i) When eating a candy: F, W and H are observable B is hidden andy Example Let true parameters be: θ=0.5, θ F1 =θ W1 =θ H1 =0.8, θ F2 =θ W2 =θ H2 =0.3 After eating 1000 candies: F=cherry F=lime H=1 273 79 W=red H=0 93 100 W=green H=1 104 94 H=0 90 167 S886 Fall 07 - Lecture 7, Oct. 2, 2007 21 S886 Fall 07 - Lecture 7, Oct. 2, 2007 22 EM algorithm andy Example Guess h 0 : θ=0.6, θ F1 =θ W1 =θ H1 =0.6, θ F2 =θ W2 =θ H2 =0.4 Alternate: Expectation: expected # of candies in each bag Maximization: new parameter estimates andy Example Expectation: expected # of candies in each bag #[Bag=i] = Σ j P(B=i f j,w j,h j ) ompute P(B=i f j,w j,h j ) by variable elimination (or any other inference alg.) Example: #[Bag=1] = 612 #[Bag=2] = 388 S886 Fall 07 - Lecture 7, Oct. 2, 2007 23 S886 Fall 07 - Lecture 7, Oct. 2, 2007 24 4

andy Example Maximization: relative frequency of each bag θ 1 = 612/1000 = 0.612 θ 2 = 388/1000 = 0.388 andy Example Expectation: expected # of cherry candies in each bag #[B=i,F=cherry] = Σ j P(B=i f j =cherry,w j,h j ) ompute P(B=i f j =cherry,w j,h j ) by variable elimination (or any other inference alg.) Maximization: θ F1 = #[B=1,F=cherry] / #[B=1] = 0.668 θ F2 = #[B=2,F=cherry] / #[B=2] = 0.389 S886 Fall 07 - Lecture 7, Oct. 2, 2007 25 S886 Fall 07 - Lecture 7, Oct. 2, 2007 26 andy Example Bayesian networks Log-likelihood -1975-1980 -1985-1990 -1995-2000 -2005-2010 -2015-2020 -2025 0 20 40 60 80 100 120 Iteration number S886 Fall 07 - Lecture 7, Oct. 2, 2007 27 EM algorithm for general Bayes nets Expectation: #[V i =v ij,pa(v i )=pa ik ] = expected frequency Maximization: θ vij,pa ik = #[V i =v ij,pa(v i )=pa ik ] / #[Pa(V i )=pa ik ] S886 Fall 07 - Lecture 7, Oct. 2, 2007 28 5