Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Similar documents
Bayesian Networks (Part II)

Directed Graphical Models (aka. Bayesian Networks)

Hidden Markov Models

Bayesian Networks (Part I)

Hidden Markov Models

Bayesian Networks: Independencies and Inference

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Latent Dirichlet Allocation

Directed Graphical Models

Graphical Models and Kernel Methods

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probabilistic Reasoning. (Mostly using Bayesian Networks)

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Probabilistic Graphical Models

Lecture 6: Graphical Models

Undirected Graphical Models

2 : Directed GMs: Bayesian Networks

Intelligent Systems (AI-2)

Representation of undirected GM. Kayhan Batmanghelich

Intelligent Systems (AI-2)

Review: Bayesian learning and inference

Bayesian Networks Representation

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Graphical Models - Part I

Bayesian networks. Chapter 14, Sections 1 4

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Introduction to Machine Learning Midterm, Tues April 8

Rapid Introduction to Machine Learning/ Deep Learning

Intelligent Systems (AI-2)

Bayesian Networks Inference with Probabilistic Graphical Models

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Networks. Motivation

Introduction to Machine Learning CMU-10701

Bayesian networks. Chapter Chapter

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

2 : Directed GMs: Bayesian Networks

Bayesian Networks BY: MOHAMAD ALSABBAGH

Chris Bishop s PRML Ch. 8: Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

Bayesian Machine Learning

Directed and Undirected Graphical Models

Machine Learning Lecture 14

Intelligent Systems:

Markov chain Monte Carlo Lecture 9

Intelligent Systems (AI-2)

Bayes Nets: Independence

Based on slides by Richard Zemel

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Lecture 8: Bayesian Networks

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Another look at Bayesian. inference

Sequential Supervised Learning

Directed Graphical Models or Bayesian Networks

4 : Exact Inference: Variable Elimination

Statistical Approaches to Learning and Discovery

Artificial Intelligence Bayesian Networks

Informatics 2D Reasoning and Agents Semester 2,

Chapter 16. Structured Probabilistic Models for Deep Learning

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models (I)

Directed Graphical Models. William W. Cohen Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Introduction to Artificial Intelligence. Unit # 11

CS 343: Artificial Intelligence

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CPSC 540: Machine Learning

Artificial Intelligence Bayes Nets: Independence

Probabilistic Reasoning Systems

Probabilistic Graphical Networks: Definitions and Basic Results

Learning Bayesian Networks (part 1) Goals for the lecture

CS 5522: Artificial Intelligence II

A brief introduction to Conditional Random Fields

Machine Learning for Data Science (CS4786) Lecture 24

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Machine Learning for Data Science (CS4786) Lecture 19

Quantifying uncertainty & Bayesian networks

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

CS 188: Artificial Intelligence. Bayes Nets

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Probabilistic Graphical Models

Directed and Undirected Graphical Models

PROBABILISTIC REASONING SYSTEMS

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Probabilistic Graphical Models: Representation and Inference

Undirected Graphical Models: Markov Random Fields

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CS Lecture 4. Markov Random Fields

Need for Sampling in Machine Learning. Sargur Srihari

Bayes Nets III: Inference

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1

Homework 7: HMMs Reminders Out: Wed, Apr 04 Due: Mon, Apr 16 at 11:59pm Schedule Changes Lecture on Fri, Apr 13 Recitation on Mon, Apr 23 2

Peer Tutoring Tutor Tutee 3

HIDDEN MARKOV MODELS 4

Derivation of Forward Algorithm Definition: Derivation: 5

Forward-Backward Algorithm 6

Viterbi Algorithm 7

Inference in HMMs What is the computational complexity of inference for HMMs? The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(K T ) The forward-backward algorithm and Viterbi algorithm run in polynomial time, O(T*K 2 ) Thanks to dynamic programming! 8

Shortcomings of Hidden Markov Models START Y 1 Y 2 Y n X 1 X 2 X n HMM models capture dependences between each state and only its corresponding observation NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (nonlocal) features of the whole line such as line length, indentation, amount of white space, etc. Mismatch between learning objective function and prediction objective function HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y X) Eric Xing @ CMU, 2005-2015 9

MBR DECODING 10

Inference for HMMs Three Inference Problems for an HMM 1. Evaluation: Compute the probability of a given sequence of observations 2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations 3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations 4. MBR Decoding: Find the lowest loss sequence of hidden states, given a sequence of observations (Viterbi decoding is a special case) 11

Minimum Bayes Risk Decoding Suppose we given a loss function l(y, y) and are asked for a single tagging How should we choose just one from our probability distribution p(y x)? A minimum Bayes risk (MBR) decoder h(x) returns the variable assignment with minimum expected loss under the model s distribution h (x) = argmin ŷ = argmin ŷ E ( x)[`(ŷ, y)] y p X p (y x)`(ŷ, y) y 12

Minimum Bayes Risk Decoding h (x) = argmin ŷ Consider some example loss functions: The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise: The MBR decoder is: `(ŷ, y) =1 I(ŷ, y) h (x) = argmin ŷ = argmax ŷ E y p ( x)[`(ŷ, y)] X X which is exactly the Viterbi decoding problem! y p (y x)(1 p (ŷ x) I(ŷ, y)) 13

Minimum Bayes Risk Decoding Consider some example loss functions: The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments: VX `(ŷ, y) = (1 I(ŷ i,y i )) The MBR decoder is: h (x) = argmin ŷ i=1 E y p ( x)[`(ŷ, y)] X ŷ i = h (x) i = argmax ŷ i p (ŷ i x) This decomposes across variables and requires the variable marginals. 14

BAYESIAN NETWORKS 15

Bayes Nets Outline Motivation Structured Prediction Background Conditional Independence Chain Rule of Probability Directed Graphical Models Writing Joint Distributions Definition: Bayesian Network Qualitative Specification Quantitative Specification Familiar Models as Bayes Nets Conditional Independence in Bayes Nets Three case studies D-separation Markov blanket Learning Fully Observed Bayes Net (Partially Observed Bayes Net) Inference Background: Marginal Probability Sampling directly from the joint distribution Gibbs Sampling 17

Bayesian Networks DIRECTED GRAPHICAL MODELS 18

Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas-emergency-sirens-hacking.html 19

Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas-emergency-sirens-hacking.html 20

Whiteboard Directed Graphical Models (Bayes Nets) Example: Tornado Alarms Writing Joint Distributions Idea #1: Giant Table Idea #2: Rewrite using chain rule Idea #3: Assume full independence Idea #4: Drop variables from RHS of conditionals Definition: Bayesian Network 21

Bayesian Network X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 22

Bayesian Network Definition: X 1 X2 X 3 n i=1 P(X 1 X n ) = P(X i parents(x i )) X 4 X 5 A Bayesian Network is a directed graphical model It consists of a graph G and the conditional probabilities P These two parts full specify the distribution: Qualitative Specification: G Quantitative Specification: P 23

Qualitative Specification Where does the qualitative specification come from? Prior knowledge of causal relationships Prior knowledge of modular relationships Assessment from experts Learning from data (i.e. structure learning) We simply link a certain architecture (e.g. a layered graph) Eric Xing @ CMU, 2006-2011 24

Quantitative Specification Example: Conditional probability tables (CPTs) for discrete random variables a 0 0.75 a 1 0.25 b 0 0.33 b 1 0.67 P(a,b,c.d) = P(a)P(b)P(c a,b)p(d c) A B C a 0 b 0 a 0 b 1 a 1 b 0 a 1 b 1 c 0 0.45 1 0.9 0.7 c 1 0.55 0 0.1 0.3 D c 0 c 1 d 0 0.3 0.5 d 1 07 0.5 Eric Xing @ CMU, 2006-2011 25

A~N(μ a, Σ a ) B~N(μ b, Σ b ) Quantitative Specification Example: Conditional probability density functions (CPDs) for continuous random variables P(a,b,c.d) = P(a)P(b)P(c a,b)p(d c) A B C C~N(A+B, Σ c ) P(D C) D D~N(μ d +C, Σ d ) D C Eric Xing @ CMU, 2006-2011 26

Quantitative Specification Example: Combination of CPTs and CPDs for a mix of discrete and continuous variables a 0 0.75 a 1 0.25 b 0 0.33 b 1 0.67 P(a,b,c.d) = P(a)P(b)P(c a,b)p(d c) A B C C~N(A+B, Σ c ) D D~N(μ d +C, Σ d ) Eric Xing @ CMU, 2006-2011 27

Whiteboard Directed Graphical Models (Bayes Nets) Observed Variables in Graphical Model Familiar Models as Bayes Nets Bernoulli Naïve Bayes Gaussian Naïve Bayes Gaussian Mixture Model (GMM) Gaussian Discriminant Analysis Logistic Regression Linear Regression 1D Gaussian 28

GRAPHICAL MODELS: DETERMINING CONDITIONAL INDEPENDENCIES Slide from William Cohen

What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. This follows from But what else does it imply? n i=1 P(X 1 X n ) = P(X i parents(x i )) n i=1 = P(X i X 1 X i 1 ) Slide from William Cohen

What Independencies does a Bayes Net Model? Three cases of interest Cascade Common Parent V-Structure Z Y X Z Y X Z Y X 31

What Independencies does a Bayes Net Model? Three cases of interest Cascade Common Parent V-Structure Z Y X Z Y X Z Y X X Z Y X Z Y X Z Y Knowing Y decouples X and Z Knowing Y couples X and Z 32

Whiteboard Common Parent Proof of conditional independence X Y Z (The other two cases can be shown just as easily.) X Z Y 33

The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home s burglar alarm is ringing. Uh oh! Burglar Alarm Phone Call Earthquake Quiz: True or False? Burglar Earthquake P honecall Slide from William Cohen

Markov Blanket Def: the co-parents of a node are the parents of its children Def: the Markov Blanket of a node is the set containing the node s parents, children, and co-parents. X 2 X3 X 1 X 4 Thm: a node is conditionally independent of every other node in the graph given its Markov blanket X 5 X 9 X 6 X 7 X 10 X 8 X 11 X 13 X 12 36

Markov Blanket Def: the co-parents of a node are the parents of its children Def: the Markov Blanket of a node is the set containing the node s parents, children, and co-parents. Example: The Markov Blanket of X 6 is {X 3, X 4, X 5, X 8, X 9, X 10 } X 2 X3 X 1 X 4 Theorem: a node is conditionally independent of every other node in the graph given its Markov blanket X 5 X 9 X 6 X 7 X 10 X 8 X 11 X 13 X 12 37

Markov Blanket Def: the co-parents of a node are the parents of its children Def: the Markov Blanket of a node is the set containing the node s parents, children, and co-parents. Theorem: a node is conditionally independent of every other node in the graph given its Markov blanket Example: The Markov Blanket of X 6 is {X 3, X 4, X 5, X 8, X 9, X 10 } X 2 X 5 X3 X 9 X 1 Parents X 4 X 6 X 7 Co-parents Parents X 10 X 8 X 11 Children Parents X 13 X 12 38

D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #1: Variables X and Z are d-separated given a set of evidence variables E iff every path from X to Z is blocked. A path is blocked whenever: 1. Y on path s.t. Y E and Y is a common parent X 2. Y on path s.t. Y E and Y is in a cascade X Y Y 3. Y on path s.t. {Y, descendants(y)} E and Y is in a v-structure Z Z X Y Z 39

D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #2: Variables X and Z are d-separated given a set of evidence variables E iff there does not exist a path in the undirected ancestral moral graph with E removed. 1. Ancestral graph: keep only X, Z, E and their ancestors 2. Moral graph: add undirected edge between all pairs of each node s parents 3. Undirected graph: convert all directed edges to undirected 4. Givens Removed: delete any nodes in E Example Query: A B {D, E} Original: Ancestral: Moral: Undirected: Givens Removed: A B A B A B A B A B C C C C C D E D E D E D E A and B connected not d-separated F 40

SUPERVISED LEARNING FOR BAYES NETS 41

Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 42

Machine Learning Data Model X 1 time flies like an arrow X2 X 3 X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 43

Learning Fully Observed BNs X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 44

Learning Fully Observed BNs X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 45

Learning Fully Observed BNs X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) How do we learn these conditional and marginal distributions for a Bayes Net? 46

Learning Fully Observed BNs Learning this fully observed Bayesian Network is equivalent to learning five (small / simple) independent networks from the same data p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) X 1 X 1 X 1 X2 X 3 X 2 X 3 X 4 X 5 X2 X 3 X 3 X 4 X 5 47

Learning Fully Observed BNs How do we learn these conditional and marginal distributions for a Bayes Net? X2 X 1 X 3 = argmax = argmax log p(x 1,X 2,X 3,X 4,X 5 ) log p(x 5 X 3, 5 ) + log p(x 4 X 2,X 3, 4 ) + log p(x 3 3 ) + log p(x 2 X 1, 2 ) + log p(x 1 1 ) X 4 X 5 1 = argmax 1 log p(x 1 1 ) 2 = argmax 2 log p(x 2 X 1, 2 ) 3 = argmax 3 log p(x 3 3 ) 4 = argmax 4 log p(x 4 X 2,X 3, 4 ) 5 = argmax 5 log p(x 5 X 3, 5 ) 48

Learning Fully Observed BNs Whiteboard Example: Learning for Tornado Alarms 49

INFERENCE FOR BAYESIAN NETWORKS 52

A Few Problems for Bayes Nets Suppose we already have the parameters of a Bayesian Network 1. How do we compute the probability of a specific assignment to the variables? P(T=t, H=h, A=a, C=c) 2. How do we draw a sample from the joint distribution? t,h,a,c P(T, H, A, C) 3. How do we compute marginal probabilities? P(A) = 4. How do we draw samples from a conditional distribution? t,h,a P(T, H, A C = c) 5. How do we compute conditional marginal probabilities? P(H C = c) = Can we use samples? 53

Whiteboard Inference for Bayes Nets Background: Marginal Probability Sampling from a joint distribution Gibbs Sampling 54

Sampling from a Joint Distribution We can use these samples to estimate many different probabilities! T H A C 55

Gibbs Sampling x 2 p(x) x (t+1) x (t) p(x P (x) 1 x (t) 2 ) a) x 1 ( 56

Gibbs Sampling x 2 p(x) x (t+2) x (t+1) P (x) p(x 2 x (t+1) x (t) 1 ) a) x 1 ( 57

Gibbs Sampling x 2 p(x) x (t+2) x (t+1) x (t+3) x (t+4) x (t) P (x) a) x 1 ( 58

Gibbs Sampling Question: How do we draw samples from a conditional distribution? y 1, y 2,, y J p(y 1, y 2,, y J x 1, x 2,, x J ) (Approximate) Solution: Initialize y 1 (0), y 2 (0),, y J (0) to arbitrary values For t = 1, 2, : y 1 (t+1) p(y 1 y 2 (t),, y J (t), x 1, x 2,, x J ) y 2 (t+1) p(y 2 y 1 (t+1), y 3 (t),, y J (t), x 1, x 2,, x J ) y 3 (t+1) p(y 3 y 1 (t+1), y 2 (t+1), y 4 (t),, y J (t), x 1, x 2,, x J ) y J (t+1) p(y J y 1 (t+1), y 2 (t+1),, y J-1 (t+1), x 1, x 2,, x J ) Properties: This will eventually yield samples from p(y 1, y 2,, y J x 1, x 2,, x J ) But it might take a long time -- just like other Markov Chain Monte Carlo methods 59

Gibbs Sampling X 1 Full conditionals only need to condition on the Markov Blanket X 2 X 5 X 3 X 9 X 4 X 6 X 7 X 10 X 8 X 11 X 13 X 12 Must be easy to sample from conditionals Many conditionals are log-concave and are amenable to adaptive rejection sampling 60

Learning Objectives Bayesian Networks You should be able to 1. Identify the conditional independence assumptions given by a generative story or a specification of a joint distribution 2. Draw a Bayesian network given a set of conditional independence assumptions 3. Define the joint distribution specified by a Bayesian network 4. User domain knowledge to construct a (simple) Bayesian network for a realworld modeling problem 5. Depict familiar models as Bayesian networks 6. Use d-separation to prove the existence of conditional indenpendencies in a Bayesian network 7. Employ a Markov blanket to identify conditional independence assumptions of a graphical model 8. Develop a supervised learning algorithm for a Bayesian network 9. Use samples from a joint distribution to compute marginal probabilities 10. Sample from the joint distribution specified by a generative story 11. Implement a Gibbs sampler for a Bayesian network 61