Data Mining 2018 Bayesian Networks (1)

Similar documents
Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Markov properties for directed graphs

Probabilistic Graphical Models (I)

Introduction to Causal Calculus

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Directed Graphical Models or Bayesian Networks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Directed and Undirected Graphical Models

Causal Inference & Reasoning with Causal Bayesian Networks

Lecture 4 October 18th

Learning in Bayesian Networks

Probabilistic Graphical Models

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

STA 4273H: Statistical Machine Learning

Undirected Graphical Models: Markov Random Fields

Lecture 17: May 29, 2002

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Directed Graphical Models

Machine Learning Lecture 14

Lecture 5: Bayesian Network

Machine Learning 4771

CPSC 540: Machine Learning

An Introduction to Bayesian Machine Learning

Undirected Graphical Models

Review: Bayesian learning and inference

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

4.1 Notation and probability review

3 : Representation of Undirected GM

Causal Models with Hidden Variables

Statistical Approaches to Learning and Discovery

CS Lecture 4. Markov Random Fields

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Bayesian Networks

Undirected Graphical Models

Causal Directed Acyclic Graphs

Bayesian Networks (Part II)

Lecture 6: Graphical Models

The Monte Carlo Method: Bayesian Networks

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Rapid Introduction to Machine Learning/ Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Inference in Bayesian networks

Applying Bayesian networks in the game of Minesweeper

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Graphical Models and Kernel Methods

Probabilistic Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

Learning Bayesian networks

2 : Directed GMs: Bayesian Networks

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Machine Learning for Data Science (CS4786) Lecture 24

CS 5522: Artificial Intelligence II

DAGS. f V f V 3 V 1, V 2 f V 2 V 0 f V 1 V 0 f V 0 (V 0, f V. f V m pa m pa m are the parents of V m. Statistical Dag: Example.

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayes Nets: Independence

4 : Exact Inference: Variable Elimination

CS Lecture 3. More Bayesian Networks

Introduction to Probabilistic Graphical Models

Learning P-maps Param. Learning

2 : Directed GMs: Bayesian Networks

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

{ p if x = 1 1 p if x = 0

Representation of undirected GM. Kayhan Batmanghelich

Alternative Parameterizations of Markov Networks. Sargur Srihari

Conditional Independence

BN Semantics 3 Now it s personal! Parameter Learning 1

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Artificial Intelligence Bayes Nets: Independence

Probabilistic Graphical Models

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Using Graphs to Describe Model Structure. Sargur N. Srihari

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Review: Directed Models (Bayes Nets)

Inference in Bayesian Networks

Reasoning Under Uncertainty: Belief Network Inference

Junction Tree, BP and Variational Methods

Probabilistic Graphical Models

Naïve Bayes classification

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Bayes Networks 6.872/HST.950

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Lecture 6: Graphical Models: Learning

Machine Learning Summer School

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

Graphical Models - Part I

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Transcription:

Data Mining 2018 Bayesian Networks (1) Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 49

Do you like noodles? Do you like noodles? Race Gender Yes No Black Male 10 40 Female 30 20 White Male 100 100 Female 120 80 Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 49

Do you like noodles? Undirected G R A G R A Strange: Gender and Race are prior to Answer, but this model says they are independent given Answer! Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 49

Do you like noodles? Marginal table for Gender and Race: Race Gender Black White Male 50 200 Female 50 200 From this table we conclude that Race and Gender are independent in the data. Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 49

Do you like noodles? Directed G R A G R, G R A Gender and Race are marginally independent (but dependent given Answer). Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 49

Do you like noodles? Table for Gender and Race given Answer=yes: Race Gender Black White Male 10 100 Female 30 120 Table for Gender and Race given Answer=no: Race Gender Black White Male 40 100 Female 20 80 From these tables we conclude that Race and Gender are dependent given Answer. Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 49

Directed Independence Graphs G = (K, E), K is a set of vertices and E is a set of edges with ordered pairs of vertices. No directed cycles (DAG) parent/child ancestor/descendant ancestral set Because G is a DAG, there exists a complete ordering of the vertices that is respected in the graph (edges point from lower ordered to higher ordered nodes). Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 49

Parents Of Node i: pa(i) i Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 49

Ancestors Of Node i: an(i) i Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 49

Ancestral Set Of Node i: an + (i) i Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 49

Children Of Node i: ch(i) i Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 49

Descendants Of Node i: de(i) i Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 49

Construction of DAG Suppose that prior knowledge tells us the variables can be labeled X 1, X 2,..., X k such that X i is prior to X i+1. (for example: causal or temporal ordering) Corresponding to this ordering we can use the product rule to factorize the joint distribution of X 1, X 2,..., X k as P(X ) = P(X 1 )P(X 2 X 1 ) P(X k X k 1, X k 2,..., X 1 ) This is an identity of probability theory, no independence assumptions have been made yet! Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 49

Constructing a DAG from pairwise independencies In constructing a DAG, an arrow is drawn from i to j, where i < j, unless P(X j X j 1,..., X 1 ) does not depend on X i, in other words, unless More loosely j i {1,..., j} \ {i, j} j i prior variables Compare this to pairwise independence in undirected independence graphs. j i rest Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 X 1 )P(X 3 X 1, X 2 )P(X 4 X 1, X 2, X 3 ) Suppose the following independencies are given: 1 X 1 X 2 2 X 4 X 3 (X 1, X 2 ) 3 X 1 X 3 X 2 Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 ) P(X 2 X 1 ) P(X 3 X 1, X 2 )P(X 4 X 1, X 2, X 3 ) }{{} P(X 2 ) 1 If X 1 X 2, then P(X 2 X 1 ) = P(X 2 ). The edge 1 2 is removed. Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 )P(X 3 X 1, X 2 )P(X 4 X 1, X 2, X 3 ) Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 )P(X 3 X 1, X 2 ) P(X 4 X 1, X 2, X 3 ) }{{} P(X 4 X 1,X 2 ) 2 If X 4 X 3 (X 1, X 2 ), then P(X 4 X 1, X 2, X 3 ) = P(X 4 X 1, X 2 ). The edge 3 4 is removed. Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 )P(X 3 X 1, X 2 )P(X 4 X 1, X 2 ) Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 ) P(X 3 X 1, X 2 ) P(X 4 X 1, X 2 ) }{{} P(X 3 X 2 ) 3 If X 1 X 3 X 2, then P(X 3 X 1, X 2 ) = P(X 3 X 2 ) The edge 1 3 is removed. Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 49

Construction Of DAG 1 2 4 3 P(X ) = P(X 1 )P(X 2 )P(X 3 X 2 )P(X 4 X 1, X 2 ) Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 49

Joint density of Bayesian Network We can write the joint density more elegantly as k P(X 1,..., X k ) = P(X i X pa(i) ) i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 49

Independence Properties of DAGs: d-separation and Moral Graphs Can we infer other/stronger independence statements from the directed graph like we did using separation in the undirected graphical models? Yes, the relevant concept is called d-separation. establishing d-separation directly (Pearl) establishing d-separation via the moral graph and normal separation We discuss each in turn. Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 49

Independence Properties of DAGs: d-separation A path p is blocked by a set of nodes Z if and only if: 1 p contains a chain of nodes A B C, or a fork A B C, such that the middle node B is in Z; or 2 p contains a collider A B C such that the collision node B is not in Z, and no descendant of B is in Z either. If Z blocks every path between two nodes X and Y, then X and Y are d-separated by Z, and thus X and Y are independent given Z. Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 49

Independence Properties of DAGs: Moral Graph Given a DAG G = (K, E) we construct the moral graph G m by marrying parents, and deleting directions, that is, 1 For each i K, we connect all vertices in pa(i) with undirected edges. 2 We replace all directed edges in E with undirected ones. Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 49

Independence Properties of DAGs: Moral Graph The directed independence graph G possesses the conditional independence properties of its associated moral graph G m. Why? We have the factorisation: P(X ) = = k P(X i X pa(i) ) i=1 k g i (X i, X pa(i) ) i=1 by setting g i (X i, X pa(i) ) = P(X i X pa(i) ). Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 49

Independence Properties of DAGs: Moral Graph We have the factorisation: k P(X ) = g i (X i, X pa(i) ) (1) i=1 We thus have a factorisation of the joint probability distribution in terms of functions g i (X ai ) where a i = {i} pa(i). By application of the factorisation criterion the sets a i become cliques in the undirected independence graph. Such cliques are formed by moralization. Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 49

Moralisation: Example X 1 X2 X 4 X 3 X 5 Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 49

Moralisation: Example X 1 X2 X 4 X 3 X 5 {i} pa(i) becomes a complete subgraph in the moral graph (by marrying all unmarried parents). Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 49

Moralisation Continued Warning: the complete moral graph can obscure independencies! To verify construct the moral graph on i j S that is i, j, S and all their ancestors. A = an + ({i, j} S), Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 49

Moralisation Continued Since for l A, pa(l) A, we know that the joint distribution of X A is given by P(X A ) = l A P(X l X pa(l) ) which corresponds to the subgraph G A of G. 1 This is a product of factors P(X l X pa(l) ), involving the variables X {l} pa(l) only. 2 So it factorizes according to GA m, and thus the independence properties for undirected graphs apply. 3 So, if S separates i from j in GA m, then i j S. Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 49

Full moral graph may obscure independencies: example G R A P(G, R, A) = P(G)P(R)P(A G, R) Does G R hold? Summing out A we obtain: P(G, R) = a = a P(G, R, A = a) P(G)P(R)P(A = a G, R) (sum rule) (BN factorisation) = P(G)P(R) a P(A = a G, R) (rule of summation) = P(G)P(R) ( a P(A = a G, R) = 1) Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 49

Moralisation Continued: example X 1 X2 X 4 X 3 X 5 Are X 3 and X 4 independent? Are X 1 and X 3 independent? Are X 3 and X 4 independent given X 5? Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 49

Equivalence When no marrying of parents is required (there are no immoralities or v-structures ), then the independence properties of the directed graph are identical to those of its undirected version. These three graphs express the same independence properties: 1 2 3 1 2 3 1 2 3 Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 49

Learning Bayesian Networks 1 Parameter learning: structure known/given; we only need to estimate the conditional probabilities from the data. 2 Structure learning: structure unknown; we need to learn the networks structure as well as the corresponding conditional probabilities from the data. Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 49

Maximum Likelihood Estimation Find value of unknown parameter(s) that maximize the probability of the observed data. n independent observations on binary variable X {1, 2}. We observe n(1) outcomes X = 1 and n(2) = n n(1) outcomes X = 2. What is the maximum likelihood estimate of p(1)? The likelihood function (probability of the data) is given by: Taking the log we get L = p(1) n(1) (1 p(1)) n n(1) L = n(1) log p(1) + (n n(1)) log(1 p(1)) Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 49

Maximum Likelihood Estimation Take derivative with respect to p(1), equate to zero, and solve for p(1). dl dp(1) = n(1) p(1) n n(1) 1 p(1) = 0, since d log x dx = 1 x (where log is the natural logarithm). Solving for p(1), we get p(1) = n(1) n, i.e., the fraction of one s in the sample! Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 49

ML Estimation of Multinomial Distribution Let X {1, 2,..., J}. Estimate the probabilities p(1), p(2),..., p(j) of getting outcomes 1, 2,..., J. If in n trials, we observe n(1) outcomes of 1, n(2) of 2,..., n(j) of J, then the obvious guess is to estimate p(j) = n(j) n, This is also the maximum likelihood estimate. j = 1, 2,..., J Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 49

BN-Factorisation For a given BN-DAG, the joint distribution factorises according to k P(X ) = p(x i X pa(i) ) So to specify the distribution we have to estimate the probabilities i=1 p(x i X pa(i) ) i = 1, 2,..., k for the conditional distribution of each variable given its parents. Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 49

ML Estimation of BN The joint probability for n independent observations is P(X (1),..., X (n) ) = = n P(X (j) ) j=1 n k p(x (j) i j=1 i=1 X (j) pa(i) ), where X (j) denotes the j-th row in the data table. The likelihood function is therefore given by L = k i=1 x i,x pa(i) p(x i x pa(i) ) n(x i,x pa(i)) where n(x i, x pa(i) ) is a count of the number of records with X i = x i, and X pa(i) = x pa(i). Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 49

ML Estimation of BN Taking the log of the likelihood, we get L = k i=1 x i,x pa(i) n(x i, x pa(i) ) log p(x i x pa(i) ) Maximize the log-likelihood function with respect to the unknown parameters p(x i x pa(i) ). This decomposes into a collection of independent multinomial estimation problems. Separate estimation problem for each X i and configuration of X pa(i). Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 49

ML Estimation of BN The maximum likelihood estimate of p(x i x pa(i) ) is given by: ˆp(x i x pa(i) ) = n(x i, x pa(i) ) n(x pa(i) ), where n(x i, x pa(i) ) is the number of records in the data with X i = x i and X pa(i) = x pa(i), and n(x pa(i) ) is the number of records in the data with X pa(i) = x pa(i). Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 49

Example BN and Factorisation 1 2 3 4 P(X 1, X 2, X 3, X 4 ) = p 1 (X 1 )p 2 (X 2 )p 3 12 (X 3 X 1, X 2 )p 4 3 (X 4 X 3 ) Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 49

Example BN: Parameters P(X 1, X 2, X 3, X 4 ) = p 1 (X 1 )p 2 (X 2 )p 3 12 (X 3 X 1, X 2 )p 4 3 (X 4 X 3 ) Now we have to estimate the following parameters (X 4 ternary, rest binary): p 1 (1) p 1 (2) = 1 p 1 (1) p 2 (1) p 2 (2) = 1 p 2 (1) p 3 1,2 (1 1, 1) p 3 1,2 (2 1, 1) = 1 p 3 1,2 (1 1, 1) p 3 1,2 (1 1, 2) p 3 1,2 (2 1, 2) = 1 p 3 1,2 (1 1, 2) p 3 1,2 (1 2, 1) p 3 1,2 (2 2, 1) = 1 p 3 1,2 (1 2, 1) p 3 1,2 (1 2, 2) p 3 1,2 (2 2, 2) = 1 p 3 1,2 (1 2, 2) p 4 3 (1 1) p 4 3 (2 1) p 4 3 (3 1) = 1 p 4 3 (1 1) p 4 3 (2 1) p 4 3 (1 2) p 4 3 (2 2) p 4 3 (3 2) = 1 p 4 3 (1 2) p 4 3 (2 2) Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 49

Example Data Set obs X 1 X 2 X 3 X 4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3 Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 49

Maximum Likelihood Estimation obs X 1 X 2 X 3 X 4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3 ˆp 1 (1) = n(x 1 = 1) n = 5 10 = 1 2 Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 49

Maximum Likelihood Estimation obs X 1 X 2 X 3 X 4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3 ˆp 2 (1) = n(x 2 = 1) n = 6 10 Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 49

Maximum Likelihood Estimation obs X 1 X 2 X 3 X 4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3 ˆp 3 1,2 (1 1, 1) = n(x 1 = 1, x 2 = 1, x 3 = 1) n(x 1 = 1, x 2 = 1) = 2 3 Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 49

Maximum Likelihood Estimation obs X 1 X 2 X 3 X 4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3 ˆp 3 1,2 (1 1, 1) = n(x 1 = 1, x 2 = 1, x 3 = 1) n(x 1 = 1, x 2 = 1) = 2 3 Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 49