Illustration of the K2 Algorithm for Learning Bayes Net Structures

Similar documents
Inferring Protein-Signaling Networks II

Exact model averaging with naive Bayesian classifiers

Intelligent Systems (AI-2)

CHAPTER-17. Decision Tree Induction

MODULE -4 BAYEIAN LEARNING

Learning in Bayesian Networks

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Intelligent Systems (AI-2)

An Improved IAMB Algorithm for Markov Blanket Discovery

On the errors introduced by the naive Bayes independence assumption

Score Metrics for Learning Bayesian Networks used as Fitness Function in a Genetic Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Notes on Machine Learning for and

Machine Learning for Data Science (CS4786) Lecture 19

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Parameter Learning: Binary Variables

LEARNING WITH BAYESIAN NETWORKS

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Lecture 6: Graphical Models: Learning

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Molecular Similarity Searching Using Inference Network

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

Bayesian network modeling. 1

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Mathematical Formulation of Our Example

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

CS 343: Artificial Intelligence

Probabilistic Partial Evaluation: Exploiting rule structure in probabilistic inference

Final Exam December 12, 2017

Mixtures of Gaussians. Sargur Srihari

Lecture 5: Bayesian Network

Estimating the Variance of Query Responses in Hybrid Bayesian Nets

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Bayesian Networks Basic and simple graphs

Bayesian Machine Learning

Final Exam December 12, 2017

Parameter Estimation. Industrial AI Lab.

{ p if x = 1 1 p if x = 0

Bayesian Updating: Discrete Priors: Spring

Directed Graphical Models

Sampling Algorithms for Probabilistic Graphical models

Properties of Bayesian Belief Network Learning Algorithms

Lecture 3: Pattern Classification

An Empirical-Bayes Score for Discrete Bayesian Networks

Introduction to Bayesian Learning

ECE521 week 3: 23/26 January 2017

CS 7180: Behavioral Modeling and Decisionmaking

Lecture 3: Pattern Classification. Pattern classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Machine Learning Summer School

Directed Probabilistic Graphical Models CMSC 678 UMBC

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Bayesian networks with a logistic regression model for the conditional probabilities

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Learning Bayesian Networks for Biomedical Data

Conservative Inference Rule for Uncertain Reasoning under Incompleteness

Data classification (II)

Lecture 8: Bayesian Networks

Data Mining Part 4. Prediction

Uncertainty and knowledge. Uncertainty and knowledge. Reasoning with uncertainty. Notes

Part 1: Expectation Propagation

CS6220: DATA MINING TECHNIQUES

Approximate Inference

Lecture 9: Bayesian Learning

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Final Examination CS 540-2: Introduction to Artificial Intelligence

Learning a probabalistic model of rainfall using graphical models

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

Learning Bayesian Network Parameters under Equivalence Constraints

Bayesian Networks: Representation, Variable Elimination

Bayesian Networks. Motivation

A brief introduction to Conditional Random Fields

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

CS6220: DATA MINING TECHNIQUES

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Intelligent Systems (AI-2)

Machine Learning, Fall 2012 Homework 2

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Classification. Bayesian Classification: Why?

Final Exam, Spring 2006

Beyond Uniform Priors in Bayesian Network Structure Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Disjoint Sets : ADT. Disjoint-Set : Find-Set

Phylogenetic trees 07/10/13

CS 343: Artificial Intelligence

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Computer Science CPSC 322. Lecture 23 Planning Under Uncertainty and Decision Networks

CS 7180: Behavioral Modeling and Decision- making in AI

Bayesian Networks Inference with Probabilistic Graphical Models

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Bayesian Updating: Discrete Priors: Spring

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Reasoning Under Uncertainty: More on BNets structure and construction

Transcription:

Illustration of the K2 Algorithm for Learning Bayes Net Structures Prof. Carolina Ruiz Department of Computer Science, WPI ruiz@cs.wpi.edu http://www.cs.wpi.edu/ ruiz The purpose of this handout is to illustrate the use of the K2 algorithm to learn the topology of a Bayes Net. The algorithm is taken from [aeh93]. Consider the dataset given in [aeh93]: case x x 2 x 3 0 0 2 3 0 0 4 5 0 0 0 6 0 7 8 0 0 0 9 0 0 0 0 Assume that x is the classification target

The K2 algorithm taken from [aeh93] is included below. This algorithm heuristically searches for the most probable belief network structure given a database of cases.. procedure K2; 2. {Input: A set of n nodes, an ordering on the nodes, an upper bound u on the 3. number of parents a node may have, and a database D containing m cases.} 4. {Output: For each node, a printout of the parents of the node.} 5. for i:= to n do 6. π i := ; 7. P old := f(i, π i ); {This function is computed using Equation 20.} 8. OKToProceed := true; 9. While OKToProceed and π i < u do 0. let z be the node in Pred(x i ) - π i that maximizes f(i, π i {z});. P new := f(i, π i {z}); 2. if P new > P old then 3. P old := P new ; 4. π i := π i {z}; 5. else OKToProceed := false; 6. end {while}; 7. write( Node:, x i, Parent of x i :,π i ); 8. end {for}; 9. end {K2}; Equation 20 is included below: where: π i : set of parents of node x i q i = φ i f(i, π i ) = q i j= (r i )! (N ij + r i )! φ i : list of all possible instantiations of the parents of x i in database D. That is, if p,..., p s are the parents of x i then φ i is the Cartesian product {v p,..., vp r p }x... x{v ps,..., vps r ps } of all the possible values of attributes p through p s. r i = V i V i : list of all possible values of the attribute x i r i k= α ijk! α ijk : number of cases (i.e. instances) in D in which the attribute x i is instantiated with its k th value, and the parents of x i in π i are instantiated with the j th instantiation in φ i. N ij = r i k= α ijk. That is, the number of instances in the database in which the parents of x i in π i are instantiated with the j th instantiation in φ i. 2

The informal intuition here is that f(i, π i ) is the probability of the database D given that the parents of x i are π i. Below, we follow the K2 algorithm over the database above. Inputs: The set of n = 3 nodes {x, x 2, x 3 }, the ordering on the nodes x, x 2, x 3. We assume that x is the classification target. As such the Weka system would place it first on the node ordering so that it can be the parent of each of the predicting attributes. the upper bound u = 2 on the number of parents a node may have, and the database D above containing m = 0 cases. K2 Algorithm. i = : Note that for i =, the attribute under consideration is x. Here, r = 2 since x has two possible values {0,}.. π := 2. P old := f(, ) = q (r )! r j= (N j +r )! k= α jk! Let s compute the necessary values for this formula. Since π = then q = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α = 5: # of cases with x = 0 (cases 3,5,6,8,0) α 2 = 5: # of cases with x = (cases,2,4,7,9) N = α + α 2 = 0 Hence, P old := f(, ) = (r )! r (N +r )! k= α k! = 2k= (N +2 )! α k! =! 5! 5! = /2772 3. Since P red(x ) =, then the iteration for i = ends here with π =. 3

i = 2: Note that for i = 2, the attribute under consideration is x 2. Here, r 2 = 2 since x 2 has two possible values {0,}.. π 2 := 2. P old := f(2, ) = q 2 (r 2 )! r2 j= (N 2j +r 2 )! k= α 2jk! Let s compute the necessary values for this formula. Since π 2 = then q 2 = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α 2 = 5: # of cases with x 2 = 0 (cases,3,5,8,0) α 2 2 = 5: # of cases with x 2 = (cases 2,4,6,7,9) N 2 = α 2 + α 2 2 = 0 Hence, P old := f(2, ) = (r 2 )! r2 (N 2 +r 2 )! k= α 2 k! = 2k= (N 2 +2 )! α 2 k! =! 5! 5! = /2772 3. Since P red(x ) = {x }, then the only iteration for i = 2 goes with z = x. P new := f(2, π 2 {x }) = f(2, {x }) = q 2 (r 2 )! r2 j= (N 2j +r 2 )! k= α 2jk! φ 2 : list of unique π-instantiations of {x } in D = ((x = 0), (x = )) q 2 = φ 2 = 2 α 2 = 4: # of cases with x = 0 and x 2 = 0 (cases 3,5,8,0) α 22 = : # of cases with x = 0 and x 2 = (case 6) α 22 = : # of cases with x = and x 2 = 0 (case ) α 222 = 4: # of cases with x = and x 2 = (case 2,4,7,9) N 2 = α 2 + α 22 = 5 N 22 = α 22 + α 222 = 5 P new = q 2 j= (r 2 )! 2k= (N 2j +r 2 )! α 2jk! = k= α 2k! = 6! α 2! α 22! 6! α 22! α 222! = 6! 4!! 6! k= α 22k!! 4! = 6 5 6 5 = /900 4. Since P new = /900 > P old = /2772 then the iteration for i = 2 ends with π 2 = {x }. 4

i = 3: Note that for i = 3, the attribute under consideration is x 3. Here, r 3 = 2 since x 3 has two possible values {0,}.. π 3 := 2. P old := f(3, ) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! Let s compute the necessary values for this formula. Since π 3 = then q 3 = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α 3 = 4: # of cases with x 3 = 0 (cases,5,8,0) α 3 2 = 6: # of cases with x 3 = (cases 2,3,4,6,7,9) N 3 = α 3 + α 3 2 = 0 Hence, P old := f(3, ) = (r 3 )! r3 (N 3 +r 3 )! k= α 3 k! = 2k= (N 3 +2 )! α 3 k! =! 4! 6! = /230 3. Note that P red(x 3 ) = {x, x 2 }. Initially, π 3 =. We need to compute argmax(f(3, π 3 {x }), f(3, π 3 {x 2 })). f(3, π 3 {x }) = f(3, {x }) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! φ 3 : list of unique π-instantiations of {x } in D = ((x = 0), (x = )) q 3 = φ 3 = 2 α 3 = 3: # of cases with x = 0 and x 3 = 0 (cases 5,8,0) α 32 = 2: # of cases with x = 0 and x 3 = (case 3,6) α 32 = : # of cases with x = and x 3 = 0 (case ) α 322 = 4: # of cases with x = and x 3 = (case 2,4,7,9) N 3 = α 3 + α 32 = 5 N 32 = α 32 + α 322 = 5 f(3, {x }) = q 3 j= (r 3 )! (N 3j +r 3 )! 2k= α 3jk! = = 6! α 3! α 32! 6! α 32! α 322! = 6! 3! 2! 6! f(3, π 3 {x 2 }) =) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! k= α 3k! k= α 32k!! 4! = 6 5 2 6 5 = /800 φ 3 : list of unique π-instantiations of {x 2 } in D = ((x 2 = 0), (x 2 = )) q 3 = φ 3 = 2 α 3 = 4: # of cases with x 2 = 0 and x 3 = 0 (cases,5,8,0) α 32 = : # of cases with x 2 = 0 and x 3 = (case 3) 5

α 32 = 0: # of cases with x 2 = and x 3 = 0 (no case) α 322 = 5: # of cases with x 2 = and x 3 = (case 2,4,6,7,9) N 3 = α 3 + α 32 = 5 N 32 = α 32 + α 322 = 5 f(3, {x 2 }) = q 3 j= (r 3 )! (N 3j +r 3 )! 2k= α 3jk! = = 6! α 3! α 32! 6! α 32! α 322! = 6! 4!! 6! Here we assume that 0! = k= α 3k! k= α 32k! 0! 5! = 6 5 6 = /80 4. Since f(3, {x 2 }) = /80 > f(3, {x }) = /800 then z = x 2. Also, since f(3, {x 2 }) = /80 > P old = f(3, ) = /230, then π 3 = {x 2 }, P old := P new = /80. 5. Now, the next iteration of the algorithm for i = 3, considers adding the remaining predecessor of x 3, namely x, to the parents of x 3. f(3, π 3 {x }) = f(3, {x, x 2 }) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! φ 3 : list of unique π-instantiations of {x, x 2 } in D = ((x = 0, x 2 = 0), (x = 0, x 2 = ), (x =, x 2 = 0), (x =, x 2 = )) q 3 = φ 3 = 4 α 3 = 3: # of cases with x = 0, x 2 = 0 and x 3 = 0 (cases 5,8,0) α 32 = : # of cases with x = 0, x 2 = 0 and x 3 = (case 3) α 32 = 0: # of cases with x = 0, x 2 = and x 3 = 0 (no case) α 322 = : # of cases with x = 0, x 2 = and x 3 = (case 6) α 33 = : # of cases with x =, x 2 = 0 and x 3 = 0 (case ) α 332 = 0: # of cases with x =, x 2 = 0 and x 3 = (no case) α 34 = 0: # of cases with x =, x 2 = and x 3 = 0 (no case) α 342 = 4: # of cases with x =, x 2 = and x 3 = (case 2,4,7,9) N 3 = α 3 + α 32 = 4 N 32 = α 32 + α 322 = N 33 = α 33 + α 332 = N 34 = α 34 + α 342 = 4 f(3, {x, x 2 }) = q 3 (r 3 )! 2k= j= (N 3j +r 3 )! α 3jk! = (4+2 )! 2 k= α 3k! (+2 )! 2 k= α 32k! (+2 )! 2 k= α 33k! = 5! α 3! α 32! 2! α 32! α 322! 2! α 33! α 332! 5! α 34! α 342! = 5! 3!! 2! 0!! 2!! 0! 5! 0! 4! = 5 4 2 2 5 = /400 (4+2 )! 2 k= α 34k! 6. Since P new = /400 < P old = /80 then the iteration for i = 3 ends with π 3 = {x 2 }. 6

Outputs: For each node, a printout of the parents of the node. Node: x, Parent of x : π = Node: x 2, Parent of x 2 : π 2 = {x } Node: x 3, Parent of x 3 : π 3 = {x 2 } This concludes the run of K2 over the database D. The learned topology is x x 2 x 3 References [aeh93] Gregory F. Cooper Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Technical Report KSL-9-02, Knowledge Systems Laboratory. Medical Computer Science. Stanford University School of Medicine, Stanford, CA 94305-5479, Updated Nov. 993. Available at: http://smiweb.stanford.edu/pubs/smi Abstracts/SMI-9-0355.html. 7