Illustration of the K2 Algorithm for Learning Bayes Net Structures Prof. Carolina Ruiz Department of Computer Science, WPI ruiz@cs.wpi.edu http://www.cs.wpi.edu/ ruiz The purpose of this handout is to illustrate the use of the K2 algorithm to learn the topology of a Bayes Net. The algorithm is taken from [aeh93]. Consider the dataset given in [aeh93]: case x x 2 x 3 0 0 2 3 0 0 4 5 0 0 0 6 0 7 8 0 0 0 9 0 0 0 0 Assume that x is the classification target
The K2 algorithm taken from [aeh93] is included below. This algorithm heuristically searches for the most probable belief network structure given a database of cases.. procedure K2; 2. {Input: A set of n nodes, an ordering on the nodes, an upper bound u on the 3. number of parents a node may have, and a database D containing m cases.} 4. {Output: For each node, a printout of the parents of the node.} 5. for i:= to n do 6. π i := ; 7. P old := f(i, π i ); {This function is computed using Equation 20.} 8. OKToProceed := true; 9. While OKToProceed and π i < u do 0. let z be the node in Pred(x i ) - π i that maximizes f(i, π i {z});. P new := f(i, π i {z}); 2. if P new > P old then 3. P old := P new ; 4. π i := π i {z}; 5. else OKToProceed := false; 6. end {while}; 7. write( Node:, x i, Parent of x i :,π i ); 8. end {for}; 9. end {K2}; Equation 20 is included below: where: π i : set of parents of node x i q i = φ i f(i, π i ) = q i j= (r i )! (N ij + r i )! φ i : list of all possible instantiations of the parents of x i in database D. That is, if p,..., p s are the parents of x i then φ i is the Cartesian product {v p,..., vp r p }x... x{v ps,..., vps r ps } of all the possible values of attributes p through p s. r i = V i V i : list of all possible values of the attribute x i r i k= α ijk! α ijk : number of cases (i.e. instances) in D in which the attribute x i is instantiated with its k th value, and the parents of x i in π i are instantiated with the j th instantiation in φ i. N ij = r i k= α ijk. That is, the number of instances in the database in which the parents of x i in π i are instantiated with the j th instantiation in φ i. 2
The informal intuition here is that f(i, π i ) is the probability of the database D given that the parents of x i are π i. Below, we follow the K2 algorithm over the database above. Inputs: The set of n = 3 nodes {x, x 2, x 3 }, the ordering on the nodes x, x 2, x 3. We assume that x is the classification target. As such the Weka system would place it first on the node ordering so that it can be the parent of each of the predicting attributes. the upper bound u = 2 on the number of parents a node may have, and the database D above containing m = 0 cases. K2 Algorithm. i = : Note that for i =, the attribute under consideration is x. Here, r = 2 since x has two possible values {0,}.. π := 2. P old := f(, ) = q (r )! r j= (N j +r )! k= α jk! Let s compute the necessary values for this formula. Since π = then q = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α = 5: # of cases with x = 0 (cases 3,5,6,8,0) α 2 = 5: # of cases with x = (cases,2,4,7,9) N = α + α 2 = 0 Hence, P old := f(, ) = (r )! r (N +r )! k= α k! = 2k= (N +2 )! α k! =! 5! 5! = /2772 3. Since P red(x ) =, then the iteration for i = ends here with π =. 3
i = 2: Note that for i = 2, the attribute under consideration is x 2. Here, r 2 = 2 since x 2 has two possible values {0,}.. π 2 := 2. P old := f(2, ) = q 2 (r 2 )! r2 j= (N 2j +r 2 )! k= α 2jk! Let s compute the necessary values for this formula. Since π 2 = then q 2 = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α 2 = 5: # of cases with x 2 = 0 (cases,3,5,8,0) α 2 2 = 5: # of cases with x 2 = (cases 2,4,6,7,9) N 2 = α 2 + α 2 2 = 0 Hence, P old := f(2, ) = (r 2 )! r2 (N 2 +r 2 )! k= α 2 k! = 2k= (N 2 +2 )! α 2 k! =! 5! 5! = /2772 3. Since P red(x ) = {x }, then the only iteration for i = 2 goes with z = x. P new := f(2, π 2 {x }) = f(2, {x }) = q 2 (r 2 )! r2 j= (N 2j +r 2 )! k= α 2jk! φ 2 : list of unique π-instantiations of {x } in D = ((x = 0), (x = )) q 2 = φ 2 = 2 α 2 = 4: # of cases with x = 0 and x 2 = 0 (cases 3,5,8,0) α 22 = : # of cases with x = 0 and x 2 = (case 6) α 22 = : # of cases with x = and x 2 = 0 (case ) α 222 = 4: # of cases with x = and x 2 = (case 2,4,7,9) N 2 = α 2 + α 22 = 5 N 22 = α 22 + α 222 = 5 P new = q 2 j= (r 2 )! 2k= (N 2j +r 2 )! α 2jk! = k= α 2k! = 6! α 2! α 22! 6! α 22! α 222! = 6! 4!! 6! k= α 22k!! 4! = 6 5 6 5 = /900 4. Since P new = /900 > P old = /2772 then the iteration for i = 2 ends with π 2 = {x }. 4
i = 3: Note that for i = 3, the attribute under consideration is x 3. Here, r 3 = 2 since x 3 has two possible values {0,}.. π 3 := 2. P old := f(3, ) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! Let s compute the necessary values for this formula. Since π 3 = then q 3 = 0. Note that the product ranges from j = to j = 0. The standard convention for a product with an upper limit smaller than the lower limit is that the value of the product is equal to. However, this convention wouldn t work here since regardless of the value of i, f(i, ) =. Since this value denotes a probability, this would imply that the best option for each node is to have no parents. Hence, j will be ignored in the formula above, but not i or k. The example on page 3 of [aeh93] shows that this is the authors intended interpretation of this formula. α 3 = 4: # of cases with x 3 = 0 (cases,5,8,0) α 3 2 = 6: # of cases with x 3 = (cases 2,3,4,6,7,9) N 3 = α 3 + α 3 2 = 0 Hence, P old := f(3, ) = (r 3 )! r3 (N 3 +r 3 )! k= α 3 k! = 2k= (N 3 +2 )! α 3 k! =! 4! 6! = /230 3. Note that P red(x 3 ) = {x, x 2 }. Initially, π 3 =. We need to compute argmax(f(3, π 3 {x }), f(3, π 3 {x 2 })). f(3, π 3 {x }) = f(3, {x }) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! φ 3 : list of unique π-instantiations of {x } in D = ((x = 0), (x = )) q 3 = φ 3 = 2 α 3 = 3: # of cases with x = 0 and x 3 = 0 (cases 5,8,0) α 32 = 2: # of cases with x = 0 and x 3 = (case 3,6) α 32 = : # of cases with x = and x 3 = 0 (case ) α 322 = 4: # of cases with x = and x 3 = (case 2,4,7,9) N 3 = α 3 + α 32 = 5 N 32 = α 32 + α 322 = 5 f(3, {x }) = q 3 j= (r 3 )! (N 3j +r 3 )! 2k= α 3jk! = = 6! α 3! α 32! 6! α 32! α 322! = 6! 3! 2! 6! f(3, π 3 {x 2 }) =) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! k= α 3k! k= α 32k!! 4! = 6 5 2 6 5 = /800 φ 3 : list of unique π-instantiations of {x 2 } in D = ((x 2 = 0), (x 2 = )) q 3 = φ 3 = 2 α 3 = 4: # of cases with x 2 = 0 and x 3 = 0 (cases,5,8,0) α 32 = : # of cases with x 2 = 0 and x 3 = (case 3) 5
α 32 = 0: # of cases with x 2 = and x 3 = 0 (no case) α 322 = 5: # of cases with x 2 = and x 3 = (case 2,4,6,7,9) N 3 = α 3 + α 32 = 5 N 32 = α 32 + α 322 = 5 f(3, {x 2 }) = q 3 j= (r 3 )! (N 3j +r 3 )! 2k= α 3jk! = = 6! α 3! α 32! 6! α 32! α 322! = 6! 4!! 6! Here we assume that 0! = k= α 3k! k= α 32k! 0! 5! = 6 5 6 = /80 4. Since f(3, {x 2 }) = /80 > f(3, {x }) = /800 then z = x 2. Also, since f(3, {x 2 }) = /80 > P old = f(3, ) = /230, then π 3 = {x 2 }, P old := P new = /80. 5. Now, the next iteration of the algorithm for i = 3, considers adding the remaining predecessor of x 3, namely x, to the parents of x 3. f(3, π 3 {x }) = f(3, {x, x 2 }) = q 3 (r 3 )! r3 j= (N 3j +r 3 )! k= α 3jk! φ 3 : list of unique π-instantiations of {x, x 2 } in D = ((x = 0, x 2 = 0), (x = 0, x 2 = ), (x =, x 2 = 0), (x =, x 2 = )) q 3 = φ 3 = 4 α 3 = 3: # of cases with x = 0, x 2 = 0 and x 3 = 0 (cases 5,8,0) α 32 = : # of cases with x = 0, x 2 = 0 and x 3 = (case 3) α 32 = 0: # of cases with x = 0, x 2 = and x 3 = 0 (no case) α 322 = : # of cases with x = 0, x 2 = and x 3 = (case 6) α 33 = : # of cases with x =, x 2 = 0 and x 3 = 0 (case ) α 332 = 0: # of cases with x =, x 2 = 0 and x 3 = (no case) α 34 = 0: # of cases with x =, x 2 = and x 3 = 0 (no case) α 342 = 4: # of cases with x =, x 2 = and x 3 = (case 2,4,7,9) N 3 = α 3 + α 32 = 4 N 32 = α 32 + α 322 = N 33 = α 33 + α 332 = N 34 = α 34 + α 342 = 4 f(3, {x, x 2 }) = q 3 (r 3 )! 2k= j= (N 3j +r 3 )! α 3jk! = (4+2 )! 2 k= α 3k! (+2 )! 2 k= α 32k! (+2 )! 2 k= α 33k! = 5! α 3! α 32! 2! α 32! α 322! 2! α 33! α 332! 5! α 34! α 342! = 5! 3!! 2! 0!! 2!! 0! 5! 0! 4! = 5 4 2 2 5 = /400 (4+2 )! 2 k= α 34k! 6. Since P new = /400 < P old = /80 then the iteration for i = 3 ends with π 3 = {x 2 }. 6
Outputs: For each node, a printout of the parents of the node. Node: x, Parent of x : π = Node: x 2, Parent of x 2 : π 2 = {x } Node: x 3, Parent of x 3 : π 3 = {x 2 } This concludes the run of K2 over the database D. The learned topology is x x 2 x 3 References [aeh93] Gregory F. Cooper Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Technical Report KSL-9-02, Knowledge Systems Laboratory. Medical Computer Science. Stanford University School of Medicine, Stanford, CA 94305-5479, Updated Nov. 993. Available at: http://smiweb.stanford.edu/pubs/smi Abstracts/SMI-9-0355.html. 7