Error Error Δ = 0.44 (4/9)(0.25) (5/9)(0.2)

Size: px

Start display at page:

Download "Error Error Δ = 0.44 (4/9)(0.25) (5/9)(0.2)"

Grant Day
6 years ago
Views:

1 Chapter 4, Problem 3: For the classification function, the number of + classifications are 4 and the number of classifications are 5. Hence, the entropy of this collection with respect to the + class: Entropy = (4/9) log2 (4/9) = 0.52 Now if we split on a1 and a2 then, a1 N1 N2 a2 N1 N Entropy(N1) = 0.81 Entropy(N1) = 1.00 Entropy(N2) = 0.72 Entropy(N2) = 0.97 Δgain = 0.99 (4/9)(0.81) (5/9)(0.72) Δgain = 0.99 (4/9)(1.00) (5/9)(0.97) = 0.23 = 0.01 Hence, the best split for a3 is 2.0. Based on the calculations we made in part ( and (, the best split is a1. Classification error rate for the classification function = 1 max[4/9, 5/9] = 1 5/9 = 0.44 a1 N1 N2 a2 N1 N Error Error Δ = 0.44 (4/9)(0.25) (5/9)(0.2) Δ = 0.44 (4/9)(0.50) (5/9)(0.40) = 0.22 = 0.00 So, based on classification error rate, a1 is the best split. f) For the classification function, GINI = 1 (4/9) 2 (5/9) 2 = 0.49

2 Now if we split on a1 and a2 then, Hence, a1 is the best split. a1 N1 N2 a2 N1 N GINI GINI Δ = 0.49 (4/9)(0.38) (5/9)(0.32) Δ = 0.49 (4/9)(0.5) (5/9)(0.48) = 0.14 = 0.00 Chapter 4, Problem 7: For the classification function, Error (without any split) = 1 max(50/100, 50/100) = 1 ½ = ½ Attribute A: Class Class 0 50 Error 1 max(25/25, 0/25) 1 max(25/75, 50/75) = 1 1 = 1 2/3 = 0 = 1/3 Δ = ½ [25/ /100 1/3] = ½ ¼ = 0.25 Attribute B: Class Class Error 1 max(30/50, 20/50) 1 max(20/50, 30/50) = 1 3/5 = 1 3/5 = 2/5 = 2/5 Δ = ½ [50/100 2/5 + 50/100 2/5] = ½ 2/5 = 1/10 = 0.1 Attribute C: Class Class Error 1 max(25/50, 25/50) 1 max(25/50, 25/50) = 1 ½ = 1 ½ = ½ = ½ Δ = ½ [50/100 ½ + 50/100 ½] = ½ ½ = 0 Clearly, attribute A is the best first split since it has the best gain. Hence, the tree would look like A ?

3 The left child of the root (A = tru can not be further split because it is a pure class. The right child of the root (A = fals can be further split possibly using attribute B or attribute C based on the best gain. Attribute B: Class Class Error 1 max(25/45, 20/45) 1 max(0/30, 30/30) = 1 5/9 = 1 1 = 4/9 = 0 Δ = 1/3 [45/75 4/9 + 30/75 0] = 1/3 4/15 = 1/15 = Attribute C: Class Class Error 1 max(0/25, 25/25) 1 max(25/50, 25/50) = 1 1 = 1 ½ = 0 = ½ Δ = 1/3 [25/ /75 ½] = 1/3 1/3 = 0 Clearly, attribute B is the better second split since it has larger gain. Hence, the tree would look like A B ? Based on the tree from part (, let us assign the majority class + to the left child of the second split. Clearly, we will misclassify 20 instances on this leaf node. Since, the remaining leaf nodes are pure classes, total 20 instances out of 100 will get misclassified by the tree from part (. C 25 +?? Using the results from part (, we can see that selecting attribute C as the first splitting attribute would

4 result in the this tree. Lets start with finding the second split for the left child (C = tru. Also note that the classification error for this node is ½ as computed in part (. Attribute A: Class Class 0 25 Error 1 max(25/25, 0/25) 1 max(0/25, 25/25) = 1 1 = 1 1 = 0 = 0 Δ = ½ [25/ /50 0] = ½ 0 = 0.5 Attribute B: Class Class 20 5 Error 1 max(5/25, 20/25) 1 max(20/25, 5/25) = 1 4/5 = 1 4/5 = 1/5 = 1/5 Δ = ½ [25/50 1/5 + 25/50 1/5] = ½ 1/5 = 3/10 = 0.3 Clearly, attribute A is better split for the left subtree. Now lets find the second splitting attribute for the right child of the root (C = fals. The classification error for this is ½ as computed in part (. Attribute A: Class Class 0 25 Error 1 max(25/50, 25/50) = 1 ½ = ½ Δ = ½ 50/50 ½ = ½ ½ = 0 Attribute B: Class Class 0 25 Error 1 max(25/25, 0/25) 1 max(0/25, 25/25) = 1 1 = 1 1 = 0 = 0 Δ = ½ [25/ /50 0] = ½ 0 = 0.5 C A B Hence, attribute B is a better split for the right subtree. Since all the leaf nodes are pure classes, 0 out of

5 the 100 instances will get misclassified. This tree will correctly classify all instances. It is clearly evident from the results of part ( and (, that the greedy approach does not always lead to a decision tree with lowest misclassification errors. Chapter 4, Problem 9: Tree ( Number of non leaf nodes = 2 Number of leaf nodes = 3 Number of errors = 7 Number of classes = 3 Number of attributes = 16 Number of records = n Hence, Cost(tre = 2 log log2 3 = 2*4 + 3*1.585 = = bits Cost(data tre = 7 log2 n = 7 log2 n bits Cost(tree, dat = log2 n bits Tree ( Number of non leaf nodes = 4 Number of leaf nodes = 5 Number of errors = 4 Number of classes = 3 Number of attributes = 16 Number of records = n Hence, Cost(tre = 4 log log2 3 = 4*4 + 5*1.585 = = bits Cost(data tre = 4 log2 n = 4 log2 n bits Cost(tree, dat = log2 n bits Solving for n in log2 n = log2 n, we get, n = Hence, according to MDL principle, decision tree ( is better if n 14 and tree ( is better otherwise. Chapter 5, Problem 4: Accuracy(R1) = 4 / (4 + 1) = 4/5 = 0.8 Accuracy(R2) = 30 / ( ) = ¾ = 0.75 Accuracy(R3) = 100 / ( ) = 10/19 = R1 best, R3 worst For FOIL's information gain, we will extend a rule with equal positive and negative coverage with the given rules, and then compare with the results i.e. p0 = n0 FOIL(R1) = 4 (log2(4/5) log2(1/2)) = 4 ( ) = FOIL(R2) = 30 (log2(30/40) log2(1/2)) = 30 ( ) = FOIL(R3) = 100 (log2(100/190) log2(1/2)) = 100 ( ) = R2 best, R1 worst For R1, k = 2, f+ = 4, e+ = 5 100/500 = 1, f = 1, e = 5 400/500 = 4 LSR(R1) = 2 (4 log2(4/1) + 1 log2(1/4)) = 2 ( ( 2)) = 2 6 = 12

6 For R2, k = 2, f+ = 30, e+ = /500 = 8, f = 10, e = /500 = 32 LSR(R2) = 2 (30 log2(30/8) + 10 log2(10/32)) = 2 ( ( )) = = For R3, k = 2, f+ = 100, e+ = /500 = 38, f = 90, e = /500 = 152 LSR(R3) = 2 (100 log2(100/38) + 90 log2(90/152)) = 2 ( ( )) = = R3 best, R1 worst For the Laplace measure, k = 2 Laplace(R1) = (4 + 1) / (5 + 2) = 5/7 = Laplace(R2) = (30 + 1) / (40 + 2) = 31/42 = Laplace(R3) = ( ) / ( ) = 101/192 = R2 best, R3 worst For the m estimate measure, k = 2 and p+ = 0.2. Therefore, k p+ = = 0.4 m estimate(r1) = ( ) / (5 + 2) = 4.4/7 = m estimate(r2) = ( ) / (40 + 2) = 30.4/42 = m estimate(r3) = ( ) / ( ) = 100.4/192 = R2 best, R3 worst Chapter 5, Problem 8: P(A=1 +) = P(A=1, class=+) / P(+) = (3/10) / (5/10) = 3/5 P(B=1 +) = P(B=1, class=+) / P(+) = (2/10) / (5/10) = 2/5 P(C=1 +) = P(C=1, class=+) / P(+) = (4/10) / (5/10) = 4/5 P(A=1 ) = P(A=1, class= ) / P( ) = (2/10) / (5/10) = 2/5 P(B=1 ) = P(B=1, class= ) / P( ) = (2/10) / (5/10) = 2/5 P(C=1 ) = P(C=1, class= ) / P( ) = (1/10) / (5/10) = 1/5 Assume that P(A=1, B=1, C=1) = x. Now, let us compute P(+ A=1, B=1, C=1) and P( A=1, B=1, C=1). Based on these values, we will assign a class label to our test sample. P(+ A=1, B=1, C=1) = P(+) P(A=1, B=1, C=1 +) / P(A=1, B=1, C=1) = P(+) P(A=1 +) P(B=1 +) P(C=1 +) / P(A=1, B=1, C=1) using naive Bayes rule = (5/10 3/5 2/5 4/5) / x = 12 / 125x P( A=1, B=1, C=1) = P( ) P(A=1, B=1, C=1 ) / P(A=1, B=1, C=1) = P( ) P(A=1 ) P(B=1 ) P(C=1 ) / P(A=1, B=1, C=1) using naive Bayes rule = (5/10 2/5 2/5 1/5) / x = 2 / 125x Clearly, x (0,1], P(+ A=1, B=1, C=1) > P( A=1, B=1, C=1). Hence, we will classify this test sample as belonging to class '+'.

7 P(A=1) = 5/10 = 0.5 P(B=1) = 4/10 = 0.4 P(A=1) P(B=1) = = 0.2 P(A=1, B=1) = P(B=1 A=1) P(A=1) = 2/5 5/10 = 2/10 = 0.2 Since P(A=1, B=1) = P(A=1) P(B=1), we can say that the random variables A and B are independent. P(A=1) = 5/10 = 0.5 P(B=0) = 6/10 = 0.6 P(A=1) P(B=0) = = 0.3 P(A=1, B=0) = P(B=0 A=1) P(A=1) = 3/5 5/10 = 3/10 = 0.3 Since P(A=1, B=0) = P(A=1) P(B=0), we can say that the random variables A and B are independent. P(A=1, B=1 +) = P(A=1, B=1, class=+) / P(+) = (1/10) / (5/10) = 1/5 = 0.2 Using values from part (, P(A=1 + ) P(B=1 +) = (3/5) (2/5) = 6/25 = 0.24 Since P(A=1, B=1 +) P(A=1 + ) P(B=1 +), we can say that the random variable A and B are not conditionally independent on the class '+'. Chapter 5, Problem 12: P(B=g, F=e, G=e, S=y) = P(S=y B=g, F=e, G= P(B=g, F=e, G= = P(S=y B=g, F= P(G=e B=g, F= P(B=g, F= S does not depend on G = (1 P(S=n B=g, F=) P(G=e B=g, F= P(B=g) P(F= B and F are independent = (1 P(S=n B=g, F=) P(G=e B=g, F= (1 P(B=) P(F= = (1 0.8) 0.8 (1 0.1) 0.2 = = P(B=b, F=e, G=ne, S=n) = P(S=n B=b, F=e, G=n P(B=b, F=e, G=n = P(S=n B=b, F= P(G=ne B=b, F= P(B=b, F= S does not depend on G = P(S=n B=b, F= (1 P(G=e B=b, F=) P(B= P(F= B and F are independent = 1 (1 0.9) = = P(S=y B= = 1 P(S=n B= = 1 P(S=n, B= / P(B= = = 0.08 P(S=n, B= / P(B= = (P(S=n,B=b,F= + P(S=n, B=b, F=n) / P(B= = (P(S=n B=b, F= P(B= P(F= + P(S=n B=b, F=n P(B= P(F=n) / P(B= = P(S=n B=b, F= P(F= + P(S=n B=b, F=n P(F=n = (1 0.8) = = 0.92

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each) Q1 (1 points): Chap 4 Exercise 3 (a) to (f) ( points each) Given a table Table 1 Dataset for Exercise 3 Instance a 1 a a 3 Target Class 1 T T 1.0 + T T 6.0 + 3 T F 5.0-4 F F 4.0 + 5 F T 7.0-6 F T 3.0-7