Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Size: px

Start display at page:

Download "Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011"

Edmund Morgan
5 years ago
Views:

1 Decision Support Systems MEIC - Alameda 2010/2011 Homework #8 Due date: 5.Dec Rule Learning 1. Consider once again the decision-tree you computed in Question 1c of Homework #7, used to determine the political affiliation of several Deputies of the Portuguese Parliament from their voting tendencies. For your commodity, we reproduce in Table 1 the dataset you used to construct the decision-tree. Table 1: Data-set D with Portuguese Parliament vote samples. Dep. ID N. Tax Inc. (V 1) Labor Reg. (V 2) Ed. Bud. (V 3) For. Pol. (V 4) Affil. ID013 Yes No No Unk Right ID030 Yes No Yes Unk Right ID050 No Yes No Unk Right ID063 No Unk Yes Unk Right ID070 Yes Yes Yes Yes Right ID072 Yes Unk Yes No Right ID102 Yes No Yes No Right ID112 Yes Yes No Yes Right ID130 No Yes Yes No Right ID165 Yes Unk No Unk Left ID177 No Unk No Yes Left ID217 Yes Unk No Yes Left ID221 No No No Unk Left ID229 No No Yes No Left

2 Homework 8 Decision Support Systems Page 2 of 9 (a) ( 1 / 2 val.) From the decision-tree you computed, write down all IF-THEN rules necessary to build an equivalent rule-based classifier. Note: Make sure to use the decision-tree from the official solution to HW7 provided by the faculty. Solutions based on different decision-trees will not be considered. Recall that the decision-tree from the previous homework is V 2 No Yes Unk V 1 Right V 3 No Yes No Yes Left Right Left Right From this tree, we can derive the following rule-based classifier: 1. IF V 2 = Yes THEN C = Right 2. IF (V 2 = No V 1 = No ) THEN C = Left 3. IF (V 2 = No V 1 = Yes ) THEN C = Right 4. IF (V 2 = Unk V 3 = No ) THEN C = Left 5. IF (V 2 = Unk V 3 = Yes ) THEN C = Right (b) ( 1 / 2 val.) Compute the coverage and accuracy of these rules for the dataset D in Table 1. The coverage corresponds to the number of tuples that verify the conditions in the rule. The accuracy represents the percentage of correctly classified tuples. Since the tree is able to correctly classify all instances in the dataset D, this means that all previously derived rules have an accuracy of 1.0. As for the coverage, we have: Rule 1 Coverage: 4/14 = Rule 2 Coverage: 2/14 = Rule 3 Coverage: 3/14 = Rule 4 Coverage: 3/14 = Rule 5 Coverage: 2/14 =

3 Homework 8 Decision Support Systems Page 3 of 9 2 Neural networks 2. (3 val.) Derive a gradient descent training rule for a network with a single unit with p inputs and output given by ŷ(x) = w 0 + w 1 x 1 + w 1 x w p x p + w p x 2 p, where x = [x 1,..., x p ]. Consider that the error in a dataset D = {(x n, y n ), n = 1,..., N} is given by E(w) = N (ŷ(x n ) y n ) 2. n=1 The gradient descent update rule can be obtained by deriving the error with respect to the network parameters. In our case, we can represent the network output as ŷ(x) = p w i φ i (x), where each φ i is a state-dependent feature. In particular, we have that { 1 if i = 0 φ i (x) = x i + x 2 i otherwise. Deriving the error with respect to the general weight w i yields: E(w) w i = 2 = 2 i=0 N (ŷ(x n ) y n ) ŷ(x) w i n=1 N (ŷ(x n ) y n )φ i (x). n=1 Finally, we get the update rule where each w i is given by w i w i + w i, w i = 2 = N (y n ŷ(x n ))φ i (x) n=1 { 2 N n=1 (y n ŷ(x n )) if i = 0 2 N n=1 (y n ŷ(x n ))(x i + x 2 i ) otherwise. 3. (3 val.) Consider the two-layer neural network depicted in Fig. 1. Initialize the weight vector w = [w 0c, w ac, w bc, w 0d, w cd ] to [0.1, 0.1, 0.1, 0.1, 0.1] and indicate the values of the weight vector after the two initial iterations of back-propagation. Assume that the activation function of both units c and d is the logistic sigmoid function, given by σ(x) = exp( x).

4 Homework 8 Decision Support Systems Page 4 of 9 x 0 w 0c z 0 w 0d x a wac z c w cd z d y x b w bc Figure 1: Two-layer neural network with two inputs, x 1 and x 2, and output ŷ = z d. The nodes x 0 and z 0 correspond to the bias. In your computations use η = 0.3 and the dataset D = {([1, 0], 1), ([0, 1], 0)}, where each point in the dataset is of the form (x, y), with x = [x a, x b ]. Each iteration of back-propagation consists in processing one data-point in the network. We begin the first iteration with the data-point x 1 = [1, 0], for which the intended output is y 1 = 1. The initial stage of back-propagation consists in computing the activation and output of each unit in the network. This yields: a c = w 0c + w ac x a + w bc w b = = 0.2 z c = σ(0.2) = 0.55 a d = w 0d + w cd z c = = 0.15 z d = σ(0.15) = We now compute the δ j, propagating it back through the network: and we get the updated weights δ d = z d (1 z d )(z d y 1 ) = 0.54 (1 0.54) (0.54 1) = 0.12 δ c = z c (1 z c )w cd δ d = 0.55 (1 0.55) = 0.00 w 0c w 0c ηδ c x 0 = = 0.10 w ac w ac ηδ c x a = = 0.10 w bc w bc ηδ c x b = = 0.10 w 0d w 0d ηδ d z 0 = = 0.13 w cd w cd ηδ d z c = = 0.12 In the second iteration, we use the data-point x 1 = [0, 1], for which the intended output is y 1 = 0. Again, the initial stage of back-propagation consists in computing the activation and output of each unit in the network. This yields: a c = = 0.2 z c = 0.55 a d = = z d = We now compute the δ j, propagating it back through the network: δ d = 0.55 (1 0.55) (0.55 0) = 0.14 δ c = 0.55 (1 0.55) = 0.00

5 Homework 8 Decision Support Systems Page 5 of 9 and we get the updated weights w 0c = 0.10 w ac = 0.10 w bc = 0.10 w 0d = 0.09 w cd = (3 val.) Consider the two-layer feed-forward neural network in Fig. 2. The output of the network is z 0 w 01 x 0 z 1 w 0 x 1 y x p w pm z M w 0 Figure 2: Two-layer neural network with p inputs, x 1 through x p, and one output. The nodes x 0 and z 0 correspond to the bias. Double indexed weights connect the inputs to the first layer, while single indexed weights connect the first to the output layer. given by ( M p ) ŷ(x, w) = σ w j h w ij x i + w 0j + w 0, (1) j=1 i=1 where h is the activation function for the units in the first layer and σ is the logistic sigmoid function σ(x) = exp( x). Suppose that the activation function h is also the logistic sigmoid function. Show that there exists an equivalent network that computes exactly the same function, but where the activation function for the units in the first layer is h(x) = tahn(x). Recall that the tanh function is defined as tanh(x) = exp(x) exp( x) exp(x) + exp( x). Suggestion: First find the relation between σ(x) and tanh(x), and then show that the parameters in the two networks differ by linear transformations.

6 Homework 8 Decision Support Systems Page 6 of 9 We start by noting that Inverting the above relation, we get exp(x) exp( x) tanh(x) = exp(x) + exp( x) = 1 exp( 2x) 1 + exp( 2x) 1 = 1 + exp( 2x) 1 + exp( 2x) exp( 2x) = σ(2x) ( 1 σ(2x) ) j=1 = 2σ(2x) 1. σ(x) = tanh( x 2 ) Replacing this in the expression for the output of the network, we get ( ) M ŷ(x, w) = σ w j 2 tanh 1 p w ij x i + w M 0j w j w 0, i=1 j=1 or, equivalently, with ( M p ) ŷ(x, w) = σ w jtanh w ijx i + w 0j + w 0, j=1 i=1 w 0 = w 0 + w j = w j 2 w ij = w ij 2. M j=1 w j 2 ; for j 0; This output expression is in the same form as (1), indicating that the network obtained (i) has the same topology as the original network; (ii) the activation function h is tanh, as required; and (iii) the two networks have the same output. 2.1 Practical Questions (Using SQL Server 2008) 5. Consider the 3 models you deployed in Lab 8 and analyzed in Homework #7. You will now compare these models with Microsoft Neural Networks. To this purpose, depart from the deployment of the three mining models you analyzed in Homework #7 and add a fourth model, corresponding to Microsoft Neural Networks. Process all models. (a) (3 val.) For Microsoft Neural Networks, provide a snapshot of the variables pane of the Neural Network viewer, showing the attributes that favor each of the two possible values for the output Bike Buyer. Compare these results with those obtained for the other methods in Homework #7.

1 Other important attributes include the commute distance, the number of children and the number of cars owned.

7 Homework 8 Decision Support Systems Page 7 of 9 We depict below variable pane for MS Neural Network on the attribute Bike Buyer. In contrast with the methods analyzed in HW7, MS Neural Networks indicates the region as a primary attribute to identify bike buyers. 1 Other important attributes include the commute distance, the number of children and the number of cars owned. These results are in accordance with those determined by MS Clustering (as seen in HW7), although the order of the factors is different. It is also important to note that, although not appearing as the most significant attribute, the number of cars owned is also considered as an important factor to identify bike buyers, as outlined in both panes portrayed. (b) (2 val.) Provide the lift chart comparing the performance of the four methods. Repeat the analysis in Question 4b of Homework #7, comparing the performance of Microsoft Neural Networks with that of the other methods and indicating which one performs best. The lift chart comparing the performance of the different methods is 1 In the results above, we have ignored the attribute Geography Key in the training of the classifier.

1 (Actual) Comparing now with MS Neural Networks, MS Decision Trees remains the method with best predictive performance.

8 Homework 8 Decision Support Systems Page 8 of 9 Pergunta 5.b) As seen in HW7, MS Pergunta Decision 5.c) Trees exhibits better predictive performance, while MS Naive Bayes and MS Clustering Predicted exhibit 0 (Actual) similar performance. 1 (Actual) Comparing now with MS Neural Networks, MS Decision Trees remains the method with best predictive performance. MS Neural Networks, while slightly outperforming MS Naive Bayes and MS Clustering, is very similar to the other two methods and still falls behind MS Decision Trees. This may indicate that the MS Decision Trees (and, to a lesser extent, MS Neural Networks) are richer classification models and, as such, are able to better capture Bike Buyer classification. (c) (2 val.) Provide the confusion matrix for the neural network model. Compare the performance of this model with that of the other methods in Homework #7 in terms of confusion matrix. Compare also the results in terms of confusion matrix with those obtained in the lift charts. Note: You don t need to include the results from Homework #7. The confusion matrix for MS Neural networks is Pos. Label Neg. Label Positive Negative These results are in accordance with those observed from the lift chart. The confusion matrix indicates again indicates that MS Decision Trees exhibits a better performance, above MS Neural Networks, MS Naive Bayes and MS Clustering. MS Neural Networks, while slightly outperforming MS Naive Bayes and MS Clustering, exhibits a very similar performance, in general. This comparison can be made more explicit by directly comparing the accuracy of all four methods: Method Accuracy MS Decision Trees 71.9% MS Neural Networks 64.7% MS Naive Bayes 63.6% MS Clustering 62.1% (d) (3 val.) Perform cross-validation with Fold Count 3, 5 and 10 for the neural network model. Set Max Cases = 1, 000. Make sure that the Target attribute is Bike Buyer and that Target State is blank. Indicate the average and standard deviation for the Pass measure, corresponding to the

9 Homework 8 Decision Support Systems Page 9 of 9 number of correct labels obtained. Perform a comparative analysis of the results obtained with Microsoft Neural Networks and all other methods in Homework #7 (you don t have to repeat the results obtained with the other methods). Finally, we performed 3-, 5- and 10-fold cross validation on MS Neural Networks, obtaining the following results: Method N-fold MS Neural Nets. 3-Fold (50.4%) Av. Acc. (%) 5-Fold (53.6%) 10-Fold (50.7%) 3-Fold Std. Dev. 5-Fold Fold It is interesting to note the general tendency observed in these results. In general, MS Neural Networks and MS Decision Trees exhibit a significantly worse performance than that observed when analyzing the confusion matrix and the lift charts. In fact, the accuracy of MS Decision Trees goes from a value of around 70%, as seen in Question 5(c), to a value between 50% and 55%. Similarly, MS Neural Networks go from around 65% accuracy to a value between 50% and 55%. On the other hand, the accuracy of both MS Naive Bayes and MS Clustering exhibits only a minor decrease, to a value around 60%. To understand these results, we note that cross-validation is conducted with a dataset of 1, 000 data-points (corresponding to the parameter Max Cases), which is then further divided for testing and training. The accuracy results observed are thus obtained with a significantly smaller amount of data. Therefore, it is not surprising that all methods show some decrease in performance, since they are trained with significantly less data. The fact that MS Clustering and MS Naive Bayes seem to be less sensitive to the smaller amount of data may suggest that these are simpler models that require less data for training. On the other hand, one may also venture that the worse performance of MS Decision Trees and MS Neural Networks may be due to overfitting due to the small dataset used for training. We note, however, that our results do not provide sufficient information to state this conclusively.

Stochastic Gradient Descent

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular