- Machine Learning - Homework Enrique Areyan April 8, 01 Problem 1: Give decision trees to represent the following oolean functions a) A b) A C c) Ā d) A C D e) A C D where Ā is a negation of A and is an exclusive OR operation. Solution: I will use the convention that a path from the left of a node for attribute A corresponds to that attribute being (A), and a path to the right corresponds to the attribute being (Ā). a) b) c) A Ā A Ā A Ā C C d) e) Note: read it like {A C D} A Ā C C C C A Ā C C D D D D C C D D D D
Problem : Consider the data set from the following table Sky Temperature Humidity Wind Water Forecast Enjoy Sport 1 Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes a) Using Enjoy Sport as the target, show the decision tree that would be learned if the splitting criterion was information gain b) Add the training example from the table below and compute the new decision tree. This time, show the value of the information gain for each candidate attribute at each step in growing the tree. Sky Temperature Humidity Wind Water Forecast Enjoy Sport 1 Sunny Warm Normal Weak Warm Same No Solution: I will use the notation from the Tom Mitchel s book to denote a set with positive and negative examples, e.g., S +, 1 in the first data set (there are positive examples corresponding to Enjoy Sport yes, and 1 negative example corresponding to Enjoy Sport no). a) Let S +, 1. Then, Entropy(S) log ( ) 1 log Clearly, if the splitting criterion is information gain we will split either by attribute Sky or Temperature since both of these partition the training set immediately. Nonetheless, let us check that the math works. We will use the following formula to compute information gain for attribute A: Gain(S, A) Entropy(S) i. V alues(sky) {Sunny, Rainy}. S Sunny +, 0 Entropy(S Sunny ) 0. S Rainy 0+, 1 Entropy(S Rainy ) 0. v V alues(a) Gain(S, Sky) 0.811 0 0 Gain(S, Sky) 0.811 S v S Entropy(S v) ii. V alues(t emperature) {W arm, Cold}. S W arm +, 0 Entropy(S W arm ) 0. S Cold 0+, 1 Entropy(S Cold ) 0. Gain(S, T emperature) 0.811 0 0 Gain(S, T emperature) 0.811 iii. V alues(humidity) {N ormal, High}. S Normal 1+, 0 Entropy(S Normal ) 0. S High +, 1 Entropy(S High ) log ( ) 1 log 0.918. Gain(S, Humidity) 0.811 0 0.918 Gain(S, Humidity) 0.16 iv. V alues(w ind) {Strong}. S Strong +, 1 Entropy(S Strong ) log Gain(S, W ind) 0.811 0.811 Gain(S, W ind) 0 v. V alues(w ater) {W arm, Cool}. S W arm +, 1 Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) 1 log ( ) 1 log Gain(S, W ater) 0.811 0 0.918 Gain(S, W ater) 0.16 0.918.
vi. V alues(f orecast) {Same, Change}. S Same +, 0 Entropy(S Same ) 0. S Change 1+, 1 Entropy(S Change ) 1 log 1 log 1. Gain(S, F orecast) 0.811 0 1 1 Gain(S, F orecast) 0.11 We have a tie between Sky and T emperature for the attributes with higher Information Gain. Choose one randomly, say T emperature to get the decision tree for the concept Enjoy Sport: W arm Temperature Cold Yes No b) Let us go through the same process as before but adding the new training example. Let S +,. Then, Entropy(S) ( ) log ( ) log 0.9710. i. V alues(sky) {Sunny, Rainy}. S Sunny +, 1 Entropy(S Sunny ) log S Rainy 0+, 1 Entropy(S Rainy ) 0. ( ) 1 log Gain(S, Sky) 0.9710 0.811 0 Gain(S, Sky) 0.0 ii. V alues(t emperature) {W arm, Cold}. S W arm +, 1 Entropy(S W arm ) log S Cold 0+, 1 Entropy(S Cold ) 0. ( ) 1 log Gain(S, T emperature) 0.9710 0.811 0 Gain(S, T emperature) 0.0 iii. V alues(humidity) {N ormal, High}. S Normal 1+, 1 Entropy(S Normal ) 1 log 1 log 1. S High +, 1 Entropy(S High ) ( ) log 1 log 0.918. Gain(S, Humidity) 0.9710 1 0.918 Gain(S, Humidity) 0.000 iv. V alues(w ind) {Strong, W eak}. S Strong +, 1 Entropy(S Strong ) log S W eak 0+, 1 Entropy(S Strong ) 0. ( ) 1 log Gain(S, W ind) 0.9710 0.811 0 Gain(S, W ind) 0.0 v. V alues(w ater) {W arm, Cool}. S W arm +, Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) log Gain(S, W ater) 0.9710 1 0 Gain(S, W ater) 0.1710 ( ) 1. vi. V alues(f orecast) {Same, Change}. S Same +, 1 Entropy(S Same ) ( ) log 1 log 0.918. S Change 1+, 1 Entropy(S Change ) 1 log 1 log 1. Gain(S, F orecast) 0.9710 0.918 1 Gain(S, F orecast) 0.000
There is a three way tie between attributes Sky, Temperature and Wind. I am choosing Temperature first: Temperature W arm Cold +,1- No Now we have to repeat the process but considering the S to contain only points S {1,,, } In this case S +, 1. Then, Entropy(S) ( ) log 1 log i. V alues(sky) {Sunny, }. S Sunny +, 1 Entropy(S Sunny ) log Gain(S, Sky) 0.811 0.811 Gain(S, Sky) 0 ii. V alues(humidity) {N ormal, High}. S Normal 1+, 1 Entropy(S Normal ) 1 log S High +, 0 Entropy(S High ) 0. ( ) 1 log 1 log 1. Gain(S, Humidity) 0.811 1 0 Gain(S, Humidity) 0.11 iii. V alues(w ind) {Strong, W eak}. S Strong +, 0 Entropy(S Strong ) 0. S W eak 0+, 1 Entropy(S Strong ) 0. Gain(S, W ind) 0.811 0 0 Gain(S, W ind) 0.811 iv. V alues(w ater) {W arm, Cool}. S W arm +, 1 Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) 1 log Gain(S, W ater) 0.811 0.918 0 Gain(S, W ater) 0.16 v. V alues(f orecast) {Same, Change}. S Same +, 1 Entropy(S Same ) log S Change 1+, 0 Entropy(S Change ) 0. ( ) 1 log 0.918. 0.918. Gain(S, F orecast) 0.9710 0.918 0 Gain(S, F orecast) 0.8 Attribute Wind has the highest information gain, so we separate by this attribute: W arm Temperature Cold Strong Wind W eak No Yes No All training examples are correctly classified. Hence, this is the final tree.
Problem : Consider a two-layer feed-forward neural network with two inputs x 1 and x, one hidden unit z, and one output unit f. This network has five weights (w 1z, w z, w 0z, w zf, w 0f ) where w 0z represents the bias for unit z and where w 0f represents the bias for unit f. Initialize these weights to the values (0.1, -0.1, 0.1, -0.1, 0.1), then give their values after each of the first two training iterations of the backpropagation algorithm. Assume learning rate η 0., momentum µ 0.9, incremental weight updates, and the following training examples {((x 1, x ) i, y i )} {((1, 0), 1), ((0, 1), 0)} Solution: Using the tool pybrain in python we can obtain the weights of two iterations of ackpropagation: # w 1z w z w 0z w zf w 0f 1 0.087-0.10 0.08 0.17 0.0 0.08-0.1 0.060 0.66 0.710 Note that this neural network has only one hidden unit, so it will learn a linear concept analogous to logistic regression. It is interesting to see the lines generated by the hidden unit for the first two iterations: Problem : We start to see how the neural network adapts to the data, which in this case is linearly separable. The green line correspond to the second iteration and the blue to the first iteration. Consider minimizing the following regularized error function using the backpropagation algorithm E ij f ij ) j1(y + γ wi,j i,j Derive the gradient descent update rule for this error function. Show that it can be implemented by multiplying each weight by some constant before performing the standard gradient descent update as shown in class. Solution: Following the notation used in class, let us minimize the above regularized error function. Note: We will use the notation b for the weights from hidden unit Z u to output f v. We will use the notation a for the weights from input X u to hidden unit Z v. Now we optimize. First for b s and later for a s: E b b n b n (y ij f ij ) j1 m (y ij f ij ) + γ j1 i,j + γ wi,j b i,j w b i,j ( (yi1 f i1 ) + (y i f i ) + + (y im f im ) ) + γb b n (y iu f iu ) ϕ(b T u z i ) + γb (1) b
Choose ϕ(x) 1 1 + e x. Then ϕ (x) e x (1 + e x ) 1 1 + e x e x ϕ(x)(1 ϕ(x)) 1 + e x (1) n (y iu f iu )ϕ(b T u z i )(1 ϕ(b T u z i ))z iv + γb () Note that f iu ϕ(b T u z i ) which means that, 1 f iu (1 ϕ(b T u z i )) Thus, () n (y iu f iu )f iu (1 f iu )z iv + γb Let β iu (y iu f iu )f iu (1 f iu ) to conclude that n E β iu z iv + γb b The update rule for gradient descent is { n } b (t+1) b (t) η β iu z iv + γb (t) b (t) η n β iu z iv ηγb (t) (1 ηγ)b (t) η n β iu z iv Optimization for a s: E a a n a (y ij f ij ) j1 m (y ij f ij ) + γ j1 i,j j1 + γ wi,j a i,j w a i,j ( (y ij f ij ) ϕ(b T a j z i ) ) + γa () y definition b T j z i h b jl z il. Thus, l1 ( ϕ(b T a j z i ) ) ϕ (b T j z i) b T j a z i ϕ (b T j z i) a ( h ) b jl z il l1 ϕ (b T j z i)b ju a (z iu ) ϕ (b T j z i)b ju a ( ϕ(a T u x i ) ) ϕ (b T j z i)b ju ϕ (a T u x i )x iv ()
Replacing () into () and rearranging: E (y ij f ij )ϕ (b T j a z i)b ju ϕ (a T u x i )x iv + γa j1 { (yij f ij )ϕ (b T j z i) } b ju ϕ (a T u x i )x iv + γa j1 β ij b ju ϕ (a T u x i )x iv + γa () j1 Let α iu β ij b ju ϕ (a T u x i )x iv β ij b ju z iu (1 z iu )x iv, and replace in () to conclude that n E α iu x iv + γa a The update rule for gradient descent is { n } n a (t+1) a (t) η α iu x iv + γa (t) a (t) η α iu x iv The gradient descent update rule for this error function is: a (t+1) b (t+1) (1 ηγ)a (t) η (1 ηγ)b (t) η b : for every u {1,,..., m} and for every v {1,,..., h} a : for every u {1,,..., h} and for every v {1,,..., k} ηγa (t) (1 ηγ)a (t) η n α iu x iv n β iu z iv n α iu x iv This shows that the minimization of the regularized error function can be implemented by multiplying each weight by the constant (1 ηγ) before performing the standard gradient descent update.
Problem : Using the neural network code from class as a starting point, compare the following ensemble methods in a binary classification scenario: 0 bagged neural networks with 0 bagged regression trees. In each case, the final decision is made by averaging the raw outputs from the ensemble members. Use at least different classification data sets to provide comparisons. Select appropriate data sets from the UCI Machine Learning Repository. Proper comparisons should be made using 10-fold cross-validation experiments. Use your knowledge about model comparison to formally conclude which of the two algorithms is better. Normalize the variables on the training set, then normalize the test set using the normalization parameters from the training set (e.g. mean and standard deviation for each feature). If data sets have categorical variables, provide a proper approach to convert them into numerical variables. Use classification error to evaluate quality of classifiers. Solution: All the code for this problem was produced using MatLab. As usual, you can find the code attached to the OnCourse submission. I tested the following three data sets: 1. Spambase: https://archive.ics.uci.edu/ml/datasets/spambase. Transfusion: https://archive.ics.uci.edu/ml/datasets/lood+transfusion+service+center. Ionosphere: https://archive.ics.uci.edu/ml/datasets/ionosphere Results on each of the sets follow: Spambase Transfusion Ionosphere Run # NN Trees Run # NN Trees Run # NN Trees 1 0.087 0.01606 1 0.19 0.10 1 0.08 0.01 0.0891 0.017 0.11 0.167 0.08 0.00698 0.0798 0.018 0.1 0.177 0.08 0.01196 0.0816 0.009780 0.0 0.166 0.0707 0.00698 0.08187 0.01171 0.17 0.166 0.09886 0.01 6 0.0898 0.010867 6 0.17 0.1 6 0.08 0.0087 7 0.0791 0.01171 7 0.1791 0.169 7 0.018 0.0089 8 0.08187 0.01177 8 0.1 0.1701 8 0.0707 0.01196 9 0.0816 0.01606 9 0.07 0.1 9 0.08 0.01 10 0.0810 0.0099978 10 0.19 0.10 10 0.018 0.0087 11 0.0889 0.010867 11 0.1 0.1 11 0.061 0.01196 1 0.0811 0.0191 1 0.19 0.1 1 0.011 0.0087 1 0.0891 0.0101 1 0.17 0.110 1 0.09886 0.01196 1 0.0876 0.018 1 0.11 0.1 1 0.011 0.01 1 0.0811 0.01171 1 0.17 0.199 1 0.08 0.0087 16 0.08808 0.0169 16 0.17 0.18 16 0.09886 0.00698 17 0.07861 0.01108 17 0.0989 0.1 17 0.09886 0.0087 18 0.0811 0.017 18 0.11 0.1968 18 0.09886 0.0087 19 0.0798 0.0189 19 0.1 0.167 19 0.0698 0.01709 0 0.08 0.01171 0 0.11 0.190 0 0.07 0.01196 1 0.08686 0.01606 1 0.19 0.10 1 0.018 0.0087 0.0876 0.0189 0.17 0.199 0.011 0.0087 0.0879 0.0110 0.1 0.1 0.09886 0.01 0.08199 0.01171 0.19 0.1 0.08 0.0087 0.0808 0.01171 0.086 0.1968 0.08 0.01 These results corresponds to runs of the corresponding bagged algorithm. Each bag consisted of 0 models, i.e., either 0 Neural networks or 0 Trees respectively. Neural Networks are feedforwad networks with 1 hidden layers of 10 neurons and trained using Scaled Conjugate Gradient. The models for the bags were selected by 10-fold cross-validation. Note that all the data was normalized before training the models. All of the datasets are binary and thus, the final decision was made by taking a majority vote of individual decisions from the 0 constituents of a single bag. In case of a tie, one class was chosen randomly and with
equal probability. For each run we have the classification error defined as: E #examples incorrectly classified #total data set For example, for the Spambase, run 1, the bagged Neural Network (NN) misclassified 0.087 (about 8.% of examples) whereas the bag of Trees misclassified 0.01606 (about 1.% of examples). Just inspecting the data one can conclude that Trees outperformed NN in every case. For mathematical sound evidence, let us perform two different statistical tests: 1. Let us hypothesize that NN perform the same as Trees, i.e., H 0 : T NN vs. H 1 : T > NN Suppose that H 0 is. In this case the probability that Trees won times out of is given by: p i ( i ) ( 1 ) i ( ) i 1 ( ) ( 1 ) ( ) 1 ( ) 1.98087691 10 8 Clearly, this is a small enough probability at any reasonable level. Conclude that it is not the case that T NN. Note also that this test apply to all data sets since in all of these Trees won every time.. The next test is a T-test for the means of two independent samples. The samples in this case are the results for NN and trees. For this I used the following function provided by the Python package SciPy: (ttest, pvalue) ttest_ind(data_nni, data_treesi) Where data_nni is a vector of numbers with the results for bagged neural networks. Likewise, data_treesi is a vector of numbers with the bagged results for trees. According to the function documentation (see.), this function: "Calculates the T-test for the means of TWO INDEPENDENT samples of scores. This is a two-sided test for the null hypothesis that independent samples have identical average (expected) values. This test assumes that the populations have identical variances... We can use this test, if we observe two independent samples from the same or different population, e.g. exam scores of boys and girls or of two ethnic groups. The test measures whether the average (expected) value differs significantly across samples. If we observe a large p-value, for example larger than 0.0 or 0.1, then we cannot reject the null hypothesis of identical average scores. If the p-value is smaller than the threshold, e.g. 1%, % or 10%, then we reject the null hypothesis of equal averages." The results of the test for each data set are: dataset ttest pvalue spambase 1.1769 1.107168 10 6 transfusion 7.671971 6.70 10 1 ionosphere.0061111 9.7889 10 7 Clearly and in all cases we reject the hypothesis that the mean of the classification error of the algorithms are the same. Together tests (1) and () provide strong evidence that bagged Trees outperform bagged Neural Networks for all datasets tested. This may be due to the fact that these are small datasets where Trees have a greater chance of fitting the data. An alternative reason is that we would need more than 10 neurons to obtain better results for neural networks.
References 1. www.pybrain.org. Tom Mitchel s book chapter on decision trees. UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/index.html. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind