B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

Similar documents
Lecture 2: Foundations of Concept Learning

Numerical Learning Algorithms

FINAL: CS 6375 (Machine Learning) Fall 2014

Answers Machine Learning Exercises 2

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Midterm: CS 6375 Spring 2015 Solutions

Neural Networks and the Back-propagation Algorithm

Midterm: CS 6375 Spring 2018

Supervised Learning. George Konidaris

Computational Learning Theory

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks and Ensemble Methods for Classification

Notes on Machine Learning for and

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Midterm, Fall 2003

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Stochastic Gradient Descent

Concept Learning through General-to-Specific Ordering

Lecture 3: Decision Trees

Version Spaces.

Nonlinear Classification

CS534 Machine Learning - Spring Final Exam

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Course 395: Machine Learning - Lectures

Single layer NN. Neuron Model

Concept Learning. Space of Versions of Concepts Learned

Learning with multiple models. Boosting.

ECE521 Lectures 9 Fully Connected Neural Networks

Decision Trees Part 1. Rao Vemuri University of California, Davis

The Solution to Assignment 6

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Revision: Neural Network

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Learning from Examples

Concept Learning.

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Machine Learning 3. week

Chapter 14 Combining Models

Classification with Perceptrons. Reading:

Learning Decision Trees

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Name (NetID): (1 Point)

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

epochs epochs

Neural Networks biological neuron artificial neuron 1

Machine Learning, Midterm Exam

CSC242: Intro to AI. Lecture 21

10-701/ Machine Learning, Fall

Neural Networks and Deep Learning

Data Mining und Maschinelles Lernen

The Perceptron. Volker Tresp Summer 2016

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning Decision Trees

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Reading Group on Deep Learning Session 1

Lecture 3: Decision Trees

Machine Learning

Generative v. Discriminative classifiers Intuition

Artificial Intelligence

Lecture 6. Notes on Linear Algebra. Perceptron

CS 6375 Machine Learning

The Perceptron Algorithm 1

Mining Classification Knowledge

Statistical Machine Learning from Data

Neural Networks (Part 1) Goals for the lecture

Linear & nonlinear classifiers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Evaluation requires to define performance measures to be optimized

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE 5424: Introduction to Machine Learning

Decision Tree Learning

Voting (Ensemble Methods)

Multilayer Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CS229 Supplemental Lecture notes

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Decision Trees. Danushka Bollegala

Learning Neural Networks

The Perceptron. Volker Tresp Summer 2014

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Evaluation. Andrea Passerini Machine Learning. Evaluation

ECE 5984: Introduction to Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning. Ensemble Methods. Manfred Huber

Outline. [read Chapter 2] Learning from examples. General-to-specic ordering over hypotheses. Version spaces and candidate elimination.

Linear Discrimination Functions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Learning and Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Transcription:

- Machine Learning - Homework Enrique Areyan April 8, 01 Problem 1: Give decision trees to represent the following oolean functions a) A b) A C c) Ā d) A C D e) A C D where Ā is a negation of A and is an exclusive OR operation. Solution: I will use the convention that a path from the left of a node for attribute A corresponds to that attribute being (A), and a path to the right corresponds to the attribute being (Ā). a) b) c) A Ā A Ā A Ā C C d) e) Note: read it like {A C D} A Ā C C C C A Ā C C D D D D C C D D D D

Problem : Consider the data set from the following table Sky Temperature Humidity Wind Water Forecast Enjoy Sport 1 Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes a) Using Enjoy Sport as the target, show the decision tree that would be learned if the splitting criterion was information gain b) Add the training example from the table below and compute the new decision tree. This time, show the value of the information gain for each candidate attribute at each step in growing the tree. Sky Temperature Humidity Wind Water Forecast Enjoy Sport 1 Sunny Warm Normal Weak Warm Same No Solution: I will use the notation from the Tom Mitchel s book to denote a set with positive and negative examples, e.g., S +, 1 in the first data set (there are positive examples corresponding to Enjoy Sport yes, and 1 negative example corresponding to Enjoy Sport no). a) Let S +, 1. Then, Entropy(S) log ( ) 1 log Clearly, if the splitting criterion is information gain we will split either by attribute Sky or Temperature since both of these partition the training set immediately. Nonetheless, let us check that the math works. We will use the following formula to compute information gain for attribute A: Gain(S, A) Entropy(S) i. V alues(sky) {Sunny, Rainy}. S Sunny +, 0 Entropy(S Sunny ) 0. S Rainy 0+, 1 Entropy(S Rainy ) 0. v V alues(a) Gain(S, Sky) 0.811 0 0 Gain(S, Sky) 0.811 S v S Entropy(S v) ii. V alues(t emperature) {W arm, Cold}. S W arm +, 0 Entropy(S W arm ) 0. S Cold 0+, 1 Entropy(S Cold ) 0. Gain(S, T emperature) 0.811 0 0 Gain(S, T emperature) 0.811 iii. V alues(humidity) {N ormal, High}. S Normal 1+, 0 Entropy(S Normal ) 0. S High +, 1 Entropy(S High ) log ( ) 1 log 0.918. Gain(S, Humidity) 0.811 0 0.918 Gain(S, Humidity) 0.16 iv. V alues(w ind) {Strong}. S Strong +, 1 Entropy(S Strong ) log Gain(S, W ind) 0.811 0.811 Gain(S, W ind) 0 v. V alues(w ater) {W arm, Cool}. S W arm +, 1 Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) 1 log ( ) 1 log Gain(S, W ater) 0.811 0 0.918 Gain(S, W ater) 0.16 0.918.

vi. V alues(f orecast) {Same, Change}. S Same +, 0 Entropy(S Same ) 0. S Change 1+, 1 Entropy(S Change ) 1 log 1 log 1. Gain(S, F orecast) 0.811 0 1 1 Gain(S, F orecast) 0.11 We have a tie between Sky and T emperature for the attributes with higher Information Gain. Choose one randomly, say T emperature to get the decision tree for the concept Enjoy Sport: W arm Temperature Cold Yes No b) Let us go through the same process as before but adding the new training example. Let S +,. Then, Entropy(S) ( ) log ( ) log 0.9710. i. V alues(sky) {Sunny, Rainy}. S Sunny +, 1 Entropy(S Sunny ) log S Rainy 0+, 1 Entropy(S Rainy ) 0. ( ) 1 log Gain(S, Sky) 0.9710 0.811 0 Gain(S, Sky) 0.0 ii. V alues(t emperature) {W arm, Cold}. S W arm +, 1 Entropy(S W arm ) log S Cold 0+, 1 Entropy(S Cold ) 0. ( ) 1 log Gain(S, T emperature) 0.9710 0.811 0 Gain(S, T emperature) 0.0 iii. V alues(humidity) {N ormal, High}. S Normal 1+, 1 Entropy(S Normal ) 1 log 1 log 1. S High +, 1 Entropy(S High ) ( ) log 1 log 0.918. Gain(S, Humidity) 0.9710 1 0.918 Gain(S, Humidity) 0.000 iv. V alues(w ind) {Strong, W eak}. S Strong +, 1 Entropy(S Strong ) log S W eak 0+, 1 Entropy(S Strong ) 0. ( ) 1 log Gain(S, W ind) 0.9710 0.811 0 Gain(S, W ind) 0.0 v. V alues(w ater) {W arm, Cool}. S W arm +, Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) log Gain(S, W ater) 0.9710 1 0 Gain(S, W ater) 0.1710 ( ) 1. vi. V alues(f orecast) {Same, Change}. S Same +, 1 Entropy(S Same ) ( ) log 1 log 0.918. S Change 1+, 1 Entropy(S Change ) 1 log 1 log 1. Gain(S, F orecast) 0.9710 0.918 1 Gain(S, F orecast) 0.000

There is a three way tie between attributes Sky, Temperature and Wind. I am choosing Temperature first: Temperature W arm Cold +,1- No Now we have to repeat the process but considering the S to contain only points S {1,,, } In this case S +, 1. Then, Entropy(S) ( ) log 1 log i. V alues(sky) {Sunny, }. S Sunny +, 1 Entropy(S Sunny ) log Gain(S, Sky) 0.811 0.811 Gain(S, Sky) 0 ii. V alues(humidity) {N ormal, High}. S Normal 1+, 1 Entropy(S Normal ) 1 log S High +, 0 Entropy(S High ) 0. ( ) 1 log 1 log 1. Gain(S, Humidity) 0.811 1 0 Gain(S, Humidity) 0.11 iii. V alues(w ind) {Strong, W eak}. S Strong +, 0 Entropy(S Strong ) 0. S W eak 0+, 1 Entropy(S Strong ) 0. Gain(S, W ind) 0.811 0 0 Gain(S, W ind) 0.811 iv. V alues(w ater) {W arm, Cool}. S W arm +, 1 Entropy(S W arm ) log S Cool 1+, 0 Entropy(S Cool ) 0. ( ) 1 log Gain(S, W ater) 0.811 0.918 0 Gain(S, W ater) 0.16 v. V alues(f orecast) {Same, Change}. S Same +, 1 Entropy(S Same ) log S Change 1+, 0 Entropy(S Change ) 0. ( ) 1 log 0.918. 0.918. Gain(S, F orecast) 0.9710 0.918 0 Gain(S, F orecast) 0.8 Attribute Wind has the highest information gain, so we separate by this attribute: W arm Temperature Cold Strong Wind W eak No Yes No All training examples are correctly classified. Hence, this is the final tree.

Problem : Consider a two-layer feed-forward neural network with two inputs x 1 and x, one hidden unit z, and one output unit f. This network has five weights (w 1z, w z, w 0z, w zf, w 0f ) where w 0z represents the bias for unit z and where w 0f represents the bias for unit f. Initialize these weights to the values (0.1, -0.1, 0.1, -0.1, 0.1), then give their values after each of the first two training iterations of the backpropagation algorithm. Assume learning rate η 0., momentum µ 0.9, incremental weight updates, and the following training examples {((x 1, x ) i, y i )} {((1, 0), 1), ((0, 1), 0)} Solution: Using the tool pybrain in python we can obtain the weights of two iterations of ackpropagation: # w 1z w z w 0z w zf w 0f 1 0.087-0.10 0.08 0.17 0.0 0.08-0.1 0.060 0.66 0.710 Note that this neural network has only one hidden unit, so it will learn a linear concept analogous to logistic regression. It is interesting to see the lines generated by the hidden unit for the first two iterations: Problem : We start to see how the neural network adapts to the data, which in this case is linearly separable. The green line correspond to the second iteration and the blue to the first iteration. Consider minimizing the following regularized error function using the backpropagation algorithm E ij f ij ) j1(y + γ wi,j i,j Derive the gradient descent update rule for this error function. Show that it can be implemented by multiplying each weight by some constant before performing the standard gradient descent update as shown in class. Solution: Following the notation used in class, let us minimize the above regularized error function. Note: We will use the notation b for the weights from hidden unit Z u to output f v. We will use the notation a for the weights from input X u to hidden unit Z v. Now we optimize. First for b s and later for a s: E b b n b n (y ij f ij ) j1 m (y ij f ij ) + γ j1 i,j + γ wi,j b i,j w b i,j ( (yi1 f i1 ) + (y i f i ) + + (y im f im ) ) + γb b n (y iu f iu ) ϕ(b T u z i ) + γb (1) b

Choose ϕ(x) 1 1 + e x. Then ϕ (x) e x (1 + e x ) 1 1 + e x e x ϕ(x)(1 ϕ(x)) 1 + e x (1) n (y iu f iu )ϕ(b T u z i )(1 ϕ(b T u z i ))z iv + γb () Note that f iu ϕ(b T u z i ) which means that, 1 f iu (1 ϕ(b T u z i )) Thus, () n (y iu f iu )f iu (1 f iu )z iv + γb Let β iu (y iu f iu )f iu (1 f iu ) to conclude that n E β iu z iv + γb b The update rule for gradient descent is { n } b (t+1) b (t) η β iu z iv + γb (t) b (t) η n β iu z iv ηγb (t) (1 ηγ)b (t) η n β iu z iv Optimization for a s: E a a n a (y ij f ij ) j1 m (y ij f ij ) + γ j1 i,j j1 + γ wi,j a i,j w a i,j ( (y ij f ij ) ϕ(b T a j z i ) ) + γa () y definition b T j z i h b jl z il. Thus, l1 ( ϕ(b T a j z i ) ) ϕ (b T j z i) b T j a z i ϕ (b T j z i) a ( h ) b jl z il l1 ϕ (b T j z i)b ju a (z iu ) ϕ (b T j z i)b ju a ( ϕ(a T u x i ) ) ϕ (b T j z i)b ju ϕ (a T u x i )x iv ()

Replacing () into () and rearranging: E (y ij f ij )ϕ (b T j a z i)b ju ϕ (a T u x i )x iv + γa j1 { (yij f ij )ϕ (b T j z i) } b ju ϕ (a T u x i )x iv + γa j1 β ij b ju ϕ (a T u x i )x iv + γa () j1 Let α iu β ij b ju ϕ (a T u x i )x iv β ij b ju z iu (1 z iu )x iv, and replace in () to conclude that n E α iu x iv + γa a The update rule for gradient descent is { n } n a (t+1) a (t) η α iu x iv + γa (t) a (t) η α iu x iv The gradient descent update rule for this error function is: a (t+1) b (t+1) (1 ηγ)a (t) η (1 ηγ)b (t) η b : for every u {1,,..., m} and for every v {1,,..., h} a : for every u {1,,..., h} and for every v {1,,..., k} ηγa (t) (1 ηγ)a (t) η n α iu x iv n β iu z iv n α iu x iv This shows that the minimization of the regularized error function can be implemented by multiplying each weight by the constant (1 ηγ) before performing the standard gradient descent update.

Problem : Using the neural network code from class as a starting point, compare the following ensemble methods in a binary classification scenario: 0 bagged neural networks with 0 bagged regression trees. In each case, the final decision is made by averaging the raw outputs from the ensemble members. Use at least different classification data sets to provide comparisons. Select appropriate data sets from the UCI Machine Learning Repository. Proper comparisons should be made using 10-fold cross-validation experiments. Use your knowledge about model comparison to formally conclude which of the two algorithms is better. Normalize the variables on the training set, then normalize the test set using the normalization parameters from the training set (e.g. mean and standard deviation for each feature). If data sets have categorical variables, provide a proper approach to convert them into numerical variables. Use classification error to evaluate quality of classifiers. Solution: All the code for this problem was produced using MatLab. As usual, you can find the code attached to the OnCourse submission. I tested the following three data sets: 1. Spambase: https://archive.ics.uci.edu/ml/datasets/spambase. Transfusion: https://archive.ics.uci.edu/ml/datasets/lood+transfusion+service+center. Ionosphere: https://archive.ics.uci.edu/ml/datasets/ionosphere Results on each of the sets follow: Spambase Transfusion Ionosphere Run # NN Trees Run # NN Trees Run # NN Trees 1 0.087 0.01606 1 0.19 0.10 1 0.08 0.01 0.0891 0.017 0.11 0.167 0.08 0.00698 0.0798 0.018 0.1 0.177 0.08 0.01196 0.0816 0.009780 0.0 0.166 0.0707 0.00698 0.08187 0.01171 0.17 0.166 0.09886 0.01 6 0.0898 0.010867 6 0.17 0.1 6 0.08 0.0087 7 0.0791 0.01171 7 0.1791 0.169 7 0.018 0.0089 8 0.08187 0.01177 8 0.1 0.1701 8 0.0707 0.01196 9 0.0816 0.01606 9 0.07 0.1 9 0.08 0.01 10 0.0810 0.0099978 10 0.19 0.10 10 0.018 0.0087 11 0.0889 0.010867 11 0.1 0.1 11 0.061 0.01196 1 0.0811 0.0191 1 0.19 0.1 1 0.011 0.0087 1 0.0891 0.0101 1 0.17 0.110 1 0.09886 0.01196 1 0.0876 0.018 1 0.11 0.1 1 0.011 0.01 1 0.0811 0.01171 1 0.17 0.199 1 0.08 0.0087 16 0.08808 0.0169 16 0.17 0.18 16 0.09886 0.00698 17 0.07861 0.01108 17 0.0989 0.1 17 0.09886 0.0087 18 0.0811 0.017 18 0.11 0.1968 18 0.09886 0.0087 19 0.0798 0.0189 19 0.1 0.167 19 0.0698 0.01709 0 0.08 0.01171 0 0.11 0.190 0 0.07 0.01196 1 0.08686 0.01606 1 0.19 0.10 1 0.018 0.0087 0.0876 0.0189 0.17 0.199 0.011 0.0087 0.0879 0.0110 0.1 0.1 0.09886 0.01 0.08199 0.01171 0.19 0.1 0.08 0.0087 0.0808 0.01171 0.086 0.1968 0.08 0.01 These results corresponds to runs of the corresponding bagged algorithm. Each bag consisted of 0 models, i.e., either 0 Neural networks or 0 Trees respectively. Neural Networks are feedforwad networks with 1 hidden layers of 10 neurons and trained using Scaled Conjugate Gradient. The models for the bags were selected by 10-fold cross-validation. Note that all the data was normalized before training the models. All of the datasets are binary and thus, the final decision was made by taking a majority vote of individual decisions from the 0 constituents of a single bag. In case of a tie, one class was chosen randomly and with

equal probability. For each run we have the classification error defined as: E #examples incorrectly classified #total data set For example, for the Spambase, run 1, the bagged Neural Network (NN) misclassified 0.087 (about 8.% of examples) whereas the bag of Trees misclassified 0.01606 (about 1.% of examples). Just inspecting the data one can conclude that Trees outperformed NN in every case. For mathematical sound evidence, let us perform two different statistical tests: 1. Let us hypothesize that NN perform the same as Trees, i.e., H 0 : T NN vs. H 1 : T > NN Suppose that H 0 is. In this case the probability that Trees won times out of is given by: p i ( i ) ( 1 ) i ( ) i 1 ( ) ( 1 ) ( ) 1 ( ) 1.98087691 10 8 Clearly, this is a small enough probability at any reasonable level. Conclude that it is not the case that T NN. Note also that this test apply to all data sets since in all of these Trees won every time.. The next test is a T-test for the means of two independent samples. The samples in this case are the results for NN and trees. For this I used the following function provided by the Python package SciPy: (ttest, pvalue) ttest_ind(data_nni, data_treesi) Where data_nni is a vector of numbers with the results for bagged neural networks. Likewise, data_treesi is a vector of numbers with the bagged results for trees. According to the function documentation (see.), this function: "Calculates the T-test for the means of TWO INDEPENDENT samples of scores. This is a two-sided test for the null hypothesis that independent samples have identical average (expected) values. This test assumes that the populations have identical variances... We can use this test, if we observe two independent samples from the same or different population, e.g. exam scores of boys and girls or of two ethnic groups. The test measures whether the average (expected) value differs significantly across samples. If we observe a large p-value, for example larger than 0.0 or 0.1, then we cannot reject the null hypothesis of identical average scores. If the p-value is smaller than the threshold, e.g. 1%, % or 10%, then we reject the null hypothesis of equal averages." The results of the test for each data set are: dataset ttest pvalue spambase 1.1769 1.107168 10 6 transfusion 7.671971 6.70 10 1 ionosphere.0061111 9.7889 10 7 Clearly and in all cases we reject the hypothesis that the mean of the classification error of the algorithms are the same. Together tests (1) and () provide strong evidence that bagged Trees outperform bagged Neural Networks for all datasets tested. This may be due to the fact that these are small datasets where Trees have a greater chance of fitting the data. An alternative reason is that we would need more than 10 neurons to obtain better results for neural networks.

References 1. www.pybrain.org. Tom Mitchel s book chapter on decision trees. UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/index.html. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind