Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated with it. Neural Networks During the learning phase, the network learns by adjusting the weights that enable it to predict the correct class label of the input samples (the training samples). Knowledge about the learning task is given in the form of examples. Inter neuron connection strengths (weights) are used to store the acquired information (the training examples). During the learning process the weights are modified in order to model the particular learning task correctly on the training examples. http://aemc.jpl.nasa.gov/activities/bio_regen.cfm 3 4

Neural Networks Network architectures Advantages prediction accuracy is generally high robust, works when training examples contain errors or noisy data output may be discrete, real valued, or a vector of several discrete or realvalued attributes fast evaluation of the learned target function Criticism parameters are best determined empirically, such as the network topology or structure long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge Three different classes of network architectures single layer feed forward neurons are organized in acyclic layers multi layer feed forward recurrent The architecture of a neural network is linked with the learning algorithm used to train Input layer of source nodes single layer Output layer of neurons Input layer multi layer Output layer 5 Hidden Layer 6 Neurons The neuron Neural networks are built out of a densely interconnected set of simple units (neurons) Each neuron takes a number of real valued inputs Produces a single real valued output Inputs to a neuron may be the outputs of other neurons. A neuron s output may be used as input to many other neurons Input signal x x 2 x m w w 2 w m weights Bias: serves to vary the activity of the unit Bias b w 0 Local Field v () Adder function (linear combiner) which computes the weighted sum of the inputs: m u bw 0 w jxj j Activation function (squashing function) for limiting the amplitude of the output of the neuron y φ(u) Output y 7 8

The neuron How does it Works? Assign weights to each input link Multiply each weight by the input value (0 or ) Sum all the weight firing input combinations Apply squash function, e.g.: If sum > threshold for the Neuron then Output = + Else Output = http://www cse.uta.edu/~cook/ai/lectures/figures/neuron.jpg 9 0 Popular activation functions How Are Neural Networks Trained? Linear activation Logistic activation Initially z z z z e choose small random weights (w i ) Set threshold = (step function) z Threshold activation, if z 0, zsign( z), if z 0. z - z 0 Hyperbolic tangent activation 2u e u tanhu 2u e 0 z Choose small learning rate (r) Apply each member of the training set to the neural net model using a training rule to adjust the weights For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias 2

Single Layer Perceptron Single layer perceptron: training rule Are the simplest form of neural networks Modify the weights (w i ) according to the Training Rule: w i = w i + r (t a) x i input variables output variables r is the learning rate (eg. 0.2) t = target output a = actual output x i =i th input value output nodes 3 Learning rate: if too small learning occurs at a small pace, if too large it may stuck in local minimum in the decision space 4 X=0 w=0.95 w2=0.5 X2= Example b= x x2 Y 0 0 0 w0=0.49 0 Y=0 0 threshold = 0.5 r=0.05 Multi layer network Compute output for the input u = x 0.49 + 0 x 0.95 + x 0.5= 0.34 < t thus, y=0 Compute the error Compute the new weights target output = actual output (y) = 0 error = ( 0) = correction factor = error x r = 0.05 w0 = 0.49 + 0.05 x ( 0) x ( ) = 0.44 w = 0.95 + 0.05 x ( 0) x 0 = 0.95 w2 = 0.5 + 0.05 x ( 0) x = 0.20 Repeat the process with the new weigths for a given number of iterations 5 input layer hidden layer (one or more) output layer 6

Training multi layer networks back propagation algorithm Multi Layer network of sigmoid units Problem: what is the desired output for a hidden node? => Backpropagation algorithm Phase : Propagation Forward propagation of a training input Back propagation of the propagation's output activations. Phase 2: Weight update For each weight synapse: Multiply its output delta and input activation to get the gradient of the weight. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight. This ratio influences the speed and quality of learning. The sign of the gradient of a weight indicates where the error is increasing, this is why the weight must be updated in the opposite direction. Repeat the phase and 2 until the performance of the network is good enough. Output vector Output nodes Hidden nodes Input nodes Input vector: x i θ θ () rerr w w () r ErrO Err O ( O ) Err w Err O ( O )( T O ) j j j j j error for a node in the output layer I j j j to update the bias ij ij j i to update the weights j j j k jk k error for a node in the hidden layer e O j I j w O θ j ij i j i 7 8 Example Propagation x= w5= 0.3 w4=0.2 4 w04= 0.4 w46= 0.3 w06=0. I w O θ j ij i j i x2=0 w24=0.4 2 w25=0. w56= 0.2 6 e O j I j w34= 0.5 5 x3= 3 w35=0.2 xi input variables (,0,) whose class is wij randomly assigned weights w05=0.2 activation function Oj = / (+e Ij ) and learning rate = 0.9 neuron input output 4 0.2x+0.4x0 0.5x 0.4= 0.7 /(+e 0.7 )=0.332 5 0.3x+0.x0+0.2x+0.2=0. /(+e 0. )=0.525 6 0.3x0.332 0.2x0.525+0.= 0.05 /(+e 0.05 )=0.474 9 20

neuron output 4 0.332 5 0.525 6 0.474 Calculation of the neuron error for a node in the output layer Err O ( O )( T O ) j j j j j error for a node in the hidden layer Err O ( O ) Err w j j j k jk k neuron error 6 0.474 x ( 0.474) x ( 0.474) = 0.3 5 0.525 x ( 0.525) x ( 0.2) x 0.3 = 0.0065 4 0.332 x ( 0.332) x ( 0.3) x 0.3 = 0.0087 2 to update the weights to update the bias w w () r ErrO θ θ () rerr ij ij j i neuron output error 4 0.332 0.0087 5 0.525 0.0065 6 0.474 0.3 Updating weights j j j weight New value w46 0.3 + 0.9 x 0.3 x 0.332 = 0.26 w56 0.2 + 0.9 x 0.3 x 0.525 = 0.38 w4 0.2 + 0.9 x 0.0087 x = 0.92 w5 0.3 + 0.9 x 0.0065 x = 0.306 w24 0.4 + 0.9 x 0.0087 x 0 = 0.4 w25 0. + 0.9 x 0.0065 x 0 = 0. w34 0.5 + 0.9 x 0.0087 x = 0.508 w35 0.2 + 0.9 x 0.0065 x = 0.94 w06 0. + 0.9 x 0.3 = 0.28 w05 0.2 + 0.9 x 0.0065 = 0.94 w04 0.4 + 0.9 x 0.0087 = 0.408 22 Example Neural Network as a Classifier x= x2=0 x3=0 w5= 0.306 2 w24=0.4 w25=0. w34= 0.508 3 w4=0.92 w35=0.94 4 5 w04= 0.408 w05=0.94 w56= 0.38 w46= 0.26 w06=0.28 This is the resulting network after the first iteration. We now have to process another training example until the overall error is low or we run out of examples. 6 23 Weakness Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of ``hidden units" in the network Strength High tolerance to noisy data Ability to classify untrained patterns Well suited for continuous valued inputs and outputs Successful on a wide array of real world data Algorithms are inherently parallel 24

Ensemble Method Aggregation of multiple learned models with the goal of improving accuracy. Intuition: simulate what we do when we combine a expert panel in a human decision making process ENSEMBLE METHODS 25 26 Some Comments Combining models adds complexity It is more difficult to characterize and explain predictions The accuracy may increase Violation of Ockham s Razor simplicity leads to greater accuracy Identifying the best model requires identifying the proper "model complexity" Methods to Achieve Diversity Diversity from differences in input variation Different feature weightings Ratings Actors Genres Classifier A Classifier B Classifier C + + Predictions Training Examples Divide up training data among models Classifier A Classifier B Classifier C + + Predictions Training Examples 27 28

Ensemble Methods: Increasing the Accuracy How to combine models Ensemble methods Use a combination of models to increase accuracy Combine a series of k learned models, M, M2,, Mk, with the aim of creating an improved model M* Algebraic methods Average Weighted average Sum Weighted sum Product Maximum Minimum Median Voting methods Majority voting Weighted majority voting Borda count (rank candidates in order of preference) 29 30 Popular ensemble methods Bagging: averaging the prediction over a collection of classifiers Boosting: weighted vote with a collection of classifiers Ensemble: combining a set of heterogeneous classifiers Bagging: Bootstrap AGGregatING Analogy: Diagnosis based on multiple doctors majority vote Training Given a set D of d tuples, at each iteration i, a training set D i of d tuples is sampled with replacement from D (i.e., bootstrap) A classifier model M i is learned for each training set D i Classification: classify an unknown sample X Each classifier M i returns its class prediction The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple 3 32

Bagging Accuracy Often significant better than a single classifier derived from D For noise data: not considerably worse, more robust Proved improved accuracy in prediction Requirement: Need unstable classifier types Unstable means a small change to the training data may lead to major decision changes. Stability in Training Training: construct classifier f from D Stability: small changes on D results in small changes on f Decision trees are a typical unstable classifier http://en.wikibooks.org/wiki/file:dte_bagging.png 33 34 Boosting Analogy: Consult several doctors, based on a combination of weighted diagnoses weight assigned based on the previous diagnosis accuracy Incrementally create models selectively using training examples based on some distribution. How boosting works? Weights are assigned to each training example A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+, to pay more attention to the training examples that were misclassified by Mi The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy Boosting: Construct Weak Classifiers Using Different Data Distribution Idea Start with uniform weighting During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner Focus on difficult examples which are not correctly classified in the previous steps 35 36

Boosting: Combine Weak Classifiers Weighted Voting Construct strong classifier by weighted voting of the weak classifiers Idea Better weak classifier gets a larger weight Iteratively add weak classifiers Increase accuracy of the combined classifier through minimization of a cost function Differences with Bagging: Boosting Models are built sequentially on modified versions of the data The predictions of the models are combined through a weighted sum/vote Boosting algorithm can be extended for numeric prediction Comparing with bagging: Boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data 37 38 Adaboost: a popular boosting algorithm (Freund and Schapire, 997) Given a set of d class labeled examples, (X, y),, (Xd, yd) Initially, all the weights of examples are set the same (/d) Generate k classifiers in k rounds. At round i, Tuples from D are sampled (with replacement) to form a training set Di of the same size Each example s chance of being selected is based on its weight A classification model Mi is derived from Di and its error rate calculated using Di as a test set If a tuple is misclassified, its weight is increased, otherwise it is decreased Error rate: err(xj) is the misclassification error of example Xj. Classifier Mi error rate is the sum of the weights of the misclassified examples. Adaboost comments This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier. Hence, consecutive classifiers training data are geared towards increasingly hard to classify instances. Unlike bagging, AdaBoost uses a rather undemocratic voting scheme, called the weighted majority voting. The idea is an intuitive one: those classifiers that have shown good performance during training are rewarded with higher voting weights than the others. 39 40

Random Forest (Breiman 200) Random Forest: A variation of the bagging algorithm Created from individual decision trees whose parameters vary randomly. Such parameters can be bootstrapped replicas of the training data, as in bagging, but they can also be different feature subsets as in random subspace methods. During classification, each tree votes and the most popular class is returned The diagram should be interpreted with the understanding that the algorithm is sequential: classifier CK is created before classifier CK+, which in turn requires that βk and the current distribution DK be available. 4 42 Random Forest (Breiman 200) Two Methods to construct Random Forest: Forest RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size Forest RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) Comparable in accuracy to Adaboost, but more robust to errors and outliers Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting References Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Ian H. Witten and Eibe Frank, 999 Data Mining: Practical Machine Learning Tools and Techniques second edition, Ian H. Witten and Eibe Frank, 2005 Todd Holloway, 2008, Ensemble Learning Better Predictions Through Diversity, power point presentation Leandro M. Almeida, Sistemas Baseados em Comitês de Classificadores Cong Li, 2009, Machine Learning Basics 3. Ensemble Learning R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 2 45, Quarter 2006. 43 44