Neural Networks and Ensemble Methods for Classification

Similar documents
Data Mining Part 5. Prediction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Lecture 7 Artificial neural networks: Supervised learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning with multiple models. Boosting.

Ensembles of Classifiers.

Algorithm-Independent Learning Issues

Voting (Ensemble Methods)

Data Mining und Maschinelles Lernen

CS7267 MACHINE LEARNING

Hierarchical Boosting and Filter Generation

Neural Networks and the Back-propagation Algorithm

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning theory. Ensemble methods. Boosting. Boosting: history

1 Handling of Continuous Attributes in C4.5. Algorithm

Lecture 4: Perceptrons and Multilayer Perceptrons

Decision Trees: Overfitting

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Artificial Neural Network

Numerical Learning Algorithms

A Brief Introduction to Adaboost

Ensemble Methods and Random Forests

Chapter 14 Combining Models

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Boosting & Deep Learning

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Statistical Machine Learning from Data

Ensembles. Léon Bottou COS 424 4/8/2010

Machine Learning. Ensemble Methods. Manfred Huber

CS:4420 Artificial Intelligence

Lecture 5: Logistic Regression. Neural Networks

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Simple neuron model Components of simple neuron

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

1 Handling of Continuous Attributes in C4.5. Algorithm

FINAL: CS 6375 (Machine Learning) Fall 2014

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Lecture 3: Decision Trees

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

MODULE -4 BAYEIAN LEARNING

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Neural networks. Chapter 19, Sections 1 5 1

CSC242: Intro to AI. Lecture 21

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Course 395: Machine Learning - Lectures

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Ensemble Methods: Jay Hyer

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

VBM683 Machine Learning

TDT4173 Machine Learning

Feedforward Neural Nets and Backpropagation

Revision: Neural Network

Linear discriminant functions

Neural networks. Chapter 20. Chapter 20 1

Neural Networks biological neuron artificial neuron 1

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

CSC321 Lecture 5: Multilayer Perceptrons

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

18.6 Regression and Classification with Linear Models

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Machine Learning (CSE 446): Neural Networks

Artificial Neural Networks Examination, June 2005

Machine Learning Lecture 10

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Single layer NN. Neuron Model

CS534 Machine Learning - Spring Final Exam

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

ECE 5424: Introduction to Machine Learning

Statistics and learning: Big Data

Variance Reduction and Ensemble Methods

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 13: Ensemble Methods

Simple Neural Nets For Pattern Classification

Neural networks. Chapter 20, Section 5 1

Holdout and Cross-Validation Methods Overfitting Avoidance

Stochastic Gradient Descent

Bagging and Other Ensemble Methods

Machine Learning Lecture 7

Machine Learning Lecture 5

Transcription:

Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated with it. Neural Networks During the learning phase, the network learns by adjusting the weights that enable it to predict the correct class label of the input samples (the training samples). Knowledge about the learning task is given in the form of examples. Inter neuron connection strengths (weights) are used to store the acquired information (the training examples). During the learning process the weights are modified in order to model the particular learning task correctly on the training examples. http://aemc.jpl.nasa.gov/activities/bio_regen.cfm 3 4

Neural Networks Network architectures Advantages prediction accuracy is generally high robust, works when training examples contain errors or noisy data output may be discrete, real valued, or a vector of several discrete or realvalued attributes fast evaluation of the learned target function Criticism parameters are best determined empirically, such as the network topology or structure long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge Three different classes of network architectures single layer feed forward neurons are organized in acyclic layers multi layer feed forward recurrent The architecture of a neural network is linked with the learning algorithm used to train Input layer of source nodes single layer Output layer of neurons Input layer multi layer Output layer 5 Hidden Layer 6 Neurons The neuron Neural networks are built out of a densely interconnected set of simple units (neurons) Each neuron takes a number of real valued inputs Produces a single real valued output Inputs to a neuron may be the outputs of other neurons. A neuron s output may be used as input to many other neurons Input signal x x 2 x m w w 2 w m weights Bias: serves to vary the activity of the unit Bias b w 0 Local Field v () Adder function (linear combiner) which computes the weighted sum of the inputs: m u bw 0 w jxj j Activation function (squashing function) for limiting the amplitude of the output of the neuron y φ(u) Output y 7 8

The neuron How does it Works? Assign weights to each input link Multiply each weight by the input value (0 or ) Sum all the weight firing input combinations Apply squash function, e.g.: If sum > threshold for the Neuron then Output = + Else Output = http://www cse.uta.edu/~cook/ai/lectures/figures/neuron.jpg 9 0 Popular activation functions How Are Neural Networks Trained? Linear activation Logistic activation Initially z z z z e choose small random weights (w i ) Set threshold = (step function) z Threshold activation, if z 0, zsign( z), if z 0. z - z 0 Hyperbolic tangent activation 2u e u tanhu 2u e 0 z Choose small learning rate (r) Apply each member of the training set to the neural net model using a training rule to adjust the weights For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias 2

Single Layer Perceptron Single layer perceptron: training rule Are the simplest form of neural networks Modify the weights (w i ) according to the Training Rule: w i = w i + r (t a) x i input variables output variables r is the learning rate (eg. 0.2) t = target output a = actual output x i =i th input value output nodes 3 Learning rate: if too small learning occurs at a small pace, if too large it may stuck in local minimum in the decision space 4 X=0 w=0.95 w2=0.5 X2= Example b= x x2 Y 0 0 0 w0=0.49 0 Y=0 0 threshold = 0.5 r=0.05 Multi layer network Compute output for the input u = x 0.49 + 0 x 0.95 + x 0.5= 0.34 < t thus, y=0 Compute the error Compute the new weights target output = actual output (y) = 0 error = ( 0) = correction factor = error x r = 0.05 w0 = 0.49 + 0.05 x ( 0) x ( ) = 0.44 w = 0.95 + 0.05 x ( 0) x 0 = 0.95 w2 = 0.5 + 0.05 x ( 0) x = 0.20 Repeat the process with the new weigths for a given number of iterations 5 input layer hidden layer (one or more) output layer 6

Training multi layer networks back propagation algorithm Multi Layer network of sigmoid units Problem: what is the desired output for a hidden node? => Backpropagation algorithm Phase : Propagation Forward propagation of a training input Back propagation of the propagation's output activations. Phase 2: Weight update For each weight synapse: Multiply its output delta and input activation to get the gradient of the weight. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight. This ratio influences the speed and quality of learning. The sign of the gradient of a weight indicates where the error is increasing, this is why the weight must be updated in the opposite direction. Repeat the phase and 2 until the performance of the network is good enough. Output vector Output nodes Hidden nodes Input nodes Input vector: x i θ θ () rerr w w () r ErrO Err O ( O ) Err w Err O ( O )( T O ) j j j j j error for a node in the output layer I j j j to update the bias ij ij j i to update the weights j j j k jk k error for a node in the hidden layer e O j I j w O θ j ij i j i 7 8 Example Propagation x= w5= 0.3 w4=0.2 4 w04= 0.4 w46= 0.3 w06=0. I w O θ j ij i j i x2=0 w24=0.4 2 w25=0. w56= 0.2 6 e O j I j w34= 0.5 5 x3= 3 w35=0.2 xi input variables (,0,) whose class is wij randomly assigned weights w05=0.2 activation function Oj = / (+e Ij ) and learning rate = 0.9 neuron input output 4 0.2x+0.4x0 0.5x 0.4= 0.7 /(+e 0.7 )=0.332 5 0.3x+0.x0+0.2x+0.2=0. /(+e 0. )=0.525 6 0.3x0.332 0.2x0.525+0.= 0.05 /(+e 0.05 )=0.474 9 20

neuron output 4 0.332 5 0.525 6 0.474 Calculation of the neuron error for a node in the output layer Err O ( O )( T O ) j j j j j error for a node in the hidden layer Err O ( O ) Err w j j j k jk k neuron error 6 0.474 x ( 0.474) x ( 0.474) = 0.3 5 0.525 x ( 0.525) x ( 0.2) x 0.3 = 0.0065 4 0.332 x ( 0.332) x ( 0.3) x 0.3 = 0.0087 2 to update the weights to update the bias w w () r ErrO θ θ () rerr ij ij j i neuron output error 4 0.332 0.0087 5 0.525 0.0065 6 0.474 0.3 Updating weights j j j weight New value w46 0.3 + 0.9 x 0.3 x 0.332 = 0.26 w56 0.2 + 0.9 x 0.3 x 0.525 = 0.38 w4 0.2 + 0.9 x 0.0087 x = 0.92 w5 0.3 + 0.9 x 0.0065 x = 0.306 w24 0.4 + 0.9 x 0.0087 x 0 = 0.4 w25 0. + 0.9 x 0.0065 x 0 = 0. w34 0.5 + 0.9 x 0.0087 x = 0.508 w35 0.2 + 0.9 x 0.0065 x = 0.94 w06 0. + 0.9 x 0.3 = 0.28 w05 0.2 + 0.9 x 0.0065 = 0.94 w04 0.4 + 0.9 x 0.0087 = 0.408 22 Example Neural Network as a Classifier x= x2=0 x3=0 w5= 0.306 2 w24=0.4 w25=0. w34= 0.508 3 w4=0.92 w35=0.94 4 5 w04= 0.408 w05=0.94 w56= 0.38 w46= 0.26 w06=0.28 This is the resulting network after the first iteration. We now have to process another training example until the overall error is low or we run out of examples. 6 23 Weakness Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of ``hidden units" in the network Strength High tolerance to noisy data Ability to classify untrained patterns Well suited for continuous valued inputs and outputs Successful on a wide array of real world data Algorithms are inherently parallel 24

Ensemble Method Aggregation of multiple learned models with the goal of improving accuracy. Intuition: simulate what we do when we combine a expert panel in a human decision making process ENSEMBLE METHODS 25 26 Some Comments Combining models adds complexity It is more difficult to characterize and explain predictions The accuracy may increase Violation of Ockham s Razor simplicity leads to greater accuracy Identifying the best model requires identifying the proper "model complexity" Methods to Achieve Diversity Diversity from differences in input variation Different feature weightings Ratings Actors Genres Classifier A Classifier B Classifier C + + Predictions Training Examples Divide up training data among models Classifier A Classifier B Classifier C + + Predictions Training Examples 27 28

Ensemble Methods: Increasing the Accuracy How to combine models Ensemble methods Use a combination of models to increase accuracy Combine a series of k learned models, M, M2,, Mk, with the aim of creating an improved model M* Algebraic methods Average Weighted average Sum Weighted sum Product Maximum Minimum Median Voting methods Majority voting Weighted majority voting Borda count (rank candidates in order of preference) 29 30 Popular ensemble methods Bagging: averaging the prediction over a collection of classifiers Boosting: weighted vote with a collection of classifiers Ensemble: combining a set of heterogeneous classifiers Bagging: Bootstrap AGGregatING Analogy: Diagnosis based on multiple doctors majority vote Training Given a set D of d tuples, at each iteration i, a training set D i of d tuples is sampled with replacement from D (i.e., bootstrap) A classifier model M i is learned for each training set D i Classification: classify an unknown sample X Each classifier M i returns its class prediction The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple 3 32

Bagging Accuracy Often significant better than a single classifier derived from D For noise data: not considerably worse, more robust Proved improved accuracy in prediction Requirement: Need unstable classifier types Unstable means a small change to the training data may lead to major decision changes. Stability in Training Training: construct classifier f from D Stability: small changes on D results in small changes on f Decision trees are a typical unstable classifier http://en.wikibooks.org/wiki/file:dte_bagging.png 33 34 Boosting Analogy: Consult several doctors, based on a combination of weighted diagnoses weight assigned based on the previous diagnosis accuracy Incrementally create models selectively using training examples based on some distribution. How boosting works? Weights are assigned to each training example A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+, to pay more attention to the training examples that were misclassified by Mi The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy Boosting: Construct Weak Classifiers Using Different Data Distribution Idea Start with uniform weighting During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner Focus on difficult examples which are not correctly classified in the previous steps 35 36

Boosting: Combine Weak Classifiers Weighted Voting Construct strong classifier by weighted voting of the weak classifiers Idea Better weak classifier gets a larger weight Iteratively add weak classifiers Increase accuracy of the combined classifier through minimization of a cost function Differences with Bagging: Boosting Models are built sequentially on modified versions of the data The predictions of the models are combined through a weighted sum/vote Boosting algorithm can be extended for numeric prediction Comparing with bagging: Boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data 37 38 Adaboost: a popular boosting algorithm (Freund and Schapire, 997) Given a set of d class labeled examples, (X, y),, (Xd, yd) Initially, all the weights of examples are set the same (/d) Generate k classifiers in k rounds. At round i, Tuples from D are sampled (with replacement) to form a training set Di of the same size Each example s chance of being selected is based on its weight A classification model Mi is derived from Di and its error rate calculated using Di as a test set If a tuple is misclassified, its weight is increased, otherwise it is decreased Error rate: err(xj) is the misclassification error of example Xj. Classifier Mi error rate is the sum of the weights of the misclassified examples. Adaboost comments This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier. Hence, consecutive classifiers training data are geared towards increasingly hard to classify instances. Unlike bagging, AdaBoost uses a rather undemocratic voting scheme, called the weighted majority voting. The idea is an intuitive one: those classifiers that have shown good performance during training are rewarded with higher voting weights than the others. 39 40

Random Forest (Breiman 200) Random Forest: A variation of the bagging algorithm Created from individual decision trees whose parameters vary randomly. Such parameters can be bootstrapped replicas of the training data, as in bagging, but they can also be different feature subsets as in random subspace methods. During classification, each tree votes and the most popular class is returned The diagram should be interpreted with the understanding that the algorithm is sequential: classifier CK is created before classifier CK+, which in turn requires that βk and the current distribution DK be available. 4 42 Random Forest (Breiman 200) Two Methods to construct Random Forest: Forest RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size Forest RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) Comparable in accuracy to Adaboost, but more robust to errors and outliers Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting References Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Ian H. Witten and Eibe Frank, 999 Data Mining: Practical Machine Learning Tools and Techniques second edition, Ian H. Witten and Eibe Frank, 2005 Todd Holloway, 2008, Ensemble Learning Better Predictions Through Diversity, power point presentation Leandro M. Almeida, Sistemas Baseados em Comitês de Classificadores Cong Li, 2009, Machine Learning Basics 3. Ensemble Learning R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 2 45, Quarter 2006. 43 44