Chapter 2 Single Layer Feedforward Networks

Similar documents
The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Single layer NN. Neuron Model

Simple Neural Nets For Pattern Classification

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Unit III. A Survey of Neural Network Model

In the Name of God. Lecture 11: Single Layer Perceptrons

Neural Networks (Part 1) Goals for the lecture

Chapter 3 Supervised learning:

ADALINE for Pattern Classification

Multilayer Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Pattern Classification

Artificial Neural Networks Examination, June 2005

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks biological neuron artificial neuron 1

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Classification with Perceptrons. Reading:

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Introduction to Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks

Supervised Learning in Neural Networks

Introduction To Artificial Neural Networks

Artificial Neural Networks The Introduction

Linear Discrimination Functions

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

The Perceptron Algorithm 1

Revision: Neural Network

The Perceptron Algorithm

Multilayer Perceptron

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Multilayer Neural Networks

Artificial Neural Networks Examination, June 2004

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Perceptron = FeedForward Neural Network

Optimization and Gradient Descent

Artificial Neural Networks. Historical description

More about the Perceptron

Backpropagation: The Good, the Bad and the Ugly

Lecture 7 Artificial neural networks: Supervised learning

4. Multilayer Perceptrons

Multilayer Perceptrons and Backpropagation

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

CS 590D Lecture Notes

Input layer. Weight matrix [ ] Output layer

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks. Edward Gatt

Linear discriminant functions

Backpropagation Neural Net

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Part 8: Neural Networks

Artifical Neural Networks

Neural Networks: Basics. Darrell Whitley Colorado State University

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

GRADIENT DESCENT AND LOCAL MINIMA

Feedforward Neural Nets and Backpropagation

Course 395: Machine Learning - Lectures

Lab 5: 16 th April Exercises on Neural Networks

Chapter ML:VI (continued)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Artificial Neural Networks

Lecture 6. Notes on Linear Algebra. Perceptron

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Least Mean Squares Regression

NN V: The generalized delta learning rule

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Binary Classification / Perceptron

Inf2b Learning and Data

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

AI Programming CS F-20 Neural Networks

Reification of Boolean Logic

Artificial Neural Network

Data Mining Part 5. Prediction

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

CMSC 421: Neural Computation. Applications of Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Multilayer Feedforward Networks. Berlin Chen, 2002

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Chapter 4 Supervised learning:

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Intelligence

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

CE213 Artificial Intelligence Lecture 14

Lecture 5 Neural models for NLP

y(x n, w) t n 2. (1)

Transcription:

Chapter 2 Single Layer Feedforward Networks

By Rosenblatt (1962) Perceptrons For modeling visual perception (retina) A feedforward network of three layers of units: Sensory, Association, and Response Learning occurs only on weights from A units to R units (weights from S units to A units are fixed). Each R unit receives inputs from n A units For a given training i sample s:t, change weights between A and R only if the computed output y is different from the target output t (error driven) w SA w AR S A R

A simple perceptron Structure: Perceptrons Sing output node with threshold function n input nodes with weights w i, i = 1, 2,, n To classify input patterns into one of the two classes (depending on whether output = -1 or 1) Example: input patterns: (x 1, x 2 ) Two groups of input patterns (0, 0) (0, 1) (1, 0) (-1, -1); (2.1, 0) (0, -2.5) (1.6, -1.6) Can be separated by a line on the (x 1, x 2 )planex x 1 - x 2 =2 Classification by a perceptron with w 1 = 1, w 2 = -1, threshold = 2

Perceptrons (-1, -1) (1.6, -1.6) f( net) 1 if net 2 1 if net 2 Implement threshold by a node x 0 Constant output 1 Weight w 0 = - threshold A common practice in NN design 1 if net 0 f( net) 1 if net 0 sign function

Linear separability Perceptrons A set of (2D) patterns (x 1, x 2 2) ) of two classes is linearly separable if there exists a line on the (x 1, x 2 ) plane w 0 + w 1 x 1 + w 2 x 2 = 0 Separates all patterns of one class from the other class A perceptron can be built with 3 input x 0 = 1, x 1, x 2 with weights w 0, w 1, w 2 n dimensional patterns (x 1,, x n ) Hyperplane w 0 + w 1 x 1 + w 2 x 2 + + w n x n = 0 dividing the space into two regions Can we get the weights from a set of sample patterns? If the problem is linearly separable, then YES (by perceptron learning)

Examples of linearly separable classes - Logical AND function patterns (bipolar) decision boundary x1 x2 output w1 = 1-1 -1-1 w2 = 1 o o -1 1-1 w0 = -1 1-1 -1 1 1 1-1 + x1 + x2 = 0 - Logical OR function patterns (bipolar) decision boundary x1 x2 output w1 = 1-1 -1-1 w2 = 1-1 1 1 w0 = 1 1-1 1 1 1 1 1 + x1 + x2 = 0 o x x: class I (output = 1) o: class II (output =-1) x o x x x: class I (output = 1) x: class I (output 1) o: class II (output = -1)

The network Perceptron Learning Input vector i (including threshold input = 1) n Weight vector w = (w 0, w 1,, w n ) net wi Output: bipolar (-1, 1) using the sign node function k 1 if wi 0 output 1 otherwise Training i samples Pairs(i, class(i )) where class(i ) is the correct classification of i Training: Update w so that all sample inputs are correctly classified (if possible) If an input i is misclassified by the current w class(i ) w i < 0 change w to w + Δw so that t (w + Δw) ) i is closer to class(i ) 0 w k i k,

Perceptron Learning Where η > 0 is the learning rate

Justification ( w x k since i i ( w x k ) i 0 ) i Perceptron Learning ( w class ( i wi class( i wi class( i ) i ) i ) i ) i i i 0 if class( i ) 1 if class( i ) 1 0 new net moves toward class( i )

Perceptron learning convergence theorem Informal: any problem that can be represented by a perceptron can be learned by the learning rule * * Theorem: If there is a w such that f ( ip w ) class( i p ) for all P training i sample patterns { i, then for p, class( ip )} 0 any start weight vector w, the perceptron learning rule will converge to a weight vector w such that t for all p f ( ip w ) class( ip ) * ( and w w may not be the same.) Proof: reading for grad students (Sec. 2.4)

Note: Perceptron Learning It is a supervised learning (class(i ) is given for all sample input i ) Learning occurs only when a sample input misclassified (error driven) Termination criteria: i learning stops when all samples are correctly classified Assuming the problem is linearly separable Assuming the learning rate (η) is sufficiently small Choice of learning rate: If η is too large: existing weights are overtaken by Δw = class ( i ) i If η is too small ( 0): very slow to converge Common choice: η = 1. Non-numeric input: Different encoding schema ex. Color = (red, blue, green, yellow). (0, 0, 1, 0) encodes green

Learning quality Perceptron Learning Generalization: can a trained perceptron correctly classify patterns not included in the training samples? Common problem for many NN learning models Depends on the quality of training samples selected. Also to some extent depends on the learning rate and initial weights (bad choices may cause learning too slow to converge to be practical) How can we know the learning is ok? Cross-validation: reserve a few samples for testing they do not Cross-validation: reserve a few samples for testing, they do not participate in learning

Adaline By Widrow and Hoff (~1960) Adaptive linear elements for signal processing The same architecture of perceptrons Learning method: delta rule (another way of error driven), also called Widrow-Hoff learning rule Try to reduce the mean squared error (MSE) between the net input and the desired out put

Adaline Delta rule Let i = (i 0 0,, i 1 1,,,,, i n n, ) be an input vector with desired output d The squared error 2 2 E ( d net ) ( d w li l, ) l Its value determined by the weights w l Modify weights by gradient descent approach Change weights in the opposite direction of E / wk w ( d net ) i, ( d w i, ) i, k k l l l k

Adaline Learning Algorithm

Adaline Learning Delta rule in batch mode Based on mean squared error over all P samples E 1 P P P p 1 ( d p net p E is again a function of w = (w 0, w 1,, w n ) the gradient of E: E 2 P [( d p net p ) ( d p net p )] wk P p1 wk 2 P [( d p net p ) i k, p ] P p1 P E Therefore w [( d net ) ik ] i p p, p w p1 i 2 )

Adaline Learning Notes: Weights will be changed even if an input is classified correctly E monotonically decreases until the system reaches a state with (local) minimum E (a small change of any w i will cause E to increase). At a local minimum E state, E / wi 0 i, but E is not guaranteed to be zero (net!= d ) This is why Adaline uses threshold h function rather than linear function

Linear Separability Again Examples of linearly inseparable classes - Logical XOR (exclusive OR) function x o patterns (bipolar) decision boundary x0 x1 x2 output 1-1 -1-1 1-1 1 1 o x 1 1-1 1 1 1 1-1 No line can separate these two classes, as can be seen from the fact that the following linear inequality system has no solution w w w w w 1 2 0 (1) w1 w2 0 (2) w1 w2 0 (3) w 0 (4) 0 w 0 0 0 1 w2 x: :classi (output t= 1) o: class II (output = -1) because we have w 0 < 0 from (1) + (4), and w 0 >= 0 from (2) + (3), which is a contradiction

XOR can be solved by a more complex network with threshold hidden units Threshold 2 x1 z1 2-2 -2 x2 z2 2 2 Threshold h Y (-1 1, -1) (-1 1, -1) -1 (-1, 1) (-1, 1) 1 (1, -1) (1, -1) 1 (1, 1) (-1, -1) -1

Why hidden units must be non-linear? Multi-layer net with linear hidden layers is equivalent to a single layer net x1 x2 v12 v21 v11 v22 z1 z2 w1 w2 threshold = 0 Because z1 and z2 are linear unit z1 = a1* (x1*v11 + x2*v21) ) + b1 z2 = a2* (x1*v12 + x2*v22) + b2 net y = z1*w1 + z2*w2 = x1*u1 + x2*u2 + b1+b2 where u1 = (a1*v11+ a2*v12)w1, u2 = (a1*v21 + a2*v22)*w2 net y is still a linear combination of x1 and x2. Y

Summary Single layer nets have limited representation power (linear separability problem) Error driven seems a good way to train a net Multi-layer nets with non-linear hidden units may overcome linear inseparability yp problem, learning methods for such nets are needed Threshold/step output functions hinders the effort to develop learning methods for multi-layered nets because they are not everywhere differentiable