Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Similar documents
4. Multilayer Perceptrons

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Unit III. A Survey of Neural Network Model

epochs epochs

Lecture 5: Logistic Regression. Neural Networks

Chapter 3 Supervised learning:

Logistic Regression & Neural Networks

Neural Networks biological neuron artificial neuron 1

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

y(x n, w) t n 2. (1)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Multilayer Perceptron

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks Lecture 3:Multi-Layer Perceptron

Computational statistics

Neural Networks DWML, /25

ECE521 Lectures 9 Fully Connected Neural Networks

Multilayer Perceptrons and Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Deep Feedforward Networks

CSC 578 Neural Networks and Deep Learning

AI Programming CS F-20 Neural Networks

Gradient Descent Training Rule: The Details

Neural Networks (Part 1) Goals for the lecture

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Introduction to Machine Learning

Input layer. Weight matrix [ ] Output layer

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lab 5: 16 th April Exercises on Neural Networks

Multilayer Perceptrons (MLPs)

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Backpropagation Neural Net

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Artificial Neural Networks. Edward Gatt

Course 395: Machine Learning - Lectures

Lecture 7 Artificial neural networks: Supervised learning

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Intelligence

Single layer NN. Neuron Model

1 What a Neural Network Computes

Neural Networks and the Back-propagation Algorithm

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Machine Learning Basics III

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural networks. Chapter 20. Chapter 20 1

CSC321 Lecture 5: Multilayer Perceptrons

Data Mining Part 5. Prediction

Day 3 Lecture 3. Optimizing deep networks

Intro to Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Chapter 2 Single Layer Feedforward Networks

Neural Networks and Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Computational Intelligence Winter Term 2017/18

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Simple Neural Nets For Pattern Classification

Midterm: CS 6375 Spring 2018

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neuron (Perceptron)

CSCI567 Machine Learning (Fall 2018)

Multilayer Feedforward Networks. Berlin Chen, 2002

Computational Intelligence

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CS:4420 Artificial Intelligence

Introduction to Neural Networks

Artificial Neural Networks

Multilayer Perceptron

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artifical Neural Networks

CS260: Machine Learning Algorithms

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CSC242: Intro to AI. Lecture 21

Neural Networks and Ensemble Methods for Classification

Machine Learning

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Machine Learning

Neural Networks: Basics. Darrell Whitley Colorado State University

EPL442: Computational

NN V: The generalized delta learning rule

10-701/ Machine Learning, Fall

Deep Feedforward Networks

Introduction to Neural Networks

Revision: Neural Network

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

8. Lecture Neural Networks

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

18.6 Regression and Classification with Linear Models

Introduction to Neural Networks

ECS171: Machine Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Transcription:

BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design learning algorithms for multi-layer networks of perceptrons Solution: Use multi-layer networks for non-linearly separable tasks Use continuous differentiable non-linear activation functions Solve the credit assignment problem 1

Multi-layer Perceptrons o1 o2 w (2,1) 1,1 w (2,1) 2,1 w (2,1) 1,2 w (1,0) 1,3 w (1,0) 2,1 Output layer w (2,1) 2,2 w (1,0) 1,1 w (1,0) 2,3 Input layer Hidden layer Activation function: sigmoid Error-correction learning Least mean square and gradient descent learning: Forward pass Backward pass 2

Preliminaries Training set: {( x p, d p ), 1 p P } P = number of input patterns x p = (x p,0,..., x p,n ) d p = (d p,1,..., d p,k ) = desired output for x p N = input space dimension K = number of output neurons o p = (o p,1,..., o p,k ) = actual ouput Obectives: minimize cumulative error Error = P p=1 Err( d p, o p ) Err should be a metric (distance measure) Err( x, y) 0 Err( x, y) = 0 if x = y Err( x, y) = Err( y, x) Err( x, y) + Err( y, z) Err( x, z) 3

Preliminaries (continuous) Popular choices based on norms of d p o p : e p, = d p, o p, E p = Err( d p, o p ) = (e u p,1 + + eu p,k )1 u, u > 0 u = 1: Manhattan distance u = 2: Euclidean distance Let u = 2, then cumulative error to minimize Sum of Squared Error SSE = P K e 2 p, p=1 =1 Mean Squared Error MSE = 1 P 1 K P K e 2 p, p=1 =1 4

Back-Propagation Algorithm Scenario for One-Hidden-Layer Networks 1. x p,i : value in i-th input node 2. net (1) = n i=0 w (1,0),i x p,i : net input of -th node in hidden layer 3. x (1) p, = σ( n i=0 w (1,0),i x p,i ): output of -th in hidden layer : net input of k-th node in out- 4. net (2) k = w (2,1) put layer x (1) p, 5. o p,k = σ( w (2,1) ): output of k-th node in output layer x (1) p, 6. e 2 p,k = d p,k o p,k 2 : squared error at k-th output node 7. w (i+1,i) : weight from -th node in i-th layer to k-th node i + 1-th layer 8. x (i) p, : output of -th node of i-th layer for pattern x p 5

Back-Propagation Algorithm E p = k e 2 p,k : error for pattern x p for simplicity write E = K k=1 e 2 k (for now, we drop suffix p, for convenience) MSE is minimal when E is minimal for each pattern x p. Since o k depends on the network weight w, then E is also a function of w. According to gradient descent, we the weight update That is and w (2,1) w (1,0),i w = E w = η = η E w (2,1) E w (1,0),i How does E depends on the weights? 1. E o k net (2) k w (2,1) 2. E o k net (2) k x (1) net (1) w (1,0),i 6

Back-Propagation Algorithm By the chain rule, we have 1. E w (2,1) = o E k o k net (2) k net(2) k w (2,1) 2. E w (1,0),i = K E k=1 ok o k net (2) k net(2) k x (1) x(1) net (1) net(1) w (1,0),i After substitutions we obtain E = 2(d k o k ) σ (net (2) k ) x(1) w (2,1) E w (1,0),i = K k=1 2(d k o k ) σ (net (2) k ) w(2,1) σ (net (1) ) x i Applying gradient descent rule we have 1. w (2,1) = η δ k x (1) 2. w (1,0),i = η µ x i δ k = (d k o k ) σ (net (2) k ) µ = ( k δ k w (2,1) ) σ (net (1) ) 7

Back-Propagation Algorithm Changes in weights for the two layers are similar 1. δ k proportional to actual error (d k o k ) multiplied by derivative of output node with respect to net input that node 2. µ proportional to weighted sum of errors coming to the hidden node from node in upper layer We made no assumption about the activation function σ except it should be differentiable. For logistic sigmoid function we have σ(net) = 1 1+e αnet σ (net) = σ(net) (1 σ(net)) Therefore 1. δ k = (d k o k ) o k (1 o k ) 2. µ = k δ k w (2,1) x (1) (1 x (1) ) 8

Back-Propagation Algorithm Start with initial random w; Repeat For each input pattern x p do {Propagate x p (forward pass), that is:} From first hidden layer to output layer do Compute hidden node activations: net (1) p, ; Compute hidden node outputs: x (1) p, ; Compute output node activations: net (2) p,k ; Compute network outputs: o p,k ; Compute network s error e p,k = d p,k o p,k ; {Back-propagate e p (backward pass), that is:} From output layer to first hidden layer do For the output layer do Update output layer weights: δ p,k = (d p,k o p,k ) o p,k (1 o p,k ); w (2,1) = η δ p,k x (1) p, ; For a hidden layer do Update hidden layer weights: mu p, = k δ p,k w (2,1) x (1) p, w (1,0),i = η µ p, x p,i ; Until MSE ( w) is minimal; (1 x(1) p, ); 9

General Multiclass Classification Problem K classes: C 1,..., C K n k examples of class C k T k = {( x k p, d k p ) : 1 p n k, 1 k K} Training set: T = T 1... T K Output representation: 1. log 2 K output nodes: (bad) Training is difficult since many output nodes must be high simultaneously. Cross-talk phenomenon: different training example require conflicting changes to be made to the same weight. 2. K output nodes: (better) One output node assigned to one class. Each output node focus only on learning one class rather than performing multiple duties. 3. Error-correcting output code: (best) Minimize cross-talk 10

General Multiclass Classification Problem Desired output representation: 1. High weight magnitudes for output 0 or 1 2. σ (net) 0 when net + 3. Desired output d k (ε,..., ε, 1 ε, ε,..., ε), where ε > 0 (typical choices are 0.01 and 0.1) Perfect classification is possible even if the error d p, o p, = 0 (in absolute value) 1. d p, = (1 ε) and o p, d p, then e p, = 0 2. d p, = ε and o p, d p, then e p, = 0 3. Otherwise e p, = d p, o p, Class membership of x p after training 1. Assign x p to that class k for which d k o p d o p for k where 1 K 2. Assign x p to class k if o k,p > o,p for all k where 1 K 11

Heuristic Modifications of Back-Propagation Frequency of weight updates 1. Sequential learning Weights are updated after each example presentation Slower but uses less memory Easy to implement Local minimization less ability to escape local optimum 2. Batch learning: w = P p=1 w p Weights are updated after each epoch Faster but uses more memory Can be parallelized Global minimization better ability to escape local optimum It is good practice to randomize the order of presentation of training examples from epoch to the next. 12

Heuristic Modifications of Back-Propagation Maximizing information content: Select examples that have the largest possible information content for the learning problem 1. Use samples that result in largest training error 2. Use radically different samples in training 3. Use random presentation 4. Use emphasizing scheme Activation function 1. Should be Antisymmetric: σ( y) = σ(y) Ex: hyperbolic tangent σ(y) = a tanh(by) 2. Bipolar representation ( 1, 1) vs. binary (0, 1) (a) Weights are always updated (b) Noise representation 3. Faster learning and better generalizability 13

Heuristic Modifications of Back-Propagation Normalizing the inputs 1. Weights should be updated at approximately the same speed 2. Prevent network bias toward particular inputs Initialization of weights 1. Random values: 1 w i +1 2. Normalized 3. Averaged: w (1,0),i = ± 1 P P p=1 1 x i or w (2,1) = ± 1 P P p=1 1 σ(net (1) ) w,i (new) = w,i(old) w (old) where w (old) denotes the average weight, computed over all values of i. 14

Heuristic Modifications of Back-Propagation Choice of learning rate: 1. Constant rate η 2. A rate η i for each w i 3. Start with large η (or η i ) and decrease steadily 4. At each iteration where (a) Performance improves: increase η (or η i ) (b) Performance worsens: decrease η (or η i ) 5. Double η (or η i ) until performance worsens i := 0; w new := w E ( w)ɛ; While E( w new ) < E( w) do := + 1; w := w new ; w new := w E ( w)2 ɛ; End-While η (or η i ) := 2 1 ɛ; Searches for a large enough rate (2 ɛ) at which network s error no longer decreases. (We assume ɛ > 0 such that E( w E ( w)ɛ) < E( w) 6. Make use of the second derivative of MSE 15

Heuristic Modifications of Back-Propagation Momentum 1. w grows in magnitude when E w has same sign on consecutive iterations: therefore accelerate descent for faster learning 2. w shrinks in magnitude when E w has opposite sign on consecutive iterations: therefore stabilize to avoid oscillation 3. 1 and 2 can be achieved by adding a momentum term in the learning rule: w (t + 1) = ηδ k x + α w (t) where momentum α: 0 < α 1 4. Momentum term has an averaging effect: weights move in the general (or average ) direction of motion. 16

Heuristic Modifications of Back-Propagation Generalizability 1. Neural networks are useless if they don t generalize 2. Design of Neural Network systems: (a) Training and Testing (b) Choice of training and testing set are very important (c) NN should give good training and test results 3. When to stop training? When testing error start to worsen Otherwise: Overfitting occurs: NN only memorizes the input, it doesn t generalize Small training error is acceptable when sampling is good 4. Small networks achieve better for generalizability 17