Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Similar documents
Artificial Neural Networks

Artificial Neural Networks

Machine Learning

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks

Machine Learning

Lab 5: 16 th April Exercises on Neural Networks

Radial Basis-Function Networks

Introduction to Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Multilayer Neural Networks

Neural Networks (Part 1) Goals for the lecture

4. Multilayer Perceptrons

y(x n, w) t n 2. (1)

Multilayer Perceptrons and Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Course 395: Machine Learning - Lectures

Chapter ML:VI (continued)

Introduction to Artificial Neural Networks

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Artificial Neural Networks. MGS Lecture 2

Lecture 5: Logistic Regression. Neural Networks

Pattern Classification

Machine Learning. Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Chapter ML:VI (continued)

Artificial Neural Networks. Edward Gatt

Neural Networks and the Back-propagation Algorithm

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Introduction to Machine Learning

AI Programming CS F-20 Neural Networks

Neural networks. Chapter 20, Section 5 1

Neural Networks and Deep Learning

Gradient Descent Training Rule: The Details

Part 8: Neural Networks

Unit III. A Survey of Neural Network Model

Neural networks. Chapter 19, Sections 1 5 1

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Computational Intelligence Winter Term 2017/18

C1.2 Multilayer perceptrons

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artifical Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Multilayer Perceptron = FeedForward Neural Network

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Computational Intelligence

Backpropagation Neural Net

Feedforward Neural Nets and Backpropagation

Simple Neural Nets For Pattern Classification

Christian Mohr

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Data Mining Part 5. Prediction

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks biological neuron artificial neuron 1

Input layer. Weight matrix [ ] Output layer

Learning Neural Networks

Multilayer Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Multilayer Perceptron

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Supervised Learning in Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

NN V: The generalized delta learning rule

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CS:4420 Artificial Intelligence

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Radial-Basis Function Networks

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning: Multi Layer Perceptrons

Perceptron. Inner-product scalar Perceptron. XOR problem. Gradient descent Stochastic Approximation to gradient descent 5/10/10

Radial-Basis Function Networks

Neural networks COMS 4771

Artificial Neural Networks (ANN)

Deep Learning. Ali Ghodsi. University of Waterloo

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Networks and Deep Learning.

CS489/698: Intro to ML

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks. Fundamentals Framework for distributed processing Network topologies Training of ANN s Notation Perceptron Back Propagation

Multilayer Feedforward Networks. Berlin Chen, 2002

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Transcription:

Back-Propagation Algorithm Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples 1

Inner-product net =< w, x >= w x cos(θ) net = n i=1 w i x i A measure of the projection of one vector onto another Activation function o = f (net) = f ( w i x i ) n i=1 f (x) := sgn(x) = 1 if x 0 1 if x < 0 2

f (x) := ϕ(x) = 1 if x 0 0 if x < 0 1 if x 0.5 f (x) := ϕ(x) = x if 0.5 > x > 0.5 0 if x 0.5 sigmoid function f (x) := σ(x) = 1 1+ e ( ax ) Gradient Descent To understand, consider simpler linear unit, where o = n i= 0 w i x i Let's learn w i that minimize the squared error, D={(x 1,t 1 ),(x 2,t 2 ),..,(x d,t d ),..,(x m,t m )} (t for target) 3

Error for different hypothesis, for w 0 and w 1 (dim 2) We want to move the weight vector in the direction that decrease E w i =w i +Δw i w=w+δw 4

Differentiating E Update rule for gradient decent Δw i = η d D (t d o d )x id 5

Stochastic Approximation to gradient descent Δw i = η(t o)x i The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update Adaline rule, used for adaptive filters Widroff and Hoff (1960) 6

XOR problem and Perceptron By Minsky and Papert in mid 1960 7

Multi-layer Networks The limitations of simple perceptron do not apply to feed-forward networks with intermediate or hidden nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago But it was only in the eighties it was shown how to make them learn Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions Units should use nonlinear activation functions Examples of nonlinear activation functions 8

XOR-example 9

Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986] Parallel Distributed Processing - Vol. 1 Foundations David E. Rumelhart, James L. McClelland and the PDP Research Group What makes people smarter than computers? These volumes by a pioneering neurocomputing... 10

Back-propagation The algorithm gives a prescription for changing the weights w ij in any feedforward network to learn a training set of input output pairs {x d,t d } We consider a simple two-layer network x k x 1 x 2 x 3 x 4 x 5 11

Given the pattern x d the hidden unit j receives a net input net j d = k=1 d w jk x k and produces the output 5 V d j = f (net d j ) = f ( w jk x d k ) 5 k=1 Output unit i thus receives 3 net d i = W ij V d j = (W ij f ( w jk x d k )) j=1 j=1 k=1 And produce the final output 3 3 o d i = f (net d i ) = f ( W ij V d j ) = f ( (W ij f ( w jk x d k ))) j=1 j=1 k=1 3 5 5 12

Out usual error function For l outputs and m input output pairs {x d,t d } E[ w ] = 1 2 m l d (t i o d i ) 2 d =1 i=1 In our example E becomes E[ w ] = 1 2 E[ w ] = 1 2 m 2 d =1 i=1 m 2 d (t i o d i ) 2 d =1 i=1 3 d (t i f ( W ij d f ( w jk x k ))) 2 E[w] is differentiable given f is differentiable Gradient descent can be applied j 5 k=1 13

For hidden-to-output connections the gradient descent rule gives: ΔW ij = η E = η W ij ΔW ij = η m d =1 m d =1 (t d i o d i ) f ' (net d d i ) V j (t d i o d i ) ( f ' (net d d i )) V j δ i d = f ' (net i d )(t i d o i d ) m d d ΔW ij = ηδ i V j d =1 For the input-to hidden connection w jk we must differentiate with respect to the w jk Using the chain rule we obtain Δw jk = η E = η w jk m d =1 d E V V j d j w jk 14

Δw jk = η m 2 d =1 i=1 (t i d δ i d = f ' (net i d )(t i d o i d ) Δw jk = η δ j d = f ' (net j d ) Δw jk = η m 2 d =1 i=1 m d =1 δ j d o i d ) f ' (net i d )W ij f ' (net j d ) x k d δ d i W ij f ' (net d d j ) x k x k d 2 d W ij δ i i=1 m d d ΔW ij = ηδ i V j d =1 Δw jk = η m d =1 δ j d x k d we have same form with a different definition of δ 15

In general, with an arbitrary number of layers, the back-propagation update rule has always the form Δw ij = η m d =1 δ output V input Where output and input refers to the connection concerned V stands for the appropriate input (hidden unit or real input, x d ) δ depends on the layer concerned By the equation δ d j = f ' (net d j ) 2 d W ij δ i allows us to determine for a given hidden unit V j in terms of the δ s of the unit o i The coefficient are usual forward, but the errors δ are propagated backward back-propagation i=1 16

We have to use a nonlinear differentiable activation function Examples: 1 f (x) = σ(x) = 1+ e ( α x) f ' (x) = σ ' (x) = α σ(x) (1 σ(x)) f (x) = tanh(α x) f ' (x) = α (1 f (x) 2 ) 17

Consider a network with M layers m=1,2,..,m V m i from the output of the ith unit of the mth layer V 0 i is a synonym for x i of the ith input Subscript m layers m s layers, not patterns W m ij mean connection from V j m-1 to V im Stochastic Back-Propagation Algorithm (mostly used) 1. Initialize the weights to small random values 2. Choose a pattern x d k and apply is to the input layer V0 k = xd k for all k 3. Propagate the signal through the network V m i = f (net m m i ) = f ( w ij V m 1 j ) 4. Compute the deltas for the output layer δ M i = f ' (net M i )(t d i V M i ) 5. Compute the deltas for the preceding layer for m=m,m-1,..2 δ m 1 i = f ' (net m 1 m m i ) δ j j 6. Update all connections Δw m ij = ηδ m m 1 i V j w new ij = w old ij + Δw ij 7. Goto 2 and repeat for the next pattern w ji j 18

Example w 1 ={w 11 =0.1,w 12 =0.1,w 13 =0.1,w 14 =0.1,w 15 =0.1} w 2 ={w 21 =0.1,w 22 =0.1,w 23 =0.1,w 24 =0.1,w 25 =0.1} w 3 ={w 31 =0.1,w 32 =0.1,w 33 =0.1,w 34 =0.1,w 35 =0.1} W 1 ={W 11 =0.1,W 12 =0.1,W 13 =0.1} W 2 ={W 21 =0.1,W 22 =0.1,W 23 =0.1} X 1 ={1,1,0,0,0}; t 1 ={1,0} X 2 ={0,0,0,1,1}; t 1 ={0,1} f (x) = σ(x) = 1 1+ e ( x) f ' (x) = σ ' (x) = σ(x) (1 σ(x)) net 1 1 1 = w 1k x k 5 k=1 net 1 1 2 = w 2k x k 5 k=1 1 V 1 1 = f (net 1 1 ) = 1+ e net 1 1 net 1 1=1*0.1+1*0.1+0*0.1+0*0.1+0*0.1 V 1 1=f(net 1 1 )=1/(1+exp(-0.2))=0.54983 1 V 1 2 = f (net 1 1 ) = 1+ e net 2 1 V 1 2=f(net 1 2 )=1/(1+exp(-0.2))=0.54983 net 1 1 3 = w 3k x k 5 k=1 1 V 1 3 = f (net 1 3 ) = 1+ e net 3 1 V 1 3=f(net 1 3 )=1/(1+exp(-0.2))=0.54983 19

3 net 1 1 1 = W 1 j V j 1 o 1 1 = f (net 1 1 ) = 1+ e net 1 1 j=1 net 1 1=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495 o 1 1= f(net11)=1/(1+exp(- 0.16495))= 0.54114 3 net 1 1 2 = W 2 j V j j=1 1 o 1 2 = f (net 1 2 ) = 1+ e net 2 1 net 1 2=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495 o 1 2= f(net11)=1/(1+exp(- 0.16495))= 0.54114 ΔW ij = η m (t d i o d i ) f ' (net d d i ) V j d =1 We will use stochastic gradient descent with η=1 ΔW ij = (t i o i ) f ' (net i )V j f ' (x) = σ ' (x) = σ(x) (1 σ(x)) ΔW ij = (t i o i )σ(net i )(1 σ(net i ))V j δ i = (t i o i )σ(net i )(1 σ(net i )) ΔW ij = δ i V j 20

δ 1 = (t 1 o 1 )σ(net 1 )(1 σ(net 1 )) ΔW 1 j = δ 1 V j δ 1 =(1-0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= 0.11394 δ 2 = (t 2 o 2 )σ(net 2 )(1 σ(net 2 )) ΔW 2 j = δ 2 V j δ 2 =(0-0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= -0.13437 2 Δw jk = δ i W ij f ' (net j ) x k i=1 2 Δw jk = δ i W ij σ(net j )(1 σ(net j )) x k i=1 δ j = σ(net j )(1 σ(net j )) Δw jk = δ j x k 2 W ij δ i i=1 21

2 δ 1 = σ(net 1 )(1 σ(net 1 )) W i1 δ i i=1 δ 1 = 1/(1+exp(- 0.2))*(1-1/(1+exp(- 0.2)))*(0.1* 0.11394+0.1*( -0.13437)) δ 1 = -5.0568e-04 2 δ 2 = σ(net 2 )(1 σ(net 2 )) W i2 δ i i=1 δ 2 = -5.0568e-04 δ 3 = σ(net 3 )(1 σ(net 3 )) i=1 δ 3 = -5.0568e-04 2 W i3 δ i First Adaptation for x 1 (one epoch, adaptation over all training patterns, in our case x 1 x 2 ) Δw jk = δ j x k ΔW ij = δ i V j δ 1 = -5.0568e-04 δ 1 = 0.11394 δ 2 = -5.0568e-04 δ 2 = -0.13437 δ 3 = -5.0568e-04 x 1 =1 v 1 =0.54983 x 2 =1 v 2 =0.54983 x 3 =0 v 3 =0.54983 x 4 =0 x 5 =0 22

More on Back-Propagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Gradient descent can be very slow if η is to small, and can oscillate widely if η is to large Often include weight momentum α Δw pq (t +1) = η E w pq + α Δw pq (t) Momentum parameter α is chosen between 0 and 1, 0.9 is a good value 23

Minimizes error over training examples Will it generalize well Training can take thousands of iterations, it is slow! Using network after training is very fast 24

Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses 25

Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. NETtalk Sejnowski et al 1987 26

Prediction 27

28

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples 29

RBF Networks, Support Vector Machines 30