Back-Propagation Algorithm Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples 1
Inner-product net =< w, x >= w x cos(θ) net = n i=1 w i x i A measure of the projection of one vector onto another Activation function o = f (net) = f ( w i x i ) n i=1 f (x) := sgn(x) = 1 if x 0 1 if x < 0 2
f (x) := ϕ(x) = 1 if x 0 0 if x < 0 1 if x 0.5 f (x) := ϕ(x) = x if 0.5 > x > 0.5 0 if x 0.5 sigmoid function f (x) := σ(x) = 1 1+ e ( ax ) Gradient Descent To understand, consider simpler linear unit, where o = n i= 0 w i x i Let's learn w i that minimize the squared error, D={(x 1,t 1 ),(x 2,t 2 ),..,(x d,t d ),..,(x m,t m )} (t for target) 3
Error for different hypothesis, for w 0 and w 1 (dim 2) We want to move the weight vector in the direction that decrease E w i =w i +Δw i w=w+δw 4
Differentiating E Update rule for gradient decent Δw i = η d D (t d o d )x id 5
Stochastic Approximation to gradient descent Δw i = η(t o)x i The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update Adaline rule, used for adaptive filters Widroff and Hoff (1960) 6
XOR problem and Perceptron By Minsky and Papert in mid 1960 7
Multi-layer Networks The limitations of simple perceptron do not apply to feed-forward networks with intermediate or hidden nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago But it was only in the eighties it was shown how to make them learn Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions Units should use nonlinear activation functions Examples of nonlinear activation functions 8
XOR-example 9
Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986] Parallel Distributed Processing - Vol. 1 Foundations David E. Rumelhart, James L. McClelland and the PDP Research Group What makes people smarter than computers? These volumes by a pioneering neurocomputing... 10
Back-propagation The algorithm gives a prescription for changing the weights w ij in any feedforward network to learn a training set of input output pairs {x d,t d } We consider a simple two-layer network x k x 1 x 2 x 3 x 4 x 5 11
Given the pattern x d the hidden unit j receives a net input net j d = k=1 d w jk x k and produces the output 5 V d j = f (net d j ) = f ( w jk x d k ) 5 k=1 Output unit i thus receives 3 net d i = W ij V d j = (W ij f ( w jk x d k )) j=1 j=1 k=1 And produce the final output 3 3 o d i = f (net d i ) = f ( W ij V d j ) = f ( (W ij f ( w jk x d k ))) j=1 j=1 k=1 3 5 5 12
Out usual error function For l outputs and m input output pairs {x d,t d } E[ w ] = 1 2 m l d (t i o d i ) 2 d =1 i=1 In our example E becomes E[ w ] = 1 2 E[ w ] = 1 2 m 2 d =1 i=1 m 2 d (t i o d i ) 2 d =1 i=1 3 d (t i f ( W ij d f ( w jk x k ))) 2 E[w] is differentiable given f is differentiable Gradient descent can be applied j 5 k=1 13
For hidden-to-output connections the gradient descent rule gives: ΔW ij = η E = η W ij ΔW ij = η m d =1 m d =1 (t d i o d i ) f ' (net d d i ) V j (t d i o d i ) ( f ' (net d d i )) V j δ i d = f ' (net i d )(t i d o i d ) m d d ΔW ij = ηδ i V j d =1 For the input-to hidden connection w jk we must differentiate with respect to the w jk Using the chain rule we obtain Δw jk = η E = η w jk m d =1 d E V V j d j w jk 14
Δw jk = η m 2 d =1 i=1 (t i d δ i d = f ' (net i d )(t i d o i d ) Δw jk = η δ j d = f ' (net j d ) Δw jk = η m 2 d =1 i=1 m d =1 δ j d o i d ) f ' (net i d )W ij f ' (net j d ) x k d δ d i W ij f ' (net d d j ) x k x k d 2 d W ij δ i i=1 m d d ΔW ij = ηδ i V j d =1 Δw jk = η m d =1 δ j d x k d we have same form with a different definition of δ 15
In general, with an arbitrary number of layers, the back-propagation update rule has always the form Δw ij = η m d =1 δ output V input Where output and input refers to the connection concerned V stands for the appropriate input (hidden unit or real input, x d ) δ depends on the layer concerned By the equation δ d j = f ' (net d j ) 2 d W ij δ i allows us to determine for a given hidden unit V j in terms of the δ s of the unit o i The coefficient are usual forward, but the errors δ are propagated backward back-propagation i=1 16
We have to use a nonlinear differentiable activation function Examples: 1 f (x) = σ(x) = 1+ e ( α x) f ' (x) = σ ' (x) = α σ(x) (1 σ(x)) f (x) = tanh(α x) f ' (x) = α (1 f (x) 2 ) 17
Consider a network with M layers m=1,2,..,m V m i from the output of the ith unit of the mth layer V 0 i is a synonym for x i of the ith input Subscript m layers m s layers, not patterns W m ij mean connection from V j m-1 to V im Stochastic Back-Propagation Algorithm (mostly used) 1. Initialize the weights to small random values 2. Choose a pattern x d k and apply is to the input layer V0 k = xd k for all k 3. Propagate the signal through the network V m i = f (net m m i ) = f ( w ij V m 1 j ) 4. Compute the deltas for the output layer δ M i = f ' (net M i )(t d i V M i ) 5. Compute the deltas for the preceding layer for m=m,m-1,..2 δ m 1 i = f ' (net m 1 m m i ) δ j j 6. Update all connections Δw m ij = ηδ m m 1 i V j w new ij = w old ij + Δw ij 7. Goto 2 and repeat for the next pattern w ji j 18
Example w 1 ={w 11 =0.1,w 12 =0.1,w 13 =0.1,w 14 =0.1,w 15 =0.1} w 2 ={w 21 =0.1,w 22 =0.1,w 23 =0.1,w 24 =0.1,w 25 =0.1} w 3 ={w 31 =0.1,w 32 =0.1,w 33 =0.1,w 34 =0.1,w 35 =0.1} W 1 ={W 11 =0.1,W 12 =0.1,W 13 =0.1} W 2 ={W 21 =0.1,W 22 =0.1,W 23 =0.1} X 1 ={1,1,0,0,0}; t 1 ={1,0} X 2 ={0,0,0,1,1}; t 1 ={0,1} f (x) = σ(x) = 1 1+ e ( x) f ' (x) = σ ' (x) = σ(x) (1 σ(x)) net 1 1 1 = w 1k x k 5 k=1 net 1 1 2 = w 2k x k 5 k=1 1 V 1 1 = f (net 1 1 ) = 1+ e net 1 1 net 1 1=1*0.1+1*0.1+0*0.1+0*0.1+0*0.1 V 1 1=f(net 1 1 )=1/(1+exp(-0.2))=0.54983 1 V 1 2 = f (net 1 1 ) = 1+ e net 2 1 V 1 2=f(net 1 2 )=1/(1+exp(-0.2))=0.54983 net 1 1 3 = w 3k x k 5 k=1 1 V 1 3 = f (net 1 3 ) = 1+ e net 3 1 V 1 3=f(net 1 3 )=1/(1+exp(-0.2))=0.54983 19
3 net 1 1 1 = W 1 j V j 1 o 1 1 = f (net 1 1 ) = 1+ e net 1 1 j=1 net 1 1=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495 o 1 1= f(net11)=1/(1+exp(- 0.16495))= 0.54114 3 net 1 1 2 = W 2 j V j j=1 1 o 1 2 = f (net 1 2 ) = 1+ e net 2 1 net 1 2=0.54983*0.1+ 0.54983*0.1+ 0.54983*0.1= 0.16495 o 1 2= f(net11)=1/(1+exp(- 0.16495))= 0.54114 ΔW ij = η m (t d i o d i ) f ' (net d d i ) V j d =1 We will use stochastic gradient descent with η=1 ΔW ij = (t i o i ) f ' (net i )V j f ' (x) = σ ' (x) = σ(x) (1 σ(x)) ΔW ij = (t i o i )σ(net i )(1 σ(net i ))V j δ i = (t i o i )σ(net i )(1 σ(net i )) ΔW ij = δ i V j 20
δ 1 = (t 1 o 1 )σ(net 1 )(1 σ(net 1 )) ΔW 1 j = δ 1 V j δ 1 =(1-0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= 0.11394 δ 2 = (t 2 o 2 )σ(net 2 )(1 σ(net 2 )) ΔW 2 j = δ 2 V j δ 2 =(0-0.54114)*(1/(1+exp(- 0.16495)))*(1-(1/(1+exp(- 0.16495))))= -0.13437 2 Δw jk = δ i W ij f ' (net j ) x k i=1 2 Δw jk = δ i W ij σ(net j )(1 σ(net j )) x k i=1 δ j = σ(net j )(1 σ(net j )) Δw jk = δ j x k 2 W ij δ i i=1 21
2 δ 1 = σ(net 1 )(1 σ(net 1 )) W i1 δ i i=1 δ 1 = 1/(1+exp(- 0.2))*(1-1/(1+exp(- 0.2)))*(0.1* 0.11394+0.1*( -0.13437)) δ 1 = -5.0568e-04 2 δ 2 = σ(net 2 )(1 σ(net 2 )) W i2 δ i i=1 δ 2 = -5.0568e-04 δ 3 = σ(net 3 )(1 σ(net 3 )) i=1 δ 3 = -5.0568e-04 2 W i3 δ i First Adaptation for x 1 (one epoch, adaptation over all training patterns, in our case x 1 x 2 ) Δw jk = δ j x k ΔW ij = δ i V j δ 1 = -5.0568e-04 δ 1 = 0.11394 δ 2 = -5.0568e-04 δ 2 = -0.13437 δ 3 = -5.0568e-04 x 1 =1 v 1 =0.54983 x 2 =1 v 2 =0.54983 x 3 =0 v 3 =0.54983 x 4 =0 x 5 =0 22
More on Back-Propagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Gradient descent can be very slow if η is to small, and can oscillate widely if η is to large Often include weight momentum α Δw pq (t +1) = η E w pq + α Δw pq (t) Momentum parameter α is chosen between 0 and 1, 0.9 is a good value 23
Minimizes error over training examples Will it generalize well Training can take thousands of iterations, it is slow! Using network after training is very fast 24
Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses 25
Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. NETtalk Sejnowski et al 1987 26
Prediction 27
28
Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples 29
RBF Networks, Support Vector Machines 30