Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial Neural Networks... 3 The Artificial Neuron... 5 The Neural Network odel... 7 Backpropagation... 10 Suary of Backpropagation... 13 Using notation and figures fro the Stanford Deep learning tutorial at: http://ufldl.stanford.edu/tutorial/

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X { X } {y } Training saples for learning. M The nuber of training saples. a the activation output of the th neuron of the l th layer. the weight fro unit i of layer l 1 to the unit of layer l. w i b l bias for unit of layer l. " A learning rate. Typically very sall 0.01. Can be variable. " L = a L # y Error for the th training saple. error for the th neuron of level l, for the th training saple. ", "w i, "b, = a l#1 i, Update for weight fro unit i of layer l 1 to the unit of layer l. = #, Update for bias for unit of layer l. 7-2

Introduction Key Equations Feed Forward fro Layer i to : N l"1 ' a = f w i a l"1 & # i + b i=1 Feed Forward fro Layer to k: # N l & a l+1 k = f w k a l+1 " + b k =1 ' Back Propagation fro Layer to i: Back Propagation fro Layer k to : " l#1 i, = f z l#1 i ", N l z i l#1 w i = #f z N l+1 =1 #z w k k=1 ", l+1 l+1 " k, Weight and Bias Corrections for layer : "w i, "b, = a l#1 i, = #, Network Update Forulas: w i " w i # &w i, b " b # &b, Artificial Neural Networks Artificial Neural Networks, also referred to as Multi-layer Perceptrons, are coputational structures coposed a weighted sus of neural units. Each neural unit is coposed of a weighted su of input units, followed by a non-linear decision function. Note that the ter neural is isleading. The coputational echanis of a neural network is only loosely inspired fro neural biology. Neural networks do NOT ipleent the sae learning and recognition algoriths as biological systes. The approach was first proposed by Warren McCullough and Walter Pitts in 1943 as a possible universal coputational odel. During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable achine for pattern recognition, called a Perceptron. The perceptron is an increental learning algorith for linear classifiers. The first Perceptron, constructed in 1956, was a roo-sized analog coputer that learned recognition functions. However, both the learning algorith and the resulting 7-3

recognition algorith are easily ipleented as coputer progras, and future perceptrons were ipleented as progras.. In 1969, Marvin Minsky and Seyour Papert of MIT published a book entitled Perceptrons, that claied to docuent the fundaental liitations of the perceptron approach. Notably, they deonstrated that a one-level perceptron could not be constructed to perfor an exclusive OR. In the 1970s, frustrations with the liits of Artificial Intelligence research based on Sybolic Logic led a sall counity of researchers to explore the perceptron based approach. In 1973, Steven Grossberg, showed that a two layered perceptron could overcoe the probles raised by Minsky and Papert, and solve any probles that plagued sybolic AI. In 1975, Paul Werbos developed an algorith referred to as Back-Propagation that uses gradient descent to learn the paraeters for perceptrons fro classification errors with training data. During the 1980 s, Neural Networks went through a period of popularity with researchers showing that Networks could be trained to provide siple solutions to probles such as recognizing handwritten characters, recognizing spoken words, and steering a car on a highway. However, results were overtaken by ore atheatically sound approaches for statistical pattern recognition such as support vector achines and boosted learning. In 1998, Yves LeCun showed that convolutional networks coposed fro any layers could outperfor other approaches for recognition probles. Unfortunately such networks required extreely large aounts of data and coputation. Around 2010, with the eergence of cloud coputing cobined with planetary-scale data, training and using convolutional networks becae practical. Since 2012, Deep Networks have outperfored other approaches for recognition tasks coon to Coputer Vision, Speech and Robotics. A rapidly growing research counity currently seeks to extend the application beyond recognition to generation of speech and robot actions. Notably, ust about any algorith can be used to train a network, often yielding a solution that executes faster. 7-4

The Artificial Neuron The siplest possible neural network is coposed of a single neuron. A neuron is a coputational unit that integrates inforation fro a vector of features, X, to copute the likelihood of a hypothesis, h w,b a = h w,b " X The neuron is coposed of a weighted su of input values z = w 1 x 1 + w 2 x 2 +...+ w D x D + b followed by a non-linear activation function, f z soeties written "z a = h " w,b X = f w " T X + b Many different activation functions ay be used. 1 A popular choice for activation function is the sigoid: f z = 1+ e "z This function is useful because the derivative is: df z dz = f z1" f z This gives a decision function: if h w,b " X > 0.5 POSITIVE else NEGATIVE Other popular decision functions include the hyperbolic tangent and the softax. The hyperbolic Tangent: f z = tanhz = ez " e "z e z + e "z 7-5

The hyperbolic tangent is a rescaled for of sigoid ranging over [-1,1] # We can also use the step function: f z = 1 if z " 0 0 if z < 0 Or the sgn function: f z = 1 if z " 0 &#1 if z < 0 Plot of Sigoid red, Hyperbolic Tangent Blue and Step Function Green The softax function is often used for ulti-class networks. For K classes: f z k = " ez k K e z k k=1 The rectified linear function is popular for deep learning because of a trivial derivative: Relu: f z = ax0,z While Relu is discontinuous at z=0, for z > 0 : df z dz =1 Note that the choice of decision function will deterine the target variable y for supervised learning. 7-6

The Neural Network odel A neural network is a ulti-layer assebly of neurons. For exaple, this is a 2-layer network: The circles labeled +1 are the bias ters. The circles on the left are the input ters. Soe authors, notably in the Stanford tutorials, refer to this as Level 1. We will NOT refer to this as a level or, if necessary, level L=0. The rightost circle is the output layer, also called L. The circles in the iddle are referred to as a hidden layer. In this exaple there is a single hidden layer and the total nuber of layers is L=2. The paraeters carry a superscript, referring to their layer. We will use the following notation: L The nuber of layers Layers of non-linear activations. l The layer index. l ranges fro 0 input layer to L output layer N The nuber of units in layer l. N 0 =D a The activation output of the th neuron of the l th layer. The weight fro the unit i of layer l-1 for the unit of layer l. w i b fz The bias ter for th unit of the l th layer A non-linear activation function, such as a sigoid, tanh, or soft-ax For exaple: a 1 2 is the activation output of the first neuron of the second layer. W 13 2 is the weight for neuron 1 fro the first level to neuron 3 in the second level. The above network would be described by: a 1 1 = f w 1 11 X 1 + w 1 21 X 2 + w 1 31 X 3 + b 1 1 a 1 2 = f w 1 12 X 1 + w 1 22 X 2 + w 1 32 X 3 + b 1 2 a 1 3 = f w 1 13 X 1 + w 1 23 X 2 + w 1 33 X 3 + b 1 3 h w,b X = a 2 1 = f w 2 11 a 1 1 + w 2 21 a 1 2 + w 2 31 a 1 3 + b 2 1 7-7

This can be generalized to ultiple layers. For exaple: h X is the vector of network outputs one for each class. Each unit is defined as follows: The notation for a ulti-layer network is a 0 = X is the input layer. a 0 i = X d l is the current layer under discussion. N is the nuber of activation units in layer l. N 0 = D i,,k Unit indices for layers l-1, l and l+1: i k w i is the weight for the unit i of layer l-1 feeding to unit of layer l. a is the activation output of the th unit of the layer l b the bias ter feeding to unit of layer l. z = N l"1 # w a l"1 + b i is the weighted input to th unit of layer l i i=1 fz is a non-linear decision function, such as a sigoid, tanh, or soft-ax a = f z is the activation output for the th unit of layer l For layer l this gives: z = N l"1 N l"1 w a l"1 # i + b i a = f & # w i i=1 i=1 ' a l"1 i + b 7-8

and then for l+1 : N l # N l & z l+1 k = " w l+1 a l+1 k + b k a l+1 k = f w l+1 k a l+1 " + b k =1 =1 ' It can be ore convenient to represent this using vectors: " z 1 ' z z 2 = ' " ' ' # & z N l " a 1 ' a a 2 = ' " ' ' # & a N l and to write the weights and bias at each level l as a k by Matrix, # W = w 11 w 1i w 1N l"1 " # " " w 1 w i w N l"1 " " # " w N l 1 w N l i w N l N l"1 & ' # b = b 1 l " b i l " l b N l"1 & ' note: To respect atrix notation, we have reversed the order of i and in the subscripts. We can see that the weights are a 3 rd order Tensor or vector of atrices, with one atrix for each level, The biases are a atrix vector of vectors with a vector for each level. z = W a l"1 + b " and a = f z = f W a l"1 + b We can asseble the set of atrices W into an 3rd order Tensor Vector of atrices, W, and represent a, z and b as atrices vectors of vectors: A, Z, B. So how to do we learn the weights W and biases B? We could train a 2-class detector fro a labeled training set { X }, {y } using gradient descent. For ore than two layers, we will need to use the ore general backpropagation algorith. 7-9

Backpropagation Back-propagation adusts the network the weights w i and biases b so as to iniize an error function between the network output h X ;W, B = a L and the target value y for the M training saples { X }, { y }. This is an iterative algorith that propagates an error ter back through the hidden layers and coputes a correction for the weights at each layer so as to iniize the error ter. This raises two questions: 1 How do we initialize the weights? 2 How do we copute the error ter for hidden layers? 1 How do we initialize the weights? A natural answer for the first question is to initialize the weights to 0. By experience this causes probles. If the paraeters all start with identical values, then the algorith can end up learning the sae value for all paraeters. To avoid this, we initialize the paraeters with a sall rando variable that is near 0, for exaple coputed with a noral density with variance ε typically 0.01. " w i = N X;0,# and " b = N X;0,# where N is a saple fro a noral i,,l,l density. An even better solution is provided by Xavier GLORIOT s technique see course web site on Xavier noralization. However that solution is too coplex for today s lecture. 2 How do we copute the error ter? Back-propagation propagates the error ter back through the layers, using the weights. We will present this for individual training saples. The algorith can easily be generalized to learning fro sets of training saples Batch ode. Given a training saple, X, we first propagate the X through the L layers of the network Forward propagation to obtain a hypothesis h X ;W, B = a L. 7-10

We then copute an error ter. In the case, of a ulti-class network, this is a vector, with k coponents, one output for each hypothesis. In this case the indicator vector would be a vector, with one coponent for each possible class: " L = a L # y or for each class k: " k, L = a L k, # y k, This error ter tells how uch the unit was responsible for differences between the activation of the network h X ;w k,b k and the target value y. To keep things siple, let us consider the case of a two class network, so that " L+1, h X, a L+1, and y are scalars. The results are easily generalized to vectors for ulti-class networks. At the output layer, the error for each training saple is: " L = a L # y For the hidden units in layers l L the error " is based on a weighted average of the error ters for " k l+1. We copute error ters, " for each unit in layer l back to l =l 1 using the su of errors ties the corresponding weights ties the derivative of the activation function. ", = #f z N l+1 w #z k l+1 l+1 " k, " i, k=1 l#1 = f z l#1 i N l z i l#1 w i =1 ", 1 For the sigoid activation function. f z = the derivative is: 1+ e "z For a = f z this gives: df z dz = f z1" f z ", = a, 1# a, N l+1 k=1 w l+1 l+1 k " k, 7-11

This error ter can then used to correct the weights and bias ters leading fro layer to layer i. "w i, "b, = a l#1 i, = #, Note that the corrections "w i, and "b, are NOT applied until after the error has propagated all the way back to layer l=1, and that when l=1, a 0 i = x i. For batch learning, the corrections ters, "w i, and "b, are averaged over M saples of the training data and then only an average correction is applied to the weights. then "w i = 1 M M "w # i, "b = 1 M =1 M "b #, =1 w i " w i # &w i b " b # &b where " is the learning rate. Back-propagation is equivalent to coputing the gradient of the loss function for each layer of the network. A coon proble with gradient descent is that the loss function can have local iniu. This proble can be iniized by regularization. A popular regularization technique for back propagation is to use oentu w i " w i # &w i + µ w i b " b # &b + µ b where the ters µ " w and µ "b serves to stabilize the estiation. The back-propagation algorith ay be continued until all training data has been used. For batch training, the algorith ay be repeated until all error ters, ",, are a less than a threshold. 7-12

Suary of Backpropagation The Back-propagation algorith can be suarized as: 1 Initialize the network and a set of correction vectors: " w i = N X;0,# i,,l " b = N X;0,# i,l " #w i = 0 i,,l "#b = 0 i,l where N is a saple fro a noral density, and " is a sall value. 2 For each training saple, X, propagate X through the network forward propagation to obtain a hypothesis h X ;W, B. Copute the error and propagate this back through the network: a Copute the error ter: " L = h X # y = a L # y b Propagate the error back fro l=l-1 to l=1: ", = #f z N l+1 w #z k l+1 l+1 " k, " i, k=1 l#1 = f z l#1 i N l z i l#1 w i =1 ", c Use the error to set a vector of correction weights. "w i, = a l#1 i, "b, = #, 3 For all layers, l=1 to L, update the weights and bias using a learning rate, " w i " w i # &w i, + µ w i b " b # &b, + µ b Note that this last step can be done with an average correction atrix obtained fro any training saples Batch ode, providing a ore efficient algorith. 7-13