Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial Neuron... 4 The Neural Network odel... 6 Network Structures for Siple Feed-Forward Networks... 8 Regression Analysis...9 Linear Models... 10 Estiation of a hyperplane with supervised learning... 11 Gradient Descent... 13 Practical Considerations for Gradient Descent... 14 Regression for a Sigoid Activation Function...16 Gradient Descent... 17 Regularisation... 17 Using notation and figures fro the Stanford Deep learning tutorial at: http://ufldl.stanford.edu/tutorial/

Notation x d X D { X } {y } M (l a i (l w ij b i l A feature. An observed or easured value. A vector of D features. The nuber of diensions for the vector X Training saples for learning. The nuber of training saples. the activation output of the i th neuron of the l th layer. the weight for the unit j of layer l and the unit i of layer l+1. the bias ter for i th using of the the l+1 th layer 7-2

Introduction, also referred to as Multi-layer Perceptrons, are coputational structures coposed a weighted sus of neural units. Each neural unit is coposed of a weighted su of input units, followed by a non-linear decision function. Note that the ter neural is isleading. The coputational echanis of a neural network is only loosely inspired fro neural biology. Neural networks do NOT ipleent the sae learning and recognition algoriths as biological systes. The approach was first proposed by Warren McCullough and Walter Pitts in 1943 as a possible universal coputational odel. During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable achine for pattern recognition, called a Perceptron. The perceptron is an increental learning algorith for linear classifiers invented by Frank Rosenblatt in 1956. The first Perceptron was a roo-sized analog coputer that ipleented Rosenblatz learning recognition functions. Both the learning algorith and the resulting recognition algorith are easily ipleented as coputer progras. In 1969, Marvin Minsky and Seyour Papert of MIT published a book entitled Perceptrons, that claied to docuent the fundaental liitations of the perceptron approach. Notably, they claied that a perceptron could not be constructed to perfor an exclusive OR. In the 1970s, frustrations with the liits of Artificial Intelligence research based on Sybolic Logic led a sall counity of researchers to explore the perceptron based approach. In 1973, Steven Grossberg, showed that a two layered perceptron could overcoe the probles raised by Minsky and Papert, and solve any probles that plagued sybolic AI. In 1975, Paul Werbos developed an algorith referred to as Back-Propagation that uses gradient descent to learn the paraeters for perceptrons fro classification errors with training data. During the 1980 s, Neural Networks went through a period of popularity with researchers showing that Networks could be trained to provide siple solutions to probles such as recognizing handwritten characters, recognizing spoken words, and steering a car on a highway. However, results were overtaken by ore 7-3

atheatically sound approaches for statistical pattern recognition such as support vector achines and boosted learning. In 1998, Yves LeCun showed that convolutional networks coposed fro any layers could outperfor other approaches recognition probles. Unfortunately such networks required extreely large aounts of data and coputation. Around 2010, with the eergence of cloud coputing cobined with planetary-scale data convolutional networks becae practical. Since 2012, Deep Networks have outperfored other approaches for recognition tasks coon to coputer Vision, Speech and robotics. A rapidly growing research counity currently seeks to extend the application beyond recognition to generation of speech and robot actions. The Artificial Neuron The siplest possible neural network is coposed of a single neuron. A neuron is a coputational unit that integrates inforation fro a vector of D features, X, to the likelihood of a hypothesis, h w,b ( a = h w,b ( X The neuron is coposed of a weighted su of input values z = w 1 x 1 + w 2 x 2 +...+ w D x D + b followed by a non-linear activation function, f (z (soeties written (z a = h ( w,b X = f ( w T X + b Many different activation functions are used. 1 A popular choice for hidden layers is the sigoid (logistic function: f (z = 1 e z 7-4

This function is useful because the derivative is: df (z dz = f (z(1 f (z This gives a decision function: if h w,b ( X > 0.5 POSITIVE else NEGATIVE Other popular decision functions include the hyperbolic tangent and the softax. The hyperbolic Tangent: f (z = tanh(z = ez e z e z + e z The hyperbolic tangent is a rescaled for of sigoid ranging over [-1,1] The rectified linear function is also popular Relu: f (z = ax(0,z While Relu is discontinuous at z=0, for z > 0 : df (z dz =1 The following plot (fro A. Ng shows the sigoid, tanh and Relu functions Note that the choice of decision function will deterine the target variable y for supervised learning. 7-5

The Neural Network odel A neural network is a ulti-layer assebly of neurons of the for. For exaple, this is a 2-layer network: The circles labeled +1 are the bias ters. The circles on the left are the input ters, also called L 1. This is called the input layer. Note that any authors do not consider this to count as a layer The rightost circle is the output layer (in this case, only one node, also called L 3 The circles in the iddle are referred to as a hidden layer, L 2. Here we follow the notation used by Andrew Ng. The paraeters carry a superscript, referring to their layer. For exaple: a 1 (2 is the activation output of the first neuron of the second layer. W 13 (2 is the weight for input 1 of activation neuron 3 in the second level. The above network would be described by: a (2 1 = f (w (1 11 X 1 + w (1 12 X 2 + w (1 13 X 3 + b (1 1 a (2 2 = f (w (1 21 X 1 + w (1 22 X 2 + w (1 23 X 3 + b (1 2 a (2 3 = f (w (1 31 X 1 + w (1 32 X 2 + w (1 33 X 3 + b (1 3 h ( w,b X = a (3 1 = f (w (2 11 a (2 1 + w (2 12 a (2 2 + w (2 13 a (2 3 + b (2 1 This can be generalized to ultiple layers. For exaple: 7-6

(taken fro the Stanford ufldl tutorial, Ng et al Note that we can recognize ultiple classes by learning ultiple h w,b (x functions. In general (following the notation of Ng, this can be described by: L 1 is the input layer. L l is the l th layer L N is the output layer (l w ij denotes the paraeter (weight for the unit j of layer l and the unit i of layer l+1. l b i is the bias ter for i th using of the the l+1 th layer (l a i is the activation i in layer l Note that any authors prefer to define the input layer as l=0. This way the l is the nuber of hidden layers. In deriving the regression algoriths for learning, we will use N l z (l+1 i = w (l a (l ij + b (l i j=1 It will be ore convenient to represent this using vectors: (l z 1 % (l z z (l 2 = # & (l z N l As defined, the neural network coputes a forward propagation of the for z l+1 = w (l a (l + b (l a (l1 = f ( z (l+1 7-7

This is called the Feed Forward Network. Note that feed forward networks do not contain loops. The weights for a Neural Network are coonly trained using a technique called back-propagation, or back propagation of errors. Back propagation is a for of regression analysis. The ost coon approach is to eploy a for of Gradient Descent. Gradient descent calculates a loss function for the weights of the network and then iteratively seeks to iniize this loss function. This is coonly perfored as a for of supervised learning using labeled training data. Note that Gradient descent requires that the activation function be differentiable. Network Structures for Siple Feed-Forward Networks The architecture of a neural network refers to the nuber of layers, the nuber of neurons of each layer and the kind of neurons, and the kinds of neural coputation perfored. For a siple feed forward network: The nuber of input neurons is deterined by the nuber of input values for each observation (size of the feature vector, D The nuber of output neurons is the nuber of classes to be recognized. The nuber of hidden layers is deterined by the coplexity of the recognition and the aount of training data available. If your training data is linearly separable (if there exists a hyper-plane between the two classes then you do not even need a hidden layer. Use a siple linear classifiers for exaple defined by a Support Vector achine or Linear Discriinant Analysis. A single hidden layer is often sufficient for any probles (although ore reliable solutions can be found using Support vector achines or other techniques. 7-8

The nuber of neurons in a hidden layer depends on the coplexity of the proble, and the aount of training data available. In theory, any proble can be solved with a single hidden layer, given enough training data. In practice the quantity of training data is not practical. Experience shows that soe very difficult probles can be solved using less training data by adding additional layers. However, the spectacular progress of the last few years has been obtained by the introduction of new coputational odels such as Convolutional Neural Networks, Pooling, Auto-encoders and Long Ter-Short Ter Meory. Many of these are actually known pattern recognition techniques recycled into a Neural Network fraework. But before we get to ore advanced techniques, we need to look at the basics. Regression Analysis The paraeters for a feed forward network are coonly learned using regression analysis of labeled training data. Regression is the estiation of the paraeters for a function that aps a set of independent variables into a dependent variable. y ˆ = f ( X, w Where X is a vector of D independent (unknown variables. y ˆ is an estiate for a variable y that depends on X. and f ( is a function that aps X onto y ˆ w is a vector of paraeters for the odel. Note: For y ˆ, the hat indicates an estiated value for the target value y X is upper case because it is a rando (unknown vector. 7-9

Regression analysis refers to a faily of techniques for odeling and analyzing the apping one or ore independent variables fro a dependent variable. For exaple, consider the following table of age, height and weight for 10 feales: M AGE H (M W (kg 1 17 163 52 2 32 169 68 3 25 158 49 4 55 158 73 5 12 161 71 6 41 172 99 7 32 156 50 8 56 161 82 9 22 154 56 10 16 145 46 We can use any two variables to estiate the third. We can use regression to estiate the paraeters for a function to predict any feature y ˆ fro the two other features X. For exaple we can predict weight fro height and age as a function. y ˆ = f ( X, w Age % where y ˆ = Weight, X = and w are the odel paraeters # Height& Linear Models A linear odel has the for y ˆ = f ( X, w = w T X + b = w 1 x 1 + w 2 x 2 +...+ w D x D + b w 1 % The vector w w 2 = are the paraeters of the odel that relates X to y ˆ. # w D & The equation w T X + b = 0 is a hyper-plane in a D-diensional space, w 1 % w 2 w = is the noral to the hyperplane and b is a constant ter. # & w D 7-10

It is generally convenient to include the constant as part of the paraeter vector and to add an extra constant ter to the observed feature vector. This gives a linear odel with D+1 paraeters where the vectors are: 1 % x 1 X = # & x D and w = # w 0 w 1 w D % & where w 0 represents b. This gives the hoogeneous equation for the odel: y ˆ = f ( X, w = w T X Hoogeneous coordinates provide a unified notation for geoetric operations. With this notation, we can predict weight fro height and age using a function learned fro regression analysis. y ˆ = f ( X, w 1 % w where y ˆ = Weight, X = Age and 0 % W = w 1 are the odel # Height& # w 2 & paraeters, and the surface is a plane in the space (weight, age, height. In a D diensional space, linear hoogeneous equation is a hyper-plane. The perpendicular distance of an arbitrary point fro the plane is coputed as d = w 0 +w 1 x 1 +w 2 x 2 This can be used as an error. Estiation of a hyperplane with supervised learning In supervised learning, we learn the paraeters of a odel fro a labeled set of training data. The training data is coposed of M sets of independent variables, { X } for which we know the value of the dependent variable {y }. The training data is the set { X }, {y } For a linear odel, learning the paraeters of the odel fro a training set is equivalent to estiating the paraeters of a hyperplane using least squares. In the case of a linear odel, there are any ways to estiate the paraeters: 7-11

For exaple, atrix algebra provides a direct, closed for solution. Assue a training set of M observations { X } {y } where the constant d is included as a 0th ter in X and w. 1 % x 1 X = and # & x D w = # w 0 w 1 w D % & We seek the paraeters for a linear odel: y ˆ = f ( X, w = w T X This can be deterined by iniizing a Loss function that can be defined as the Square of the error. L( w M = #( w T X y 2 =1 To build or function, we will use the M training saples to copose a atrix X and a vector Y. 1 1 1 % x 11 x 12 x 1M X = x 21 x 22 x 2 M # # x D1 x D2 x DM & (D+1 rows by M coluns Y = # y 1 y 2 y M % & (M rows. We can factor the loss function to obtain: L( w = ( w T X Y T ( w T X Y To iniize the loss function, we calculate the derivative and solve for w when the derivative is 0. L( w w = 2X T Y # 2X T X w = 0 which gives X T Y = 2X T X w and thus w = (X T X 1 X T Y While this is an elegant solution for linear regression, it does not generalize to other odels. A ore general approach is to use Gradient Descent. 7-12

Gradient Descent Gradient descent is a popular algorith for estiating paraeters for a large variety of odels. Here we will illustrate the approach with estiation of paraeters for a linear odel. As before we seek to estiate that paraeters w for a odel y ˆ = f ( X, w = w T X fro a training set of M saples { X } {y } We will define our loss function as 1 2 average error L( w = 1 M 2M ( f ( X, w # y 2 =1 where we have included the ter 1 2 to siplify the algebra later. The gradient is the derivative of the loss function with respect to each ter w d of w is f ( X, w = #f ( X, w # w #f ( X,w 0 & & #w 0 #f ( X &,w 1 = & #w 1 & # #f ( X &,w D & % #w D ( where: f ( X,w d = 1 M w d M ( f ( X, w # y x d =1 x d is the d th coefficient of the th training vector. Of course x 0 =1 is the constant ter. We use the gradient to correct an estiate of the paraeter vector for each training saple. The correction is weighted by a learning rate α We can see M 1 M ( f ( X, w (i1 (i1 # y x d as the average error for paraeter w d =1 Gradient descent corrects by subtracting the average error weighted by the learning rate. 7-13

Gradient Descent Algorith Initialization: (i=0 Let w d (o = 0 for all D coefficients of w Repeat until L( w (i L( w (i1 < # : w (i = w (i1 # f ( X, w (i1 where L( w = 1 M 2M ( f ( X, w y 2 =1 That is: w (i d = w (i1 d # 1 M M ( f ( X, w (i1 y x d =1 Note that all coefficients are updated in parallel. The algorith halts when the change in L( w (i becoes sall: L( w (i L( w (i1 < # For soe sall constant. Gradient Descent can be used to learn the paraeters for a non-linear odel. For exaple, when D=2, a second order odel would be: 1 % x 1 2 x 1 X = x 2 2 x 1 # x 1 x 2 & and f ( X, w = w 0 +w 1 x 1 + w 2 x 1 2 + w 3 x 2 + w 4 x 2 2 + w 5 x 1 x 2 Practical Considerations for Gradient Descent The following are soe practical issues concerning gradient descent. Feature Scaling Make sure that features have siilar scales (range of values. One way to assure this is to noralize the training date so that each feature has a range of 1. Siple technique: divide by the range of saple values. 7-14

For a training set { X } of M training saples with D values. Range: r D = Max(x d - Min(x d Then M =1 : x d := x d r d Even better would be to scale with the ean and standard deviation of the each feature in the training data µ d = E{x d } 2 = E{(x d # µ d 2 } M =1 : x d := (x # µ d d d Note that the value of the loss function should always decrease: Verify that L( w (i L( w (i1 < 0. if L( w (i L( w (i1 > 0 then decrease the learning rate α You can use this to dynaically adjust the learning rate α. For exaple, one can start with a high learning rate. Any tie that L( w (i L( w (i1 > 0 1 reset a a/2 2 Recalculate the i th iteration. Halt when a < threshold. 7-15

Regression for a Sigoid Activation Function Gradient descent is easily used to estiate a sigoid estiation function. Recall that the sigoid function has the for h( X, w 1 = 1+ e g( X, w This function is differentiable. When g k ( X, w = w T X h( X, w 1 = 1+ e w T X and the decision rule is: if h( X, w > 0.5 then Positive else Negative The cost function for this activation function is: Cost(h( X, w, y = # % Thus the loss function would be: L( w = 1 M Cost(h( X, w, y =1 log(h( X, w if y =1 log(1 h( X, w if y = 1 L( w = 1 M # y log(h( X, w + (1 y log(1 h( X, w =1 h( X,w d = 1 M w d M (y # h( X, w x d =1 7-16

Gradient Descent Repeat until L( w (i L( w (i1 < # : w (i = w (i1 # h( X, w (i1 where L( w = 1 M # y log(h( X, w + (1 y log(1 h( X, w =1 w (i d = w (i1 d # 1 M M (y h( X, w (i1 x d =1 Regularisation Overfitting: Tuning the function to the training data. Overfitting is a coon proble with achine learning, particularly when there are any features and not enough training data. One solution is to regularise the function by adding an additional ter. replace w (i 0 = w (i1 0 # 1 M M (y h( X, w x 0 with =1 w (i d = w (i1 d # 1 M M (y h( X, w (i1 x d + % w d =1 (i1 7-17