Pattern Recognition and Machine Learning. Artificial Neural networks

Size: px

Start display at page:

Download "Pattern Recognition and Machine Learning. Artificial Neural networks"

Ophelia Oliver
5 years ago
Views:

1 Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3 Fully connected Networks What Window Size?... 4 Convolutional Neural Network for an iage Pooling... 6 AutoEncoders...7 The Sparsity Paraeter... 8 Kullback-Leibler Divergence... 9 An exaple of the Hidden Units Using notation and figures fro the Stanford Deep learning tutorial at:

2 Notation x d! X D { X! } {y } M N A feature. An observed or easured value. A vector of D features.! The nuber of diensions for the vector X Training saples for learning. The nuber of training saples. Spatial extent (size of the convolutional unit (NxN k a i,j A feature ap of k features at each iage position P(i,j D (l is the nuber of activation units in layer l. M ˆ " j = 1 M a (2 # j, Sparsity: The average activation of hidden unit j for the training set =1 D (2 # KL(" ˆ " j = " log "ˆ + (1$ "log 1$ " Kullback-Leibler Divergence " j 1$ ˆ " j 9-2

3 Convolutional Neural Networks. Convolutional Neural Networks take inspiration fro the Receptive Field odel of biological vision systes proposed by Hubel and Weisel in 1968 to explain the organization of the visual cortex. By probing the visual cortex of cats with electrodes, Hubel and Weisel discovered that the visual cortex was coposed of retinal aps of visual features. Each retinal ap is an iage of the pattern projected on the retina expressed with correlation of a visual feature at a particular size (scale and orientation. The collection retinal feature aps serves as an interediate representation for recognition. Fully connected Networks. A fully connected network is a network where each unit at level l+1 receives activations fro all units at level l. If there are D (l units at level l and D (l+1 units are level l+1 then a fully connected network requires learning D (l D (l+1 paraeters. While this ay be tractable for sall exaples, it quickly becoes excessive for practical probles, as found in coputer vision or speech recognition. For exaple, a typical iage ay have 1024 x 1024 = 2 20 pixels. If we assue, say a 512 x 512 =2 18 hidden units we have 2 38 paraeters to learn for a single class of iage pattern. Clearly this is not practical (and, in any case unnecessary A coon solution is to perfor learning using a liited size window, and to use all possible windows as training data. We have already seen this principle with the use of a sliding window approach with the Viola Jones detector. This leads to a technique where we fix a window size at NxN input units and use all possible, overlapping, windows of size NxN fro our training data to train the network. We then use the sae learned weights with every hidden cell. The resulting operation is equivalent to a convolution of the learned weights with the input signal. This approach is reasonable for iage and speech signals because both iages and speech signals have two interesting theoretical properties: They are local and stationary. 9-3

4 1 Local. Local eans that (ost of the required inforation can be found within a liited sized neighborhood of the signal. In fact, iage inforation tends to be ulti-scale, but this can be easily accoodated using ulti-scale signal techniques using a scale invariant pyraid. Such a representation is local at ultiple scales, with low-resolution scales providing context for higher resolution. This can be referred to as ulti-local. 2 Stationary. A stationary signal is a rando (unknown signal whose joint probability density function does not change when shifted in tie (speech or space (iage. Iage and Speech signals tend to have stationary statistics. Thus the sae processing can be applied to every possible (overlapping window. There are exceptions to both rules, but these can be handled with established techniques. What Window Size? What Window Size for a feature in a Convolutional Neural Network? This tends to depend on the proble. It is not uncoon to see tutorials proposal 5 x 5 iage windows. This ay be fine for illustration, but likely to be far fro optial. The ipressive results in category learning were obtained with a 2D iage window of 11 x 11. A iniu reasonable size is N=7. For saller windows, the Fourier transfor of the window function (the digital Sync function doinates the spectru and hides inforation. N=9 is a better choice as the window effects at are negligible. However, there is the proble of scale invariance. In coputer vision, patterns can occur at any scales. The solution is to use a scale invariant pyraid, providing invariant representation at sapled set of scaled. The window size should accoodate all scale changes between pyraid steps. This is generally between N=11 and N=15. The actual choice depends on available training data, coputing tie for learning and the scale steps used in the pyraid. For today, consider that N=11 is a reasonable size. 9-4

5 Convolutional Neural Network for an iage. Convolutional Neural Network (CNN can be used as feature detectors for iage analysis. When used with iages, a CNN provides K features at each pixelusing convolution with K hidden layers (or kernels. Each feature will be coputed as a weighted su of the pixels within an N x N window R(i,j. Let us assue an iage of J rows and I coluns, where each pixel has 3 color values. Each pixel (i,j is a color vector, P! (i, j, represented by 3 integers between 0 and 255 representing Red, Green and Blue. " R%! $ ' P (i, j = G $ ' # B& In the literature on CNNs, the colors are referred to as channels, and the nuber of channels is called the Depth (Dl. We will refer to the position of receptive field as its center. As a result we prefer odd values for N. The CNN will copute K filters (or kernels for each window R ij (u,v of size NxNx D that fits within the iage. If we consider the position of the window as its upper left corner, then for each position fro i=n/2, j=n/2 to i = I-N/2, j=j N/2: R ij (u,v = P(i+u-1, j+v-1 for i=n/2, j=n/2 to i = I-N/2, j=j N/2 The features can be learned as hidden layers using back-propagation, using a training set where each window is labeled with a target class. For exaple, with the FDDB training set, each possible NxN window R ij (u,v would be an training vector X! of size D (1 =D N 2. The target value y would be 1 if the center pixel of the window is in the face ellipse and 0 Otherwise. 9-5

6 The result is a feature ap of k features at each position (i,j (2 a i,j,k = f (" w (1 k (u,vr i,j (u,v + b (1 k u,v Note that written as a convolution, the forula would be a k (i, j = f (# w k (u,vr(i " u, j " v + b k u,v Hyperperaeters: CNNs are typically configured with a nuber of hyper-paraeters : Depth: This is the nuber D of channels for each iage pixel. Stride: Stride is the step size, S, between window positions. By default it ay be 1, but for larger windows, it is possible define larger step sizes. Spatial Extent: This is the size, NxN of the receptive field. Zero-Padding: Size of region at the border of the feature ap that is filled with zeros in order to preserve the iage size (typically N/2. Pooling Pooling is a for of non-linear down-sapling that partitions the iage into nonoverlapping regions and coputes a representative value for each region. Pooling is typically perfored over contiguous regions of the iage. In this case, the stride equals the pooling window size. The CNN feature iage is partitioned into sall non-overlapping rectangular regions, typically of size 2x2 or 4x4. Several non-linear functions can be used. These include Max, Average, Median, and Histogras. Max pooling sees to be the ost popular. For exaple, the SIFT operation, in Coputer vision, uses local histogras over a 4x4 window. 9-6

7 AutoEncoders An autoencoder is an unsupervised learning algorith that uses back-propagation to learning a sparse set of features for describing the training data. Rather than try to learn a target variable, y, the auto-encoder tries to learn to reconstruct the input X using a iniu set of features. The effect is siilar to the use of Principal coponents analysis on neighborhoods, (as we saw in lecture 4. However, the auto-encodeur provides a ore appropriate basis set for recognition, while principle coponents anaysis is appropriate for reconstruction. Using the notation fro our 2 layer network, given an input feature vector X! {w (1 ji,b (1 j } and {w (2 kj,b (2 k }such that a! = X ˆ " X! using as few weights as possible. Note that D = D (1. When the nuber of hidden units D (2 is less than the nuber of input units, D (1,! a = ˆ X "! X is necessarily an approxiation. The error for back-propagation for each unit is For each coponent x i, of the training saple X! " k, = a k, # a (1 i, = a k, # x i, 9-7

8 The Sparsity Paraeter The auto-encoder will learn weights subject to a sparseness constraints specified by a sparsity paraeter ˆ " j = ", typically set close to zero. The sparsity paraeter " is the average activation for the hidden units. Using the notation fro last week s lecture, the auto-encoder is described by: Level 1: level 2: level 3: "! $ X = $ $ # a j, x 1, " x D (1, D (1 % ' ' ' & (2 = f ( w (1 (1 " ji x i, + b j a k, i=1 D (2 = f (" w (2 kj a (2 + b (2 j, k Desired output "! $ a = " $ # a 1 a D % ' ' = X ˆ (! X, with error " k, & = a k, # a (1 i, = a k, # x i, The average activation ˆ " j is coputed as the average activation for each of the D (2 hidden units, to D (2 for the M training saples: M # ˆ " j = 1 M a (2 j, =1 The auto-encoder can be learned by back-propagation using a inor change to the cost function. L sparse (W, B; X!, y = 1 2 (! a " X! D 2 + #% (2 KL($ ˆ $ j s 2 where # KL(" ˆ " j is the Kullback-Leibler Divergence. and " controls the weight of the sparsity paraeter. (Don t panic - this is easy to do. 9-8

9 Kullback-Leibler Divergence The KL divergence between the desired and average activation is: D (2 D % # KL(" ˆ " j = " log "ˆ + (1$ "log 1$ " ( # (2 ' & " j 1$ ˆ " * j To incorporate the KL divergence into back propagation, we replace with " (2 j = #f (z (2 D j (2 $ w #z kj j k=1 (2 " k " (2 j = #f (z (2 j ( D (1 ( (2 w (2 $ #z kj " k + % & 'ˆ + 1& ' + + * * j k=1 ' j 1& ˆ ' - - j,, Note that you need the average activation ˆ " j to copute the correction. Thus you need to copute a forward pass on all the training data, before coputing the backpropagation on any of the training saples. This can be a proble if the nuber of training saples is large. This forces the hidden units to becoe approxiately orthogonal! Thus the hidden units act as a for of basis space for the input vectors. 9-9

applied at ultiple levels and trained on face iages this can give

10 Exaples of the Hidden Units given by Autoencoders An exaple of 100 hidden units learned by a sparse auto-encoder fro iages: When applied at ultiple levels and trained on face iages this can give recognizable features: or when trained on YouTube videos: Cats 9-10

11 when trained with car iages: 9-11

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial