Convolutional Neural Networks Srikumar Ramalingam
Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/ Marc'Aurelio Ranzato Deep learning tutorial in CVPR 2014
Introduction Deep learning allows computational models that are composed of multiple layers to learn representations of data. Significantly improved state-of-the-art results in speech recognition, visual object recognition, object detection, drug discovery and genomics. deep comes from having multiple layers of non-linearity [Source: Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Deep Learning, Nature 2015]
Introduction neural is used because it is loosely inspired by neuroscience. The goal is generally to approximate some function f, e.g., consider a classifier y = f x : We define a mapping y = f θ, x and learn the value of the parameters θ that result in the best function approximation. Feedforward network is a specific type of deep neural network where information flows through the function being evaluated from input x through the intermediate computations used to define f, and finally to the output y.
Perceptron A perceptron takes several Boolean inputs (x 1, x 2, x 3 ) and returns a Boolean output. The weights (w 1, w 2, w 3 ) and the threshold are real numbers.
The first learning machine: the Perceptron Built at Cornell in 1960 It s an old paradigm The Perceptron was a linear classifier on top of a simple feature extractor The vast majority of practical applications of ML today use glorified linear classifiers or glorified template matching. Designing a feature extractor requires considerable efforts by experts. A N y=sign( i=1 Feature Extractor W i W i F i (X )+b) Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Motivation for CNNs Consider an input with 28x28 = 784 values. Consider 3 fully connected hidden layers. We can achieve a result of about 98% with just fully connected layers in the case of MNIST digit recognition dataset.
Fully Connected Layer Example: 300x300 image 40K hidden units ~4B parameters!!! Fully connected layers do not take into account the spatial structure of the images. For instance, it treats input pixels which are far apart and close together on exactly the same footing. - Spatial correlation is local - Waste of resources + we don t have enough training samples anyway.. Slide Credit: Marc'Aurelio Ranzato 8
Locally Connected Layer Example: 300x300 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 9
Locally Connected Layer Example: 300x300 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 10
Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels Slide Credit: Marc'Aurelio Ranzato 11
Local Receptive Fields in CNNs Local receptive fields each neuron in the hidden layer will be connected to a small window of input neurons, say 5x5 region, corresponding to 25 input neurons. Each hidden neuron can be thought of analyzing its local receptive field.
Local Receptive Fields in CNNs We slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron. If we have a 28x28 image and 5x5 receptive field, we will have 24x24 hidden neurons.
Stride Length Sometimes we slide the local receptive field over by more than one pixel to the right (or down). In that case the stride length could be 2 or more. This will lead to fewer hidden neurons. For example, in the case of a 28x28 image with 5x5 receptive field and stride length 2, we will have just 12x12 hidden neurons.
Shared Weights and Biases j, k th hidden neuron σ σ is the activation function such as sigmoid unit, b is the bias unit, w l,m is the weight term, and a x,y is the input variable at location x, y. Note that the weights and the bias terms are the same for all the hidden neurons in one feature map. All the neurons in the first hidden layer detect exactly the same feature, just at different locations in the image, e.g, cats.
Sigmoid neuron A sigmoid neuron can take real numbers (x 1, x 2, x 3 ) and returns a number within 0 to 1. The weights (w 1, w 2, w 3 ) and the bias term b are real numbers. Sigmoid function
Rectified linear neuron Preferred choice for many computer vision problems.
Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 18
Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 19
Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 20
Convolutional Layer * -1 0 1-1 0 1-1 0 1 = Slide Credit: Marc'Aurelio Ranzato 21
Feature maps 3 feature maps We call this map from the input to the hidden layer as the feature map. The weights are called shared weights and biases are called shared biases, and they are only shared in one feature map. Different feature maps have different weights and biases.
Pooling Layers A pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of neurons in the previous layer. As a concrete example, one common procedure for pooling is known as max-pooling, e.g., a pooling unit can output the maximum activation in the 2x2 input region.
Pooling Layers L2 pooling square root of the sum of the squares of the 2x2 regions several pooling options exist (e.g., average pooling)
D D M D-K+1 D-K+1 Assumption: zero-padding and stride 1 N
With padding P and Stride S Input : D x D x M there are M input channels (M@D x D) Let us assume that we have kernels of size K x K Let us assume that we have N output channels. What is the dimensions of the output? M@( (D K + 2P)/S + 1) x (D K + 2P)/S + 1) ) Example in 1D: D=5, P=1,S=1,K=3 D=5, P=1,S=2,K=3 Filter in 1D Input with padding
Stride S = 2, Padding P = 1, input size D=5, filter size K=3
Why use padding? If there is no padding, then the size of the output would reduce by a small amount after each CONV, and the information at the borders would be washed away too quickly.
D D/K D M D/K M
With overlap? Stride S and Filter size K Output dimension? M@( ((D-K)/S+1) x ((D-K)/S + 1) )
What should I set the size of the pools? It depends on how much robust or invariant we want the representation to be. It is best to pool slowly using a sequence of convolution layers (i.e., each pooling is used after a sequence of conv layers).
Getting rid of the pooling layer To reduce the size of the representation use larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will use very few to no pooling layers. Reference: Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller, Striving for Simplicity: The All Convolutional Net,2014.
Conv->FC and FC->Conv Convolution layer can be seen as FC layer and FC layer can be seen as convolution.
MNIST data Each grayscale image is of size 28x28. 60,000 training images and 10,000 test images 10 possible labels (0,1,2,3,4,5,6,7,8,9) One of the very early datasets for neural networks, but still actively used by researchers for testing their algorithms.
Performance on MNIST is near perfect 33 out of 10,000 images are misclassified. Top Right: correct Bottom Right: misclassfied Using several ideas: convolutions, pooling, the use of GPUs to do far more training than we did with shallow networks, the algorithmic expansion of our training, dropouts, etc.
Digit recognition using 3 layers Example outputs: 6 -> [0 0 0 0 0 0 1 0 0 0 ] Input normalized to a value between 0 and 1.
Matrix equations for neural networks The indices j and k seem a little counter-intuitive!
Layer to layer relationship a j l = σ(z j l ) z j l = k w l jk a l 1 l k + b j b j l is the bias term in the jth neuron in the lth layer. a j l is the activation in the jth neuron in the lth layer. z j l is the weighted input to the jth neuron in the lth layer.
Itty # Layers Output F G µwi '. =a F. 3 z age # toaie±t 0%1 Layering 0 l I g#*r L µ gwf =)af=r( o LayI Lage,ai * + owzjajt cizzaz 5) do ) ) - ( tool +, ±- (boy =L, = # w=ht=3 44f= # bias terms = 4 + 2
Cost function from the network Groundtruth for each input Output activation vector for a specific training sample x. # of input samples for each input sample
Cost function parameters to compute # of input samples input -> x vector output -> a We assume that the network approximates a function y x and outputs a. We use a quadratic cost function, i.e., mean squared error or MSE.
Cost function Can the cost function be negative in the above example? What does it mean when the cost is approximately equal to zero?
Gradient Descent Small changes in parameters to leads to small changes in output Gradient vector! Change the parameter using learning rate (positive) and gradient vector! Update rule! Let us consider a cost function C v 1, v 2 that depends on two variables. The goal is to change the two variables to minimize the cost function.
Cost function from the network parameters to compute # of input samples What are the challenges in gradient descent when you have a large number of training samples? Gradient from a set of training samples.
- 5 Consider a Simple case b) # =5( wxtb ) = c= ( y 22 G) ) ite # I = - 2W ( y - SGD see ) - Graph for (a) 0=#5( ± # =) zc # will 5423 is be very Small when 2 > 4 very 2<-4 - Small His leads to learning Slowdown in MSE ( Mean Square or very and error ) Cay.
Cross-entropy loss function C = 1 n σ x[ylna + 1 y ln 1 a ] n is the total number of items of training data x is the input y is the required output and a is the output from the neuron " " a " i. a - 2-c.la# can be interpreted as the probability f " output the to output be Claes ±e for probability clan to be ""=
log ' ' - disorderliness " - Entropy N 6. B L, # 9 s.net E - { P ; loge to layover CI P( H = ) of i= I PCT ) - 0.5 E = - PCA ) log ZPCH ) - PG )l g[ G) = - o 's log ( ) - otl.sk ) = - ( = Elogastlos ;] = log 22 e 1 bit to transmit It or Tail. If P ( H ) = 1 PG ) = o E = - PCH ) leg, PCH ) - PG ) log - PG ) = - a log 1 - = o If it is always = " # " no need to p. transmit any info
Cross-entropy loss function C = 1 n x [ylna + 1 y ln 1 a ] n is the total number of items of training data x is the input y is the required output and a is the output from the neuron The cost function is non-negative, i.e., ln a is negative whenever 0 a 1. If the neuron s actual output is close to the desired output, then the cost function is close to 0.
Consider the Simple case To ( cy b) - T (2) = 5 ( cont to ) ydnatc C= [. - g) lnc a)] = *i ±a C Therein. =a[.sn?asi ] = ( a y ) - aa ] 5 '( z ) that =.E D causes learn ; > Slowdown 2 a - r T( a y ] - - 2) ( 1- on G ) ) \(a-y## ( = ))x] if a ( - y ) * how
Derivative of Sigmoid function It e- 2 T 'C = ( 252ft i. e- ) ( e- 2) G) 5C =1= = e- Get 2 =C )( IET = - c ( i - ) - ' (2) = - (2) ( I - ec ) -
Cross-entropy loss for multiple neurons C = 1 n σ x σ j [y j ln a j L + 1 y j ln(1 a j L )] The desired values of the output neurons are given by y 1, y 2, The actual output values are given by a 1 L, a 2 L,
SoftMax layer New kind of output layer a j L = L ez j σ k e z k L The output activations are guaranteed to sum to 1. Could be interpreted as probabilities.
Stochastic gradient descent The idea is to compute the gradient using a small set of randomly chosen training data. We assume that the average gradient obtained from the small set is close to the gradient obtained from the entire set.
Stochastic gradient descent Let us consider a mini-batch with m randomly chosen samples. Provided that the sample size is large enough, we expect the average gradient from the m samples is approximately equal to the average gradient from all the n samples. Backpropagation is a method to compute the gradients!
Thank You