EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent (SGD) Back-propagation Convolution Pooling (or Sub-sampling) Convolutional Neural Networks (CNN) Features maps Dropout Batch Normalization

NOTATION {x, y}: a training example (x the input, y the label) x: a scalar x: a vector X: a matrix W, θ: network weights J(θ): a loss function

MNIST DATASET Dataset of handwritten digits. 60.000 training data and 10.000 test data. Digits are size-normalized and centered in fixed-size images. Easy dataset for beginners in machine learning.

SUPERVISED LEARNING 5 Label Input Function Loss Output ( Classes : 0, 1, 2, etc ) Error

OUR FIRST NEURAL NETWORK

LINEAR REGRESSION Linear regression y 0 20 40 60 80 0 20 40 60 80 100 x Linear function: f x, w = w 2 + w 4 x Objective: find w 2, w 4 = w which minimize the error ; J w = 1 2 7(f x 8, w y 8 ) : 8<4 Animation of the optimizationproblem https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/#train-your-dragon

CLASSIFICATION FUNCTION Linear classification y 1.5 1.0 0.5 0.0 0.5 1.0 1.5 0 20 40 60 80 100 x Binary classification: f x, w { 1; +1} Using a non linearity function f x, w =? 1 if tanh w 2 + w 4 x 0 1 otherwise

ACTIVATION FUNCTION Threshold tanh 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 sigmoid 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 Rectified Linear Unit (ReLU) Threshold Tanh Sigmoïd Recitified Linear Unit (ReLU) 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Leaky ReLU PReLU Etc 4 2 0 2 4 4 2 0 2 4

PERCEPTRON x 1 1 w 1 w 0 If h(x) is an activation function, then a perceptron if define as follows: x 2 x 3 w 2 y F x, w = h(w 2 + w 4 x 4 + w : x : + w N x N ) w 3 = h 7 w 8 x 8 = h w O x P

FIRST LAYER OF NEURONES 1 y 1 x 1 y 2 x 2 y 3 2 3 1 F (x, W) =h(x t W) =h( 4 5 x 1 x 2 t 2 3 2 3 w 01 w 02 w 03 w 01 + x 1 w 11 + x 2 w 21 4w 11 w 12 w 13 5) =h( 4w 02 + x 1 w 12 + x 2 w 22 5 w 21 w 22 w 23 w 03 + x 1 w 13 + x 2 w 23 t 2 )=h( 4 y 1 y 2 y 3 3 5 t )= 2 3 h(y 1 ) 4h(y 2 ) 5 h(y 3 ) t x W y

MLP: MULTI LAYER PERCEPTRON 1 1 1 y 1 x 1 y 2 x 2 y 3 F 1 F 2 F 3 F N (F : (F 4 x, W 4, W : ), W N ) = F N F : F 4 x = (F N F : F 4 )(x)

BUILDING OUR MLP Torch7 works with modules. Module is an abstract class which defines fundamental methods necessary for a training a neural network. Modules are serializable. nn.sequential Input nn.linear nn.relu nn.linear Output

LOSS FUNCTION FOR CLASSIFICATION Converting the network outputs into probabilities: f y = j u = e u T u X V <4 Negative log likelihood: J p, t = log (p^) e u TV u = f(u) = Network output: 14 11 16 Class probabilities:.1185.0059.8756 Combination of both: J u, t = u^ + log ( 7 e u T V ) u X V <4 Error: J f u, 3 = log 0.8756 = 0.1328

LOSS FUNCTION IN TORCH7 Criterion is a special kind of Module who take to parameters has input Target Output nn.logsoft Max nn.classnllcriterion Error Or Target Output nn.crossentropycriterion Error

HOW TO TRAIN A NEURAL NETWORK?

GRADIENT DESCENT J(θ) J(θ) θ θ θ θ η Objective: minimizing an objective (loss) function J(θ) Gradient gives the slope of the function Updating the parameters θ in the opposite direction of the gradient according to a learning rate η Repeat until convergence

CHAIN RULE Composition function: F x = f g x = f(g x ) Derivative of a composition function: F j x = f j g x g j x = f j g x g (x) Using the Leibniz s notation: F j x = F(x) f(g x ) f(g x ) = = x x g(x) g(x) x

BACK-PROPAGATION w 4 f 4 w : w 4 f : w N f N w : w N x y 4 = f 4 (x, w 4 ) f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : y : = f : y 4, w : = f : f 4 x, w 4, w : y N = f N y :, w N = f N f : y 4, w :, w N = f N (f : f 4 x, w 4, w :, w N ) mp y N = y N w N = f N y :, w N δw N Objective: mn,m o,m p y N = mn y N ; mo y N ; m p y N mp y N = f N y :, w N w N mo y N = f N y :, w N y : f : y 4, w : w : mn y N = f N y :, w N y : f : y 4, w : y 4 f 4 x, w 4 w 4 mo y N = y N w : = f N y :, w N w : = f N y :, w N y : y : w : = f N y :, w N y : f : y 4, w : w : mn y N = y N δw 4 = f N y :, w N w 4 = f N y :, w N y : y : w 4 = f N y :, w N y : f : y 4, w : w 4 = f N y :, w N y : f : y 4, w : y 4 y 4 w 4

ONE STEP IN TORCH7 data Model θ θ + ηgθ gθ 0gθ + θ output dfdo Loss function ferror target Reset gradients model : zerogradparameters () Forward local output = model : forward(data) local f error = loss function :forward(output, target) Backward local df do = loss function :backward(output, target) model : backward(data, df do) Update parameters model : updateparameters (0.01 )

BATCH GRADIENT DESCENT Computes the gradient of the cost function for the entire dataset: θ θ η v J(θ) Reset gradients model : zerogradparameters () for i=1, traindata:size() do Forward local output = model : forward( traindata.data [ i ]) local f error = loss function :forward(output, traindata.labels [ i ]) end Backward local df do = loss function :backward(output, traindata.labels [ i ]) model : backward( traindata.data [ i ], df do) Update parameters model : updateparameters (0.01 )

STOCHASTIC GRADIENT DESCENT Performs a parameter update for each training example x (8), y (8) : θ θ η v J(θ, x (8), y (8) ) Create a random permutation shuffle = torch.randperm(traindata: size ()) for i=1, traindata:size() do Reset gradients model : zerogradparameters () Forward local output = model : forward( traindata.data [ shuffle [ i ]]) local f error = loss function :forward(output, traindata.labels [ shuffle [ i ]]) Backward local df do = loss function :backward(output, traindata.labels [ shuffle [ i ]]) model : backward( traindata.data [ shuffle [ i ] ], df do) end Update parameters model : updateparameters (0.01 )

MINI-BATCH SGD Takes the best of both worlds and performs an update for every minibatch of n training examples: θ θ η v J(θ, x (8:8yz), y (8:8yz) ) for i=1, traindata:size(), batchsize do Reset gradients model : zerogradparameters () Create batch batch = getbatch(traindata, batchsize ) Forward local output = model : forward( batch.inputs ) local f error = loss function :forward(output, batch.targets) Backward local df do = loss function :backward(output, batch.targets) model : backward( batch.inputs, df do) end Update parameters model : updateparameters (0.01 )

SGD OPTIMIZATION ALGORITHMS Momentum: adds a fraction of the previously computed gradient (gives inertia to the gradient) v^ γv^}4 + η v J θ θ θ v^ NAG: extension of momentum Adagrad: adapts the learning rate to each parameters individually Adadelta: extension of Adagrad RMSprop: another extension of Adagrad Adam: takes into account the mean and variance of gradients Etc

GRADIENT DESCENT ILLUSTRATION See: http://sebastianruder.com/optimizing-gradient-descent/index.html#whichoptimizertouse

PACKAGE OPTIM IN TORCH7 Torch package providing several optimization algorithms. Easy to use, easy to switch from one optimizer to another. https://github.com/torch/optim

CONVOLUTIONAL NEURAL NETWORK

y f g x = f t g x t dt } CONVOLUTION

DISCRETE CONVOLUTION f g n = 7 f(n m) g(m) ƒ<}

SLIDING MASK Convolution tool from Rémi Emonet: http://dl.heeere.com/convolution/ Convolution layer in Torch7: https://github.com/torch/nn/blob/master/doc/convolution.md

CONVOLUTION EXAMPLE Original image 0-1 0-2 -1 0 1 1 1 0 1 0-1 5-1 -1 1 1 1 1 1 1-4 1 0-1 0 0 1 2 1 1 1 0 1 0 Sharpen Emboss Blur Edge detect

CONVOLUTIONAL NEURAL NETWORK y 1 0 1 0 1-4 1 0 1 0 y 2 y 3

CONVOLUTIONAL NEURAL NETWORK y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 y 3

CONVOLUTIONAL NEURAL NETWORK w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 w 1 w 2 w 3 y 3 w 4 w 5 w 6 w 7 w 8 w 9

POOLING -1 17-18 13-15 19 11 5-10 -3 2 18-4 4-12 13 Maximum Pooling 19 13 4 18 Effect: Reduces the feature map s size Increases the field of view Average pooling Sum pooling Stochastic pooling Etc

FIELD OF VIEW Convolution with mask size = 3 Pooling with mask size = 2

LENET 5 Gradient-based learning applied to document recognition, Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner [1998].

IF WE HAVE TIME

During training, for each forward pass, randomly set units to 0. DROPOUT 13 9 7 4 4 16 2 5 Dropout 13 0 7 4 4 0 2 0 0 6 8 19 4 7 7 5 Input Drop factor =0.3 0 6 8 0 0 0 7 5 Output At test time, keep the same «energy» into the network

BATCH NORMALIZATION During training, for each forward pass, normalized the data according to the mini-batch

CREATING OUR OWN LAYERS x w 4 f 4 w : f : w N f N w 4 w : w N f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : Each module f x, w = y have to compute: y f x f w In Torch7 a new module have to overload 3 functions: [output] updateoutput(input) [gradinput] updategradinput(input, gradoutput) accgradparameters(input, gradoutput) Torch7 documentations: http://torch.ch/docs/developer-docs.html

LINKS

TORCH7 Torch7 Documentation: Torch7: http://torch.ch/ Optim package: https://github.com/torch/optim Criterions: https://github.com/torch/nn/blob/master/doc/criterion.md Convolutional modules: https://github.com/torch/nn/blob/master/doc/convolution.md Some tutorials code in torch7: Torch7 tutorials: https://github.com/torch/tutorials Digit classifier: https://github.com/torch/demos/tree/master/traina-digit-classifier

TUTORIALS A Visual and Interactive Guide to the Basics of Neural Networks: https://jalammar.github.io/visual-interactive-guide-basics-neuralnetworks/ An overview of gradient descent optimization algorithms: http://sebastianruder.com/optimizing-gradientdescent/index.html Artificial Inteligence: https://leonardoaraujosantos.gitbooks.io/artificialinteligence/content/ Andrew Ng lesson on coursera: https://www.coursera.org/learn/machine-learning