EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Size: px

Start display at page:

Download "EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)"

Philippa Johnson
5 years ago
Views:

1 EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

2 TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent (SGD) Back-propagation Convolution Pooling (or Sub-sampling) Convolutional Neural Networks (CNN) Features maps Dropout Batch Normalization

3 NOTATION {x, y}: a training example (x the input, y the label) x: a scalar x: a vector X: a matrix W, θ: network weights J(θ): a loss function

4 MNIST DATASET Dataset of handwritten digits training data and test data. Digits are size-normalized and centered in fixed-size images. Easy dataset for beginners in machine learning.

5 SUPERVISED LEARNING 5 Label Input Function Loss Output ( Classes : 0, 1, 2, etc ) Error

6 OUR FIRST NEURAL NETWORK

7 LINEAR REGRESSION Linear regression y x Linear function: f x, w = w 2 + w 4 x Objective: find w 2, w 4 = w which minimize the error ; J w = 1 2 7(f x 8, w y 8 ) : 8<4 Animation of the optimizationproblem

8 CLASSIFICATION FUNCTION Linear classification y x Binary classification: f x, w { 1; +1} Using a non linearity function f x, w =? 1 if tanh w 2 + w 4 x 0 1 otherwise

9 ACTIVATION FUNCTION Threshold tanh sigmoid Rectified Linear Unit (ReLU) Threshold Tanh Sigmoïd Recitified Linear Unit (ReLU) Leaky ReLU PReLU Etc

10 PERCEPTRON x 1 1 w 1 w 0 If h(x) is an activation function, then a perceptron if define as follows: x 2 x 3 w 2 y F x, w = h(w 2 + w 4 x 4 + w : x : + w N x N ) w 3 = h 7 w 8 x 8 = h w O x P

11 FIRST LAYER OF NEURONES 1 y 1 x 1 y 2 x 2 y F (x, W) =h(x t W) =h( 4 5 x 1 x 2 t w 01 w 02 w 03 w 01 + x 1 w 11 + x 2 w 21 4w 11 w 12 w 13 5) =h( 4w 02 + x 1 w 12 + x 2 w 22 5 w 21 w 22 w 23 w 03 + x 1 w 13 + x 2 w 23 t 2 )=h( 4 y 1 y 2 y t )= 2 3 h(y 1 ) 4h(y 2 ) 5 h(y 3 ) t x W y

12 MLP: MULTI LAYER PERCEPTRON y 1 x 1 y 2 x 2 y 3 F 1 F 2 F 3 F N (F : (F 4 x, W 4, W : ), W N ) = F N F : F 4 x = (F N F : F 4 )(x)

13 BUILDING OUR MLP Torch7 works with modules. Module is an abstract class which defines fundamental methods necessary for a training a neural network. Modules are serializable. nn.sequential Input nn.linear nn.relu nn.linear Output

14 LOSS FUNCTION FOR CLASSIFICATION Converting the network outputs into probabilities: f y = j u = e u T u X V <4 Negative log likelihood: J p, t = log (p^) e u TV u = f(u) = Network output: Class probabilities: Combination of both: J u, t = u^ + log ( 7 e u T V ) u X V <4 Error: J f u, 3 = log =

15 LOSS FUNCTION IN TORCH7 Criterion is a special kind of Module who take to parameters has input Target Output nn.logsoft Max nn.classnllcriterion Error Or Target Output nn.crossentropycriterion Error

16 HOW TO TRAIN A NEURAL NETWORK?

17 GRADIENT DESCENT J(θ) J(θ) θ θ θ θ η Objective: minimizing an objective (loss) function J(θ) Gradient gives the slope of the function Updating the parameters θ in the opposite direction of the gradient according to a learning rate η Repeat until convergence

18 CHAIN RULE Composition function: F x = f g x = f(g x ) Derivative of a composition function: F j x = f j g x g j x = f j g x g (x) Using the Leibniz s notation: F j x = F(x) f(g x ) f(g x ) = = x x g(x) g(x) x

19 BACK-PROPAGATION w 4 f 4 w : w 4 f : w N f N w : w N x y 4 = f 4 (x, w 4 ) f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : y : = f : y 4, w : = f : f 4 x, w 4, w : y N = f N y :, w N = f N f : y 4, w :, w N = f N (f : f 4 x, w 4, w :, w N ) mp y N = y N w N = f N y :, w N δw N Objective: mn,m o,m p y N = mn y N ; mo y N ; m p y N mp y N = f N y :, w N w N mo y N = f N y :, w N y : f : y 4, w : w : mn y N = f N y :, w N y : f : y 4, w : y 4 f 4 x, w 4 w 4 mo y N = y N w : = f N y :, w N w : = f N y :, w N y : y : w : = f N y :, w N y : f : y 4, w : w : mn y N = y N δw 4 = f N y :, w N w 4 = f N y :, w N y : y : w 4 = f N y :, w N y : f : y 4, w : w 4 = f N y :, w N y : f : y 4, w : y 4 y 4 w 4

20 ONE STEP IN TORCH7 data Model θ θ + ηgθ gθ 0gθ + θ output dfdo Loss function ferror target Reset gradients model : zerogradparameters () Forward local output = model : forward(data) local f error = loss function :forward(output, target) Backward local df do = loss function :backward(output, target) model : backward(data, df do) Update parameters model : updateparameters (0.01 )

21 BATCH GRADIENT DESCENT Computes the gradient of the cost function for the entire dataset: θ θ η v J(θ) Reset gradients model : zerogradparameters () for i=1, traindata:size() do Forward local output = model : forward( traindata.data [ i ]) local f error = loss function :forward(output, traindata.labels [ i ]) end Backward local df do = loss function :backward(output, traindata.labels [ i ]) model : backward( traindata.data [ i ], df do) Update parameters model : updateparameters (0.01 )

22 STOCHASTIC GRADIENT DESCENT Performs a parameter update for each training example x (8), y (8) : θ θ η v J(θ, x (8), y (8) ) Create a random permutation shuffle = torch.randperm(traindata: size ()) for i=1, traindata:size() do Reset gradients model : zerogradparameters () Forward local output = model : forward( traindata.data [ shuffle [ i ]]) local f error = loss function :forward(output, traindata.labels [ shuffle [ i ]]) Backward local df do = loss function :backward(output, traindata.labels [ shuffle [ i ]]) model : backward( traindata.data [ shuffle [ i ] ], df do) end Update parameters model : updateparameters (0.01 )

23 MINI-BATCH SGD Takes the best of both worlds and performs an update for every minibatch of n training examples: θ θ η v J(θ, x (8:8yz), y (8:8yz) ) for i=1, traindata:size(), batchsize do Reset gradients model : zerogradparameters () Create batch batch = getbatch(traindata, batchsize ) Forward local output = model : forward( batch.inputs ) local f error = loss function :forward(output, batch.targets) Backward local df do = loss function :backward(output, batch.targets) model : backward( batch.inputs, df do) end Update parameters model : updateparameters (0.01 )

24 SGD OPTIMIZATION ALGORITHMS Momentum: adds a fraction of the previously computed gradient (gives inertia to the gradient) v^ γv^}4 + η v J θ θ θ v^ NAG: extension of momentum Adagrad: adapts the learning rate to each parameters individually Adadelta: extension of Adagrad RMSprop: another extension of Adagrad Adam: takes into account the mean and variance of gradients Etc

25 GRADIENT DESCENT ILLUSTRATION See:

26 PACKAGE OPTIM IN TORCH7 Torch package providing several optimization algorithms. Easy to use, easy to switch from one optimizer to another.

27 CONVOLUTIONAL NEURAL NETWORK

28 y f g x = f t g x t dt } CONVOLUTION

29 DISCRETE CONVOLUTION f g n = 7 f(n m) g(m) ƒ<}

SLIDING MASK Convolution tool from Rémi Emonet: http://dl.heeere.

30 SLIDING MASK Convolution tool from Rémi Emonet: Convolution layer in Torch7:

31 CONVOLUTION EXAMPLE Original image Sharpen Emboss Blur Edge detect

32 CONVOLUTIONAL NEURAL NETWORK y y 2 y 3

33 CONVOLUTIONAL NEURAL NETWORK y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 y 3

34 CONVOLUTIONAL NEURAL NETWORK w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 w 1 w 2 w 3 y 3 w 4 w 5 w 6 w 7 w 8 w 9

35 POOLING Maximum Pooling Effect: Reduces the feature map s size Increases the field of view Average pooling Sum pooling Stochastic pooling Etc

36 FIELD OF VIEW Convolution with mask size = 3 Pooling with mask size = 2

37 LENET 5 Gradient-based learning applied to document recognition, Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner [1998].

38 IF WE HAVE TIME

39 During training, for each forward pass, randomly set units to 0. DROPOUT Dropout Input Drop factor = Output At test time, keep the same «energy» into the network

40 BATCH NORMALIZATION During training, for each forward pass, normalized the data according to the mini-batch

41 CREATING OUR OWN LAYERS x w 4 f 4 w : f : w N f N w 4 w : w N f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : Each module f x, w = y have to compute: y f x f w In Torch7 a new module have to overload 3 functions: [output] updateoutput(input) [gradinput] updategradinput(input, gradoutput) accgradparameters(input, gradoutput) Torch7 documentations:

42 LINKS

43 TORCH7 Torch7 Documentation: Torch7: Optim package: Criterions: Convolutional modules: Some tutorials code in torch7: Torch7 tutorials: Digit classifier:

44 TUTORIALS A Visual and Interactive Guide to the Basics of Neural Networks: An overview of gradient descent optimization algorithms: Artificial Inteligence: Andrew Ng lesson on coursera:

45 TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent (SGD) Back-propagation Convolution Pooling (or Sub-sampling) Convolutional Neural Networks (CNN) Features maps Dropout Batch Normalization

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character