EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Similar documents
Introduction to Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

Deep Feedforward Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Convolutional Neural Networks (CNNs)

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning & Artificial Intelligence WS 2018/2019

Lecture 17: Neural Networks and Deep Learning

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Architecture Multilayer Perceptron (MLP)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning Lecture 14

Optimization for Training I. First-Order Methods Training algorithm

Introduction to Deep Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 35: Optimization and Neural Nets

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Lecture 2. Rob Fergus

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Convolutional Neural Networks. Srikumar Ramalingam

SGD and Deep Learning

Introduction to Machine Learning

Neural Networks: Backpropagation

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction to (Convolutional) Neural Networks

Computational Photography

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks and Deep Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Neural Networks (Part 1) Goals for the lecture

Neural Networks. Lecture 6. Rob Fergus

Machine Learning Basics III

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural networks (NN) 1

Grundlagen der Künstlichen Intelligenz

Understanding Neural Networks : Part I

OPTIMIZATION METHODS IN DEEP LEARNING

Introduction to Machine Learning (67577)

ECE521 Lectures 9 Fully Connected Neural Networks

Convolutional Neural Networks

CSC 578 Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Machine Learning. Boris

Artificial Neural Networks

Computational statistics

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Day 3 Lecture 3. Optimizing deep networks

Tips for Deep Learning

Neural Networks and the Back-propagation Algorithm

Neural Networks: Backpropagation

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic gradient descent; Classification

Based on the original slides of Hung-yi Lee

Theories of Deep Learning

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Advanced computational methods X Selected Topics: SGD

Lecture 3 Feedforward Networks and Backpropagation

Introduction to Deep Learning CMPT 733. Steven Bergner

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Normalization Techniques in Training of Deep Neural Networks

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Course 395: Machine Learning - Lectures

Statistical Machine Learning from Data

Lecture 2: Learning with neural networks

Logistic Regression. COMP 527 Danushka Bollegala

CSCI567 Machine Learning (Fall 2018)

Advanced Machine Learning

Rapid Introduction to Machine Learning/ Deep Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Deep Learning (CNNs)

CS260: Machine Learning Algorithms

Deep Neural Networks (1) Hidden layers; Back-propagation

Importance Reweighting Using Adversarial-Collaborative Training

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Topic 3: Neural Networks

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Lecture 3 Feedforward Networks and Backpropagation

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Neural Networks and Deep Learning.

Transcription:

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent (SGD) Back-propagation Convolution Pooling (or Sub-sampling) Convolutional Neural Networks (CNN) Features maps Dropout Batch Normalization

NOTATION {x, y}: a training example (x the input, y the label) x: a scalar x: a vector X: a matrix W, θ: network weights J(θ): a loss function

MNIST DATASET Dataset of handwritten digits. 60.000 training data and 10.000 test data. Digits are size-normalized and centered in fixed-size images. Easy dataset for beginners in machine learning.

SUPERVISED LEARNING 5 Label Input Function Loss Output ( Classes : 0, 1, 2, etc ) Error

OUR FIRST NEURAL NETWORK

LINEAR REGRESSION Linear regression y 0 20 40 60 80 0 20 40 60 80 100 x Linear function: f x, w = w 2 + w 4 x Objective: find w 2, w 4 = w which minimize the error ; J w = 1 2 7(f x 8, w y 8 ) : 8<4 Animation of the optimizationproblem https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/#train-your-dragon

CLASSIFICATION FUNCTION Linear classification y 1.5 1.0 0.5 0.0 0.5 1.0 1.5 0 20 40 60 80 100 x Binary classification: f x, w { 1; +1} Using a non linearity function f x, w =? 1 if tanh w 2 + w 4 x 0 1 otherwise

ACTIVATION FUNCTION Threshold tanh 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 sigmoid 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 Rectified Linear Unit (ReLU) Threshold Tanh Sigmoïd Recitified Linear Unit (ReLU) 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Leaky ReLU PReLU Etc 4 2 0 2 4 4 2 0 2 4

PERCEPTRON x 1 1 w 1 w 0 If h(x) is an activation function, then a perceptron if define as follows: x 2 x 3 w 2 y F x, w = h(w 2 + w 4 x 4 + w : x : + w N x N ) w 3 = h 7 w 8 x 8 = h w O x P

FIRST LAYER OF NEURONES 1 y 1 x 1 y 2 x 2 y 3 2 3 1 F (x, W) =h(x t W) =h( 4 5 x 1 x 2 t 2 3 2 3 w 01 w 02 w 03 w 01 + x 1 w 11 + x 2 w 21 4w 11 w 12 w 13 5) =h( 4w 02 + x 1 w 12 + x 2 w 22 5 w 21 w 22 w 23 w 03 + x 1 w 13 + x 2 w 23 t 2 )=h( 4 y 1 y 2 y 3 3 5 t )= 2 3 h(y 1 ) 4h(y 2 ) 5 h(y 3 ) t x W y

MLP: MULTI LAYER PERCEPTRON 1 1 1 y 1 x 1 y 2 x 2 y 3 F 1 F 2 F 3 F N (F : (F 4 x, W 4, W : ), W N ) = F N F : F 4 x = (F N F : F 4 )(x)

BUILDING OUR MLP Torch7 works with modules. Module is an abstract class which defines fundamental methods necessary for a training a neural network. Modules are serializable. nn.sequential Input nn.linear nn.relu nn.linear Output

LOSS FUNCTION FOR CLASSIFICATION Converting the network outputs into probabilities: f y = j u = e u T u X V <4 Negative log likelihood: J p, t = log (p^) e u TV u = f(u) = Network output: 14 11 16 Class probabilities:.1185.0059.8756 Combination of both: J u, t = u^ + log ( 7 e u T V ) u X V <4 Error: J f u, 3 = log 0.8756 = 0.1328

LOSS FUNCTION IN TORCH7 Criterion is a special kind of Module who take to parameters has input Target Output nn.logsoft Max nn.classnllcriterion Error Or Target Output nn.crossentropycriterion Error

HOW TO TRAIN A NEURAL NETWORK?

GRADIENT DESCENT J(θ) J(θ) θ θ θ θ η Objective: minimizing an objective (loss) function J(θ) Gradient gives the slope of the function Updating the parameters θ in the opposite direction of the gradient according to a learning rate η Repeat until convergence

CHAIN RULE Composition function: F x = f g x = f(g x ) Derivative of a composition function: F j x = f j g x g j x = f j g x g (x) Using the Leibniz s notation: F j x = F(x) f(g x ) f(g x ) = = x x g(x) g(x) x

BACK-PROPAGATION w 4 f 4 w : w 4 f : w N f N w : w N x y 4 = f 4 (x, w 4 ) f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : y : = f : y 4, w : = f : f 4 x, w 4, w : y N = f N y :, w N = f N f : y 4, w :, w N = f N (f : f 4 x, w 4, w :, w N ) mp y N = y N w N = f N y :, w N δw N Objective: mn,m o,m p y N = mn y N ; mo y N ; m p y N mp y N = f N y :, w N w N mo y N = f N y :, w N y : f : y 4, w : w : mn y N = f N y :, w N y : f : y 4, w : y 4 f 4 x, w 4 w 4 mo y N = y N w : = f N y :, w N w : = f N y :, w N y : y : w : = f N y :, w N y : f : y 4, w : w : mn y N = y N δw 4 = f N y :, w N w 4 = f N y :, w N y : y : w 4 = f N y :, w N y : f : y 4, w : w 4 = f N y :, w N y : f : y 4, w : y 4 y 4 w 4

ONE STEP IN TORCH7 data Model θ θ + ηgθ gθ 0gθ + θ output dfdo Loss function ferror target Reset gradients model : zerogradparameters () Forward local output = model : forward(data) local f error = loss function :forward(output, target) Backward local df do = loss function :backward(output, target) model : backward(data, df do) Update parameters model : updateparameters (0.01 )

BATCH GRADIENT DESCENT Computes the gradient of the cost function for the entire dataset: θ θ η v J(θ) Reset gradients model : zerogradparameters () for i=1, traindata:size() do Forward local output = model : forward( traindata.data [ i ]) local f error = loss function :forward(output, traindata.labels [ i ]) end Backward local df do = loss function :backward(output, traindata.labels [ i ]) model : backward( traindata.data [ i ], df do) Update parameters model : updateparameters (0.01 )

STOCHASTIC GRADIENT DESCENT Performs a parameter update for each training example x (8), y (8) : θ θ η v J(θ, x (8), y (8) ) Create a random permutation shuffle = torch.randperm(traindata: size ()) for i=1, traindata:size() do Reset gradients model : zerogradparameters () Forward local output = model : forward( traindata.data [ shuffle [ i ]]) local f error = loss function :forward(output, traindata.labels [ shuffle [ i ]]) Backward local df do = loss function :backward(output, traindata.labels [ shuffle [ i ]]) model : backward( traindata.data [ shuffle [ i ] ], df do) end Update parameters model : updateparameters (0.01 )

MINI-BATCH SGD Takes the best of both worlds and performs an update for every minibatch of n training examples: θ θ η v J(θ, x (8:8yz), y (8:8yz) ) for i=1, traindata:size(), batchsize do Reset gradients model : zerogradparameters () Create batch batch = getbatch(traindata, batchsize ) Forward local output = model : forward( batch.inputs ) local f error = loss function :forward(output, batch.targets) Backward local df do = loss function :backward(output, batch.targets) model : backward( batch.inputs, df do) end Update parameters model : updateparameters (0.01 )

SGD OPTIMIZATION ALGORITHMS Momentum: adds a fraction of the previously computed gradient (gives inertia to the gradient) v^ γv^}4 + η v J θ θ θ v^ NAG: extension of momentum Adagrad: adapts the learning rate to each parameters individually Adadelta: extension of Adagrad RMSprop: another extension of Adagrad Adam: takes into account the mean and variance of gradients Etc

GRADIENT DESCENT ILLUSTRATION See: http://sebastianruder.com/optimizing-gradient-descent/index.html#whichoptimizertouse

PACKAGE OPTIM IN TORCH7 Torch package providing several optimization algorithms. Easy to use, easy to switch from one optimizer to another. https://github.com/torch/optim

CONVOLUTIONAL NEURAL NETWORK

y f g x = f t g x t dt } CONVOLUTION

DISCRETE CONVOLUTION f g n = 7 f(n m) g(m) ƒ<}

SLIDING MASK Convolution tool from Rémi Emonet: http://dl.heeere.com/convolution/ Convolution layer in Torch7: https://github.com/torch/nn/blob/master/doc/convolution.md

CONVOLUTION EXAMPLE Original image 0-1 0-2 -1 0 1 1 1 0 1 0-1 5-1 -1 1 1 1 1 1 1-4 1 0-1 0 0 1 2 1 1 1 0 1 0 Sharpen Emboss Blur Edge detect

CONVOLUTIONAL NEURAL NETWORK y 1 0 1 0 1-4 1 0 1 0 y 2 y 3

CONVOLUTIONAL NEURAL NETWORK y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 y 3

CONVOLUTIONAL NEURAL NETWORK w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 y 1 w 1 w 2 w 3 w 4 w 5 w 6 y 2 w 7 w 8 w 9 w 1 w 2 w 3 y 3 w 4 w 5 w 6 w 7 w 8 w 9

POOLING -1 17-18 13-15 19 11 5-10 -3 2 18-4 4-12 13 Maximum Pooling 19 13 4 18 Effect: Reduces the feature map s size Increases the field of view Average pooling Sum pooling Stochastic pooling Etc

FIELD OF VIEW Convolution with mask size = 3 Pooling with mask size = 2

LENET 5 Gradient-based learning applied to document recognition, Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner [1998].

IF WE HAVE TIME

During training, for each forward pass, randomly set units to 0. DROPOUT 13 9 7 4 4 16 2 5 Dropout 13 0 7 4 4 0 2 0 0 6 8 19 4 7 7 5 Input Drop factor =0.3 0 6 8 0 0 0 7 5 Output At test time, keep the same «energy» into the network

BATCH NORMALIZATION During training, for each forward pass, normalized the data according to the mini-batch

CREATING OUR OWN LAYERS x w 4 f 4 w : f : w N f N w 4 w : w N f 1 y 4 f 2 y : f 3 y N f : f N y 4 y : Each module f x, w = y have to compute: y f x f w In Torch7 a new module have to overload 3 functions: [output] updateoutput(input) [gradinput] updategradinput(input, gradoutput) accgradparameters(input, gradoutput) Torch7 documentations: http://torch.ch/docs/developer-docs.html

LINKS

TORCH7 Torch7 Documentation: Torch7: http://torch.ch/ Optim package: https://github.com/torch/optim Criterions: https://github.com/torch/nn/blob/master/doc/criterion.md Convolutional modules: https://github.com/torch/nn/blob/master/doc/convolution.md Some tutorials code in torch7: Torch7 tutorials: https://github.com/torch/tutorials Digit classifier: https://github.com/torch/demos/tree/master/traina-digit-classifier

TUTORIALS A Visual and Interactive Guide to the Basics of Neural Networks: https://jalammar.github.io/visual-interactive-guide-basics-neuralnetworks/ An overview of gradient descent optimization algorithms: http://sebastianruder.com/optimizing-gradientdescent/index.html Artificial Inteligence: https://leonardoaraujosantos.gitbooks.io/artificialinteligence/content/ Andrew Ng lesson on coursera: https://www.coursera.org/learn/machine-learning

TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent (SGD) Back-propagation Convolution Pooling (or Sub-sampling) Convolutional Neural Networks (CNN) Features maps Dropout Batch Normalization