Neural networks and optimization

Similar documents
Neural networks and optimization

Introduction to Convolutional Neural Networks (CNNs)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

SGD and Deep Learning

Jakub Hajic Artificial Intelligence Seminar I

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernelized Perceptron Support Vector Machines

Introduction to Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 17: Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Deep Learning: a gentle introduction

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Nonlinear Classification

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

CSC321 Lecture 5: Multilayer Perceptrons

Grundlagen der Künstlichen Intelligenz

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

6.036 midterm review. Wednesday, March 18, 15

Feature Design. Feature Design. Feature Design. & Deep Learning

Based on the original slides of Hung-yi Lee

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Introduction to Deep Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Logistic Regression Logistic

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Artificial Neural Networks

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Regularization in Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Loss Functions and Optimization. Lecture 3-1

Reading Group on Deep Learning Session 1

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CSC242: Intro to AI. Lecture 21

Introduction to Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 9: Generalization

Neural Networks and Deep Learning

Classification of Hand-Written Digits Using Scattering Convolutional Network

Introduction to Support Vector Machines

CSC321 Lecture 9: Generalization

Neural networks and support vector machines

Neural Networks and Deep Learning.

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

1 What a Neural Network Computes

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

SVMs, Duality and the Kernel Trick

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Lecture 14: Deep Generative Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Deep Learning (CNNs)

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Midterm exam CS 189/289, Fall 2015

Convolutional Neural Networks

RegML 2018 Class 8 Deep learning

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Linear Models for Classification

Convolutional Neural Networks. Srikumar Ramalingam

Regression with Numerical Optimization. Logistic

STA 414/2104: Lecture 8

Loss Functions and Optimization. Lecture 3-1

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning Lecture 5

Lecture 5 Neural models for NLP

CSCI 315: Artificial Intelligence through Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to Deep Learning

Lecture 35: Optimization and Neural Nets

Linear & nonlinear classifiers

Feed-forward Network Functions

Support Vector Machines

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Comparison of Modern Stochastic Optimization Algorithms

CSC321 Lecture 4: Learning a Classifier

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

The Perceptron Algorithm

Lecture 5: Logistic Regression. Neural Networks

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Improved Local Coordinate Coding using Local Tangents

The Perceptron. Volker Tresp Summer 2016

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Feedforward Networks

Neural Networks: Optimization & Regularization

CS 343: Artificial Intelligence

CSC321 Lecture 4: Learning a Classifier

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

1 Machine Learning Concepts (16 points)

Transcription:

Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 2 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 3 / 85

Goal : classification and regression Medical imaging : cancer or not? Classification Autonomous driving : optimal wheel position Regression Kinect : where are the limbs? Regression OCR : what are the characters? Classification Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Goal : classification and regression Medical imaging : cancer or not? Classification Autonomous driving : optimal wheel position Regression Kinect : where are the limbs? Regression OCR : what are the characters? Classification It also works for structured outputs Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Object recognition Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 5 / 85

Object recognition Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 6 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 7 / 85

Linear classifier Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. X (i) R n, Y (i) { 1, 1}. Goal : Find w and b such that sign(w X (i) + b) = Y (i). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85

Linear classifier Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. X (i) R n, Y (i) { 1, 1}. Goal : Find w and b such that sign(w X (i) + b) = Y (i). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85

Perceptron algorithm (Rosenblatt, 57) w 0 = 0, b 0 = 0 Ŷ (i) = sign(w X (i) + b) w t+1 w t + ( i Y (i) Ŷ (i)) X (i) b t+1 b t + ( i Y (i) Ŷ (i)) Linearly separable demo Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 9 / 85

Some data are not separable The Perceptron algorithm is NOT convergent for non linearly separable data. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 10 / 85

Non convergence of the perceptron algorithm Non-linearly separable perceptron We need an algorithm which works both on separable and non separable data. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 11 / 85

Classification error A good classifier minimizes l(w) = i = i l i ) l (Ŷ (i), Y (i) = i 1 sign(w X (i) +b) Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 12 / 85

Cost function Classification error is not smooth. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Cost function Classification error is not smooth. Sigmoid is smooth but not convex. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Convex cost functions Classification error is not smooth. Sigmoid is smooth but not convex. Logistic loss is a convex upper bound. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Convex cost functions Classification error is not smooth. Sigmoid is smooth but not convex. Logistic loss is a convex upper bound. Hinge loss (SVMs) is very much like logistic. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Interlude Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 15 / 85

Convex function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 16 / 85

Non-convex function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 17 / 85

Not-convex-but-still-OK function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 18 / 85

End of interlude Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 19 / 85

Solving separable AND non-separable problems Non-linearly separable logistic Linearly separable logistic Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 20 / 85

Non-linear classification Non-linearly separable polynomial Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 21 / 85

Non-linear classification Features : X 1, X 2 linear classifier Features : X 1, X 2, X 1 X 2, X1 2,... non-linear classifier Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 22 / 85

Choosing the features To make it work, I created lots of extra features : (X1 i X j 2 ) for i, j [0, 4] (p = 18) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

Choosing the features To make it work, I created lots of extra features : (X1 i X j 2 ) for i, j [0, 4] (p = 18) Would it work with fewer features? Test with (X1 i X j 2 ) for i, j [1, 3] Non-linearly separable polynomial degree 2 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

A graphical view of the classifiers f (X) = w 1 X 1 + w 2 X 2 + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

A graphical view of the classifiers f (X) = w 1 X 1 + w 2 X 2 + w 3 X 2 1 + w 4 X 2 2 + w 5 X 1 X 2 +... Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Do they have to be predefined? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Do they have to be predefined? A neural network will learn the H j s Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

A 3-step algorithm for a neural network 1 Pick an example x Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Compute w ˆx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Compute w ˆx + b = w Vx + b = ŵx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 27 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Apply a nonlinear function g to all the elements of ˆx 4 Compute w ˆx + b = w g(vx) + b ŵx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 28 / 85

Neural networks Usually, we use H j = g(v j X) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Neural networks Usually, we use H j = g(v j X) H j : Hidden unit v j : Input weight g : Transfer function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Transfer function f (X) = j w j H j (X) + b = j w j g(vj X) + b g is the transfer function. g used to be the sigmoid or the tanh. If g is the sigmoid, each hidden unit is a soft classifier. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 30 / 85

Transfer function f (X) = j w j H j (X) + b = j w j g(vj X) + b g is the transfer function. g is now often the positive part. This is called a rectified linear unit. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 31 / 85

Neural networks f (X) = j w j H j (X) + b = j w j g(vj X) + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 32 / 85

Example on the non-separable problem Non-linearly separable neural net 3 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 33 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find w and b such that sign ( w X (i) + b ) = Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find w and b to minimize log ( 1 + exp ( Y ( (i) w X (i) + b ))) i Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find v 1,..., v k, w and b to minimize log 1 + exp Y (i) w j g ( vj i j X (i)) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Cost function s = cost function (logistic loss, hinge loss,...) ) l(v, w, b, X (i), Y (i) ) = s (Ŷ (i), Y (i) = s j w j H j (X (i) ), Y (i) = s j w j g ( v j X (i)), Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Cost function s = cost function (logistic loss, hinge loss,...) ) l(v, w, b, X (i), Y (i) ) = s (Ŷ (i), Y (i) = s j w j H j (X (i) ), Y (i) = s j w j g ( v j X (i)), Y (i) We can minimize it using gradient descent. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? It is theoretically enough (universal approximation) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? It is theoretically enough (universal approximation) That does not mean you should use it Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Going deeper H j (x) = g(v j x) ˆx j = g(v j x) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going deeper H j (x) = g(v j x) ˆx j = g(v j x) We can transform ˆx as well. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going from 2 to 3 layers Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers We can have as many layers as we like Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers We can have as many layers as we like But it makes the optimization problem harder Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 39 / 85

The need for fast learning Neural networks may need many examples (several millions or more). We need to be able to use them quickly. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 40 / 85

Batch methods L(θ) = 1 l(θ, X (i), Y (i) ) N θ t+1 θ t α t N i i l(θ, X (i), Y (i) ) θ To compute one update of the parameters, we need to go through all the data. This can be very expensive. What if we have an infinite amount of data? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 41 / 85

Potential solutions 1 Discard data. Seems stupid Yet many people do it Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Potential solutions 1 Discard data. Seems stupid Yet many people do it 2 Use approximate methods. Update = average of the updates for all datapoints. Are these update really different? If not, how can we learn faster? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N θ t+1 θ t α t N i i l(θ, X (i), Y (i) ) θ Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N i θ t+1 θ t α t l(θ, X (i t ), Y (i t ) ) θ Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N i θ t+1 θ t α t l(θ, X (i t ), Y (i t ) ) θ What do we lose when updating the parameters to satisfy just one example? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Disagreement µ 2 /σ 2 during optimization (log scale) As optimization progresses, disagreement increases Early on, one can pick one example at a time What about later? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 44 / 85

Stochastic vs. deterministic methods Stochastic vs. batch methods Goal = best of both worlds: linearratewitho(1) iteration cost log(excess cost) stochastic deterministic time Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Stochastic vs. deterministic methods Stochastic vs. batch methods Goal = best of both worlds: linearratewitho(1) iteration cost log(excess cost) stochastic deterministic time For non-convex problem, stochastic works best. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Understanding the loss function of deep networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Understanding the loss function of deep networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Understanding the loss function of deep networks Recent studies say most local minima are close to the optimum. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Limiting the number of units Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Limiting the number of units Dropout Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Dropout Overfitting happens when each unit gets too specialized Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout Overfitting happens when each unit gets too specialized The core idea is to prevent each unit from relying too much on the others Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout Overfitting happens when each unit gets too specialized The core idea is to prevent each unit from relying too much on the others How do we do this? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout - an illustration Figure taken from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 49 / 85

Dropout - a conclusion By removing units at random, we force the others to be less specialized At test time, we use all the units Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Dropout - a conclusion By removing units at random, we force the others to be less specialized At test time, we use all the units This is sometimes presented as an ensemble method. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Neural networks - Summary A linear classifier in a feature space can model non-linear boundaries. Finding a good feature space is essential. One can design the feature map by hand. One can learn the feature map, using fewer features than if it done by hand. Learning the feature map is potentially HARD (non-convexity). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 51 / 85

Recipe for training a neural network Use rectified linear units Use dropout Use stochastic gradient descent Use 2000 units on each layer Try different number of layers Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 52 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 53 / 85

v j s for images f (X) = j w j H j (X) + b = j w j g(vj X) + b If X is an image, v j is an image too. v j acts as a filter (presence or absence of a pattern). What does v j look like? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 54 / 85

v j s for images - Examples Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

v j s for images - Examples Filters are mostly local Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

Basic idea of convolutional neural networks Filters are mostly local. Instead of using image-wide filters, use small ones over patches. Repeat for every patch to get a response image. Subsample the response image to get local invariance. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 56 / 85

Filtering - Filter 1 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 57 / 85

Filtering - Filter 2 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 58 / 85

Filtering - Filter 3 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 59 / 85

Pooling - Filter 1 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 60 / 85

Pooling - Filter 2 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 61 / 85

Pooling - Filter 3 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 62 / 85

A convolutional layer Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 63 / 85

Transforming the data with a layer Original datapoint New datapoint Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 64 / 85

A convolutional network Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 65 / 85

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 66 / 85

Face detection Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 67 / 85

Face detection Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 68 / 85

NORB dataset 50 toys belonging to 5 categories animal, human figure, airplane, truck, car 10 instance per category 5 instances used for training, 5 instances for testing Raw dataset 972 stereo pairs of each toy. 48,600 image pairs total. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 69 / 85

NORB dataset - 2 For each instance : 18 azimuths 0 to 350 degrees every 20 degrees 9 elevations 30 to 70 degrees from horizontal every 5 degrees 6 illuminations on/off combinations of 4 lights 2 cameras (stereo), 7.5 cm apart 40 cm from the object Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 70 / 85

NORB dataset - 3 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 71 / 85

Textured and cluttered versions Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 72 / 85

90,857 free parameters, 3,901,162 connections. The entire network is trained end-to-end (all the layers are trained simultaneously). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 73 / 85

Normalized-Uniform set Method Error Linear Classifier on raw stereo images 30.2% K-Nearest-Neighbors on raw stereo images 18.4% K-Nearest-Neighbors on PCA-95 16.6% Pairwise SVM on 96x96 stereo images 11.6% Pairwise SVM on 95 Principal Components 13.3% Convolutional Net on 96x96 stereo images 5.8% Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 74 / 85

Jittered-Cluttered Dataset 291,600 stereo pairs for training, 58,320 for testing Objects are jittered position, scale, in-plane rotation, contrast, brightness, backgrounds, distractor objects,... Input dimension : 98x98x2 (approx 18,000) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 75 / 85

Jittered-Cluttered Dataset - Results Method Error SVM with Gaussian kernel 43.3% Convolutional Net with binocular input 7.8% Convolutional Net + SVM on top 5.9% Convolutional Net with monocular input 20.8% Smaller mono net (DEMO) 26.0% Dataset available from http://www.cs.nyu.edu/ yann Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 76 / 85

NORB recognition - 1 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 77 / 85

NORB recognition - 2 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 78 / 85

ConvNet summary With complex problems, it is hard to design features by hand. Neural networks circumvent this problem. They can be hard to train (again...). Convolutional neural networks use knowledge about locality in images. They are much easier than standard networks. And they are FAST (again...). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 79 / 85

What has not been covered In some cases, we have lots of data, but without the labels. Unsupervised learning. There are techniques to use these data to get better performance. E.g. : Task-Driven Dictionary Learning, Mairal et al. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 80 / 85

Additional techniques currently used Local response normalization Data augmentation Batch gradient normalization Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 81 / 85

Adversarial examples ConvNets achieve stellar performance in object recognition Do they really understand what s going on in an image? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85

Adversarial examples ConvNets achieve stellar performance in object recognition Do they really understand what s going on in an image? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85

What I did not cover Algorithms for pretraining (RBMs, autoencoders) Generative models (DBMs) Recurrent neural networks for machine translation Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 83 / 85

Conclusions Neural networks are complex non-linear models Optimizing them is hard It is as much about theory as about engineering When properly tuned, they can perform wonders Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 84 / 85

Bibliography Deep Learning book, http://www-labs.iro.umontreal.ca/ bengioy/dlbook/ Dropout : A Simple Way to Prevent Neural Networks from Overfitting by Srivistava et al. Intriguing properties of neural networks by Szegedy et al. ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky, Sutskever and Hinton Gradient-based learning applied to document recognition by LeCun et al. Qualitatively characterizing neural network optimization problems by Goodfellow and Vinyals Sequence to Sequence Learning with Neural Networks by Sutskever, Vinyals and Le Caffe, http://caffe.berkeleyvision.org/ Hugo Larochelle s MOOC, https: //www.youtube.com/playlist?list=pl6xpj9i5qxyecohn7tqghaj6naprnmubh Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 85 / 85