Neural networks and optimization

Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 2 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 3 / 85

Goal : classification and regression Medical imaging : cancer or not? Classification Autonomous driving : optimal wheel position Regression Kinect : where are the limbs? Regression OCR : what are the characters? Classification Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Goal : classification and regression Medical imaging : cancer or not? Classification Autonomous driving : optimal wheel position Regression Kinect : where are the limbs? Regression OCR : what are the characters? Classification It also works for structured outputs Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Object recognition Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 5 / 85

Object recognition Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 6 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 7 / 85

Linear classifier Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. X (i) R n, Y (i) { 1, 1}. Goal : Find w and b such that sign(w X (i) + b) = Y (i). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85

Perceptron algorithm (Rosenblatt, 57) w 0 = 0, b 0 = 0 Ŷ (i) = sign(w X (i) + b) w t+1 w t + ( i Y (i) Ŷ (i)) X (i) b t+1 b t + ( i Y (i) Ŷ (i)) Linearly separable demo Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 9 / 85

Some data are not separable The Perceptron algorithm is NOT convergent for non linearly separable data. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 10 / 85

Non convergence of the perceptron algorithm Non-linearly separable perceptron We need an algorithm which works both on separable and non separable data. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 11 / 85

Classification error A good classifier minimizes l(w) = i = i l i ) l (Ŷ (i), Y (i) = i 1 sign(w X (i) +b) Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 12 / 85

Cost function Classification error is not smooth. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Cost function Classification error is not smooth. Sigmoid is smooth but not convex. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Convex cost functions Classification error is not smooth. Sigmoid is smooth but not convex. Logistic loss is a convex upper bound. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Convex cost functions Classification error is not smooth. Sigmoid is smooth but not convex. Logistic loss is a convex upper bound. Hinge loss (SVMs) is very much like logistic. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Interlude Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 15 / 85

Convex function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 16 / 85

Non-convex function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 17 / 85

Not-convex-but-still-OK function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 18 / 85

End of interlude Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 19 / 85

Solving separable AND non-separable problems Non-linearly separable logistic Linearly separable logistic Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 20 / 85

Non-linear classification Non-linearly separable polynomial Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 21 / 85

Non-linear classification Features : X 1, X 2 linear classifier Features : X 1, X 2, X 1 X 2, X1 2,... non-linear classifier Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 22 / 85

Choosing the features To make it work, I created lots of extra features : (X1 i X j 2 ) for i, j [0, 4] (p = 18) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

Choosing the features To make it work, I created lots of extra features : (X1 i X j 2 ) for i, j [0, 4] (p = 18) Would it work with fewer features? Test with (X1 i X j 2 ) for i, j [1, 3] Non-linearly separable polynomial degree 2 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

A graphical view of the classifiers f (X) = w 1 X 1 + w 2 X 2 + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

A graphical view of the classifiers f (X) = w 1 X 1 + w 2 X 2 + w 3 X 2 1 + w 4 X 2 2 + w 5 X 1 X 2 +... Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Do they have to be predefined? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features A linear classifier on a non-linear transformation is non-linear. A non-linear classifier relies on non-linear features. Which ones do we choose? Example : H j = X p j 1 X q j 2 SVM : H j = K (X, X (j) ) with K some kernel function Do they have to be predefined? A neural network will learn the H j s Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

A 3-step algorithm for a neural network 1 Pick an example x Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Compute w ˆx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Compute w ˆx + b = w Vx + b = ŵx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 27 / 85

A 3-step algorithm for a neural network 1 Pick an example x 2 Transform it in ˆx = Vx with some matrix V 3 Apply a nonlinear function g to all the elements of ˆx 4 Compute w ˆx + b = w g(vx) + b ŵx + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 28 / 85

Neural networks Usually, we use H j = g(v j X) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Neural networks Usually, we use H j = g(v j X) H j : Hidden unit v j : Input weight g : Transfer function Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Transfer function f (X) = j w j H j (X) + b = j w j g(vj X) + b g is the transfer function. g used to be the sigmoid or the tanh. If g is the sigmoid, each hidden unit is a soft classifier. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 30 / 85

Transfer function f (X) = j w j H j (X) + b = j w j g(vj X) + b g is the transfer function. g is now often the positive part. This is called a rectified linear unit. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 31 / 85

Neural networks f (X) = j w j H j (X) + b = j w j g(vj X) + b Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 32 / 85

Example on the non-separable problem Non-linearly separable neural net 3 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 33 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find w and b such that sign ( w X (i) + b ) = Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find w and b to minimize log ( 1 + exp ( Y ( (i) w X (i) + b ))) i Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network Dataset : ( X (i), Y (i)) pairs, i = 1,..., N. Goal : Find v 1,..., v k, w and b to minimize log 1 + exp Y (i) w j g ( vj i j X (i)) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Cost function s = cost function (logistic loss, hinge loss,...) ) l(v, w, b, X (i), Y (i) ) = s (Ŷ (i), Y (i) = s j w j H j (X (i) ), Y (i) = s j w j g ( v j X (i)), Y (i) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Cost function s = cost function (logistic loss, hinge loss,...) ) l(v, w, b, X (i), Y (i) ) = s (Ŷ (i), Y (i) = s j w j H j (X (i) ), Y (i) = s j w j g ( v j X (i)), Y (i) We can minimize it using gradient descent. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? It is theoretically enough (universal approximation) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x H j (x) = g(v j x) Is it a good H j? What can we say about it? It is theoretically enough (universal approximation) That does not mean you should use it Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Going deeper H j (x) = g(v j x) ˆx j = g(v j x) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going deeper H j (x) = g(v j x) ˆx j = g(v j x) We can transform ˆx as well. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going from 2 to 3 layers Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers We can have as many layers as we like Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers We can have as many layers as we like But it makes the optimization problem harder Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 39 / 85

The need for fast learning Neural networks may need many examples (several millions or more). We need to be able to use them quickly. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 40 / 85

Batch methods L(θ) = 1 l(θ, X (i), Y (i) ) N θ t+1 θ t α t N i i l(θ, X (i), Y (i) ) θ To compute one update of the parameters, we need to go through all the data. This can be very expensive. What if we have an infinite amount of data? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 41 / 85

Potential solutions 1 Discard data. Seems stupid Yet many people do it Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Potential solutions 1 Discard data. Seems stupid Yet many people do it 2 Use approximate methods. Update = average of the updates for all datapoints. Are these update really different? If not, how can we learn faster? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N θ t+1 θ t α t N i i l(θ, X (i), Y (i) ) θ Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N i θ t+1 θ t α t l(θ, X (i t ), Y (i t ) ) θ Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent L(θ) = 1 l(θ, X (i), Y (i) ) N i θ t+1 θ t α t l(θ, X (i t ), Y (i t ) ) θ What do we lose when updating the parameters to satisfy just one example? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Disagreement µ 2 /σ 2 during optimization (log scale) As optimization progresses, disagreement increases Early on, one can pick one example at a time What about later? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 44 / 85

Stochastic vs. deterministic methods Stochastic vs. batch methods Goal = best of both worlds: linearratewitho(1) iteration cost log(excess cost) stochastic deterministic time Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Stochastic vs. deterministic methods Stochastic vs. batch methods Goal = best of both worlds: linearratewitho(1) iteration cost log(excess cost) stochastic deterministic time For non-convex problem, stochastic works best. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Understanding the loss function of deep networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Understanding the loss function of deep networks Recent studies say most local minima are close to the optimum. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Limiting the number of units Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks Deep networks have many parameters How do we avoid overfitting? Regularizing the parameters Limiting the number of units Dropout Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Dropout Overfitting happens when each unit gets too specialized Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout Overfitting happens when each unit gets too specialized The core idea is to prevent each unit from relying too much on the others Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout Overfitting happens when each unit gets too specialized The core idea is to prevent each unit from relying too much on the others How do we do this? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout - an illustration Figure taken from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 49 / 85

Dropout - a conclusion By removing units at random, we force the others to be less specialized At test time, we use all the units Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Dropout - a conclusion By removing units at random, we force the others to be less specialized At test time, we use all the units This is sometimes presented as an ensemble method. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Neural networks - Summary A linear classifier in a feature space can model non-linear boundaries. Finding a good feature space is essential. One can design the feature map by hand. One can learn the feature map, using fewer features than if it done by hand. Learning the feature map is potentially HARD (non-convexity). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 51 / 85

Recipe for training a neural network Use rectified linear units Use dropout Use stochastic gradient descent Use 2000 units on each layer Try different number of layers Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 52 / 85

1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 53 / 85

v j s for images f (X) = j w j H j (X) + b = j w j g(vj X) + b If X is an image, v j is an image too. v j acts as a filter (presence or absence of a pattern). What does v j look like? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 54 / 85

v j s for images - Examples Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

v j s for images - Examples Filters are mostly local Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

Basic idea of convolutional neural networks Filters are mostly local. Instead of using image-wide filters, use small ones over patches. Repeat for every patch to get a response image. Subsample the response image to get local invariance. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 56 / 85

Filtering - Filter 1 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 57 / 85

Filtering - Filter 2 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 58 / 85

Filtering - Filter 3 Original image Filter Output image Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 59 / 85

Pooling - Filter 1 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 60 / 85

Pooling - Filter 2 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 61 / 85

Pooling - Filter 3 Original image Output image Subsampled image How to do 2x subsampling-pooling : Output image = O, subsampled image = S. S ij = max k over window around (2i,2j) O k. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 62 / 85

A convolutional layer Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 63 / 85

Transforming the data with a layer Original datapoint New datapoint Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 64 / 85

A convolutional network Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 65 / 85

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 66 / 85

Face detection Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 67 / 85

Face detection Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 68 / 85

NORB dataset 50 toys belonging to 5 categories animal, human figure, airplane, truck, car 10 instance per category 5 instances used for training, 5 instances for testing Raw dataset 972 stereo pairs of each toy. 48,600 image pairs total. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 69 / 85

NORB dataset - 2 For each instance : 18 azimuths 0 to 350 degrees every 20 degrees 9 elevations 30 to 70 degrees from horizontal every 5 degrees 6 illuminations on/off combinations of 4 lights 2 cameras (stereo), 7.5 cm apart 40 cm from the object Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 70 / 85

NORB dataset - 3 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 71 / 85

Textured and cluttered versions Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 72 / 85

90,857 free parameters, 3,901,162 connections. The entire network is trained end-to-end (all the layers are trained simultaneously). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 73 / 85

Normalized-Uniform set Method Error Linear Classifier on raw stereo images 30.2% K-Nearest-Neighbors on raw stereo images 18.4% K-Nearest-Neighbors on PCA-95 16.6% Pairwise SVM on 96x96 stereo images 11.6% Pairwise SVM on 95 Principal Components 13.3% Convolutional Net on 96x96 stereo images 5.8% Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 74 / 85

Jittered-Cluttered Dataset 291,600 stereo pairs for training, 58,320 for testing Objects are jittered position, scale, in-plane rotation, contrast, brightness, backgrounds, distractor objects,... Input dimension : 98x98x2 (approx 18,000) Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 75 / 85

Jittered-Cluttered Dataset - Results Method Error SVM with Gaussian kernel 43.3% Convolutional Net with binocular input 7.8% Convolutional Net + SVM on top 5.9% Convolutional Net with monocular input 20.8% Smaller mono net (DEMO) 26.0% Dataset available from http://www.cs.nyu.edu/ yann Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 76 / 85

NORB recognition - 1 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 77 / 85

NORB recognition - 2 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 78 / 85

ConvNet summary With complex problems, it is hard to design features by hand. Neural networks circumvent this problem. They can be hard to train (again...). Convolutional neural networks use knowledge about locality in images. They are much easier than standard networks. And they are FAST (again...). Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 79 / 85

What has not been covered In some cases, we have lots of data, but without the labels. Unsupervised learning. There are techniques to use these data to get better performance. E.g. : Task-Driven Dictionary Learning, Mairal et al. Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 80 / 85

Additional techniques currently used Local response normalization Data augmentation Batch gradient normalization Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 81 / 85

Adversarial examples ConvNets achieve stellar performance in object recognition Do they really understand what s going on in an image? Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85

What I did not cover Algorithms for pretraining (RBMs, autoencoders) Generative models (DBMs) Recurrent neural networks for machine translation Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 83 / 85

Conclusions Neural networks are complex non-linear models Optimizing them is hard It is as much about theory as about engineering When properly tuned, they can perform wonders Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 84 / 85

Bibliography Deep Learning book, http://www-labs.iro.umontreal.ca/ bengioy/dlbook/ Dropout : A Simple Way to Prevent Neural Networks from Overfitting by Srivistava et al. Intriguing properties of neural networks by Szegedy et al. ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky, Sutskever and Hinton Gradient-based learning applied to document recognition by LeCun et al. Qualitatively characterizing neural network optimization problems by Goodfellow and Vinyals Sequence to Sequence Learning with Neural Networks by Sutskever, Vinyals and Le Caffe, http://caffe.berkeleyvision.org/ Hugo Larochelle s MOOC, https: //www.youtube.com/playlist?list=pl6xpj9i5qxyecohn7tqghaj6naprnmubh Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 85 / 85