Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire Nationnal des Arts et Métiers (Cnam)

8 Weeks on Deep Learning and Strutured Prediction Week 1-5: Deep Learning Week 6-8: Structured Prediction and Applications nicolas.thome@cnam.fr RCP209 / Deep Learning 2/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 3/ 54

Context Neural Nets Backprop Context Big Data Superabundance of data: images, videos, audio, text, use traces, etc BBC: 2.4M videos Facebook: 350B images 100M monitoring cameras 1B each day Obvious need to access, search, or classify these data: Recognition Huge number of applications: mobile visual search, robotics, autonomous driving, augmented reality, medical imaging etc Leading track in major ML/CV conferences during the last decade nicolas.thome@cnam.fr RCP209 / Deep Learning 4/ 54

Recognition and classification Classification : assign a given data to a given set of pre-defined classes Recognition much more general than classification, e.g. Object Localization in images Ranking for document indexing Sequence prediction for text, speech, audio, etc Many tasks can be cast as classification problems importance of classification nicolas.thome@cnam.fr RCP209 / Deep Learning 5/ 54

Focus on Visual Recognition: Perceiving Visual World Visual Recognition: archetype of low-level signal understanding Supposed to be a master class problem in the early 80 s Certainly the most impacted topic by deep learning Scene categorization Object localization Context & Attribute recognition Rough 3D layout, depth ordering Rich description of scene, e.g. sentences nicolas.thome@cnam.fr RCP209 / Deep Learning 6/ 54

Context Neural Nets Backprop Recognition of low-level signals Challenge: filling the semantic gap What we perceive vs What a computer sees Illumination variations View-point variations Deformable objects intra-class variance etc How to design "good" intermediate representation? nicolas.thome@cnam.fr RCP209 / Deep Learning 7/ 54

Deep Learning (DL) & Recognition of low-level signals DL: breakthrough for the recognition of low-level signal data Before DL: handcrafted intermediate representations for each task Needs expertise (PhD level) in each field Weak level of semantics in the representation nicolas.thome@cnam.fr RCP209 / Deep Learning 8/ 54

Deep Learning (DL) & Recognition of low-level signals DL: breakthrough for the recognition of low-level signal data Since DL: automatically learning intermediate representations Outstanding experimental performances >> handcrafted features Able to learn high level intermediate representations Common learning methodology field independent, no expertise nicolas.thome@cnam.fr RCP209 / Deep Learning 9/ 54

Deep Learning (DL) & Representation Learning DL: breakthrough for representation learning Automatically learning intermediate levels of representation Ex: Natural language Processing (NLP) nicolas.thome@cnam.fr RCP209 / Deep Learning 10/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 11/ 54

The Formal Neuron: 1943 [MP43] Basis of Neural Networks Input: vector x R m, i.e. x = {x i } i {1,2,...,m} Neuron output ŷ R: scalar nicolas.thome@cnam.fr RCP209 / Deep Learning 12/ 54

The Formal Neuron: 1943 [MP43] Mapping from x to ŷ: 1 Linear (affine) mapping: s = w x + b 2 Non-linear activation function: f : ŷ = f (s) nicolas.thome@cnam.fr RCP209 / Deep Learning 13/ 54

The Formal Neuron: Linear Mapping Linear (affine) mapping: s = w x + b = m w i x i + b i=1 w: normal vector to an hyperplane in R m linear boundary b bias, shift the hyperplane position 2D hyperplane: line 3D hyperplane: plane nicolas.thome@cnam.fr RCP209 / Deep Learning 14/ 54

The Formal Neuron: Activation Function ŷ = f (w x + b), f activation function Popular f choices: step, sigmoid, tanh 1 if z 0 Step (Heaviside) function: H(z) = 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 15/ 54

Step function: Connection to Biological Neurons Formal neuron, step activation H: ŷ = H(w x + b) ŷ = 1 (activated) w x b ŷ = 0 (unactivated) w x < b Biological Neurons: output activated input weighted by synaptic weight threshold nicolas.thome@cnam.fr RCP209 / Deep Learning 16/ 54

Sigmoid Activation Function Neuron output ŷ = f (w x + b), f activation function Sigmoid: σ(z) = (1 + e az ) 1 a : more similar to step function (step: a ) Sigmoid: linear and saturating regimes nicolas.thome@cnam.fr RCP209 / Deep Learning 17/ 54

The Formal neuron: Application to Binary Classification Binary Classification: label input x as belonging to class 1 or 0 Neuron output with sigmoid: ŷ = 1 1+e a(w x+b) Sigmoid: probabilistic interpretation ŷ P(1/x) Input x classified as 1 if P(1/x) > 0.5 w x + b > 0 Input x classified as 0 if P(1/x) < 0.5 w x + b < 0 sign(w x + b): linear boundary decision in input space! nicolas.thome@cnam.fr RCP209 / Deep Learning 18/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 10 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 1 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 0.1 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

From Formal Neuron to Neural Networks Formal Neuron: 1 A single scalar output 2 Linear decision boundary for binary classification Single scalar output: limited for several tasks Ex: multi-class classification, e.g. MNIST or CIFAR nicolas.thome@cnam.fr RCP209 / Deep Learning 20/ 54

Perceptron and Multi-Class Classification Formal Neuron: limited to binary classification Multi-Class Classification: use several output neurons instead of a single one! Perceptron Input x in R m Output neuron ŷ 1 is a formal neuron: Linear (affine) mapping: s 1 = w 1 x + b 1 Non-linear activation function: f : ŷ 1 = f (s 1 ) Linear mapping parameters: w 1 = {w 11,..., w m1 } R m b 1 R nicolas.thome@cnam.fr RCP209 / Deep Learning 21/ 54

Perceptron and Multi-Class Classification Input x in R m Output neuron ŷ k is a formal neuron: Linear (affine) mapping: s k = w k x + b k Non-linear activation function: f : yˆ k = f (s k ) Linear mapping parameters: w k = {w 1k,..., w mk } R m b k R nicolas.thome@cnam.fr RCP209 / Deep Learning 22/ 54

Perceptron and Multi-Class Classification Input x in R m (1 m), output ŷ: concatenation of K formal neurons Linear (affine) mapping matrix multiplication: s = xw + b W matrix of size m K - columns are w k b: bias vector - size 1 K Element-wise non-linear activation: ŷ = f (s) nicolas.thome@cnam.fr RCP209 / Deep Learning 23/ 54

Perceptron and Multi-Class Classification Soft-max Activation: ŷ k = f (s k ) = e s k K e s k k =1 Probabilistic interpretation for multi-class classification: Each output neuron class yˆ k P(k/x, w) Logistic Regression (LR) Model! nicolas.thome@cnam.fr RCP209 / Deep Learning 24/ 54

2d Toy Example for Multi-Class Classification x = {x 1, x 2} [ 5; 5] [ 5; 5], ŷ: 3 outputs (classes) Linear mapping for each class: s k = w k x + b k w 1 = [1; 1], b 1 = 2 w 2 = [0; 1], b 2 = 1 w 3 = [1; 0.5], b 3 = 10 Soft-max output: P(k/x, W) nicolas.thome@cnam.fr RCP209 / Deep Learning 25/ 54

2d Toy Example for Multi-Class Classification x = {x 1, x 2} [ 5; 5] [ 5; 5], ŷ: 3 outputs (classes) w 1 = [1; 1], b 1 = 2 w 2 = [0; 1], b 2 = 1 w 3 = [1; 0.5], b 3 = 10 Soft-max output: P(k/x, W) Class Prediction: k = arg max P(k/x, W) k nicolas.thome@cnam.fr RCP209 / Deep Learning 25/ 54

Beyond Linear Classification X-OR Problem Logistic Regression (LR): NN with 1 input layer & 1 output layer LR: limited to linear decision boundaries X-OR: NOT 1 and 2 OR NOT 2 AND 1 X-OR: Non linear decision function nicolas.thome@cnam.fr RCP209 / Deep Learning 26/ 54

Beyond Linear Classification LR: limited to linear boundaries Solution: add a layer! Input x in R m, e.g. m = 4 Output ŷ in R K (K # classes), e.g. K = 2 Hidden layer h in R L nicolas.thome@cnam.fr RCP209 / Deep Learning 27/ 54

Multi-Layer Perceptron Hidden layer h: x projection to a new space R L Neural Net with 1 hidden layer: Multi-Layer Perceptron (MLP) h: intermediate representations of x for classification ŷ: h = f (xw + b) Mapping from x to ŷ: non-linear boundary! activation f crucial! nicolas.thome@cnam.fr RCP209 / Deep Learning 28/ 54

Deep Neural Networks Adding more hidden layers: Deep Neural Networks (DNN) Basis of Deep Learning Each layer h l projects layer h l 1 into a new space Gradually learning intermediate representations useful for the task nicolas.thome@cnam.fr RCP209 / Deep Learning 29/ 54

Conclusion Deep Neural Networks: applicable to classification problems with non-linear decision boundaries Visualize prediction from fixed model parameters Reverse problem: Supervised Learning nicolas.thome@cnam.fr RCP209 / Deep Learning 30/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 31/ 54

Training Multi-Layer Perceptron (MLP) Input x, output y A parametrized model x y: f w(x i ) = ŷ i Supervised context: training set A = {(x i, y i )} i {1,2,...,N} A loss function l(ŷ i, y i ) for each annotated pair (x i, y i ) Assumptions: parameters w R d continuous, L differentiable Gradient w = L : steepest direction to decrease loss L w nicolas.thome@cnam.fr RCP209 / Deep Learning 32/ 54

MLP Training Gradient descent algorithm: Initialyze parameters w Update: w (t+1) = w (t) η L w Until convergence, e.g. w 2 0 nicolas.thome@cnam.fr RCP209 / Deep Learning 33/ 54

Gradient Descent Update rule: w (t+1) = w (t) η L w η learning rate Convergence ensured? provided a "well chosen" learning rate η nicolas.thome@cnam.fr RCP209 / Deep Learning 34/ 54

Gradient Descent Update rule: w (t+1) = w (t) η L w Global minimum? convex a) vs non convex b) loss L(w) a) Convex function a) Non convex function nicolas.thome@cnam.fr RCP209 / Deep Learning 35/ 54

Supervised Learning: Multi-Class Classification Logistic Regression for multi-class classification s i = x i W + b Soft-Max (SM): ŷ k P(k/x i, W, b) = Supervised loss function: L(W, b) = 1 N e s k K e s k k =1 N i=1 l(ŷ i, y i ) 1 y {1; 2;...; K} 2 ŷ i = arg max P(k/x i, W, b) k 3 l 0/1 (ŷ i, yi 1 if ŷ i yi ) = : 0/1 loss 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 36/ 54

Logistic Regression Training Formulation Input x i, ground truth output supervision yi One hot-encoding for yi : y 1 if c is the groud truth class for x i c,i = 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 37/ 54

Logistic Regression Training Formulation Loss function: multi-class Cross-Entropy (CE) l CE l CE : Kullback-Leiber divergence between y i and ^y i K l CE (^y i, y i ) = KL(y i,^y i ) = y c,i log(ŷ c,i ) = log(ŷ c,i ) c=1 B KL asymmetric: KL(^y i, y i ) KL(y i,^y i ) B nicolas.thome@cnam.fr RCP209 / Deep Learning 38/ 54

Logistic Regression Training L CE (W, b) = 1 N N i=1 l CE (ŷ i, y i ) = 1 N N i=1 l CE smooth convex upper bound of l 0/1 gradient descent optimization Gradient descent: W (t+1) = W (t) η L CE W MAIN CHALLENGE: computing L CE W log(ŷ c,i ) = 1 N x Key Property: chain rule z = x y y z Backpropagation of gradient error! (b (t+1) = b (t) η L CE b ) N l CE? W i=1 nicolas.thome@cnam.fr RCP209 / Deep Learning 39/ 54

Chain Rule l x = l ^y ^y x Logistic regression: l CE W = l CE ^y i ^y i s i s i W nicolas.thome@cnam.fr RCP209 / Deep Learning 40/ 54

Logistic Regression Training: Backpropagation l CE W = l CE ^y i ^y i s i, l s i W CE (ŷ i, yi ) = log(ŷ c,i ) Update for 1 example: l CE = 1 ^y i ŷ c,i = 1 ^y i l CE = ^y s i y i i = δ y i δ c,c l CE W = x i T δ y i nicolas.thome@cnam.fr RCP209 / Deep Learning 41/ 54

Logistic Regression Training: Backpropagation Whole dataset: data matrix X (N m), label matrix ^Y, Y (N K) L CE (W, b) = 1 N N i=1 log(ŷ c,i ), L CE W = L CE ^Y ^Y S S W L CE s L CE W = ^Y Y = y = XT y nicolas.thome@cnam.fr RCP209 / Deep Learning 42/ 54

Perceptron Training: Backpropagation Perceptron vs Logistic Regression: adding hidden layer (sigmoid) Goal: Train parameters W y and W h (+bias) with Backpropagation computing L CE W y = 1 N N l CE and L W i=1 y CE W h = 1 N N l CE i=1 W h Last hidden layer Logistic Regression First hidden layer: l CE W h = x i T l CE u i computing l CE u i = δ h i nicolas.thome@cnam.fr RCP209 / Deep Learning 43/ 54

Perceptron Training: Backpropagation Computing l CE u i = δ h i use chain rule:... Leading to: l CE u i = l CE v i v i h i h i u i l CE = δ h u i i = δ yt i W y σ (h i ) = δ yt i W y (h i (1 h i )) nicolas.thome@cnam.fr RCP209 / Deep Learning 44/ 54

Deep Neural Network Training: Backpropagation Multi-Layer Perceptron (MLP): adding more hidden layers Backpropagation update Perceptron: assuming L W l +1 = H l T l+1 L = l+1 known U l+1 Computing L U l = l (= l+1t W l+1 H l (1 H l ) sigmoid) L W l = H l 1 T h l nicolas.thome@cnam.fr RCP209 / Deep Learning 45/ 54

Neural Network Training: Optimization Issues Classification loss over training set (vectorized w, b ignored): L CE (w) = 1 N N i=1 l CE (ŷ i, y i ) = 1 N Gradient descent optimization: w (t+1) = w (t) η L CE w Gradient (t) w = 1 N wrt: w dimension Training set size N i=1 log(ŷ c,i ) (w(t) ) = w (t) η (t) w N l CE (ŷ i,y i ) (w (t) ) linearly scales w i=1 Too slow even for moderate dimensionality & dataset size! nicolas.thome@cnam.fr RCP209 / Deep Learning 46/ 54

Stochastic Gradient Descent Solution: approximate (t) w = 1 N Stochastic Gradient Descent (SGD) Use a single example (online): N l CE (ŷ i,y i ) (w (t) ) with subset of examples w i=1 (t) w l CE (ŷ i, yi ) (w (t) ) w Mini-batch: use B < N examples: (t) w 1 B B l CE (ŷ i, yi ) (w (t) ) i=1 w Full gradient SGD (online) SGD (mini-batch) nicolas.thome@cnam.fr RCP209 / Deep Learning 47/ 54

Stochastic Gradient Descent SGD: approximation of the true Gradient w! Noisy gradient can lead to bad direction, increase loss BUT: much more parameter updates: online N, mini-batch N B Faster convergence, at the core of Deep Learning for large scale datasets Full gradient SGD (online) SGD (mini-batch) nicolas.thome@cnam.fr RCP209 / Deep Learning 48/ 54

Optimization: Learning Rate Decay Gradient descent optimization: w (t+1) = w (t) η (t) w η setup? open question Learning Rate Decay: decrease η during training progress Inverse (time-based) decay: η t = η 0, r decay rate 1+r t Exponential decay: η t = η 0 e λt Step Decay η t = η 0 r tu t... Exponential Decay (η 0 = 0.1, λ = 0.1s) Step Decay (η 0 = 0.1, r = 0.5, tu = 10) nicolas.thome@cnam.fr RCP209 / Deep Learning 49/ 54

Generalization and Overfitting Learning: minimizing classification loss L CE over training set Training set: sample representing data vs labels distributions Ultimate goal: train a prediction function with low prediction error on the true (unknown) data distribution L train = 4, L train = 9 L test = 15, L test = 13 Optimization Machine Learning! Generalization / Overfitting! nicolas.thome@cnam.fr RCP209 / Deep Learning 50/ 54

Regularization Regularization: improving generalization, i.e. test ( train) performances Structural regularization: add Prior R(w) in training objective: L(w) = L CE (w) + αr(w) L 2 regularization: weight decay, R(w) = w 2 Commonly used in neural networks Theoretical justifications, generalization bounds (SVM) Other possible R(w): L 1 regularization, dropout, etc nicolas.thome@cnam.fr RCP209 / Deep Learning 51/ 54

Regularization and hyper-parameters Neural networks: hyper-parameters to tune: Training parameters: learning rate, weight decay, learning rate decay, # epochs, etc Architectural parameters: number of layers, number neurones, non-linearity type, etc Hyper-parameters tuning: improve generalization: estimate performances on a validation set nicolas.thome@cnam.fr RCP209 / Deep Learning 52/ 54

Neural networks: Conclusion Training issues at several levels: optimization, generalization, cross-validation Limits of fully connected layers and Convolutional Neural Nets? next course! nicolas.thome@cnam.fr RCP209 / Deep Learning 53/ 54

References I Warren S McCulloch and Walter Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics 5 (1943), no. 4, 115 133. nicolas.thome@cnam.fr RCP209 / Deep Learning 54/ 54