Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Similar documents
Lecture 17: Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning Basics III

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

SGD and Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural networks and optimization

Neural networks and optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ECE521 Lecture 7/8. Logistic Regression

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Introduction to Machine Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Classification with Perceptrons. Reading:

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

From perceptrons to word embeddings. Simon Šuster University of Groningen

Stochastic gradient descent; Classification

Deep Learning Lab Course 2017 (Deep Learning Practical)

Statistical Machine Learning from Data

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Advanced Machine Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction Biologically Motivated Crude Model Backpropagation

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Logistic Regression & Neural Networks

CSC242: Intro to AI. Lecture 21

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 5 Neural models for NLP

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Computational statistics

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks: Backpropagation

Artificial Neural Networks

Supervised Learning. George Konidaris

How to do backpropagation in a brain

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSCI567 Machine Learning (Fall 2018)

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks The Introduction

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 6. Regression

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Linear classifiers: Overfitting and regularization

Feedforward Neural Nets and Backpropagation

Natural Language Processing with Deep Learning CS224N/Ling284

Multilayer Perceptrons (MLPs)

Based on the original slides of Hung-yi Lee

UNSUPERVISED LEARNING

ECS171: Machine Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 2: Linear Regression

The connection of dropout and Bayesian statistics

Deep Feedforward Networks

COMP-4360 Machine Learning Neural Networks

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Neural networks. Chapter 19, Sections 1 5 1

Feed-forward Network Functions

Artificial Neural Network

Course 395: Machine Learning - Lectures

The Perceptron. Volker Tresp Summer 2014

1 What a Neural Network Computes

Artificial Neural Networks

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

Deep Learning for NLP

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

DeepLearning on FPGAs

Deep Feedforward Networks. Sargur N. Srihari

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Grundlagen der Künstlichen Intelligenz

Machine Learning

Linear discriminant functions

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning. Neural Networks

Artificial Neural Networks

Advanced statistical methods for data analysis Lecture 2

Transcription:

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire Nationnal des Arts et Métiers (Cnam)

8 Weeks on Deep Learning and Strutured Prediction Week 1-5: Deep Learning Week 6-8: Structured Prediction and Applications nicolas.thome@cnam.fr RCP209 / Deep Learning 2/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 3/ 54

Context Neural Nets Backprop Context Big Data Superabundance of data: images, videos, audio, text, use traces, etc BBC: 2.4M videos Facebook: 350B images 100M monitoring cameras 1B each day Obvious need to access, search, or classify these data: Recognition Huge number of applications: mobile visual search, robotics, autonomous driving, augmented reality, medical imaging etc Leading track in major ML/CV conferences during the last decade nicolas.thome@cnam.fr RCP209 / Deep Learning 4/ 54

Recognition and classification Classification : assign a given data to a given set of pre-defined classes Recognition much more general than classification, e.g. Object Localization in images Ranking for document indexing Sequence prediction for text, speech, audio, etc Many tasks can be cast as classification problems importance of classification nicolas.thome@cnam.fr RCP209 / Deep Learning 5/ 54

Focus on Visual Recognition: Perceiving Visual World Visual Recognition: archetype of low-level signal understanding Supposed to be a master class problem in the early 80 s Certainly the most impacted topic by deep learning Scene categorization Object localization Context & Attribute recognition Rough 3D layout, depth ordering Rich description of scene, e.g. sentences nicolas.thome@cnam.fr RCP209 / Deep Learning 6/ 54

Context Neural Nets Backprop Recognition of low-level signals Challenge: filling the semantic gap What we perceive vs What a computer sees Illumination variations View-point variations Deformable objects intra-class variance etc How to design "good" intermediate representation? nicolas.thome@cnam.fr RCP209 / Deep Learning 7/ 54

Deep Learning (DL) & Recognition of low-level signals DL: breakthrough for the recognition of low-level signal data Before DL: handcrafted intermediate representations for each task Needs expertise (PhD level) in each field Weak level of semantics in the representation nicolas.thome@cnam.fr RCP209 / Deep Learning 8/ 54

Deep Learning (DL) & Recognition of low-level signals DL: breakthrough for the recognition of low-level signal data Since DL: automatically learning intermediate representations Outstanding experimental performances >> handcrafted features Able to learn high level intermediate representations Common learning methodology field independent, no expertise nicolas.thome@cnam.fr RCP209 / Deep Learning 9/ 54

Deep Learning (DL) & Representation Learning DL: breakthrough for representation learning Automatically learning intermediate levels of representation Ex: Natural language Processing (NLP) nicolas.thome@cnam.fr RCP209 / Deep Learning 10/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 11/ 54

The Formal Neuron: 1943 [MP43] Basis of Neural Networks Input: vector x R m, i.e. x = {x i } i {1,2,...,m} Neuron output ŷ R: scalar nicolas.thome@cnam.fr RCP209 / Deep Learning 12/ 54

The Formal Neuron: 1943 [MP43] Mapping from x to ŷ: 1 Linear (affine) mapping: s = w x + b 2 Non-linear activation function: f : ŷ = f (s) nicolas.thome@cnam.fr RCP209 / Deep Learning 13/ 54

The Formal Neuron: Linear Mapping Linear (affine) mapping: s = w x + b = m w i x i + b i=1 w: normal vector to an hyperplane in R m linear boundary b bias, shift the hyperplane position 2D hyperplane: line 3D hyperplane: plane nicolas.thome@cnam.fr RCP209 / Deep Learning 14/ 54

The Formal Neuron: Activation Function ŷ = f (w x + b), f activation function Popular f choices: step, sigmoid, tanh 1 if z 0 Step (Heaviside) function: H(z) = 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 15/ 54

Step function: Connection to Biological Neurons Formal neuron, step activation H: ŷ = H(w x + b) ŷ = 1 (activated) w x b ŷ = 0 (unactivated) w x < b Biological Neurons: output activated input weighted by synaptic weight threshold nicolas.thome@cnam.fr RCP209 / Deep Learning 16/ 54

Sigmoid Activation Function Neuron output ŷ = f (w x + b), f activation function Sigmoid: σ(z) = (1 + e az ) 1 a : more similar to step function (step: a ) Sigmoid: linear and saturating regimes nicolas.thome@cnam.fr RCP209 / Deep Learning 17/ 54

The Formal neuron: Application to Binary Classification Binary Classification: label input x as belonging to class 1 or 0 Neuron output with sigmoid: ŷ = 1 1+e a(w x+b) Sigmoid: probabilistic interpretation ŷ P(1/x) Input x classified as 1 if P(1/x) > 0.5 w x + b > 0 Input x classified as 0 if P(1/x) < 0.5 w x + b < 0 sign(w x + b): linear boundary decision in input space! nicolas.thome@cnam.fr RCP209 / Deep Learning 18/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 10 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 1 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

The Formal neuron: Toy Example for Binary Classification 2d example: m = 2, x = {x 1, x 2} [ 5; 5] [ 5; 5] Linear mapping: w = [1; 1] and b = 2 Result of linear mapping : s = w x + b Sigmoid activation function: ŷ = (1 + e a(w x+b) ) 1, a = 0.1 nicolas.thome@cnam.fr RCP209 / Deep Learning 19/ 54

From Formal Neuron to Neural Networks Formal Neuron: 1 A single scalar output 2 Linear decision boundary for binary classification Single scalar output: limited for several tasks Ex: multi-class classification, e.g. MNIST or CIFAR nicolas.thome@cnam.fr RCP209 / Deep Learning 20/ 54

Perceptron and Multi-Class Classification Formal Neuron: limited to binary classification Multi-Class Classification: use several output neurons instead of a single one! Perceptron Input x in R m Output neuron ŷ 1 is a formal neuron: Linear (affine) mapping: s 1 = w 1 x + b 1 Non-linear activation function: f : ŷ 1 = f (s 1 ) Linear mapping parameters: w 1 = {w 11,..., w m1 } R m b 1 R nicolas.thome@cnam.fr RCP209 / Deep Learning 21/ 54

Perceptron and Multi-Class Classification Input x in R m Output neuron ŷ k is a formal neuron: Linear (affine) mapping: s k = w k x + b k Non-linear activation function: f : yˆ k = f (s k ) Linear mapping parameters: w k = {w 1k,..., w mk } R m b k R nicolas.thome@cnam.fr RCP209 / Deep Learning 22/ 54

Perceptron and Multi-Class Classification Input x in R m (1 m), output ŷ: concatenation of K formal neurons Linear (affine) mapping matrix multiplication: s = xw + b W matrix of size m K - columns are w k b: bias vector - size 1 K Element-wise non-linear activation: ŷ = f (s) nicolas.thome@cnam.fr RCP209 / Deep Learning 23/ 54

Perceptron and Multi-Class Classification Soft-max Activation: ŷ k = f (s k ) = e s k K e s k k =1 Probabilistic interpretation for multi-class classification: Each output neuron class yˆ k P(k/x, w) Logistic Regression (LR) Model! nicolas.thome@cnam.fr RCP209 / Deep Learning 24/ 54

2d Toy Example for Multi-Class Classification x = {x 1, x 2} [ 5; 5] [ 5; 5], ŷ: 3 outputs (classes) Linear mapping for each class: s k = w k x + b k w 1 = [1; 1], b 1 = 2 w 2 = [0; 1], b 2 = 1 w 3 = [1; 0.5], b 3 = 10 Soft-max output: P(k/x, W) nicolas.thome@cnam.fr RCP209 / Deep Learning 25/ 54

2d Toy Example for Multi-Class Classification x = {x 1, x 2} [ 5; 5] [ 5; 5], ŷ: 3 outputs (classes) w 1 = [1; 1], b 1 = 2 w 2 = [0; 1], b 2 = 1 w 3 = [1; 0.5], b 3 = 10 Soft-max output: P(k/x, W) Class Prediction: k = arg max P(k/x, W) k nicolas.thome@cnam.fr RCP209 / Deep Learning 25/ 54

Beyond Linear Classification X-OR Problem Logistic Regression (LR): NN with 1 input layer & 1 output layer LR: limited to linear decision boundaries X-OR: NOT 1 and 2 OR NOT 2 AND 1 X-OR: Non linear decision function nicolas.thome@cnam.fr RCP209 / Deep Learning 26/ 54

Beyond Linear Classification LR: limited to linear boundaries Solution: add a layer! Input x in R m, e.g. m = 4 Output ŷ in R K (K # classes), e.g. K = 2 Hidden layer h in R L nicolas.thome@cnam.fr RCP209 / Deep Learning 27/ 54

Multi-Layer Perceptron Hidden layer h: x projection to a new space R L Neural Net with 1 hidden layer: Multi-Layer Perceptron (MLP) h: intermediate representations of x for classification ŷ: h = f (xw + b) Mapping from x to ŷ: non-linear boundary! activation f crucial! nicolas.thome@cnam.fr RCP209 / Deep Learning 28/ 54

Deep Neural Networks Adding more hidden layers: Deep Neural Networks (DNN) Basis of Deep Learning Each layer h l projects layer h l 1 into a new space Gradually learning intermediate representations useful for the task nicolas.thome@cnam.fr RCP209 / Deep Learning 29/ 54

Conclusion Deep Neural Networks: applicable to classification problems with non-linear decision boundaries Visualize prediction from fixed model parameters Reverse problem: Supervised Learning nicolas.thome@cnam.fr RCP209 / Deep Learning 30/ 54

Outline 1 Context 2 Neural Networks 3 Training Deep Neural Networks nicolas.thome@cnam.fr RCP209 / Deep Learning 31/ 54

Training Multi-Layer Perceptron (MLP) Input x, output y A parametrized model x y: f w(x i ) = ŷ i Supervised context: training set A = {(x i, y i )} i {1,2,...,N} A loss function l(ŷ i, y i ) for each annotated pair (x i, y i ) Assumptions: parameters w R d continuous, L differentiable Gradient w = L : steepest direction to decrease loss L w nicolas.thome@cnam.fr RCP209 / Deep Learning 32/ 54

MLP Training Gradient descent algorithm: Initialyze parameters w Update: w (t+1) = w (t) η L w Until convergence, e.g. w 2 0 nicolas.thome@cnam.fr RCP209 / Deep Learning 33/ 54

Gradient Descent Update rule: w (t+1) = w (t) η L w η learning rate Convergence ensured? provided a "well chosen" learning rate η nicolas.thome@cnam.fr RCP209 / Deep Learning 34/ 54

Gradient Descent Update rule: w (t+1) = w (t) η L w Global minimum? convex a) vs non convex b) loss L(w) a) Convex function a) Non convex function nicolas.thome@cnam.fr RCP209 / Deep Learning 35/ 54

Supervised Learning: Multi-Class Classification Logistic Regression for multi-class classification s i = x i W + b Soft-Max (SM): ŷ k P(k/x i, W, b) = Supervised loss function: L(W, b) = 1 N e s k K e s k k =1 N i=1 l(ŷ i, y i ) 1 y {1; 2;...; K} 2 ŷ i = arg max P(k/x i, W, b) k 3 l 0/1 (ŷ i, yi 1 if ŷ i yi ) = : 0/1 loss 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 36/ 54

Logistic Regression Training Formulation Input x i, ground truth output supervision yi One hot-encoding for yi : y 1 if c is the groud truth class for x i c,i = 0 otherwise nicolas.thome@cnam.fr RCP209 / Deep Learning 37/ 54

Logistic Regression Training Formulation Loss function: multi-class Cross-Entropy (CE) l CE l CE : Kullback-Leiber divergence between y i and ^y i K l CE (^y i, y i ) = KL(y i,^y i ) = y c,i log(ŷ c,i ) = log(ŷ c,i ) c=1 B KL asymmetric: KL(^y i, y i ) KL(y i,^y i ) B nicolas.thome@cnam.fr RCP209 / Deep Learning 38/ 54

Logistic Regression Training L CE (W, b) = 1 N N i=1 l CE (ŷ i, y i ) = 1 N N i=1 l CE smooth convex upper bound of l 0/1 gradient descent optimization Gradient descent: W (t+1) = W (t) η L CE W MAIN CHALLENGE: computing L CE W log(ŷ c,i ) = 1 N x Key Property: chain rule z = x y y z Backpropagation of gradient error! (b (t+1) = b (t) η L CE b ) N l CE? W i=1 nicolas.thome@cnam.fr RCP209 / Deep Learning 39/ 54

Chain Rule l x = l ^y ^y x Logistic regression: l CE W = l CE ^y i ^y i s i s i W nicolas.thome@cnam.fr RCP209 / Deep Learning 40/ 54

Logistic Regression Training: Backpropagation l CE W = l CE ^y i ^y i s i, l s i W CE (ŷ i, yi ) = log(ŷ c,i ) Update for 1 example: l CE = 1 ^y i ŷ c,i = 1 ^y i l CE = ^y s i y i i = δ y i δ c,c l CE W = x i T δ y i nicolas.thome@cnam.fr RCP209 / Deep Learning 41/ 54

Logistic Regression Training: Backpropagation Whole dataset: data matrix X (N m), label matrix ^Y, Y (N K) L CE (W, b) = 1 N N i=1 log(ŷ c,i ), L CE W = L CE ^Y ^Y S S W L CE s L CE W = ^Y Y = y = XT y nicolas.thome@cnam.fr RCP209 / Deep Learning 42/ 54

Perceptron Training: Backpropagation Perceptron vs Logistic Regression: adding hidden layer (sigmoid) Goal: Train parameters W y and W h (+bias) with Backpropagation computing L CE W y = 1 N N l CE and L W i=1 y CE W h = 1 N N l CE i=1 W h Last hidden layer Logistic Regression First hidden layer: l CE W h = x i T l CE u i computing l CE u i = δ h i nicolas.thome@cnam.fr RCP209 / Deep Learning 43/ 54

Perceptron Training: Backpropagation Computing l CE u i = δ h i use chain rule:... Leading to: l CE u i = l CE v i v i h i h i u i l CE = δ h u i i = δ yt i W y σ (h i ) = δ yt i W y (h i (1 h i )) nicolas.thome@cnam.fr RCP209 / Deep Learning 44/ 54

Deep Neural Network Training: Backpropagation Multi-Layer Perceptron (MLP): adding more hidden layers Backpropagation update Perceptron: assuming L W l +1 = H l T l+1 L = l+1 known U l+1 Computing L U l = l (= l+1t W l+1 H l (1 H l ) sigmoid) L W l = H l 1 T h l nicolas.thome@cnam.fr RCP209 / Deep Learning 45/ 54

Neural Network Training: Optimization Issues Classification loss over training set (vectorized w, b ignored): L CE (w) = 1 N N i=1 l CE (ŷ i, y i ) = 1 N Gradient descent optimization: w (t+1) = w (t) η L CE w Gradient (t) w = 1 N wrt: w dimension Training set size N i=1 log(ŷ c,i ) (w(t) ) = w (t) η (t) w N l CE (ŷ i,y i ) (w (t) ) linearly scales w i=1 Too slow even for moderate dimensionality & dataset size! nicolas.thome@cnam.fr RCP209 / Deep Learning 46/ 54

Stochastic Gradient Descent Solution: approximate (t) w = 1 N Stochastic Gradient Descent (SGD) Use a single example (online): N l CE (ŷ i,y i ) (w (t) ) with subset of examples w i=1 (t) w l CE (ŷ i, yi ) (w (t) ) w Mini-batch: use B < N examples: (t) w 1 B B l CE (ŷ i, yi ) (w (t) ) i=1 w Full gradient SGD (online) SGD (mini-batch) nicolas.thome@cnam.fr RCP209 / Deep Learning 47/ 54

Stochastic Gradient Descent SGD: approximation of the true Gradient w! Noisy gradient can lead to bad direction, increase loss BUT: much more parameter updates: online N, mini-batch N B Faster convergence, at the core of Deep Learning for large scale datasets Full gradient SGD (online) SGD (mini-batch) nicolas.thome@cnam.fr RCP209 / Deep Learning 48/ 54

Optimization: Learning Rate Decay Gradient descent optimization: w (t+1) = w (t) η (t) w η setup? open question Learning Rate Decay: decrease η during training progress Inverse (time-based) decay: η t = η 0, r decay rate 1+r t Exponential decay: η t = η 0 e λt Step Decay η t = η 0 r tu t... Exponential Decay (η 0 = 0.1, λ = 0.1s) Step Decay (η 0 = 0.1, r = 0.5, tu = 10) nicolas.thome@cnam.fr RCP209 / Deep Learning 49/ 54

Generalization and Overfitting Learning: minimizing classification loss L CE over training set Training set: sample representing data vs labels distributions Ultimate goal: train a prediction function with low prediction error on the true (unknown) data distribution L train = 4, L train = 9 L test = 15, L test = 13 Optimization Machine Learning! Generalization / Overfitting! nicolas.thome@cnam.fr RCP209 / Deep Learning 50/ 54

Regularization Regularization: improving generalization, i.e. test ( train) performances Structural regularization: add Prior R(w) in training objective: L(w) = L CE (w) + αr(w) L 2 regularization: weight decay, R(w) = w 2 Commonly used in neural networks Theoretical justifications, generalization bounds (SVM) Other possible R(w): L 1 regularization, dropout, etc nicolas.thome@cnam.fr RCP209 / Deep Learning 51/ 54

Regularization and hyper-parameters Neural networks: hyper-parameters to tune: Training parameters: learning rate, weight decay, learning rate decay, # epochs, etc Architectural parameters: number of layers, number neurones, non-linearity type, etc Hyper-parameters tuning: improve generalization: estimate performances on a validation set nicolas.thome@cnam.fr RCP209 / Deep Learning 52/ 54

Neural networks: Conclusion Training issues at several levels: optimization, generalization, cross-validation Limits of fully connected layers and Convolutional Neural Nets? next course! nicolas.thome@cnam.fr RCP209 / Deep Learning 53/ 54

References I Warren S McCulloch and Walter Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics 5 (1943), no. 4, 115 133. nicolas.thome@cnam.fr RCP209 / Deep Learning 54/ 54