RegML 2018 Class 8 Deep learning

Similar documents
Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

How to do backpropagation in a brain

Convolutional Neural Networks

RegML 2018 Class 2 Tikhonov regularization and kernels

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artificial Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Feature Design. Feature Design. Feature Design. & Deep Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Deep unsupervised learning

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks: Backpropagation

Machine Learning Techniques

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

UNSUPERVISED LEARNING

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural networks COMS 4771

Convolutional Neural Networks. Srikumar Ramalingam

Multilayer Perceptrons (MLPs)

SGD and Deep Learning

Statistical NLP for the Web

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning: a gentle introduction

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

A summary of Deep Learning without Poor Local Minima

Neural networks and optimization

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Oslo Class 2 Tikhonov regularization and kernels

y(x n, w) t n 2. (1)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Generalization and Properties of the Neural Response Jake Bouvrie, Tomaso Poggio, Lorenzo Rosasco, Steve Smale, and Andre Wibisono

Derived Distance: Beyond a model, towards a theory

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Lecture 3 Feedforward Networks and Backpropagation

Introduction to Machine Learning (67577)

Logistic Regression & Neural Networks

Computational statistics

Neural networks. Chapter 20. Chapter 20 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Origin of Deep Learning. Lili Mou Jan, 2015

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 17: Neural Networks and Deep Learning

Oslo Class 4 Early Stopping and Spectral Regularization

Deep Feedforward Networks

Introduction to (Convolutional) Neural Networks

Neural Networks and Deep Learning

Multilayer Perceptron

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks and Deep Learning.

Artificial Neural Network : Training

Artificial Neural Networks. MGS Lecture 2

TUM 2016 Class 3 Large scale learning by regularization

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Architectures and Algorithms

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Artificial Intelligence

18.6 Regression and Classification with Linear Models

CSCI 315: Artificial Intelligence through Deep Learning

Unsupervised Neural Nets

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 3 Feedforward Networks and Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

4 3hrs. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

Deep Neural Networks (1) Hidden layers; Back-propagation

Class Learning Data Representations: beyond DeepLearning: the Magic Theory. Tomaso Poggio. Thursday, December 5, 13

Neural Networks: Backpropagation

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Deep Feedforward Networks. Sargur N. Srihari

A Deep Representation for Invariance And Music Classification

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Lab 5: 16 th April Exercises on Neural Networks

Neural networks and optimization

Deep Belief Networks are compact universal approximators

Unit III. A Survey of Neural Network Model

4. Multilayer Perceptrons

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Transcription:

RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018

Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x X unsupervised learning of Φ supervised learning of w

Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x X unsupervised learning of Φ supervised learning of w But can we perform only one learning step?

In practice all is multi-layer! (an old slide) Typical data representation schemes, e.g. in vision or speech, involve multiple stages (layers). Pipeline Raw data are often processed: first computing some of low level features, then learning some mid level representation,... finally using supervised learning. These stages are often done separately, but is it possible to design end-to-end learning systems? 3

In practice all is (becoming) deep-learning! (updated slide) Typical data representation schemes e.g. in vision or speech, involve deep learning. Pipeline Design some wild- but differentiable hierarchical architecture. Proceed with end-to-end learning!! Ok, maybe not all is deep learning but let s take a look 4

Shallow nets f(x) = w, Φ(x), x Φ(x) }{{} Fixed 5

Shallow nets f(x) = w, Φ(x), x Φ(x) }{{} Fixed Empirical Risk Minimization (ERM) min w 1 n n (y i w, Φ(x i ) ) 2 i=1 5

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular Φ l = σ W l, l = 1,..., L 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular where linear/affine Φ l = σ W l, l = 1,..., L W l : R d l 1 R d l, l = 1,..., L

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular Φ l = σ W l, l = 1,..., L where W l : R d l 1 R d l, l = 1,..., L linear/affine and σ is a non linear map acting component-wise σ : R R.

Deep neural nets f(x) = w, Φ L (x), Φ L = Φ L Φ 1 }{{} compositional representation Φ 1 = σ W 1... Φ L = σ W L

Deep neural nets f(x) = w, Φ L (x), Φ L = Φ L Φ 1 }{{} compositional representation Φ 1 = σ W 1... Φ L = σ W L ERM 1 min w,(w j) j n n (y i w, Φ L (x i ) ) 2 i=1

Neural networks terminology Φ L (x) = σ(w L... σ(w 2 σ(w 1 x))) Each intermediate representation corresponds to a (hidden) layer The dimensionalities (d l ) l correspond to the number of hidden units the non linearity is called activation function

Neural networks illustrated y W x Each neuron compute an inner product based on a column of a weight matrix W The non-linearity σ is the neuron activation function. 9

Activation functions logistic function s(α) = (1 + e α ) 1, α R, hyperbolic tangent s(α) = (e α e α )/(e α + e α ), α R, hinge s(α) = s +, α R. Note: -If the activation is chosen to be linear the architecture is equivalent to one layer.

Neural networks function spaces Consider the non linear space of functions of the form f w,(wl ) l : X R, f w,(wl ) l (x) = w, Φ (Wl ) l (x), Φ (Wl ) l = σ(w L... σ(w 2 σ(w 1 x))) Very little structure, but we can : train by gradient descent (next) get (some) approximation/statistical guarantees (later) 11

One layer neural networks Consider only one hidden layer: f w,w (x) = w, σ(w x) = typically optimized given supervised data 1 n u w j σ ( W j, x ) j=1 n (y i f w,w (x i )) 2, i=1 possibly with norm constraints on the weights (regularization).

One layer neural networks Consider only one hidden layer: f w,w (x) = w, σ(w x) = typically optimized given supervised data 1 n u w j σ ( W j, x ) j=1 n (y i f w,w (x i )) 2, i=1 possibly with norm constraints on the weights (regularization). Problem is non-convex! (maybe possibly smooth depending on σ)

Back-propagation Empirical risk minimization, min Ê(w, W ), Ê(w, W ) = w,w n (y i f (w,w ) (x i ))) 2. i=1 An approximate minimizer is computed via the following update rules w t+1 j = w t j γ t Ê w j (w t, W t ) W t+1 j,k = W t j,k γ t Ê W j,k (w t+1, W t ) where the step-size (γ t ) t is often called learning rate. 13

Back-propagation & chain rule Direct computations show that: Ê w j (w, W ) = 2 Ê W j,k (w, W ) = 2 n (y i f (w,w ) (x i ))) h j,i }{{} i=1 j,i n (y i f (w,w ) (x i )))w j σ ( w j, x ) }{{} i=1 η i,k k xi Back-prop equations: η i,k = j,i c j s ( w j, x ) Using above equations, the updates are performed in two steps: Forward pass compute function values keeping weights fixed, Backward pass compute errors and propagate Hence the weights are updated. 14

Few remarks Multiple layers can be analogously considered Batch gradients descent can be replaced by stochastic gradient. Faster iterations are available, e.g. variable metric/accelerated gradient.... Online update rules are potentially biologically plausible Hebbian learning rules describing neuron plasticity.

Computations n min Ê(w, W ), Ê(w, W ) = (y i f (w,w ) (x i ))) 2. w,w Convex vs. Nonconvex i=1 Optimization In practice, no access to f u but only to approximate minimizers. Unique optimum: global/local. Multiple local optima Non-convex problem Convergence of back-prop to a reasonable local minimum can depend heavily on the initialization. Empirically: the more the layers the easier to find good minima.

An older idea: pre-training and unsupervised learning Pre-training Use unsupervised training of each layer to initialize supervised training. Potential benefit of unlabeled data.

Auto-encoders x W x A neural network with one input layer, one output layer and one (or more) hidden layers connecting them. The output layer has equally many nodes as the input layer, It is trained to predict the input rather than some target output.

Auto-encoders (cont.) An auto encoder with one hidden layer of k units, can be seen as a representation-reconstruction pair: with F k = R k, k < d and Φ : X F k, Φ(x) = σ (W x), x X Ψ : F k X, Ψ(β) = σ (W β), β F k.

Auto-encoders & dictionary learning Φ(x) = σ (W x), Ψ(β) = σ (W β) The above formulation is closely related to dictionary learning. The weights can be seen as dictionary atoms. Reconstructive approaches have connections with so called energy models [LeCun et al.... ] Possible probabilistic/bayesian interpretations/variations (e.g. Boltzmann machine [Hinton et al.... ])

Stacked auto-encoders Multiple layers of auto-encoders can be stacked [Hinton et al 06]... (Φ 1 Ψ 1 ) (Φ 2 Ψ 2 ) (Φ l Ψ l ) }{{} Autoencoder... with the potential of obtaining richer representations. 21

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way.

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way. Weights in the first few layers have smaller support and are repeated.

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way. Weights in the first few layers have smaller support and are repeated. Subsampling (pooling) is interleaved with standard neural nets computations.

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way. Weights in the first few layers have smaller support and are repeated. Subsampling (pooling) is interleaved with standard neural nets computations. The obtained architectures are called convolutional neural networks.

Convolutional layers Consider the composite representation Φ : X F, Φ = σ W, with representation by filtering W : X F, representation by pooling σ : F F. 23

Convolutional layers Consider the composite representation Φ : X F, Φ = σ W, with representation by filtering W : X F, representation by pooling σ : F F. Note: σ, W are more complex than in standard NN. 23

The matrix W is made of blocks Convolution and filtering W = (G t1,..., G tt ) each block is a convolution matrix obtained transforming a vector (template) t, e.g. G t = (g 1 t,..., g N t). e.g. t 1 t 2 t 3... t d t d t 1 t 2... t d 1 G t = t d 1 t d t 1... t d 2........................ t 2 t 3 t 4... t 1 For all x X, W (x)(j, i) = g i t j, x. 24

Pooling The pooling map aggregates (pools) the values corresponding to the same transformed template g 1 t, x,..., ( g N t, x, and can be seen as a form of subsampling. 25

Pooling functions Given a template t, let β = (s( g 1 t, x ),..., s( g N t, x )). for some non-linearity s, e.g. s( ) = +.

Pooling functions Given a template t, let β = (s( g 1 t, x ),..., s( g N t, x )). for some non-linearity s, e.g. s( ) = +. Examples of pooling max pooling average pooling l p pooling max j=1,...,n βj, 1 N N β j, j=1 N β p = β j p j=1 1 p. 26

Why pooling? The intuition is that pooling can provide some form of robustness and even invariance to the transformations. Invariance & selectivity A good representation should be invariant to semantically irrelevant transformations. Yet, it should be discriminative with respect to relevant information (selective).

Basic computations: simple & complex cells (Hubel, Wiesel 62) Simple cells Complex cells x x, g 1 t,..., x, g N t x, g 1 t,..., x, g N t..., x, g N t g x, gt + 28

Basic computations: convolutional networks (Le Cun 88) Convolutional filters Subsampling/pooling x x, g 1 t,..., x, g N t x, g 1 t,..., x, g N t..., x, g N t g x, gt +

Deep convolutional networks Filtering Pooling Filtering Pooling Input First Layer Second Layer Classifier Output In practice: multiple convolution layers are stacked, 30

Deep convolutional networks Filtering Pooling Filtering Pooling Input First Layer Second Layer Classifier Output In practice: multiple convolution layers are stacked, pooling is not global, but over a subset of transformations (receptive field), 30

Deep convolutional networks Filtering Pooling Filtering Pooling Input First Layer Second Layer Classifier Output In practice: multiple convolution layers are stacked, pooling is not global, but over a subset of transformations (receptive field), the receptive fields size increases in higher layers.

A biological motivation Classification units Visual cortex The processing in DCN has analogies with computational neuroscience models of the information processing in the visual cortex see [Poggio et al.... ]. PIT/AIT V4/PIT V2/V4 V1/V2 Figure 2: Sketch of the Hmax hierarchical model of visual processing: Acronyms: V1, V2 and V4 correspond to primary, second and fourth visual areas, PIT and AIT to posterior and anterior inferotemporal areas, L.Rosasco, respectively RegML 2018

Theory Φ L (x) = σ(w L... σ(w 2 (σ(w 1 x))) No pooling: metric properties of networks with random weights connection with compressed sensing [Giryes et al. 15]

Theory Φ L (x) = σ(w L... σ(w 2 (σ(w 1 x))) No pooling: metric properties of networks with random weights connection with compressed sensing [Giryes et al. 15] Invariance x = gx Φ(x ) = Φ(x) [Anselmi et al. 12, R. Poggio 15, Mallat 12, Soatto, Chiuso 13] and covariance for multiple layers [Anselmi et al. 12].

Theory Φ L (x) = σ(w L... σ(w 2 (σ(w 1 x))) No pooling: metric properties of networks with random weights connection with compressed sensing [Giryes et al. 15] Invariance x = gx Φ(x ) = Φ(x) [Anselmi et al. 12, R. Poggio 15, Mallat 12, Soatto, Chiuso 13] and covariance for multiple layers [Anselmi et al. 12]. Selectivity/Maximal Invariance, i.e. injectivity modulo transformations Φ(x ) = Φ(x) x = gx [R. Poggio 15, Soatto, Chiuso 15]

Theory (cont.) Similarity preservation Φ(x ) Φ(x) min x gx??? g Stability to diffeomorphisms [Mallat, 12] Φ(x) Φ(d(x)) d x Reconstruction: connection to phase retrieval/one bit compressed sensing [Bruna et al 14].

This class Neural nets Autoencoders Convolutional neural nets

FINE