RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018

Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x X unsupervised learning of Φ supervised learning of w

Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x X unsupervised learning of Φ supervised learning of w But can we perform only one learning step?

In practice all is multi-layer! (an old slide) Typical data representation schemes, e.g. in vision or speech, involve multiple stages (layers). Pipeline Raw data are often processed: first computing some of low level features, then learning some mid level representation,... finally using supervised learning. These stages are often done separately, but is it possible to design end-to-end learning systems? 3

In practice all is (becoming) deep-learning! (updated slide) Typical data representation schemes e.g. in vision or speech, involve deep learning. Pipeline Design some wild- but differentiable hierarchical architecture. Proceed with end-to-end learning!! Ok, maybe not all is deep learning but let s take a look 4

Shallow nets f(x) = w, Φ(x), x Φ(x) }{{} Fixed 5

Shallow nets f(x) = w, Φ(x), x Φ(x) }{{} Fixed Empirical Risk Minimization (ERM) min w 1 n n (y i w, Φ(x i ) ) 2 i=1 5

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular Φ l = σ W l, l = 1,..., L 6

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular where linear/affine Φ l = σ W l, l = 1,..., L W l : R d l 1 R d l, l = 1,..., L

Neural Nets Basic idea of neural networks: functions obtained by composition. Φ = Φ L Φ 2 Φ 1 Let d 0 = d and Φ l : R d l 1 R d l, l = 1,..., L and in particular Φ l = σ W l, l = 1,..., L where W l : R d l 1 R d l, l = 1,..., L linear/affine and σ is a non linear map acting component-wise σ : R R.

Deep neural nets f(x) = w, Φ L (x), Φ L = Φ L Φ 1 }{{} compositional representation Φ 1 = σ W 1... Φ L = σ W L

Deep neural nets f(x) = w, Φ L (x), Φ L = Φ L Φ 1 }{{} compositional representation Φ 1 = σ W 1... Φ L = σ W L ERM 1 min w,(w j) j n n (y i w, Φ L (x i ) ) 2 i=1

Neural networks terminology Φ L (x) = σ(w L... σ(w 2 σ(w 1 x))) Each intermediate representation corresponds to a (hidden) layer The dimensionalities (d l ) l correspond to the number of hidden units the non linearity is called activation function

Neural networks illustrated y W x Each neuron compute an inner product based on a column of a weight matrix W The non-linearity σ is the neuron activation function. 9

Activation functions logistic function s(α) = (1 + e α ) 1, α R, hyperbolic tangent s(α) = (e α e α )/(e α + e α ), α R, hinge s(α) = s +, α R. Note: -If the activation is chosen to be linear the architecture is equivalent to one layer.

Neural networks function spaces Consider the non linear space of functions of the form f w,(wl ) l : X R, f w,(wl ) l (x) = w, Φ (Wl ) l (x), Φ (Wl ) l = σ(w L... σ(w 2 σ(w 1 x))) Very little structure, but we can : train by gradient descent (next) get (some) approximation/statistical guarantees (later) 11

One layer neural networks Consider only one hidden layer: f w,w (x) = w, σ(w x) = typically optimized given supervised data 1 n u w j σ ( W j, x ) j=1 n (y i f w,w (x i )) 2, i=1 possibly with norm constraints on the weights (regularization).

Back-propagation Empirical risk minimization, min Ê(w, W ), Ê(w, W ) = w,w n (y i f (w,w ) (x i ))) 2. i=1 An approximate minimizer is computed via the following update rules w t+1 j = w t j γ t Ê w j (w t, W t ) W t+1 j,k = W t j,k γ t Ê W j,k (w t+1, W t ) where the step-size (γ t ) t is often called learning rate. 13

Back-propagation & chain rule Direct computations show that: Ê w j (w, W ) = 2 Ê W j,k (w, W ) = 2 n (y i f (w,w ) (x i ))) h j,i }{{} i=1 j,i n (y i f (w,w ) (x i )))w j σ ( w j, x ) }{{} i=1 η i,k k xi Back-prop equations: η i,k = j,i c j s ( w j, x ) Using above equations, the updates are performed in two steps: Forward pass compute function values keeping weights fixed, Backward pass compute errors and propagate Hence the weights are updated. 14

Few remarks Multiple layers can be analogously considered Batch gradients descent can be replaced by stochastic gradient. Faster iterations are available, e.g. variable metric/accelerated gradient.... Online update rules are potentially biologically plausible Hebbian learning rules describing neuron plasticity.

Computations n min Ê(w, W ), Ê(w, W ) = (y i f (w,w ) (x i ))) 2. w,w Convex vs. Nonconvex i=1 Optimization In practice, no access to f u but only to approximate minimizers. Unique optimum: global/local. Multiple local optima Non-convex problem Convergence of back-prop to a reasonable local minimum can depend heavily on the initialization. Empirically: the more the layers the easier to find good minima.

An older idea: pre-training and unsupervised learning Pre-training Use unsupervised training of each layer to initialize supervised training. Potential benefit of unlabeled data.

Auto-encoders x W x A neural network with one input layer, one output layer and one (or more) hidden layers connecting them. The output layer has equally many nodes as the input layer, It is trained to predict the input rather than some target output.

Auto-encoders (cont.) An auto encoder with one hidden layer of k units, can be seen as a representation-reconstruction pair: with F k = R k, k < d and Φ : X F k, Φ(x) = σ (W x), x X Ψ : F k X, Ψ(β) = σ (W β), β F k.

Auto-encoders & dictionary learning Φ(x) = σ (W x), Ψ(β) = σ (W β) The above formulation is closely related to dictionary learning. The weights can be seen as dictionary atoms. Reconstructive approaches have connections with so called energy models [LeCun et al.... ] Possible probabilistic/bayesian interpretations/variations (e.g. Boltzmann machine [Hinton et al.... ])

Stacked auto-encoders Multiple layers of auto-encoders can be stacked [Hinton et al 06]... (Φ 1 Ψ 1 ) (Φ 2 Ψ 2 ) (Φ l Ψ l ) }{{} Autoencoder... with the potential of obtaining richer representations. 21

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way.

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way. Weights in the first few layers have smaller support and are repeated.

Beyond reconstruction In many applications the connectivity of neural networks is limited in a specific way. Weights in the first few layers have smaller support and are repeated. Subsampling (pooling) is interleaved with standard neural nets computations.

Convolutional layers Consider the composite representation Φ : X F, Φ = σ W, with representation by filtering W : X F, representation by pooling σ : F F. 23

Convolutional layers Consider the composite representation Φ : X F, Φ = σ W, with representation by filtering W : X F, representation by pooling σ : F F. Note: σ, W are more complex than in standard NN. 23

The matrix W is made of blocks Convolution and filtering W = (G t1,..., G tt ) each block is a convolution matrix obtained transforming a vector (template) t, e.g. G t = (g 1 t,..., g N t). e.g. t 1 t 2 t 3... t d t d t 1 t 2... t d 1 G t = t d 1 t d t 1... t d 2........................ t 2 t 3 t 4... t 1 For all x X, W (x)(j, i) = g i t j, x. 24

Pooling The pooling map aggregates (pools) the values corresponding to the same transformed template g 1 t, x,..., ( g N t, x, and can be seen as a form of subsampling. 25

Pooling functions Given a template t, let β = (s( g 1 t, x ),..., s( g N t, x )). for some non-linearity s, e.g. s( ) = +.

Pooling functions Given a template t, let β = (s( g 1 t, x ),..., s( g N t, x )). for some non-linearity s, e.g. s( ) = +. Examples of pooling max pooling average pooling l p pooling max j=1,...,n βj, 1 N N β j, j=1 N β p = β j p j=1 1 p. 26

Why pooling? The intuition is that pooling can provide some form of robustness and even invariance to the transformations. Invariance & selectivity A good representation should be invariant to semantically irrelevant transformations. Yet, it should be discriminative with respect to relevant information (selective).

Basic computations: simple & complex cells (Hubel, Wiesel 62) Simple cells Complex cells x x, g 1 t,..., x, g N t x, g 1 t,..., x, g N t..., x, g N t g x, gt + 28

Basic computations: convolutional networks (Le Cun 88) Convolutional filters Subsampling/pooling x x, g 1 t,..., x, g N t x, g 1 t,..., x, g N t..., x, g N t g x, gt +

Deep convolutional networks Filtering Pooling Filtering Pooling Input First Layer Second Layer Classifier Output In practice: multiple convolution layers are stacked, 30

Deep convolutional networks Filtering Pooling Filtering Pooling Input First Layer Second Layer Classifier Output In practice: multiple convolution layers are stacked, pooling is not global, but over a subset of transformations (receptive field), 30

A biological motivation Classification units Visual cortex The processing in DCN has analogies with computational neuroscience models of the information processing in the visual cortex see [Poggio et al.... ]. PIT/AIT V4/PIT V2/V4 V1/V2 Figure 2: Sketch of the Hmax hierarchical model of visual processing: Acronyms: V1, V2 and V4 correspond to primary, second and fourth visual areas, PIT and AIT to posterior and anterior inferotemporal areas, L.Rosasco, respectively RegML 2018

Theory Φ L (x) = σ(w L... σ(w 2 (σ(w 1 x))) No pooling: metric properties of networks with random weights connection with compressed sensing [Giryes et al. 15]

Theory Φ L (x) = σ(w L... σ(w 2 (σ(w 1 x))) No pooling: metric properties of networks with random weights connection with compressed sensing [Giryes et al. 15] Invariance x = gx Φ(x ) = Φ(x) [Anselmi et al. 12, R. Poggio 15, Mallat 12, Soatto, Chiuso 13] and covariance for multiple layers [Anselmi et al. 12].

Theory (cont.) Similarity preservation Φ(x ) Φ(x) min x gx??? g Stability to diffeomorphisms [Mallat, 12] Φ(x) Φ(d(x)) d x Reconstruction: connection to phase retrieval/one bit compressed sensing [Bruna et al 14].

This class Neural nets Autoencoders Convolutional neural nets

FINE