Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR)

Generative Models Wrap-up Deep Learning Module Lecture Generative Graphical Models Bayesian Models/Nets N θ z w M α Unsupervised data understanding Interpretability Weak on supervised performance β K Markov Random Fields Y i X i Knowledge and constraints through feature functions CRF: the supervised way to generative Computationally heavy Y j Dynamic Models S 2 1 S 2 2 S 2 3 Y 2 t Q 1 2 Q 2 2 Topology unfolds on data structure Structured data processing Complex causal relationships

Generative Models Wrap-up Deep Learning Module Lecture Module s Take Home Messages Consider using generative models when Need interpretability Need to incorporate prior knowledge Unsupervised learning or learning with partially observable supervision Need reusable/portable learned knowledge Consider avoiding generative models when Having tight computational constraints Dealing with raw, noisy low level data Variational inference Efficient way to learn an approximation to intractable distributions The variational function can be a neural network

Generative Models Wrap-up Deep Learning Module Lecture Deep Learning AI Prediction Deep Learning Hard-coded expert reasoning Expertdesigned features Trainable predictor Learned features Learned feature hierarchy ML Input ANN

Generative Models Wrap-up Deep Learning Module Lecture Module Outline Recurrent, recursive and contextual (Micheli) Recurrent NN training refresher Recursive NN Graph processing Foundational models s and RBM Convolutional Neural Networks Gated Recurrent Networks (LSTM, ) Advanced topics Adversarial models, memory networks and attentional models Variational deep learning Deep reinforcement learning API and applications seminars

Generative Models Wrap-up Deep Learning Module Lecture Reference Book Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press Chapters 6-10, 14,20 Integrated by course slides and additional readings Freely available online

Generative Models Wrap-up Deep Learning Module Lecture Module s Prerequisites Formal model of neuron Neural network Feed-forward Recurrent Cost function optimization Backpropagation/SGD Regularization Neural network hyper-parameters and model selection

Generative Models Wrap-up Deep Learning Module Lecture Lecture Outline Autoencoders a.k.a. The first and the latest deep learning model Autoencoders and dimensionality reduction Deep neural autoencoders Sparse Denoising Contractive Deep generative-based autoencoders Deep Belief Networks Deep Boltzmann Machines Application Examples

Key Concepts Neural Approaches Generative Approaches Basic Autoencoder (AE) input D Encoder f hidden h K Decoder g output x x Latent space projection (again) D Train a model to reconstruct the input Passing through some form of information bottleneck K << D, or? h sparsely active Train by loss minimization L x, x = L x, g f(x)

Key Concepts Neural Approaches Generative Approaches A Very Well Known Autoencoder x x h Encoding-Decoding h = f x = W e x x = g h = W d W e x Tied weights (often, not always) D W e K W d D Euclidean Loss W d = W e T = W T L x, x = x W T Wx 2 2 What if we take f and g linear and K<<D? Learns the same subspace of PCA

Key Concepts Neural Approaches Generative Approaches A Probabilistic View x x Stochastic Autoencoder h h P e (h x) P d ( x h) K x x A unifying view of neural and generative AE Paves the way to Variational Autoencoders

Key Concepts Neural Approaches Generative Approaches Neural Autoencoders Generally, we would like to train nonlinear AEs, with possibly K>D, that do not learn trivial identity Regularized autoencoders Sparse AE Denoising AE Contractive AE Autoencoders with dropout layers

Key Concepts Neural Approaches Generative Approaches Sparse Autoencoder Add a term to the cost function to penalize h (want the number of active units to be small) J SAE θ = x S (L(x, x) + λω(h)) Typically Ω h = Ω f(x) = j h j (x))

Key Concepts Neural Approaches Generative Approaches Probabilistic Interpretation (Oh No, Again!) (From ML Course) Training with regularization is MAP inference max log P h, x = max (log P x h + log P h ) Likelihood P h = λ 2 exp( λ h 1) Prior Ω h = λ h 1 Laplace

Key Concepts Neural Approaches Generative Approaches Denoising Autoencoder (DAE) Train the AE to minimize the function L(x, g(f x )) where x is a version of original input x corrupted by some noise process C( x x) Key Intuition - Learned representations should be robust to partial destruction of the input

Key Concepts Neural Approaches Generative Approaches Another Interpretation yes, exactly the one you are thinking of f h g Learns the denoising distribution P(x x) x x By minimizing C( x x) x log P d (x h = f x )

Key Concepts Neural Approaches Generative Approaches DAE as Manifold Learning Learning a vector field (green arrows) approximating the gradient of the unknown data generating distribution g h x log p(x) x C x x = N x x, σ 2 )

Key Concepts Neural Approaches Generative Approaches The Manifold Assumption Assume data lies on a lower dimensional non-linear manifold since variables in data are typically dependent Regularized AE can afford to represent only variations that are needed to reconstruct training examples AE mapping is sensitive only to changes in manifold direction Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2009.

Key Concepts Neural Approaches Generative Approaches Contractive Autoencoder Penalize encoding function for input sensitivity J CAE θ = x S (L(x, x) + λω(h)) Ω h = Ω f(x) = f(x) x 2 F You can as well penalize on higher order derivatives

Key Concepts Neural Approaches Generative Approaches Supervised learning h 4 h 3 h 2 h 1 x Unsupervised training Hierarchical autoencoder Extracts a representation of inputs that facilitates Data visualization, exploration, indexing, Realization of a supervised task

Key Concepts Neural Approaches Generative Approaches Unsupervised Layerwise Pretraining Incremental unsupervised construction of the Deep AE Any form of AE, e.g. those shown in previous slides x W 1 T h 1 W 1 h 1 W 1 x x

Key Concepts Neural Approaches Generative Approaches Unsupervised Layerwise Pretraining Incremental unsupervised construction of the Deep AE h 1 W 2 T h 2 h 2 W 2 W 2 h 1 h 1 x W 1

Key Concepts Neural Approaches Generative Approaches Unsupervised Layerwise Pretraining Incremental unsupervised construction of the Deep AE h 2 W 3 T h 4 W 4 h 3 W 3 W 3 h 3 h 2 h 2 h 1 x W 2 W 1

Key Concepts Neural Approaches Generative Approaches Optional Fine Tuning Fine tune the whole autoencoder to optimize input reconstruction You can use backpropagation, but it remains an unsupervised task W 4 T h 4 W 4 h 3 W 3 T h 3 W 3 h 2 W 2 T h 2 W 2 h 1 x W 1 T h 1 x W 1

Key Concepts Neural Approaches Generative Approaches Rearranging the Graphics Does it look like something familiar? h 3 h 4 h 3 A layered Restricted Boltzmann Machine h 1 h 2 h 1 h 2 Can use RBM to perform layerwise pretraining and learn the matrices W i x x

Key Concepts Neural Approaches Generative Approaches Deep Belief Network (DBN) A stack of pairwise RBM h 4 h 3 h 2 h 1 x RBM 4 RBM 3 RBM 2 RBM 1 IMPORTANT NOTE A DBM is a deep autoencoder but it is NOT a deep RBM It is (mostly) directed!

Key Concepts Neural Approaches Generative Approaches Deep Boltzmann Machine (DBM) How do we get this? h 4 h 3 Training requires some attention because of the recurrent interactions from higher layers to the bottom h 2 h 1 P h j 1 x, h 2 = σ i W ij 1 x i + m W jm 2 h m 2 x P x i h 1 = σ W ij 1 h j 1 j

Key Concepts Neural Approaches Generative Approaches Pretraining DBM How do we get this? 2) (Pre)training the second layer changes h 1 prior by P h 1 W 2 ) = h 1 P h 1, h 2 W 2 ) h 2 h 1 x 1) (Pre)training the first layer entails fitting this model When putting things together, we need to average between the two P h 1 W 1 ) = P x θ = h 1 P h 1 W 1 )P(x h 1, W 1 ) x P h 1, x W 1 )

Key Concepts Neural Approaches Generative Approaches Pretraining DBM - Trick Averaging the two models of h 1 can be approximated by taking half contribution from W 1 and half from W 2 Using full W 1 and W 2 would double count x contribution as h 2 depends on x h 2 h 2 W 2 W 2 h 1 h 1 W 1 W 1 x x W 1 h 2 h 1 x W 2 When training with more than two RBMs apply trick to first and last layers and halve weights (both direction) of intermediate RBM

Key Concepts Neural Approaches Generative Approaches DBM Discriminative Fine Tuning W 2 T P(h 2 v) output h 2 h 1 W 2 W 1 x The pretrained DBM matrices can be used to initialize a deep autoencoder Add input from h 2 to the first hidden layer Add output layer Fine tuning of the RBM matrices by backpropagation

Software Conclusions Software - Deep Neural Autoencoders All deep learning frameworks offer facilities to build (deep) AEs Check out classic Theano-based tutorials for denoising autoencoders and their stacked version A variety of deep AE in Keras and their counterpart in Lua-Torch Stacked autoencoders built with official Matlab toolbox functions

Software Conclusions Matlab - Deep Generative Models Matlab code for the DBN paper with a demo on MNIST data Matlab code for Deep Boltzmann Machines with a demo on MNIST data Deepmat Matlab library for deep generative models DeeBNet Matlab/Octave toolbox for deep generative models with GPU support

Software Conclusions Python - Deep Generative Models DBN and DBM implementations exist for all major deep learning libraries Deep Boltzmann machine implementation (Tensorflow-based) with image processing application, pre-trained networks and notebooks Deepnet A Toronto based implementation of deep autoencoders (neural and generative) Check out classic Theano-based tutorials for deep belief networks and RBM

Software Conclusions AE - Visualization Visualizing complex data in learned latent space

Software Conclusions Visualizing Sound

Software Conclusions AE Image Restoration/Colorization Apply autoencoder construction with advanced building blocks (e.g. CNN layers)

Software Conclusions DBM Learning Image Features CIFAR-10 Images Level 1 Filters Level 2 Filters https://github.com/monsta-hd/boltzmann-machines

Software Conclusions Multimodal DBM Modality fusion layers Modality 1 Modality K N. Srivastava, R. Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

P(txt img) Introduction Software Conclusions Multimodal DBM Image and Text P(img txt) N. Srivastava, R. Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

Software Conclusions Multimodal DBM Sampling N. Srivastava, R. Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

Software Conclusions Multimodal DBM Multimodal Quering N. Srivastava, R. Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

Software Conclusions Multimodal DBM Pang et al. Deep Multimodal Learning for Affective Analysis and Retrieval. IEEE Trans. on Multimedia, 2015

Software Conclusions Take Home Messages Regularized autoencoder Optimize reconstruction quality Constrain stored information Autoencoder training is manifold learning Learn a latent space manifold where input data resides Store only variations that are useful to represent training data Autoencoders learn a (conditional) distribution of input data P( x ) Deep AE: pretraining, fine tuning, supervised optimization Use AE for finding new/useful data representations Or to learn its distribution

Software Conclusions Next Lecture Next lecture will be on Monday 16 th April 2018 Stay tuned on Moodle for the topic Need to finalize dates for: Lecture recovery: 19 th April 2018, 11-13 (room TBD) Midterm 2: 03 rd May 2018, 11-13 (room TBD)