Unsupervised Learning

Similar documents
Lecture 16 Deep Neural Generative Models

Greedy Layer-Wise Training of Deep Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep unsupervised learning

Variational Autoencoder

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Latent Variable Models

Deep Generative Models. (Unsupervised Learning)

Learning Deep Architectures for AI. Part II - Vijay Chakilam

UNSUPERVISED LEARNING

Generative models for missing value completion

STA 4273H: Statistical Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Chapter 20. Deep Generative Models

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

TUTORIAL PART 1 Unsupervised Learning

Probabilistic Graphical Models

Auto-Encoding Variational Bayes

Lecture 14: Deep Generative Learning

STA 414/2104: Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Bayesian Semi-supervised Learning with Deep Generative Models

Deep Belief Networks are compact universal approximators

CSC321 Lecture 20: Autoencoders

Contrastive Divergence

Variational Autoencoders

Machine Learning 4771

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Learning Deep Architectures

Introduction to Restricted Boltzmann Machines

STA 4273H: Statistical Machine Learning

ECE521 week 3: 23/26 January 2017

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Restricted Boltzmann Machines for Collaborative Filtering

How to do backpropagation in a brain

Gaussian Cardinality Restricted Boltzmann Machines

The connection of dropout and Bayesian statistics

STA 414/2104: Lecture 8

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Probabilistic Graphical Models

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Nonparametric Inference for Auto-Encoding Variational Bayes

Knowledge Extraction from DBNs for Images

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

STA 4273H: Statistical Machine Learning

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Variational Autoencoders (VAEs)

Learning Tetris. 1 Tetris. February 3, 2009

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Lecture 15. Probabilistic Models on Graph

Part 2. Representation Learning Algorithms

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

p L yi z n m x N n xi

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Deep Learning Autoencoder Models

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Need for Sampling in Machine Learning. Sargur Srihari

COMP90051 Statistical Machine Learning

Learning Energy-Based Models of High-Dimensional Data

Basic Principles of Unsupervised and Unsupervised

Speech and Language Processing

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Deep Boltzmann Machines

Probabilistic Graphical Models

CSC321 Lecture 18: Learning Probabilistic Models

13: Variational inference II

STA 414/2104: Lecture 8

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v4 [cs.lg] 16 Apr 2015

Modeling Documents with a Deep Boltzmann Machine

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Recent Advances in Bayesian Inference Techniques

GENERATIVE ADVERSARIAL LEARNING

Variational Auto Encoders

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Natural Gradients via the Variational Predictive Distribution

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 14. Sinan Kalkan

Replicated Softmax: an Undirected Topic Model. Stephen Turner

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Neural Networks and Deep Learning

Variational Inference via Stochastic Backpropagation

STA 4273H: Statistical Machine Learning

Generative Models for Sentences

Deep latent variable models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 Lecture 7/8. Logistic Regression

An efficient way to learn deep generative models

Representation of undirected GM. Kayhan Batmanghelich

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Transcription:

CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality reduction) Autoencoders (Feature learning) P( ) Generative Models 1

Generative Models Given training data, generate new samples from same distribution CIFAR-10 dataset (Krihevsky and Hinton, 2009) Training data p data () Generated sampled p model () Want to learn p model () similar to p data () Why Generative Models? Realistic samples for artwork, super-resolution, coloriation, etc. Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!) Training generative models can also enable inference of latent representation that can be useful as general features 2

Taonomy of Generative Models Generative Models Eplicit density Implicit density Tractable density Approimate density Markov Chain Direct Fully visible belief nets - NADE - MADE - PielRNN/CNN Change of variables models (nonlinear ICA) Variational Variational Autoencoder GSN Markov Chain Boltmann Machine GAN Restricted Boltmann Machines (RBM) 3

Restricted Boltmann Machines Many interesting theoretical results about undirected models depends on the assumption that, p > 0. A convenient way to enforce this condition is to use an energy-based model where E() is known as the energy function p = ep( E ) Remember: normalied probability distribution p() = 1 p Z Any distribution of this form is an eample of a Boltmann distribution. For this reason, many energy-based models are called Boltmann machines. a b c d e f E(a, b, c, d, e, f) can be written as E a,b (a,b) + E b,c (b,c) + E a,d (a,d) + E b,e (b,e) + E e,f (e,f) φ a,b (a, b) = ep( E(a, b) Restricted Boltmann Machines Boltmann machines were originally introduced as a general connectionist approach to learning arbitrary probability distributions over binary vectors While Boltmann machines were defined to encompass both models with and without latent variables, the term Boltmann machine is today most often used to designate models with latent variables Joint probability distribution: p, h = 1 ep( E, h ) Z Energy function: E, h = T R T Wh h T Sh c T b T h 4

Restricted Boltmann Machines Restricted Boltmann machines (RBMs) are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables RBM is a bipartite graph, with no connections permitted between any variables in the observed layer or between any units in the latent layer Restricted Boltmann machine Deep belief network Deep Boltmann machine Restricted Boltmann Machines b j h hidden layer (binary units) bias W connections c k visible layer (binary units) p, h = ep E, h /Z = ep h T W + c T + b T h /Z = ep h T W ep c T ep b T h /Z Factors Z = ep( E, h ) The notation based on an energy function is simply an alternative to the representation as the product of factors,h partition function (intractable) 5

Restricted Boltmann Machines Pair-wise Factors Hidden variables h 1 h 2... h 3 h H p, h = 1 Z j ep W j,k h j k k 1 2... Visible variables D Bipartite Structure ep c k k k ep b j h j j Unary Factors The scalar visualiation is more informative of the structure within the vectors RBM: Inference Hidden variables h 1 h 2... h 3 h H Bipartite Structure Restricted: No interaction between hidden variables 1 2... Visible variables p h Similarly: p h D = p(h j ) j = p( k h) k Factories: easy to compute Markov random fields, Boltmann machines, log-linear models Inferring the distribution over the hidden variables is easy 6

RBM: Inference Conditional Distributions h p h = p(h j ) j 1 p h j = 1 = 1 + ep b j + W j. = sigm(b j + W j. ) j th row if W h p h = p( k h) k 1 p k = 1 h = 1 + ep c k + h T W. k = sigm(c k + h T W. k ) k th column if W RBM: Free Energy What about computing marginal p()? p = p, h = ep E(, h) /Z h 0,1 H h 0,1 H H = ep c T + log 1 + ep b j + W j. j=1 = ep F() /Z /Z X X X X X X h Free energy 7

RBM: Free Energy What about computing marginal p()? p = h 0,1 H ep h T W + c T + b T h /Z = ep c T h 1 {0,1} h H {0,1} ep h i W j. + b j h j j /Z = ep c T ep(h 1 W 1. + b 1 h 1 ) ep h H W H. + b H h H h 1 {0,1} h H 0,1 /Z = ep c T 1 + ep b 1 + W 1. 1 + ep b H + W H. /Z = ep c T ep(log 1 + ep b 1 + W 1. ) ep(log( 1 + ep b H + W H. )/Z H = ep c T + log 1 + ep b j + W j. /Z j=1 Also known as Product of Eperts model RBM: Free Energy H p = ep c T + log 1 + ep b j + W j. j=1 H = ep c T + softplus(b j + W j.) /Z j=1 /Z X X X X X X h bias of the probability of each i bias of each feature feature epected in 8

RBM: Model Learning h 1 h 2... h 3 h H 1 Hidden variables 2... Visible variables D Derivative of the negative log-likelihood objective (stochastic gradient descent): log p( (t) ) E( t, h) = E h (t) E(, h) E,h Given a set of i.i.d. training eamples we want to minimie the average negative log-likelihood: 1 T l(f( (t) ) = 1 T log p( (t) ) t t Remember: log p( (t) ) = log p( (t), h) h = log h ep( E t, h ) Z = log ep( E t, h ) h log Z Positive phase Negative phase Hard to compute Restricted Boltmann machines Tutorial - Chris Maddison RBM: Contrastive Divergence Key idea behind Contrastive Divergence: Replace the epectation by a point estimate at Obtain the point by Gibbs sampling Start sampling chain at (t) ~p(h ) ~p( h) (t) 1 k = negative sample 9

RBM: Contrastive Divergence Intuition: log p( (t) ) E( t, h) = E h (t) E(, h) E,h E h E( t, h) (t) E( t, h (t) ) E,h E(, h) E(, h) RBM: Contrastive Divergence Intuition: log p( (t) ) E( t, h) = E h (t) E(, h) E,h E h E( t, h) (t) E( t, h (t) ) E,h E(, h) E(, h) 10

RBM: Deriving Learning Rule Let us look at derivative of E(,h) for θ = W jk E(, h) = W jk W j,k h j k c k k b j h j jk = W W j,k h j k jk jk = h j k k j Remember: E, h = h T W c T b T h Hence: W E, h = h T RBM: Deriving Learning Rule Let us now derive E h E(,h) E h E(, h) W j,k = E h h j k = h j 0,1 h j k p(h j ) Hence: = k p h j = 1 E h W E, h = h() T h = p(h 1 = 1 ) p(h H = 1 ) = sigm(b + W) 11

RBM: Deriving Learning Rule (t) θ = W W W α W log p( (t) ) W α E h W E t, h (t) W α E h W E t, h (t) E,h W E, h E h W E, h W + α h( t ) t T h( ) T Learning rate RBM: CD-k Algorithm For each training eample t Generate a negative sample using k steps of Gibbs sampling, starting at the data point t Update model parameters: W W + α h( t ) t T h( ) T b b + α h( t ) h( ) c c + α (t) Go back to the first step until stopping criteria 12

RBM: CD-k Algorithm CD-k: contrastive divergence with k iterations of Gibbs sampling In general, the bigger k is, the less biased the estimate of the gradient will be In practice, k = 1 works pretty well for learning good features and for pre-training RBM: Persistent CD: Stochastic ML Estimator Idea: instead of initialiing the chain of t, initialie the chain to the negative sample of the last iteration ~p(h ) ~p( h) (t) 1 k = comes from the previous iteration negative sample 13

Variational Autoencoders (VAE) Autoencoders (Recap) Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data Reconstructed data Reconstructed input data Decoder Originally: Linear + nonlinearity (sigmoid) Later: Deep, fully-connected Later: ReLU CNN (upconv) Encoder - Decoder Features Input data Encoder Input data 14

Autoencoders (Recap) Train a model such that features can be used to reconstruct original data L2 Loss function 2 Doesn t use labels! Reconstructed data Reconstructed input data Decoder Encoder - Decoder Features Input data Encoder Input data Autoencoders (Recap) Loss function (Softma, etc) bird plane panther truck dog Predicted label y y Encoder can be used to initialie a supervised model Features Classifier Encoder Fine-tune encoder jointly with classifier Input data 15

Autoencoders (Recap) Autoencoders can reconstruct data, and can learn features to initialie a supervised model Reconstructed input data Decoder Features capture factors of variation in training data. Can we generate new data from an autoencoder? Features Encoder Input data Probabilistic spin on autoencoders - will let us sample from the model to generate data! Assume training data (i) N i=1 representation Z Sample from true conditional p θ (i) is generated from underlying unobserved (latent) Intuition: is an image, is latent factors used to generate : attributes, orientation, etc. Sample from true prior p θ 16

We want to estimate the true parameters θ of this generative model Sample from true conditional p θ Sample from true prior p θ (i) Decoder network How should we represent this model? Choose prior p() to be simple, e.g. Gaussian Conditional p( ) is comple (generates image) => represent with neural network We want to estimate the true parameters θ of this generative model Sample from true conditional p θ Sample from true prior p θ (i) Decoder network How train this model? Learn model parameters to maimie likelihood of training data p θ = න p θ p θ d 17

Data likelihood: p θ = න p θ p θ d Intractable to compute p( ) for every! Posterior density also intractable: p θ = p θ p θ p θ Solution: in addition to decoder network modeling p θ ( ), define additional encoder network q φ ( ) that approimates p θ ( ) Will see that this allows us to derive a lower bound on the data likelihood that is tractable, which we can optimie Since we re modeling probabilistic generation of data, encoder and decoder networks are probabilistic Mean and (diagonal) covariance of Mean and (diagonal) covariance of μ Σ μ Σ Encoder network q φ ( ) (parameter φ) Decoder network p θ ( ) (parameter θ) 18

Since we re modeling probabilistic generation of data, encoder and decoder networks are probabilistic Sample from ~N μ, Σ Sample from ~N μ, Σ μ Σ μ Σ Encoder network q φ ( ) (parameter φ) Decoder network p θ ( ) (parameter θ) Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) (Bayes Rule) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) (Multiply by constant) = E log p θ i E log q φ i p θ + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) Decoder network gives p θ ( ), can compute estimate of this term through sampling This KL term (between Gaussians for encoder and prior) has nice closed-form solution! p θ ( ) intractable (saw earlier), can t compute this KL term. But we know KL divergence always >= 0. 19

Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) (Bayes Rule) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) (Multiply by constant) = E log p θ i E log q φ i p θ + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) L( i, θ, φ) 0 Tractable lower bound which we can take gradient of and optimie! (p θ ( ) differentiable, KL term differentiable) Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) = E log p θ i E log q φ i (Bayes Rule) p θ (Multiply by constant) + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) log p θ (i) L( i, θ, φ) Variational lower bound ( ELBO ) L( i, θ, φ) θ, φ = arg ma θ,φ i=1 0 L( i, θ, φ) Training: Maimie lower bound N 20

Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Sample from ~N μ, Σ Encoder network q φ ( ) Input data μ Σ 21

Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Encoder network q φ ( ) Input data μ Σ Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Encoder network q φ ( ) Input data Sample from ~N μ, Σ μ Σ 22

Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Decoder network p θ ( ) Encoder network q φ ( ) Input data μ Σ Sample from ~N μ, Σ μ Σ Putting it all together: maimiing the likelihood lower bound Maimie likelihood of original input being reconstructed Sample from ~N μ, Σ E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Decoder network p θ ( ) Encoder network q φ ( ) Input data μ Σ Sample from ~N μ, Σ μ Σ 23

Use decoder network. Now sample from prior! Sample from ~N μ, Σ Decoder network p θ ( ) μ Σ Sample from ~N 0, I Diagonal prior on => independent latent variables Degree of smile Different dimensions of encode interpretable factors of variation Vary 1 Vary 2 Head pose 24

Diagonal prior on => independent latent variables Degree of smile Different dimensions of encode interpretable factors of variation Vary 1 Also good feature representation that can be computed using q ɸ ( )! Vary 2 Head pose : Generating Data 3232 CIFAR-10 Labeled Faces in the Wild Figures (L) from Dirk Kingma et al. 2016; (R) from Anders Larsen et al. 2017 25

Probabilistic spin to traditional autoencoders => allows generating data Pros: - Principled approach to generative models - Allows inference of q( ), can be useful feature representation for other tasks Cons: - Maimies lower bound of likelihood: okay - Samples blurrier and lower quality compared to state-of-the-art (GANs) Active areas of research: - More fleible approimations, e.g. richer approimate posterior instead of diagonal Gaussian - Incorporating structure in latent variables Tools Python library for Bernoulli Restricted Boltmann Machines: sklearn.neural_network.bernoullirbm Python Keras for Variational auto-encoder Generative models (including RBM and VAE): https://github.com/wiseodd/generative-models Variational Auto-encoder: Tutorials: http://pyro.ai/eamples/vae.html Codes: https://github.com/uber/pyro/tree/dev/eamples/vae 26

References and Resources Goodfellow, Ian, et al. Deep learning. Vol. 1. Cambridge: MIT press, 2016. Diederik P Kingma, Ma Welling: Auto-Encoding Variational Bayes. ICLR 2014 Restricted Boltmann machines Tutorial - Chris Maddison: https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf Geoffrey E. Hinton: Training products of eperts by minimiing contrastive divergence. Neural Computation (2002) Generative Models Lecture (Stanford University): http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf Restricted Boltmann Machines (CMU): https://www.youtube.com/watch?v=jfpp9cy1efo Hugo Larochelle s class on Neural Networks: http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html Image Generation (ICLR 2018): https://www.youtube.com/watch?v=g06decz-qtg THANK YOU!!! Hung Kim Chau University of Pittsburgh School of Computing and Information Department of Informatics and Networked Systems 135 N Bellefield Ave, Pittsburgh, PA, 15260 Phone: +1 (412) 552-3055 27