Unsupervised Learning

Size: px

Start display at page:

Download "Unsupervised Learning"

Dustin Ryan
5 years ago
Views:

CS 3750 Advanced Machine Learning hkc6@pitt.

Component Analysis (Dimensionality reduction)

1 CS 3750 Advanced Machine Learning Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality reduction) Autoencoders (Feature learning) P( ) Generative Models 1

Generative Models Given training data, generate new samples from same distribution CIFAR-10

Want to learn p model () similar to p data () Why Generative Models?

Generative models of time-series data can be used for simulation and planning (reinforcement

2 Generative Models Given training data, generate new samples from same distribution CIFAR-10 dataset (Krihevsky and Hinton, 2009) Training data p data () Generated sampled p model () Want to learn p model () similar to p data () Why Generative Models? Realistic samples for artwork, super-resolution, coloriation, etc. Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!) Training generative models can also enable inference of latent representation that can be useful as general features 2

3 Taonomy of Generative Models Generative Models Eplicit density Implicit density Tractable density Approimate density Markov Chain Direct Fully visible belief nets - NADE - MADE - PielRNN/CNN Change of variables models (nonlinear ICA) Variational Variational Autoencoder GSN Markov Chain Boltmann Machine GAN Restricted Boltmann Machines (RBM) 3

4 Restricted Boltmann Machines Many interesting theoretical results about undirected models depends on the assumption that, p > 0. A convenient way to enforce this condition is to use an energy-based model where E() is known as the energy function p = ep( E ) Remember: normalied probability distribution p() = 1 p Z Any distribution of this form is an eample of a Boltmann distribution. For this reason, many energy-based models are called Boltmann machines. a b c d e f E(a, b, c, d, e, f) can be written as E a,b (a,b) + E b,c (b,c) + E a,d (a,d) + E b,e (b,e) + E e,f (e,f) φ a,b (a, b) = ep( E(a, b) Restricted Boltmann Machines Boltmann machines were originally introduced as a general connectionist approach to learning arbitrary probability distributions over binary vectors While Boltmann machines were defined to encompass both models with and without latent variables, the term Boltmann machine is today most often used to designate models with latent variables Joint probability distribution: p, h = 1 ep( E, h ) Z Energy function: E, h = T R T Wh h T Sh c T b T h 4

Restricted Boltmann Machines Restricted Boltmann machines (RBMs) are undirected probabilistic graphical models containing a

between any variables in the observed layer or between any units in the latent layer Restricted Boltmann machine Deep belief

layer (binary units) p, h = ep E, h /Z = ep h T W + c T + b T h /Z = ep h T W ep c T ep b T h /Z Factors Z = ep( E, h ) The

5 Restricted Boltmann Machines Restricted Boltmann machines (RBMs) are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables RBM is a bipartite graph, with no connections permitted between any variables in the observed layer or between any units in the latent layer Restricted Boltmann machine Deep belief network Deep Boltmann machine Restricted Boltmann Machines b j h hidden layer (binary units) bias W connections c k visible layer (binary units) p, h = ep E, h /Z = ep h T W + c T + b T h /Z = ep h T W ep c T ep b T h /Z Factors Z = ep( E, h ) The notation based on an energy function is simply an alternative to the representation as the product of factors,h partition function (intractable) 5

6 Restricted Boltmann Machines Pair-wise Factors Hidden variables h 1 h 2... h 3 h H p, h = 1 Z j ep W j,k h j k k Visible variables D Bipartite Structure ep c k k k ep b j h j j Unary Factors The scalar visualiation is more informative of the structure within the vectors RBM: Inference Hidden variables h 1 h 2... h 3 h H Bipartite Structure Restricted: No interaction between hidden variables Visible variables p h Similarly: p h D = p(h j ) j = p( k h) k Factories: easy to compute Markov random fields, Boltmann machines, log-linear models Inferring the distribution over the hidden variables is easy 6

7 RBM: Inference Conditional Distributions h p h = p(h j ) j 1 p h j = 1 = 1 + ep b j + W j. = sigm(b j + W j. ) j th row if W h p h = p( k h) k 1 p k = 1 h = 1 + ep c k + h T W. k = sigm(c k + h T W. k ) k th column if W RBM: Free Energy What about computing marginal p()? p = p, h = ep E(, h) /Z h 0,1 H h 0,1 H H = ep c T + log 1 + ep b j + W j. j=1 = ep F() /Z /Z X X X X X X h Free energy 7

8 RBM: Free Energy What about computing marginal p()? p = h 0,1 H ep h T W + c T + b T h /Z = ep c T h 1 {0,1} h H {0,1} ep h i W j. + b j h j j /Z = ep c T ep(h 1 W 1. + b 1 h 1 ) ep h H W H. + b H h H h 1 {0,1} h H 0,1 /Z = ep c T 1 + ep b 1 + W ep b H + W H. /Z = ep c T ep(log 1 + ep b 1 + W 1. ) ep(log( 1 + ep b H + W H. )/Z H = ep c T + log 1 + ep b j + W j. /Z j=1 Also known as Product of Eperts model RBM: Free Energy H p = ep c T + log 1 + ep b j + W j. j=1 H = ep c T + softplus(b j + W j.) /Z j=1 /Z X X X X X X h bias of the probability of each i bias of each feature feature epected in 8

9 RBM: Model Learning h 1 h 2... h 3 h H 1 Hidden variables 2... Visible variables D Derivative of the negative log-likelihood objective (stochastic gradient descent): log p( (t) ) E( t, h) = E h (t) E(, h) E,h Given a set of i.i.d. training eamples we want to minimie the average negative log-likelihood: 1 T l(f( (t) ) = 1 T log p( (t) ) t t Remember: log p( (t) ) = log p( (t), h) h = log h ep( E t, h ) Z = log ep( E t, h ) h log Z Positive phase Negative phase Hard to compute Restricted Boltmann machines Tutorial - Chris Maddison RBM: Contrastive Divergence Key idea behind Contrastive Divergence: Replace the epectation by a point estimate at Obtain the point by Gibbs sampling Start sampling chain at (t) ~p(h ) ~p( h) (t) 1 k = negative sample 9

10 RBM: Contrastive Divergence Intuition: log p( (t) ) E( t, h) = E h (t) E(, h) E,h E h E( t, h) (t) E( t, h (t) ) E,h E(, h) E(, h) RBM: Contrastive Divergence Intuition: log p( (t) ) E( t, h) = E h (t) E(, h) E,h E h E( t, h) (t) E( t, h (t) ) E,h E(, h) E(, h) 10

11 RBM: Deriving Learning Rule Let us look at derivative of E(,h) for θ = W jk E(, h) = W jk W j,k h j k c k k b j h j jk = W W j,k h j k jk jk = h j k k j Remember: E, h = h T W c T b T h Hence: W E, h = h T RBM: Deriving Learning Rule Let us now derive E h E(,h) E h E(, h) W j,k = E h h j k = h j 0,1 h j k p(h j ) Hence: = k p h j = 1 E h W E, h = h() T h = p(h 1 = 1 ) p(h H = 1 ) = sigm(b + W) 11

12 RBM: Deriving Learning Rule (t) θ = W W W α W log p( (t) ) W α E h W E t, h (t) W α E h W E t, h (t) E,h W E, h E h W E, h W + α h( t ) t T h( ) T Learning rate RBM: CD-k Algorithm For each training eample t Generate a negative sample using k steps of Gibbs sampling, starting at the data point t Update model parameters: W W + α h( t ) t T h( ) T b b + α h( t ) h( ) c c + α (t) Go back to the first step until stopping criteria 12

13 RBM: CD-k Algorithm CD-k: contrastive divergence with k iterations of Gibbs sampling In general, the bigger k is, the less biased the estimate of the gradient will be In practice, k = 1 works pretty well for learning good features and for pre-training RBM: Persistent CD: Stochastic ML Estimator Idea: instead of initialiing the chain of t, initialie the chain to the negative sample of the last iteration ~p(h ) ~p( h) (t) 1 k = comes from the previous iteration negative sample 13

14 Variational Autoencoders (VAE) Autoencoders (Recap) Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data Reconstructed data Reconstructed input data Decoder Originally: Linear + nonlinearity (sigmoid) Later: Deep, fully-connected Later: ReLU CNN (upconv) Encoder - Decoder Features Input data Encoder Input data 14

Autoencoders (Recap) Train a model such that features can be used to

Reconstructed data Reconstructed input data Decoder Encoder - Decoder Features

bird plane panther truck dog Predicted label y y Encoder can be used to

15 Autoencoders (Recap) Train a model such that features can be used to reconstruct original data L2 Loss function 2 Doesn t use labels! Reconstructed data Reconstructed input data Decoder Encoder - Decoder Features Input data Encoder Input data Autoencoders (Recap) Loss function (Softma, etc) bird plane panther truck dog Predicted label y y Encoder can be used to initialie a supervised model Features Classifier Encoder Fine-tune encoder jointly with classifier Input data 15

16 Autoencoders (Recap) Autoencoders can reconstruct data, and can learn features to initialie a supervised model Reconstructed input data Decoder Features capture factors of variation in training data. Can we generate new data from an autoencoder? Features Encoder Input data Probabilistic spin on autoencoders - will let us sample from the model to generate data! Assume training data (i) N i=1 representation Z Sample from true conditional p θ (i) is generated from underlying unobserved (latent) Intuition: is an image, is latent factors used to generate : attributes, orientation, etc. Sample from true prior p θ 16

17 We want to estimate the true parameters θ of this generative model Sample from true conditional p θ Sample from true prior p θ (i) Decoder network How should we represent this model? Choose prior p() to be simple, e.g. Gaussian Conditional p( ) is comple (generates image) => represent with neural network We want to estimate the true parameters θ of this generative model Sample from true conditional p θ Sample from true prior p θ (i) Decoder network How train this model? Learn model parameters to maimie likelihood of training data p θ = න p θ p θ d 17

18 Data likelihood: p θ = න p θ p θ d Intractable to compute p( ) for every! Posterior density also intractable: p θ = p θ p θ p θ Solution: in addition to decoder network modeling p θ ( ), define additional encoder network q φ ( ) that approimates p θ ( ) Will see that this allows us to derive a lower bound on the data likelihood that is tractable, which we can optimie Since we re modeling probabilistic generation of data, encoder and decoder networks are probabilistic Mean and (diagonal) covariance of Mean and (diagonal) covariance of μ Σ μ Σ Encoder network q φ ( ) (parameter φ) Decoder network p θ ( ) (parameter θ) 18

19 Since we re modeling probabilistic generation of data, encoder and decoder networks are probabilistic Sample from ~N μ, Σ Sample from ~N μ, Σ μ Σ μ Σ Encoder network q φ ( ) (parameter φ) Decoder network p θ ( ) (parameter θ) Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) (Bayes Rule) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) (Multiply by constant) = E log p θ i E log q φ i p θ + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) Decoder network gives p θ ( ), can compute estimate of this term through sampling This KL term (between Gaussians for encoder and prior) has nice closed-form solution! p θ ( ) intractable (saw earlier), can t compute this KL term. But we know KL divergence always >= 0. 19

20 Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) (Bayes Rule) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) (Multiply by constant) = E log p θ i E log q φ i p θ + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) L( i, θ, φ) 0 Tractable lower bound which we can take gradient of and optimie! (p θ ( ) differentiable, KL term differentiable) Now equipped with our encoder and decoder networks, let s work out the (log) data likelihood: log p θ (i) = E ~qφ (i) log p θ (i) ( p θ (i) does not depend on ) = E log p θ i p θ () p θ (i) = E log p θ i p θ () q φ ( (i) ) p θ (i) q φ ( (i) ) = E log p θ i E log q φ i (Bayes Rule) p θ (Multiply by constant) + E log q φ i p θ i (Logarithms) = E log p θ i D KL (q φ i p θ ) + D KL (q φ i p θ i ) log p θ (i) L( i, θ, φ) Variational lower bound ( ELBO ) L( i, θ, φ) θ, φ = arg ma θ,φ i=1 0 L( i, θ, φ) Training: Maimie lower bound N 20

21 Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Sample from ~N μ, Σ Encoder network q φ ( ) Input data μ Σ 21

22 Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Encoder network q φ ( ) Input data μ Σ Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Encoder network q φ ( ) Input data Sample from ~N μ, Σ μ Σ 22

23 Putting it all together: maimiing the likelihood lower bound E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Decoder network p θ ( ) Encoder network q φ ( ) Input data μ Σ Sample from ~N μ, Σ μ Σ Putting it all together: maimiing the likelihood lower bound Maimie likelihood of original input being reconstructed Sample from ~N μ, Σ E log p θ i D KL (q φ i p θ ) L( i, θ, φ) Make approimate posterior distribution close to prior Decoder network p θ ( ) Encoder network q φ ( ) Input data μ Σ Sample from ~N μ, Σ μ Σ 23

24 Use decoder network. Now sample from prior! Sample from ~N μ, Σ Decoder network p θ ( ) μ Σ Sample from ~N 0, I Diagonal prior on => independent latent variables Degree of smile Different dimensions of encode interpretable factors of variation Vary 1 Vary 2 Head pose 24

Diagonal prior on => independent latent variables Degree of

of variation Vary 1 Also good feature representation that

Vary 2 Head pose : Generating Data 3232 CIFAR-10 Labeled

25 Diagonal prior on => independent latent variables Degree of smile Different dimensions of encode interpretable factors of variation Vary 1 Also good feature representation that can be computed using q ɸ ( )! Vary 2 Head pose : Generating Data 3232 CIFAR-10 Labeled Faces in the Wild Figures (L) from Dirk Kingma et al. 2016; (R) from Anders Larsen et al

26 Probabilistic spin to traditional autoencoders => allows generating data Pros: - Principled approach to generative models - Allows inference of q( ), can be useful feature representation for other tasks Cons: - Maimies lower bound of likelihood: okay - Samples blurrier and lower quality compared to state-of-the-art (GANs) Active areas of research: - More fleible approimations, e.g. richer approimate posterior instead of diagonal Gaussian - Incorporating structure in latent variables Tools Python library for Bernoulli Restricted Boltmann Machines: sklearn.neural_network.bernoullirbm Python Keras for Variational auto-encoder Generative models (including RBM and VAE): Variational Auto-encoder: Tutorials: Codes: 26

27 References and Resources Goodfellow, Ian, et al. Deep learning. Vol. 1. Cambridge: MIT press, Diederik P Kingma, Ma Welling: Auto-Encoding Variational Bayes. ICLR 2014 Restricted Boltmann machines Tutorial - Chris Maddison: Geoffrey E. Hinton: Training products of eperts by minimiing contrastive divergence. Neural Computation (2002) Generative Models Lecture (Stanford University): Restricted Boltmann Machines (CMU): Hugo Larochelle s class on Neural Networks: Image Generation (ICLR 2018): THANK YOU!!! Hung Kim Chau University of Pittsburgh School of Computing and Information Department of Informatics and Networked Systems 135 N Bellefield Ave, Pittsburgh, PA, Phone: +1 (412)

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed