Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Size: px

Start display at page:

Download "Reading Group on Deep Learning Session 4 Unsupervised Neural Networks"

Norma Holt
5 years ago
Views:

1 Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

2 Outline Autoencoders Restricted) Boltzmann Macines Deep Belief Networks Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

3 Autoencoders Autoencoders Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

4 Autoencoders Autoencoder Input Code Reconstruction Input x Input fx) gfx)) Code Reconstruction Reconstruction Lfx), y) Lx, gfx))) Lx, e.g. gfx))) MSE) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

5 Autoencoders Autoencoder Input Code Reconstruction Input x Input fx) gfx)) Code Reconstruction Reconstruction Lfx), y) Lx, gfx))) Lx, e.g. gfx))) MSE) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

6 Autoencoders Autoencoder Input Code Reconstruction Input x Input fx) gfx)) Code Reconstruction Reconstruction Lfx), y) Lx, gfx))) Lx, e.g. gfx))) MSE) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

7 Autoencoders Autoencoder Input Code Reconstruction Input x Input fx) gfx)) Code Reconstruction Reconstruction Lfx), y) Lx, gfx))) Lx, e.g. gfx))) MSE) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

8 Autoencoders Autoencoder Input Code Reconstruction Input x Input fx) gfx)) Code Reconstruction Reconstruction Lfx), y) Lx, gfx))) Lx, e.g. gfx))) MSE) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

9 Autoencoders Denoising Autoencoder Input x Corrupted Input ˆx Reconstruction gfx)) Jakob Verbeek & Daan Wynen Lx, gfx))) Lx, gfx))) Unsupervised Neural Networks

10 Autoencoders Denoising Autoencoder Input x Corrupted Input ˆx Reconstruction gfx)) Jakob Verbeek & Daan Wynen Lx, gfx))) Lx, gfx))) Unsupervised Neural Networks

11 Autoencoders Denoising Autoencoder Input x Corrupted Input ˆx Reconstruction gfx)) Jakob Verbeek & Daan Wynen Lx, gfx))) Lx, gfx))) Unsupervised Neural Networks

12 Autoencoders Sparse Autoencoder Lx, gfx))) + Ωfx)) Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

13 Restricted) Boltzmann Macines Restricted) Boltzmann Macines Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

14 Restricted) Boltzmann Macines Boltzmann Macine Binary units x P x) = exp Ex) Z Ex) = x Ux b x Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

15 Restricted) Boltzmann Macines Boltzmann Macine Binary units v, exp Ev, ) P v, ) = Z Ev, ) = v Rv v W S b v c Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

16 Restricted) Boltzmann Macines Restricted Boltzmann Macine Binary units v, exp Ev, ) P v, ) = Z Ev, ) = v Rv v W S b v c Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

17 Restricted) Boltzmann Macines Slides by Honglak Lee: ttp://videolectures.net/deeplearning205_lee_ boltzmann_macines Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

18 5 Restricted Boltzmann Macines RBMs) Representation Undirected bipartite grapical model v 0, D : observed visible) binary variables 0, N : idden binary variables. idden H) j visible V) = w T v) + 2 w 2 T v) + 3 w 3 T v) b T c T v i

19 6 Restricted Boltzmann Macines RBMs) Representation Undirected bipartite grapical model v 0, D : observed visible) binary variables 0, N : idden binary variables. idden H) 2 3 w w 2 w 3 visible V) = w T v) + 2 w 2 T v) + 3 w 3 T v) b T c T v i

20 Conditional Probabilities RBM wit binary-valued input data) 7 Given v, all te j are conditionally independent idden H) 2 j 3 w w 2 w 3 P v) can be used as features Given, all te v i are conditionally independent i v v 2 visible V)

21 RBMs wit real-valued input data Representation Undirected bipartite grapical model V: observed visible) real variables H: idden binary variables. idden H) j i visible V) 8 8

22 9 RBMs wit real-valued input data Given v, all te j are conditionally independent idden H) w 2 j w 2 3 w 3 P v) can be used as features i v v 2 visible V) Given, all te v i are conditionally independent

23 0 Inference Conditional Distribution: Pv ) or P v) Easy to compute see previous slides). Due to conditional independence, we can sample idden units in parallel given visible units & vice versa) Joint Distribution: Pv,) Requires Gibbs Sampling approximate) Initialize wit v 0 Sample 0 from P v 0 ) Repeat until convergence t=, ) { Sample v t from Pv t t ) Sample t from P v t ) } v

intractable. Typically sampling-based approximation is used e.g., contrastive divergence).

24 Training RBMs Maximum likeliood training Objective: Log-likeliood of te training data were Computing exact gradient intractable. Typically sampling-based approximation is used e.g., contrastive divergence). Usually optimized via stocastic gradient descent

25 2 Training RBMs Model: How can we find parameters θ tat maximize P θ v)? Data Distribution posterior of given v) Model Distribution We need to compute P v) and Pv,), and derivative of E w.r.t. parameters {W,b,c} P v): tractable see previous slides) Pv,): intractable Can approximate wit Gibbs sampling, but requires lots of iterations

26 Contrastive Divergence Approximation of te log-likeliood gradient. Replace te average over all possible inputs by samples 2. Run te MCMC cain Gibbs sampling) for only k steps starting from te observed example Initialize wit v 0 = v Sample 0 from P v 0 ) For t =,,k { Sample v t from Pv t t ) Sample t from P v t ) } 3

27 Maximum likeliood learning for RBM j j j j vi j 0 v i j i i i i t = 0 t = t = 2 t = infinity Equilibrium distribution Slide Credit: Geoff Hinton 4

28 Contrastive divergence to learn RBM v i j i j t = 0 t = data reconstruction i j 0 v i j Start wit a training vector on te visible units. Update all te idden units in parallel Update te all te visible units in parallel to get a reconstruction. Update te idden units again. Slide Credit: Geoff Hinton 5

29 6 Update rule: Putting togeter Training via stocastic gradient. Note, E W ij = i v j. Terefore, were v k is a sample from k-step CD Can derive similar update rule for biases b and c Mini-batc ~00 samples) are used to reduce te variance of te gradient estimate Implemented in ~0 lines of matlab code

30 Deep Belief Networks Deep Belief Networks Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

31 Deep Belief Networks DBNs) Probabilistic generative model Wit deep arcitecture multiple layers) Unsupervised pre-training wit RBMs provides a good initialization of te model Teoretically justified as maximizing te lowerbound of te log-likeliood of te data Supervised fine-tuning Generative: Up-down algoritm [Hinton et al., 2006] Discriminative: backpropagation convert to NN) Related work: Deep Boltzmann Macine Salakutdinov and Hinton, 2009) 24

32 25 Deep Belief Networks DBN) 2 3 v Visible layer Hidden layers RBM Directed belief nets ), ) )... ) ),...,,, 2 l l l l l P P P P P v v 2 2

33 26 DBN structure v 2 3 ) 2 P ) v P ), 2 3 P ) v Q ) 2 Q ) ) P Q W 2 W 3 W j i i j i j i i sigm P ) W ' ). b j i i j i j i i sigm Q ) W ) b ), ) )... ) ),...,,, 2 l l l l l P P P P P v v 2 2 Hinton et al., 2006 approximate) inference Generative process

34 27 First step: DBN Greedy training Construct an RBM wit an input layer v and a idden layer Train te RBM Hinton et al., 2006

35 28 Second step: DBN Greedy training Stack anoter idden layer on top of te RBM to form a new RBM Fix W, sample from Q v) as input. Train as RBM. 2 W 2 W Hinton et al., 2006 Q v) W

36 29 Tird step: DBN Greedy training Continue to stack layers on top of te network, train it as previous step, wit sample sampled 2 from Q ) And so on Q 2 ) 3 W 2 W 3 Hinton et al., 2006 Q v) W

37 36 DBN and supervised fine-tuning Discriminative fine-tuning Initializing wit neural nets + backpropagation Maximizes log P Y X ) X: data Y: label) Generative fine-tuning Up-down algoritm Maximizes log P Y, X ) joint likeliood of data and labels) Hinton et al. used supervised + generative fine-tuning in teir Neural Computation paper. However, it is possible to use unsupervised + generative fine-tuning as well.

38 A model for digit recognition Te top two layers form an associative memory Te energy valleys ave names 2000 top-level neurons 0 label neurons 500 neurons Te model learns to generate combinations of labels and images. To perform recognition we start wit a neutral state of te label units and do an up-pass from te image followed by a few iterations of te top-level associative memory. ttp:// 500 neurons 28 x 28 pixel image Slide Credit: Geoff Hinton 37

39 Generative fine-tuning via Up-down algoritm After pre-training many layers of features, we can finetune te features to improve generation.. Do a stocastic bottom-up pass Adjust te top-down weigts to be good at reconstructing te feature activities in te layer below. 2. Do a few iterations of sampling in te top level RBM Adjust te weigts in te top-level RBM. 3. Do a stocastic top-down pass Adjust te bottom-up weigts to be good at reconstructing te feature activities in te layer above. ttp:// Slide Credit: Geoff Hinton 38

40 39 Generating sample from a DBN Want to sample from Sample using Gibbs sampling in te RBM Sample te lower layer from l ) i i P i ), ) )... ) ),...,,, 2 l l l l l P P P P P v v 2 2

41 Generating samples from DBN Hinton et al, A Fast Learning Algoritm for Deep Belief Nets,

42 Stacking of RBMs as Deep Neural Networks 43

43 44 Using Stacks of RBMs as Neural Networks Te feedforward approximate) inference of te DBN looks te same as te sigmoid deep neural networks Idea: use te DBN as an initialization of te deep neural network, and ten do fine-tuning wit back-propagation

44 DBN for classification Slide credit: Russ Salakutdinov 45

45 Deep Belief Networks Deep Boltzmann Macine Figure: from ttp:// Specialization of Boltzmann Macine Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

Deep Belief Networks Deep Boltzmann Macine Figure: from ttp://www.deeplearningbook.

46 Deep Belief Networks Deep Boltzmann Macine Figure: from ttp:// Specialization of Boltzmann Macine Also specialization of Restricted Boltzmann Macine! Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

47 Deep Belief Networks DBM - Properties Allows for up-down interactions between layers Compared to DBNets, Q v) can be tigter to P v) Also makes inference and training arder MCMC across all layers... Jakob Verbeek & Daan Wynen Unsupervised Neural Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,