Lecture 16 Deep Neural Generative Models

Size: px

Start display at page:

Download "Lecture 16 Deep Neural Generative Models"

Diana Greer
6 years ago
Views:

1 Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017

2 Approach so far: We have considered simple models and then constructed their deep, non-linear variants

3 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders)

4 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders

5 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic

6 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity

7 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data

8 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data Can t sample from the model

9 Representations Figure: Ruslan Salakhutdinov

10 To motivate Deep Neural Generative models, like before, let s seek inspiration from simple linear models first

11 Linear Factor Model We want to build a probabilistic model of the input P (x)

12 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x

13 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h)

14 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h) h is a representation of the data

15 Linear Factor Model The latent factors h are an encoding of the data

16 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise

17 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h)

18 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h) Then: x = W h + b + ɛ

19 Linear Factor Model P (h) is a factorial distribution h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 x = W h + b + ɛ How do learn in such a model? Let s look at a simple example

20 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I)

21 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I)

22 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I)

23 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I) We care about the marginal P (x) (predictive distribution): P (x) = N (x b, W W T + σ 2 I)

24 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation)

25 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation:

26 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I

27 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I We want to maximize l(θ; X) = i log P (x i θ)

28 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S]

29 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood

30 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood Can also use EM

31 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I)

32 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance:

33 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ])

34 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ]) Still consider linear relationship between inputs and observed variables: Marginal P (x) N (x; b, W W T + Ψ)

35 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get:

36 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b)

37 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W

38 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult

39 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult Need to employ EM (or non-linear optimization)

40 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form

41 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function

42 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function Estimation and inference can get complicated!

43 Earlier we had: h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

44 Quick Review Generative models can be modeled as directed graphical models

45 Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency

46 Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency Some of the random variables are observed, others are hidden

47 Sigmoid Belief Networks h 3 h 2 h 1 Just like a feedfoward network, but with arrows reversed. x

48 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then:

49 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j )

50 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as:

51 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

52 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) Marginalization yields P (x), intractable in practice except for very small models

53 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

54 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i )

55 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units

56 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units Deep Belief Networks are like Sigmoid Belief Networks except for the top two layers

57 Sigmoid Belief Networks General case models are called Helmholtz Machines

58 Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995

59 Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995 R. M. Neal: Connectionist Learning of Belief Networks, In Artificial Intelligence, 1992

60 Deep Belief Networks h 3 h 2 h 1 The top two layers now have undirected edges x

61 Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

62 Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

63 Deep Belief Networks h l+1 h l The top two layers are a Restricted Boltzmann Machine

64 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: h l

65 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l

66 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l We will return to RBMs and training procedures in a while, but first we look at the mathematical machinery that will make our task easier

67 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration

68 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties

69 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z

70 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z Energies are in the log-probability domain: Energy(x) = log 1 (ZP (x))

71 Energy Based Models P (x) = exp (Energy(x)) Z

72 Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x))

73 Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x)) How do we specify the energy function?

74 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x)

75 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z

76 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z We have the product of experts: P (x) i P i (x) i exp ( f i(x))

77 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x

78 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated)

79 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied)

80 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied) Contrast this with mixture models

81 Latent Variables x is observed, let s say h are hidden factors that explain x

82 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z

83 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

84 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

85 Latent Variables P (x) = h exp (Energy(x,h)) Z

86 Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z

87 Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z Free Energy is just a marginalization of energies in the log-domain: FreeEnergy(x) = log h exp (Energy(x,h))

88 Latent Variables P (x) = exp (FreeEnergy(x)) Z

89 Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x)

90 Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x) We have an expression for P (x) (and hence for the data log-likelihood). Let us see how the gradient looks like

91 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z

92 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x

93 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x Note that P ( x) = exp (FreeEnergy( x))

94 Data Log-Likelihood Gradient E P [ The expected log-likelihood gradient over the training set has the following form: ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ

95 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution

96 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution Easy to compute!

97 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!)

98 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute!

99 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute! Resort to Markov Chain Monte Carlo to get a stochastic estimator of the gradient

100 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i )

101 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably!

102 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)?

103 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)? What is the FreeEnergy(x)?

104 A Special Case The likelihood term: P (x) = expβ(x) Z i h i exp γi(x,hi)

105 A Special Case The likelihood term: P (x) = expβ(x) Z The Free Energy term: i h i exp γi(x,hi) FreeEnergy(x) = log P (x) log Z = β i log h i exp γ i(x,h i )

106 Restricted Boltzmann Machines... Recall the form of energy:... h l+1 h l

107 Restricted Boltzmann Machines h l+1 h l Recall the form of energy: Energy(x, h) = b T x c T h h T W x

108 Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

109 Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

110 Restricted Boltzmann Machines h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x)

111 Restricted Boltzmann Machines h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x) The conditional probability: P (h x) = exp (bt x + c T h + h T W x) h exp (bt x + c T h + h T W x) = i P (h i x)

112 Restricted Boltzmann Machines... x and h play symmetric roles:... h l+1 h l

113 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h)

114 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case):

115 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x)

116 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

117 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

118 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case?

119 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples

120 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling):

121 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling): First sample x 1 P (x), then h 1 P (h x 1 ), then x 2 P (x h 1 ), then h 2 P (h x 2 ) till x k+1

122 Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x)

123 Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x) With: P (h i = 1 x) = σ(c i + W i x) and P (x j = 1 h) = σ(b j + W T :,j h)

124 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units

125 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel

126 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction

127 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again

128 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters

129 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters Aside: Easy to extend RBM (and contrastive divergence) to the continuous case

130 Boltzmann Machines A model in which the energy has the form:

131 Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h

132 Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h Originally proposed by Hinton and Sejnowski (1983) Important historically. But very difficult to train (why?)

133 Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

134 Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

135 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006.

136 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM

137 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM

138 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers

139 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training)

140 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training) Can then be discriminatively fine-tuned using backpropagation

141 From last time: Was hard to train deep networks from scratch in 2006! Deep Autoencoders (2006) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006

142 Semantic Hashing G. Hinton and R. Salakhutdinov, Semantic Hashing, 2006

143 Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x)

144 Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x) Optimization: Unsupervised pre-training leads to better regions of the space i.e. better than random initialization

145 Effect of Unsupervised Pre-training

146 Effect of Unsupervised Pre-training

147 Important topics we didn t talk about in detail/at all: Joint unsupervised training of all layers (Wake-Sleep algorithm) Deep Boltzmann Machines Variational bounds justifying greedy layerwise training Conditional RBMs, Multimodal RBMs, Temporal RBMs etc

148 Next Time Some Applications of methods we just considered

149 Next Time Some Applications of methods we just considered Generative Adversarial Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,