Lecture 16 Deep Neural Generative Models

Size: px
Start display at page:

Download "Lecture 16 Deep Neural Generative Models"

Transcription

1 Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017

2 Approach so far: We have considered simple models and then constructed their deep, non-linear variants

3 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders)

4 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders

5 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic

6 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity

7 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data

8 Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data Can t sample from the model

9 Representations Figure: Ruslan Salakhutdinov

10 To motivate Deep Neural Generative models, like before, let s seek inspiration from simple linear models first

11 Linear Factor Model We want to build a probabilistic model of the input P (x)

12 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x

13 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h)

14 Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h) h is a representation of the data

15 Linear Factor Model The latent factors h are an encoding of the data

16 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise

17 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h)

18 Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h) Then: x = W h + b + ɛ

19 Linear Factor Model P (h) is a factorial distribution h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 x = W h + b + ɛ How do learn in such a model? Let s look at a simple example

20 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I)

21 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I)

22 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I)

23 Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I) We care about the marginal P (x) (predictive distribution): P (x) = N (x b, W W T + σ 2 I)

24 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation)

25 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation:

26 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I

27 Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I We want to maximize l(θ; X) = i log P (x i θ)

28 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S]

29 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood

30 Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood Can also use EM

31 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I)

32 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance:

33 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ])

34 Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ]) Still consider linear relationship between inputs and observed variables: Marginal P (x) N (x; b, W W T + Ψ)

35 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get:

36 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b)

37 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W

38 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult

39 Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult Need to employ EM (or non-linear optimization)

40 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form

41 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function

42 More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function Estimation and inference can get complicated!

43 Earlier we had: h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

44 Quick Review Generative models can be modeled as directed graphical models

45 Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency

46 Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency Some of the random variables are observed, others are hidden

47 Sigmoid Belief Networks h 3 h 2 h 1 Just like a feedfoward network, but with arrows reversed. x

48 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then:

49 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j )

50 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as:

51 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

52 Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) Marginalization yields P (x), intractable in practice except for very small models

53 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

54 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i )

55 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units

56 Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units Deep Belief Networks are like Sigmoid Belief Networks except for the top two layers

57 Sigmoid Belief Networks General case models are called Helmholtz Machines

58 Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995

59 Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995 R. M. Neal: Connectionist Learning of Belief Networks, In Artificial Intelligence, 1992

60 Deep Belief Networks h 3 h 2 h 1 The top two layers now have undirected edges x

61 Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

62 Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

63 Deep Belief Networks h l+1 h l The top two layers are a Restricted Boltzmann Machine

64 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: h l

65 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l

66 Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l We will return to RBMs and training procedures in a while, but first we look at the mathematical machinery that will make our task easier

67 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration

68 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties

69 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z

70 Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z Energies are in the log-probability domain: Energy(x) = log 1 (ZP (x))

71 Energy Based Models P (x) = exp (Energy(x)) Z

72 Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x))

73 Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x)) How do we specify the energy function?

74 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x)

75 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z

76 Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z We have the product of experts: P (x) i P i (x) i exp ( f i(x))

77 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x

78 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated)

79 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied)

80 Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied) Contrast this with mixture models

81 Latent Variables x is observed, let s say h are hidden factors that explain x

82 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z

83 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

84 Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

85 Latent Variables P (x) = h exp (Energy(x,h)) Z

86 Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z

87 Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z Free Energy is just a marginalization of energies in the log-domain: FreeEnergy(x) = log h exp (Energy(x,h))

88 Latent Variables P (x) = exp (FreeEnergy(x)) Z

89 Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x)

90 Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x) We have an expression for P (x) (and hence for the data log-likelihood). Let us see how the gradient looks like

91 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z

92 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x

93 Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x Note that P ( x) = exp (FreeEnergy( x))

94 Data Log-Likelihood Gradient E P [ The expected log-likelihood gradient over the training set has the following form: ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ

95 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution

96 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution Easy to compute!

97 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!)

98 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute!

99 Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute! Resort to Markov Chain Monte Carlo to get a stochastic estimator of the gradient

100 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i )

101 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably!

102 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)?

103 A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)? What is the FreeEnergy(x)?

104 A Special Case The likelihood term: P (x) = expβ(x) Z i h i exp γi(x,hi)

105 A Special Case The likelihood term: P (x) = expβ(x) Z The Free Energy term: i h i exp γi(x,hi) FreeEnergy(x) = log P (x) log Z = β i log h i exp γ i(x,h i )

106 Restricted Boltzmann Machines... Recall the form of energy:... h l+1 h l

107 Restricted Boltzmann Machines h l+1 h l Recall the form of energy: Energy(x, h) = b T x c T h h T W x

108 Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

109 Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

110 Restricted Boltzmann Machines h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x)

111 Restricted Boltzmann Machines h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x) The conditional probability: P (h x) = exp (bt x + c T h + h T W x) h exp (bt x + c T h + h T W x) = i P (h i x)

112 Restricted Boltzmann Machines... x and h play symmetric roles:... h l+1 h l

113 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h)

114 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case):

115 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x)

116 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

117 Restricted Boltzmann Machines h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

118 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case?

119 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples

120 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling):

121 Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling): First sample x 1 P (x), then h 1 P (h x 1 ), then x 2 P (x h 1 ), then h 2 P (h x 2 ) till x k+1

122 Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x)

123 Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x) With: P (h i = 1 x) = σ(c i + W i x) and P (x j = 1 h) = σ(b j + W T :,j h)

124 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units

125 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel

126 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction

127 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again

128 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters

129 Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters Aside: Easy to extend RBM (and contrastive divergence) to the continuous case

130 Boltzmann Machines A model in which the energy has the form:

131 Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h

132 Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h Originally proposed by Hinton and Sejnowski (1983) Important historically. But very difficult to train (why?)

133 Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

134 Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

135 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006.

136 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM

137 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM

138 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers

139 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training)

140 Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training) Can then be discriminatively fine-tuned using backpropagation

141 From last time: Was hard to train deep networks from scratch in 2006! Deep Autoencoders (2006) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006

142 Semantic Hashing G. Hinton and R. Salakhutdinov, Semantic Hashing, 2006

143 Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x)

144 Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x) Optimization: Unsupervised pre-training leads to better regions of the space i.e. better than random initialization

145 Effect of Unsupervised Pre-training

146 Effect of Unsupervised Pre-training

147 Important topics we didn t talk about in detail/at all: Joint unsupervised training of all layers (Wake-Sleep algorithm) Deep Boltzmann Machines Variational bounds justifying greedy layerwise training Conditional RBMs, Multimodal RBMs, Temporal RBMs etc

148 Next Time Some Applications of methods we just considered

149 Next Time Some Applications of methods we just considered Generative Adversarial Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 20: Autoencoders CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

An efficient way to learn deep generative models

An efficient way to learn deep generative models An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto Joint work with: Ruslan Salakhutdinov, Yee-Whye

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Density estimation. Computing, and avoiding, partition functions. Iain Murray Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Gaussian Cardinality Restricted Boltzmann Machines

Gaussian Cardinality Restricted Boltzmann Machines Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Gaussian Cardinality Restricted Boltzmann Machines Cheng Wan, Xiaoming Jin, Guiguang Ding and Dou Shen School of Software, Tsinghua

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

arxiv: v1 [stat.ml] 30 Mar 2016

arxiv: v1 [stat.ml] 30 Mar 2016 A latent-observed dissimilarity measure arxiv:1603.09254v1 [stat.ml] 30 Mar 2016 Yasushi Terazono Abstract Quantitatively assessing relationships between latent variables and observed variables is important

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR) Generative Models Wrap-up Deep Learning Module Lecture Generative

More information

Deep Learning & Neural Networks Lecture 2

Deep Learning & Neural Networks Lecture 2 Deep Learning & Neural Networks Lecture 2 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 16, 2014 2/45 Today s Topics 1 General Ideas in Deep Learning Motivation

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However, Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Part 2. Representation Learning Algorithms

Part 2. Representation Learning Algorithms 53 Part 2 Representation Learning Algorithms 54 A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis;c regression func;ons, then

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Asja Fischer and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum,

More information

An Efficient Learning Procedure for Deep Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines ARTICLE Communicated by Yoshua Bengio An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov rsalakhu@utstat.toronto.edu Department of Statistics, University of Toronto, Toronto,

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Modeling Documents with a Deep Boltzmann Machine

Modeling Documents with a Deep Boltzmann Machine Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava, Ruslan Salakhutdinov & Geoffrey Hinton UAI 2013 Presented by Zhe Gan, Duke University November 14, 2014 1 / 15 Outline Replicated Softmax

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Using Graphs to Describe Model Structure. Sargur N. Srihari

Using Graphs to Describe Model Structure. Sargur N. Srihari Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe

More information

arxiv: v1 [cs.ne] 6 May 2014

arxiv: v1 [cs.ne] 6 May 2014 Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}

More information

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Robust Classification using Boltzmann machines by Vasileios Vasilakakis Robust Classification using Boltzmann machines by Vasileios Vasilakakis The scope of this report is to propose an architecture of Boltzmann machines that could be used in the context of classification,

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-037 August 4, 2010 An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey

More information

Need for Sampling in Machine Learning. Sargur Srihari

Need for Sampling in Machine Learning. Sargur Srihari Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,

More information

arxiv: v4 [cs.lg] 16 Apr 2015

arxiv: v4 [cs.lg] 16 Apr 2015 REWEIGHTED WAKE-SLEEP Jörg Bornschein and Yoshua Bengio Department of Computer Science and Operations Research University of Montreal Montreal, Quebec, Canada ABSTRACT arxiv:1406.2751v4 [cs.lg] 16 Apr

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Variational Auto Encoders

Variational Auto Encoders Variational Auto Encoders 1 Recap Deep Neural Models consist of 2 primary components A feature extraction network Where important aspects of the data are amplified and projected into a linearly separable

More information

The Recurrent Temporal Restricted Boltzmann Machine

The Recurrent Temporal Restricted Boltzmann Machine The Recurrent Temporal Restricted Boltzmann Machine Ilya Sutskever, Geoffrey Hinton, and Graham Taylor University of Toronto {ilya, hinton, gwtaylor}@cs.utoronto.ca Abstract The Temporal Restricted Boltzmann

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Inductive Principles for Restricted Boltzmann Machine Learning

Inductive Principles for Restricted Boltzmann Machine Learning Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th, 2009, Montreal Main reference: Learning Deep Architectures for AI, Y. Bengio, to appear in Foundations and

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes Ruslan Salakhutdinov and Geoffrey Hinton Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada

More information

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group NON-LINEAR DIMENSIONALITY REDUCTION USING NEURAL NETORKS Ruslan Salakhutdinov Joint work with Geoff Hinton University of Toronto, Machine Learning Group Overview Document Retrieval Present layer-by-layer

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information