Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Similar documents
Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Restricted Boltzmann Machines

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

The Origin of Deep Learning. Lili Mou Jan, 2015

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Markov Chains and MCMC

Lecture 16 Deep Neural Generative Models

Chapter 16. Structured Probabilistic Models for Deep Learning

STA 4273H: Statistical Machine Learning

Introduction to Restricted Boltzmann Machines

Monte Carlo Methods. Leon Gu CSD, CMU

Deep unsupervised learning

Probabilistic Models in Theoretical Neuroscience

STA 4273H: Statistical Machine Learning

Bayesian Networks Inference with Probabilistic Graphical Models

Ch5. Markov Chain Monte Carlo

Greedy Layer-Wise Training of Deep Networks

Undirected Graphical Models

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

UNSUPERVISED LEARNING

Machine Learning for Data Science (CS4786) Lecture 24

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

How to do backpropagation in a brain

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

6 Markov Chain Monte Carlo (MCMC)

Rapid Introduction to Machine Learning/ Deep Learning

Restricted Boltzmann Machines for Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Introduction to Machine Learning CMU-10701

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Latent Variable Models

Introduction to Machine Learning

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Basic math for biology

Learning Energy-Based Models of High-Dimensional Data

Stochastic Networks Variations of the Hopfield model

Learning MN Parameters with Approximation. Sargur Srihari

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chain Monte Carlo. Simulated Annealing.

Deep Boltzmann Machines

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Mobile Robot Localization

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

( ) ( ) ( ) ( ) Simulated Annealing. Introduction. Pseudotemperature, Free Energy and Entropy. A Short Detour into Statistical Mechanics.

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 7 and 8: Markov Chain Monte Carlo

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Neural Network Training

Markov Chain Monte Carlo Methods

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Learning Tetris. 1 Tetris. February 3, 2009

Gradient Estimation for Attractor Networks

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Probabilistic Graphical Models Lecture Notes Fall 2009

Inference in Bayesian Networks

Knowledge Extraction from DBNs for Images

Graphical Models and Kernel Methods

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Introduction to Probabilistic Graphical Models

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

CPSC 540: Machine Learning

3 : Representation of Undirected GM

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Algorithmisches Lernen/Machine Learning

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Lecture 8: Bayesian Networks

Bayesian Methods for Machine Learning

Does the Wake-sleep Algorithm Produce Good Density Estimators?

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

Approximate inference in Energy-Based Models

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Lecture 6: Graphical Models: Learning

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Probabilistic Graphical Models

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Notes on Machine Learning for and

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Mobile Robot Localization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Transcription:

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20170926 è 20170928è20171011

Contents 11.1 Introduction... 3 11.2 Statistical Mechanics... 4 11.3 Markov Chains....... 6 11.4 Metropolis Algorithm....... 16 11.5 Simulated Annealing...... 19 11.6 Gibbs Sampling......... 22 11.7 Boltzmann Machine....... 24 11.8 Logistic Belief Nets........ 29 11.9 Deep Belief Nets....... 30 11.10 Deterministic Annealing (DA)....... 34 11.11 Analogy of DA with EM...... 39 Summary and Discussion...... 41 (c) 2017 Biointelligence Lab, SNU 2

11.1 Introduction Statistical mechanics as a source of ideas for unsupervised (selforganized) learning systems Statistical mechanics ü The formal study of macroscopic equilibrium properties of large systems of elements that are subject to the microscopic laws of mechanics. ü The number of degrees of freedom is enormous, making the use of probabilistic methods mandatory. ü The concept of entropy plays a vital role in statistical mechanics, as with the Shannon s information theory. ü The more ordered the system, or the more concentrated the underlying probability distribution, the smaller the entropy will be. Statistical mechanics for the study of neural networks ü Cragg and Temperley (1954) and Cowan (1968) ü Boltzmann machine (Hinton and Sejnowsky, 1983, 1986; Ackley et al., 1985) (c) 2017 Biointelligence Lab, SNU 3

11.2 Statistical Mechanics (1/2) p i :!probability!of!occurrence!of!state!i!of!a!stochastic!system!!!!!p i 0!(for!all!i)!!and! p i = 1 E i :!energy!of!the!system!when!it!is!in!state!i i In!thermal!equilibrium,!the!probability!of!state!i!is (Canonical!distribution!/!Gibbs!distribution)!!!!!p i = 1 Z exp E i k B T!!!!!Z = exp E i k B T i exp( E /k B T ):!Boltzmann!factor! Z :!sum!over!states!(partition!function) 1. States of low energy have a higher probability of occurrence than the states of high energy. 2. As the temperature T is reduced, the probability is concentrated on a smaller subset of low-energy states. We!set!k! B = 1!and!view! log p i!as!"energy" (c) 2017 Biointelligence Lab, SNU 4

11.2 Statistical Mechanics (2/2) Helmholtz!free!energy!!!!!!!F = T log Z < E >! = p i E i!!!!!!(avergage!energy) i!!!!!! < E >!F = T p i log p i i H = p i log p i!!!!!(entropy) i Thus,!we!have!!!!!! < E >!F = TH!!!!!!!F =! < E >!TH The!free!energy!of!the!system,!F,!tends!to!decrease!and become!a!minimum!in!an!equilibrium!situation.! The!resulting!probability!distribution!is!defined!by!! Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).! Consider!two!systems!A!and!A'!in!thermal!contact. ΔH!and!ΔH':!entropy!changes!of!A!and!A'! The!total!entropy!tends!to!increase!with!!!!!!!!ΔH + ΔH' 0 Nature likes to find a physical system with minimum free energy. (c) 2017 Biointelligence Lab, SNU 5

11.3 Markov Chains (1/9) Markov property P( X n+1 = x n+1 X n = x n,, X 1 = x 1 ) = P( X n+1 = x n+1 X n = x n ) Transition probability from state i at time n to j at time n +1 p ij = P( X n+1 = j X n = i) (p ij 0 i, j and p ij = 1 i) j If the transition probabilities are fixed, the Markov chain is homogeneous. In case of a system with a finite number of possible states K, the transition probabilities constitute a K-by-K matrix (stochastic matrix): P = p 11 p 1K! "! p K1 # p KK (c) 2017 Biointelligence Lab, SNU 6

11.3 Markov Chains (2/9) Generalization to m-step transition probability p (m) ij = P( X n+m = x j X n = x i ), p (m+1) ij = k m = 1,2, p (m) ik p kj, m = 1,2,, p (1) ik = p ik We can further generalize to (Chapman-Kolmogorov identity) p (m+n) ij =, m,n = 1,2, k p (m) (n) ik p kj lim k v i (k) = π i i = 1,2,, K (c) 2017 Biointelligence Lab, SNU 7

11.3 Markov Chains (3/9) Recurrent p i = P(every returning to state i) Transient p i < 1 Periodic j S k+1, for k = 1,...,d -1 If i S k and p i > 0, then j S k, for k = 1,...,d Aperiodic Accessable: Accessable from i if there is a finite sequence of transition from i to j Communicate: If the states i and j are accessible to each other If two states communicate each other, they belong to the same class. If all the states consists of a single class, the Markov chain is indecomposible or irreducible. (c) 2017 Biointelligence Lab, SNU 8

11.3 Markov Chains (4/9) Figure 11.1: A periodic recurrent Markov chain with d = 3. (c) 2017 Biointelligence Lab, SNU 9

11.3 Markov Chains (5/9) Ergodic Markov chains Ergodicity: time average = ensemble average i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i v i (k) : Proportion of time spent in state i after k returns v i (k) = k k l=1 T i (l) lim v k i (k) = π i i = 1,2,, K (c) 2017 Biointelligence Lab, SNU 10

11.3 Markov Chains (6/9) Convergence to Stationary Distributions Consider an ergodic Markov chain with a stochastic matrix P π (n 1) : state transition vector of the chain at time n -1 State transition vector at time n is π (n) = π (n 1) P By iteration we obtain π (n) = π (n 1) P = π (n 2) P 2 = π (n 3) P 3 =! π (n) = π (0) P n π (0) : initial value lim P n = n π 1 π K " # " π 1! π K = π " π Ergodic theorem 1. lim p (n) = π n ij j 2. π j > 0 j K 3. π j = 1 j=1 K 4. π j = π i p ij i=1 i for j = 1,2,, K (c) 2017 Biointelligence Lab, SNU 11

11.3 Markov Chains (6/9) Figure 11.2: State-transition diagram of Markov chain for Example 1: The states x1 and x2 and may be identified as up-to-date behind, respectively. 12! P = 1 4 3 4 1 2 1 2! π (0) = 1 6 5 6 π (1) = π (1) P!!!!!!! =! 1 6 5 6 1 4 3 4 1 2 1 2!!!!!! =! 11 24 13 24! P (2) = 0.4375 0.5625 0.3750 0.6250 P (3) = 0.4001 0.5999 0.3999 0.6001 P (4) = 0.4000 0.6000 0.4000 0.6000

11.3 Markov Chains (7/9) Figure 11.3: State-transition diagram of Markov chain for Example 2. π j = K i=1 π i p ij π 1 = 1 3 π 2 + 3 4 π 3 π 1 = 0.3953 π 2 = 0.1395 P =! 0 0 1 1 1 1 3 6 2 3 1 4 4 0 π 2 = 1 6 π + 1 2 4 π 3 π π 3 = π 1 + 1! 3 = 0.4652! 2 π 2 (c) 2017 Biointelligence Lab, SNU 13

11.3 Markov Chains (8/9) Figure 11.4: Classification of the states of a Markov chain and their associated long-term behavior. 14

11.3 Markov Chains (9/9) Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i p ij = π j p ji Application : stationary distribution K π i p ij = i=1 K π i p π ij π j = i=1 j K ( ) K i=1 π j p π ji j = p ji π j (π i p ij = π j p ji,detailed balance) i=1 π j = π j (since p ji = 1) K i=1 (c) 2017 Biointelligence Lab, SNU 15

11.4 Metropolis Algorithm (1/3) Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method Algorithm Metropolis 1. X n = x i. Randomly generate a new state x j. 2. ΔE = E(x j ) E(x i ) 3. If ΔE < 0, then X n+1 = x j else if ΔE 0, then { Select a random number ξ U[0,1]. If ξ < exp( ΔE / T ), then X n+1 = x j, (accept) } else X n+1 = x i. (reject) (c) 2017 Biointelligence Lab, SNU 16

11.4 Metropolis Algorithm (2/3) Choice of Transition Probabilities Proposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity 2. τ ij = 1 (for all i) : Normalization j 3. τ ij = τ ji (for all i, j) : Symmetry Desired set of transition probabilities π τ j ij π for π j < 1 i π i p ij = τ ij for π j 1 π i p ii = τ ii + τ ij 1 π j = 1 α τ j i j i ij ij π i Moving probability α ij = min 1, π j π i (c) 2017 Biointelligence Lab, SNU 17

11.4 Metropolis Algorithm (3/3) How to choose the ratio π j / π i? We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution π j π i π j = 1 Z exp E j T = exp ΔE T ΔE = E j E i Proof of detailed balance : Case 1: ΔE < 0. π i p ij = π i τ ij = π i τ ji π π j p ji = π i j τ π ji j = π τ i ji Case 2: ΔE > 0. π i p ij = π i π j p ji = π i τ ij π j π i τ ij = π τ j ji (c) 2017 Biointelligence Lab, SNU 18

11.5 Simulated Annealing (1/3) Simulated Annealing A stochastic relaxation technique for solving optimization problems. Improves the computational efficiency of the Metropolis algorithm. Makes random moves on the energy surface! F =! < E > TH,!!!!!!lim!F! =! < E > T 0 Operate a stochastic system at a high temperature (where convergence to equilibrium is fast) and then iteratively lower the temperature (at T=0, the Markov chain collapses on the global minima). Two ingredients: 1. A schedule that determines the rate at which the temperature is lowered. 2. An algorithm, such as the Metropolis algorithm, that iteratively finds the equilibrium distribution at each new temperature in the schedule by using the final state of the system at the previous temperature as the starting point for the new temperature. (c) 2017 Biointelligence Lab, SNU 19

11.5 Simulated Annealing (2/3) 1. Initial Value of the Temperature. The initial value T 0 of the temperature is chosen high enough to ensure that virtually all proposed transitions are accepted by the simulated-annealing algorithm 2. Decrement of the Temperature. Ordinarily, the cooling is performed exponentially, and the changes made in the value of the temperature are small. In particular, the decrement function is defined by T k = αt k 1, k = 1,2,, K where α is a constant smaller than, but close to, unity. Typical values of α lie between 0.8 and 0.99. At each temperature, enough transitions are attempted so that there are 10 accepted transitions per experiment, on average. 3. Final Value of the Temperature. The system is fixed and annealing stops if the desired number of acceptances is not achieved at three successive temperatures (c) 2017 Biointelligence Lab, SNU 20

11.5 Simulated Annealing (3/3) Simulated Annealing for Combinatorial Optimization (c) 2017 Biointelligence Lab, SNU 21

11.6 Gibbs Sampling (1/2) Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X, rather than all values of the variables at the same time. X = X 1, X 2,..., X K : a random vector of K components Assume we know P( X k X k ),where X k = X 1, X 2,..., X k 1 X k+1,..., X K Gibbs sampling algorithm (Gibbs sampler) 1. Initialize x 1 (0),x 2 (0),...,x K (0). 2. i 1 x 1 (1) P( X 1 x 2 (0),x 3 (0),x 4 (0),...,x K (0)) x 2 (1) P( X 2 x 1 (1),x 3 (0),x 4 (0),...,x K (0)) x 3 (1) P( X 3 x 1 (1),x 2 (1),x 3 (0),...,x K (0)) " x k (1) P( X k x 1 (1),x 2 (1),...,x k 1 (1),x k+1 (0),x K (0)) " x K (1) P( X K x 1 (1),x 2 (1),...,x K 1 (1)) 3. If (termination condition not met), then i i +1 and go to step 2. 22

11.6 Gibbs Sampling (2/2) 1. Convergence theorem. The random variable X k (n) converges in distribution to the true probability distributions of X k for k = 1, 2,..., K as n approaches infinity; that is, lim P( X (n) n k x x k (0)) = P X (x) for k = 1,2,, K k where P X (x) is marginal cumulative distribution function of X k k. 2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X 1 (n), X 2 (n),..., X K (n) converges to the true joint cumulative distribution of X 1, X 2,..., X K at a geometric rate in n. 3. Ergodic theorem. For any measurable function g of the random variables X 1, X 2,..., X K whose expectation exists, we have 1 n lim g( X n n 1 (i), X 2 (i),, X K (i)) E[g( X 1, X 2, X K )] i=1 with probability 1 (i.e., almost surely). (c) 2017 Biointelligence Lab, SNU 23

11.7 Boltzmann Machine (1/5) A stochastic machine consisting of stochastic neurons with symmetric synaptic connections. Boltzmann machine (BM) x : state vector of BM w ji : synaptic connection from i to j Structure (weights) w ji = w ij i, j w ii = 0 i Energy Figure 11.5: Architectural graph of Boltzmann machine; K is the number of visible neurons, and L is the number of hidden neurons. The distinguishing features of the machine are: 1. The connections between the visible and hidden neurons are symmetric. 2. The symmetric connections are extended to the visible and hidden neurons. E(x) = 1 2 Probability j i i w ji x i x j P(X = x) = 1 E(x) exp Z T 24

11.7 Boltzmann Machine (2/5) Consider three events: A: X j = x j K { } i=1 K { } i=1 B : X i = x i C : X i = x i with i j The joint event B excludes A, and the joint event C includes both A and B. P(C) = P( A, B) = 1 Z exp 1 2T P(B) = A P( A, B) = 1 Z x j exp j, j i i 1 2T w ji x i x j j, j i i w ji x i x j The component involving x j x j 2T i j P( A B) = w ji x i P( A, B) P(B) 1 = 1+ exp x j T P X j = x X i = x i i i j w ji x i ( { } ) K = ϕ x i=1,i j T ϕ(v) = 1 1+ exp( v) K i,i j w ji x i (c) 2017 Biointelligence Lab, SNU 25

11.7 Boltzmann Machine (3/5) Figure 11.6: Sigmoid-shaped function P(v). 1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample J ). 2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. xα I L(w) = log P(X α = x α ) xα I = log P(X α = x α ) (c) 2017 Biointelligence Lab, SNU 26

11.7 Boltzmann Machine (4/5) x α : the state of the visible neurons (subset of x) x β : the state of the hidden neurons (subset of x) Probability of the visible state P(X α = x α ) = 1 Z Z = x xβ exp E(x) T exp E(x) T Log-likelihood function given the training data I L(w) = log P(x w) = log exp E(x) x β T log exp E(x) xα I x T Derivative of the log-likelihood function L(w) w ji = 1 P(X T xα I( β = x β X α = x α ) x x j x i P(X = x)x j x i β ) x (c) 2017 Biointelligence Lab, SNU 27

11.7 Boltzmann Machine (5/5) Mean firing rate in the positive phase (clamped) ρ + + ji = x j x i = P(X = xβ xα I xβ X = x α )x j x i Mean firing rate in the negative phase (free-running) ρ ji = x j x i = xα I x P(X = x)x j x i Thus, we may write L(w) w ji = 1 T (ρ + ρ ji ji ) Gradient ascent to maximize the L(w) Δw ji = η L(w) w ji = η '(ρ + ji ρ ji ) Boltzmann machine learning rule η ' = ε T (c) 2017 Biointelligence Lab, SNU 28

11.8 Logistic Belief Nets A stochastic machine consisting of multiple layers of stochastic neurons with directed synaptic connections. Parents of node j pa( X j ) { X 1, X 2,, X j 1 } Conditional probability P( X j = x j X 1 = x 1,, X j 1 = x j 1 ) = P( X j = x j pa( X j )) Figure 11.7: Directed (logistic) belief network. Calculation of conditional probabilities 1. w ji = 0 for all X i pa(x j ) 2. w ji = 0 for i j ( acyclic) Weight update rule Δw ji = η L(w) w ji (c) 2017 Biointelligence Lab, SNU 29

11.9 Deep Belief Nets (1/4) Maximum-Likelihood Learning in a Restricted Boltzmann Machine (RBM) Sequential pre - training 1. Update the hidden states h in parallel, given the visible states x. 2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h. Maximum - likelihood learning L(w) w ji = ρ (0) ( ) ji ρ ji Figure 11.8: Neural structure of restricted Boltzmann machine (RBM). Contrasting this with that of Fig. 11.6, we see that unlike the Boltzmann machine, there are no connections among the visible neurons and the hidden neurons in the RBM. (c) 2017 Biointelligence Lab, SNU 30

11.9 Deep Belief Nets (2/4) Figure 11.9: Top-down learning, using logistic belief network of infinite depth. Figure 11.10: A hybrid generative model in which the two top layers form a restricted Boltzmann machine and the lower two layers form a directed model. The weights shown with blue shaded arrows are not part of the generative model; they are used to infer the feature values given to the data, but they are not used for generating data. 31

11.9 Deep Belief Nets (3/4) Figure 11.11: Illustrating the progression of alternating Gibbs sampling in an RBM. After sufficiently many steps, the visible and hidden vectors are sampled from the stationary distribution defined by the current parameters of the model. (c) 2017 Biointelligence Lab, SNU 32

11.9 Deep Belief Nets (4/4) Figure 11.12: The task of modeling the sensory (visible) data is divided into two subtasks. (c) 2017 Biointelligence Lab, SNU 33

11.10 Deterministic Annealing (1/5) Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface) Clustering via Deterministic Annealing x :source (input) vector y : reconstruction (output) vector Distortion measure: d(x,y) = x y 2 Expected distortion: D = Probability of joint event x y x P(X = x,y = y)d(x,y) y = P(X = x) P(Y = y X = x)d(x,y) P(X = x,y = y) = P(Y = y X = x) P(X = x)!##" ## $ association probability (c) 2017 Biointelligence Lab, SNU 34

11.10 Deterministic Annealing (2/5) Table 11.2 Entropy as randomness measure x y H(X,Y) = P(X = x,y = y)log P(X = x,y = y) Constrained optimization of D as minimization of the Lagrangean F = D TH H(X,Y) =! H(X) source entropy x +! H(Y #" X) $# conditional entropy y H(Y X) = P(X = x) P(Y = y X = x)log P(Y = y X = x) P(Y = y X = x) = 1 exp d(x,y) Z x T, Z d(x,y) = exp x y T (c) 2017 Biointelligence Lab, SNU 35

11.10 Deterministic Annealing (3/5) F * = min F P(Y=y X=x) x = T P(X = x)log Z x Setting y F * = P(X = x,y = y) y d(x,y) = 0 y ϒ x The minimizing condition is 1 N x P(Y = y X = x) y d(x,y) = 0 y ϒ The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered. (c) 2017 Biointelligence Lab, SNU 36

11.10 Deterministic Annealing (4/5) Figure 11.13: Clustering at various phases. The lines are equiprobability contours, p = ½ in (b), and p = ⅓ elsewhere: (a) 1 cluster (B = 0), (b) 2 clusters (B = 0.0049), (c) 3 clusters (B = 0.0056), (d) 4 clusters (B = 0.0100), (e) 5 clusters (B = 0.0156), (f) 6 clusters (B = 0.0347), and (g) 19 clusters (B = 0.0605).! B = 1 T (c) 2017 Biointelligence Lab, SNU 37

11.10 Deterministic Annealing (5/5) Figure 11.14: Phase diagram for the Case Study in deterministic annealing. The number of effective clusters is shown for each phase.! B = 1 T (c) 2017 Biointelligence Lab, SNU 38

11.11 Analogy of DA with EM (1/2) Suppose we view the association probability P(Y = y X = x) as the expected value of a random binary variable V xy defined as 1 if thesource vector x isassigned tocode vector y V xy 0 otherwise Then, the two steps of DA = two steps of EM 1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y) (c) 2017 Biointelligence Lab, SNU 39

11.11 Analogy of DA with EM (2/2) r : complete data including missing data z d = d(r) : incomplete data Conditional pdf of r given param vector θ p D (d θ) = R(d ) p c (r θ)dr R(d) : subspace of R determined by d = d(r) 2. M-step Incomplete log-likelihood function ˆθ(n +1) = arg maxq(θ, ˆθ(n)) θ L(θ) = log p D (d θ) After an interation of the EM algorithm, Complete-data log-likelihood function L c (θ) = log p c (r θ) Expectation - Maximization Algorithm ˆθ(n) : value of θ at iteration n of EM 1. E-step Q(θ, ˆθ(n)) = E ˆθ(n) L C (θ) the incomplete-data log-likelihood function is not decreased: L( ˆθ(n +1) L( ˆθ(n)) for n = 0,1,2,, K (c) 2017 Biointelligence Lab, SNU 40

Summary and Discussion n Statistical mechanics as mathematical basis for the formulation of stochastic simulation / optimization / learning 1. Metropolis algorithm 2. Simulated annealing 3. Gibbs sampling n Stochastic learning machines 1. (Classical) Boltzmann machine 2. Restricted Boltzmann machine (RBM) n 3. Deep belief nets (DBN) Deterministic annealing (DA) 1. For optimization: Connection to simulated annealing (SA) 2. For clustering: Connection to expectation-maximization (EM) (c) 2017 Biointelligence Lab, SNU 41