Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Size: px
Start display at page:

Download "Chapter 11. Stochastic Methods Rooted in Statistical Mechanics"

Transcription

1 Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: è è

2 Contents 11.1 Introduction Statistical Mechanics Markov Chains Metropolis Algorithm Simulated Annealing Gibbs Sampling Boltzmann Machine Logistic Belief Nets Deep Belief Nets Deterministic Annealing (DA) Analogy of DA with EM Summary and Discussion (c) 2017 Biointelligence Lab, SNU 2

3 11.1 Introduction Statistical mechanics as a source of ideas for unsupervised (selforganized) learning systems Statistical mechanics ü The formal study of macroscopic equilibrium properties of large systems of elements that are subject to the microscopic laws of mechanics. ü The number of degrees of freedom is enormous, making the use of probabilistic methods mandatory. ü The concept of entropy plays a vital role in statistical mechanics, as with the Shannon s information theory. ü The more ordered the system, or the more concentrated the underlying probability distribution, the smaller the entropy will be. Statistical mechanics for the study of neural networks ü Cragg and Temperley (1954) and Cowan (1968) ü Boltzmann machine (Hinton and Sejnowsky, 1983, 1986; Ackley et al., 1985) (c) 2017 Biointelligence Lab, SNU 3

4 11.2 Statistical Mechanics (1/2) p i :!probability!of!occurrence!of!state!i!of!a!stochastic!system!!!!!p i 0!(for!all!i)!!and! p i = 1 E i :!energy!of!the!system!when!it!is!in!state!i i In!thermal!equilibrium,!the!probability!of!state!i!is (Canonical!distribution!/!Gibbs!distribution)!!!!!p i = 1 Z exp E i k B T!!!!!Z = exp E i k B T i exp( E /k B T ):!Boltzmann!factor! Z :!sum!over!states!(partition!function) 1. States of low energy have a higher probability of occurrence than the states of high energy. 2. As the temperature T is reduced, the probability is concentrated on a smaller subset of low-energy states. We!set!k! B = 1!and!view! log p i!as!"energy" (c) 2017 Biointelligence Lab, SNU 4

5 11.2 Statistical Mechanics (2/2) Helmholtz!free!energy!!!!!!!F = T log Z < E >! = p i E i!!!!!!(avergage!energy) i!!!!!! < E >!F = T p i log p i i H = p i log p i!!!!!(entropy) i Thus,!we!have!!!!!! < E >!F = TH!!!!!!!F =! < E >!TH The!free!energy!of!the!system,!F,!tends!to!decrease!and become!a!minimum!in!an!equilibrium!situation.! The!resulting!probability!distribution!is!defined!by!! Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).! Consider!two!systems!A!and!A'!in!thermal!contact. ΔH!and!ΔH':!entropy!changes!of!A!and!A'! The!total!entropy!tends!to!increase!with!!!!!!!!ΔH + ΔH' 0 Nature likes to find a physical system with minimum free energy. (c) 2017 Biointelligence Lab, SNU 5

6 11.3 Markov Chains (1/9) Markov property P( X n+1 = x n+1 X n = x n,, X 1 = x 1 ) = P( X n+1 = x n+1 X n = x n ) Transition probability from state i at time n to j at time n +1 p ij = P( X n+1 = j X n = i) (p ij 0 i, j and p ij = 1 i) j If the transition probabilities are fixed, the Markov chain is homogeneous. In case of a system with a finite number of possible states K, the transition probabilities constitute a K-by-K matrix (stochastic matrix): P = p 11 p 1K! "! p K1 # p KK (c) 2017 Biointelligence Lab, SNU 6

7 11.3 Markov Chains (2/9) Generalization to m-step transition probability p (m) ij = P( X n+m = x j X n = x i ), p (m+1) ij = k m = 1,2, p (m) ik p kj, m = 1,2,, p (1) ik = p ik We can further generalize to (Chapman-Kolmogorov identity) p (m+n) ij =, m,n = 1,2, k p (m) (n) ik p kj lim k v i (k) = π i i = 1,2,, K (c) 2017 Biointelligence Lab, SNU 7

8 11.3 Markov Chains (3/9) Recurrent p i = P(every returning to state i) Transient p i < 1 Periodic j S k+1, for k = 1,...,d -1 If i S k and p i > 0, then j S k, for k = 1,...,d Aperiodic Accessable: Accessable from i if there is a finite sequence of transition from i to j Communicate: If the states i and j are accessible to each other If two states communicate each other, they belong to the same class. If all the states consists of a single class, the Markov chain is indecomposible or irreducible. (c) 2017 Biointelligence Lab, SNU 8

9 11.3 Markov Chains (4/9) Figure 11.1: A periodic recurrent Markov chain with d = 3. (c) 2017 Biointelligence Lab, SNU 9

10 11.3 Markov Chains (5/9) Ergodic Markov chains Ergodicity: time average = ensemble average i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i v i (k) : Proportion of time spent in state i after k returns v i (k) = k k l=1 T i (l) lim v k i (k) = π i i = 1,2,, K (c) 2017 Biointelligence Lab, SNU 10

11 11.3 Markov Chains (6/9) Convergence to Stationary Distributions Consider an ergodic Markov chain with a stochastic matrix P π (n 1) : state transition vector of the chain at time n -1 State transition vector at time n is π (n) = π (n 1) P By iteration we obtain π (n) = π (n 1) P = π (n 2) P 2 = π (n 3) P 3 =! π (n) = π (0) P n π (0) : initial value lim P n = n π 1 π K " # " π 1! π K = π " π Ergodic theorem 1. lim p (n) = π n ij j 2. π j > 0 j K 3. π j = 1 j=1 K 4. π j = π i p ij i=1 i for j = 1,2,, K (c) 2017 Biointelligence Lab, SNU 11

12 11.3 Markov Chains (6/9) Figure 11.2: State-transition diagram of Markov chain for Example 1: The states x1 and x2 and may be identified as up-to-date behind, respectively. 12! P = ! π (0) = π (1) = π (1) P!!!!!!! =! !!!!!! =! ! P (2) = P (3) = P (4) =

13 11.3 Markov Chains (7/9) Figure 11.3: State-transition diagram of Markov chain for Example 2. π j = K i=1 π i p ij π 1 = 1 3 π π 3 π 1 = π 2 = P =! π 2 = 1 6 π π 3 π π 3 = π 1 + 1! 3 = ! 2 π 2 (c) 2017 Biointelligence Lab, SNU 13

14 11.3 Markov Chains (8/9) Figure 11.4: Classification of the states of a Markov chain and their associated long-term behavior. 14

15 11.3 Markov Chains (9/9) Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i p ij = π j p ji Application : stationary distribution K π i p ij = i=1 K π i p π ij π j = i=1 j K ( ) K i=1 π j p π ji j = p ji π j (π i p ij = π j p ji,detailed balance) i=1 π j = π j (since p ji = 1) K i=1 (c) 2017 Biointelligence Lab, SNU 15

16 11.4 Metropolis Algorithm (1/3) Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method Algorithm Metropolis 1. X n = x i. Randomly generate a new state x j. 2. ΔE = E(x j ) E(x i ) 3. If ΔE < 0, then X n+1 = x j else if ΔE 0, then { Select a random number ξ U[0,1]. If ξ < exp( ΔE / T ), then X n+1 = x j, (accept) } else X n+1 = x i. (reject) (c) 2017 Biointelligence Lab, SNU 16

17 11.4 Metropolis Algorithm (2/3) Choice of Transition Probabilities Proposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity 2. τ ij = 1 (for all i) : Normalization j 3. τ ij = τ ji (for all i, j) : Symmetry Desired set of transition probabilities π τ j ij π for π j < 1 i π i p ij = τ ij for π j 1 π i p ii = τ ii + τ ij 1 π j = 1 α τ j i j i ij ij π i Moving probability α ij = min 1, π j π i (c) 2017 Biointelligence Lab, SNU 17

18 11.4 Metropolis Algorithm (3/3) How to choose the ratio π j / π i? We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution π j π i π j = 1 Z exp E j T = exp ΔE T ΔE = E j E i Proof of detailed balance : Case 1: ΔE < 0. π i p ij = π i τ ij = π i τ ji π π j p ji = π i j τ π ji j = π τ i ji Case 2: ΔE > 0. π i p ij = π i π j p ji = π i τ ij π j π i τ ij = π τ j ji (c) 2017 Biointelligence Lab, SNU 18

19 11.5 Simulated Annealing (1/3) Simulated Annealing A stochastic relaxation technique for solving optimization problems. Improves the computational efficiency of the Metropolis algorithm. Makes random moves on the energy surface! F =! < E > TH,!!!!!!lim!F! =! < E > T 0 Operate a stochastic system at a high temperature (where convergence to equilibrium is fast) and then iteratively lower the temperature (at T=0, the Markov chain collapses on the global minima). Two ingredients: 1. A schedule that determines the rate at which the temperature is lowered. 2. An algorithm, such as the Metropolis algorithm, that iteratively finds the equilibrium distribution at each new temperature in the schedule by using the final state of the system at the previous temperature as the starting point for the new temperature. (c) 2017 Biointelligence Lab, SNU 19

20 11.5 Simulated Annealing (2/3) 1. Initial Value of the Temperature. The initial value T 0 of the temperature is chosen high enough to ensure that virtually all proposed transitions are accepted by the simulated-annealing algorithm 2. Decrement of the Temperature. Ordinarily, the cooling is performed exponentially, and the changes made in the value of the temperature are small. In particular, the decrement function is defined by T k = αt k 1, k = 1,2,, K where α is a constant smaller than, but close to, unity. Typical values of α lie between 0.8 and At each temperature, enough transitions are attempted so that there are 10 accepted transitions per experiment, on average. 3. Final Value of the Temperature. The system is fixed and annealing stops if the desired number of acceptances is not achieved at three successive temperatures (c) 2017 Biointelligence Lab, SNU 20

21 11.5 Simulated Annealing (3/3) Simulated Annealing for Combinatorial Optimization (c) 2017 Biointelligence Lab, SNU 21

22 11.6 Gibbs Sampling (1/2) Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X, rather than all values of the variables at the same time. X = X 1, X 2,..., X K : a random vector of K components Assume we know P( X k X k ),where X k = X 1, X 2,..., X k 1 X k+1,..., X K Gibbs sampling algorithm (Gibbs sampler) 1. Initialize x 1 (0),x 2 (0),...,x K (0). 2. i 1 x 1 (1) P( X 1 x 2 (0),x 3 (0),x 4 (0),...,x K (0)) x 2 (1) P( X 2 x 1 (1),x 3 (0),x 4 (0),...,x K (0)) x 3 (1) P( X 3 x 1 (1),x 2 (1),x 3 (0),...,x K (0)) " x k (1) P( X k x 1 (1),x 2 (1),...,x k 1 (1),x k+1 (0),x K (0)) " x K (1) P( X K x 1 (1),x 2 (1),...,x K 1 (1)) 3. If (termination condition not met), then i i +1 and go to step 2. 22

23 11.6 Gibbs Sampling (2/2) 1. Convergence theorem. The random variable X k (n) converges in distribution to the true probability distributions of X k for k = 1, 2,..., K as n approaches infinity; that is, lim P( X (n) n k x x k (0)) = P X (x) for k = 1,2,, K k where P X (x) is marginal cumulative distribution function of X k k. 2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X 1 (n), X 2 (n),..., X K (n) converges to the true joint cumulative distribution of X 1, X 2,..., X K at a geometric rate in n. 3. Ergodic theorem. For any measurable function g of the random variables X 1, X 2,..., X K whose expectation exists, we have 1 n lim g( X n n 1 (i), X 2 (i),, X K (i)) E[g( X 1, X 2, X K )] i=1 with probability 1 (i.e., almost surely). (c) 2017 Biointelligence Lab, SNU 23

24 11.7 Boltzmann Machine (1/5) A stochastic machine consisting of stochastic neurons with symmetric synaptic connections. Boltzmann machine (BM) x : state vector of BM w ji : synaptic connection from i to j Structure (weights) w ji = w ij i, j w ii = 0 i Energy Figure 11.5: Architectural graph of Boltzmann machine; K is the number of visible neurons, and L is the number of hidden neurons. The distinguishing features of the machine are: 1. The connections between the visible and hidden neurons are symmetric. 2. The symmetric connections are extended to the visible and hidden neurons. E(x) = 1 2 Probability j i i w ji x i x j P(X = x) = 1 E(x) exp Z T 24

25 11.7 Boltzmann Machine (2/5) Consider three events: A: X j = x j K { } i=1 K { } i=1 B : X i = x i C : X i = x i with i j The joint event B excludes A, and the joint event C includes both A and B. P(C) = P( A, B) = 1 Z exp 1 2T P(B) = A P( A, B) = 1 Z x j exp j, j i i 1 2T w ji x i x j j, j i i w ji x i x j The component involving x j x j 2T i j P( A B) = w ji x i P( A, B) P(B) 1 = 1+ exp x j T P X j = x X i = x i i i j w ji x i ( { } ) K = ϕ x i=1,i j T ϕ(v) = 1 1+ exp( v) K i,i j w ji x i (c) 2017 Biointelligence Lab, SNU 25

26 11.7 Boltzmann Machine (3/5) Figure 11.6: Sigmoid-shaped function P(v). 1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample J ). 2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. xα I L(w) = log P(X α = x α ) xα I = log P(X α = x α ) (c) 2017 Biointelligence Lab, SNU 26

27 11.7 Boltzmann Machine (4/5) x α : the state of the visible neurons (subset of x) x β : the state of the hidden neurons (subset of x) Probability of the visible state P(X α = x α ) = 1 Z Z = x xβ exp E(x) T exp E(x) T Log-likelihood function given the training data I L(w) = log P(x w) = log exp E(x) x β T log exp E(x) xα I x T Derivative of the log-likelihood function L(w) w ji = 1 P(X T xα I( β = x β X α = x α ) x x j x i P(X = x)x j x i β ) x (c) 2017 Biointelligence Lab, SNU 27

28 11.7 Boltzmann Machine (5/5) Mean firing rate in the positive phase (clamped) ρ + + ji = x j x i = P(X = xβ xα I xβ X = x α )x j x i Mean firing rate in the negative phase (free-running) ρ ji = x j x i = xα I x P(X = x)x j x i Thus, we may write L(w) w ji = 1 T (ρ + ρ ji ji ) Gradient ascent to maximize the L(w) Δw ji = η L(w) w ji = η '(ρ + ji ρ ji ) Boltzmann machine learning rule η ' = ε T (c) 2017 Biointelligence Lab, SNU 28

29 11.8 Logistic Belief Nets A stochastic machine consisting of multiple layers of stochastic neurons with directed synaptic connections. Parents of node j pa( X j ) { X 1, X 2,, X j 1 } Conditional probability P( X j = x j X 1 = x 1,, X j 1 = x j 1 ) = P( X j = x j pa( X j )) Figure 11.7: Directed (logistic) belief network. Calculation of conditional probabilities 1. w ji = 0 for all X i pa(x j ) 2. w ji = 0 for i j ( acyclic) Weight update rule Δw ji = η L(w) w ji (c) 2017 Biointelligence Lab, SNU 29

30 11.9 Deep Belief Nets (1/4) Maximum-Likelihood Learning in a Restricted Boltzmann Machine (RBM) Sequential pre - training 1. Update the hidden states h in parallel, given the visible states x. 2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h. Maximum - likelihood learning L(w) w ji = ρ (0) ( ) ji ρ ji Figure 11.8: Neural structure of restricted Boltzmann machine (RBM). Contrasting this with that of Fig. 11.6, we see that unlike the Boltzmann machine, there are no connections among the visible neurons and the hidden neurons in the RBM. (c) 2017 Biointelligence Lab, SNU 30

31 11.9 Deep Belief Nets (2/4) Figure 11.9: Top-down learning, using logistic belief network of infinite depth. Figure 11.10: A hybrid generative model in which the two top layers form a restricted Boltzmann machine and the lower two layers form a directed model. The weights shown with blue shaded arrows are not part of the generative model; they are used to infer the feature values given to the data, but they are not used for generating data. 31

32 11.9 Deep Belief Nets (3/4) Figure 11.11: Illustrating the progression of alternating Gibbs sampling in an RBM. After sufficiently many steps, the visible and hidden vectors are sampled from the stationary distribution defined by the current parameters of the model. (c) 2017 Biointelligence Lab, SNU 32

33 11.9 Deep Belief Nets (4/4) Figure 11.12: The task of modeling the sensory (visible) data is divided into two subtasks. (c) 2017 Biointelligence Lab, SNU 33

34 11.10 Deterministic Annealing (1/5) Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface) Clustering via Deterministic Annealing x :source (input) vector y : reconstruction (output) vector Distortion measure: d(x,y) = x y 2 Expected distortion: D = Probability of joint event x y x P(X = x,y = y)d(x,y) y = P(X = x) P(Y = y X = x)d(x,y) P(X = x,y = y) = P(Y = y X = x) P(X = x)!##" ## $ association probability (c) 2017 Biointelligence Lab, SNU 34

35 11.10 Deterministic Annealing (2/5) Table 11.2 Entropy as randomness measure x y H(X,Y) = P(X = x,y = y)log P(X = x,y = y) Constrained optimization of D as minimization of the Lagrangean F = D TH H(X,Y) =! H(X) source entropy x +! H(Y #" X) $# conditional entropy y H(Y X) = P(X = x) P(Y = y X = x)log P(Y = y X = x) P(Y = y X = x) = 1 exp d(x,y) Z x T, Z d(x,y) = exp x y T (c) 2017 Biointelligence Lab, SNU 35

36 11.10 Deterministic Annealing (3/5) F * = min F P(Y=y X=x) x = T P(X = x)log Z x Setting y F * = P(X = x,y = y) y d(x,y) = 0 y ϒ x The minimizing condition is 1 N x P(Y = y X = x) y d(x,y) = 0 y ϒ The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered. (c) 2017 Biointelligence Lab, SNU 36

37 11.10 Deterministic Annealing (4/5) Figure 11.13: Clustering at various phases. The lines are equiprobability contours, p = ½ in (b), and p = ⅓ elsewhere: (a) 1 cluster (B = 0), (b) 2 clusters (B = ), (c) 3 clusters (B = ), (d) 4 clusters (B = ), (e) 5 clusters (B = ), (f) 6 clusters (B = ), and (g) 19 clusters (B = ).! B = 1 T (c) 2017 Biointelligence Lab, SNU 37

38 11.10 Deterministic Annealing (5/5) Figure 11.14: Phase diagram for the Case Study in deterministic annealing. The number of effective clusters is shown for each phase.! B = 1 T (c) 2017 Biointelligence Lab, SNU 38

39 11.11 Analogy of DA with EM (1/2) Suppose we view the association probability P(Y = y X = x) as the expected value of a random binary variable V xy defined as 1 if thesource vector x isassigned tocode vector y V xy 0 otherwise Then, the two steps of DA = two steps of EM 1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y) (c) 2017 Biointelligence Lab, SNU 39

40 11.11 Analogy of DA with EM (2/2) r : complete data including missing data z d = d(r) : incomplete data Conditional pdf of r given param vector θ p D (d θ) = R(d ) p c (r θ)dr R(d) : subspace of R determined by d = d(r) 2. M-step Incomplete log-likelihood function ˆθ(n +1) = arg maxq(θ, ˆθ(n)) θ L(θ) = log p D (d θ) After an interation of the EM algorithm, Complete-data log-likelihood function L c (θ) = log p c (r θ) Expectation - Maximization Algorithm ˆθ(n) : value of θ at iteration n of EM 1. E-step Q(θ, ˆθ(n)) = E ˆθ(n) L C (θ) the incomplete-data log-likelihood function is not decreased: L( ˆθ(n +1) L( ˆθ(n)) for n = 0,1,2,, K (c) 2017 Biointelligence Lab, SNU 40

41 Summary and Discussion n Statistical mechanics as mathematical basis for the formulation of stochastic simulation / optimization / learning 1. Metropolis algorithm 2. Simulated annealing 3. Gibbs sampling n Stochastic learning machines 1. (Classical) Boltzmann machine 2. Restricted Boltzmann machine (RBM) n 3. Deep belief nets (DBN) Deterministic annealing (DA) 1. For optimization: Connection to simulated annealing (SA) 2. For clustering: Connection to expectation-maximization (EM) (c) 2017 Biointelligence Lab, SNU 41

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Probabilistic Models in Theoretical Neuroscience

Probabilistic Models in Theoretical Neuroscience Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Ch5. Markov Chain Monte Carlo

Ch5. Markov Chain Monte Carlo ST4231, Semester I, 2003-2004 Ch5. Markov Chain Monte Carlo In general, it is very difficult to simulate the value of a random vector X whose component random variables are dependent. In this chapter we

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo Winter 2019 Math 106 Topics in Applied Mathematics Data-driven Uncertainty Quantification Yoonsang Lee (yoonsang.lee@dartmouth.edu) Lecture 9: Markov Chain Monte Carlo 9.1 Markov Chain A Markov Chain Monte

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Stochastic Networks Variations of the Hopfield model

Stochastic Networks Variations of the Hopfield model 4 Stochastic Networks 4. Variations of the Hopfield model In the previous chapter we showed that Hopfield networks can be used to provide solutions to combinatorial problems that can be expressed as the

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time

More information

Markov Chain Monte Carlo. Simulated Annealing.

Markov Chain Monte Carlo. Simulated Annealing. Aula 10. Simulated Annealing. 0 Markov Chain Monte Carlo. Simulated Annealing. Anatoli Iambartsev IME-USP Aula 10. Simulated Annealing. 1 [RC] Stochastic search. General iterative formula for optimizing

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

7.1 Basis for Boltzmann machine. 7. Boltzmann machines 7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is

More information

( ) ( ) ( ) ( ) Simulated Annealing. Introduction. Pseudotemperature, Free Energy and Entropy. A Short Detour into Statistical Mechanics.

( ) ( ) ( ) ( ) Simulated Annealing. Introduction. Pseudotemperature, Free Energy and Entropy. A Short Detour into Statistical Mechanics. Aims Reference Keywords Plan Simulated Annealing to obtain a mathematical framework for stochastic machines to study simulated annealing Parts of chapter of Haykin, S., Neural Networks: A Comprehensive

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Srinjoy Das Department of Electrical and Computer Engineering University of California, San Diego srinjoyd@gmail.com Bruno Umbria

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Markov Chain Monte Carlo Methods p. 2/36 Markov Chains

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due

More information

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models Lecture Notes Fall 2009 Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions Instructor: Erik Sudderth Brown University Computer Science April 14, 215 Review: Discrete Markov Chains Some

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

RANDOM TOPICS. stochastic gradient descent & Monte Carlo RANDOM TOPICS stochastic gradient descent & Monte Carlo MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo Assaf Weiner Tuesday, March 13, 2007 1 Introduction Today we will return to the motif finding problem, in lecture 10

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] Neural Networks William Cohen 10-601 [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] WHAT ARE NEURAL NETWORKS? William s notation Logis;c regression + 1

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks and Boltzmann Machines Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks A Hopfield network is a neural network with a graph G = (U,C) that satisfies

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks Recap Probability, stochastic processes, Markov chains ELEC-C7210 Modeling and analysis of communication networks 1 Recap: Probability theory important distributions Discrete distributions Geometric distribution

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321 Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,

More information