Mathematics for Artificial Intelligence

Size: px
Start display at page:

Download "Mathematics for Artificial Intelligence"

Transcription

1 Mathematics for Artificial Intelligence Reading Course Elena Agliari Dipartimento di Matematica Sapienza Università di Roma

2 (TENTATIVE) PLAN OF THE COURSE Introduction Chapter : Basics of statistical mechanics The Curie-Weiss model Chapter 2: Neural networks for associative memory and pattern recognition Chapter 3: The Hopfield model Hopfield model with low-load and solution via log-constrained entropy Self-average, spurious states, phase diagram Hopfield model with high-load and solution via stochastic stability Chapter 4: Beyond the Hebbian paradigma Chapter 5: A gentle introduction to machine learning Maximum likelihood and Bayesian learning Concepts and preliminaries in machine learning Chapter 6: Neural networks for statistical learning and feature discovery. Rosenblatt and Minsky&Papert perceptrons. Restricted Boltzmann machines. Chapter 7: A few remarks on deep learning, complex patterns, and outlooks Multilayered Boltzmann machines and deep learning. Mapping Restricted Bolzmann machines and Hopfield networks Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, 66/220

3 (Restricted) Boltzmann machines A Boltzmann machine is a type of stochastic recurrent neural network. Assuming symmetric coupling among units, the detailed balance holds and the invariant measure is of the Boltzmann-Gibbs type, which they are named after. The units in the BM are divided into visible units and hidden units. The visible units are those that receive information from the environment. Learning is impractical in general Boltzmann machines, yet it can be made quite efficient in a restricted architecture called the Restricted Boltzmann Machine, which does not allow intralayer connections. RBMs were initially invented under the name Harmonium by Paul Smolensky in 986, and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid-2000s. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common deep learning strategies. As each new layer is added the overall generative model gets better. 67 /220

4 Restricted Boltzmann machines Three layers composed of N+K+M binary neurons 68/220 Interaction matrix (symmetric, zero eye)

5 Hidden layer Each neuron σj corresponds to a feature Set on/off according to whether the sampled individual matches feature j Ex.: i= -> Thriller, i=2 -> SF,, i=4 -> Romance Output layer Each neuron yk provides likelihood of interest towards an item j from the sampled individual Ex.: i= -> Asimov i=2 -> Brönte,, i=4 -> King Input layer Each neuron xi corresponds to a movie Set on/off according to whether the sampled individual like movie i Ex.: i= -> Harry Potter, i=2 -> Avatar,, i=4 ->/220 Titanic

6 Stochastic dynamic for neurons The parameters Jij give the magnitude of the pairwise coupling between spins, the parameter θi determines the firing thresholds of the neurons and the parameter β controls the degree of randomness in the dynamics. For β =0 the dynamics just assigns random values to the states of the updated neurons. For β the candidate neurons align their states strictly to the sign of the local fields. 70 /220

7 Stochastic dynamic for neurons Remark: focus on the probability pt(s) to find the system in a certain state s at a certain time t Remark: We have the freedom to update only a subset S Prop: the unique stationary probability distribution of the neuronal dynamics formulated above is p (s) = Z e H(s), H(s) = 2 7 /220 X ij J ij s i s j X s i # i i

8 x is clamped x&y are clamped /220

9 Task: learn a prescribed target joint input-output probability distribution q(x,y) The system has accomplished this when p (x,y) (i.e., the equilibrium input-output probability distribution of the network) equals the target distribution q(x,y). Performance measure: distance between q(x,y) and p (x,y) Kullback-Leibler distance. Positive definite 2. D(q p ) =0 iff q(x,y)=p (x,y), 3. Additive for independent distribution 4. Convex in the pair q,p if your data comes from probability distribution Q, but you use a compression scheme optimised for P, D(Q P) is the number of extra bits you'll require to store a record of each sample from Q. 73 /220

10 Gradient descent learning rule This ensures D(q p ) decreases monotonically until a stationary state is reached which could, but need not correspond to D(q p )=0 Thus, we need D(q p )= where λ is any parameter in the system log 2 X x,y log p (x, y) Details of calculation depend on the network operation mode (A, B, C) 74 /220

11 A: Operation with all neurons freely evolving ( ) Each individual modification step, to be carried out repeatedly, thus involves: (i) Operate the neuronal dynamics with input and output neuron states (x,y) fixed until equilibrium is reached, and measure the averages sisj + and si +; repeat this for many combinations of (x,y) generated according to the desired joint distribution q(x,y) (ii) Operate the neuronal dynamics with all neurons evolving freely until equilibrium is reached, and measure the averages sisj - and si - (iii) Insert the results of (i,ii) into ( ) and execute the rule ( ) 75 /220

12 B: Operation with hidden&output neurons freely evolving ( ) Each individual modification step, to be carried out repeatedly, now involves: (i) Operate the neuronal dynamics with input and output neuron states (x,y) fixed until equilibrium is reached, and measure the averages sisj + and si +; repeat this for many combinations of (x,y) generated according to the desired joint distribution q(x,y) (ii) Operate the neuronal dynamics with all hidden and output neurons evolving freely until equilibrium is reached, and measure the averages sisj - and si -; repeat this for many input configurations x generated according to the desired distribution q(x)= y q(x,y) (iii) Insert the results of (i,ii) into ( ) and execute the rule ( ) 76 /220

13 Case A Learning is accomplished when first and second moments of the free system correspond to those of the system with clamped input&output Case B Learning is accomplished when first and second moments of the system with clamped input correspond to those of the system with clamped input&output Evaluateing averages is complex! 77/220

14 A telegraphic introduction to Markov Chain Monte Carlo (MCMC) sampling Evaluate thermodynamic average by summing over all possible configurations beyond reach if N is large Evaluate thermodynamic average by summing over a subset of drawn configurations selections are unlikely Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. (0)! ()!...! (t)! (t + )! (t + M) Random initial configuration Average over M samples hf( )i M MX f( (t + k)) k= Equilibrium configuration 78/220

15 (0)! ()!...! (t)! (t + )! (t + M) hf( )i M MX f( (t + k)) k= Random initial configuration Equilibrium configuration Average over M samples Let p be the probability distribution (over a finite set X) you want to sample with. Suppose you are in configuration i X and there are r- possible new configurations you can reach in one move, that is, overall r possible outcomes (including retaining the current configuration) Pick neighbouring configuration j X with probability /r If you pick configuration j and p(j) p(i), then deterministically go to j Otherwise, p(j) < p(i), and you to go j with probability p(j)/p(i) Probability weight pij for the transition i j is indeed a probability distribution for each configuration i and turns out to be p ij = r min(,p(j)/p(i)) X p ii = Also, p(j) is the stationary distribution for this chain as one can prove checking that the det. bal. holds. We could take a large sample S X via the solution to the sampling problem, and then compute the average value of f on that sample. i j p ij 79/220

16 (0)! ()!...! (t)! (t + )! (t + M) Random initial configuration Average over M samples hf( )i = M MX f( (t + k)) Equilibrium configuration k= p ij = r min(,p(j)/p(i)) X p ii = i j p ij if exp(-ej/t)/exp(-ei/t) = exp(-δe/t)> (energetically convenient) exp(-δe/t) if exp(-δe/t)< (energetically not convenient) Metropolis algorithm Initialize the system at an arbitrary point σ0 and choose a probability that suggests a candidate configuration for the next step Iterate the following steps:. Extract a candidate configuration σ 2. Being σ the current configuration, calculate the energy variation following move ΔE = E(σ ) - E(σ) 3. If ΔE<0 accept the move; if ΔE>0 generate a uniform random number u [0,] and accept the move if u exp(-δe/t), otherwise reject 80/220

17 Contrastive divergence Consider a probability distribution over a vector x (assumed discrete w.l.o.g.) p(x; )= Z( ) e E(x; ) where λ is a vector of model parameters and the partition function Z(λ) is defined as Z( )= X x e E(x; ) We derive model parameters, λ, by maximizing the likelihood, namely the probability of a training set of iid data Χ = {x, x2,, xn} as p( ; )= NY k= Z( ) e E(x k; ) or, equivalently, by minimizing the negative of the log-likelihood L( ; ) = log Z( )+ N NX E(x k ; ) k= 8/220

18 The gradient equation is obtained by first writing down the partial derivative of L(x,λ) ; i log Z( log Z( i + N NX k ; i where x is the expectation of given the distribution x, i.e., p 0 (x) = N NX n= (x x n ) = Z( ) X log Z( ) i Z( i e E(x; ) ) i p(x; ) which can be numerically approximated by drawing samples from the proposed distribution. However, samples cannot be drawn directly from p(x;λ) as we do not know the value of the partition function, but we can use as many cycles of MCMC sampling to transform our training data (drawn from the target distribution) into data drawn from the proposed distribution. 82/220

19 Denoting with x n the training data transformed using n cycles of MCMC, such that x 0 E(x), we ; i ) i p(x; i ) i x 0 Update parameters ) i(t + ) = i i x i x 83/220

20 i(t + ) = i i x i x p(x; )= Z( ) e E(x; ) hoosing a pairwise energy function we recover the previous results 2 3 E(x; )! E(s; J; #) = 4 X ij J ij s i s j + X i s i # i ij = s i s i = s i 84/220

21 Denoting with x n the training data transformed using n cycles of MCMC, such that x 0 E(x), we ; i i i x 0 How many MCMC cycles are required to compute an accurate gradient? Hinton 2000: only a few MCMC cycles (empirically even just one) are needed to calculate an approximate gradient. The intuition is that after a few iterations the data will have moved from the target distribution (i.e., that of the training data) towards the proposed distribution, and so give an idea in which direction the proposed distribution should move to better model the training data. i(t + ) = i i x i x 85/220

22 Another numerical method: Simulated annealing Approximate global optimization in a large search space It has to be preferred to gradient descent for problems where finding an approximate global optimum is more important that finding a precise local optimum in a fixed amount of time. Some thermal noise is useful to escape from local minima Too much thermal noise would make even global minima unstable S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi (983) Optimization by Simulated Annealing, Science 86/220

23 Another numerical method: Simulated annealing Approximate global optimization in a large search space It has to be preferred to gradient descent for problems where finding an approximate global optimin is more important that finding a precise local optimum in a fixed amount of time. Annealing in metallurgy: technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects At each time step, the algorithm - randomly selects a solution close to the current one (e.g., a spin configuration under single spin-flip) - measures its quality (e.g., the energy change ΔE) - decides probabilistically to move to it or to stay (e.g., p exp[-δe/t]). During the search, the temperature is progressively decreased from an initial positive value to zero. This affects the move probability: at each step, the probability of moving to a worse new solution is progressively changed towards zero. S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi (983) Optimization by Simulated Annealing, Science 87/220

24 configuration of the system random generator of rearrangements of the elements in a configuration cost function contaning the trade-offs that have to be made annealing schedule of the temperatures and length of times for which the system is to be evolved For a given problem may be developed by trial-and-error, or may consist of just warming the system until it is obviously melted, then cooling in slow stages until diffusion of the components ceases. Inventing the most effective sets of moves and deciding which factors to incorporate into the objective function require insight into the problem being solved and may not be obvious -E By Kingpin3 - Own work, CC0, 88/220

25 Another method: Mean-field formulation Learning consists of adjusting the weights and thresholds in such a way that the Boltzmann distribution on the visible visible units p approximate a target distribution q, as closely as possible. This can be recast in the problem of minimizing the KL distance using gradient descent and the learning rules are given by The parameter ε is the learning rate. The brakets - and + denote the free and clamped expectation values, respectively. hf(x,, y)i + = X x,,y f(x,, y)p ( x, y)q(x, y) hf(x,, y)i = X x,,y f(x,, y)p (x,, y) Naive mean-field approximation: the free and the clamped expectations are approximated by their classic, infinite-size, mean-field values hs i i m i, 0 hs i s j i m i m j m i = X J ij m j + # i A j6=i 89/220

26 Naive mean-field approximation: the free and the campled expectations are approximated by their classic, infinite-size, mean-field values hs i i m i, 0 hs i s j i m i m j m i = X J ij m j + # i A j6=i In each step of the gradient descent procedure, one must solve these self-consistent equations for the current parameter values, evaluate the distance with respect to the clamped averages and accordingly update parameters. This method is O(0) times faster than the MC method. However, this method may not converge! In fact, this method leads to a converging gradient descent algorithm only when the data are such that hs i s j i + = hs i i + hs j i +, i 6= j This is a property of the data It is equivalent to state that the target probability distribution q is factorized in all its variables: q(s) = i qi (si). The quality of this naive method depends on to what extent the previous equation is violated. There exist possible improvements, e.g., the linear response correction. 90/220

27 A few examples MNIST database Features Two 9/220

28 Natural images database Pines 92/220

29 Olivetti database The dataset consists of 400 images with greyscale pixels. There are 0 images for each person, so there is 40 persons (target). The first 6 faces in the olivetti dataset look like this: First convert the image array to binary (black and white). This is needed because RBM can extract features more efficiently on binary data, than on the grayscale data. 93/220

30 After running RBM, we see the extracted features on the first 6 faces. This looks similar like eigen face. 94/220

31 A simple example to see frustration at work σ3 h i (t + )i = 2 tanh(2 ) [J ijh j (t)i + J ik h k (t)i], i,j,k =, 2, 3 u i (t) =h i (t)i, i =, 2, 3 Jij = Jji and Jij =, i,j J3 J23 d dt u (t) = d dt u 2(t) = d dt u 3(t) = u (t)+ 2 tanh(2 )[J 2u 2 (t)+j 3 u 3 (t)] u 2(t)+ 2 tanh(2 )[J 2u (t)+j 23 u 3 (t)] u 3(t)+ 2 tanh(2 )[J 3u (t)+j 23 u 2 (t)] σ J2 σ2 In matricial notation u (t) = Au(t), where and Θ=tanh(2β). A= / J 2/2 J 3 /2 J 2 /2 / J 23 /2 A J 3 /2 J 23 /2 / A I =0) apple apple J 2 J 23 J 3 =0. Characteristic polynomial = = 2 ( J 2 J 23 J 3 ), = J 2 J 23 J 3. 95/220

32 = = 2 ( J 2 J 23 J 3 ), = J 2 J 23 J 3. C,2,3 depend on initial configuration h (t)i = C J 3 J 23 e t/ 2C 3 J 3 J 23 e t/ 3, h 2 (t)i = (C C 2 J 2 J 3 ) e t/ C 3 e t/ 3, h 3 (t)i = (C J 2 J 3 + C 2 )e t/ + C 3 J 2 J 3 e t/ 3. σi(t) relax to zero* as t with characteristic time * The system is finite! J2J3J23= simple system τ/τ τ 3 τ = τ β Simple system eigenvalues are peaked at short time scales Complex system eigenvalues are peaked at long time scales τ/τ 96/ J2J3J23=- complex system τ = τ 2 τ β

33 Pairwise correlation functions u ij (t) =h i (t) j (t)i, i =, 2, 3 d dt u 2(t) = d dt u 3(t) = d dt u 23(t) = 2 u 2(t)+ 2 (2J 2 + J 23 u 3 (t)+j 3 u 23 (t)), 2 u 3(t)+ 2 (2J 3 + J 23 u 2 (t)+j 2 u 23 (t)), 2 u 23(t)+ 2 (2J 23 + J 3 u 2 (t)+j 2 u 3 (t)). Roots of the characteristic polynomial As t σi(t)σj(t) relax to non-zero value independent of the initial configuration = (2 J 2 J 23 J 3 ), = = J 2 J 23 J 3. h (t) 2 (t)i = C J 2 J 3 e t/ 2C 2 e t/ 2 + 2J 2 J 3 J 23, h (t) 3 (t)i = C e t/ +(C 2 J 3 J 2 C 3 J 3 J 23 )e t/ 2 + h 2 (t) 3 (t)i = C J 3 J 23 e t/ +(C 2 J 2 J 23 + C 3 )e t/ 3 + 2J 3 J 2 J 23, 2J 23 J 2 J 3. C,2,3 depend on initial configuration 97/220

34 = (2 J 2 J 23 J 3 ), = = J 2 J 23 J 3. h (t) 2 (t)i = C J 2 J 3 e t/ 2C 2 e t/ 2 + 2J 2 J 3 J 23, h (t) 3 (t)i = C e t/ +(C 2 J 3 J 2 C 3 J 3 J 23 )e t/ 2 + h 2 (t) 3 (t)i = C J 3 J 23 e t/ +(C 2 J 2 J 23 + C 3 )e t/ 3 + 2J 3 J 2 J 23, 2J 23 J 2 J 3. C,2,3 depend on initial configuration J2J3J23= simple system 3 3 J2J3J23=- complex system τ τ/τ τ 2 = τ 3 τ 0 2 β τ/τ β 98/220 τ 2 = τ 3

35 Conditioned reflex h h ii = h J ij i = i j i = hj ij i i = h i i 0 + X ij i j i + h i A j6=ihj 0 hj iji + 0 h i ji tanh( ) h i ji + < : h ii tanh 4 X jk k i + h j 5 + h j i tanh 4 k6=jhj X = ik k i + h i 5 ; k6=ihj hj ij ii + hj iji tanh 4 X ij j i + h i 5 + j6=ihj 0 h i jih i i tanh( ) σi Jij σj Ivan Pavlov ( ) E Agliari, A Barra, K Gervasi-Vidal, F Guerra, Journal of biological dynamics (202) 99/220

36 Some fun Google DeepDream Initially it was invented to help scientists and engineers to see what a deep neural network is seeing when it is looking in a given image. Later the algorithm has become a new form of psychedelic and abstract art. The network's 'answer' comes from this final output layer. In doing this, the software builds up a idea of what it thinks an object looked like. Other images were created by feeding a picture into the network and then asking the software to recognise a feature of it, and modify the picture to emphasise the feature it recognises - such as animals and eyes. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition. 200/220

37 Neural art Neural networks are trained to be able to work out what makes each artists style unique 20/220

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks and Boltzmann Machines Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks A Hopfield network is a neural network with a graph G = (U,C) that satisfies

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Neural Nets and Symbolic Reasoning Hopfield Networks

Neural Nets and Symbolic Reasoning Hopfield Networks Neural Nets and Symbolic Reasoning Hopfield Networks Outline The idea of pattern completion The fast dynamics of Hopfield networks Learning with Hopfield networks Emerging properties of Hopfield networks

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Markov Chain Monte Carlo. Simulated Annealing.

Markov Chain Monte Carlo. Simulated Annealing. Aula 10. Simulated Annealing. 0 Markov Chain Monte Carlo. Simulated Annealing. Anatoli Iambartsev IME-USP Aula 10. Simulated Annealing. 1 [RC] Stochastic search. General iterative formula for optimizing

More information

Stochastic Networks Variations of the Hopfield model

Stochastic Networks Variations of the Hopfield model 4 Stochastic Networks 4. Variations of the Hopfield model In the previous chapter we showed that Hopfield networks can be used to provide solutions to combinatorial problems that can be expressed as the

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

7.1 Basis for Boltzmann machine. 7. Boltzmann machines 7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Hertz, Krogh, Palmer: Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company (1991). (v ji (1 x i ) + (1 v ji )x i )

Hertz, Krogh, Palmer: Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company (1991). (v ji (1 x i ) + (1 v ji )x i ) Symmetric Networks Hertz, Krogh, Palmer: Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company (1991). How can we model an associative memory? Let M = {v 1,..., v m } be a

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

( ) ( ) ( ) ( ) Simulated Annealing. Introduction. Pseudotemperature, Free Energy and Entropy. A Short Detour into Statistical Mechanics.

( ) ( ) ( ) ( ) Simulated Annealing. Introduction. Pseudotemperature, Free Energy and Entropy. A Short Detour into Statistical Mechanics. Aims Reference Keywords Plan Simulated Annealing to obtain a mathematical framework for stochastic machines to study simulated annealing Parts of chapter of Haykin, S., Neural Networks: A Comprehensive

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Probabilistic Models in Theoretical Neuroscience

Probabilistic Models in Theoretical Neuroscience Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Srinjoy Das Department of Electrical and Computer Engineering University of California, San Diego srinjoyd@gmail.com Bruno Umbria

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

On Markov Chain Monte Carlo

On Markov Chain Monte Carlo MCMC 0 On Markov Chain Monte Carlo Yevgeniy Kovchegov Oregon State University MCMC 1 Metropolis-Hastings algorithm. Goal: simulating an Ω-valued random variable distributed according to a given probability

More information

Learning to Disentangle Factors of Variation with Manifold Learning

Learning to Disentangle Factors of Variation with Manifold Learning Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Neural Networks. Hopfield Nets and Auto Associators Fall 2017

Neural Networks. Hopfield Nets and Auto Associators Fall 2017 Neural Networks Hopfield Nets and Auto Associators Fall 2017 1 Story so far Neural networks for computation All feedforward structures But what about.. 2 Loopy network Θ z = ቊ +1 if z > 0 1 if z 0 y i

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Hopfield Networks. (Excerpt from a Basic Course at IK 2008) Herbert Jaeger. Jacobs University Bremen

Hopfield Networks. (Excerpt from a Basic Course at IK 2008) Herbert Jaeger. Jacobs University Bremen Hopfield Networks (Excerpt from a Basic Course at IK 2008) Herbert Jaeger Jacobs University Bremen Building a model of associative memory should be simple enough... Our brain is a neural network Individual

More information

SIMU L TED ATED ANNEA L NG ING

SIMU L TED ATED ANNEA L NG ING SIMULATED ANNEALING Fundamental Concept Motivation by an analogy to the statistical mechanics of annealing in solids. => to coerce a solid (i.e., in a poor, unordered state) into a low energy thermodynamic

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

A Video from Google DeepMind.

A Video from Google DeepMind. A Video from Google DeepMind http://www.nature.com/nature/journal/v518/n7540/fig_tab/nature14236_sv2.html Can machine learning teach us cluster updates? Lei Wang Institute of Physics, CAS https://wangleiphy.github.io

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28 1 / 28 Neural Networks Mark van Rossum School of Informatics, University of Edinburgh January 15, 2018 2 / 28 Goals: Understand how (recurrent) networks behave Find a way to teach networks to do a certain

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 14. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 14. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 14 Sinan Kalkan Today Hopfield Networks Boltzmann Machines Deep BM, Restricted BM Generative Adversarial Networks Variational Auto-encoders Autoregressive

More information

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In humans, association is known to be a prominent feature of memory.

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Robust Classification using Boltzmann machines by Vasileios Vasilakakis Robust Classification using Boltzmann machines by Vasileios Vasilakakis The scope of this report is to propose an architecture of Boltzmann machines that could be used in the context of classification,

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/112727

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due

More information

Credit Assignment: Beyond Backpropagation

Credit Assignment: Beyond Backpropagation Credit Assignment: Beyond Backpropagation Yoshua Bengio 11 December 2016 AutoDiff NIPS 2016 Workshop oo b s res P IT g, M e n i arn nlin Le ain o p ee em : D will r G PLU ters p cha k t, u o is Deep Learning

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Using a Hopfield Network: A Nuts and Bolts Approach

Using a Hopfield Network: A Nuts and Bolts Approach Using a Hopfield Network: A Nuts and Bolts Approach November 4, 2013 Gershon Wolfe, Ph.D. Hopfield Model as Applied to Classification Hopfield network Training the network Updating nodes Sequencing of

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Fundamentals of Computational Neuroscience 2e

Fundamentals of Computational Neuroscience 2e Fundamentals of Computational Neuroscience 2e January 1, 2010 Chapter 10: The cognitive brain Hierarchical maps and attentive vision A. Ventral visual pathway B. Layered cortical maps Receptive field size

More information

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria 12. LOCAL SEARCH gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley h ttp://www.cs.princeton.edu/~wayne/kleinberg-tardos

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

5. Simulated Annealing 5.1 Basic Concepts. Fall 2010 Instructor: Dr. Masoud Yaghini

5. Simulated Annealing 5.1 Basic Concepts. Fall 2010 Instructor: Dr. Masoud Yaghini 5. Simulated Annealing 5.1 Basic Concepts Fall 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Real Annealing and Simulated Annealing Metropolis Algorithm Template of SA A Simple Example References

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Hopfield Neural Network

Hopfield Neural Network Lecture 4 Hopfield Neural Network Hopfield Neural Network A Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. Hopfield nets serve as content-addressable memory systems

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

Advanced Sampling Algorithms

Advanced Sampling Algorithms + Advanced Sampling Algorithms + Mobashir Mohammad Hirak Sarkar Parvathy Sudhir Yamilet Serrano Llerena Advanced Sampling Algorithms Aditya Kulkarni Tobias Bertelsen Nirandika Wanigasekara Malay Singh

More information

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

RANDOM TOPICS. stochastic gradient descent & Monte Carlo RANDOM TOPICS stochastic gradient descent & Monte Carlo MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

Information in Biology

Information in Biology Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living

More information

OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE. Bruce Hajek* University of Illinois at Champaign-Urbana

OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE. Bruce Hajek* University of Illinois at Champaign-Urbana OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE Bruce Hajek* University of Illinois at Champaign-Urbana A Monte Carlo optimization technique called "simulated

More information