Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Similar documents
CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

6. Stochastic processes (2)

Unsupervised Learning

6. Stochastic processes (2)

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Multilayer Perceptron (MLP)

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Ensemble Methods: Boosting

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Markov Chain Monte Carlo Lecture 6

Problem Set 9 Solutions

VQ widely used in coding speech, image, and video

Multi-layer neural networks

A Bayesian methodology for systemic risk assessment in financial networks

CS 3750 Machine Learning Lecture 6. Monte Carlo methods. CS 3750 Advanced Machine Learning. Markov chain Monte Carlo

Information Geometry of Gibbs Sampler

Multilayer neural networks

Hopfield Training Rules 1 N

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Course 395: Machine Learning - Lectures

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Mean Field / Variational Approximations

Associative Memories

Homework Assignment 3 Due in class, Thursday October 15

CSE 546 Midterm Exam, Fall 2014(with Solution)

Expectation Maximization Mixture Models HMMs

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

CHAPTER 17 Amortized Analysis

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Using Immune Genetic Algorithm to Optimize BP Neural Network and Its Application Peng-fei LIU1,Qun-tai SHEN1 and Jun ZHI2,*

1 The Mistake Bound Model

Artificial Intelligence Bayesian Networks

Probabilistic Graphical Models

NP-Completeness : Proofs

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

Self-organising Systems 2 Simulated Annealing and Boltzmann Machines

Lecture 10 Support Vector Machines. Oct

CSC 411 / CSC D11 / CSC C11

ENTROPIC QUESTIONING

Optimal information storage in noisy synapses under resource constraints

Suggested solutions for the exam in SF2863 Systems Engineering. June 12,

} Often, when learning, we deal with uncertainty:

Review: Fit a line to N data points

find (x): given element x, return the canonical element of the set containing x;

Dynamic Programming. Lecture 13 (5/31/2017)

Continuous Time Markov Chains

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Chapter 8 SCALAR QUANTIZATION

MATH 567: Mathematical Techniques in Data Science Lab 8

Lecture Notes on Linear Regression

Generalized Linear Methods

10-701/ Machine Learning, Fall 2005 Homework 3

Convergence of random processes

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

Boostrapaggregating (Bagging)

Dynamical Systems and Information Theory

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Channel Encoder. Channel. Figure 7.1: Communication system

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

1 Convex Optimization

Hidden Markov Models

The Feynman path integral

Lecture 3: Shannon s Theorem

Applied Stochastic Processes

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

The Cortex. Networks. Laminar Structure of Cortex. Chapter 3, O Reilly & Munakata.

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Hidden Markov Models

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

2 Laminar Structure of Cortex. 4 Area Structure of Cortex

Target tracking example Filtering: Xt. (main interest) Smoothing: X1: t. (also given with SIS)

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

CHAPTER 3: BAYESIAN DECISION THEORY

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Review of Taylor Series. Read Section 1.2

Probability Theory (revisited)

Linear Classification, SVMs and Nearest Neighbors

Equilibrium with Complete Markets. Instructor: Dmytro Hryshko

Statistical Foundations of Pattern Recognition

Lecture 10: May 6, 2013

Strong Markov property: Same assertion holds for stopping times τ.

On complexity and randomness of Markov-chain prediction

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Supporting Information

EEE 241: Linear Systems

arxiv: v1 [cs.lg] 17 Jan 2019

HMMT February 2016 February 20, 2016

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

LECTURE NOTES. Artifical Neural Networks. B. MEHLIG (course home page)

Transcription:

Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014

Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html

Energy functon The global energy: E s b j s s j w j The energy gap: E E( s 0) E( s 1) Update rule: s s b j s 1,f b s j 0, otherwse j j w w j j 0 http://en.wkpeda.org/wk/hopfeld_network

Example 1-4 0? 3 2 3 3-1 1? 0?1-1 0 - E = goodness = 34

Deeper energy mnmum 0-4 1 3 2 3 3 0-1 1-1 1 - E = goodness = 5

Is updatng of Hopfeld network determnstc or non-determnstc? A. Determnstc B. Non-determnstc 0% 0% Determnstc Non-determnstc

How to update? Nodes must be updated sequentally, usually n randomzed order. Wth parallel updatng energy could go up. -100 0 0 +5 +5 If updates occur n parallel but wth random tmng, the oscllatons are usually destroyed.

Content-addressable memory Usng energy mnma to represent memores gves a content-addressable memory. An tem can be accessed by just knowng part of ts content. Can fll out mssng or corrupted peces of nformaton. It s robust aganst hardware damage.

Classcal condtonng http://changecom.wordpress.com/2013/01/03/classcal-condtonng/

Storng memores Energy landscape s determned by weghts! If we use actvtes -1 and 1: w j = s s j If we use states 0 and 1: f f s s s s j j then then w w j j 1 1 w j 4 1 ( s 1 ) ( s ) 2 j 2

Demo http://www.tarkvaralabor.ee/doodler/ (choose Algorthm: Hopfeld and Intalze)

How many weghts the example had? A. 100 B. 1000 C. 10000 0% 0% 0% 100 1000 10000

Storage capacty The capacty of a totally connected net wth N unts s only about 0.15 * N memores. Wth N bts per one memory ths s only 0.15 * N * N bts. The net has N 2 weghts and bases. After storng M memores, each connecton weght has an nteger value n the range [ M, M]. So the number of bts requred to store the weghts and bases s: N 2 log(2m +1)

How many bts are needed to represent weghts n the example? A. 1500 B. 50 000 C. 320 000 0% 0% 0% 1500 50 000 320 000

Spurous mnma Each tme we memorze a confguraton, we hope to create a new energy mnmum. But what f two mnma merge to create a mnmum at an ntermedate locaton?

Reverse learnng Let the net settle from a random ntal state and then do unlearnng. Ths wll get rd of deep, spurous mnma and ncrease memory capacty.

Increasng memory capacty Instead of tryng to store vectors n one shot, cycle through the tranng set many tmes. Use the perceptron convergence procedure to tran each unt to have the correct state gven the states of all the other unts n that vector. x f j x j w j w ( x x) x j j

Hopfeld nets wth hdden unts Instead of usng the net to store memores, use t to construct nterpretatons of sensory nput. The nput s represented by the vsble unts. The nterpretaton s represented by the states of the hdden unts. The badness of the nterpretaton s represented by the energy. hdden unts vsble unts

3D edges from 2D mages 3-D lnes You can only see one of these 3-D edges at a tme because they occlude one another. 2-D lnes pcture

Nosy networks Hopfeld net tres reduce the energy at each step. Ths makes t mpossble to escape from local mnma. We can use random nose to escape from poor mnma. Start wth a lot of nose so ts easy to cross energy barrers. Slowly reduce the nose so that the system ends up n a deep mnmum. Ths s smulated annealng. A B C

Temperature p( A B ) 0.2 Hgh temperature transton probabltes A p( A B ) 0.1 p( A B) 0.001 B Low temperature transton probabltes A p( A B) 0.000001 B

Stochastc bnary unts Replace the bnary threshold unts by bnary stochastc unts that make based random decsons. The temperature controls the amount of nose. Rasng the nose level s equvalent to decreasng all the energy gaps between confguratons. p(s =1) = 1 1+ e E T temperature

Why we need stochastc bnary unts? A. Because we cannot get rd of nherent nose. B. Because t helps to escape local mnma. C. Because we want system to produce randomzed results. Because we cannot get r... 0% 0% 0% Because t helps to escap... Because we want system..

Thermal equlbrum Thermal equlbrum s a dffcult concept! Reachng thermal equlbrum does not mean that the system has settled down nto the lowest energy confguraton. The thng that settles down s the probablty dstrbuton over confguratons. Ths settles to the statonary dstrbuton. Any gven system keeps changng ts confguraton, but the fracton of systems n each confguraton does not change.

Modelng bnary data Gven a tranng set of bnary vectors, ft a model that wll assgn a probablty to every possble bnary vector. Model can be used for generatng data wth the same dstrbuton as orgnal data. If partcular model (dstrbuton) produced the observed data: p( model data ) p( data model p( data j ) p( model model ) j )

Boltzmann machne...s defned n terms of the energes of jont confguratons of the vsble and hdden unts. Probablty of jont confguraton: p(v, h) e E(v,h) The probablty of fndng the network n that jont confguraton after we have updated all of the stochastc bnary unts many tmes.

Energy of a jont confguraton bnary state of unt n v bas of unt k E(v, h) = v b + h k b k + v v j w j + v h k w k + h k h l w kl vs k hd < j, k k<l Energy wth confguraton v on the vsble unts and h on the hdden unts ndexes every nondentcal par of and j once weght between vsble unt and hdden unt k

From energes to probabltes The probablty of a jont confguraton over both vsble and hdden unts depends on the energy of that jont confguraton compared wth the energy of all other jont confguratons. p(v, h) = partton functon u, g E(v, h) e E(u, g) e The probablty of a confguraton of the vsble unts s the sum of the probabltes of all the jont confguratons that contan t. p(v) = h u, g E(v, h) e E(u, g) e

Example v h E e E p(v, h ) p(v) 1 1 1 1 2 7.39.186 1 1 1 0 2 7.39.186 1 1 0 1 1 2.72.069 1 1 0 0 0 1.025 1 0 1 1 1 2.72.069 1 0 1 0 2 7.39.186 1 0 0 1 0 1.025 1 0 0 0 0 1.025 0 1 1 1 0 1.025 0 1 1 0 0 1.025 0 1 0 1 1 2.72.069 0 1 0 0 0 1.025 0 0 1 1-1 0.37.009 0 0 1 0 0 1.025 0 0 0 1 0 1.025 0 0 0 0 0 1.025 39.70 0.466 0.305 0.144 0.084 An example of how weghts defne a dstrbuton h1-1 h2 +2 +1 v1 v2

Gettng a sample from model We cannot compute the normalzng term (the partton functon) because t has exponentally many terms. So we use Markov Chan Monte Carlo to get samples from the model startng from a random global confguraton: Keep pckng unts at random and allowng them to stochastcally update ther states based on ther energy gaps. Run the Markov chan untl t reaches ts statonary dstrbuton The probablty of a global confguraton s then related to ts energy by the Boltzmann dstrbuton.

Gettng a sample from the posteror dstrbuton for a gven data vector The number of possble hdden confguratons s exponental so we need MCMC to sample from the posteror. It s just the same as gettng a sample from the model, except that we keep the vsble unts clamped to the gven data vector. Only the hdden unts are allowed to change states Samples from the posteror are requred for learnng the weghts. Each hdden confguraton s an explanaton of an observed vsble confguraton. Better explanatons have lower energy.

What does Boltzmann machne really A. Models probablty dstrbuton of nput data. B. Generates samples from modeled dstrbuton. C. Learns probablty dstrbuton of nput data from samples. D. All of above. do? Generates samples from... Models probablty dstr... 20% 20% 20% 20% 20% Learns probablty dstrb... All of above. None of above.