Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014
Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html
Energy functon The global energy: E s b j s s j w j The energy gap: E E( s 0) E( s 1) Update rule: s s b j s 1,f b s j 0, otherwse j j w w j j 0 http://en.wkpeda.org/wk/hopfeld_network
Example 1-4 0? 3 2 3 3-1 1? 0?1-1 0 - E = goodness = 34
Deeper energy mnmum 0-4 1 3 2 3 3 0-1 1-1 1 - E = goodness = 5
Is updatng of Hopfeld network determnstc or non-determnstc? A. Determnstc B. Non-determnstc 0% 0% Determnstc Non-determnstc
How to update? Nodes must be updated sequentally, usually n randomzed order. Wth parallel updatng energy could go up. -100 0 0 +5 +5 If updates occur n parallel but wth random tmng, the oscllatons are usually destroyed.
Content-addressable memory Usng energy mnma to represent memores gves a content-addressable memory. An tem can be accessed by just knowng part of ts content. Can fll out mssng or corrupted peces of nformaton. It s robust aganst hardware damage.
Classcal condtonng http://changecom.wordpress.com/2013/01/03/classcal-condtonng/
Storng memores Energy landscape s determned by weghts! If we use actvtes -1 and 1: w j = s s j If we use states 0 and 1: f f s s s s j j then then w w j j 1 1 w j 4 1 ( s 1 ) ( s ) 2 j 2
Demo http://www.tarkvaralabor.ee/doodler/ (choose Algorthm: Hopfeld and Intalze)
How many weghts the example had? A. 100 B. 1000 C. 10000 0% 0% 0% 100 1000 10000
Storage capacty The capacty of a totally connected net wth N unts s only about 0.15 * N memores. Wth N bts per one memory ths s only 0.15 * N * N bts. The net has N 2 weghts and bases. After storng M memores, each connecton weght has an nteger value n the range [ M, M]. So the number of bts requred to store the weghts and bases s: N 2 log(2m +1)
How many bts are needed to represent weghts n the example? A. 1500 B. 50 000 C. 320 000 0% 0% 0% 1500 50 000 320 000
Spurous mnma Each tme we memorze a confguraton, we hope to create a new energy mnmum. But what f two mnma merge to create a mnmum at an ntermedate locaton?
Reverse learnng Let the net settle from a random ntal state and then do unlearnng. Ths wll get rd of deep, spurous mnma and ncrease memory capacty.
Increasng memory capacty Instead of tryng to store vectors n one shot, cycle through the tranng set many tmes. Use the perceptron convergence procedure to tran each unt to have the correct state gven the states of all the other unts n that vector. x f j x j w j w ( x x) x j j
Hopfeld nets wth hdden unts Instead of usng the net to store memores, use t to construct nterpretatons of sensory nput. The nput s represented by the vsble unts. The nterpretaton s represented by the states of the hdden unts. The badness of the nterpretaton s represented by the energy. hdden unts vsble unts
3D edges from 2D mages 3-D lnes You can only see one of these 3-D edges at a tme because they occlude one another. 2-D lnes pcture
Nosy networks Hopfeld net tres reduce the energy at each step. Ths makes t mpossble to escape from local mnma. We can use random nose to escape from poor mnma. Start wth a lot of nose so ts easy to cross energy barrers. Slowly reduce the nose so that the system ends up n a deep mnmum. Ths s smulated annealng. A B C
Temperature p( A B ) 0.2 Hgh temperature transton probabltes A p( A B ) 0.1 p( A B) 0.001 B Low temperature transton probabltes A p( A B) 0.000001 B
Stochastc bnary unts Replace the bnary threshold unts by bnary stochastc unts that make based random decsons. The temperature controls the amount of nose. Rasng the nose level s equvalent to decreasng all the energy gaps between confguratons. p(s =1) = 1 1+ e E T temperature
Why we need stochastc bnary unts? A. Because we cannot get rd of nherent nose. B. Because t helps to escape local mnma. C. Because we want system to produce randomzed results. Because we cannot get r... 0% 0% 0% Because t helps to escap... Because we want system..
Thermal equlbrum Thermal equlbrum s a dffcult concept! Reachng thermal equlbrum does not mean that the system has settled down nto the lowest energy confguraton. The thng that settles down s the probablty dstrbuton over confguratons. Ths settles to the statonary dstrbuton. Any gven system keeps changng ts confguraton, but the fracton of systems n each confguraton does not change.
Modelng bnary data Gven a tranng set of bnary vectors, ft a model that wll assgn a probablty to every possble bnary vector. Model can be used for generatng data wth the same dstrbuton as orgnal data. If partcular model (dstrbuton) produced the observed data: p( model data ) p( data model p( data j ) p( model model ) j )
Boltzmann machne...s defned n terms of the energes of jont confguratons of the vsble and hdden unts. Probablty of jont confguraton: p(v, h) e E(v,h) The probablty of fndng the network n that jont confguraton after we have updated all of the stochastc bnary unts many tmes.
Energy of a jont confguraton bnary state of unt n v bas of unt k E(v, h) = v b + h k b k + v v j w j + v h k w k + h k h l w kl vs k hd < j, k k<l Energy wth confguraton v on the vsble unts and h on the hdden unts ndexes every nondentcal par of and j once weght between vsble unt and hdden unt k
From energes to probabltes The probablty of a jont confguraton over both vsble and hdden unts depends on the energy of that jont confguraton compared wth the energy of all other jont confguratons. p(v, h) = partton functon u, g E(v, h) e E(u, g) e The probablty of a confguraton of the vsble unts s the sum of the probabltes of all the jont confguratons that contan t. p(v) = h u, g E(v, h) e E(u, g) e
Example v h E e E p(v, h ) p(v) 1 1 1 1 2 7.39.186 1 1 1 0 2 7.39.186 1 1 0 1 1 2.72.069 1 1 0 0 0 1.025 1 0 1 1 1 2.72.069 1 0 1 0 2 7.39.186 1 0 0 1 0 1.025 1 0 0 0 0 1.025 0 1 1 1 0 1.025 0 1 1 0 0 1.025 0 1 0 1 1 2.72.069 0 1 0 0 0 1.025 0 0 1 1-1 0.37.009 0 0 1 0 0 1.025 0 0 0 1 0 1.025 0 0 0 0 0 1.025 39.70 0.466 0.305 0.144 0.084 An example of how weghts defne a dstrbuton h1-1 h2 +2 +1 v1 v2
Gettng a sample from model We cannot compute the normalzng term (the partton functon) because t has exponentally many terms. So we use Markov Chan Monte Carlo to get samples from the model startng from a random global confguraton: Keep pckng unts at random and allowng them to stochastcally update ther states based on ther energy gaps. Run the Markov chan untl t reaches ts statonary dstrbuton The probablty of a global confguraton s then related to ts energy by the Boltzmann dstrbuton.
Gettng a sample from the posteror dstrbuton for a gven data vector The number of possble hdden confguratons s exponental so we need MCMC to sample from the posteror. It s just the same as gettng a sample from the model, except that we keep the vsble unts clamped to the gven data vector. Only the hdden unts are allowed to change states Samples from the posteror are requred for learnng the weghts. Each hdden confguraton s an explanaton of an observed vsble confguraton. Better explanatons have lower energy.
What does Boltzmann machne really A. Models probablty dstrbuton of nput data. B. Generates samples from modeled dstrbuton. C. Learns probablty dstrbuton of nput data from samples. D. All of above. do? Generates samples from... Models probablty dstr... 20% 20% 20% 20% 20% Learns probablty dstrb... All of above. None of above.