Teaching Feed-Forward N eural N etworks by Simulated Annealing

Similar documents
A L A BA M A L A W R E V IE W

Table of C on t en t s Global Campus 21 in N umbe r s R e g ional Capac it y D e v e lopme nt in E-L e ar ning Structure a n d C o m p o n en ts R ea

T h e C S E T I P r o j e c t

AI Programming CS F-20 Neural Networks

P a g e 5 1 of R e p o r t P B 4 / 0 9

I zm ir I nstiute of Technology CS Lecture Notes are based on the CS 101 notes at the University of I llinois at Urbana-Cham paign

What are S M U s? SMU = Software Maintenance Upgrade Software patch del iv ery u nit wh ich once ins tal l ed and activ ated prov ides a point-fix for

176 5 t h Fl oo r. 337 P o ly me r Ma te ri al s

Chapter - 3. ANN Approach for Efficient Computation of Logarithm and Antilogarithm of Decimal Numbers

Geometric Predicates P r og r a m s need t o t es t r ela t ive p os it ions of p oint s b a s ed on t heir coor d ina t es. S im p le exa m p les ( i

Neural Networks and the Back-propagation Algorithm

The Ind ian Mynah b ird is no t fro m Vanuat u. It w as b ro ug ht here fro m overseas and is now causing lo t s o f p ro b lem s.

OH BOY! Story. N a r r a t iv e a n d o bj e c t s th ea t e r Fo r a l l a g e s, fr o m th e a ge of 9

PC Based Thermal + Magnetic Trip Characterisitcs Test System for MCB

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Software Process Models there are many process model s in th e li t e ra t u re, s om e a r e prescriptions and some are descriptions you need to mode

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Grain Reserves, Volatility and the WTO

Lesson Ten. What role does energy play in chemical reactions? Grade 8. Science. 90 minutes ENGLISH LANGUAGE ARTS

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Neural Networks

Agenda Rationale for ETG S eek ing I d eas ETG fram ew ork and res u lts 2

Alles Taylor & Duke, LLC Bob Wright, PE RECORD DRAWINGS. CPOW Mini-Ed Conf er ence Mar ch 27, 2015

SIMU L TED ATED ANNEA L NG ING

Instruction Sheet COOL SERIES DUCT COOL LISTED H NK O. PR D C FE - Re ove r fro e c sed rea. I Page 1 Rev A

Nov Julien Michel

Neural networks. Chapter 20. Chapter 20 1

I M P O R T A N T S A F E T Y I N S T R U C T I O N S W h e n u s i n g t h i s e l e c t r o n i c d e v i c e, b a s i c p r e c a u t i o n s s h o

Learning in State-Space Reinforcement Learning CIS 32

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Simple Neural Nets For Pattern Classification

Fr anchi s ee appl i cat i on for m

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

W Table of Contents h at is Joint Marketing Fund (JMF) Joint Marketing Fund (JMF) G uidel ines Usage of Joint Marketing Fund (JMF) N ot P erm itted JM

Artificial Neural Networks. MGS Lecture 2

Input layer. Weight matrix [ ] Output layer

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Building Harmony and Success

Data Mining Part 5. Prediction

Microcanonical Mean Field Annealing: A New Algorithm for Increasing the Convergence Speed of Mean Field Annealing.

6.036: Midterm, Spring Solutions

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CPU. 60%/yr. Moore s Law. Processor-Memory Performance Gap: (grows 50% / year) DRAM. 7%/yr. DRAM

Lab 5: 16 th April Exercises on Neural Networks

8. Lecture Neural Networks

Unit III. A Survey of Neural Network Model

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

INTERIM MANAGEMENT REPORT FIRST HALF OF 2018

Midterm, Fall 2003

Stochastic Networks Variations of the Hopfield model

Form and content. Iowa Research Online. University of Iowa. Ann A Rahim Khan University of Iowa. Theses and Dissertations

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Temporal Backpropagation for FIR Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to Artificial Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Strategies for Teaching Layered Networks. classification tasks. Abstract. The Classification Task

P a g e 3 6 of R e p o r t P B 4 / 0 9

Part 8: Neural Networks

Neural Nets Supervised learning

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

V o l. 21, N o. 2 M ar., 2002 PRO GR ESS IN GEO GRA PH Y ,, 2030, (KZ9522J 12220) E2m ail: w igsnrr1ac1cn

Bellman-F o r d s A lg o r i t h m The id ea: There is a shortest p ath f rom s to any other verte that d oes not contain a non-negative cy cle ( can

SGD and Deep Learning

Archit ect ure s w it h

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

C o r p o r a t e l i f e i n A n c i e n t I n d i a e x p r e s s e d i t s e l f

Neural networks III: The delta learning rule with semilinear activation function

CSC242: Intro to AI. Lecture 21

Markov Chain Monte Carlo. Simulated Annealing.

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

The Ability C ongress held at the Shoreham Hotel Decem ber 29 to 31, was a reco rd breaker for winter C ongresses.

Neural Networks (Part 1) Goals for the lecture

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Belief Networks are compact universal approximators

S O M A M O H A M M A D I 13 MARCH 2014

Integer weight training by differential evolution algorithms

Artificial Neural Networks

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Multilayer Perceptrons and Backpropagation

CS:4420 Artificial Intelligence

Artificial Neural Networks

Neural Network Identification of Non Linear Systems Using State Space Techniques.

Foreword by Yvo de Boer Prefa ce a n d a c k n owledge m e n ts List of abbreviations

CHAPTER 3. Pattern Association. Neural Networks

Artificial Neural Network

Computational statistics

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

LSU Historical Dissertations and Theses

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Dangote Flour Mills Plc

Structure Design of Neural Networks Using Genetic Algorithms

Multiple Threshold Neural Logic

The Robustness of Relaxation Rates in Constraint Satisfaction Networks

o Alphabet Recitation

Lecture 4: Perceptrons and Multilayer Perceptrons

Linear discriminant functions

Transcription:

Complex Systems 2 (1988) 641-648 Teaching Feed-Forward N eural N etworks by Simulated Annealing Jonathan Engel Norman Bridge Laboratory of Pllysics 161-33, California Institute of Technology, Pasadena, CA 91125, USA Abstract. Simulated ann ealing is a pplied to the problem of teachin g feed-forward neural networks with discret e-valued weights. Network performance is optimized by repeated present ation of tr aining data at lower and lower temperatures. Several examples, including the parity and "clump-recognition " problem s are treated, scaling with network complexity is discussed, and the viability of mean-field approximations to the annealing pro cess is considered. 1. Introduction Back propagation [1] and related te chniques have focu sed attention on the prospect of effective learning by feed-forward neural networks with hidden layer s. Most curr ent teaching methods suffer from one of several problem s, including the tendency to get stuck in local minima, and poor performance in lar ge-scale exam ples. In addition, gradient-descent methods ar e applicable only when the weights can assume a cont inuum of values. T he solut ions reached by back propagation are sometimes only marginally stable agai nst perturbations, and rounding off weights aft er or during the procedure can severely affect network performance. If t he weights are restr icted to a few discrete values, an alternative is required. At first sight, such a restriction seems coun terproductive. Why decrease the flexib ility of the network any more th an necessary? One answer is that truly tunable analog weights are still a bi t beyond the capabilities of current VLSI technology. New te chniques will no doubt be developed, bu t th ere are other more fundamental reason s to p refer discret e-weight networks. Application of back propagation often result s in weights that vary greatly, mu st be preci sely specified, and em body no parti cular pattern. If the network is to incorporate struct ured rul es underlying t he exam ples it has learned, the weights ought oft en to ass ume regular, in teger values. Examples of such very structured sets of weights include the "human solution" to the "clump recognition" problem [2] discussed in section 2.2, and th e configuration presen ted in reference [1] that solves the parit y problem. T he relatively low number of 1988 Compl ex Systems Pub lications, Inc.

642 Jonathan Engel correct solutions to an y problem when weights are rest ricted to a few integers increases the probability that by finding one th e network discovers real rules. Discrete weights can th us in certain instances improve the ability of a network to generalize. Abandoning t he cont inuum necessarily complicates t he teaching prob lem; combinatorial minization is always harder than gr adient descent. New techniques, however, have in certain cases mitigated the problem. Simulated annealing [3L and more recently, a mean- field approximation to it [4,5], are proving successful in a nu mber of opt im ization task s. In what follows, we apply the technique to the problem of determining the best set of weights in feed-for ward neural networks. For simplicity, we rest rict ourse lves here to networks with one hidden layer and a single output. A measure of network pe rform ance is given by t he "energy" E = Dt"- 0"]', (11) where 0 " = g(- Wo+I: WiV," ), (1.2) and W = g(-tio+i:tijlj). j (1.3) Here 9 is a step function with t hreshold 0 and taking values {I, - I}, and the W i and T i j are weights for the output and hidde n layers resp ectively; a subscript 0 indica tes a t hreshold unit. The Ii are th e inp ut values for question a, and t el is th e target output value (answer) for that question. Back propaga tion works by taking derivatives of E with respect to th e weights and following a path downhill on th e E surface. Since the weights are restricted here to a few discrete values, t his procedure is not applicable and find ing minima of E, as noted above, becomes a combinatorial problem. The approach we adopt is to sweep t hrough the weights W i and T i j one at a time in cycl ic order, present ing each with t he tra ining data and allowing it the option of changing its value to another (chosen at random ) from th e allowed set. T he weight acce pts the change with probabili ty P = exp[-(e"ow - Eo1d)/TJ (1.4 ) if E n ew > E old ' an d with probabili ty P = 1 if the chan ge decreases the ene rgy. Th e par amet er T is initially set at some relatively large value, and then decreased accord ing to the prescript ion T ---4 IT, 1<1, (1.5)

Teaching Feed-Forward Neural Networks by Simulated Annealing 643 after a fixed number of sweeps N aw, during which time the system "equilibrates." It is far from obvious that th is ann ealing algori thm will prove a useful tool for minimi zing E. Equations (1.1-1.3) specify a function very different from t he energ ies associated with spin glasses or the Trav eling Salesman Problem, systems to which annealing has been successfully ap plied. In th e next sect ion, we test the procedure on a few simple learning prob lems. 2. E xamples 2.1 P ar ity The XOR problem and its generalizat ion to mor e inputs (t he parity problem) hav e been widely studied as measur es of th e efficiency of training algorithms. W hile not "realistic," these problems elucidate some of the issues that ar ise in the application of annealing. We consider first a network wit h two hidden un its connected to th e inputs, an output unit connected to the hidd en layer, and th reshold values for each unit. All the weights an d thresholds may take the values - 1, 1, or O. ',..,re add an extra constraint term to the energy discouraging configuration s in which the to tal input into some unit (for some question Q) is exactly equal to the threshold for that unit. Wh en the temperature T is set to zero fro m the start, t he an nealing algorithm reduces to a kind of iterated improvemen t. Changes in the network are accepted only if they decrease the ene rgy, or leave it unchange d. T he XOR problem specified above has enough solut ions th at th e algorit hm will find one about half the time st art ing at T = O. It takes, on the average, 100 300 presentations of th e input data to find the minimum. If, on the other hand, anneal ing is performed, start ing at To R::: 1, iter ating 20 times at each tem perat ure, and cooling with 'Y =.9 after each set of iterations, a solution is always found after (on aver age) abou t 400 presentations. In this context, therefore, annealing offers no real advantage. Furtherm ore, by increasing th e nu mber of hidden units to four, we can increase th e likelihood that the T = 0 netwo rk find s a solution to 90%. T hese resu lts per sist when we go from the XOR problem with two hidden units to the par ity problem with four inp uts an d four hidden units (we now allow the thresholds to take all integer values between - 3 and 3). Th ere, by taking To = 3.3 and 'Y =.95, the annealing algori thm always find s a solut ion, typi cally aft er ab out 15,000 presen tat ions. However, the T = 0 net finds a solut ion once every 30 or 40 a ttemp ts, so that the work involved is com parable to or less th an that needed to anneal. And as before, increasin g the number of hidden units makes the T = 0 sear ch eve n faster. Neither of these points will ret ain its force when we move to more com plicated, interesting, and realisti c problems. Wh en th e ra tio of solutions to to t al configu rations becomes too small, it erated im provement techniques perform badly, and will not solve the pro blem in a reasonable am ount of tim e. And increasing the number of hidden units is not a good remedy. T he more

644 Jon athan Engel ni - 6 > 2, 000, 000 Table 1: Aver age number of data-set presentations needed to find a zero-energ y solut ion to one-vs.-two clumps p roblem by ite rated improvement. solutions avail abl e to t he net work, the less likely it is t hat any part icul ar one will generalize well to data outside the set prese nt ed. We want to train the network by finding one of only few solutions, not by altering network structure to allow more. 2. 2 Recog n iz ing clum ps - scalin g behavior To test our algo rithms in mor e charact eri stic situat ions, we prese nt the network with th e prob lem of distinguishing two-or-more clumps of 1'5 in a binary string (of l 's an d _l's) from one-or-fewer [2]. This predicate has order two, no matter the number of input units ni' In what follows we always take the number of hidden units equal to the number of inputs, allow the weights to take values -c l, 0, I, and let the thresholds assume any integer value between -(ni - 1) an d ni + 1. By examining the performance of the network on the entire 2 n i input strings for n i equal to four, five, and six, we hope to get a rudimentary idea of how well the performance scales wit h the number of neurons. To establish some kind of benchmark, we aga in disc uss the T = 0 iterated improvement algorithm. For Hi = 4, the technique works very well, but slows considera bly for ni = 5. By the tim e the number of inpu ts reaches six, we are unable to find a zero-e nergy solution in many hours of VAX 11/ 780 time. These results are sum marized in table I, where we present the average number of question presentation s needed to find a correct solut ion. T he full annealing algorithm is slower th an iterated improvemen t for ni = 4, but by ni = 5, it is alr eady a better approach. The average number of presentations for each nil with the anne aling schedule adjusted so t hat the ground state is found at least 80% of the t ime, is shown in table 2. While it is not p ossible to reliably extrapolate to larger ni (for n i = 7, running times were already t oo long t o gather statistics), it is clear that for these small values t he learning takes substant ially longe r with each increm ent in the number of hidde n units an d corresp onding doubl ing of the number of exa mples. One fact or intrinsic to the algorithm par ti ally accounts for the increase. The num ber of weights and thres holds in the network is (ni + 1)2. Since a sweep consists of changing each weight, it takes (ni +1)2 presentations to expose th e entire network to the input data one time. But it is also clear that more sweeps an d a slower annealing schedule are required as ni increases. A better quantification of these tendencies awaits further study. One interesting feature of the solut ions found is that many of the weights allotted to t he network are not used, that is t hey ass ume t he value zero.

Teaching Feed-Forward Neural Net works by Sim ulated Ann ealing 645 n j - 4 nj - 5 nj - 6 To 10 20 33 N,w 40 40 40 '1 0.9 0.95 0.993 present ations 19,000 131,000 652,000 Table 2: Average num ber of presentations needed to an neal into a zero-energy solution in one-vs.-two clumps problem. F igure 1 shows two typi cal solutions for nj = 6. In t he second of these, one of the output weight s is zero, so that only five hidden units are utili zed. While t he network shows no inclination to set tle into the very low order "huma n" solution of [2], its tendency to eliminat e man y allotted connect ions is encouraging, and could be increased by adding a term to t he ene rgy that pe nalizes large numbers of weights. An algorithm like back propagation tha t is designed for continuous-valued connections will cert ain ly find solut ions [2], but can not be expected to set weights exactly to zero. Is it possible that a modification in the procedure could speed convergence? Szu [6] has shown that for continuous functions a sampling less local than the usual one leads to quicker freezing. Our sampling here is extremely local; we change only one weight at a ti me, and it can therefore be difficult for t he network to extract itself from false minima. An app roach in which several weights are sometimes changed needs to be tried. Thi s will certainly increase the time needed to simulate th e effect of a networ k cha nge on th e energy. A ha rdware or par allel implementation, however, avoids the problem. 3. Mean fie ld approach One might hope to speed lea rning in ot her ways. In spin glasses [7], certain representative NP-complete combinatorial problems [4], and Boltzmannmachine learning [5], mean-field approximati ons to the Monte Carlo equilibration have resulted in good solut ions while redu cing comp utation ti me by an order of magnitude. In these calc ulat ions, the configurations are not explicitly varied, but rather an average value at any te mpe rature is calcu lated (sell-consistently) for each variable under the assumption that all the others are fixed at their average values. T he procedure, which is quite sim ilar to the Hopfield- net app roach to comb inatorial minimization detailed in reference {S], resul ts in the equations (X;) = Lx;X; exp[-e(x ;, (X»)JT] L X ; exp[-e(x;, (X»)JTJ (3.1) where Xi st ands for the one of the weight s Wi or Ti,j, () denot es the average value, and t he expression (X) means that all the weights except Xi are fixed at t heir average values. These equations ar e iterated at each temperature until self-consistent solut ions for the weights are reached. At very high T,

646 Jonathan Engel al bl F igure 1: Typical solutions to t he clump-recognition probl em found by t he simulated annealing algorithm. Th e black lines denote weights of st rength +1 and the dashed lines weights of st rengt h ~ 1. T hreshold values are displayed inside t he circles corres ponding to hidden an d out put neurons. Many of the weights are zero (absent ), including one of t he out put weights in (b).

Teaching Feed -For ward Ne ural Network s by Simulated Annealing 647 there is on ly one such solution: all weights equal to zero. By using the selfconsistent solutions at a given T as st arting poin ts at the next tem perature, we hope to reach a zero-energy network at T = O. Although t he algorithm works well for simple tasks like th e XOR, it funs into pro blems when extend ed to clump recogni tion. The difficul ties ar ise [rom th e discont inuities in the energy induced by the step function (threshold) 9 in equations (1.2) and (1.3) ; steep variations can cau se the iteration pro cedure to fail. At certain te mperatures, the self-consist ent solutions either disappear or becom e unstable in our iteration scheme, and the equations neve r settle down. The problem can be avoided by replacing th e step with a relat ively shallow-sloped sigmo id fun ction but th en, because the weight s mus t take discrete values, no zero-energy solut ion exists. By changing th e outpu t unit from a sigmoid funct ion to a step (to determ ine right and wrong answers) after the teaching, we can obtain solut ions in which the netwo rk makes 2 or 3 mistakes (in 56 examp les) but none in which perfo rmance is perfect. Some of t hese d ifficulties can conceivably be overcome. In an art icle on mean-fi eld theor y in spin glasses [9], Ling et al. introduce a. more sophist i cated iteration procedure that greatly improves the chances of convergence. Their method} however} seems to require the reformulation of our problem in terms of neurons that can take only two values (on or off}, and entails a significant amount of matrix diagonali za tion, making it less practical as the size of t he net is increased. Scaling difficulties plague every other teachi ng algorit hm as well thoug h, so while mean-field techniques do not at first sight offer clear improvement over full-scale an nea ling, they do deserve cont inued investigation. 4. Conclusions We have shown that simulated annea ling can be used to train discre te-valued weights; questions concern ing scaling and general ization, tou ched on br iefly here, remain to be ex plored in more detail. How well will the procedure perform in truly large-scale problems? T he results presented her e do not inspire excitement, but the algorit hm is to a certain extent parallelizab le, has already been implemented in hardware in a different context [10], and can be mod ified to sample far-away configurations. How well do the discrete-weight nets gene ral ize? Is our intuition about integer s and rules really correct? A systematic study of th ese all these issues is st ill needed. References [1] D. Rumm elhar t and J. McClelland, Parallel Distributed Pre cessing; Explorations in tile Microstructure of Cognitiotl, (Cambridge: MIT Press, 1986). [2J J. Denker, et al., Complex Syst ems, 1 (1987) 877. [3J S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchl, Science, 220 (1983) 671. [4J G. Bilbro, et al., North Carolina State prep rint (1988).

648 Jonathan Engel [5] C. Peterson and J.R. Anderson, Comp lex Sys tems, 1 (1987) 995.. [6] H. Szu, Physics Letters, 122 (1987) 157. [7J C.M. Soukoulis, G.S. Grest, and K. Levin, Physcal Review Letters, 50 (1983) 80. [8] J.J. Hopfield and D.W. Tank, Biological Cybernetics, 52 (1985) 141. [9] David D. Ling, David R. Bowman, and K. Levin, Physical Review, B28 (1983) 262. [10] J. Alspector and R.B. Allen, Advanced Research ill VLSI, ed. Losleber (Cambridge: MIT Press, 1987).