Lending Direction to Neural Networks. Richard S. Zemel North Torrey Pines Rd. La Jolla, CA Christopher K. I.

Size: px

Start display at page:

Download "Lending Direction to Neural Networks. Richard S. Zemel North Torrey Pines Rd. La Jolla, CA Christopher K. I."

Susan Perry
5 years ago
Views:

1 Lending Direction to Neural Networks Richard S Zemel CNL, The Salk Institute North Torrey Pines Rd La Jolla, CA Christoher K I Williams Deartment of Comuter Science University of Toronto Toronto, ONT M5S 1A4 CANADA Michael C Mozer Deartment of Comuter Science & Institute of Cognitive Science University of Colorado Boulder, CO Abstract We resent a general formulation for a network of stochastic directional units This formulation is an extension of the Boltzmann machine in which the units are not binary, but take on values on a cyclic range, between 0 and 2 radians This measure is aroriate to many domains, reresenting cyclic or angular values, eg, wind direction, days of the week, hases of the moon The state of each unit in a Directional-Unit Boltzmann Machine (DUBM) is described by a comlex variable, where the hase comonent seci- es a direction; the weights are also comlex variables We associate a quadratic energy function, and corresonding robability, with each DUBM conguration The conditional distribution of a unit's stochastic state is a circular version of the Gaussian robability distribution, known as the von Mises distribution In a mean-eld aroximation to a stochastic dubm, the hase comonent of a unit's state reresents its mean direction, and the magnitude comonent secies the degree of certainty associated with this direction This combination of a value and a certainty rovides additional reresentational ower in a unit We resent a roof that the settling dynamics for a mean-eld DUBM cause convergence to a free energy minimum Finally, we describe a learning algorithm and simulations that demonstrate a mean-eld DUBM's ability to learn interesting maings To aear in: Neural Networks 1

2 1 Introduction Many kinds of information can naturally be reresented in terms of angular, or directional, variables A circular range forms a suitable reresentation for exlicitly directional information, such as wind direction, as well as for information where the underlying range is eriodic, such as days of the week or months of the year In comuter vision, tangent elds and otic ow elds are reresented as elds of oriented line segments, each of which can be described by a magnitude and direction Directions can also be used to reresent a set of symbolic labels, eg, object label A at 0, and object label B at =2 radians We discuss below some advantages of reresenting symbolic labels with directional units These and many other henomena can be usefully encoded using a directional reresentation a olar coordinate reresentation of comlex values in which the hase arameter indicates a direction between 0 and 2 radians We have devised a general formulation of networks of stochastic directional units This formulation is a novel generalization of a Boltzmann machine (Ackley, Hinton and Sejnowski, 1985) in which the units are not binary, but instead take on directional values between 0 and 2 radians This aer resents the theoretical basis of a directional-unit Boltzmann machine (dubm), including an energy function and underlying robabilistic model We also resent a simle derivation of a deterministic version of a dubm and a convergence roof for its dynamics Finally, we describe a generalization of the Boltzmann machine learning algorithm, and demonstrate that a dubm is able to learn interesting maings 2 Stochastic Directional Unit Boltzmann Machines 21 The DUBM Energy Function We formulate a network comosed of directional units a dubm in an analogous manner to a Boltzmann machine with binary-state units In the Hoeld sin-system (Hoeld, 1982) and in a Boltzmann machine, the units are binary, and the energy is dened to be the sum of the airwise roducts of the units' states and their connecting weights This energy can be written in quadratic form: E(s) =?1=2 s T Cs, where the weight matrix C is a real symmetric matrix, and s is the vector of the units' binary states in a articular global conguration In a dubm each stochastic directional unit takes on values on the unit circle We associate with unit j a random variable Z j ; a articular state of j is described by a comlex number with magnitude one and direction, or hase j : z j = e i j The weights of a dubm also take on comlex values The weight from unit k to unit j is: w jk = b jk e i jk The activation function of a unit is a generalization of the dot roduct to the comlex domain: the net inut x j = z w j = P k z kwjk, where z is the vector of the units' comlex states in a articular global conguration, and the asterisk indicates the comlex conjugate oeration Figure 1 shows an examle of this multilication oeration We constrain the weight matrix W to be Hermitian: W T = W, where the diagonal elements of the matrix are zero Note that if the comonents are real, then W T = W, which is a real 2

3 w 21 = 2e?i=2 z 1 = e i0 x 2 = z 1 w 21 = 2ei=2 Figure 1: This gure shows the eect of using comlex weights Unit 1 is in state z 1, and the weight from unit 1 to 2 is w 21 Multilying the state of unit 1 by the comlex conjugate of this weight doubles the magnitude and rotates the hase by =2 radians symmetric matrix Thus, the Hermitian form is a natural generalization of weight symmetry to the comlex domain This denition of W leads to a Hermitian quadratic form that generalizes the real quadratic form of the Hoeld energy function: E(z) =?1=2 z T Wz =?1=2 X j;k z j z kw jk (1) Noest (1988) indeendently roosed this energy function It is similar to that used in Fradkin, Huberman, and Shenker's (1978) generalization of the XY model of statistical mechanics to allow arbitary weight hases jk, and couled oscillator models, eg, Baldi and Meir (1990) We discuss below the relationshi between dubms and these models 22 The Circular Normal Distribution We can dene a robability distribution over the ossible states of a stochastic network by using the Boltzmann factor In a binary-unit Boltzmann machine, a binary random variable S j describes the state of unit j, and it has the relative robability distribution: (S j = 1) (S j = 0) = e?e(sj=1) e?e(s j=0) = e x j where x j = P k w jks k is the net inut to unit j, and = 1=T The robability that a unit is on is thus described by the familiar logistic function: (S j = 1) = (1 + e?x j )?1 = (x j ) We use a similar method to dene a robability distribution over the states of directional units; it turns out that this distribution is well-known for directional data In a dubm, we 3

4 can describe the energy as a function of the state of a articular unit j: E(Z j = z j ) =?1=2 [ X k z j z kw jk + X k z k z j w kj ] =?1=2 [z j x j + (z jx j ) ] (2) where x j = X k z k w jk (3) We can consider x j as the net inut to unit j; it catures the relevant information in the local eld of the unit We denote the magnitude and hase of x j as a j and j resectively, such that x j = a j e i j (4) Using Equations 2, 3, and 4, we nd that E(Z j = z j ) =?a j cos( j? j ) Alying the Boltzmann factor, the robability that unit j is in a articular state is roortional to: (Z j = z j ) / e?e(z j=z j ) = e a j cos( j? j ) (5) This robability distribution for a unit's state corresonds to a distribution known as the von Mises, or circular normal, distribution (Mardia, 1972) that in many ways acts on a circular range as the Gaussian distribution acts on a linear (ie, non-circular) range 1 Two arameters comletely characterize this distribution: a mean direction = (0; 2] and a concentration arameter m > 0 that behaves like the recirocal of the variance of a Gaussian distribution on a linear random variable The robability density function of a circular normal random variable Z is: 1 (; ; m) = 2I 0 (m) em cos(?) (6) The normalization factor I 0 (m) is the modied Bessel function of the rst kind and order zero 2 From Equation 5, we see that if a unit adots states according to its contribution to the system energy, it will be a circular normal variable with mean direction j and concentration arameter m j = a j These arameters are directly determined by the net inut to the unit Figure 2 shows a circular normal density function for Z j, the state of unit j This gure also shows the exected value of its stochastic state, which we dene as: y j = < Z j > = r j e i j (7) 1 The circular normal resembles the Gaussian in that (a) the maximum likelihood estimate of the mean direction of a random variable x is the samle mean if and only if x is a circular normal random variable; and (b) the circular normal distribution is the maximum entroy distribution if a mean direction and circular variance are xed Some of the other imortant roerties of the Gaussian, such as the central limit theorem, aly to another circular distribution, the wraed normal R 2 An integral reresentation of this function is I 0(m) = 1 em cos 0 d It can be comuted by numerical routines 4

5 s s s s s s s s s 0 o j r j s Figure 2: This gure shows a circular normal density function laid over a unit circle The dots along the circle reresent samles of the circular normal random variable Z j The exected direction of Z j, j, is =4; its resultant length r j is roortional to m j If m j is large, then most of the samles of Z j will be concentrated on a small arc about j, and r j! 1 As m j decreases, the samle sread increases, and r j! 0 where j, the hase of y j, is the mean direction and r j, the magnitude of y j, is the resultant length For a circular normal random variable, its exected direction j is simly the mean direction of the random variable: j = j (8) The calculation of r j, the resultant length, is more involved From Mardia (1972) we nd that 3 : r j = I 1(m j ) (9) I 0 (m j ) From the gure we can see that when most of the samles of Z j are concentrated on a small arc about the mean, r j will aroach length one This corresonds to a large concentration arameter (m j = a j ), which means that in Equation 6, small deviations from the mean direction will roduce much smaller exonents than that of the mean, and the robability distribution will be concentrated about the mean Conversely, for small m j, the distribution aroaches the uniform distribution on the circle, and the resultant length falls toward zero For a uniform distribution, r j = 0 A lot of r j against m j is shown in Figure 3 Note that the concentration arameter for a unit's circular normal density function is the roduct of a j, the magnitude of its inut, and, the recirocal of the system temerature 3 R 1 0 An integral reresentation of the modied Bessel function of the rst kind and order k is I k(m) = e m cos cos(k)d Note that I 1(m) = di 0(m)=dm 5

6 Higher temeratures will thus have the eect of making this distribution more uniform, just as they do in a binary-unit Boltzmann machine r j m j Figure 3: This gure shows the function dened by Equation 9 that mas m j, the concentration arameter associated with the circular normal distribution of unit j, to r j, its resultant length 3 Emergent Proerties of a DUBM 31 Evidence accumulation and reresenting certainty A network of directional units as dened above contains two imortant emergent roerties The rst roerty is that the magnitude of the net inut to unit j describes the extent to which its various inuts \agree" Intuitively, one can think of each comonent z k wjk of the sum that comrises x j as redicting a hase for unit j When the hases of these comonents are equal, the magnitude of x j, a j, is maximized If these hase redictions are far aart, then they will act to cancel each other out, and roduce a small a j Given x j, we can comute the exected value of the outut of unit j (Equation 11) The exected direction of the unit roughly reresents the weighted average of the hase redictions, while the resultant length, a number between zero and one, is a monotonic function of a j and hence describes the agreement between the various redictions Figure 4 illustrates this roerty with two examles showing the comutation of the inut and exected outut of a unit in a simle 3-unit dubm The key idea here is that the resultant length directly describes the degree of certainty in the exected direction of unit j Thus, a dubm naturally incororates a reresentation of the system's condence in a value This ability to combine several sources of evidence, and not 6

7 only reresent a value but also describe the certainty of that value is an imortant roerty that may be useful in a variety of domains 32 Rotation invariance The second emergent roerty is that the dubm energy is globally rotation-invariant E is unaected when the same rotation is alied to all units' states in the network This can be easily shown from Equation 1 Rotating a unit state z j by some hase is equivalent to multilying z j by e i If we aly this to each unit, then each e i will be cancelled out in the z j zk term in Equation 1 For each dubm conguration, there is thus an equivalence class of congurations which have the same energy In a similar way, we nd that the magnitude of x j is rotation-invariant That is, when we obtain a new inut, x 0 j by translating the hases of all of the other units by some hase, the magnitude a j is unaected: x 0 j = X k b jk e i( k+? jk ) = a j e i( j+) = x j e i This roerty underlies one of the key advantages of the reresentation: both the magnitude of a unit's state as well as system energy deend on the relative rather than absolute hases of the units 4 Deterministic Directional Unit Boltzmann Machines 41 Aroximating the stochastic system Calculating roerties of a large stochastic system is time-consuming: a simulation of such a system must be allowed to run for a sucient length of time to build u statistics of the robability distribution across the ossible congurations Just as in deterministic binaryunit Boltzmann machines (Peterson and Anderson, 1987; Hinton, 1989), we can greatly reduce the comutational time required if we invoke the mean-eld aroximation, which states that once the system has reached equilibrium, the stochastic variables can be aroximated by their mean values In this aroximation, the variables are treated as indeendent, and the system robability distribution is simly the roduct of the robability distributions for the individual units 4 Gislen, Peterson, and Soderberg (1992) originally roosed a mean-eld theory for networks of directional (or \rotor") units, but only considered the case of real-valued weights They derived the mean-eld consistency equations R by using the saddle-oint method to aroximate the artition function of the system: e?e() d, where indexes ossible con- gurations of the system Our aroach rovides an alternative, erhas more intuitive derivation, due to the use of the circular normal distribution 4 The tradeo for the increase in comutational seed of the mean-eld aroximation is a decrease in the number of states that can be reresented, and a loss of information concerning the correlations between units' states 7

8 Unit 3 y 3 = < Z 3 > = I 1(9=2) I 0 (9=2) ei=2 x 3 = 5=2e i=2 + 2e i=2 :9e i=2 = 9=2e i=2 w 31 = 5=2e i=4 w 32 = 2e i0 Unit 1 Unit 2 z 1 = e i3=4 z 2 = e i=2 Unit 3 y 3 = < Z 3 > = I 1(1=2) I 0 (1=2) ei=2 :25e i=2 x 3 = 5=2e?i=2 + 2e i=2 = 1=2e?i=2 w 31 = 5=2e i=4 w 32 = 2e i0 Unit 1 Unit 2 z 1 = e?i=4 z 2 = e i=2 Figure 4: These two diagrams grahically deict the comutation of the inut and exected outut of a unit Unit 3 is connected to units 1 and 2 Its inut, x 3, is comuted as described in Equation 3, while its exected value, y 3, can be comuted directly from its inut, as described in Equations 8 and 9 In the to diagram, the inut from the units 1 and 2 \agree", ie, their contributions to x 3 have the same hase The second diagram shows the eect of rotating the state of unit 1 by : the hases of the two comonents of x j no longer agree, resulting in a smaller magnitude, and hence a smaller resultant length 8

9 We can directly describe these mean values based on the circular normal interretation We still denote the net inut to a unit j as x j : x j = X k y k w jk = a j e i j (10) Once equilibrium has been reached, the state of unit j is y j, the exected value of Z j given the mean-eld aroximation We have already shown that Z j is a circular normal random variable with mean j and concentration arameter a j ; thus from Equations 8 and 9: y j = r j e i j = I 1(a j ) I 0 (a j ) ei j (11) In the stochastic as well as the deterministic system, units evolve to minimize the free energy, F = < E >? T H We calculate H, the entroy of the system, using the circular normal distribution and the mean-eld aroximation The entroy of a circular normal random variable Z is: H(Z) =?mr +log(2i 0 (m)) Since under the mean-eld aroximation, the units are considered indeendent, the entroy of the mean-eld dubm is simly the sum of the entroies of the individual units' states The mean-eld free energy is: F MF X X =?r j r k b jk cos( j? k + jk )? T j<k j [?a j r j + log(2i 0 (a j ))] (12) We can derive mean-eld consistency equations for x j and y j by minimizing F MF with resect to each variable indeendently, ie, by nding when the artial derivatives of F MF with resect to each of these variables equal zero The resulting equations match the mean- eld equations (Equations 10 and 11) derived directly from the circular normal robability density function They also match the secial case derived by Gislen et al for real-valued weights 42 Running a deterministic DUBM We have imlemented a dubm using the mean-eld aroximation The goal is to solve for a consistent set of x and y values; there are several ossible methods to settle the network to this state One is to erform asynchronous udates of Equations 10 and 11 We have selected an alternative method, erforming synchronous udates of the discrete-time aroximation of the set of dierential equations based on the net inut to each unit j We udate the x j variables using the following dierential equation: dx j =?x j + X k y k w jk (13) which has Equation 10 as its steady-state solution In order to adat x j, we actually adat its real and imaginary comonents: x R j = a j cos j and x I j = a j sin j The steady-state solutions for these two quantities are equivalent to, and 9

10 follow directly from Equation 10, and their udate equations follow directly from Equation 13: dx R j =?x R j + X k Re(y kw jk ) =?x R j + X k r k b jk cos( k? jk ) dx I j =?x I j + X k Im(y kw jk ) =?x I j + X k r k b jk sin( k? jk ) We use synchronous udates of these dierential equations At each ste, we udate x R j and x I j : x R j (t) = (1? )xr j (t? 1) + X k x I j (t) = (1? )x I j(t? 1) + X k r k (t? 1)b jk cos( k (t? 1)? jk ) r k (t? 1)b jk sin( k (t? 1)? jk ) where 0 < < 1 is a variable ste-size arameter that is selected to encourage the system to aroach a steady-state solution while avoiding oscillations We then comute the magnitude q and hase of the inut to unit j: a j (t) = (x R j (t))2 + (x I x I j (t))2 and j (t) = arctan j (t) x R j (t) These are then used to directly comute the state r j (t) and j (t) of each unit j (see Equation 11) In the simulations, we tyically set = 0:2 We also use simulated annealing to hel nd good minima of the free energy F This is imlemented using an annealing schedule which exonentially lowers the temerature until T = 1 Just as for the Hoeld binary-state network, it can be shown that the free energy F MF always decreases during the dynamical evolution described in Equation 13 The equilibrium solutions are free energy minima The Aendix contains a convergence roof for F MF, in the style of Hoeld (1984) 5 Learning in DUBMs The units in a dubm can be arranged in a variety of architectures The aroriate method for determining weight values for the network deends on the articular class of network architecture In a standard autoassociative network, containing a single set of interconnected units as in a Hoeld network, the weights can be set directly from the training atterns If hidden units are required to erform a task, then an algorithm for learning the weights is required We have exlored two learning algorithms The rst is a generalization of recurrent back-roagation (Pineda, 1987; Almeida, 1987) to a network of directional units The comlex reresentations in a dubm comlicates this algorithm, as it requires several matrix inversions The second algorithm generalizes the Boltzmann machine training algorithm (Ackley, Hinton and Sejnowski, 1985; Peterson and Anderson, 1987) to these networks This algorithm is considerably simler; we thus focus on the second algorithm here, and results obtained from it are described in the following section 10

11 We assume that the dubm contains two sets of units, visible and hidden, and concentrate on a articular tye of architecture, in which the visible units are slit into two sets, inut and outut The network is trained using a set of atterns, each secifying values for the inut units and target values for each of the outut units We can regard the target outut values as describing a desired set of robabilities If we take the outut units as reresenting a robability distribution across a set of directional values, then an aroriate error measure is the relative entroy of these robability distributions This is the asymmetric divergence, or Kullback information measure used in binary-unit Boltzmann machines The derivation of the learning rule for a dubm arallels that for the standard Boltzmann machine Let! and index ossible congurations of the outut and inut units, resectively The goal is to make the network learn assocations from to! such that the network's distribution P!j resembles the desired distribution R!j for each If the ossible inuts occur with robability, then the error measure, G, is: Z Z G = R!j log(r!j =P!j ) d! d Z! Z = G 0? R!j log P!j d! d! In the stochastic network, P!j = e?f! =e?f, where F! is the minimum free energy when the inut units are in conguration and the outut units are clamed to conguration!, and F is the minimum free energy when only the inut units are set If we assume that the training set secies the occurence robabilities, as well as the robabilities R!j for the! and, then G simlies to the following exression: G = G 0 + [F!? F ] Thus, the objective reduces to minimizing the dierence between F! and F, averaged over the training set For the mean-eld dubm, we use this same objective, where the free energies are now the mean-eld free energies, as described in Section 4:1 (see Equation 12) We dierentiate this objective function with resect to the magnitude and direction of the comlex-valued weights to determine the weight udate rule The artial derivative of G with resect to a weight is the dierence between the artials of the two mean-eld free energies, F MF! and F MF, with resect to the weight On a given training case, the required derivatives are given MF =@b jk =?r j r k cos( j? k + jk MF =@ jk = r j r k b jk sin( j? k + jk ) Note that the derivatives only involve the exlicit derivatives of F MF with resect to the weight arameters because we have settled to equilibrium (Hinton, 1989; Baldi and Pineda, 1991) The learning algorithm uses these gradients to nd weight values that will minimize the objective G over a training set 11

12 6 Exerimental Results The focus of this aer has been on the theoretical ideas underlying a generalization of a Boltzmann machine to directional units We resent below some illustrative examles to show that an adative network of directional units can be used in a range of aradigms, including associative memory, inut/outut maings, and attern comletion 61 Simle autoassociative DUBMs The rst set of exeriments considers a simle autoassociative dubm, which contains no hidden units, and the units are fully connected As in a standard Hoeld network, the weights are set directly from the training atterns; they equal the suerosition of the outer roduct of the atterns: w jk = 1 N NX =1 y j y k where indexes the training atterns We have exerimented with a number of dierent networks Figure 5 shows a network containing 30 units that has stored 3 atterns The weights shown are generated as described above; the gure also shows two examles of the network settling from a corruted version of one of the stored atterns to a state near the attern These atterns thus form stable attractors, as the network can erform attern comletion and clean-u from noisy inuts The rotation-invariance roerty of the energy function allows any rotated version of a training attern to also act as an attractor We have run several exeriments with simle autoassociative dubms The emirical results arallel those for binary-unit autoassociative networks Using an error criterion that measures the distance from an equilibrium state to the closest stored attern, we nd that the mean error raidly increases as the number of stored atterns increases For examle, the erformance of the network shown in Figure 5 raidly degrades for more than 4 orthogonal atterns; the atterns themselves no longer act as xed-oints, and many random initial states end in states far from any stored attern In addition, more orthogonal atterns can be stored than random atterns See Noest (1988) for an analysis of the caacity of an autoassociative dubm with sarse and asymmetric connections 62 Learning inut/outut maings We have also used the mean-eld dubm learning algorithm to learn the weights in networks containing hidden units Due to the resence of the hidden units, the network must settle via the synchronous udating rocedure described in Section 4:2 on each training attern, rst with the inut clamed, and then with both the inut and target outut clamed We have exerimented with a task that is well-suited to a directional reresentation There is a single-jointed robot arm, anchored at a oint, as shown in Figure 6 The inut consists of two angles: the angle between the rst arm segment and the ositive x-axis (), (14) 12

13 Patterns Examle 1 Iteration 0 Iteration 10 Iteration 21 Examle 2 Iteration 0 Iteration 10 Iteration 30 Weights Figure 5: This gure shows the atterns and weights stored in an autoassociative network of 30 directional units The area of each circle is roortional to the magnitude of the comlex value, and the orientation of the notch indicates its hase The weights are the outer roduct of the 3 atterns shown at the to Each row shows the incoming weights to a articular unit Note that the self weights (along the diagonal) have zero magnitude, and the symmetry of the diagram about the diagonal reects the Hermitian roerty of the weights On the right are two examles of the network settling to a xed-oint The rst examle ends u near the rst of the three stored atterns after 21 iterations (modulo a global rotation) The second examle is considerably corruted; it requires 30 iterations to aroach the second stored attern 13

14 B A Figure 6: A grahical dislay of a training case for the robot arm roblem The arm is anchored on the horizontal axis, and the lengths of its two segments, A and B, are xed The two angles, and, are given as inut for each training case, and the target outut for the net is the angle and the angle between the two arm segments () The two segments each have a xed length, A and B; these are not exlicitly given to the network The outut is the angle between the line connecting the two ends of the arm and the x-axis () This target angle is related in a comlex, non-linear way to the inut angles the network must learn to aroximate the following trigonometric relationshi: = arctan A sin? B sin( + ) A cos? B cos( + ) With 500 training cases, a dubm with 2 inut units and 8 hidden units was able to learn the task so that it could accurately estimate for novel atterns The learning required 200 iterations of the conjugate gradient training algorithm On each of 100 testing atterns, the resultant length of the outut unit exceeded 85, and the mean error on the angle was less than 05 radians The network can also learn the task with as few as 5 hidden units, with a concomitant decrease in learning seed The comact nature of this network shows that the directional units form a natural, ecient reresentation for this roblem 63 Comlex attern comletion Our earlier work describes a large-scale dubm that attacks a dicult roblem in comuter vision: image segmentation In magic (Mozer, Zemel and Behrmann, 1992; Mozer et al, 1992), directional values are used to reresent alternative labels that can be assigned to image features The goal of magic is to learn to assign aroriate object labels to a set of image features (eg, edge segments) based on a set of examles The idea is that the features of a given object should have consistent hases, with each object taking on its own 14

15 hase The units in the network are arranged into two layers feature and hidden and the comutation roceeds by randomly initializing the hases of the units in the feature layer, and settling on a labeling through a relaxation rocedure The units in the hidden layer learn to detect satially local congurations of the image features that are labeled in a consistent manner across the training examles magic successfully learns to segment novel scenes consisting of overlaing geometric objects The emergent dubm roerties described in Section 3 are essential to magic's ability to erform this task The evidence accumulation roerty allows units to act as feature detectors, exressing constraints in their comlex-valued weight atterns The comlex weights are necessary in magic, as the weights encode statistical regularities in the relationshis between image features, eg, that two features tyically belong to the same object (ie, have similar hase values) or to dierent objects (ie, are out of hase) The fact that a unit's resultant length reects the certainty in a hase label allows the system to decide which hase labels to use when udating labels of neighboring features: the initially random hases are ignored, while condent labels are roagated Finally, the rotation-invariance roerty allows the system to assign labels to features in a manner consistent with the relationshis described in the weights, where it is the relative rather than absolute hases of the units that are imortant 7 Discussion We have resented a general formulation for networks of directional units, called directionalunit Boltzmann machines, or dubms This formulation is based on an underlying robabilistic model, and we have develoed and exerimented with a mean-eld version of the stochastic system A deterministic dubm rovides a natural reresentation for both comuted values (the hase of a unit), and the certainty associated with these values (the magnitude of the unit's state) This formulation rovides a well-founded activation function for directional units, based on this mean-eld aroximation, as well as a convergence roof for the system dynamics In addition, this tye of network has many otential alications: it is well-suited to data that can naturally be reresented as directions, such as vector elds that arise in otic ow or tangent eld measurements of comuter vision, as well as for information that can take advantage of the cyclic nature of the directional reresentation, and tasks where relative rather than absolute information is imortant Our work on directional units ts in a long tradition of work on hase-based reresentations The well-known XY models of statistical mechanics, as well as couled oscillator models, describe the dynamics of a oulation of oscillators in which each oscillator is described by a single arameter, its hase Several researchers have recently studied such systems of couled oscillators (Kammen, Holmes and Koch, 1989; Baldi and Meir, 1990), and have alied these methods to model exerimental ndings concerning oscillations and hase-locking of cortical neurons in cats (Gray et al, 1989; Eckhorn et al, 1988) This work relates to the suggestion (von der Malsburg, 1981) that synchronized oscillations rovide a biologically lausible mechanism of labeling features through temoral correlations among 15

16 neural signals Other related work { mentioned earlier { includes that of Gislen, Peterson, and Soderberg (1992), which describes a mean-eld analysis for a system of directional units, and considers the general case where units assume multi-dimensional directional states on a hyershere Also, Noest (1988), roosed an energy function for hase-based units and analyzed the caacity of an autoassociative network which minimizes this energy function Several imortant distinctions exist between our work and these other models First, just as a binary-unit Boltzmann machine extends the reresentational ower of the Hoeld network by adding hidden units, our formulation extends the ower of autoassociative couled oscillator models A dubm can contain layers of hidden units, which enables the network to solve tasks that require a reresentation of higher-order relations between elements of inut atterns, such as the robot-arm task described above 5 In addition, our formulation allows the weights to be comlex-valued, which is imortant for encoding desired hase relationshis between two directional units This is a crucial roerty for systems looking for atterns of hase relationshis, as was discussed in Section 6:3 One could conceivably encode hase relationshis with scalar weights, but this would require additional connections between units, as well as an additional term in the energy function Finally, our work introduces the robabilistic interretation of networks of directional units We introduce the notion that the density function across directions for a given unit can be described by the circular normal, and show that the mean-eld equations for such a unit corresond to the exected value of this distribution This aroach greatly simlies the derivation of the deterministic system dynamics Our work demonstrates that networks comosed of directional units can learn interesting maings magic, as described in (Mozer et al, 1992), is a large-scale dubm that learns to segment novel images comosed of a wide range of overlaing geometric objects One extension of the current model is aroriate for tasks that involve learning a maing from a set of inut atterns to a set of outut atterns We can consider a dierent tye of architecture for these tasks, where the directional units are used in a strictly feedforward manner; such a network can then be trained by a modied form of the back-roagation training algorithm In this case, the convergence roerties of a dubm as well as the corresondence between state robabilities and energies no longer aly However, this aroach has some advantages over revious methods of training networks of comlex units, eg, (Benvenuto and Piazza, 1992; Widrow, McCool and Ball, 1975) Several interesting roerties of dubms, such as the rotation-invariance and the interretation of the magnitude of a unit as caturing the agreement of its inuts, are embodied in the activation function of the network and are hence retained We are currently exloring an imortant extension to the model which will exand its reresentational caabilities Radford Neal (ersonal communication, 1992) has suggested a method for combining binary and directional units This exanded reresentation may be 5 Note that deterministic dubms share the roblem of deterministic binary-unit Boltzmann machines, in that their erformance dros o in dee networks (Galland, 1992) This is largely due to the fact that the mean-eld aroximation breaks down as multile layers introduce increasing correlations among the units 16

17 useful in domains with directional data that is not resent everywhere For examle, it can be directly alied to the object labeling roblem exlored in magic The binary asect of the unit can describe whether a articular image feature is resent or absent This may enable the system to handle various comlications, articularly labeling across gas along the contour of an object Finally, we are alying a dubm network to the interesting and challenging roblem of time-series rediction of wind directions 8 Acknowledgements The authors thank Georey Hinton for his generous suort and guidance We thank Radford Neal for many valuable discussions that signicantly inuenced this work We also thank Peter Dayan, Conrad Galland, Sue Becker, Steve Nowlan, and the other members of the Connectionist Research Grou at the University of Toronto for helful comments regarding this work This research was suorted by a grant from the Information Technology Research Centre of Ontario to Georey Hinton, and NSF Presidential Young Investigator award IRI{ and grant 90{21 from the James S McDonnell Foundation to MM References Ackley, D H, Hinton, G E, and Sejnowski, T J (1985) A learning algorithm for Boltzmann machines Cognitive Science, 9:147{169 Almeida, L B (1987) A learning rule for asynchronous ercetrons with feedback in a combinatorial environment In Proceedings, 1st First International Conference on Neural Networks, volume 2, ages 609{618, San Diego, CA IEEE Baldi, P and Meir, R (1990) Comuting with arrays of couled oscillators: An alication to reattentive texture discrimination Neural Comutation, 2(4):458{471 Baldi, P and Pineda, F (1991) Contrastive learning and neural oscillations Neural Comutation, 3(4):526{533 Benvenuto, N and Piazza, F (1992) On the comlex backroagation algorithm IEEE Transactions On Signal Processing, 40(4):967{969 Eckhorn, R, Bauer, R, Jordan, W, Brosch, M, Kruse, W, Munk, M, and Reitboek, H K (1988) Coherent oscillations: A mechanism of feature linking in the visual cortex Biological Cybernetics, 60:121{130 Fradkin, E, Huberman, B A, and Shenker, S H (1978) Gauge symmetries in random magnetic systems Physical Review B, 18(9):4789{4814 Galland, C C (1992) Self-organization using Deterministic Boltzmann Machine learning PhD thesis, Physics Deartment of the University of Toronto 17

18 Gislen, L, Peterson, C, and Soderberg, B (1992) Rotor neurons: Basic formalism and dynamics Neural Comutation, 4(5):737{745 Gray, C M, Koenig, P, Engel, A K, and Singer, W (1989) Oscillatory resonses in cat visual cortex exhibiting intercolumnar synchronization which reects global roerties Nature (London), 338:334{337 Hinton, G E (1989) Deterministic Boltzmann learning erforms steeest descent in weightsace Neural Comutation, 1(2):143{150 Hoeld, J J (1982) Neural networks and hysical systems with emergent collective comutational abilities Proceedings of the National Academy of Sciences USA, 79:2554{2558 Hoeld, J J (1984) Neurons with graded resonse have collective comutational roerties like those of two-state neurons Proceedings of the National Academy of Sciences USA, 81:3088{3092 Kammen, D M, Holmes, P J, and Koch, C (1989) Cortical oscillations in neuronal networks: Feed-back versus local couling In Models of Brain Function, ages 273{284 Cambridge University Press, Cambridge Mardia, K V (1972) Statistics of Directional Data Academic Press, London Mozer, M C, Zemel, R S, and Behrmann, M (1992) Learning to segment images using dynamic feature binding In Advances in Neural Information Processing Systems 4, ages 436{443 Morgan Kaufmann, San Mateo, CA Mozer, M C, Zemel, R S, Behrmann, M, and Williams, C K I (1992) Learning to segment images using dynamic feature binding Neural Comutation, 4(5):650{665 Noest, A J (1988) Phasor neural networks In Neural Information Processing Systems, ages 584{591, New York AIP Peterson, C and Anderson, J R (1987) A mean eld theory learning algorithm for neural networks Comlex Systems, 1:995{1019 Pineda, F J (1987) Generalization of back roagation to recurrent and higher order neural networks In Proceedings of IEEE Conference on Neural Information Processing Systems, Denver, Colorado IEEE von der Malsburg, C (1981) The correlation theory of brain function Technical Reort Internal reort 81-2, Deartment of Neurobiology, Max Planck Institute for Biohysical Chemistry, Goettingen Widrow, B, McCool, J, and Ball, M (1975) The comlex LMS algorithm Proc IEEE, ages 719{720 18

19 A Aendix To show that F always decreases, we rst dierentiate E MF with resect to time: X E MF =?1=2 y j ykw jk j;k de X MF dyj dy =?1=2 y k kw jk + y j w jk j;k X "! dyj X =?1=2 y k wjk + dy j j We then dierentiate H MF with resect to time: H MF = X j dh MF = X j k [?a j r j + log(2i 0 (a j ))] "? da j r j? a j dr j X k + I 1(a j ) I 0 (a j ) y k w jk # da j!# (15) = X j?a j dr j (16) where we use the fact that di 0(m) = I 1 (m) dm, along with the denition of r j = I 1(a j ) I 0 (a j ) Now consider (recalling that x = ae i and y = re i = re i ): x dy 1=2 x dy + xdy = ae?i dr ei + ire = a dr + irad = a dr i d This last ste follows from the fact that if z = f + ig, then z + z = 2f Substituting back into Equation 16, we nd that: dh MF =?1=2 X j " x j dy j + x dyj j # (17) Combining Equations 15 and 17, and using the denition of dx j Equation 13): =?x j + P k y k w jk (see df MF = de MF =?1=2 X j =?1=2 X j? T dh MF " ( X k " dxj dy y k wjk j! )? x j + # dyj + dx j dy j (X k # dyj y k wjk )? xj! 19

20 We can exand these roducts as follows: dx dy dx dy = da ei i d + iae = dr e?i?i d? ire = dr da = dr da d + ar da 2 + ar 2 + i d a dr d? r da 2 + i a dr d d? r da d where we have used the fact that since r deends solely on a, dr Finally, df X " MF dxj dyj =?1=2 + dx j j 0 =?1=2 X j " drj daj da j = dr da! dyj # 2 + a j r j dj The nal ste in this derivation follows because r j is a monotonically increasing function of a j (see Figure 3), and because a j and r j are always non-negative da 2 # 20

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules