The Traveling Salesman Problem: A Neural Network Perspective. Jean-Yves Potvin

Size: px

Start display at page:

Download "The Traveling Salesman Problem: A Neural Network Perspective. Jean-Yves Potvin"

Adam Henderson
6 years ago
Views:

1 1 The Traveling Salesman Problem: A Neural Network Perspective Jean-Yves Potvin Centre de Recherche sur les Transports Université de Montréal C.P. 6128, Succ. A, Montréal (Québec) Canada H3C 3J7 potvin@iro.umontreal.ca Abstract. This paper surveys the "neurally" inspired problemsolving approaches to the traveling salesman problem, namely, the Hopfield-Tank network, the elastic net, and the self-organizing map. The latest achievements in the neural network domain are reported and numerical comparisons are provided with the classical solution approaches of operations research. An extensive bibliography with more than one hundred references is also included. Introduction The Traveling Salesman Problem (TSP) is a classical combinatorial optimization problem, which is simple to state but very difficult to solve. The problem is to find the shortest possible tour through a set of N vertices so that each vertex is visited exactly once. This problem is known to be NP-complete, and cannot be solved exactly in polynomial time. Many exact and heuristic algorithms have been devised in the field of operations research (OR) to solve the TSP. We refer readers to [15, 64, 65] for good overviews of the TSP. In the sections that follow, we briefly introduce the OR problem-solving approaches to the TSP. Then, the neural network approaches for solving that problem are discussed. Exact Algorithms The exact algorithms are designed to find the optimal solution to the TSP, that is, the tour of minimum length. They are computationally expensive because they must (implicitly) consider

2 2 all feasible solutions in order to identify the optimum. The exact algorithms are typically derived from the integer linear programming (ILP) formulation of the TSP Min Σ i Σ j d ij x ij subject to: Σ j x ij = 1, i=1,...,n Σ i x ij = 1, j=1,...,n (x ij ) X x ij = 0 or 1, where d ij is the distance between vertices i and j and the x ij 's are the decision variables: x ij is set to 1 when arc (i,j) is included in the tour, and 0 otherwise. (x ij ) X denotes the set of subtour-breaking constraints that restrict the feasible solutions to those consisting of a single tour. Although the subtour-breaking constraints can be formulated in many different ways, one very intuitive formulation is Σ i,j SV x ij S v - 1 (S v V; 2 S v N-2), where V is the set of all vertices, S v is some subset of V and S v is the cardinality of S v. These constraints prohibit subtours, that is, tours on subsets with less than N vertices. If there were such a subtour on some subset of vertices S v, this subtour would contain S v arcs. Consequently, the left-hand side of the inequality would be equal to S v, which is greater than S v -1, and the constraint would be violated for this particular subset. Without the subtour-breaking constraints, the TSP reduces to an assignment problem (AP), and a solution like the one shown in Figure 1 would then be feasible. Branch and bound algorithms are commonly used to find an optimal solution to the TSP, and the AP-relaxation is useful to generate good lower bounds on the optimum value. This is true in particular for asymmetric problems, where d ij d ji for some i,j. For symmetric problems, like the Euclidean TSP (ETSP), the AP-solutions often contain many subtours with only two vertices. Consequently,

3 3 these problems are better addressed by specialized algorithms that can exploit their particular structure. For instance, a specific ILP formulation can be derived for the symmetric problem which allows for relaxations that provide sharp lower bounds (e.g., the shortest spanning one-tree [46] ). (a) (b) Fig. 1. (a) Solving the TSP, (b) Solving the assignment problem. It is worth noting that problems with a few hundred vertices can now be routinely solved to optimality. Also, instances involving more than 2,000 vertices have been addressed. For example, the optimal solution to a symmetric problem with 2,392 vertices was identified after two hours and forty minutes of computation time on a powerful vector computer, the IBM 3090/600. [76,77] On the other hand, a classical problem with 532 vertices took five and a half hours on the same machine, indicating that the size of the problem is not the only determining factor for computation time. We refer the interested reader to [64] for a complete description of the state of the art with respect to exact algorithms. Heuristic Algorithms Running an exact algorithm for hours on an expensive computer may not be very cost-effective if a solution, within a few percent of the optimum, can be found quickly on a microcomputer. Accordingly, heuristic or approximate algorithms are often preferred to exact algorithms for solving the large TSPs that occur in practice (e.g., drilling problems).

4 4 Generally speaking, TSP heuristics can be classified as tour construction procedures, tour improvement procedures, and composite procedures, which are based on both construction and improvement techniques. (a) Construction procedures. The best known procedures in this class gradually build a tour by selecting each vertex in turn and by inserting them one by one into the current tour. Various metrics are used for selecting the next vertex and for identifying the best place to insert it, like the proximity to the current tour and the minimum detour. [88] (b) Improvement procedures. Among the local improvement procedures, the k-opt exchange heuristics are the most widely used, in particular, the 2-opt, 3-opt, and Lin-Kernighan heuristics. [67,68] These heuristics locally modify the current solution by replacing k arcs in the tour by k new arcs so as to generate a new improved tour. Figure 2 shows an example of a 2-opt exchange. Typically, the exchange heuristics are applied iteratively until a local optimum is found, namely a tour which cannot be improved further via the exchange heuristic under consideration. In order to overcome the limitations associated with local optimality, new heuristics like simulated annealing and tabu search are being used. [25,39,40,60] Basically, these new procedures allow local modifications that increase the length of the tour. By this means, the method can escape from local minima and explore a larger number of solutions. The neural network models discussed in this paper are often compared to the simulated annealing heuristic described in [60]. In this context, simulated annealing refers to an implementation based on the 2-opt exchanges of Lin [67], where an increase in the length of the tour has some probability of being accepted (see the description of simulated annealing in Section 3). (c) Composite procedures. Recently developed composite procedures, which use both construction and improvement techniques, are now among the most powerful heuristics for solving TSPs. Among the new generation of composite heuristics, the most successful ones are the CCAO heuristic, [41] the GENIUS heuristic, [38] and the iterated Lin-Kernighan heuristic. [53] For example, the iterated Lin-Kernighan heuristic can routinely find solutions within 1% of the optimum for problems with up to

5 5 10,000 vertices. [53] Heuristic solutions within 4% of the optimum for some 1,000,000-city ETSPs are reported in [12]. Here, the tour construction procedure is a simple greedy heuristic. At the start, each city is considered as a fragment, and multiple fragments are built in parallel by iteratively connecting the closest fragments together until a single tour is generated. The solution is then processed by a 3-opt exchange heuristic. A clever implementation of this procedure solved some 1,000,000-city problems in less than four hours on a VAX i j i j l k l k Fig. 2. Exchange of links (i,k),(j,l) for links (i,j),(k,l). Artificial Neural Networks Because of the simplicity of its formulation, the TSP has always been a fertile ground for new solution ideas. Consequently, it is not surprising that many problem-solving approaches inspired by artificial neural networks have been applied to the TSP. Currently, neural networks do not provide solution quality that compares with the classical heuristics of OR. However, the technology is quite young and spectacular improvements have already been achieved since the first attempts in [51] All of these efforts for solving a problem that has already been quite successfully addressed by operations researchers are motivated, in part, by the fact that artificial neural networks are powerful parallel devices. They are made up of a large number of simple elements that can process their inputs in parallel. Accordingly, they lend themselves naturally to implementations on parallel computers. Moreover, many neural

6 6 network models have already been directly implemented hardware as "neural chips." in Hence, the neural network technology could provide a means to solve optimization problems at a speed that has never been achieved before. It remains to be seen, however, whether the quality of the neural network solutions will ever compare to the solutions produced by the best heuristics in OR. Given the spectacular improvements in the neural network technology in the last few years, it would certainly be premature at this time to consider this line of research to be a "dead end." In the sections that follow, we review the three basic neural network approaches to the TSP, namely, the Hopfield-Tank network, the elastic net, and the self-organizing map. Actually, the elastic nets and self-organizing maps appear to be the best approaches for solving the TSP. But the Hopfield-Tank model was the first to be applied to the TSP and it has been the dominant neural approach for solving combinatorial optimization problems over the last decade. Even today, many researchers are still working on that model, trying to explain its failures and successes. Because of its importance, a large part of this paper is thus devoted to that model and its refinements over the years. The paper is organized along the following lines. Sections 1 and 2 first describe the Hopfield-Tank model and its many variants. Sections 3 and 4 are then devoted to the elastic net and the selforganizing map, respectively. Finally, concluding remarks are made in Section 5. Each basic model is described in detail and no deep understanding of neural network technology is assumed. However, previous exposure to an introductory paper on the subject could help to better understand the various models. [61] In each section, computation times and numerical comparisons with other OR heuristics are provided when they are available. However, the OR specialist must understand that the computation time for simulating a neural network on a serial digital computer is not particularly meaningful, because such an implementation does not exploit the inherent parallelism of the model. For this reason, computation times are often missing in neural network research papers. A final remark concerns the class of TSPs addressed by neural network researchers. Although the Hopfield-Tank network has been applied to TSPs with randomly generated distance matrices, [106]

7 7 virtually all work concerns the ETSP. Accordingly, Euclidean distances should be assumed in the sections that follow, unless it is explicitly stated otherwise. The reader should also note that general surveys on the use of neural networks in combinatorial optimization may be found in [22, 70]. An introductory paper about the impacts of neurocomputing on operations research may be found in [29]. Section 1. The Hopfield-Tank Model Before going further into the details of the Hopfield model, it is important to observe that the network or graph defining the TSP is very different from the neural network itself. As a consequence, the TSP must be mapped, in some way, onto the neural network structure. For example, Figure 3a shows a TSP defined over a transportation network. The artificial neural network encoding that problem is shown in Figure 3b. In the transportation network, the five vertices stand for cities and the links are labeled or weighted by the inter-city distances d ij (e.g., d NY,LA is the distance between New York and Los Angeles). A feasible solution to that problem is the tour Montreal-Boston-NY-LA-Toronto-Montreal, as shown by the bold arcs. In Figure 3b, the Hopfield network [50] is depicted as a 5x5 matrix of nodes or units that are used to encode solutions to the TSP. Each row corresponds to a particular city and each column to a particular position in the tour. The black nodes are the activated units that encode the current solution (namely, Montreal is in first position in the tour, Boston in second position, NY in third, etc.). Only a few connections between the units are shown in Figure 3b. In fact, there is a connection between each pair of units, and a weight is associated with each connection. The signal sent along a connection from unit i to unit j is equal to the weight T ij if unit i is activated. It is equal to 0 otherwise. A negative weight thus defines an inhibitory connection between the two units. In such a case, it is unlikely that both units will be active or "on" at the same time, because the first unit that turns on immediately sends an inhibitory signal to the other unit through that connection to prevent its activation. On the other hand, it is more likely for both units to be on at the same time if the connection has a positive weight. In such a

8 8 case, the first unit that turns on sends a positive excitatory signal to the other unit through that connection to facilitate its activation. (a) TSP problem Toronto Montreal Boston NY d NY,LA LA (b) Neural network representation Boston Montreal LA NY T LA5,NY5 Toronto Fig. 3. Mapping a TSP onto the Hopfield network.

9 9 In the TSP context, the weights are derived in part from the inter-city distances. They are chosen to penalize infeasible tours and, among the feasible tours, to favor the shorter ones. For example, T LA5,NY5 in Figure 3b denotes the weight on the connection between the units that represent a visit to cities LA and NY both in the fifth position on a tour. Consequently, that connection should be inhibitory (negative weight), because two cities cannot occupy the same exact position. The first unit to be activated will inhibit the other unit via that connection, so as to prevent an infeasible solution to occur. In Section 1.1, we first introduce the Hopfield model, which is a network composed of binary "on/off" or "0/1" units, like the artificial neural network shown in Figure 3b. We will then describe the Hopfield-Tank model, which is a natural extension of the discrete model to units with continuous activation levels. Finally, the application of the Hopfield-Tank network to the TSP will be described. 1.1 The Discrete Hopfield Model The original Hopfield neural network model [50] is a fully interconnected network of binary units with symmetric connection weights between the units. The connection weights are not learned but are defined a priori from problem data (the inter-city distances in a TSP context). Starting from some arbitrarily chosen initial configuration, either feasible or infeasible, the Hopfield network evolves by updating the activation of each unit in turn (i.e., an activated unit can be turned off, and an unactivated unit can be turned on). The update rule of any given unit involves the activation of the units it is connected to as well as the weights on the connections. Via this update process, various configurations are explored until the network settles into a stable configuration. In this final state, all units are stable according to the update rule and do not change their activation status. The dynamics of the Hopfield network can be described formally in mathematical terms. To this end, the activation levels of the binary units are set to zero and one for "off" and "on," respectively. Starting from some initial configuration {V i } i=1,...,l, where L is the number of units and V i is the activation level of unit i, the network relaxes to a stable configuration according to the following update rule

10 10 set V i to 0 set V i to 1 if Σ j T ij V j < θ i if Σ j T ij V j > θi do not change V i if Σ j T ij V j = θ i, where T ij is the connection weight between units i and j, and θ i is the threshold of unit i. The units are updated at random, one unit at a time. Since the configurations of the network are L-dimensional, the update of one unit from zero to one or from one to zero moves the configuration of the network from one corner to another of the L-dimensional unit hypercube. The behavior of the network can be characterized by an appropriate energy function. The energy E depends only on the activation levels V i (the weights T ij and the thresholds θ i are fixed and derived from problem data), and is such that it can only decrease as the network evolves over time. This energy is given by E = -1/2 Σ i Σ j T ij V i V j + Σ i θ i V i. (1.1) Since the connection weights T ij are symmetric, each term T ij V i V j appears twice within the double summation of (1.1). Hence, this double summation is divided by 2. It is easy to show that a unit changes its activation level if and only if the energy of the network decreases by doing so. In order to prove that statement, we must consider the contribution E i of a given unit i to the overall energy E, that is, E i = - Σ j T ij V i V j + θ i V i. Consequently, if V i = 1 then E i = - Σ j T ij V j + θ i if V i = 0 then E i = 0. Hence, the change in energy due to a change V i in the activation level of unit i is E i = - V i (Σ j T ij V j - θ i ).

11 11 Now, V i is one if unit i changed its activation level from zero to one, and such a change can only occur if the expression between the parentheses is positive. As a consequence, E i is negative and the energy decreases. This same line of reasoning can be applied when a unit i changes its activation level from one to zero (i.e., V i = -1). Since the energy can only decrease over time and the number of configurations is finite, the network must necessarily converge to a stable state (but not necessarily the minimum energy state). In the next section, a natural extension of this model to units with continuous activation levels is described. 1.2 The Continuous Hopfield-Tank Model In [51], Hopfield and Tank extended the original model to a fully interconnected network of nonlinear analog units, where the activation level of each unit is a value in the interval [0,1]. Hence, the space of possible configurations {Vi} i=1,...,l is now continuous rather than discrete, and is bounded by the L-dimensional hypercube defined by V i = 0 or 1. Obviously, the final configuration of the network can be decoded into a solution of the optimization problem if it is close to a corner of the hypercube (i.e., if the activation value of each unit is close to zero or one). The main motivation of Hopfield and Tank for extending the discrete network to a continuous one was to provide a model that could be easily implemented using simple analog hardware. However, it seems that continuous dynamics also facilitate convergence. [47] The evolution of the units over time is now characterized by the following differential equations (usually called "equations of motion") du i /dt = Σ j T ij V j + I i - U i, i=1,...,l (1.2) where U i, I i and V i are the input, input bias, and activation level of unit i, respectively. The activation level of unit i is a function of its input, namely V i = g(u i ) = 1/2 (1 + tanh U i /U o ) = 1/ (1+ e -2U i/uo). (1.3) The activation function g is the well-known sigmoidal function, which always returns a value between 0 and 1. The parameter U o is

12 12 used to modify the slope of the function. In Figure 4, for example, the U o value is lower for curve (2) than for curve (1). g(ui) (2) g(ui) = 1 (1) (1) (2) 0 Ui g(ui) = 0 Fig. 4. The sigmoidal activation function. The energy function for the continuous Hopfield-Tank model is now Vi E = -1/2 Σ i Σ j T ij V i V j - Σ i V i I i + g -1 (x)dx. (1.4) Note in particular that du i /dt = -de/dv i. Accordingly, when the units obey the dynamics of the equations of motion, the network is performing a gradient descent in the network's configuration space with respect to that energy function, and stabilizes at a local minimum. At that point, du i /dt = 0 and the input to any given unit i is the weighted sum of the activation levels of all the other units plus the bias, that is U i = Σ j T ij V j + I i. 0

13 Simulation of the Hopfield-Tank Model In order to simulate the behavior of the continuous Hopfield- Tank model, a discrete time approximation is applied to the equations of motion {U i (t+ t) - U i (t)} / t = Σ j T ij V j (t) + I i - U i (t), where t is a small time interval. This formula can be rewritten as U i (t+ t) = U i (t) + t (Σ j T ij V j (t) + I i - U i (t)). Starting with some initial values {U i (0)} i=1,...,l at time t=0, the system evolves according to these equations until a stable state is reached. During the simulation, t is usually set to 10-5 or Smaller values provide a better approximation of the analog system, but more iterations are then required to converge to a stable state. In the literature, the simulations have mostly been performed on standard sequential machines. However, implementations on parallel machines are discussed in [10, 93]. The authors report that it is possible to achieve almost linear speed-up with the number of processors. For example, a Hopfield-Tank network for a 100-city TSP took almost three hours to converge to a solution on a single processor of the Sequent Balance 8000 computer. [10] The computation time was reduced to about 20 minutes using eight processors. 1.4 Application of the Hopfield-Tank Model to the TSP In the previous sections, we have shown that the Hopfield-Tank model performs a descent towards a local minimum of the energy function E. The "art" of applying that model to the TSP is to appropriately define the connection weights Tij and the bias Ii so that the local minima of E will correspond to good TSP solutions. In order to map a combinatorial optimization problem like the TSP onto the Hopfield-Tank model, the following steps are suggested in [83, 84]: (1) Choose a representation scheme which allows the activation levels of the units to be decoded into a solution of the problem.

14 14 (2) Design an energy function whose minimum corresponds to the best solution of the problem. (3) Derive the connectivity of the network from the energy function. (4) Set up the initial activation levels of the units. These ideas can easily be applied to the design of a Hopfield- Tank network in a TSP context: (1) First, a suitable representation of the problem must be chosen. In [51], the TSP is represented as an NxN matrix of units, where each row corresponds to a particular city and each column to a particular position in the tour (see Figure 1). If the activation level of a given unit V Xi is close to 1, it is then assumed that city X is visited at the ith position in the tour. In this way, the final configuration of the network can be interpreted as a solution to the TSP. Note that N 2 units are needed to encode a solution for a TSP with N cities. (2) Second, the energy function must be defined. The following function is used in [51] E = A/2 (Σ X Σ i Σ j i V Xi V Xj ) +B/2 (Σ i Σ X Σ Y X V Xi V Yi ) +C/2 (Σ X Σ i V Xi -N) 2 +D/2 (Σ X Σ Y X Σ i d XY V Xi (V Yi+1 + V Yi-1 )), (1.5) where the A, B, C, and D parameters are used to weight the various components of the energy. The first three terms penalize solutions that do not correspond to feasible tours. Namely, there must be exactly one activated unit in each row and column of the matrix. The first and second terms, respectively, penalize the rows and columns with more than one activated unit, and the third term requires a total of N activated units (so as to avoid the trivial solution V Xi =0 for all Xi). The fourth term ensures that the energy function will favor short tours over longer ones. This term adds the distance d XY to the energy value when cities X and Y are in consecutive positions in the tour (note that subscripts are taken modulo N, so that V X,N+1 is the same as V X1 ).

15 15 (3) Third, the bias and connection weights are derived. To do so, the energy function of Hopfield and Tank (1.5) is compared to the generic energy function (1.6), which is a slightly modified version of (1.4): each unit has now two subscripts (city and position) and the last term is removed (since it does not play any role here) E = -1/2 Σ Xi Σ Yj T XiYj V Xi V Yj - Σ Xi V Xi I Xi. (1.6) Consequently, the weights T XiYj on the connections of the Hopfield-Tank network are identified by looking at the quadratic terms in the TSP energy function, while the bias I Xi is derived from the linear terms. Hence, T XiYj = - A δ XY (1- δ ij ) I Xi = + CN e, - B δ ij (1- δ XY ) - C - D d XY (δ j,i+1 + δ j,i-1 ), where δ ij =1 if i=j and 0 otherwise. The first and second terms in the definition of the connection weights stand for inhibitory connections within each row and each column, respectively. Hence, a unit whose activation level is close to 1 tends to inhibit the other units in the same row and column. The third term is a global inhibitor term. The combined action of this term and the input bias I Xi, which are both derived from the C term in the energy function (1.5), favor solutions with a total of N activated units. Finally, the fourth term is called the "data term" and prevents solutions with adjacent cities that are far apart (namely, the inhibition is stronger between two units when they represent two cities X, Y in consecutive positions in the tour, with a large inter-city distance d XY ). In the experiments of Hopfield and Tank, the parameter N e in the definition of the bias I Xi = CN e does not always correspond exactly to the number of cities N. This parameter is used by Hopfield and Tank to adjust the level of the positive bias signal with respect to the negative signals coming through the other connections, and it is usually slightly larger than N. Note

16 16 finally that there are O(N 4 ) connections between the N 2 units for a TSP with N cities. (4) The last step is to set the initial activation value of each unit to 1/N plus or minus a small random perturbation (in this way the sum of the initial activations is approximately equal to N). With this model, Hopfield and Tank were able to solve a randomly generated 10-city ETSP, with the following parameter values: A=B=500, C=200, D=500, N e =15. They reported that for 20 distinct trials, using different starting configurations, the network converged 16 times to feasible tours. Half of those tours were one of the two optimal tours. On the other hand, the network was much less reliable on a randomly generated 30-city ETSP (900 units). Apart from frequent convergence to infeasible solutions, the network commonly found feasible tours with a length over 7.0, as compared to a tour of length 4.26 generated by the Lin-Kernighan exchange heuristic. [68] Three years later, it was claimed in [105] that the results of Hopfield and Tank were quite difficult to reproduce. For the 10-city ETSP of Hopfield and Tank, using the same parameter settings, the authors report that on 100 different trials, the network converged to feasible solutions only 15 times. Moreover, the feasible tours were only slightly better than randomly generated tours. Other experiments by the same authors, on various randomly generated 10-city ETSP problems, produced the same kind of results. The main weaknesses of the original Hopfield-Tank model, as pointed out in [105] are the following. (a) Solving a TSP with N cities requires O(N 2 ) units and O(N 4 ) connections. (b) The optimization problem is not solved in a problem space of (c) O(N!), but in a space of O(2 N2 ) where many configurations correspond to infeasible solutions. Each valid tour is represented 2N times in the Hopfield-Tank model because any one of the N cities can be chosen as the starting city, and the two orientations of the tour are equivalent for a symmetric problem. This phenomenon is referred to as "2N-degeneracy" in neural network terminology.

17 17 (d) The model performs a gradient descent of the energy function in the configuration space, and is thus plagued with the limitations of "hill-climbing" approaches, where a local optimum is found. As a consequence, the performance of the model is very sensitive to the initial starting configuration. (e) (f) (g) The model does not guarantee feasibility. In other words, many local minima of the energy function correspond to infeasible solutions. This is related to the fact that the constraints of the problem, namely that each city must be visited exactly once, are not strictly enforced but rather introduced into the energy function as penalty terms. Setting the values of the parameters A, B, C, and D is much more an art than a science and requires a long "trial-anderror" process. Setting the penalty parameters A, B, and C to small values usually leads to short but infeasible tours. Alternatively, setting the penalty parameters to large values forces the network to converge to any feasible solution regardless of the total length. Moreover, it seems to be increasingly difficult to find "good" parameter settings as the number of cities increases. Many infeasible tours produced by the network visit only a subset of cities. This is due to the fact that the third term in the energy function (C term) is the only one to penalize such a situation. The first two terms (A and B terms), as well as the fourth term (D term), benefit from such a situation. (h) It usually takes a large number of iterations (in the thousands) before the network converges to a solution. Moreover, the network can "freeze" far from a corner of the hypercube in the configuration space, where it is not possible to interpret the configuration as a TSP solution. This phenomenon can be explained by the shape of the sigmoidal activation function which is very flat for large positive and large negative U i 's (see Figure 2). Consequently, if the activation level V i of a given unit i is close to zero or one, even large modifications to U i will produce only slight modifications to the activation level. If a large number of units are in this situation, the network will evolve very slowly, a phenomenon referred to as "network paralysis." Paralysis far from a corner of the hypercube can occur if the slope of the activation function is not very steep. In that case, the flat regions of the sigmoidal function extend further and

18 18 (i) affect a larger number of units (even those with activation levels far from zero and one). The network is not adaptive, because the weights of the network are fixed and derived from problem data, rather than taught from it. The positive points are that the model can be easily implemented in hardware, using simple analog devices, and that it can also be applied to non-euclidean TSPs, in particular, problems that do not satisfy the triangle inequality and cannot be interpreted geometrically. [106] This is an advantage over the geometric approaches that are presented in Sections 3 and 4. Section 2. Variants of the Hopfield-Tank Model Surprisingly enough, the results of Wilson and Pawley did not discourage the community of researchers but rather stimulated the search for ways to improve the original Hopfield-Tank model. There were also numerous papers providing in-depth analysis of the model to explain its failures and propose various improvements to the method. [4,5,6,7,17,24,56,86,87,90,108] The modifications to the original model can be classified into six distinct categories: modifications to the energy function, techniques for estimating "good" parameter settings, addition of hard constraints to the model, incorporation of techniques to escape from local minima, new problem representations, and modifications to the starting configurations. We now describe each category, and emphasize the most important contributions. 2.1 Modifications to the Energy Function The first attempts were aimed at modifying the energy function to improve the performance of the Hopfield-Tank model. Those studies, which add or modify terms to push the model towards feasible solutions, are mostly empirical. (a) In [18, 75, 78, 95], the authors suggest replacing either the third term (C term) or the three first terms (A, B, and C terms) of the original energy function (1.5) by F/2 Σ X (Σ i V Xi -1) 2 + G/2 Σ i (Σ X V Xi -1) 2.

19 19 This modification helps the model to converge towards feasible tours, because it heavily penalizes configurations that do not have exactly one active unit in each row and each column. In particular, it prevents many solutions that do not incorporate all cities. Note that a formulation where the A, B, and C terms are replaced by the two new terms can be implemented with only O(N 3 ) connections. In [14], the author proposes an alternative approach to that problem by adding an additional excitatory bias to each unit so as to get a larger number of activated units. (b) In [18], the authors suggest the addition of a new penalty term in the energy function (1.5). This term attempts to drive the search out from the center of the hypercube (and thus towards the corners of the hypercube) so as to alleviate the network paralysis problem. The additional term is F/2 Σ Xi V Xi (1-V Xi ) = F/2 (N 2 /4 - Σ Xi (V Xi - 1/2) 2 ). In the same paper, they also propose a formulation where the inter-city distances are only used in the linear components of the energy function. Hence, the distances are provided to the network via the input bias rather than encoded into the connection weights. It is a great advantage for a hardware implementation, like a neural chip, because the connection weights do not change from one TSP instance to another and can be fixed into the hardware at fabrication time. However, a new representation of the problem with O(N 3 ) units is now required. Brandt and his colleagues [18] report that the Hopfield-Tank model with the two modifications suggested in (a) and (b) consistently converged to feasible tours for randomly generated 10- city ETSPs. Moreover, the average length was now much better than the length of randomly generated tours. Table 1 shows these results for problems with 10, 16, and 32 cities. Note that the heading "Manual" in the Table refers to tours constructed manually by hand.

20 20 Average Tour Length Number of Number of Cities Problems Brandt Manual Random Table 1. Comparison of Results for Three Solution Procedures 2.2 Finding Good Settings for the Parameter Values In [44, 45], the authors experimentally demonstrate that various relationships among the parameters of the energy function must be satisfied in order for the Hopfield-Tank model to converge to feasible tours. Their work also indicates that the region of good settings in the parameter space quickly gets very narrow as the number of cities grows. This study supports previous observations about the difficulty to tune the Hopfield-Tank energy function for problems with a large number of cities. Cuykendall and Reese [28] also provide ways of estimating parameter values from problem data, as the number of cities increases. In [4, 5, 6, 7, 27, 57, 58, 74, 86, 87] theoretical relationships among the parameters are investigated in order for feasible tours to be stable. The work described in these papers is mostly based on a close analysis of the eigenvalues and eigenvectors of the connection matrix. Wang and Tsai [102] propose to gradually reduce the value of some parameters over time. However, time-varying parameters preclude a simple hardware implementation. Lai and Coghill [63] propose the genetic algorithm, as described in [49], to find good parameter values for the Hopfield-Tank model. Along that line of research, the most impressive practical results are reported in [28]. The authors generate feasible solutions for a 165-city ETSP, by appropriately setting the bias of each unit and the U 0 parameter in the sigmoidal activation function (1.3). Depending on the parameter settings, it took between one hour and 10 hours of computation time on an APOLLO DN4000 to converge to a solution. The computation times were in a range of 10 to 30 minutes for

21 21 another 70-city ETSP. Unfortunately, no comparisons are provided with other problem-solving heuristics. 2.3 Addition of Constraints to the Model The approaches that we now describe add new constraints to the Hopfield-Tank model, so as to restrict the configuration space to feasible tours. (a) In [81, 97, 98] the activation levels of the units are normalized so that Σ i V Xi =1 for all cities X. The introduction of these additional constraints is only one aspect of the problem-solving methodology, which is closely related to the simulated annealing heuristic. Accordingly, the full discussion is deferred to Section 2.4, where simulated annealing is introduced. (b) Other approaches are more aggressive and explicitly restrict the configuration space to feasible tours. In [96], the authors calculate the changes required in the remaining units to maintain a feasible solution when a given unit is updated. The energy function is then evaluated on the basis of the change to the updated unit and all the logically implied changes to the other units. This approach converges consistently to feasible solutions on 30-city ETSPs. The tours are only 5% longer on average than those generated by the simulated annealing heuristic. In [69], the author updates the configurations using Lin and Kernighan's exchange heuristic. [68] Foo and Szu [35] use a "divide-andconquer" approach to the problem. They partition the set of cities into subsets and apply the Hopfield-Tank model to each subset. The subtours are then merged back together into a single larger tour with a simple heuristic. Although their approach is not conclusive, the integration of classical OR heuristics and artificial intelligence within a neural network framework could provide interesting research avenues for the future. 2.4 Incorporation of Techniques to Escape from Local Minima The Hopfield-Tank model converges to a local minimum, and is thus highly sensitive to the starting configuration. Hence, various modifications have been proposed in the literature to alleviate this problem.

22 22 (a) In [1, 2, 3], a Boltzmann machine [48] is designed to solve the TSP. Basically, a Boltzmann machine incorporates the simulated annealing heuristic [25,60] within a discrete Hopfield network, so as to allow the network to escape from bad local minima. The simulated annealing heuristic performs a stochastic search in the space of configurations of a discrete system, like a Hopfield network with binary units. As opposed to classical hill-climbing approaches, simulated annealing allows modifications to the current configuration that increase the value of the objective or energy function (for a minimization problem). More precisely, a modification that reduces the energy of the system is always accepted, while a modification that increases the energy by E is accepted with Boltzmann probability e - E/T, where T is the temperature parameter. At a high temperature, the probability of accepting an increase to the energy is high. This probability gets lower as the temperature is reduced. The simulated annealing heuristic is typically initiated at a high temperature, where most modifications are accepted, so as to perform a coarse search of the configuration space. The temperature is then gradually reduced to focus the search on a specific region of the configuration space. At each temperature T, the configurations are modified according to the Boltzmann update rule until the system reaches an "equilibrium." At that point, the configurations follow a Boltzmann distribution, where the probability of the system being in configuration s' at temperature T is e -E(s')/T P T (s') =. (2.1) Σ s e -E(s)/T Here, E(s') is the energy of configuration s', and the denominator is the summation over all configurations. According to that probability, configurations of high energy are very likely to be observed at high temperatures and much less likely to be observed at low temperatures. The inverse is true for low energy configurations. Hence, by gradually reducing the temperature parameter T and by allowing the system to reach equilibrium at each temperature, the system is expected to ultimately settle down at a configuration of low energy.

23 23 This simulated annealing heuristic has been incorporated into the discrete Hopfield model to produce the so-called Boltzmann machine. Here, the binary units obey a stochastic update rule, rather than a deterministic one. At each iteration, a unit is randomly selected and the consequence of modifying its activation level (from zero to one or from one to zero) on the energy is evaluated. The probability of accepting the modification is then 1, 1 + e E/T where E is the modification to the energy. This update probability is slightly different from the one used in simulated annealing. In particular, the probability of accepting a modification that decreases the energy of the network (i.e., E < 0) is not one here, but rather a value between 0.5 and one. However, this new update probability has the same convergence properties as the one used in simulated annealing and, in that sense, the two expressions are equivalent. Aarts and Korst [1,2,3] design a Boltzmann machine for solving the TSP based on these ideas. Unfortunately, their approach suffers from very slow convergence and, as a consequence, only 30-city TSPs have been solved with this model. (b) In [43], the Boltzmann Machine is generalized to units with continuous activation levels. A truncated exponential distribution is used to compute the activation level of each unit. As for the discrete Boltzmann machine, the model suffers from slow convergence, and only small 10-city ETSPs have been solved. (c) The research described in [81, 97, 98], which is derived from the mean-field theory, is probably the most important contribution to the literature relating to the Hopfield-Tank model since its original description in [51]. The term "mean-field" refers to the fact that the model computes the mean activation levels of the stochastic binary units of a Boltzmann machine. This section focuses on the model of Van den Bout and Miller, [97,98] but the model of Peterson and Soderberg [81] is similar. We first introduce the iterative algorithm for updating the

24 24 configurations of the network. Then, we explain the relationships between this model and the Boltzmann machine. The neural network model introduced in [97] is characterized by a new simplified energy function E = d max /2 (Σ i Σ X Σ Y X V Xi V Yi ) + Σ X Σ Y X Σ i d XY V Xi (V Yi+1 + V Yi-1 ). (2.2) The first summation penalizes solutions with multiple cities at the same position, while the second summation computes the tour length. Note that the penalty value is weighted by the parameter d max. Starting from some arbitrary initial configuration, the model evolves to a stable configuration that minimizes (2.2), via the following iterative algorithm: 1. Set the temperature T. 2. Select a city X at random. 3. Compute U Xi = - d max Σ Y X V Yi - Σ Y X d XY (V Yi+1 + V Yi-1 ), i=1,...,n. 4. Compute e U Xi/T V Xi =, i=1,...,n. Σ j e U Xj/T 5. Evaluate the energy E. 6. Repeat Steps 2 to 5 until the energy no longer decreases (i.e., a stable configuration has been reached). We note that the activation levels always satisfy the constraints Σ i V Xi = 1, for all cities X. Accordingly, each value V Xi can be interpreted as the probability that city X occupies position i. When a stable configuration is reached, the activation levels V Xi satisfy the following system of equations (called the "mean-field equations")

25 25 e U Xi/T V Xi =, (2.3) Σ j e U Xj/T where U Xi = - de/dv Xi (see Step 3 of the algorithm). In order to understand the origin of the mean-field equations, we must go back to the evolution of a discrete Hopfield network with binary units, when those units are governed by a stochastic update rule like the Boltzmann rule (see the description of the simulated annealing heuristic and the Boltzmann machine in point (a)). It is known that the configurations of that network follow a Boltzmann distribution at equilibrium (i.e., after a large number of updates). Since the network is stochastic, it is not possible to know what the exact configuration will be at a given time. On the other hand, the average or mean activation value of each binary unit at Boltzmann equilibrium at a given temperature T is a deterministic value which can be computed as follows <V Xi > = Σ s P T (s)v Xi (s) = Σ s Xi P T(s Xi ). In this equation, the summations are restricted to the configurations satisfying Σ j V Xj = 1, for all cities X (so as to comply with the model of Van den Bout and Miller), P T (s) is the Boltzmann probability of configuration s at temperature T, V Xi (s) is the activation level of unit Xi in configuration s, and s Xi denotes the configurations where V Xi = 1. Hence, we have Σ s Xi e E(s Xi)/T <V Xi > =. Σ j Σ s Xj e E(s Xj)/T In this formula, the double summation in the denominator is equivalent to a single summation over all configurations s, because each configuration contains exactly one activated unit in {X j } j=1,...,n (Σ j V Xj = 1 and V Xj is either zero or one). Now, we can apply the so-called "mean-field approximation" to <V Xi >. Rather than summing up over all configurations, we assume that the activation levels of all units that interact with a given unit X j are fixed at their mean value. For example, rather than summing up

26 26 over all configurations s Xi in the numerator (configurations where V Xi =1), we fix the activation levels of all the other units to their mean value. In this way, the summation can be removed. By applying this idea to both the numerator and the denominator, and by observing that -U Xi is the contribution of unit Xi to the energy (2.2) when V Xi =1, the expression can be simplified to where e <U Xi>/T <V Xi > =, Σ j e <U Xj>/T <U Xi > = - d max Σ Y X <V Yi > - Σ Y X d XY (<V Yi+1 > + <V Yi-1 >). These equations are the same as the equations of Van den Bout and Miller (2.3). Hence, the V Xi values computed via their iterative algorithm can be interpreted as the mean activation levels of the corresponding stochastic binary units at Boltzmann equilibrium (at a given temperature T). At low temperatures, the low energy configurations have high Boltzmann probability and they dominate in the computation of the mean values <V Xi >. Hence, the stable configuration computed by the algorithm of Van den Bout and Miller is expected to be of low energy, for a sufficiently small parameter value T, because the stable configuration is composed of those mean values <V Xi >. As noted in [98], all the activation levels are the same at high temperatures, that is, V Xi 1/N when T. As the temperature parameter is lowered, each city gradually settles into a single position, because such configurations correspond to low energy states. In addition, the model also prevents two cities from occupying the same position, because a penalty of d max /2 is incurred in the energy function. If the parameter d max is set to a value slightly larger than twice the largest distance between any two cities, the network can find a configuration with lower energy simply by moving one of the two cities into the empty position. Feasible tours are thus guaranteed through the combined actions of the new energy function and the additional constraints imposed on the activation levels V Xi (once again, for a sufficiently small parameter value T). It is clear that the key problem is to identify a "good" value for the parameter T. By gradually decreasing the temperature, Van den

27 27 Bout and Miller identified a critical value T c where all the energy minimization takes place. Above the critical value T c, the units rarely converge to zero or one and feasible tours do not emerge. Below T c, all the tours generated are feasible, and the best tours emerge when T is close to T c. Obviously, the critical temperature value is highly dependent on the particular TSP to be solved. In [97, 98], Van den Bout and Miller describe a methodology for estimating that value from the inter-city distances. Using various T and d max parameter values, their best tour on a 30-city TSP had a length of 26.9, as compared to 24.1 for a tour obtained with the simulated annealing heuristic. Peterson and Soderberg [81] test a similar model on much larger problems, ranging in size from 50 to 200 cities. They observe that the length of the tours generated by the neural network approach are about 8% longer on average than the tours generated with a simulated annealing heuristic. Moreover, no tour was more than 10% longer than the corresponding simulated annealing tour. However, the average tour lengths are not provided (rather, the results are displayed as small histograms). Also, no computation times are reported. It is quite interesting to note that Bilbro et al. [13] have shown that the evolution of the above model is equivalent to the evolution of the Hopfield-Tank network governed by the equations of motion (see Section 1). However, convergence to a stable configuration is much faster by solving the mean-field equations. This increased convergence speed explains the successes of Peterson and Soderberg who routinely found feasible solutions to the TSP with up to 200 cities, those being the largest problems ever solved with models derived from the work of Hopfield and Tank. (d) Mean-field annealing refers to the application of the mean-field algorithm, as described in (c), with a gradual reduction of the temperature from high to low values, as in the simulated annealing heuristic. [13,19,80,107] As pointed out in [97], this approach is of little use if the critical temperature, where all the energy minimization takes place, can be accurately estimated from problem data. If the estimate is not accurate, then it can be useful to gradually decrease the temperature. (e) In [8, 26, 66], random noise is introduced into the activation level of the units in order to escape from local minima. The random

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

7.1 Basis for Boltzmann machine. 7. Boltzmann machines 7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is