MAPPING LARGE PARALLEL SIMULATION PROGRAMS TO MULTICOMPUTER SYSTEMS

A.Tentne (ed.): High Pefomance Computing 1994, Poc. of the SCS Simulation Multiconfeence 1994, San Diego, 11.-15. Apil 1994. S. 285-290. MAPPING LARGE PARALLEL SIMULATION PROGRAMS TO MULTICOMPUTER SYSTEMS Hans-Ulich Heiss and Macus Domanns Depatment of Infomatics, Univesity of Kalsuhe, D-76128 Kalsuhe, Gemany (heiss@ia.uka.de) ABSTRACT We conside the poblem of mapping paallel simulation pogams to distibuted memoy paallel machines. Since a lage faction of compute simulations consists of solving patial diffeential equations, the communication pattens of the esulting paallel pogams can be exploited to constuct efficient mappings which lead to low communication ovehead. We epot about the application of Kohonen-netwoks to find such mappings. 1 INTRODUCTION Most compute simulations deal with physical pocesses in space and time modeled by a set of patial diffeential equations (PDEs). Discetization leads to a lage spase system of linea equations o - in case of nonlinea PDEs o discetization in time - to a sequence of such systems. The simulation can theefoe easily be paallelized along the spatial stuctue of the model esulting in pogams that can be consideed as gid-like gaphs with vetices as computational nodes and edges as communication elations between them. Ideally, the paallel machine (MIMD with distibuted memoy) executing the simulation pogam matches the topological stuctue of the model, i.e. the communication gaph of the paallel pogam. While suitable machines can be found fo egula poblem stuctues (complete 2D- o 3D-gids), many communication gaphs ae iegula to some degee, e.g. if the physical system to be modeled is of iegula shape o if the discetization uses diffeent degees of efinement in diffeent aeas (finite element method, FEM). Iegula communication pattens ae also usual in paallel discete event simulation pogams. In these cases thee exists the poblem, how to map the communication gaph of the simulation pogam to the machine gaph that consists of the pocesso nodes and the inteconnection links. The goal is to spead the computational load as evenly as possible acoss the pocesso netwok while keeping communication delays low by placing stongly communicating tasks close togethe. The poblem can be fomalized as a gaph embedding poblem which is notoiously NP-had. Algoithms to solve this poblem ae theefoe of heuistic natue. Afte giving a bief oveview of heuistic appoaches to that poblem we popose a new algoithm that is based on the concept of self-oganizing maps intoduced by Kohonen. We show how ou algoithm pefoms compaed to the well known Simulated Annealing (SA) heuistic. This is done based on simulation expeiments using eal FEM-gaphs mapped to pocesso netwoks of diffeent sizes. As anothe example application we epot about expeiences with modeling the non-stationay heat conduction in a iegulaly shaped aea by the coesponding PDE which is calculated using a paallel Conjugate-Gadient (CG) solve on a tanspute-based multicompute system. 2 THE MODEL A natual way to model multicompute systems is the usage of the so-called Pocesso Connection Gaph PCG = (P, E P ) with P set of identical pocesso elements E P set of bi-diectional pocesso links γ(p,q) link communication delay (time / data unit) The edge weights γ(p,q) can be used to define a distance metic on the PCG between abitay pais of pocesso nodes: d:p P R + whee d(p,q) indicates the length of the shotest path between p and q. Analogously, a paallel pogam can be modeled as a Task Inteaction Gaph TIG = (T, E T ) with T set of tasks E T set of communication elationships α(i,j) communication intensity (numbe of data units to be exchanged) Depending on the poblem solved by the paallel pogam, TIGs cove a lage spectum of diffeent sizes and topologies. In case of the numeical paallel solution of a PDE, each task (node of TIG) coesponds to one discetization point and consists of the calculations equied fo the iteative solution. Those TIGs ae usually lage and based on mostly local communication. TIGs of othe examples (e.g. n-body poblem) may have much fewe nodes and mostly non-local communication. Given such a TIG, we have to find a mapping π:t P which minimizes the communication costs defined as the time each of the tansmitted data units needs

to each its destination, summed up ove all communication elations: CC:= α ( i, j) d( π ( i), π( j) ) min (1) ( i,j) E T The mapping π can be eithe injective (one-to-one), i.e. to each pocesso, at most one task is allocated, o it can be contactive (many-to-one) which means that moe than one task may be assigned to a pocesso. We assume that communication between tasks esiding on the same pocesso is fast enough to be negligible compaed to communication acoss pocesso boundaies which is consideed in the above cost function by assuming d(p, p) = 0 p P. In most cases, the numbe of computational tasks is lage than the numbe of pocessos available, i.e. T =m > P =n. This means that we have to assign moe than one task to each pocesso. These tasks have to shae the computational powe of the pocesso esulting in a decease of thei speed of pogess. Since most algoithms equie synchonization between the tasks, it is impotant that all tasks make equal pogess to minimize synchonization delays. This can be achieved by balancing the computational load acoss the pocessos. We theefoe have a second optimization goal which is the minimization of the load unbalance LU. In the simplest case that we assume hee fo bevity, the load of a pocesso is defined as the numbe of tasks assigned to it and the load unbalance is the accumulated deviation fom the aveage load: LU: = load( p) load min (2) with load(p): = and load: = 1 n { i T π ( i) = p} load( p) p P (If necessay, it is easy to incopoate diffeent pocesso speeds and diffeent task sizes (computational equiements).) Using only goal (1) (minimizing CC) and ignoing goal (2) (minimizing LU) would lead to a situation whee all tasks ae assigned to exactly one pocesso because then we have only local communication and the communication cost is =0. On the othe hand, using only goal (2) and ignoing (1) could place stongly communicating tasks to extemely emote pocessos. Theefoe, goals (1) and (2) ae contadictive and have to be consideed togethe. Consideing two contadictive optimization goals can be done in two diffeent ways: a) Building a new objective function as a linea combination of the two oiginal functions. b) Taking one of the two functions as the objective function and the othe one as a constaint, e.g. 'minimize CC subject to LU x' To find such a task-to-pocesso mapping, thee ae two geneal methods available: (i) the indiect o two-step method, and (ii) the diect o one-step-method. (Figue 1) With the indiect method which is usually applied fo mapping lage paallel simulation pogams to paallel computes, the task inteaction gaph is boken down into as many (equal) patitions as pocessos ae available while minimizing communication costs (n-way gaph patitioning with minimum cut). Fo this patitioning step, two diffeent appoaches can be taken: If the TIG epesents the solution domain of a PDE, then geometically oiented decomposition methods may be chosen, e.g. ecusive bisection schemes [Fox 1988; Sadayappan et al. 1990; Williams 1991; DeKeyse and Roose 1993] Howeve, these methods ae only useful, if the solution domain has a benign shape, i.e. the computation to be pefomed at each discetization point is affected only by neaby discetization points. Thee ae othe application aeas whee these popeties ae not available, e.g. molecula dynamics simulations. Fo all simulation pogams that esult in moe geneal TIGs, moe geneal patitioning schemes have to be employed. One class of those geneal patitioning algoithms also uses ecusive bisection, but not based on a geometic epesentation of the computational domain, but on geneal gaphs. The coe of these algoithms is the famous heuistic of Kenighan and Lin fo nea-optimal bipatitioning of gaphs with minimal cut costs. Also applicable fo n-way gaph patitioning ae moden algoithms fo geneal combinatoial poblems, e.g. Genetic Algoithms (GA), Simulated Annealing (SA) and Neual Netwoks (NN) [Reeves 1993]. As a esult of this fist step the patitions obtained can be egaded as supenodes of a new gaph, the so-called contaction diect indiect PCG TIG with 4-patitioning contaction gaph Figue 1: Contactive mappings

gaph which - as the second step - has to be embedded in the taget pocesso gaph (one-to-one embedding). If the inteconnection netwok of the pocessos is fast enough that communication between emote pocessos is only insignificantly slowe than between adjacent pocessos, this embedding may be uncitical, i.e. the mapping of the supenodes to the pocessos can be abitay without sacificing communication efficiency. Howeve, most achitectues exhibit a communication delay monotonically inceasing with the physical distance between the pocessos in the netwok. In many cases, this delay is a linea function of the distance. And even if the popagation delay is negligible, the limited bandwidth of the connections equies caeful embedding to avoid congestion. The point we want to make is that even with the poweful inteconnection stuctues of the ecent paallel machines, it is advantageous to map communicating (supe)tasks to neighboing pocessos. Howeve, also this embedding poblem is NP-had, at least in the geneal case. Only if both souce and taget gaph topologies ae egula, thee ae efficient embeddings available. Despite these unavoidable poblems, the indiect appoach has anothe dawback: Since the patitioning of the TIG only consides the numbe of pocessos of the taget machine, but not thei inteconnection topology, the esulting patitioning may be optimal in geneal, but not with egad to a paticula taget topology, i.e. optimal decomposition of a TIG fo late embedding in a 2D-mesh may be diffeent fom a decomposition fo a hypecube. So thee is some 'loss' of optimality when using the indiect appoach. This 'loss' can be avoided if we diectly map the nodes of the TIG (tasks) to the nodes of the PCG (pocessos) in a many-to-one mapping. The diect method theefoe has a geate potential fo optimal o nea-optimal solutions. Of couse, this diect mapping poblem is also NP-had and can be solved using the geneal moden heuistics fo had combinatoial poblems mentioned above. 3 The Kohonen Pocess A Kohonen netwok is a special type of a neual netwok and can be egaded as a laye of n neuons which ae aanged and inteconnected in some way, e.g. as a two-dimensional gid, which allows to define a metic d(p,q) on the netwok [Kohonen 1989; Ritte et al. 1991]. Each neuon is connected to each of m inputs foming an m-dimensional input vecto x. At each neuon p, thee is a weight w ip associated with each input i which is used to compute the weighted sum of the inputs. Besides this excitation by the inputs, thee ae signals coming in fom othe neuons that ae also weighted leading to a nonlinea system of equations the solution of which indicates the stationay states of the neuons. It can be shown that the oveall effect of the neuon's behaviou is that to evey paticula input signal, thee will be a neuon maximally excited, i.e. diffeent signals excite diffeent pats of the netwok, and the moe simila input signals ae, the close (in tems of the netwok metic) ae the excited egions. Kohonen's idea now is to appoximate the netwok behavio by a simple calculation. He poposed to calculate the cente of excitation only by using the input signal and eplaces the signal intechange of the neuons by a simple function that deceases with inceasing inteneuon distance: Neuon p*, which is maximally excited by the input signal x is called the excitation cente of the signal x : w p* x = min w p x (3) with w p = ( w 1p,K,w mp ) T and x = ( x 1,K,x m ) T So fa, depending on the weights w ip the Kohonen netwok defines a mapping π w which maps each input signal x to a location (o neuon) p* (the cente of excitation): π w : x a p* = π w x ( ) (4) Kohonen's appoximative appoach uses a special function h p*,p, which models the neual inteaction by indicating to which degee othe neuons in the neighbohood of the excitation cente ae excited as well. This envionment excitation is a function of the distance of the neuon fom the excitation cente p*. It is maximum fo p=p* and deceases with inceasing distance d(p*,p). Usually, the bell-shaped Gaussian density function is employed. To achieve that simila input signals ae mapped to neighboing neuons, the weights have to be adapted suitably. To that end, a leaning ule is applied that updates the weights incementally fo each signal. The incement depends on the signal and on the degee of excitation at this neuon. The leaning ule can be fomulated as follows: w new p = w old p + ε h p*,p x w old p ( ) (5) Weights ae changed popotional to thei diffeence fom the input signals and, moe impotant, popotional to the neighbohood function h p*p, that causes the weight coection to fade with inceasing distance fom the cente p*. ε as the thid coefficient seves as the leaning step size which vaies duing leaning. Figue 2 illustates the update opeation as a esponse to an input signal x : It causes the calculation of an excitation cente p* that coects its weight vecto w p* to the diection of x by w p*. The same coection, but to a lowe extent is

caied out also by the neuons in the neighbohood of p*. The decay of the excitation is shown by diffeent shadings. vecto space of input signals X p* w p* w p* x π neuon laye P Figue 2: Weight update with the Kohonen-Algoithm. (Tiangula input space mapped to a ectangula gid.) Now we ae eady to pesent a scheme fo an iteative algoithm that as a esult assigns each input signal x X to a location p P and by doing so tansfoms similaity of signals to spatial closeness. The moe simila the signals, the smalle the distance of thei images: π: X P with w π (x) x = min w p x (6) The algoithm that appoximately finds such a mapping, selects at each step an element fom the set of input signals which leads to an update of some weight vectos. The algoithm is usually teminated afte a pespecified numbe of steps, t max. Othe paametes that affect the pefomance of the algoithm ae the update step size ε and the shape of the neighbohood function h. ε(t) is usually continuously deceasing to allow lage coections at the beginning of the pocess and only small ones towad the end, when a global ode has aleady been established. The shape of the neighbohood function hp*p is govened by the width σ(t) which is also continuously deceasing to find a coase global ode ealy, but late allows focusing on moe naow egions fo local optimizations. 1 initialize wip 2 fo k=1 to tmax do 3 select x X andomly detemine p* with w p* x = min 4 5 6 end fo Algoithm 1: Kohonen-Algoithm w p w p + ε h p*,p ( x w p ) p w p x Since the weight vectos of the neuons ae elements of the input space, it can be shown how the netwok adapts to the topology of the input space (Figue 3). The position of the neuons coesponds to the final values of thei espective weight vectos. Figue 3: Possible esult of a Kohonen pocess applied to the example of Figue 2 It can be shown analytically [Ritte et al. 1991] that in the case of a discete and finite set of input vectos and a unifom pobability fo thei selection, the following objective function is minimized: V = 1 1 h p,q ( x w p ) 2 (7) 2 p,q P x F( q) m whee F(q) denotes the eceptive field of q, i.e. the set of input vectos fo which q is the excitation cente: Fq ( ): = x X w q x = min w p x 4 ADAPTATION TO THE PROBLEM To apply the Kohonen algoithm to map TIGs to PCGs, we associate the set of tasks with the input space and the pocesso netwok with the neual netwok. Since the input space should exhibit a metic, we need a suitable epesentation of the tasks by defining a coelation measue between abitay tasks (input vectos). This measue is based on the communication intensities between the tasks. Fo abitay TIGs, the tasks ae epesented as m-dimensional vectos with the coelations to othe tasks as the vecto components. Details can be found in [Heiss and Domanns 1993]. The potential function V (7) is stongly elated to the communication cost CC (1), and it can be shown that minimizing V appoximately also minimizes CC. In the oiginal Kohonen pocess, load balancing takes place as kind of side effect of the pocess. It can be enhanced by a dedicated load balancing mechanism incopoated in the algoithm: The eceptive fields of the neuons lead to a patitioning of the input space and can be egaded as Voonoi cells. The size of a cell o eceptive field, espectively, coesponds to the load the pocesso obtains. By influencing the size of the eceptive fields by enlaging the lengths of the weight vectos we have the means to fine-tune load balancing. (8)

TIG size PCG load deviation % CC m topology SA Kohonen SA Kohonen 290 6x6-M 8.0 10.3 589.8 618.8 290 8x8-M 16.5 15.2 1048.4 849.3 290 10x10-M 18.8 18.9 1425.9 1101.1 491 6x6-M 6.3 12.5 758.9 803.0 491 8x8-M 14.1 14.2 1535.7 1127.5 491 10x10-M 16.3 15.2 2051.4 1449.6 491 12x12-M 21.4 20.5 2745.9 1709.4 intentionally chosen to obtain sufficient computational load. Since the numbe of tasks is too lage fo an efficient use of the Kohonen-pocess, we fist apply a simple clusteing scheme to educe the numbe of tasks to be mapped while leaving sufficient leeway fo the actual mapping. It tuned out that a numbe of clustes equal to five times the numbe of pocessos is easonable. (Figue 5) Table 1: Results fo mapping FEM gaphs to mesh-connected systems Table 1 compaes some esults of mapping a FEMgaph to diffeent mesh-connected machines by Simulated Annealing and the Kohonen-pocess. We do not claim that the esults achieved by SA ae the best one can obtain using this method, but we actually put some effot into tuning the paametes of the algoithm. At least the esults indicate that the Kohonen-pocess is competitive concening the quality of esults. With egad to the efficiency, i.e. computational ovehead, both algoithms needed oughly the same time, e.g. to map the 491-node FEM-TIG to a 6x6-mesh, SA needed 80 seconds on a Sun SPARC 10, Kohonen needed 54 seconds. 5 AN EXAMPLE As a simple example we epot about the implementation of the simulation of the non-stationay heat conduction in a two-dimensional cooling body. The paabolic diffeential equation to be solved is given by u xx + u yy = u t (10) 1.00 0.2 Figue 5: Result afte clusteing This clusteed task gaph is then mapped to the pocesso netwok which is a 4x8-mesh in ou example. Figue 6 shows the pocesso netwok with the black cicles indicating the load each pocesso eceives and with the acs indicating communication elations cossing pocesso boundaies.. y. x y.. y. x. y. z. x. x. y x. x. y x y. y. y. z. y. y. x. x. x. y z. y. y. x. x. y. y. y Figue 6: Final mapping, e.g. on a 4x8-mesh 0. 5 Figue 4: Cooling body (example) The body is epesented in the unit squae which was discetized by 200x200 gid points which - accoding to the shape of body esults in 22.000 vaiables. Tempoal diffeentiation is done by fist ode fowad diffeences. This leads to seies of linea equation systems with a constant matix which is symmetic, positive definite and of dimension 22.000 with at most 5 non-zeo enties pe ow. It should be noted that the discetization is unnecessaily fine fo the simple shape, but this was 0.75 To examine the scalability popeties, the CGalgoithm was implemented on a tanspute based system unning ou own opeating system COSY. Since the hadwae does not povide useful goup communication algoithms, tee-based multicast and combine opeations ae implemented in the opeating system. Figue 7 shows the measued execution times fo 10 time steps of the nonstationay heat conduction totalling to 1094 iteation steps of CG-algoithm which in tun epesents 5x10 8 floating point instuctions. It can be seen that the time spent fo communication is slightly inceasing with the dot poduct communication making up the lion's shae. Fo the lagest configuation (8x8-mesh) we end up with an efficiency of oughly 27% which is easonable consideing the athe low bandwidth of T805 tansputes (ca. 1.7Mb/sec pe link) and the opeating system ovehead. (Fo the implementation of the NAS CG-benchmak on a 128-node Intel ipsc/860, a ate of 181 MFLPOS was epoted which coesponds to only 3% efficiency [Lewis

and van de Geijn 1993]. It should be noted that the matix used in the NAS benchmak has a andom stuctue that can not be exploited fo mapping puposes. It shows, howeve, the potential fo stuctue exploitation in those computations.) time (sec) computation 300 3x3 communication: matix-vecto-poduct 4x4 200 inne poduct 5x5 6x6 8x8 100 8 16 24 32 40 48 56 64 #poc Figue 7: Run times of CG-algoithm on diffeent squae meshes Besides its competitiveness compaed to othe heuistics like SA, the Kohonen-algoithm as poposed has seveal useful featues If fo some eason paticula tasks have to be mapped to specific pocessos (e.g. because special hadwae is equied) we simply initialize the weight vecto of the taget pocesso (neuon) with the input vecto of the task and keep it fixed duing the self-oganization pocess. Application in dynamic multipogamming envionments is possible by supeposition of seveal self-oganizing pocesses at the same that only coopeate fo load balancing puposes. Dynamic task ceation o changing communication pattens of paallel pogams can also be handled by unning the pocess within small neighbohoods (small step sizes ε and naow neighbohood function h) to emove developing distotions. Paallelization of the algoithm is possible with almost linea speed-up. The communication ovehead of a paallel implementation is smalle as one might expect, since in the convegence phase of the pocess which needs the most iteations, the pocess is woking only in small envionments, i.e. only few and adjacent pocessos ae involved. The asymptotic complexity of the sequential algoithm is O(n m 2 ) which mainly esults fom the inne poducts to be calculated and the numbe of iteations equied which ae popotional to m. Fo some types of PCGs this can be educed to O(n m 2 / log n). Fo 2D o 3D geometic TIGs, the intetask coelations can be based on the Euclidean distance and tasks can be epesented by thei 2D o 3D coodinates. This leads to much moe efficient implementation. In this case, the time complexity of a paallel implementation of the Kohonen pocess is independent of n and m and is only influenced by the extent of dissimilaity between TIG and PCG. 6 CONCLUSION We have pesented a new method to map lage simulation pogams to paallel achitectues. Instead of unning on the font-end of the paallel machine and calculating a static allocation of the tasks to the pocessos, it can un on the paallel machine itself, eithe as pat of the (distibuted) opeating system o - if no opeating system is used - as pat of the un-time system of the simulation pogam. Fo simulations exhibiting changes in the communication patten o ceating and deleting tasks duing un-time, it offes the possibility of dynamic load balancing with low ovehead while taking into account the communication behavio of the pogam. REFERENCES Bollinge,S.W. and S.F. Midkiff 1991. Heuistic Technique fo Pocesso and Link Assignment in Multicomputes, IEEE TOC Vol.40,3 pp. 325-333. Chockalingam,T. and S.Aankuma 1992. A andomized heuistics fo the mapping poblem: The genetic appoach. Paallel Computing 18, pp. 1157-1165. DeKeyse,J. and D. Roose 1993. Load Balancing Data Paallel Pogams on Distibuted Memoy Computes. Paallel Computing 19, pp. 1199-121 Domanns,M. and H.-U. Heiss 1993. Topology Conseving Gaph Mapping by Self Oganization: A Solution to the Pocesso Allocation Poblem. in: Albecht,R.F. et.al. (eds) Atificial Neual Netwoks and Genetic Algoithms, Conf. Poc. (Innsbuck, Apil 1993), Spinge, Wien, pp.198-205. Fox,G.C. 1988. A Gaphical Appoach to Load Balancing and Space Matix Vecto Multiplication on the Hypecube. in Schultz,E. (ed.): Numeical Algoithms fo Moden Paallel Computes, Spinge-Velag, Belin, Heiss,H.-U. and M.Domanns1993. Task Assignment by Self- Oganizing Maps. Intenal Repot No. 17/93, Dep. of Infomatics, Univesity of Kalsuhe, Gemany Kenighan,,B.W. and S.Lin 1970. An Efficient Heuistic fo Patitioning Gaphs. Bell Systems Jounal Vol.49, pp. 291-307. Kohonen, T. 1989. Self-Oganization and Associative Memoy. 3d edition, Spinge-Velag, Belin. Lewis, J.G. and R. van den Geijn 1993. Distibuted Memoy Matix Vecto Multiplication and Conjugate Gadient Algoithms. Poc. Supecomputing '93 Reeves,C. (ed.) 1993. Moden Heuistic Techniques fo Combinatoial Poblems. Blackwell Scientific Publ., Oxfod Ritte, H.; T.Matinetz and K. Schulten 1991. Neual Computation and Self-Oganizing Maps. AddisonWesley. Sadayappan,P. F.Ecal, and J.Ramanujam 1990. Cluste patitioning appoaches to mapping paallel pogams onto a hypecube. Paallel Computing 13, pp.1-16.

Williams,R.P. 1991. Pefomance of Dynamic Load Balancing Algoithms fo Unstuctued Mesh Calculations. Concuency: Pactice and Expeience 3,5 pp.457-481.