Laboratoire de l Informatique du Parallélisme

Size: px

Start display at page:

Download "Laboratoire de l Informatique du Parallélisme"

Constance Parsons
6 years ago
Views:

1 Laboratoire de l Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n 398 Neural Network Parallelization on a Ring of Processors: Training Set Partition and Load Sharing Bernard Girau September 994 Research Report N o Ecole Normale Supérieure de Lyon 46 Allée d Italie, Lyon Cedex 07, France Téléphone : (+33) Télécopieur : (+33) Adresse électronique : lip@lipens lyonfr

2 Neural Network Parallelization on a Ring of Processors: Training Set Partition and Load Sharing Bernard Girau September 994 Abstract A parallel back-propagation algorithm that partitions the training set has been introduced in [6] Its performance on MIMD machines is studied, and a new version is developped It is based on a heterogeneous load sharing method Algebraic models allow precise comparisons and show great improvements if learning steps are processed Keywords: neural networks, parallel algorithms, load sharing, MIMD machines, performance models Resume Un algorithme parallele de retropropagation par partition de la base d'apprentissage a ete introduit dans [6] Ce rapport etudie ses performances sur plusieurs machines MIMD, et en presente une nouvelle version, basee sur un principe de partage heterogene de la charge de calcul Des comparaisons precises de performances entre les dierentes methodes obtenues sont rendues possibles par des modelisations mettant en jeu des systemes algebriques lineaires Elles mettent en valeur des ameliorations sensibles en ce qui concerne les phases d'apprentissage reseaux de neurones, algorithmes paralleles, partage de charge, machines MIMD, modelisation de per- Mots-cles: formances

3 Contents Introduction 2 Initial algorithm 3 Description and modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 Generalization phase : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Learning phase: centralized updating : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 3 Learning phase: distributed updating : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 Performance study : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Machine-dependent performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 22 Inuence of block sizes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2 Heterogeneous load sharing 6 2 Message gathering : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2 Principle : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 22 Modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 22 Heterogeneous load sharing (generalization phase) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 22 Description : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Algebraic model of the heterogeneous load sharing : : : : : : : : : : : : : : : : : : : : : : : : : Solvability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Exact and approximate solving : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning phase : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Studied versions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : New linear system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 Conclusion 33 Bibliography 34 Index 35

4 Introduction Many studies have been performed to obtain ecient algorithms for neural network parallelization Nevertheless, their eciency mainly depends on the network characteristics Moreover, most of them apply to specic neural architectures, or specic parallel computers Therefore, none of them can claim to be better than others Standard algorithms for neural network parallelization may be partitioned into two subsets In an inner algorithm, any computation of the neural network is performed thanks to all processors, whereas an outer algorithm performs each computation of the neural network by means of only one processor and achieves the parallelization by partitioning the set of all needed neural computations among the processors Inner algorithms highly depend on the network structure This report deals with the training set partition algorithm This outer algorithm applies to any neural problem using supervised learning with training sets Therefore, this report follows the works of [4], [6] and [5] In order to obtain better performance when the standard training set partition does not provide satisfactory eciency, communication steps are modied, and a heterogeneous load sharing is introduced so as to reduce the partial processor inactivity that has appeared because of the communication modications Precise characteristics of the appropriate load sharing are obtained by means of a linear algebraic model Performance models show that the obtained heterogeneous version leads to a great eciency increase when the training set partition is applied to the gradient descent learning algorithm Since many notations and terms are used without always recalling their meaning, an index is provided at the end of this report It mentions where each notation or term is rst employed or described 2

5 Chapter Initial algorithm Description and modeling In [4], [6] and [5], H Paugam-Moisy partitions the training set so as to parallelize any neural network appplication Each processor owns a local copy of the whole neural network This algorithm has been tested on an one-directional ring of transputers, with up to 32 T800 processors A neural learning with a multilayer network and a training set of 2000 to 4000 samples has been parallelized This network has 27 inputs, 3 outputs and only one hidden layer with 0 to 80 neurons Satisfactory performance has been obtained, which justies a precise analysis of this algorithm This section recalls some important aspects of [6] It also develops some new theoretical arguments For instance, new critical cases are taken into account so as to determine the limits of this algorithm, called standard set partition algorithm Generalization phase A sample set is given In a generalization phase, the neural output is computed for each input in this set No learning is achieved Host processor The T-Node machine that has been used for former works implies the existence of a privileged node, the I/O processor or host processor, which is able to communicate with the outside environment Moreover, information distribution has to be taken into account in real time applications, whatever the employed parallel machine is Therefore, it will be assumed that there is such a host, which centralizes information exchange in each studied algorithm Figure describes the parallel structure that is considered in this report In a generalization phase, inputs are known by the host, and computed outputs have to be communicated to the host Principle Let S be the number of given samples The input set is rst partitioned into n subsets, so that S = ns Each subset of s inputs is called a block If p is the number of processors on the ring, and if s = pk, then we use n iterations of the following algorithm: Partition the block into p subsets of size k, called sub-blocks The host sends all the sub-blocks to node, one after the other, and then waits for the p output sub-blocks which are sent one after the other by node p Node i runs the following process: Receive p? i input sub-blocks, one after the other, and immmediately send them to the next processor on the ring (that is: receive an input sub-block, forward it, receive the next one, : : : ) 2 Receive a sub-block with k input vectors, and compute the corresponding outputs Even if the ring topology is simulated on a homogeneous computer in which all nodes have equivalent resources 3

6 node node 2 host node i node p node p- Figure : ring communication structure with host (p-)*k received/sent samples k sent outputs 0 received/sent outputs s=p*k sent samples I/O s=p*k received outputs (p-2)*k received/sent samples k sent outputs k received/sent outputs (p-i)*k received/sent samples k sent outputs (i-)*k received/sent outputs 0 received/sent samples k sent outputs (p-)*k received/sent outputs k received/sent samples k sent outputs (p-2)*k received/sent outputs Figure 2: sample set partition principle 3 Send the obtained output sub-block to its successor 4 Receive and immediately forward i? output sub-blocks (that is: receive an output sub-block, forward it, receive the next one, : : :) If i = p, we consider that the next processor is the host, and these last two steps make the host receive all output sub-blocks Figure 2 describes the set partition method which is thus obtained, whereas gure 3 shows when and how nodes are busy during the algorithm processing It is simplied in gure 4 which shows only the general shape of this behaviour Such shapes will be used later if they are explicit enough, since they clearly point out the main aspects of any description Communication modeling Following models aim to be more exhaustive than in [6] A linear communication time model is used: Between the host and a node: ( 0 : startup time - L: message size - 0 : transfer rate) T c (L) = 0 + L 0 Between two nodes: with 0 and 0 T c2 (L) = + L 4

7 host node node 2 node 3 node p-2 node p host time sending of an input sub-block from the host to a transputer sending of an input sub-block between 2 transputers sending of an output sub-block between 2 transputers sending of an output sub-block from a transputer to the host computation of an output sub-block Figure 3: algorithm steps for generalization phase 0 2 inputs computations p- p 0 outputs Figure 4: shape of algorithm behaviour for generalization phase Digression It might be assumed that 0 and 0, even if slower communications with the host are observed for the T-Node machine Figure 5 shows the dierences between both cases This gure can be used so as to determine at what time node starts its computation (the algorithm iteration starting at time 0) Let and r be the input and output sizes If (; ) ( 0 ; 0 ), then node starts computing at time: p ( 0 + k 0 ) + (p? ) ( + k) < 2( 0 + k 0 ) + (2p? 3) max( 0 + k 0 ; + k) And if (; ) ( 0 ; 0 ), then node starts computing at time: 2( 0 + k 0 ) + (2p? 3) ( + k) = 2( 0 + k 0 ) + (2p? 3) max( 0 + k 0 ; + k) It shows that the standard case, with (; ) ( 0 ; 0 ), is theoretically better (it is more likely anyway) Performance modeling Figure 6 is used so as to determine the processing time of one iteration The case with r is analysed rst From the beginning of the algorithm until the beginning of node computation, the host sends p sub-blocks, whereas node forwards p? of them p( 0 + k 0 ) + (p? )( + k) 2 The computation time of a multilayer perceptron linearly depends on its parameter size For one input vector, it is + w, if w is the size of the network parameter vector Therefore node computation lasts k( + w) : 5

8 host node node 2 node 3 host node node 2 node 3 host/node lomger than node/node node/node longer than host/node host sends -> node host -> node receives node sends -> node node -> node receives Figure 5: inuence of host/node communication times host host host host computations communications time accumulated delays if λ > r if r > λ Figure 6: processors activity all along the algorithm processing 3 The output sub-block of node is sent through all nodes to the host The rst case of gure 6 shows that p? communications between two nodes are implied, and that it ends with a communication node/host (p? )( + kr) + ( 0 + kr 0 ) But the aim of gure 6 is to show that if r, then the host has only to wait a constant time between each output receipt, whereas if r, then node p must wait for the host rst receipt, node p? must therefore wait even more A phenomenon of accumulated delays appears For node, the obtained delay is: (p? )k 0 (r? ) : Speedup Indeed, and r have symmetrical roles The previous paragraph has privileged node (time before the beginning of node computation, time of node computation, time for node output to reach the host) Figure 6 shows that if node p is privileged and if and r are exchanged, then the second case with r leads to a formula which is similar 6

9 0 2 p-2 p- p inputs computations outputs Figure 7: shape of the algorithm behaviour with parallel communications 0 2 p- p 0 inputs outputs computations Figure 8: pipeline algorithm, with parallel communications to the simple formula obtained for the rst case Therefore a global speedup formula can be given: pk( + w) k( + w) + p( 0 + k max(; r) 0 ) + (p? )(2 + k( + r)) + ( 0 + k min(; r) 0 ) () Improvement In order to reduce communication times, message sendings and receipts may be processed in parallel, providing that the machine allows simultaneous communications on the dierent links of each node 2 In this case, node i receives block j from node i? while he sends block j? to node i + Figure 7 shows the shape of the resulting algorithm Such an improvement actually suers from two drawbacks: Many parallel computers do not allow parallel communications, or simulate them with irregular performance (see [3] for the ipsc 860) A good model is thus dicult to determine Parallel communications need more management work, and their startup time is longer than for standard communications A precise communication time model for the T-Node machine is given in [] It shows that the startup time depends on the number of possible links on each node It can be derived from this work that if parallel communications are to be used, then the startup time becomes + // For a T-Node and only two parallel links, // = 2:9 s Modeling with parallel communications If communication sendings and receipts are performed in parallel, it can be considered that only host/node communication times must be taken into account, since they are the slowest Moreover, if the algorithm is given a pipeline structure as shown in gure 8, then the transfer of node output for iteration i is performed during the transfer of node p input for iteration i + If many iterations are processed, then the speedup is: pk( + w) k( + w) + p( // k max(; r) 0 ) (2) 2 The T-Node machine allows such parallel communications 7

10 Speedup in the generalization phase 30 without parallel communications with parallel communications speedup number of processors Figure 9: performance for to 32 processors (k = 00) Speedup in the generalization phase without parallel communications with parallel communications speedup number of processors Figure 0: performance for to 000 processors (k = 00) Numerical performance Then following numerical values correspond to the experiments of [6]: 0 = 5:4 s and 0 = :8 s = 3:9 s and = : s = 08 (27 oats) and r = 2 (3 oats) = 50 s and = 3 s w = 243 (40 neurons in the single hidden layer) Figure 9 shows satisfactory performance for a limited number of processors But some limits appear for massively parallel machines, as shown in gure 0 Speedup formula interpreting In accordance with formula, a good eciency is obtained with: 8

11 few processors (p), 2 large input sub-blocks (k), 3 small input and output sizes (, r), 4 many network parameters (w) Let us try to sharpen this analysis terms gathering together: A dense eciency formula can be derived from formula : e = + (c + c 2 p) kw + (c 3 + c 4 p) L w ; (3) where c i are machine-dependent constants, and L stands for the sum of the input and output sizes study of the L w term: For a neural network with only one hidden layer (containing N neurons), it can be asserted that w = O(LN), and the third term of the denominator may therefore be rewritten as c0 3 +c0 4 p N inuence of L: It may now be considered that L and N are independent variables, then L does not have any inuence on third term c0 3 +c0 4 p, and the second term becomes c0 +c0 2 p It shows that indeed large input and output sizes N kln are useful to obtain a good eciency dominating variables: A great eciency increase is obtained thanks to many hidden neurons and few nodes, since p N belongs to both second and third terms in the formula denominator multilayer nets: If there are two hidden layers or more, then w = w N 2 + w 2 LN + w 3 It follows that the inuence of L becomes negligible 2 Learning phase: centralized updating This paragraph describes the most intuitive version, directly derived from the algorithm above The term centralized updating is employed because the nodes only perform gradient computation, whereas the host updates the neural network parameters thanks to the communicated gradients This algorithm does not appear in [4] Description The previous algorithm may be used so as to simulate the total gradient algorithm in the learning phase Instead of sending output sub-blocks, node i computes a sum of error gradients (one error gradient per input sample), which is called G i W If n =, then the total gradient algorithm is implemented, whereas n > generates a block-gradient algorithm 3 In the case of centralized updating, node i sends G i W to node i +, so as to communicate it to the host Then the host computes the sum GW of all the obtained vectors, and applies any simple or advanced gradient descent algorithm 4 It must be ensured that the host communicates the network modications to all nodes before each of the n algorithm iterations Without parallel communication New constants have to be introduced for the computation time model Gradient computation lasts l + l w for one sample Let us notice that output message size no longer depends on k, but is proportional to w, since all nodes communicate error gradients (for instance, if float values are used in the C-program, then the output message size is w*sizeof(float) ) Therefore, let lw be the output message size If the problem of the computation and communication of the network modications is not yet taken into account 5, speedup is then: pk( l + l w) k( l + l w) + p( 0 + max(k; lw) 0 ) + (p? )(2 + (k + lw)) + ( 0 + min(k; lw) 0 ) (4) 3 if p = and n = s, then we obtain the stochastic gradient algorithm 4 W (t + ) = W (t)? (t)gw (t) for the standard gradient descent algorithm, where W (t) is the parameter vector of the network at time t 5 See paragraph Version choice : : : so as to see how this problem can be taken into account 9

12 Efficiency on the T-Node machine learning validation efficiency number of processors number of neurons Figure : performance for generalization and learning phases (k = ) When studying a corresponding dense form, e = + c 00 p + kw c00 2 p w + c00 p 3 k ; (5) a new term is added to the denominator of the eciency formula It is proportional to p w does not appear in k the denominator of this term, and therefore, even a boundless computation load can not lead to an eciency equal to 00 % (a boundless load may correspond to a boundless number of hidden neurons) This phenomenon can be observed on gure Optimal block size Small blocks are required for a quick convergence of the block-gradient algorithm (see [2]) But the arguments above show that a good parallelization is achieved with large blocks A compromise must be found therefore Performance improvement for small blocks is a priority indeed Pipeline The interest of a pipeline structure for the training set partition algorithm is discussed pipeline for generalization phase: Formulae and 2 show that an increase of k always leads to an improvement of the algorithm eciency Therefore the best solution is S = s, that is n =, which allows no pipelining pipeline for learning phase: Since updating is performed by the host, obtained modications must be communicated to all nodes before each iteration Figure 2 shows the induced behaviour of the algorithm It is clear that no pipelining is possible With parallel communication Since there is no pipelining processing, and if the computation and communication of the network modications is not yet considered, the speedup formula is (see gure 2): pk( l + l w) k( l + l w) + p(2 // (k + lw) 0 ) (6) 0

13 0 2 p- p p- p 0 inputs outputs computations host computes new parameters communication of new parameters without parallel communications with parallel communications Figure 2: consecutive iterations for learning phase 3 Learning phase: distributed updating The parallel learning algorithm of [4] is presented here It is particularly adapted to a parallel computer which allows parallel communications It is called distributed updating because the host only communicates inputs Instead of sending the gradient sum of the sub-block to the host, the nodes communicate directly At the end of this communication processing (or gradient exchange), each node knows all error gradients, and therefore can update the network Description Input sub-block are communicated in the same way as for centralized updating Then each node performs its own computation At last, the gradient exchange is realized by means of the following algorithm: rst step: Node i communicates G i W to its direct successor Therefore, it receives G i? W from its predecessor and can compute G i? W + G i W 2 j th step: Node i sends j? X k=0 3 step p? : Every node knows GW = G i?k W to node i +, receives px k= j? X k=0 G i??k W, and then computes jx k=0 G i?k W G k W, from which it can induce the new network parameters With parallel communications, all nodes simultaneously end their computation, and therefore they can start this gradient exchange process without any delay There is no question about a pipeline structure for distributed updating, since the gradient exchange process also ends at the same time on all nodes Modeling Once again, the computation and communication of the network modication is not considered Therefore, each step of the gradient exchange lasts 2( + lw), and the computation of G i? W + G i W is not yet taken into account Without parallel communications The gradient exchange must wait for every node to have nished its local computation before it can start speedup = With parallel communications pk( l + l w) k( l + l w) + p( 0 + k 0 ) + (p? )( + k) + 2(p? )( + lw)

14 Here, each step of the gradient exchange lasts // ++lw (which still does not take into account the computation of the gradient sum) speedup = Version choice according to the machine pk( l + l w) k( l + l w) + p( // k 0 ) + (p? )( // + + lw) In order to determine the best algorithm for a learning phase, the time needed for the computation and communication of the gradient modications must be introduced The elapsed time between the end of the rst node computation and the start of the next algorithm iteration (next main block) is estimated in this paragraph for both centralized and distributed updating versions The computation time of the sum of two gradient vectors must be taken into account: add + add n for n-dimensional vectors Moreover, most algorithms for function minimization (gradient descent, conjugate gradient descent, : : :) are performed within a linear time, with regard to the gradient size: it will be estimated as desc + desc w with parallel communications: centralized updating: All G i W received by the host: p( 0 + // + 0 lw) The host computes the sum of all gradients: (p? )( add + add w) The host computes the new network parameters thanks to the obtained global gradient: desc + desc w The host communicates the obtained modications In order to minimize the number of startup times, the modications of iteration h may be included in the rst message of iteration h+, since this message contains node p input sub-block and therefore goes through all nodes That is why this modication communication only adds the gradient transfer time, from the host to node p: ( 0 + (p? ))lw distributed updating: All nodes communicate G i W to each other: (p? )( // + + lw) For each step j, node i adds G i W to j? X k=0 Each node computes the new network parameters: desc + desc w without parallel communications: G i??k W Now, there are p? steps: (p? )( add + add w) centralized updating: G W forwarded to the host Possible delays must be taken into account in the same way as for formula determination: (p? )( + lw) lw + (p? ) 0 (lw? k) The host computes the sum of all gradients After having received an output sub-block from node p, the host must wait for node p to have received the next output sub-block from node p? Therefore, the host can use this waiting time to perform some gradient sum computation But these computations may not end within the waiting times That is why partial computation times may remain: each of them is equal to max(0; add + add w? ( + lw)) The global computation implies therefore an extra time equal to: ( add + add w) + (p? 2)max(0; add + add w? ( + lw)) : This is generally equal to ( add + add w), since the addition time of n real numbers is usually smaller than their transfer time The host computes the new network parameters: desc + desc w The host communicates the obtained modications:( 0 + (p? ))w distributed updating: All nodes communicate G i W to each other: 2(p? )( + w) Node i adds G i W : (p? )( add + add w) Each node computes the new network parameters: desc + desc w 2

15 It clearly appears that distributed updating must be chosen if parallel communications are possible Otherwise distributed updating is to be prefered only if 0, which is true for the T-Node machine, but not for the applications on ipsc 860 or Volvox machines Thanks to this study, more exact speedup formulae may be established They take into account the time needed for the computation and communication of the network modications But they still make an assumption: the computation of these modications is not performed with a parallel algorithm (the time it needs is equal to desc + desc w in a sequential processing as in a parallel processing) Following formulae give the obtained speedups: centralized updating without parallel communications: pk( l + l w) + desc + desc w k(l + l w) + p( 0 + max(k; lw) 0 ) + (p? )(2 + (k + lw)) + ( 0 + min(k; lw) 0 ) + add + add w + (p? 2)max(0; add + add w? ( + lw)) + desc + desc w + ( 0 + (p? ))lw centralized updating with parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p(2 // (k + lw) 0 ) + (p? )( add + add w) + desc + desc w + ( 0 + (p? ))lw distributed updating without parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p( 0 + k 0 ) + (p? )( + k) + 2(p? )( + lw) + (p? )( add + add w) + desc + desc w distributed updating with parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p( // k 0 ) + (p? )( // + + lw)(p? )( add + add w) + desc + desc w 2 Performance study The dependence of the algorithm eciency on the machine characteristics is now studied according to experimental data It follows the qualitative discussion of Only MIMD machines, with distributed memory and asynchronous communications are considered The ipsc 860 machine has been used for all experiments, though performance models have also been applied to the Volvox machine with the Volvix environment 2 Machine-dependent performance Communications on an ipsc 860 Though a linear communication time model + L is rather good for this parallel computer 6, many special cases appear, even without any parallel communication A study of these problems can be found in [3] Experiments have shown that they can be avoided thanks to some programing precautions 7 Experimental data See [3] for a precise study of the ipsc 860 communications communication characteristics: ipsc 860: 6 For this machine, one value of (; ) is taken if L 00, and another one if L > 00 7 These precautions are described in [3] Without them, more than 20 % eciency decrease may be observed during experiments, when they are compared to performance modelings 3

16 Speedup in generalization phase 30 Volvox ipsc speedup number of nodes Figure 3: performance in generalization phase (k = 00) 0 = = 79 s for less than 00 bytes communications, 94 s for more than 00 bytes 0 = = 0:42 s (less than 00 bytes), 0:4 s (more than 00 bytes) Volvox (Volvix system): 0 = ' 5000 s 0 = = :53 s computation characteristics: Both ipsc 860 and Volvox use i860 processors: ' 0 s and = :37 s Performance analysis Figure 3 shows modeled speedups for the ipsc 860 machine and the Volvox machine (each message is supposed to contain more than 00 bytes) Formula is used as the performance model The ipsc 860 computer provides rather unsatisfactory performance, and the Volvox machine provides really feeble performance Indeed, the computation capability of the i860 processor is too high, compared with the communication quickness of both machines, for such an application The T-Node machine is more balanced (quicker communications, slower computations) And yet, the ipsc 860 must be chosen, since the i860 quickness makes it process the algorithm ve times faster than the T-Node machine (without parallel communications, with 32 nodes, k = 00, 40 hidden neurons) It is all the more important that the algorithm eciency be increased on such powerful machines, as they already provide the highest quickness with an unsatisfactory eciency 22 Inuence of block sizes The low performance on the ipsc 860 is mainly due to its big startup time Formula shows that this phenomenon also generates an enlarged sensitiveness to block size 8 This is illustrated by gure 4 In a learning phase, it is a crucial problem, since better convergence is obtained thanks to small blocks 8 In the eciency formula, k is negligible only with large block sizes 4

17 Speedup in generalization phase T-Node T-Node ipsc ipsc speedup number of processors sub-block 5060 size Figure 4: performance for ipsc 860 and T-Node with regard to p and k 5

18 Chapter 2 Heterogeneous load sharing In this chapter, no parallel communications will be considered Described algorithms intend to improve the eciency of the training set partition algorithm when the machine characteristics do not t well with the standard form of this algorithm Now, a machine without parallel communications is less adapted to the standard algorithm than a machine with such facilities 2 Message gathering 2 Principle Sub-blocks gathering In previous algorithm, node begins its own computations after 2p? communications, and therefore after 2p? startup times A simple idea is to gather all sub-blocks into a unique communication The following algorithm, called simple gathered algorithm, is obtained: The host sends the whole training block to node, that is, one message for p sub-blocks 2 Node i (i 6= p) receives p? i + gathered sub-blocks, sends last p? i ones to its successor, and computes the neural outputs of the rst sub-block it has received 3 Node sends its output sub-block to node 2 4 Node i (i 6= p and i 6= ) receives output sub-blocks which are gathered from number to number i?, and therefore is able to send output sub-blocks from number to number i 5 Node p sends all gathered output sub-blocks to the host Figure 2 shows the obtained algorithm behaviour This algorithm is the simplest one (basic scattering algorithm on a ring) It has better performance than the previous one when startup time is much greater than transfer time Its obvious drawback is that the number of communicated bytes, before node p can compute, is no longer linear with regard to the number of nodes, but is now quadratic In a more concrete way, a signicant loss of time appears for the rst ring nodes, since node i + starts its computation long after node i It also appears for the last ones, since node i + waits for node i to have performed a large output communication with node i? after its computation time 22 Modeling Generalization phase Despite a simple structure, this algorithm is more dicult to model than the previous one Accumulated delays appear, as it is shown in gure 22 Thanks to gure 22, the processing time can be computed as the sum of the following times: 6

19 host node node 2 node 3 node p-2 node p host time host sends inputs host receives outputs node i sends inputs to node i+ node i sends outputs to node i+ output sub-block computation Figure 2: simple gathered algorithm host node node 2 communications time computations node p host inactivity Figure 22: behaviour of the algorithm with sub-block gathering input transfer: Node i sends p? i input sub-blocks to its direct successor For all nodes: computation: Node p computes its output sub-block X p? 0 + pk (p? i)k i= k( + w) accumulated delays: In order to determine how long node p must wait before it can receive node p? outputs, it should be noted that a delay appears for node i if and only if the input communication time between node i and node i + is longer than the output communication time between node i? 2 and node i? Therefore it appears only if (i? 2)kr (p? i)k, and all obtained delays must be added px i=3 max(0; (i? 2)kr? (p? i)k) output transfer: Node p receives the outputs of block to p? and then sends all outputs to the host + (p? )kr pkr 0 7

20 host node node 2 node p host time receipt (node <--> node) emission (node <--> node) emission (host <--> node) receipt (host <--> node) computation Figure 23: algorithm with processors as fully loaded as possible The obtained speedup is therefore: pk( + w) k( + w) pk(r + ) 0 + (p? ) + p(p?) k + + (p? )kr + 2 Learning phase px d 2r+p r+ e k((i? 2)r? (p? i)) In the previous centralized updating, all gradient vectors are sent to the host which computes their sum afterwards Such a method would become rather unecient with gathered output messages: node sends its computed gradient vector to node 2, which sends its gradient vector and node gradient vector, : : :, node p nally sends a message containing p gradient vectors to the host From the end of node computation to the end of the iteration, the time is O(p 2 w)! Another version of centralized updating is therefore introduced The host still performs the computation of the new network parameters (updating), but it receives the sum of all gradients from node p When node i receives a gradient from node i?, it adds its own gradient vector, and sends the sum to node i + Therefore, node i sends P i G iw All output messages are the same size From the end of node computation to the end of the iteration, the time is now O(pw) The obtained speedup is: pk( + w) k( + w) (lw + pk) 0 + (p? ) + p(p?) 2 k + + lw + px dp? ke lw 22 Heterogeneous load sharing (generalization phase) 22 Description (lw? (p? i)k) The previous model shows that a simple block gathering should not be chosen, even for a parallel computer with long startup times The main drawbacks are due to a lack of balance among input and output communications, specially in case of generalization phase This algorithm may be improved by giving more samples to deal with to the nodes that have long inactive times Figure 23 describes the algorithm behaviour that should be obtained Node i begins its computation after having sent the inputs of all its successors to its direct successor The time that this communication needs is therefore the time between node i? computation start and node i computation start 8

21 In the same way, node i has to receive all outputs of nodes to i? from node i?, before it is allowed to send its own outputs and those of all its predecessors to node i + Node i + might go on with its own computation during this communication between nodes i and i? Nodes and 2 might simultaneously end their computation, since node has no predecessor output to receive In the same way nodes p? and p simultaneously begin their computation, since node p has no successor in the ring If node i is among the rst ring nodes, it has many inputs to transfer, whereas node i? has few outputs to receive, therefore node i? might deal with much more samples than node i In the same way, if node i is among the last ring nodes, node i + might deal with much more samples than node i A new algorithm can be imagined, in which communications are still gathered so as to decrease the number of startup times, but in which each node has to deal with a specic number of samples so as to minimize its inactive time This heterogeneous load sharing can not lead to optimal eciency But it aims to provide satisfactory performance when the standard algorithm appears very unsatisfactory Mutual inuence between communications and computations The training set partition is then rst changed by gathering the messages, and this gathered algorithm is now given a new aspect by performing a heterogeneous load sharing according to inactive times that are due to unbalanced communications But adding some samples for a given node will modify most communications We have to change loads because of communication times, and yet any load change modies these communications This mutual inuence must be given a mathematical expression so as to nd the best training set partition according to the principle above Another problem is that even adding only one sample might provide more extra computation time than it is needed Therefore, a heterogeneous load sharing might better adapt to an experiment with large communication sizes, eg many nodes But it has been shown that a block gathering generates a long waiting time before node p can compute when p is big (this time is quadratic with regard to p) Each argument shows that a global improvement can not be expected: a precise study of the heterogeneous algorithm must be performed so as to know when it may provide signicant performance increase About an active host A previous discussion in has justied the idea of a privileged node, the host But in many parallel machines, nodes have equivalent resources Therefore the host may also perform some computation The following modication may appear desirable: If p nodes are available, the training set partition algorithm (standard or heterogeneous) is implemented with a ring of p? nodes, whereas the last one, which centralizes all data, keeps some samples to deal with between its last input communication and its rst output communication In such a case, it must be noted that the heterogeneous version provides more available computation time for the host than the standard version (see gure 24) Of course, if the host performs no computation, then it is not taken into account in the eciency computation (in this case, only the ring nodes are taken into account so as to determine the number p of processors) 222 Algebraic model of the heterogeneous load sharing The previous description is now given a mathematical expression Let k i be the size of the sub-block node i has to deal with Node i computation time lasts k i ( + w) 9

22 0 2 inputs p- p 0 computations outputs initial algorithm available computation time for the host 0 2 p- p 0 inputs computations available computation time for the host outputs gathered algorithm Figure 24: available computation time for the host The time to send input sub-blocks within one message from node i is (0 for i = p) 8i 2 [; p? ] + px j=i+ k j The time to send output sub-blocks within one message from node i is 8i 2 [; p? ] + ix j= k j r : Node p sends the s outputs to the host, therefore no k i value depends on this communication, just as it does not depend on the sending of all inputs from the host to node That is why 0 and 0 do not appear in the following mathematical expressions Let t begin (i) and t end (i) be the start and end times of node i computation (in the heterogeneous version): therefore: with 8i 2 [2; p? 2] t begin (i + )? t begin (i) = + t end (i + )? t end (i) = + r p X t end (i)? t begin (i) = k i ( + w) (k i+? k i ) ( + w) = (k 2? k ) ( + w) =?? j= j= k j A j=i+2 0 i? k j A 20 0 X 4@ i? X k j A p 0 p k j A j=3 j=i+2 k j A 3 5 (2) 20

23 (k p? k p? ) ( + w) = + p?2 X k j A r j= This can be expressed as matrices: 8 >: b?b c c c c a b?b c c c a a b?b c c a a c a a a a b?b 98 >; >: k k 2 k p? k p 9 >; = 8 >: 0 0? 9 >; (22) where b = + w a = r c =? This is a linear system with p variables and p? equations If it is assumed that there is a solution (this will be discussed), then the solution space is a one-dimensional vectorial space It might be chosen to parametrize it with k : (k 2 ; k 3 ; : : :; k p ) = f(k ), where f is a linear function IR =) IR p? The following system must be solved so as to nd out f ?b c c c c k 2? bk b?b c c c k 3?ak a b?b c c = (23) 8>: a a c a a a b?b >; >: k p? k p >; >:?ak?? ak But a better choice, which will not privilege any k i, is to write (k ; k 2 ; k 3 ; : : :; k p ) = F (s), where F is a linear function IR =) IR p (and s is still the block size, that is S = ns and s = p-dimensional linear system: 8 >: b?b c c c c a b?b c c c a a b?b c c a a c a a a a b?b 98 >; >: k k 2 k p? k p px i= 9 >; >; k i ) This can be done with the following System 24 should be chosen, because the training set size is a more signicant parameter than the load of only one node 223 Solvability The aim is now to nd out under which conditions the system is solvable If system 23 is considered, it should be known when the system matrix is inversible Another question will be discussed : if the system is solvable, does it give judicious solutions (values of k i which correspond to a concrete situation) = 8 >: 0 0? s 9 >; (24) 2

24 Algebraic computation The system matrix is inversible when the following determinant is equal to 0 >?b c c c c b?b c c c a b?b c c a a c a a a b?b > Rather than computing some linear combinations of either rows or columns, a polynomial function is used P(X) = >?b? X c? X c? X c? X c? X b? X a? X?b? X b? X c? X?b? X c? X c? X c? X c? X a? X a? X c? X a? X a? X a? X b? X?b? X > Since this determinant is a n-dimensional anti-symmetrical linear function, the degree of P(X) is Of course, P(0) is to be determined Therefore, two values of P(X) are sucient A rst computation is obvious P(c) = >?b? c b? c?b? c a? c b? c?b? c 0 0 a? c a? c 0 a? c a? c a? c b? c?b? c > = (?b? c) p? Let P(a) be the other value to compute It is called p? because of a recursive computation p? = >?b? a c? a c? a c? a c? a b? a?b? a c? a c? a c? a 0 b? a?b? a c? a c? a 0 0 c? a b? a?b? a > Then, let r n be the determinant of the following n-dimensional square matrix: > c? a c? a c? a c? a c? a b? a?b? a c? a c? a c? a 0 b? a?b? a c? a c? a 0 0 c? a b? a?b? a > If the following notations are used: A =?b? a B = a? b C = c? a 22

25 and if n is developed with regard to its rst column, then the following system is obtained: n = A n? + Br n? r n = C n? + Br n? where Therefore, > : n > r n ; = >: A B C B 2 = r 2 = 98 >; >: n? r n? 9 > ; = > A C?B A > C C?B A 8 >: A B C B > = A2 + BC > = C(A + B) 9 >; n?2 8 > : 2 r 2 9 > ; In order to compute the power of the 2-dimensional matrix, its eigen values have to be computed The discriminant is: > A? Y B C B? Y > = Y 2? (A + B)Y + B(A? C) = Y 2 + 2bY? (b + c)(a? b) disc 0 = a(b + c)? bc Now, c < 0, a > 0 and b > 0 Moreover, it can be assumed that b j c j, that is to say ( + w) It means that an neural output computation takes more time than the transfer of the corresponding data Such an assumption is rather reasonnable, and ensures that disc 0 > 0, and therefore that there are two distinct eigen values v and v 2 in IR From which it can be derived that: 8 >: A B C B 9 >; = B(v 2? v ) 8 >: Bv Bv 2 v? A v 2? A >; >: v 0 > 0 v 2 ; >: v 2? A?B A? v B 9 >; p? can be computed now B v p?2 (v 2? A) + v p?2 2 (A? v ) p? = B(v 2? v ) (A 2 + BC) + B 2 v p?2 2? v p?2 C(A + B) (25) Since a > 0 and c < 0, and therefore a 6= c, it leads to: P(0) = cp(a)? ap(c) c? a Case of unsolvability The system matrix is not inversible if ab(v 2? v )(?b? c) p? = cb (A 2 + BC) v p?2 (v 2? A) + v p?2 2 (A? v ) + B(v p?2 2? v p?2 )C(A + B) (26) If initial notations are used and if M and M 2 are the rst and second terms of equation 26, M = a(?b? c) p? (v 2? v ) M 2 = c (b 3 + 3ab + ac? bc) v p?2 (v 2 + a + b)? v p?2 2 (v + a + b)?2b(a? b)(c? a) v p?2 2? v p?2 then the system is not solvable if and only if M = M 2 23

26 A realistic hypothesis is that b a and b j c j, which implies: (units can be chosen so that O() = (b), where (b) b For the rst term, For M 2, v p?2 (v 2 + a + b)? v p?2 2 (v + a + b) and since the following formula is obtained: disc 0 = (a? c)b + O()?! b! 0), and therefore v =?b? b 2 p a? c + O() v 2 =?b + b 2 p a? c + O() : M = 2a(?) p? b p? 2 p a? c + O(b p? 3 2 ) : (27) =?b? p a? cb 2 + O() p?2 pa? cb 2 + O()??b + p a? cb 2 + O() p?2? p a? cb 2 + O() =?2 p a? cb p? 2 + O(b p? ) v p?2 2? v p?2 = O(b p? 5 2 ) ; M 2 =?2cb p+ 5 2 p a? c + O(b p+2 ) (28) Equations 27 and 28 show that the system accepts solutions if neural output computation time is longer than input and output transfer times Concrete signicance real solutions vs needed integers Solutions are taken in IR, whereas integers must be used in practice (k i are numbers of samples) Therefore, after having solved the system, approximate solutions will be taken in the set of integers In this case, we should not care about the system theoretical solvability, since GL n (IR) is dense in M n (IR): exact solvability is an improper worry with regard to approximate values The aim was in fact to show that the theoretical necessary condition had a concrete signicance (computation time bytes transfer time), which allows to say that a good load sharing will be easy to obtain in most experimental conditions positive solutions The last argument has shown that it should not be worried about non-integer solutions But negative solutions (in IR, and then in IN by approximation) would have no sense, since it is not allowed to achieve an optimal solution with the help of negative computational times Then it is intended now to determine under which conditions the system will provide positive solutions But when exact theoretical solutions are computed with small systems (p = 3 or 4), it can be seen that such conditions are very dicult to determine and to express The idea is then to rst assume that the previous condition (b a and b j c j) is satised, which ensures that the system is solvable A special system is then studied : in this system, a = 0 and c = 0 It leads to : in our application b > 00a and b > 00 j c j k = k p = 8i 2 [2; p? ] bs? 2 pb k i = + b bs? 2 pb 24

27 All solutions are positive, as soon as the computation time is longer than the startup time (another reasonable assumption) Since inversion and multiplication are continuous functions in the space of matrices, it can be asserted that : 9a max > 0 and c max > 0 so that j a j< a max and j c j< c max =) 8i 2 [; p] k i 0 Since a max and c max obviously depend on b, the main assumption leads to both system solvability and solution positivity (keeping in mind that it is only a qualitative assumption, since there is no theoretical equivalence between j a j< a max and b a) 224 Performance modeling After having solved the system, and therefore after having found the k i values, theoretical performance is computed According to gure 23: sequential computation time: host sends inputs to node : node sends inputs to node 2: node computation: output communications: host receives outputs: ( + w) 0 p? X i= px i= px i= px i=2 k i k i k i k ( + w) + r i? 0 + r 0 j= px i= k i k i A The speedup formula is therefore: p P p i= k i( + w) P P P k ( + w) p i= k p i + + i=2 k p? i + i=2 P P i? + r j= k i r 0 p i= k i (29) 225 Exact and approximate solving Each performance model is applied to experimental data 2, so as to analyse the possible improvement that is obtained thanks to gathered communications and heterogeneous load sharing 2 The computation of system 24 determinant is dicult enough to show that a theoretical direct formula for each k i computation can not be given That is why a Maple program is given experimental data so as to nd the corresponding k i by solving the linear system 25

28 Studied algorithms Six versions of the training set partition principle are considered SA stands for the standard algorithm 2 SA2 stands for the standard algorithm where the host computes some outputs between its last input communication and its rst output communication 3 GA stands for the simple gathered algorithm (without computation by the host) 4 GA2 stands for the gathered algorithm with computation by the host 5 HA stands for the heterogeneous gathered algorithm where the host performs no computation Moreover, exact real solutions of system 24 are taken 6 aha stands for the heterogeneous gathered algorithm where the host performs no computation, but with approximate solutions (integers) of system 24, since each k i is a number of samples The performance model 29 can not be used any more Indeed delays may appear once again But modeling these delays is more dicult than for the simple gathered algorithm If a delay appears for instance on account of an over-approximation, it might now be afterwards balanced by a negative delay Here is an example: If round(k i? ) > k i?, then node i must wait for node i? between the end of node i computation and the beginning of node i? sending outputs to node i A delay appears, compared with the theoretical behaviour of gure 23 If afterwards round(k j? )< k j? (with j > i), then node j? should wait for node j before they can communicate outputs But this waiting balances the previously created delay, so that node j undergoes a shorter delay Caution: this delay balancing can not provide a negative global delay, since a delay can not be balanced beforehand Then a recurrence is necessary to estimate the nal delay (of node p) delay(2) = max(0; (k? k 2 )( + w)?? px j=3 k j ) X i?2 delay(i + ) = max(0; delay(i) + (k i?? k i )( + w) + ( j= k j r? delay(p) = max(0; delay(p? ) + (k p?? k p )( + w) HA2 is HA with computation performed by the host 8 aha2 is HA2 with approximate solutions of system 24 and possible delay Notes about computation Thanks to system 24, all solutions are given so that px i= px j=i+ Xi?2 j= k j r) k j )) k i is equal to a given total number of samples Numerical resolutions have been achieved thanks to a Maple program, which provides all k i values and infers theoretical parallel eciencies of each studied algorithm with the help of previously given formulae Numerous experiments have been performed with 4 to 64 processors on an ipsc 860 They have corroborated theoretical models with an average error near 2 % (see [3]) Even when the host deals with a number k h of samples (k h depends on the inactive time of the host, that is on all k i ), the aim is to provide a constant total number of samples The host is then said active For HA2, px i= k i = s? k h, therefore the desired k i are found after the following steps: 26

Laboratoire de l Informatique du Parallélisme

Laboratoire de l Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n 1398 An Algorithm that Computes a Lower Bound on the Distance Between a Segment and