Laboratoire de l Informatique du Parallélisme

Size: px
Start display at page:

Download "Laboratoire de l Informatique du Parallélisme"

Transcription

1 Laboratoire de l Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n 398 Neural Network Parallelization on a Ring of Processors: Training Set Partition and Load Sharing Bernard Girau September 994 Research Report N o Ecole Normale Supérieure de Lyon 46 Allée d Italie, Lyon Cedex 07, France Téléphone : (+33) Télécopieur : (+33) Adresse électronique : lip@lipens lyonfr

2 Neural Network Parallelization on a Ring of Processors: Training Set Partition and Load Sharing Bernard Girau September 994 Abstract A parallel back-propagation algorithm that partitions the training set has been introduced in [6] Its performance on MIMD machines is studied, and a new version is developped It is based on a heterogeneous load sharing method Algebraic models allow precise comparisons and show great improvements if learning steps are processed Keywords: neural networks, parallel algorithms, load sharing, MIMD machines, performance models Resume Un algorithme parallele de retropropagation par partition de la base d'apprentissage a ete introduit dans [6] Ce rapport etudie ses performances sur plusieurs machines MIMD, et en presente une nouvelle version, basee sur un principe de partage heterogene de la charge de calcul Des comparaisons precises de performances entre les dierentes methodes obtenues sont rendues possibles par des modelisations mettant en jeu des systemes algebriques lineaires Elles mettent en valeur des ameliorations sensibles en ce qui concerne les phases d'apprentissage reseaux de neurones, algorithmes paralleles, partage de charge, machines MIMD, modelisation de per- Mots-cles: formances

3 Contents Introduction 2 Initial algorithm 3 Description and modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 Generalization phase : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Learning phase: centralized updating : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 3 Learning phase: distributed updating : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 Performance study : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Machine-dependent performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 22 Inuence of block sizes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2 Heterogeneous load sharing 6 2 Message gathering : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2 Principle : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 22 Modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 22 Heterogeneous load sharing (generalization phase) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 22 Description : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Algebraic model of the heterogeneous load sharing : : : : : : : : : : : : : : : : : : : : : : : : : Solvability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Exact and approximate solving : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning phase : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Studied versions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : New linear system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 Conclusion 33 Bibliography 34 Index 35

4 Introduction Many studies have been performed to obtain ecient algorithms for neural network parallelization Nevertheless, their eciency mainly depends on the network characteristics Moreover, most of them apply to specic neural architectures, or specic parallel computers Therefore, none of them can claim to be better than others Standard algorithms for neural network parallelization may be partitioned into two subsets In an inner algorithm, any computation of the neural network is performed thanks to all processors, whereas an outer algorithm performs each computation of the neural network by means of only one processor and achieves the parallelization by partitioning the set of all needed neural computations among the processors Inner algorithms highly depend on the network structure This report deals with the training set partition algorithm This outer algorithm applies to any neural problem using supervised learning with training sets Therefore, this report follows the works of [4], [6] and [5] In order to obtain better performance when the standard training set partition does not provide satisfactory eciency, communication steps are modied, and a heterogeneous load sharing is introduced so as to reduce the partial processor inactivity that has appeared because of the communication modications Precise characteristics of the appropriate load sharing are obtained by means of a linear algebraic model Performance models show that the obtained heterogeneous version leads to a great eciency increase when the training set partition is applied to the gradient descent learning algorithm Since many notations and terms are used without always recalling their meaning, an index is provided at the end of this report It mentions where each notation or term is rst employed or described 2

5 Chapter Initial algorithm Description and modeling In [4], [6] and [5], H Paugam-Moisy partitions the training set so as to parallelize any neural network appplication Each processor owns a local copy of the whole neural network This algorithm has been tested on an one-directional ring of transputers, with up to 32 T800 processors A neural learning with a multilayer network and a training set of 2000 to 4000 samples has been parallelized This network has 27 inputs, 3 outputs and only one hidden layer with 0 to 80 neurons Satisfactory performance has been obtained, which justies a precise analysis of this algorithm This section recalls some important aspects of [6] It also develops some new theoretical arguments For instance, new critical cases are taken into account so as to determine the limits of this algorithm, called standard set partition algorithm Generalization phase A sample set is given In a generalization phase, the neural output is computed for each input in this set No learning is achieved Host processor The T-Node machine that has been used for former works implies the existence of a privileged node, the I/O processor or host processor, which is able to communicate with the outside environment Moreover, information distribution has to be taken into account in real time applications, whatever the employed parallel machine is Therefore, it will be assumed that there is such a host, which centralizes information exchange in each studied algorithm Figure describes the parallel structure that is considered in this report In a generalization phase, inputs are known by the host, and computed outputs have to be communicated to the host Principle Let S be the number of given samples The input set is rst partitioned into n subsets, so that S = ns Each subset of s inputs is called a block If p is the number of processors on the ring, and if s = pk, then we use n iterations of the following algorithm: Partition the block into p subsets of size k, called sub-blocks The host sends all the sub-blocks to node, one after the other, and then waits for the p output sub-blocks which are sent one after the other by node p Node i runs the following process: Receive p? i input sub-blocks, one after the other, and immmediately send them to the next processor on the ring (that is: receive an input sub-block, forward it, receive the next one, : : : ) 2 Receive a sub-block with k input vectors, and compute the corresponding outputs Even if the ring topology is simulated on a homogeneous computer in which all nodes have equivalent resources 3

6 node node 2 host node i node p node p- Figure : ring communication structure with host (p-)*k received/sent samples k sent outputs 0 received/sent outputs s=p*k sent samples I/O s=p*k received outputs (p-2)*k received/sent samples k sent outputs k received/sent outputs (p-i)*k received/sent samples k sent outputs (i-)*k received/sent outputs 0 received/sent samples k sent outputs (p-)*k received/sent outputs k received/sent samples k sent outputs (p-2)*k received/sent outputs Figure 2: sample set partition principle 3 Send the obtained output sub-block to its successor 4 Receive and immediately forward i? output sub-blocks (that is: receive an output sub-block, forward it, receive the next one, : : :) If i = p, we consider that the next processor is the host, and these last two steps make the host receive all output sub-blocks Figure 2 describes the set partition method which is thus obtained, whereas gure 3 shows when and how nodes are busy during the algorithm processing It is simplied in gure 4 which shows only the general shape of this behaviour Such shapes will be used later if they are explicit enough, since they clearly point out the main aspects of any description Communication modeling Following models aim to be more exhaustive than in [6] A linear communication time model is used: Between the host and a node: ( 0 : startup time - L: message size - 0 : transfer rate) T c (L) = 0 + L 0 Between two nodes: with 0 and 0 T c2 (L) = + L 4

7 host node node 2 node 3 node p-2 node p host time sending of an input sub-block from the host to a transputer sending of an input sub-block between 2 transputers sending of an output sub-block between 2 transputers sending of an output sub-block from a transputer to the host computation of an output sub-block Figure 3: algorithm steps for generalization phase 0 2 inputs computations p- p 0 outputs Figure 4: shape of algorithm behaviour for generalization phase Digression It might be assumed that 0 and 0, even if slower communications with the host are observed for the T-Node machine Figure 5 shows the dierences between both cases This gure can be used so as to determine at what time node starts its computation (the algorithm iteration starting at time 0) Let and r be the input and output sizes If (; ) ( 0 ; 0 ), then node starts computing at time: p ( 0 + k 0 ) + (p? ) ( + k) < 2( 0 + k 0 ) + (2p? 3) max( 0 + k 0 ; + k) And if (; ) ( 0 ; 0 ), then node starts computing at time: 2( 0 + k 0 ) + (2p? 3) ( + k) = 2( 0 + k 0 ) + (2p? 3) max( 0 + k 0 ; + k) It shows that the standard case, with (; ) ( 0 ; 0 ), is theoretically better (it is more likely anyway) Performance modeling Figure 6 is used so as to determine the processing time of one iteration The case with r is analysed rst From the beginning of the algorithm until the beginning of node computation, the host sends p sub-blocks, whereas node forwards p? of them p( 0 + k 0 ) + (p? )( + k) 2 The computation time of a multilayer perceptron linearly depends on its parameter size For one input vector, it is + w, if w is the size of the network parameter vector Therefore node computation lasts k( + w) : 5

8 host node node 2 node 3 host node node 2 node 3 host/node lomger than node/node node/node longer than host/node host sends -> node host -> node receives node sends -> node node -> node receives Figure 5: inuence of host/node communication times host host host host computations communications time accumulated delays if λ > r if r > λ Figure 6: processors activity all along the algorithm processing 3 The output sub-block of node is sent through all nodes to the host The rst case of gure 6 shows that p? communications between two nodes are implied, and that it ends with a communication node/host (p? )( + kr) + ( 0 + kr 0 ) But the aim of gure 6 is to show that if r, then the host has only to wait a constant time between each output receipt, whereas if r, then node p must wait for the host rst receipt, node p? must therefore wait even more A phenomenon of accumulated delays appears For node, the obtained delay is: (p? )k 0 (r? ) : Speedup Indeed, and r have symmetrical roles The previous paragraph has privileged node (time before the beginning of node computation, time of node computation, time for node output to reach the host) Figure 6 shows that if node p is privileged and if and r are exchanged, then the second case with r leads to a formula which is similar 6

9 0 2 p-2 p- p inputs computations outputs Figure 7: shape of the algorithm behaviour with parallel communications 0 2 p- p 0 inputs outputs computations Figure 8: pipeline algorithm, with parallel communications to the simple formula obtained for the rst case Therefore a global speedup formula can be given: pk( + w) k( + w) + p( 0 + k max(; r) 0 ) + (p? )(2 + k( + r)) + ( 0 + k min(; r) 0 ) () Improvement In order to reduce communication times, message sendings and receipts may be processed in parallel, providing that the machine allows simultaneous communications on the dierent links of each node 2 In this case, node i receives block j from node i? while he sends block j? to node i + Figure 7 shows the shape of the resulting algorithm Such an improvement actually suers from two drawbacks: Many parallel computers do not allow parallel communications, or simulate them with irregular performance (see [3] for the ipsc 860) A good model is thus dicult to determine Parallel communications need more management work, and their startup time is longer than for standard communications A precise communication time model for the T-Node machine is given in [] It shows that the startup time depends on the number of possible links on each node It can be derived from this work that if parallel communications are to be used, then the startup time becomes + // For a T-Node and only two parallel links, // = 2:9 s Modeling with parallel communications If communication sendings and receipts are performed in parallel, it can be considered that only host/node communication times must be taken into account, since they are the slowest Moreover, if the algorithm is given a pipeline structure as shown in gure 8, then the transfer of node output for iteration i is performed during the transfer of node p input for iteration i + If many iterations are processed, then the speedup is: pk( + w) k( + w) + p( // k max(; r) 0 ) (2) 2 The T-Node machine allows such parallel communications 7

10 Speedup in the generalization phase 30 without parallel communications with parallel communications speedup number of processors Figure 9: performance for to 32 processors (k = 00) Speedup in the generalization phase without parallel communications with parallel communications speedup number of processors Figure 0: performance for to 000 processors (k = 00) Numerical performance Then following numerical values correspond to the experiments of [6]: 0 = 5:4 s and 0 = :8 s = 3:9 s and = : s = 08 (27 oats) and r = 2 (3 oats) = 50 s and = 3 s w = 243 (40 neurons in the single hidden layer) Figure 9 shows satisfactory performance for a limited number of processors But some limits appear for massively parallel machines, as shown in gure 0 Speedup formula interpreting In accordance with formula, a good eciency is obtained with: 8

11 few processors (p), 2 large input sub-blocks (k), 3 small input and output sizes (, r), 4 many network parameters (w) Let us try to sharpen this analysis terms gathering together: A dense eciency formula can be derived from formula : e = + (c + c 2 p) kw + (c 3 + c 4 p) L w ; (3) where c i are machine-dependent constants, and L stands for the sum of the input and output sizes study of the L w term: For a neural network with only one hidden layer (containing N neurons), it can be asserted that w = O(LN), and the third term of the denominator may therefore be rewritten as c0 3 +c0 4 p N inuence of L: It may now be considered that L and N are independent variables, then L does not have any inuence on third term c0 3 +c0 4 p, and the second term becomes c0 +c0 2 p It shows that indeed large input and output sizes N kln are useful to obtain a good eciency dominating variables: A great eciency increase is obtained thanks to many hidden neurons and few nodes, since p N belongs to both second and third terms in the formula denominator multilayer nets: If there are two hidden layers or more, then w = w N 2 + w 2 LN + w 3 It follows that the inuence of L becomes negligible 2 Learning phase: centralized updating This paragraph describes the most intuitive version, directly derived from the algorithm above The term centralized updating is employed because the nodes only perform gradient computation, whereas the host updates the neural network parameters thanks to the communicated gradients This algorithm does not appear in [4] Description The previous algorithm may be used so as to simulate the total gradient algorithm in the learning phase Instead of sending output sub-blocks, node i computes a sum of error gradients (one error gradient per input sample), which is called G i W If n =, then the total gradient algorithm is implemented, whereas n > generates a block-gradient algorithm 3 In the case of centralized updating, node i sends G i W to node i +, so as to communicate it to the host Then the host computes the sum GW of all the obtained vectors, and applies any simple or advanced gradient descent algorithm 4 It must be ensured that the host communicates the network modications to all nodes before each of the n algorithm iterations Without parallel communication New constants have to be introduced for the computation time model Gradient computation lasts l + l w for one sample Let us notice that output message size no longer depends on k, but is proportional to w, since all nodes communicate error gradients (for instance, if float values are used in the C-program, then the output message size is w*sizeof(float) ) Therefore, let lw be the output message size If the problem of the computation and communication of the network modications is not yet taken into account 5, speedup is then: pk( l + l w) k( l + l w) + p( 0 + max(k; lw) 0 ) + (p? )(2 + (k + lw)) + ( 0 + min(k; lw) 0 ) (4) 3 if p = and n = s, then we obtain the stochastic gradient algorithm 4 W (t + ) = W (t)? (t)gw (t) for the standard gradient descent algorithm, where W (t) is the parameter vector of the network at time t 5 See paragraph Version choice : : : so as to see how this problem can be taken into account 9

12 Efficiency on the T-Node machine learning validation efficiency number of processors number of neurons Figure : performance for generalization and learning phases (k = ) When studying a corresponding dense form, e = + c 00 p + kw c00 2 p w + c00 p 3 k ; (5) a new term is added to the denominator of the eciency formula It is proportional to p w does not appear in k the denominator of this term, and therefore, even a boundless computation load can not lead to an eciency equal to 00 % (a boundless load may correspond to a boundless number of hidden neurons) This phenomenon can be observed on gure Optimal block size Small blocks are required for a quick convergence of the block-gradient algorithm (see [2]) But the arguments above show that a good parallelization is achieved with large blocks A compromise must be found therefore Performance improvement for small blocks is a priority indeed Pipeline The interest of a pipeline structure for the training set partition algorithm is discussed pipeline for generalization phase: Formulae and 2 show that an increase of k always leads to an improvement of the algorithm eciency Therefore the best solution is S = s, that is n =, which allows no pipelining pipeline for learning phase: Since updating is performed by the host, obtained modications must be communicated to all nodes before each iteration Figure 2 shows the induced behaviour of the algorithm It is clear that no pipelining is possible With parallel communication Since there is no pipelining processing, and if the computation and communication of the network modications is not yet considered, the speedup formula is (see gure 2): pk( l + l w) k( l + l w) + p(2 // (k + lw) 0 ) (6) 0

13 0 2 p- p p- p 0 inputs outputs computations host computes new parameters communication of new parameters without parallel communications with parallel communications Figure 2: consecutive iterations for learning phase 3 Learning phase: distributed updating The parallel learning algorithm of [4] is presented here It is particularly adapted to a parallel computer which allows parallel communications It is called distributed updating because the host only communicates inputs Instead of sending the gradient sum of the sub-block to the host, the nodes communicate directly At the end of this communication processing (or gradient exchange), each node knows all error gradients, and therefore can update the network Description Input sub-block are communicated in the same way as for centralized updating Then each node performs its own computation At last, the gradient exchange is realized by means of the following algorithm: rst step: Node i communicates G i W to its direct successor Therefore, it receives G i? W from its predecessor and can compute G i? W + G i W 2 j th step: Node i sends j? X k=0 3 step p? : Every node knows GW = G i?k W to node i +, receives px k= j? X k=0 G i??k W, and then computes jx k=0 G i?k W G k W, from which it can induce the new network parameters With parallel communications, all nodes simultaneously end their computation, and therefore they can start this gradient exchange process without any delay There is no question about a pipeline structure for distributed updating, since the gradient exchange process also ends at the same time on all nodes Modeling Once again, the computation and communication of the network modication is not considered Therefore, each step of the gradient exchange lasts 2( + lw), and the computation of G i? W + G i W is not yet taken into account Without parallel communications The gradient exchange must wait for every node to have nished its local computation before it can start speedup = With parallel communications pk( l + l w) k( l + l w) + p( 0 + k 0 ) + (p? )( + k) + 2(p? )( + lw)

14 Here, each step of the gradient exchange lasts // ++lw (which still does not take into account the computation of the gradient sum) speedup = Version choice according to the machine pk( l + l w) k( l + l w) + p( // k 0 ) + (p? )( // + + lw) In order to determine the best algorithm for a learning phase, the time needed for the computation and communication of the gradient modications must be introduced The elapsed time between the end of the rst node computation and the start of the next algorithm iteration (next main block) is estimated in this paragraph for both centralized and distributed updating versions The computation time of the sum of two gradient vectors must be taken into account: add + add n for n-dimensional vectors Moreover, most algorithms for function minimization (gradient descent, conjugate gradient descent, : : :) are performed within a linear time, with regard to the gradient size: it will be estimated as desc + desc w with parallel communications: centralized updating: All G i W received by the host: p( 0 + // + 0 lw) The host computes the sum of all gradients: (p? )( add + add w) The host computes the new network parameters thanks to the obtained global gradient: desc + desc w The host communicates the obtained modications In order to minimize the number of startup times, the modications of iteration h may be included in the rst message of iteration h+, since this message contains node p input sub-block and therefore goes through all nodes That is why this modication communication only adds the gradient transfer time, from the host to node p: ( 0 + (p? ))lw distributed updating: All nodes communicate G i W to each other: (p? )( // + + lw) For each step j, node i adds G i W to j? X k=0 Each node computes the new network parameters: desc + desc w without parallel communications: G i??k W Now, there are p? steps: (p? )( add + add w) centralized updating: G W forwarded to the host Possible delays must be taken into account in the same way as for formula determination: (p? )( + lw) lw + (p? ) 0 (lw? k) The host computes the sum of all gradients After having received an output sub-block from node p, the host must wait for node p to have received the next output sub-block from node p? Therefore, the host can use this waiting time to perform some gradient sum computation But these computations may not end within the waiting times That is why partial computation times may remain: each of them is equal to max(0; add + add w? ( + lw)) The global computation implies therefore an extra time equal to: ( add + add w) + (p? 2)max(0; add + add w? ( + lw)) : This is generally equal to ( add + add w), since the addition time of n real numbers is usually smaller than their transfer time The host computes the new network parameters: desc + desc w The host communicates the obtained modications:( 0 + (p? ))w distributed updating: All nodes communicate G i W to each other: 2(p? )( + w) Node i adds G i W : (p? )( add + add w) Each node computes the new network parameters: desc + desc w 2

15 It clearly appears that distributed updating must be chosen if parallel communications are possible Otherwise distributed updating is to be prefered only if 0, which is true for the T-Node machine, but not for the applications on ipsc 860 or Volvox machines Thanks to this study, more exact speedup formulae may be established They take into account the time needed for the computation and communication of the network modications But they still make an assumption: the computation of these modications is not performed with a parallel algorithm (the time it needs is equal to desc + desc w in a sequential processing as in a parallel processing) Following formulae give the obtained speedups: centralized updating without parallel communications: pk( l + l w) + desc + desc w k(l + l w) + p( 0 + max(k; lw) 0 ) + (p? )(2 + (k + lw)) + ( 0 + min(k; lw) 0 ) + add + add w + (p? 2)max(0; add + add w? ( + lw)) + desc + desc w + ( 0 + (p? ))lw centralized updating with parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p(2 // (k + lw) 0 ) + (p? )( add + add w) + desc + desc w + ( 0 + (p? ))lw distributed updating without parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p( 0 + k 0 ) + (p? )( + k) + 2(p? )( + lw) + (p? )( add + add w) + desc + desc w distributed updating with parallel communications: pk( l + l w) + desc + desc w k( l + l w) + p( // k 0 ) + (p? )( // + + lw)(p? )( add + add w) + desc + desc w 2 Performance study The dependence of the algorithm eciency on the machine characteristics is now studied according to experimental data It follows the qualitative discussion of Only MIMD machines, with distributed memory and asynchronous communications are considered The ipsc 860 machine has been used for all experiments, though performance models have also been applied to the Volvox machine with the Volvix environment 2 Machine-dependent performance Communications on an ipsc 860 Though a linear communication time model + L is rather good for this parallel computer 6, many special cases appear, even without any parallel communication A study of these problems can be found in [3] Experiments have shown that they can be avoided thanks to some programing precautions 7 Experimental data See [3] for a precise study of the ipsc 860 communications communication characteristics: ipsc 860: 6 For this machine, one value of (; ) is taken if L 00, and another one if L > 00 7 These precautions are described in [3] Without them, more than 20 % eciency decrease may be observed during experiments, when they are compared to performance modelings 3

16 Speedup in generalization phase 30 Volvox ipsc speedup number of nodes Figure 3: performance in generalization phase (k = 00) 0 = = 79 s for less than 00 bytes communications, 94 s for more than 00 bytes 0 = = 0:42 s (less than 00 bytes), 0:4 s (more than 00 bytes) Volvox (Volvix system): 0 = ' 5000 s 0 = = :53 s computation characteristics: Both ipsc 860 and Volvox use i860 processors: ' 0 s and = :37 s Performance analysis Figure 3 shows modeled speedups for the ipsc 860 machine and the Volvox machine (each message is supposed to contain more than 00 bytes) Formula is used as the performance model The ipsc 860 computer provides rather unsatisfactory performance, and the Volvox machine provides really feeble performance Indeed, the computation capability of the i860 processor is too high, compared with the communication quickness of both machines, for such an application The T-Node machine is more balanced (quicker communications, slower computations) And yet, the ipsc 860 must be chosen, since the i860 quickness makes it process the algorithm ve times faster than the T-Node machine (without parallel communications, with 32 nodes, k = 00, 40 hidden neurons) It is all the more important that the algorithm eciency be increased on such powerful machines, as they already provide the highest quickness with an unsatisfactory eciency 22 Inuence of block sizes The low performance on the ipsc 860 is mainly due to its big startup time Formula shows that this phenomenon also generates an enlarged sensitiveness to block size 8 This is illustrated by gure 4 In a learning phase, it is a crucial problem, since better convergence is obtained thanks to small blocks 8 In the eciency formula, k is negligible only with large block sizes 4

17 Speedup in generalization phase T-Node T-Node ipsc ipsc speedup number of processors sub-block 5060 size Figure 4: performance for ipsc 860 and T-Node with regard to p and k 5

18 Chapter 2 Heterogeneous load sharing In this chapter, no parallel communications will be considered Described algorithms intend to improve the eciency of the training set partition algorithm when the machine characteristics do not t well with the standard form of this algorithm Now, a machine without parallel communications is less adapted to the standard algorithm than a machine with such facilities 2 Message gathering 2 Principle Sub-blocks gathering In previous algorithm, node begins its own computations after 2p? communications, and therefore after 2p? startup times A simple idea is to gather all sub-blocks into a unique communication The following algorithm, called simple gathered algorithm, is obtained: The host sends the whole training block to node, that is, one message for p sub-blocks 2 Node i (i 6= p) receives p? i + gathered sub-blocks, sends last p? i ones to its successor, and computes the neural outputs of the rst sub-block it has received 3 Node sends its output sub-block to node 2 4 Node i (i 6= p and i 6= ) receives output sub-blocks which are gathered from number to number i?, and therefore is able to send output sub-blocks from number to number i 5 Node p sends all gathered output sub-blocks to the host Figure 2 shows the obtained algorithm behaviour This algorithm is the simplest one (basic scattering algorithm on a ring) It has better performance than the previous one when startup time is much greater than transfer time Its obvious drawback is that the number of communicated bytes, before node p can compute, is no longer linear with regard to the number of nodes, but is now quadratic In a more concrete way, a signicant loss of time appears for the rst ring nodes, since node i + starts its computation long after node i It also appears for the last ones, since node i + waits for node i to have performed a large output communication with node i? after its computation time 22 Modeling Generalization phase Despite a simple structure, this algorithm is more dicult to model than the previous one Accumulated delays appear, as it is shown in gure 22 Thanks to gure 22, the processing time can be computed as the sum of the following times: 6

19 host node node 2 node 3 node p-2 node p host time host sends inputs host receives outputs node i sends inputs to node i+ node i sends outputs to node i+ output sub-block computation Figure 2: simple gathered algorithm host node node 2 communications time computations node p host inactivity Figure 22: behaviour of the algorithm with sub-block gathering input transfer: Node i sends p? i input sub-blocks to its direct successor For all nodes: computation: Node p computes its output sub-block X p? 0 + pk (p? i)k i= k( + w) accumulated delays: In order to determine how long node p must wait before it can receive node p? outputs, it should be noted that a delay appears for node i if and only if the input communication time between node i and node i + is longer than the output communication time between node i? 2 and node i? Therefore it appears only if (i? 2)kr (p? i)k, and all obtained delays must be added px i=3 max(0; (i? 2)kr? (p? i)k) output transfer: Node p receives the outputs of block to p? and then sends all outputs to the host + (p? )kr pkr 0 7

20 host node node 2 node p host time receipt (node <--> node) emission (node <--> node) emission (host <--> node) receipt (host <--> node) computation Figure 23: algorithm with processors as fully loaded as possible The obtained speedup is therefore: pk( + w) k( + w) pk(r + ) 0 + (p? ) + p(p?) k + + (p? )kr + 2 Learning phase px d 2r+p r+ e k((i? 2)r? (p? i)) In the previous centralized updating, all gradient vectors are sent to the host which computes their sum afterwards Such a method would become rather unecient with gathered output messages: node sends its computed gradient vector to node 2, which sends its gradient vector and node gradient vector, : : :, node p nally sends a message containing p gradient vectors to the host From the end of node computation to the end of the iteration, the time is O(p 2 w)! Another version of centralized updating is therefore introduced The host still performs the computation of the new network parameters (updating), but it receives the sum of all gradients from node p When node i receives a gradient from node i?, it adds its own gradient vector, and sends the sum to node i + Therefore, node i sends P i G iw All output messages are the same size From the end of node computation to the end of the iteration, the time is now O(pw) The obtained speedup is: pk( + w) k( + w) (lw + pk) 0 + (p? ) + p(p?) 2 k + + lw + px dp? ke lw 22 Heterogeneous load sharing (generalization phase) 22 Description (lw? (p? i)k) The previous model shows that a simple block gathering should not be chosen, even for a parallel computer with long startup times The main drawbacks are due to a lack of balance among input and output communications, specially in case of generalization phase This algorithm may be improved by giving more samples to deal with to the nodes that have long inactive times Figure 23 describes the algorithm behaviour that should be obtained Node i begins its computation after having sent the inputs of all its successors to its direct successor The time that this communication needs is therefore the time between node i? computation start and node i computation start 8

21 In the same way, node i has to receive all outputs of nodes to i? from node i?, before it is allowed to send its own outputs and those of all its predecessors to node i + Node i + might go on with its own computation during this communication between nodes i and i? Nodes and 2 might simultaneously end their computation, since node has no predecessor output to receive In the same way nodes p? and p simultaneously begin their computation, since node p has no successor in the ring If node i is among the rst ring nodes, it has many inputs to transfer, whereas node i? has few outputs to receive, therefore node i? might deal with much more samples than node i In the same way, if node i is among the last ring nodes, node i + might deal with much more samples than node i A new algorithm can be imagined, in which communications are still gathered so as to decrease the number of startup times, but in which each node has to deal with a specic number of samples so as to minimize its inactive time This heterogeneous load sharing can not lead to optimal eciency But it aims to provide satisfactory performance when the standard algorithm appears very unsatisfactory Mutual inuence between communications and computations The training set partition is then rst changed by gathering the messages, and this gathered algorithm is now given a new aspect by performing a heterogeneous load sharing according to inactive times that are due to unbalanced communications But adding some samples for a given node will modify most communications We have to change loads because of communication times, and yet any load change modies these communications This mutual inuence must be given a mathematical expression so as to nd the best training set partition according to the principle above Another problem is that even adding only one sample might provide more extra computation time than it is needed Therefore, a heterogeneous load sharing might better adapt to an experiment with large communication sizes, eg many nodes But it has been shown that a block gathering generates a long waiting time before node p can compute when p is big (this time is quadratic with regard to p) Each argument shows that a global improvement can not be expected: a precise study of the heterogeneous algorithm must be performed so as to know when it may provide signicant performance increase About an active host A previous discussion in has justied the idea of a privileged node, the host But in many parallel machines, nodes have equivalent resources Therefore the host may also perform some computation The following modication may appear desirable: If p nodes are available, the training set partition algorithm (standard or heterogeneous) is implemented with a ring of p? nodes, whereas the last one, which centralizes all data, keeps some samples to deal with between its last input communication and its rst output communication In such a case, it must be noted that the heterogeneous version provides more available computation time for the host than the standard version (see gure 24) Of course, if the host performs no computation, then it is not taken into account in the eciency computation (in this case, only the ring nodes are taken into account so as to determine the number p of processors) 222 Algebraic model of the heterogeneous load sharing The previous description is now given a mathematical expression Let k i be the size of the sub-block node i has to deal with Node i computation time lasts k i ( + w) 9

22 0 2 inputs p- p 0 computations outputs initial algorithm available computation time for the host 0 2 p- p 0 inputs computations available computation time for the host outputs gathered algorithm Figure 24: available computation time for the host The time to send input sub-blocks within one message from node i is (0 for i = p) 8i 2 [; p? ] + px j=i+ k j The time to send output sub-blocks within one message from node i is 8i 2 [; p? ] + ix j= k j r : Node p sends the s outputs to the host, therefore no k i value depends on this communication, just as it does not depend on the sending of all inputs from the host to node That is why 0 and 0 do not appear in the following mathematical expressions Let t begin (i) and t end (i) be the start and end times of node i computation (in the heterogeneous version): therefore: with 8i 2 [2; p? 2] t begin (i + )? t begin (i) = + t end (i + )? t end (i) = + r p X t end (i)? t begin (i) = k i ( + w) (k i+? k i ) ( + w) = (k 2? k ) ( + w) =?? j= j= k j A j=i+2 0 i? k j A 20 0 X 4@ i? X k j A p 0 p k j A j=3 j=i+2 k j A 3 5 (2) 20

23 (k p? k p? ) ( + w) = + p?2 X k j A r j= This can be expressed as matrices: 8 >: b?b c c c c a b?b c c c a a b?b c c a a c a a a a b?b 98 >; >: k k 2 k p? k p 9 >; = 8 >: 0 0? 9 >; (22) where b = + w a = r c =? This is a linear system with p variables and p? equations If it is assumed that there is a solution (this will be discussed), then the solution space is a one-dimensional vectorial space It might be chosen to parametrize it with k : (k 2 ; k 3 ; : : :; k p ) = f(k ), where f is a linear function IR =) IR p? The following system must be solved so as to nd out f ?b c c c c k 2? bk b?b c c c k 3?ak a b?b c c = (23) 8>: a a c a a a b?b >; >: k p? k p >; >:?ak?? ak But a better choice, which will not privilege any k i, is to write (k ; k 2 ; k 3 ; : : :; k p ) = F (s), where F is a linear function IR =) IR p (and s is still the block size, that is S = ns and s = p-dimensional linear system: 8 >: b?b c c c c a b?b c c c a a b?b c c a a c a a a a b?b 98 >; >: k k 2 k p? k p px i= 9 >; >; k i ) This can be done with the following System 24 should be chosen, because the training set size is a more signicant parameter than the load of only one node 223 Solvability The aim is now to nd out under which conditions the system is solvable If system 23 is considered, it should be known when the system matrix is inversible Another question will be discussed : if the system is solvable, does it give judicious solutions (values of k i which correspond to a concrete situation) = 8 >: 0 0? s 9 >; (24) 2

24 Algebraic computation The system matrix is inversible when the following determinant is equal to 0 >?b c c c c b?b c c c a b?b c c a a c a a a b?b > Rather than computing some linear combinations of either rows or columns, a polynomial function is used P(X) = >?b? X c? X c? X c? X c? X b? X a? X?b? X b? X c? X?b? X c? X c? X c? X c? X a? X a? X c? X a? X a? X a? X b? X?b? X > Since this determinant is a n-dimensional anti-symmetrical linear function, the degree of P(X) is Of course, P(0) is to be determined Therefore, two values of P(X) are sucient A rst computation is obvious P(c) = >?b? c b? c?b? c a? c b? c?b? c 0 0 a? c a? c 0 a? c a? c a? c b? c?b? c > = (?b? c) p? Let P(a) be the other value to compute It is called p? because of a recursive computation p? = >?b? a c? a c? a c? a c? a b? a?b? a c? a c? a c? a 0 b? a?b? a c? a c? a 0 0 c? a b? a?b? a > Then, let r n be the determinant of the following n-dimensional square matrix: > c? a c? a c? a c? a c? a b? a?b? a c? a c? a c? a 0 b? a?b? a c? a c? a 0 0 c? a b? a?b? a > If the following notations are used: A =?b? a B = a? b C = c? a 22

25 and if n is developed with regard to its rst column, then the following system is obtained: n = A n? + Br n? r n = C n? + Br n? where Therefore, > : n > r n ; = >: A B C B 2 = r 2 = 98 >; >: n? r n? 9 > ; = > A C?B A > C C?B A 8 >: A B C B > = A2 + BC > = C(A + B) 9 >; n?2 8 > : 2 r 2 9 > ; In order to compute the power of the 2-dimensional matrix, its eigen values have to be computed The discriminant is: > A? Y B C B? Y > = Y 2? (A + B)Y + B(A? C) = Y 2 + 2bY? (b + c)(a? b) disc 0 = a(b + c)? bc Now, c < 0, a > 0 and b > 0 Moreover, it can be assumed that b j c j, that is to say ( + w) It means that an neural output computation takes more time than the transfer of the corresponding data Such an assumption is rather reasonnable, and ensures that disc 0 > 0, and therefore that there are two distinct eigen values v and v 2 in IR From which it can be derived that: 8 >: A B C B 9 >; = B(v 2? v ) 8 >: Bv Bv 2 v? A v 2? A >; >: v 0 > 0 v 2 ; >: v 2? A?B A? v B 9 >; p? can be computed now B v p?2 (v 2? A) + v p?2 2 (A? v ) p? = B(v 2? v ) (A 2 + BC) + B 2 v p?2 2? v p?2 C(A + B) (25) Since a > 0 and c < 0, and therefore a 6= c, it leads to: P(0) = cp(a)? ap(c) c? a Case of unsolvability The system matrix is not inversible if ab(v 2? v )(?b? c) p? = cb (A 2 + BC) v p?2 (v 2? A) + v p?2 2 (A? v ) + B(v p?2 2? v p?2 )C(A + B) (26) If initial notations are used and if M and M 2 are the rst and second terms of equation 26, M = a(?b? c) p? (v 2? v ) M 2 = c (b 3 + 3ab + ac? bc) v p?2 (v 2 + a + b)? v p?2 2 (v + a + b)?2b(a? b)(c? a) v p?2 2? v p?2 then the system is not solvable if and only if M = M 2 23

26 A realistic hypothesis is that b a and b j c j, which implies: (units can be chosen so that O() = (b), where (b) b For the rst term, For M 2, v p?2 (v 2 + a + b)? v p?2 2 (v + a + b) and since the following formula is obtained: disc 0 = (a? c)b + O()?! b! 0), and therefore v =?b? b 2 p a? c + O() v 2 =?b + b 2 p a? c + O() : M = 2a(?) p? b p? 2 p a? c + O(b p? 3 2 ) : (27) =?b? p a? cb 2 + O() p?2 pa? cb 2 + O()??b + p a? cb 2 + O() p?2? p a? cb 2 + O() =?2 p a? cb p? 2 + O(b p? ) v p?2 2? v p?2 = O(b p? 5 2 ) ; M 2 =?2cb p+ 5 2 p a? c + O(b p+2 ) (28) Equations 27 and 28 show that the system accepts solutions if neural output computation time is longer than input and output transfer times Concrete signicance real solutions vs needed integers Solutions are taken in IR, whereas integers must be used in practice (k i are numbers of samples) Therefore, after having solved the system, approximate solutions will be taken in the set of integers In this case, we should not care about the system theoretical solvability, since GL n (IR) is dense in M n (IR): exact solvability is an improper worry with regard to approximate values The aim was in fact to show that the theoretical necessary condition had a concrete signicance (computation time bytes transfer time), which allows to say that a good load sharing will be easy to obtain in most experimental conditions positive solutions The last argument has shown that it should not be worried about non-integer solutions But negative solutions (in IR, and then in IN by approximation) would have no sense, since it is not allowed to achieve an optimal solution with the help of negative computational times Then it is intended now to determine under which conditions the system will provide positive solutions But when exact theoretical solutions are computed with small systems (p = 3 or 4), it can be seen that such conditions are very dicult to determine and to express The idea is then to rst assume that the previous condition (b a and b j c j) is satised, which ensures that the system is solvable A special system is then studied : in this system, a = 0 and c = 0 It leads to : in our application b > 00a and b > 00 j c j k = k p = 8i 2 [2; p? ] bs? 2 pb k i = + b bs? 2 pb 24

27 All solutions are positive, as soon as the computation time is longer than the startup time (another reasonable assumption) Since inversion and multiplication are continuous functions in the space of matrices, it can be asserted that : 9a max > 0 and c max > 0 so that j a j< a max and j c j< c max =) 8i 2 [; p] k i 0 Since a max and c max obviously depend on b, the main assumption leads to both system solvability and solution positivity (keeping in mind that it is only a qualitative assumption, since there is no theoretical equivalence between j a j< a max and b a) 224 Performance modeling After having solved the system, and therefore after having found the k i values, theoretical performance is computed According to gure 23: sequential computation time: host sends inputs to node : node sends inputs to node 2: node computation: output communications: host receives outputs: ( + w) 0 p? X i= px i= px i= px i=2 k i k i k i k ( + w) + r i? 0 + r 0 j= px i= k i k i A The speedup formula is therefore: p P p i= k i( + w) P P P k ( + w) p i= k p i + + i=2 k p? i + i=2 P P i? + r j= k i r 0 p i= k i (29) 225 Exact and approximate solving Each performance model is applied to experimental data 2, so as to analyse the possible improvement that is obtained thanks to gathered communications and heterogeneous load sharing 2 The computation of system 24 determinant is dicult enough to show that a theoretical direct formula for each k i computation can not be given That is why a Maple program is given experimental data so as to nd the corresponding k i by solving the linear system 25

28 Studied algorithms Six versions of the training set partition principle are considered SA stands for the standard algorithm 2 SA2 stands for the standard algorithm where the host computes some outputs between its last input communication and its rst output communication 3 GA stands for the simple gathered algorithm (without computation by the host) 4 GA2 stands for the gathered algorithm with computation by the host 5 HA stands for the heterogeneous gathered algorithm where the host performs no computation Moreover, exact real solutions of system 24 are taken 6 aha stands for the heterogeneous gathered algorithm where the host performs no computation, but with approximate solutions (integers) of system 24, since each k i is a number of samples The performance model 29 can not be used any more Indeed delays may appear once again But modeling these delays is more dicult than for the simple gathered algorithm If a delay appears for instance on account of an over-approximation, it might now be afterwards balanced by a negative delay Here is an example: If round(k i? ) > k i?, then node i must wait for node i? between the end of node i computation and the beginning of node i? sending outputs to node i A delay appears, compared with the theoretical behaviour of gure 23 If afterwards round(k j? )< k j? (with j > i), then node j? should wait for node j before they can communicate outputs But this waiting balances the previously created delay, so that node j undergoes a shorter delay Caution: this delay balancing can not provide a negative global delay, since a delay can not be balanced beforehand Then a recurrence is necessary to estimate the nal delay (of node p) delay(2) = max(0; (k? k 2 )( + w)?? px j=3 k j ) X i?2 delay(i + ) = max(0; delay(i) + (k i?? k i )( + w) + ( j= k j r? delay(p) = max(0; delay(p? ) + (k p?? k p )( + w) HA2 is HA with computation performed by the host 8 aha2 is HA2 with approximate solutions of system 24 and possible delay Notes about computation Thanks to system 24, all solutions are given so that px i= px j=i+ Xi?2 j= k j r) k j )) k i is equal to a given total number of samples Numerical resolutions have been achieved thanks to a Maple program, which provides all k i values and infers theoretical parallel eciencies of each studied algorithm with the help of previously given formulae Numerous experiments have been performed with 4 to 64 processors on an ipsc 860 They have corroborated theoretical models with an average error near 2 % (see [3]) Even when the host deals with a number k h of samples (k h depends on the inactive time of the host, that is on all k i ), the aim is to provide a constant total number of samples The host is then said active For HA2, px i= k i = s? k h, therefore the desired k i are found after the following steps: 26

Laboratoire de l Informatique du Parallélisme

Laboratoire de l Informatique du Parallélisme Laboratoire de l Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n 1398 An Algorithm that Computes a Lower Bound on the Distance Between a Segment and

More information

Laboratoire de l Informatique du Parallélisme. École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 8512

Laboratoire de l Informatique du Parallélisme. École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 8512 Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 8512 SPI A few results on table-based methods Jean-Michel Muller October

More information

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

RN-coding of numbers: definition and some properties

RN-coding of numbers: definition and some properties Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL n o 5668 RN-coding of numbers: definition and some properties Peter Kornerup,

More information

Input layer. Weight matrix [ ] Output layer

Input layer. Weight matrix [ ] Output layer MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Multi-layer Neural Networks

Multi-layer Neural Networks Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn. APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D-52425

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Multilayer Perceptrons and Backpropagation

Multilayer Perceptrons and Backpropagation Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 7 Chris Lucas School of Informatics University of Edinburgh January 31, 2017 (Slides adapted from Mirella Lapata s.) 1 / 33 Reading:

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller 2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks Todd W. Neller Machine Learning Learning is such an important part of what we consider "intelligence" that

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

Computational Graphs, and Backpropagation

Computational Graphs, and Backpropagation Chapter 1 Computational Graphs, and Backpropagation (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction We now describe the backpropagation algorithm for calculation of derivatives

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

A note on parallel and alternating time

A note on parallel and alternating time Journal of Complexity 23 (2007) 594 602 www.elsevier.com/locate/jco A note on parallel and alternating time Felipe Cucker a,,1, Irénée Briquel b a Department of Mathematics, City University of Hong Kong,

More information

Solving Updated Systems of Linear Equations in Parallel

Solving Updated Systems of Linear Equations in Parallel Solving Updated Systems of Linear Equations in Parallel P. Blaznik a and J. Tasic b a Jozef Stefan Institute, Computer Systems Department Jamova 9, 1111 Ljubljana, Slovenia Email: polona.blaznik@ijs.si

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

How to Pop a Deep PDA Matters

How to Pop a Deep PDA Matters How to Pop a Deep PDA Matters Peter Leupold Department of Mathematics, Faculty of Science Kyoto Sangyo University Kyoto 603-8555, Japan email:leupold@cc.kyoto-su.ac.jp Abstract Deep PDA are push-down automata

More information

Novel determination of dierential-equation solutions: universal approximation method

Novel determination of dierential-equation solutions: universal approximation method Journal of Computational and Applied Mathematics 146 (2002) 443 457 www.elsevier.com/locate/cam Novel determination of dierential-equation solutions: universal approximation method Thananchai Leephakpreeda

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Performance and Scalability. Lars Karlsson

Performance and Scalability. Lars Karlsson Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

Artificial Neural Network : Training

Artificial Neural Network : Training Artificial Neural Networ : Training Debasis Samanta IIT Kharagpur debasis.samanta.iitgp@gmail.com 06.04.2018 Debasis Samanta (IIT Kharagpur) Soft Computing Applications 06.04.2018 1 / 49 Learning of neural

More information

Laboratoire Bordelais de Recherche en Informatique. Universite Bordeaux I, 351, cours de la Liberation,

Laboratoire Bordelais de Recherche en Informatique. Universite Bordeaux I, 351, cours de la Liberation, Laboratoire Bordelais de Recherche en Informatique Universite Bordeaux I, 351, cours de la Liberation, 33405 Talence Cedex, France Research Report RR-1145-96 On Dilation of Interval Routing by Cyril Gavoille

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Data redistribution algorithms for heterogeneous processor rings

Data redistribution algorithms for heterogeneous processor rings Data redistribution algorithms for heterogeneous processor rings Hélène Renard, Yves Robert, Frédéric Vivien To cite this version: Hélène Renard, Yves Robert, Frédéric Vivien. Data redistribution algorithms

More information

LIP. Laboratoire de l Informatique du Parallélisme. Ecole Normale Supérieure de Lyon

LIP. Laboratoire de l Informatique du Parallélisme. Ecole Normale Supérieure de Lyon LIP Laboratoire de l Informatique du Parallélisme Ecole Normale Supérieure de Lyon Institut IMAG Unité de recherche associée au CNRS n 1398 Inversion of 2D cellular automata: some complexity results runo

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial () if its output depends on (is a nonincreasing function of) the distance of the input from a given stored vector. s represent local receptors, as illustrated

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

The impact of heterogeneity on master-slave on-line scheduling

The impact of heterogeneity on master-slave on-line scheduling Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL n o 5668 The impact of heterogeneity on master-slave on-line scheduling

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Artificial Neural Networks Examination, June 2004

Artificial Neural Networks Examination, June 2004 Artificial Neural Networks Examination, June 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

More information

Essentials of Intermediate Algebra

Essentials of Intermediate Algebra Essentials of Intermediate Algebra BY Tom K. Kim, Ph.D. Peninsula College, WA Randy Anderson, M.S. Peninsula College, WA 9/24/2012 Contents 1 Review 1 2 Rules of Exponents 2 2.1 Multiplying Two Exponentials

More information

Fundamentals of Neural Network

Fundamentals of Neural Network Chapter 3 Fundamentals of Neural Network One of the main challenge in actual times researching, is the construction of AI (Articial Intelligence) systems. These systems could be understood as any physical

More information

Basic building blocks for a triple-double intermediate format

Basic building blocks for a triple-double intermediate format Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL n o 5668 Basic building blocks for a triple-double intermediate format Christoph

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

SPI. Laboratoire de l Informatique du Parallélisme

SPI. Laboratoire de l Informatique du Parallélisme Laboratoire de l Informatique du Parallélisme Ecole Normale Superieure de Lyon Unite de recherche associee au CNRS n o 198 SPI More on Scheduling Block-Cyclic Array Redistribution 1 Frederic Desprez,Stephane

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin Ecient Higher-order Neural Networks for Classication and Function Approximation Joydeep Ghosh and Yoan Shin Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In humans, association is known to be a prominent feature of memory.

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

More information

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun Y. LeCun: Machine Learning and Pattern Recognition p. 1/3 MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun The Courant Institute, New

More information

Integer weight training by differential evolution algorithms

Integer weight training by differential evolution algorithms Integer weight training by differential evolution algorithms V.P. Plagianakos, D.G. Sotiropoulos, and M.N. Vrahatis University of Patras, Department of Mathematics, GR-265 00, Patras, Greece. e-mail: vpp

More information

Unit 8: Introduction to neural networks. Perceptrons

Unit 8: Introduction to neural networks. Perceptrons Unit 8: Introduction to neural networks. Perceptrons D. Balbontín Noval F. J. Martín Mateos J. L. Ruiz Reina A. Riscos Núñez Departamento de Ciencias de la Computación e Inteligencia Artificial Universidad

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

Image Reconstruction And Poisson s equation

Image Reconstruction And Poisson s equation Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow Plan training single neuron with continuous activation function training 1-layer of continuous neurons training multi-layer network - back-propagation method single neuron with continuous activation function

More information

`First Come, First Served' can be unstable! Thomas I. Seidman. Department of Mathematics and Statistics. University of Maryland Baltimore County

`First Come, First Served' can be unstable! Thomas I. Seidman. Department of Mathematics and Statistics. University of Maryland Baltimore County revision2: 9/4/'93 `First Come, First Served' can be unstable! Thomas I. Seidman Department of Mathematics and Statistics University of Maryland Baltimore County Baltimore, MD 21228, USA e-mail: hseidman@math.umbc.edui

More information

Neural Nets Supervised learning

Neural Nets Supervised learning 6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w

More information

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3]. Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie

More information

Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Bounding the End-to-End Response Times of Tasks in a Distributed. Real-Time System Using the Direct Synchronization Protocol.

Bounding the End-to-End Response Times of Tasks in a Distributed. Real-Time System Using the Direct Synchronization Protocol. Bounding the End-to-End Response imes of asks in a Distributed Real-ime System Using the Direct Synchronization Protocol Jun Sun Jane Liu Abstract In a distributed real-time system, a task may consist

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

Learning and Neural Networks

Learning and Neural Networks Artificial Intelligence Learning and Neural Networks Readings: Chapter 19 & 20.5 of Russell & Norvig Example: A Feed-forward Network w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a

More information

Collected trivialities on algebra derivations

Collected trivialities on algebra derivations Collected trivialities on algebra derivations Darij Grinberg December 4, 2017 Contents 1. Derivations in general 1 1.1. Definitions and conventions....................... 1 1.2. Basic properties..............................

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Scheduling divisible loads with return messages on heterogeneous master-worker platforms Scheduling divisible loads with return messages on heterogeneous master-worker platforms Olivier Beaumont 1, Loris Marchal 2, and Yves Robert 2 1 LaBRI, UMR CNRS 5800, Bordeaux, France Olivier.Beaumont@labri.fr

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

Lab 5: 16 th April Exercises on Neural Networks

Lab 5: 16 th April Exercises on Neural Networks Lab 5: 16 th April 01 Exercises on Neural Networks 1. What are the values of weights w 0, w 1, and w for the perceptron whose decision surface is illustrated in the figure? Assume the surface crosses the

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Neural Networks Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Brains as Computational Devices Brains advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable

More information

The WENO Method for Non-Equidistant Meshes

The WENO Method for Non-Equidistant Meshes The WENO Method for Non-Equidistant Meshes Philip Rupp September 11, 01, Berlin Contents 1 Introduction 1.1 Settings and Conditions...................... The WENO Schemes 4.1 The Interpolation Problem.....................

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Intrinsic products and factorizations of matrices

Intrinsic products and factorizations of matrices Available online at www.sciencedirect.com Linear Algebra and its Applications 428 (2008) 5 3 www.elsevier.com/locate/laa Intrinsic products and factorizations of matrices Miroslav Fiedler Academy of Sciences

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Functional Preprocessing for Multilayer Perceptrons

Functional Preprocessing for Multilayer Perceptrons Functional Preprocessing for Multilayer Perceptrons Fabrice Rossi and Brieuc Conan-Guez Projet AxIS, INRIA, Domaine de Voluceau, Rocquencourt, B.P. 105 78153 Le Chesnay Cedex, France CEREMADE, UMR CNRS

More information

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

More information