Downloaded from orbit.dtu.dk on: May 01, 2018 Matrix representation of a Neural Network Christensen, Bjørn Klint Publication date: 2003 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Christensen, B. K. (2003). Matrix representation of a Neural Network. Technical University of Denmark (DTU). General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Matrix representation of a Neural Network by Bjørn Christensen June 2003 Contents 1 INTRODUCTION 2 2 THE WEIGHT MATRIX 3 2.1 Feedforward 4 2.2 Backpropagation 5 2.3 Squared Error 7 3 REFERENCES 7 1
1 Introduction This paper describes the implementation of a three-layer feedforward backpropagation neural network. The paper does not explain feedforward, backpropagation or what a neural network is. It is assumed, that the reader knows all this. If not please read chapters 2, 8 and 9 in Parallel Distributed Processing, by David Rummelhart (Rummelhart 1986) for an easy-to-read introduction. What the paper does explain is how a matrix representation of a neural net allows for a very simple implementation. The matrix representation is introduced in (Rummelhart 1986, chapter 9), but only for a two-layer linear network and the feedforward algorithm. This paper develops the idea further to three-layer non-linear networks and the backpropagation algorithm. i0 h0 o0 i1 h1 o1 ii-1 hj-1 ok-1 ii=1 hj=1 input nodes hidden nodes output nodes Figure 1 Figure 1 shows the layout of a three-layer network. There are I input nodes, J hidden nodes and K output nodes all indexed from 0. Bias-node for the hidden nodes is called ii, and bias-node for the output nodes is called hj. 2
2 The Weight Matrix In the matrix representation the input, hidden and output nodes are represented by three vectors i, h and o respectively. The weights connecting each layer are represented by a matrix. V connects the input layer with the hidden layer, and W connects the hidden layer with the output layer. See Figure 2. w0,0 w0,1 w0,j-1 w0,j o0 w1,0 w1,1 w1,j-1 w1,j o1 input nodes i0 i1 ii-1 ii=1 wk-1,0 wk-1,1 wk-1,j-1 wk-1,j ok-1 output nodes v0,0 v0,1 v0,i-1 v0,i h0 v1,0 v1,1 v1,i-1 v1,i h1 vj-1,0 vj-1,1 vj-1,i-1 vj-1,i hj-1 hj=1 hidden nodes Figure 2 Definitions: i - input nodes vector, dimension = I, indexed by i[0,i-1] h - hidden nodes vector, dimension = J, indexed by j[0,j-1] o - output nodes vector, dimension = K, indexed by k[0,k-1] ib - input nodes vector incl bias, dimension = I+1, indexed by i[0,i] hb - hidden nodes vector incl bias, dimension = J+1, indexed by j[0,j] V weight matrix from Ib to h, dimension = J(I+1), V j,i weight ib i h j W weight matrix from hb to o, dimension = K(J+1),W k,j weight hb j o k Bold denotes a vector or a matrix in the text. 3
2.1 Feedforward The feedforward algorithm produces an output vector given an input vector. As described in (Callan 1999, chapter 1), the input to hidden node h j is I 1 (1) net j Vj, I iivji i0 The naming convention of Callan is changed to match the definitions in this paper which follows (Rummelhart 1986). Note that Callan has interchanged the indices in the weight matrix. The value of hidden node h j is h j f(netj) (2) Where f() is the activation function defined as 1 (3) f ( x) -x 1 e Using matrix-vector multiplication, the value of all hidden nodes h can be calculated in a single operation h F( V ib) (4) Where F() is the vector function that takes f() on all elements of it s argument. Similarly the output vector is calculated from o F( Whb) (5) ib and hb are produced from ib i ii for i0.. I1, ibi 1 (6) hb j hj for j 0.. J1, hbj (7) 1 4
2.2 Backpropagation The backpropagation algorithm calculates the changes of the weights based on the error between the output o as calculated the by the feedforward algorithm, and the desired target output value t. Definition: t - target vector, dimension = K, indexed by k[0,k-1] From equation 11 in (Rummelhart 1986, chapter 8), we have the change of each weight: ΔWkj ηδkhj (8) Where is the learning rate, k is the error signal form output node o k, and hj is the value of node h j. This formula applys to all weights in the network. The error signal from the output layer and the hidden layer are calculated differently. Rummelhart defines the error signal as follows Error signal from output node k δok ok(1 ok)(tk ok) Error signal from hidden node j K-1 (9) (10) δh j h j(1 h j) δokwkj Definition: o- error signal from output nodes dimension = K, indexed by k[0,k-1] h- error signal from hidden nodes dimension = J, indexed by j[0,j-1] k0 Now we are ready to calculate the weight change W for all weights in a singe operation: ΔW W0,0 W0,1 W0, J - 1 W0, J (11) η W1,0 W1,1 W1, J - 1 W1, J WK - 1,0 WK - 1,1 WK - 1, J - 1 WK - 1, J δo0h0 δo0h1 δo0hj - 1 δo0hj η δo1h 0 δo1h 1 δo1h J - 1 δo1h J - 1h 0-1h1-1hJ - 1-1hJ δo0 η δo1 h 0-1 h1 hj - 1 hj η δo hb So W is times the outer product of o and hb. is the operator for the outer product. Similarly for V we get Δ V ηδh ib (12) 5
We need to express o and h in vector form. Let us define a vector multiplication operator % that multiplies the components of vector x and y as follows: 1* 1 % 2* 2 x x x y y (13) x y n* yn Calculating o is easy: δo o% ( 1 o) %( t o) (14) Where 1 is a vector of 1 s with same dimension as o. Calculating h requires a little more work. Let us define a vector S as: T s W δo (15) W T is the transpose of W. Expanding W and o gives us: T W W W W δo (16) 0,0 0,1 0, J - 1 0, J 0 s W1,0 W1,1 W1, J - 1 W1, J δo1 WK - 1,0 WK - 1,1 WK - 1, J - 1 WK - 1, J - 1 W0,0 W1,0 WK - 1,0 δo0 W0,1 W1,1 WK - 1,1 δo1 W0, J - 1 W1, J - 1 WK - 1, J - 1-1 W0, J W1, J WK - 1, J δo0w0,0 δo1w1,0... - 1WK - 1,0... δo0w0,1 δo1w1,1-1wk - 1,1 δo0w0, J - 1 δo1w1, J - 1... - 1WK - 1, J - 1 δo0w0, J δo1w1, J... - 1WK - 1, J s0 K-1 s1 where sj δokwkj sj - 1 k0 sj The component s j is operand in the error signal h j. Note that we do not need an error signal from h J since it is a bias node. Therefore we skip the last component of s, and create a vector s defined as: s j s j for j0.. J1 (17) Dimension of s is J. Finally we can write h as: δh h% (1 h) % s (18) Here 1 has same dimension as h. Let us sum up the important equations of the backpropagation algorithm: Error signal from output nodes δ o o% ( 1 o) %( t o) (19) Changes of weights connected to output nodes W η δo hb (20) Propagated errors from output nodes T s W δo (21) Error signal from hidden nodes δh h% (1 h) % s (22) Changes of weights connected to hidden nodes Δ V ηδh ib (23) 6
2.3 Squared Error The squared error is a measure of the correctness of the network. It is calculated for each input pattern p. Rummelhart defines the squared error as: K 1 1 2 E p (tk ok) 2 k0 In vector notation using the inner product this is 1 Ep (t o) (t o) 2 (24) (25) Summing the squared errors for all patterns, we get the squared error E for an epoch n (26) E Ep p1 Where n is the number of patterns in the training set. 3 References Rummelhart 1986 Callan 1999 Parallel Distributed Programming: Explorations in the Microstructure of Cognition. Vol 1 David E Rummelhart, James L McClelland MIT-press 1986 The Essence of Nearal Networks Robert Callan Prentice Hall 1999 7