A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

Size: px

Start display at page:

Download "A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010"

Laurel Baldwin
6 years ago
Views:

1 A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010

2 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering Science 2010 Restricted Boltzmann Machines (RBMs) are a type of Artificial Neural Network and the fundamental building blocks of Deep Belief Networks (DBNs) [1]. DBNs have been successfully applied to a number of machine learning problems [2, 3, 1]. However, the O(n 2 ) complexity of training a RBM presents a serious impediment to their use in large applications. Attempts have been made to accelerate the process using custom FPGA hardware [4, 5], but no implemenation has been demonstrated to run RBMs of nodes necessary for real world applications. This thesis builds upon a virutalized FPGA architecture presented by Ly, et al. [4] with the goal of investigating its scalability towards large RBMs. The virtualized architecture time-multiplexes the hardware resources of a single FPGA to implement large virtual RBMs. To maintain the performance gain of the custom hardware in the presence of context switches, a number of approaches were used. The architecture was ported to a faster, more modern FPGA, the data representation of reduced from 32-bits to 16-bits to increase throughput in communicaition and coarse grain parallelism was provided by extending the architecture to four FPGAs. A sequential benchmark written in C was used to test the performance of the architecture. The analysis shows a strong dependence of performance on the communication overhead between the supervising microprocessor and the hardware cores. Although very little speedup is possible with the implementation presented, this thesis provides a direction for further improvements to the architecture. ii

3 Acknowledgements I would like to express my gratitude to Professor Paul Chow for giving me the opportunity to work on this project as well as for his guidance over the course of this thesis. I would also like to thank Daniel Ly for helping to define the direction of this thesis and always being available to answer my questions. Finally, I am grateful to Chris Madill, Arun Patel, Manuel Saldaña, Geng Liu and Chu Pang for their assistance during the past year. iii

4 Contents 1 Introduction 1 2 Background Restricted Boltzmann Machine Operation Structure Alternating Gibbs Sampling Energy Learning Rules Batch Learning Summary Deep Belief Networks Methods for accelerating Restricted Boltzmann Machines Virutalized FPGA Architecture Partitioning Computational Cores Restricted Boltzmann Machine Core Energy Accumulator Core and Node Select Core Message Passing Interface Large Restricted Boltzmann Machine Architecture 19 iv

5 4.1 Investigation of Data Bit Widths Memory and Communication Considerations Data Storage Communication Overhead Extension to Four FPGAs Results and Analysis Test Methods Test Setup Test Metric Results Batch Size vs. Speedup Intrinsic RBM Size Summary Conclusion Conclusions Future Work Weight Matrix Caching Distributed Energy Accumulator Core Structure Bibliography 33 A Outline of MicroBlaze Operation 36 v

6 List of Tables 5.1 Summary of Performance Measurements vi

7 List of Figures 2.1 Structure of a 3x3 Restricted Boltzmann Machine Sigmoid Function Layout of a Deep Belief Network Weight Distribution in BRAM Structure of the Virtualized Restricted Boltzmann Machine Architecture MicroBlaze PLB Connectivity Weight distribution of eight partitions among four FPGAs Overall Layout in Four FPGA System. All MPI Ranks are interconnected Mini-batch size vs. speedup Mini-batch size vs. speedup without node calculation Mini-batch size vs. speedup for a virutal 512x512 network with intrinsic RBM sizes of 64 and vii

8 Chapter 1 Introduction The paradigm of machine learning deals with methods that allow a computer to extract complex patterns underlying data. The applications of such methods are extensive, including visual pattern recognition, speech recognition and video game artificial intelligence. One popular method of machine learning is the use of artificial neural networks (ANNs). Such networks roughly model the structure of the biological neural networks in the brain, in that they consist of many parallel simple neurons, connected together through weighted relationships. The activation of the neurons dependant on the weights and states of connected neurons determines the reaction of the network to some input. By controlling the value of the weights, the network can be trained to recognize certain patterns or features of a dataset. Many different types of ANNs exist with different network topologies, activation functions and learning algorithms. A particularly popular architecture is the Restricted Boltzmann Machine (RBM); a stochastic, generative model that has proven to perform well in problems such as face recognition [6]. Recently, it has been shown that when several RBMs are stacked together to form a Deep Belief Network (DBN), an efficient learning algorithm exists to train the entire network [1]. DBNs have the benefit of being able to learn more complex features and have been applied to problems of generating facial 1

9 Chapter 1. Introduction 2 expressions [2], semantic hashing of documents [3] and recognition of hand written digits [1]. Although the learning algorithm is relatively efficient, training the large networks required for the real-world applications above can still take several days or weeks on a general purpose desktop computer [6]. The parallel nature of the RBM architecture makes it very tractable by hardware implementations and several groups have created FPGA and GPU based RBM solutions providing much needed speed-up [7, 8, 5]. In particular, Ly et al. s FPGA architecture has produced a 145x speed-up relative to a desktop PC [7]. However, it has only implemented relatively small RBM networks of 256x256 neurons whereas real world applications require much larger networks. For example, the DBN used to recognize handwritten digits [1] contained a RBM of size 2000x510. The goal of my thesis is to scale Ly, et al. s FPGA architecture [4] up, to be capable of handling the thousands of neurons necessary in real world DBN applications while maintaining maximum performance. The main bottleneck limiting the current implementation size of the FPGA architecture is the size of the weight matrix. This data structure is necessarily large since each node must be connected through a weight to all of the nodes in the next layer due to the bipartite graph organization of the RBM. To allow for larger networks, my project will first involve adapting the FPGA architecture to a larger, faster FPGA platform. This allows for the possibility of higher clock speeds as well as better interconnects between FPGAs and thus greater performance. In addition, I will investigate the effect of decreasing the bit width of the weights. This could allow more weights to be stored on-chip and a increase in communication bandwidth between computational cores provided that the network is trainable at lower precision. Finally, by time-multiplexing the resources of four FPGAs, I hope accelerate the training performance of arbitrary size networks.

10 Chapter 2 Background 2.1 Restricted Boltzmann Machine Operation Artificial Neural Networks (ANNs) are models of biological neural networks that rely on the interactions between simple units called neurons or nodes to perform computations. By modifying connections between nodes, ANNs may be taught to model patterns in a set of training data [9]. A Restricted Boltzmann Machine(RBM) [10, 11] is a type of ANN that has become recently popular due to its role as a building block in Deep Belief Networks (DBNs). An interesting property of RBMs is they are taught to reproduce the training data they are given. The internal model they create allow them to generate new data statistically similar to the training set and thus a RBM is said to be a generative neural network Structure The RBM consists of two layers of nodes. A visible layer representing the input to the network and a hidden layer. Each node is connected to all of the nodes of the opposite layer through a weighted connection. The real valued weights on the connections are the learning parameters of the RBM and it is through their adjustment that the network 3

11 Chapter 2. Background 4 h 0 h 1 h 2 v 0 v 1 v 2 Figure 2.1: Structure of a 3x3 Restricted Boltzmann Machine. may be trained. We will denote the weight connecting visible node i to hidden node j as w i,j. The restriction that nodes of the same type are not interconnected allows for the development of a fast learning algorithm and is one of the properties that separates the RBM from general Boltzmann Machines. Generally, the node states are binary valued. However, some applications benefit from having real valued visible nodes to represent data such as greyscale images. With this topology in mind, we can write the elements of a RBM in matrix notation: w 0,0 w 0,J 1 W =..... w I 1,0 w I 1,J 1 (2.1) V = [v 0 v I 1 ] (2.2) H = [h 0 h J 1 ] (2.3) The RBM is a stochastic neural network in that its node states are determined through a probabilistic function rather than a deterministic one. The function used in a RBM is the sigmoid or logistic function [10] (Fig. 2.2). Thus, the probability of activating a node may be calculated, given that the opposite layer is determined, as the logistic function

12 Chapter 2. Background 5 1 Node Activation Probability Node Energy Figure 2.2: Sigmoid Function of its weighted inputs. P (v i = 1) = P (h j = 1) = exp( J 1 h j w i,j ) j= exp( I 1 v i w i,j ) i=0 (2.4) (2.5) Alternating Gibbs Sampling We can now describe the main operating mode of a RBM called Alternating Gibbs Sampling (AGS). Looking at Eqns. 2.4 and 2.5, the node states for one layer can be determined as long as the other layer is fixed. In the first AGS phase, the hidden layer is generated based on test data on the visible nodes; the second AGS phase reconstructs the test data by clamping the hidden nodes and stochastically finding the states of the visible nodes. This process can continue to higher order AGS phases as each layer is clamped or determined in turn. To keep track of node states we will denote the AGS as a superscript, for example V 3 would represent the visible layer node states at the third AGS phase.

13 Chapter 2. Background Energy The Restricted Boltzmann Machine draws inspiration from the Boltzmann Distribution of statistical mechanics which describes the probability distribution for a set of states in a system [12]. The state of a RBM is defined by its visible and hidden layers. Thus, given a certain configuration of visible and hidden nodes and fixing the weights, we can define an energy: I 1 J 1 E(V, H) = v i h j w i,j (2.6) i=0 j=0 From the Boltzmann Distribution: P (V, H) exp( E(V, H)) (2.7) The goal of learning in an RBM is to model the training set, this can be accomplished by modifying the weights such that we obtain a Boltzmann distribution where the probability of obtaining configurations with the training vectors is maximized. Looking at the previous equation, we see that to maximize the probability of a training vector P (V ) we need to minimize the Energy associated with its configurations. In addition, from the energy equation, we can see that for each visible or hidden node, there is an associated local energy. I 1 J 1 E(V, H) = v i E i = h j E j (2.8) E i = E j = i=0 j=0 J 1 h j w i,j (2.9) j=0 I 1 v i w i,j (2.10) i=0 The local energy for a given node is in fact the weighed sum of the states of the nodes from the opposite layer. Therefore, we can rewrite Eqns. 2.4, 2.5 as:

14 Chapter 2. Background 7 P (v i = 1) = e E i P (h j = 1) = e E j (2.11) (2.12) We can write these local energies more succinctly as members of vectors: E V = [E 0 E I 1 ] = H W T (2.13) E H = [E 0 E J 1 ] = V W (2.14) Thus the node state vectors V and H are functions of the energy vectors E V and E H respectively Learning Rules Given this concept of state energy, we can find the learning rule for a RBM by differentiating the log probability of obtaining a particular visible layer configuration: δlog(p (V )) δw i,j =< v i h j > 0 < v i h j > (2.15) Where < a > n denotes the expected value of a at the nth AGS phase. From this equation we can see that to increase the probability of training vectors, we can apply the following weight update rule: w i,j = ɛ(< v i h j > 0 < v i h j > ) (2.16) Where ɛ is the learning rate. This learning rate must be carefully controlled since large weight updates would not be able to reach the energy minima, while slow rates would take too long to reach them. One solution is to dynamically decrease the learning

15 Chapter 2. Background 8 rate during training using the process of simulated annealing [12]. Clearly, getting a sample from the infinite AGS is not feasible in computation time. However, it has been shown that we can estimate the infinite phase with a finite one; this is called contrastive divergence (CD) learning [13]. Using CD, we no longer perform gradient descent in weight space, but it has been shown to work well even with as few as three AGS phases Batch Learning To have weight updates which represent the entire set of training data, it would be best to calculate the average weight update for the entire training set before committing the change. This type of weight update is called Batch Learning. For large sets batch learning would result in long computation times between weight updates. To address this problem, we can reduce the batch size and create mini-batches to increase update rate at the expense of update precision. At the limit of one training vector per weight update, we are performing on-line learning. For a batch size of L, the learning rule becomes: w i,j = ɛ L 1 (< v i h j > 0 < v i h j > ) (2.17) L l=0

16 Chapter 2. Background Summary The following procedure describes the RBM training operation with three AGS Phases and a mini-batch size of L. 1. Apply a training vector to the visible layer. This becomes V 1 in the next step. 2. AGS Phase 1 Compute the local energy E 1 H = V 1 W and apply the logistic function to find the node states H 1 = f(e 1 H ) 3. Increment the weight update: w i,j = w i,j + (V 1 ) T H 1 4. AGS Phase 2 Compute the local energy E 2 V = H2 W T and apply the logistic function to find the node states V 2 = f(e 2 V ) 5. AGS Phase 3 Compute the local energy E 3 H = V 3 W and apply the logistic function to find the node states H 3 = f(e 3 H ) 6. Decrement the weight update: w i,j = w i,j (V 1 ) T H 1 7. Repeat steps 1-6 for each training vector in the mini-batch 8. Commit the weight update: w i,j = w i,j + ɛ L w i,j 9. Repeat steps 1-8 for each mini-batch in the training set It should be noted that the the computations of energy in each AGS phase is of complexity O(n 2 ) and the weight update computation is also O(n 2 ). This makes training RBMs with thousands of nodes a very time consuming process Deep Belief Networks The Restricted Boltzmann Machine is powerful in itself to extract features from test data. However, it becomes even more useful as a part of a Deep Belief Network (DBN).

17 Chapter 2. Background 10 h 0 h 1 h 2 RBM 2 v 0 h 0 v 1 h 1 v 2 h 2 RBM 1 v 0 v 1 v 2 Figure 2.3: Layout of a Deep Belief Network In effect, DBNs consist of multiple RBMs stacked upon each other; the hidden nodes for one layer become the visible nodes for the next as in Fig 2.3. The additional layers of hidden nodes are used to model patterns within the patterns generated by earlier layers. Thus, the DBN is able to model more complex features in data. What is interesting about these deep networks is that they can be greedily trained layer by layer using the same efficient algorithm presented above [1]. To get better classification or generative properties additional training can be performed using wake-sleep or backpropagation algorithms. 2.2 Methods for accelerating Restricted Boltzmann Machines A number of computationally intensive operations need to be performed during RBM training. In calculating the local energies, a vector-matrix multiplication must be performed as well as a matrix transposition during even AGS phases. In addition, to evaluate the node states, the non-linear logistic function must be evaluated. These operations can be slow on sequential general purpose processors. However, there have been a number of attempts to accelerate the process. Of particular interest are three published implementations: One design using the inherent parallelism in Graphics

18 Chapter 2. Background 11 Processing Units (GPUs) [8] and two custom hardware designs implemented on Field Programmable Gate Arrays (FPGAs) [5, 4]. Modern Graphics Processing Units (GPUs) offer several layers of parallelism much greater than standard multi-core CPUs allowing them to operate on large batches of data at once. In addition, optimized linear algebra packages are available for them. Raina et al. [8] used an NVIDIA GTX 280 GPU with 1GB of RAM and the CUDA Application Layer to accelerate RBM operations and build deep belief networks. On a single RBM of 4096x11008 size, they achieved a speed-up of 72.6x over a software implementation using the optimized matrix operation library Goto BLAS running on a 3.16GHz Dual- Core processor. To minimize data transfer of weights in DBNs, they developed the idea of overlapping patches. By representing the visible layer as a 2D surface and tiling patches across it, they were able to create local connections between hidden layers where the patches overlapped. Using this method, they were able to build 4-layer DBNs with 96 million parameters. However, the amount of overlapping areas decreases as the order of overlap increases, so this method is inherently limited to DBNs of decreasing size for higher layers. In addition, the layers are not fully connected with the overlapping patches method thus this implementation is limited to applications in a subset of DBN problems. Kim et al. [5] developed a hardware implementation of an RBM on an Altera Stratix III EP3SL340 FPGA. In this design the authors decided to use 16-bit fixed point words to represent the weights, energies and visible node states. The main computational cores of this design were partitioned into groups of adders and multipliers to perform the vector-matrix operation of local energy calculation. To perform the energy calculation for all of the nodes of a given layer in parallel, all of the row or column elements of the weight matrix must be available at the same time and thus must be stored on separately addressable memory elements. To avoid this problem, the authors stored each column of the weight matrix in separate memory blocks such that a single row was available at a time. This allowed the visible energies to be calculated simply using a multiplier and

19 Chapter 2. Background 12 tree adder. Then, by using an accumulator structure to calculate the hidden energies, they did not have to modify their memory structure. To compute the logistic function, a Piecewise Linear Approximate of Nonlinear function (PLAN) was implemented. When benchmarked against a software implementation running on a 2.4GHz Intel Core 2 system, they achieved a speed-up of 25x over single precision MATLAB code and 30x over double precision. The maximum network size achieved was 512x512. The final FPGA Implementation by Ly et al [4], was developed on a Berkeley Emulation Engine 2 hardware platform consisting of five interconnected Virtex-II Pro XC2VP70 FPGAs. In this design, a set of tree adders was used to calculate the visible and hidden energies. The problem of weight addressing was allievated by storing diagonal sections of the matrix in different memory blocks. In this way, the same set of memory blocks could be used to access a row or column of the weight matrix. The logistic function was performed using a Piecewise Linear Interpolator. Some significant differences with the FPGA design by Kim et. al. are that the weights and energies are represented as 32-bit fixed point numbers rather than 16-bit ones. In addition, the visible nodes can only be binary valued, whereas they are real valued in Kim et. al s design. Three different designs were presented: one on a single FPGA running a 128x128 RBM, one using coarse grain parallelism across four FPGAs to run a 256x256 RBM and one time-multiplexing the resources of a single FPGA to realize a 256x256 network. The speed-ups obtained were 61x, 145x and 32x respectively over an optimized C implementation running on a 2.8GHz Pentium 4 processor. The works described here did not use a common benchmark, so it is difficult to compare performance directly. The GPU implementation has a clear advantage in network size, but the limitations of its overlapping patches technique make it unusable for large, general DBNs. Of the two FPGA applications, the one by Ly et. al, has a clear performance advantage especially considering that it is implemented on older FPGA hardware. Notably, no designs have been published implementing real world DBN applications.

20 Chapter 3 Virutalized FPGA Architecture The work in this thesis is built on top of the Virtualized FPGA RBM architecture designed by Ly, et al. [4] In this chapter, some important aspects of the architecture will be discussed. Custom FPGA hardware cores are able to perform the compoutations involved in RBM training very quickly, but a FPGA has a finite amount of resources. Therefore, the size of the network a single FPGA can work on is limited. One way to increase the workable network size is by simply adding more FPGAs. However as the size of the application grows, this method becomes quickly cost and power prohibitive. A better approach is to time-multiplex the hardware to handle problems of almost arbitrary size. The tradeoff in this approach is that a context switch is required to work on different portions of the network. The virtualized RBM architecture that this thesis is based on uses the time-multiplexing approach to work on networks whose size would not normally fit on a single FPGA. 3.1 Partitioning To use a virtualized system for performing Restricted Boltzmann Machine operations, the computations must first be partitioned into independent work units. By partitioning 13

21 Chapter 3. Virutalized FPGA Architecture 14 the visible and hidden vectors into A and B parts respectively, the weight matrix can be broken into a group of block matrices. W 0,0 W 0,B 1 W =..... W A 1,0 W A 1,B 1 (3.1) V = [V 0 V A 1 ] (3.2) H = [H 0 H B 1 ] (3.3) The energy calculation then becomes: E H0 V 0 W 0,0 + +V A 1 W A 1,0 E H = V W =. =..... E HB 1 V 0 W 0,B 1 + +V A 1 W A 1,B 1 E V0 H 0 W 0,0 + +H B 1 W 0,B 1 E V = H W T =. =..... E VA 1 H 0 W A 1,0 + +H B 1 W A 1,B 1 (3.4) (3.5) In this configuration, the energies E Hj and E Vi required to calculate a block of node states is now divided into a number of partial energy blocks involving the calcuation V i W i,j or H j W i,j. These partial energy calculations can be done independently and later recombined to resolve the node states. Also, by partioning the weights in this way, the weight update calculations may be performed on each weight matrix W i,j independently as well.

22 Chapter 3. Virutalized FPGA Architecture Computational Cores The hardware architecture consists of three major cores. The Restricted Boltzmann Machine Core (RBMC) performs the primary vector-matrix energy calculation as well as the weight update. The Energy Accumulator Core (EAC) is used to sum the partial energies described in the previous section. Finally, the Node Select Core (NSC) evaluates the node states using the sigmoid function Restricted Boltzmann Machine Core The RBMC performs the O(n 2 ) energy calcuation (Eqns. 2.13, 2.14) and weight update (Eqn. 2.16) steps in O(n) time. To achieve this speed, a number of restrictions are applied to the RBM network. The sizes of the visible and hidden layers must be the same to allow for reuse of the same computational logic for both layers. Node states must be binary values. This condition allows the use of AND gates in place of multipliers for operations involving the node states. Weights and energies use a 32-bit fixed point representation. The choice of fixed point over floating point simplifies the arithmetic logic for energy and weight update calculations. The layer size must be a power of two. This limitation allows the use of a binary tree adder when calculating the energy. The size restrictions on the node layers do not limit the space of problems the architecture is capable of handling since unused nodes may be added to reach the next power of two. However, maximum effective performance will not be attained unless the sizes of the layers in the application match the previous descriptions. Also a 32-bit fixed point representation provides a very large range of supported values and would likely not run

23 Chapter 3. Virutalized FPGA Architecture 16 BRAM 0 W = w 0,0 w 0,1 w 0,2 w 1,0 w 1,1 w 1,2 w 2,0 w 2,1 w 2,2 BRAM 1 BRAM 2 Figure 3.1: Weight Distribution in BRAM into overflow or underflow issues unless the radix was chosen poorly. The choice of binary valued node states on the other hand does limit the number of real world applications, but the simplicity of the logic required allows for very fast computation in the set of problems the architecture can handle. With these restrictions in place, the node energies may be calculated by accessing an entire row or column of the weight matrix, performing a logical AND with the node states and feeding the result into a binary tree adder. Due to the pipelined nature of the tree adder, one energy can be produced every clock cycle, thus reducing the computational complexity to O(n). Likewise an entire row or column of weight updates may be generated in parallel by performing a logical AND between the visible and hidden node states and using the outcome to decide whether or not the learning rate should be applied to the weight updates. Clearly an important characteristic of the RBMC is its ability to access a full column or row of the weight matrix in parallel. To facilitate this, for a RBM of size nxn, n physical dual ported Block RAMs (BRAMs) are instantiated, each containing a diagonal of the weight matrix. An example for a 3x3 RBM is shown in Fig Notice that for every row and column, each weight is stored on a different BRAM. The parallel storage of weights turns out to be the limiting factor in terms of size of RBM synthesizable on a single FPGA. For weight storage, n BRAMs are required and an additional n BRAMs are required to store the weight updates.

24 Chapter 3. Virutalized FPGA Architecture Energy Accumulator Core and Node Select Core The EAC and NSC work in tandem to find the node states given the a set of partial energies. First, as a stream of partial energies arrives at the EAC, they are summed and stored in a BRAM First In First Out (FIFO) memory structure. Once the total energies has been computed, the EAC sends them to the NSC which performs node state selection using an approximated sigmoid function and a uniform random number generator. The sigmoid function is calculated using a look up table (LUT) whose output is sent through a pipelined piecewise linear interpolator (PLI) in order to get a better estimate. Once the node states have been determined, they are sent back to the EAC and from there back to the source of the partial energies. 3.3 Message Passing Interface All three cores performing different parts of the RBM calculations must be connected to each other as well as a supervising microprocessor. In order to provide a simple, high bandwidth, communication channel TMD-MPI [14] was used to connect the cores. TMD-MPI implements a subset of the Message Passing Interface (MPI) standard for embedded systems. This communication layer offers a level of abstraction away from the implementation of the computational cores. Data is sent point to point as through a network as packets called messages. Each message has a defined source, destination, tag and word count where words are 32-bit pieces of data. At initialization, each device on the MPI network is given a specific address called a rank which is used to route packets through the network. When data is recieved it is stored in a message queue. Once a hardware core begins to read the message, a new word of the data is available each clock cycle. This operation allows the cores to operate asynchornously and yet still have high bandwidth communication between each other. The microprocessor has its own Message Passing Engine (MPE) which supports direct

25 Chapter 3. Virutalized FPGA Architecture 18 PPC R0 R3 EAC RBMC R1 R2 NSC Figure 3.2: Structure of the Virtualized Restricted Boltzmann Machine Architecture memory access (DMA) and burst access to memory. These features allow a minimal overhead from the processor since only four 32-bit words must be sent to the MPE before it may begin streaming data. The MPI connectivity of the full system is shown in Fig The circles represent the MPI hardware and show the ranks of each computing element. The platform used in [4] was a Berkeley Emulation Engine 2 (BEE2) [15] with five Virtex-II Pro XC2VP70 FPGAs in a communication mesh and a hard PowerPC (PPC) processor.

26 Chapter 4 Large Restricted Boltzmann Machine Architecture The goal of this thesis was to investigate the scalability of the FPGA architecture discussed in the previous chapter and adapt it to handle the size of Restricted Boltzmann Machines required in real world Deep Belief Network applications. In this chapter, the design modifications to the existing architecture will be discussed. One of the first steps taken to increase the performance of the virtualized FPGA architecture was to move the design to a more modern FPGA. The BEECube Berkeley Emulation Engine 3 (BEE3) [16] hardware platform was chosen as a logical upgrade from the Virtex II based BEE2. The BEE3 contains four Xilinx Virtex-5 5VLX155T FPGAs connected in a ring, each with acces to up to 16GB of DDR2 external RAM. The only major design change during this transition was a switch from a PowerPC processor managing the hardware cores to soft MicroBlaze processors. 4.1 Investigation of Data Bit Widths The existing Virtualized Restrcted Boltzmann Machine architecture represented weight and energy values as 32-bit fixed-point numbers. This bit width was a convinient design 19

27 Chapter 4. Large Restricted Boltzmann Machine Architecture 20 choice since the MPI hardware operated with 32-bit data widths and the configurable dual ported BRAMs supported up to 36-bit data width. However, significant performance improvements may be realized with a reduction in bit width. For example, given the 32- bit width of the MPI channel, a bit width of 16-bits would result in double the amount of weights or energies transferred per clock cycle since two of them could be packed into each MPI word. A width of 8-bits would allow four times the number of weights or energies transferred. Since the weights must be transferred during every context switch, this gain in throughput becomes significant in the Virtualized RBM architecture. With data packing, the operation of the EAC and NSC may also be parallelized to add multiple energies and calculate multiple node states per clock cycle. Finally, a RBM of size n requires 2n physical dual ported BRAMs to store the weight and weight update matrices on the RBMC. If the BRAMs on the FPGA may be split into smaller width, but more plentiful dual ported BRAMs the size of RBM that can be synthesized on a single FPGA may also be increased. This would significantly increase the performance of the RBMC. The drawback of using fewer bits to represent data is the reduction in the range of possible values. Depending on the RBM application, there exists the possibility of overflow or underflow. This could lead to problems finding a set of values in weight space to accurately represent the given training set. To roughly estimate the effect of using different data widths, a simple experiment was carried out. A RBM was trained in software with three different signed fixed-point representations: 32-bit with 8 magnitude bits, 16-bit with 8 magnitude bits and 8-bit with 4 mangitude bits. The network of size 1024x512 was trained for 100 epochs to recognize an image of the number 0. As a comparison metric, the trained networks were fed back the training image and AGS was run for 1025 phases. If the weights had been set properly, the network would be able to reproduce the image faithfully. Fig 4.1 shows the average number of errors found in a bitwise comparison of the original image versus the reconstructed one over ten attempts. The reconstruction with 16-bit weights produced a result similar to the case

28 Chapter 4. Large Restricted Boltzmann Machine Architecture 21 Bit Width vs. Average Reconstruction Error Average Reconstruction Error Weight and Energy Bit Width with 32-bit weights while the network with 8-bit weights failed to reproduce the image at all. Previous studies [17, 5] on data width reinforce these results and suggest that 16-bits is adequate for many neural network applications. Thus, this project uses 16-bit representations for weight and energy values. Due to the choice of 16-bit widths, two energies are packed in each word transmitted over MPI, the EAC and NSC were modified to perform the energy summation and node state calcuation in parallel for both incomming energies. The Virtex-5 family Block RAMs are 36Kbit dual ported modules configurable in a number of width and depth settings. Each BRAM may also be configured as two independent 18Kbit dual ported modules. In addition, both the 36Kbit and 18Kbit BRAMs may be configured in simple dual-port mode in which there is a single dedicated read port and a single dedicated write port. In this configuration, the 36Kbit BRAM width is doubled to 72 bits and the 18Kbit BRAM width is doubled to 36 bits.[18] The 5VLX155T FPGA has Kbit BRAMs and thus Kbit BRAMs available. Therefore the maximum RBM which can be implemented on the FPGA is 128x128 using Kbit BRAMs.[19] The RBMC only requires a single read port and single write port for each weight storage BRAM, thus the maximum RBMC is 128x128 with both 32-bit data widths and 16-bit data widths.

29 Chapter 4. Large Restricted Boltzmann Machine Architecture 22 μb MPMC PLB BRAM PLB MPE FPGA 0 DRAM Figure 4.1: MicroBlaze PLB Connectivity However, given the improvements in the communication performance from the reduction in bit widths is still beneficial to overall performance. 4.2 Memory and Communication Considerations Data Storage A weight matrix for a RBM of size 128x128 with 16-bit weight representation requires 32 KBytes of memory to store. As we increase the RBM network size, the weight matrix becomes quadratically larger and quickly exceeds the storage resources on a FPGA. In addition, when training real world applications the number of training vectors can become very large. For example, the MNIST database with images was used while training a network to recognize handwritten numbers[1]. In order to store all of this data off chip DDR2 RAM was used. Fig 4.1 shows the local connections to the MicroBlaze processor. Data is streamed through the processor local bus (PLB) from the DDR2 RAM to the multip ported memory controller (MPMC) and finally to the PLB MPE core after which it is distributed to the appropriate computational core through the MPI network.

30 Chapter 4. Large Restricted Boltzmann Machine Architecture Communication Overhead Since much of the data must be located external to the FPGA, additional latency is introduced during transfers between the MicroBlaze and the computational cores. The MPE core is designed with direct memory access (DMA) and is able to perform burst writes and reads to and from the external memory through the PLB. In principle this is fast, especially for large blocks of data such as the weight matrices. The only overhead from the MicroBlaze processor is the transmission of four words to set up the MPE core. However, when transferring smaller batches of data this overhead becomes very significant since the MicroBlaze is slow compared to the hardware cores. In particular, node states for a 128x128 system are only four 32-bit words long themselves and a set of energies is only 64 words long. Therefore operations heavily involving these elements such as the node state calculation are subject to a significant performance hit. In addition to the overhead from the MicroBlaze, another significant performance reduction comes from the context switch operation itself. Although the transmission of weight matrices is relatively fast given DMA, bursting and the two packed 16-bit elements in each MPI word, the communication time represents a significant period where the RBMC is idle. One simple way to address this problem is by increasing the mini-batch size. This allows the weight matrix to remain on the RBMC for a longer period of time and as batch size increases, theoretically the computation time of the RBMC would eventually become the limiting factor. One key drawback to using larger batch sizes is the need to store more partial energies before node state calculations may occur. In this particular implemention since the energies are stored on large external DDR RAM, this does not have a significant effect.

31 Chapter 4. Large Restricted Boltzmann Machine Architecture 24 W = W 0,0 W 0,1 W 0,2 W 0,3 W 0,4 W 0,5 W 0,6 W 0,7 W 1,0 W 1,1 W 1,2 W 1,3 W 1,4 W 1,5 W 1,6 W 1,7 W 2,0 W 2,1 W 2,2 W 2,3 W 2,4 W 2,5 W 2,6 W 2,7 W 3,0 W 3,1 W 3,2 W 3,3 W 3,4 W 3,5 W 3,6 W 3,7 W 4,0 W 4,1 W 4,2 W 4,3 W 4,4 W 4,5 W 4,6 W 4,7 W 5,0 W 5,1 W 5,2 W 5,3 W 5,4 W 5,5 W 5,6 W 5,7 W 6,0 W 6,1 W 6,2 W 6,3 W 6,4 W 6,5 W 6,6 W 6,7 W 7,0 W 7,1 W 7,2 W 7,3 W 7,4 W 7,5 W 7,6 W 7,7 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Figure 4.2: Weight distribution of eight partitions among four FPGAs 4.3 Extension to Four FPGAs Additional performance was obtained by using the four FPGAs available on the BEE3 platform to provide coarse grain parallelism. Since communication links between FPGAs are not as fast as on-chip links, minimizing the inter-fpga communication was essential to maintaining performance. In addition, it was important to ensure that the work load was shared evenly among the FPGAs. From these two conditions, the weight matrices representing the largest data transfer, were assigned to FPGAs where they were just streamed locally in and out of DDR2 RAM. The partioning of the matrices shown in Fig. 4.2 is similar to the weight distribution within the RBMC from Fig The one difference being that there may be fewer FPGAs than weight matrices and thus multiple sets of calculations may be required to get all the partial energies for a set of nodes. This structure allows all of the FPGAs to work together computing either a set of visible or hidden nodes at once. The overall system layout is shown in Fig In this configuration, all of the partial energies from each FPGA must be sent to a single location to be added together and the node states must then be distributed from that single source back to the rest of the FPGAs. Given the operation of the EAC, this was the simplest method of connectivity. However it results in a communication bottleneck that becomes more significant as FP-

32 Chapter 4. Large Restricted Boltzmann Machine Architecture 25 RAM μb RBMC RAM μb RBMC R0 R1 R4 R5 R3 R2 EAC NSC FPGA 0 FPGA 1 RAM μb RBMC RAM μb RBMC R8 R9 R6 R7 FPGA 3 FPGA 2 Figure 4.3: Overall Layout in Four FPGA System. All MPI Ranks are interconnected. GAs are added. In addition, a key limitation in this implementation is network size. A bug exists in the EAC in which it ceases operation when more than ten partial energies are delivered to it. Due to time constraints, this bug was not addressed for this thesis. Therefore, the maximum network size implementable by this system is 8n where n is the size of RBM synthesized on a single FPGA. An outline of the code run on each FPGA is provided in Appendix A.

33 Chapter 5 Results and Analysis 5.1 Test Methods Test Setup The design was tested on the BEE3 platform with all computational cores and MicroBlaze soft processors running at 100MHz. In addition, an external 2GB DDR2-667 RDIMM module running at 200MHz was connected through a multi ported memory controller (MPMC) to the processor local bus (PLB) of the MicroBlaze processor. In hardware, two different network sizes were synthesized: 64x64 and 128x128. Virtual network sizes of 1024x1024 and 512x512 were tested on the 128x128 system while only 512x512 was tested on the 64x64 system. The f max of the 128x128 system reported by Xilinx Synthesis Tool (XST) was MHz. However, due to time constraints, system clock frequencies greater than 100MHz were not explored. In testing the effect of various batch sizes on performance, batches of 1,8,16,32,64 and 84 were run. A sequential implementation written in C was used as a basis for comparison for relative speedup. The equivalent software versions of the hardware components were written such that the output of the software benchmark matched the output of the hardware implementation. The C implemenation was compiled using gcc version

34 Chapter 5. Results and Analysis 27 with optimization level 2. The benchmark was run on a Intel Core 2 Duo E8400 at 3GHz on a 32-bit version of Ubuntu Linux running kernel To record the computation time, the function gettimeofday() was used on the software implementation. The results of 25 runs was averaged to get the final computation time. For the hardware implementation, the function MPI TIME() was used and the results were averaged over 10 runs Test Metric Although relative speedup is an interesting measure, it is difficult to compare different architectures without an absolute measure of performance. One popular method of measuring neural network training performance is Connection Updates per Second (CUPS) [20]. This is defined as the number of weight updates per second or CUP S = n2 T (5.1) Where n is the size of node layers and T is the amount of time for all of the weights to be updated for one test vector. The speedup over the sequential C implementation was taken to be the ratio of Connection Updates per Second of the hardware implementation and that of the software implementation. S = CUP S h CUP S s (5.2) 5.2 Results Batch Size vs. Speedup As previously mentioned, one of the major concerns with scaling a RBM architecture is the O(n 2 ) growth in the number of context switches and thus data transfers. Fig.

35 Chapter 5. Results and Analysis 28 8 Batch Size vs. Speedup 6 Speedup x x Mini-batch size Figure 5.1: Mini-batch size vs. speedup shows the effect of increasing batch size on overall performance. Both network sizes plotted were implemented on a system with 128x128 intrinsic core size. From Fig , we can see that as batch size increases the larger 1024x1024 system becomes increasingly faster than the 512x512 network. We can infer from this result that at a batch size of one, the weight transfers consume a vast majority of the computation time. As the network size is increased, the RBMC and EAC/NSC operations begin to take up a greater percentage of time and the O(n) benefits of those cores become apparent relative to the O(n 2 ) software baseline. If the RBM training time was limited purely by the RBMC or EAC/NSC computation times, we would expect to see a continued increase in speedup with batch size. However, the plots level off fairly quickly. One other operation that keeps the RBMC idle apart from the weight transfer is the node state calculation. Particularly in this architecture, the partial energies must be transferred from all of the FPGAs to one point and the node states must then be streamed back. This operation must be done synchronously between FPGAs thus, some overhead in setting up the timings between FPGAs is required and the computation must wait for the most delayed FPGA to be ready.

36 Chapter 5. Results and Analysis 29 Batch Size vs. Speedup Without Node Selection 15 Speedup x x Mini-batch size Figure 5.2: Mini-batch size vs. speedup without node calculation As an artificial experiment, the same test was run but without the node calculation stage. Here, the speedup increases noticebly beyond the last test, but still begins to taper off before reaching very high performance. Since from [4], the hardware cores themselves are very fast, this is likely due once again to communication bottlenecks between the RAM and the compute cores. From these results, we can see that the communication overhead due to transfers between the MicroBlaze and the hardware cores limit the overall system performance. Since a number of communications increase as O(n 2 ), this has significant implications Intrinsic RBM Size A second performance factor measured was the effect of changing the intrinsic network size n. A reduction in n would degrade the performance benefit of the O(n) compute cores and require more context switches for the same virtualized RBM size. However it would also reduce the overall transfer time of data between computations. Fig shows a comparison of running a virtual 512x512 RBM on both n = 64 and n = 128 hardware over varying batch sizes. From this plot we can see that increasing the intrinsic

37 Chapter 5. Results and Analysis x512 Batch Size vs. Speedup for n 4 Speedup n = 128 n = Mini-batch size Figure 5.3: Mini-batch size vs. RBM sizes of 64 and 128 speedup for a virutal 512x512 network with intrinsic size of RBM is very beneficial above small batch sizes. 64x64 vs 128x Summary The absolute CUPS results are summarized in table It is interesting to note how CUPS is approximately the same for batch size 1 and n = 128 regardless of virutalized size. This represents a O(n 2 ) relationship. Platform RBM Size Batch Size MCUPS Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Table 5.1: Summary of Performance Measurements

38 Chapter 6 Conclusion 6.1 Conclusions The purpose of this thesis was to investigate the scalability of Ly, et al. s [4], virtualized FPGA architecture. The primary impediment to using the architecture for large networks was the communication overhead required in context switching. As the size of the RBM grows linearly, the number of transfers increases as O(n 2 ). To maintain performance, the architecture was ported to a faster, more modern FPGA platform, the BEE3. The data representation was also changed from 32-bits to 16-bits in order to reduce the time required to transfer a set of energies or weights. Finally, the system was implemented four FPGAs in order to provide some extra coarse grain parallelsim. When compared to a sequential O(n 2 ) C benchmark, the results showed several different communication overhead problems. The speed at low batch sizes was limited by the weight transfers during AGS phases. At higher batch sizes, the transfer of data to the EAC became the bottleneck and if that was removed, additional overhead reduced performance before a good speedup could be observed. The design presented in this thesis only achieves a small speedup over software at high batch sizes. However, the analysis provided may be used as a basis for further improvements to the architecture. 31

39 Chapter 6. Conclusion Future Work The architecture presented in this thesis still has a great deal of room for improvement. Primarily, a reduction in the significant communication overheads present in the virtualized system would allow the system to more fully utilize the computational cores Weight Matrix Caching The transfer of weights during the context switches is a significant performance bottleneck of this architecture. As shown in this work, the effect of the context switch can be partially allieviated by using large batch sizes. However during the transfer from external DDR2 RAM, the RBMC is still inactive. To reduce the transfer latency, the next weight matrix to be processed may be cached in a compact structure within the RBMC. Another possibility is to use the leftover depth of the weight storage BRAMs to cache multiple weight matrices. In the 128x bit architecture, only 128 out of 1K elements of the 18Kbit DRAMs are being used; by making use of the independent write port, additional weight matrices may be loaded while the RBMC performs other calcuations Distributed Energy Accumulator Core Structure Another method of reducing communication overhead is to improve the operation of the EAC. If more FPGAs are added to the system, the single EAC point in the current architecture will become increasingly bottlenecked. In order for the performance of the architecture so scale well when implemented on many FPGAs, the calculation of the node states should be distributed in a tree or ring fashion. This would reduce the communication bottleneck as well as improve the node selection time in the case of the tree structure.

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering